Re: [jgit-dev] Encoding used in the repository, e.g., for file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] Encoding used in the repository, e.g., for file names

From: "Shawn O. Pearce" <spearce@xxxxxxxxxxx>
Date: Thu, 19 Aug 2010 11:21:01 -0700
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Mutt/1.5.17+20080114 (2008-01-14)

Thomas Singer <jgit-dev@xxxxxxxxx> wrote:
> 
> According to my understanding (and in contrast with, for example, SVN) Git
> has not standardized on the encoding used to store, e.g., file names in the
> repository, it simply takes the byte-sequence it gets from the operating
> system. With pure Java it is not possible to get the byte-sequence of file
> names, because Java already has translated them to characters. What impact
> this has for the compatibility between JGit and Git?

A lot.  :-(

JGit assumes the file names are encoded in UTF-8, but falls back
to default platform encoding or ISO-8859-1 when a name is not a
valid UTF-8 string.

Linus Torvalds assumes you only use US-ASCII file names, or you
mount your filesystem to use UTF-8, because any other encoding is
flat out bat shit insane.  Therefore C Git can mostly assume the
file names are UTF-8.  But yea, uh, there is no promise of that.

Even this UTF-8 assumption falls apart horribly though, thanks to
UTF normal forms and Mac OS X HFS+ using a different normal form
than Linux does.  C Git tries to fix this by doing some sort of
hashing and name mangling when looking at entries in the index.
I haven't tried to understand that logic so I'm not entirely clear
what they are doing right now.

There are some old JGit patches laying around in Gerrit by Semyon
Vadishev to try and address this concept in JGit.  You can see them
at http://egit.eclipse.org/r/r/status:open%20project:jgit%20owner:semen.vadishev@xxxxxxxxxxx

-- 
Shawn.

References:
- [jgit-dev] Encoding used in the repository, e.g., for file names
  - From: Thomas Singer

Prev by Date: [jgit-dev] Encoding used in the repository, e.g., for file names
Next by Date: [jgit-dev] jgit maven repo for last successful build: direct download from hudson
Previous by thread: [jgit-dev] Encoding used in the repository, e.g., for file names
Next by thread: [jgit-dev] jgit maven repo for last successful build: direct download from hudson
Index(es):
- Date
- Thread

Breadcrumbs