Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Encoding used in the repository, e.g., for file names

Thomas Singer <jgit-dev@xxxxxxxxx> wrote:
> 
> According to my understanding (and in contrast with, for example, SVN) Git
> has not standardized on the encoding used to store, e.g., file names in the
> repository, it simply takes the byte-sequence it gets from the operating
> system. With pure Java it is not possible to get the byte-sequence of file
> names, because Java already has translated them to characters. What impact
> this has for the compatibility between JGit and Git?

A lot.  :-(

JGit assumes the file names are encoded in UTF-8, but falls back
to default platform encoding or ISO-8859-1 when a name is not a
valid UTF-8 string.

Linus Torvalds assumes you only use US-ASCII file names, or you
mount your filesystem to use UTF-8, because any other encoding is
flat out bat shit insane.  Therefore C Git can mostly assume the
file names are UTF-8.  But yea, uh, there is no promise of that.

Even this UTF-8 assumption falls apart horribly though, thanks to
UTF normal forms and Mac OS X HFS+ using a different normal form
than Linux does.  C Git tries to fix this by doing some sort of
hashing and name mangling when looking at entries in the index.
I haven't tried to understand that logic so I'm not entirely clear
what they are doing right now.

There are some old JGit patches laying around in Gerrit by Semyon
Vadishev to try and address this concept in JGit.  You can see them
at http://egit.eclipse.org/r/r/status:open%20project:jgit%20owner:semen.vadishev@xxxxxxxxxxx

-- 
Shawn.


Back to the top