|Re: [jgit-dev] Encoding used in the repository, e.g., for file names|
Thomas Singer <jgit-dev@xxxxxxxxx> wrote: > > According to my understanding (and in contrast with, for example, SVN) Git > has not standardized on the encoding used to store, e.g., file names in the > repository, it simply takes the byte-sequence it gets from the operating > system. With pure Java it is not possible to get the byte-sequence of file > names, because Java already has translated them to characters. What impact > this has for the compatibility between JGit and Git? A lot. :-( JGit assumes the file names are encoded in UTF-8, but falls back to default platform encoding or ISO-8859-1 when a name is not a valid UTF-8 string. Linus Torvalds assumes you only use US-ASCII file names, or you mount your filesystem to use UTF-8, because any other encoding is flat out bat shit insane. Therefore C Git can mostly assume the file names are UTF-8. But yea, uh, there is no promise of that. Even this UTF-8 assumption falls apart horribly though, thanks to UTF normal forms and Mac OS X HFS+ using a different normal form than Linux does. C Git tries to fix this by doing some sort of hashing and name mangling when looking at entries in the index. I haven't tried to understand that logic so I'm not entirely clear what they are doing right now. There are some old JGit patches laying around in Gerrit by Semyon Vadishev to try and address this concept in JGit. You can see them at http://egit.eclipse.org/r/r/status:open%20project:jgit%20owner:semen.vadishev@xxxxxxxxxxx -- Shawn.
Back to the top