|Re: [jgit-dev] RawTextComparator hash is bad|
"Shawn O. Pearce" <spearce@xxxxxxxxxxx> wrote: > RawTextComparator's hash() method is horrible for Java source code. > Completely horrible. ... > This afternoon I'll run some tests on a bunch of different source > files and see if I can get more information about how bad the chain > lengths might get. And here are test results run over some "common" repositories that I happen to have locally. The test scans every text file in the current HEAD revision to determine the maximum chain length when using a given hash function. The hash table is sized for each file to be the next power of 2 larger than the number of unique lines in that file. So a file with 122 unique lines will use a hash table of 128 buckets; a file with 258 unique lines uses a 512 bucket table. This is what we do inside of HistogramDiff. The Max column reports the maximum number of elements in a single bucket, for any single file in the repository. test_jgit: 931 files; 122 avg. unique lines/file Algorithm Max rtc TRUNC 418 rtc LK 174 djb2 TRUNC 5 djb2 LK 6 string_hash31 TRUNC 11 string_hash31 LK 5 crapwow32 TRUNC 5 crapwow32 LK 6 sha1 TRUNC 6 sha1 LK 6 test_linux26: 30198 files; 258 avg. unique lines/file Algorithm Max rtc TRUNC 8675 rtc LK 1894 djb2 TRUNC 32 djb2 LK 7 string_hash31 TRUNC 32 string_hash31 LK 7 crapwow32 TRUNC 54 crapwow32 LK 54 sha1 TRUNC 8 sha1 LK 7 test_frameworks_base: 8381 files; 184 avg. unique lines/file Algorithm Max rtc TRUNC 4615 rtc LK 1592 djb2 TRUNC 10 djb2 LK 7 string_hash31 TRUNC 13 string_hash31 LK 6 crapwow32 TRUNC 16 crapwow32 LK 17 sha1 TRUNC 6 sha1 LK 6 As we can see, "rtc" (aka the current RawTextComparator hash) is utter crap. Even after applying the Linux kernel GOLDEN_RATIO_PRIME_32 cleanup step ("rtc LK") its producing collisions into the thousands for some files in the Linux kernel or the Android platform/frameworks/base repository. crapwow32 is a 32 bit version of CrapWow that I got off GitHub. The code complexity is very high, and its collision count isn't as low as the simpler djb2 or string_hash31. sha1 is applying SHA-1 to the line, and then using only the first 4 bytes of the result as the 32 bit hash code. I offer this as an example because members of the Git community typically consider SHA-1 to be very collision resistant, but its too slow for this sort of application. Notice in the output above that its collision rates are actually very close to "djb2 LK" or "string_hash31 LK". Based on these results we should use either djb2 or string_hash31, with the LK cleanup applied after. The thing is, these are virtually identical hash routines: djb2: h = ((h << 5) + h) + (raw[ptr] & 0xff); string_hash31: h = (31 * h) + (raw[ptr] & 0xff) Either multiply by 33 (djb2) or by 31. And the latter can obviously be written without the multiply: string_hash31: h = (h << 5) - h) + (raw[ptr] & 0xff)  http://www.team5150.com/~andrew/noncryptohashzoo/CrapWow.html  http://github.com/sunnygleason/g414-hash/blob/master/src/main/java/com/g414/hash/impl/CWowHash.java -- Shawn.
Back to the top