|[jgit-dev] RawTextComparator hash is bad|
RawTextComparator's hash() method is horrible for Java source code. Completely horrible. I wrote a test to compare a few hash functions as run over the JGit source blob 3fad2c3 (this is a version of MyersDiff.java): 574 lines; 398 unique; 8 bits = 256 table Algorithm Ratio Max StdDev rtc T 91.92 201 13.42 rtc L 3.25 8 1.68 djb2 T 2.71 6 1.33 djb2 L 2.84 6 1.42 djb2_nows T 2.74 6 1.35 djb2_nows L 2.80 5 1.39 full_name_hash T 2.82 7 1.41 full_name_hash L 2.73 6 1.34 string_hash31 T 2.76 6 1.36 string_hash31 L 2.86 6 1.43 The test computes the unique elements (398) and then inserts them into a simple hash table of 256 buckets, incrementing a counter for each new element that fell into that bucket. Max shows the longest chain (total number of elements that fell into the same bucket). Ratio is the estimated number of probes we need to perform to locate an element in the table, divided by the best-case number of probes if the elements were perfectly distributed into the buckets. A ratio of 1.0 is therefore a perfect hash function, as it evenly distributed the elements. Now to describe the functions. The T and L suffixes mean different methods of converting from the raw hash value into a bucket index in the table. * T: Simple case for a table size that is a power of 2: return hash & ((1 << table_bits) - 1); * L: I got this from , which appears to be itself derived from a hash cleanup used by the Linux kernel: /* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */ static final int GOLDEN_RATIO_PRIME_32 = 0x9e370001; int bucket = hash * GOLDEN_RATIO_PRIME_32; return bucket >>> (32 - table_bits); * rtc T 91.92 201 13.42 * rtc L 3.25 8 1.68 This is the current RawTextComparator.DEFAULT function, which appears to be based on what xdiff in C Git uses: int hash = 5381; for (; ptr < end; ptr++) hash = (hash << 5) ^ (raw[ptr] & 0xff); As we can see, its a _horrible_ hash function. 201 lines out of 398 wound up in the same bucket. Applying the L cleanup routine improved matters dramatically, because more bits from the high end of the hash become involved in the final outcome. * djb2 T 2.71 6 1.33 * djb2 L 2.84 6 1.42 This is the original djb2 algorithm, before it got busted when it was put into RawTextComparator. Its far better than rtc above. int hash = 5381; for (; ptr < end; ptr++) hash = ((hash << 5) + hash) + (raw[ptr] & 0xff); * djb2_nows T 2.74 6 1.35 * djb2_nows L 2.80 5 1.39 This is a modification that squashes values <= ASCII space to 0, and shifts all other values down by 0x20. For ASCII like sources this gets us a tiny bit more space in the hash for important characters, but as we can see the chain lengths aren't very different from djb2. * full_name_hash T 2.82 7 1.41 * full_name_hash L 2.73 6 1.34 This is the function used by the Linux kernel inside of its VFS layer to hash file or directory names. They primarily care about short strings like 'bin', 'usr', or 'home'. It seems to be not as good with Java source code lines. * string_hash31 T 2.76 6 1.36 * string_hash31 L 2.86 6 1.43 This is identical to what Sun JRE version 6 uses for java.lang.String hashCode(): int hash = 0; for (; ptr < end; ptr++) hash = 31 * hash + (raw[ptr] & 0xff); Granted, this is only 1 source file out of many. But I think the current hash is really bad if I got this unlucky by having such a horrible chain length on the first file I tried it on. :-) This afternoon I'll run some tests on a bunch of different source files and see if I can get more information about how bad the chain lengths might get.  http://www.spinics.net/lists/netdev/msg110556.html -- Shawn.
Back to the top