[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [jgit-dev] RawTextComparator hash is bad
|
torsdagen den 23 september 2010 20.42.13 skrev Shawn O. Pearce:
> RawTextComparator's hash() method is horrible for Java source code.
> Completely horrible.
>
> I wrote a test to compare a few hash functions as run over the JGit
> source blob 3fad2c3 (this is a version of MyersDiff.java):
>
> 574 lines; 398 unique; 8 bits = 256 table
> Algorithm Ratio Max StdDev
> rtc T 91.92 201 13.42
> rtc L 3.25 8 1.68
>
> djb2 T 2.71 6 1.33
> djb2 L 2.84 6 1.42
>
> djb2_nows T 2.74 6 1.35
> djb2_nows L 2.80 5 1.39
>
> full_name_hash T 2.82 7 1.41
> full_name_hash L 2.73 6 1.34
>
> string_hash31 T 2.76 6 1.36
> string_hash31 L 2.86 6 1.43
>
>
> The test computes the unique elements (398) and then inserts them
> into a simple hash table of 256 buckets, incrementing a counter
> for each new element that fell into that bucket.
>
> Max shows the longest chain (total number of elements that fell
> into the same bucket).
>
> Ratio is the estimated number of probes we need to perform to
> locate an element in the table, divided by the best-case number of
> probes if the elements were perfectly distributed into the buckets.
> A ratio of 1.0 is therefore a perfect hash function, as it evenly
> distributed the elements.
>
>
> Now to describe the functions. The T and L suffixes mean different
> methods of converting from the raw hash value into a bucket index
> in the table.
>
> * T: Simple case for a table size that is a power of 2:
>
> return hash & ((1 << table_bits) - 1);
>
> * L: I got this from [1], which appears to be itself derived from
> a hash cleanup used by the Linux kernel:
>
> /* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
> static final int GOLDEN_RATIO_PRIME_32 = 0x9e370001;
> int bucket = hash * GOLDEN_RATIO_PRIME_32;
> return bucket >>> (32 - table_bits);
>
>
> * rtc T 91.92 201 13.42
> * rtc L 3.25 8 1.68
>
> This is the current RawTextComparator.DEFAULT function, which
> appears to be based on what xdiff in C Git uses:
>
> int hash = 5381;
> for (; ptr < end; ptr++)
> hash = (hash << 5) ^ (raw[ptr] & 0xff);
>
> As we can see, its a _horrible_ hash function. 201 lines out of
> 398 wound up in the same bucket. Applying the L cleanup routine
> improved matters dramatically, because more bits from the high
> end of the hash become involved in the final outcome.
>
>
> * djb2 T 2.71 6 1.33
> * djb2 L 2.84 6 1.42
>
> This is the original djb2 algorithm, before it got busted when
> it was put into RawTextComparator. Its far better than rtc above.
>
> int hash = 5381;
> for (; ptr < end; ptr++)
> hash = ((hash << 5) + hash) + (raw[ptr] & 0xff);
>
>
> * djb2_nows T 2.74 6 1.35
> * djb2_nows L 2.80 5 1.39
>
> This is a modification that squashes values <= ASCII space to 0,
> and shifts all other values down by 0x20. For ASCII like sources
> this gets us a tiny bit more space in the hash for important
> characters, but as we can see the chain lengths aren't very
> different from djb2.
>
>
> * full_name_hash T 2.82 7 1.41
> * full_name_hash L 2.73 6 1.34
>
> This is the function used by the Linux kernel inside of its VFS
> layer to hash file or directory names. They primarily care about
> short strings like 'bin', 'usr', or 'home'. It seems to be not
> as good with Java source code lines.
>
>
> * string_hash31 T 2.76 6 1.36
> * string_hash31 L 2.86 6 1.43
>
> This is identical to what Sun JRE version 6 uses for
> java.lang.String hashCode():
>
> int hash = 0;
> for (; ptr < end; ptr++)
> hash = 31 * hash + (raw[ptr] & 0xff);
>
>
> Granted, this is only 1 source file out of many. But I think the
> current hash is really bad if I got this unlucky by having such a
> horrible chain length on the first file I tried it on. :-)
Sun's HashMap tries to improve bad hash functions by rehashing the hash
code, Maybe you too could do that.
-- robin