Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] RawTextComparator hash is bad

torsdagen den 23 september 2010 20.42.13 skrev  Shawn O. Pearce:
> RawTextComparator's hash() method is horrible for Java source code.
> Completely horrible.
> 
> I wrote a test to compare a few hash functions as run over the JGit
> source blob 3fad2c3 (this is a version of MyersDiff.java):
> 
>   574 lines; 398 unique; 8 bits = 256 table
>   Algorithm               Ratio       Max   StdDev
>   rtc T                   91.92       201  13.42
>   rtc L                    3.25         8   1.68
> 
>   djb2 T                   2.71         6   1.33
>   djb2 L                   2.84         6   1.42
> 
>   djb2_nows T              2.74         6   1.35
>   djb2_nows L              2.80         5   1.39
> 
>   full_name_hash T         2.82         7   1.41
>   full_name_hash L         2.73         6   1.34
> 
>   string_hash31 T          2.76         6   1.36
>   string_hash31 L          2.86         6   1.43
> 
> 
> The test computes the unique elements (398) and then inserts them
> into a simple hash table of 256 buckets, incrementing a counter
> for each new element that fell into that bucket.
> 
> Max shows the longest chain (total number of elements that fell
> into the same bucket).
> 
> Ratio is the estimated number of probes we need to perform to
> locate an element in the table, divided by the best-case number of
> probes if the elements were perfectly distributed into the buckets.
> A ratio of 1.0 is therefore a perfect hash function, as it evenly
> distributed the elements.
> 
> 
> Now to describe the functions.  The T and L suffixes mean different
> methods of converting from the raw hash value into a bucket index
> in the table.
> 
> * T: Simple case for a table size that is a power of 2:
> 
>     return hash & ((1 << table_bits) - 1);
> 
> * L: I got this from [1], which appears to be itself derived from
>      a hash cleanup used by the Linux kernel:
> 
>     /* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
>     static final int GOLDEN_RATIO_PRIME_32 = 0x9e370001;
>     int bucket = hash * GOLDEN_RATIO_PRIME_32;
>     return bucket >>> (32 - table_bits);
> 
> 
> * rtc T                   91.92       201  13.42
> * rtc L                    3.25         8   1.68
> 
>   This is the current RawTextComparator.DEFAULT function, which
>   appears to be based on what xdiff in C Git uses:
> 
>     int hash = 5381;
>     for (; ptr < end; ptr++)
>       hash = (hash << 5) ^ (raw[ptr] & 0xff);
> 
>   As we can see, its a _horrible_ hash function.  201 lines out of
>   398 wound up in the same bucket.  Applying the L cleanup routine
>   improved matters dramatically, because more bits from the high
>   end of the hash become involved in the final outcome.
> 
> 
> * djb2 T                   2.71         6   1.33
> * djb2 L                   2.84         6   1.42
> 
>   This is the original djb2 algorithm, before it got busted when
>   it was put into RawTextComparator.  Its far better than rtc above.
> 
>     int hash = 5381;
>     for (; ptr < end; ptr++)
>       hash = ((hash << 5) + hash) + (raw[ptr] & 0xff);
> 
> 
> * djb2_nows T              2.74         6   1.35
> * djb2_nows L              2.80         5   1.39
> 
>   This is a modification that squashes values <= ASCII space to 0,
>   and shifts all other values down by 0x20.  For ASCII like sources
>   this gets us a tiny bit more space in the hash for important
>   characters, but as we can see the chain lengths aren't very
>   different from djb2.
> 
> 
> * full_name_hash T         2.82         7   1.41
> * full_name_hash L         2.73         6   1.34
> 
>   This is the function used by the Linux kernel inside of its VFS
>   layer to hash file or directory names.  They primarily care about
>   short strings like 'bin', 'usr', or 'home'.  It seems to be not
>   as good with Java source code lines.
> 
> 
> * string_hash31 T          2.76         6   1.36
> * string_hash31 L          2.86         6   1.43
> 
>   This is identical to what Sun JRE version 6 uses for
>   java.lang.String hashCode():
> 
>     int hash = 0;
>     for (; ptr < end; ptr++)
>       hash = 31 * hash + (raw[ptr] & 0xff);
> 
> 
> Granted, this is only 1 source file out of many.  But I think the
> current hash is really bad if I got this unlucky by having such a
> horrible chain length on the first file I tried it on.  :-)

Sun's HashMap tries to improve bad hash functions by rehashing the hash
code,  Maybe you too could do that.

-- robin


Back to the top