[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [jgit-dev] RawTextComparator hash is bad
|
"Shawn O. Pearce" <spearce@xxxxxxxxxxx> wrote:
> RawTextComparator's hash() method is horrible for Java source code.
> Completely horrible.
...
> This afternoon I'll run some tests on a bunch of different source
> files and see if I can get more information about how bad the chain
> lengths might get.
And here are test results run over some "common" repositories that
I happen to have locally. The test scans every text file in the
current HEAD revision to determine the maximum chain length when
using a given hash function.
The hash table is sized for each file to be the next power of 2
larger than the number of unique lines in that file. So a file
with 122 unique lines will use a hash table of 128 buckets; a file
with 258 unique lines uses a 512 bucket table. This is what we do
inside of HistogramDiff.
The Max column reports the maximum number of elements in a single
bucket, for any single file in the repository.
test_jgit: 931 files; 122 avg. unique lines/file
Algorithm Max
rtc TRUNC 418
rtc LK 174
djb2 TRUNC 5
djb2 LK 6
string_hash31 TRUNC 11
string_hash31 LK 5
crapwow32 TRUNC 5
crapwow32 LK 6
sha1 TRUNC 6
sha1 LK 6
test_linux26: 30198 files; 258 avg. unique lines/file
Algorithm Max
rtc TRUNC 8675
rtc LK 1894
djb2 TRUNC 32
djb2 LK 7
string_hash31 TRUNC 32
string_hash31 LK 7
crapwow32 TRUNC 54
crapwow32 LK 54
sha1 TRUNC 8
sha1 LK 7
test_frameworks_base: 8381 files; 184 avg. unique lines/file
Algorithm Max
rtc TRUNC 4615
rtc LK 1592
djb2 TRUNC 10
djb2 LK 7
string_hash31 TRUNC 13
string_hash31 LK 6
crapwow32 TRUNC 16
crapwow32 LK 17
sha1 TRUNC 6
sha1 LK 6
As we can see, "rtc" (aka the current RawTextComparator
hash) is utter crap. Even after applying the Linux kernel
GOLDEN_RATIO_PRIME_32 cleanup step ("rtc LK") its producing
collisions into the thousands for some files in the Linux kernel
or the Android platform/frameworks/base repository.
crapwow32 is a 32 bit version of CrapWow[1] that I got off GitHub[2].
The code complexity is very high, and its collision count isn't as
low as the simpler djb2 or string_hash31.
sha1 is applying SHA-1 to the line, and then using only the first
4 bytes of the result as the 32 bit hash code. I offer this as an
example because members of the Git community typically consider
SHA-1 to be very collision resistant, but its too slow for this
sort of application. Notice in the output above that its collision
rates are actually very close to "djb2 LK" or "string_hash31 LK".
Based on these results we should use either djb2 or string_hash31,
with the LK cleanup applied after. The thing is, these are virtually
identical hash routines:
djb2: h = ((h << 5) + h) + (raw[ptr] & 0xff);
string_hash31: h = (31 * h) + (raw[ptr] & 0xff)
Either multiply by 33 (djb2) or by 31. And the latter can obviously
be written without the multiply:
string_hash31: h = (h << 5) - h) + (raw[ptr] & 0xff)
[1] http://www.team5150.com/~andrew/noncryptohashzoo/CrapWow.html
[2] http://github.com/sunnygleason/g414-hash/blob/master/src/main/java/com/g414/hash/impl/CWowHash.java
--
Shawn.