[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [jgit-dev] RawTextComparator hash is bad
|
torsdagen den 23 september 2010 22.04.28 skrev Shawn O. Pearce:
> "Shawn O. Pearce" <spearce@xxxxxxxxxxx> wrote:
> > RawTextComparator's hash() method is horrible for Java source code.
> > Completely horrible.
>
> ...
>
> > This afternoon I'll run some tests on a bunch of different source
> > files and see if I can get more information about how bad the chain
> > lengths might get.
>
> And here are test results run over some "common" repositories that
> I happen to have locally. The test scans every text file in the
> current HEAD revision to determine the maximum chain length when
> using a given hash function.
>
> The hash table is sized for each file to be the next power of 2
> larger than the number of unique lines in that file. So a file
> with 122 unique lines will use a hash table of 128 buckets; a file
> with 258 unique lines uses a 512 bucket table. This is what we do
> inside of HistogramDiff.
>
> The Max column reports the maximum number of elements in a single
> bucket, for any single file in the repository.
>
>
> test_jgit: 931 files; 122 avg. unique lines/file
> Algorithm Max
> rtc TRUNC 418
> rtc LK 174
>
> djb2 TRUNC 5
> djb2 LK 6
>
> string_hash31 TRUNC 11
> string_hash31 LK 5
>
> crapwow32 TRUNC 5
> crapwow32 LK 6
>
> sha1 TRUNC 6
> sha1 LK 6
>
> test_linux26: 30198 files; 258 avg. unique lines/file
> Algorithm Max
> rtc TRUNC 8675
> rtc LK 1894
>
> djb2 TRUNC 32
> djb2 LK 7
>
> string_hash31 TRUNC 32
> string_hash31 LK 7
>
> crapwow32 TRUNC 54
> crapwow32 LK 54
>
> sha1 TRUNC 8
> sha1 LK 7
>
> test_frameworks_base: 8381 files; 184 avg. unique lines/file
> Algorithm Max
> rtc TRUNC 4615
> rtc LK 1592
>
> djb2 TRUNC 10
> djb2 LK 7
>
> string_hash31 TRUNC 13
> string_hash31 LK 6
>
> crapwow32 TRUNC 16
> crapwow32 LK 17
>
> sha1 TRUNC 6
> sha1 LK 6
>
>
> As we can see, "rtc" (aka the current RawTextComparator
> hash) is utter crap. Even after applying the Linux kernel
> GOLDEN_RATIO_PRIME_32 cleanup step ("rtc LK") its producing
> collisions into the thousands for some files in the Linux kernel
> or the Android platform/frameworks/base repository.
>
> crapwow32 is a 32 bit version of CrapWow[1] that I got off GitHub[2].
> The code complexity is very high, and its collision count isn't as
> low as the simpler djb2 or string_hash31.
>
> sha1 is applying SHA-1 to the line, and then using only the first
> 4 bytes of the result as the 32 bit hash code. I offer this as an
> example because members of the Git community typically consider
> SHA-1 to be very collision resistant, but its too slow for this
> sort of application. Notice in the output above that its collision
> rates are actually very close to "djb2 LK" or "string_hash31 LK".
>
>
> Based on these results we should use either djb2 or string_hash31,
> with the LK cleanup applied after. The thing is, these are virtually
> identical hash routines:
>
> djb2: h = ((h << 5) + h) + (raw[ptr] & 0xff);
> string_hash31: h = (31 * h) + (raw[ptr] & 0xff)
>
> Either multiply by 33 (djb2) or by 31. And the latter can obviously
> be written without the multiply:
>
> string_hash31: h = (h << 5) - h) + (raw[ptr] & 0xff)
The shifting obviously goes fine with either 33 or 31.
string_hash33: h = (h << 5) + h) + (raw[ptr] & 0xff)
I couldn't resist. Here's a stolen competitor [1], which wins over the hashes
above according to my measurements. You might want to measure it in your test
suite. It is slightly more expensive to compute for the case of the single
round of mixing the bits. When repeated it's much slower, but the distribution
into bucket gives a few per cent fewer scans through the buckets.
Update: Cutting away part of the bit mixing code surprised me by getting even
better hashing and (not surprising) better performance, very close the
string:hash33 with shifting, but I only ran it on the MyersDiff source, so I'me
very curious to see what your test suite gives.
public static int hash_t(byte[] s) {
int h = 0x79FEf6E8;
for (int i = 0; i < s.length; ++i) {
int v = s[i] & 0xFF;
h += v + (v << 16);
// Jenkins mixing. Repeat this block two or three
// times for better hashing. (inline, not loop)
h += (h << 12);
h ^= (h >> 22);
h += (h << 4);
h ^= (h >> 9);
h += (h << 10);
h ^= (h >> 2);
h += (h << 7);
h ^= (h >> 12);
}
return h;
}
public static int hash_tx2(byte[] s) {
int h = 0x79FEf6E8;
for (int i = 0; i < s.length; ++i) {
int v = s[i] & 0xFF;
h += v + (v << 16);
// Jenkins mixing x 2
h += (h << 12);
h ^= (h >> 22);
h += (h << 4);
h ^= (h >> 9);
h += (h << 10);
h ^= (h >> 2);
h += (h << 7);
h ^= (h >> 12);
h += (h << 12);
h ^= (h >> 22);
h += (h << 4);
h ^= (h >> 9);
h += (h << 10);
h ^= (h >> 2);
h += (h << 7);
h ^= (h >> 12);
}
return h;
}
public static int hash_tb(byte[] s) {
int h = 0x79FEf6E8;
for (int i = 0; i < s.length; ++i) {
int v = s[i] & 0xFF;
h += v + (v << 16);
// Jenkins mixing.
h += (h << 12);
h ^= (h >> 22);
h += (h << 4);
h ^= (h >> 9);
}
return h;
}
-- robin
[1] http://bretm.home.comcast.net/~bretm/hash/3.html