[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [jgit-dev] RawTextComparator hash is bad
|
Robin Rosenberg <robin.rosenberg@xxxxxxxxxx> wrote:
>
> I couldn't resist. Here's a stolen competitor [1], which wins over the hashes
> above according to my measurements. You might want to measure it in your test
> suite.
...
> [1] http://bretm.home.comcast.net/~bretm/hash/3.html
Its not really that much better. With these functions we don't need
to run the LK cleanup afterwards; its safe to just take the lower
N bits like TRUNC does. But the inner loop of the hash function
is far more complex.
My test suite isn't doing benchmarks, but I'm willing to go out on
a limb and say that doing 4 more shifts, 2 more additions, and 2
more xors *per byte of a line* (like hash_tb does) is going to be
slower than a single multiply at the end of the line.
If we assume the average line is at least 3 bytes to hash (aka
"\t\t}\n") then hash_tb needs to expend an extra 24 clock cycles
over djb2. For a slightly more typical line that has say 15 bytes
("\tint count = 0;\n"), its an at extra 120 clock cycles per line.
Integer multiplication is slow, but its not *that* slow.
So hash_t, hash_tx2, hash_tb all have lower collision rates than
djb2 TRUNC, but they are more complex on the inner loop and don't
get us that much better of a result than djb2 LK.
I'd rather stick with djb2 LK.
test_jgit: 928 files; 122 avg. unique lines/file
Algorithm Max
djb2 TRUNC 5
djb2 LK 6
string_hash31 TRUNC 11
string_hash31 LK 5
hash_t TRUNC 5
hash_t LK 5
hash_tx2 TRUNC 6
hash_tx2 LK 5
hash_tb TRUNC 5
hash_tb LK 7
test_linux26: 30198 files; 258 avg. unique lines/file
Algorithm Max
djb2 TRUNC 32
djb2 LK 7
string_hash31 TRUNC 32
string_hash31 LK 7
hash_t TRUNC 7
hash_t LK 8
hash_tx2 TRUNC 7
hash_tx2 LK 7
hash_tb TRUNC 7
hash_tb LK 7
test_frameworks_base: 8381 files; 184 avg. unique lines/file
Algorithm Max
djb2 TRUNC 10
djb2 LK 7
string_hash31 TRUNC 13
string_hash31 LK 6
hash_t TRUNC 7
hash_t LK 7
hash_tx2 TRUNC 7
hash_tx2 LK 6
hash_tb TRUNC 6
hash_tb LK 6
--
Shawn.