[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [jgit-dev] RawTextComparator hash is bad
|
The Java String hash goes on a stride after a certain length for longer values. Is there any merit in skipping out some characters for faster hash computation? Or maybe skip whitespace chars?
Finally - aren't there likely to be lots of lines which have the same content (and therefore hash) - lines containing a } or <tab>} will probably be both common and repeated.
Alex
Sent from my iPhone 4
On 24 Sep 2010, at 00:57, "Shawn O. Pearce" <spearce@xxxxxxxxxxx> wrote:
> Robin Rosenberg <robin.rosenberg@xxxxxxxxxx> wrote:
>>
>> I couldn't resist. Here's a stolen competitor [1], which wins over the hashes
>> above according to my measurements. You might want to measure it in your test
>> suite.
> ...
>> [1] http://bretm.home.comcast.net/~bretm/hash/3.html
>
> Its not really that much better. With these functions we don't need
> to run the LK cleanup afterwards; its safe to just take the lower
> N bits like TRUNC does. But the inner loop of the hash function
> is far more complex.
>
> My test suite isn't doing benchmarks, but I'm willing to go out on
> a limb and say that doing 4 more shifts, 2 more additions, and 2
> more xors *per byte of a line* (like hash_tb does) is going to be
> slower than a single multiply at the end of the line.
>
> If we assume the average line is at least 3 bytes to hash (aka
> "\t\t}\n") then hash_tb needs to expend an extra 24 clock cycles
> over djb2. For a slightly more typical line that has say 15 bytes
> ("\tint count = 0;\n"), its an at extra 120 clock cycles per line.
> Integer multiplication is slow, but its not *that* slow.
>
> So hash_t, hash_tx2, hash_tb all have lower collision rates than
> djb2 TRUNC, but they are more complex on the inner loop and don't
> get us that much better of a result than djb2 LK.
>
> I'd rather stick with djb2 LK.
>
>
> test_jgit: 928 files; 122 avg. unique lines/file
> Algorithm Max
> djb2 TRUNC 5
> djb2 LK 6
>
> string_hash31 TRUNC 11
> string_hash31 LK 5
>
> hash_t TRUNC 5
> hash_t LK 5
>
> hash_tx2 TRUNC 6
> hash_tx2 LK 5
>
> hash_tb TRUNC 5
> hash_tb LK 7
>
> test_linux26: 30198 files; 258 avg. unique lines/file
> Algorithm Max
> djb2 TRUNC 32
> djb2 LK 7
>
> string_hash31 TRUNC 32
> string_hash31 LK 7
>
> hash_t TRUNC 7
> hash_t LK 8
>
> hash_tx2 TRUNC 7
> hash_tx2 LK 7
>
> hash_tb TRUNC 7
> hash_tb LK 7
>
> test_frameworks_base: 8381 files; 184 avg. unique lines/file
> Algorithm Max
> djb2 TRUNC 10
> djb2 LK 7
>
> string_hash31 TRUNC 13
> string_hash31 LK 6
>
> hash_t TRUNC 7
> hash_t LK 7
>
> hash_tx2 TRUNC 7
> hash_tx2 LK 6
>
> hash_tb TRUNC 6
> hash_tb LK 6
>
>
>
> --
> Shawn.
> _______________________________________________
> jgit-dev mailing list
> jgit-dev@xxxxxxxxxxx
> https://dev.eclipse.org/mailman/listinfo/jgit-dev