|Re: [jgit-dev] RawTextComparator hash is bad|
Robin Rosenberg <robin.rosenberg@xxxxxxxxxx> wrote: > > I couldn't resist. Here's a stolen competitor , which wins over the hashes > above according to my measurements. You might want to measure it in your test > suite. ... >  http://bretm.home.comcast.net/~bretm/hash/3.html Its not really that much better. With these functions we don't need to run the LK cleanup afterwards; its safe to just take the lower N bits like TRUNC does. But the inner loop of the hash function is far more complex. My test suite isn't doing benchmarks, but I'm willing to go out on a limb and say that doing 4 more shifts, 2 more additions, and 2 more xors *per byte of a line* (like hash_tb does) is going to be slower than a single multiply at the end of the line. If we assume the average line is at least 3 bytes to hash (aka "\t\t}\n") then hash_tb needs to expend an extra 24 clock cycles over djb2. For a slightly more typical line that has say 15 bytes ("\tint count = 0;\n"), its an at extra 120 clock cycles per line. Integer multiplication is slow, but its not *that* slow. So hash_t, hash_tx2, hash_tb all have lower collision rates than djb2 TRUNC, but they are more complex on the inner loop and don't get us that much better of a result than djb2 LK. I'd rather stick with djb2 LK. test_jgit: 928 files; 122 avg. unique lines/file Algorithm Max djb2 TRUNC 5 djb2 LK 6 string_hash31 TRUNC 11 string_hash31 LK 5 hash_t TRUNC 5 hash_t LK 5 hash_tx2 TRUNC 6 hash_tx2 LK 5 hash_tb TRUNC 5 hash_tb LK 7 test_linux26: 30198 files; 258 avg. unique lines/file Algorithm Max djb2 TRUNC 32 djb2 LK 7 string_hash31 TRUNC 32 string_hash31 LK 7 hash_t TRUNC 7 hash_t LK 8 hash_tx2 TRUNC 7 hash_tx2 LK 7 hash_tb TRUNC 7 hash_tb LK 7 test_frameworks_base: 8381 files; 184 avg. unique lines/file Algorithm Max djb2 TRUNC 10 djb2 LK 7 string_hash31 TRUNC 13 string_hash31 LK 6 hash_t TRUNC 7 hash_t LK 7 hash_tx2 TRUNC 7 hash_tx2 LK 6 hash_tb TRUNC 6 hash_tb LK 6 -- Shawn.
Back to the top