Re: [jgit-dev] RawTextComparator hash is bad

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] RawTextComparator hash is bad

From: "Shawn O. Pearce" <spearce@xxxxxxxxxxx>
Date: Thu, 23 Sep 2010 16:57:45 -0700
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Mutt/1.5.17+20080114 (2008-01-14)

Robin Rosenberg <robin.rosenberg@xxxxxxxxxx> wrote:
> 
> I couldn't resist. Here's a stolen competitor [1], which wins over the hashes
> above according to my measurements. You might want to measure it in your test 
> suite.
...
> [1] http://bretm.home.comcast.net/~bretm/hash/3.html

Its not really that much better.  With these functions we don't need
to run the LK cleanup afterwards; its safe to just take the lower
N bits like TRUNC does.  But the inner loop of the hash function
is far more complex.

My test suite isn't doing benchmarks, but I'm willing to go out on
a limb and say that doing 4 more shifts, 2 more additions, and 2
more xors *per byte of a line* (like hash_tb does) is going to be
slower than a single multiply at the end of the line.

If we assume the average line is at least 3 bytes to hash (aka
"\t\t}\n") then hash_tb needs to expend an extra 24 clock cycles
over djb2.  For a slightly more typical line that has say 15 bytes
("\tint count = 0;\n"), its an at extra 120 clock cycles per line.
Integer multiplication is slow, but its not *that* slow.

So hash_t, hash_tx2, hash_tb all have lower collision rates than
djb2 TRUNC, but they are more complex on the inner loop and don't
get us that much better of a result than djb2 LK.

I'd rather stick with djb2 LK.

test_jgit: 928 files; 122 avg. unique lines/file
  Algorithm                  Max
  djb2 TRUNC                   5
  djb2 LK                      6

  string_hash31 TRUNC         11
  string_hash31 LK             5

  hash_t TRUNC                 5
  hash_t LK                    5

  hash_tx2 TRUNC               6
  hash_tx2 LK                  5

  hash_tb TRUNC                5
  hash_tb LK                   7

test_linux26: 30198 files; 258 avg. unique lines/file
  Algorithm                  Max
  djb2 TRUNC                  32
  djb2 LK                      7

  string_hash31 TRUNC         32
  string_hash31 LK             7

  hash_t TRUNC                 7
  hash_t LK                    8

  hash_tx2 TRUNC               7
  hash_tx2 LK                  7

  hash_tb TRUNC                7
  hash_tb LK                   7

test_frameworks_base: 8381 files; 184 avg. unique lines/file
  Algorithm                  Max
  djb2 TRUNC                  10
  djb2 LK                      7

  string_hash31 TRUNC         13
  string_hash31 LK             6

  hash_t TRUNC                 7
  hash_t LK                    7

  hash_tx2 TRUNC               7
  hash_tx2 LK                  6

  hash_tb TRUNC                6
  hash_tb LK                   6

-- 
Shawn.

Follow-Ups:
- Re: [jgit-dev] RawTextComparator hash is bad
  - From: Alex Blewitt

References:
- [jgit-dev] RawTextComparator hash is bad
  - From: Shawn O. Pearce
- Re: [jgit-dev] RawTextComparator hash is bad
  - From: Shawn O. Pearce
- Re: [jgit-dev] RawTextComparator hash is bad
  - From: Robin Rosenberg

Prev by Date: Re: [jgit-dev] RawTextComparator hash is bad
Next by Date: Re: [jgit-dev] RawTextComparator hash is bad
Previous by thread: Re: [jgit-dev] RawTextComparator hash is bad
Next by thread: Re: [jgit-dev] RawTextComparator hash is bad
Index(es):
- Date
- Thread

Breadcrumbs