Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] RawTextComparator hash is bad

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> wrote:
> RawTextComparator's hash() method is horrible for Java source code.
> Completely horrible.
> This afternoon I'll run some tests on a bunch of different source
> files and see if I can get more information about how bad the chain
> lengths might get.

And here are test results run over some "common" repositories that
I happen to have locally.  The test scans every text file in the
current HEAD revision to determine the maximum chain length when
using a given hash function.

The hash table is sized for each file to be the next power of 2
larger than the number of unique lines in that file.  So a file
with 122 unique lines will use a hash table of 128 buckets; a file
with 258 unique lines uses a 512 bucket table.  This is what we do
inside of HistogramDiff.

The Max column reports the maximum number of elements in a single
bucket, for any single file in the repository.

test_jgit: 931 files; 122 avg. unique lines/file
  Algorithm                  Max
  rtc TRUNC                  418
  rtc LK                     174

  djb2 TRUNC                   5
  djb2 LK                      6

  string_hash31 TRUNC         11
  string_hash31 LK             5

  crapwow32 TRUNC              5
  crapwow32 LK                 6

  sha1 TRUNC                   6
  sha1 LK                      6

test_linux26: 30198 files; 258 avg. unique lines/file
  Algorithm                  Max
  rtc TRUNC                 8675
  rtc LK                    1894

  djb2 TRUNC                  32
  djb2 LK                      7

  string_hash31 TRUNC         32
  string_hash31 LK             7

  crapwow32 TRUNC             54
  crapwow32 LK                54

  sha1 TRUNC                   8
  sha1 LK                      7

test_frameworks_base: 8381 files; 184 avg. unique lines/file
  Algorithm                  Max
  rtc TRUNC                 4615
  rtc LK                    1592

  djb2 TRUNC                  10
  djb2 LK                      7

  string_hash31 TRUNC         13
  string_hash31 LK             6

  crapwow32 TRUNC             16
  crapwow32 LK                17

  sha1 TRUNC                   6
  sha1 LK                      6

As we can see, "rtc" (aka the current RawTextComparator
hash) is utter crap.  Even after applying the Linux kernel
GOLDEN_RATIO_PRIME_32 cleanup step ("rtc LK") its producing
collisions into the thousands for some files in the Linux kernel
or the Android platform/frameworks/base repository.

crapwow32 is a 32 bit version of CrapWow[1] that I got off GitHub[2].
The code complexity is very high, and its collision count isn't as
low as the simpler djb2 or string_hash31.

sha1 is applying SHA-1 to the line, and then using only the first
4 bytes of the result as the 32 bit hash code.  I offer this as an
example because members of the Git community typically consider
SHA-1 to be very collision resistant, but its too slow for this
sort of application.  Notice in the output above that its collision
rates are actually very close to "djb2 LK" or "string_hash31 LK".

Based on these results we should use either djb2 or string_hash31,
with the LK cleanup applied after.  The thing is, these are virtually
identical hash routines:

  djb2:           h = ((h << 5) + h) + (raw[ptr] & 0xff);
  string_hash31:  h = (31 * h)       + (raw[ptr] & 0xff)

Either multiply by 33 (djb2) or by 31.  And the latter can obviously
be written without the multiply:

  string_hash31:  h = (h << 5) - h)  + (raw[ptr] & 0xff)



Back to the top