[jgit-dev] RawTextComparator hash is bad

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[jgit-dev] RawTextComparator hash is bad

From: "Shawn O. Pearce" <spearce@xxxxxxxxxxx>
Date: Thu, 23 Sep 2010 11:42:13 -0700
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Mutt/1.5.17+20080114 (2008-01-14)

RawTextComparator's hash() method is horrible for Java source code.
Completely horrible.

I wrote a test to compare a few hash functions as run over the JGit
source blob 3fad2c3 (this is a version of MyersDiff.java):

  574 lines; 398 unique; 8 bits = 256 table
  Algorithm               Ratio       Max   StdDev
  rtc T                   91.92       201  13.42
  rtc L                    3.25         8   1.68

  djb2 T                   2.71         6   1.33
  djb2 L                   2.84         6   1.42

  djb2_nows T              2.74         6   1.35
  djb2_nows L              2.80         5   1.39

  full_name_hash T         2.82         7   1.41
  full_name_hash L         2.73         6   1.34

  string_hash31 T          2.76         6   1.36
  string_hash31 L          2.86         6   1.43


The test computes the unique elements (398) and then inserts them
into a simple hash table of 256 buckets, incrementing a counter
for each new element that fell into that bucket.

Max shows the longest chain (total number of elements that fell
into the same bucket).

Ratio is the estimated number of probes we need to perform to
locate an element in the table, divided by the best-case number of
probes if the elements were perfectly distributed into the buckets.
A ratio of 1.0 is therefore a perfect hash function, as it evenly
distributed the elements.


Now to describe the functions.  The T and L suffixes mean different
methods of converting from the raw hash value into a bucket index
in the table.

* T: Simple case for a table size that is a power of 2:

    return hash & ((1 << table_bits) - 1);

* L: I got this from [1], which appears to be itself derived from
     a hash cleanup used by the Linux kernel:

    /* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
    static final int GOLDEN_RATIO_PRIME_32 = 0x9e370001;
    int bucket = hash * GOLDEN_RATIO_PRIME_32;
    return bucket >>> (32 - table_bits);


* rtc T                   91.92       201  13.42
* rtc L                    3.25         8   1.68

  This is the current RawTextComparator.DEFAULT function, which
  appears to be based on what xdiff in C Git uses:

    int hash = 5381;
    for (; ptr < end; ptr++)
      hash = (hash << 5) ^ (raw[ptr] & 0xff);

  As we can see, its a _horrible_ hash function.  201 lines out of
  398 wound up in the same bucket.  Applying the L cleanup routine
  improved matters dramatically, because more bits from the high
  end of the hash become involved in the final outcome.


* djb2 T                   2.71         6   1.33
* djb2 L                   2.84         6   1.42

  This is the original djb2 algorithm, before it got busted when
  it was put into RawTextComparator.  Its far better than rtc above.

    int hash = 5381;
    for (; ptr < end; ptr++)
      hash = ((hash << 5) + hash) + (raw[ptr] & 0xff);


* djb2_nows T              2.74         6   1.35
* djb2_nows L              2.80         5   1.39

  This is a modification that squashes values <= ASCII space to 0,
  and shifts all other values down by 0x20.  For ASCII like sources
  this gets us a tiny bit more space in the hash for important
  characters, but as we can see the chain lengths aren't very
  different from djb2.


* full_name_hash T         2.82         7   1.41
* full_name_hash L         2.73         6   1.34

  This is the function used by the Linux kernel inside of its VFS
  layer to hash file or directory names.  They primarily care about
  short strings like 'bin', 'usr', or 'home'.  It seems to be not
  as good with Java source code lines.


* string_hash31 T          2.76         6   1.36
* string_hash31 L          2.86         6   1.43

  This is identical to what Sun JRE version 6 uses for
  java.lang.String hashCode():

    int hash = 0;
    for (; ptr < end; ptr++)
      hash = 31 * hash + (raw[ptr] & 0xff);


Granted, this is only 1 source file out of many.  But I think the
current hash is really bad if I got this unlucky by having such a
horrible chain length on the first file I tried it on.  :-)

This afternoon I'll run some tests on a bunch of different source
files and see if I can get more information about how bad the chain
lengths might get.


[1] http://www.spinics.net/lists/netdev/msg110556.html

-- 
Shawn.

Follow-Ups:
- Re: [jgit-dev] RawTextComparator hash is bad
  - From: Shawn O. Pearce
- Re: [jgit-dev] RawTextComparator hash is bad
  - From: Robin Rosenberg

Prev by Date: Re: [jgit-dev] Maven Repositories
Next by Date: Re: [jgit-dev] RawTextComparator hash is bad
Previous by thread: [jgit-dev] Implement cherry-pick by merge
Next by thread: Re: [jgit-dev] RawTextComparator hash is bad
Index(es):
- Date
- Thread

Breadcrumbs