Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] RawTextComparator hash is bad

torsdagen den 23 september 2010 22.04.28 skrev  Shawn O. Pearce:
> "Shawn O. Pearce" <spearce@xxxxxxxxxxx> wrote:
> > RawTextComparator's hash() method is horrible for Java source code.
> > Completely horrible.
> 
> ...
> 
> > This afternoon I'll run some tests on a bunch of different source
> > files and see if I can get more information about how bad the chain
> > lengths might get.
> 
> And here are test results run over some "common" repositories that
> I happen to have locally.  The test scans every text file in the
> current HEAD revision to determine the maximum chain length when
> using a given hash function.
> 
> The hash table is sized for each file to be the next power of 2
> larger than the number of unique lines in that file.  So a file
> with 122 unique lines will use a hash table of 128 buckets; a file
> with 258 unique lines uses a 512 bucket table.  This is what we do
> inside of HistogramDiff.
> 
> The Max column reports the maximum number of elements in a single
> bucket, for any single file in the repository.
> 
> 
> test_jgit: 931 files; 122 avg. unique lines/file
>   Algorithm                  Max
>   rtc TRUNC                  418
>   rtc LK                     174
> 
>   djb2 TRUNC                   5
>   djb2 LK                      6
> 
>   string_hash31 TRUNC         11
>   string_hash31 LK             5
> 
>   crapwow32 TRUNC              5
>   crapwow32 LK                 6
> 
>   sha1 TRUNC                   6
>   sha1 LK                      6
> 
> test_linux26: 30198 files; 258 avg. unique lines/file
>   Algorithm                  Max
>   rtc TRUNC                 8675
>   rtc LK                    1894
> 
>   djb2 TRUNC                  32
>   djb2 LK                      7
> 
>   string_hash31 TRUNC         32
>   string_hash31 LK             7
> 
>   crapwow32 TRUNC             54
>   crapwow32 LK                54
> 
>   sha1 TRUNC                   8
>   sha1 LK                      7
> 
> test_frameworks_base: 8381 files; 184 avg. unique lines/file
>   Algorithm                  Max
>   rtc TRUNC                 4615
>   rtc LK                    1592
> 
>   djb2 TRUNC                  10
>   djb2 LK                      7
> 
>   string_hash31 TRUNC         13
>   string_hash31 LK             6
> 
>   crapwow32 TRUNC             16
>   crapwow32 LK                17
> 
>   sha1 TRUNC                   6
>   sha1 LK                      6
> 
> 
> As we can see, "rtc" (aka the current RawTextComparator
> hash) is utter crap.  Even after applying the Linux kernel
> GOLDEN_RATIO_PRIME_32 cleanup step ("rtc LK") its producing
> collisions into the thousands for some files in the Linux kernel
> or the Android platform/frameworks/base repository.
> 
> crapwow32 is a 32 bit version of CrapWow[1] that I got off GitHub[2].
> The code complexity is very high, and its collision count isn't as
> low as the simpler djb2 or string_hash31.
> 
> sha1 is applying SHA-1 to the line, and then using only the first
> 4 bytes of the result as the 32 bit hash code.  I offer this as an
> example because members of the Git community typically consider
> SHA-1 to be very collision resistant, but its too slow for this
> sort of application.  Notice in the output above that its collision
> rates are actually very close to "djb2 LK" or "string_hash31 LK".
> 
> 
> Based on these results we should use either djb2 or string_hash31,
> with the LK cleanup applied after.  The thing is, these are virtually
> identical hash routines:
> 
>   djb2:           h = ((h << 5) + h) + (raw[ptr] & 0xff);
>   string_hash31:  h = (31 * h)       + (raw[ptr] & 0xff)
> 
> Either multiply by 33 (djb2) or by 31.  And the latter can obviously
> be written without the multiply:
> 
>   string_hash31:  h = (h << 5) - h)  + (raw[ptr] & 0xff)

The shifting obviously goes fine with either 33 or 31.

	string_hash33:  h = (h << 5) + h)  + (raw[ptr] & 0xff)

I couldn't resist. Here's a stolen competitor [1], which wins over the hashes
above according to my measurements. You might want to measure it in your test 
suite. It is slightly more expensive to compute for the case of the single 
round of mixing the bits. When repeated it's much slower, but the distribution 
into bucket gives a few per cent fewer scans through the buckets. 

Update: Cutting away part of the bit mixing code surprised me by getting even 
better hashing and (not surprising) better performance, very close the 
string:hash33 with shifting, but I only ran it on the MyersDiff source, so I'me 
very curious to see what your test suite gives.

	public static int hash_t(byte[] s) {
		int h = 0x79FEf6E8;
		for (int i = 0; i < s.length; ++i) {
			int v = s[i] & 0xFF;
			h += v + (v << 16);

			// Jenkins mixing. Repeat this block two or three
			// times for better hashing. (inline, not loop)
			h += (h << 12);
			h ^= (h >> 22);
			h += (h << 4);
			h ^= (h >> 9);
			h += (h << 10); 
			h ^= (h >> 2);
			h += (h << 7);
			h ^= (h >> 12);
		}
		return h;
	}

	public static int hash_tx2(byte[] s) {
		int h = 0x79FEf6E8;
		for (int i = 0; i < s.length; ++i) {
			int v = s[i] & 0xFF;
			h += v + (v << 16);

			// Jenkins mixing x 2
			h += (h << 12);
			h ^= (h >> 22);
			h += (h << 4);
			h ^= (h >> 9);
			h += (h << 10); 
			h ^= (h >> 2);
			h += (h << 7);
			h ^= (h >> 12);

			h += (h << 12);
			h ^= (h >> 22);
			h += (h << 4);
			h ^= (h >> 9);
			h += (h << 10); 
			h ^= (h >> 2);
			h += (h << 7);
			h ^= (h >> 12);
		}
		return h;
	}

	public static int hash_tb(byte[] s) {
		int h = 0x79FEf6E8;
		for (int i = 0; i < s.length; ++i) {
			int v = s[i] & 0xFF;
			h += v + (v << 16);

			// Jenkins mixing. 
			h += (h << 12);
			h ^= (h >> 22);
			h += (h << 4);
			h ^= (h >> 9);
		}
		return h;
	}

-- robin

[1] http://bretm.home.comcast.net/~bretm/hash/3.html


Back to the top