```RawTextComparator's hash() method is horrible for Java source code.
Completely horrible.

I wrote a test to compare a few hash functions as run over the JGit
source blob 3fad2c3 (this is a version of MyersDiff.java):

574 lines; 398 unique; 8 bits = 256 table
Algorithm               Ratio       Max   StdDev
rtc T                   91.92       201  13.42
rtc L                    3.25         8   1.68

djb2 T                   2.71         6   1.33
djb2 L                   2.84         6   1.42

djb2_nows T              2.74         6   1.35
djb2_nows L              2.80         5   1.39

full_name_hash T         2.82         7   1.41
full_name_hash L         2.73         6   1.34

string_hash31 T          2.76         6   1.36
string_hash31 L          2.86         6   1.43

The test computes the unique elements (398) and then inserts them
into a simple hash table of 256 buckets, incrementing a counter
for each new element that fell into that bucket.

Max shows the longest chain (total number of elements that fell
into the same bucket).

Ratio is the estimated number of probes we need to perform to
locate an element in the table, divided by the best-case number of
probes if the elements were perfectly distributed into the buckets.
A ratio of 1.0 is therefore a perfect hash function, as it evenly
distributed the elements.

Now to describe the functions.  The T and L suffixes mean different
methods of converting from the raw hash value into a bucket index
in the table.

* T: Simple case for a table size that is a power of 2:

return hash & ((1 << table_bits) - 1);

* L: I got this from , which appears to be itself derived from
a hash cleanup used by the Linux kernel:

/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
static final int GOLDEN_RATIO_PRIME_32 = 0x9e370001;
int bucket = hash * GOLDEN_RATIO_PRIME_32;
return bucket >>> (32 - table_bits);

* rtc T                   91.92       201  13.42
* rtc L                    3.25         8   1.68

This is the current RawTextComparator.DEFAULT function, which
appears to be based on what xdiff in C Git uses:

int hash = 5381;
for (; ptr < end; ptr++)
hash = (hash << 5) ^ (raw[ptr] & 0xff);

As we can see, its a _horrible_ hash function.  201 lines out of
398 wound up in the same bucket.  Applying the L cleanup routine
improved matters dramatically, because more bits from the high
end of the hash become involved in the final outcome.

* djb2 T                   2.71         6   1.33
* djb2 L                   2.84         6   1.42

This is the original djb2 algorithm, before it got busted when
it was put into RawTextComparator.  Its far better than rtc above.

int hash = 5381;
for (; ptr < end; ptr++)
hash = ((hash << 5) + hash) + (raw[ptr] & 0xff);

* djb2_nows T              2.74         6   1.35
* djb2_nows L              2.80         5   1.39

This is a modification that squashes values <= ASCII space to 0,
and shifts all other values down by 0x20.  For ASCII like sources
this gets us a tiny bit more space in the hash for important
characters, but as we can see the chain lengths aren't very
different from djb2.

* full_name_hash T         2.82         7   1.41
* full_name_hash L         2.73         6   1.34

This is the function used by the Linux kernel inside of its VFS
layer to hash file or directory names.  They primarily care about
short strings like 'bin', 'usr', or 'home'.  It seems to be not
as good with Java source code lines.

* string_hash31 T          2.76         6   1.36
* string_hash31 L          2.86         6   1.43

This is identical to what Sun JRE version 6 uses for
java.lang.String hashCode():

int hash = 0;
for (; ptr < end; ptr++)
hash = 31 * hash + (raw[ptr] & 0xff);

Granted, this is only 1 source file out of many.  But I think the
current hash is really bad if I got this unlucky by having such a
horrible chain length on the first file I tried it on.  :-)

This afternoon I'll run some tests on a bunch of different source