Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » Eclipse Communications Framework (ECF) » hash function for text comparison
hash function for text comparison [message #621628] Wed, 20 February 2008 11:10 Go to next message
Mustafa K. Isik is currently offline Mustafa K. Isik
Messages: 2
Registered: July 2009
Junior Member
Is anybody aware of a "good" hash function to compare text of the size
of a "typical" sourcecode file?Good in this case means fast to compute,
well distributed in the target domain for text as source and with
values in the target domain short enough to facilitate quick comparison?

I am toying around with the idea to send state information for shared
editing operations along with the messages as hashes based on the state
of the sourcecode they originated from. Those hashes would be used to
compare the states of sourcecode documents from which operations
originated from.

The way cola does this now is by using counters for local and remote
operation applications on a document state. Thanks to a number of
assumptions that can be made, this is currently enough to tell that two
operations must have originated from the same document state when their
respective counters for local and remote ops are equal. One of the
problems with this approach is that at some point, for instance during
long/modification-heavy sessions, the counters will run out of range.

Performance impact of computing hash values based on the sourcecode
file contents for every edit operation might prove to be a concern, but
I can't say much about it yet. Comparisons of hash values should not
prove to be an issue, since they only happen when conflicts arise.

Thanks,

Mustafa
Re: hash function for text comparison [message #621629 is a reply to message #621628] Thu, 21 February 2008 14:42 Go to previous messageGo to next message
Scott Lewis is currently offline Scott Lewis
Messages: 964
Registered: July 2009
Senior Member
Hi Mustafa,

I'm not at all sure this will perform sufficiently well, but I suppose
you could use one of the built in hash functions...e.g. SHA-1:

MessageDigest digest = MessageDigest.getInstance("SHA-1");
digest.reset();
digest.update(saltBytes);
byte [] result = digest.digest(documentBytes);

where saltBytes is a seed number, and documentBytes are the current
state of the document.

I'm not at all expert, however, in the performance characteristics of
these different hash algorithms, so this post is mainly a goad to those
out there that know more and can correct my errant ways.

I did find this chapter in a little google search for hash performance,
but I haven't read the research yet

http://www.springerlink.com/content/0rn1h0gk3lmmtm92/

Scott


Mustafa K. Isik wrote:
> Is anybody aware of a "good" hash function to compare text of the size
> of a "typical" sourcecode file?Good in this case means fast to compute,
> well distributed in the target domain for text as source and with values
> in the target domain short enough to facilitate quick comparison?
>
> I am toying around with the idea to send state information for shared
> editing operations along with the messages as hashes based on the state
> of the sourcecode they originated from. Those hashes would be used to
> compare the states of sourcecode documents from which operations
> originated from.
>
> The way cola does this now is by using counters for local and remote
> operation applications on a document state. Thanks to a number of
> assumptions that can be made, this is currently enough to tell that two
> operations must have originated from the same document state when their
> respective counters for local and remote ops are equal. One of the
> problems with this approach is that at some point, for instance during
> long/modification-heavy sessions, the counters will run out of range.
>
> Performance impact of computing hash values based on the sourcecode file
> contents for every edit operation might prove to be a concern, but I
> can't say much about it yet. Comparisons of hash values should not prove
> to be an issue, since they only happen when conflicts arise.
>
> Thanks,
>
> Mustafa
>
Re: hash function for text comparison [message #621630 is a reply to message #621629] Thu, 21 February 2008 16:08 Go to previous messageGo to next message
Eugene Kuleshov is currently offline Eugene Kuleshov
Messages: 505
Registered: July 2009
Senior Member
Do you really need a crypto hash function? If not, look at
java.util.zip.CRC32 or much faster java.util.zip.Adler32. I suppose it
all boils down to resolving collisions if checksum/hash will
accidentally be the same...

regards,
Eugene


Scott Lewis wrote:
> Hi Mustafa,
>
> I'm not at all sure this will perform sufficiently well, but I suppose
> you could use one of the built in hash functions...e.g. SHA-1:
>
> MessageDigest digest = MessageDigest.getInstance("SHA-1");
> digest.reset();
> digest.update(saltBytes);
> byte [] result = digest.digest(documentBytes);
>
> where saltBytes is a seed number, and documentBytes are the current
> state of the document.
>
> I'm not at all expert, however, in the performance characteristics of
> these different hash algorithms, so this post is mainly a goad to
> those out there that know more and can correct my errant ways.
>
> I did find this chapter in a little google search for hash
> performance, but I haven't read the research yet
>
> http://www.springerlink.com/content/0rn1h0gk3lmmtm92/
>
> Scott
>
>
> Mustafa K. Isik wrote:
>> Is anybody aware of a "good" hash function to compare text of the
>> size of a "typical" sourcecode file?Good in this case means fast to
>> compute, well distributed in the target domain for text as source and
>> with values in the target domain short enough to facilitate quick
>> comparison?
>>
>> I am toying around with the idea to send state information for shared
>> editing operations along with the messages as hashes based on the
>> state of the sourcecode they originated from. Those hashes would be
>> used to compare the states of sourcecode documents from which
>> operations originated from.
>>
>> The way cola does this now is by using counters for local and remote
>> operation applications on a document state. Thanks to a number of
>> assumptions that can be made, this is currently enough to tell that
>> two operations must have originated from the same document state when
>> their respective counters for local and remote ops are equal. One of
>> the problems with this approach is that at some point, for instance
>> during long/modification-heavy sessions, the counters will run out of
>> range.
>>
>> Performance impact of computing hash values based on the sourcecode
>> file contents for every edit operation might prove to be a concern,
>> but I can't say much about it yet. Comparisons of hash values should
>> not prove to be an issue, since they only happen when conflicts arise.
>>
>> Thanks,
>>
>> Mustafa
>>
Re: hash function for text comparison [message #621631 is a reply to message #621630] Thu, 21 February 2008 18:36 Go to previous messageGo to next message
Scott Lewis is currently offline Scott Lewis
Messages: 964
Registered: July 2009
Senior Member
Hi Eugene,

Eugene Kuleshov wrote:
>
> Do you really need a crypto hash function? If not, look at
> java.util.zip.CRC32 or much faster java.util.zip.Adler32. I suppose it
> all boils down to resolving collisions if checksum/hash will
> accidentally be the same...

Yeah...I think the utility of the crypto hash functions is their
likelihood of collision is vanishingly small...but cost being relatively
high.

Is there any info about the collision probability for CRC32 and/or
Adler32? I didn't see any in javadocs.

Thanks,

Scott


>
> regards,
> Eugene
>
>
> Scott Lewis wrote:
>> Hi Mustafa,
>>
>> I'm not at all sure this will perform sufficiently well, but I suppose
>> you could use one of the built in hash functions...e.g. SHA-1:
>>
>> MessageDigest digest = MessageDigest.getInstance("SHA-1");
>> digest.reset();
>> digest.update(saltBytes);
>> byte [] result = digest.digest(documentBytes);
>>
>> where saltBytes is a seed number, and documentBytes are the current
>> state of the document.
>>
>> I'm not at all expert, however, in the performance characteristics of
>> these different hash algorithms, so this post is mainly a goad to
>> those out there that know more and can correct my errant ways.
>>
>> I did find this chapter in a little google search for hash
>> performance, but I haven't read the research yet
>>
>> http://www.springerlink.com/content/0rn1h0gk3lmmtm92/
>>
>> Scott
>>
>>
>> Mustafa K. Isik wrote:
>>> Is anybody aware of a "good" hash function to compare text of the
>>> size of a "typical" sourcecode file?Good in this case means fast to
>>> compute, well distributed in the target domain for text as source and
>>> with values in the target domain short enough to facilitate quick
>>> comparison?
>>>
>>> I am toying around with the idea to send state information for shared
>>> editing operations along with the messages as hashes based on the
>>> state of the sourcecode they originated from. Those hashes would be
>>> used to compare the states of sourcecode documents from which
>>> operations originated from.
>>>
>>> The way cola does this now is by using counters for local and remote
>>> operation applications on a document state. Thanks to a number of
>>> assumptions that can be made, this is currently enough to tell that
>>> two operations must have originated from the same document state when
>>> their respective counters for local and remote ops are equal. One of
>>> the problems with this approach is that at some point, for instance
>>> during long/modification-heavy sessions, the counters will run out of
>>> range.
>>>
>>> Performance impact of computing hash values based on the sourcecode
>>> file contents for every edit operation might prove to be a concern,
>>> but I can't say much about it yet. Comparisons of hash values should
>>> not prove to be an issue, since they only happen when conflicts arise.
>>>
>>> Thanks,
>>>
>>> Mustafa
>>>
Re: hash function for text comparison [message #621632 is a reply to message #621631] Thu, 21 February 2008 19:02 Go to previous messageGo to next message
Eugene Kuleshov is currently offline Eugene Kuleshov
Messages: 505
Registered: July 2009
Senior Member
Scott Lewis wrote:
>> Do you really need a crypto hash function? If not, look at
>> java.util.zip.CRC32 or much faster java.util.zip.Adler32. I suppose
>> it all boils down to resolving collisions if checksum/hash will
>> accidentally be the same...
> Yeah...I think the utility of the crypto hash functions is their
> likelihood of collision is vanishingly small...but cost being
> relatively high.
>
> Is there any info about the collision probability for CRC32 and/or
> Adler32? I didn't see any in javadocs.
Wikipedia has few articles on crc32, adler32 and message digests:

http://en.wikipedia.org/wiki/Adler32 (apparently it have bad
distribution on short messages)
http://en.wikipedia.org/wiki/CRC-32
http://en.wikipedia.org/wiki/Sha1
http://en.wikipedia.org/wiki/Cryptographic_hash_function

Basically if you have other ways to track document integrity (i.e.
operation counters) and don't care about intentional hash forgery, crc32
is probably good enough.

Another idea is to apply crc32 or Md5 on the stream of commands in
addition to counters. So you will have more reliable indication that
things got out of sync.

Anyways, before selecting particular algorithm you should run a
simulation of the typical and extreme streams of commands and source
documents and see how one or the other hash function would affect the
overall system performance in such simulation. It is quite possible that
on the modern hardware even md5 will be fast enough...

regards,
Eugene

>> Scott Lewis wrote:
>>> Hi Mustafa,
>>>
>>> I'm not at all sure this will perform sufficiently well, but I
>>> suppose you could use one of the built in hash functions...e.g. SHA-1:
>>>
>>> MessageDigest digest = MessageDigest.getInstance("SHA-1");
>>> digest.reset();
>>> digest.update(saltBytes);
>>> byte [] result = digest.digest(documentBytes);
>>>
>>> where saltBytes is a seed number, and documentBytes are the current
>>> state of the document.
>>>
>>> I'm not at all expert, however, in the performance characteristics
>>> of these different hash algorithms, so this post is mainly a goad to
>>> those out there that know more and can correct my errant ways.
>>>
>>> I did find this chapter in a little google search for hash
>>> performance, but I haven't read the research yet
>>>
>>> http://www.springerlink.com/content/0rn1h0gk3lmmtm92/
>>>
>>> Scott
>>>
>>>
>>> Mustafa K. Isik wrote:
>>>> Is anybody aware of a "good" hash function to compare text of the
>>>> size of a "typical" sourcecode file?Good in this case means fast to
>>>> compute, well distributed in the target domain for text as source
>>>> and with values in the target domain short enough to facilitate
>>>> quick comparison?
>>>>
>>>> I am toying around with the idea to send state information for
>>>> shared editing operations along with the messages as hashes based
>>>> on the state of the sourcecode they originated from. Those hashes
>>>> would be used to compare the states of sourcecode documents from
>>>> which operations originated from.
>>>>
>>>> The way cola does this now is by using counters for local and
>>>> remote operation applications on a document state. Thanks to a
>>>> number of assumptions that can be made, this is currently enough to
>>>> tell that two operations must have originated from the same
>>>> document state when their respective counters for local and remote
>>>> ops are equal. One of the problems with this approach is that at
>>>> some point, for instance during long/modification-heavy sessions,
>>>> the counters will run out of range.
>>>>
>>>> Performance impact of computing hash values based on the sourcecode
>>>> file contents for every edit operation might prove to be a concern,
>>>> but I can't say much about it yet. Comparisons of hash values
>>>> should not prove to be an issue, since they only happen when
>>>> conflicts arise.
>>>>
>>>> Thanks,
>>>>
>>>> Mustafa
>>>>
Re: hash function for text comparison [message #621633 is a reply to message #621632] Fri, 22 February 2008 13:07 Go to previous message
Scott Lewis is currently offline Scott Lewis
Messages: 964
Registered: July 2009
Senior Member
Looks great...thanks for the pointers/refs Eugene.

Scott

Eugene Kuleshov wrote:
> Scott Lewis wrote:
>>> Do you really need a crypto hash function? If not, look at
>>> java.util.zip.CRC32 or much faster java.util.zip.Adler32. I suppose
>>> it all boils down to resolving collisions if checksum/hash will
>>> accidentally be the same...
>> Yeah...I think the utility of the crypto hash functions is their
>> likelihood of collision is vanishingly small...but cost being
>> relatively high.
>>
>> Is there any info about the collision probability for CRC32 and/or
>> Adler32? I didn't see any in javadocs.
> Wikipedia has few articles on crc32, adler32 and message digests:
>
> http://en.wikipedia.org/wiki/Adler32 (apparently it have bad
> distribution on short messages)
> http://en.wikipedia.org/wiki/CRC-32
> http://en.wikipedia.org/wiki/Sha1
> http://en.wikipedia.org/wiki/Cryptographic_hash_function
>
> Basically if you have other ways to track document integrity (i.e.
> operation counters) and don't care about intentional hash forgery, crc32
> is probably good enough.
>
> Another idea is to apply crc32 or Md5 on the stream of commands in
> addition to counters. So you will have more reliable indication that
> things got out of sync.
>
> Anyways, before selecting particular algorithm you should run a
> simulation of the typical and extreme streams of commands and source
> documents and see how one or the other hash function would affect the
> overall system performance in such simulation. It is quite possible that
> on the modern hardware even md5 will be fast enough...
>
> regards,
> Eugene
>
>>> Scott Lewis wrote:
>>>> Hi Mustafa,
>>>>
>>>> I'm not at all sure this will perform sufficiently well, but I
>>>> suppose you could use one of the built in hash functions...e.g. SHA-1:
>>>>
>>>> MessageDigest digest = MessageDigest.getInstance("SHA-1");
>>>> digest.reset();
>>>> digest.update(saltBytes);
>>>> byte [] result = digest.digest(documentBytes);
>>>>
>>>> where saltBytes is a seed number, and documentBytes are the current
>>>> state of the document.
>>>>
>>>> I'm not at all expert, however, in the performance characteristics
>>>> of these different hash algorithms, so this post is mainly a goad to
>>>> those out there that know more and can correct my errant ways.
>>>>
>>>> I did find this chapter in a little google search for hash
>>>> performance, but I haven't read the research yet
>>>>
>>>> http://www.springerlink.com/content/0rn1h0gk3lmmtm92/
>>>>
>>>> Scott
>>>>
>>>>
>>>> Mustafa K. Isik wrote:
>>>>> Is anybody aware of a "good" hash function to compare text of the
>>>>> size of a "typical" sourcecode file?Good in this case means fast to
>>>>> compute, well distributed in the target domain for text as source
>>>>> and with values in the target domain short enough to facilitate
>>>>> quick comparison?
>>>>>
>>>>> I am toying around with the idea to send state information for
>>>>> shared editing operations along with the messages as hashes based
>>>>> on the state of the sourcecode they originated from. Those hashes
>>>>> would be used to compare the states of sourcecode documents from
>>>>> which operations originated from.
>>>>>
>>>>> The way cola does this now is by using counters for local and
>>>>> remote operation applications on a document state. Thanks to a
>>>>> number of assumptions that can be made, this is currently enough to
>>>>> tell that two operations must have originated from the same
>>>>> document state when their respective counters for local and remote
>>>>> ops are equal. One of the problems with this approach is that at
>>>>> some point, for instance during long/modification-heavy sessions,
>>>>> the counters will run out of range.
>>>>>
>>>>> Performance impact of computing hash values based on the sourcecode
>>>>> file contents for every edit operation might prove to be a concern,
>>>>> but I can't say much about it yet. Comparisons of hash values
>>>>> should not prove to be an issue, since they only happen when
>>>>> conflicts arise.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mustafa
>>>>>
Previous Topic:Reminder: ECF conference call TODAY 2/19/2008
Next Topic:Announce: weekly ECF integration builds
Goto Forum:
  


Current Time: Fri Apr 18 10:17:01 EDT 2014

Powered by FUDForum. Page generated in 0.02825 seconds