Home » Archived » Eclipse Communications Framework (ECF) » hash function for text comparison
|
Re: hash function for text comparison [message #621629 is a reply to message #621628] |
Thu, 21 February 2008 19:42 |
Scott Lewis Messages: 1038 Registered: July 2009 |
Senior Member |
|
|
Hi Mustafa,
I'm not at all sure this will perform sufficiently well, but I suppose
you could use one of the built in hash functions...e.g. SHA-1:
MessageDigest digest = MessageDigest.getInstance("SHA-1");
digest.reset();
digest.update(saltBytes);
byte [] result = digest.digest(documentBytes);
where saltBytes is a seed number, and documentBytes are the current
state of the document.
I'm not at all expert, however, in the performance characteristics of
these different hash algorithms, so this post is mainly a goad to those
out there that know more and can correct my errant ways.
I did find this chapter in a little google search for hash performance,
but I haven't read the research yet
http://www.springerlink.com/content/0rn1h0gk3lmmtm92/
Scott
Mustafa K. Isik wrote:
> Is anybody aware of a "good" hash function to compare text of the size
> of a "typical" sourcecode file?Good in this case means fast to compute,
> well distributed in the target domain for text as source and with values
> in the target domain short enough to facilitate quick comparison?
>
> I am toying around with the idea to send state information for shared
> editing operations along with the messages as hashes based on the state
> of the sourcecode they originated from. Those hashes would be used to
> compare the states of sourcecode documents from which operations
> originated from.
>
> The way cola does this now is by using counters for local and remote
> operation applications on a document state. Thanks to a number of
> assumptions that can be made, this is currently enough to tell that two
> operations must have originated from the same document state when their
> respective counters for local and remote ops are equal. One of the
> problems with this approach is that at some point, for instance during
> long/modification-heavy sessions, the counters will run out of range.
>
> Performance impact of computing hash values based on the sourcecode file
> contents for every edit operation might prove to be a concern, but I
> can't say much about it yet. Comparisons of hash values should not prove
> to be an issue, since they only happen when conflicts arise.
>
> Thanks,
>
> Mustafa
>
|
|
|
Re: hash function for text comparison [message #621630 is a reply to message #621629] |
Thu, 21 February 2008 21:08 |
Eugene Kuleshov Messages: 504 Registered: July 2009 |
Senior Member |
|
|
Do you really need a crypto hash function? If not, look at
java.util.zip.CRC32 or much faster java.util.zip.Adler32. I suppose it
all boils down to resolving collisions if checksum/hash will
accidentally be the same...
regards,
Eugene
Scott Lewis wrote:
> Hi Mustafa,
>
> I'm not at all sure this will perform sufficiently well, but I suppose
> you could use one of the built in hash functions...e.g. SHA-1:
>
> MessageDigest digest = MessageDigest.getInstance("SHA-1");
> digest.reset();
> digest.update(saltBytes);
> byte [] result = digest.digest(documentBytes);
>
> where saltBytes is a seed number, and documentBytes are the current
> state of the document.
>
> I'm not at all expert, however, in the performance characteristics of
> these different hash algorithms, so this post is mainly a goad to
> those out there that know more and can correct my errant ways.
>
> I did find this chapter in a little google search for hash
> performance, but I haven't read the research yet
>
> http://www.springerlink.com/content/0rn1h0gk3lmmtm92/
>
> Scott
>
>
> Mustafa K. Isik wrote:
>> Is anybody aware of a "good" hash function to compare text of the
>> size of a "typical" sourcecode file?Good in this case means fast to
>> compute, well distributed in the target domain for text as source and
>> with values in the target domain short enough to facilitate quick
>> comparison?
>>
>> I am toying around with the idea to send state information for shared
>> editing operations along with the messages as hashes based on the
>> state of the sourcecode they originated from. Those hashes would be
>> used to compare the states of sourcecode documents from which
>> operations originated from.
>>
>> The way cola does this now is by using counters for local and remote
>> operation applications on a document state. Thanks to a number of
>> assumptions that can be made, this is currently enough to tell that
>> two operations must have originated from the same document state when
>> their respective counters for local and remote ops are equal. One of
>> the problems with this approach is that at some point, for instance
>> during long/modification-heavy sessions, the counters will run out of
>> range.
>>
>> Performance impact of computing hash values based on the sourcecode
>> file contents for every edit operation might prove to be a concern,
>> but I can't say much about it yet. Comparisons of hash values should
>> not prove to be an issue, since they only happen when conflicts arise.
>>
>> Thanks,
>>
>> Mustafa
>>
|
|
|
Re: hash function for text comparison [message #621631 is a reply to message #621630] |
Thu, 21 February 2008 23:36 |
Scott Lewis Messages: 1038 Registered: July 2009 |
Senior Member |
|
|
Hi Eugene,
Eugene Kuleshov wrote:
>
> Do you really need a crypto hash function? If not, look at
> java.util.zip.CRC32 or much faster java.util.zip.Adler32. I suppose it
> all boils down to resolving collisions if checksum/hash will
> accidentally be the same...
Yeah...I think the utility of the crypto hash functions is their
likelihood of collision is vanishingly small...but cost being relatively
high.
Is there any info about the collision probability for CRC32 and/or
Adler32? I didn't see any in javadocs.
Thanks,
Scott
>
> regards,
> Eugene
>
>
> Scott Lewis wrote:
>> Hi Mustafa,
>>
>> I'm not at all sure this will perform sufficiently well, but I suppose
>> you could use one of the built in hash functions...e.g. SHA-1:
>>
>> MessageDigest digest = MessageDigest.getInstance("SHA-1");
>> digest.reset();
>> digest.update(saltBytes);
>> byte [] result = digest.digest(documentBytes);
>>
>> where saltBytes is a seed number, and documentBytes are the current
>> state of the document.
>>
>> I'm not at all expert, however, in the performance characteristics of
>> these different hash algorithms, so this post is mainly a goad to
>> those out there that know more and can correct my errant ways.
>>
>> I did find this chapter in a little google search for hash
>> performance, but I haven't read the research yet
>>
>> http://www.springerlink.com/content/0rn1h0gk3lmmtm92/
>>
>> Scott
>>
>>
>> Mustafa K. Isik wrote:
>>> Is anybody aware of a "good" hash function to compare text of the
>>> size of a "typical" sourcecode file?Good in this case means fast to
>>> compute, well distributed in the target domain for text as source and
>>> with values in the target domain short enough to facilitate quick
>>> comparison?
>>>
>>> I am toying around with the idea to send state information for shared
>>> editing operations along with the messages as hashes based on the
>>> state of the sourcecode they originated from. Those hashes would be
>>> used to compare the states of sourcecode documents from which
>>> operations originated from.
>>>
>>> The way cola does this now is by using counters for local and remote
>>> operation applications on a document state. Thanks to a number of
>>> assumptions that can be made, this is currently enough to tell that
>>> two operations must have originated from the same document state when
>>> their respective counters for local and remote ops are equal. One of
>>> the problems with this approach is that at some point, for instance
>>> during long/modification-heavy sessions, the counters will run out of
>>> range.
>>>
>>> Performance impact of computing hash values based on the sourcecode
>>> file contents for every edit operation might prove to be a concern,
>>> but I can't say much about it yet. Comparisons of hash values should
>>> not prove to be an issue, since they only happen when conflicts arise.
>>>
>>> Thanks,
>>>
>>> Mustafa
>>>
|
|
|
Re: hash function for text comparison [message #621632 is a reply to message #621631] |
Fri, 22 February 2008 00:02 |
Eugene Kuleshov Messages: 504 Registered: July 2009 |
Senior Member |
|
|
Scott Lewis wrote:
>> Do you really need a crypto hash function? If not, look at
>> java.util.zip.CRC32 or much faster java.util.zip.Adler32. I suppose
>> it all boils down to resolving collisions if checksum/hash will
>> accidentally be the same...
> Yeah...I think the utility of the crypto hash functions is their
> likelihood of collision is vanishingly small...but cost being
> relatively high.
>
> Is there any info about the collision probability for CRC32 and/or
> Adler32? I didn't see any in javadocs.
Wikipedia has few articles on crc32, adler32 and message digests:
http://en.wikipedia.org/wiki/Adler32 (apparently it have bad
distribution on short messages)
http://en.wikipedia.org/wiki/CRC-32
http://en.wikipedia.org/wiki/Sha1
http://en.wikipedia.org/wiki/Cryptographic_hash_function
Basically if you have other ways to track document integrity (i.e.
operation counters) and don't care about intentional hash forgery, crc32
is probably good enough.
Another idea is to apply crc32 or Md5 on the stream of commands in
addition to counters. So you will have more reliable indication that
things got out of sync.
Anyways, before selecting particular algorithm you should run a
simulation of the typical and extreme streams of commands and source
documents and see how one or the other hash function would affect the
overall system performance in such simulation. It is quite possible that
on the modern hardware even md5 will be fast enough...
regards,
Eugene
>> Scott Lewis wrote:
>>> Hi Mustafa,
>>>
>>> I'm not at all sure this will perform sufficiently well, but I
>>> suppose you could use one of the built in hash functions...e.g. SHA-1:
>>>
>>> MessageDigest digest = MessageDigest.getInstance("SHA-1");
>>> digest.reset();
>>> digest.update(saltBytes);
>>> byte [] result = digest.digest(documentBytes);
>>>
>>> where saltBytes is a seed number, and documentBytes are the current
>>> state of the document.
>>>
>>> I'm not at all expert, however, in the performance characteristics
>>> of these different hash algorithms, so this post is mainly a goad to
>>> those out there that know more and can correct my errant ways.
>>>
>>> I did find this chapter in a little google search for hash
>>> performance, but I haven't read the research yet
>>>
>>> http://www.springerlink.com/content/0rn1h0gk3lmmtm92/
>>>
>>> Scott
>>>
>>>
>>> Mustafa K. Isik wrote:
>>>> Is anybody aware of a "good" hash function to compare text of the
>>>> size of a "typical" sourcecode file?Good in this case means fast to
>>>> compute, well distributed in the target domain for text as source
>>>> and with values in the target domain short enough to facilitate
>>>> quick comparison?
>>>>
>>>> I am toying around with the idea to send state information for
>>>> shared editing operations along with the messages as hashes based
>>>> on the state of the sourcecode they originated from. Those hashes
>>>> would be used to compare the states of sourcecode documents from
>>>> which operations originated from.
>>>>
>>>> The way cola does this now is by using counters for local and
>>>> remote operation applications on a document state. Thanks to a
>>>> number of assumptions that can be made, this is currently enough to
>>>> tell that two operations must have originated from the same
>>>> document state when their respective counters for local and remote
>>>> ops are equal. One of the problems with this approach is that at
>>>> some point, for instance during long/modification-heavy sessions,
>>>> the counters will run out of range.
>>>>
>>>> Performance impact of computing hash values based on the sourcecode
>>>> file contents for every edit operation might prove to be a concern,
>>>> but I can't say much about it yet. Comparisons of hash values
>>>> should not prove to be an issue, since they only happen when
>>>> conflicts arise.
>>>>
>>>> Thanks,
>>>>
>>>> Mustafa
>>>>
|
|
|
Re: hash function for text comparison [message #621633 is a reply to message #621632] |
Fri, 22 February 2008 18:07 |
Scott Lewis Messages: 1038 Registered: July 2009 |
Senior Member |
|
|
Looks great...thanks for the pointers/refs Eugene.
Scott
Eugene Kuleshov wrote:
> Scott Lewis wrote:
>>> Do you really need a crypto hash function? If not, look at
>>> java.util.zip.CRC32 or much faster java.util.zip.Adler32. I suppose
>>> it all boils down to resolving collisions if checksum/hash will
>>> accidentally be the same...
>> Yeah...I think the utility of the crypto hash functions is their
>> likelihood of collision is vanishingly small...but cost being
>> relatively high.
>>
>> Is there any info about the collision probability for CRC32 and/or
>> Adler32? I didn't see any in javadocs.
> Wikipedia has few articles on crc32, adler32 and message digests:
>
> http://en.wikipedia.org/wiki/Adler32 (apparently it have bad
> distribution on short messages)
> http://en.wikipedia.org/wiki/CRC-32
> http://en.wikipedia.org/wiki/Sha1
> http://en.wikipedia.org/wiki/Cryptographic_hash_function
>
> Basically if you have other ways to track document integrity (i.e.
> operation counters) and don't care about intentional hash forgery, crc32
> is probably good enough.
>
> Another idea is to apply crc32 or Md5 on the stream of commands in
> addition to counters. So you will have more reliable indication that
> things got out of sync.
>
> Anyways, before selecting particular algorithm you should run a
> simulation of the typical and extreme streams of commands and source
> documents and see how one or the other hash function would affect the
> overall system performance in such simulation. It is quite possible that
> on the modern hardware even md5 will be fast enough...
>
> regards,
> Eugene
>
>>> Scott Lewis wrote:
>>>> Hi Mustafa,
>>>>
>>>> I'm not at all sure this will perform sufficiently well, but I
>>>> suppose you could use one of the built in hash functions...e.g. SHA-1:
>>>>
>>>> MessageDigest digest = MessageDigest.getInstance("SHA-1");
>>>> digest.reset();
>>>> digest.update(saltBytes);
>>>> byte [] result = digest.digest(documentBytes);
>>>>
>>>> where saltBytes is a seed number, and documentBytes are the current
>>>> state of the document.
>>>>
>>>> I'm not at all expert, however, in the performance characteristics
>>>> of these different hash algorithms, so this post is mainly a goad to
>>>> those out there that know more and can correct my errant ways.
>>>>
>>>> I did find this chapter in a little google search for hash
>>>> performance, but I haven't read the research yet
>>>>
>>>> http://www.springerlink.com/content/0rn1h0gk3lmmtm92/
>>>>
>>>> Scott
>>>>
>>>>
>>>> Mustafa K. Isik wrote:
>>>>> Is anybody aware of a "good" hash function to compare text of the
>>>>> size of a "typical" sourcecode file?Good in this case means fast to
>>>>> compute, well distributed in the target domain for text as source
>>>>> and with values in the target domain short enough to facilitate
>>>>> quick comparison?
>>>>>
>>>>> I am toying around with the idea to send state information for
>>>>> shared editing operations along with the messages as hashes based
>>>>> on the state of the sourcecode they originated from. Those hashes
>>>>> would be used to compare the states of sourcecode documents from
>>>>> which operations originated from.
>>>>>
>>>>> The way cola does this now is by using counters for local and
>>>>> remote operation applications on a document state. Thanks to a
>>>>> number of assumptions that can be made, this is currently enough to
>>>>> tell that two operations must have originated from the same
>>>>> document state when their respective counters for local and remote
>>>>> ops are equal. One of the problems with this approach is that at
>>>>> some point, for instance during long/modification-heavy sessions,
>>>>> the counters will run out of range.
>>>>>
>>>>> Performance impact of computing hash values based on the sourcecode
>>>>> file contents for every edit operation might prove to be a concern,
>>>>> but I can't say much about it yet. Comparisons of hash values
>>>>> should not prove to be an issue, since they only happen when
>>>>> conflicts arise.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mustafa
>>>>>
|
|
|
|
|
Goto Forum:
Current Time: Sun Sep 22 07:55:23 GMT 2024
Powered by FUDForum. Page generated in 0.04072 seconds
|