Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
RE: [stellation-res] Merge Algorithm

> >-----Original Message-----
> >From: stellation-res-admin@xxxxxxxxxxxxxxx
> >[mailto:stellation-res-admin@xxxxxxxxxxxxxxx]On Behalf Of Mark C.
> >Chu-Carroll
> >Sent: October 15, 2002 9:19 PM
> >To: stellation-res@xxxxxxxxxxxxxxx
> >Subject: RE: [stellation-res] Merge Algorithm
> >
> >
> >On Tue, 2002-10-15 at 18:40, Jonathan Gossage wrote:
> >> What is the meaning of the acronym LCS?
> >
> >Longest common subsequence. The LCS is the basis of
> >optimal diffs.
> >
> >The LCS basically gives you the anchors for your diffs:
> >it finds the largest region of unmodified text between
> >two versions.
> >
> >For what it means, and how we use it, I'll explain a bit, because it
> >confused the heck out of me at first.
> >
> >A sequence is an ordered list of values.
> >
> >For any members x and y of a sequence X, we'll say
> >that x <: y if and only if x occurs before y in X.
> >
> >A sequence X is a subsequence of Y if all members
> >of X are members of Y, and for all pairs (x,y) of
> >members in X,  x <: y in X if and only if x <: y in Y.
> >
> >Example:
> >	X = [ 1, 1, 2, 3, 5, 6, 7 ]
> >	Y = [ 1, 3, 5, 7 ]
> >	Z = [1, 2, 1, 3, 5]
> >Y is a subsequence of X. Z is not.

I presume that the definition of a pair includes the stipulation that its'
members are adjacent in the subsequence.

> >
> >Given this much, the meaning of the longest common subsequence should be
> >fairly clear. It's *not* easy
> >to compute, and we use a rather odd algorithm for it.
> >It's a dynamic programming problem. The common solution
> >is quadratic in space, which can be insanely large. (There's a good
> >writeup of the algorithm in the Corman, Liesersen and Rivest algorithms
> >textbook. It's easiest to understand how the algorithms work if you
> >first understand that one. Doing
> >some tests of that algorithm, however, we found that on files 15000
> >lines long, with the tables used by the algorithm, we needed a heap of
> >something like 200 megabytes!)
> >
> >So we needed a better algorithm. One of the advantages of
> >working at Watson is that we're surrounded by incredibly
> >smart people with lots of different specialties. The guy
> >accross the hall from me is an algorithms specialist. He
> >found us an LCS algorithm used by computational biologists
> >for gene matching, which is the sparse dynamic programming version of
> >the original algorithm.
> >
> >That *still* wasn't good enough. Its space complexity is
> >based, roughly, on the recurrence factor: the space
> >complexity is roughly the length of one sequence
> >times the average number of times that each value from
> >that sequence occurs in the other sequence. If you're working
> >with files, which are sequences of strings, you'll find
> >that you have a large number of lines with *very* high
> >recurrence (empty line, " */", "   }", etc., all recur
> >*very* many times). We had particularly horrid blowups
> >trying to use Jikes as a test: one of the files in jikes
> >is generated by a parser generator, which follows a template
> >form for each parser rule... So you had several hundred
> >parser rules, each of which had something like 10 identical
> >boilerplate lines.
> >
> >So back to Mr. Algorithms. An interested property of the
> >high recurrence lines in a source file is that, in general,
> >they are  the *least* interesting lines for our purposes! The
> >purpose of the LCS is to find anchors for the diff: the fact that you
> >have a common string between two versions isn't
> >particularly significant if that common string is "*/".
> >In fact, what turns out to be true is the *lower* the
> >recurrence, the more likely the string is to be significant in finding a
> >meaningful anchor point.
> >
> >So we want to get rid of those things. The way we do it is
> >by clumping them together. When a line recurs many times,
> >we attach it to the line before it. This doesn't always
> >work out well, but the vast majority of the time, it's
> >extremely good. And it drops the recurrence factor *way*
> >down. On test files, we could push recurrence factors down
> >from an average of 15 to between 2 and 3. In memory use,
> >that brings us to the point where we haven't found any
> >files that we couldn't diff in less than, I think, 32M.
> >
> >	-Mark
> >
> >

Thank you for the excellent description. This will be a big help when I go
bug hunting.



Back to the top