[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
RE: [stellation-res] Merge Algorithm
|
On Tue, 2002-10-15 at 18:40, Jonathan Gossage wrote:
> What is the meaning of the acronym LCS?
Longest common subsequence. The LCS is the basis of
optimal diffs.
The LCS basically gives you the anchors for your diffs:
it finds the largest region of unmodified text between
two versions.
For what it means, and how we use it, I'll explain a bit, because it
confused the heck out of me at first.
A sequence is an ordered list of values.
For any members x and y of a sequence X, we'll say
that x <: y if and only if x occurs before y in X.
A sequence X is a subsequence of Y if all members
of X are members of Y, and for all pairs (x,y) of
members in X, x <: y in X if and only if x <: y in Y.
Example:
X = [ 1, 1, 2, 3, 5, 6, 7 ]
Y = [ 1, 3, 5, 7 ]
Z = [1, 2, 1, 3, 5]
Y is a subsequence of X. Z is not.
Given this much, the meaning of the longest common subsequence should be
fairly clear. It's *not* easy
to compute, and we use a rather odd algorithm for it.
It's a dynamic programming problem. The common solution
is quadratic in space, which can be insanely large. (There's a good
writeup of the algorithm in the Corman, Liesersen and Rivest algorithms
textbook. It's easiest to understand how the algorithms work if you
first understand that one. Doing
some tests of that algorithm, however, we found that on files 15000
lines long, with the tables used by the algorithm, we needed a heap of
something like 200 megabytes!)
So we needed a better algorithm. One of the advantages of
working at Watson is that we're surrounded by incredibly
smart people with lots of different specialties. The guy
accross the hall from me is an algorithms specialist. He
found us an LCS algorithm used by computational biologists
for gene matching, which is the sparse dynamic programming version of
the original algorithm.
That *still* wasn't good enough. Its space complexity is
based, roughly, on the recurrence factor: the space
complexity is roughly the length of one sequence
times the average number of times that each value from
that sequence occurs in the other sequence. If you're working
with files, which are sequences of strings, you'll find
that you have a large number of lines with *very* high
recurrence (empty line, " */", " }", etc., all recur
*very* many times). We had particularly horrid blowups
trying to use Jikes as a test: one of the files in jikes
is generated by a parser generator, which follows a template
form for each parser rule... So you had several hundred
parser rules, each of which had something like 10 identical
boilerplate lines.
So back to Mr. Algorithms. An interested property of the
high recurrence lines in a source file is that, in general,
they are the *least* interesting lines for our purposes! The
purpose of the LCS is to find anchors for the diff: the fact that you
have a common string between two versions isn't
particularly significant if that common string is "*/".
In fact, what turns out to be true is the *lower* the
recurrence, the more likely the string is to be significant in finding a
meaningful anchor point.
So we want to get rid of those things. The way we do it is
by clumping them together. When a line recurs many times,
we attach it to the line before it. This doesn't always
work out well, but the vast majority of the time, it's
extremely good. And it drops the recurrence factor *way*
down. On test files, we could push recurrence factors down
from an average of 15 to between 2 and 3. In memory use,
that brings us to the point where we haven't found any
files that we couldn't diff in less than, I think, 32M.
-Mark
--
Mark Craig Chu-Carroll, IBM T.J. Watson Research Center
*** The Stellation project: Advanced SCM for Collaboration
*** http://www.eclipse.org/stellation
*** Work Email: mcc@xxxxxxxxxxxxxx ------- Personal Email: markcc@xxxxxxxxxxx
Attachment:
signature.asc
Description: This is a digitally signed message part