Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[stellation-res] A database proposal, for fixing 22135.

This is probably going to be long, so please be patient.

I've been exploring 22135, which I think is also the corruption
problem that Florin just discovered in a unit test.

I don't know the specific cause yet. But the difficulty of trying to
approach it has caused me to reconsider something about the
basic structure of the repository. I'd like to propose a change,
which will have major repurcussions.

Before I can explain it, I need to start by describing how
we currently store text artifacts in the database. We're
using a format called in-place-delta. In in-place-delta (IPD)
all of the versions of a single text artifact are interleaved
into a single list of lines of text. Each line of text has a tag
which identifies which version its a part of.

For example, take a three line artifact:

Version 1:
This is line one
This is line two
This is line three

Version 2:
This is line one.
Line two changed.
This is line three

Version 3:
This is line one
Line two changed.
Line three also changed.

This would be represented in IPD format as:
TAG		IPDLINENUM	LINE
1,2,3		1			This is line one
1		2			This is line two
2,3		2			Line two changed
1,2		3			This is line three
3		5			Line three also changed.

(The IPDLINEUM is used to maintain the order of lines in
the database.) In stellation, we've been representing the tag portion as a 
bitstring encoded into ASCII in a rather tricky form. The advantage of
the encoding is that some pretty tricky SQL makes it easy to
have the server extract the lines that make up a version. 

There are two major disadvantage to it. First is that updating an
atomic artifact means completely replacing the IPD lines for a
document. (It's too hard to work out the correct set of update/inserts,
so we remove all lines, and then replace them.)

The second major disadvantage is that when looking at the database,
there's no good way for a human viewing the database to tell
what lines are members of what versions. It's really hard to decode
those tag strings by hand, and it's really hard to write a probe query
that extracts a particular version from the repository.

OK. Enough background. Here's what I'd like to do.

I'd like to separate the tags from the lines in the database. And I'd
also like to burst the tags out, so that instead of bunding the information
about what versions a given line is part of into a tag string, it gets
broken out into a bunch of lines in a different table.

So... The Texts table becomes two tables: TextLines, and TextVersions.
Each row in TextLines is an artifactID, and a line ID, and the line string.

TextVersions is a list of (linenumber, LineID, VersionIDs). There is an
entry in TextVersions for a LineID,VersionID if the line is a member
of the version.

There are several advantages to this, and one major disadvantage.

The pros:
- It's easy for a human viewer to see what's going on in the texts table.
- The tags are a lot easier to read and maintain. (I'm pretty sure that
  the current bitstring tricks are what's failing in these problems.)
- Updates can be done in-place in the database. 
- An update adding a new version will *not* be able to corrupt
 an older version. (Now, it's possible for a new version to
  mangle a tagline, effectively damaging earlier versions.)

The (major) con:
- The database size will increase significantly for projects with
 many versions.

So... What do folks think about this idea for a change? It would
take a couple of days of work for me to do this. Before I do it,
I'd like to know what people think about it.

	-Mark


-- 
Mark Craig Chu-Carroll,  IBM T.J. Watson Research Center  
*** The Stellation project: Advanced SCM for Collaboration
***		http://www.eclipse.org/stellation
*** Work Email: mcc@xxxxxxxxxxxxxx  ------- Personal Email: markcc@xxxxxxxxxxx




Back to the top