Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [stellation-res] A database proposal, for fixing 22135.

On Sun, Aug 04, 2002 at 08:10:13PM -0400, Mark C. Chu-Carroll wrote:
> 
> This is probably going to be long, so please be patient.
> 
> I've been exploring 22135, which I think is also the corruption
> problem that Florin just discovered in a unit test.
> 
> I don't know the specific cause yet. But the difficulty of trying to
> approach it has caused me to reconsider something about the
> basic structure of the repository. I'd like to propose a change,
> which will have major repurcussions.
> 
> Before I can explain it, I need to start by describing how
> we currently store text artifacts in the database. We're
> using a format called in-place-delta. In in-place-delta (IPD)
> all of the versions of a single text artifact are interleaved
> into a single list of lines of text. Each line of text has a tag
> which identifies which version its a part of.
> 
> For example, take a three line artifact:
> 
> Version 1:
> This is line one
> This is line two
> This is line three
> 
> Version 2:
> This is line one.
> Line two changed.
> This is line three
> 
> Version 3:
> This is line one
> Line two changed.
> Line three also changed.
> 
> This would be represented in IPD format as:
> TAG		IPDLINENUM	LINE
> 1,2,3		1			This is line one
> 1		2			This is line two
> 2,3		2			Line two changed
> 1,2		3			This is line three
> 3		5			Line three also changed.
> 
> (The IPDLINEUM is used to maintain the order of lines in
> the database.) In stellation, we've been representing the tag portion as a 
> bitstring encoded into ASCII in a rather tricky form. The advantage of
> the encoding is that some pretty tricky SQL makes it easy to
> have the server extract the lines that make up a version. 
> 
> There are two major disadvantage to it. First is that updating an
> atomic artifact means completely replacing the IPD lines for a
> document. (It's too hard to work out the correct set of update/inserts,
> so we remove all lines, and then replace them.)
> 
> The second major disadvantage is that when looking at the database,
> there's no good way for a human viewing the database to tell
> what lines are members of what versions. It's really hard to decode
> those tag strings by hand, and it's really hard to write a probe query
> that extracts a particular version from the repository.
> 
> OK. Enough background. Here's what I'd like to do.
> 
> I'd like to separate the tags from the lines in the database. And I'd
> also like to burst the tags out, so that instead of bunding the information
> about what versions a given line is part of into a tag string, it gets
> broken out into a bunch of lines in a different table.
> 
> So... The Texts table becomes two tables: TextLines, and TextVersions.
> Each row in TextLines is an artifactID, and a line ID, and the line string.
> 
> TextVersions is a list of (linenumber, LineID, VersionIDs). There is an
> entry in TextVersions for a LineID,VersionID if the line is a member
> of the version.
> 
> There are several advantages to this, and one major disadvantage.
> 
> The pros:
> - It's easy for a human viewer to see what's going on in the texts table.
> - The tags are a lot easier to read and maintain. (I'm pretty sure that
>   the current bitstring tricks are what's failing in these problems.)
> - Updates can be done in-place in the database. 
> - An update adding a new version will *not* be able to corrupt
>  an older version. (Now, it's possible for a new version to
>   mangle a tagline, effectively damaging earlier versions.)
> 
> The (major) con:
> - The database size will increase significantly for projects with
>  many versions.

Concons:

1. Not that much. Since you can get native short integers in Postgres, the
representation will be compact enough.

2. Also disk space it getting cheaper.

3. After a certain time, you can start squeezing harder the old
versions. You can have a janitor process doing this kind of work.

> So... What do folks think about this idea for a change? It would
> take a couple of days of work for me to do this. Before I do it,
> I'd like to know what people think about it.

It makes a lot of sense. When I saw the IPD table in your e-mail it
looked rather strange for something that sits in a relational database.
Your proposal came to my mind as the obvious way to break/normalize it.

florin

-- 

"If it's not broken, let's fix it till it is."

41A9 2BDE 8E11 F1C5 87A6  03EE 34B3 E075 3B90 DFE4

Attachment: pgptKkKUtsMcb.pgp
Description: PGP signature


Back to the top