[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [stellation-res] A database proposal, for fixing 22135.
|
On Sun, Aug 04, 2002 at 08:10:13PM -0400, Mark C. Chu-Carroll wrote:
>
> This is probably going to be long, so please be patient.
>
> I've been exploring 22135, which I think is also the corruption
> problem that Florin just discovered in a unit test.
>
> I don't know the specific cause yet. But the difficulty of trying to
> approach it has caused me to reconsider something about the
> basic structure of the repository. I'd like to propose a change,
> which will have major repurcussions.
>
> Before I can explain it, I need to start by describing how
> we currently store text artifacts in the database. We're
> using a format called in-place-delta. In in-place-delta (IPD)
> all of the versions of a single text artifact are interleaved
> into a single list of lines of text. Each line of text has a tag
> which identifies which version its a part of.
>
> For example, take a three line artifact:
>
> Version 1:
> This is line one
> This is line two
> This is line three
>
> Version 2:
> This is line one.
> Line two changed.
> This is line three
>
> Version 3:
> This is line one
> Line two changed.
> Line three also changed.
>
> This would be represented in IPD format as:
> TAG IPDLINENUM LINE
> 1,2,3 1 This is line one
> 1 2 This is line two
> 2,3 2 Line two changed
> 1,2 3 This is line three
> 3 5 Line three also changed.
>
> (The IPDLINEUM is used to maintain the order of lines in
> the database.) In stellation, we've been representing the tag portion as a
> bitstring encoded into ASCII in a rather tricky form. The advantage of
> the encoding is that some pretty tricky SQL makes it easy to
> have the server extract the lines that make up a version.
>
> There are two major disadvantage to it. First is that updating an
> atomic artifact means completely replacing the IPD lines for a
> document. (It's too hard to work out the correct set of update/inserts,
> so we remove all lines, and then replace them.)
>
> The second major disadvantage is that when looking at the database,
> there's no good way for a human viewing the database to tell
> what lines are members of what versions. It's really hard to decode
> those tag strings by hand, and it's really hard to write a probe query
> that extracts a particular version from the repository.
>
> OK. Enough background. Here's what I'd like to do.
>
> I'd like to separate the tags from the lines in the database. And I'd
> also like to burst the tags out, so that instead of bunding the information
> about what versions a given line is part of into a tag string, it gets
> broken out into a bunch of lines in a different table.
>
> So... The Texts table becomes two tables: TextLines, and TextVersions.
> Each row in TextLines is an artifactID, and a line ID, and the line string.
>
> TextVersions is a list of (linenumber, LineID, VersionIDs). There is an
> entry in TextVersions for a LineID,VersionID if the line is a member
> of the version.
>
> There are several advantages to this, and one major disadvantage.
>
> The pros:
> - It's easy for a human viewer to see what's going on in the texts table.
> - The tags are a lot easier to read and maintain. (I'm pretty sure that
> the current bitstring tricks are what's failing in these problems.)
> - Updates can be done in-place in the database.
> - An update adding a new version will *not* be able to corrupt
> an older version. (Now, it's possible for a new version to
> mangle a tagline, effectively damaging earlier versions.)
>
> The (major) con:
> - The database size will increase significantly for projects with
> many versions.
Concons:
1. Not that much. Since you can get native short integers in Postgres, the
representation will be compact enough.
2. Also disk space it getting cheaper.
3. After a certain time, you can start squeezing harder the old
versions. You can have a janitor process doing this kind of work.
> So... What do folks think about this idea for a change? It would
> take a couple of days of work for me to do this. Before I do it,
> I'd like to know what people think about it.
It makes a lot of sense. When I saw the IPD table in your e-mail it
looked rather strange for something that sits in a relational database.
Your proposal came to my mind as the obvious way to break/normalize it.
florin
--
"If it's not broken, let's fix it till it is."
41A9 2BDE 8E11 F1C5 87A6 03EE 34B3 E075 3B90 DFE4
Attachment:
pgptKkKUtsMcb.pgp
Description: PGP signature