Re: [jgit-dev] JDBC based storage layer for JGit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] JDBC based storage layer for JGit

From: Shawn Pearce <spearce@xxxxxxxxxxx>
Date: Tue, 15 Feb 2011 16:17:03 -0800
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

On Tue, Feb 15, 2011 at 11:40, Thun, Philipp <philipp.thun@xxxxxxx> wrote:
> I've tried to review Shawn's DHT change (http://egit.eclipse.org/r/2295) and
> then thought that it might be an interesting task for me to write an
> implementation for JDBC. It's easier to understand something when you try it
> out…
>
> So today I have uploaded a very first version of this JDBC storage provider
> to github: https://github.com/philippthun/jgit_jdbc
>
> It is a really simple and prototypical (e.g. the complete database contents
> are Base64 encoded) and primitive and… implementation - but I have
> successfully pushed and cloned JGit itself into a PostgreSQL database with
> this it, so I was actually quite happy ;-)

This is quite cool. I was thinking this morning how we should also
write a JDBC based backend, because I think it would at least be an
interesting exercise, but might also make for a decent small scale
implementation.  :-)

Some comments after just random browsing through the code:

* RowKey.toBytes() returns US-ASCII, but in a byte[]. There is no need
to encode it further with Base64. That won't really simplify your code
because the JDBC clients really want String not byte[] here, so you'll
have to convert the byte[] into a String (and go the reverse coming
back). But you don't want to mangle the encoded value further, it
makes range scans difficult, and management by peeking at the raw
database even harder.

* Database is a singleton, you'll need some sort of connection pool to
borrow a java.sql.Connection from during each of the spi methods, and
put it back into when the method ends. The pool should be fast, which
implies it cannot validate the connection when you borrow it. This is
a limitation in the way the DHT code is put together, we may find that
we should expose the DhtReader to the spi and allow it to cache some
sort of connection handle inside of the DhtReader.

* Asynchronous methods like ObjectIndexTable.get() should do all of
their query work in the background through the ExecutorService. So the
loop over the objects, that should occur in the Runnable you submit to
the ExecutorService, as well as the callback. The generic DHT code
uses this calling strategy to schedule a bunch of stuff in parallel at
once (which is another reason why each method call needs its own
unique java.sql.Connection).

* Methods like ObjectIndexTable.remove() are removing 1 "cell", that
is the unique pairing of ObjectIndexKey and ChunkKey. Right now you
are removing all rows for that object, which would break under garbage
collection after a repack occurred.

* RepositoryTable, I would probably map onto multiple tables in SQL:

  create table repository_chunk_info (r_key, r_chunk_key, r_chunk_info);
  create table repository_cached_pack (r_key, r_cached_pack_key,
r_cached_pack_info);

* ChunkTable, when trying for the update use "SELECT 1 FROM chunk
WHERE c_key = ..". You don't actually need the old column values if
you dynamically build the SQL to update only the fields from
PackChunk.Members which are non-null. This avoids a lot of dragging
the data from the SQL server back to the application, and reduces how
much the SQL server has to churn in its WAL. The way the DHT code
performs puts, the large segment (chunkData) was put during the
insertion. During later calls chunk.getChunkData() == null, but the
other two smaller payloads of getChunkIndex() and getMeta() are
populated and should be updated into the row. You may even be able to
drive off a hint here and do something like:

  if (chunk.getChunkData() != null) {
    ... assume insert, if we get duplicate key exception DHT is busted
  } else {
    ... assume update, if we don't update exactly 1 row, DHT is busted
  }

I think we might be able to make this promise in the generic DHT code
and make it part of the spi contract, which means our unit tests for
the generic DHT code should ensure this holds true all of the time.

* You really don't want to Base64 encode the ChunkTable. Its data is
all quite large, you should be using binary column types.

* I've wondered if we should expose more of the data to the SPI layer.
For example, RefData in RefTable. The fields it holds might be very
useful to have broken out as proper SQL columns... as annoying as that
might make the JDBC SPI provider to maintain.

-- 
Shawn.

Follow-Ups:
- Re: [jgit-dev] JDBC based storage layer for JGit
  - From: Shawn Pearce

References:
- [jgit-dev] JDBC based storage layer for JGit
  - From: Thun, Philipp

Prev by Date: Re: [jgit-dev] New and Noteworthy 0.11
Next by Date: Re: [jgit-dev] [egit-dev] New and Noteworthy 0.11
Previous by thread: [jgit-dev] JDBC based storage layer for JGit
Next by thread: Re: [jgit-dev] JDBC based storage layer for JGit
Index(es):
- Date
- Thread

Breadcrumbs