Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit

On Fri, Oct 22, 2010 at 14:30, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
> Last weekend I wrote a read-only storage layer for JGit that uses
> Apache Cassandra[1] as the backend data store.  I have now posted the
> source under the EDL here:

I have rewritten this from scratch to use a generic JGit DHT
implementation, making the actual glue onto the database very, very

Also, it now supports push.  Which means its a full blown repository
system, and is as useful as JGit is itself.  Yes kids, Christmas came
late this year.

The implementation is now offered as part of JGit, and has no
additional dependencies outside of the JGit core library.  Which makes
it easy to host this as part of the JGit project.  Its broken down
into a "all of the code doing stuff" part within JGit, and a thin spi
layer to talk to the database, and this simple spi layer doesn't have
to be part of the JGit project.

The Cassandra spi glue is about 1,600 lines of code, and is on GitHub.
 This still depends upon Hector, Thrift, and about a billion other 3rd
party libraries, making it very difficult to import under the
foundation umbrella.  (If we can get the dependencies resolved, we can
move this over too.)

I plan to write some more spi implementations, and will try to make at
least some of them available through the core JGit project.  Amazon S3
should be possible, since we have an S3 client already available
within JGit, and I think it might actually be suitable for the spi
layer.  JDBC shouldn't be too bad either, backing onto any SQL
database with a JDBC driver.  (JDBC might be ugly if you insist on
having the protobuf encoded data members as SQL columns, rather than a
binary blob shoved into the database... but I digress.)

I had been working on an in-memory only version of Repository to make
unit testing easier.  Its hard, like five thousand lines hard.  I
might just create an in-memory only spi for the DHT code above and
offer that as the in-memory unit testing implementation.  :-)

> repositories that are stored in Cassandra.  With all processes on my
> laptop, it was serving about 420 KiB/sec.

This newer code does things way faster (still all on my laptop):

$ git push git://localhost:9418/jgit.git master
Counting objects: 16375, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3667/3667), done.
Writing objects: 100% (16375/16375), 3.34 MiB | 739 KiB/s, done.
Total 16375 (delta 9455), reused 13894 (delta 7911)
To git://localhost:9418/jgit.git
 * [new branch]      master -> master

$ git clone --bare git://localhost:9418/jgit.git in_jgit.git
Cloning into bare repository in_jgit.git...
remote: Counting objects: 16375, done
remote: Compressing objects: 100% (5838/5838)
remote: Compressing objects: 100% (5823/5823)
Receiving objects: 100% (16375/16375), 3.27 MiB, done.
Resolving deltas: 100% (9566/9566), done.

Its missing a lot of documentation, totally lacks unit tests, and has
some TODOs left in related to handling really big objects.
Performance of the DhtPackParser leaves a lot to be desired, as there
isn't any prefetching occurring during delta resolution.  Basically
the code works and is feature complete, but isn't optimized.  My plan
is to try to do the remainder of the work incrementally in the open,
built on top of this first commit.

No, this is *NOT* production ready.  I just got it working today.
I've only been working on the rewrite since about 10 pm Monday
evening.  Its the result of an all night hacking session, and two and
a half very long days.  I wouldn't trust storing my hello world
collection in it, let alone something I cared about.


Back to the top