Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[jgit-dev] [RFC] Cassandra based storage layer for JGit

Last weekend I wrote a read-only storage layer for JGit that uses
Apache Cassandra[1] as the backend data store.  I have now posted the
source under the EDL here:

  http://github.com/spearce/jgit_cassandra

Right now the implementation is read-only.  A repository must be
imported using the cassandra-import command line utility, and the
source repository needs to be fully repacked so that there is exactly
1 pack file containing the repository's object contents (the importer
checks for this and fails if it isn't packed as expected).  The
cassandra-daemon program provides a git:// style daemon that serves
repositories that are stored in Cassandra.  With all processes on my
laptop, it was serving about 420 KiB/sec.

The implementation is far slower than the local filesystem, but it can
complete a clone request.  Its ~5k lines of code, doesn't support
writing/pushing, doesn't support large objects, has no unit tests, and
is missing the schema documentation describing exactly how we store
Git data onto Cassandra (and more importantly why we do it the way we
do it).  I expect a complete implementation to be closer to ~30-40kloc
in size once writing support, more performance optimizations, and unit
tests are also included.  I'll make the schema documentation available
early next week, as I plan to present it at GitTogether '10.

The implementation sits on top of hector[2], a Java client library for
Cassandra.  Given the huge dependency tree involved here we might not
contribute the code to the JGit project at the Eclipse Foundation.  I
don't have the desire to manage this giant stack of non-Eclipse
libraries in Orbit.  But I would like to continue to work on it and
develop this storage implementation to one day be production quality.
One possible use for this is a Gerrit Code Review repository manager
sitting on top of a Cassandra cluster, to support large project
hosting systems with multiple servers.


If you want to start looking at the code, start with the
CassandraRepositoryBuilder, that configures our connection to the
cluster.  You might also want to see pgm.Main, this permits the user
to run something like:

  jgit --git-dir git+cassandra://server/cluster/keyspace/repository.git log

and have that read from Cassandra rather than from a local filesystem
directory.  It does this by passing the URL into
CassandraRepositoryBuilder and using that to feed the Repository to
the rest of the program infrastructure that is reachable from the
command line.

CassandraObjectDatabase is the main ObjectDatabase glue required to
implement a new storage layer for JGit.  If you notice, its not much
more than a factory for CsObjectReader, which does the real work
associated with loading objects from the storage system.
CsObjectReader is fairly complex because it implements the optional
async interfaces to try and perform reads asynchronously to the main
JGit worker thread in order to hide the latency involved in talking to
Cassandra.  CassandraRefDatabase provides the read implementation
necessary to load references from the storage system.  Most of the
rest of the code exists only to support JGit's PackWriter, and is what
permits us to serve a clone request within a reasonable time bound.
Its a huge amount of code.  :-(

Most of this code was developed for a different distributed database
other than Apache Cassandra.  I ported it all over in about 1 day, and
most of it is identical to the other implementation I have.  Which
tells me that JGit on top of a database system probably needs most of
this same code, and there may be a suitable abstraction we can make
that permits us to reuse most of this code but just swap the
underlying database.  Assuming you use the same schema anyway.  I
unfortunately had to make some schema changes, Apache Cassandra isn't
able to do exactly the same things as the other distributed database I
was using in my original implementation.


I'm currently in the process of tearing JGit's IndexPack class apart
and putting it back together in a way that will allow us to support
pushing into Cassandra, or having a Cassandra based repository act as
a fetch client.  I'll post changes for review as I get something
working, as right now it doesn't even compile yet.  I am effectively
starting over from scratch, the IndexPack code is a mess and can't
really be refactored.  After I'm done rebuilding something new, I'll
see if I can do it as a series of refactorings, but I doubt that I
can.


[1]  http://cassandra.apache.org/
[2]  http://github.com/rantav/hector

-- 
Shawn.


Back to the top