|Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit|
On Fri, Oct 22, 2010 at 14:30, Shawn Pearce <spearce@xxxxxxxxxxx> wrote: > Last weekend I wrote a read-only storage layer for JGit that uses > Apache Cassandra as the backend data store. I have now posted the > source under the EDL here: > > http://github.com/spearce/jgit_cassandra I have rewritten this from scratch to use a generic JGit DHT implementation, making the actual glue onto the database very, very thin. Also, it now supports push. Which means its a full blown repository system, and is as useful as JGit is itself. Yes kids, Christmas came late this year. The implementation is now offered as part of JGit, and has no additional dependencies outside of the JGit core library. Which makes it easy to host this as part of the JGit project. Its broken down into a "all of the code doing stuff" part within JGit, and a thin spi layer to talk to the database, and this simple spi layer doesn't have to be part of the JGit project. http://egit.eclipse.org/r/2295 The Cassandra spi glue is about 1,600 lines of code, and is on GitHub. This still depends upon Hector, Thrift, and about a billion other 3rd party libraries, making it very difficult to import under the foundation umbrella. (If we can get the dependencies resolved, we can move this over too.) http://github.com/spearce/jgit_cassandra I plan to write some more spi implementations, and will try to make at least some of them available through the core JGit project. Amazon S3 should be possible, since we have an S3 client already available within JGit, and I think it might actually be suitable for the spi layer. JDBC shouldn't be too bad either, backing onto any SQL database with a JDBC driver. (JDBC might be ugly if you insist on having the protobuf encoded data members as SQL columns, rather than a binary blob shoved into the database... but I digress.) I had been working on an in-memory only version of Repository to make unit testing easier. Its hard, like five thousand lines hard. I might just create an in-memory only spi for the DHT code above and offer that as the in-memory unit testing implementation. :-) > repositories that are stored in Cassandra. With all processes on my > laptop, it was serving about 420 KiB/sec. This newer code does things way faster (still all on my laptop): $ git push git://localhost:9418/jgit.git master Counting objects: 16375, done. Delta compression using up to 4 threads. Compressing objects: 100% (3667/3667), done. Writing objects: 100% (16375/16375), 3.34 MiB | 739 KiB/s, done. Total 16375 (delta 9455), reused 13894 (delta 7911) To git://localhost:9418/jgit.git * [new branch] master -> master $ git clone --bare git://localhost:9418/jgit.git in_jgit.git Cloning into bare repository in_jgit.git... remote: Counting objects: 16375, done remote: Compressing objects: 100% (5838/5838) remote: Compressing objects: 100% (5823/5823) Receiving objects: 100% (16375/16375), 3.27 MiB, done. Resolving deltas: 100% (9566/9566), done. Its missing a lot of documentation, totally lacks unit tests, and has some TODOs left in related to handling really big objects. Performance of the DhtPackParser leaves a lot to be desired, as there isn't any prefetching occurring during delta resolution. Basically the code works and is feature complete, but isn't optimized. My plan is to try to do the remainder of the work incrementally in the open, built on top of this first commit. No, this is *NOT* production ready. I just got it working today. I've only been working on the rewrite since about 10 pm Monday evening. Its the result of an all night hacking session, and two and a half very long days. I wouldn't trust storing my hello world collection in it, let alone something I cared about. -- Shawn.
Back to the top