|Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit|
On Fri, Jan 28, 2011 at 08:11, Adrian <adrian.wilkins@xxxxxxxxx> wrote: > A couple of questions > > Q: Presumably since this is essentially a K/V store, Yes. > you could just > write a backend that used a boring old local, in-process, K/V store > instead of a distributed one? Yes. That was my point about an in-memory repository. We could create an in-memory repository type by just making an in-memory K/V store, based on java.util.ConcurrentHashMap. Its limited by your in-memory heap space, and isn't persistent, but this is acceptable for some forms of unit testing. We could also use an in-process, but persistent to local disk K/V store... but there is very little value to this in my opinion. The "native" storage.file format is very efficient and is compatible with C Git, as well as other Git implementations like Dulwich. The value of a distributed K/V store is for large centralized hosting, such as what eclipse.org needs to do for committers, or what a big company might do internally for its engineers. Here dedicating 4 machines to running Git processes in a load-balanced, automatic hot-failover system can make a lot of sense. Using a stable distributed K/V for most of that storage simplifies things for everyone. > Q: Given that as the case, would it make sense to keep pack files as > the pack storage, presuming their performance is much higher, only > retaining the use of the K/V store for loose objects? The scheme this patch proposes is to use the pack file format even inside of the K/V store. A pack is sliced into chunks, and each chunk is saved in the K/V store almost exactly as-is. Its very efficient, and means that on at least some K/V stores the K/V storage cost of a repository is only a few KBs larger than the same repository on local disk would be. (Where the few KBs is the K/V store's framing of the pack chunks.) E.g. the linux-2.6 repository is about 400 MB, and it goes into the K/V at about the same size... because its sliced into about 407 chunks, each around 1 MB in size. If the K/V only uses a few hundred bytes per chunk for framing in its own storage files, that's only a few KBs additional overhead. But again, this K/V store stuff is really meant for larger installations where the owners want multiple machines dedicated to the task, for redundancy and load-balancing purposes. Given how cheap disk storage is these days, and that each disk is probably much larger than you would ever need anyway for Git repository storage, its likely you would ask the K/v store to replicate the data several times (e.g. once to each machine, and keep the data current on writes). This is acceptable though, saying linux-2.6 costs you 4x storage in a K/V when you have 4x replication for redundancy and read load-balancing makes a lot of sense to an IT administrator who is trying to determine how much hardware to purchase. :-) > Specifically I'm exploring the idea of managing objects that are > values in a K/V store, rather than objects that are files, by breaking > the keyspace into trees. I'm not sure this is going to be very helpful. Last October-ish I tried a different NoSQL based storage implementation that stored each object into its own row. For linux-2.6 that meant 1.8 million rows, rather than ~407 as I described above. It also required range scan support from the NoSQL server, which reduced the number of systems that could be supported, it no loner was just a K/V store, it had to use a binary tree or sorted file as its underlying storage system. It was _way_ slower than what I'm doing now, and it took a lot more coding to get a lot less functionality. Taking a very simple approach of just treating the K/V store as a big virtual memory system and slicing the pack up into ~1 MB chunks for the K/V seems to be working well. There are still some operations that are sucking (e.g. object counting linux-2.6 takes 15+ minutes), but these suck anyway in JGit on local disk (easily takes a few minutes). Last night I came up with another idea to try and cache data to prevent needing to do this... and I think it works for most projects. So I'm off trying to code that now. :-) -- Shawn.
Back to the top