Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit

From: Shawn Pearce <spearce@xxxxxxxxxxx>
Date: Fri, 28 Jan 2011 10:18:57 -0800
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

On Fri, Jan 28, 2011 at 08:11, Adrian <adrian.wilkins@xxxxxxxxx> wrote:
> A couple of questions
>
> Q: Presumably since this is essentially a K/V store,

Yes.

> you could just
> write a backend that used a boring old local, in-process, K/V store
> instead of a distributed one?

Yes.  That was my point about an in-memory repository.  We could
create an in-memory repository type by just making an in-memory K/V
store, based on java.util.ConcurrentHashMap.  Its limited by your
in-memory heap space, and isn't persistent, but this is acceptable for
some forms of unit testing.

We could also use an in-process, but persistent to local disk K/V
store... but there is very little value to this in my opinion.  The
"native" storage.file format is very efficient and is compatible with
C Git, as well as other Git implementations like Dulwich.  The value
of a distributed K/V store is for large centralized hosting, such as
what eclipse.org needs to do for committers, or what a big company
might do internally for its engineers.  Here dedicating 4 machines to
running Git processes in a load-balanced, automatic hot-failover
system can make a lot of sense.  Using a stable distributed K/V for
most of that storage simplifies things for everyone.

> Q: Given that as the case, would it make sense to keep pack files as
> the pack storage, presuming their performance is much higher, only
> retaining the use of the K/V store for loose objects?

The scheme this patch proposes is to use the pack file format even
inside of the K/V store.  A pack is sliced into chunks, and each chunk
is saved in the K/V store almost exactly as-is.  Its very efficient,
and means that on at least some K/V stores the K/V storage cost of a
repository is only a few KBs larger than the same repository on local
disk would be.  (Where the few KBs is the K/V store's framing of the
pack chunks.)  E.g. the linux-2.6 repository is about 400 MB, and it
goes into the K/V at about the same size... because its sliced into
about 407 chunks, each around 1 MB in size.  If the K/V only uses a
few hundred bytes per chunk for framing in its own storage files,
that's only a few KBs additional overhead.

But again, this K/V store stuff is really meant for larger
installations where the owners want multiple machines dedicated to the
task, for redundancy and load-balancing purposes.  Given how cheap
disk storage is these days, and that each disk is probably much larger
than you would ever need anyway for Git repository storage, its likely
you would ask the K/v store to replicate the data several times (e.g.
once to each machine, and keep the data current on writes).  This is
acceptable though, saying linux-2.6 costs you 4x storage in a K/V when
you have 4x replication for redundancy and read load-balancing makes a
lot of sense to an IT administrator who is trying to determine how
much hardware to purchase.  :-)

> Specifically I'm exploring the idea of managing objects that are
> values in a K/V store, rather than objects that are files, by breaking
> the keyspace into trees.

I'm not sure this is going to be very helpful.

Last October-ish I tried a different NoSQL based storage
implementation that stored each object into its own row.  For
linux-2.6 that meant 1.8 million rows, rather than ~407 as I described
above.  It also required range scan support from the NoSQL server,
which reduced the number of systems that could be supported, it no
loner was just a K/V store, it had to use a binary tree or sorted file
as its underlying storage system.

It was _way_ slower than what I'm doing now, and it took a lot more
coding to get a lot less functionality.  Taking a very simple approach
of just treating the K/V store as a big virtual memory system and
slicing the pack up into ~1 MB chunks for the K/V seems to be working
well.  There are still some operations that are sucking (e.g. object
counting linux-2.6 takes 15+ minutes), but these suck anyway in JGit
on local disk (easily takes a few minutes).  Last night I came up with
another idea to try and cache data to prevent needing to do this...
and I think it works for most projects.  So I'm off trying to code
that now.  :-)

-- 
Shawn.

Follow-Ups:
- Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit
  - From: Adrian

References:
- Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit
  - From: Adrian

Prev by Date: Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit
Next by Date: Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit
Previous by thread: Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit
Next by thread: Re: [jgit-dev] [RFC] Cassandra based storage layer for JGit
Index(es):
- Date
- Thread

Breadcrumbs