Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Update on JGit on DHT

On Fri, Jul 1, 2011 at 15:58, Matthias Sohn
<matthias.sohn@xxxxxxxxxxxxxx> wrote:
> 2011/7/1 Shawn Pearce <spearce@xxxxxxxxxxx>
>> I may be giving up on the concept of putting JGit on a DHT. I just
>> can't find any DHT that can handle the read traffic caused by the
>> schema, and a clone request for the linux-2.6 repository (2 million
>> objects / 400 MB).
> that's a pity regarding the work you invested

It happens. I kept chasing down something that I thought was certain
to work. It doesn't really always work. I probably should have given
up sooner.

>> What *may* work is putting the standard Git pack file and index file
>> onto a high-latency "filesystem"... and ensuring the Git process has a
>> bunch of memory available for caching. I simulated a filesystem with a
>> 1 MB block read latency of 500-2000ms... if the linux-2.6 repository
>> is fully repacked, and we have ~1 GB of cache, a cold-cache clone can
>> still run in a matter of minutes. (Compared to the DHT stuff I
>> submitted before, where it may be 15-30 minutes!)
> what's the major difference causing this large difference between DHT and
> DFS performance and why is DFS solution easier to make fast ? Compared
> to local file system both solutions store large Blobs on some high-latency storage...

I think it is a few things.

First is code complexity, there is a lot less code involved, so its
just plain simpler. There are less smarts that we are trying to
balance to always do the right thing for any given read workload, and
the `git clone` type of action goes through different stages of
different types of read workloads. Its hard to be right all of the
time. As an example, the "Finding sources" phase of PackWriter looks
up every stored copy of an object to see if it is delta compressed,
and if so, who its delta base is. The DHT storage layer has a huge
amount of code to try and convert the 2 million ObjectIds into batches
that can be forwarded to the database, run batches in parallel (to try
and hide RPC latency), absorb the responses, and match them back up to
the ObjectToPack instances that need to be filled. This is a ton of
effort. Do you know what the DFS / local disk version does?  The
PackIndex is pegged in-memory. It binary searches for each object in
that in-memory PackIndex, gets the data it needs. Much simpler, and
much faster.

Another problem is that OBJECT_INDEX table. In the DHT code its used
in place of the PackIndex, to translate ObjectId to location of raw
object storage. The table is huge. The row keys are 40 byte hex
strings, each row contains a ChunkKey (another 49 byte string), and
position + delta base data (at least another 28 bytes). So for
something that the PackIndex stores in 28 bytes... the DHT code stores
in *at least* 117 bytes, before you even consider the database's own
storage overheads. We need to read all 2 million rows related to the
linux-2.6 kernel to clone the linux-2.6 kernel, and at >4x bigger than
the DFS/local version, that is pretty ugly.  Gigabit Ethernet still
only runs at Gigabit speeds... 4x larger volume of data takes 4x
longer to transfer. In the DHT version, its actually even more than
that because its handled in batches of rows, there are a lot of
batches to move 2 million object records around. In the DFS version,
its just streaming the 55 MB pack-*.idx file, even if this is RPC-ish
and not a straight TCP stream like HTTP GET would be, you still get
better density in the RPC messages.

Prefetching is easier with the DFS. In the DHT schema pack chunks were
written out to the CHUNK table using a SHA-1 of the content as the row
key. This means the row keys have no relationship to processing order.
Given one chunk, we cannot easily predict what the next chunk(s) the
reader needs are, and thus cannot try to load them in the background
while the reader is chewing on the current chunk. I tried to
approximate this with the ChunkMeta field and its prefetch hints.
Unfortunately these ChunkMeta objects are actually rather large, to
get decent prefetching I had to put 50 chunk keys into each meta
listing then the next 50 chunks. That is a lot of repetition as chunk
A lists B,C,D,E..., B lists C,D,E,F,... etc. At 52 bytes per entry
this was 2.6 KiB per chunk. Not an insignificant figure. The
prefetching code was very complex, and again like above with the
OBJECT_INDEX table there was a lot of code trying to load these chunks
as fast as possible from the database and then putting them back
together for the reader.  None of this matters with the DFS.  Git pack
files have good locality... assuming you use JGit's PackWriter (it
produces better locality than C Git does right now). Given good data
locality, a very simple read-ahead X bytes algorithm works great. And
given that the DFS is just a stream of bytes, this is easy to
implement, and have perform well.

Another big problem with the DHT was the local object index inside of
each chunk. Chunks on average held 2600 objects (for commit/tree
chunks anyway). The local index inside the chunk row is a pack-index
style mapping of ObjectId to offset within that chunk, and was used to
avoid round-trips to the OBJECT_INDEX table when processing commits
and trees during a RevWalk or ObjectWalk. On average each object
needed 23 bytes, plus the 1024 byte fan-out table at the front. So the
local index was usually at least 60 KiB. When add that to the meta, we
have an additional 62.6 KiB of data for every ~1 MiB of data. And all
of this is redundant given that we already need to read OBJECT_INDEX.

On a ~400 MiB repository, we have an extra 25 MiB of redundant data
that gets downloaded to the JGit process. And that is before you
consider the cost of the OBJECT_INDEX table also being fully loaded,
which is at least 223 MiB of data for the linux kernel repository.

So the DHT schema, to answer a `git clone` of the ~400 MiB linux
kernel, needs to load 248 MiB of "index" data from the DHT, in
addition to the ~400 MiB of pack data.

In the DFS schema, we stream down 55 MiB of PackIndex, and then load
the 400 MiB of pack data. Its a full 193 MiB less data to transfer,
and we are doing it in a format that TCP is more suited for

Wow, that was the first time I worked out any of those figures. I
should have done it sooner, because its pretty obvious now.


Back to the top