Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?

Hi Matthias,
Thanks for the comments and your prompt help with https://git.eclipse.org/r/#/c/156984/.
After I dug into the AmazonS3 code a bit, I realized that, rather than come up with a pack caching scheme, it was simpler to just sort the pack files so recent pack files are tried first when looking for objects, which led to change 156894.
An API tailored to high-latency ops does seem like a better long-term approach, as you mention.
Cheers,
Josh



On Sun, Feb 2, 2020 at 3:27 PM Matthias Sohn <matthias.sohn@xxxxxxxxx> wrote:
On Wed, Jan 29, 2020 at 10:15 PM Joshua Redstone <redstone@xxxxxxxxx> wrote:
Hi,
I store a git repository on Amazon S3 and notice the "jgit fetch" can be very slow, fetching lots of pack-*.idx files even when the remote is ahead of local by only a single commit.  It looks like WalkFetchConnection::downloadObject essentially iterates by brute force through all remote pack-*.idx files looking for an object.  Since it's difficult to GC remote dumb repositories (I think best practice for Amazon S3 is doing a git gc using s3fs-fuse), over time pack files accumulate and "fetch" becomes slow.

So what if a local repository kept a persistent cache of remote pack-*.idx files? WalkFetchConnection could try that cache before the big iteration through all remote pack files. Further, maybe before consulting the cache, WalkFetchConnection could check the local .git/objects/pack directory for index files as well.

pack indexes are tied to their respective pack file, if a repack happened the corresponding
cached index file would become stale
 
It'd also be nice if jgit supported remote gc of dumb repositories but that's maybe a separate optimization.

well, the dumb protocol is - dumb
 
Thoughts?  Am I understanding things correctly and does this seem like a workable idea?
 
The AmazonS3 implementation is very old and predates the introduction of the DfsRepository APIs
which is meant for storing git objects in a distributed file system which has a much higher latency
compared to a local filesystem. Maybe a Dfs-based S3 implementation would be the better approach.


-Matthias

Back to the top