Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb trans

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?

From: Joshua Redstone <redstone@xxxxxxxxx>
Date: Mon, 3 Feb 2020 11:46:14 -0600
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://www.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://www.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://www.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

Hi Matthias,

Thanks for the comments and your prompt help with https://git.eclipse.org/r/#/c/156984/.

After I dug into the AmazonS3 code a bit, I realized that, rather than come up with a pack caching scheme, it was simpler to just sort the pack files so recent pack files are tried first when looking for objects, which led to change 156894.

An API tailored to high-latency ops does seem like a better long-term approach, as you mention.

Cheers,

Josh

On Sun, Feb 2, 2020 at 3:27 PM Matthias Sohn <matthias.sohn@xxxxxxxxx> wrote:

On Wed, Jan 29, 2020 at 10:15 PM Joshua Redstone <redstone@xxxxxxxxx> wrote:
Hi,
I store a git repository on Amazon S3 and notice the "jgit fetch" can be very slow, fetching lots of pack-*.idx files even when the remote is ahead of local by only a single commit. It looks like WalkFetchConnection::downloadObject essentially iterates by brute force through all remote pack-*.idx files looking for an object. Since it's difficult to GC remote dumb repositories (I think best practice for Amazon S3 is doing a git gc using s3fs-fuse), over time pack files accumulate and "fetch" becomes slow.

So what if a local repository kept a persistent cache of remote pack-*.idx files? WalkFetchConnection could try that cache before the big iteration through all remote pack files. Further, maybe before consulting the cache, WalkFetchConnection could check the local .git/objects/pack directory for index files as well.

pack indexes are tied to their respective pack file, if a repack happened the corresponding
cached index file would become stale

It'd also be nice if jgit supported remote gc of dumb repositories but that's maybe a separate optimization.

well, the dumb protocol is - dumb

Thoughts? Am I understanding things correctly and does this seem like a workable idea?

The AmazonS3 implementation is very old and predates the introduction of the DfsRepository APIs
which is meant for storing git objects in a distributed file system which has a much higher latency
compared to a local filesystem. Maybe a Dfs-based S3 implementation would be the better approach.

See https://groups.google.com/d/msg/repo-discuss/IekVPmow0yE/BCCMXFy8CwAJ

-Matthias

References:
- [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?
  - From: Joshua Redstone
- Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?
  - From: Matthias Sohn

Prev by Date: Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?
Next by Date: [jgit-dev] [Announce] JGit release 5.1.13 and 5.3.7
Previous by thread: Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?
Next by thread: Re: [jgit-dev] Invalid ref advertisement line: '{1}'
Index(es):
- Date
- Thread

Breadcrumbs