Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb trans

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?

From: Matthias Sohn <matthias.sohn@xxxxxxxxx>
Date: Sun, 2 Feb 2020 22:26:51 +0100
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://www.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://www.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://www.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

On Wed, Jan 29, 2020 at 10:15 PM Joshua Redstone <redstone@xxxxxxxxx> wrote:

Hi,
I store a git repository on Amazon S3 and notice the "jgit fetch" can be very slow, fetching lots of pack-*.idx files even when the remote is ahead of local by only a single commit. It looks like WalkFetchConnection::downloadObject essentially iterates by brute force through all remote pack-*.idx files looking for an object. Since it's difficult to GC remote dumb repositories (I think best practice for Amazon S3 is doing a git gc using s3fs-fuse), over time pack files accumulate and "fetch" becomes slow.

So what if a local repository kept a persistent cache of remote pack-*.idx files? WalkFetchConnection could try that cache before the big iteration through all remote pack files. Further, maybe before consulting the cache, WalkFetchConnection could check the local .git/objects/pack directory for index files as well.

pack indexes are tied to their respective pack file, if a repack happened the corresponding

cached index file would become stale

It'd also be nice if jgit supported remote gc of dumb repositories but that's maybe a separate optimization.

well, the dumb protocol is - dumb

Thoughts? Am I understanding things correctly and does this seem like a workable idea?

The AmazonS3 implementation is very old and predates the introduction of the DfsRepository APIs

which is meant for storing git objects in a distributed file system which has a much higher latency

compared to a local filesystem. Maybe a Dfs-based S3 implementation would be the better approach.

See https://groups.google.com/d/msg/repo-discuss/IekVPmow0yE/BCCMXFy8CwAJ

-Matthias

Follow-Ups:
- Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?
  - From: Joshua Redstone

References:
- [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?
  - From: Joshua Redstone

Prev by Date: [jgit-dev] Review request for 156984: speed up fetch for AmazonS3 transport
Next by Date: Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?
Previous by thread: [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?
Next by thread: Re: [jgit-dev] Persistent caching to speed up fetch of remote dumb transports (e.g., Amazon S3) ?
Index(es):
- Date
- Thread

Breadcrumbs