[jgit-dev] Re: large object support patches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[jgit-dev] Re: large object support patches

From: Shawn Pearce <spearce@xxxxxxxxxxx>
Date: Fri, 2 Jul 2010 11:09:05 -0700
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Mutt/1.5.17+20080114 (2008-01-14)

On Fri, Jul 2, 2010 at 2:34 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
>
> The last one, delta packed objects, is a nightmare of a patch.  Doing
> streaming deltas efficiently is virtually impossible.

Some more thoughts on the streaming deltas:

One of the nastier parts here is we walk backwards through the entire
delta chain O(delta_depth) times just to open the damn stream up
for reading... so its O(delta_depth^2).  We walk the entire chain
once to identity the type of the object, as that is part of the
ObjectLoader interface.  We walk it all again as we open up each
base object's ObjectLoader, for the same damn reason.

I think we can fix that by making getType() on the large delta
loader lazy, and only construct it when asked.  Unfortunately an
earlier patch in my series has the getType() return value checked
when opening an object and a type hint was supplied.  We usually
have a type hint supplied now. :-)   Making the type lazy would
still reduce that cost from O(delta_depth^2) to O(delta_depth),
as we would not lookup the types for the base objects.

The delta instruction stream is inflated twice: partially once
when we open the loader to get the result object size, and again
when we actually start to execute the stream and build the object.
This is to support the loader's getSize() method, but also because
we want to know the size of the base so we can check its still
valid at the start of our stream.  Again, we can probably make
this lazy, and try to check the size of the base when we open the
base's stream, reducing the number of times we have to inflate the
instruction stream.

Deltas can copy from random positions in the base file.  This is
to better support block moves.  Consider the file "ABCD" as a base
and the new version of "ACQBD".  Right now we read the base twice,
once to copy A, skip over B, copy C, then close and open it again
to skip over A, copy B, the skip over C, and finally copy D.

Skipping over a section of another delta is reasonably efficient,
we just fast-forward through the instructions without performing
them.  But that final base object is an inflater stream which isn't
seekable, and closing, reopening, and inflating into a skip buffer
really hurts performance.

We could be smarter here and perform some look-ahead on our
instruction set.  If we see a copy instruction that would cause
a "reverse seek" like above, then we could try to save a copy of
that region under a SoftReference as we scan over it.  When we need
that region, if the SoftReference is still valid, we can avoid the
reverse seek and just use it from our local cache.

The issue I'm worried about is what happens when the chain is the
following sequences:

  base:  ABCD
  d1:    ACBD
  d2:    ADBC
  d3:    ADQC

If each level has this reverse seek identified in its look-ahead
then that level will try to cache that section from its base,
leading to block "B" being cached several times, but its not even
part of the final output.

If the blocks are really big, we run the risk that our SoftReferences
will be cleared by the GC, or that we can't even complete the
allocations required for the buffering.  Spooling to a temporary
file might help, but is probably only cost effective if we know
that block must appear in the final output.

FWIW, Hg does delta application differently than Git.  In Hg delta
instructions are inflated and merged together until a non-delta
base is reached... then there is a single clear instruction stream
to apply to one whole object, and you know for certain what will
be reused, and what won't be.

Right now we don't do this for the large deltas because the
instruction stream itself can be arbitrarily large.  But we
might be able to make some rule that delta streams smaller than
some reasonable threshold (e.g. 64 KiB) get expanded, flattened,
and then executed against the base, while larger streams use this
slower multiple scan approach.

Food for thought.

-- 
Shawn.

References:
- [jgit-dev] large object support patches
  - From: Shawn Pearce

Prev by Date: Re: [jgit-dev] JGit's license starts in the middle of the sentence - should this be fixed?
Next by Date: Re: [jgit-dev] JGit's license starts in the middle of the sentence - should this be fixed?
Previous by thread: Re: [jgit-dev] large object support patches
Next by thread: [jgit-dev] RFC: Resolving conflicts using DirCacheEditor
Index(es):
- Date
- Thread

Breadcrumbs