|[jgit-dev] Re: large object support patches|
On Fri, Jul 2, 2010 at 2:34 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote: > > The last one, delta packed objects, is a nightmare of a patch. Doing > streaming deltas efficiently is virtually impossible. Some more thoughts on the streaming deltas: One of the nastier parts here is we walk backwards through the entire delta chain O(delta_depth) times just to open the damn stream up for reading... so its O(delta_depth^2). We walk the entire chain once to identity the type of the object, as that is part of the ObjectLoader interface. We walk it all again as we open up each base object's ObjectLoader, for the same damn reason. I think we can fix that by making getType() on the large delta loader lazy, and only construct it when asked. Unfortunately an earlier patch in my series has the getType() return value checked when opening an object and a type hint was supplied. We usually have a type hint supplied now. :-) Making the type lazy would still reduce that cost from O(delta_depth^2) to O(delta_depth), as we would not lookup the types for the base objects. The delta instruction stream is inflated twice: partially once when we open the loader to get the result object size, and again when we actually start to execute the stream and build the object. This is to support the loader's getSize() method, but also because we want to know the size of the base so we can check its still valid at the start of our stream. Again, we can probably make this lazy, and try to check the size of the base when we open the base's stream, reducing the number of times we have to inflate the instruction stream. Deltas can copy from random positions in the base file. This is to better support block moves. Consider the file "ABCD" as a base and the new version of "ACQBD". Right now we read the base twice, once to copy A, skip over B, copy C, then close and open it again to skip over A, copy B, the skip over C, and finally copy D. Skipping over a section of another delta is reasonably efficient, we just fast-forward through the instructions without performing them. But that final base object is an inflater stream which isn't seekable, and closing, reopening, and inflating into a skip buffer really hurts performance. We could be smarter here and perform some look-ahead on our instruction set. If we see a copy instruction that would cause a "reverse seek" like above, then we could try to save a copy of that region under a SoftReference as we scan over it. When we need that region, if the SoftReference is still valid, we can avoid the reverse seek and just use it from our local cache. The issue I'm worried about is what happens when the chain is the following sequences: base: ABCD d1: ACBD d2: ADBC d3: ADQC If each level has this reverse seek identified in its look-ahead then that level will try to cache that section from its base, leading to block "B" being cached several times, but its not even part of the final output. If the blocks are really big, we run the risk that our SoftReferences will be cleared by the GC, or that we can't even complete the allocations required for the buffering. Spooling to a temporary file might help, but is probably only cost effective if we know that block must appear in the final output. FWIW, Hg does delta application differently than Git. In Hg delta instructions are inflated and merged together until a non-delta base is reached... then there is a single clear instruction stream to apply to one whole object, and you know for certain what will be reused, and what won't be. Right now we don't do this for the large deltas because the instruction stream itself can be arbitrarily large. But we might be able to make some rule that delta streams smaller than some reasonable threshold (e.g. 64 KiB) get expanded, flattened, and then executed against the base, while larger streams use this slower multiple scan approach. Food for thought. -- Shawn.
Back to the top