Re: [jgit-dev] RevSort#COMMIT_TIME

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion

From: Shawn Pearce <spearce@xxxxxxxxxxx>
Date: Sat, 13 Aug 2011 12:05:37 -0700
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

On Sat, Aug 13, 2011 at 10:15, Marc Strapetz <marc.strapetz@xxxxxxxxxxx> wrote:
>>> Should I keep them local to our project or submit as patch? In
>>> latter case, any suggestions where to place that methods?
>>
>> Operations like `git branch --contains` need something like this. So
>> we would appreciate a contribution to RevWalk or a new class in that
>> same package that makes this operation easy in JGit.
>
> Actually, answering isMergedIntoAny(BASE, List<> TIPS) and
> isAnyMergedInto(List<> BASES, TIP) are simpler tasks and will usually
> require less effort for positive answers. So it's desirable to have some
> shortcuts implemented for the more general functions:
>
> getTipsThatContainsBase(BASE, List<> TIPS), used for "git branch
> --contains" and
>
> getBasesWhichAreContainedInTip(List<> BASE, TIP), used for "git branch
> --merged"
>
> Both functions require in worst case one full RevWalk for an arbitrary
> number of BASES and tips, hence I would not like to use
> MergeBaseGenerator here (actually I couldn't see how that class could
> help here, probably it would require some significant changes). The main
> problem of both functions is when to stop the RevWalk? If I understood
> correctly, the idea of MergeBaseGenerator is to stop once all pending
> commits to process are ancestors of all of our start commits (to solve
> the problem of violations of topological order). We could use the same
> approach for some general purpose commit-traversal which takes a List of
> TIP commits and a list of BASE commits and results in mapping of (TIP
> commits -> reachable BASE commits) resp. a mapping of (BASE commits ->
> contained TIP commits). With that information both above mentioned
> functions can be implemented straight-forward and maybe the traversal
> could also be a replacement for the more complex part of
> MergeBaseGenerator which deals with finding the merge base commits.
>
> Implementation of that commit-traversal would start with assigning
> BitSets to every START commit. The set of START commits is the union of
> TIP and BASE commits. If bit N is set, this means that the current
> commit is reachable from the Nth START commit. The traversal is done in
> the same way as e.g. PendingGenerator does. For every commit
> encountered, its BitSet is propagated to its parent commits (i.e. BitSet
> instances will remain the same). If a parent already contains a BitSet,
> both BitSets will be joined (ORed) to a new BitSet. After propagating to
> the parent commits, the BitSet of the current commit will be set to
> null. So we actually have a kind of "BitSet-front" which is moving back
> in history and getting filled more and more. We will finish the
> traversal at those commits which have their BitSets completely filled
> with 1s. For these commits we know that they are descendants of all TIP
> and BASE commits.
>
> This approach will only work if commits are processed in topological
> order, i.e. all children of a commit have been processed before the
> commit itself is processed. Otherwise the basic joining of BitSets will
> not work and we will need some additional effort: when a child commit is
> processed after its parent commit(s), we will propagate its BitSet
> recursively through the set of already processed commits until we reach
> the BitSet front and then perform the BitSet joining here.
>
> Now assumption is that the vast majority of commits is received in
> topological order and if topological order is violated, it's only a
> minor violation (i.e. not many commits would have to be rearranged to
> yield topological order again). This means, that the cases where a child
> is processed after its parents are rare and in such a case the child is
> not far away from the BitSet-front, so propagation is cheap. Hence,
> additional effort introduced by violations of the topological order
> should be low (though it looks like worst case running time in case of
> extreme violation of the topological order would be O(N^2) with N number
> of commits in the repository).
>
> Regarding memory usage: the size of each BitSet is the number of START
> commits, the number of alive BitSets is of O(N) with N being the number
> of concurrent development lines in the repository. So that should be
> acceptable.
>
> Concluding, I think that for the common use cases (topological order,
> TIP commits and BASE commits are near to each other), this approach
> should perform well. Still some basic questions remain:
>
> - is the approach reasonable or did I miss something and it won't work
>  at all?
>
> - is the assumption on rare violation of topological order acceptable?
>
> - what would the implementation be based on? Would it be something
>  low-level like a Generator? Or something more high-level, using a
>  RevWalk?
>
> --
> Best regards,
> Marc Strapetz
> =============
> syntevo GmbH
> http://www.syntevo.com
> http://blog.syntevo.com
>
>
>
> On 12.08.2011 23:23, Shawn Pearce wrote:
>> On Fri, Aug 12, 2011 at 12:21, Marc Strapetz <marc.strapetz@xxxxxxxxxxx> wrote:
>>>> Don't do this. Instead use RevWalk.isMergedInto(TARGET, SRC).
>>>
>>> That's definitely a better solution, thanks for pointing out. I'll
>>> probably need some more general methods:
>>>
>>> RevWalk.isAnyMergedTo(List<> BASES, TIP)
>>>
>>> and
>>>
>>> RevWalk.isMergedToAny(BASE, List<> TIPS)
>>>
>>> MergeBaseGenerator looks somewhat specific to me and JavaDoc states:
>>> "The maximum number of starting commits is bounded by the number of free
>>> flags available in the RevWalk when the generator is initialized.". So
>>> I'd probably implement this by two utility methods, creating their own
>>> RevWalk.
>>
>> Discarding and creating a new RevWalk on each call is a bit ugly.
>> Right now there are 26 flags free in a RevWalk when it is first
>> initialized. So that is the maximum number of starting points you can
>> run at once. In your generalized form of isMergedToAny or
>> isAnyMergedTo you would have to run in batches if the List argument
>> had more than 25 inputs.
>>
>>> Should I keep them local to our project or submit as patch? In
>>> latter case, any suggestions where to place that methods?
>>
>> Operations like `git branch --contains` need something like this. So
>> we would appreciate a contribution to RevWalk or a new class in that
>> same package that makes this operation easy in JGit.

What you just described is how the flags field works inside of a
RevCommit. Except its limited in size to a 32 bit value, and 6 of the
bits are reserved for processing within RevWalk itself. We use the
flags field rather than a Map<RevCommit, BitSet> like I think you
propose to maintain reasonable performance. Directly accessing the
flags field is cheap, going through a Map lookup to find the BitSet
and then process that set is much more costly.

I wouldn't want to replace the current isMergedInto() implementation
with your proposed solution as I can be pretty certain its going to
always be faster, and isMergedInto is run often enough during
reference updates and the like that I don't want to slow them down in
order to use a more generic approach.

That said, there are some algorithms where a single pass through the
graph using the more generalized BitSet you describe is going to be
faster... such as when there are more than 20 or so branch tips of
interest that need to have their bits maintained through the history
graph. So its worth building the more generalized form. IMHO, this
should be inside a Generator that RevWalk can execute, you probably
need to maintain a fair amount of state about the pending branches in
the commit-date priority queue so you know when you can discard a
BitSet, or when you need to build a new one.

-- 
Shawn.

Follow-Ups:
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Marc Strapetz

References:
- [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Marc Strapetz
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Alex Blewitt
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Marc Strapetz
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Shawn Pearce
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Marc Strapetz
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Shawn Pearce
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Marc Strapetz

Prev by Date: Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
Next by Date: [jgit-dev] space regression in PackWriter
Previous by thread: Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
Next by thread: Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
Index(es):
- Date
- Thread

Breadcrumbs