Re: [jgit-dev] RevSort#COMMIT_TIME

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion

From: Marc Strapetz <marc.strapetz@xxxxxxxxxxx>
Date: Sat, 13 Aug 2011 19:15:29 +0200
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20110624 Thunderbird/5.0

>> Should I keep them local to our project or submit as patch? In
>> latter case, any suggestions where to place that methods?
>
> Operations like `git branch --contains` need something like this. So
> we would appreciate a contribution to RevWalk or a new class in that
> same package that makes this operation easy in JGit.

Actually, answering isMergedIntoAny(BASE, List<> TIPS) and
isAnyMergedInto(List<> BASES, TIP) are simpler tasks and will usually
require less effort for positive answers. So it's desirable to have some
shortcuts implemented for the more general functions:

getTipsThatContainsBase(BASE, List<> TIPS), used for "git branch
--contains" and

getBasesWhichAreContainedInTip(List<> BASE, TIP), used for "git branch
--merged"

Both functions require in worst case one full RevWalk for an arbitrary
number of BASES and tips, hence I would not like to use
MergeBaseGenerator here (actually I couldn't see how that class could
help here, probably it would require some significant changes). The main
problem of both functions is when to stop the RevWalk? If I understood
correctly, the idea of MergeBaseGenerator is to stop once all pending
commits to process are ancestors of all of our start commits (to solve
the problem of violations of topological order). We could use the same
approach for some general purpose commit-traversal which takes a List of
TIP commits and a list of BASE commits and results in mapping of (TIP
commits -> reachable BASE commits) resp. a mapping of (BASE commits ->
contained TIP commits). With that information both above mentioned
functions can be implemented straight-forward and maybe the traversal
could also be a replacement for the more complex part of
MergeBaseGenerator which deals with finding the merge base commits.

Implementation of that commit-traversal would start with assigning
BitSets to every START commit. The set of START commits is the union of
TIP and BASE commits. If bit N is set, this means that the current
commit is reachable from the Nth START commit. The traversal is done in
the same way as e.g. PendingGenerator does. For every commit
encountered, its BitSet is propagated to its parent commits (i.e. BitSet
instances will remain the same). If a parent already contains a BitSet,
both BitSets will be joined (ORed) to a new BitSet. After propagating to
the parent commits, the BitSet of the current commit will be set to
null. So we actually have a kind of "BitSet-front" which is moving back
in history and getting filled more and more. We will finish the
traversal at those commits which have their BitSets completely filled
with 1s. For these commits we know that they are descendants of all TIP
and BASE commits.

This approach will only work if commits are processed in topological
order, i.e. all children of a commit have been processed before the
commit itself is processed. Otherwise the basic joining of BitSets will
not work and we will need some additional effort: when a child commit is
processed after its parent commit(s), we will propagate its BitSet
recursively through the set of already processed commits until we reach
the BitSet front and then perform the BitSet joining here.

Now assumption is that the vast majority of commits is received in
topological order and if topological order is violated, it's only a
minor violation (i.e. not many commits would have to be rearranged to
yield topological order again). This means, that the cases where a child
is processed after its parents are rare and in such a case the child is
not far away from the BitSet-front, so propagation is cheap. Hence,
additional effort introduced by violations of the topological order
should be low (though it looks like worst case running time in case of
extreme violation of the topological order would be O(N^2) with N number
of commits in the repository).

Regarding memory usage: the size of each BitSet is the number of START
commits, the number of alive BitSets is of O(N) with N being the number
of concurrent development lines in the repository. So that should be
acceptable.

Concluding, I think that for the common use cases (topological order,
TIP commits and BASE commits are near to each other), this approach
should perform well. Still some basic questions remain:

- is the approach reasonable or did I miss something and it won't work
  at all?

- is the assumption on rare violation of topological order acceptable?

- what would the implementation be based on? Would it be something
  low-level like a Generator? Or something more high-level, using a
  RevWalk?

--
Best regards,
Marc Strapetz
=============
syntevo GmbH
http://www.syntevo.com
http://blog.syntevo.com

On 12.08.2011 23:23, Shawn Pearce wrote:
> On Fri, Aug 12, 2011 at 12:21, Marc Strapetz <marc.strapetz@xxxxxxxxxxx> wrote:
>>> Don't do this. Instead use RevWalk.isMergedInto(TARGET, SRC).
>>
>> That's definitely a better solution, thanks for pointing out. I'll
>> probably need some more general methods:
>>
>> RevWalk.isAnyMergedTo(List<> BASES, TIP)
>>
>> and
>>
>> RevWalk.isMergedToAny(BASE, List<> TIPS)
>>
>> MergeBaseGenerator looks somewhat specific to me and JavaDoc states:
>> "The maximum number of starting commits is bounded by the number of free
>> flags available in the RevWalk when the generator is initialized.". So
>> I'd probably implement this by two utility methods, creating their own
>> RevWalk.
> 
> Discarding and creating a new RevWalk on each call is a bit ugly.
> Right now there are 26 flags free in a RevWalk when it is first
> initialized. So that is the maximum number of starting points you can
> run at once. In your generalized form of isMergedToAny or
> isAnyMergedTo you would have to run in batches if the List argument
> had more than 25 inputs.
> 
>> Should I keep them local to our project or submit as patch? In
>> latter case, any suggestions where to place that methods?
> 
> Operations like `git branch --contains` need something like this. So
> we would appreciate a contribution to RevWalk or a new class in that
> same package that makes this operation easy in JGit.
>

Follow-Ups:
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Shawn Pearce

References:
- [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Marc Strapetz
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Alex Blewitt
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Marc Strapetz
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Shawn Pearce
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Marc Strapetz
- Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
  - From: Shawn Pearce

Prev by Date: Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
Next by Date: Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
Previous by thread: Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
Next by thread: Re: [jgit-dev] RevSort#COMMIT_TIME_DESC confusion
Index(es):
- Date
- Thread

Breadcrumbs