Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Looking for an example of a custom RevWalk Sort

Hello,

 

I found a way to do this outside of changing the JGit internals (I tried but couldn’t make it work).

 

I used a bitmap aware Comparator<RevCommit> based on the JavaEWAH library (a transitive dependency of JGit) to create a bitmap for each commit in my list to sort so that the natural and fusion parents were first.

 

I know that the JGit GC creates bitmaps for a subset of all commits when the packfile is written but because of the size of my repository I need to use C git for gc’s so was unsure if it writes a compatible index that I would be able to use. 

 

So I started from scratch and it worked out.

 

Steps:

1.      Create a java based tree mirroring the structure of the git commit graph. 

2.      Create a Bitmap for each commit representing its directly connected natural and fusion parents  (i.e. not looking into the parent’s parents).

a.      The index of the commit in the original list provides the index of the bitmap bit for that commit id.

b.      The bitmap is just for the direct parents

3.      In a comparator accumulate the full parentage bitmap by looping over all of the parents nodes (from step 1) in turn. 

a.      This is an OR’ing of the accumulated bitmap against each node’s bitmap. 

b.      At the end of the accumulation it is known if the right commit is in the parents of the left commit.

c.      Treats the commits as equal if they are in totally different and unrelated branches (not parents of each other).

 

https://fisheye.kuali.org/cru/ks-2027#CFR-56286

 

look at patch b: Bitmap.java, RevCommitBitmapIndex.java, FusionAwareTopoSortComparator.java, RewriteFusionPluginData.java and AbstractRespositoryCleaner.java (this last one is the driver of the rewriting process)

 

Regards,

 

Michael

 

From: jgit-dev-bounces@xxxxxxxxxxx [mailto:jgit-dev-bounces@xxxxxxxxxxx] On Behalf Of Michael O'Cleirigh
Sent: Wednesday, August 13, 2014 11:04 AM
To: Roberto Tyley
Cc: jgit-dev@xxxxxxxxxxx
Subject: Re: [jgit-dev] Looking for an example of a custom RevWalk Sort

 

Hi Roberto,

 

Thanks for your very useful feedback.

 

From: Roberto Tyley [mailto:roberto.tyley@xxxxxxxxx]
Sent: Tuesday, August 12, 2014 4:53 PM
To: Michael O'Cleirigh
Cc:
jgit-dev@xxxxxxxxxxx
Subject: Re: [jgit-dev] Looking for an example of a custom RevWalk Sort

 

On 12 August 2014 21:09, Michael O'Cleirigh <michael.ocleirigh@xxxxxxxxxxx> wrote:

Hello,

 

I’m working on some history rewriting programs for a subversion to git conversion (>70,000 commits).   When we do the initial subversion to git conversion we translate svn:externals properties on the branch into a fusion.dat file in the root directory of the branch.

 

This is sort of like a submodule but within the same repository.  It maps a subdirectory name to a commit id.  We have a custom maven plugin that can do a special subtree like merge to materialize the subdirectories from the commit object given.

 

If I understand you correctly this submodule stuff is not a standard part of SVN, but a custom approach used by your project?

 

So the contents of this fusion.dat file (in the root of the commit's file tree) are something like this? :

 

src/main/java/libA -> d065748c3fcbc3d2449012ac75b02fba962fe735

src/main/java/libB -> ac2971e610a8f784ba74644aaff46276a24d6bc3

 

...and then the file tree of commit d065748c3fcbc3d2449012ac75b02fba962fe735 needs to be stuck into the file tree of the current commit at the location of src/main/java/libA - and this is done by your maven plugin?

 

In subversion svn:externals are a property on a path say ^/aggregate/trunk in a format like:

 

^/module1/trunk module1

^/module2/trunk module2

 

When you check out aggregate/trunk in subversion you get all the files actually in aggregate/trunk and a module1 directory containing the module1 trunk and a module2 directory containing the module2 trunk like:

 

$ svn co http://svn.repo.org/repo/aggregate/trunk

 

$ ls

some-file-from-aggregate-trunk-like-a-pom.xml

module1

module2

 

The fusion.dat file is my customization that contains:

Directory name::branch name::commitID

module1::module1_trunk::d065748c3fcbc3d2449012ac75b02fba962fe735

module2::module2_trunk::ac2971e610a8f784ba74644aaff46276a24d6bc3

 

So in git when you check out the aggregate_trunk branch it only contains:

some-file-from-aggregate-trunk-like-a-pom.xml

fusion.dat

 

Running my custom maven plugin uses JGit to create effectively a subtree merge commit (for all file entries at the same time) to make the tree the same as it would have looked in subversion with the externals included.

 

Like this (copied out from a plugin unit test):

Description: cid:image001.png@01CFCC27.D5966D70

 

Our aggregate/trunk/pom.xml defines a modules section which expects those to be present.  While not a problem at the tip of the development branch this fusion approach lets old build tags and releases to be reconstituted easily to a buildable state.

 

So this is very similar to your object id in comment problem.  The commitId’s are stored when the subversion to git conversion takes place.  My importer notices the existence of the svn:external properties and if present creates a fusion.dat file mapping to the branch head for each external named.

 

Then when the history is rewritten the commit’s that the fusion.dat refer to are no longer valid.

I have several rewrite history programs that are using a reverse and topo sort which end up rewriting the original commit id’s that were stored initially in these fusion.dat files.

 

a) So you're not planning to get rid of the fusion.dat files? You plan to continue using your maven plugin to materialize the subdirectories?

 

Yes, keep the fusion.dat files but rewrite their contents so that they refer to the latest commit id.

 

module1::module1_trunk::d065748c3fcbc3d2449012ac75b02fba962fe735

 

rewrite 1:

d065748c3fcbc3d2449012ac75b02fba962fe735 changed to 5882135d679f0e2a39822e5c4da419bb504d21a0

 

rewrite 2:

5882135d679f0e2a39822e5c4da419bb504d21a0 changed to 1a4565c5560e7f888ab743a8745896935779137b

 

b) what are the rewrites you're doing (if they're not removing/replacing the dat files?).

 

So the rewrite is changing the contents of the fusion.dat file for the module1 line to point at 1a4565c5560e7f888ab743a8745896935779137b*

 

*(technically this can still shift if 1a4565c5560e7f888ab743a8745896935779137b is rewritten as part of the final rewrite we need to use the final object id)

 

 I have a tool that should work to rewrite these fusion.dat files using the old commit to new commit records from the previous history rewriting however the sort order of the commits is not exactly what I need.

 

If I look at a commit with this fusion.dat file I know the id’s of the commits that have to be processed before this one.  You can image that the aggregate branch contains the fusion.dat file but it is referencing lateral branches.  In subversion these might have been all in a single commit but in Git there is the top level branch and each individual module branch.

 

The BFG Repo Cleaner has a similar-ish problem when it tries to rewrite commit-ids that are embedded in commit messages - it's not guaranteed that it's encountered that commit id before (even tho it uses a reverse and topo sort), as it could be from a lateral branch, so it just attempts, in a very simple way, to recursively clean that commit-id and the history behind it, which works pretty well most of the time because it memoizes all cleaning operations on git-id - but occasionally risks blowing-up with a StackOverflowError if the commit-ids unseen history is too deep.

 

Incidentally, depending on what clean-up operations you're doing, you might be able to make use of the BFG as your cleaner, adding your fusion.dat file-updater as a bfg.Cleaner[Seq[Tree.Entry]].

 

Yes I think this sounds like the same problem.  I’ll try your recursive approach first to see if I have the stack overhead to allow it to work.

I’d like to add in my own sorter so that the RevWalk will consider this additional interdependence when ordering the results.

 

I can sort of see how I might subclass RevWalk and TopoSortGenerator to include my additional sort constraint data.

 

But those classes are all package scoped so it seems not designed to be extended directly by JGit users.

 

Just my 2 cents, if you want to do this custom sorting, and given this sounds like a one-off, you're probably best off just doing a small JGit fork and doing a TopoSortGenerator that works exactly how you want. Changing org.eclipse.jgit.revwalk.RevSort and all that lot for a general solution might well require a breaking-API change.

 

Apologies if my questions appear dim, I have a bit of a cold and may be misunderstanding you (also there are brighter people on this mailing list than me, I just had my interest piqued because of my work on the BFG repo-cleaner).

 

If the recursive sort doesn’t work I’ll probably end up with a fork as you suggest.

 

Thanks again for your response its very useful.

 

Regards,

 

Michael

 


Back to the top