On 12 August 2014 21:09, Michael O'Cleirigh <michael.ocleirigh@xxxxxxxxxxx> wrote:
Hello,
I’m working on some history rewriting programs for a subversion to git conversion (>70,000 commits). When we do the initial subversion to git conversion we translate svn:externals
properties on the branch into a fusion.dat file in the root directory of the branch.
This is sort of like a submodule but within the same repository. It maps a subdirectory name to a commit id. We have a custom maven plugin that can do a special subtree like merge
to materialize the subdirectories from the commit object given.
If I understand you correctly this submodule stuff is not a standard part of SVN, but a custom approach used by your project?
So the contents of this fusion.dat file (in the root of the commit's file tree) are something like this? :
src/main/java/libA -> d065748c3fcbc3d2449012ac75b02fba962fe735
src/main/java/libB -> ac2971e610a8f784ba74644aaff46276a24d6bc3
...and then the file tree of commit d065748c3fcbc3d2449012ac75b02fba962fe735 needs to be stuck into the file tree of the current commit at the location of src/main/java/libA - and this is done by your maven plugin?
In subversion svn:externals are a property on a path say ^/aggregate/trunk in a format like:
^/module1/trunk module1
^/module2/trunk module2
When you check out aggregate/trunk in subversion you get all the files actually in aggregate/trunk and a module1 directory containing the module1 trunk and
a module2 directory containing the module2 trunk like:
$ svn co
http://svn.repo.org/repo/aggregate/trunk
$ ls
some-file-from-aggregate-trunk-like-a-pom.xml
module1
module2
The fusion.dat file is my customization that contains:
Directory name::branch name::commitID
module1::module1_trunk::d065748c3fcbc3d2449012ac75b02fba962fe735
module2::module2_trunk::ac2971e610a8f784ba74644aaff46276a24d6bc3
So in git when you check out the aggregate_trunk branch it only contains:
some-file-from-aggregate-trunk-like-a-pom.xml
fusion.dat
Running my custom maven plugin uses JGit to create effectively a subtree merge commit (for all file entries at the same time) to make the tree the same as it
would have looked in subversion with the externals included.
Like this (copied out from a plugin unit test):
Our aggregate/trunk/pom.xml defines a modules section which expects those to be present. While not a problem at the tip of the development branch this fusion
approach lets old build tags and releases to be reconstituted easily to a buildable state.
So this is very similar to your object id in comment problem. The commitId’s are stored when the subversion to git conversion takes place. My importer notices
the existence of the svn:external properties and if present creates a fusion.dat file mapping to the branch head for each external named.
Then when the history is rewritten the commit’s that the fusion.dat refer to are no longer valid.
I have several rewrite history programs that are using a reverse and topo sort which end up rewriting the original commit id’s that were stored initially in these fusion.dat files.
a) So you're not planning to get rid of the fusion.dat files? You plan to continue using your maven plugin to materialize the subdirectories?
Yes, keep the fusion.dat files but rewrite their contents so that they refer to the latest commit id.
module1::module1_trunk::d065748c3fcbc3d2449012ac75b02fba962fe735
rewrite 1:
d065748c3fcbc3d2449012ac75b02fba962fe735 changed to 5882135d679f0e2a39822e5c4da419bb504d21a0
rewrite 2:
5882135d679f0e2a39822e5c4da419bb504d21a0 changed to 1a4565c5560e7f888ab743a8745896935779137b
b) what are the rewrites you're doing (if they're not removing/replacing the dat files?).
So the rewrite is changing the contents of the fusion.dat file for the module1 line to point at 1a4565c5560e7f888ab743a8745896935779137b*
*(technically this can still shift if 1a4565c5560e7f888ab743a8745896935779137b is rewritten as part of the final rewrite we need to use the final object id)
I have a tool that should work to rewrite these fusion.dat files using the old commit to new commit records from the previous history rewriting however the sort order of the commits is not exactly what I need.
If I look at a commit with this fusion.dat file I know the id’s of the commits that have to be processed before this one. You can image that the aggregate branch contains the fusion.dat
file but it is referencing lateral branches. In subversion these might have been all in a single commit but in Git there is the top level branch and each individual module branch.
The BFG Repo Cleaner has a similar-ish problem when it tries to rewrite commit-ids that are embedded in commit messages - it's not guaranteed that it's encountered that commit id before (even tho it uses a reverse and topo sort), as it
could be from a lateral branch, so it just attempts, in a very simple way, to recursively clean that commit-id and the history behind it, which works pretty well most of the time because it memoizes all cleaning operations on git-id - but occasionally risks
blowing-up with a StackOverflowError if the commit-ids unseen history is too deep.
Incidentally, depending on what clean-up operations you're doing, you might be able to make use of the BFG as your cleaner, adding your fusion.dat file-updater as a bfg.Cleaner[Seq[Tree.Entry]].
Yes I think this sounds like the same problem. I’ll try your recursive approach first to see if I have the stack overhead to allow it to work.
I’d like to add in my own sorter so that the RevWalk will consider this additional interdependence when ordering the results.
I can sort of see how I might subclass RevWalk and TopoSortGenerator to include my additional sort constraint data.
But those classes are all package scoped so it seems not designed to be extended directly by JGit users.
Just my 2 cents, if you want to do this custom sorting, and given this sounds like a one-off, you're probably best off just doing a small JGit fork and doing a TopoSortGenerator that works exactly how you want. Changing org.eclipse.jgit.revwalk.RevSort
and all that lot for a general solution might well require a breaking-API change.
Apologies if my questions appear dim, I have a bit of a cold and may be misunderstanding you (also there are brighter people on this mailing list than me, I just had my interest piqued because of my work on the BFG repo-cleaner).
If the recursive sort doesn’t work I’ll probably end up with a fork as you suggest.
Thanks again for your response its very useful.
Regards,
Michael