[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
| [jgit-dev] Fastest way of retrieving contents of new/changed files	of all commits? | 
Hello everyone,
For a research project I need the contents of all new and changed files, 
as well as the file paths of deleted files, for every commit of a 
particular branch in a repository. I currently do it as follows:
1. clone the repository using Git.cloneRepository
2. get the list of commits using git.log
3. create a new "temporary" branch and point it to the initial commit of 
the target branch
4. read all files (of the initial commit, needed for research)
5. for each subsequent commit
    a) reset the temporary branch by doing (note this is Scala code):
	  git.reset
            .setMode(org.eclipse.jgit.api.ResetCommand.ResetType.HARD)
            .setRef(nextCommitRev).call
    b) calculate the diff of HEAD^{tree} and HEAD^^{tree}:
          val head = repository.resolve("HEAD^{tree}")
          val prev = repository.resolve("HEAD^^{tree}")
          val reader = repository.newObjectReader()
          val oldTreeIter = new CanonicalTreeParser()
          oldTreeIter.reset(reader, prev)
          val newTreeIter = new CanonicalTreeParser()
          newTreeIter.reset(reader, head)
          val diffs = git.diff()
            .setNewTree(newTreeIter)
            .setOldTree(oldTreeIter).call()
6. From the diff above, I can determine the new, changed and deleted 
file paths and the read the contents of new/changed files from the 
working directory.
My problem now is that this is too slow for my purposes. I actually have 
to checkout the files into the working dir so that I can read the files 
and for every commit, I have to create another diff etc. Plus it's hard 
to parallelize this. I could just run N instances on different parts of 
the commit-range (e.g. the 1st 1000, the 2nd 1000 and so on) but then I 
would need N temporary directories to checkout the commits.
Does anyone have an idea how I could do this more quickly?
Ideally, I would like to not even check out the files. Isn't there some 
way to walk the commits and stream the contents old/new/changed files 
on-the-fly from a bare repository?
To clarify what I need for each commit C:
  * A list of paths (e.g. relative to the root of the git repo) to all 
files that have been deleted in C.
  * The full content of all files which are newly added or have been 
changed in C.
  * If I could get this data sequentially (or in multiple/parallel 
sequences) this would be best, but even if it were "random access", i.e. 
unordered, it would still be fine.
  * It should be as fast and parallelized as possible
I hope it's clear what I mean. Thank you for your assistance.
Cheers,
Tom