Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[jgit-dev] Fastest way of retrieving contents of new/changed files of all commits?

Hello everyone,

For a research project I need the contents of all new and changed files, as well as the file paths of deleted files, for every commit of a particular branch in a repository. I currently do it as follows:

1. clone the repository using Git.cloneRepository
2. get the list of commits using git.log
3. create a new "temporary" branch and point it to the initial commit of the target branch
4. read all files (of the initial commit, needed for research)
5. for each subsequent commit
    a) reset the temporary branch by doing (note this is Scala code):


    b) calculate the diff of HEAD^{tree} and HEAD^^{tree}:

          val head = repository.resolve("HEAD^{tree}")
          val prev = repository.resolve("HEAD^^{tree}")
          val reader = repository.newObjectReader()

          val oldTreeIter = new CanonicalTreeParser()
          oldTreeIter.reset(reader, prev)
          val newTreeIter = new CanonicalTreeParser()
          newTreeIter.reset(reader, head)

          val diffs = git.diff()

6. From the diff above, I can determine the new, changed and deleted file paths and the read the contents of new/changed files from the working directory.

My problem now is that this is too slow for my purposes. I actually have to checkout the files into the working dir so that I can read the files and for every commit, I have to create another diff etc. Plus it's hard to parallelize this. I could just run N instances on different parts of the commit-range (e.g. the 1st 1000, the 2nd 1000 and so on) but then I would need N temporary directories to checkout the commits.

Does anyone have an idea how I could do this more quickly?

Ideally, I would like to not even check out the files. Isn't there some way to walk the commits and stream the contents old/new/changed files on-the-fly from a bare repository?

To clarify what I need for each commit C:
* A list of paths (e.g. relative to the root of the git repo) to all files that have been deleted in C. * The full content of all files which are newly added or have been changed in C. * If I could get this data sequentially (or in multiple/parallel sequences) this would be best, but even if it were "random access", i.e. unordered, it would still be fine.
  * It should be as fast and parallelized as possible

I hope it's clear what I mean. Thank you for your assistance.


Back to the top