|Re: [jgit-dev] Performance of indexing blob metadata|
On Sat, Mar 10, 2012 at 12:59, <james.moger@xxxxxxxxxxx> wrote: > Here is the goal: I want to traverse the head tree of each branch and > index the blobs with the appropriate author metadata. This is not a fast operation in Git. The data structures used to store the data on disk do not lend themselves to computing this information. If you go down this road, you should try to build your own index or cache alongside of the Git data that will help you to update the Lucene index as branches advance more quickly. But indexing a tree without this cache will be very, very slow. code.google.com supports showing the last author of a file through a very complex and yet still very brute force method. They check every single commit, comparing it against its ancestor(s), and working a blame matrix through the tree to compute for each file at each commit what the last author of that file was in a predecessor commit. It takes a lot of time and is a very MapReduce-y sort of problem. Its also very IO intensive and CPU intensive to shuffle a lot of data around. :-( > Consider the following unit tests which implement the traversal logic > I'm using in my indexer. if fullIndex = false then I take a shortcut > and assume the branch head metadata applies to all blobs. This is > wrong, but doing so allows me to traverse all JGit > branches/blobs/commits in ~1.4 seconds on my older Debian box. If > fullIndex = true, I get the correct metadata for the blobs but the same > traversal takes 833 seconds (~14 minutes). Wow. CPU is working really > hard when this is going on, heap stays fairly controlled. Not unexpected given what you are doing. Seriously. > Perhaps the obvious answer is to re-use blobWalk and not make a new one > on every treewalk traversal. But when I do that each .next() after the > _first_ blob returns the branch head commit (markStart) instead of the > correct commit. And that returned commit can not really be parsed for > author info - it throws null pointers so my getAuthor/getCommitter > methods return "unknown". You need to call setRetainBody(true) on the RevWalk being used. I think you can reuse the same RevWalk instance through this code and benefit from the internal hash map that caches the Git graph... except that graph doesn't cache with the PathFilter. A PathFilter destroys the parent links in the graph and creates a subgraph specific to what the PathFilter matched. If you change the PathFilter, you need to discard the old RevWalk and make a new one. > Am I doing something glaringly wrong here? Or can someone suggest an > alternative? I had considered traversing the commits first and building > a hashmap of file-lastcommitid. You can't use the algorithm you are using. :-) A better approach would be to ignore using the PathFilter. Instead start a RevWalk from the tip of a branch. Create a Map<String, RevCommit> to hold the most recent commit that modified each file. Prefill this map with a TreeWalk on the tip of the branch. This tells you how many files you need to locate the RevCommit for. Run a loop over the RevWalk, popping each RevCommit. When you get a RevCommit, diff that RevCommit against its parent using a TreeWalk with TreeFilter.ANY_DIFF set on it. For each of these paths, if the path is in the Map with a null RevCommit, put this RevCommit in and decrement the number of paths you are missing. When the number of paths you are missing is zero, you can abort the walk and move to the next branch, as you have run "blame" for all file paths. If you are clever, I think you can find a way to reuse portions of these results across branches. Once there is common history between two branches, all remaining paths that you haven't found a RevCommit for can be copied over from that prior branch... provided that the RevCommits are older than the common history point. :-) You will want to save these Map<String, RevCommit> for each branch to some persistent storage so that when a branch moves forward you can reload this map, and only scan backwards through a branch history as far as the last point you indexed to. The initial indexing will not be speedy, but you can get quicker incremental updates. IIRC code.google.com stores all of this in Google Bigtable, because its a lot of data and is very expensive to compute, but relatively cheap to store on disk. GitHub I think uses Redis to store the same information and computes it on the fly when a user first asks for it (hence the spinner in the web UI on each file). Both sites use an algorithm similar to what I described above to populate the data tables in their database.
Back to the top