Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Performance of indexing blob metadata

On Sat, Mar 10, 2012 at 12:59,  <james.moger@xxxxxxxxxxx> wrote:
> Here is the goal:  I want to traverse the head tree of each branch and
> index the blobs with the appropriate author metadata.

This is not a fast operation in Git. The data structures used to store
the data on disk do not lend themselves to computing this information.

If you go down this road, you should try to build your own index or
cache alongside of the Git data that will help you to update the
Lucene index as branches advance more quickly. But indexing a tree
without this cache will be very, very slow. supports
showing the last author of a file through a very complex and yet still
very brute force method. They check every single commit, comparing it
against its ancestor(s), and working a blame matrix through the tree
to compute for each file at each commit what the last author of that
file was in a predecessor commit. It takes a lot of time and is a very
MapReduce-y sort of problem. Its also very IO intensive and CPU
intensive to shuffle a lot of data around. :-(

> Consider the following unit tests which implement the traversal logic
> I'm using in my indexer.  if fullIndex = false then I take a shortcut
> and assume the branch head metadata applies to all blobs.  This is
> wrong, but doing so allows me to traverse all JGit
> branches/blobs/commits in ~1.4 seconds on my older Debian box.  If
> fullIndex = true, I get the correct metadata for the blobs but the same
> traversal takes 833 seconds (~14 minutes).  Wow.  CPU is working really
> hard when this is going on, heap stays fairly controlled.

Not unexpected given what you are doing. Seriously.

> Perhaps the obvious answer is to re-use blobWalk and not make a new one
> on every treewalk traversal.  But when I do that each .next() after the
> _first_ blob returns the branch head commit (markStart) instead of the
> correct commit.  And that returned commit can not really be parsed for
> author info - it throws null pointers so my getAuthor/getCommitter
> methods return "unknown".

You need to call setRetainBody(true) on the RevWalk being used. I
think you can reuse the same RevWalk instance through this code and
benefit from the internal hash map that caches the Git graph... except
that graph doesn't cache with the PathFilter. A PathFilter destroys
the parent links in the graph and creates a subgraph specific to what
the PathFilter matched. If you change the PathFilter, you need to
discard the old RevWalk and make a new one.

> Am I doing something glaringly wrong here?  Or can someone suggest an
> alternative?  I had considered traversing the commits first and building
> a hashmap of file-lastcommitid.

You can't use the algorithm you are using.  :-)

A better approach would be to ignore using the PathFilter. Instead
start a RevWalk from the tip of a branch. Create a Map<String,
RevCommit> to hold the most recent commit that modified each file.
Prefill this map with a TreeWalk on the tip of the branch. This tells
you how many files you need to locate the RevCommit for.

Run a loop over the RevWalk, popping each RevCommit. When you get a
RevCommit, diff that RevCommit against its parent using a TreeWalk
with TreeFilter.ANY_DIFF set on it. For each of these paths, if the
path is in the Map with a null RevCommit, put this RevCommit in and
decrement the number of paths you are missing. When the number of
paths you are missing is zero, you can abort the walk and move to the
next branch, as you have run "blame" for all file paths.

If you are clever, I think you can find a way to reuse portions of
these results across branches. Once there is common history between
two branches, all remaining paths that you haven't found a RevCommit
for can be copied over from that prior branch... provided that the
RevCommits are older than the common history point. :-)

You will want to save these Map<String, RevCommit> for each branch to
some persistent storage so that when a branch moves forward you can
reload this map, and only scan backwards through a branch history as
far as the last point you indexed to. The initial indexing will not be
speedy, but you can get quicker incremental updates.

IIRC stores all of this in Google Bigtable, because
its a lot of data and is very expensive to compute, but relatively
cheap to store on disk. GitHub I think uses Redis to store the same
information and computes it on the fly when a user first asks for it
(hence the spinner in the web UI on each file). Both sites use an
algorithm similar to what I described above to populate the data
tables in their database.

Back to the top