These are definitely all great recommendations. This code has gone through several iterations and approaches and I didn't catch all those cleanups/optimizations.
As for "do I need to know the branch"... that is a good question. Chris Aniszczyk originally requested the Lucene integration and he pointed to OpenGrok as a reference which I believe indexes each branch. I can't compete with most of OpenGrok because it relies on Exuberant Ctags (which is GPLd and native code) to extract blob metadata goodies. Indexing only the default branch would simplify things, but I can see wanting to search on a particular branch - especially orphaned branches. I'm not sure what GitHub does - but it appears to only index the default branch, which is reasonable considering GitHub's scale. :)
Before I asked for help I kinda figured that Git's data design was going to be the bottleneck and Shawn confirms that. Since I can't be the first one to run into this problem I held onto a hope that someone had a trick to improve this, but it seems like some variant of brute-force will be required. I'll prototype Shawn's approach next and I'll probably sort the branches in reverse chronological order so that common ancestry is reused.
I wonder if the core Git team has ever considered freshening the repository format to track a little more info on commits and blobs so that metadata reconstruction is not so painful? Such a change would probably be non-backwards compatible.
On Sat, Mar 10, 2012, at 07:11 PM, Kevin Sawicki wrote:
Also, do you need to know which branch the current commit/blobs are on?
You can call markStart with multiple commits (the tip of each branch) and do all the processing in a single walk, but you wouldn't know which branch(es) the current commit was on and I'm not sure if you need that information for what you are doing.
On Sat, Mar 10, 2012 at 7:03 PM, Kevin Sawicki <kevinsawicki@xxxxxxxxx>
Also you can call PathFilterGroup.createFromStrings(blobPath) instead of PathFilterGroup.createFromStrings(Collections.singleton(blobPath)) which creates one less object per blob path.
On Sat, Mar 10, 2012 at 7:02 PM, Kevin Sawicki <kevinsawicki@xxxxxxxxx>
I would suggest reusing blobWalk and call reset() each time before you call next() and pass the head variable directly to blobWalk.markStart instead of calling blobWalk.parseCommit each time with head.getId()