Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [egit-dev] [jgit-dev] Avoiding stat calls (again)

On Tue, Nov 16, 2010 at 2:57 PM, Robin Rosenberg
<robin.rosenberg@xxxxxxxxxx> wrote:
> See the Gerrit http://egit.eclipse.org/r/#change,1905. The main thing I wanted, was to make the Commit
> dialog pop up faster. Now, what I found is that it actually opens and read every file, No wonder it
> takes forever. Git is supposed to be fast.
>
> Daemon Thread [ModalContext] (Suspended)
>        ContainerTreeIterator(WorkingTreeIterator).idBuffer() line: 213
>        ContainerTreeIterator(AbstractTreeIterator).idEqual(AbstractTreeIterator) line: 368
>        TreeWalk.idEqual(int, int) line: 696
>        TreeFilter$2.include(TreeWalk) line: 133

Ouch.  Someone is using the default ANY_DIFF with an iterator that
cannot provide fast access to the ObjectId of the current entry (aka a
WorkingTreeIterator).  No wonder its taking ages, we have to SHA-1
checksum the entire working directory.  That's a bad use of the
ANY_DIFF filter.

> The TreeWalk needs a better interface to the TreeIterators to support this cleanly and TreeWalk
> must be able to receive parameters regarding how to compare with regards to filemode and crlf etc.

I don't see how the TreeWalk should be modified.  TreeWalk is *only*
about performing the merge join of N sorted iterators.  That's really
all it does.  Everything else is the responsibility of the iterators
and the TreeFilters.

We can argue for two different things:

Idea 1)  Teach ANY_DIFF filter to check to see if the current
iterators are a DirCacheIterator and a WorkingTreeIterator.  If so use
the WorkingTreeIterator's concept of isModified() instead of checking
idEqual() to determine if the content differs at this path name.  This
seems like a good idea, as the check is reasonably inexpensive (two
instanceof calls), and then ANY_DIFF just works as expected, but gets
a performance boost when both a DirCacheIterator and a
WorkingTreeIterator are passed in the TreeWalk.  However this is hard
to code because there are N trees (where N is unknown) and we don't
know in which slot the two iterators will appear.

Idea 2)  Add knowledge of the TreeWalk and the position of the
DirCacheIterator to the WorkingTreeIterator.  That is, maybe we add a
method like:

  public void setDirCacheIterator(TreeWalk tw, int pos);

To be used like this:

  TreeWalk tw = new TreeWalk(db);
  tw.reset();


  WorkingTreeIterator twi = new WorkingTreeIterator(...);
  twi.setDirCacheIterator(tw, tw.add(new DirCacheIterator(...));
  tw.add(twi);

Then modify WorkingTreeIterator's idBuffer() call to try and reuse the
DirCacheIterator's idBuffer() instead of its own whenever the TreeWalk
has a valid DirCacheIterator at the given position, and isModified()
is false.  This will allow the ANY_DIFF filter to work without
modifications.

> I think we need to break the TreeWalk API. Let's suprise Shawn when he gets back from vacation }:>

I don't think you need to change TreeWalk at all.  Just teach ANY_DIFF
to be smarter, avoid using ANY_DIFF, create a new special variant of
ANY_DIFF that knows about the WorkingTreeIterator, or teach the
WorkingTreeIterator how to use a sibling DirCacheIterator to respond
to the idBuffer() request more efficiently than scanning the whole
file.

See, no breakage.  :-)

-- 
Shawn.


Back to the top