[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] [egit-dev] speeding up re-indexing

On 05/31/2012 10:44 PM, Matthias Sohn wrote:
> 
> How about using a profiler to find out what's going on ?

Hey all!

(please note that all the times i'm noting in texts are approximate measurements of hitting the refresh button in the staging view, which also refreshes all projects. so for cold caches this takes a good amount of the time).

good idea :) i jprofile'd egits re-indexing, and found out that most of the time (with hot caches) is spent finding a container for resources:

  23.5% - 516 s - 1 inv. org.eclipse.egit.core.internal.indexdiff.IndexDiffCacheEntry$4.run
  23.5% - 516 s - 1 inv. org.eclipse.egit.core.internal.indexdiff.IndexDiffCacheEntry.access$4
  23.5% - 516 s - 1 inv. org.eclipse.egit.core.internal.indexdiff.IndexDiffCacheEntry.calcIndexDiff
  23.4% - 516 s - 1 inv. org.eclipse.jgit.lib.IndexDiff.diff
  23.4% - 515 s - 3 inv. org.eclipse.jgit.treewalk.TreeWalk.next
  21.7% - 477 s - 12,005 inv. org.eclipse.jgit.treewalk.TreeWalk.enterSubtree
  21.6% - 476 s - 23,995 inv. org.eclipse.jgit.treewalk.AbstractTreeIterator.createSubtreeIterator
  20.9% - 460 s - 1,873 inv. org.eclipse.egit.core.AdaptableFileTreeIterator.createSubtreeIterator
  20.8% - 458 s - 1,873 inv. org.eclipse.egit.core.IteratorService.findContainer
  20.8% - 458 s - 1,873 inv. org.eclipse.core.internal.resources.WorkspaceRoot.findContainersForLocationURI

as a really stupid optimization, i tried to change AdaptableFileTreeIterator.createSubtreeIterator() in a way that it only tries to find a container if a .project file exists (i know i', breaking project-less containers in the workspace). This sped up the whole thing _a lot_ (127 vs 516 seconds):

  17.2% - 127 s - 1 inv. org.eclipse.egit.core.internal.indexdiff.IndexDiffCacheEntry$4.run
  17.2% - 127 s - 1 inv. org.eclipse.egit.core.internal.indexdiff.IndexDiffCacheEntry.access$4
  17.2% - 127 s - 1 inv. org.eclipse.egit.core.internal.indexdiff.IndexDiffCacheEntry.calcIndexDiff
  17.2% - 127 s - 1 inv. org.eclipse.jgit.lib.IndexDiff.diff
  17.1% - 126 s - 3 inv. org.eclipse.jgit.treewalk.TreeWalk.next
  13.0% - 96,737 ms - 12,005 inv. org.eclipse.jgit.treewalk.TreeWalk.enterSubtree
  12.9% - 95,484 ms - 23,995 inv. org.eclipse.jgit.treewalk.AbstractTreeIterator.createSubtreeIterator
  11.3% - 83,680 ms - 1,873 inv. org.eclipse.egit.core.AdaptableFileTreeIterator.createSubtreeIterator
  11.1% - 82,487 ms - 696 inv. org.eclipse.egit.core.IteratorService.findContainer
  11.1% - 82,374 ms - 696 inv. org.eclipse.core.internal.resources.WorkspaceRoot.findContainersForLocationURI

also note that of course profiling slowed this down a lot. without profiling the time for re-indexing with hot caches changed from ~20 seconds to ~14 seconds.

now for the bad(er) case: cold caches. it turned out that the IndexDiffFilter seems to do a contentCheck() on (nearly) every file - 33826 times. after debugging this, it seems that jgit thinks all timestamps differ.

interesting is, that org.eclipse.jgit.treewalk.WorkingTreeIterator.compareMetadata(DirCacheEntry).cacheLastModified is nearly always zero, /except/ for files that have been modified in the last commit (tip of current branch)(!!!!).

times look more like this then:

  26.1% - 396 s - 1 inv. org.eclipse.egit.core.internal.indexdiff.IndexDiffCacheEntry$4.run
  26.1% - 396 s - 1 inv. org.eclipse.egit.core.internal.indexdiff.IndexDiffCacheEntry.access$4
  26.1% - 396 s - 1 inv. org.eclipse.egit.core.internal.indexdiff.IndexDiffCacheEntry.calcIndexDiff
  26.1% - 395 s - 1 inv. org.eclipse.jgit.lib.IndexDiff.diff
  26.0% - 394 s - 3 inv. org.eclipse.jgit.treewalk.TreeWalk.next
  19.0% - 288 s - 46,620 inv. org.eclipse.jgit.treewalk.filter.AndTreeFilter$List.include
  19.0% - 288 s - 46,620 inv. org.eclipse.jgit.treewalk.filter.IndexDiffFilter.include
  18.5% - 281 s - 33,828 inv. org.eclipse.jgit.treewalk.WorkingTreeIterator.isModified
  18.0% - 273 s - 33,826 inv. org.eclipse.jgit.treewalk.WorkingTreeIterator.contentCheck
  18.0% - 272 s - 33,826 inv. org.eclipse.jgit.treewalk.AbstractTreeIterator.getEntryObjectId
  17.9% - 271 s - 33,826 inv. org.eclipse.jgit.treewalk.WorkingTreeIterator.idBuffer
  17.9% - 271 s - 33,826 inv. org.eclipse.jgit.treewalk.WorkingTreeIterator.idBufferBlob
  17.6% - 267 s - 33,826 inv. org.eclipse.jgit.treewalk.WorkingTreeIterator.computeHash
  17.0% - 258 s - 543,873 inv. java.io.InputStream.read

i think the bug is in how the dircache is created, maybe? could it be that only mod-times of entries in the current HEAD commit are applied to the (initial) DCEs? seems so. as soon as i stage a file, this also gets correct last modifier timestamps.

i added this:

	if (e.getLastModified() != 0) {
		System.out.println("mod: " + e.getLastModified() + ": "
					+ e.getPathString());
	}

to org.eclipse.jgit.dircache.DirCache.writeTo(OutputStream):666, where the DCEs are written to the real index file, and only files modified in the last commit, and newly staged entries are shown.

even more investigation showed that the only one that seems to set the timestamp on index entries during non-merge "normal" checkout (org.eclipse.jgit.dircache.DirCacheCheckout.checkoutEntry(Repository, File, DirCacheEntry, ObjectReader)) is only called for files that changed - this matches my observations. my stupid approach to fix this is to add a setLastModified() when keep() is called (index entry unmodified). this gives me valid timestamps here and everywhere, pushing down the re-indexing time to ~5 seconds with hot caches, and ~108 seconds (vs. ~500 seconds!) with cold caches.

i pushed the fix i have done [1]; it is probably completely wrong, so correct me please :)

[1] https://git.eclipse.org/r/6232

Regards,
Markus