[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] jGit memory management and optimizations

On Sat, Nov 24, 2012 at 11:16 AM,  <emilian.bold@xxxxxxxxx> wrote:
>
>>> I'm just starting to look at jGit but from my small tests, it is extremely
>>> liberal with RAM.
>>
>> So is the C implementation of Git. The graph algorithms and data
>> structures don't lend themselves to low memory processing. Both
>> implementations trade RAM in order to reduce running time.
>
> What I would like is some way to trade RAM for more running time.
>
> I can display a loading messange, but I can't ask for 400MB to display some UI. I mostly care about desktop apps: I can easily trigger a memory error on NetBeans with large repositories.

Memory usage in Java is painful. The JRE isn't good at allocating
memory dynamically as needed by an application. The C implementation
gets the luxury of just asking the OS for memory as needed, and
returning it all when the process is done. Java code can't do this,
and really can't when they get embedded into a larger application like
an IDE.

>>> The only advanced guide I could find about this mentions very few tricks:
>>> http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.egit.doc%2Fhelp%2FJGit%2FUser_Guide%2FAdvanced-Topics.html
>>>
>>> I haven't yet analyzed the source code itself very much but I'll start with
>>> a simple questions: how does one efficiently count the commits?
>>
>> You don't. :-(
>>
>> This is one of the things that takes RAM and CPU time.
>
> Any place I can learn more about the design that impacts this? Other than reading the source code, of course.

Git doesn't really keep indexes around, most data is computed on the
fly from the base data. This has been a design decision of Git for a
very long time. The best I can do is point you at the Git mailing list
archives. Many threads over time have discussed this on and off over
the years. :-(

We haven't successfully found index structures that work well. Most
consume a lot more disk and the overhead of getting that data from
disk into RAM usually exceeds the computation time required when using
only the base data. So in general the Git community has been skeptical
about adding indexes. If there is an index that works, the community
may embrace it, but really nothing has come out as effective enough to
justify its existence.

>>> RevWalkUtils.count(...) calls find(walk, start, end).size() which basically
>>> builds a huge ArrayList with all the commits.
>>
>> OK so that is ugly that count requires making the ArrayList.
>
> Yeah, that's a minor bug.

Seems easily fixable too. :-)

>>> Counting by hand is better,
>>> but not by much as, it seems to consume lots of RAM even so (via the RevWalk
>>> itself, I assume).
>>
>> Yes, RevWalk must maintain a map of all commits.
>
> Why?

The map is needed to know if a commit has already been visited or not.
Multiple paths can lead to the same commit. RevWalk keeps a map of
ObjectId to RevCommit and uses this to identify if it has seen the
commit already, and if so, avoid visiting it again. (Internally this
is actually a graph coloring algorithm, the map converts from ObjectId
to RevCommit, a singleton for that commit node in the graph, and
RevCommit keeps a set of colors on it, one of which means "already
visited".)

>>> What am I missing?
>>
>> Have you tried setRetainBody(false) ?
>
> Yes, this is also in the wiki but doesn't seem to help much.

Well, it should help _some_. But a large amount of the memory is in
the map and RevCommit structures.

>>> I'm starting to believe that perhaps I should read more about the Git files
>>> format (http://git-scm.com/book/en/Git-Internals-Packfiles ?) and parse that
>>> somehow directly -- at least for the whole repository, counting should be
>>> fast.
>>
>> Not much faster. You may be able to save some memory, but this is an
>> odd question to try and accelerate an answer to. If you really need
>> this commit count fast you may be better off to cache the value on the
>> side. Store it as of some commit and refresh the cache when you notice
>> the HEAD is no longer at that commit by doing a RevWalk between the
>> two points and adding the difference to the counter.
>
> Maybe it's an odd question because I'm looking at jGit for a desktop app. It's not just counting commits, it's most of the git interractiong that would need to be done within memory constraints, but where I could let the user wait some more.

EGit has the same problem... and right now we have been telling users
to assign more memory to Eclipse. :-(

>>> It there something inherent in the git design that makes this so RAM hungry?
>>> I realize we are doing a topological sort on a DAG, but this seems to be a
>>> rather particular kind of DAG (generally, each vertex has only one
>>> incoming/outgoing edge) and I somehow expected operations on it to be much
>>> more efficient in terms of both memory and time.
>>
>> Nope. :-(
>
> This is sad. I read an article about the Eclipse Memory tool using a 'dominator tree' to speed lookup on a heapdump graph. Somehow I'm hoping for something similar for jgit that would speed some operations up or allow them to support some sort of indexing.

See above, if you happen to find an index format that saves a lot of
memory at runtime with low enough penalties that its actually
effective... the Git community may get behind it.

>>> Any low-hanging fruit remaining? Perhaps some ideas about building some
>>> 'index' to speed up jgit operations?
>>
>> There is new work that uses compressed bitmaps to speed up counting
>> operations during packing, which is primarily useful when JGit is used
>> as a server. Unfortunately this doesn't generalize to all commit
>> walking algorithms.
>
> Any link for this work so I could read some more?

  http://thread.gmane.org/gmane.comp.version-control.git/206457/
  https://git.eclipse.org/r/7940
  https://git.eclipse.org/r/#/q/status:open+project:jgit/jgit+branch:master+topic:bitmaps,n,z