[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] jGit memory management and optimizations

>> I'm just starting to look at jGit but from my small tests, it is extremely
>> liberal with RAM.
> 
> So is the C implementation of Git. The graph algorithms and data
> structures don't lend themselves to low memory processing. Both
> implementations trade RAM in order to reduce running time.

What I would like is some way to trade RAM for more running time.

I can display a loading messange, but I can't ask for 400MB to display some UI. I mostly care about desktop apps: I can easily trigger a memory error on NetBeans with large repositories.

>> The only advanced guide I could find about this mentions very few tricks:
>> http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.egit.doc%2Fhelp%2FJGit%2FUser_Guide%2FAdvanced-Topics.html
>> 
>> I haven't yet analyzed the source code itself very much but I'll start with
>> a simple questions: how does one efficiently count the commits?
> 
> You don't. :-(
> 
> This is one of the things that takes RAM and CPU time.

Any place I can learn more about the design that impacts this? Other than reading the source code, of course.

>> RevWalkUtils.count(...) calls find(walk, start, end).size() which basically
>> builds a huge ArrayList with all the commits.
> 
> OK so that is ugly that count requires making the ArrayList.

Yeah, that's a minor bug.

>> Counting by hand is better,
>> but not by much as, it seems to consume lots of RAM even so (via the RevWalk
>> itself, I assume).
> 
> Yes, RevWalk must maintain a map of all commits.

Why?

>> What am I missing?
> 
> Have you tried setRetainBody(false) ?

Yes, this is also in the wiki but doesn't seem to help much.

>> I'm starting to believe that perhaps I should read more about the Git files
>> format (http://git-scm.com/book/en/Git-Internals-Packfiles ?) and parse that
>> somehow directly -- at least for the whole repository, counting should be
>> fast.
> 
> Not much faster. You may be able to save some memory, but this is an
> odd question to try and accelerate an answer to. If you really need
> this commit count fast you may be better off to cache the value on the
> side. Store it as of some commit and refresh the cache when you notice
> the HEAD is no longer at that commit by doing a RevWalk between the
> two points and adding the difference to the counter.

Maybe it's an odd question because I'm looking at jGit for a desktop app. It's not just counting commits, it's most of the git interractiong that would need to be done within memory constraints, but where I could let the user wait some more.

>> It there something inherent in the git design that makes this so RAM hungry?
>> I realize we are doing a topological sort on a DAG, but this seems to be a
>> rather particular kind of DAG (generally, each vertex has only one
>> incoming/outgoing edge) and I somehow expected operations on it to be much
>> more efficient in terms of both memory and time.
> 
> Nope. :-(

This is sad. I read an article about the Eclipse Memory tool using a 'dominator tree' to speed lookup on a heapdump graph. Somehow I'm hoping for something similar for jgit that would speed some operations up or allow them to support some sort of indexing.

>> Any low-hanging fruit remaining? Perhaps some ideas about building some
>> 'index' to speed up jgit operations?
> 
> There is new work that uses compressed bitmaps to speed up counting
> operations during packing, which is primarily useful when JGit is used
> as a server. Unfortunately this doesn't generalize to all commit
> walking algorithms.

Any link for this work so I could read some more?

--emi