Re: [jgit-dev] jGit memory management and optimizations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] jGit memory management and optimizations

From: Emilian Bold <emilian.bold@xxxxxxxxx>
Date: Sun, 25 Nov 2012 00:02:13 +0200
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

It's not about releasing the memory back to the OS, it's about just not using so much of it. Ideally I would like some way to configure jGit to only cache 100MB in RAM and re-load/re-compute cache misses.

The fact that even EGit is as broken is slightly reassuring, it means the problem is on the radar. But does anybody see this as a priority or we just rely on the fact the people that use git have enough RAM anyhow?

--emi

On Nov 24, 2012, at 9:54 PM, Shawn Pearce wrote:

> On Sat, Nov 24, 2012 at 11:16 AM,  <emilian.bold@xxxxxxxxx> wrote:
>> 
>>>> I'm just starting to look at jGit but from my small tests, it is extremely
>>>> liberal with RAM.
>>> 
>>> So is the C implementation of Git. The graph algorithms and data
>>> structures don't lend themselves to low memory processing. Both
>>> implementations trade RAM in order to reduce running time.
>> 
>> What I would like is some way to trade RAM for more running time.
>> 
>> I can display a loading messange, but I can't ask for 400MB to display some UI. I mostly care about desktop apps: I can easily trigger a memory error on NetBeans with large repositories.
> 
> Memory usage in Java is painful. The JRE isn't good at allocating
> memory dynamically as needed by an application. The C implementation
> gets the luxury of just asking the OS for memory as needed, and
> returning it all when the process is done. Java code can't do this,
> and really can't when they get embedded into a larger application like
> an IDE.
> 
>>>> The only advanced guide I could find about this mentions very few tricks:
>>>> http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.egit.doc%2Fhelp%2FJGit%2FUser_Guide%2FAdvanced-Topics.html
>>>> 
>>>> I haven't yet analyzed the source code itself very much but I'll start with
>>>> a simple questions: how does one efficiently count the commits?
>>> 
>>> You don't. :-(
>>> 
>>> This is one of the things that takes RAM and CPU time.
>> 
>> Any place I can learn more about the design that impacts this? Other than reading the source code, of course.
> 
> Git doesn't really keep indexes around, most data is computed on the
> fly from the base data. This has been a design decision of Git for a
> very long time. The best I can do is point you at the Git mailing list
> archives. Many threads over time have discussed this on and off over
> the years. :-(
> 
> We haven't successfully found index structures that work well. Most
> consume a lot more disk and the overhead of getting that data from
> disk into RAM usually exceeds the computation time required when using
> only the base data. So in general the Git community has been skeptical
> about adding indexes. If there is an index that works, the community
> may embrace it, but really nothing has come out as effective enough to
> justify its existence.
> 
>>>> RevWalkUtils.count(...) calls find(walk, start, end).size() which basically
>>>> builds a huge ArrayList with all the commits.
>>> 
>>> OK so that is ugly that count requires making the ArrayList.
>> 
>> Yeah, that's a minor bug.
> 
> Seems easily fixable too. :-)
> 
>>>> Counting by hand is better,
>>>> but not by much as, it seems to consume lots of RAM even so (via the RevWalk
>>>> itself, I assume).
>>> 
>>> Yes, RevWalk must maintain a map of all commits.
>> 
>> Why?
> 
> The map is needed to know if a commit has already been visited or not.
> Multiple paths can lead to the same commit. RevWalk keeps a map of
> ObjectId to RevCommit and uses this to identify if it has seen the
> commit already, and if so, avoid visiting it again. (Internally this
> is actually a graph coloring algorithm, the map converts from ObjectId
> to RevCommit, a singleton for that commit node in the graph, and
> RevCommit keeps a set of colors on it, one of which means "already
> visited".)
> 
>>>> What am I missing?
>>> 
>>> Have you tried setRetainBody(false) ?
>> 
>> Yes, this is also in the wiki but doesn't seem to help much.
> 
> Well, it should help _some_. But a large amount of the memory is in
> the map and RevCommit structures.
> 
>>>> I'm starting to believe that perhaps I should read more about the Git files
>>>> format (http://git-scm.com/book/en/Git-Internals-Packfiles ?) and parse that
>>>> somehow directly -- at least for the whole repository, counting should be
>>>> fast.
>>> 
>>> Not much faster. You may be able to save some memory, but this is an
>>> odd question to try and accelerate an answer to. If you really need
>>> this commit count fast you may be better off to cache the value on the
>>> side. Store it as of some commit and refresh the cache when you notice
>>> the HEAD is no longer at that commit by doing a RevWalk between the
>>> two points and adding the difference to the counter.
>> 
>> Maybe it's an odd question because I'm looking at jGit for a desktop app. It's not just counting commits, it's most of the git interractiong that would need to be done within memory constraints, but where I could let the user wait some more.
> 
> EGit has the same problem... and right now we have been telling users
> to assign more memory to Eclipse. :-(
> 
>>>> It there something inherent in the git design that makes this so RAM hungry?
>>>> I realize we are doing a topological sort on a DAG, but this seems to be a
>>>> rather particular kind of DAG (generally, each vertex has only one
>>>> incoming/outgoing edge) and I somehow expected operations on it to be much
>>>> more efficient in terms of both memory and time.
>>> 
>>> Nope. :-(
>> 
>> This is sad. I read an article about the Eclipse Memory tool using a 'dominator tree' to speed lookup on a heapdump graph. Somehow I'm hoping for something similar for jgit that would speed some operations up or allow them to support some sort of indexing.
> 
> See above, if you happen to find an index format that saves a lot of
> memory at runtime with low enough penalties that its actually
> effective... the Git community may get behind it.
> 
>>>> Any low-hanging fruit remaining? Perhaps some ideas about building some
>>>> 'index' to speed up jgit operations?
>>> 
>>> There is new work that uses compressed bitmaps to speed up counting
>>> operations during packing, which is primarily useful when JGit is used
>>> as a server. Unfortunately this doesn't generalize to all commit
>>> walking algorithms.
>> 
>> Any link for this work so I could read some more?
> 
>  http://thread.gmane.org/gmane.comp.version-control.git/206457/
>  https://git.eclipse.org/r/7940
>  https://git.eclipse.org/r/#/q/status:open+project:jgit/jgit+branch:master+topic:bitmaps,n,z

Follow-Ups:
- Re: [jgit-dev] jGit memory management and optimizations
  - From: Shawn Pearce

References:
- [jgit-dev] jGit memory management and optimizations
  - From: Emilian Bold
- Re: [jgit-dev] jGit memory management and optimizations
  - From: Shawn Pearce
- Re: [jgit-dev] jGit memory management and optimizations
  - From: emilian . bold
- Re: [jgit-dev] jGit memory management and optimizations
  - From: Shawn Pearce

Prev by Date: Re: [jgit-dev] jGit memory management and optimizations
Next by Date: Re: [jgit-dev] jGit memory management and optimizations
Previous by thread: Re: [jgit-dev] jGit memory management and optimizations
Next by thread: Re: [jgit-dev] jGit memory management and optimizations
Index(es):
- Date
- Thread

Breadcrumbs