[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [jgit-dev] jGit memory management and optimizations
- From: Emilian Bold <emilian.bold@xxxxxxxxx>
- Date: Sun, 25 Nov 2012 00:02:13 +0200
- Delivered-to: firstname.lastname@example.org
It's not about releasing the memory back to the OS, it's about just not using so much of it. Ideally I would like some way to configure jGit to only cache 100MB in RAM and re-load/re-compute cache misses.
The fact that even EGit is as broken is slightly reassuring, it means the problem is on the radar. But does anybody see this as a priority or we just rely on the fact the people that use git have enough RAM anyhow?
On Nov 24, 2012, at 9:54 PM, Shawn Pearce wrote:
> On Sat, Nov 24, 2012 at 11:16 AM, <emilian.bold@xxxxxxxxx> wrote:
>>>> I'm just starting to look at jGit but from my small tests, it is extremely
>>>> liberal with RAM.
>>> So is the C implementation of Git. The graph algorithms and data
>>> structures don't lend themselves to low memory processing. Both
>>> implementations trade RAM in order to reduce running time.
>> What I would like is some way to trade RAM for more running time.
>> I can display a loading messange, but I can't ask for 400MB to display some UI. I mostly care about desktop apps: I can easily trigger a memory error on NetBeans with large repositories.
> Memory usage in Java is painful. The JRE isn't good at allocating
> memory dynamically as needed by an application. The C implementation
> gets the luxury of just asking the OS for memory as needed, and
> returning it all when the process is done. Java code can't do this,
> and really can't when they get embedded into a larger application like
> an IDE.
>>>> The only advanced guide I could find about this mentions very few tricks:
>>>> I haven't yet analyzed the source code itself very much but I'll start with
>>>> a simple questions: how does one efficiently count the commits?
>>> You don't. :-(
>>> This is one of the things that takes RAM and CPU time.
>> Any place I can learn more about the design that impacts this? Other than reading the source code, of course.
> Git doesn't really keep indexes around, most data is computed on the
> fly from the base data. This has been a design decision of Git for a
> very long time. The best I can do is point you at the Git mailing list
> archives. Many threads over time have discussed this on and off over
> the years. :-(
> We haven't successfully found index structures that work well. Most
> consume a lot more disk and the overhead of getting that data from
> disk into RAM usually exceeds the computation time required when using
> only the base data. So in general the Git community has been skeptical
> about adding indexes. If there is an index that works, the community
> may embrace it, but really nothing has come out as effective enough to
> justify its existence.
>>>> RevWalkUtils.count(...) calls find(walk, start, end).size() which basically
>>>> builds a huge ArrayList with all the commits.
>>> OK so that is ugly that count requires making the ArrayList.
>> Yeah, that's a minor bug.
> Seems easily fixable too. :-)
>>>> Counting by hand is better,
>>>> but not by much as, it seems to consume lots of RAM even so (via the RevWalk
>>>> itself, I assume).
>>> Yes, RevWalk must maintain a map of all commits.
> The map is needed to know if a commit has already been visited or not.
> Multiple paths can lead to the same commit. RevWalk keeps a map of
> ObjectId to RevCommit and uses this to identify if it has seen the
> commit already, and if so, avoid visiting it again. (Internally this
> is actually a graph coloring algorithm, the map converts from ObjectId
> to RevCommit, a singleton for that commit node in the graph, and
> RevCommit keeps a set of colors on it, one of which means "already
>>>> What am I missing?
>>> Have you tried setRetainBody(false) ?
>> Yes, this is also in the wiki but doesn't seem to help much.
> Well, it should help _some_. But a large amount of the memory is in
> the map and RevCommit structures.
>>>> I'm starting to believe that perhaps I should read more about the Git files
>>>> format (http://git-scm.com/book/en/Git-Internals-Packfiles ?) and parse that
>>>> somehow directly -- at least for the whole repository, counting should be
>>> Not much faster. You may be able to save some memory, but this is an
>>> odd question to try and accelerate an answer to. If you really need
>>> this commit count fast you may be better off to cache the value on the
>>> side. Store it as of some commit and refresh the cache when you notice
>>> the HEAD is no longer at that commit by doing a RevWalk between the
>>> two points and adding the difference to the counter.
>> Maybe it's an odd question because I'm looking at jGit for a desktop app. It's not just counting commits, it's most of the git interractiong that would need to be done within memory constraints, but where I could let the user wait some more.
> EGit has the same problem... and right now we have been telling users
> to assign more memory to Eclipse. :-(
>>>> It there something inherent in the git design that makes this so RAM hungry?
>>>> I realize we are doing a topological sort on a DAG, but this seems to be a
>>>> rather particular kind of DAG (generally, each vertex has only one
>>>> incoming/outgoing edge) and I somehow expected operations on it to be much
>>>> more efficient in terms of both memory and time.
>>> Nope. :-(
>> This is sad. I read an article about the Eclipse Memory tool using a 'dominator tree' to speed lookup on a heapdump graph. Somehow I'm hoping for something similar for jgit that would speed some operations up or allow them to support some sort of indexing.
> See above, if you happen to find an index format that saves a lot of
> memory at runtime with low enough penalties that its actually
> effective... the Git community may get behind it.
>>>> Any low-hanging fruit remaining? Perhaps some ideas about building some
>>>> 'index' to speed up jgit operations?
>>> There is new work that uses compressed bitmaps to speed up counting
>>> operations during packing, which is primarily useful when JGit is used
>>> as a server. Unfortunately this doesn't generalize to all commit
>>> walking algorithms.
>> Any link for this work so I could read some more?