Re: [jgit-dev] jGit memory management and optimizations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] jGit memory management and optimizations

From: Shawn Pearce <spearce@xxxxxxxxxxx>
Date: Sat, 24 Nov 2012 10:31:51 -0800
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

On Fri, Nov 16, 2012 at 9:36 AM, Emilian Bold <emilian.bold@xxxxxxxxx> wrote:
> I'm just starting to look at jGit but from my small tests, it is extremely
> liberal with RAM.

So is the C implementation of Git. The graph algorithms and data
structures don't lend themselves to low memory processing. Both
implementations trade RAM in order to reduce running time.

> The only advanced guide I could find about this mentions very few tricks:
> http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.egit.doc%2Fhelp%2FJGit%2FUser_Guide%2FAdvanced-Topics.html
>
> I haven't yet analyzed the source code itself very much but I'll start with
> a simple questions: how does one efficiently count the commits?

You don't. :-(

This is one of the things that takes RAM and CPU time.

> RevWalkUtils.count(...) calls find(walk, start, end).size() which basically
> builds a huge ArrayList with all the commits.

OK so that is ugly that count requires making the ArrayList.

> Counting by hand is better,
> but not by much as, it seems to consume lots of RAM even so (via the RevWalk
> itself, I assume).

Yes, RevWalk must maintain a map of all commits.

> What am I missing?

Have you tried setRetainBody(false) ?

> I'm starting to believe that perhaps I should read more about the Git files
> format (http://git-scm.com/book/en/Git-Internals-Packfiles ?) and parse that
> somehow directly -- at least for the whole repository, counting should be
> fast.

Not much faster. You may be able to save some memory, but this is an
odd question to try and accelerate an answer to. If you really need
this commit count fast you may be better off to cache the value on the
side. Store it as of some commit and refresh the cache when you notice
the HEAD is no longer at that commit by doing a RevWalk between the
two points and adding the difference to the counter.

> It there something inherent in the git design that makes this so RAM hungry?
> I realize we are doing a topological sort on a DAG, but this seems to be a
> rather particular kind of DAG (generally, each vertex has only one
> incoming/outgoing edge) and I somehow expected operations on it to be much
> more efficient in terms of both memory and time.

Nope. :-(

> Any low-hanging fruit remaining? Perhaps some ideas about building some
> 'index' to speed up jgit operations?

There is new work that uses compressed bitmaps to speed up counting
operations during packing, which is primarily useful when JGit is used
as a server. Unfortunately this doesn't generalize to all commit
walking algorithms.

Follow-Ups:
- Re: [jgit-dev] jGit memory management and optimizations
  - From: emilian . bold

References:
- [jgit-dev] jGit memory management and optimizations
  - From: Emilian Bold

Prev by Date: Re: [jgit-dev] Issue in Using JGit PullCommand
Next by Date: Re: [jgit-dev] jGit memory management and optimizations
Previous by thread: [jgit-dev] jGit memory management and optimizations
Next by thread: Re: [jgit-dev] jGit memory management and optimizations
Index(es):
- Date
- Thread

Breadcrumbs