Re: [jgit-dev] PackFile and PackIndex size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] PackFile and PackIndex size

From: Shawn Pearce <spearce@xxxxxxxxxxx>
Date: Thu, 9 Sep 2010 20:32:16 -0700
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

On Thu, Sep 9, 2010 at 7:31 AM, Dmitry Neverov <dmitry.neverov@xxxxxxxxx> wrote:
> I wonder how a size of PackIndex and PackFile objects is related
> to the size of pack-* and idx files?

They are very closely related.  :-)

The size of the *.pack is actually irrelevant.  The JGit PackFile type
contains only a pointer to the file.  Individual blocks are loaded
into the WindowCache as needed, and the size of the WindowCache is
bounded by configuration parameters (default is 20M).  WindowCache is
a poor-man's virtual memory system providing on-demand paging of
blocks and eviction when space runs low.  So the size of the *.pack
files doesn't matter, we'll only load up to the limit and then evict
blocks if others are needed.

The size of the *.idx is another matter.  Initially the PackFile does
not contain the *.idx file data, so its a rather thin header naming
the files on disk.  When the PackFile is consulted the *.idx file is
loaded completely into memory on first reference, but this is never
evicted.  Once loaded it stays loaded.  We can safely evict it, the
way the data is accessed is thread-safe to clear it out and let it GC.
 We just don't have any infrastructure to select which PackIndex
instances should be evicted and discarded.  The same is true for the
reverse indexes, which are used during packing, and are now also used
when large deltaified objects are encountered.

> Suppose I run a code:
>
> RevWalk walk = new RevWalk(db);
> try {
>   walk.parseCommit(commitId);
> } finally {
>   walk.release();
> }
>
> and the method ObjectDirectory.scanPacksImpl() creates a PackFile
> object for every found pack-* file. If we have a lot of pack
> files - a lot of such objects created.

Yes.  However the PackFile collection is sorted descending by last
modification date.  This places the most recent PackFile first.  If
the object named commitId is in an early PackFile, we'll abort
searching fast and be able to avoid loading the other *.idx other
files.  Unfortunately if the object is loose, we'll have to load all
of the *.idx files before we look at the loose directory.

This week I sent a change that caches the last 2048 loose objects we
know about and bypasses pack lookups when there is a hit here.  If
JGit created the object, its automatically put into that cache.  Which
should help us avoid needing to open a ton of pack files just to read
a commit we just created.  Unfortunately it doesn't help the case of a
commit being created by an external process and then accessed by a
running JGit.

> After running 'git gc' I get single big pack and big index. Since
> PackIndex and PackFile are allocated on the stack it is hard to
> understand if one big file is better for memory usage than a lot
> of small files.

They aren't allocated on the stack, nothing in Java is allocated on
the stack.  :-)

One big file is slightly better for memory usage, because we have less
PackIndex and PackFile instances running around in the JVM.  These
instances are fairly lightweight, maybe only 128 bytes combined
between them.  So 100 packs uses say 12.5 KiB more memory than if you
had only 1 pack.

The issue of one pack file vs. many actually has to do with lookup
performance.  Within a single pack file we can do log N lookup for an
object, which gives us very good performance even when the number of
objects stored goes up.  However multiple pack files requires a linear
scan through them.  So 100 packs means we need to do 100 different log
N searches.

> The outline of the tutorial on
> RevWalk (http://code.google.com/p/egit/wiki/JGitTutorialRevWalk)
> contains section on reducing memory usage. Could you provide some
> hints on how to do that? Thanks!

There is a setRetainBody(false) method you can use to discard the body
of a commit if you don't need the author, committer or message
information during the traversal.  Examples of when you don't need
this data is when you are only using the RevWalk to compute the merge
base between branches, or to perform a task you would have used `git
rev-list` with its default formatting for.

If you do need the body, consider extracting the data you need and
then calling dispose() on the RevCommit, assuming you only need the
data once and can then discard it.  If you need to hang onto the data,
you may find that JGit's internal representation uses less overall
memory than if you held onto it yourself... especially if you want the
full message.  (Because we use a byte[] internally to store the
message in UTF-8.  Java String storage would be bigger using UTF-16,
assuming the message is mostly US-ASCII data.)

If you need to attach additional data to a commit, consider
subclassing both RevWalk and RevCommit, and using the createCommit()
method in RevWalk to consruct an instance of your RevCommit subclass.
Put the additional data as fields in your RevCommit subclass, so that
you don't need to use an auxiliary HashMap to translate from RevCommit
or ObjectId to your additional data fields.

Obviously of course also try to walk only the amount of the graph you
actually need to walk.  That is, if you are looking for the commits in
refs/heads/master not yet in refs/remotes/origin/master, make sure you
markStart() for refs/heads/master and markUninteresting()
refs/remotes/origin/master.  The RevWalk traversal will only parse the
commits necessary for it to answer you, and will try to avoid looking
back further in history.  That reduces the size of the internal object
map, and thus reduces overall memory usage.

A RevWalk cannot shrink its internal object map.  If you have just
done a huge traversal of say all history of the repository, that will
load everything into the object map, and it cannot be released.  If
you don't need this data in the near future, it may be a good idea to
throw away the RevWalk and allocate a new one for your next traversal.
 That will let the GC reclaim everything and make it available for
another use.  On the other hand, reusing an existing object map is
much faster than building a new one from scratch.  So you need to
balance the reclaiming of memory against the user's desire to perform
fast updates of an existing repository view.

At some point there isn't much more we can do.  We already use fairly
light-weight RevCommit instances.  I think our current overhead on the
32 bit OpenJDK runtime is 72 bytes per commit if the body has been
disposed, and 84+sizeof(commit_text) if we are retaining the body.
I'd like to trim this more, but I haven't yet found a solution I am
happy with.

-- 
Shawn.

Follow-Ups:
- Re: [jgit-dev] PackFile and PackIndex size
  - From: Jonas Fonseca

References:
- [jgit-dev] PackFile and PackIndex size
  - From: Dmitry Neverov

Prev by Date: [jgit-dev] PackFile and PackIndex size
Next by Date: [jgit-dev] Fwd: [Bug 324868] New: git clone fails for org.eclipse.mdt
Previous by thread: [jgit-dev] PackFile and PackIndex size
Next by thread: Re: [jgit-dev] PackFile and PackIndex size
Index(es):
- Date
- Thread

Breadcrumbs