Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Making RevCommit object representation more compact in memory

Alex Blewitt <alex.blewitt@xxxxxxxxx> wrote:
> On 8 Oct 2010, at 19:44, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
> > On Fri, Oct 8, 2010 at 11:21 AM, Alex Blewitt <alex.blewitt@xxxxxxxxx> wrote:
> >> Using a visitor pattern for processing the parents coupled with using a
> >> different subclass (SingleParentRevCommit vs MultiParentRevCommit) would
> >> avoid the instanceof check.
> > 
> > At the time of allocation we don't know the number of parents.
> Is that a limitation of the current implementation or a more
> significant architectural one?

Its a signifcant architectural issue.  When we encounter a reference
to a commit, we allocate its structure immediately, without looking
up details of the commit.  This allows us to have the flag data
for that commit available right away, without needing to pay the
cost of parsing it from a pack file.

Later when we absolutely need to know more about the commit (like who
its parents are) we load that data on-demand.  This on-demand loading
is what allows us to only examine the last ~100 commits when showing
you the most recent 100 changes to the project in the History view.

> > And
> > that number might change anyway due to a path limiter killing a merge.
> > So we can't do that.
> > 
> > ...
> > No.  We rely on reference equality in a lot of the revwalk code.
> > Creating a new instance would break that.
> Does the number of parents increase or the remaining parents change
> value?

I don't think we ever increase the parent count, but the parents
could change to point to another commit, or they could be decreased.

> If not, we could pass a filter argument into the visitor
> with known skippable values, such that the filtered parents never
> show up in the visit. This would also mean nor having to mutate
> the data structure.

True, but the visitor pattern will perform more slowly here.
Its extra method dispatches that we don't do today.  And it inverts
control, which makes a major change to the way applications interact
with us.

> I was just thinking aloud though - if it can't be done, or it's
> more inefficient to do this, never mind.

Its not going to perform better, and there are limitations which
make it hard for us to change from what we have now.

The C Git folks have speculated that another way to handle this is
to assign each object an integer when it is allocated, and then
have look-aside tables that are indexed by the object integer to
store things like the parent data.  This allows us to defer the
parent allocation until later.  It also allows us to discard the
parent data, without discarding other data.  This is useful if the
TreeFilter changes on a RevWalk, we might be able to retain the
original commit graph in memory and just build a different parent
data graph alongside.

Likewise we could move the flag data to be a bitset, each flag is
its own bitset that is indexed by this integer, rather than being
stored inside of the RevObject.  I think this would cause massive
slowdown for performance critical algorithms like object enumeration.
But applications would find it useful to be able to more efficiently
"attach" data to a RevCommit without subclassing it.

Basically the approach would be:

  public class RevCommit {
    private final int annotationId;

    public <T> T get(RevCommitAnnotation<T> a) {
      return a.get(annotationId);

    public <T> void set(RevCommitAnnotation<T> a, T value) {
      a.set(annotationId, value);

    public int getParentCount(RevCommitParents a) {
      return a.getParentCount(annotationId);

    public RevCommit getParent(RevCommitParents a, int nth) {
      return a.getParent(annotationId, nth);

  public class RevCommitAnnotation<T> {
    private T[] annotations;

    T get(int annotationId) {
      if (annotationId < annotations.length)
        return annotations[annotationId];
      return null;

    void set(int annotationId, T t) {
      if (annotationId <= annotations.length) {
        ... grow array to be larger ...
      annotations[annotationId] = t;

  public class RevCommitParents {
    private Object[] parents;

    int getParentCount(int annotationId) {
      Object p = parents[annotationId];
      if (p instanceof RevCommit)
        return 1;
      return ((RevCommit[])p).length;

    RevCommit getParent(int anontationId, int nth) {
      Object p = parents[annotationId];
      if (nth > 1)
        return ((RevCommit[])p)[nth];
      if (p instanceof RevCommit)
        return (RevCommit)p;
      return ((RevCommit[])p)[0];

The nice part of this approach is, additional data members can
be cleanly added to a RevCommit instance by an application at an
ammortized cost of only 1 additional reference per data item added.

The downside is, this is slower.  With proper use of final and
package access we can get a lot of this to inline (or at least
avoid virtual dispatch) in the JIT, but it is an extra array index
operation per data access, which is slower than adding a field
through a subclass.


Back to the top