Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Gerrit clone time takes ages, because of PackWriter.searchForReuse()

On Thursday, October 1, 2020 10:51:19 PM MDT Luca Milanesio wrote:
> Looking at the code called by the searchForReuse, I ended up in:
> 
> 	@Override
> 	public void selectObjectRepresentation(PackWriter packer,
> 			ProgressMonitor monitor, Iterable<ObjectToPack> objects)
> 			throws IOException, MissingObjectException {
> 		for (ObjectToPack otp : objects) {
> 			db.selectObjectRepresentation(packer, otp, this);
> 			monitor.update(1);
> 		}
> 	}
> 
> 
>  The above is a cycle for *all* objects that goes into the another scan for
> *all* packfiles inside the selectObjectRepresentation().
> 
> The slow clones were going through 2M of objects on a repository with 4k
> packfiles … the math would say that it went through a nested cycle of 2M x
> 4k => 8BN of operations. I am not surprised it is slow after all :-)

Yes, it is terrible to have 4K pack files (or even 300) in a repo. It clearly 
needs to be repacked (but you knew that)!

> So, it looks like it works the way it is designed: very very slowly.
> 
> My questions on the above are:
> 1. Is there anyone else in the world, using Gerrit or JGit, with the same
> problem? 

Well, I do think most people (especially you know) that you should expect a 
repo with 4K to perform atrociously! We have certainly experienced it and we 
avoid it!

> 2. How to disable the search for reuse? (Even if I disable the
> reuseDelta or reuseObjects in the [pack] section of the gerrit.config, the
> searchForReuse() phase is trigged anyway) 3. Would it make sense to
> estimate the combination explosion of the phase beforehand (it is simple:
> just multiply the number of objects x number of packfiles) and
> automatically disable that phase?

I don't think the search for reuse is technically the problem. I think the 
problem is not short circuiting the search when one it found? If I remember 
correctly, the loop searches for all the possibilities to attempt to find the 
best one. So I do believe that some mechanism to short circuit this is needed, 
not just for the degenerate case of a repo that has not been repacked.

We have repos that we call siblings, they share objects via the alternatives 
mechanisms. They are different copies of the kernel/msm repo. Each copy
points to the other copy via alternatives. When they get repacked, each one
ensures that it has a copy of all the objects it references, and since there 
is a lot of shared history in these repos, the main objects are in many of 
these repos. In the past, I measured clones to take along the lines of 20% 
longer for each alternative than if there were no alternatives. I tracked it 
down to this same problem. I welcome a solution in this area.

My thoughts for solving this it to introduce ways to short circuit. Of course, 
short circuiting could lead to subpar performance in some cases too, so it is 
tricky. I would guess that once a delta is found, it would usually make sense 
to just send it. If however a non deltafied copy is found, it might still be 
worth looking a bit further for a deltafied one, maybe a configurable amount, 
one or two? 

To make short circuiting work well, I believe it would make sense to order the 
packfiles in a way that fewer searches are likely to be needed before finding 
an object. I have thought about ordering based on dates, older packfiles are 
more likely to have deltas of more(most) objects. Similarly, the largest  
packfiles are also more likely to have deltafied objects in them since the 
largest packfiles are likely the ones that have been repacked, whereas small 
ones are likely to be new pushes which are more likely to have been thickened 
(which removes some deltas) on receipt of the pack. Additionally it might make 
sense to dynamically sort the packs based on the results of the searches 
themselves. As certain packfiles start to shine as the best candidates, they 
would get searched first. This might help transition well from packfile to 
packfile dynamically, especially with large object counts (clones), or if they 
have "islands" in them.

Lastly, what does git do in this situation? What tradeoffs, if any, does it 
make when deciding which copy of an object to send?

> P.S. I am planning to prepare a patch for implementing 3. If we believe it’s
> a good idea to auto-disable the phase.

I look forward to testing out what you come up with,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation



Back to the top