Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Gerrit clone time takes ages, because of PackWriter.searchForReuse()

Thanks, Martin, for the prompt reply.

> On 1 Oct 2020, at 23:44, Martin Fick <mfick@xxxxxxxxxxxxxx> wrote:
> 
> On Thursday, October 1, 2020 10:51:19 PM MDT Luca Milanesio wrote:
>> Looking at the code called by the searchForReuse, I ended up in:
>> 
>> 	@Override
>> 	public void selectObjectRepresentation(PackWriter packer,
>> 			ProgressMonitor monitor, Iterable<ObjectToPack> objects)
>> 			throws IOException, MissingObjectException {
>> 		for (ObjectToPack otp : objects) {
>> 			db.selectObjectRepresentation(packer, otp, this);
>> 			monitor.update(1);
>> 		}
>> 	}
>> 
>> 
>> The above is a cycle for *all* objects that goes into the another scan for
>> *all* packfiles inside the selectObjectRepresentation().
>> 
>> The slow clones were going through 2M of objects on a repository with 4k
>> packfiles … the math would say that it went through a nested cycle of 2M x
>> 4k => 8BN of operations. I am not surprised it is slow after all :-)
> 
> Yes, it is terrible to have 4K pack files (or even 300) in a repo. It clearly 
> needs to be repacked (but you knew that)!

Yeah, and the GC of the repo solves the problem … but that takes time (this is a *veeeery big* repo) and during the day pack files are piling up again.
So the problem is really *in between* GC cycles.

> 
>> So, it looks like it works the way it is designed: very very slowly.
>> 
>> My questions on the above are:
>> 1. Is there anyone else in the world, using Gerrit or JGit, with the same
>> problem? 
> 
> Well, I do think most people (especially you know) that you should expect a 
> repo with 4K to perform atrociously! We have certainly experienced it and we 
> avoid it!

I have to say that *IF* we sort out this problem, then it won’t be *SO bad* after all.

> 
>> 2. How to disable the search for reuse? (Even if I disable the
>> reuseDelta or reuseObjects in the [pack] section of the gerrit.config, the
>> searchForReuse() phase is trigged anyway) 3. Would it make sense to
>> estimate the combination explosion of the phase beforehand (it is simple:
>> just multiply the number of objects x number of packfiles) and
>> automatically disable that phase?
> 
> I don't think the search for reuse is technically the problem. I think the 
> problem is not short circuiting the search when one it found? If I remember 
> correctly, the loop searches for all the possibilities to attempt to find the 
> best one. So I do believe that some mechanism to short circuit this is needed, 
> not just for the degenerate case of a repo that has not been repacked.
> 
> We have repos that we call siblings, they share objects via the alternatives 
> mechanisms. They are different copies of the kernel/msm repo. Each copy
> points to the other copy via alternatives. When they get repacked, each one
> ensures that it has a copy of all the objects it references, and since there 
> is a lot of shared history in these repos, the main objects are in many of 
> these repos. In the past, I measured clones to take along the lines of 20% 
> longer for each alternative than if there were no alternatives. I tracked it 
> down to this same problem. I welcome a solution in this area.
> 
> My thoughts for solving this it to introduce ways to short circuit. Of course, 
> short circuiting could lead to subpar performance in some cases too, so it is 
> tricky. I would guess that once a delta is found, it would usually make sense 
> to just send it. If however a non deltafied copy is found, it might still be 
> worth looking a bit further for a deltafied one, maybe a configurable amount, 
> one or two? 
> 
> To make short circuiting work well, I believe it would make sense to order the 
> packfiles in a way that fewer searches are likely to be needed before finding 
> an object. I have thought about ordering based on dates, older packfiles are 
> more likely to have deltas of more(most) objects. Similarly, the largest  
> packfiles are also more likely to have deltafied objects in them since the 
> largest packfiles are likely the ones that have been repacked, whereas small 
> ones are likely to be new pushes which are more likely to have been thickened 
> (which removes some deltas) on receipt of the pack. Additionally it might make 
> sense to dynamically sort the packs based on the results of the searches 
> themselves. As certain packfiles start to shine as the best candidates, they 
> would get searched first. This might help transition well from packfile to 
> packfile dynamically, especially with large object counts (clones), or if they 
> have "islands" in them.
> 
> Lastly, what does git do in this situation? What tradeoffs, if any, does it 
> make when deciding which copy of an object to send?

Good point, let me repeat the tests with the latest and greatest c-based Git implementation.

> 
>> P.S. I am planning to prepare a patch for implementing 3. If we believe it’s
>> a good idea to auto-disable the phase.
> 
> I look forward to testing out what you come up with,

+1

Luca.

> 
> -Martin
> 
> -- 
> The Qualcomm Innovation Center, Inc. is a member of Code 
> Aurora Forum, hosted by The Linux Foundation
> 



Back to the top