[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [jgit-dev] Gerrit clone time takes ages, because of PackWriter.searchForReuse()
|
On Thursday, October 1, 2020 10:51:19 PM MDT Luca Milanesio wrote:
> Looking at the code called by the searchForReuse, I ended up in:
>
> @Override
> public void selectObjectRepresentation(PackWriter packer,
> ProgressMonitor monitor, Iterable<ObjectToPack> objects)
> throws IOException, MissingObjectException {
> for (ObjectToPack otp : objects) {
> db.selectObjectRepresentation(packer, otp, this);
> monitor.update(1);
> }
> }
>
>
> The above is a cycle for *all* objects that goes into the another scan for
> *all* packfiles inside the selectObjectRepresentation().
>
> The slow clones were going through 2M of objects on a repository with 4k
> packfiles … the math would say that it went through a nested cycle of 2M x
> 4k => 8BN of operations. I am not surprised it is slow after all :-)
Yes, it is terrible to have 4K pack files (or even 300) in a repo. It clearly
needs to be repacked (but you knew that)!
> So, it looks like it works the way it is designed: very very slowly.
>
> My questions on the above are:
> 1. Is there anyone else in the world, using Gerrit or JGit, with the same
> problem?
Well, I do think most people (especially you know) that you should expect a
repo with 4K to perform atrociously! We have certainly experienced it and we
avoid it!
> 2. How to disable the search for reuse? (Even if I disable the
> reuseDelta or reuseObjects in the [pack] section of the gerrit.config, the
> searchForReuse() phase is trigged anyway) 3. Would it make sense to
> estimate the combination explosion of the phase beforehand (it is simple:
> just multiply the number of objects x number of packfiles) and
> automatically disable that phase?
I don't think the search for reuse is technically the problem. I think the
problem is not short circuiting the search when one it found? If I remember
correctly, the loop searches for all the possibilities to attempt to find the
best one. So I do believe that some mechanism to short circuit this is needed,
not just for the degenerate case of a repo that has not been repacked.
We have repos that we call siblings, they share objects via the alternatives
mechanisms. They are different copies of the kernel/msm repo. Each copy
points to the other copy via alternatives. When they get repacked, each one
ensures that it has a copy of all the objects it references, and since there
is a lot of shared history in these repos, the main objects are in many of
these repos. In the past, I measured clones to take along the lines of 20%
longer for each alternative than if there were no alternatives. I tracked it
down to this same problem. I welcome a solution in this area.
My thoughts for solving this it to introduce ways to short circuit. Of course,
short circuiting could lead to subpar performance in some cases too, so it is
tricky. I would guess that once a delta is found, it would usually make sense
to just send it. If however a non deltafied copy is found, it might still be
worth looking a bit further for a deltafied one, maybe a configurable amount,
one or two?
To make short circuiting work well, I believe it would make sense to order the
packfiles in a way that fewer searches are likely to be needed before finding
an object. I have thought about ordering based on dates, older packfiles are
more likely to have deltas of more(most) objects. Similarly, the largest
packfiles are also more likely to have deltafied objects in them since the
largest packfiles are likely the ones that have been repacked, whereas small
ones are likely to be new pushes which are more likely to have been thickened
(which removes some deltas) on receipt of the pack. Additionally it might make
sense to dynamically sort the packs based on the results of the searches
themselves. As certain packfiles start to shine as the best candidates, they
would get searched first. This might help transition well from packfile to
packfile dynamically, especially with large object counts (clones), or if they
have "islands" in them.
Lastly, what does git do in this situation? What tradeoffs, if any, does it
make when deciding which copy of an object to send?
> P.S. I am planning to prepare a patch for implementing 3. If we believe it’s
> a good idea to auto-disable the phase.
I look forward to testing out what you come up with,
-Martin
--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation