Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Gerrit clone time takes ages, because of PackWriter.searchForReuse()



On 2 Oct 2020, at 22:24, Terry Parker <tparker@xxxxxxxxxx> wrote:



On Thu, Oct 1, 2020 at 4:05 PM Martin Fick <mfick@xxxxxxxxxxxxxx> wrote:
On Thursday, October 1, 2020 4:51:46 PM MDT Martin Fick wrote:
> On Thursday, October 1, 2020 4:44:22 PM MDT Martin Fick wrote:
> > On Thursday, October 1, 2020 10:51:19 PM MDT Luca Milanesio wrote:
> > >  The above is a cycle for *all* objects that goes into the another scan
> > >  for *all* packfiles inside the selectObjectRepresentation().
> 
> Also, check the jgit DFS implementation, I seem to recall that it might have
> done something a bit better here?

If I remember correctly, it stores something about the size of the objects 
during the initial walking phase so that it does not need to refind that data 
during selectObjectRepresentation() (I think I was hacking a solution to do 
that and noticed that the DFS version has already done the same thing),

In DfsPackCompactor searchForReuse shows up as the scalability bottleneck
when the pack count grows, but we are aggressive about compacting after 
nearly every push. That workaround has kept it from being too urgent. It shows 
up in GC too but generally takes less time than building reachability bitmaps.
It only shows up in UploadPack when our compaction or GC background
jobs get wedged, and even then we push back with TOO_MANY_PACKS
service errors at 2.5k packs, so we get paged and fix things before read
latencies get too bad.

One thing that may help is ignoring duplicated content that a client sends
to the server (as a follow on to the "goodput" measurement). The idea is that
when a server receives a pack from a client and realizes it contains objects
the server already has, we can drop those objects from the received pack's
index (making them invisible) and the next GC or compaction will drop them
and reclaim the storage.

I'm not sure if that solves all of the searchForReuse inefficiencies, but it
should help.

The Gerrit workflow of creating new refs/changes branches from a client
that may not have been synced for a few days, plus the lack of negotiation
in push, makes for this conversation:
client: hello server!
server: here are sha-1s for my current refs (including active branches 
updated in the last N minutes)
client: I synced two days ago. I don't recognize most of those sha-1s. You
didn't mention the parent commit for my new change, which was the tip of
the 'main' branch when I started working on it. I see a sha-1 for the build we
tagged last week. Let me send you all of history from that sha-1 to my
shiny new change.

Oh yes, that makes things clearer and yes, it is worrying :-(

In my local tests last night, I managed to create a 10GB repository very quickly, that after a gc ended up to be just over 200MBytes.
That’s a symptom of the huge duplication happening.


The situation is bad. Those goodput metrics show that for active Gerrit
repos, over 90% of the data sent in pushes is data the server already has.
Negotiation in push is the best way to solve it, but until that is widely
available in clients (and even beyond to deal with cases where
negotiation breaks down), I think ignoring duplicate objects at the
point where the server receives them will make life a lot better.

Yeah, agreed.

Thanks for the detailed explanation.
Luca.


-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

_______________________________________________
jgit-dev mailing list
jgit-dev@xxxxxxxxxxx
To unsubscribe from this list, visit https://www.eclipse.org/mailman/listinfo/jgit-dev
_______________________________________________
jgit-dev mailing list
jgit-dev@xxxxxxxxxxx
To unsubscribe from this list, visit https://www.eclipse.org/mailman/listinfo/jgit-dev


Back to the top