Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] stable-4.1 fixes

Have you contributed the revert & workaround?

Luca

Sent from my iPhone

On 11 Apr 2018, at 01:08, Hugo Ares <hugo.ares@xxxxxxxxxxxx> wrote:

On Wed, 2018-04-11 at 01:27 +0200, Matthias Sohn wrote:
On Wed, Apr 11, 2018 at 1:00 AM, Nasser Grainawi <nasser@xxxxxxxxxxxxxx> wrote:

On Apr 10, 2018, at 4:50 PM, Matthias Sohn <matthias.sohn@xxxxxxxxx> wrote:

On Wed, Apr 11, 2018 at 12:45 AM, Nasser Grainawi <nasser@codeaurora.org> wrote:

On Apr 10, 2018, at 4:25 PM, Matthias Sohn <matthias.sohn@xxxxxxxxx> wrote:

On Tue, Apr 10, 2018 at 10:36 PM, Nasser Grainawi <nasser@xxxxxxxxxxxxxg> wrote:

On Mar 29, 2018, at 10:21 AM, Nasser Grainawi <nasser@xxxxxxxxxxxxxx> wrote:


On Mar 24, 2018, at 1:45 PM, Matthias Sohn <matthias.sohn@xxxxxxxxx> wrote:

On Sat, Mar 24, 2018 at 1:32 AM, David Pursehouse <david.pursehouse@gmail.com> wrote:


On Fri, 23 Mar 2018, 23:01 Matthias Sohn, <matthias.sohn@xxxxxxxxx> wrote:
On Fri, Mar 23, 2018 at 9:52 PM, Nasser Grainawi <nasser@xxxxxxxxxxxxxg> wrote:
Hey folks,

I'm doing some work to stabilize the Gerrit 2.12 release series that we're running in several places still internally. As part of that, I found a few JGit commits that I'd like to backport to the stable-4.1 branch.

You can see the fixes as the latest 12 commits on https://source.codeaurora.org/quic/gerrit4msm/jgit/log/?h=stable-4.1-backports



I cherry-picked the patches you want and pushed them for stable-4.1
 
Please review this series and let me know if you really want 4.1.3 or if you can use 4.5.4 like we do

I'll test out 4.5.4 today and let you know mid-next week. Thanks for telling me about that version and that you have used it in prod with 2.12, that really helps.


Sorry for the delay getting back to you. We tried 4.5.4 and it works pretty well. I found one NFS-related bug and uploaded a fix [1]. We're also seeing MissingObjectExceptions that we think are NFS related, but we're not certain. The setup we have for this instance is a master replicating over git:// to git-daemon through a load-balancer to two slaves sharing an NFS filesystem. The slaves are running a custom git.git (has a few NFS error-avoidance patches) for repacking.



did I get it right that the master also accesses git repositories via NFS ?

No, master has its own disk.


 So jgit running on master is victim to NFS issues on a slave when trying to replicate changes to the slave ?

No, sorry. We're only seeing these errors on the slaves. AFAIK, replication is working fine.


Ah, ok so this is UploadPack failure on slave when some client fetches from it and client
wants a commit not available on slave or slave doesn't see it yet since it arrived via replication
from master to the other slave. This sounds similar to issues we saw in concurrent push tests
to a multi-master setup running on NFS. We introduced new options supportsatomicfilecreation
and also setting trustfolderstat to false might help, though Hugo reported massive performance problems.

listing folder content of parent folder a given commit resides in instead of setting trustfolderstat to false could help.
I think Hugo said this works and is faster than setting trustfolderstat

On our side, all repositories are configured wiht trustfolderstat to false. We had that for a long time and this setting was only used
when listing the pack files. Without this setting, we used to have a lot of problems on slaves reading outdated stuff because of NFS.
When we deployed this change[1] which make this setting effective for packed-refs we had such big performance degradation that
it caused an outage.

We reverted that change internally and instead , based on Christian feedback, we simply list the content of the parent folder of the
packed-refs and this seems to be enough to not get an outdated version of the packed-refs from NFS.





Why do you use slaves sharing the same NFS ?

Mostly because of less replication traffic and the slaves share the effort of repacking, but there's added benefits like knowing the slaves have a consistent view (one can't get out of date). With just 2 slaves it's not as big of a deal, but we have other instances with 8-10 slaves in a site.

 
This is what the exceptions look like (jgit code is the same as [1]):
[2018-04-06 10:42:38,524] [SSH git-upload-pack 'XXXXXXX' (YYYYYYY)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user YYYYYY account ZZZZZZ) during git-upload-pack 'XXXXXXX'
org.eclipse.jgit.transport.UploadPackInternalServerErrorException
        at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:768)
        at org.eclipse.jgit.transport.UploadPack.upload(UploadPack.java:667)
        at com.google.gerrit.sshd.commands.Upload.runImpl(Upload.java:80)
        at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:101)
        at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32)
        at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:70)
        at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:437)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:376)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.eclipse.jgit.transport.WantNotValidException: want 56c76057710dd3ca29e66ef6cdc4b727be781384 not valid
        at org.eclipse.jgit.transport.UploadPack.parseWants(UploadPack.java:1197)
        at org.eclipse.jgit.transport.UploadPack.processHaveLines(UploadPack.java:1072)
        at org.eclipse.jgit.transport.UploadPack.negotiate(UploadPack.java:1052)
        at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:745)
        ... 14 more
Caused by: org.eclipse.jgit.errors.MissingObjectException: Missing unknown 56c76057710dd3ca29e66ef6cdc4b727be781384
        at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
        at org.eclipse.jgit.lib.ObjectReader$1.open(ObjectReader.java:300)
        at org.eclipse.jgit.revwalk.RevWalk$2.next(RevWalk.java:971)
        at org.eclipse.jgit.transport.UploadPack.parseWants(UploadPack.java:1184)
        ... 17 more

We're assuming this error happens due to NFS caching directory entries. We're looking at testing that out by changing the mount options (lookupcache, noac, ac*time), but these are prod hosts, so we don't want to hurt the performance too much. Do you have any ideas on how we'd approach solving this in JGit or a different hypothesis on what could be causing this error?



AFAIR I never saw a WantNotValidException

@Chris you did a lot of testing on NFS when we implemented multi-master on shared NFS, maybe you have an idea ?

-Matthias

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project


-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project



_______________________________________________
jgit-dev mailing list
jgit-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/jgit-dev
_______________________________________________
jgit-dev mailing list
jgit-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/jgit-dev

Back to the top