Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » EGit » Performance issue with large files(core.streamFileThreshold not working?)
icon5.gif  Performance issue with large files [message #754854] Fri, 04 November 2011 12:01 Go to next message
Chris Lee is currently offline Chris Lee
Messages: 6
Registered: November 2011
Junior Member
Hello,

We're in the process of converting from SVN to GIT, and are running into a performance issue when cloning the repository with Eclipse/EGit. The clone appears to hang indefinitely

For various research and the below stack trace, the issue is related to large files in the repository. The specific file it is stuck on is a 48M (uncompressed) SQL script (there are subsequent files that are slightly larger, up to 72M uncompressed).

Have tried setting core.streamFileThreshold to various values (100m, 200m and finally 2047m) - in each case the behaviour is unchanged.

Have been unable to locate any information on how to correctly set core.streamFileThreshold through Eclipse; have been testing with it added to Team->Git->Configuration->User Settings. Adding it to the System Settings doesn't seem to take (on redisplay the value is not present).

Current test configuration (after trying many other permutations without success) is Eclipse 3.7.1 w/ EGit 1.1.0.201109151100-r on Windows 7, with Sun 64-bit JVM, 3G heap.


"Worker-8" prio=6 tid=0x000000000963f800 nid=0x12e8 runnable [0x00000000101af000]
java.lang.Thread.State: RUNNABLE
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:238)
- locked <0x0000000755f0a370> (a java.util.zip.ZStreamRef)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:135)
at java.util.zip.InflaterInputStream.skip(InflaterInputStream.java:191)
at java.io.BufferedInputStream.skip(BufferedInputStream.java:349)
- locked <0x00000007ec7b3b98> (a java.io.BufferedInputStream)
at org.eclipse.jgit.lib.ObjectStream$Filter.skip(ObjectStream.java:199)
at org.eclipse.jgit.util.IO.skipFully(IO.java:244)
at org.eclipse.jgit.storage.pack.DeltaStream.seekBase(DeltaStream.java:339)
at org.eclipse.jgit.storage.pack.DeltaStream.read(DeltaStream.java:213)
at org.eclipse.jgit.storage.pack.DeltaStream.read(DeltaStream.java:214)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x000000075657a338> (a java.io.BufferedInputStream)
at org.eclipse.jgit.util.io.TeeInputStream.read(TeeInputStream.java:111)
at org.eclipse.jgit.lib.ObjectStream$Filter.read(ObjectStream.java:209)
at java.io.InputStream.read(InputStream.java:82)
at org.eclipse.jgit.lib.ObjectLoader.copyTo(ObjectLoader.java:254)
at org.eclipse.jgit.dircache.DirCacheCheckout.checkoutEntry(DirCacheCheckout.java:936)
at org.eclipse.jgit.dircache.DirCacheCheckout.doCheckout(DirCacheCheckout.java:447)
at org.eclipse.jgit.dircache.DirCacheCheckout.checkout(DirCacheCheckout.java:380)
at org.eclipse.jgit.api.CloneCommand.checkout(CloneCommand.java:225)
at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:120)
at org.eclipse.egit.core.op.CloneOperation.run(CloneOperation.java:142)
at org.eclipse.egit.ui.internal.clone.GitCloneWizard.executeCloneOperation(GitCloneWizard.java:306)
at org.eclipse.egit.ui.internal.clone.GitCloneWizard.access$3(GitCloneWizard.java:299)
at org.eclipse.egit.ui.internal.clone.GitCloneWizard$5.run(GitCloneWizard.java:278)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:54)

Re: Performance issue with large files [message #755100 is a reply to message #754854] Mon, 07 November 2011 06:38 Go to previous messageGo to next message
Christian Halstrick is currently offline Christian Halstrick
Messages: 18
Registered: July 2009
Junior Member
I tried to reproduce with a simple repo containing a 70MB and a 50MB text file. I was able to clone it without problems over the git protocol (I cloned git://localhost:xxx/yyy.git). With which protocol are you cloning (http? git? ssh?). Is it possible to publish the repo url here?
Re: Performance issue with large files [message #755151 is a reply to message #755100] Mon, 07 November 2011 09:34 Go to previous messageGo to next message
Chris Lee is currently offline Chris Lee
Messages: 6
Registered: November 2011
Junior Member
The repo is done over SSH, though that doesn't appear to be related - can see the .git directory (pack file - ~647M) being downloaded and completed. Its when the working copy is being populated and this file is encountered that the issue arises.

The equivalent operation on the Linux server itself goes fast, though it benefits from direct local access:

[git@devtools02 foo]$ time git clone emr
Cloning into emr...
done.

real 0m5.967s
user 0m3.779s
sys 0m2.040s

...the problematic file is there (size: 50126474 bytes)

The file has no history itself. It was moved at some point in the repo (from an active to archive folder), though this was done in CVS so there is no explicit history of the move.

When migrating the repo from SVN, we do a final 'git repack -f -a -d --depth=250 --window=250' to generate a nice pack file.

The repo is private (internal) and can't be made public; will see about pulling this single file out for testing.
Re: Performance issue with large files [message #755203 is a reply to message #755151] Mon, 07 November 2011 11:31 Go to previous messageGo to next message
Christian Halstrick is currently offline Christian Halstrick
Messages: 100
Registered: July 2009
Senior Member
the stacktrace you sent also points to the code where we check out what we have cloned. Seems to be the writing into the working tree which is very slow in your case. On platform are you syncing? Could it be because of virus scanners, compressed/encrypted folders? How long does cloning http://egit.eclipse.org/r/p/egit take on your machine?


Ciao
Chris
Re: Performance issue with large files [message #755245 is a reply to message #755203] Mon, 07 November 2011 13:54 Go to previous messageGo to next message
Chris Lee is currently offline Chris Lee
Messages: 6
Registered: November 2011
Junior Member
This is on Windows 7 64-bit. Not likely the virus scanner - no other issues, and can see all the other workspace files being created. This specific file hums along painfully slow - took ~6 hours to get 3M of it there. In looking at the JGit code, it uses isLargeFile() to determine which path to take - a byte array in memory (for small files) or a streaming/incremental method (the one in question here). The isLargeFile() uses the core.streamFileThreshold settings (default: 50M), though it isn't clear where (if at all) this setting would be specified in Eclipse to pass to JGit.

Checked out the EGit repo in ~ 6 minutes - ~5 minutes to download the repo (pack files, etc) and ~ 1 minute for the working copy.
Re: Performance issue with large files [message #755274 is a reply to message #755245] Mon, 07 November 2011 16:53 Go to previous messageGo to next message
Robin Rosenberg is currently offline Robin Rosenberg
Messages: 319
Registered: July 2009
Senior Member
Chris Lee skrev 2011-11-07 19.54:
> This is on Windows 7 64-bit. Not likely the virus scanner - no other issues, and can see all the other workspace files being created. This specific file hums along
> painfully slow - took ~6 hours to get 3M of it there. In looking at the JGit code, it uses isLargeFile() to determine which path to take - a byte array in memory (for small
> files) or a streaming/incremental method (the one in question here). The isLargeFile() uses the core.streamFileThreshold settings (default: 50M), though it isn't clear
> where (if at all) this setting would be specified in Eclipse to pass to JGit.

You can change it in Settings > Team > Git > Configuration and then add the setting under either User or System settings.
These settings are shared with C Git.

-- robin

>
> Checked out the EGit repo in ~ 6 minutes - ~5 minutes to download the repo (pack files, etc) and ~ 1 minute for the working copy.
Re: Performance issue with large files [message #755275 is a reply to message #755274] Mon, 07 November 2011 16:58 Go to previous messageGo to next message
Chris Lee is currently offline Chris Lee
Messages: 6
Registered: November 2011
Junior Member
Tried that, as indicated in the original post - no matter what the value is set to (based on my current JVM heap of 3G, core.streamFileThreshold should be 768m), the behaviour is unchanged.

Re: Performance issue with large files [message #755280 is a reply to message #755245] Mon, 07 November 2011 17:08 Go to previous messageGo to next message
Christian Halstrick is currently offline Christian Halstrick
Messages: 100
Registered: July 2009
Senior Member
as I said: I tried it out and current JGIt is able to clone repos with text files of more than 50mb in reasonable amount of time. Your slow performance can't be just because of checking out >50MB files is always slow. There must be another reason. Either your blobs are strangely packed or file i/o is slow at either your objects directory is or where your
working directory is. See this log where I compare cloning with native git compared to cloning with jgit command line. 14 minutes to clone the big repo with either native git, jgit or egit. Whats your time for git://github.com/chalstrick/testBig.git?

/c/git/tmp> time git clone git://github.com/chalstrick/testBig.git
Cloning into testBig...
remote: Counting objects: 1098, done.
remote: Compressing objects: 100% (1049/1049), done.
remote: Total 1098 (delta 14), reused 1098 (delta 14)
Receiving objects: 100% (1098/1098), 69.58 MiB | 105 KiB/s, done.
Resolving deltas: 100% (14/14), done.

real 14m13.707s
user 0m0.000s
sys 0m0.031s
/c/git/tmp> time jgit clone git://github.com/chalstrick/testBig.git testBig2
Initialized empty Git repository in c:\git\tmp\testBig2\.git
remote: Counting objects: 1098
remote: Compressing objects: 100% (1049/1049)
Receiving objects: 100% (1098/1098)
Resolving deltas: 100% (14/14)
Updating references: 100% (1/1)
remote: Total 1098 (delta 14), reused 1098 (delta 14)
From git://github.com/chalstrick/testBig.git
* [new branch] master -> origin/master

real 14m29.182s
user 0m0.030s
sys 0m0.167s
/c/git/tmp> ls -Sl testBig2 | head -5
total 80238
-rw-r--r-- 1 D032780 Administ 81873707 Nov 7 22:41 rfc-all.txt
-rw-r--r-- 1 D032780 Administ 3208268 Nov 7 22:41 rfc635.pdf
-rw-r--r-- 1 D032780 Administ 2682880 Nov 7 22:41 rfc1305.tar
-rw-r--r-- 1 D032780 Administ 2293049 Nov 7 22:40 rfc1131.ps
/c/git/tmp>


Ciao
Chris
Re: Performance issue with large files [message #755285 is a reply to message #755280] Mon, 07 November 2011 17:18 Go to previous messageGo to next message
Chris Lee is currently offline Chris Lee
Messages: 6
Registered: November 2011
Junior Member
Approximately 2 minutes to clone that repo.

Our repo was packed with 'git repack -f -a -d --depth=250 --window=250' following the conversion from SVN.

From reviewing the JGit code, it branches when hitting a large file (> 50M) - its the streaming branch that is pathologically slow (as evidenced by the .tmp files created by this code path). This link appears to be on target, and may explain why certain cases will be slow: (argh, won't let me use a hyperlink - google for "jgit large object stream")

It seems that core.streamFileThreshold is not taking effect, otherwise we wouldn't be on that code path. Have set this under User Settings, as the current (and latest nightly) EGit versions aren't letting me set anything under System Settings.
Re: Performance issue with large files [message #755327 is a reply to message #755285] Tue, 08 November 2011 02:55 Go to previous messageGo to next message
Christian Halstrick is currently offline Christian Halstrick
Messages: 100
Registered: July 2009
Senior Member
ok, you found the best info on this topic: http://dev.eclipse.org/mhonarc/lists/jgit-dev/msg00689.html explains quite well a possible reason for your performance problem. Strange is why it doesn't help to set core.streamFileThreshold to something big. Did you try the repack with -f and have the -delta gitattribute set? This has helped here: http://dev.eclipse.org/mhonarc/lists/jgit-dev/msg00694.html

Ciao
Chris
Re: Performance issue with large files [message #755454 is a reply to message #755327] Tue, 08 November 2011 09:59 Go to previous message
Chris Lee is currently offline Chris Lee
Messages: 6
Registered: November 2011
Junior Member

That worked.

Added this to <repo>/info/attributes:
Quote:

*.sql -delta
*.zip -delta
*.jpg -delta
*.gif -delta
*.png -delta
*.doc -delta
*.docx -delta
*.pdf -delta


Repacked using:
Quote:

git repack -f -a -d --depth=250 --window=250


...and can now successfully clone from EGit.

Thx.
Previous Topic:Default repository folder is too inflexible
Next Topic:Interactive rebase a dead end?
Goto Forum:
  


Current Time: Wed Aug 27 17:12:01 EDT 2014

Powered by FUDForum. Page generated in 0.07218 seconds