While updating the BFG Repo-Cleaner to JGit 4.0, I was surprised to see a consistent performance regression on clean-large-repo benchmarks (/dev/shm on Ubuntu 15.04, OpenJDK 1.8.0_45). Total run time had increased by ~10%. After some fairly chunky automated git-bisecting, I traced the regression to this change:
...and then *specifically* to the exists() method in FS_POSIX, now overriding the default implementation in FS:
The default implementation of FS.exists() uses java.io.File.exists(), while the new implementation in FS_POSIX uses java.nio.file.Files.exists() - by simply removing the override in FS_POSIX, performance was restored.
Profiling the BFG benchmark, it became clear that in my environment at least, j.nio.file.Files.exists() is substantially slower than j.io.File.exists(), to the point where the exists() call doubles the average cost of a call to ObjectDirectory.insertUnpackedObject() - which the BFG uses a lot, because it's rewriting history. Average times are:
j.io.File.exists() - 4 microseconds
j.nio.file.Files.exists() - 60 microseconds
If you look at the implementation of j.nio.file.Files.exists() it's not hard to believe it's slower, j.io.File.exists() drops almost immediately to a native method, while the NIO method has multiple if statements & layers of indirection (profiling says the real cost comes quite far down, in calls to 'readAttributes' methods).
I had a look at the change that originally introduced use of j.nio.file.Files.exists() (for org.eclipse.jgit.java7 only at that point):
The commit message on this change mentions that "there are claims that Files.exists is faster the File.exists" (ie that NIO is faster, contrary to what I've seen). I think that claim might have come from here:
I can't explain the conflicting results for the performance of NIO exists(), though obviously there are many different OSs and Java versions out there, which may behave differently. I've only tested on Ubuntu so far, but can grab a Mac. Getting a Windows box is more difficult for me, let me know if you'd like to volunteer for benchmarking duty.
Aside from performance, I'm not sure if there was any other strong motivation for using j.nio.file.Files.exists(). It's called with LinkOption.NOFOLLOW_LINKS, which means that when checking a symbolic link, it will return true so long as the symbolic link is there, regardless of whether the link points to an existing file. This differs from j.io.File.exists(), which
does follow the link, and returns false if the underlying file is not present. Is the NIO behavior useful?
Obviously, my preference would be to remove use of the NIO call. I'm inclined to think it could be removed everywhere, from both FS_POSIX & FS_Win32, but at the very least from ObjectDirectory.insertUnpackedObject().
Roberto
PS I have to confess the difference in total execution time run is just 45s rather than 41s, working on a 1.1GB repository - no BFG users will really care. But, having spent so long trying to create a fast tool, a 10% regression is pretty hard for me to swallow!