Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » EGit / JGit » Garbage collection concurrency with ref update
Garbage collection concurrency with ref update [message #1699170] Mon, 22 June 2015 10:09 Go to next message
Mathieu Bruyen is currently offline Mathieu BruyenFriend
Messages: 2
Registered: June 2015
Junior Member
Hello,

I have an application performing commits/reads to a git repository through jgit. I create and delete references thus the repository ends up containing many dangling objects that I want to remove using git garbage collection. I would like this not to stop operations on the repository. Thus I tested performing garbage collection concurrently with inserting objects in the database.

I created a test-bench committing random files, triggering garbage collection and periodically reading content (checking that the repository is in a consistent state). In some cases there are missing objects in the database. I drilled down a bit and found that it appeared when a commit "reused" a blob which was not referenced for a while.

A specific scenario (attached as a junit test) is to have a blob created a long time ago and no longer referenced. Then having two processes start at the same time, one triggering a garbage collection while another creates a commit which includes the previous blob in its tree and references it. Finally having a last process trying to read the content pointed by the reference.

Am I incorrectly using the API (like missing to add a lock somewhere) or should there be something in jgit (or even git) to prevent that?
Re: Garbage collection concurrency with ref update [message #1699414 is a reply to message #1699170] Wed, 24 June 2015 07:50 Go to previous messageGo to next message
Christian Halstrick is currently offline Christian HalstrickFriend
Messages: 274
Registered: July 2009
Senior Member
I am also hunting for a bug which may be there in JGit in this area. May latest attempts to fix this are the symptons are [1]. Please see the discussion in [2], [3]. My fixes are only providing a cure for the sympton (missing object exception during push) but I haven't found the real cause for the bug. Therefore I really appreciate that somebody tried to get a grip on this bug with tests. Thanks a lot. Therefore I tested your code locally and also see that we are loosing objects .... but I do think that works as designed Wink When you call gc() in the middle of your test then there is not a single ref in the repo. No object is referenced and you set a small expiration time for the gc. That means the repo will be EMPTY after the gc. All objects removed. If you afterwards create commits which refer to the objects with IDs from the cleaned objects then your repo is corrupt.

Sounds scary, but in real world I think thats not such a big problem. If you work with files in the working tree which you add to the index and then commit them this will not occur. The add command will do the ObjectDirectory.insert() and also update the .git/index file without a big delay (and with a lock set on .git/index). From then on the objects are referenced by the index and gc() is not allowed to delete blobs referenced by the index. The following commit command will be safe.

But what you do is unsafe. Whatever you add to the git object repository which is not referenced by a ref may potentially be garbage collected. That's why we have the expiration time in gc. Make sure that the gc expiration time is longer then the time your application needs create references to objects it has added to the object store. You may also work with the index and add files/objects to the index to safe them from beeing garbage collected.










[1] https://git.eclipse.org/r/#/c/50230/
[2] https://bugs.eclipse.org/bugs/show_bug.cgi?id=468024
[3] https://groups.google.com/forum/#!topic/repo-discuss/XmjP7PF59cc


Ciao
Chris
Re: Garbage collection concurrency with ref update [message #1699427 is a reply to message #1699414] Wed, 24 June 2015 09:39 Go to previous messageGo to next message
Mathieu Bruyen is currently offline Mathieu BruyenFriend
Messages: 2
Registered: June 2015
Junior Member
Thanks for your reply.

In the test-case using a larger expiration time would not help because the blob was supposedly referenced long ago in the past, the reference has since been removed, but by misfortune a new reference wants to reuse the blob. Since it was created for something else long ago it's modification time on the file system is very old, even if I ask an object inserted to insert it again, the object inserter sees the file already exists and won't touch it. I also cannot create a reference to the object before gc ends as there is no commit pointing to the object yet.

I don't really know about the index and even less how to use it from jgit, it may be the way to go for me if gc handles it well, I need to test that (I'm using low level API, not commands).
Re: Garbage collection concurrency with ref update [message #1699438 is a reply to message #1699427] Wed, 24 June 2015 10:50 Go to previous message
Christian Halstrick is currently offline Christian HalstrickFriend
Messages: 274
Registered: July 2009
Senior Member
Got your problem. But there is no good way around it: if you have inserted something to gits objectdatabase then this may be garbage collected at any point in time unless you have references pointing (at least indirectly) to those object. References can either pointing to commits but they can also directly reference blobs or trees. So, if you have blobs which you want to protect then you can create refs pointing to them.

Ciao
Chris
Previous Topic:EGit Work on a Branch, Commit, Switch and Pull remote master
Next Topic:EGit and Mars
Goto Forum:
  


Current Time: Sat Apr 27 03:01:22 GMT 2024

Powered by FUDForum. Page generated in 0.03781 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top