Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Storage interface.

Ok, thanks a lot for precisions,

I think I'm going to avoid this implementation...
I'm working on a european project (qualipso) which aims to provide a next generation forge and for now I'm focusing my effort on providing a working system with git backend (only push pull on server side).
So, because of the performance issue you have pointed, I understand that accessing objects from my framework (EJB3 + XACML + events +++)  will introduce terrible time response and I don't want to think about caching first !!
For now my strategy is to replicate needed data in my framework (Refs, Commits, and Entries) to allow framework transversal services to work ; accesses to the store for push pull operations are dedicated to jgit using direct streams. Using PostHook, I perform a kind of crawling of the repo after a push in order to replicate objects... This is not very clean but enough to have a good view of a repository content in my framework.

The only bad point in this approach is because of "after work" crawling of the repository content, I'm not able to revert the push if something wrong happens in the framework.
For instance, we only use ssh transport so it's easy to start the crawler using PostReceiveHook in the ReceivePack action. I think adding http and git protocol support will imply to review this crawling.
Another way of doing everything is to create a kind of PackFilter (already done using special streams)  to decode pack content (refs + objects) and create resources in framework before passing the stream to the underlying jgit repository.

More generally , the goal of jgit integration in this project is to provide full text search over repository objects with semantic filters on searches using on the fly git push annotations. providing only ssh support is not so terrible in this case.
We have also a SVN service backend and we'd like to find a ontology which will allow git and svn commits annotations in the same way in order to provide search and sparql request in the same way wathever the backend is.
As you can understand, our goal is close to GitHub system but the semantic dimension will ensure more powerfull searches over repository content. And global forge architecture will allow us to develop more services than only code revision.

By the way, thanks again for those precisions. Very nice work also in jgit (so usefull for me).
Everything is open source and visible at qualipso.gforge.inria.fr

Best regards, Jérôme.


2010/10/12 Shawn Pearce <spearce@xxxxxxxxxxx>
On Tue, Oct 12, 2010 at 5:34 AM, Jérôme Blanchard <jayblanc@xxxxxxxxx> wrote:
> Is the storage/file package is portable to another type of storage ?

Its *almost* portable.

> I mean, did storage package ensure a complete abstraction layer or is it
> only organisationnal ?

The goal is a complete abstraction layer.  We have almost done that.

> Recent updates make me think it is now possible to develop another storage
> system (database, ejb, etc...) but I'd like to be sure before trying to
> develop this.

Almost true.  :-)

I do have a closed source code base that allows JGit to sit on top of
a database rather than a local filesystem.  The abstraction works
sufficiently that I can run JGit's daemon and clone a repository that
is stored on the database over any of the Git transport protocols
(git:// or smart http://).  It also works enough that you can do
simple operations like log.  Its a fully open source JGit, with the
closed source base just extending objects like Repository,
RefDatabase, ObjectDatabase (no JGit hacks required, I've upstreamed
everything required already).

I have not yet implemented writing to refs in this implementation.
Consequently I can't say for certain that the RefUpdate API is
sufficiently abstracted.  I know the RefLog API is *NOT* abstracted
yet.  The reason ref writing isn't done is because I'm just swamped
and ran out of time with this project.  I suspect you need to
duplicate a lot of code with storage.file's RefUpdate implementation,
and that we may be able to share more code.

The major part that is missing from a complete abstraction is the
transport.IndexPack class.  This class is crucial for fetching into a
repository, or being on the receiving end of a push into a repository.
 It is completely dependent upon local file IO and is *NOT* abstracted
onto an arbitrary ObjectDatabase implementation.  That means I can't
fetch into my database, nor can I push into it.  So the way I get a
Git repository into the database is through a hacked up program I
wrote that manually injects objects.  (Its not pretty.  At all.)


The closed source implementation is still closed source because it
sits on top of a database API that isn't public, and the code is
horrid.  I would be embarrassed to show it... especially that importer
program that injects objects.  I do plan to open source this, but only
once I had it cleaned up enough that I was willing to put my name on
it and call it my work.  :-)

I had hoped to spend some of my time the past month cleaning up that
code and getting it open sourced before the end of this month.  But
then my son came 5 weeks early and I discovered life had other plans
for me right now.  So that just didn't happen.


I have learned that writing a new storage implementation is a lot of
work.  You can do something really naive in about a day or two worth
of work... its a lot of typing to implement the various classes that
JGit requires.  But performance will be so bad its unusable on
anything beyond a toy repository.  Then you need to spend a lot of
time implementing the rest of those APIs (like the async reading
methods in ObjectReader) in order to work back towards something even
half-way acceptable.

Replacing the storage layer in JGit isn't like swapping out MySQL for
PostgreSQL in a SQL based application.  Its more like trying to build
a rocket and fly to the moon and back, using some twine and paperclips
you found in the office supply cabinet.  The fundamental problem is,
most of the algorithms in Git assume that object access is performed
in very small constant time, and most of them have very little
lookahead available to them.  This means your implementation's
performance is determined entirely by the round-trip time to talk to
your storage system.  If that storage system isn't local mapped into
memory the way storage.file is, its going to be a lot slower.

--
Shawn.


Back to the top