Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » EGit » Git HTTP protocol/improvements?
Git HTTP protocol/improvements? [message #3654] Thu, 23 April 2009 10:19 Go to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

The git: protocol is claimed to be faster/better than the HTTP access.
Assuming that's true, what can be done to speed up or improve HTTP? Is it an
implementation of the protocol, or the fact that the protocol itself is
chattier than the git: protocol? Can we optimise JGit somehow?

Alex
Re: Git HTTP protocol/improvements? [message #3687 is a reply to message #3654] Thu, 23 April 2009 14:23 Go to previous messageGo to next message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> The git: protocol is claimed to be faster/better than the HTTP access.
> Assuming that's true, what can be done to speed up or improve HTTP? Is it an
> implementation of the protocol, or the fact that the protocol itself is
> chattier than the git: protocol? Can we optimise JGit somehow?

It isn't a claim, its a fact.

The http:// support in git is implemented by assuming no Git specific
knowledge on the server side. Instead we treat the server as a dumb
peer that can only respond to standard HTTP/1.0 GET requests.

So, when you issue "git fetch http://... master" (get the current
version of the master branch of that repository) the client goes
something like this:

GET $URL/info/refs
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......

Where its downloading one object at a time. In the second GET, it
downloaded the commit that is at the tip of the master branch. It
parses that to discover what the top level directory object name is.
Then it requests that in the third GET. It parses that file listing,
and starts requesting the file or subtree for each of those entries, one
at a time.

Eventually, it fails with a 404. Or a 200 OK with some HTML message
saying "Dude, not found".

At which point it turns around and starts looking at the pack files:

GET $URL/objects/info/packs
GET $URL/objects/pack/pack-??.idx
GET $URL/objects/pack/pack-??.idx
GET $URL/objects/pack/pack-??.idx
GET $URL/objects/pack/pack-??.idx
GET $URL/objects/pack/pack-??.idx

Until it has downloaded an index file which says that the corresponding
pack file contains that missing object. Then it downloads that pack file:

GET $URL/objects/pack/pack-??.pack

This process repeats until the download is complete, or its unable to
download a necessary object.

It isn't uncommon for an HTTP fetch of 100 or 200 changes to turn into
thousands of GET requests. Although the client supports HTTP/1.1
pipelining, there isn't a lot of parallelism available as the next
object to obtain requires information from the object currently being
requested just to make that next request.


The git:// protocol on the other hand uses a direct TCP stream and has
both sides send basic state information to each other in a
bi-directional handshake until they can agree upon a set of objects that
both peers have. Once that set is agreed upon, the sending side can
compute a set difference and stream everything that the receiver does
not yet have.

In practice this only takes a couple of round trips. A massive
difference compared to 1000+ HTTP round trips.

Also, since the sender only transmits *exactly* what the client is
missing, and does so by transferring deltas whenever possible, the
overall data transfer is quite a bit smaller than what occurs using HTTP.


Pushing over HTTP uses WebDAV, and otherwise assumes the remote server
is any standard WebDAV filesystem. You probably could push a Git
repository into an SVN repository that way, treating the SVN repository
as a WebDAV backend.

Since the WebDAV backend is assumed to have no capabilities beyond those
of storing files, the git push client is forced to transmit whole data
objects, just like the git fetch client is forced to download whole data
objects.


Last August we had a mailing list thread about improving HTTP
performance by installing a Git specific extension on the HTTP server.
For example, by creating a "gitserver.cgi" that could be invoked through
the CGI standard. Easier to install than mod_svn in Apache, and could
be installed alongside gitweb.cgi if the side administrator wanted to.

The mailing list thread is here:

http://thread.gmane.org/gmane.comp.version-control.git/91104

The RFC, as of the last time I touched it, is now online in my
fastimport fork:

http://repo.or.cz/w/git/fastimport.git?a=blob;f=Documentation/technical/http-protocol.txt;hb=smart-http

I haven't had time to work on it in months.

The git.git C code for HTTP support is difficult to work with, though it
has recently been improved over the past couple of months. It may be
easier to prototype something in JGit, but whatever gets implemented
needs to also be implemented in git.git eventually, as users will demand it.

Mercurial has a more efficient HTTP protocol. They require a custom
Mercurial HTTP server, but if that custom server is in place then their
protocol's efficiency generally matches that of git://. They also
support a dumb HTTP approach, like I described for Git above, but I hear
people avoid it like the plague because of the performance problems I
described above.
Re: Git HTTP protocol/improvements? [message #3718 is a reply to message #3687] Thu, 23 April 2009 15:17 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

Shawn Pearce wrote:
> The http:// support in git is implemented by assuming no Git specific
> knowledge on the server side. Instead we treat the server as a dumb
> peer that can only respond to standard HTTP/1.0 GET requests.
>
> So, when you issue "git fetch http://... master" (get the current
> version of the master branch of that repository) the client goes
> something like this:

I can see how that would be slow :-)

I think a generic REST-style API to the server would be a useful protocol to
define, which then could be implemented in (Java,C) and then accessed by
(Java,C).

My forte is not C though; I'd be much more comfortable putting together a
servlet-based (JGit backed?) http-git-server implementation to iron out the
ideas and kinks; and if it flies, then maybe back-porting that to the C git
implementation and/or mod_git of sorts?

Alex
Re: Git HTTP protocol/improvements? [message #4086 is a reply to message #3718] Thu, 23 April 2009 18:01 Go to previous messageGo to next message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> Shawn Pearce wrote:
>> The http:// support in git is implemented by assuming no Git specific
>> knowledge on the server side.
>
> I think a generic REST-style API to the server would be a useful protocol to
> define, which then could be implemented in (Java,C) and then accessed by
> (Java,C).

We should be careful here. I've been told proxy servers don't like HTTP
methods other then GET or POST. So those "fancy" methods like "PUT" are
just too much for some proxy servers to handle.

So embedding into POST is probably the safest approach.

> My forte is not C though; I'd be much more comfortable putting together a
> servlet-based (JGit backed?) http-git-server implementation to iron out the
> ideas and kinks; and if it flies, then maybe back-porting that to the C git
> implementation and/or mod_git of sorts?

Right.

But the C folks would probably prefer a CGI over mod_git. The C
implementation isn't suitable for running in long-lived processes, or a
server process that still needs to return a response to a client in the
face of an error.
Re: Git HTTP protocol/improvements? [message #4142 is a reply to message #4086] Sun, 26 April 2009 08:35 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

Shawn Pearce wrote:
> Alex Blewitt wrote:
>> Shawn Pearce wrote:
>>> The http:// support in git is implemented by assuming no Git specific
>>> knowledge on the server side.
>>
>> I think a generic REST-style API to the server would be a useful protocol
to
>> define, which then could be implemented in (Java,C) and then accessed by
>> (Java,C).
>
> We should be careful here. I've been told proxy servers don't like HTTP
> methods other then GET or POST. So those "fancy" methods like "PUT" are
> just too much for some proxy servers to handle.

Actually, the limitation on PUT is more to do with the client rather than
proxies. In any case, WebDAV uses PUT to upload content; so if a WebDAV based
solution works, it's not going to make any difference.

But REST is more than what HTTP methods you use; it's about designing
resources around URIs. The key thing here is to get a resource which allows us
to navigate from the tip of a branch back to its ancestors, instead of a
single HTTP-round-trip to do each of those.

> But the C folks would probably prefer a CGI over mod_git. The C
> implementation isn't suitable for running in long-lived processes, or a
> server process that still needs to return a response to a client in the
> face of an error.

Fair enough. Even better then. A JGit-backed Jetty server would be pretty
sweet; and if the protocol admits it, then re-using API could be used to
provide a web-based view ala viewvc in AJAX.

Anyway, I'm going to give that a go now - the .gitignore UI addition is done
and waiting to be applied, so I'll switch tack and start investigating the
HTTP optimisation. Once we've verified it from a pure Java perspective, we can
look at other clients implementing the same HTTP protocol.

Incidentally, this is good timing - Google Code just announced support for Hg
as their DVCS (not surprisingly, since they're a Python shop) but did single
out Git's poor HTTP performance as one of the disadvantages.

http://google-opensource.blogspot.com/2009/04/distributed-ve rsion-control-for-
project.html


Alex
Re: Git HTTP protocol/improvements? [message #4212 is a reply to message #3687] Tue, 28 April 2009 08:43 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

Shawn Pearce wrote:
> So, when you issue "git fetch http://... master" (get the current
> version of the master branch of that repository) the client goes
> something like this:
>
> GET $URL/info/refs
> GET $URL/objects/??/.......
> GET $URL/objects/??/.......

So, with my limited understanding of the Git format, the 'info/refs'
would correspond to a directory in .git/info/refs (except I can't find it).
However, there's a refs/heads/master which contains the string a string like
ee933d31d2ca4a4270aa9f4be6e60beec388e8af, which would then map to a file
in objects/ee/933d31d2ca4a4270aa9f4be6e60beec388e8af.

Presumably this is some kind of one-way linked list structure, so if I knew
how to open/parse this file, I'd then find another reference like
9bed0610017d97b6fd3fb19a5256646f4d2399e4 which in turn would take me to
objects/9b/ed0610017d97b6fd3fb19a5256646f4d2399e4 and so on.

If that's the case, then calculating the list of hashes for a branch would
be a case of following refs/heads/master through to build up a list like:

ee933d31d2ca4a4270aa9f4be6e60beec388e8af
9bed0610017d97b6fd3fb19a5256646f4d2399e4

So, creating a URL that lookedl like:

GET $URL/webgit/ee933d31d2ca4a4270aa9f4be6e60beec388e8af

could load/process the refs and produce a JSON representation like:

[
"ee933d31d2ca4a4270aa9f4be6e60beec388e8af",
"9bed0610017d97b6fd3fb19a5256646f4d2399e4",
...
]

That would solve a bunch of the round-trips up front and then allow the client
to start downloading the packs in parallel (which it would need, or at least,
the subset of them that it needed).

So, how do I go about opening/parsing the objects/ file? I guess there's
something in the JGit stuff that would help here, but I don't know the
terminology that is used to describe the various files in the directory.

Alex
Re: Git HTTP protocol/improvements? [message #4282 is a reply to message #4212] Tue, 28 April 2009 15:10 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: j16sdiz.gmail.com

Alex Blewitt wrote:
> Shawn Pearce wrote:
>> So, when you issue "git fetch http://... master" (get the current
>> version of the master branch of that repository) the client goes
>> something like this:
>>
>> GET $URL/info/refs
>> GET $URL/objects/??/.......
>> GET $URL/objects/??/.......
>
> So, with my limited understanding of the Git format, the 'info/refs'
> would correspond to a directory in .git/info/refs (except I can't find it).

info/refs is generated by `git update-server-info`
(or, sometimes, `git repack`)

> However, there's a refs/heads/master which contains the string a string like
> ee933d31d2ca4a4270aa9f4be6e60beec388e8af, which would then map to a file
> in objects/ee/933d31d2ca4a4270aa9f4be6e60beec388e8af.

without info/refs, git won't know refs/heads/master:
- 'master' is just an arbitrary name, it can be anything.
- plain old HTTP does not support file listing, so
we need a list of available refs.

[..]
Re: Git HTTP protocol/improvements? [message #4352 is a reply to message #4212] Tue, 28 April 2009 16:58 Go to previous messageGo to next message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> Shawn Pearce wrote:
>> So, when you issue "git fetch http://... master" (get the current
>> version of the master branch of that repository) the client goes
>> something like this:
>>
>> GET $URL/info/refs
>> GET $URL/objects/??/.......
>> GET $URL/objects/??/.......
>
> So, with my limited understanding of the Git format, the 'info/refs'
> would correspond to a directory in .git/info/refs (except I can't find it).

Yea, like Daniel Cheng said, you need to run `git update-server-info`
here to get .git/info/refs created. Normally this is run by `git gc`,
or by a post-update hook under .git/hooks/post-update. It is only
needed by the HTTP support, so normally the file doesn't exist unless
you are serving this repository over HTTP.

> However, there's a refs/heads/master which contains the string a string like
> ee933d31d2ca4a4270aa9f4be6e60beec388e8af, which would then map to a file
> in objects/ee/933d31d2ca4a4270aa9f4be6e60beec388e8af.

Yes. info/refs is just a union catalog of the packed-refs file, and the
recursive contents of refs/. As Daniel Cheng pointed out, HTTP lacks a
generic "directory listing" mechanism so info/refs provides a catalog.
It could just have been a catalog of the file names under refs/, but it
also contains the SHA-1s to try and remove a bunch of round-trips in the
common case of "Nothing changed".

> Presumably this is some kind of one-way linked list structure, so if I knew
> how to open/parse this file, I'd then find another reference like
> 9bed0610017d97b6fd3fb19a5256646f4d2399e4 which in turn would take me to
> objects/9b/ed0610017d97b6fd3fb19a5256646f4d2399e4 and so on.

Yup, exactly.

> If that's the case, then calculating the list of hashes for a branch would
> be a case of following refs/heads/master through to build up a list like:
>
> ee933d31d2ca4a4270aa9f4be6e60beec388e8af
> 9bed0610017d97b6fd3fb19a5256646f4d2399e4
>
> So, creating a URL that lookedl like:
>
> GET $URL/webgit/ee933d31d2ca4a4270aa9f4be6e60beec388e8af
>
> could load/process the refs and produce a JSON representation like:
>
> [
> "ee933d31d2ca4a4270aa9f4be6e60beec388e8af",
> "9bed0610017d97b6fd3fb19a5256646f4d2399e4",
> ...
> ]

Eeeeek. No.

Well, yes, in theory you can do this. But I think its a bad idea.

Assuming the Linux kernel repository, this listing would need to be a
JSON list of 1,174,664 SHA-1 values. That's more than 48.17 MiB of text
to transfer.

Really what you want is to have the client and server negotiate on a
common ancestor; some commit or tree that they both contain. Once that
common ancestor is found, *then* you can talk about sending that list of
object identities to the client, as now its only a subset of that 1
million object list.

Since the object identity can be recovered from the object data (just
run SHA-1 over it after decompression) there actually is no reason to
send the object identities to the client. Instead, we should just have
the server send the object data for that group of objects that the
client doesn't yet have, but has told the server it wants to have.

This is fundamentally how the fetch-pack/upload-pack protocol used by
`git fetch` over git:// and ssh:// works.

> That would solve a bunch of the round-trips up front and then allow the client
> to start downloading the packs in parallel (which it would need, or at least,
> the subset of them that it needed).

Ideally, the found-trips should be just 1 for the *entire* data
transfer. And then we're just looking at the round trips required to
negotiate the common ancestor point.

> So, how do I go about opening/parsing the objects/ file? I guess there's
> something in the JGit stuff that would help here,

Yes, yes it would. See WalkFetchConnection in JGit. Its quite a bit of
code. But the code handles downloading both the loose objects from the
objects/ directory, and pack files from the objects/pack/ directory, and
parsing each of the 4 basic object types (commit, tree, tag, blob) in
order to determine any more pointers that must be followed.

> but I don't know the
> terminology that is used to describe the various files in the directory.

Loose objects are the things under objects/??/. Packs are the things
under objects/pack/pack-*.pack. A pack is something like a ZIP file, it
contains multiple compressed objects in a single file stream. The
corresponding pack-*.idx file contains a directory to support efficient
O(log N) access time to any object within that pack.

Two different encodings are used for the data. The loose objects are
deflated with libz, but are otherwise the complete file content, they
never store a delta. The packed objects can be stored either as the
full content but deflated with libz, or they can be stored as a delta
relative to another object in the same pack file.

For reference, see these documents:

http://book.git-scm.com/7_how_git_stores_objects.html
http://book.git-scm.com/7_browsing_git_objects.html
http://book.git-scm.com/7_the_packfile.html
http://www.gelato.unsw.edu.au/archives/git/0608/25286.html
http://www.kernel.org/pub/software/scm/git/docs/technical/pa ck-format.txt
http://www.kernel.org/pub/software/scm/git/docs/technical/pa ck-heuristics.txt

also, some data about the current fetch-pack/upload-pack protocol:

http://book.git-scm.com/7_transfer_protocols.html
http://www.kernel.org/pub/software/scm/git/docs/technical/pa ck-protocol.txt
Re: Git HTTP protocol/improvements? [message #4423 is a reply to message #4282] Tue, 28 April 2009 18:50 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

Daniel Cheng wrote:
> Alex Blewitt wrote:
>> Shawn Pearce wrote:
>>> So, when you issue "git fetch http://... master" (get the current
>>> version of the master branch of that repository) the client goes
>>> something like this:
>>>
>>> GET $URL/info/refs
>>> GET $URL/objects/??/.......
>>> GET $URL/objects/??/.......
>>
>> So, with my limited understanding of the Git format, the 'info/refs'
>> would correspond to a directory in .git/info/refs (except I can't find it).
>
> info/refs is generated by `git update-server-info`
> (or, sometimes, `git repack`)

Ah, thanks.

>> However, there's a refs/heads/master which contains the string a string
like
>> ee933d31d2ca4a4270aa9f4be6e60beec388e8af, which would then map to a file
>> in objects/ee/933d31d2ca4a4270aa9f4be6e60beec388e8af.
>
> without info/refs, git won't know refs/heads/master:
> - 'master' is just an arbitrary name, it can be anything.
> - plain old HTTP does not support file listing, so
> we need a list of available refs.

OK. This could be something computed dynamically by a server-side process,
rather than batch changed, too. Plus, WebDAV supports a listing (though that
isn't vanilla HTTP). Do we support that if available?

Alex
Re: Git HTTP protocol/improvements? [message #4493 is a reply to message #4352] Tue, 28 April 2009 18:50 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

Shawn Pearce wrote:
> Alex Blewitt wrote:
>> Shawn Pearce wrote:
>>> So, when you issue "git fetch http://... master" (get the current
>>> version of the master branch of that repository) the client goes
>>> something like this:
>>>
>>> GET $URL/info/refs
>>> GET $URL/objects/??/.......
>>> GET $URL/objects/??/.......
>>
>> So, with my limited understanding of the Git format, the 'info/refs'
>> would correspond to a directory in .git/info/refs (except I can't find it).
>
> Yea, like Daniel Cheng said, you need to run `git update-server-info`
> here to get .git/info/refs created. Normally this is run by `git gc`,
> or by a post-update hook under .git/hooks/post-update. It is only
> needed by the HTTP support, so normally the file doesn't exist unless
> you are serving this repository over HTTP.

Right. And the only reason we need this is to support HTTP then.

>> could load/process the refs and produce a JSON representation like:
>>
>> [
>> "ee933d31d2ca4a4270aa9f4be6e60beec388e8af",
>> "9bed0610017d97b6fd3fb19a5256646f4d2399e4",
>> ...
>> ]
>
> Eeeeek. No.
>
> Well, yes, in theory you can do this. But I think its a bad idea.
>
> Assuming the Linux kernel repository, this listing would need to be a
> JSON list of 1,174,664 SHA-1 values. That's more than 48.17 MiB of text
> to transfer.

OK. But that's assuming a whole world change, right? The URL doesn't have to
generate the entire collection of trees from the beginning (in the same way
that 'git status' offers you a paginated view). We could limit it to an
abitrary/fixed/user-requestable paging figure, so:

C: GET $URL/aaa
S: [
"aaa.."
"bbb.."
"ccc.."
...
"mmm.."
]
C: GET $URL/mmm
S: [
"mmm.."
"nnn.."
"ooo.."
]

Assuming a relatively recent change was a common ancestor, you'd probably get
it in the first couple of pages of requests.

> Really what you want is to have the client and server negotiate on a
> common ancestor; some commit or tree that they both contain.

As a matter of interest, is the hash assumed to be unique for all commits over
time? In other words, if I find "ooo..." in the server response, and I too
have "ooo..." in my client tree, then is that de facto the common ancestor?
Are there any chances that the "ooo..." could be the same hash but a
completely different part of the tree?

> Since the object identity can be recovered from the object data (just
> run SHA-1 over it after decompression) there actually is no reason to
> send the object identities to the client. Instead, we should just have
> the server send the object data for that group of objects that the
> client doesn't yet have, but has told the server it wants to have.

OK, so the same mechanism could be used to upload the hashes of the identies
to the server, right?

>> but I don't know the
>> terminology that is used to describe the various files in the directory.
>
> For reference, see these documents:

Thanks, I'll take a while to peruse and understand them.

On the subject of dependencies; writing a web app is going to require some
kind of server support. I was thinking of using Jetty, now it's under the
Eclipse banner. Is there any reason why we can't use other EPL in the
server-side part of this component?

For the client side, I hope the protocol will be esay enough to add in to
(say) JGit as a BSD implementation instead of having to bring in other
dependencies. I assume the reason why we are not using (say) Apache Commons
Net is to avoid any extra dependencies?

Alex
Re: Git HTTP protocol/improvements? [message #4563 is a reply to message #4493] Tue, 28 April 2009 20:10 Go to previous messageGo to next message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> Shawn Pearce wrote:
>> Assuming the Linux kernel repository, this listing would need to be a
>> JSON list of 1,174,664 SHA-1 values. That's more than 48.17 MiB of text
>> to transfer.
>
> OK. But that's assuming a whole world change, right? The URL doesn't have to
> generate the entire collection of trees from the beginning (in the same way
> that 'git status' offers you a paginated view). We could limit it to an
> abitrary/fixed/user-requestable paging figure, so:
>
> C: GET $URL/aaa
> S: [
> "aaa.."
> "bbb.."
> "ccc.."
> ...
> "mmm.."
> ]
> C: GET $URL/mmm
> S: [
> "mmm.."
> "nnn.."
> "ooo.."
> ]

Ugh.

So, what if the whole world was being download for the first time?
(Initial clone of a project over HTTP.) How many "pages" would I need
for the Linux kernel's 1,174,664 values?

How do you define the boundary for a page?

The most recent commit in the Linux kernel has 27,829+ objects in it.
Probably closer to 30,000 when you include all of the directories.
That's just that first commit. How many objects did you want to put per
page?

You are thinking about this all wrong. You seriously can't do what you
are suggesting and still get good performance, for either an initial
clone, or for an incremental update.

> Assuming a relatively recent change was a common ancestor, you'd probably get
> it in the first couple of pages of requests.

Sure. That's the point of the negation that currently takes place, you
want to find that common ancestor in some small number of round trips.

> As a matter of interest, is the hash assumed to be unique for all commits over
> time?

Yes.

> In other words, if I find "ooo..." in the server response, and I too
> have "ooo..." in my client tree, then is that de facto the common ancestor?

Yes.

> Are there any chances that the "ooo..." could be the same hash but a
> completely different part of the tree?

No.

>> Since the object identity can be recovered from the object data (just
>> run SHA-1 over it after decompression) there actually is no reason to
>> send the object identities to the client. Instead, we should just have
>> the server send the object data for that group of objects that the
>> client doesn't yet have, but has told the server it wants to have.
>
> OK, so the same mechanism could be used to upload the hashes of the identies
> to the server, right?

You aren't seriously suggesting that we take the object data, which is
usually larger than 40 bytes, and upload it to the server, just to send
the server a 40 byte token saying "I have this object"?

> On the subject of dependencies; writing a web app is going to require some
> kind of server support.

I would try to stick to the J2EE servlet specification, so that any
servlet container can be used.

> I was thinking of using Jetty, now it's under the
> Eclipse banner.

Sure.

But I'd also like to let people deploy under any other servlet container.

Seriously, how much "server side support" do you need to speak this
protocol? You need to register something with the container to handle
POST, that's a subclass of HttpServlet. You need InputStream to read
that POST body, that's the HttpServletRequest.getInputStream(). You
need an OutputStream to send a response, that's the
HttpServletResponse.getOutputStream(). That's J2EE servlet
specification cira 1999.

After that, everything should be available in JGit, as its all Git
specific. I see no reason to tie this to Jetty, even if Jetty is under
the Eclipse banner (which I think is great).

> Is there any reason why we can't use other EPL in the
> server-side part of this component?

I'd rather not.

See my remark above about how you really shouldn't need anything that
isn't already in JGit, or that can't be trivially reimplemented in JGit,
to support this.

Remember that to maximize use of this HTTP protocol you also need to
implement both client and server in git.git, the canonical C
implementation, which use a GPLv2 license *ONLY*. If you try to port
EPL "support libraries" to C they won't be accepted into git.git because
EPL isn't compatible to be linked with GPLv2.

> For the client side, I hope the protocol will be esay enough to add in to
> (say) JGit as a BSD implementation instead of having to bring in other
> dependencies.

Its not a hope, its a requirement. Robin and I won't merge something to
JGit that isn't BSD implementation, or that requires non-BSD or non-MIT
dependencies. Actually, we try quite hard not to add any additional
dependencies to JGit.

> I assume the reason why we are not using (say) Apache Commons
> Net is to avoid any extra dependencies?

Yup, exactly. Although Apache License 2.0 plays nicely with BSD and
EPL, we don't use Apache Commons Net because its overkill for what we
need, and its yet another dependency.

We only depend upon JSch because there was no other Java based SSH
client implementation at the time, its license is very acceptable (also
BSD), and rewriting it would be very time consuming.
Re: Git HTTP protocol/improvements? [message #4632 is a reply to message #4563] Tue, 28 April 2009 21:03 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

> Shawn Pearce wrote:
> > Alex Blewitt wrote:
> So, what if the whole world was being download for the first time?
> (Initial clone of a project over HTTP.) How many "pages" would I need
> for the Linux kernel's 1,174,664 values?

How does it work for the GIT protocol at the moment? I was under the
impression that the client would download the SHA1 names in any case.
Obviously the initial clone could probably be handled in a more optimised
manner if needed.

> How do you define the boundary for a page?

It could be part of the URL, for example .../aaaa/2/50 (for 2nd page of 50)

> The most recent commit in the Linux kernel has 27,829+ objects in it.
> Probably closer to 30,000 when you include all of the directories.
> That's just that first commit. How many objects did you want to put per
> page?

I suspect it's probably going to take some measurement to find out what the
optimal number(s) are. Ideally, you'd like to get it to get the majority
(recent updates) of commits in a single hit, but I frankly don't know much at
this stage - I'm just exploring ideas

> You are thinking about this all wrong. You seriously can't do what you
> are suggesting and still get good performance, for either an initial
> clone, or for an incremental update.

I'm exploring ideas. I'm bound to explore more bad ones than good ones in
order to get there :-)

> You aren't seriously suggesting that we take the object data, which is
> usually larger than 40 bytes, and upload it to the server, just to send
> the server a 40 byte token saying "I have this object"?

No, I was only suggesting submitting the hashes as part of the handshake to
find the common ancestor and/or what the client/server both has.

>> On the subject of dependencies; writing a web app is going to require some
>> kind of server support.
>
> I would try to stick to the J2EE servlet specification, so that any
> servlet container can be used.

Yes, that is the plan.

>> I was thinking of using Jetty, now it's under the
>> Eclipse banner.
>
> Sure.
>
> But I'd also like to let people deploy under any other servlet container.

Agreed. I was just thinking of having a downloadable server, like Hudson,
which can be executed with java -jar webgit.jar as well as being installed
into other servers (Tomcat etc.)

> After that, everything should be available in JGit, as its all Git
> specific. I see no reason to tie this to Jetty, even if Jetty is under
> the Eclipse banner (which I think is great).

I didn't mean to tie it in at the code level, just as a way of
downloading/running it.

> Remember that to maximize use of this HTTP protocol you also need to
> implement both client and server in git.git, the canonical C
> implementation, which use a GPLv2 license *ONLY*. If you try to port
> EPL "support libraries" to C they won't be accepted into git.git because
> EPL isn't compatible to be linked with GPLv2.

Agreed. The plan is to evolve a protocol whose client can be implemented in C
without needing any other aspects.

Alex
Re: Git HTTP protocol/improvements? [message #4701 is a reply to message #4632] Tue, 28 April 2009 21:30 Go to previous messageGo to next message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
>> Shawn Pearce wrote:
>>> Alex Blewitt wrote:
>> So, what if the whole world was being download for the first time?
>> (Initial clone of a project over HTTP.) How many "pages" would I need
>> for the Linux kernel's 1,174,664 values?
>
> How does it work for the GIT protocol at the moment?

Look at the links I sent earlier today about the fetch-pack/upload-pack
protocol. Basically the exchange goes something like this for an
initial clone of the whole world:

C: CONNECT
S: here's the list of refs I have, and their current SHA-1 values

C: want deadbeef...
C: want aaaabbbb...
C: <END>

S: PACK...compressed data for the entire project...

For an incremental update:

C: CONNECT
S: here's the list of refs I have, and their current SHA-1 values

C: want deadbeef...
C: want aaaabbbb...
C: have 1831123...
C: have asd813c...
... up to 32 more have lines ...

S: ACK 1831123...
S: NAK asd813c...

C: <END>

S: PACK...compressed data for the incremental update...

The want lines are the client saying, "I don't have X, but you said in
your initial advertisement that you have it, so give it to me". The
client selected these SHA-1s out of the initial advertisement by first
looking to see if the object exists on disk; if it doesn't but its
corresponding ref is in the pattern of refs the client was instructed to
fetch (e.g. fetch = refs/heads/* in .git/config) then the client "wants"
that SHA-1.

The have lines are the client listing every commit it knows about,
starting from the most recent revisions the client has, walking
backwards in time through the project history.

have lines are sent by the client in batches of 32, with up to 2 batches
in flight at a time.

The server sends ACK lines to let the client know that the server also
has that object, and thus that the client can stop enumerating history
reachable from that point in time. This is a potential common ancestor.
There may be multiple, due to different side branches being active on
both sides.

The server sends NAK lines to let the client know it doesn't have a
particular object. Such objects are unique to the client (e.g. commits
you created but haven't published to anyone, or are commits you got from
some other repository that this repository hasn't communicated with).
On these objects the client goes further backwards in that history to
look for another possible match.

That's a simplification of it, but the rough idea. See the links I
pointed you to and BasePackFetchConnection in JGit for the Java
implementation of this client, and transport.UploadPack for the server
side implementation of this protocol.

> I was under the
> impression that the client would download the SHA1 names in any case.

No, we don't transfer the SHA-1 names of the objects the client is going
to download. Instead, the client computes them on the fly from the data
it receives. This is actually a safety measure, it allows the client to
verify the data received matches the signature it expects.

A really paranoid client performs a full check of the object pointers
too. Checking the tips of what you fetched (the things you want'd in
the protocol) all the way back to the common base (the things you and
the server agreed upon as existing) validates that the entire data
stream is what you believed it should be.

Its more about data integrity during transit and against broken Git
tools than filtering out an evil MITM attack.

For the JGit validation code see transport.IndexPack, ObjectChecker, and
FetchProcess.fetchObjects() for the client side, or
transport.ReceivePack.checkConnectivity() for the server side.

>>> I was thinking of using Jetty, now it's under the
>>> Eclipse banner.
>>
>> But I'd also like to let people deploy under any other servlet container.
>
> Agreed. I was just thinking of having a downloadable server, like Hudson,
> which can be executed with java -jar webgit.jar as well as being installed
> into other servers (Tomcat etc.)

Oh, yea, that's awesome. Jetty is quite embeddable and is under a good
license for this sort of binary redistribution build. But that is an
unrelated goal to a more efficient HTTP support in Git. Jetty has made
it really easy for anyone to roll a servlet into a simple downloadable
JAR. My point is, anyone can roll that distribution. But I'm not
against having it as an eventual downloadable product once both JGit and
Jetty have exited incubating status and can make formal releases.
Re: Git HTTP protocol/improvements? [message #4772 is a reply to message #4701] Tue, 28 April 2009 21:55 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

Shawn Pearce wrote:
> Alex Blewitt wrote:
>>> Shawn Pearce wrote:
>>>> Alex Blewitt wrote:
>>> So, what if the whole world was being download for the first time?
>>> (Initial clone of a project over HTTP.) How many "pages" would I need
>>> for the Linux kernel's 1,174,664 values?
>>
>> How does it work for the GIT protocol at the moment?
>
> Look at the links I sent earlier today about the fetch-pack/upload-pack
> protocol. Basically the exchange goes something like this for an
> initial clone of the whole world:

OK, so on the initial clone, we just say "give me everything reachable from
'deadbeef'" without caring what those happen to be.

In the case of an incremental, we have a subset of things the server (might)
be interested in, plus the 'everything from deadbeef' (which may include some
of the things we have). The server will know to only send deadbeef..common
ancestor(s). It works out the common ancestor(s) based on drilling down
through the final ACKs that we get of combined updates.

The reason we don't need SHAs is because once we've agreed on the download set
(from deadbeef to common ancestor including 1831123/asd813c/...) we just get
the data (from which we can reconstruct the SHAs).

> have lines are sent by the client in batches of 32, with up to 2 batches
> in flight at a time.

OK. I guess I had a similar idea of batching the SHA-1 earlier, but we don't
need to do that on the client; we should be able to compute it on the server.

> The server sends ACK lines to let the client know that the server also
> has that object, and thus that the client can stop enumerating history
> reachable from that point in time. This is a potential common ancestor.

Why only a potential common ancestor? I can imagine not necessarily 'the' but
could easily be multiple of these. I'm not sure how it might not be a common
ancestor, though.

> That's a simplification of it, but the rough idea. See the links I
> pointed you to and BasePackFetchConnection in JGit for the Java
> implementation of this client, and transport.UploadPack for the server
> side implementation of this protocol.

That's great - this has been very useful to me. I'll take a look at the Java
implementation a little more to see what I can do.

> No, we don't transfer the SHA-1 names of the objects the client is going
> to download. Instead, the client computes them on the fly from the data
> it receives. This is actually a safety measure, it allows the client to
> verify the data received matches the signature it expects.

OK, we get an implicit set of data rather than the SHAs. I was trying to find
out how we could come to a common ancestor using the SHAs on the client side,
but a server-side computation can work just as well.

> Oh, yea, that's awesome. Jetty is quite embeddable and is under a good
> license for this sort of binary redistribution build. But that is an
> unrelated goal to a more efficient HTTP support in Git. Jetty has made
> it really easy for anyone to roll a servlet into a simple downloadable
> JAR. My point is, anyone can roll that distribution. But I'm not
> against having it as an eventual downloadable product once both JGit and
> Jetty have exited incubating status and can make formal releases.

Great.

Thanks again for the detailed response; now, it's over to me to start playing
around with it in code.

Alex
Re: Git HTTP protocol/improvements? [message #4839 is a reply to message #4772] Thu, 30 April 2009 18:54 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

>Alex Blewitt wrote:
>> Shawn Pearce wrote:
>> Look at the links I sent earlier today about the fetch-pack/upload-pack
>> protocol. Basically the exchange goes something like this for an
>> initial clone of the whole world:
>
> OK, so on the initial clone, we just say "give me everything reachable from
> 'deadbeef'" without caring what those happen to be.

I think I'm getting closer to understanding what's going on. I'm going to
start throwing some code together over the weekend to see if I can figure out
what's going on.

If I end up with a URL based on a current tip (like /webgit/pack/ab01230a0...)
then the contents that get served back can be the same format (pack) as with
the git protocol. This will be a useful proof of concept, as well as handling
the initial check-out case where you don't have anything. If you GET the
response, you'll get everything reachable from tip, whereas if you POST the
response (with some details to be worked out later) along the want/need kind
of lines of the git protocol, then it can send a subset instead.

One advantage of this is it should be possible to do something like curl
/webgit/pack/ab012340a | git receive-pack as a proof of concept without having
to change the C code, at least initially.

There's also no reason the webapp can't serve the info/refs as well, so that
it's dynamcially calculated instead of regenerated on each commit. We could
use some header flags to determine whether the server was smart or dumb in
what we request next.

The challenge I have is how to convert a tree identifier into a pack
structure, I suspect. Objects might already be in a packed structure, or they
might have to get packed on the server side. I'm also aware that that
operation (in Git, at least) can take a while. I'm not sure whether there are
HTTP timeouts that might be involved if the server takes too long to pack
something; it might be necessary to somehow send the packs back as chunks
instead of as a single big pack. From what I can infer, the git protocol uses
a series of status messages to indiciate progress without any data on the
remote end, then switches over to sending the pack file as one big lump.

Alex
Re: Git HTTP protocol/improvements? [message #4908 is a reply to message #4839] Thu, 30 April 2009 20:23 Go to previous messageGo to next message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> If I end up with a URL based on a current tip (like /webgit/pack/ab01230a0...)
> then the contents that get served back can be the same format (pack) as with
> the git protocol. This will be a useful proof of concept, as well as handling
> the initial check-out case where you don't have anything. If you GET the
> response, you'll get everything reachable from tip, whereas if you POST the
> response (with some details to be worked out later) along the want/need kind
> of lines of the git protocol, then it can send a subset instead.

OK.

> One advantage of this is it should be possible to do something like curl
> /webgit/pack/ab012340a | git receive-pack as a proof of concept without having
> to change the C code, at least initially.

Actually, that should be git index-pack. I think you'd want something like:

mkdir foo
cd foo
git init
curl /webgit/pack/ab012340a | git index-pack --stdin --fix-thin in.pack
mv in.pack in.idx .git/objects/pack
git update-ref HEAD ab012340a

It should work, but it still doesn't quite get everything right. That
other stuff is minor, but still important, details that can be worked
out later.

> There's also no reason the webapp can't serve the info/refs as well, so that
> it's dynamcially calculated instead of regenerated on each commit. We could
> use some header flags to determine whether the server was smart or dumb in
> what we request next.

Yes. If you had looked at that HTTP thread from last July/August I
mentioned doing something like that. And also having it compute
objects/info/packs for dumb clients, in case they are accessing a smart
server and don't know any better (e.g. older versions that predate the
smart HTTP support).

> The challenge I have is how to convert a tree identifier into a pack
> structure, I suspect.

In JGit? Use a PackWriter. You feed preparePack the interestingObjects
(wants) and the uninterestingObjects (common base/haves, can be empty to
get the whole world) and it builds up a list of what to send. Then you
ask it to dump that to an OutputStream with writePack().

> Objects might already be in a packed structure, or they
> might have to get packed on the server side.

Yup. PackWriter automatically handles this distinction by taking
objects from whatever location they are at.

> I'm also aware that that
> operation (in Git, at least) can take a while.

And JGit is not different. Worse actually, its in Java and isn't nearly
as optimized as C Git is.

> I'm not sure whether there are
> HTTP timeouts that might be involved if the server takes too long to pack
> something; it might be necessary to somehow send the packs back as chunks
> instead of as a single big pack.

Yes. That is a problem.

> From what I can infer, the git protocol uses
> a series of status messages to indiciate progress without any data on the
> remote end, then switches over to sending the pack file as one big lump.

Actually, the that huge data lump is framed inside of a multiplexing
stream, with the pack data in stream #0 and progress messages in stream
#1. Its just that Git stops sending progress messages once the data
starts flowing. The progress messages are just there so the end user
doesn't abort while the server is computing... and it can take a while
for that computation to complete. Once the computation is done, data
transfer starts, and the user gets progress messages from the client as
it processes that data through `git index-pack`.

In the case of JGit's PackWriter class, that server computation phase
happens inside of the preparePack(Collection,Collection) method.
Blocking in there for 2m30s while handling the Linux kernel repository
isn't unheard of. Once that method is complete, the caller switches to
writePack(), which starts writing data immediately.

In terms of HTTP timeouts, yes, 2m30s before any data transfer even
starts is a lot. And HTTP can't send progress messages to at least let
the user know the server is chugging on their behalf.
Re: Git HTTP protocol/improvements? [message #4977 is a reply to message #4908] Thu, 30 April 2009 21:33 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

> Shawn Pearce wrote:
>> Alex Blewitt wrote:
>> One advantage of this is it should be possible to do something like curl
>> /webgit/pack/ab012340a | git receive-pack as a proof of concept without
having
>> to change the C code, at least initially.
>
> Actually, that should be git index-pack. I think you'd want something like:
>
> mkdir foo
> cd foo
> git init
> curl /webgit/pack/ab012340a | git index-pack --stdin --fix-thin in.pack
> mv in.pack in.idx .git/objects/pack
> git update-ref HEAD ab012340a


Thanks, that is useful. I might use something like that in testing later on.

>> There's also no reason the webapp can't serve the info/refs as well,
>
> Yes. If you had looked at that HTTP thread from last July/August I
> mentioned doing something like that.

Yup, that's where I remembered it from ;-)

>> The challenge I have is how to convert a tree identifier into a pack
>> structure, I suspect.
>
> In JGit? Use a PackWriter.

Thanks, I'll take a look at that. Sounds like it should be easy, provided that
I get the objects in the right order.

>> I'm not sure whether there are
>> HTTP timeouts that might be involved if the server takes too long to pack
>> something; it might be necessary to somehow send the packs back as chunks
>> instead of as a single big pack.
>
> Yes. That is a problem.

According to Apache HTTP docs, the timeout defaults to 300s.

http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxytim eout
http://httpd.apache.org/docs/2.2/mod/core.html#timeout

There's always the possibility - for a simple GET request, at least - of being
able to temporarily cache a generated pack file, so if the client re-tries, it
might be available already. An alternative would be to throw back a different
error message, and request that the client acquire different sub-sets of the
problem instead (say, /webgit/pack/a1234...bc3456)

> In terms of HTTP timeouts, yes, 2m30s before any data transfer even
> starts is a lot. And HTTP can't send progress messages to at least let
> the user know the server is chugging on their behalf.

Ah well, I've got enough to be getting on with ... as always, thanks for your
help!

Alex
Re: Git HTTP protocol/improvements? [message #6699 is a reply to message #4977] Wed, 06 May 2009 19:41 Go to previous messageGo to next message
Eclipse UserFriend
Originally posted by: alex_blewitt.nospam.yahoo.com

>Alex Blewitt wrote:
>> Shawn Pearce wrote:
>> Actually, that should be git index-pack. I think you'd want something
like:
>>> The challenge I have is how to convert a tree identifier into a pack
>>> structure, I suspect.
>>
>> In JGit? Use a PackWriter.

Thanks for the advice. I've not been able to do much recently (headaches with
Eclipse 3.5M7 notwithstanding) but I've been able to take a repo, use
PackWriter to be able to generate the pack file, and then pipe it through a
few of the commands you mentioned in order to be able to reconstitute the
file. So I at least understand how to operate the basics for a full checkout,
even if I don't yet know how to plug in the only-send-updates (well, how to
orchestrate the protocol; I'm assuming that the PackWriter will do the work
for that as well).

My next stab will be to write the servlet that can serve this down. Given the
imminent provisioning for EGit at Eclipse, I'm wondering if it makes sense to
start using org.eclipse.egit straight away rather than org.spearce (or
equivalent subpackage). So I don't know if it makes sense to start committing
it to the existing git repository yet or wait until the new Eclipse one is up
there. Any thoughts?

Alex
Re: Git HTTP protocol/improvements? [message #6704 is a reply to message #6699] Wed, 06 May 2009 20:17 Go to previous message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> I don't yet know how to plug in the only-send-updates (well, how to
> orchestrate the protocol; I'm assuming that the PackWriter will do the work
> for that as well).

Yes. The uninteresting collection that can't be null (the bug you just
posted a patch for) is the common base; PackWriter automatically handles
selecting only the delta between the interesting collection and the
uninteresting collection, and writing only that.

JGit is still slightly inefficient here compared to C Git. PackWriter
can reuse a binary delta stored in a pack file, but it can't create a
new binary delta on its own. So JGit may wind up sending an entire file
when only a small insertion delta is necessary (e.g. to insert 5 lines
into a 200 line file). Its something that will get fixed eventually,
and is internal to PackWriter, so its still transparent to your
application. And this is only network efficiency, so stuff may run a
bit slower on a slower network connection, but its still correct without
this binary delta generation support.

> My next stab will be to write the servlet that can serve this down. Given the
> imminent provisioning for EGit at Eclipse, I'm wondering if it makes sense to
> start using org.eclipse.egit straight away rather than org.spearce (or
> equivalent subpackage).

org.eclipse.jgit ? And we're still not sure JGit is going to be able to
move to Eclipse. The JGit license is going to remain EDL. We need to
ask the board of directors for an exemption to place JGit under the EDL.
Without their permission to do so, JGit can't move to eclipse.org.

> So I don't know if it makes sense to start committing
> it to the existing git repository yet or wait until the new Eclipse one is up
> there. Any thoughts?

I would just commit to the existing repository. At worst you'll have to
move the commits to the new one. Unless you've already polished them
for submission, you'll probably want to rewrite part of the development
history anyway to clean it up, so, no big deal.

But, yea, you could also just make your own git repository for this and
link to to the JGit code. Its not like its hard to compile it and
import it into your project's classpath.
Re: Git HTTP protocol/improvements? [message #571492 is a reply to message #3654] Thu, 23 April 2009 14:23 Go to previous message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> The git: protocol is claimed to be faster/better than the HTTP access.
> Assuming that's true, what can be done to speed up or improve HTTP? Is it an
> implementation of the protocol, or the fact that the protocol itself is
> chattier than the git: protocol? Can we optimise JGit somehow?

It isn't a claim, its a fact.

The http:// support in git is implemented by assuming no Git specific
knowledge on the server side. Instead we treat the server as a dumb
peer that can only respond to standard HTTP/1.0 GET requests.

So, when you issue "git fetch http://.. master" (get the current
version of the master branch of that repository) the client goes
something like this:

GET $URL/info/refs
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......
GET $URL/objects/??/.......

Where its downloading one object at a time. In the second GET, it
downloaded the commit that is at the tip of the master branch. It
parses that to discover what the top level directory object name is.
Then it requests that in the third GET. It parses that file listing,
and starts requesting the file or subtree for each of those entries, one
at a time.

Eventually, it fails with a 404. Or a 200 OK with some HTML message
saying "Dude, not found".

At which point it turns around and starts looking at the pack files:

GET $URL/objects/info/packs
GET $URL/objects/pack/pack-??.idx
GET $URL/objects/pack/pack-??.idx
GET $URL/objects/pack/pack-??.idx
GET $URL/objects/pack/pack-??.idx
GET $URL/objects/pack/pack-??.idx

Until it has downloaded an index file which says that the corresponding
pack file contains that missing object. Then it downloads that pack file:

GET $URL/objects/pack/pack-??.pack

This process repeats until the download is complete, or its unable to
download a necessary object.

It isn't uncommon for an HTTP fetch of 100 or 200 changes to turn into
thousands of GET requests. Although the client supports HTTP/1.1
pipelining, there isn't a lot of parallelism available as the next
object to obtain requires information from the object currently being
requested just to make that next request.


The git:// protocol on the other hand uses a direct TCP stream and has
both sides send basic state information to each other in a
bi-directional handshake until they can agree upon a set of objects that
both peers have. Once that set is agreed upon, the sending side can
compute a set difference and stream everything that the receiver does
not yet have.

In practice this only takes a couple of round trips. A massive
difference compared to 1000+ HTTP round trips.

Also, since the sender only transmits *exactly* what the client is
missing, and does so by transferring deltas whenever possible, the
overall data transfer is quite a bit smaller than what occurs using HTTP.


Pushing over HTTP uses WebDAV, and otherwise assumes the remote server
is any standard WebDAV filesystem. You probably could push a Git
repository into an SVN repository that way, treating the SVN repository
as a WebDAV backend.

Since the WebDAV backend is assumed to have no capabilities beyond those
of storing files, the git push client is forced to transmit whole data
objects, just like the git fetch client is forced to download whole data
objects.


Last August we had a mailing list thread about improving HTTP
performance by installing a Git specific extension on the HTTP server.
For example, by creating a "gitserver.cgi" that could be invoked through
the CGI standard. Easier to install than mod_svn in Apache, and could
be installed alongside gitweb.cgi if the side administrator wanted to.

The mailing list thread is here:

http://thread.gmane.org/gmane.comp.version-control.git/91104

The RFC, as of the last time I touched it, is now online in my
fastimport fork:

http://repo.or.cz/w/git/fastimport.git?a=blob;f=Documentation/technical/http-protocol.txt;hb=smart-http

I haven't had time to work on it in months.

The git.git C code for HTTP support is difficult to work with, though it
has recently been improved over the past couple of months. It may be
easier to prototype something in JGit, but whatever gets implemented
needs to also be implemented in git.git eventually, as users will demand it.

Mercurial has a more efficient HTTP protocol. They require a custom
Mercurial HTTP server, but if that custom server is in place then their
protocol's efficiency generally matches that of git://. They also
support a dumb HTTP approach, like I described for Git above, but I hear
people avoid it like the plague because of the performance problems I
described above.
Re: Git HTTP protocol/improvements? [message #571527 is a reply to message #3687] Thu, 23 April 2009 15:17 Go to previous message
Alex Blewitt is currently offline Alex BlewittFriend
Messages: 946
Registered: July 2009
Senior Member
Shawn Pearce wrote:
> The http:// support in git is implemented by assuming no Git specific
> knowledge on the server side. Instead we treat the server as a dumb
> peer that can only respond to standard HTTP/1.0 GET requests.
>
> So, when you issue "git fetch http://.. master" (get the current
> version of the master branch of that repository) the client goes
> something like this:

I can see how that would be slow :-)

I think a generic REST-style API to the server would be a useful protocol to
define, which then could be implemented in (Java,C) and then accessed by
(Java,C).

My forte is not C though; I'd be much more comfortable putting together a
servlet-based (JGit backed?) http-git-server implementation to iron out the
ideas and kinks; and if it flies, then maybe back-porting that to the C git
implementation and/or mod_git of sorts?

Alex
Re: Git HTTP protocol/improvements? [message #571557 is a reply to message #3718] Thu, 23 April 2009 18:01 Go to previous message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> Shawn Pearce wrote:
>> The http:// support in git is implemented by assuming no Git specific
>> knowledge on the server side.
>
> I think a generic REST-style API to the server would be a useful protocol to
> define, which then could be implemented in (Java,C) and then accessed by
> (Java,C).

We should be careful here. I've been told proxy servers don't like HTTP
methods other then GET or POST. So those "fancy" methods like "PUT" are
just too much for some proxy servers to handle.

So embedding into POST is probably the safest approach.

> My forte is not C though; I'd be much more comfortable putting together a
> servlet-based (JGit backed?) http-git-server implementation to iron out the
> ideas and kinks; and if it flies, then maybe back-porting that to the C git
> implementation and/or mod_git of sorts?

Right.

But the C folks would probably prefer a CGI over mod_git. The C
implementation isn't suitable for running in long-lived processes, or a
server process that still needs to return a response to a client in the
face of an error.
Re: Git HTTP protocol/improvements? [message #571596 is a reply to message #4086] Sun, 26 April 2009 08:35 Go to previous message
Alex Blewitt is currently offline Alex BlewittFriend
Messages: 946
Registered: July 2009
Senior Member
Shawn Pearce wrote:
> Alex Blewitt wrote:
>> Shawn Pearce wrote:
>>> The http:// support in git is implemented by assuming no Git specific
>>> knowledge on the server side.
>>
>> I think a generic REST-style API to the server would be a useful protocol
to
>> define, which then could be implemented in (Java,C) and then accessed by
>> (Java,C).
>
> We should be careful here. I've been told proxy servers don't like HTTP
> methods other then GET or POST. So those "fancy" methods like "PUT" are
> just too much for some proxy servers to handle.

Actually, the limitation on PUT is more to do with the client rather than
proxies. In any case, WebDAV uses PUT to upload content; so if a WebDAV based
solution works, it's not going to make any difference.

But REST is more than what HTTP methods you use; it's about designing
resources around URIs. The key thing here is to get a resource which allows us
to navigate from the tip of a branch back to its ancestors, instead of a
single HTTP-round-trip to do each of those.

> But the C folks would probably prefer a CGI over mod_git. The C
> implementation isn't suitable for running in long-lived processes, or a
> server process that still needs to return a response to a client in the
> face of an error.

Fair enough. Even better then. A JGit-backed Jetty server would be pretty
sweet; and if the protocol admits it, then re-using API could be used to
provide a web-based view ala viewvc in AJAX.

Anyway, I'm going to give that a go now - the .gitignore UI addition is done
and waiting to be applied, so I'll switch tack and start investigating the
HTTP optimisation. Once we've verified it from a pure Java perspective, we can
look at other clients implementing the same HTTP protocol.

Incidentally, this is good timing - Google Code just announced support for Hg
as their DVCS (not surprisingly, since they're a Python shop) but did single
out Git's poor HTTP performance as one of the disadvantages.

http://google-opensource.blogspot.com/2009/04/distributed-ve rsion-control-for-
project.html


Alex
Re: Git HTTP protocol/improvements? [message #571630 is a reply to message #3687] Tue, 28 April 2009 08:43 Go to previous message
Alex Blewitt is currently offline Alex BlewittFriend
Messages: 946
Registered: July 2009
Senior Member
Shawn Pearce wrote:
> So, when you issue "git fetch http://.. master" (get the current
> version of the master branch of that repository) the client goes
> something like this:
>
> GET $URL/info/refs
> GET $URL/objects/??/.......
> GET $URL/objects/??/.......

So, with my limited understanding of the Git format, the 'info/refs'
would correspond to a directory in .git/info/refs (except I can't find it).
However, there's a refs/heads/master which contains the string a string like
ee933d31d2ca4a4270aa9f4be6e60beec388e8af, which would then map to a file
in objects/ee/933d31d2ca4a4270aa9f4be6e60beec388e8af.

Presumably this is some kind of one-way linked list structure, so if I knew
how to open/parse this file, I'd then find another reference like
9bed0610017d97b6fd3fb19a5256646f4d2399e4 which in turn would take me to
objects/9b/ed0610017d97b6fd3fb19a5256646f4d2399e4 and so on.

If that's the case, then calculating the list of hashes for a branch would
be a case of following refs/heads/master through to build up a list like:

ee933d31d2ca4a4270aa9f4be6e60beec388e8af
9bed0610017d97b6fd3fb19a5256646f4d2399e4

So, creating a URL that lookedl like:

GET $URL/webgit/ee933d31d2ca4a4270aa9f4be6e60beec388e8af

could load/process the refs and produce a JSON representation like:

[
"ee933d31d2ca4a4270aa9f4be6e60beec388e8af",
"9bed0610017d97b6fd3fb19a5256646f4d2399e4",
...
]

That would solve a bunch of the round-trips up front and then allow the client
to start downloading the packs in parallel (which it would need, or at least,
the subset of them that it needed).

So, how do I go about opening/parsing the objects/ file? I guess there's
something in the JGit stuff that would help here, but I don't know the
terminology that is used to describe the various files in the directory.

Alex
Re: Git HTTP protocol/improvements? [message #571683 is a reply to message #4212] Tue, 28 April 2009 15:10 Go to previous message
Eclipse UserFriend
Originally posted by: j16sdiz.gmail.com

Alex Blewitt wrote:
> Shawn Pearce wrote:
>> So, when you issue "git fetch http://.. master" (get the current
>> version of the master branch of that repository) the client goes
>> something like this:
>>
>> GET $URL/info/refs
>> GET $URL/objects/??/.......
>> GET $URL/objects/??/.......
>
> So, with my limited understanding of the Git format, the 'info/refs'
> would correspond to a directory in .git/info/refs (except I can't find it).

info/refs is generated by `git update-server-info`
(or, sometimes, `git repack`)

> However, there's a refs/heads/master which contains the string a string like
> ee933d31d2ca4a4270aa9f4be6e60beec388e8af, which would then map to a file
> in objects/ee/933d31d2ca4a4270aa9f4be6e60beec388e8af.

without info/refs, git won't know refs/heads/master:
- 'master' is just an arbitrary name, it can be anything.
- plain old HTTP does not support file listing, so
we need a list of available refs.

[..]
Re: Git HTTP protocol/improvements? [message #571703 is a reply to message #4212] Tue, 28 April 2009 16:58 Go to previous message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> Shawn Pearce wrote:
>> So, when you issue "git fetch http://.. master" (get the current
>> version of the master branch of that repository) the client goes
>> something like this:
>>
>> GET $URL/info/refs
>> GET $URL/objects/??/.......
>> GET $URL/objects/??/.......
>
> So, with my limited understanding of the Git format, the 'info/refs'
> would correspond to a directory in .git/info/refs (except I can't find it).

Yea, like Daniel Cheng said, you need to run `git update-server-info`
here to get .git/info/refs created. Normally this is run by `git gc`,
or by a post-update hook under .git/hooks/post-update. It is only
needed by the HTTP support, so normally the file doesn't exist unless
you are serving this repository over HTTP.

> However, there's a refs/heads/master which contains the string a string like
> ee933d31d2ca4a4270aa9f4be6e60beec388e8af, which would then map to a file
> in objects/ee/933d31d2ca4a4270aa9f4be6e60beec388e8af.

Yes. info/refs is just a union catalog of the packed-refs file, and the
recursive contents of refs/. As Daniel Cheng pointed out, HTTP lacks a
generic "directory listing" mechanism so info/refs provides a catalog.
It could just have been a catalog of the file names under refs/, but it
also contains the SHA-1s to try and remove a bunch of round-trips in the
common case of "Nothing changed".

> Presumably this is some kind of one-way linked list structure, so if I knew
> how to open/parse this file, I'd then find another reference like
> 9bed0610017d97b6fd3fb19a5256646f4d2399e4 which in turn would take me to
> objects/9b/ed0610017d97b6fd3fb19a5256646f4d2399e4 and so on.

Yup, exactly.

> If that's the case, then calculating the list of hashes for a branch would
> be a case of following refs/heads/master through to build up a list like:
>
> ee933d31d2ca4a4270aa9f4be6e60beec388e8af
> 9bed0610017d97b6fd3fb19a5256646f4d2399e4
>
> So, creating a URL that lookedl like:
>
> GET $URL/webgit/ee933d31d2ca4a4270aa9f4be6e60beec388e8af
>
> could load/process the refs and produce a JSON representation like:
>
> [
> "ee933d31d2ca4a4270aa9f4be6e60beec388e8af",
> "9bed0610017d97b6fd3fb19a5256646f4d2399e4",
> ...
> ]

Eeeeek. No.

Well, yes, in theory you can do this. But I think its a bad idea.

Assuming the Linux kernel repository, this listing would need to be a
JSON list of 1,174,664 SHA-1 values. That's more than 48.17 MiB of text
to transfer.

Really what you want is to have the client and server negotiate on a
common ancestor; some commit or tree that they both contain. Once that
common ancestor is found, *then* you can talk about sending that list of
object identities to the client, as now its only a subset of that 1
million object list.

Since the object identity can be recovered from the object data (just
run SHA-1 over it after decompression) there actually is no reason to
send the object identities to the client. Instead, we should just have
the server send the object data for that group of objects that the
client doesn't yet have, but has told the server it wants to have.

This is fundamentally how the fetch-pack/upload-pack protocol used by
`git fetch` over git:// and ssh:// works.

> That would solve a bunch of the round-trips up front and then allow the client
> to start downloading the packs in parallel (which it would need, or at least,
> the subset of them that it needed).

Ideally, the found-trips should be just 1 for the *entire* data
transfer. And then we're just looking at the round trips required to
negotiate the common ancestor point.

> So, how do I go about opening/parsing the objects/ file? I guess there's
> something in the JGit stuff that would help here,

Yes, yes it would. See WalkFetchConnection in JGit. Its quite a bit of
code. But the code handles downloading both the loose objects from the
objects/ directory, and pack files from the objects/pack/ directory, and
parsing each of the 4 basic object types (commit, tree, tag, blob) in
order to determine any more pointers that must be followed.

> but I don't know the
> terminology that is used to describe the various files in the directory.

Loose objects are the things under objects/??/. Packs are the things
under objects/pack/pack-*.pack. A pack is something like a ZIP file, it
contains multiple compressed objects in a single file stream. The
corresponding pack-*.idx file contains a directory to support efficient
O(log N) access time to any object within that pack.

Two different encodings are used for the data. The loose objects are
deflated with libz, but are otherwise the complete file content, they
never store a delta. The packed objects can be stored either as the
full content but deflated with libz, or they can be stored as a delta
relative to another object in the same pack file.

For reference, see these documents:

http://book.git-scm.com/7_how_git_stores_objects.html
http://book.git-scm.com/7_browsing_git_objects.html
http://book.git-scm.com/7_the_packfile.html
http://www.gelato.unsw.edu.au/archives/git/0608/25286.html
http://www.kernel.org/pub/software/scm/git/docs/technical/pa ck-format.txt
http://www.kernel.org/pub/software/scm/git/docs/technical/pa ck-heuristics.txt

also, some data about the current fetch-pack/upload-pack protocol:

http://book.git-scm.com/7_transfer_protocols.html
http://www.kernel.org/pub/software/scm/git/docs/technical/pa ck-protocol.txt
Re: Git HTTP protocol/improvements? [message #571740 is a reply to message #4282] Tue, 28 April 2009 18:50 Go to previous message
Alex Blewitt is currently offline Alex BlewittFriend
Messages: 946
Registered: July 2009
Senior Member
Daniel Cheng wrote:
> Alex Blewitt wrote:
>> Shawn Pearce wrote:
>>> So, when you issue "git fetch http://.. master" (get the current
>>> version of the master branch of that repository) the client goes
>>> something like this:
>>>
>>> GET $URL/info/refs
>>> GET $URL/objects/??/.......
>>> GET $URL/objects/??/.......
>>
>> So, with my limited understanding of the Git format, the 'info/refs'
>> would correspond to a directory in .git/info/refs (except I can't find it).
>
> info/refs is generated by `git update-server-info`
> (or, sometimes, `git repack`)

Ah, thanks.

>> However, there's a refs/heads/master which contains the string a string
like
>> ee933d31d2ca4a4270aa9f4be6e60beec388e8af, which would then map to a file
>> in objects/ee/933d31d2ca4a4270aa9f4be6e60beec388e8af.
>
> without info/refs, git won't know refs/heads/master:
> - 'master' is just an arbitrary name, it can be anything.
> - plain old HTTP does not support file listing, so
> we need a list of available refs.

OK. This could be something computed dynamically by a server-side process,
rather than batch changed, too. Plus, WebDAV supports a listing (though that
isn't vanilla HTTP). Do we support that if available?

Alex
Re: Git HTTP protocol/improvements? [message #571775 is a reply to message #4352] Tue, 28 April 2009 18:50 Go to previous message
Alex Blewitt is currently offline Alex BlewittFriend
Messages: 946
Registered: July 2009
Senior Member
Shawn Pearce wrote:
> Alex Blewitt wrote:
>> Shawn Pearce wrote:
>>> So, when you issue "git fetch http://.. master" (get the current
>>> version of the master branch of that repository) the client goes
>>> something like this:
>>>
>>> GET $URL/info/refs
>>> GET $URL/objects/??/.......
>>> GET $URL/objects/??/.......
>>
>> So, with my limited understanding of the Git format, the 'info/refs'
>> would correspond to a directory in .git/info/refs (except I can't find it).
>
> Yea, like Daniel Cheng said, you need to run `git update-server-info`
> here to get .git/info/refs created. Normally this is run by `git gc`,
> or by a post-update hook under .git/hooks/post-update. It is only
> needed by the HTTP support, so normally the file doesn't exist unless
> you are serving this repository over HTTP.

Right. And the only reason we need this is to support HTTP then.

>> could load/process the refs and produce a JSON representation like:
>>
>> [
>> "ee933d31d2ca4a4270aa9f4be6e60beec388e8af",
>> "9bed0610017d97b6fd3fb19a5256646f4d2399e4",
>> ...
>> ]
>
> Eeeeek. No.
>
> Well, yes, in theory you can do this. But I think its a bad idea.
>
> Assuming the Linux kernel repository, this listing would need to be a
> JSON list of 1,174,664 SHA-1 values. That's more than 48.17 MiB of text
> to transfer.

OK. But that's assuming a whole world change, right? The URL doesn't have to
generate the entire collection of trees from the beginning (in the same way
that 'git status' offers you a paginated view). We could limit it to an
abitrary/fixed/user-requestable paging figure, so:

C: GET $URL/aaa
S: [
"aaa.."
"bbb.."
"ccc.."
...
"mmm.."
]
C: GET $URL/mmm
S: [
"mmm.."
"nnn.."
"ooo.."
]

Assuming a relatively recent change was a common ancestor, you'd probably get
it in the first couple of pages of requests.

> Really what you want is to have the client and server negotiate on a
> common ancestor; some commit or tree that they both contain.

As a matter of interest, is the hash assumed to be unique for all commits over
time? In other words, if I find "ooo..." in the server response, and I too
have "ooo..." in my client tree, then is that de facto the common ancestor?
Are there any chances that the "ooo..." could be the same hash but a
completely different part of the tree?

> Since the object identity can be recovered from the object data (just
> run SHA-1 over it after decompression) there actually is no reason to
> send the object identities to the client. Instead, we should just have
> the server send the object data for that group of objects that the
> client doesn't yet have, but has told the server it wants to have.

OK, so the same mechanism could be used to upload the hashes of the identies
to the server, right?

>> but I don't know the
>> terminology that is used to describe the various files in the directory.
>
> For reference, see these documents:

Thanks, I'll take a while to peruse and understand them.

On the subject of dependencies; writing a web app is going to require some
kind of server support. I was thinking of using Jetty, now it's under the
Eclipse banner. Is there any reason why we can't use other EPL in the
server-side part of this component?

For the client side, I hope the protocol will be esay enough to add in to
(say) JGit as a BSD implementation instead of having to bring in other
dependencies. I assume the reason why we are not using (say) Apache Commons
Net is to avoid any extra dependencies?

Alex
Re: Git HTTP protocol/improvements? [message #571807 is a reply to message #4493] Tue, 28 April 2009 20:10 Go to previous message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> Shawn Pearce wrote:
>> Assuming the Linux kernel repository, this listing would need to be a
>> JSON list of 1,174,664 SHA-1 values. That's more than 48.17 MiB of text
>> to transfer.
>
> OK. But that's assuming a whole world change, right? The URL doesn't have to
> generate the entire collection of trees from the beginning (in the same way
> that 'git status' offers you a paginated view). We could limit it to an
> abitrary/fixed/user-requestable paging figure, so:
>
> C: GET $URL/aaa
> S: [
> "aaa.."
> "bbb.."
> "ccc.."
> ...
> "mmm.."
> ]
> C: GET $URL/mmm
> S: [
> "mmm.."
> "nnn.."
> "ooo.."
> ]

Ugh.

So, what if the whole world was being download for the first time?
(Initial clone of a project over HTTP.) How many "pages" would I need
for the Linux kernel's 1,174,664 values?

How do you define the boundary for a page?

The most recent commit in the Linux kernel has 27,829+ objects in it.
Probably closer to 30,000 when you include all of the directories.
That's just that first commit. How many objects did you want to put per
page?

You are thinking about this all wrong. You seriously can't do what you
are suggesting and still get good performance, for either an initial
clone, or for an incremental update.

> Assuming a relatively recent change was a common ancestor, you'd probably get
> it in the first couple of pages of requests.

Sure. That's the point of the negation that currently takes place, you
want to find that common ancestor in some small number of round trips.

> As a matter of interest, is the hash assumed to be unique for all commits over
> time?

Yes.

> In other words, if I find "ooo..." in the server response, and I too
> have "ooo..." in my client tree, then is that de facto the common ancestor?

Yes.

> Are there any chances that the "ooo..." could be the same hash but a
> completely different part of the tree?

No.

>> Since the object identity can be recovered from the object data (just
>> run SHA-1 over it after decompression) there actually is no reason to
>> send the object identities to the client. Instead, we should just have
>> the server send the object data for that group of objects that the
>> client doesn't yet have, but has told the server it wants to have.
>
> OK, so the same mechanism could be used to upload the hashes of the identies
> to the server, right?

You aren't seriously suggesting that we take the object data, which is
usually larger than 40 bytes, and upload it to the server, just to send
the server a 40 byte token saying "I have this object"?

> On the subject of dependencies; writing a web app is going to require some
> kind of server support.

I would try to stick to the J2EE servlet specification, so that any
servlet container can be used.

> I was thinking of using Jetty, now it's under the
> Eclipse banner.

Sure.

But I'd also like to let people deploy under any other servlet container.

Seriously, how much "server side support" do you need to speak this
protocol? You need to register something with the container to handle
POST, that's a subclass of HttpServlet. You need InputStream to read
that POST body, that's the HttpServletRequest.getInputStream(). You
need an OutputStream to send a response, that's the
HttpServletResponse.getOutputStream(). That's J2EE servlet
specification cira 1999.

After that, everything should be available in JGit, as its all Git
specific. I see no reason to tie this to Jetty, even if Jetty is under
the Eclipse banner (which I think is great).

> Is there any reason why we can't use other EPL in the
> server-side part of this component?

I'd rather not.

See my remark above about how you really shouldn't need anything that
isn't already in JGit, or that can't be trivially reimplemented in JGit,
to support this.

Remember that to maximize use of this HTTP protocol you also need to
implement both client and server in git.git, the canonical C
implementation, which use a GPLv2 license *ONLY*. If you try to port
EPL "support libraries" to C they won't be accepted into git.git because
EPL isn't compatible to be linked with GPLv2.

> For the client side, I hope the protocol will be esay enough to add in to
> (say) JGit as a BSD implementation instead of having to bring in other
> dependencies.

Its not a hope, its a requirement. Robin and I won't merge something to
JGit that isn't BSD implementation, or that requires non-BSD or non-MIT
dependencies. Actually, we try quite hard not to add any additional
dependencies to JGit.

> I assume the reason why we are not using (say) Apache Commons
> Net is to avoid any extra dependencies?

Yup, exactly. Although Apache License 2.0 plays nicely with BSD and
EPL, we don't use Apache Commons Net because its overkill for what we
need, and its yet another dependency.

We only depend upon JSch because there was no other Java based SSH
client implementation at the time, its license is very acceptable (also
BSD), and rewriting it would be very time consuming.
Re: Git HTTP protocol/improvements? [message #571840 is a reply to message #4563] Tue, 28 April 2009 21:03 Go to previous message
Alex Blewitt is currently offline Alex BlewittFriend
Messages: 946
Registered: July 2009
Senior Member
> Shawn Pearce wrote:
> > Alex Blewitt wrote:
> So, what if the whole world was being download for the first time?
> (Initial clone of a project over HTTP.) How many "pages" would I need
> for the Linux kernel's 1,174,664 values?

How does it work for the GIT protocol at the moment? I was under the
impression that the client would download the SHA1 names in any case.
Obviously the initial clone could probably be handled in a more optimised
manner if needed.

> How do you define the boundary for a page?

It could be part of the URL, for example .../aaaa/2/50 (for 2nd page of 50)

> The most recent commit in the Linux kernel has 27,829+ objects in it.
> Probably closer to 30,000 when you include all of the directories.
> That's just that first commit. How many objects did you want to put per
> page?

I suspect it's probably going to take some measurement to find out what the
optimal number(s) are. Ideally, you'd like to get it to get the majority
(recent updates) of commits in a single hit, but I frankly don't know much at
this stage - I'm just exploring ideas

> You are thinking about this all wrong. You seriously can't do what you
> are suggesting and still get good performance, for either an initial
> clone, or for an incremental update.

I'm exploring ideas. I'm bound to explore more bad ones than good ones in
order to get there :-)

> You aren't seriously suggesting that we take the object data, which is
> usually larger than 40 bytes, and upload it to the server, just to send
> the server a 40 byte token saying "I have this object"?

No, I was only suggesting submitting the hashes as part of the handshake to
find the common ancestor and/or what the client/server both has.

>> On the subject of dependencies; writing a web app is going to require some
>> kind of server support.
>
> I would try to stick to the J2EE servlet specification, so that any
> servlet container can be used.

Yes, that is the plan.

>> I was thinking of using Jetty, now it's under the
>> Eclipse banner.
>
> Sure.
>
> But I'd also like to let people deploy under any other servlet container.

Agreed. I was just thinking of having a downloadable server, like Hudson,
which can be executed with java -jar webgit.jar as well as being installed
into other servers (Tomcat etc.)

> After that, everything should be available in JGit, as its all Git
> specific. I see no reason to tie this to Jetty, even if Jetty is under
> the Eclipse banner (which I think is great).

I didn't mean to tie it in at the code level, just as a way of
downloading/running it.

> Remember that to maximize use of this HTTP protocol you also need to
> implement both client and server in git.git, the canonical C
> implementation, which use a GPLv2 license *ONLY*. If you try to port
> EPL "support libraries" to C they won't be accepted into git.git because
> EPL isn't compatible to be linked with GPLv2.

Agreed. The plan is to evolve a protocol whose client can be implemented in C
without needing any other aspects.

Alex
Re: Git HTTP protocol/improvements? [message #571872 is a reply to message #4632] Tue, 28 April 2009 21:30 Go to previous message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
>> Shawn Pearce wrote:
>>> Alex Blewitt wrote:
>> So, what if the whole world was being download for the first time?
>> (Initial clone of a project over HTTP.) How many "pages" would I need
>> for the Linux kernel's 1,174,664 values?
>
> How does it work for the GIT protocol at the moment?

Look at the links I sent earlier today about the fetch-pack/upload-pack
protocol. Basically the exchange goes something like this for an
initial clone of the whole world:

C: CONNECT
S: here's the list of refs I have, and their current SHA-1 values

C: want deadbeef...
C: want aaaabbbb...
C: <END>

S: PACK...compressed data for the entire project...

For an incremental update:

C: CONNECT
S: here's the list of refs I have, and their current SHA-1 values

C: want deadbeef...
C: want aaaabbbb...
C: have 1831123...
C: have asd813c...
... up to 32 more have lines ...

S: ACK 1831123...
S: NAK asd813c...

C: <END>

S: PACK...compressed data for the incremental update...

The want lines are the client saying, "I don't have X, but you said in
your initial advertisement that you have it, so give it to me". The
client selected these SHA-1s out of the initial advertisement by first
looking to see if the object exists on disk; if it doesn't but its
corresponding ref is in the pattern of refs the client was instructed to
fetch (e.g. fetch = refs/heads/* in .git/config) then the client "wants"
that SHA-1.

The have lines are the client listing every commit it knows about,
starting from the most recent revisions the client has, walking
backwards in time through the project history.

have lines are sent by the client in batches of 32, with up to 2 batches
in flight at a time.

The server sends ACK lines to let the client know that the server also
has that object, and thus that the client can stop enumerating history
reachable from that point in time. This is a potential common ancestor.
There may be multiple, due to different side branches being active on
both sides.

The server sends NAK lines to let the client know it doesn't have a
particular object. Such objects are unique to the client (e.g. commits
you created but haven't published to anyone, or are commits you got from
some other repository that this repository hasn't communicated with).
On these objects the client goes further backwards in that history to
look for another possible match.

That's a simplification of it, but the rough idea. See the links I
pointed you to and BasePackFetchConnection in JGit for the Java
implementation of this client, and transport.UploadPack for the server
side implementation of this protocol.

> I was under the
> impression that the client would download the SHA1 names in any case.

No, we don't transfer the SHA-1 names of the objects the client is going
to download. Instead, the client computes them on the fly from the data
it receives. This is actually a safety measure, it allows the client to
verify the data received matches the signature it expects.

A really paranoid client performs a full check of the object pointers
too. Checking the tips of what you fetched (the things you want'd in
the protocol) all the way back to the common base (the things you and
the server agreed upon as existing) validates that the entire data
stream is what you believed it should be.

Its more about data integrity during transit and against broken Git
tools than filtering out an evil MITM attack.

For the JGit validation code see transport.IndexPack, ObjectChecker, and
FetchProcess.fetchObjects() for the client side, or
transport.ReceivePack.checkConnectivity() for the server side.

>>> I was thinking of using Jetty, now it's under the
>>> Eclipse banner.
>>
>> But I'd also like to let people deploy under any other servlet container.
>
> Agreed. I was just thinking of having a downloadable server, like Hudson,
> which can be executed with java -jar webgit.jar as well as being installed
> into other servers (Tomcat etc.)

Oh, yea, that's awesome. Jetty is quite embeddable and is under a good
license for this sort of binary redistribution build. But that is an
unrelated goal to a more efficient HTTP support in Git. Jetty has made
it really easy for anyone to roll a servlet into a simple downloadable
JAR. My point is, anyone can roll that distribution. But I'm not
against having it as an eventual downloadable product once both JGit and
Jetty have exited incubating status and can make formal releases.
Re: Git HTTP protocol/improvements? [message #571911 is a reply to message #4701] Tue, 28 April 2009 21:55 Go to previous message
Alex Blewitt is currently offline Alex BlewittFriend
Messages: 946
Registered: July 2009
Senior Member
Shawn Pearce wrote:
> Alex Blewitt wrote:
>>> Shawn Pearce wrote:
>>>> Alex Blewitt wrote:
>>> So, what if the whole world was being download for the first time?
>>> (Initial clone of a project over HTTP.) How many "pages" would I need
>>> for the Linux kernel's 1,174,664 values?
>>
>> How does it work for the GIT protocol at the moment?
>
> Look at the links I sent earlier today about the fetch-pack/upload-pack
> protocol. Basically the exchange goes something like this for an
> initial clone of the whole world:

OK, so on the initial clone, we just say "give me everything reachable from
'deadbeef'" without caring what those happen to be.

In the case of an incremental, we have a subset of things the server (might)
be interested in, plus the 'everything from deadbeef' (which may include some
of the things we have). The server will know to only send deadbeef..common
ancestor(s). It works out the common ancestor(s) based on drilling down
through the final ACKs that we get of combined updates.

The reason we don't need SHAs is because once we've agreed on the download set
(from deadbeef to common ancestor including 1831123/asd813c/...) we just get
the data (from which we can reconstruct the SHAs).

> have lines are sent by the client in batches of 32, with up to 2 batches
> in flight at a time.

OK. I guess I had a similar idea of batching the SHA-1 earlier, but we don't
need to do that on the client; we should be able to compute it on the server.

> The server sends ACK lines to let the client know that the server also
> has that object, and thus that the client can stop enumerating history
> reachable from that point in time. This is a potential common ancestor.

Why only a potential common ancestor? I can imagine not necessarily 'the' but
could easily be multiple of these. I'm not sure how it might not be a common
ancestor, though.

> That's a simplification of it, but the rough idea. See the links I
> pointed you to and BasePackFetchConnection in JGit for the Java
> implementation of this client, and transport.UploadPack for the server
> side implementation of this protocol.

That's great - this has been very useful to me. I'll take a look at the Java
implementation a little more to see what I can do.

> No, we don't transfer the SHA-1 names of the objects the client is going
> to download. Instead, the client computes them on the fly from the data
> it receives. This is actually a safety measure, it allows the client to
> verify the data received matches the signature it expects.

OK, we get an implicit set of data rather than the SHAs. I was trying to find
out how we could come to a common ancestor using the SHAs on the client side,
but a server-side computation can work just as well.

> Oh, yea, that's awesome. Jetty is quite embeddable and is under a good
> license for this sort of binary redistribution build. But that is an
> unrelated goal to a more efficient HTTP support in Git. Jetty has made
> it really easy for anyone to roll a servlet into a simple downloadable
> JAR. My point is, anyone can roll that distribution. But I'm not
> against having it as an eventual downloadable product once both JGit and
> Jetty have exited incubating status and can make formal releases.

Great.

Thanks again for the detailed response; now, it's over to me to start playing
around with it in code.

Alex
Re: Git HTTP protocol/improvements? [message #571968 is a reply to message #4772] Thu, 30 April 2009 18:54 Go to previous message
Alex Blewitt is currently offline Alex BlewittFriend
Messages: 946
Registered: July 2009
Senior Member
>Alex Blewitt wrote:
>> Shawn Pearce wrote:
>> Look at the links I sent earlier today about the fetch-pack/upload-pack
>> protocol. Basically the exchange goes something like this for an
>> initial clone of the whole world:
>
> OK, so on the initial clone, we just say "give me everything reachable from
> 'deadbeef'" without caring what those happen to be.

I think I'm getting closer to understanding what's going on. I'm going to
start throwing some code together over the weekend to see if I can figure out
what's going on.

If I end up with a URL based on a current tip (like /webgit/pack/ab01230a0...)
then the contents that get served back can be the same format (pack) as with
the git protocol. This will be a useful proof of concept, as well as handling
the initial check-out case where you don't have anything. If you GET the
response, you'll get everything reachable from tip, whereas if you POST the
response (with some details to be worked out later) along the want/need kind
of lines of the git protocol, then it can send a subset instead.

One advantage of this is it should be possible to do something like curl
/webgit/pack/ab012340a | git receive-pack as a proof of concept without having
to change the C code, at least initially.

There's also no reason the webapp can't serve the info/refs as well, so that
it's dynamcially calculated instead of regenerated on each commit. We could
use some header flags to determine whether the server was smart or dumb in
what we request next.

The challenge I have is how to convert a tree identifier into a pack
structure, I suspect. Objects might already be in a packed structure, or they
might have to get packed on the server side. I'm also aware that that
operation (in Git, at least) can take a while. I'm not sure whether there are
HTTP timeouts that might be involved if the server takes too long to pack
something; it might be necessary to somehow send the packs back as chunks
instead of as a single big pack. From what I can infer, the git protocol uses
a series of status messages to indiciate progress without any data on the
remote end, then switches over to sending the pack file as one big lump.

Alex
Re: Git HTTP protocol/improvements? [message #571998 is a reply to message #4839] Thu, 30 April 2009 20:23 Go to previous message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> If I end up with a URL based on a current tip (like /webgit/pack/ab01230a0...)
> then the contents that get served back can be the same format (pack) as with
> the git protocol. This will be a useful proof of concept, as well as handling
> the initial check-out case where you don't have anything. If you GET the
> response, you'll get everything reachable from tip, whereas if you POST the
> response (with some details to be worked out later) along the want/need kind
> of lines of the git protocol, then it can send a subset instead.

OK.

> One advantage of this is it should be possible to do something like curl
> /webgit/pack/ab012340a | git receive-pack as a proof of concept without having
> to change the C code, at least initially.

Actually, that should be git index-pack. I think you'd want something like:

mkdir foo
cd foo
git init
curl /webgit/pack/ab012340a | git index-pack --stdin --fix-thin in.pack
mv in.pack in.idx .git/objects/pack
git update-ref HEAD ab012340a

It should work, but it still doesn't quite get everything right. That
other stuff is minor, but still important, details that can be worked
out later.

> There's also no reason the webapp can't serve the info/refs as well, so that
> it's dynamcially calculated instead of regenerated on each commit. We could
> use some header flags to determine whether the server was smart or dumb in
> what we request next.

Yes. If you had looked at that HTTP thread from last July/August I
mentioned doing something like that. And also having it compute
objects/info/packs for dumb clients, in case they are accessing a smart
server and don't know any better (e.g. older versions that predate the
smart HTTP support).

> The challenge I have is how to convert a tree identifier into a pack
> structure, I suspect.

In JGit? Use a PackWriter. You feed preparePack the interestingObjects
(wants) and the uninterestingObjects (common base/haves, can be empty to
get the whole world) and it builds up a list of what to send. Then you
ask it to dump that to an OutputStream with writePack().

> Objects might already be in a packed structure, or they
> might have to get packed on the server side.

Yup. PackWriter automatically handles this distinction by taking
objects from whatever location they are at.

> I'm also aware that that
> operation (in Git, at least) can take a while.

And JGit is not different. Worse actually, its in Java and isn't nearly
as optimized as C Git is.

> I'm not sure whether there are
> HTTP timeouts that might be involved if the server takes too long to pack
> something; it might be necessary to somehow send the packs back as chunks
> instead of as a single big pack.

Yes. That is a problem.

> From what I can infer, the git protocol uses
> a series of status messages to indiciate progress without any data on the
> remote end, then switches over to sending the pack file as one big lump.

Actually, the that huge data lump is framed inside of a multiplexing
stream, with the pack data in stream #0 and progress messages in stream
#1. Its just that Git stops sending progress messages once the data
starts flowing. The progress messages are just there so the end user
doesn't abort while the server is computing... and it can take a while
for that computation to complete. Once the computation is done, data
transfer starts, and the user gets progress messages from the client as
it processes that data through `git index-pack`.

In the case of JGit's PackWriter class, that server computation phase
happens inside of the preparePack(Collection,Collection) method.
Blocking in there for 2m30s while handling the Linux kernel repository
isn't unheard of. Once that method is complete, the caller switches to
writePack(), which starts writing data immediately.

In terms of HTTP timeouts, yes, 2m30s before any data transfer even
starts is a lot. And HTTP can't send progress messages to at least let
the user know the server is chugging on their behalf.
Re: Git HTTP protocol/improvements? [message #572028 is a reply to message #4908] Thu, 30 April 2009 21:33 Go to previous message
Alex Blewitt is currently offline Alex BlewittFriend
Messages: 946
Registered: July 2009
Senior Member
> Shawn Pearce wrote:
>> Alex Blewitt wrote:
>> One advantage of this is it should be possible to do something like curl
>> /webgit/pack/ab012340a | git receive-pack as a proof of concept without
having
>> to change the C code, at least initially.
>
> Actually, that should be git index-pack. I think you'd want something like:
>
> mkdir foo
> cd foo
> git init
> curl /webgit/pack/ab012340a | git index-pack --stdin --fix-thin in.pack
> mv in.pack in.idx .git/objects/pack
> git update-ref HEAD ab012340a


Thanks, that is useful. I might use something like that in testing later on.

>> There's also no reason the webapp can't serve the info/refs as well,
>
> Yes. If you had looked at that HTTP thread from last July/August I
> mentioned doing something like that.

Yup, that's where I remembered it from ;-)

>> The challenge I have is how to convert a tree identifier into a pack
>> structure, I suspect.
>
> In JGit? Use a PackWriter.

Thanks, I'll take a look at that. Sounds like it should be easy, provided that
I get the objects in the right order.

>> I'm not sure whether there are
>> HTTP timeouts that might be involved if the server takes too long to pack
>> something; it might be necessary to somehow send the packs back as chunks
>> instead of as a single big pack.
>
> Yes. That is a problem.

According to Apache HTTP docs, the timeout defaults to 300s.

http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxytim eout
http://httpd.apache.org/docs/2.2/mod/core.html#timeout

There's always the possibility - for a simple GET request, at least - of being
able to temporarily cache a generated pack file, so if the client re-tries, it
might be available already. An alternative would be to throw back a different
error message, and request that the client acquire different sub-sets of the
problem instead (say, /webgit/pack/a1234...bc3456)

> In terms of HTTP timeouts, yes, 2m30s before any data transfer even
> starts is a lot. And HTTP can't send progress messages to at least let
> the user know the server is chugging on their behalf.

Ah well, I've got enough to be getting on with ... as always, thanks for your
help!

Alex
Re: Git HTTP protocol/improvements? [message #572234 is a reply to message #4977] Wed, 06 May 2009 19:41 Go to previous message
Alex Blewitt is currently offline Alex BlewittFriend
Messages: 946
Registered: July 2009
Senior Member
>Alex Blewitt wrote:
>> Shawn Pearce wrote:
>> Actually, that should be git index-pack. I think you'd want something
like:
>>> The challenge I have is how to convert a tree identifier into a pack
>>> structure, I suspect.
>>
>> In JGit? Use a PackWriter.

Thanks for the advice. I've not been able to do much recently (headaches with
Eclipse 3.5M7 notwithstanding) but I've been able to take a repo, use
PackWriter to be able to generate the pack file, and then pipe it through a
few of the commands you mentioned in order to be able to reconstitute the
file. So I at least understand how to operate the basics for a full checkout,
even if I don't yet know how to plug in the only-send-updates (well, how to
orchestrate the protocol; I'm assuming that the PackWriter will do the work
for that as well).

My next stab will be to write the servlet that can serve this down. Given the
imminent provisioning for EGit at Eclipse, I'm wondering if it makes sense to
start using org.eclipse.egit straight away rather than org.spearce (or
equivalent subpackage). So I don't know if it makes sense to start committing
it to the existing git repository yet or wait until the new Eclipse one is up
there. Any thoughts?

Alex
Re: Git HTTP protocol/improvements? [message #572262 is a reply to message #6699] Wed, 06 May 2009 20:17 Go to previous message
Shawn O. Pearce is currently offline Shawn O. PearceFriend
Messages: 82
Registered: July 2009
Member
Alex Blewitt wrote:
> I don't yet know how to plug in the only-send-updates (well, how to
> orchestrate the protocol; I'm assuming that the PackWriter will do the work
> for that as well).

Yes. The uninteresting collection that can't be null (the bug you just
posted a patch for) is the common base; PackWriter automatically handles
selecting only the delta between the interesting collection and the
uninteresting collection, and writing only that.

JGit is still slightly inefficient here compared to C Git. PackWriter
can reuse a binary delta stored in a pack file, but it can't create a
new binary delta on its own. So JGit may wind up sending an entire file
when only a small insertion delta is necessary (e.g. to insert 5 lines
into a 200 line file). Its something that will get fixed eventually,
and is internal to PackWriter, so its still transparent to your
application. And this is only network efficiency, so stuff may run a
bit slower on a slower network connection, but its still correct without
this binary delta generation support.

> My next stab will be to write the servlet that can serve this down. Given the
> imminent provisioning for EGit at Eclipse, I'm wondering if it makes sense to
> start using org.eclipse.egit straight away rather than org.spearce (or
> equivalent subpackage).

org.eclipse.jgit ? And we're still not sure JGit is going to be able to
move to Eclipse. The JGit license is going to remain EDL. We need to
ask the board of directors for an exemption to place JGit under the EDL.
Without their permission to do so, JGit can't move to eclipse.org.

> So I don't know if it makes sense to start committing
> it to the existing git repository yet or wait until the new Eclipse one is up
> there. Any thoughts?

I would just commit to the existing repository. At worst you'll have to
move the commits to the new one. Unless you've already polished them
for submission, you'll probably want to rewrite part of the development
history anyway to clean it up, so, no big deal.

But, yea, you could also just make your own git repository for this and
link to to the JGit code. Its not like its hard to compile it and
import it into your project's classpath.
Previous Topic:EGit Creation Review results
Next Topic:EGit and JGit at Eclipse?
Goto Forum:
  


Current Time: Wed Nov 26 20:43:33 GMT 2014

Powered by FUDForum. Page generated in 0.03154 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software