[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Rename (move) detection with JGit

On Sun, Feb 5, 2012 at 05:54, Pawel Kozlowski
<pkozlowski.opensource@xxxxxxxxx> wrote:
> First of all: big Thank You for all the work on JGit. We are using it
> as an embeded VCS and it works like a charm! Amazing library, it opens
> up so many possibilities where git can be used as a versioned
> key(path)-value(content) store.

I am glad the library has been useful to you.

> Since I had such a positive experience with JGit I'm looking into some
> other usages where rename detection would be handy (I'm talking about
> git log -M here). From what I can see nothing like this exists in
> JGit.

It does. You can enable rename detection on a DiffFormatter. There is
a RenameDetector class that provides the more low-level interface to
the rename detection logic for an arbitrary pair of trees, most
applications that want to handle rename detection just enable it on
DiffFormatter and then get back either a patch script, or a list of
DiffEntry describing the changes.

> It is complete mystery for me how git establishes if a modification is
> 'not too big', I presume with some treshold on ... modified lines?

Yes. More accurately, its number of bytes in common between the two
files. Rename detection works by breaking the file into lines, then
summing up the total number of bytes in the lines that are common.
This causes longer lines to count more towards the score than short
lines like "}\n". If the score is above the threshold, the files are
"the same" and a rename is assumed.

There is also some weight in the scoring that comes from the file
name/path similarity. This can try to break ties between *.h and *.c
files for example, or to better guess renames for small Java classes
that are just moving between packages but have more copyright header
boilerplate than they do real code content.

> Anyway, my question is quite simple: do you think rename detection
> (with 'small' content modification) would be possible with JGit?

Yes. Use the RenameDetection class if you have trees you want to compare.

> How
> one could go about finding a blob that is 'similar enough' to aonother
> blob?

If you need to perform the compare against an arbitrary set of blobs
without path information, you would need to build the rename matrix
yourself and run the content similarity function. The RenameDetector
tries to optimize this by building a hashtable for one blob, and then
running all other blobs against that hashtable before switching to a
new blob and retrying.