|Re: [jgit-dev] Rename (move) detection with JGit|
On Sun, Feb 5, 2012 at 05:54, Pawel Kozlowski <pkozlowski.opensource@xxxxxxxxx> wrote: > First of all: big Thank You for all the work on JGit. We are using it > as an embeded VCS and it works like a charm! Amazing library, it opens > up so many possibilities where git can be used as a versioned > key(path)-value(content) store. I am glad the library has been useful to you. > Since I had such a positive experience with JGit I'm looking into some > other usages where rename detection would be handy (I'm talking about > git log -M here). From what I can see nothing like this exists in > JGit. It does. You can enable rename detection on a DiffFormatter. There is a RenameDetector class that provides the more low-level interface to the rename detection logic for an arbitrary pair of trees, most applications that want to handle rename detection just enable it on DiffFormatter and then get back either a patch script, or a list of DiffEntry describing the changes. > It is complete mystery for me how git establishes if a modification is > 'not too big', I presume with some treshold on ... modified lines? Yes. More accurately, its number of bytes in common between the two files. Rename detection works by breaking the file into lines, then summing up the total number of bytes in the lines that are common. This causes longer lines to count more towards the score than short lines like "}\n". If the score is above the threshold, the files are "the same" and a rename is assumed. There is also some weight in the scoring that comes from the file name/path similarity. This can try to break ties between *.h and *.c files for example, or to better guess renames for small Java classes that are just moving between packages but have more copyright header boilerplate than they do real code content. > Anyway, my question is quite simple: do you think rename detection > (with 'small' content modification) would be possible with JGit? Yes. Use the RenameDetection class if you have trees you want to compare. > How > one could go about finding a blob that is 'similar enough' to aonother > blob? If you need to perform the compare against an arbitrary set of blobs without path information, you would need to build the rename matrix yourself and run the content similarity function. The RenameDetector tries to optimize this by building a hashtable for one blob, and then running all other blobs against that hashtable before switching to a new blob and retrying.
Back to the top