[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] RevWalk next is slow for git repos that have a long commit history. (2)

If you find a file that hasn't been modified since (say) 1999 you are going to walk back through all history trying to find a commit in 1999. 

If you then find file2 in the same repo that hadn't been modified since 1999 you're going to go through exactly the same process again. So you are now o(n^2) in repo depth and file count. 

I'd suggest:

1. Presumably you only need to go back as far as last time you did this. If you're updating 2014-2015 then you can abort the commit search when you find commits older than 2014. 

2. You should search history for all files not just a file at a time. You can probably use the walk to find the hashes of all the files and then step backwards to find out when the file changes. 

Alex

Sent from my iPhat 6

> On 3 Jun 2015, at 21:15, Leo Ufimtsev <lufimtse@xxxxxxxxxx> wrote:
> 
> Hello jgit developers,
> 
> I'm working on improving the performance of Eclipse's Releng Copyright fix
> tool:
> https://bugs.eclipse.org/bugs/show_bug.cgi?id=468850
> 
> We ran into the problem that RevWalk.next() is very slow for repositories
> that have a long commit history.
> e.g eclipse.jdt.ui has 26,000+ commits and 15,000+ files.
> 
> I was wondering if this is a known issue and if there is a way to improve
> performance or work around it?
> 
> 
> ------ To be specific:  -------
> 
> The tool traverses each file in a project, for each file:
> - it finds it's repository,
> - starting from git's HEAD commit it does a RevWalk.next() backwards through
> history
>   to find the commit when the file was last modified.
> - it extracts the year
> - then updates the file's copyright header (2001-2011) -> (2001-2014).
> 
> The problem is that RevWalk.next() takes 2-3 seconds per file for
> repositories that have very long commit histories (e.g eclipse.jdt.ui has
> 26,814 commits) and with +15,000 files in a project this operation can take
> many hours to complete.
> 
> To be specific:
> RevWalk.next()
>  -> StartGenerator.next()
>    -> FIFORevQueue constructor
>     -> 56: BlockRevQueue constructor (Generator s)
>        -- the 'for loop' can loop 10k+ times per file.
> 
> I found that the native git-log command is also very slow.
> E.g calling git-log on 15000 files takes 13 minutes for eclipse.jdt.ui:
> 'time find . -name "*.java" -exec git log -1 {} \; > /dev/null
> 
> (in contrast 'cat-ing' every file takes only 6 seconds:
> (find . -exec cat {} \; > /dev/null 2>&1)
> 
> 
> Being aware of the git-log limitation, is there some way to e.g cache the
> repo and the commit history or find the last-modified date of a file faster
> than just traveling the git commit history?
> 
> Any advice/tips?
> 
> Thank you.
> 
> --
> Leo Ufimtsev | Intern Software Engineer @ Eclipse Team
> _______________________________________________
> jgit-dev mailing list
> jgit-dev@xxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> https://dev.eclipse.org/mailman/listinfo/jgit-dev