Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[jgit-dev] A faster alternative to git-filter-branch - the BFG Repo-Cleaner (based on JGit, written in Scala)


I've been working on an open-sourced JGit-based project, ready for some feedback - it's an alternative to git-filter-branch called The BFG, which has substantial performance improvements:

http://rtyley.github.com/bfg-repo-cleaner/

Doing a quick test with a large-ish repo (GCC - 148495 commits), removing a single file from history takes 3.5 minutes with the BFG:

$ time bfg -D README-fixinc
...
real 3m29.086s
user 12m18.994s
sys 0m41.483s

... doing the same thing with git-filter-branch takes 8 hours (135x speed increase for BFG):

$ time git filter-branch --index-filter 'git rm --cached --ignore-unmatch gcc/README-fixinc' --prune-empty --tag-name-filter cat -- --all
...
real 472m30.941s
user 356m53.974s
sys 59m15.350s


The BFG is all about removing unwanted data from the history of your Git repository, ie it's main use-cases are:
  • Removing Crazy Big Files (eg bfg --strip-blobs-bigger-than 1M my-repo.git)
  • Removing Passwords, Credentials & other Private data
...hopefully people who are looking at open-sourcing-projects & moving-to-hosted-Git will find the BFG useful for easing the small (but important) part of the process that requires removing private data from their repository history. There are some more usage examples here:

http://rtyley.github.com/bfg-repo-cleaner/#examples


BFG's performance advantage is due to these factors:
  • The approach of git-filter-branch steps through every commit in the repo, examining the complete file-hierarchy of each one. For the intended use-cases of The BFG this is wasteful- we don't care where in a file structure a 'bad' file exists, we just want it dealt with. Consequently the BFG processes the Git object db on a memoised tree-by-tree basis, processing each and every file & folder exactly once - no need for a given unique tree to be examined more than once. This does mean that it's not possible to delete files based on their absolute path within the repo, but they can deleted based on their filename, blob-id, or contents.
  • The BFG uses multi-core concurrent processing by default and typically consumes 100% of CPU capacity for a large part of the run.
  • All action takes place in a single process (the process of the JVM), so doesn't require the frequent fork-and-exec-ing needed by git-filter-branch's mix of Bash and C code.
There's a bit more performance data here:

https://docs.google.com/spreadsheet/ccc?key=0AsR1d5Zpes8HdER3VGU1a3dOcmVHMmtzT2dsS2xNenc
(tests done using a 4GB tmpfs ramdisk - a situation which benefits git-filter-branch more than the BFG)


If anyone has cause to filter a repository's history in future, I'd appreciate you giving the BFG a try and letting me know how you found it, and how it compared to git-filter-branch - much appreciated!

thanks in advance,

Roberto
software dev @ The Guardian



Back to the top