[jgit-dev] A faster alternative to git-filter-branch - the BFG Repo-Cleaner (based on JGit, written in Scala)
I've been working on an open-sourced JGit-based project, ready for some feedback - it's an alternative to git-filter-branch called The BFG, which has substantial performance improvements:
Doing a quick test with a large-ish repo (GCC
- 148495 commits), removing a single file from history takes 3.5 minutes
with the BFG:
$ time bfg -D README-fixinc
... doing the same thing with git-filter-branch takes 8 hours (135x speed increase for BFG):
$ time git filter-branch --index-filter 'git rm --cached --ignore-unmatch gcc/README-fixinc' --prune-empty --tag-name-filter cat -- --all
The BFG is all about removing unwanted
data from the history of your Git repository, ie it's main use-cases are:
- Removing Crazy Big Files (eg bfg --strip-blobs-bigger-than 1M my-repo.git)
- Removing Passwords, Credentials & other Private data
...hopefully people who are looking at open-sourcing-projects & moving-to-hosted-Git will find the BFG useful for easing the small (but important) part of the process that requires removing private data from their repository history. There are some more usage examples here:
BFG's performance advantage is due to these factors:
- The approach of git-filter-branch steps through every commit in the repo, examining the complete file-hierarchy of each one. For the intended use-cases of The BFG this is wasteful- we don't care where in a file structure a 'bad' file exists, we just want it dealt with. Consequently the BFG processes the Git object db on a memoised tree-by-tree basis, processing each and every file & folder exactly once - no need for a given unique tree to be examined more than once. This does mean that it's not possible to delete files based on their absolute path within the repo, but they can deleted based on their filename, blob-id, or contents.
- The BFG uses multi-core concurrent processing by default and typically consumes 100% of CPU capacity for a large part of the run.
- All action takes place in a single process (the process of the JVM), so doesn't require the frequent fork-and-exec-ing needed by git-filter-branch's mix of Bash and C code.
There's a bit more performance data here:
(tests done using a 4GB tmpfs ramdisk - a situation which benefits git-filter-branch more than the BFG)
If anyone has cause to filter a repository's history in future, I'd appreciate you giving the BFG a try and letting me know how you found it, and how it compared to git-filter-branch - much appreciated!
thanks in advance,
software dev @ The Guardian