, , , ,

About a year ago, I setup git repsitoris for our projects and sub projects. Because I was leading the effort towards a Service Oriented Architecture, we decomposed the whole application in various services which could run independent. And being independent, they could be treated as separate projects and we setup separate git repositories.

Lately, I noticed that size of the repositories was pretty high around 300+ MB. Upon inspection, I discovered that large SQL dumps were being comitted and each change to them obviously took the toll in terms of space and the time it takes to clone/push and pull etc.

I wanted not only to remove the files but to totally remove them from the history as well. Here’s what I did:

$ git clone git@code.from.somewhere.com:repo.git # Get the repo
$ cd repo
$ # Remove the db/ and all files in it.
$ git filter-branch --index-filter 'git rm -r --cached --ignore-unmatch .db'
$ rm -rf .git/refs/original/
$ git reflog expire --expire=now --all
$ git gc --prune=now
$ git gc --aggressive --prune=now
$ mv .git .. && rm -fr * # Now we make it a bare repository
$ mv ../.git .
$ mv .git/* .
$ rmdir .git
$ git config --bool core.bare true #
$ cd ..; mv repo repo.git # Done with making it bare, renaming just for clarity

Now copy the new repo.git in place of the existing on and that’s all.

Please note that git garbage collection not only might take too much time but also might consume enormous amounts of CPU if files were large enough.

Another issue might be that you imported from SVN and .svn directories are still there. In such case, find them all with the command below:

$find . -type d -name .svn

And apply the process for each of them. I know that’s tough, and I am looking to use xargs to do the stuff or turn it into a shell script.