# Understanding Git Filter-branch and the Git Storage Model

The other day Steve wanted git alchemy done on the Rust repo.

Specifically, he wanted the reference and nomicon moved out into their own repositories, preserving history. Both situations had some interesting quirks, the reference has lived in src/doc/reference/* and src/doc/reference.md, and the nomicon has lived in src/doc/nomicon, src/doc/tarpl, and at the top level in a separate git root.

As you can guess from the title of this post, the correct tool for this job is git filter-branch. My colleague Greg calls it “the swiss-army knife of Git history rewriting”.

I had some fun with filter-branch that day, thought I’d finally write an accessible tutorial for it. A lot of folks treat filter-branch like rebase, but it isn’t, and this crucial difference can lead to many false starts. It certainly did for me back when I first learned it.

This kind of ties into the common bit of pedantry about the nature of a commit I keep seeing pop up:

Git commits appear to be diffs, but they’re actually file copies, but they’re actually ACTUALLY diffs.

## So what is a git commit?

Generally we interact with git commits via git show or by looking at commits on a git GUI / web UI. Here, we see diffs. It’s natural to think of a commit as a diff, it’s the model that makes the most sense for the most common ways of interacting with commits. It also makes some sense from an implementation point of view, diffs seem like an efficient way of storing things.

It turns out that the “real” model is not this, it’s actually that each commit is a snapshot of the whole repo state at the time.

But actually, it isn’t, the underlying implementation does make use of deltas in packfiles and some other tricks like copy-on-write forking.

Ultimately, arguing about the “real” mental model is mostly pedantry. There are multiple ways of looking at a commit. The documentation tends to implicitly think of them as “full copies of the entire file tree”, which is where most of the confusion about filter-branch comes from. But often it’s important to picture them as diffs, too.

Understanding the implementation can be helpful, especially when you break the repository whilst doing crazy things (I do this often). I’ve explained how it works in a later section, it’s not really a prerequisite for understanding filter-branch, but it’s interesting.

## How do I rewrite history with git rebase?

This is where some of the confusion around filter-branch stems from. Folks have worked with rebase, and they think filter-branch is a generalized version of this. They’re actually quite different.

For those of you who haven’t worked with git rebase, it’s a pretty useful way of rewriting history, and is probably what you should use when you want to rewrite history, especially for maintaining clean git history in an unmerged under-review branch.

Rebase does a whole bunch of things. Its core task is, given the current branch and a branch that you want to “rebase onto”, it will take all commits unique to your branch, and apply them in order to the new one. Here, “apply” means “apply the diff of the commit, attempting to resolve any conflicts”. At times, it may ask you to manually resolve the conflicts, using the same tooling you use for conflicts during git merge.

Rebase is much more powerful than that, though. git rebase -i will open up “interactive rebase”, which will show you the commits that are going to be rebased. In this interface, you can reorder commits, mark them for edits (wherein the rebase will stop at that commit and let you git commit --amend changes into it), and even “squash” commits which lets you mark a commit to be absorbed into the previous one. This is rather useful for when you’re working on a feature and want to keep your commits neat, but also want to make fixup patches to older commits. Filippo’s git fixup alias packages this particular task into a single git command. Changing EDITOR=true into EDITOR=: GIT_SEQUENCE_EDITOR=: will make it not even open the editor for confirmation and try to do the whole thing automatically.

git rebase -x some_command is also pretty neat, lets you run a shell command on each step during a rebase.

In this model, you are fundamentally thinking of commits as diffs. When you move around commits in the interactive rebase editor, you’re moving around diffs. When you mark things for squashing, you’re basically merging diffs. The whole process is about taking a set of diffs and applying them to a different “base commit”.

## How do I rewrite history with git filter-branch?

filter-branch does not work with diffs. You’re working with the “snapshot” model of commits here, where each commit is a snapshot of the tree, and rewriting these commits.

What git filter-branch will do is for each commit in the specified branch, apply filters to the snapshot, and create a new commit. The new commit’s parent will be the filtered version of the old commit’s parent. So it creates a parallel commit DAG.

Because the filters apply on the snapshots instead of the diffs, there’s no chance for this to cause conflicts like in git rebase. In git rebase, if I have one commit that makes changes to a file, and I change the previous commit to just remove the area of the file that was changed, I’d have a conflict and git would ask me to figure out how the changes are supposed to be applied.

In git-filter-branch, if I do this, it will just power through. Unless you explicitly write your filters to refer to previous commits, the new commit is created in isolation, so it doesn’t worry about changes to the previous commits. If you had indeed edited the previous commit, the new commit will appear to undo those changes and apply its own on top of that.

filter-branch is generally for operations you want to apply pervasively to a repository. If you just want to tweak a few commits, it won’t work, since future commits will appear to undo your changes. git rebase is for when you want to tweak a few commits.

So, how do you use it?

The basic syntax is git filter-branch <filters> branch_name. You can use HEAD or @ to refer to the current branch instead of explicitly typing branch_name.

A very simple and useful filter is the subdirectory filter. It makes a given subdirectory the repository root. You use it via git filter-branch --subdirectory-filter name_of_subdir @. This is useful for extracting the history of a folder into its own repository.

Another useful filter is the tree filter, you can use it to do things like moving around, creating, or removing files. For example, if you want to move README.md to README in the entire history, you’d do something like git filter-branch --tree-filter 'mv README.md README' @ (you can also achieve this much faster with some manual work and rebase). The tree filter will work by checking out each commit (in a separate temporary folder), running your filter on the working directory, adding any changes to the index (no need to git add yourself), and committing the new index.

The --prune-empty argument is useful here, as it removes commits which are now empty due to the rewrite.

Because it is checking out each commit, this filter is quite slow. When I initially was trying to do Steve’s task on the rust repo, I wrote a long tree filter and it was taking forever.

The faster version is the index filter. However, this is a bit trickier to work with (which is why I tend to use a tree filter if I can get away with it). What this does is operate on the index, directly.

The “index” is basically where things go when you git add them. Running git add will create temporary objects for the added file, and modify the WIP index (directory tree) to include a reference to the new file or change an existing file reference to the new one. When you commit, this index is packaged up into a commit and stored as an object. (More on how these objects work in a later section)

Now, since this deals with files that are already stored as objects, git doesn’t need to unwrap these objects and create a working directory to operate on them. So, with --index-filter, you can operate on these in a much faster way. However, since you don’t have a working directory, stuff like adding and moving files can be trickier. You often have to use git update-index to make this work.

However, a useful index filter is one which just scrubs a file (or files) from history:

The --ignore-unmatch makes the command still succeed if the file doesn’t exist. filter-branch will fail if one of the filters fails. In general I tend to write fallible filters like command1 1>&2 2>/dev/null ; command2 1>&2 2>/dev/null ; true, which makes it always succeed and also ignores any stdout/stderr output (which tends to make the progress screen fill up fast).

The --cached argument on git rm makes it operate only on the index, not the working directory. This is great, because we don’t have a working directory right now.

I rarely use git update-index so I’m not really going to try and explain how it can be used here. But if you need to do more complex operations in an index filter, that’s the way to go.

There are many other filters, like --commit-filter (lets you discard a commit entirely), --msg-filter (rewriting commit messages), and --env-filter (changing things like author metadata or other env vars). You can see a complete list with examples in the docs

## How did I perform the rewrites on the reference and nomicon?

For the Rust Reference, basically I had to extract the history of src/doc/reference.md, AND src/doc/reference/* (reference.md was split up into reference/*.md recently) into its own commit. This is an easy tree filter to write, but tree filters take forever.

Instead of trying my luck with an index filter, I decided to just make it so that the tree filter would be faster. I first extracted src/doc/:

Now I had a branch that contained only the history of src/doc, with the root directory moved to doc. This is a much smaller repo than the entirety of Rust.

Now, I moved reference.md into reference/:

As mentioned before, the /dev/null and true bits are because the mv command will fail in some cases (when reference.md doesn’t exist), and I want it to just continue without complaining when that happens. I only care about moving instances of that file, if that file doesn’t exist there it’s still okay.

Now, everything I cared about was within reference. The next step was simple:

The whole process took maybe 10 minutes to run, most of the time being spent by the second command. The final result can be found here.

For the nomicon, the task was easier. In the case of the nomicon, it has always resided in src/doc/nomicon, src/doc/tarpl, or at the root. This last bit is interesting, when Alexis was working on the nomicon, he started off by hacking on it in a separate repo, but then within that repo moved it to src/doc/tarpl, and performed a merge commit with rustc. There’s no inherent restriction in Git that all merges must have a common ancestor, and you can do stuff like this. I was quite surprised when I saw this, since it’s pretty uncommon in general, but really, many projects of that size will have stuff like this. Servo and html5ever do too, and usually it’s when a large project is merged into it after being developed on the side.

This sounds complicated to work with, but it wasn’t that hard. I took the same subdirectory-filtere’d doc directory branch used for the reference. Then, I renamed tarpl/ to nomicon/ via a tree filter, and ran another subdirectory filter:

Now, I had the whole history of the nomicon in the root dir. Except for the commits made by Alexis before his frankenmerge, because these got removed in the first subdirectory filter (the commits were operating outside of src/doc, even though their contents eventually got moved there).

But, at this stage, I already had a branch with the nomicon at the root. Alexis’ original commits were also operating on the root directory. I can just rebase here, and the diffs of my commits will cleanly apply!

I found the commit (a54e64) where everything was moved to tarpl/, and took its parent (c7919f). Then, I just ran git rebase --root c7919f, and everything cleanly rebased. As expected, because I had a history going back to the first child of a54e64 with files moved, and a54e64 itself only moved files, so the diffs should cleanly apply.

The final result can be found here.

## Appendix: How are commits actually stored?

The way the actual implementation of a commit works is that each file being stored is hashed and stored in a compressed format, indexed by the hash. A directory (“tree”) will be a list of hashes, one for each file/directory inside it, alongside the filenames and other metadata. This list will be hashed and used everywhere else to refer to the directory.

A commit will reference the “tree” object for the root directory via its hash.

Now, if you make a commit changing some files, most of the files will be unchanged. So will most of the directories. So the commits can share the objects for the unchanged files/directories, reducing their size. This is basically a copy-on-write model. Furthermore, there’s a second optimization called a “packfile”, wherein instead of storing a file git will store a delta (a diff) and a reference to the file the diff must be applied to.

We can see this at work using git cat-file. cat-file lets you view objects in the “git filesystem”, which is basically a bunch of hash-indexed objects stored in .git/objects. You can view them directly by traversing that directory (they’re organized as a trie), but cat-file -p will let you pretty-print their contents since they’re stored in a binary format.

I’m working with the repo for the Rust Book, playing with commit 4822f2. It’s a commit that changes just one file (second-edition/src/ch15-01-box.md), perfect.

This tells us that the commit is a thing with some author information, a pointer to a parent, a commit message, and a “tree”. What’s this tree?

This is just a directory! You can see that each entry has a hash. We can use git cat-file -p to view each one. Looking at a tree object will just give us a subdirectory, but the blobs will show us actual files!

So how does this share objects? Let’s look at the previous commit:

If you look closely, all of these hashes are the same, except for the hash for second-edition. For the hashes which are the same, these objects are being shared across commits. The differing hash is d5672d in the newer commit, and d48b2e in the older one.

Let’s look at the objects:

Again, these are the same, except for that of src. src has a lot of files in it, which will clutter this post, so I’ll run a diff on the outputs of cat-file:

\$ diff -U5 <(g cat-file -p f9fc05a6ff78b8211f4df931ed5e32c937aba66c) <(g cat-file -p 3f8db396566716299330cdd5f569fb0a0c4615dd)
--- /dev/fd/63  2017-03-05 11:58:22.000000000 -0800
+++ /dev/fd/62  2017-03-05 11:58:22.000000000 -0800
@@ -63,11 +63,11 @@
100644 blob ff6b8f8cd44f624e1239c47edda59560cdf491ae   ch14-02-publishing-to-crates-io.md
100644 blob c53ef854a74b6c9fbd915be1bf824c6e78439c42   ch14-03-cargo-workspaces.md
100644 blob 3fb59f9cc85b6b81994e83a34d542871a260a8f0   ch14-04-installing-binaries.md
100644 blob e1cd1ca779fdf202af433108a8af6eda317f2717   ch14-05-extending-cargo.md
100644 blob 3173cc508484cc447ebe42a024eac7d9e6c2ddcd   ch15-00-smart-pointers.md
-100644 blob 14c5533bb3b604c6e6274db278d1e7129f78d55d   ch15-01-box.md
+100644 blob 29d87933d6832374b87d98aa5588e09e0c1a4991   ch15-01-box.md
100644 blob 47b35ed489d63ce6a885289fec01b7b16ba1afea   ch15-02-deref.md

As you can see, only the file that was changed in the commit has a new blob stored. If you view 14c553 and 29d879 you’ll get the pre- and post- commit versions of the file respectively.