# Understanding Git Filter-branch and the Git Storage Model

The other day Steve wanted git alchemy done on the Rust repo.

Specifically, he wanted the reference and nomicon moved out into their own repositories, preserving history. Both situations had some interesting quirks, the reference has lived in src/doc/reference/* and src/doc/reference.md, and the nomicon has lived in src/doc/nomicon, src/doc/tarpl, and at the top level in a separate git root.

As you can guess from the title of this post, the correct tool for this job is git filter-branch. My colleague Greg calls it “the swiss-army knife of Git history rewriting”.

I had some fun with filter-branch that day, thought I’d finally write an accessible tutorial for it. A lot of folks treat filter-branch like rebase, but it isn’t, and this crucial difference can lead to many false starts. It certainly did for me back when I first learned it.

This kind of ties into the common bit of pedantry about the nature of a commit I keep seeing pop up:

Git commits appear to be diffs, but they’re actually file copies, but they’re actually ACTUALLY diffs.

## So what is a git commit?

Generally we interact with git commits via git show or by looking at commits on a git GUI / web UI. Here, we see diffs. It’s natural to think of a commit as a diff, it’s the model that makes the most sense for the most common ways of interacting with commits. It also makes some sense from an implementation point of view, diffs seem like an efficient way of storing things.

It turns out that the “real” model is not this, it’s actually that each commit is a snapshot of the whole repo state at the time.

But actually, it isn’t, the underlying implementation does make use of deltas in packfiles and some other tricks like copy-on-write forking.

Ultimately, arguing about the “real” mental model is mostly pedantry. There are multiple ways of looking at a commit. The documentation tends to implicitly think of them as “full copies of the entire file tree”, which is where most of the confusion about filter-branch comes from. But often it’s important to picture them as diffs, too.

Understanding the implementation can be helpful, especially when you break the repository whilst doing crazy things (I do this often). I’ve explained how it works in a later section, it’s not really a prerequisite for understanding filter-branch, but it’s interesting.

## How do I rewrite history with git rebase?

This is where some of the confusion around filter-branch stems from. Folks have worked with rebase, and they think filter-branch is a generalized version of this. They’re actually quite different.

For those of you who haven’t worked with git rebase, it’s a pretty useful way of rewriting history, and is probably what you should use when you want to rewrite history, especially for maintaining clean git history in an unmerged under-review branch.

Rebase does a whole bunch of things. Its core task is, given the current branch and a branch that you want to “rebase onto”, it will take all commits unique to your branch, and apply them in order to the new one. Here, “apply” means “apply the diff of the commit, attempting to resolve any conflicts”. At times, it may ask you to manually resolve the conflicts, using the same tooling you use for conflicts during git merge.

Rebase is much more powerful than that, though. git rebase -i will open up “interactive rebase”, which will show you the commits that are going to be rebased. In this interface, you can reorder commits, mark them for edits (wherein the rebase will stop at that commit and let you git commit --amend changes into it), and even “squash” commits which lets you mark a commit to be absorbed into the previous one. This is rather useful for when you’re working on a feature and want to keep your commits neat, but also want to make fixup patches to older commits. Filippo’s git fixup alias packages this particular task into a single git command. Changing EDITOR=true into EDITOR=: GIT_SEQUENCE_EDITOR=: will make it not even open the editor for confirmation and try to do the whole thing automatically.

git rebase -x some_command is also pretty neat, lets you run a shell command on each step during a rebase.

In this model, you are fundamentally thinking of commits as diffs. When you move around commits in the interactive rebase editor, you’re moving around diffs. When you mark things for squashing, you’re basically merging diffs. The whole process is about taking a set of diffs and applying them to a different “base commit”.

## How do I rewrite history with git filter-branch?

filter-branch does not work with diffs. You’re working with the “snapshot” model of commits here, where each commit is a snapshot of the tree, and rewriting these commits.

What git filter-branch will do is for each commit in the specified branch, apply filters to the snapshot, and create a new commit. The new commit’s parent will be the filtered version of the old commit’s parent. So it creates a parallel commit DAG.

Because the filters apply on the snapshots instead of the diffs, there’s no chance for this to cause conflicts like in git rebase. In git rebase, if I have one commit that makes changes to a file, and I change the previous commit to just remove the area of the file that was changed, I’d have a conflict and git would ask me to figure out how the changes are supposed to be applied.

In git-filter-branch, if I do this, it will just power through. Unless you explicitly write your filters to refer to previous commits, the new commit is created in isolation, so it doesn’t worry about changes to the previous commits. If you had indeed edited the previous commit, the new commit will appear to undo those changes and apply its own on top of that.

filter-branch is generally for operations you want to apply pervasively to a repository. If you just want to tweak a few commits, it won’t work, since future commits will appear to undo your changes. git rebase is for when you want to tweak a few commits.

So, how do you use it?

The basic syntax is git filter-branch <filters> branch_name. You can use HEAD or @ to refer to the current branch instead of explicitly typing branch_name.

A very simple and useful filter is the subdirectory filter. It makes a given subdirectory the repository root. You use it via git filter-branch --subdirectory-filter name_of_subdir @. This is useful for extracting the history of a folder into its own repository.

Another useful filter is the tree filter, you can use it to do things like moving around, creating, or removing files. For example, if you want to move README.md to README in the entire history, you’d do something like git filter-branch --tree-filter 'mv README.md README' @ (you can also achieve this much faster with some manual work and rebase). The tree filter will work by checking out each commit (in a separate temporary folder), running your filter on the working directory, adding any changes to the index (no need to git add yourself), and committing the new index.

The --prune-empty argument is useful here, as it removes commits which are now empty due to the rewrite.

Because it is checking out each commit, this filter is quite slow. When I initially was trying to do Steve’s task on the rust repo, I wrote a long tree filter and it was taking forever.

The faster version is the index filter. However, this is a bit trickier to work with (which is why I tend to use a tree filter if I can get away with it). What this does is operate on the index, directly.

The “index” is basically where things go when you git add them. Running git add will create temporary objects for the added file, and modify the WIP index (directory tree) to include a reference to the new file or change an existing file reference to the new one. When you commit, this index is packaged up into a commit and stored as an object. (More on how these objects work in a later section)

Now, since this deals with files that are already stored as objects, git doesn’t need to unwrap these objects and create a working directory to operate on them. So, with --index-filter, you can operate on these in a much faster way. However, since you don’t have a working directory, stuff like adding and moving files can be trickier. You often have to use git update-index to make this work.

However, a useful index filter is one which just scrubs a file (or files) from history:

The --ignore-unmatch makes the command still succeed if the file doesn’t exist. filter-branch will fail if one of the filters fails. In general I tend to write fallible filters like command1 1>&2 2>/dev/null ; command2 1>&2 2>/dev/null ; true, which makes it always succeed and also ignores any stdout/stderr output (which tends to make the progress screen fill up fast).

The --cached argument on git rm makes it operate only on the index, not the working directory. This is great, because we don’t have a working directory right now.

I rarely use git update-index so I’m not really going to try and explain how it can be used here. But if you need to do more complex operations in an index filter, that’s the way to go.

There are many other filters, like --commit-filter (lets you discard a commit entirely), --msg-filter (rewriting commit messages), and --env-filter (changing things like author metadata or other env vars). You can see a complete list with examples in the docs

## How did I perform the rewrites on the reference and nomicon?

For the Rust Reference, basically I had to extract the history of src/doc/reference.md, AND src/doc/reference/* (reference.md was split up into reference/*.md recently) into its own commit. This is an easy tree filter to write, but tree filters take forever.

Instead of trying my luck with an index filter, I decided to just make it so that the tree filter would be faster. I first extracted src/doc/:

Now I had a branch that contained only the history of src/doc, with the root directory moved to doc. This is a much smaller repo than the entirety of Rust.

Now, I moved reference.md into reference/:

As mentioned before, the /dev/null and true bits are because the mv command will fail in some cases (when reference.md doesn’t exist), and I want it to just continue without complaining when that happens. I only care about moving instances of that file, if that file doesn’t exist there it’s still okay.

Now, everything I cared about was within reference. The next step was simple:

The whole process took maybe 10 minutes to run, most of the time being spent by the second command. The final result can be found here.

For the nomicon, the task was easier. In the case of the nomicon, it has always resided in src/doc/nomicon, src/doc/tarpl, or at the root. This last bit is interesting, when Alexis was working on the nomicon, he started off by hacking on it in a separate repo, but then within that repo moved it to src/doc/tarpl, and performed a merge commit with rustc. There’s no inherent restriction in Git that all merges must have a common ancestor, and you can do stuff like this. I was quite surprised when I saw this, since it’s pretty uncommon in general, but really, many projects of that size will have stuff like this. Servo and html5ever do too, and usually it’s when a large project is merged into it after being developed on the side.

This sounds complicated to work with, but it wasn’t that hard. I took the same subdirectory-filtere’d doc directory branch used for the reference. Then, I renamed tarpl/ to nomicon/ via a tree filter, and ran another subdirectory filter:

Now, I had the whole history of the nomicon in the root dir. Except for the commits made by Alexis before his frankenmerge, because these got removed in the first subdirectory filter (the commits were operating outside of src/doc, even though their contents eventually got moved there).

But, at this stage, I already had a branch with the nomicon at the root. Alexis’ original commits were also operating on the root directory. I can just rebase here, and the diffs of my commits will cleanly apply!

I found the commit (a54e64) where everything was moved to tarpl/, and took its parent (c7919f). Then, I just ran git rebase --root c7919f, and everything cleanly rebased. As expected, because I had a history going back to the first child of a54e64 with files moved, and a54e64 itself only moved files, so the diffs should cleanly apply.

The final result can be found here.

## Appendix: How are commits actually stored?

The way the actual implementation of a commit works is that each file being stored is hashed and stored in a compressed format, indexed by the hash. A directory (“tree”) will be a list of hashes, one for each file/directory inside it, alongside the filenames and other metadata. This list will be hashed and used everywhere else to refer to the directory.

A commit will reference the “tree” object for the root directory via its hash.

Now, if you make a commit changing some files, most of the files will be unchanged. So will most of the directories. So the commits can share the objects for the unchanged files/directories, reducing their size. This is basically a copy-on-write model. Furthermore, there’s a second optimization called a “packfile”, wherein instead of storing a file git will store a delta (a diff) and a reference to the file the diff must be applied to.

We can see this at work using git cat-file. cat-file lets you view objects in the “git filesystem”, which is basically a bunch of hash-indexed objects stored in .git/objects. You can view them directly by traversing that directory (they’re organized as a trie), but cat-file -p will let you pretty-print their contents since they’re stored in a binary format.

I’m working with the repo for the Rust Book, playing with commit 4822f2. It’s a commit that changes just one file (second-edition/src/ch15-01-box.md), perfect.

This tells us that the commit is a thing with some author information, a pointer to a parent, a commit message, and a “tree”. What’s this tree?

This is just a directory! You can see that each entry has a hash. We can use git cat-file -p to view each one. Looking at a tree object will just give us a subdirectory, but the blobs will show us actual files!

So how does this share objects? Let’s look at the previous commit:

If you look closely, all of these hashes are the same, except for the hash for second-edition. For the hashes which are the same, these objects are being shared across commits. The differing hash is d5672d in the newer commit, and d48b2e in the older one.

Let’s look at the objects:

Again, these are the same, except for that of src. src has a lot of files in it, which will clutter this post, so I’ll run a diff on the outputs of cat-file:

\$ diff -U5 <(g cat-file -p f9fc05a6ff78b8211f4df931ed5e32c937aba66c) <(g cat-file -p 3f8db396566716299330cdd5f569fb0a0c4615dd)
--- /dev/fd/63  2017-03-05 11:58:22.000000000 -0800
+++ /dev/fd/62  2017-03-05 11:58:22.000000000 -0800
@@ -63,11 +63,11 @@
100644 blob ff6b8f8cd44f624e1239c47edda59560cdf491ae   ch14-02-publishing-to-crates-io.md
100644 blob c53ef854a74b6c9fbd915be1bf824c6e78439c42   ch14-03-cargo-workspaces.md
100644 blob 3fb59f9cc85b6b81994e83a34d542871a260a8f0   ch14-04-installing-binaries.md
100644 blob e1cd1ca779fdf202af433108a8af6eda317f2717   ch14-05-extending-cargo.md
100644 blob 3173cc508484cc447ebe42a024eac7d9e6c2ddcd   ch15-00-smart-pointers.md
-100644 blob 14c5533bb3b604c6e6274db278d1e7129f78d55d   ch15-01-box.md
+100644 blob 29d87933d6832374b87d98aa5588e09e0c1a4991   ch15-01-box.md
100644 blob 47b35ed489d63ce6a885289fec01b7b16ba1afea   ch15-02-deref.md
100644 blob 2d20c55cc8605c0c899bc4867adc6b6ea1f5c902   ch15-03-drop.md
100644 blob 8e3fcf4e83fe1ce985a7c0b479b8b16701765aaf   ch15-04-rc.md
100644 blob a4ade4ae8bf5296d79ed51d69506e71a83f9f489   ch15-05-interior-mutability.md
100644 blob 3a4db5616c4f5baeb95d04ea40c6747e60181684   ch15-06-reference-cycles.md


As you can see, only the file that was changed in the commit has a new blob stored. If you view 14c553 and 29d879 you’ll get the pre- and post- commit versions of the file respectively.

So basically, each commit stores a tree of references to objects, often sharing nodes with other commits.

I haven’t had the opportunity to work with packfiles much, but they’re an additional optimization on top of this. Aditya’s post is a good intro to these.

# What Are Sum, Product, and Pi Types?

See also: Tony’s post on the same topic

You often hear people saying “Language X1 has sum types” or “I wish language X had sum types”2, or “Sum types are cool”.

Much like fezzes and bow ties, sum types are indeed cool.

These days, I’ve also seen people asking about “Pi types”, because of this Rust RFC.

But what does “sum type” mean? And why is it called that? And what, in the name of sanity, is a Pi type?

Before I start, I’ll mention that while I will be covering some type theory to explain the names “sum” and “product”, you don’t need to understand these names to use these things! Far too often do people have trouble understanding relatively straightforward concepts in languages because they have confusing names with confusing mathematical backgrounds3.

## So what’s a sum type? (the no-type-theory version)

In it’s essence, a sum type is basically an “or” type. Let’s first look at structs.

Foo is a bool AND a String. You need one of each to make one. This is an “and” type, or a “product” type (I’ll explain the name later).

So what would an “or” type be? It would be one where the value can be a bool OR a String. You can achieve this with C++ with a union:

However, this isn’t exactly right, since the value doesn’t store the information of which variant it is. You could store false and the reader wouldn’t know if you had stored an empty string or a false bool.

There’s a pattern called “tagged union” (or “discriminated union”) in C++ which bridges this gap.

Here, you manually set the tag when setting the value. C++ also has std::variant (or boost::variant) that encapsulates this pattern with a better API.

While I’m calling these “or” types here, the technical term for such types is “sum” types. Other languages have built-in sum types.

Rust has them and calls them “enums”. These are a more generalized version of the enums you see in other languages.

Swift is similar, and also calls them enums

You can fake these in Go using interfaces, as well. Typescript has built-in unions which can be typechecked without any special effort, but you need to add a tag (like in C++) to pattern match on them.

Of course, Haskell has them:

One of the very common things that languages with sum types do is express nullability as a sum type;

Generally, these languages have “pattern matching”, which is like a switch statement on steroids. It lets you match on and destructure all kinds of things, sum types being one of them. Usually, these are “exhaustive”, which means that you are forced to handle all possible cases. In Rust, if you remove that None branch, the program won’t compile. So you’re forced to deal with the none case, somehow.

In general sum types are a pretty neat and powerful tool. Languages with them built-in tend to make heavy use of them, almost as much as they use structs.

## Why do we call it a sum type?

Here be (type theory) dragons

Let’s step back a bit and figure out what a type is.

It’s really a restriction on the values allowed. It can have things like methods and whatnot dangling off it, but that’s not so important here.

In other words, it’s like4 a set. A boolean is the set $$\{\mathtt{true}, \mathtt{false}\}$$. An 8-bit unsigned integer (u8 in Rust) is the set $$\{0, 1, 2, 3, …. 254, 255\}$$. A string is a set with infinite elements, containing all possible valid strings5.

What’s a struct? A struct with two fields contains every possible combination of elements from the two sets.

The set of possible values of Foo is

$\{(\mathtt{x}, \mathtt{y}): \mathtt{x} \in \mathtt{bool}, \mathtt y \in \mathtt{u8}\}$

(Read as “The set of all $$(\mathtt{x}, \mathtt{y})$$ where $$\tt x$$ is in $$\mathtt{bool}$$ and $$\tt y$$ is in $$\mathtt{u8}$$”)

This is called a Cartesian product, and is often represented as $$\tt Foo = bool \times u8$$. An easy way to view this as a product is to count the possible values: The number of possible values of Foo is the number of possible values of bool (2) times the number of possible values of u8 (256).

A general struct would be a “product” of the types of each field, so something like

is $$\mathtt{Bar = bool \times u8 \times bool \times String}$$

This is why structs are called “product types”6.

You can probably guess what comes next – Rust/Swift enums are “sum types”, because they are the sum of the two sets.

is a set of all values which are valid booleans, and all values which are valid integers. This is a sum of sets, $$\tt Foo = bool + u8$$. More accurately, it’s a disjoint union, where if the input sets have overlap, the overlap is “discriminated” out.

An example of this being a disjoint union is:

This is not $$\tt Bar = bool + bool + u8$$, because $$\tt bool + bool = bool$$, (regular set addition doesn’t duplicate the overlap).

Instead, it’s something like

$\tt Bar = bool + otherbool + u8$

where $$\tt otherbool$$ is also a set $$\tt \{true, false\}$$, except that these elements are different from those in $$\tt bool$$. You can look at it as if

$\tt otherbool = \{true_2, false_2\}$

so that

$\mathtt{bool + otherbool} = \{\mathtt{true, false, true_2, false_2}\}$

For sum types, the number of possible values is the sum of the number of possible values of each of its component types.

So, Rust/Swift enums are “sum types”.

You may often notice the terminology “algebraic datatypes” (ADT) being used, usually that’s just talking about sum and product types together – a language with ADTs will have both.

In fact, you can even have exponential types! The notation AB in set theory does mean something, it’s the set of all possible mappings from $$B$$ to $$A$$. The number of elements is $$N_A^{N_B}$$. So basically, the type of a function (which is a mapping) is an “exponential” type. You can also view it as an iterated product type, a function from type B to A is really a struct like this:

given a value of the input b, the function will find the right field of my_func and return the mapping. Since a struct is a product type, this is

$\mathtt{A}^{N_\mathtt{B}} = \tt A \times A \times A \times \dots$

making it an exponential type.

You can even take derivatives of types! (h/t Sam Tobin-Hochstadt for pointing this out to me)

## What, in the name of sanity, is a Pi type?

It’s essentially a form of dependent type. A dependent type is when your type can depend on a value. An example of this is integer generics, where you can do things like Array<bool, 5>, or template<unsigned int N, typename T> Array<T, N> ... (in C++).

Note that the type signature contains a type dependent on an integer, being generic over multiple different array lengths.

The name comes from how a constructor for these types would look:

What’s the type of make_array here? It’s a function which can accept any integer and return a different type in each case. You can view it as a set of functions, where each function corresponds to a different integer input. It’s basically:

Given an input, the function chooses the right child function here, and calls it.

This is a struct, or a product type! But it’s a product of an infinite number of types7.

We can look at it as

$\texttt{make_array} = \prod\limits_{x = 0}^\infty\left( \texttt{fn()} \mathtt\to \texttt{Array<bool, x>}\right)$

The usage of the $$\Pi$$ symbol to denote an iterative product gives this the name “Pi type”.

In languages with lazy evaluation (like Haskell), there is no difference between having a function that can give you a value, and actually having the value. So, the type of make_array is the type of Array<bool, N> itself in languages with lazy evaluation.

There’s also a notion of a “sigma” type, which is basically

$\sum\limits_{x = 0}^\infty \left(\texttt{fn()} \mathtt\to \texttt{Array<bool, x>}\right)$

With the Pi type, we had “for all N we can construct an array”, with the sigma type we have “there exists some N for which we can construct this array”. As you can expect, this type can be expressed with a possibly-infinite enum, and instances of this type are basically instances of Array<bool, N> for some specific N where the N is only known at runtime. (much like how regular sum types are instances of one amongst multiple types, where the exact type is only known at runtime). Vec<bool> is conceptually similar to the sigma type Array<bool, ?>, as is &[bool].

## Wrapping up

Types are sets, and we can do set-theory things on them to make cooler types.

Let’s try to avoid using confusing terminology, however. If Rust does get “pi types”, let’s just call them “dependent types” or “const generics” :)

Thanks to Zaki, Avi Weinstock, Corey Richardson, and Peter Atashian for reviewing drafts of this post.

1. Rust, Swift, sort of Typescript, and all the functional languages who had it before it was cool.

2. Lookin’ at you, Go.

4. Types are not exactly sets due to some differences, but for the purposes of this post we can think of them like sets.

5. Though you can argue that strings often have their length bounded by the pointer size of the platform, so it’s still a finite set.

6. This even holds for zero-sized types, for more examples, check out this blog post

7. Like with strings, in practice this would probably be bounded by the integer type chosen

# Mitigating Underhandedness: Fuzzing Your Code

This may be part of a collaborative blog post series about underhanded Rust code. Or it may not. I invite you to write your own posts about underhanded code to make it so!

The submission deadline for the Underhanded Rust competition has been extended, so let’s talk more about how to keep your code working and free from bugs/underhandedness!

Now, really, underhanded bugs are just another form of bug. And how do we find bugs? We test!

We write unit tests. We run the code under Valgrind, ASan, MSan, UBSan, TSan, and any other sanitizer we can get our hands on. Tests tests tests. More tests. Tests.

But, there’s a problem here. You need to write test cases to make this work. These are inputs fed to your code after which you check whatever invariants your code has. There’s no guarantee that the test cases you write will exercise all the code paths in your program. This applies for sanitizers too, sanitizers are limited to testing the code paths that your test cases hit.

Of course, you can use code coverage tools to ensure that all these code paths will be hit. However, there’s a conflict here – your code will have many code paths that are not supposed to be hit ever. Things like redundant bounds checks, null checks, etc. In Rust programs such code paths generally use panics.

Now, these code paths are never supposed to be hit, so they’ll never show up in your code coverage. But you don’t have a guarantee that they can never be hit, short of formally verifying your program. The only solution here is writing more test cases.

Aside from that, even ignoring those code paths, you still need to manually write test cases for everything. For each possible code path in your code, if you want to be sure.

Who wants to manually write a million test cases?

Enter fuzzing. What fuzzing will do is feed your program random inputs, carefully watching the codepaths being taken, and try to massage the inputs so that new, interesting (usually crashy) codepaths are taken. You write tests for the fuzzer such that they can accept arbitrary input, and the fuzzer will find cases where they crash or panic.

One of the most popular fuzzers out there is AFL, which takes a binary and feeds it random input. Rust has a library that you can use for running AFL, however it currently needs to be run via a Docker image or needs a recompilation of rustc, since it adds a custom LLVM pass. We’re working on making this step unnecessary.

However, as of a few weeks ago, we now have bindings for libFuzzer, which uses existing instrumentation options built in to LLVM itself! libFuzzer works a bit differently; instead of giving it a binary, you write a function in a special way and give it a library containing that function, which it turns into a fuzzer binary. This is faster, since the fuzzer lives inside the binary itself and it doesn’t need to execute a new program each time.

Using libFuzzer in Rust is easy. Install cargo-fuzz:

Now, within your crate, initialize the fuzz setup:

This will create a fuzzing crate in fuzz/, with a single “fuzz target”, fuzzer_script_1. You can add more such targets with cargo fuzz add name_of_target. Fuzz targets are small libraries with a single function in them; the function that will be called over and over again by the fuzzer. It is up to you to fill in the body of this function, such that the program will crash or panic if and only if something goes wrong.

For example, for the unicode-segmentation crate, one of the fuzz targets I wrote just takes the string, splits it by grapheme and word boundaries, recombines it, and then asserts that the new string is the same.

The other targets ensure that the forward and reverse word/grapheme iterators produce the same results. They all take the byte slice input, attempt to convert to UTF8 (silently failing – NOT panicking – if not possible), and then use the string as an input testcase.

Now, these targets will panic if the test fails, and the fuzzer will try and force that panic to happen. But also, these targets put together exercise most of the API surface of the crate, so the fuzzer may also find panics (or even segmentation faults!) in the crate itself. For example, the fuzz target for rust-url doesn’t itself assert; all it does is try to parse the given string. The fuzzer will try to get the URL parser to panic.

To run a fuzz script:

This will start the fuzzer, running until it finds a crash or panic. It may also find other things like inputs which make the code abnormally slow.

Fuzzing can find some interesting bugs. For example, the unicode-segmentation fuzzers found this bug, where an emoji followed by two skin tone modifiers isn’t handled correctly. We’d probably never have been able to come up with this testcase on our own. But the fuzzer could find it!

The Rust Cap’n Proto crate ran cargo-fuzz and found a whole ton of bugs. There are more such examples in the trophy case (be sure to add any of your own findings to the trophy case, too!)

cargo-fuzz is relatively new, so the API and behavior may still be tweaked a bit before 1.0. But you can start taking it for a spin now, and finding bugs!

# Clarifying Misconceptions About SHAttered

This week Google published a SHA-1 collision.

There’s a lot of confusion about the implications of this. A lot of this is due to differences of opinion on what exactly constitutes a “new” collision. I tweeted about this. The webpage for the attack itself is misleading, saying that the answer to “Who is capable of mounting this attack?” is people with Google-esque resources. This depends on what exactly you mean by “this attack”.

So I’m seeing a lot of “oh well just another anti-milestone for SHA, doesn’t affect anyone since its still quite expensive to exploit” reactions, as well as the opposite “aaaaa everything is on fire” reaction. Both are wrong. It has practical implications for you even if you are certain that you won’t attract the ire of an entity with a lot of computational power. None of these implications, however, are likely to be disastrous.

TLDR: Now anyone, without needing Google-esque resources, can generate two colliding PDFs with arbitrary visual content in each.

(In fact, there’s already a PDF collision-generator up where you can upload two images and get a PDF with collisions in it)

## Okay, back up a bit. What’s a hash? What’s SHA-1?

I explained this a bit in my older post about zero-knowledge-proofs.

In essence, a hash function takes some data (usually of arbitrary size), and produces a value called a hash (usually of fixed size). The function has some additional properties:

• In almost all cases, a small perturbation in the input will lead to a large perturbation in the hash
• Given an input and its hash, it is computationally hard to find an alternate input producing the same hash
• It’s also hard to just find two inputs that has to the same value, though this is usually easier than the previous one

when two inputs hash to the same value, this is called a collision. As mentioned, is easier to find a collision, over finding a colliding alternate input for a known input.

SHA-1 is one such hash function. It’s been known for a while that it’s insecure, and the industry has largely moved off of it, but it’s still used, so it can still be a problem.

## What did the researchers do?

They found a hash collision for SHA-1. In essence, they found two strings, A and B, where SHA1(A) == SHA1(B).

However, given the way SHA-1 works, this means that you can generate infinitely many other such pairs of strings. And given the nature of the exact A and B they created, it is possible to use this to create arbitrary colliding PDFs.

Basically, SHA-1 (and many other hash functions), operate on “blocks”. These are fixed-size chunks of data, where the size is a property of the hash function. For SHA1 this is 512 bits.

The function starts off with an “initial” built-in hash. It takes the first block of your data and this hash, and does some computation with the two to produce a new hash, which is its state after the first block.

It will then take this hash and the second block, and run the same computations to produce a newer hash, which is its state after the second block. This is repeated till all blocks have been processed, and the final state is the result of the function.

There’s an important thing to notice here. At each block, the only inputs are the block itself and the hash of the string up to that block.

This means, if A and B are of a size that is a multiple of the block size, and SHA1(A) == SHA1(B), then SHA1(A + C) == SHA1(B + C). This is because, when the hash function reaches C, the state will be the same due to the hash collision, and after this point the next input blocks are identical in both cases, so the final hash will be the same.

Now, while you might consider A+C, B+C to be the “same collision” as A, B, the implications of this are different than just “there is now one known pair of inputs that collide”, since everyone now has the ability to generate new colliding inputs by appending an arbitrary string to A and B.

Of course, these new collisions have the restriction that the strings will always start with A or B and the suffixes will be identical. If you want to break this restriction, you will have to devote expensive resources to finding a new collision, like Google did.

## How does this let us generate arbitrary colliding PDFs?

So this exploit actually uses features of the JPEG format to work. It was done in a PDF format since JPEGs often get compressed when sent around the Internet. However, since both A and B start a partial PDF document, they can only be used to generate colliding PDFs, not JPEGs.

I’m going to first sketch out a simplified example of what this is doing, using a hypothetical pseudocode-y file format. The researchers found a collision between the strings:

• A: <header data> COMMENT(<nonce for A>) DISPLAY IMAGE 1
• B: <header data> COMMENT(<nonce for B>) DISPLAY IMAGE 2

Here, <header data> is whatever is necessary to make the format work, and the “nonce”s are strings that make A and B have the same hash. Finding these nonces is where the computational power is required, since you basically have to brute-force a solution.

Now, to both these strings, they append a suffix C: IMAGE 1(<data for image 1>) IMAGE 2(<data for image 2>). This creates two complete documents. Both of the documents contain both images, but each one is instructed to display a different one. Note that since SHA1(A) == SHA1(B), SHA1(A + C) = SHA1(B + C), so these final documents have the same hash.

The contents of C don’t affect the collision at all. So, we can insert any two images in C, to create our own personal pair of colliding PDFs.

The actual technique used is similar to this, and it relies on JPEG comment fields. They have found a collision between two strings that look like:

By playing with the nonces, they managed to generate a collision between A and B. In the first pdf, the embedded image has a comment containing only the nonce. Once the JPEG reader gets past that comment, it sees the first image, displays it, and then sees the end-of-file marker and decides to stop. Since the PDF format doesn’t try to interpret the image itself, the PDF format won’t be boggled by the fact that there’s some extra garbage data after the JPEG EOF marker. It simply takes all the data between the “begin embedded image” and “end embedded image” blocks, and passes it to the JPEG decoder. The JPEG decoder itself stops after it sees the end of file marker, and doesn’t get to the extra data for the second image.

In the second pdf, the jpg comment is longer, and subsumes the first image (as well as the EOF marker) Thus, the JPEG decoder directly gets to the second image, which it displays.

Since the actual images are not part of the original collision (A and B), you can substitute any pair of jpeg images there, with some length restrictions.

## What are the implications?

This does mean that you should not trust the integrity of a PDF when all you have to go on is its SHA-1 hash. Use a better hash. Anyone can generate these colliding PDFs now.

Fortunately, since all such PDFs will have the same prefix A or B, you can detect when such a deception is being carried out.

Don’t check colliding PDFs into SVN. Things break.

In some cases it is possible to use the PDF collision in other formats. For example, it can be used to create colliding HTML documents. I think it can be used to colide ZIP files too.

Outside the world of complex file formats, little has changed. It’s still a bad idea to use SHA-1. It’s still possible for people to generate entirely new collisions like Google did, though this needs a lot of resources. It’s possible that someone with resources has already generated such a “universal-key collision” for some other file format1 and will use it on you, but this was equally possible before Google published their attack.

This does not make it easier to collide with arbitrary hashes – if someone else has uploaded a document with a hash, and you trust them to not be playing any tricks, an attacker won’t be able to generate a colliding document for this without immense resources. The attack only works when the attacker has control over the initial document; e.g. in a bait-and-switch-like attack where the attacker uploads document A, you read and verify it and broadcast your trust in document A with hash SHA(A), and then the attacker switches it with document B.

1. Google’s specific collision was designed to be a “universal key”, since A and B are designed to have the image-switching mechanism built into it. Some other collision may not be like this; it could just be a collision of two images (or whatever) with no such switching mechanism. It takes about the same effort to do either of these, however, so if you have a file format that can be exploited to create a switching mechanism, it would always make more sense to build one into any collision you look for.

# Mitigating Underhandedness: Clippy!

This may be part of a collaborative blog post series about underhanded Rust code. Or it may not. I invite you to write your own posts about underhanded code to make it so!

Last month we opened up The Underhanded Rust competition. This contest is about writing seemingly-innocuous malicious code; code that is deliberately written to do some harm, but will pass a typical code review.

It is inspired by the Underhanded C contest. Most of the underhanded C submissions have to do with hidden buffer overflows, pointer arithmetic fails, or misuse of C macros; and these problems largely don’t occur in Rust programs. However, the ability to layer abstractions on each other does open up new avenues to introducing underhandedness by relying on sufficiently confusing abstraction sandwiches. There are probably other interesting avenues. Overall, I’m pretty excited to see what kind of underhandedness folks come up with!

Of course, underhandedness is not just about fun and games; we should be hardening our code against this kind of thing. Even if you trust your fellow programmers. Even if you are the sole programmer and you trust yourself. After all, you can’t spell Trust without Rust; and Rust is indeed about trust. Specifically, Rust is about trusting nobody. Not even yourself.

Rust protects you from your own mistakes when it comes to memory management. But we should be worried about other kinds of mistakes, too. Many of the techniques used in underhanded programming involve sleights of hand that could just as well be introduced in the code by accident, causing bugs. Not memory safety bugs (in Rust), but still, bugs. The existence of these sleights of hand is great for that very common situation when you are feeling severely under-plushied and must win a competition to replenish your supply but we really don’t want these creeping into real-world code, either by accident or intentionally.

Allow me to take a moment out of your busy underhanded-submission-writing schedules to talk to you about our Lord and Savior Clippy.

Clippy is for those of you who have become desensitized to the constant whining of the Rust compiler and need a higher dosage of whininess to be kept on their toes. Clippy is for those perfectionists amongst you who want to know every minute thing wrong with their code so that they can fix it. But really, Clippy is for everyone.

Clippy is simply a large repository of lints. As of the time of writing this post, there are 183 lints in it, though not all of them are enabled by default. These use the regular Rust lint system so you can pick and choose the ones you need via #[allow(lint_name)] and #[warn(lint_name)]. These lints cover a wide range of functions:

• Improving readability of the code (though rustfmt is the main tool you should use for this)
• Helping make the code more compact by reducing unnecessary things (my absolute favorite is needless_lifetimes)
• Helping make the code more idiomatic
• Making sure you don’t do things that you’re not supposed to
• Catching mistakes and cases where the code may not work as expected

The last two really are the ones which help with underhanded code. Just to give an example, we have lints like:

• cmp_nan, which disallows things like x == NaN
• clone_double_ref, which disallows calling .clone() on double-references (&&T), since that’s a straightforward copy and you probably meant to do something like (*x).clone()
• for_loop_over_option: Option<T> is iterable, and while this is useful when composing iterators, directly iterating over an option is usually an indication of a mistake.
• match_same_arms, which checks for identical match arm bodies (strong indication of a typo)
• suspicious_assignment_formatting, which checks for possible typos with the += and -= operators
• unused_io_amount, which ensures that you don’t forget that some I/O APIs may not write all bytes in the span of a single call

These catch many of the gotchas that might crop up in Rust code. In fact, I based my solution of an older, more informal Underhanded Rust contest on one of these.

## Usage

Clippy is still nightly-only. We hook straight into the compiler’s guts to obtain the information we need, and like most internal compiler APIs, this is completely unstable. This does mean that you usually need a latest or near-latest nightly for clippy to work, and there will be times when it won’t compile while we’re working to update it.

There is a plan to ship clippy as an optional component of rustc releases, which will fix all of these issues (yay!).

But, for now, you can use clippy via:

If you’re going to be making it part of the development procedures of a crate you maintain, you can also make it an optional dependency.

If you’re on windows, there’s currently a rustup/cargo bug where you may have to add the rustc libs path in your PATH for cargo clippy to work.

There’s an experimental project called rustfix which can automatically apply suggestions from clippy and rustc to your code. This may help in clippy-izing a large codebase, but it may also eat your code and/or laundry, so beware.

## Contributing

There’s a lot of work that can be done on clippy. A hundred and eighty lints is just a start, there are hundreds more lint ideas filed on the issue tracker. We’re willing to mentor anyone who wants to get involved; and have specially tagged “easy” issues for folks new to compiler internals. In general, contributing to clippy is a great way to gain an understanding of compiler internals if you want to contribute to the compiler itself.

If you don’t want to write code for clippy, you can also run it on random crates, open pull requests with fixes, and file bugs on clippy for any false positives that appear.

There are more tips about contributing in our CONTRIBUTING.md.

I hope this helps reduce mistakes and underhandedness in your code!

..unless you’re writing code for the Underhanded Rust competition. In that case, underhand away!

# Breaking Our Latin-1 Assumptions

So in my previous post I explored a specific (wrong) assumption that programmers tend to make about the nature of code points and text.

I was asked multiple times about other assumptions we tend to make. There are a lot. Most Latin-based scripts are simple, but most programmers spend their time dealing with Latin text so these complexities never come up.

I thought it would be useful to share my personal list of scripts that break our Latin-1 assumptions. This is a list I mentally check against whenever I am attempting to reason about text. I check if I’m making any assumptions that break in these scripts. Most of these concepts are independent of Unicode; so any program would have to deal with this regardless of encoding.

I again recommend going through eevee’s post, since it covers many related issues. Awesome-Unicode also has a lot of random tidbits about Unicode.

Anyway, here’s the list. Note that a lot of the concepts here exist in scripts other than the ones listed, these are just the scripts I use for comparing.

## Arabic / Hebrew

Both Arabic and Hebrew are RTL scripts; they read right-to-left. This may even affect how a page is laid out, see the Hebrew Wikipedia.

They both have a concept of letters changing how they look depending on where they are in the word. Hebrew has the “sofit” letters, which use separate code points. For example, Kaf (כ) should be typed as ך at the end of a word. Greek has something similar with the sigma.

In Arabic, the letters can have up to four different forms, depending on whether they start a word, end a word, are inside a word, or are used by themselves. These forms can look very different. They don’t use separate code points for this; however. You can see a list of these forms here

Arabic can get pretty tricky – the characters have to join up; and in cursive fonts (like those for Nastaliq), you get a lot of complex ligatures.

As I mentioned in the last post, U+FDFD (﷽), a ligature representing the Basamala, is also a character that breaks a lot of assumptions.

## Indic scripts

Indic scripts are abugidas, where you have consonants with vowel modifiers. For example, क is “kə”, where the upside down “e” is a schwa, something like an “uh” vowel sound. You can change the vowel by adding a diacritic (e.g ा); getting things like का (“kaa”) को (“koh”) कू (“koo”).

You can also mash together consonants to create consonant clusters. The “virama” is a vowel-killer symbol that removes the inherent schwa vowel. So, क + ् becomes क्. This sound itself is unpronounceable since क is a stop consonant (vowel-killed consonants can be pronounced for nasal and some other consonants though), but you can combine it with another consonant, as क् + र (“rə”), to get क्र (“krə”). Consonants can be strung up infinitely, and you can stick one or more vowel diacritics after that. Usually, you won’t see more than two consonants in a cluster, but larger ones are not uncommon in Sanskrit (or when writing down some onomatopoeia). They may not get rendered as single glyphs, depending on the font.

One thing that crops up is that there’s no unambiguous concept of a letter here. There is a concept of an “akshara”, which basically includes the vowel diacritics, and depending on who you talk to may also include consonant clusters. Often things are clusters an akshara depending on whether they’re drawn with an explicit virama or form a single glyph.

In general the nature of the virama as a two-way combining character in Unicode is pretty new.

## Hangul

Korean does its own fun thing when it comes to conjoining characters. Hangul has a concept of a “syllable block”, which is basically a letter. It’s made up of a leading consonant, medial vowel, and an optional tail consonant. 각 is an example of such a syllable block, and it can be typed as ᄀ + ᅡ + ᆨ. It can also be typed as 각, which is a “precomposed form” (and a single code point).

These characters are examples of combining characters with very specific combining rules. Unlike accents or other diacritics, these combining characters will combine with the surrounding characters only when the surrounding characters form an L-V-T or L-V syllable block.

As I mentioned in my previous post, apparently syllable blocks with more (adjacent) Ls, Vs, and Ts are also valid and used in Old Korean, so the grapheme segmentation algorithm in Unicode considers “ᄀᄀᄀ각ᆨᆨ” to be a single grapheme (it explicitly mentions this). I’m not aware of any fonts which render these as a single syllable block, or if that’s even a valid thing to do.

## Han scripts

So Chinese (Hanzi), Japanese (Kanji1), Korean (Hanja2), and Vietnamese (Hán tự, along with Chữ Nôm 3) all share glyphs, collectively called “Han characters” (or CJK characters4). These languages at some point in their history borrowed the Chinese writing system, and made their own changes to it to tailor to their needs.

Now, the Han characters are ideographs. This is not a phonetic script; individual characters represent words. The word/idea they represent is not always consistent across languages. The pronounciation is usually different too. Sometimes, the glyph is drawn slightly differently based on the language used. There are around 80,000 Han ideographs in Unicode right now.

The concept of ideographs itself breaks some of our Latin-1 assumptions. For example, how do you define Levenshtein edit distance for text using Han ideographs? The straight answer is that you can’t, though if you step back and decide why you need edit distance you might be able to find a workaround. For example, if you need it to detect typos, the user’s input method may help. If it’s based on pinyin or bopomofo, you might be able to reverse-convert to the phonetic script, apply edit distance in that space, and convert back. Or not. I only maintain an idle curiosity in these scripts and don’t actually use them, so I’m not sure how well this would work.

The concept of halfwidth character is a quirk that breaks some assumptions.

In the space of Unicode in particular, all of these scripts are represented by a single set of ideographs. This is known as “Han unification”. This is a pretty controversial issue, but the end result is that rendering may sometimes be dependent on the language of the text, which e.g. in HTML you set with a <span lang=whatever>. The wiki page has some examples of encoding-dependent characters.

Unicode also has a concept of variation selector, which is a code point that can be used to select between variations for a code point that has multiple ways of being drawn. These do get used in Han scripts.

While this doesn’t affect rendering, Unicode, as a system for describing text, also has a concept of interlinear annotation characters. These are used to represent furigana / ruby. Fonts don’t render this, but it’s useful if you want to represent text that uses ruby. Similarly, there are ideographic description sequences which can be used to “build up” glyphs from smaller ones when the glyph can’t be encoded in Unicode. These, too, are not to be rendered, but can be used when you want to describe the existence of a character like biáng. These are not things a programmer needs to worry about; I just find them interesting and couldn’t resist mentioning them :)

Japanese speakers haven’t completely moved to Unicode; there are a lot of things out there using Shift-JIS, and IIRC there are valid reasons for that (perhaps Han unification?). This is another thing you may have to consider.

Finally, these scripts are often written vertically, top-down. Mongolian, while not being a Han script, is written vertically sideways, which is pretty unique. The CSS writing modes spec introduces various concepts related to this, though that’s mostly in the context of the Web.

## Thai / Khmer / Burmese / Lao

These scripts don’t use spaces to split words. Instead, they have rules for what kinds of sequences of characters start and end a word. This can be determined programmatically, however IIRC the Unicode spec does not attempt to deal with this. There are libraries you can use here instead.

## Latin scripts themselves!

Turkish is a latin-based script. But it has a quirk: The uppercase of “i” is a dotted “İ”, and the lowercase of “I” is “ı”. If doing case-based operations, try to use a Unicode-aware library, and try to provide the locale if possible.

Also, not all code points have a single-codepoint uppercase version. The eszett (ß) capitalizes to “SS”. There’s also the “capital” eszett ẞ, but its usage seems to vary and I’m not exactly sure how it interacts here.

While Latin-1 uses precomposed characters, Unicode also introduces ways to specify the same characters via combining diacritics. Treating these the same involves using the normalization algorithms (NFC/NFD).

## Emoji

Well, not a script5. But emoji is weird enough that it breaks many of our assumptions. The scripts above cover most of these, but it’s sometimes easier to think of them in the context of emoji.

The main thing with emoji is that you can use a zero-width-joiner character to glue emoji together.

For example, the family emoji 👩‍👩‍👧‍👦 (may not render for you) is made by using the woman/man/girl/boy emoji and gluing them together with ZWJs. You can see its decomposition in uniview.

There are more sequences like this, which you can see in the emoji-zwj-sequences file. For example, MAN + ZWJ + COOK will give a male cook emoji (font support is sketchy). Similarly, SWIMMER + ZWJ + FEMALE SIGN is a female swimmer. You have both sequences of the form “gendered person + zwj + thing”, and “emoji containing human + zwj + gender”, IIRC due to legacy issues6

There are also modifier characters that let you change the skin tone of an emoji that contains a human (or human body part, like the hand-gesture emojis) in it.

Finally, the flag emoji are pretty special snowflakes. For example, 🇪🇸 is the Spanish flag. It’s made up of two regional indicator characters for “E” and “S”.

Unicode didn’t want to deal with adding new flags each time a new country or territory pops up. Nor did they want to get into the tricky business of determining what a country is, for example when dealing with disputed territories. So instead, they just defined these regional indicator symbols. Fonts are supposed to take pairs of RI symbols7 and map the country code to a flag. This mapping is up to them, so it’s totally valid for a font to render a regional indicator pair “E” + “S” as something other than the flag of Spain. On some Chinese systems, for example, the flag for Taiwan (🇹🇼) may not render.

I hightly recommend comparing against this relatively small list of scripts the next time you are writing code that does heavy manipulation of user-provided strings.

1. Supplemented (but not replaced) by the Hiragana and Katakana phonetic scripts. In widespread use.

2. Replaced by Hangul in modern usage

3. Replaced by chữ quốc ngữ in modern usage, which is based on the Latin alphabet

4. “CJK” (Chinese-Japanese-Korean) is probably more accurate here, though it probably should include “V” for Vietnamese too. Not all of these ideographs come from Han; the other scripts invented some of their own. See: Kokuji, Gukja, Chữ Nôm.

5. Back in my day we painstakingly typed actual real words on numeric phone keypads, while trudging to 🏫 in three feet of ❄️️, and it was uphill both ways, and we weren’t even allowed 📱s in 🏫. Get off my lawn!

6. We previously had individual code points for professions and stuff and they decided to switch over to using existing object emoji with combiners instead of inventing new profession emoji all the time

7. 676 countries should be enough for anybody

# Let’s Stop Ascribing Meaning to Code Points

Update: This post got a sequel, Breaking our latin-1 assumptions.

I’ve seen misconceptions about Unicode crop up regularly in posts discussing it. One very common misconception I’ve seen is that code points have cross-language intrinsic meaning.

It usually comes up when people are comparing UTF8 and UTF32. Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation. I’ve also seen this assumption manifest itself in actual programs which make incorrect assumptions about the nature of code points and mess things up when fed non-Latin text.

If you like reading about unicode, you might also want to go through Eevee’s article on the dark corners of unicode. Great read!

## Encodings

So, anyway, we have some popular encodings for Unicode. UTF8 encodes 7-bit code points as a single byte, 11-bit code points as two bytes, 16-bit code points as 3 bytes, and 21-bit code points as four bytes. UTF-16 encodes the first three in two bytes, and the last one as four bytes (logically, a pair of two-byte code units). UTF-32 encodes all code points as 4-byte code units. UTF-16 is mostly a “worst of both worlds” compromise at this point, and the main programming language I can think of that uses it (and exposes it in this form) is Javascript, and that too in a broken way.

The nice thing about UTF8 is that it saves space. Of course, that is subjective and dependent on the script you use most commonly, for example my first name is 12 bytes in UTF-8 but only 4 in ISCII (or a hypothetical unicode-based encoding that swapped the Devanagri Unicode block with the ASCII block). It also uses more space over the very non-hypothetical UTF-16 encoding if you tend to use code points in the U+0800 - U+FFFF range. It always uses less space than UTF-32 however.

A commonly touted disadvantage of UTF-8 is that string indexing is O(n). Because code points take up a variable number of bytes, you won’t know where the 5th codepoint is until you scan the string and look for it. UTF-32 doesn’t have this problem; it’s always 4 * index bytes away.

The problem here is that indexing by code point shouldn’t be an operation you ever need!

## Indexing by code point

The main time you want to be able to index by code point is if you’re implementing algorithms defined in the unicode spec that operate on unicode strings (casefolding, segmentation, NFD/NFC). Most if not all of these algorithms operate on whole strings, so implementing them as an iteration pass is usually necessary anyway, so you don’t lose anything if you can’t do arbitrary code point indexing.

But for application logic, dealing with code points doesn’t really make sense. This is because code points have no intrinsic meaning. They are not “characters”. I’m using scare quotes here because a “character” isn’t a well-defined concept either, but we’ll get to that later.

For example, “é” is two code points (e +́), where one of them is a combining accent. My name, “मनीष”, visually looks like three “characters”, but is four code points. The “नी” is made up of न + ी. My last name contains a “character” made up of three code points (and multiple two-code-point “characters”). The flag emoji “🇺🇸” is also made of two code points, 🇺 + 🇸.

One false assumption that’s often made is that code points are a single column wide. They’re not. They sometimes bunch up to form characters that fit in single “columns”. This is often dependent on the font, and if your application relies on this, you should be querying the font. There are even code points like U+FDFD (﷽) which are often rendered multiple columns wide. In fact, in my monospace font in my text editor, that character is rendered almost 12 columns wide. Yes, “almost”, subsequent characters get offset a tiny bit. I don’t know why.

Another false assumption is that editing actions (selection, backspace, cut, paste) operate on code points. In both Chrome and Firefox, selection will often include multiple code points. All the multi-code-point examples I gave above fall into this category. An interesting testcase for this is the string “ᄀᄀᄀ각ᆨᆨ”, which will rarely if ever render as a single “character” but will be considered as one for the purposes of selection, pretty much universally. I’ll get to why this is later.

Backspace can gobble multiple code points at once too, but the heuristics are different. The reason behind this is that backspace needs to mirror the act of typing, and while typing sometimes constructs multi-codepoint characters, backspace decomposes it piece by piece. In cases where a multi-codepoint “character” can be logically decomposed (e.g. “letter + accent”), backspace will decompose it, by removing the accent or whatever. But some multi-codepoint characters are not “constructions” of general concepts that should be exposed to the user. For example, a user should never need to know that the “🇺🇸” flag emoji is made of 🇺 + 🇸, and hitting backspace on it should delete both codepoints. Similarly, variation selectors and other such code points shouldn’t be treated as their own unit when backspacing.

On my Mac most builtin apps (which I presume use the OSX UI toolkits) seem to use the same heuristics that Firefox/Chrome use for selection for both selection and backspace. While the treatment of code points in editing contexts is not consistent, it seems like applications consistently do not consider code points as “editing units”.

Now, it is true that you often need some way to index a string. For example, if you have a large document and need to represent a slice of it. This could be a user-selection, or something delimeted by markup. Basically, you’ve already gone through the document and have a section you want to be able to refer to later without copying it out.

However, you don’t need code point indexing here, byte indexing works fine! UTF8 is designed so that you can check if you’re on a code point boundary even if you just byte-index directly. It does this by restricting the kinds of bytes allowed. One-byte code points never have the high bit set (ASCII). All other code points have the high bit set in each byte. The first byte of multibyte codepoints always starts with a sequence that specifies the number of bytes in the codepoint, and such sequences can’t be found in the lower-order bytes of any multibyte codepoint. You can see this visually in the table here. The upshot of all this is that you just need to check the current byte if you want to be sure you’re on a codepoint boundary, and if you receive an arbitrarily byte-sliced string, you will not mistake it for something else. It’s not possible to have a valid code point be a subslice of another, or form a valid code point by subslicing a sequence of two different ones by cutting each in half.

So all you need to do is keep track of the byte indices, and use them for slicing it later.

All in all, it’s important to always remember that “code point” doesn’t have intrinsic meaning. If you need to do a segmentation operation on a string, find out what exactly you’re looking for, and what concept maps closest to that. It’s rare that “code point” is the concept you’re looking for. In most cases, what you’re looking for instead is “grapheme cluster”.

## Grapheme clusters

The concept of a “character” is a nebulous one. Is “각” a single character, or three? How about “नी”? Or “நி”? Or the “👨‍❤️‍👨” emoji1? Or the “👨‍👨‍👧‍👧” family emoji2? Different scripts have different concepts which may not clearly map to the Latin notion of “letter” or our programmery notion of “character”.

Unicode itself gives the term “character” multiple incompatible meanings, and as far as I know doesn’t use the term in any normative text.

Often, you need to deal with what is actually displayed to the user. A lot of terminal emulators do this wrong, and end up messing up cursor placement. I used to use irssi-xmpp to keep my Facebook and Gchat conversations in my IRC client, but I eventually stopped as I was increasingly chatting in Marathi or Hindi and I prefer using the actual script over romanizing3, and it would just break my terminal4. Also, they got rid of the XMPP bridge but I’d already cut down on it by then.

So sometimes, you need an API querying what the font is doing. Generally, when talking about the actual rendered image, the term “glyph” or “glyph image” is used.

However, you can’t always query the font. Text itself exists independent of rendering, and sometimes you need a rendering-agnostic way of segmenting it into “characters”.

For this, Unicode has a concept of “grapheme cluster”. There’s also “extended grapheme cluster” (EGC), which is basically an updated version of the concept. In this post, whenever I use the term “grapheme cluster”, I am talking about EGCs.

The term is defined and explored in UAX #29. It starts by pinning down the still-nebulous concept of “user-perceived character” (“a basic unit of a writing system for a language”), and then declares the concept of a “grapheme cluster” to be an approximation to this notion that we can determine programmatically.

A rough definition of grapheme cluster is a “horizontally segmentable unit of text”.

The spec goes into detail as to the exact algorithm that segments text at grapheme cluster boundaries. All of the examples I gave in the first paragraph of this section are single grapheme clusters. So is “ᄀᄀᄀ각ᆨᆨ” (or “ᄀᄀᄀ각ᆨᆨ”), which apparently is considered a single syllable block in Hangul even though it is not of the typical form of leading consonant + vowel + optional tail consonant, but is not something you’d see in modern Korean. The spec explicitly talks of this case so it seems to be on purpose. I like this string because nothing I know of renders it as a single glyph; so you can easily use it to tell if a particular segmentation- aware operation uses grapheme clusters as segmentation. If you try and select it, in most browsers you will be forced to select the whole thing, but backspace will delete the jamos one by one. For the second string, backspace will decompose the core syllable block too (in the first string the syllable block 각 is “precomposed” as a single code point, in the second one I built it using combining jamos).

Basically, unless you have very specific requirements or are able to query the font, use an API that segments strings into grapheme clusters wherever you need to deal with the notion of “character”.

## Language defaults

Now, a lot of languages by default are now using Unicode-aware encodings. This is great. It gets rid of the misconception that characters are one byte long.

But it doesn’t get rid of the misconception that user-perceived characters are one code point long.

There are only two languages I know of which handle this well: Swift and Perl 6. I don’t know much about Perl 6’s thing so I can’t really comment on it, but I am really happy with what Swift does:

In Swift, the Character type is an extended grapheme cluster. This does mean that a character itself is basically a string, since EGCs can be arbitrarily many code points long.

All the APIs by default deal with EGCs. The length of a string is the number of EGCs in it. They are indexed by EGC. Iteration yields EGCs. The default comparison algorithm uses unicode canonical equivalence, which I think is kind of neat. Of course, APIs that work with code points are exposed too, you can iterate over the code points using .unicodeScalars.

The internal encoding itself is … weird (and as far as I can tell not publicly exposed), but as a higher level language I think it’s fine to do things like that.

I strongly feel that languages should be moving in this direction, having defaults involving grapheme clusters.

Rust, for example, gets a lot of things right – it has UTF-8 strings. It internally uses byte indices in slices. Explicit slicing usually uses byte indices too, and will panic if out of bounds. The non-O(1) methods are all explicit, since you will use an iterator to perform the operation (E.g. .chars().nth(5)). This encourages people to think about the cost, and it also encourages people to coalesce the cost with nearby iterations – if you are going to do multiple O(n) things, do them in a single iteration! Rust chars represent code points. .char_indices() is a useful string iteration method that bridges the gap between byte indexing and code points.

However, while the documentation does mention grapheme clusters, the stdlib is not aware of the concept of grapheme clusters at all. The default “fundamental” unit of the string in Rust is a code point, and the operations revolve around that. If you want grapheme clusters, you may use unicode-segmentation

Now, Rust is a systems programming language and it just wouldn’t do to have expensive grapheme segmentation operations all over your string defaults. I’m very happy that the expensive O(n) operations are all only possible with explicit acknowledgement of the cost. So I do think that going the Swift route would be counterproductive for Rust. Not that it can anyway, due to backwards compatibility :)

But I would prefer if the grapheme segmentation methods were in the stdlib (they used to be). This is probably not something that will happen, though I should probably push for the unicode crates being move into the nursery at least.

1. Emoji may not render as a single glyph depending on the font.

2. While writing this paragraph I discovered that wrapping text that contains lots of family emoji hangs Sublime. Neat.

3. Part of the reason here is that I just find romanization confusing. There are some standardized ways to romanize which don’t get used much. My friends and I romanize one way, different from the standardizations. My family members romanize things a completely different way and it’s a bit hard to read. Then again, romanization does hide the fact that my spelling in Hindi is atrocious :)

4. It’s possible to make work. You need a good terminal emulator, with the right settings, the right settings in your env vars, the right settings in irssi, and the right settings in screen. I think my current setup works well with non-ascii text but I’m not sure what I did to make it happen.

# Rust Tidbits: What Is a Lang Item?

Rust is not a simple language. As with any such language, it has many little tidbits of complexity that most folks aren’t aware of. Many of these tidbits are ones which may not practically matter much for everyday Rust programming, but are interesting to know. Others may be more useful. I’ve found that a lot of these aren’t documented anywhere (not that they always should be), and sometimes depend on knowledge of compiler internals or history. As a fan of programming trivia myself, I’ve decided to try writing about these things whenever I come across them. “Tribal Knowledge” shouldn’t be a thing in a programming community; and trivia is fun!

Previously in tidbits: Box is Special

Last time I talked about Box<T> and how it is a special snowflake. Corey asked that I write more about lang items, which are basically all of the special snowflakes in the stdlib.

So what is a lang item? Lang items are a way for the stdlib (and libcore) to define types, traits, functions, and other items which the compiler needs to know about.

For example, when you write x + y, the compiler will effectively desugar that into Add::add(x, y)1. How did it know what trait to call? Did it just insert a call to ::core::Add::add and hope the trait was defined there? This is what C++ does; the Itanium ABI spec expects functions of certain names to just exist, which the compiler is supposed to call in various cases. The __cxa_guard_* functions from C++s deferred-initialization local statics (which I’ve explored in the past) are an example of this. You’ll find that the spec is full of similar __cxa functions. While the spec just expects certain types, e.g. std::type_traits (“Type properties” § 20.10.4.3), to be magic and exist in certain locations, the compilers seem to implement them using intrinsics like __is_trivial<T> which aren’t defined in C++ code at all. So C++ compilers have a mix of solutions here, they partly insert calls to known ABI functions, and they partly implement “special” types via intrinsics which are detected and magicked when the compiler comes across them.

However, this is not Rust’s solution. It does not care what the Add trait is named or where it is placed. Instead, it knew where the trait for addition was located because we told it. When you put #[lang = "add"] on a trait, the compiler knows to call YourTrait::add(x, y) when it encounters the addition operator. Of course, usually the compiler will already have been told about such a trait since libcore is usually the first library in the pipeline. If you want to actually use this, you need to replace libcore.

Huh? You can’t do that, can you?

It’s not a big secret that you can compile rust without the stdlib using #![no_std]. This is useful in cases when you are on an embedded system and can’t rely on an allocator existing. It’s also useful for writing your own alternate stdlib, though that’s not something folks do often. Of course, libstd itself uses #![no_std], because without it the compiler will happily inject an extern crate std while trying to compile libstd and the universe will implode.

What’s less known is that you can do the same thing with libcore, via #![no_core]. And, of course, libcore uses it to avoid the cyclic dependency. Unlike #![no_std], no_core is a nightly-only feature that we may never stabilize2. #![no_core] is something that’s basically only to be used if you are libcore (or you are an alternate Rust stdlib/core implementation trying to emulate it).

Still, it’s possible to write a working Rust binary in no_core mode:

If you run this, the program will exit with exit code 42.

Note that this already adds two lang items. Sized and Copy. It’s usually worth looking at the lang item in libcore and copying it over unless you want to make tweaks. Beware that tweaks may not always work; not only does the compiler expect the lang item to exist, it expects it to make sense. There are properties of the lang item that it assumes are true, and failure to provide an appropriate lang item may cause the compiler to assert without a useful error message. In this case I do have a tweak, since the original definition of Copy is pub trait Copy: Clone {}, but I know that this tweak will work.

Lang items are usually only required when you do an operation which needs them. There are 72 non- deprecated lang items and we only had to define three of them here. “start” is necessary to, well, start executables, and Copy/Sized are very crucial to how the compiler reasons about types and must exist.

But let’s try doing something that will trigger a lang item to be required:

Rust will immediately complain:

This is because Rust wants to enforce that types in statics (which can be accessed concurrently) are safe when accessed concurrently, i.e., they implement Sync. We haven’t defined Sync yet, so Rust doesn’t know how to enforce this restruction. The Sync trait is defined with the “sync” lang item, so we need to do:

Note that the trait doesn’t have to be called Sync here, any trait name would work. This definition is also a slight departure from the one in the stdlib, and in general you should include the auto trait impl (instead of specifically using unsafe impl Sync for u8 {}) since the compiler may assume it exists. Our code is small enough for this to not matter.

Alright, let’s try defining our own addition trait as before. First, let’s see what happens if we try to add a struct when addition isn’t defined:

We get an error:

It is interesting to note that here the compiler did refer to Add by its path. This is because the diagnostics in the compiler are free to assume that libcore exists. However, the actual error just noted that it doesn’t know how to add two Foos. But we can tell it how!

This will compile fine and the exit code of the program will be 42.

An interesting bit of behavior is what happens if we try to add two numbers. It will give us the same kind of error, even though the addition of concrete primitives doesn’t go through Add::add (Rust asks LLVM to generate an add instruction directly). However, any addition operation still checks if Add::add is implemented, even though it won’t get used in the case of a primitive. We can even verify this!

This will need to be compiled with -C opt-level=2, since numeric addition in debug mode panics on wrap and we haven’t defined the "panic" lang item to teach the compiler how to panic.

It will exit with 42, not 92, since while the Add implementation is required for this to type check, it doesn’t actually get used.

So what lang items are there, and why are they lang items? There’s a big list in the compiler. Let’s go through them:

The ImplItem ones (core) are used to mark implementations on primitive types. char has some methods, and someone has to say impl char to define them. But coherence only allows us to impl methods on types defined in our own crate, and char isn’t defined … in any crate, so how do we add methods to it? #[lang = "char"] provides an escape hatch; applying that to impl char will allow you to break the coherence rules and add methods, as is done in the standard library. Since lang items can only be defined once, only a single crate gets the honor of adding methods to char, so we don’t have any of the issues that arise from sidestepping coherence.

There are a bunch for the marker traits (core):

• Send is a lang item because you are allowed to use it in a + bound in a trait object (Box<SomeTrait+Send+Sync>), and the compiler caches it aggressively
• Sync is a lang item for the same reasons as Send, but also because the compiler needs to enforce its implementation on types used in statics
• Copy is fundamental to classifying values and reasoning about moves/etc, so it needs to be a lang item
• Sized is also fundamental to reasoning about which values may exist on the stack. It is also magically included as a bound on generic parameters unless excluded with ?Sized
• Unsize is implemented automatically on types using a specific set of rules (listed in the nomicon). Unlike Send and Sync, this mechanism for autoimplementation is tailored for the use case of Unsize and can’t be reused on user-defined marker traits.

Drop is a lang item (core) because the compiler needs to know which types have destructors, and how to call these destructors.

CoerceUnsized is a lang item (core) because the compiler is allowed to perform DST coercions (nomicon) when it is implemented.

All of the builtin operators (also Deref and PartialEq/PartialOrd, which are listed later in the file) (core) are lang items because the compiler needs to know what trait to require (and call) when it comes across such an operation.

UnsafeCell is a lang item (core) because it has very special semantics; it prevents certain optimizations. Specifically, Rust is allowed to reorder reads/writes to &mut foo with the assumption that the local variable holding the reference is the only alias allowed to read from or write to the data, and it is allowed to reorder reads from &foo assuming that no other alias writes to it. We tell LLVM that these types are noalias. UnsafeCell<T> turns this optimization off, allowing writes to &UnsafeCell<T> references. This is used in the implementation of interior mutability types like Cell<T>, RefCell<T>, and Mutex<T>.

The Fn traits (core) are used in dispatching function calls, and can be specified with special syntax sugar, so they need to be lang items. They also get autoimplemented on closures.

The "str_eq" lang item is outdated. It used to specify how to check the equality of a string value against a literal string pattern in a match (match uses structural equality, not PartialEq::eq), however I believe this behavior is now hardcoded in the compiler.

The panic-related lang items (core) exist because rustc itself inserts panics in a few places. The first one, "panic", is used for integer overflow panics in debug mode, and "panic_bounds_check" is used for out of bounds indexing panics on slices. The last one, "panic_fmt" hooks into a function defined later in libstd.

The "exchange_malloc" and "box_free" (alloc) are for telling the compiler which functions to call in case it needs to do a malloc() or free(). These are used when constructing Box<T> via placement box syntax and when moving out of a deref of a box.

"strdup_uniq" seemed to be used in the past for moving string literals to the heap, but is no longer used.

We’ve already seen the start lang item (std) being used in our minimal example program. This function is basically where you find Rust’s “runtime”: it gets called with a pointer to main and the command line arguments, it sets up the “runtime”, calls main, and tears down anything it needs to. Rust has a C-like minimal runtime, so the actual libstd definition doesn’t do much. But you theoretically could stick a very heavy runtime initialization routine here.

The exception handling lang items (panic_unwind, in multiple platform-specific modules) specify various bits of the exception handling behavior. These hooks are called during various steps of unwinding: eh_personality is called when determining whether or not to stop at a stack frame or unwind up to the next one. eh_unwind_resume is the routine called when the unwinding code wishes to resume unwinding after calling destructors in a landing pad. msvc_try_filter defines some parameter that MSVC needs in its unwinding code. I don’t understand it, and apparently, neither does the person who wrote it.

The "owned_box" (alloc) lang item tells the compiler which type is the Box type. In my previous post I covered how Box is special; this lang item is how the compiler finds impls on Box and knows what the type is. Unlike the other primitives, Box doesn’t actually have a type name (like bool) that can be used if you’re writing libcore or libstd. This lang item gives Box a type name that can be used to refer to it. (It also defines some, but not all, of the semantics of Box<T>)

The "phantom_data" (core) type itself is allowed to have an unused type parameter, and it can be used to help fix the variance and drop behavior of a generic type. More on this in the nomicon.

The "non_zero" lang item (core) marks the NonZero<T> type, a type which is guaranteed to never contain a bit pattern of only zeroes. This is used inside things like Rc<T> and Box<T> – we know that the pointers in these can/should never be null, so they contain a NonZero<*const T>. When used inside an enum like Option<Rc<T>>, the discriminant (the “tag” value that distinguishes between Some and None) is no longer necessary, since we can mark the None case as the case where the bits occupied by NonZero in the Some case are zero. Beware, this optimization also applies to C-like enums that don’t have a variant corresponding to a discriminant value of zero (unless they are #[repr(C)])

There are also a bunch of deprecated lang items there. For example, NoCopy used to be a struct that could be dropped within a type to make it not implement Copy; in the past Copy implementations were automatic like Send and Sync are today. NoCopy was the way to opt out. There also used to be NoSend and NoSync. CovariantType/CovariantLifetime/etc were the predecessors of PhantomData; they could be used to specify variance relations of a type with its type or lifetime parameters, but you can now do this with providing the right PhantomData, e.g. InvariantType<T> is now PhantomData<Cell<T>>. The nomicon has more on variance. I don’t know why these lang items haven’t been removed (they don’t work anymore anyway); the only consumer of them is libcore so “deprecating” them seems unnecessary. It’s probably an oversight.

Interestingly, Iterator and IntoIterator are not lang items, even though they are used in for loops. Instead, the compiler inserts hardcoded calls to ::std::iter::IntoIterator::into_iter and ::std::iter::Iterator::next, and a hardcoded reference to ::std::option::Option (The paths use core in no_std mode). This is probably because the compiler desugars for loops before type resolution is done, so withut this, libcore would not be able to use for loops since the compiler wouldn’t know what calls to insert in place of the loops while compiling.

Basically, whenever the compiler needs to use special treatment with an item – whether it be dispatching calls to functions and trait methods in various situations, conferring special semantics to types/traits, or requiring traits to be implemented, the type will be defined in the standard library (libstd, libcore, or one of the crates behind the libstd façade), and marked as a lang item.

Some of the lang items are useful/necessary when working without libstd. Most only come into play if you want to replace libcore, which is a pretty niche thing to do, and knowing about them is rarely useful outside of the realm of compiler hacking.

But, like with the Box<T> madness, I still find this quite interesting, even if it isn’t generally useful!

1. Though as we learned in the previous post, when x and y are known numeric types it will bypass the trait and directly generate an add instruction in LLVM

2. To be clear, I’m not aware of any plans to eventually stabilize this. It’s something that could happen.

# Rust Tidbits: Box Is Special

Rust is not a simple language. As with any such language, it has many little tidbits of complexity that most folks aren’t aware of. Many of these tidbits are ones which may not practically matter much for everyday Rust programming, but are interesting to know. Others may be more useful. I’ve found that a lot of these aren’t documented anywhere (not that they always should be), and sometimes depend on knowledge of compiler internals or history. As a fan of programming trivia myself, I’ve decided to try writing about these things whenever I come across them. “Tribal Knowledge” shouldn’t be a thing in a programming community; and trivia is fun!

So. Box<T>. Your favorite heap allocation type that nobody uses1.

I was discussing some stuff on the rfcs repo when @burdges realized that Box<T> has a funky Deref impl.

Let’s look at it:

Wait, what? Squints

The call is coming from inside the house!

In case you didn’t realize it, this deref impl returns &**self – since self is an &Box<T>, dereferencing it once will provide a Box<T>, and the second dereference will dereference the box to provide a T. We then wrap it in a reference and return it.

But wait, we are defining how a Box<T> is to be dereferenced (that’s what Deref::deref is for!), such a definition cannot itself dereference a Box<T>! That’s infinite recursion.

And indeed. For any other type such a deref impl would recurse infinitely. If you run this code:

the compiler will warn you:

Actually trying to dereference the type will lead to a stack overflow.

Clearly something is fishy here. This deref impl is similar to the deref impl for &T, or the Add impl for number types, or any other of the implementations of operators on primitive types. For example we literally define Add on two integers to be their addition. The reason these impls need to exist is so that people can still call Add::add if they need to in generic code and be able to pass integers to things with an Add bound. But the compiler knows how to use builtin operators on numbers and dereference borrowed references without these impls. But those are primitive types which are defined in the compiler, while Box<T> is just a regular smart pointer struct, right?

Turns out, Box<T> is special. It, too, is somewhat of a primitive type.

This is partly due to historical accident.

To understand this, we must look back to Ye Olde days of pre-1.0 Rust (ca 2014). Back in these days, we had none of this newfangled “stability” business. The compiler broke your code every two weeks. Of course, you wouldn’t know that because the compiler would usually crash before it could tell you that your code was broken! Sigils roamed the lands freely, and cargo was but a newborn child which was destined to eventually end the tyranny of Makefiles. People were largely happy knowing that their closures were safely boxed and their threads sufficiently green.

Back in these days, we didn’t have Box<T>, Vec<T>, or String. We had ~T, ~[T], and ~str. The second two are not equivalent to Box<[T]> and Box<str>, even though they may look like it, they are both growable containers like Vec<T> and String. ~ conceptually meant “owned”, though IMO that caused more confusion than it was worth.

You created a box using the ~ operator, e.g. let x = ~1;. It could be dereferenced with the * operator, and autoderef worked much like it does today.

As a “primitive” type; like all primitive types, ~T was special. The compiler knew things about it. The compiler knew how to dereference it without an explicit Deref impl. In fact, the Deref traits came into existence much after ~T did. ~T never got an explicit Deref impl, though it probably should have.

Eventually, there was a move to remove sigils from the language. The box constructor ~foo was superseded by placement box syntax, which still exists in Rust nightly2. Then, the ~T type became Box<T>. (~[T] and ~str would also be removed, though ~str took a very confusing detour with StrBuf first).

However, Box<T> was still special. It no longer needed special syntax to be referred to or constructed, but it was still internally a special type. It didn’t even have a Deref impl yet, that came six months later, and it was implemented as &**self, exactly the same as it is today.

But why does it have to be special now? Rust had all the features it needed (allocations, ownership, overloadable deref) to implement Box<T> in pure rust in the stdlib as if it were a regular type.

Turns out that Rust didn’t. You see, because Box<T> and before it ~T were special, their dereference semantics were implemented in a different part of the code. And, these semantics were not the same as the ones for DerefImm and DerefMut, which were created for use with other smart pointers. I don’t know if the possibility of being used for ~T was considered when DerefImm/DerefMut were being implemented, or if it was a simple oversight, but Box<T> has three pieces of behavior that could not be replicated in pure Rust at the time:

• box foo in a pattern would destructure a box into its contents. It’s somewhat the opposite of ref
• box foo() performed placement box, so the result of foo() could be directly written to a preallocated box, reducing extraneous copies
• You could move out of deref with Box<T>

The third one is the one that really gets to us here3. For a regular type, *foo will produce a temporary that must be immediately borrowed or copied. You cannot do let x = *y for a non-Copy type. This dereference operation will call DerefMut::deref_mut or Deref::deref based on how it gets borrowed. With Box<T>, you can do this:

For any other type, such an operation will produce a “cannot move out of a borrow” error.

This operation is colloquially called DerefMove, and there has been an rfc in the past for making it into a trait. I suspect that the DerefMove semantics could even have been removed from Box<T> before 1.0 (I don’t find it necessary), but people had better things to do, like fixing the million other rough edges of the language that can’t be touched after backwards compatibility is a thing.

So now we’re stuck with it. The current status is that Box<T> is still a special type in the compiler. By “special type” I don’t just mean that the compiler treats it a bit differently (this is true for any lang item), I mean that it literally is treated as a completely new kind of type, not as a struct the way it has been defined in liballoc. There’s a TON of cruft in the compiler related to this type, much of which can be removed, but some of which can’t. If we ever do get DerefMove, we should probably try removing it all again. After writing this post I’m half-convinced to try and implement an internal-use-only DerefMove and try cleaning up the code myself.

Most of this isn’t really useful to know unless you actually come across a case where you can make use of DerefMove semantics, or if you work on the compiler. But it certainly is interesting!

Next post: What is a lang item?

1. Seriously though, does anyone use it much? I’ve only seen it getting used for boxed DSTs (trait objects and boxed slices), which themselves are pretty rare, for sending heap types over FFI, recursive types (rare), and random special cases. I find this pretty interesting given that other languages are much more liberal with non-refcounted single-element allocation.

2. It will probably eventually be replaced or made equivalent to the <- syntax before stabilizing

3. It’s easier to special case the first two, much like how for loops are aware of the iterator trait without the iterator trait being extremely special cased

# Reflections on Rusting Trust

The Rust compiler is written in Rust. This is overall a pretty common practice in compiler development. This usually means that the process of building the compiler involves downloading a (typically) older version of the compiler.

This also means that the compiler is vulnerable to what is colloquially known as the “Trusting Trust” attack, an attack described in Ken Thompson’s acceptance speech for the 1983 Turing Award. This kind of thing fascinates me, so I decided to try writing one myself. It’s stuff like this which started my interest in compilers, and I hope this post can help get others interested the same way.

To be clear, this isn’t an indictment of Rust’s security. Quite a few languages out there have popular self-hosted compilers (C, C++, Haskell, Scala, D, Go) and are vulnerable to this attack. For this attack to have any effect, one needs to be able to uniformly distribute this compiler, and there are roughly equivalent ways of doing the same level of damage with that kind of access.

If you already know what a trusting trust attack is, you can skip the next section. If you just want to see the code, it’s in the trusting-trust branch on my Rust fork, specifically this code.

## The attack

The essence of the attack is this:

An attacker can conceivably change a compiler such that it can detect a particular kind of application and make malicious changes to it. The example given in the talk was the UNIX login program — the attacker can tweak a compiler so as to detect that it is compiling the login program, and compile in a backdoor that lets it unconditionally accept a special password (created by the attacker) for any user, thereby giving the attacker access to all accounts on all systems that have login compiled by their modified compiler.

However, this change would be detected in the source. If it was not included in the source, this change would disappear in the next release of the compiler, or when someone else compiles the compiler from source. Avoiding this attack is easily done by compiling your own compilers and not downloading untrusted binaries. This is good advice in general regarding untrusted binaries, and it equally applies here.

To counter this, the attacker can go one step further. If they can tweak the compiler so as to backdoor login, they could also tweak the compiler so as to backdoor itself. The attacker needs to modify the compiler with a backdoor which detects when it is compiling the same compiler, and introduces itself into the compiler that it is compiling. On top of this it can also introduce backdoors into login or whatever other program the attacker is interested in.

Now, in this case, even if the backdoor is removed from the source, every compiler compiled using this backdoored compiler will be similarly backdoored. So if this backdoored compiler somehow starts getting distributed, it will spread itself as it is used to compile more copies of itself (e.g. newer versions, etc). And it will be virtually undetectable — since the source doesn’t need to be modified for it to work; just the non-human-readable binary.

Of course, there are ways to protect against this. Ultimately, before a compiler for language X existed, that compiler had to be written in some other language Y. If you can track the sources back to that point you can bootstrap a working compiler from scratch and keep compiling newer compiler versions till you reach the present. This raises the question of whether or not Y’s compiler is backdoored. While it sounds pretty unlikely that such a backdoor could be so robust as to work on two different compilers and stay put throughout the history of X, you can of course trace back Y back to other languages and so on till you find a compiler in assembly that you can verify1.

## Backdooring Rust

Alright, so I want to backdoor my compiler. I first have to decide when in the pipeline the code that insert backdoors executes. The Rust compiler operates by taking source code, parsing it into a syntax tree (AST), transforming it into some intermediate representations (HIR and MIR), and feeding it to LLVM in the form of LLVM IR, after which LLVM does its thing and creates binaries. A backdoor can be inserted at any point in this stage. To me, it seems like it’s easier to insert one into the AST, because it’s easier to obtain AST from source, and this is important as we’ll see soon. It also makes this attack less practically viable2, which is nice since this is just a fun exercise and I don’t actually want to backdoor the compiler.

So the moment the compiler finishes parsing, my code will modify the AST to insert a backdoor.

First, I’ll try to write a simpler backdoor; one which doesn’t affect the compiler but instead affects some programs. I shall write a backdoor that replaces occurrences of the string “hello world” with “जगाला नमस्कार”, a rough translation of the same in my native language.

Now, in rustc, the rustc_driver crate is where the whole process of compiling is coordinated. In particular, phase_2_configure_and_expand is run right after parsing (which is phase 1). Perfect. Within that function, the krate variable contains the parsed AST for the crate3, and we need to modify that.

In this case, there’s already machinery in syntax::fold for mutating ASTs based on patterns. A Folder basically has the ability to walk the AST, producing a mirror AST, with modifications. For each kind of node, you get to specify a function which will produce a node to be used in its place. Most such functions will default to no-op (returning the same node).

So I write the following Folder:

I invoke it by calling let krate = trust::fold_crate(krate); as the first line of phase_2_configure_and_expand.

I create a stage 1 build4 of rustc (make rustc-stage1). I’ve already set up rustup to have a “stage1” toolchain pointing to this folder (rustup toolchain link stage1 /path/to/rust/target_triple/stage1), so I can easily test this new compiler:

Note that I had the string on a separate line instead of directly doing println!("hello world"). This is because our backdoor isn’t perfect; it applies to the pre-expansion AST. In this AST, println! is stored as a macro and the "hello world" is part of the macro token tree; and has not yet been turned into an expression. Our folder ignores it. It is not too hard to perform this same attack post-expansion, however.

So far, so good. We have a compiler that tweaks “hello world” strings. Now, let’s see if we can get it to miscompile itself. This means that our compiler, when compiling a pristine Rust source tree, should produce a compiler that is similarly backdoored (with the trust module and the trust::fold_crate() call).

We need to tweak our folder so that it does two things:

• Inserts the let krate = trust::fold_crate(krate); statement in the appropriate function (phase_2_configure_and_expand) when compiling a pristine Rust source tree
• Inserts the trust module

The former is relatively easy. We need to construct an AST for that statement (can be done by invoking the parser again and extracting the node). The latter is where it gets tricky. We can encode instructions for outputting the AST of the trust module, but these instructions themselves are within the same module, so the instructions for outputting these instructions need to be included, and so on. This clearly isn’t viable.

However, there’s a way around this. It’s a common trick used in writing quines, which face similar issues. The idea is to put the entire block of code in a string. We then construct the code for the module by doing something like

With the code of the module entered in, this will look something like

So you have a string containing the contents of the module, except for itself. You build the code for the module by using the string twice – once to construct the code for the declaration of the string, and once to construct the code for the rest of the module. Now, by parsing this, you’ll get the original AST!

Let’s try this step by step. Let’s first see if injecting an arbitrary string (use foo::bar::blah) works, without worrying about this cyclical quineyness:

We also change the original call in phase_2_configure_and_expand to let krate = trust::fold_crate(krate, sess);

Compiling with make rustc-stage2 (we now want the backdoored stage1 compiler to try and compile the same sources and fudge the phase_2_configure_and_expand function the second time around), gets us this error:

This is exactly what we expected! We inserted the code use foo::bar::blah;, which isn’t going to resolve, and thus got a failure when compiling the crate the second time around.

Let’s add the code for the quineyness and for inserting the fold_crate call:

The #s let us specify “raw strings” in Rust, where I can freely include other quotation marks without needing to escape things. For a string starting with n pound symbols, we can have raw strings with up to n - 1 pound symbols inside it. The SELF_STRING is declared with four pound symbols, and the code in the trust module only uses raw strings with three pound symbols. Since the code needs to generate the declaration of SELF_STRING (with four pound symbols), we manually concatenate extra pound symbols on – a 4-pound-symbol raw string will not be valid within a three- pound-symbol raw string since the parser will try to end the string early. So we don’t ever directly type a sequence of four consecutive pound symbols in the code, and instead construct it by concatenating two pairs of pound symbols.

Ultimately, the code_for_module declaration really does the same as:

conceptually, but also ensures that things stay escaped. I could get similar results by calling into a function that takes a string and inserts literal backslashes at the appropriate points.

To update SELF_STRING, we just need to include all the code inside the trust module after the declaration of SELF_STRING itself inside the string. I won’t include this inline since it’s big, but this is what it looks like in the end.

If we try compiling this code to stage 2 after updating SELF_STRING, we will get errors about duplicate trust modules, which makes sense because we’re actually already compiling an already- backdoored version of the Rust source code. While we could set up two Rust builds, the easiest way to verify if our attack is working is to just use #[cfg(stage0)] on the trust module and the fold_crate call5. These will only get included during “stage 0” (when it compiles the stage 1 compiler6), and not when it compiles the stage 2 compiler, so if the stage 2 compiler still backdoors executables, we’re done.

On building the stage 2 (make rustc-stage2) compiler,

I was also able to make it work with a separate clone of Rust:

Thus, a pristine copy of the rustc source has built a compiler infected with the backdoor.

So we now have a working trusting trust attack in Rust. What can we do with it? Hopefully nothing! This particular attack isn’t very robust, and while that can be improved upon, building a practical and resilient trusting trust attack that won’t get noticed is a bit trickier.

We in the Rust community should be working on ways to prevent such attacks from being successful, though.

A couple of things we could do are:

• Work on an alternate Rust compiler (in Rust or otherwise). For a pair of self-hosted compilers, there’s a technique called “Diverse Double-Compiling” wherein you choose an arbitrary sequence of compilers (something like “gcc followed by 3x clang followed by gcc” followed by clang), and compile each compiler with the output of the previous one. Difficulty of writing a backdoor that can survive this process grows exponentially.
• Try compiling rustc from its ocaml roots, and package up the process into a shell script so that you have reproducible trustworthy rustc builds.
• Make rustc builds deterministic, which means that a known-trustworthy rustc build can be compared against a suspect one to figure out if it has been tampered with.

Overall trusting trust attacks aren’t that pressing a concern since there are many other ways to get approximately equivalent access with the same threat model. Having the ability to insert any backdoor into distributed binaries is bad enough, and should be protected against regardless of whether or not the backdoor is a self-propagating one. If someone had access to the distribution or build servers, for example, they could as easily insert a backdoor into the server, or place a key so that they can reupload tampered binaries when they want. Now, cleaning up after these attacks is easier than trusting trust, but ultimately this is like comparing being at the epicenter of Little Boy or the Tsar Bomba – one is worse, but you’re atomized regardless, and your mitigation plan shouldn’t need to change.

But it’s certainly an interesting attack, and should be something we should at least be thinking about.

Thanks to Josh Matthews, Nika Layzell, Diane Hosfelt, Eevee, and Yehuda Katz for reviewing drafts of this post.

Discuss: HN, Reddit

1. Of course, this raises the question of whether or not your assembler/OS/loader/processor is backdoored. Ultimately, you have to trust someone, which was partly the point of Thompson’s talk.

2. The AST turns up in the metadata/debuginfo/error messages, can be inspected from the command line, and in general is very far upstream and affects a number of things (all the other stages in the pipeline). You could write code to strip it out from these during inspection and only have it turn up in the binary, but that is much harder.

3. The local variable is called krate because crate is a keyword

4. Stage 1 takes the downloaded (older) rust compiler and compiles the sources from it. The stage 2 compiler is build when the stage 1 compiler (which is a “new” compiler) is used to compile the sources again.

5. Using it on the fold_crate call requires enabling the “attributes on statements” feature, but that’s no big deal – we’re only using the cfgs to be able to test easily; this feature won’t actually be required if we use our stage1 compiler to compile a clean clone of the sources.

6. The numbering of the stages is a bit confusing. During “stage 0” (cfg(stage0)), the stage 1 compiler is built. Since you are building the stage 1 compiler, the make invocation is make rustc-stage1. Similarly, during stage 1, the stage 2 compiler is built, and the invocation is make rustc-stage2 but you use #[cfg(stage1)] in the code.