In Pursuit of Laziness

Manish Goregaokar’s blog

I Never Hear the Phrase ‘INHTPAMA’ Anymore

Imagine never hearing the phrase ‘INHTPAMA’ again.

Oh, that’s already the case? Bummer.

Often, when talking about Rust, folks refer to the core aliasing rule as “that &mut thing”, “compile-time RWLock” (or “compile-time RefCell”), or something similar. Basically, referring to the fact that you can’t mutate the data that is currently held via an & reference, and that you can’t mutate or read the data currently held via an &mut reference except through that reference itself.

It’s always bugged me that we really don’t have a name for this thing. It’s one of the core bits of Rust, and crops up often in discussions.

But we did have a name for it! It was “INHTPAMA” (which was later butchered into “INHTWAMA”).

This is a reference to Niko’s 2012 blog post, titled “Imagine Never Hearing The Phrase ‘aliasable, mutable’ again”. It’s where the aliasing rules came from. Go read it, it’s great. It talks about this weird language with at symbols and purity, but I assure you, that language is Baby Rust. Or maybe Teenage Rust. The lifecycle of rusts is complex and interesting and I don’t know how to categorize it.

The point of this post isn’t really to encourage reviving the use of “INHTWAMA”; it’s a rather weird acronym that will probably confuse folks. I would like to have a better way of refering to “that &mut thing”, but I’d prefer if it wasn’t a confusing acronym that carries no meaning of its own if you don’t know the history of it. That’s a recipe for making new community members feel like outsiders.

But that post is amazing and I’d hate to see it drop out of the collective memory of the Rust community.

Use Signal. Use Tor.

I went to send a missive today
As I have done so oft before
But I forgot to employ that scrap of advice
“Use Signal. Use Tor.”

Intercepted of course the missive was
By a ferocious beast of lore
Because I failed to use that bit of advice
“Use Signal. Use Tor.”

The beast was strong; and formidable
He hated the amendments four
I should have remembered that piece of advice
“Use Signal. Use Tor.”

I tried to reason with the beast
but he only wanted war
Do not neglect that important advice
“Use Signal. Use Tor.”

Here I lie in the belly of the beast
I shall discount this advice no more
If I ever manage to leave this place
I’ll use Signal, and Tor.

Heed this advice, children.
It’s not something to ignore
Always, always, always, always
Use Signal. Use Tor.

Why Quantum Computing Is Weird

I’ve been meaning to write about physics for a while. When I started this blog the intention was to write about a wide variety of interests, but I ended up focusing on programming, despite the fact that I was doing more physics than programming for most of the lifetime of this blog. Time to change that, and hopefully write about other non-programming topics too.

Quantum Computing. It’s the new hip thing that’s going to change the world1. Someday.

In it’s essence, where classical computing deals with “bits”, which are on/off states, quantum computing deals with “qubits”, which are probabalistic quantum states that are often a mixture of on and off. These have interesting properties which make certain kinds of so-far-hard computation very easy to perform.

The goal of this post is not to teach quantum computing, rather to garner interest. I come to praise quantum computing, not bury it2. As a result, this post doesn’t require a background in physics. Having worked with very simple logic circuits is probably enough, though you may not even need that.

I’m basically going to sketch out an example of a very simple quantum algorithm. One that’s very logic-defying. It’s even logic-defying for many who have studied quantum mechanics; it certainly was for me. When I learned this first I could understand why it worked but there was a lot of dissonance between that and my intuitive conviction that it was wrong.

The algorithm

This is a quantum circuit (specifically, the circuit for the Deutsch-Jozsa algorithm). It’s used to find out the nature of a black-box function f(x), which takes in one qubit and outputs another3. For now, you can try to interpret this circuit as if it were a regular logic circuit. You’ll soon see that this interpretation is wrong, but it’s useful for the purposes of this explanation.

To run this algorithm, you first construct an “oracle” out of the black-box function. The oracle, given inputs x and y, has outputs x and y ⊕ f(x) (where is the symbol for XOR, the “exclusive OR”).

As with logic circuits, data flow here goes from left to right. This circuit has two constant inputs, a zero and a one. This is similar to how we might have constant “true” and “false” inputs to logic circuits.

They are then passed through “Hadamard gates”. These are like NOT gates, in that applying them twice is a no-op (they are their own inverse), but they’re not actually NOT gates. I like to describe them as “sideways NOT gates” since that description somewhat intuitively captures what’s going on with the qubits. What’s important to note here is that they have one input and one output, so they’re unaffected by the goings-on in a different wire.

Once these inputs have been Hadamard’ed, they are fed to the oracle we constructed. The top input goes on to become the top output. It’s also passed through f(x) and XORd with the bottom input to make the bottom output.

The top output is then Hadamard’ed again, and finally we observe its value.

Here’s where the magic comes in. By observing the top output, we will know the nature of f(x)4.

Wait, what? The top output doesn’t appear to have any interaction with f(x) at all! How can that work?

In fact, we could try to rewrite this circuit such that the measured output definitely has no interaction with f(x) whatever, assuming that the Hadamard gate isn’t doing anything funky5 (it isn’t):

How in the world does this work?

Why it works

Sadly, I can’t give a satisfying explanation to exactly why this works. This requires some quantum mechanics background6 to grasp.

However, I can give a hopefully-satisfying explanation as to why our regular intuition doesn’t work here.

First and foremost: The rewritten circuit I showed above? It’s wrong. If this was a logic circuit, we could always do that, but in quantum computing, T-junctions like the following can’t exist:

This is due to the “No Cloning theorem”. Unlike regular logic circuits, you can’t just “duplicate” a qubit. In some cases (like this one), you can try to create a similar qubit via the same process (e.g. here we could take another 0 and pass it through a Hadamard gate), but it’s not the “same” qubit. Unlike bits, qubits have a stronger notion of unique identity.

And it’s this sense of identity that fuels this algorithm (and most of quantum computing).

You see, while the top output of the oracle was x, it wasn’t exactly the same x. This x had been mixed with the lower output. This means that the upper and lower outputs are now entangled, with their state depending on each other. In fact, it’s really misleading to show the output as two wires in the first place – it’s really a single “entangled” state of two qubits that can’t be decomposed as a “top half” and a “bottom half”. Of course, this way of representing quantum circuits is still used because it’s a tidy way of visualizing these circuits, and physicists are aware of the caveats involved.

So what happens is that when you observe the top output, you are really doing a partial observation on the combined state of the two outputs, and this includes some information about f(x), which leaks out when you perform the observation.

These properties of qubits make quantum circuits work significantly differently from regular logic ones. On one hand, this severely restricts what you can do with them, but at the same time, new avenues of erstwhile-impossible operations open up. Most useful quantum algorithms (like Shor’s factorization algorithm) involve a mixture of a classical algorithm and a quantum circuit due to this reason. It’s pretty cool!


  1. What isn’t?

  2. The abstruseness of physics lives after it; the coolness is oft interred with its bones.

  3. This actually can be generalized to a function with n input and n output qubits, and the circuit stays mostly the same, except the top “x” line becomes n lines all initialized to 0 and passing through n parallel H gates.

  4. Specifically, if the observation is 1, the function is a constant, whereas if the observation is 0, the function is “balanced” (gives a different output for inputs 1 and 0)

  5. For Hadamard is an honorable gate. So are they all, all honorable gates.

  6. If you do have this background, it’s relatively straightforward; the Wikipedia page has the equations for it.

Understanding Git Filter-branch and the Git Storage Model

The other day Steve wanted git alchemy done on the Rust repo.

Specifically, he wanted the reference and nomicon moved out into their own repositories, preserving history. Both situations had some interesting quirks, the reference has lived in src/doc/reference/* and src/doc/reference.md, and the nomicon has lived in src/doc/nomicon, src/doc/tarpl, and at the top level in a separate git root.

As you can guess from the title of this post, the correct tool for this job is git filter-branch. My colleague Greg calls it “the swiss-army knife of Git history rewriting”.

I had some fun with filter-branch that day, thought I’d finally write an accessible tutorial for it. A lot of folks treat filter-branch like rebase, but it isn’t, and this crucial difference can lead to many false starts. It certainly did for me back when I first learned it.

This kind of ties into the common bit of pedantry about the nature of a commit I keep seeing pop up:

Git commits appear to be diffs, but they’re actually file copies, but they’re actually ACTUALLY diffs.

So what is a git commit?

Generally we interact with git commits via git show or by looking at commits on a git GUI / web UI. Here, we see diffs. It’s natural to think of a commit as a diff, it’s the model that makes the most sense for the most common ways of interacting with commits. It also makes some sense from an implementation point of view, diffs seem like an efficient way of storing things.

It turns out that the “real” model is not this, it’s actually that each commit is a snapshot of the whole repo state at the time.

But actually, it isn’t, the underlying implementation does make use of deltas in packfiles and some other tricks like copy-on-write forking.

Ultimately, arguing about the “real” mental model is mostly pedantry. There are multiple ways of looking at a commit. The documentation tends to implicitly think of them as “full copies of the entire file tree”, which is where most of the confusion about filter-branch comes from. But often it’s important to picture them as diffs, too.

Understanding the implementation can be helpful, especially when you break the repository whilst doing crazy things (I do this often). I’ve explained how it works in a later section, it’s not really a prerequisite for understanding filter-branch, but it’s interesting.

How do I rewrite history with git rebase?

This is where some of the confusion around filter-branch stems from. Folks have worked with rebase, and they think filter-branch is a generalized version of this. They’re actually quite different.

For those of you who haven’t worked with git rebase, it’s a pretty useful way of rewriting history, and is probably what you should use when you want to rewrite history, especially for maintaining clean git history in an unmerged under-review branch.

Rebase does a whole bunch of things. Its core task is, given the current branch and a branch that you want to “rebase onto”, it will take all commits unique to your branch, and apply them in order to the new one. Here, “apply” means “apply the diff of the commit, attempting to resolve any conflicts”. At times, it may ask you to manually resolve the conflicts, using the same tooling you use for conflicts during git merge.

Rebase is much more powerful than that, though. git rebase -i will open up “interactive rebase”, which will show you the commits that are going to be rebased. In this interface, you can reorder commits, mark them for edits (wherein the rebase will stop at that commit and let you git commit --amend changes into it), and even “squash” commits which lets you mark a commit to be absorbed into the previous one. This is rather useful for when you’re working on a feature and want to keep your commits neat, but also want to make fixup patches to older commits. Filippo’s git fixup alias packages this particular task into a single git command. Changing EDITOR=true into EDITOR=: GIT_SEQUENCE_EDITOR=: will make it not even open the editor for confirmation and try to do the whole thing automatically.

git rebase -x some_command is also pretty neat, lets you run a shell on each step during a rebase.

In this model, you are fundamentally thinking of commits as diffs. When you move around commits in the interactive rebase editor, you’re moving around diffs. When you mark things for squashing, you’re basically merging diffs. The whole process is about taking a set of diffs and applying them to a different “base commit”.

How do I rewrite history with git filter-branch?

filter-branch does not work with diffs. You’re working with the “snapshot” model of commits here, where each commit is a snapshot of the tree, and rewriting these commits.

What git filter-branch will do is for each commit in the specified branch, apply filters to the snapshot, and create a new commit. The new commit’s parent will be the filtered version of the old commit’s parent. So it creates a parallel commit DAG.

Because the filters apply on the snapshots instead of the diffs, there’s no chance for this to cause conflicts like in git rebase. In git rebase, if I have one commit that makes changes to a file, and I change the previous commit to just remove the area of the file that was changed, I’d have a conflict and git would ask me to figure out how the changes are supposed to be applied.

In git-filter-branch, if I do this, it will just power through. Unless you explicitly write your filters to refer to previous commits, the new commit is created in isolation, so it doesn’t worry about changes to the previous commits. If you had indeed edited the previous commit, the new commit will appear to undo those changes and apply its own on top of that.

filter-branch is generally for operations you want to apply pervasively to a repository. If you just want to tweak a few commits, it won’t work, since future commits will appear to undo your changes. git rebase is for when you want to tweak a few commits.

So, how do you use it?

The basic syntax is git filter-branch <filters> branch_name. You can use HEAD or @ to refer to the current branch instead of explicitly typing branch_name.

A very simple and useful filter is the subdirectory filter. It makes a given subdirectory the repository root. You use it via git filter-branch --subdirectory-filter name_of_subdir @. This is useful for extracting the history of a folder into its own repository.

Another useful filter is the tree filter, you can use it to do things like moving around, creating, or removing files. For example, if you want to move README.md to README in the entire history, you’d do something like git filter-branch --tree-filter 'mv README.md README' @. The tree filter will work by checking out each commit (in a separate temporary folder), running your filter on the working directory, adding any changes to the index (no need to git add yourself), and committing the new index.

The --prune-empty argument is useful here, as it removes commits which are now empty due to the rewrite.

Because it is checking out each commit, this filter is quite slow. When I initially was trying to do Steve’s task on the rust repo, I wrote a long tree filter and it was taking forever.

The faster version is the index filter. However, this is a bit trickier to work with (which is why I tend to use a tree filter if I can get away with it). What this does is operate on the index, directly.

The “index” is basically where things go when you git add them. Running git add will create temporary objects for the added file, and modify the WIP index (directory tree) to include a reference to the new file or change an existing file reference to the new one. When you commit, this index is packaged up into a commit and stored as an object. (More on how these objects work in a later section)

Now, since this deals with files that are already stored as objects, git doesn’t need to unwrap these objects and create a working directory to operate on them. So, with --index-filter, you can operate on these in a much faster way. However, since you don’t have a working directory, stuff like adding and moving files can be trickier. You often have to use git update-index to make this work.

However, a useful index filter is one which just scrubs a file (or files) from history:

1
$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch filename' HEAD

The --ignore-unmatch makes the command still succeed if the file doesn’t exist. filter-branch will fail if one of the filters fails. In general I tend to write fallible filters like command1 1>&2 2>/dev/null ; command2 1>&2 2>/dev/null ; true, which makes it always succeed and also ignores any stdout/stderr output (which tends to make the progress screen fill up fast).

The --cached argument on git rm makes it operate only on the index, not the working directory. This is great, because we don’t have a working directory right now.

I rarely use git update-index so I’m not really going to try and explain how it can be used here. But if you need to do more complex operations in an index filter, that’s the way to go.

There are many other filters, like --commit-filter (lets you discard a commit entirely), --msg-filter (rewriting commit messages), and --env-filter (changing things like author metadata or other env vars). You can see a complete list with examples in the docs

How did I perform the rewrites on the reference and nomicon?

For the Rust Reference, basically I had to extract the history of src/doc/reference.md, AND src/doc/reference/* (reference.md was split up into reference/*.md recently) into its own commit. This is an easy tree filter to write, but tree filters take forever.

Instead of trying my luck with an index filter, I decided to just make it so that the tree filter would be faster. I first extracted src/doc/:

1
$ git filter-branch -f --prune-empty --subdirectory-filter src/doc @

Now I had a branch that contained only the history of src/doc, with the root directory moved to doc. This is a much smaller repo than the entirety of Rust.

Now, I moved reference.md into reference/:

1
$ git filter-branch -f --prune-empty --tree-filter 'mkdir -p reference; mv reference.md reference 1>/dev/null 2>/dev/null; true' @

As mentioned before, the /dev/null and true bits are because the mv command will fail in some cases (when reference.md doesn’t exist), and I want it to just continue without complaining when that happens. I only care about moving instances of that file, if that file doesn’t exist there it’s still okay.

Now, everything I cared about was within reference. The next step was simple:

1
$ git filter-branch -f --prune-empty --subdirectory-filter reference @

The whole process took maybe 10 minutes to run, most of the time being spent by the second command. The final result can be found here.

For the nomicon, the task was easier. In the case of the nomicon, it has always resided in src/doc/nomicon, src/doc/tarpl, or at the root. This last bit is interesting, when Alexis was working on the nomicon, he started off by hacking on it in a separate repo, but then within that repo moved it to src/doc/tarpl, and performed a merge commit with rustc. There’s no inherent restriction in Git that all merges must have a common ancestor, and you can do stuff like this. I was quite surprised when I saw this, since it’s pretty uncommon in general, but really, many projects of that size will have stuff like this. Servo and html5ever do too, and usually it’s when a large project is merged into it after being developed on the side.

This sounds complicated to work with, but it wasn’t that hard. I took the same subdirectory-filtere’d doc directory branch used for the reference. Then, I renamed tarpl/ to nomicon/ via a tree filter, and ran another subdirectory filter:

1
2
$ git filter-branch -f --prune-empty --tree-filter 'mv tarpl nomicon 1>/dev/null 2>/dev/null; true' @
$ git filter-branch -f --prune-empty --subdirectory-filter nomicon @

Now, I had the whole history of the nomicon in the root dir. Except for the commits made by Alexis before his frankenmerge, because these got removed in the first subdirectory filter (the commits were operating outside of src/doc, even though their contents eventually got moved there).

But, at this stage, I already had a branch with the nomicon at the root. Alexis’ original commits were also operating on the root directory. I can just rebase here, and the diffs of my commits will cleanly apply!

I found the commit (a54e64) where everything was moved to tarpl/, and took its parent (c7919f). Then, I just ran git rebase --root c7919f, and everything cleanly rebased. As expected, because I had a history going back to the first child of a54e64 with files moved, and a54e64 itself only moved files, so the diffs should cleanly apply.

The final result can be found here.

Appendix: How are commits actually stored?

The way the actual implementation of a commit works is that each file being stored is hashed and stored in a compressed format, indexed by the hash. A directory (“tree”) will be a list of hashes, one for each file/directory inside it, alongside the filenames and other metadata. This list will be hashed and used everywhere else to refer to the directory.

A commit will reference the “tree” object for the root directory via its hash.

Now, if you make a commit changing some files, most of the files will be unchanged. So will most of the directories. So the commits can share the objects for the unchanged files/directories, reducing their size. This is basically a copy-on-write model. Furthermore, there’s a second optimization called a “packfile”, wherein instead of storing a file git will store a delta (a diff) and a reference to the file the diff must be applied to.

We can see this at work using git cat-file. cat-file lets you view objects in the “git filesystem”, which is basically a bunch of hash-indexed objects stored in .git/objects. You can view them directly by traversing that directory (they’re organized as a trie), but cat-file -p will let you pretty-print their contents since they’re stored in a binary format.

I’m working with the repo for the Rust Book, playing with commit 4822f2. It’s a commit that changes just one file (second-edition/src/ch15-01-box.md), perfect.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ git show 4822f2baa69c849e4fa3b85204f219a16bde2f42
commit 4822f2baa69c849e4fa3b85204f219a16bde2f42
Author: Jake Goulding <...>
Date:   Fri Mar 3 14:07:24 2017 -0500

    Reorder sentence about a generic cons list.

diff --git a/second-edition/src/ch15-01-box.md b/second-edition/src/ch15-01-box.md
index 14c5533..29d8793 100644
--- a/second-edition/src/ch15-01-box.md
+++ b/second-edition/src/ch15-01-box.md
(diff omitted)

$ git cat-file -p 4822f2baa69c849e4fa3b85204f219a16bde2f42

tree ec7cd2821d4bcbafe08f3eca6ea60487bfdc1b52
parent 24cd100e061bb11c3f7f3219467d6d644c50d811
author Jake Goulding <...> 1488568044 -0500
committer GitHub <noreply@github.com> 1488568044 -0500

Reorder sentence about a generic cons list.

This tells us that the commit is a thing with some author information, a pointer to a parent, a commit message, and a “tree”. What’s this tree?

1
2
3
4
5
6
7
8
9
10
11
$ git cat-file -p ec7cd2821d4bcbafe08f3eca6ea60487bfdc1b52
100644 blob 4cab1f4d267628ab5f4f7c14b1b64a9d4b032409    .gitattributes
040000 tree e1dcc1c754d72450b03542b2106fcb67c78805ff    .github
100644 blob 4c699f440ac134c577cb6f67b04ec5b93c652440    .gitignore
100644 blob e86d887d84a839417c960faf877c9057a8dc6823    .travis.yml
100644 blob 7990f2738876fc0fbc2ca30f5f91e91745b0b8eb    README.md
040000 tree 17b33cb52a5abb67ff678a03e7ed88cf9f163c69    ci
040000 tree 0ffd2c1238345c1b0e99af6c1c618eee4a0bab58    first-edition
100644 blob 5d1d2bb79e1521b28dd1b8ff67f9b04f38d83620    index.md
040000 tree b7160f7d05d5b5bfe28bad029b1b490e310cff22    redirects
040000 tree d5672dd9ef15adcd1527813df757847d745e299a    second-edition

This is just a directory! You can see that each entry has a hash. We can use git cat-file -p to view each one. Looking at a tree object will just give us a subdirectory, but the blobs will show us actual files!

1
2
3
4
5
6
7
8
$ git cat-file -p 7990f2738876fc0fbc2ca30f5f91e91745b0b8eb # Show README
# The Rust Programming Language

[![Build Status](https://travis-ci.org/rust-lang/book.svg?branch=master)](https://travis-ci.org/rust-lang/book)

To read this book online, visit [rust-lang.github.io/book/][html].

(rest of file omitted)

So how does this share objects? Let’s look at the previous commit:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ git cat-file -p 4822f2baa69c849e4fa3b85204f219a16bde2f42^ # `^` means "parent"
tree d219be3c5010f64960ddb609a849fc42a01ad31b
parent 21c063868f9d7fb0fa488b6f1124262f055d275b
author steveklabnik <...> 1488567224 -0500
committer steveklabnik <...> 1488567239 -0500

mdbook needs to be on the PATH for deploy

$ git cat-file -p d219be3c5010f64960ddb609a849fc42a01ad31b # the tree
100644 blob 4cab1f4d267628ab5f4f7c14b1b64a9d4b032409    .gitattributes
040000 tree e1dcc1c754d72450b03542b2106fcb67c78805ff    .github
100644 blob 4c699f440ac134c577cb6f67b04ec5b93c652440    .gitignore
100644 blob e86d887d84a839417c960faf877c9057a8dc6823    .travis.yml
100644 blob 7990f2738876fc0fbc2ca30f5f91e91745b0b8eb    README.md
040000 tree 17b33cb52a5abb67ff678a03e7ed88cf9f163c69    ci
040000 tree 0ffd2c1238345c1b0e99af6c1c618eee4a0bab58    first-edition
100644 blob 5d1d2bb79e1521b28dd1b8ff67f9b04f38d83620    index.md
040000 tree b7160f7d05d5b5bfe28bad029b1b490e310cff22    redirects
040000 tree d48b2e06970cf3a6ae65655c340922ae69723989    second-edition

If you look closely, all of these hashes are the same, except for the hash for second-edition. For the hashes which are the same, these objects are being shared across commits. The differing hash is d5672d in the newer commit, and d48b2e in the older one.

Let’s look at the objects:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
$ git cat-file -p d5672d
100644 blob 82dc67a6b08f0eb62420e4da3b3aa9c0dc10911a    CONTRIBUTING.md
100644 blob 5cd51aa43f05416996c4ef055df5d6eb58fbe737    Cargo.lock
100644 blob 7ab2575fa5bf4abf6eaf767c72347580c9f769dd    Cargo.toml
100644 blob 96e9f0458b55a4047927de5bf04ceda89d772b2b    LICENSE-APACHE
100644 blob 5a56e6e8ed1909b4e4800aa8d2a0e7033ab4babe    LICENSE-MIT
100644 blob be1135fc6d28eca53959c7fc9ae191523e4bc96f    book.json
100644 blob 1400454f36840e916a7d7028d987c42fcb31b4db    dictionary.txt
100644 blob 5103c84d034d6e8a0e4b6090453ad2cdcde21537    doc-to-md.sh
040000 tree 6715d1d4c97e3d17a088922f687b8d9ffacb5953    dot
100644 blob f9e045c4c1824520534270a2643ebe68311503b8    nostarch.sh
040000 tree f8d9a9452b4bbaeba256b95d40b303cd5fb20a64    nostarch
100644 blob 0a2d16852c11355ef9d8758a304b812633dcf03c    spellcheck.sh
040000 tree 3f8db396566716299330cdd5f569fb0a0c4615dd    src
100644 blob 56677811f451084de7c3a2478587a09486209b14    style-guide.md
040000 tree 7601821a2ff38906332082671ea23e4074464dd2    tools

$ git cat-file -p d48b2e
100644 blob 82dc67a6b08f0eb62420e4da3b3aa9c0dc10911a    CONTRIBUTING.md
100644 blob 5cd51aa43f05416996c4ef055df5d6eb58fbe737    Cargo.lock
100644 blob 7ab2575fa5bf4abf6eaf767c72347580c9f769dd    Cargo.toml
100644 blob 96e9f0458b55a4047927de5bf04ceda89d772b2b    LICENSE-APACHE
100644 blob 5a56e6e8ed1909b4e4800aa8d2a0e7033ab4babe    LICENSE-MIT
100644 blob be1135fc6d28eca53959c7fc9ae191523e4bc96f    book.json
100644 blob 1400454f36840e916a7d7028d987c42fcb31b4db    dictionary.txt
100644 blob 5103c84d034d6e8a0e4b6090453ad2cdcde21537    doc-to-md.sh
040000 tree 6715d1d4c97e3d17a088922f687b8d9ffacb5953    dot
100644 blob f9e045c4c1824520534270a2643ebe68311503b8    nostarch.sh
040000 tree f8d9a9452b4bbaeba256b95d40b303cd5fb20a64    nostarch
100644 blob 0a2d16852c11355ef9d8758a304b812633dcf03c    spellcheck.sh
040000 tree f9fc05a6ff78b8211f4df931ed5e32c937aba66c    src
100644 blob 56677811f451084de7c3a2478587a09486209b14    style-guide.md
040000 tree 7601821a2ff38906332082671ea23e4074464dd2    tools

Again, these are the same, except for that of src. src has a lot of files in it, which will clutter this post, so I’ll run a diff on the outputs of cat-file:

$ diff -U5 <(g cat-file -p f9fc05a6ff78b8211f4df931ed5e32c937aba66c) <(g cat-file -p 3f8db396566716299330cdd5f569fb0a0c4615dd)
--- /dev/fd/63  2017-03-05 11:58:22.000000000 -0800
+++ /dev/fd/62  2017-03-05 11:58:22.000000000 -0800
@@ -63,11 +63,11 @@
 100644 blob ff6b8f8cd44f624e1239c47edda59560cdf491ae   ch14-02-publishing-to-crates-io.md
 100644 blob c53ef854a74b6c9fbd915be1bf824c6e78439c42   ch14-03-cargo-workspaces.md
 100644 blob 3fb59f9cc85b6b81994e83a34d542871a260a8f0   ch14-04-installing-binaries.md
 100644 blob e1cd1ca779fdf202af433108a8af6eda317f2717   ch14-05-extending-cargo.md
 100644 blob 3173cc508484cc447ebe42a024eac7d9e6c2ddcd   ch15-00-smart-pointers.md
-100644 blob 14c5533bb3b604c6e6274db278d1e7129f78d55d   ch15-01-box.md
+100644 blob 29d87933d6832374b87d98aa5588e09e0c1a4991   ch15-01-box.md
 100644 blob 47b35ed489d63ce6a885289fec01b7b16ba1afea   ch15-02-deref.md
 100644 blob 2d20c55cc8605c0c899bc4867adc6b6ea1f5c902   ch15-03-drop.md
 100644 blob 8e3fcf4e83fe1ce985a7c0b479b8b16701765aaf   ch15-04-rc.md
 100644 blob a4ade4ae8bf5296d79ed51d69506e71a83f9f489   ch15-05-interior-mutability.md
 100644 blob 3a4db5616c4f5baeb95d04ea40c6747e60181684   ch15-06-reference-cycles.md

As you can see, only the file that was changed in the commit has a new blob stored. If you view 14c553 and 29d879 you’ll get the pre- and post- commit versions of the file respectively.

So basically, each commit stores a tree of references to objects, often sharing nodes with other commits.

I haven’t had the opportunity to work with packfiles much, but they’re an additional optimization on top of this. Aditya’s post is a good intro to these.

What Are Sum, Product, and Pi Types?

See also: Tony’s post on the same topic

You often hear people saying “Language X1 has sum types” or “I wish language X had sum types”2, or “Sum types are cool”.

Much like fezzes and bow ties, sum types are indeed cool.

These days, I’ve also seen people asking about “Pi types”, because of this Rust RFC.

But what does “sum type” mean? And why is it called that? And what, in the name of sanity, is a Pi type?

Before I start, I’ll mention that while I will be covering some type theory to explain the names “sum” and “product”, you don’t need to understand these names to use these things! Far too often do people have trouble understanding relatively straightforward concepts in languages because they have confusing names with confusing mathematical backgrounds3.

So what’s a sum type? (the no-type-theory version)

In it’s essence, a sum type is basically an “or” type. Let’s first look at structs.

1
2
3
4
struct Foo {
    x: bool,
    y: String,
}

Foo is a bool AND a String. You need one of each to make one. This is an “and” type, or a “product” type (I’ll explain the name later).

So what would an “or” type be? It would be one where the value can be a bool OR a String. You can achieve this with C++ with a union:

1
2
3
4
5
6
7
union Foo {
    bool x;
    string y;
}

foo.x = true; // set it to a bool
foo.y = "blah"; // set it to a string

However, this isn’t exactly right, since the value doesn’t store the information of which variant it is. You could store false and the reader wouldn’t know if you had stored an empty string or a false bool.

There’s a pattern called “tagged union” (or “discriminated union”) in C++ which bridges this gap.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
union FooUnion {
    bool x;
    string y;
}

enum FooTag {
    BOOL, STRING
}

struct Foo {
    FooUnion data;
    FooTag tag;
}

// set it to a bool
foo.data.x = true;
foo.tag = BOOL;

// set it to a string
foo.data.y = "blah";
foo.tag = STRING;

Here, you manually set the tag when setting the value. C++ also has std::variant (or boost::variant) that encapsulates this pattern with a better API.

While I’m calling these “or” types here, the technical term for such types is “sum” types. Other languages have built-in sum types.

Rust has them and calls them “enums”. These are a more generalized version of the enums you see in other languages.

1
2
3
4
5
6
7
8
9
10
11
12
enum Foo {
    Str(String),
    Bool(bool)
}

let foo = Foo::Bool(true);

// "pattern matching"
match foo {
    Str(s) => /* do something with string `s` */,
    Bool(b) => /* do something with bool `b` */,
}

Swift is similar, and also calls them enums

1
2
3
4
5
6
7
8
9
10
11
12
enum Foo {
    case str(String)
    case boolean(bool)
}

let foo = Foo.boolean(true);
switch foo {
    case .str(let s):
        // do something with string `s`
    case .boolean(let b):
        // do something with boolean `b`
}

You can fake these in Go using interfaces, as well. Typescript has built-in unions which can be typechecked without any special effort, but you need to add a tag (like in C++) to pattern match on them.

Of course, Haskell has them:

1
2
3
4
5
6
7
8
9
10
data Foo = B Bool | S String

-- define a function
doThing :: Foo -> SomeReturnType
doThing (B b) = -- do something with boolean b
doThing (S s) = -- do something with string s

-- call it
doThing (S "blah")
doThing (B True)

One of the very common things that languages with sum types do is express nullability as a sum type;

1
2
3
4
5
6
7
8
9
10
11
// an Option is either "something", containing a type, or "nothing"
enum Option<T> {
    Some(T),
    None
}

let x = Some("hello");
match x {
    Some(s) => println!("{}", s),
    None => println!("no string for you"),
}

Generally, these languages have “pattern matching”, which is like a switch statement on steroids. It lets you match on and destructure all kinds of things, sum types being one of them. Usually, these are “exhaustive”, which means that you are forced to handle all possible cases. In Rust, if you remove that None branch, the program won’t compile. So you’re forced to deal with the none case, somehow.

In general sum types are a pretty neat and powerful tool. Languages with them built-in tend to make heavy use of them, almost as much as they use structs.

Why do we call it a sum type?

Here be (type theory) dragons

Let’s step back a bit and figure out what a type is.

It’s really a restriction on the values allowed. It can have things like methods and whatnot dangling off it, but that’s not so important here.

In other words, it’s like4 a set. A boolean is the set \(\{\mathtt{true}, \mathtt{false}\}\). An 8-bit unsigned integer (u8 in Rust) is the set \(\{0, 1, 2, 3, …. 254, 255\}\). A string is a set with infinite elements, containing all possible valid strings5.

What’s a struct? A struct with two fields contains every possible combination of elements from the two sets.

1
2
3
4
struct Foo {
    x: bool,
    y: u8,
}

The set of possible values of Foo is

\[\{(\mathtt{x}, \mathtt{y}): \mathtt{x} \in \mathtt{bool}, \mathtt y \in \mathtt{u8}\}\]

(Read as “The set of all \((\mathtt{x}, \mathtt{y})\) where \(\tt x\) is in \(\mathtt{bool}\) and \(\tt y\) is in \(\mathtt{u8}\)”)

This is called a Cartesian product, and is often represented as \(\tt Foo = bool \times u8\). An easy way to view this as a product is to count the possible values: The number of possible values of Foo is the number of possible values of bool (2) times the number of possible values of u8 (256).

A general struct would be a “product” of the types of each field, so something like

1
2
3
4
5
6
struct Bar {
    x: bool,
    y: u8,
    z: bool,
    w: String
}

is \(\mathtt{Bar = bool \times u8 \times bool \times String}\)

This is why structs are called “product types”6.

You can probably guess what comes next – Rust/Swift enums are “sum types”, because they are the sum of the two sets.

1
2
3
4
enum Foo {
    Bool(bool),
    Integer(u8),
}

is a set of all values which are valid booleans, and all values which are valid integers. This is a sum of sets, \(\tt Foo = bool + u8\). More accurately, it’s a disjoint union, where if the input sets have overlap, the overlap is “discriminated” out.

An example of this being a disjoint union is:

1
2
3
4
5
enum Bar {
    Bool1(bool),
    Bool2(bool),
    Integer(u8).
}

This is not \(\tt Bar = bool + bool + u8\), because \(\tt bool + bool = bool\), (regular set addition doesn’t duplicate the overlap).

Instead, it’s something like

\[\tt Bar = bool + otherbool + u8\]

where \(\tt otherbool\) is also a set \(\tt \{true, false\}\), except that these elements are different from those in \(\tt bool\). You can look at it as if

\[\tt otherbool = \{true_2, false_2\}\]

so that

\[\mathtt{bool + otherbool} = \{\mathtt{true, false, true_2, false_2}\}\]

For sum types, the number of possible values is the sum of the number of possible values of each of its component types.

So, Rust/Swift enums are “sum types”.

You may often notice the terminology “algebraic datatypes” (ADT) being used, usually that’s just talking about sum and product types together – a language with ADTs will have both.

In fact, you can even have exponential types! The notation AB in set theory does mean something, it’s the set of all possible mappings from \(B\) to \(A\). The number of elements is \(N_A^{N_B}\). So basically, the type of a function (which is a mapping) is an “exponential” type. You can also view it as an iterated product type, a function from type B to A is really a struct like this:

1
2
3
4
5
6
7
8
9
10
11
// the type
fn my_func(b: B) -> A;

// is conceptually (each possible my_func can be written as an instance of)

struct my_func {
    b1: A, // value for first element in B
    b2: A, // value for second element in B
    b3: A,
    // ... 
}

given a value of the input b, the function will find the right field of my_func and return the mapping. Since a struct is a product type, this is

\[\mathtt{A}^{N_\mathtt{B}} = \tt A \times A \times A \times \dots\]

making it an exponential type.

You can even take derivatives of types! (h/t Sam Tobin-Hochstadt for pointing this out to me)

What, in the name of sanity, is a Pi type?

It’s essentially a form of dependent type. A dependent type is when your type can depend on a value. An example of this is integer generics, where you can do things like Array<bool, 5>, or template<unsigned int N, typename T> Array<T, N> ... (in C++).

Note that the type signature contains a type dependent on an integer, being generic over multiple different array lengths.

The name comes from how a constructor for these types would look:

1
2
3
4
5
6
7
8
9
10
11
// create an array of booleans from a given integer
// I made up this syntax, this is _not_ from the Rust Pi type RFC
fn make_array(x: u8) -> Array<bool, x> {
    // ...
}

// or
// (the proposed rust syntax)
fn make_array<const x: u8>() -> Array<bool, x> {
   // ... 
}

What’s the type of make_array here? It’s a function which can accept any integer and return a different type in each case. You can view it as a set of functions, where each function corresponds to a different integer input. It’s basically:

1
2
3
4
5
6
7
8
9
struct make_array {
    make_array_0: fn() -> Array<bool, 0>,
    make_array_1: fn() -> Array<bool, 1>,
    make_array_2: fn() -> Array<bool, 2>,
    make_array_3: fn() -> Array<bool, 3>,
    make_array_4: fn() -> Array<bool, 4>,
    make_array_5: fn() -> Array<bool, 5>,
    // ... 
}

Given an input, the function chooses the right child function here, and calls it.

This is a struct, or a product type! But it’s a product of an infinite number of types7.

We can look at it as

\[\texttt{make_array} = \prod\limits_{x = 0}^\infty\left( \texttt{fn()} \mathtt\to \texttt{Array<bool, x>}\right)\]

The usage of the \(\Pi\) symbol to denote an iterative product gives this the name “Pi type”.

In languages with lazy evaluation (like Haskell), there is no difference between having a function that can give you a value, and actually having the value. So, the type of make_array is the type of Array<bool, N> itself in languages with lazy evaluation.

There’s also a notion of a “sigma” type, which is basically

\[\sum\limits_{x = 0}^\infty \left(\texttt{fn()} \mathtt\to \texttt{Array<bool, x>}\right)\]

With the Pi type, we had “for all N we can construct an array”, with the sigma type we have “there exists some N for which we can construct this array”. As you can expect, this type can be expressed with a possibly-infinite enum, and instances of this type are basically instances of Array<bool, N> for some specific N where the N is only known at runtime. (much like how regular sum types are instances of one amongst multiple types, where the exact type is only known at runtime). Vec<bool> is conceptually similar to the sigma type Array<bool, ?>, as is &[bool].

Wrapping up

Types are sets, and we can do set-theory things on them to make cooler types.

Let’s try to avoid using confusing terminology, however. If Rust does get “pi types”, let’s just call them “dependent types” or “const generics” :)

Thanks to Zaki, Avi Weinstock, Corey Richardson, and Peter Atashian for reviewing drafts of this post.


  1. Rust, Swift, sort of Typescript, and all the functional languages who had it before it was cool.

  2. Lookin’ at you, Go.

  3. Moooooooooooooooonads

  4. Types are not exactly sets due to some differences, but for the purposes of this post we can think of them like sets.

  5. Though you can argue that strings often have their length bounded by the pointer size of the platform, so it’s still a finite set.

  6. This even holds for zero-sized types, for more examples, check out this blog post

  7. Like with strings, in practice this would probably be bounded by the integer type chosen

Mitigating Underhandedness: Fuzzing Your Code

This may be part of a collaborative blog post series about underhanded Rust code. Or it may not. I invite you to write your own posts about underhanded code to make it so!

The submission deadline for the Underhanded Rust competition has been extended, so let’s talk more about how to keep your code working and free from bugs/underhandedness!

Previously, we talked about Clippy.

Now, really, underhanded bugs are just another form of bug. And how do we find bugs? We test!

We write unit tests. We run the code under Valgrind, ASan, MSan, UBSan, TSan, and any other sanitizer we can get our hands on. Tests tests tests. More tests. Tests.

But, there’s a problem here. You need to write test cases to make this work. These are inputs fed to your code after which you check whatever invariants your code has. There’s no guarantee that the test cases you write will exercise all the code paths in your program. This applies for sanitizers too, sanitizers are limited to testing the code paths that your test cases hit.

Of course, you can use code coverage tools to ensure that all these code paths will be hit. However, there’s a conflict here – your code will have many code paths that are not supposed to be hit ever. Things like redundant bounds checks, null checks, etc. In Rust programs such code paths generally use panics.

Now, these code paths are never supposed to be hit, so they’ll never show up in your code coverage. But you don’t have a guarantee that they can never be hit, short of formally verifying your program. The only solution here is writing more test cases.

Aside from that, even ignoring those code paths, you still need to manually write test cases for everything. For each possible code path in your code, if you want to be sure.

Who wants to manually write a million test cases?

Enter fuzzing. What fuzzing will do is feed your program random inputs, carefully watching the codepaths being taken, and try to massage the inputs so that new, interesting (usually crashy) codepaths are taken. You write tests for the fuzzer such that they can accept arbitrary input, and the fuzzer will find cases where they crash or panic.

One of the most popular fuzzers out there is AFL, which takes a binary and feeds it random input. Rust has a library that you can use for running AFL, however it currently needs to be run via a Docker image or needs a recompilation of rustc, since it adds a custom LLVM pass. We’re working on making this step unnecessary.

However, as of a few weeks ago, we now have bindings for libFuzzer, which uses existing instrumentation options built in to LLVM itself! libFuzzer works a bit differently; instead of giving it a binary, you write a function in a special way and give it a library containing that function, which it turns into a fuzzer binary. This is faster, since the fuzzer lives inside the binary itself and it doesn’t need to execute a new program each time.

Using libFuzzer in Rust is easy. Install cargo-fuzz:

1
$ cargo install cargo-fuzz

Now, within your crate, initialize the fuzz setup:

1
$ cargo fuzz init

This will create a fuzzing crate in fuzz/, with a single “fuzz target”, fuzzer_script_1. You can add more such targets with cargo fuzz add name_of_target. Fuzz targets are small libraries with a single function in them; the function that will be called over and over again by the fuzzer. It is up to you to fill in the body of this function, such that the program will crash or panic if and only if something goes wrong.

For example, for the unicode-segmentation crate, one of the fuzz targets I wrote just takes the string, splits it by grapheme and word boundaries, recombines it, and then asserts that the new string is the same.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
pub extern fn go(data: &[u8]) {
    // we only deal with unicode input
    // bail early, *without panicking* if the input isn't utf8
    if let Ok(s) = str::from_utf8(data) {
        // split into graphemes, recollect
        let result = UnicodeSegmentation::graphemes(s, true).flat_map(|s| s.chars()).collect::<String>();
        // recollected string should be the same as the input, panic if not
        assert_eq!(s, result);

        // split into words, recollect
        let result = s.split_word_bounds().flat_map(|s| s.chars()).collect::<String>();
        // recollected string should be the same as the input, panic if not
        assert_eq!(s, result);
    }
}

The other targets ensure that the forward and reverse word/grapheme iterators produce the same results. They all take the byte slice input, attempt to convert to UTF8 (silently failing – NOT panicking – if not possible), and then use the string as an input testcase.

Now, these targets will panic if the test fails, and the fuzzer will try and force that panic to happen. But also, these targets put together exercise most of the API surface of the crate, so the fuzzer may also find panics (or even segmentation faults!) in the crate itself. For example, the fuzz target for rust-url doesn’t itself assert; all it does is try to parse the given string. The fuzzer will try to get the URL parser to panic.

To run a fuzz script:

1
$ cargo fuzz run fuzzer_script_1

This will start the fuzzer, running until it finds a crash or panic. It may also find other things like inputs which make the code abnormally slow.

Fuzzing can find some interesting bugs. For example, the unicode-segmentation fuzzers found this bug, where an emoji followed by two skin tone modifiers isn’t handled correctly. We’d probably never have been able to come up with this testcase on our own. But the fuzzer could find it!

The Rust Cap’n Proto crate ran cargo-fuzz and found a whole ton of bugs. There are more such examples in the trophy case (be sure to add any of your own findings to the trophy case, too!)

cargo-fuzz is relatively new, so the API and behavior may still be tweaked a bit before 1.0. But you can start taking it for a spin now, and finding bugs!

Clarifying Misconceptions About SHAttered

This week Google published a SHA-1 collision.

There’s a lot of confusion about the implications of this. A lot of this is due to differences of opinion on what exactly constitutes a “new” collision. I tweeted about this. The webpage for the attack itself is misleading, saying that the answer to “Who is capable of mounting this attack?” is people with Google-esque resources. This depends on what exactly you mean by “this attack”.

So I’m seeing a lot of “oh well just another anti-milestone for SHA, doesn’t affect anyone since its still quite expensive to exploit” reactions, as well as the opposite “aaaaa everything is on fire” reaction. Both are wrong. It has practical implications for you even if you are certain that you won’t attract the ire of an entity with a lot of computational power. None of these implications, however, are likely to be disastrous.

TLDR: Now anyone, without needing Google-esque resources, can generate two colliding PDFs with arbitrary visual content in each.

(In fact, there’s already a PDF collision-generator up where you can upload two images and get a PDF with collisions in it)

Okay, back up a bit. What’s a hash? What’s SHA-1?

I explained this a bit in my older post about zero-knowledge-proofs.

In essence, a hash function takes some data (usually of arbitrary size), and produces a value called a hash (usually of fixed size). The function has some additional properties:

  • In almost all cases, a small perturbation in the input will lead to a large perturbation in the hash
  • Given an input and its hash, it is computationally hard to find an alternate input producing the same hash
  • It’s also hard to just find two inputs that has to the same value, though this is usually easier than the previous one

when two inputs hash to the same value, this is called a collision. As mentioned, is easier to find a collision, over finding a colliding alternate input for a known input.

SHA-1 is one such hash function. It’s been known for a while that it’s insecure, and the industry has largely moved off of it, but it’s still used,

What did the researchers do?

They found a hash collision for SHA-1. In essence, they found two strings, A and B, where SHA1(A) == SHA1(B).

However, given the way SHA-1 works, this means that you can generate infinitely many other such pairs of strings. And given the nature of the exact A and B they created, it is possible to use this to create arbitrary colliding PDFs.

Basically, SHA-1 (and many other hash functions), operate on “blocks”. These are fixed-size chunks of data, where the size is a property of the hash function. For SHA1 this is 512 bits.

The function starts off with an “initial” built-in hash. It takes the first block of your data and this hash, and does some computation with the two to produce a new hash, which is its state after the first block.

It will then take this hash and the second block, and run the same computations to produce a newer hash, which is its state after the second block. This is repeated till all blocks have been processed, and the final state is the result of the function.

There’s an important thing to notice here. At each block, the only inputs are the block itself and the hash of the string up to that block.

This means, if A and B are of a size that is a multiple of the block size, and SHA1(A) == SHA1(B), then SHA1(A + C) == SHA1(B + C). This is because, when the hash function reaches C, the state will be the same due to the hash collision, and after this point the next input blocks are identical in both cases, so the final hash will be the same.

Now, while you might consider A+C, B+C to be the “same collision” as A, B, the implications of this are different than just “there is now one known pair of inputs that collide”, since everyone now has the ability to generate new colliding inputs by appending an arbitrary string to A and B.

Of course, these new collisions have the restriction that the strings will always start with A or B and the suffixes will be identical. If you want to break this restriction, you will have to devote expensive resources to finding a new collision, like Google did.

How does this let us generate arbitrary colliding PDFs?

So this exploit actually uses features of the JPEG format to work. It was done in a PDF format since JPEGs often get compressed when sent around the Internet. However, since both A and B start a partial PDF document, they can only be used to generate colliding PDFs, not JPEGs.

I’m going to first sketch out a simplified example of what this is doing, using a hypothetical pseudocode-y file format. The researchers found a collision between the strings:

  • A: <header data> COMMENT(<nonce for A>) DISPLAY IMAGE 1
  • B: <header data> COMMENT(<nonce for B>) DISPLAY IMAGE 2

Here, <header data> is whatever is necessary to make the format work, and the “nonce”s are strings that make A and B have the same hash. Finding these nonces is where the computational power is required, since you basically have to brute-force a solution.

Now, to both these strings, they append a suffix C: IMAGE 1(<data for image 1>) IMAGE 2(<data for image 2>). This creates two complete documents. Both of the documents contain both images, but each one is instructed to display a different one. Note that since SHA1(A) == SHA1(B), SHA1(A + C) = SHA1(B + C), so these final documents have the same hash.

The contents of C don’t affect the collision at all. So, we can insert any two images in C, to create our own personal pair of colliding PDFs.

The actual technique used is similar to this, and it relies on JPEG comment fields. They have found a collision between two strings that look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
pdf header data                       | String A
begin embedded image                  |
    jpg header data                   |
    declare jpg comment of length N   |
    random nonce of length N          | (comment ends here)
                                     ---
    image 1, length L                 | String C
    jpg EOF byte (2 bytes)            |
    image 2                           |
end embedded image                    |

and

pdf header data                       | String B
begin embedded image                  |
    jpg header data                   |
    declare jpg comment of length M   |
    random nonce of length M-L-2      |
                                     ---
    image 1, length L                 | String C
    jpg EOF marker (2 bytes)          | (comment ends here)
    image 2                           |
end embedded image                    |

By playing with the nonces, they managed to generate a collision between A and B. In the first pdf, the embedded image has a comment containing only the nonce. Once the JPEG reader gets past that comment, it sees the first image, displays it, and then sees the end-of-file marker and decides to stop. Since the PDF format doesn’t try to interpret the image itself, the PDF format won’t be boggled by the fact that there’s some extra garbage data after the JPEG EOF marker. It simply takes all the data between the “begin embedded image” and “end embedded image” blocks, and passes it to the JPEG decoder. The JPEG decoder itself stops after it sees the end of file marker, and doesn’t get to the extra data for the second image.

In the second pdf, the jpg comment is longer, and subsumes the first image (as well as the EOF marker) Thus, the JPEG decoder directly gets to the second image, which it displays.

Since the actual images are not part of the original collision (A and B), you can substitute any pair of jpeg images there, with some length restrictions.

What are the implications?

This does mean that you should not trust the integrity of a PDF when all you have to go on is its SHA-1 hash. Use a better hash. Anyone can generate these colliding PDFs now.

Fortunately, since all such PDFs will have the same prefix A or B, you can detect when such a deception is being carried out.

Don’t check colliding PDFs into SVN. Things break.

In some cases it is possible to use the PDF collision in other formats. For example, it can be used to create colliding HTML documents. I think it can be used to colide ZIP files too.

Outside the world of complex file formats, little has changed. It’s still a bad idea to use SHA-1. It’s still possible for people to generate entirely new collisions like Google did, though this needs a lot of resources. It’s possible that someone with resources has already generated such a “universal-key collision” for some other file format1 and will use it on you, but this was equally possible before Google published their attack.

This does not make it easier to collide with arbitrary hashes – if someone else has uploaded a document with a hash, and you trust them to not be playing any tricks, an attacker won’t be able to generate a colliding document for this without immense resources. The attack only works when the attacker has control over the initial document; e.g. in a bait-and-switch-like attack where the attacker uploads document A, you read and verify it and broadcast your trust in document A with hash SHA(A), and then the attacker switches it with document B.


  1. Google’s specific collision was designed to be a “universal key”, since A and B are designed to have the image-switching mechanism built into it. Some other collision may not be like this; it could just be a collision of two images (or whatever) with no such switching mechanism. It takes about the same effort to do either of these, however, so if you have a file format that can be exploited to create a switching mechanism, it would always make more sense to build one into any collision you look for.

Mitigating Underhandedness: Clippy!

This may be part of a collaborative blog post series about underhanded Rust code. Or it may not. I invite you to write your own posts about underhanded code to make it so!

Last month we opened up The Underhanded Rust competition. This contest is about writing seemingly-innocuous malicious code; code that is deliberately written to do some harm, but will pass a typical code review.

It is inspired by the Underhanded C contest. Most of the underhanded C submissions have to do with hidden buffer overflows, pointer arithmetic fails, or misuse of C macros; and these problems largely don’t occur in Rust programs. However, the ability to layer abstractions on each other does open up new avenues to introducing underhandedness by relying on sufficiently confusing abstraction sandwiches. There are probably other interesting avenues. Overall, I’m pretty excited to see what kind of underhandedness folks come up with!

Of course, underhandedness is not just about fun and games; we should be hardening our code against this kind of thing. Even if you trust your fellow programmers. Even if you are the sole programmer and you trust yourself. After all, you can’t spell Trust without Rust; and Rust is indeed about trust. Specifically, Rust is about trusting nobody. Not even yourself.

Rust protects you from your own mistakes when it comes to memory management. But we should be worried about other kinds of mistakes, too. Many of the techniques used in underhanded programming involve sleights of hand that could just as well be introduced in the code by accident, causing bugs. Not memory safety bugs (in Rust), but still, bugs. The existence of these sleights of hand is great for that very common situation when you are feeling severely under-plushied and must win a competition to replenish your supply but we really don’t want these creeping into real-world code, either by accident or intentionally.


Allow me to take a moment out of your busy underhanded-submission-writing schedules to talk to you about our Lord and Savior Clippy.

Clippy is for those of you who have become desensitized to the constant whining of the Rust compiler and need a higher dosage of whininess to be kept on their toes. Clippy is for those perfectionists amongst you who want to know every minute thing wrong with their code so that they can fix it. But really, Clippy is for everyone.

Clippy is simply a large repository of lints. As of the time of writing this post, there are 183 lints in it, though not all of them are enabled by default. These use the regular Rust lint system so you can pick and choose the ones you need via #[allow(lint_name)] and #[warn(lint_name)]. These lints cover a wide range of functions:

  • Improving readability of the code (though rustfmt is the main tool you should use for this)
  • Helping make the code more compact by reducing unnecessary things (my absolute favorite is needless_lifetimes)
  • Helping make the code more idiomatic
  • Making sure you don’t do things that you’re not supposed to
  • Catching mistakes and cases where the code may not work as expected

The last two really are the ones which help with underhanded code. Just to give an example, we have lints like:

  • cmp_nan, which disallows things like x == NaN
  • clone_double_ref, which disallows calling .clone() on double-references (&&T), since that’s a straightforward copy and you probably meant to do something like (*x).clone()
  • for_loop_over_option: Option<T> is iterable, and while this is useful when composing iterators, directly iterating over an option is usually an indication of a mistake.
  • match_same_arms, which checks for identical match arm bodies (strong indication of a typo)
  • suspicious_assignment_formatting, which checks for possible typos with the += and -= operators
  • unused_io_amount, which ensures that you don’t forget that some I/O APIs may not write all bytes in the span of a single call

These catch many of the gotchas that might crop up in Rust code. In fact, I based my solution of an older, more informal Underhanded Rust contest on one of these.

Usage

Clippy is still nightly-only. We hook straight into the compiler’s guts to obtain the information we need, and like most internal compiler APIs, this is completely unstable. This does mean that you usually need a latest or near-latest nightly for clippy to work, and there will be times when it won’t compile while we’re working to update it.

There is a plan to ship clippy as an optional component of rustc releases, which will fix all of these issues (yay!).

But, for now, you can use clippy via:

1
2
3
4
5
rustup install nightly
# +nightly not necessary if nightly is your default toolchain
cargo +nightly install clippy
# in your project folder
cargo +nightly clippy

If you’re going to be making it part of the development procedures of a crate you maintain, you can also make it an optional dependency.

If you’re on windows, there’s currently a rustup/cargo bug where you may have to add the rustc libs path in your PATH for cargo clippy to work.

There’s an experimental project called rustfix which can automatically apply suggestions from clippy and rustc to your code. This may help in clippy-izing a large codebase, but it may also eat your code and/or laundry, so beware.

Contributing

There’s a lot of work that can be done on clippy. A hundred and eighty lints is just a start, there are hundreds more lint ideas filed on the issue tracker. We’re willing to mentor anyone who wants to get involved; and have specially tagged “easy” issues for folks new to compiler internals. In general, contributing to clippy is a great way to gain an understanding of compiler internals if you want to contribute to the compiler itself.

If you don’t want to write code for clippy, you can also run it on random crates, open pull requests with fixes, and file bugs on clippy for any false positives that appear.

There are more tips about contributing in our CONTRIBUTING.md.


I hope this helps reduce mistakes and underhandedness in your code!

..unless you’re writing code for the Underhanded Rust competition. In that case, underhand away!

Breaking Our Latin-1 Assumptions

So in my previous post I explored a specific (wrong) assumption that programmers tend to make about the nature of code points and text.

I was asked multiple times about other assumptions we tend to make. There are a lot. Most Latin-based scripts are simple, but most programmers spend their time dealing with Latin text so these complexities never come up.

I thought it would be useful to share my personal list of scripts that break our Latin-1 assumptions. This is a list I mentally check against whenever I am attempting to reason about text. I check if I’m making any assumptions that break in these scripts. Most of these concepts are independent of Unicode; so any program would have to deal with this regardless of encoding.

I again recommend going through eevee’s post, since it covers many related issues. Awesome-Unicode also has a lot of random tidbits about Unicode.

Anyway, here’s the list. Note that a lot of the concepts here exist in scripts other than the ones listed, these are just the scripts I use for comparing.

Arabic / Hebrew

Both Arabic and Hebrew are RTL scripts; they read right-to-left. This may even affect how a page is laid out, see the Hebrew Wikipedia.

They both have a concept of letters changing how they look depending on where they are in the word. Hebrew has the “sofit” letters, which use separate code points. For example, Kaf (כ) should be typed as ך at the end of a word. Greek has something similar with the sigma.

In Arabic, the letters can have up to four different forms, depending on whether they start a word, end a word, are inside a word, or are used by themselves. These forms can look very different. They don’t use separate code points for this; however. You can see a list of these forms here

As I mentioned in the last post, U+FDFD (﷽), a ligature representing the Basamala, is also a character that breaks a lot of assumptions.

Indic scripts

Indic scripts are abugidas, where you have consonants with vowel modifiers. For example, क is “kə”, where the upside down “e” is a schwa, something like an “uh” vowel sound. You can change the vowel by adding a diacritic (e.g ); getting things like का (“kaa”) को (“koh”) कू (“koo”).

You can also mash together consonants to create consonant clusters. The “virama” is a vowel-killer symbol that removes the inherent schwa vowel. So, + becomes क्. This sound itself is unpronounceable since क is a stop consonant (vowel-killed consonants can be pronounced for nasal and some other consonants though), but you can combine it with another consonant, as क् + (“rə”), to get क्र (“krə”). Consonants can be strung up infinitely, and you can stick one or more vowel diacritics after that. Usually, you won’t see more than two consonants in a cluster, but larger ones are not uncommon in Sanskrit (or when writing down some onomatopoeia). They may not get rendered as single glyphs, depending on the font.

One thing that crops up is that there’s no unambiguous concept of a letter here. There is a concept of an “akshara”, which basically includes the vowel diacritics, and depending on who you talk to may also include consonant clusters. Often things are clusters an akshara depending on whether they’re drawn with an explicit virama or form a single glyph.

In general the nature of the virama as a two-way combining character in Unicode is pretty new.

Hangul

Korean does its own fun thing when it comes to conjoining characters. Hangul has a concept of a “syllable block”, which is basically a letter. It’s made up of a leading consonant, medial vowel, and an optional tail consonant. 각 is an example of such a syllable block, and it can be typed as ᄀ + ᅡ + ᆨ. It can also be typed as 각, which is a “precomposed form” (and a single code point).

These characters are examples of combining characters with very specific combining rules. Unlike accents or other diacritics, these combining characters will combine with the surrounding characters only when the surrounding characters form an L-V-T or L-V syllable block.

As I mentioned in my previous post, apparently syllable blocks with more (adjacent) Ls, Vs, and Ts are also valid and used in Old Korean, so the grapheme segmentation algorithm in Unicode considers “ᄀᄀᄀ각ᆨᆨ” to be a single grapheme (it explicitly mentions this). I’m not aware of any fonts which render these as a single syllable block, or if that’s even a valid thing to do.

Han scripts

So Chinese (Hanzi), Japanese (Kanji1), Korean (Hanja2), and Vietnamese (Hán tự, along with Chữ Nôm 3) all share glyphs, collectively called “Han characters” (or CJK characters4). These languages at some point in their history borrowed the Chinese writing system, and made their own changes to it to tailor to their needs.

Now, the Han characters are ideographs. This is not a phonetic script; individual characters represent words. The word/idea they represent is not always consistent across languages. The pronounciation is usually different too. Sometimes, the glyph is drawn slightly differently based on the language used. There are around 80,000 Han ideographs in Unicode right now.

The concept of ideographs itself breaks some of our Latin-1 assumptions. For example, how do you define Levenshtein edit distance for text using Han ideographs? The straight answer is that you can’t, though if you step back and decide why you need edit distance you might be able to find a workaround. For example, if you need it to detect typos, the user’s input method may help. If it’s based on pinyin or bopomofo, you might be able to reverse-convert to the phonetic script, apply edit distance in that space, and convert back. Or not. I only maintain an idle curiosity in these scripts and don’t actually use them, so I’m not sure how well this would work.

The concept of halfwidth character is a quirk that breaks some assumptions.

In the space of Unicode in particular, all of these scripts are represented by a single set of ideographs. This is known as “Han unification”. This is a pretty controversial issue, but the end result is that rendering may sometimes be dependent on the language of the text, which e.g. in HTML you set with a <span lang=whatever>. The wiki page has some examples of encoding-dependent characters.

Unicode also has a concept of variation selector, which is a code point that can be used to select between variations for a code point that has multiple ways of being drawn. These do get used in Han scripts.

While this doesn’t affect rendering, Unicode, as a system for describing text, also has a concept of interlinear annotation characters. These are used to represent furigana / ruby. Fonts don’t render this, but it’s useful if you want to represent text that uses ruby. Similarly, there are ideographic description sequences which can be used to “build up” glyphs from smaller ones when the glyph can’t be encoded in Unicode. These, too, are not to be rendered, but can be used when you want to describe the existence of a character like biáng. These are not things a programmer needs to worry about; I just find them interesting and couldn’t resist mentioning them :)

Japanese speakers haven’t completely moved to Unicode; there are a lot of things out there using Shift-JIS, and IIRC there are valid reasons for that (perhaps Han unification?). This is another thing you may have to consider.

Finally, these scripts are often written vertically, top-down. Mongolian, while not being a Han script, is written vertically sideways, which is pretty unique. The CSS writing modes spec introduces various concepts related to this, though that’s mostly in the context of the Web.

Thai / Khmer / Burmese / Lao

These scripts don’t use spaces to split words. Instead, they have rules for what kinds of sequences of characters start and end a word. This can be determined programmatically, however IIRC the Unicode spec does not attempt to deal with this. There are libraries you can use here instead.

Latin scripts themselves!

Turkish is a latin-based script. But it has a quirk: The uppercase of “i” is a dotted “İ”, and the lowercase of “I” is “ı”. If doing case-based operations, try to use a Unicode-aware library, and try to provide the locale if possible.

Also, not all code points have a single-codepoint uppercase version. The eszett (ß) capitalizes to “SS”. There’s also the “capital” eszett ẞ, but its usage seems to vary and I’m not exactly sure how it interacts here.

While Latin-1 uses precomposed characters, Unicode also introduces ways to specify the same characters via combining diacritics. Treating these the same involves using the normalization algorithms (NFC/NFD).

Emoji

Well, not a script5. But emoji is weird enough that it breaks many of our assumptions. The scripts above cover most of these, but it’s sometimes easier to think of them in the context of emoji.

The main thing with emoji is that you can use a zero-width-joiner character to glue emoji together.

For example, the family emoji 👩‍👩‍👧‍👦 (may not render for you) is made by using the woman/man/girl/boy emoji and gluing them together with ZWJs. You can see its decomposition in uniview.

There are more sequences like this, which you can see in the emoji-zwj-sequences file. For example, MAN + ZWJ + COOK will give a male cook emoji (font support is sketchy). Similarly, SWIMMER + ZWJ + FEMALE SIGN is a female swimmer. You have both sequences of the form “gendered person + zwj + thing”, and “emoji containing human + zwj + gender”, IIRC due to legacy issues6

There are also modifier characters that let you change the skin tone of an emoji that contains a human (or human body part, like the hand-gesture emojis) in it.

Finally, the flag emoji are pretty special snowflakes. For example, 🇪🇸 is the Spanish flag. It’s made up of two regional indicator characters for “E” and “S”.

Unicode didn’t want to deal with adding new flags each time a new country or territory pops up. Nor did they want to get into the tricky business of determining what a country is, for example when dealing with disputed territories. So instead, they just defined these regional indicator symbols. Fonts are supposed to take pairs of RI symbols7 and map the country code to a flag. This mapping is up to them, so it’s totally valid for a font to render a regional indicator pair “E” + “S” as something other than the flag of Spain. On some Chinese systems, for example, the flag for Taiwan (🇹🇼) may not render.


I hightly recommend comparing against this relatively small list of scripts the next time you are writing code that does heavy manipulation of user-provided strings.


  1. Supplemented (but not replaced) by the Hiragana and Katakana phonetic scripts. In widespread use.

  2. Replaced by Hangul in modern usage

  3. Replaced by chữ quốc ngữ in modern usage, which is based on the Latin alphabet

  4. “CJK” (Chinese-Japanese-Korean) is probably more accurate here, though it probably should include “V” for Vietnamese too. Not all of these ideographs come from Han; the other scripts invented some of their own. See: Kokuji, Gukja, Chữ Nôm.

  5. Back in my day we painstakingly typed actual real words on numeric phone keypads, while trudging to 🏫 in three feet of ❄️️, and it was uphill both ways, and we weren’t even allowed 📱s in 🏫. Get off my lawn!

  6. We previously had individual code points for professions and stuff and they decided to switch over to using existing object emoji with combiners instead of inventing new profession emoji all the time

  7. 676 countries should be enough for anybody

Let’s Stop Ascribing Meaning to Code Points

Update: This post got a sequel, Breaking our latin-1 assumptions.

I’ve seen misconceptions about Unicode crop up regularly in posts discussing it. One very common misconception I’ve seen is that code points have cross-language intrinsic meaning.

It usually comes up when people are comparing UTF8 and UTF32. Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation. I’ve also seen this assumption manifest itself in actual programs which make incorrect assumptions about the nature of code points and mess things up when fed non-Latin text.

If you like reading about unicode, you might also want to go through Eevee’s article on the dark corners of unicode. Great read!

Encodings

So, anyway, we have some popular encodings for Unicode. UTF8 encodes 7-bit code points as a single byte, 11-bit code points as two bytes, 16-bit code points as 3 bytes, and 21-bit code points as four bytes. UTF-16 encodes the first three in two bytes, and the last one as four bytes (logically, a pair of two-byte code units). UTF-32 encodes all code points as 4-byte code units. UTF-16 is mostly a “worst of both worlds” compromise at this point, and the main programming language I can think of that uses it (and exposes it in this form) is Javascript, and that too in a broken way.

The nice thing about UTF8 is that it saves space. Of course, that is subjective and dependent on the script you use most commonly, for example my first name is 12 bytes in UTF-8 but only 4 in ISCII (or a hypothetical unicode-based encoding that swapped the Devanagri Unicode block with the ASCII block). It also uses more space over the very non-hypothetical UTF-16 encoding if you tend to use code points in the U+0800 - U+FFFF range. It always uses less space than UTF-32 however.

A commonly touted disadvantage of UTF-8 is that string indexing is O(n). Because code points take up a variable number of bytes, you won’t know where the 5th codepoint is until you scan the string and look for it. UTF-32 doesn’t have this problem; it’s always 4 * index bytes away.

The problem here is that indexing by code point shouldn’t be an operation you ever need!

Indexing by code point

The main time you want to be able to index by code point is if you’re implementing algorithms defined in the unicode spec that operate on unicode strings (casefolding, segmentation, NFD/NFC). Most if not all of these algorithms operate on whole strings, so implementing them as an iteration pass is usually necessary anyway, so you don’t lose anything if you can’t do arbitrary code point indexing.

But for application logic, dealing with code points doesn’t really make sense. This is because code points have no intrinsic meaning. They are not “characters”. I’m using scare quotes here because a “character” isn’t a well-defined concept either, but we’ll get to that later.

For example, “é” is two code points (e +́), where one of them is a combining accent. My name, “मनीष”, visually looks like three “characters”, but is four code points. The “नी” is made up of + . My last name contains a “character” made up of three code points (and multiple two-code-point “characters”). The flag emoji “🇺🇸” is also made of two code points, 🇺 + 🇸.

One false assumption that’s often made is that code points are a single column wide. They’re not. They sometimes bunch up to form characters that fit in single “columns”. This is often dependent on the font, and if your application relies on this, you should be querying the font. There are even code points like U+FDFD (﷽) which are often rendered multiple columns wide. In fact, in my monospace font in my text editor, that character is rendered almost 12 columns wide. Yes, “almost”, subsequent characters get offset a tiny bit. I don’t know why.

Another false assumption is that editing actions (selection, backspace, cut, paste) operate on code points. In both Chrome and Firefox, selection will often include multiple code points. All the multi-code-point examples I gave above fall into this category. An interesting testcase for this is the string “ᄀᄀᄀ각ᆨᆨ”, which will rarely if ever render as a single “character” but will be considered as one for the purposes of selection, pretty much universally. I’ll get to why this is later.

Backspace can gobble multiple code points at once too, but the heuristics are different. The reason behind this is that backspace needs to mirror the act of typing, and while typing sometimes constructs multi-codepoint characters, backspace decomposes it piece by piece. In cases where a multi-codepoint “character” can be logically decomposed (e.g. “letter + accent”), backspace will decompose it, by removing the accent or whatever. But some multi-codepoint characters are not “constructions” of general concepts that should be exposed to the user. For example, a user should never need to know that the “🇺🇸” flag emoji is made of 🇺 + 🇸, and hitting backspace on it should delete both codepoints. Similarly, variation selectors and other such code points shouldn’t be treated as their own unit when backspacing.

On my Mac most builtin apps (which I presume use the OSX UI toolkits) seem to use the same heuristics that Firefox/Chrome use for selection for both selection and backspace. While the treatment of code points in editing contexts is not consistent, it seems like applications consistently do not consider code points as “editing units”.

Now, it is true that you often need some way to index a string. For example, if you have a large document and need to represent a slice of it. This could be a user-selection, or something delimeted by markup. Basically, you’ve already gone through the document and have a section you want to be able to refer to later without copying it out.

However, you don’t need code point indexing here, byte indexing works fine! UTF8 is designed so that you can check if you’re on a code point boundary even if you just byte-index directly. It does this by restricting the kinds of bytes allowed. One-byte code points never have the high bit set (ASCII). All other code points have the high bit set in each byte. The first byte of multibyte codepoints always starts with a sequence that specifies the number of bytes in the codepoint, and such sequences can’t be found in the lower-order bytes of any multibyte codepoint. You can see this visually in the table here. The upshot of all this is that you just need to check the current byte if you want to be sure you’re on a codepoint boundary, and if you receive an arbitrarily byte-sliced string, you will not mistake it for something else. It’s not possible to have a valid code point be a subslice of another, or form a valid code point by subslicing a sequence of two different ones by cutting each in half.

So all you need to do is keep track of the byte indices, and use them for slicing it later.

All in all, it’s important to always remember that “code point” doesn’t have intrinsic meaning. If you need to do a segmentation operation on a string, find out what exactly you’re looking for, and what concept maps closest to that. It’s rare that “code point” is the concept you’re looking for. In most cases, what you’re looking for instead is “grapheme cluster”.

Grapheme clusters

The concept of a “character” is a nebulous one. Is “각” a single character, or three? How about “नी”? Or “நி”? Or the “👨‍❤️‍👨” emoji1? Or the “👨‍👨‍👧‍👧” family emoji2? Different scripts have different concepts which may not clearly map to the Latin notion of “letter” or our programmery notion of “character”.

Unicode itself gives the term “character” multiple incompatible meanings, and as far as I know doesn’t use the term in any normative text.

Often, you need to deal with what is actually displayed to the user. A lot of terminal emulators do this wrong, and end up messing up cursor placement. I used to use irssi-xmpp to keep my Facebook and Gchat conversations in my IRC client, but I eventually stopped as I was increasingly chatting in Marathi or Hindi and I prefer using the actual script over romanizing3, and it would just break my terminal4. Also, they got rid of the XMPP bridge but I’d already cut down on it by then.

So sometimes, you need an API querying what the font is doing. Generally, when talking about the actual rendered image, the term “glyph” or “glyph image” is used.

However, you can’t always query the font. Text itself exists independent of rendering, and sometimes you need a rendering-agnostic way of segmenting it into “characters”.

For this, Unicode has a concept of “grapheme cluster”. There’s also “extended grapheme cluster” (EGC), which is basically an updated version of the concept. In this post, whenever I use the term “grapheme cluster”, I am talking about EGCs.

The term is defined and explored in UAX #29. It starts by pinning down the still-nebulous concept of “user-perceived character” (“a basic unit of a writing system for a language”), and then declares the concept of a “grapheme cluster” to be an approximation to this notion that we can determine programmatically.

A rough definition of grapheme cluster is a “horizontally segmentable unit of text”.

The spec goes into detail as to the exact algorithm that segments text at grapheme cluster boundaries. All of the examples I gave in the first paragraph of this section are single grapheme clusters. So is “ᄀᄀᄀ각ᆨᆨ” (or “ᄀᄀᄀ각ᆨᆨ”), which apparently is considered a single syllable block in Hangul even though it is not of the typical form of leading consonant + vowel + optional tail consonant, but is not something you’d see in modern Korean. The spec explicitly talks of this case so it seems to be on purpose. I like this string because nothing I know of renders it as a single glyph; so you can easily use it to tell if a particular segmentation- aware operation uses grapheme clusters as segmentation. If you try and select it, in most browsers you will be forced to select the whole thing, but backspace will delete the jamos one by one. For the second string, backspace will decompose the core syllable block too (in the first string the syllable block 각 is “precomposed” as a single code point, in the second one I built it using combining jamos).

Basically, unless you have very specific requirements or are able to query the font, use an API that segments strings into grapheme clusters wherever you need to deal with the notion of “character”.

Language defaults

Now, a lot of languages by default are now using Unicode-aware encodings. This is great. It gets rid of the misconception that characters are one byte long.

But it doesn’t get rid of the misconception that user-perceived characters are one code point long.

There are only two languages I know of which handle this well: Swift and Perl 6. I don’t know much about Perl 6’s thing so I can’t really comment on it, but I am really happy with what Swift does:

In Swift, the Character type is an extended grapheme cluster. This does mean that a character itself is basically a string, since EGCs can be arbitrarily many code points long.

All the APIs by default deal with EGCs. The length of a string is the number of EGCs in it. They are indexed by EGC. Iteration yields EGCs. The default comparison algorithm uses unicode canonical equivalence, which I think is kind of neat. Of course, APIs that work with code points are exposed too, you can iterate over the code points using .unicodeScalars.

The internal encoding itself is … weird (and as far as I can tell not publicly exposed), but as a higher level language I think it’s fine to do things like that.

I strongly feel that languages should be moving in this direction, having defaults involving grapheme clusters.

Rust, for example, gets a lot of things right – it has UTF-8 strings. It internally uses byte indices in slices. Explicit slicing usually uses byte indices too, and will panic if out of bounds. The non-O(1) methods are all explicit, since you will use an iterator to perform the operation (E.g. .chars().nth(5)). This encourages people to think about the cost, and it also encourages people to coalesce the cost with nearby iterations – if you are going to do multiple O(n) things, do them in a single iteration! Rust chars represent code points. .char_indices() is a useful string iteration method that bridges the gap between byte indexing and code points.

However, while the documentation does mention grapheme clusters, the stdlib is not aware of the concept of grapheme clusters at all. The default “fundamental” unit of the string in Rust is a code point, and the operations revolve around that. If you want grapheme clusters, you may use unicode-segmentation

Now, Rust is a systems programming language and it just wouldn’t do to have expensive grapheme segmentation operations all over your string defaults. I’m very happy that the expensive O(n) operations are all only possible with explicit acknowledgement of the cost. So I do think that going the Swift route would be counterproductive for Rust. Not that it can anyway, due to backwards compatibility :)

But I would prefer if the grapheme segmentation methods were in the stdlib (they used to be). This is probably not something that will happen, though I should probably push for the unicode crates being move into the nursery at least.


  1. Emoji may not render as a single glyph depending on the font.

  2. While writing this paragraph I discovered that wrapping text that contains lots of family emoji hangs Sublime. Neat.

  3. Part of the reason here is that I just find romanization confusing. There are some standardized ways to romanize which don’t get used much. My friends and I romanize one way, different from the standardizations. My family members romanize things a completely different way and it’s a bit hard to read. Then again, romanization does hide the fact that my spelling in Hindi is atrocious :)

  4. It’s possible to make work. You need a good terminal emulator, with the right settings, the right settings in your env vars, the right settings in irssi, and the right settings in screen. I think my current setup works well with non-ascii text but I’m not sure what I did to make it happen.