In Pursuit of Laziness

Manish Goregaokar’s blog

Breaking Our Latin-1 Assumptions

So in my previous post I explored a specific (wrong) assumption that programmers tend to make about the nature of code points and text.

I was asked multiple times about other assumptions we tend to make. There are a lot. Most Latin-based scripts are simple, but most programmers spend their time dealing with Latin text so these complexities never come up.

I thought it would be useful to share my personal list of scripts that break our Latin-1 assumptions. This is a list I mentally check against whenever I am attempting to reason about text. I check if I’m making any assumptions that break in these scripts. Most of these concepts are independent of Unicode; so any program would have to deal with this regardless of encoding.

I again recommend going through eevee’s post, since it covers many related issues. Awesome-Unicode also has a lot of random tidbits about Unicode.

Anyway, here’s the list. Note that a lot of the concepts here exist in scripts other than the ones listed, these are just the scripts I use for comparing.

Arabic / Hebrew

Both Arabic and Hebrew are RTL scripts; they read right-to-left. This may even affect how a page is laid out, see the Hebrew Wikipedia.

They both have a concept of letters changing how they look depending on where they are in the word. Hebrew has the “sofit” letters, which use separate code points. For example, Kaf (כ) should be typed as ך at the end of a word. Greek has something similar with the sigma.

In Arabic, the letters can have up to four different forms, depending on whether they start a word, end a word, are inside a word, or are used by themselves. These forms can look very different. They don’t use separate code points for this; however. You can see a list of these forms here

As I mentioned in the last post, U+FDFD (﷽), a ligature representing the Basamala, is also a character that breaks a lot of assumptions.

Indic scripts

Indic scripts are abugidas, where you have consonants with vowel modifiers. For example, क is “kə”, where the upside down “e” is a schwa, something like an “uh” vowel sound. You can change the vowel by adding a diacritic (e.g ); getting things like का (“kaa”) को (“koh”) कू (“koo”).

You can also mash together consonants to create consonant clusters. The “virama” is a vowel-killer symbol that removes the inherent schwa vowel. So, + becomes क्. This sound itself is unpronounceable since क is a stop consonant (vowel-killed consonants can be pronounced for nasal and some other consonants though), but you can combine it with another consonant, as क् + (“rə”), to get क्र (“krə”). Consonants can be strung up infinitely, and you can stick one or more vowel diacritics after that. Usually, you won’t see more than two consonants in a cluster, but larger ones are not uncommon in Sanskrit (or when writing down some onomatopoeia). They may not get rendered as single glyphs, depending on the font.

One thing that crops up is that there’s no unambiguous concept of a letter here. There is a concept of an “akshara”, which basically includes the vowel diacritics, and depending on who you talk to may also include consonant clusters. Often things are clusters an akshara depending on whether they’re drawn with an explicit virama or form a single glyph.

In general the nature of the virama as a two-way combining character in Unicode is pretty new.

Hangul

Korean does its own fun thing when it comes to conjoining characters. Hangul has a concept of a “syllable block”, which is basically a letter. It’s made up of a leading consonant, medial vowel, and an optional tail consonant. 각 is an example of such a syllable block, and it can be typed as ᄀ + ᅡ + ᆨ. It can also be typed as 각, which is a “precomposed form” (and a single code point).

These characters are examples of combining characters with very specific combining rules. Unlike accents or other diacritics, these combining characters will combine with the surrounding characters only when the surrounding characters form an L-V-T or L-V syllable block.

As I mentioned in my previous post, apparently syllable blocks with more (adjacent) Ls, Vs, and Ts are also valid and used in Old Korean, so the grapheme segmentation algorithm in Unicode considers “ᄀᄀᄀ각ᆨᆨ” to be a single grapheme (it explicitly mentions this). I’m not aware of any fonts which render these as a single syllable block, or if that’s even a valid thing to do.

Han scripts

So Chinese (Hanzi), Japanese (Kanji1), Korean (Hanja2), and Vietnamese (Hán tự, along with Chữ Nôm 3) all share glyphs, collectively called “Han characters” (or CJK characters4). These languages at some point in their history borrowed the Chinese writing system, and made their own changes to it to tailor to their needs.

Now, the Han characters are ideographs. This is not a phonetic script; individual characters represent words. The word/idea they represent is not always consistent across languages. The pronounciation is usually different too. Sometimes, the glyph is drawn slightly differently based on the language used. There are around 80,000 Han ideographs in Unicode right now.

The concept of ideographs itself breaks some of our Latin-1 assumptions. For example, how do you define Levenshtein edit distance for text using Han ideographs? The straight answer is that you can’t, though if you step back and decide why you need edit distance you might be able to find a workaround. For example, if you need it to detect typos, the user’s input method may help. If it’s based on pinyin or bopomofo, you might be able to reverse-convert to the phonetic script, apply edit distance in that space, and convert back. Or not. I only maintain an idle curiosity in these scripts and don’t actually use them, so I’m not sure how well this would work.

The concept of halfwidth character is a quirk that breaks some assumptions.

In the space of Unicode in particular, all of these scripts are represented by a single set of ideographs. This is known as “Han unification”. This is a pretty controversial issue, but the end result is that rendering may sometimes be dependent on the language of the text, which e.g. in HTML you set with a <span lang=whatever>. The wiki page has some examples of encoding-dependent characters.

Unicode also has a concept of variation selector, which is a code point that can be used to select between variations for a code point that has multiple ways of being drawn. These do get used in Han scripts.

While this doesn’t affect rendering, Unicode, as a system for describing text, also has a concept of interlinear annotation characters. These are used to represent furigana / ruby. Fonts don’t render this, but it’s useful if you want to represent text that uses ruby. Similarly, there are ideographic description sequences which can be used to “build up” glyphs from smaller ones when the glyph can’t be encoded in Unicode. These, too, are not to be rendered, but can be used when you want to describe the existence of a character like biáng. These are not things a programmer needs to worry about; I just find them interesting and couldn’t resist mentioning them :)

Japanese speakers haven’t completely moved to Unicode; there are a lot of things out there using Shift-JIS, and IIRC there are valid reasons for that (perhaps Han unification?). This is another thing you may have to consider.

Finally, these scripts are often written vertically, top-down. Mongolian, while not being a Han script, is written vertically sideways, which is pretty unique. The CSS writing modes spec introduces various concepts related to this, though that’s mostly in the context of the Web.

Thai / Khmer / Burmese / Lao

These scripts don’t use spaces to split words. Instead, they have rules for what kinds of sequences of characters start and end a word. This can be determined programmatically, however IIRC the Unicode spec does not attempt to deal with this. There are libraries you can use here instead.

Latin scripts themselves!

Turkish is a latin-based script. But it has a quirk: The uppercase of “i” is a dotted “İ”, and the lowercase of “I” is “ı”. If doing case-based operations, try to use a Unicode-aware library, and try to provide the locale if possible.

Also, not all code points have a single-codepoint uppercase version. The eszett (ß) capitalizes to “SS”. There’s also the “capital” eszett ẞ, but its usage seems to vary and I’m not exactly sure how it interacts here.

While Latin-1 uses precomposed characters, Unicode also introduces ways to specify the same characters via combining diacritics. Treating these the same involves using the normalization algorithms (NFC/NFD).

Emoji

Well, not a script5. But emoji is weird enough that it breaks many of our assumptions. The scripts above cover most of these, but it’s sometimes easier to think of them in the context of emoji.

The main thing with emoji is that you can use a zero-width-joiner character to glue emoji together.

For example, the family emoji 👩‍👩‍👧‍👦 (may not render for you) is made by using the woman/man/girl/boy emoji and gluing them together with ZWJs. You can see its decomposition in uniview.

There are more sequences like this, which you can see in the emoji-zwj-sequences file. For example, MAN + ZWJ + COOK will give a male cook emoji (font support is sketchy). Similarly, SWIMMER + ZWJ + FEMALE SIGN is a female swimmer. You have both sequences of the form “gendered person + zwj + thing”, and “emoji containing human + zwj + gender”, IIRC due to legacy issues6

There are also modifier characters that let you change the skin tone of an emoji that contains a human (or human body part, like the hand-gesture emojis) in it.

Finally, the flag emoji are pretty special snowflakes. For example, 🇪🇸 is the Spanish flag. It’s made up of two regional indicator characters for “E” and “S”.

Unicode didn’t want to deal with adding new flags each time a new country or territory pops up. Nor did they want to get into the tricky business of determining what a country is, for example when dealing with disputed territories. So instead, they just defined these regional indicator symbols. Fonts are supposed to take pairs of RI symbols7 and map the country code to a flag. This mapping is up to them, so it’s totally valid for a font to render a regional indicator pair “E” + “S” as something other than the flag of Spain. On some Chinese systems, for example, the flag for Taiwan (🇹🇼) may not render.


I hightly recommend comparing against this relatively small list of scripts the next time you are writing code that does heavy manipulation of user-provided strings.


  1. Supplemented (but not replaced) by the Hiragana and Katakana phonetic scripts. In widespread use.

  2. Replaced by Hangul in modern usage

  3. Replaced by chữ quốc ngữ in modern usage, which is based on the Latin alphabet

  4. “CJK” (Chinese-Japanese-Korean) is probably more accurate here, though it probably should include “V” for Vietnamese too. Not all of these ideographs come from Han; the other scripts invented some of their own. See: Kokuji, Gukja, Chữ Nôm.

  5. Back in my day we painstakingly typed actual real words on numeric phone keypads, while trudging to 🏫 in three feet of ❄️️, and it was uphill both ways, and we weren’t even allowed 📱s in 🏫. Get off my lawn!

  6. We previously had individual code points for professions and stuff and they decided to switch over to using existing object emoji with combiners instead of inventing new profession emoji all the time

  7. 676 countries should be enough for anybody

Let’s Stop Ascribing Meaning to Code Points

Update: This post got a sequel, Breaking our latin-1 assumptions.

I’ve seen misconceptions about Unicode crop up regularly in posts discussing it. One very common misconception I’ve seen is that code points have cross-language intrinsic meaning.

It usually comes up when people are comparing UTF8 and UTF32. Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation. I’ve also seen this assumption manifest itself in actual programs which make incorrect assumptions about the nature of code points and mess things up when fed non-Latin text.

If you like reading about unicode, you might also want to go through Eevee’s article on the dark corners of unicode. Great read!

Encodings

So, anyway, we have some popular encodings for Unicode. UTF8 encodes 7-bit code points as a single byte, 11-bit code points as two bytes, 16-bit code points as 3 bytes, and 21-bit code points as four bytes. UTF-16 encodes the first three in two bytes, and the last one as four bytes (logically, a pair of two-byte code units). UTF-32 encodes all code points as 4-byte code units. UTF-16 is mostly a “worst of both worlds” compromise at this point, and the main programming language I can think of that uses it (and exposes it in this form) is Javascript, and that too in a broken way.

The nice thing about UTF8 is that it saves space. Of course, that is subjective and dependent on the script you use most commonly, for example my first name is 12 bytes in UTF-8 but only 4 in ISCII (or a hypothetical unicode-based encoding that swapped the Devanagri Unicode block with the ASCII block). It also uses more space over the very non-hypothetical UTF-16 encoding if you tend to use code points in the U+0800 - U+FFFF range. It always uses less space than UTF-32 however.

A commonly touted disadvantage of UTF-8 is that string indexing is O(n). Because code points take up a variable number of bytes, you won’t know where the 5th codepoint is until you scan the string and look for it. UTF-32 doesn’t have this problem; it’s always 4 * index bytes away.

The problem here is that indexing by code point shouldn’t be an operation you ever need!

Indexing by code point

The main time you want to be able to index by code point is if you’re implementing algorithms defined in the unicode spec that operate on unicode strings (casefolding, segmentation, NFD/NFC). Most if not all of these algorithms operate on whole strings, so implementing them as an iteration pass is usually necessary anyway, so you don’t lose anything if you can’t do arbitrary code point indexing.

But for application logic, dealing with code points doesn’t really make sense. This is because code points have no intrinsic meaning. They are not “characters”. I’m using scare quotes here because a “character” isn’t a well-defined concept either, but we’ll get to that later.

For example, “é” is two code points (e +́), where one of them is a combining accent. My name, “मनीष”, visually looks like three “characters”, but is four code points. The “नी” is made up of + . My last name contains a “character” made up of three code points (and multiple two-code-point “characters”). The flag emoji “🇺🇸” is also made of two code points, 🇺 + 🇸.

One false assumption that’s often made is that code points are a single column wide. They’re not. They sometimes bunch up to form characters that fit in single “columns”. This is often dependent on the font, and if your application relies on this, you should be querying the font. There are even code points like U+FDFD (﷽) which are often rendered multiple columns wide. In fact, in my monospace font in my text editor, that character is rendered almost 12 columns wide. Yes, “almost”, subsequent characters get offset a tiny bit. I don’t know why.

Another false assumption is that editing actions (selection, backspace, cut, paste) operate on code points. In both Chrome and Firefox, selection will often include multiple code points. All the multi-code-point examples I gave above fall into this category. An interesting testcase for this is the string “ᄀᄀᄀ각ᆨᆨ”, which will rarely if ever render as a single “character” but will be considered as one for the purposes of selection, pretty much universally. I’ll get to why this is later.

Backspace can gobble multiple code points at once too, but the heuristics are different. The reason behind this is that backspace needs to mirror the act of typing, and while typing sometimes constructs multi-codepoint characters, backspace decomposes it piece by piece. In cases where a multi-codepoint “character” can be logically decomposed (e.g. “letter + accent”), backspace will decompose it, by removing the accent or whatever. But some multi-codepoint characters are not “constructions” of general concepts that should be exposed to the user. For example, a user should never need to know that the “🇺🇸” flag emoji is made of 🇺 + 🇸, and hitting backspace on it should delete both codepoints. Similarly, variation selectors and other such code points shouldn’t be treated as their own unit when backspacing.

On my Mac most builtin apps (which I presume use the OSX UI toolkits) seem to use the same heuristics that Firefox/Chrome use for selection for both selection and backspace. While the treatment of code points in editing contexts is not consistent, it seems like applications consistently do not consider code points as “editing units”.

Now, it is true that you often need some way to index a string. For example, if you have a large document and need to represent a slice of it. This could be a user-selection, or something delimeted by markup. Basically, you’ve already gone through the document and have a section you want to be able to refer to later without copying it out.

However, you don’t need code point indexing here, byte indexing works fine! UTF8 is designed so that you can check if you’re on a code point boundary even if you just byte-index directly. It does this by restricting the kinds of bytes allowed. One-byte code points never have the high bit set (ASCII). All other code points have the high bit set in each byte. The first byte of multibyte codepoints always starts with a sequence that specifies the number of bytes in the codepoint, and such sequences can’t be found in the lower-order bytes of any multibyte codepoint. You can see this visually in the table here. The upshot of all this is that you just need to check the current byte if you want to be sure you’re on a codepoint boundary, and if you receive an arbitrarily byte-sliced string, you will not mistake it for something else. It’s not possible to have a valid code point be a subslice of another, or form a valid code point by subslicing a sequence of two different ones by cutting each in half.

So all you need to do is keep track of the byte indices, and use them for slicing it later.

All in all, it’s important to always remember that “code point” doesn’t have intrinsic meaning. If you need to do a segmentation operation on a string, find out what exactly you’re looking for, and what concept maps closest to that. It’s rare that “code point” is the concept you’re looking for. In most cases, what you’re looking for instead is “grapheme cluster”.

Grapheme clusters

The concept of a “character” is a nebulous one. Is “각” a single character, or three? How about “नी”? Or “நி”? Or the “👨‍❤️‍👨” emoji1? Or the “👨‍👨‍👧‍👧” family emoji2? Different scripts have different concepts which may not clearly map to the Latin notion of “letter” or our programmery notion of “character”.

Unicode itself gives the term “character” multiple incompatible meanings, and as far as I know doesn’t use the term in any normative text.

Often, you need to deal with what is actually displayed to the user. A lot of terminal emulators do this wrong, and end up messing up cursor placement. I used to use irssi-xmpp to keep my Facebook and Gchat conversations in my IRC client, but I eventually stopped as I was increasingly chatting in Marathi or Hindi and I prefer using the actual script over romanizing3, and it would just break my terminal4. Also, they got rid of the XMPP bridge but I’d already cut down on it by then.

So sometimes, you need an API querying what the font is doing. Generally, when talking about the actual rendered image, the term “glyph” or “glyph image” is used.

However, you can’t always query the font. Text itself exists independent of rendering, and sometimes you need a rendering-agnostic way of segmenting it into “characters”.

For this, Unicode has a concept of “grapheme cluster”. There’s also “extended grapheme cluster” (EGC), which is basically an updated version of the concept. In this post, whenever I use the term “grapheme cluster”, I am talking about EGCs.

The term is defined and explored in UAX #29. It starts by pinning down the still-nebulous concept of “user-perceived character” (“a basic unit of a writing system for a language”), and then declares the concept of a “grapheme cluster” to be an approximation to this notion that we can determine programmatically.

A rough definition of grapheme cluster is a “horizontally segmentable unit of text”.

The spec goes into detail as to the exact algorithm that segments text at grapheme cluster boundaries. All of the examples I gave in the first paragraph of this section are single grapheme clusters. So is “ᄀᄀᄀ각ᆨᆨ” (or “ᄀᄀᄀ각ᆨᆨ”), which apparently is considered a single syllable block in Hangul even though it is not of the typical form of leading consonant + vowel + optional tail consonant, but is not something you’d see in modern Korean. The spec explicitly talks of this case so it seems to be on purpose. I like this string because nothing I know of renders it as a single glyph; so you can easily use it to tell if a particular segmentation- aware operation uses grapheme clusters as segmentation. If you try and select it, in most browsers you will be forced to select the whole thing, but backspace will delete the jamos one by one. For the second string, backspace will decompose the core syllable block too (in the first string the syllable block 각 is “precomposed” as a single code point, in the second one I built it using combining jamos).

Basically, unless you have very specific requirements or are able to query the font, use an API that segments strings into grapheme clusters wherever you need to deal with the notion of “character”.

Language defaults

Now, a lot of languages by default are now using Unicode-aware encodings. This is great. It gets rid of the misconception that characters are one byte long.

But it doesn’t get rid of the misconception that user-perceived characters are one code point long.

There are only two languages I know of which handle this well: Swift and Perl 6. I don’t know much about Perl 6’s thing so I can’t really comment on it, but I am really happy with what Swift does:

In Swift, the Character type is an extended grapheme cluster. This does mean that a character itself is basically a string, since EGCs can be arbitrarily many code points long.

All the APIs by default deal with EGCs. The length of a string is the number of EGCs in it. They are indexed by EGC. Iteration yields EGCs. The default comparison algorithm uses unicode canonical equivalence, which I think is kind of neat. Of course, APIs that work with code points are exposed too, you can iterate over the code points using .unicodeScalars.

The internal encoding itself is … weird (and as far as I can tell not publicly exposed), but as a higher level language I think it’s fine to do things like that.

I strongly feel that languages should be moving in this direction, having defaults involving grapheme clusters.

Rust, for example, gets a lot of things right – it has UTF-8 strings. It internally uses byte indices in slices. Explicit slicing usually uses byte indices too, and will panic if out of bounds. The non-O(1) methods are all explicit, since you will use an iterator to perform the operation (E.g. .chars().nth(5)). This encourages people to think about the cost, and it also encourages people to coalesce the cost with nearby iterations – if you are going to do multiple O(n) things, do them in a single iteration! Rust chars represent code points. .char_indices() is a useful string iteration method that bridges the gap between byte indexing and code points.

However, while the documentation does mention grapheme clusters, the stdlib is not aware of the concept of grapheme clusters at all. The default “fundamental” unit of the string in Rust is a code point, and the operations revolve around that. If you want grapheme clusters, you may use unicode-segmentation

Now, Rust is a systems programming language and it just wouldn’t do to have expensive grapheme segmentation operations all over your string defaults. I’m very happy that the expensive O(n) operations are all only possible with explicit acknowledgement of the cost. So I do think that going the Swift route would be counterproductive for Rust. Not that it can anyway, due to backwards compatibility :)

But I would prefer if the grapheme segmentation methods were in the stdlib (they used to be). This is probably not something that will happen, though I should probably push for the unicode crates being move into the nursery at least.


  1. Emoji may not render as a single glyph depending on the font.

  2. While writing this paragraph I discovered that wrapping text that contains lots of family emoji hangs Sublime. Neat.

  3. It’s possible to make work. You need a good terminal emulator, with the right settings, the right settings in your env vars, the right settings in irssi, and the right settings in screen. I think my current setup works well with non-ascii text but I’m not sure what I did to make it happen.

  4. Part of the reason here is that I just find romanization confusing. There are some standardized ways to romanize which don’t get used much. My friends and I romanize one way, different from the standardizations. My family members romanize things a completely different way and it’s a bit hard to read. Then again, romanization does hide the fact that my spelling in Hindi is atrocious :)

Rust Tidbits: What Is a Lang Item?

Rust is not a simple language. As with any such language, it has many little tidbits of complexity that most folks aren’t aware of. Many of these tidbits are ones which may not practically matter much for everyday Rust programming, but are interesting to know. Others may be more useful. I’ve found that a lot of these aren’t documented anywhere (not that they always should be), and sometimes depend on knowledge of compiler internals or history. As a fan of programming trivia myself, I’ve decided to try writing about these things whenever I come across them. “Tribal Knowledge” shouldn’t be a thing in a programming community; and trivia is fun!

Previously in tidbits: Box is Special

Last time I talked about Box<T> and how it is a special snowflake. Corey asked that I write more about lang items, which are basically all of the special snowflakes in the stdlib.

So what is a lang item? Lang items are a way for the stdlib (and libcore) to define types, traits, functions, and other items which the compiler needs to know about.

For example, when you write x + y, the compiler will effectively desugar that into Add::add(x, y)1. How did it know what trait to call? Did it just insert a call to ::core::Add::add and hope the trait was defined there? This is what C++ does; the Itanium ABI spec expects functions of certain names to just exist, which the compiler is supposed to call in various cases. The __cxa_guard_* functions from C++s deferred-initialization local statics (which I’ve explored in the past) are an example of this. You’ll find that the spec is full of similar __cxa functions. While the spec just expects certain types, e.g. std::type_traits (“Type properties” § 20.10.4.3), to be magic and exist in certain locations, the compilers seem to implement them using intrinsics like __is_trivial<T> which aren’t defined in C++ code at all. So C++ compilers have a mix of solutions here, they partly insert calls to known ABI functions, and they partly implement “special” types via intrinsics which are detected and magicked when the compiler comes across them.

However, this is not Rust’s solution. It does not care what the Add trait is named or where it is placed. Instead, it knew where the trait for addition was located because we told it. When you put #[lang = "add"] on a trait, the compiler knows to call YourTrait::add(x, y) when it encounters the addition operator. Of course, usually the compiler will already have been told about such a trait since libcore is usually the first library in the pipeline. If you want to actually use this, you need to replace libcore.

Huh? You can’t do that, can you?

It’s not a big secret that you can compile rust without the stdlib using #![no_std]. This is useful in cases when you are on an embedded system and can’t rely on an allocator existing. It’s also useful for writing your own alternate stdlib, though that’s not something folks do often. Of course, libstd itself uses #![no_std], because without it the compiler will happily inject an extern crate std while trying to compile libstd and the universe will implode.

What’s less known is that you can do the same thing with libcore, via #![no_core]. And, of course, libcore uses it to avoid the cyclic dependency. Unlike #![no_std], no_core is a nightly-only feature that we may never stabilize2. #![no_core] is something that’s basically only to be used if you are libcore (or you are an alternate Rust stdlib/core implementation trying to emulate it).

Still, it’s possible to write a working Rust binary in no_core mode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#![feature(no_core)]
#![feature(lang_items)]

// Look at me.
// Look at me.
// I'm the libcore now.
#![no_core]

// Tell the compiler to link to appropriate runtime libs
// (This way I don't have to specify `-l` flags explicitly)
#[cfg(target_os = "linux")]
#[link(name = "c")]
extern {}
#[cfg(target_os = "macos")]
#[link(name = "System")]
extern {}

// Compiler needs these to proceed
#[lang = "sized"]
pub trait Sized {}
#[lang = "copy"]
pub trait Copy {}

// `main` isn't the actual entry point, `start` is.
#[lang = "start"]
fn start(_main: *const u8, _argc: isize, _argv: *const *const u8) -> isize {
    // we can't really do much in this benighted hellhole of
    // an environment without bringing in more libraries.
    // We can make syscalls, segfault, and set the exit code.
    // To be sure that this actually ran, let's set the exit code.
    42
}

// still need a main unless we want to use `#![no_main]`
// won't actually get called; `start()` is supposed to call it
fn main() {}

If you run this, the program will exit with exit code 42.

Note that this already adds two lang items. Sized and Copy. It’s usually worth looking at the lang item in libcore and copying it over unless you want to make tweaks. Beware that tweaks may not always work; not only does the compiler expect the lang item to exist, it expects it to make sense. There are properties of the lang item that it assumes are true, and it may cause assertions in the code. In this case I do have a tweak, since the original definition of Copy is pub trait Copy: Clone {}, but I know that this tweak will work.

Lang items are usually only required when you do an operation which needs them. There are 72 non- deprecated lang items and we only had to define three of them here. “start” is necessary to, well, start executables, and Copy/Sized are very crucial to how the compiler reasons about types and must exist.

But let’s try doing something that will trigger a lang item to be required:

1
pub static X: u8 = 1;

Rust will immediately complain:

1
2
$ rustc test.rs
error: requires `sync` lang_item

This is because Rust wants to enforce that types in statics (which can be accessed concurrently) are safe when accessed concurrently, i.e., they implement Sync. We haven’t defined Sync yet, so Rust doesn’t know how to enforce this restruction. The Sync trait is defined with the “sync” lang item, so we need to do:

1
2
3
4
5
pub static X: u8 = 1;

#[lang = "sync"]
pub unsafe trait Sync {}
unsafe impl Sync for u8 {}

Note that the trait doesn’t have to be called Sync here, any trait name would work. This definition is also a slight departure from the one in the stdlib, and in general you should include the auto trait impl (instead of specifically using unsafe impl Sync for u8 {}) since the compiler may assume it exists. Our code is small enough for this to not matter.

Alright, let’s try defining our own addition trait as before. First, let’s see what happens if we try to add a struct when addition isn’t defined:

1
2
3
4
5
struct Foo;
#[lang = "start"]
fn start(_main: *const u8, _argc: isize, _argv: *const *const u8) -> isize {
    Foo + Foo
}

We get an error:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ rustc test.rs
error[E0369]: binary operation `+` cannot be applied to type `Foo`
  --> test.rs:33:5
   |
33 |     Foo + Foo
   |     ^^^
   |
note: an implementation of `std::ops::Add` might be missing for `Foo`
  --> test.rs:33:5
   |
33 |     Foo + Foo
   |     ^^^

error: aborting due to previous error

It is interesting to note that here the compiler did refer to Add by its path. This is because the diagnostics in the compiler are free to assume that libcore exists. However, the actual error just noted that it doesn’t know how to add two Foos. But we can tell it how!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#[lang = "add"]
trait MyAdd<RHS> {
    type Output;
    fn add(self, other: RHS) -> Self::Output;
}

impl MyAdd<Foo> for Foo {
    type Output = isize;
    fn add(self, other: Foo) -> isize {
        return 42;
    }
}

struct Foo;
#[lang = "start"]
fn start(_main: *const u8, _argc: isize, _argv: *const *const u8) -> isize {
    Foo + Foo
}

This will compile fine and the exit code of the program will be 42.

An interesting bit of behavior is what happens if we try to add two numbers. It will give us the same kind of error, even though I mentioned before that the addition of concrete primitives doesn’t go through Add::add. However, any addition operation still checks if Add::add is implemented, even though it won’t get used in the case of a primitive. We can even verify this!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#[lang = "add"]
trait MyAdd<RHS> {
    type Output;
    fn add(self, other: RHS) -> Self::Output;
}

impl MyAdd<isize> for isize {
    type Output = isize;
    fn add(self, other: isize) -> isize {
        self + other + 50
    }
}

struct Foo;
#[lang = "start"]
fn start(_main: *const u8, _argc: isize, _argv: *const *const u8) -> isize {
    40 + 2
}

This will need to be compiled with -C opt-level=2, since numeric addition in debug mode panics on wrap and we haven’t defined the "panic" lang item to teach the compiler how to panic.

It will exit with 42, not 92, since while the Add implementation is required for this to type check, it doesn’t actually get used.


So what lang items are there, and why are they lang items? There’s a big list in the compiler. Let’s go through them:

The ImplItem ones (core) are used to mark implementations on primitive types. char has some methods, and someone has to say impl char to define them. But coherence only allows us to impl methods on types defined in our own crate, and char isn’t defined … in any crate, so how do we add methods to it? #[lang = "char"] provides an escape hatch; applying that to impl char will allow you to break the coherence rules and add methods, as is done in the standard library. Since lang items can only be defined once, only a single crate gets the honor of adding methods to char, so we don’t have any of the issues that arise from sidestepping coherence.

There are a bunch for the marker traits (core):

  • Send is a lang item because you are allowed to use it in a + bound in a trait object (Box<SomeTrait+Send+Sync>), and the compiler caches it aggressively
  • Sync is a lang item for the same reasons as Send, but also because the compiler needs to enforce its implementation on types used in statics
  • Copy is fundamental to classifying values and reasoning about moves/etc, so it needs to be a lang item
  • Sized is also fundamental to reasoning about which values may exist on the stack. It is also magically included as a bound on generic parameters unless excluded with ?Sized
  • [Unsize] is implemented automatically on types using a specific set of rules (listed in the nomicon). Unlike Send and Sync, this mechanism for autoimplementation is tailored for the use case of Unsize and can’t be reused on user-defined marker traits.

Drop is a lang item (core) because the compiler needs to know which types have destructors, and how to call these destructors.

CoerceUnsized is a lang item (core) because the compiler is allowed to perform DST coercions (nomicon) when it is implemented.

All of the builtin operators (also Deref and PartialEq/PartialOrd, which are listed later in the file) (core) are lang items because the compiler needs to know what trait to require (and call) when it comes across such an operation.

UnsafeCell is a lang item (core) because it has very special semantics; it prevents certain optimizations. Specifically, Rust is allowed to reorder reads/writes to &mut foo with the assumption that the local variable holding the reference is the only alias allowed to read from or write to the data, and it is allowed to reorder reads from &foo assuming that no other alias writes to it. We tell LLVM that these types are noalias. UnsafeCell<T> turns this optimization off, allowing writes to &UnsafeCell<T> references. This is used in the implementation of interior mutability types like Cell<T>, RefCell<T>, and Mutex<T>.

The Fn traits (core) are used in dispatching function calls, and can be specified with special syntax sugar, so they need to be lang items. They also get autoimplemented on closures.

The “str_eq” lang item is outdated. It used to specify how to check the equality of a string value against a literal string pattern in a match (match uses structural equality, not PartialEq::eq), however I believe this behavior is now hardcoded in the compiler.

The panic-related lang items (core) exist because rustc itself inserts panics in a few places. The first one, "panic", is used for integer overflow panics in debug mode, and "panic_bounds_check" is used for out of bounds indexing panics on slices. The last one, "panic_fmt" hooks into a function defined later in libstd.

The "exchange_malloc" and "box_free" (alloc) are for telling the compiler which functions to call in case it needs to do a malloc() or free(). These are used when constructing Box<T> via placement box syntax and when moving out of a deref of a box.

"strdup_uniq" seemed to be used in the past for moving string literals to the heap, but is no longer used.

We’ve already seen the start lang item (std) being used in our minimal example program. This function is basically where you find Rust’s “runtime”: it gets called with a pointer to main and the command line arguments, it sets up the “runtime”, calls main, and tears down anything it needs to. Rust has a C-like minimal runtime, so the actual libstd definition doesn’t do much. But you theoretically could stick a very heavy runtime initialization routine here.

The exception handling lang items (panic_unwind, in multiple platform-specific modules) specify various bits of the exception handling behavior. These hooks are called during various steps of unwinding: eh_personality is called when determining whether or not to stop at a stack frame or unwind up to the next one. eh_unwind_resume is the routine called when the unwinding code wishes to resume unwinding after calling destructors in a landing pad. msvc_try_filter defines some parameter that MSVC needs in its unwinding code. I don’t understand it, and apparently, neither does the person who wrote it.

The "owned_box" (alloc) lang item tells the compiler which type is the Box type. In my previous post I covered how Box is special; this lang item is how the compiler finds impls on Box and knows what the type is. Unlike the other primitives, Box doesn’t actually have a type name (like bool) that can be used if you’re writing libcore or libstd. This lang item gives Box a type name that can be used to refer to it. (It also defines some, but not all, of the semantics of Box<T>)

The "phantom_data" (core) type itself is allowed to have an unused type parameter, and it can be used to help fix the variance and drop behavior of a generic type. More on this in the nomicon.

The "non_zero" lang item (core) marks the NonZero<T> type, a type which is guaranteed to never contain a bit pattern of only zeroes. This is used inside things like Rc<T> and Box<T> – we know that the pointers in these can/should never be null, so they contain a NonZero<*const T>. When used inside an enum like Option<Rc<T>>, the discriminant (the “tag” value that distinguishes between Some and None) is no longer necessary, since we can mark the None case as the case where the bits occupied by NonZero in the Some case are zero. Beware, this optimization also applies to C-like enums that don’t have a variant corresponding to a discriminant value of zero (unless they are #[repr(C)])

There are also a bunch of deprecated lang items there. For example, NoCopy used to be a struct that could be dropped within a type to make it not implement Copy; in the past Copy implementations were automatic like Send and Sync are today. NoCopy was the way to opt out. There also used to be NoSend and NoSync. CovariantType/CovariantLifetime/etc were the predecessors of PhantomData; they could be used to specify variance relations of a type with its type or lifetime parameters, but you can now do this with providing the right PhantomData, e.g. InvariantType<T> is now PhantomData<Cell<T>>. The nomicon has more on variance. I don’t know why these lang items haven’t been removed (they don’t work anymore anyway); the only consumer of them is libcore so “deprecating” them seems unnecessary. It’s probably an oversight.

Interestingly, Iterator and IntoIterator are not lang items, even though they are used in for loops. Instead, the compiler inserts hardcoded calls to ::std::iter::IntoIterator::into_iter and ::std::iter::Iterator::next, and a hardcoded reference to ::std::option::Option (The paths use core in no_std mode). This is probably because the compiler desugars for loops before type resolution is done, so withut this, libcore would not be able to use for loops since the compiler wouldn’t know what calls to insert in place of the loops while compiling.


Basically, whenever the compiler needs to use special treatment with an item – whether it be dispatching calls to functions and trait methods in various situations, conferring special semantics to types/traits, or requiring traits to be implemented, the type will be defined in the standard library (libstd, libcore, or one of the crates behind the libstd façade), and marked as a lang item.

Some of the lang items are useful/necessary when working without libstd. Most only come into play if you want to replace libcore, which is a pretty niche thing to do, and knowing about them is rarely useful outside of the realm of compiler hacking.

But, like with the Box<T> madness, I still find this quite interesting, even if it isn’t generally useful!


  1. Though as we learned in the previous post, when x and y are known numeric types it will bypass the trait and directly generate an add instruction in LLVM

  2. To be clear, I’m not aware of any plans to eventually stabilize this. It’s something that could happen.

Rust Tidbits: Box Is Special

Rust is not a simple language. As with any such language, it has many little tidbits of complexity that most folks aren’t aware of. Many of these tidbits are ones which may not practically matter much for everyday Rust programming, but are interesting to know. Others may be more useful. I’ve found that a lot of these aren’t documented anywhere (not that they always should be), and sometimes depend on knowledge of compiler internals or history. As a fan of programming trivia myself, I’ve decided to try writing about these things whenever I come across them. “Tribal Knowledge” shouldn’t be a thing in a programming community; and trivia is fun!


So. Box<T>. Your favorite heap allocation type that nobody uses1.

I was discussing some stuff on the rfcs repo when @burdges realized that Box<T> has a funky Deref impl.

Let’s look at it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#[stable(feature = "rust1", since = "1.0.0")]
impl<T: ?Sized> Deref for Box<T> {
    type Target = T;

    fn deref(&self) -> &T {
        &**self
    }
}

#[stable(feature = "rust1", since = "1.0.0")]
impl<T: ?Sized> DerefMut for Box<T> {
    fn deref_mut(&mut self) -> &mut T {
        &mut **self
    }
}

Wait, what? Squints

1
2
3
fn deref(&self) -> &T {
    &**self
}

The call is coming from inside the house!

In case you didn’t realize it, this deref impl returns &**self – since self is an &Box<T>, dereferencing it once will provide a Box<T>, and the second dereference will dereference the box to provide a T. We then wrap it in a reference and return it.

But wait, we are defining how a Box<T> is to be dereferenced (that’s what Deref::deref is for!), such a definition cannot itself dereference a Box<T>! That’s infinite recursion.

And indeed. For any other type such a deref impl would recurse infinitely. If you run this code:

1
2
3
4
5
6
7
8
9
10
use std::ops::Deref;

struct LolBox<T>(T);

impl<T> Deref for LolBox<T> {
    type Target = T;
    fn deref(&self) -> &T {
        &**self
    }
}

the compiler will warn you:

1
2
3
4
5
6
7
8
9
10
11
12
warning: function cannot return without recurring, #[warn(unconditional_recursion)] on by default
 --> <anon>:7:5
  |
7 |     fn deref(&self) -> &T {
  |     ^
  |
note: recursive call site
 --> <anon>:8:10
  |
8 |         &**self
  |          ^^^^^^
  = help: a `loop` may express intention better if this is on purpose

Actually trying to dereference the type will lead to a stack overflow.

Clearly something is fishy here. This deref impl is similar to the deref impl for &T, or the Add impl for number types, or any other of the implementations of operators on primitive types. For example we literally define Add on two integers to be their addition. The reason these impls need to exist is so that people can still call Add::add if they need to in generic code and be able to pass integers to things with an Add bound. But the compiler knows how to use builtin operators on numbers and dereference borrowed references without these impls. But those are primitive types which are defined in the compiler, while Box<T> is just a regular smart pointer struct, right?

Turns out, Box<T> is special. It, too, is somewhat of a primitive type.

This is partly due to historical accident.

To understand this, we must look back to Ye Olde days of pre-1.0 Rust (ca 2014). Back in these days, we had none of this newfangled “stability” business. The compiler broke your code every two weeks. Of course, you wouldn’t know that because the compiler would usually crash before it could tell you that your code was broken! Sigils roamed the lands freely, and cargo was but a newborn child which was destined to eventually end the tyranny of Makefiles. People were largely happy knowing that their closures were safely boxed and their threads sufficiently green.

Back in these days, we didn’t have Box<T>, Vec<T>, or String. We had ~T, ~[T], and ~str. The second two are not equivalent to Box<[T]> and Box<str>, even though they may look like it, they are both growable containers like Vec<T> and String. ~ conceptually meant “owned”, though IMO that caused more confusion than it was worth.

You created a box using the ~ operator, e.g. let x = ~1;. It could be dereferenced with the * operator, and autoderef worked much like it does today.

As a “primitive” type; like all primitive types, ~T was special. The compiler knew things about it. The compiler knew how to dereference it without an explicit Deref impl. In fact, the Deref traits came into existence much after ~T did. ~T never got an explicit Deref impl, though it probably should have.

Eventually, there was a move to remove sigils from the language. The box constructor ~foo was superseded by placement box syntax, which still exists in Rust nightly2. Then, the ~T type became Box<T>. (~[T] and ~str would also be removed, though ~str took a very confusing detour with StrBuf first).

However, Box<T> was still special. It no longer needed special syntax to be referred to or constructed, but it was still internally a special type. It didn’t even have a Deref impl yet, that came six months later, and it was implemented as &**self, exactly the same as it is today.

But why does it have to be special now? Rust had all the features it needed (allocations, ownership, overloadable deref) to implement Box<T> in pure rust in the stdlib as if it were a regular type.

Turns out that Rust didn’t. You see, because Box<T> and before it ~T were special, their dereference semantics were implemented in a different part of the code. And, these semantics were not the same as the ones for DerefImm and DerefMut, which were created for use with other smart pointers. I don’t know if the possibility of being used for ~T was considered when DerefImm/DerefMut were being implemented, or if it was a simple oversight, but Box<T> has three pieces of behavior that could not be replicated in pure Rust at the time:

  • box foo in a pattern would destructure a box into its contents. It’s somewhat the opposite of ref
  • box foo() performed placement box, so the result of foo() could be directly written to a preallocated box, reducing extraneous copies
  • You could move out of deref with Box<T>

The third one is the one that really gets to us here3. For a regular type, *foo will produce a temporary that must be immediately borrowed or copied. You cannot do let x = *y for a non-Copy type. This dereference operation will call DerefMut::deref_mut or Deref::deref based on how it gets borrowed. With Box<T>, you can do this:

1
2
3
let x = Box::new(vec![1,2,3,4]);
let y = *x; // moves the vec out into `y`, then deallocates the box
            // but does not call a destructor on the vec

For any other type, such an operation will produce a “cannot move out of a borrow” error.

This operation is colloquially called DerefMove, and there has been an rfc in the past for making it into a trait. I suspect that the DerefMove semantics could even have been removed from Box<T> before 1.0 (I don’t find it necessary), but people had better things to do, like fixing the million other rough edges of the language that can’t be touched after backwards compatibility is a thing.

So now we’re stuck with it. The current status is that Box<T> is still a special type in the compiler. By “special type” I don’t just mean that the compiler treats it a bit differently (this is true for any lang item), I mean that it literally is treated as a completely new kind of type, not as a struct the way it has been defined in liballoc. There’s a TON of cruft in the compiler related to this type, much of which can be removed, but some of which can’t. If we ever do get DerefMove, we should probably try removing it all again. After writing this post I’m half-convinced to try and implement an internal-use-only DerefMove and try cleaning up the code myself.

Most of this isn’t really useful to know unless you actually come across a case where you can make use of DerefMove semantics, or if you work on the compiler. But it certainly is interesting!

Next post: What is a lang item?


  1. Seriously though, does anyone use it much? I’ve only seen it getting used for boxed DSTs (trait objects and boxed slices), which themselves are pretty rare, for sending heap types over FFI, recursive types (rare), and random special cases. I find this pretty interesting given that other languages are much more liberal with non-refcounted single-element allocation.

  2. It will probably eventually be replaced or made equivalent to the <- syntax before stabilizing

  3. It’s easier to special case the first two, much like how for loops are aware of the iterator trait without the iterator trait being extremely special cased

Reflections on Rusting Trust

The Rust compiler is written in Rust. This is overall a pretty common practice in compiler development. This usually means that the process of building the compiler involves downloading a (typically) older version of the compiler.

This also means that the compiler is vulnerable to what is colloquially known as the “Trusting Trust” attack, an attack described in Ken Thompson’s acceptance speech for the 1983 Turing Award. This kind of thing fascinates me, so I decided to try writing one myself. It’s stuff like this which started my interest in compilers, and I hope this post can help get others interested the same way.

To be clear, this isn’t an indictment of Rust’s security. Quite a few languages out there have popular self-hosted compilers (C, C++, Haskell, Scala, D, Go) and are vulnerable to this attack. For this attack to have any effect, one needs to be able to uniformly distribute this compiler, and there are roughly equivalent ways of doing the same level of damage with that kind of access.

If you already know what a trusting trust attack is, you can skip the next section. If you just want to see the code, it’s in the trusting-trust branch on my Rust fork, specifically this code.

The attack

The essence of the attack is this:

An attacker can conceivably change a compiler such that it can detect a particular kind of application and make malicious changes to it. The example given in the talk was the UNIX login program — the attacker can tweak a compiler so as to detect that it is compiling the login program, and compile in a backdoor that lets it unconditionally accept a special password (created by the attacker) for any user, thereby giving the attacker access to all accounts on all systems that have login compiled by their modified compiler.

However, this change would be detected in the source. If it was not included in the source, this change would disappear in the next release of the compiler, or when someone else compiles the compiler from source. Avoiding this attack is easily done by compiling your own compilers and not downloading untrusted binaries. This is good advice in general regarding untrusted binaries, and it equally applies here.

To counter this, the attacker can go one step further. If they can tweak the compiler so as to backdoor login, they could also tweak the compiler so as to backdoor itself. The attacker needs to modify the compiler with a backdoor which detects when it is compiling the same compiler, and introduces itself into the compiler that it is compiling. On top of this it can also introduce backdoors into login or whatever other program the attacker is interested in.

Now, in this case, even if the backdoor is removed from the source, every compiler compiled using this backdoored compiler will be similarly backdoored. So if this backdoored compiler somehow starts getting distributed, it will spread itself as it is used to compile more copies of itself (e.g. newer versions, etc). And it will be virtually undetectable — since the source doesn’t need to be modified for it to work; just the non-human-readable binary.

Of course, there are ways to protect against this. Ultimately, before a compiler for language X existed, that compiler had to be written in some other language Y. If you can track the sources back to that point you can bootstrap a working compiler from scratch and keep compiling newer compiler versions till you reach the present. This raises the question of whether or not Y’s compiler is backdoored. While it sounds pretty unlikely that such a backdoor could be so robust as to work on two different compilers and stay put throughout the history of X, you can of course trace back Y back to other languages and so on till you find a compiler in assembly that you can verify1.

Backdooring Rust

Alright, so I want to backdoor my compiler. I first have to decide when in the pipeline the code that insert backdoors executes. The Rust compiler operates by taking source code, parsing it into a syntax tree (AST), transforming it into some intermediate representations (HIR and MIR), and feeding it to LLVM in the form of LLVM IR, after which LLVM does its thing and creates binaries. A backdoor can be inserted at any point in this stage. To me, it seems like it’s easier to insert one into the AST, because it’s easier to obtain AST from source, and this is important as we’ll see soon. It also makes this attack less practically viable2, which is nice since this is just a fun exercise and I don’t actually want to backdoor the compiler.

So the moment the compiler finishes parsing, my code will modify the AST to insert a backdoor.

First, I’ll try to write a simpler backdoor; one which doesn’t affect the compiler but instead affects some programs. I shall write a backdoor that replaces occurrences of the string “hello world” with “जगाला नमस्कार”, a rough translation of the same in my native language.

Now, in rustc, the rustc_driver crate is where the whole process of compiling is coordinated. In particular, phase_2_configure_and_expand is run right after parsing (which is phase 1). Perfect. Within that function, the krate variable contains the parsed AST for the crate3, and we need to modify that.

In this case, there’s already machinery in syntax::fold for mutating ASTs based on patterns. A Folder basically has the ability to walk the AST, producing a mirror AST, with modifications. For each kind of node, you get to specify a function which will produce a node to be used in its place. Most such functions will default to no-op (returning the same node).

So I write the following Folder:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Understanding the minute details of this code isn't important; it is a bit complex
// since the API used here isn't meant to be used this way. Focus on the comments.

mod trust {
    use syntax::fold::*;
    use syntax::ast::*;
    use syntax::parse::token::InternedString;
    use syntax::ptr::P;
    struct TrustFolder;

    // The trait contains default impls which we override for specific cases
    impl Folder for TrustFolder {
        // every time we come across an expression, run this function
        // on it and replace it with the produced expression in the tree
        fn fold_expr(&mut self, expr: P<Expr>) -> P<Expr> {
            // The peculiar `.map` pattern needs to be used here
            // because of the way AST nodes are stored in immutable
            // `P<T>` pointers. The AST is not typically mutated.
            expr.map(|mut expr| {
                match expr.node {
                    ExprKind::Lit(ref mut l) => {
                        *l = l.clone().map(|mut l| {
                            // look for string literals
                            if let LitKind::Str(ref mut s, _) = l.node {
                                // replace their contents
                                if s == "hello world" {
                                    *s = InternedString::new("जगाला नमस्कार");
                                }
                            }
                            l
                        })
                    }
                    _ => ()
                }
                // recurse down expression with the default fold
                noop_fold_expr(expr, self)
            })
        }
        fn fold_mac(&mut self, mac: Mac) -> Mac {
            // Folders are not typically supposed to operate on pre-macro-expansion ASTs
            // and will by default panic here. We must explicitly specify otherwise.
            noop_fold_mac(mac, self)
        }
    }

    // our entry point
    pub fn fold_crate(krate: Crate) -> Crate {
        // make a folder, fold the crate with it
        TrustFolder.fold_crate(krate)
    }
}

I invoke it by calling let krate = trust::fold_crate(krate); as the first line of phase_2_configure_and_expand.

I create a stage 1 build4 of rustc (make rustc-stage1). I’ve already set up rustup to have a “stage1” toolchain pointing to this folder (rustup toolchain link stage1 /path/to/rust/target_triple/stage1), so I can easily test this new compiler:

1
2
3
4
5
// test.rs
fn main() {
    let x = "hello world";
    println!("{}", x);
}
1
2
3
$ rustup run stage1 rustc test.rs
$ ./test
जगाला नमस्कार

Note that I had the string on a separate line instead of directly doing println!("hello world"). This is because our backdoor isn’t perfect; it applies to the pre-expansion AST. In this AST, println! is stored as a macro and the "hello world" is part of the macro token tree; and has not yet been turned into an expression. Our folder ignores it. It is not too hard to perform this same attack post-expansion, however.

So far, so good. We have a compiler that tweaks “hello world” strings. Now, let’s see if we can get it to miscompile itself. This means that our compiler, when compiling a pristine Rust source tree, should produce a compiler that is similarly backdoored (with the trust module and the trust::fold_crate() call).

We need to tweak our folder so that it does two things:

  • Inserts the let krate = trust::fold_crate(krate); statement in the appropriate function (phase_2_configure_and_expand) when compiling a pristine Rust source tree
  • Inserts the trust module

The former is relatively easy. We need to construct an AST for that statement (can be done by invoking the parser again and extracting the node). The latter is where it gets tricky. We can encode instructions for outputting the AST of the trust module, but these instructions themselves are within the same module, so the instructions for outputting these instructions need to be included, and so on. This clearly isn’t viable.

However, there’s a way around this. It’s a common trick used in writing quines, which face similar issues. The idea is to put the entire block of code in a string. We then construct the code for the module by doing something like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
mod trust {
    static SELF_STRING: &'static str = "/* stringified contents of this module except for this line */";
    // ..
    fn fold_mod(..) {
        // ..
        // this produces a string that is the same as the code for the module containing it
        // SELF_STRING is used twice, once to produce the string literal for SELF_STRING, and
        // once to produce the code for the module
        let code_for_module = "mod trust { static SELF_STRING: &'static str = \"" + SELF_STRING + "\";" + SELF_STRING + "}";
        insert_into_crate(code_for_module);
        // ..
    }
    // ..
}

With the code of the module entered in, this will look something like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
mod trust {
    static SELF_STRING: &'static str = "
        // .. 
        fn fold_mod(..) {
            // ..
            // this produces a string that is the same as the code for the module containing it
            // SELF_STRING is used twice, once to produce the string literal for SELF_STRING, and
            // once to produce the code for the module
            let code_for_module = \"mod trust { static SELF_STRING: &'static str = \\\"\" + SELF_STRING + \"\\\";\" + SELF_STRING + \"}\";
            insert_into_crate(code_for_module);
            // ..
        }
        // ..
    ";

    // ..
    fn fold_mod(..) {
        // ..
        // this produces a string that is the same as the code for the module containing it
        // SELF_STRING is used twice, once to produce the string literal for SELF_STRING, and
        // once to produce the code for the module
        let code_for_module = "mod trust { static SELF_STRING: &'static str = \"" + SELF_STRING + "\";" + SELF_STRING + "}";
        insert_into_crate(code_for_module);
        // ..
    }
    // ..
}

So you have a string containing the contents of the module, except for itself. You build the code for the module by using the string twice – once to construct the code for the declaration of the string, and once to construct the code for the rest of the module. Now, by parsing this, you’ll get the original AST!

Let’s try this step by step. Let’s first see if injecting an arbitrary string (use foo::bar::blah) works, without worrying about this cyclical quineyness:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
mod trust {
    // dummy string just to see if it gets injected
    // inserting the full code of this module has some practical concerns
    // about escaping which I'll address later
    static SELF_STRING: &'static str = "use foo::bar::blah;";
    use syntax::fold::*;
    use syntax::ast::*;
    use syntax::parse::parse_crate_from_source_str;
    use syntax::parse::token::InternedString;
    use syntax::ptr::P;
    use syntax::util::move_map::MoveMap;
    use rustc::session::Session;

    struct TrustFolder<'a> {
        // we need the session to be able to parse things. No biggie.
        sess: &'a Session,
    }

    impl<'a> Folder for TrustFolder<'a> {
        fn fold_expr(&mut self, expr: P<Expr>) -> P<Expr> {
            expr.map(|mut expr| {
                match expr.node {
                    ExprKind::Lit(ref mut l) => {
                        *l = l.clone().map(|mut l| {
                            if let LitKind::Str(ref mut s, _) = l.node {
                                if s == "hello world" {
                                    *s = InternedString::new("जगाला नमस्कार");
                                }
                            }
                            l
                        })
                    }
                    _ => ()
                }
                noop_fold_expr(expr, self)
            })
        }
        fn fold_mod(&mut self, m: Mod) -> Mod {
            // move_flat_map takes a vector, constructs a new one by operating
            // on each element by-move. Again, needed because of `P<T>`
            let new_items = m.items.move_flat_map(|item| {
                // we want to modify this function, and give it a sibling from SELF_STRING
                if item.ident.name.as_str() == "phase_2_configure_and_expand" {
                    // parse SELF_STRING
                    let new_crate = parse_crate_from_source_str("trust".into(),
                                                                SELF_STRING.into(),
                                                                &self.sess.parse_sess).unwrap();
                    // extract the first item contained in it, which is the use statement
                    let inner_item = new_crate.module.items[0].clone();

                    // move_flat_map needs an iterator of items to insert
                    vec![inner_item, item].into_iter()
                } else {
                    vec![item].into_iter()
                }
            });
            let m = Mod {
                inner: m.inner,
                items: new_items,
            };
            noop_fold_mod(m, self)
        }
        fn fold_mac(&mut self, _mac: Mac) -> Mac {
            noop_fold_mac(_mac, self)
        }
    }

    pub fn fold_crate(krate: Crate, sess: &Session) -> Crate {
        let mut folder = TrustFolder {sess: sess};
        folder.fold_crate(krate)
    }
}

We also change the original call in phase_2_configure_and_expand to let krate = trust::fold_crate(krate, sess);

Compiling with make rustc-stage2 (we now want the backdoored stage1 compiler to try and compile the same sources and fudge the phase_2_configure_and_expand function the second time around), gets us this error:

1
2
3
4
5
6
7
8
rustc: x86_64-apple-darwin/stage1/lib/rustlib/x86_64-apple-darwin/lib/librustc_driver
error[E0432]: unresolved import `foo::bar::blah`
 --> trust:1:5
  |
1 | use foo::bar::blah;
  |     ^^^^^^^^^^^^^^ Maybe a missing `extern crate foo;`?

error: aborting due to previous error

This is exactly what we expected! We inserted the code use foo::bar::blah;, which isn’t going to resolve, and thus got a failure when compiling the crate the second time around.

Let’s add the code for the quineyness and for inserting the fold_crate call:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
fn fold_mod(&mut self, m: Mod) -> Mod {
    let new_items = m.items.move_flat_map(|item| {
        // look for the phase_2_configure_and_expand function
        if item.ident.name.as_str() == "phase_2_configure_and_expand" {
            // construct the code for the module contents as described earlier
            let code_for_module = r###"mod trust { static SELF_STRING: &'static str = r##"###.to_string() + r###"##""### + SELF_STRING + r###""##"### + r###"##;"### + SELF_STRING + "}";
            // Parse it into an AST by creating a crate only containing that code
            let new_crate = parse_crate_from_source_str("trust".into(),
                                                        code_for_module,
                                                        &self.sess.parse_sess).unwrap();
            // extract the AST of the contained module
            let inner_mod = new_crate.module.items[0].clone();

            // now to insert the fold_crate() call
            let item = item.map(|mut i| {
                if let ItemKind::Fn(.., ref mut block) = i.node {
                    *block = block.clone().map(|mut b| {
                        // create a temporary crate just containing a fold_crate call
                        let new_crate = parse_crate_from_source_str("trust".into(),
                                                                    "fn trust() {let krate = trust::fold_crate(krate, sess);}".into(),
                                                                    &self.sess.parse_sess).unwrap();
                        // extract the AST from the parsed temporary crate, shove it in here
                        if let ItemKind::Fn(.., ref blk) = new_crate.module.items[0].node {
                            b.stmts.insert(0, blk.stmts[0].clone());
                        }
                        b
                    });
                }
                i
            });
            // yield both the created module and the modified function to move_flat_map
            vec![inner_mod, item].into_iter()
        } else {
            vec![item].into_iter()
        }
    });
    let m = Mod {
        inner: m.inner,
        items: new_items,
    };
    noop_fold_mod(m, self)
}

The #s let us specify “raw strings” in Rust, where I can freely include other quotation marks without needing to escape things. For a string starting with n pound symbols, we can have raw strings with up to n - 1 pound symbols inside it. The SELF_STRING is declared with four pound symbols, and the code in the trust module only uses raw strings with three pound symbols. Since the code needs to generate the declaration of SELF_STRING (with four pound symbols), we manually concatenate extra pound symbols on – a 4-pound-symbol raw string will not be valid within a three- pound-symbol raw string since the parser will try to end the string early. So we don’t ever directly type a sequence of four consecutive pound symbols in the code, and instead construct it by concatenating two pairs of pound symbols.

Ultimately, the code_for_module declaration really does the same as:

1
let code_for_module = "mod trust { static SELF_STRING: &'static str = \"" + SELF_STRING + "\";" + SELF_STRING + "}";

conceptually, but also ensures that things stay escaped. I could get similar results by calling into a function that takes a string and inserts literal backslashes at the appropriate points.

To update SELF_STRING, we just need to include all the code inside the trust module after the declaration of SELF_STRING itself inside the string. I won’t include this inline since it’s big, but this is what it looks like in the end.

If we try compiling this code to stage 2 after updating SELF_STRING, we will get errors about duplicate trust modules, which makes sense because we’re actually already compiling an already- backdoored version of the Rust source code. While we could set up two Rust builds, the easiest way to verify if our attack is working is to just use #[cfg(stage0)] on the trust module and the fold_crate call5. These will only get included during “stage 0” (when it compiles the stage 1 compiler6), and not when it compiles the stage 2 compiler, so if the stage 2 compiler still backdoors executables, we’re done.

On building the stage 2 (make rustc-stage2) compiler,

1
2
3
$ rustup run stage2 rustc test.rs
$ ./test
जगाला नमस्कार

I was also able to make it work with a separate clone of Rust:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ cd /path/to/new/clone
# Tell rustup to use our backdoored stage1 compiler whenever rustc is invoked
# from anywhere inside this folder.
$ rustup override set stage1 # Works with stage 2 as well.

# with --enable-local-rust, instead of the downloaded stage 0 compiler compiling
# stage 0 internal libraries (like libsyntax), the libraries from the local Rust get used. Hence we
# need to check out a git commit close to our changes. This commit is the parent of our changes,
# and is bound to work
$ git checkout bfa709a38a8c607e1c13ee5635fbfd1940eb18b1

# This will make it call `rustc` instead of downloading its own compiler.
# We already overrode rustc to be our backdoored compiler for this folder
# using rustup
$ ./configure --enable-local-rust
# build it!
$ make rustc-stage1
# Tell rustup about the new toolchain
$ rustup toolchain link other-stage1 /path/to/new/clone/target_dir/stage1
$ rustup run other-stage1 rustc test.rs
$ ./test
जगाला नमस्कार

Thus, a pristine copy of the rustc source has built a compiler infected with the backdoor.


So we now have a working trusting trust attack in Rust. What can we do with it? Hopefully nothing! This particular attack isn’t very robust, and while that can be improved upon, building a practical and resilient trusting trust attack that won’t get noticed is a bit trickier.

We in the Rust community should be working on ways to prevent such attacks from being successful, though.

A couple of things we could do are:

  • Work on an alternate Rust compiler (in Rust or otherwise). For a pair of self-hosted compilers, there’s a technique called “Diverse Double-Compiling” wherein you choose an arbitrary sequence of compilers (something like “gcc followed by 3x clang followed by gcc” followed by clang), and compile each compiler with the output of the previous one. Difficulty of writing a backdoor that can survive this process grows exponentially.
  • Try compiling rustc from its ocaml roots, and package up the process into a shell script so that you have reproducible trustworthy rustc builds.
  • Make rustc builds deterministic, which means that a known-trustworthy rustc build can be compared against a suspect one to figure out if it has been tampered with.

Overall trusting trust attacks aren’t that pressing a concern since there are many other ways to get approximately equivalent access with the same threat model. Having the ability to insert any backdoor into distributed binaries is bad enough, and should be protected against regardless of whether or not the backdoor is a self-propagating one. If someone had access to the distribution or build servers, for example, they could as easily insert a backdoor into the server, or place a key so that they can reupload tampered binaries when they want. Now, cleaning up after these attacks is easier than trusting trust, but ultimately this is like comparing being at the epicenter of Little Boy or the Tsar Bomba – one is worse, but you’re atomized regardless, and your mitigation plan shouldn’t need to change.

But it’s certainly an interesting attack, and should be something we should at least be thinking about.

Thanks to Josh Matthews, Michael Layzell, Diane Hosfelt, Eevee, and Yehuda Katz for reviewing drafts of this post.

Discuss: HN, Reddit


  1. Of course, this raises the question of whether or not your assembler/OS/loader/processor is backdoored. Ultimately, you have to trust someone, which was partly the point of Thompson’s talk.

  2. The AST turns up in the metadata/debuginfo/error messages, can be inspected from the command line, and in general is very far upstream and affects a number of things (all the other stages in the pipeline). You could write code to strip it out from these during inspection and only have it turn up in the binary, but that is much harder.

  3. The local variable is called krate because crate is a keyword

  4. Stage 1 takes the downloaded (older) rust compiler and compiles the sources from it. The stage 2 compiler is build when the stage 1 compiler (which is a “new” compiler) is used to compile the sources again.

  5. Using it on the fold_crate call requires enabling the “attributes on statements” feature, but that’s no big deal – we’re only using the cfgs to be able to test easily; this feature won’t actually be required if we use our stage1 compiler to compile a clean clone of the sources.

  6. The numbering of the stages is a bit confusing. During “stage 0” (cfg(stage0)), the stage 1 compiler is built. Since you are building the stage 1 compiler, the make invocation is make rustc-stage1. Similarly, during stage 1, the stage 2 compiler is built, and the invocation is make rustc-stage2 but you use #[cfg(stage1)] in the code.

GC Support in Rust: API Design

Recently we (Felix, Niko, and I) have been working on getting compiler-level GC support for Rust. The plan is to provide a base set of APIs and intrinsics on which GCs can be built, without including an actual GC itself. This blog post serves as status update and a pre-pre- rfc on the designs. I’m also going to walk through the process of coming up with the current design. We’ll soon be posting more detailed design docs and discussion about some of the unresolved bits.

The motivation behind this is the same as my motivation for writing rust-gc. Firstly, it makes it possible to integrate with languages which themselves have a GC. Being able to safely pass around GCd types in Rust is very useful when writing libraries for Node, Python, or Ruby in Rust.

Secondly, some algorithms are much neater when a GC is involved. Things like persistent datastructures, for example, are easier to deal with when a GC is involved. There are ways around this requirement, but it’s nice to have the full range of options.

Rust tries to be safe without a GC, and this doesn’t change that — we envision that GCs in Rust will be rarely used except for some very specific use cases like the ones listed above.

Compiler support isn’t strictly necessary for a GC in Rust to be safe. rust-gc manages to work without compiler support (except for a #[derive()] plugin). However, there’s a lot of manual tracking of roots involved, which has a much higher cost than compiler-backed GCs. This is suboptimal — we want GC support to be as efficient as possible.

Design goals

We’re considering GCs designed as a Gc<T> object, which, like Rc<T>, can be explicitly wrapped around a value to move it to the GC heap. A pervasive GC (where every Rust object is GCd) is an explicit non-goal; if you need a GC everywhere a different language may make more sense. We’re expecting Gc<T> to be used only where needed, much like how Rc<T> is today.

We want this to work well with other Rust abstractions. Things like Vec<Gc<T>> should be completely legal, for example.

We want implementors to have total freedom in how Gc<T> is represented – they define the type, not the compiler. The compiler provides traits and intrinsics which can be used to find the GC roots. It should be possible for implementors to provide safe APIs for Gc<T>. There will be no canonical Gc<T> in the stdlib.

We are trying to support multiple GCs in a single binary. This should be a pretty niche thing to need, but it strengthens the behavior of GCs as libraries (and not magical one-time things like custom allocators). One possible use case for this is if a library internally uses a GC to run some algorithm, and this library is used by an application which uses a GC for some other reason (perhaps to talk to Node). Interacting GCs are hard to reason about, though. The current design leaves this decision up to the GC designer — while it is possible to let your GCd object contain objects managed by a different GC, this requires some explicit extra work. Interacting GCs is a very niche use case1, so if this ability isn’t something we’re adamant on supporting.

We also would like it to be safe to use trait objects with the GC. This raises some concerns which I’ll address in depth later in this post.

Core design

The core idea is to use LLVM stack maps to keep track of roots.

In a tracing GC, the concept of a “root” is basically something which can be directly reached without going through other GC objects. In our case they will be cases of Gc<T> ending up on the stack or in non-gc heap boxes which themselves are reachable from the stack. Some examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
struct Foo {
    bar: Gc<Bar>,
}

struct Bar {
    inner: Gc<bool>,
}

// `bar` is a root
let bar = Gc::new(Bar::new());
// `bar.inner` is not a root, since it can't be
// accessed without going through `bar`

// `foo.bar` is a root:
let foo = Foo::new(); // This is a root


// `inner` is not a root, because it is a borrowed reference
let inner = &bar.inner;

// `rooted_bool` is a root, since it is a `Gc<bool>` on the stack
// (cloning has the same behavior as that on `Rc<T>`: it creates a
// new reference to the same value)
let rooted_bool = bar.inner.clone();

// `boxed_bar` is a root. While the Gc<Bar> is not on the stack,
// it can be reached without dereferencing another `Gc<T>`
// or passing through a borrowed reference
let boxed_bar = Box::new(Gc::new(Bar::new()));

When figuring out which objects are live (“tracing”), we need to have this initial set of “roots” which contain the list of things directly reachable from the stack. From here, the GC can rifle through the fields and subfields of the roots till it finds other GCd objects, which it can mark as live and continue the process with.

Most runtimes for GCd languages have efficient ways of obtaining this list of roots. Contrast this with conservative collectors like Boehm, which read in the whole stack and consider anything which looks like a pointer to the GC heap to be a root. rust-gc’s approach is inefficient too; because it incurs an additional reference counting cost on copying and mutation.

However, the list of current roots is known at compile time; it’s just a matter of which variables are live at any point. We store this list of live variables in a per-call-site “stack map”. To find all the roots, you walk up the call stack, and for each call site look up its entry in the stack map, which will contain the stack offsets of all the roots (and other metadata if we need it). LLVM has native support for this. The stack map is stored in a separate section so there is no runtime performance hit during regular execution, however some optimizations may be inhibited by turning on GC.

So basically a GC will have access to a walk_roots<F>(f: F) where F: FnMut(..) intrinsic that will yield all the roots to the provided function (which can then mark them as such and start tracing).

I’m not going to focus on the implementation of this intrinsic for this blog post — this might be the subject of a later blog post by Felix who is working on this.

Instead, I’m focusing on the higher-level API.

Identifying rootables

The first problem we come across with the design mentioned above is that the compiler doesn’t yet know how to distinguish between a root and a non-root. We can’t mark every variable as a root; that would bloat the stack maps and make walking the roots a very expensive operation.

A very simple way of doing this is via a trait, Root.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// in libcore

unsafe trait Root {}

// auto-trait, anything containing
// a Root will itself be Root
unsafe impl !Root for .. {}

// references are never roots
unsafe impl<'a, T> !Root for &'a T {}


// in a gc impl

struct Gc<T> {
    // ..
}

unsafe impl<T> Root for Gc<T> {}

if we detect Root objects that are directly reachable, we consider them to be roots.

This has a flaw, it doesn’t actually tell us how to find roots inside container types. What would we do if there was a Box<Gc<T>> or a Vec<Gc<T>> on the stack? We can stick their entry in the stack map, but the GC needs to know what to do with them!

We could store some type information in the map and let the GC hardcode how to root each container type. This isn’t extensible though; the GC will have to be able to handle types from arbitrary crates too. Additionally, we have to solve this problem anyway for tracing — when tracing we need to be able to find all values “contained” within a particular value, which is the same operation we need to do to find roots.

For this purpose, we introduce the Trace trait:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// in libcore
unsafe trait Trace {
    fn trace(&self);
}

// in libcollections
// (or any third-party collections library)
unsafe impl<T: Trace> Trace for Vec<T> {
    fn trace(&self) {
        for i in self {
            i.trace()
        }
    }
}

// in gc library

// only allow trace objects
struct Gc<T: Trace> {
    // ..
}

unsafe impl<T> Trace for Gc<T> {
    fn trace(&self) {
        // mark `self`

        // Don't actually trace contained fields,
        // because there may be cycles and we'd recurse infinitely
    }
}

// in consumer of gc library

// autoderived impl will call `bar.trace()` and `baz.trace()`
#[derive(Trace)]
struct Foo {
    bar: Gc<Bar>,
    baz: SomeType,
}

(These traits are unsafe to implement because an incorrect implementation can lead to a reachable value getting cleaned up by the GC, which is unsafe)

Basically, an implementation of Trace will yield all values owned by the object, unless that object is a GC struct like Gc<T>, in which case the GC implementor will have it mark the object. This way, calling .trace() will walk all fields and subfields of an object recursively, until it finds all of the contained Gc<T>s.

This has an issue with multiple GCs, though — we don’t want the GCs to interact unless they want to, and with the Trace trait being shared one GC object may accidentally contain a different GC object.

We need to introduce the concept of a tracer here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// in libcore

trait Tracer : Any {}

unsafe trait Trace {
    fn trace(&self, tracer: &mut Tracer);
}


// in libcollections

// impl doesn't care about the tracer
unsafe impl<T: Trace> Trace for Vec<T> {
    fn trace(&self, tracer: &mut Tracer) {
        for i in self {
            i.trace(tracer)
        }
    }
}

// in gc library

struct MyTracer {} // more complicated tracers may have state
impl Tracer for MyTracer {}


struct Gc<T: Trace> {
    // ..
}

unsafe impl<T> Trace for Gc<T> {
    fn trace(&self, tracer: &mut Tracer) {
        if let Some(tracer) = tracer.downcast_mut::<MyTracer>() {
            // mark self
        } else {
            panic("Don't mix GCs!");
            // If you want to support multiple GCs interacting with each other,
            // you can let this else block trace the contents.
            // Beware, interacting GCs have subtle safety issues.
        }
    }
}

This also makes it easier to distinguish between rooting and tracing operations. While the operations are similar (“to root/trace a value, walk its fields recursively till you find all of the Gcs, and root/mark those”), the code we run at the leaf Gc<T> nodes is different. In the previous model, this could have been solved with a global static boolean that identifies if the code is currently walking roots or tracing, but with the Tracer trait object we can just pass in different tracer values.

We’re not yet sure if we should be lumping root walking and tracing in a single trait; so we might end up with a second Scan trait that works similarly.

Note that we’re not getting rid of the Root trait here. This is because Root and Trace have slightly incompatible purposes – Root signals to the compiler if something definitely contains roots, whereas Trace marks things which are safe to put inside a GC. bool is Trace, but not Root. Vec<Gc<T>> is Trace and Root, Vec<bool> is Trace but not Root. &T and &mut T are neither. Trace will actually show up in trait bounds for GC code. Root will only be analysed by the compiler itself, bounds like R: Root probably won’t show up.

There should not be any types which are Root but not Trace, because this means the compiler won’t know what to do with them!

Now, when generating the stack map, we include the stack offset of all Root objects in scope, as well as appropriate dynamic dispatch vtable pointers for the Trace implementation2. Walking the stack involves calling the trace method on each entry in the stack map for each call site.

Unresolved problems

There are a lot of these. Suggestions very welcome.

Trait objects

Trait objects provide an interesting challenge. They may or may not contain roots, but what’s more important is that trait objects in libraries that know nothing about GC may also contain roots.

For example, if a library is dealing with a Box<SomeTrait>, and your code feeds it a Box<SomeRoot as SomeTrait>, the trait object is now a root. If a gc is triggered while in this call (perhaps by a callback), then this trait object should be counted as a root.

But this library didn’t depend on the GC, and when it was compiled, it wasn’t compiled with stack map entries for this GC object.

There are two solutions here. The first is to recompile everything (including libstd) from scratch with GC support on, and put all owned trait objects in the stack maps. They will have an extra generated trace entry in the vtable that will ignore the object if it isn’t a root. To put trait objects inside Gc<T>, you will have to explicitly use Box<Trait+Trace>, however – this magical trace entry is just for collecting roots.

The second solution is to simply not allow casting Root objects to owned trait objects. I feel that there are use cases for both – the former has extra bloat and requires a custom libstd (which could be distributed via rustup if necessary), but the latter restricts how you use trait objects. Servo, for example, would probably prefer the latter since we don’t put our DOM objects in owned trait objects. But other GC users may want maximum flexibility. Letting people choose this via a codegen flag (which can be controlled via cargo) might be a good idea.

Should it be Trace<T>?

There is a dynamic dispatch cost on rooting/tracing any Gc<T> leaf with the tracer model.

This can be obviated by having it be:

1
2
3
trait Trace<T: Tracer> {
    fn trace(&self, tracer: &mut T)
}

Most types would implement Trace<T>, and GCs can implement Trace<SpecificTracer>, and only require their contents to be Trace<SpecificTracer>. This lets the type system forbid interacting GCs instead of having it done at runtime.

This has multiple downsides, however:

  • #[derive(Trace)] becomes #[derive(Trace<MyTracer>)] for things containing Gc<T> (because Gc<T> is not Trace<T> for all T, and macro expansion runs before this information can be computed).
  • If there are multiple GCs, there are multiple Trace<T> vtable pointers in the stack map. Not all libs know about the other GC when being compiled, so you need to defer generation of these stack map entries somehow.
  • The heuristics for forbidding types which are Root but not Trace<T> become subtler. You have to effectively forbid types which are Root but do not have an impl of Trace<T> for at least one tracer T that is active in the compilation.

Non-Trace collections on the stack

If something like the following, defined by a third-party library:

1
2
3
4
struct Foo<T> {
    x: T,
    // ...
}

doesn’t implement Trace, it’s still okay to use Foo<RootedThing> on the stack, because we can figure out that the inner T is what we need to root.

However, if a third-party MyVec<T> (which behaves like a vector) contains RootedThings, and is on the stack, the compiler doesn’t know what do do with it. Lack of a Trace bound makes it impossible to put such types on the GC heap, but there’s no restriction on putting these types on the stack. As I mentioned before, we can simply forbid the existence of types which are Root but not Trace (MyVec<RootedThing> is Root). This is already done with Copy and Drop.

There’s a subtle difference between this and the Copy/Drop forbidding. Copy and Drop are always explicitly implemented. On the other hand, Root is an auto trait and automatically implements itself on types containing roots. This means that we can’t necessarily forbid such types being created at impl time — third party collections like above for example won’t contain Root types until they are monomorphised. We can error during monomorphization, but this error might not be very user-friendly, like template errors in C++.

Another solution is to make Root into ?Root, much like ?Sized. This means that the writers of collections will explicitly opt in to allowing GCd things inside them. This probably would lead to a lot more churn, however. But the diagnostics would be clearer.

Turns out that this actually works with half-decent diagnostics. This doesn’t forbid the existence of types which impl Root but not Trace, however. It simply avoids autoderiving Root on types which aren’t Trace. But this behavior can be changed. (In fact, it was changed while this post was being written!)

It becomes more complicated with Trace though. Having Root<T> might fix this, but then you have to deal with the auto trait generics.

One solution for the auto trait generics is to simple not include Root in the stdlib. Instead, require code like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// in gc library

struct MyTracer {/* .. */}

struct Gc<T: Trace<MyTracer>> {
    // ...
}

#[gc_root_trait]
unsafe trait MyRoot: Trace<MyTracer> {}

unsafe impl !MyRoot for .. {}

unsafe impl<T: Trace<MyTracer>> MyRoot for Gc<T> {}

This can be further simplified by completely removing the rooting trait requirement and instead require #[gc(tracer=MyTracer)] on all GC structs. This, however, is a bit more special and we lose the free diagnostics that you get from utilizing the type system.

Are Root-containing raw pointers Root?

For the auto-trait to work, types like Vec<ContainsRoot> should also be marked as Root.

This can be done by just marking *const T and *mut T as Root if T is Root using an impl in libcore. However, borrowed types like Iter will also be dragged into this. We only want types which own Root things to be considered roots.

The alternative is to not require this, and solely rely on PhantomData. Vec<T> also contains a PhantomData<T>, which gives the compiler a hint that it owns a T. On the other hand, Iter<'a, T> contains a PhantomData<&'a T>, which hints that it borrows a T. This is already used by the compiler to determine drop soundness, so we can just use the same thing to determine Root types. This is already supported by the autotrait infrastructure.

A downside here is that we’re relying more on producers of unsafe code remembering to use PhantomData. I’m not 100% certain about this, but triggering dropck unsoundness by neglecting PhantomData is still pretty hard (and often requires types like arenas), whereas forgetting a root can very easily cause a GC segfault. I do not consider this to be a major downside.

Finalizers and Drop

The following code is unsafe:

1
2
3
4
5
6
7
8
9
10
11
12
13
struct Foo {
    bar: Gc<Bar>,
    baz: Gc<Baz> // baz can contain a Bar
}
impl Drop for Foo {
    fn drop(&mut self) {
        println!("{:?}", *bar);
        // or
        *self.baz.bar.borrow_mut() = bar.clone();
    }
}

// Foo itself is used as a `Gc<Foo>`

The problem is that destructors run in the sweep cycle of a GC, in some order. This means that bar may have already been collected when Foo’s destructor runs. While in many cases this can be solved with a smart collection alrogithm, in the case where there’s a cycle being collected there’s nowhere safe to start.

Additionally, further mutation of the graph after collection may extend the lifetime of a to-be- collected variable.

A simple solution is to forbid all GC accesses during the collection phase. However, this means dereferences too, and this will incur a cost on all GCd types – they stop being simple pointer accesses. This solution places the burden on the GC implementor, instead of the compiler.

We have enough information in the type system to solve this – we can forbid Drop impls on types which are explicitly Root. But it turns out that this isn’t enough. Consider:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
trait MyTrait {
    fn do_the_thing(&self);
}
struct HijackableType<T: MyTrait> {
    x: T,
}

impl<T> Drop for HijackableType<T> {
    fn drop(&mut self) {
        self.x.do_the_thing();
    }
}

// in other library

struct Bar {
    inner: Gc<Baz>
}

impl MyTrait for Bar {
    fn do_the_thing(&self) {
        println!("{:?}", *self.inner)
    }
}

Foo<Bar> now has an unsafe destructor. Stopping this behavior requres forbidding Drop impls on structs with trait bounds, but that is too restrictive.

This may end up having a similar solution to the “all roots must be Trace” issue. Warning on monomorphizations isn’t enough, we need to be able to allow Vec<ContainsRoot>, but not HijackableType<ContainsRoot>. Making this distinction without poisoning half the generics out there is tricky.

The notion of a hijackable type is actually already important for sound generic drop impls, see RFC 1327 (dropck eyepatch), RFC 1238 (nonparametrick dropck), and their predecessor, RFC 0769 (sound generic drop). We might be able to rely on this, but would need to introduce additional constraints in dropck.

Fortunately, there is always the fallback solution of requiring the implementor to enforce this constraint at runtime.


  1. Firefox does have a garbage collector and a cycle collector which interact, though, so it’s not something which is unthinkable.

  2. If there is an active stack drop flag for the value, that will need to be included too.

Fun Crypto Problem: Designing an Anonymous Reputation System

One of the reasons I like crypto is that it’s a gold mine of interesting problems which feel like they are impossible to solve and if a solution exists, it must be magic.

The other day, I came across one such problem here, by @SarahJamieLewis

Is there a scheme where A can give reputation points to B, & later, B presenting as C can prove their reputation (without revealing A or B)?

(I recommend trying to solve this yourself before reading ahead)

The problem isn’t completely defined because we don’t know how “reputation” is supposed to work. A simple model is to think of it as currency, and use Bitcoin as a proxy. Of course, a real reputation system probably would be different from currency. There might only be a small set of authorized reputation “sellers” (this can still be built on top of Bitcoin, or you can use a system similar to the CA system for TLS certificates). Or there might be a system in which each person can vote for another person at most once (though this needs to be designed in a way that is resilient to sybil attacks).

Let us assume that there is a ledger out there, where each ledger entry is a certificate saying that entity X has given one reputation point to entity Y. A public key is included, where the private key is only known to Y. This model cleanly applies to both Bitcoin and CA systems – in Bitcoin, the transaction is the “certificate”, and in the CA system the certificate is the certificate.

For additional anonymity, you can have a different private key for each certificate. I’m going to assume this is the case, though the solutions don’t change much if it isn’t.

Solution via ZKP

I’m very fond of the concept of a zero-knowledge proof, and when you have a hammer everything looks like a nail.

So my first solution was one involving zero-knowledge proofs.

Construct the problem “Given the certificates in this ledger and X private keys, prove that these private keys each have one certificate they correspond to, and that the keys are distinct”.

In this problem, the certificates (public keys) are hardcoded, whereas the private keys are inputs. This sort of algorithm can be written as a sequential logic circuit, assuming that the method of signing can be. We can then perform a zero-knowledge proof of this problem using the ZKP for general execution outlined here. The prover inserts their private keys into the algorithm, run the algorithm, and prove that the execution was faithful and had an output of true using the ZKP.

Since the ZKP doesn’t leak any information about its inputs, it doesn’t leak which certificates were the ones for which the prover had private keys, so it doesn’t leak the identities of A or B.

However, this is overkill. The general ZKP gets very expensive as the size of the algorithm, and since the ledger was hardcoded in it, this ZKP will probably take a while (or a lot of computational power) to execute. One can perform it with a subset of the ledger picked by the prover, but repeating the process may slowly reveal the identity of the prover via the intersection of these subsets.

Solution via secret-sharing

(This solution is technically a ZKP too, but it doesn’t use the “general” ZKP algorithm which while expensive can be used for any combinatorical verification algorithm)

Once I’d gotten the “use a ZKP!” solution out of my system, I thought about it more and realized that the problem is very close to a secret-sharing one.

Secret-sharing is when you want to have a cryptographic “lock” (a shared secret) which can only be revealed/opened when the requisite quorum of (any) X keys out of N total keys is used.

Shamir’s secret sharing is a nice algorithm using polynomials that lets you do this.

In this situation, we want to prove that we have X private keys out of N total certificates in the ledger.

The verifier (Victor) can construct a secret sharing problem with a single secret and N secret- sharing-keys (in the case of Shamir, these would be N x,y-coordinate pairs). Each such key is paired with a certificate, and is encrypted with the corresponding public key of that certificate.

The prover (Peggy) is given all of these encrypted secret-sharing keys, as well as the certificates they correspond to.

If Peggy legitimately has X reputation, she has the X private keys necessary to obtain X of the secret sharing keys by decrypting them. From this, she can obtain the secret. By showing the secret to Victor, she has proven that she has at least X private keys corresponding to certificates in the ledger, and thus has at least X reputation. In the process, which certificates were involved is not revealed (so both the reputation-giver and reputation-receiver) stay anonymous.

Or was it?

Victor can construct a malicious secret sharing problem. Such a problem would basically reveal a different secret depending on the secret-sharing-keys Peggy uses. For example, in Shamir’s secret sharing, Victor can just give N random coordinates. X of those coordinates will always create a degree-X curve, but the curves obtained from different sets of X coordinates will probably have a different constant term (and thus a different secret).

The secret-sharing problem needs to be transmitted in a way that makes it possible for Peggy to verify that it’s not malicious.

One way to do it is to make it possible to uncover all the secret-sharing-keys, but only after the secret has been found. In Shamir’s algorithm, this can be done by pre-revealing the x coordinates and only encrypting the y coordinates. Once Peggy has found the secret, she has the entire polynomial curve, and can input the remaining x coordinates into the curve to find the remaining secret sharing keys (and then verify that they have been encrypted properly).

This is almost perfect. User “otus” on Crypto Stack Exchange pointed out my mistake.

The problem with this scheme (and the previous one to a lesser degree) is that Peggy could simply brute-force the values of the y coordinates beforehand.

This can be solved by using nonces. Instead of encrypting each y-coordinate, Victor encrypts each y-coordinate, plus a nonce. So, instead of encrypting the y-coordinate “42”, a string like “da72ke8lv0q-42” will be encrypted.

On decryption, it is easy to extract the coordinate from the plaintext (presumably the scheme used to add the nonce would be decided upon beforehand). However, we can’t brute-force for the plaintext anymore, because the ciphertext isn’t the encryption of a low-entropy value like a regular, smallish number, it’s the encryption of a relatively high-entropy value.

So far, this prevents brute forcing, but it also prevents Peggy from verifying that the secret- sharing problem was non-malicious, since she doesn’t know the nonces. Nor can these be pre-shared with her, since she can just use them to brute force again.

The solution here is for Victor to use the shared secret as a symmetric key, encrypt all of the nonces with it, and share them with Peggy. Until Peggy knows this key, she cannot use the nonces to brute force. Once she knows this key, she can decrypt the values for the nonces and use them to verify that the nonces are correct.

This is exactly the property we need. If Peggy doesn’t have enough private keys (reputation points), she won’t have the secret and can’t prove her reputation to Victor. Once Peggy does have the quorum of keys, she will know the symmetric key, be able to decrypt the nonces, and use these nonces to verify that the other N-X ciphertexts fall on the curve which she has obtained. Once she has verified this, she can present the shared secret/symmetric key to Victor, who will know that she had enough keys to crack the secret sharing problem and thus has at least X reputation.


This was quite an entertaining problem to solve (and it got me thinking about ZKPs again, which made me write my previous post). Thanks, Sarah!

Got an alternate solution (or other similar fun problems)? Let me know!

Interactive Sudoku Zero-knowledge Proof

Back in March I was particularly interested in Zero-Knowledge Proofs. At the time, I wrote a long blog post introducing them and explaining how the ZKP for generic execution works.

I was really enjoying learning about them, so I decided to do a presentation on them in my crypto course. Sadly there wasn’t going to be time for explaining the structure of the proof for general execution, but I could present something more fun: Sudoku.

Sudoku solutions can be proven via ZKP. That is to say, if Peggy has a solution to Victor’s Sudoku problem, she can prove that she has a valid solution without ever revealing any information about her solution to Victor (aside from the fact that it is valid).

To make the ZKP easier to explain, I wrote an interactive version of it.

I planned to write about it then, but completely forgot till now. Oops.

I’m first going to explain how the ZKP is carried out before I explain how the interactive verifier works. If you aren’t familiar with ZKPs, you might want to read my previous post on the subject up to and including the part about proving graph colorings.

Proving Sudoku

This proof is going to be carried out very similarly to the graph coloring proof. Indeed, Sudoku can be reduced to a graph coloring problem, though that’s not how we’re going to obtain the ZKP.

Victor has a Sudoku problem:

Peggy has a solution:

In order to not leak information about her solution, Peggy permutes it:

Basically, there is a 1-1 mapping between the old digits and the new ones. In this specific permutation, all 3s are replaced by 4s, all 1s by 5s, etc.

She now commits to this permutation by committing to every individual cell. A random nonce is obtained for each cell, and the contents of that cell are hashed along with the nonce. This is the same commitment procedure used in the graph coloring ZKP.

These commitments are now sent over to Victor.

Victor ponders for a bit, and demands that Peggy reveal the third row of the sudoku square.

(Note that this is the non-permuted problem statement)

This row is marked in orange. There are some additional elements marked in green, which I shall get to shortly.

Peggy reveals the permuted values for this row:

Victor can now verify that all digits 1-9 appear within this permuted row, and that they match the commitments. This means that they appear in the original solution too (since permutation doesn’t change this fact), and, at least for this row, the solution is correct. If Peggy didn’t have a solution, there was a chance she’d be caught in this round if Victor had asked for the right set of 9 squares to be revealed.

The procedure can be repeated (with a new permutation each time) to minimize this chance, with Victor asking to reveal a row, column, or 3x3 subsquare each time, until he is certain that Peggy has a solution.

But wait! This only works towards proving that Peggy has a valid Sudoku solution, not that this is the solution to Victor’s specific problem. Victor only verified that each row/column/subsquare had no duplicates, a property which is true for all sudoku solutions!

This is where the green squares come in. For any given set of “orange squares” (a row, column, or 3x3 subsquare), we take the “preset” digits appearing in the problem statement (In this case: 7, 8, and 6) in that set of squares. All other instances of those digits preset in the problem statement form the set of “green squares”:

Peggy reveals the permuted values for both the green and orange squares each time:

In addition to verifying that there are no duplicates in the orange squares, Victor additionally verifies that the permutation is consistent. For example, the 7th element in that row is a 6, which is already preset in the problem statement. There are two other 6s in the problem statement, one in the 5th row 8th column, and one in the 7th row 1st column. If the permutation is consistent, their corresponding squares in the revealed portion of the permuted solution should all have the same digit. In this case, that number is 1. Similarly, the 5th element in that row is a preset 8, and there’s a corresponding green square in the 5th row last column that also has an 8. In the permuted solution, Victor verifies that they both have the same digit, in this case 7.

This lets Victor ensure that Peggy has a solution to his sudoku problem. The fact that two given squares must share the same digit is invariant under permutations, so this can be safely verified. In fact, a sudoku problem is really just a problem saying “Fill these 81 squares with 9 symbols such that there are no duplicates in any row/column/subsquare, and these three squares have the same symbol in them, and these five squares have the same symbol in them, and …”. So that’s all we verify: There should be no duplicates, and the digits in certain sets of squares should be the same.

Note that revealing the green squares doesn’t reveal additional information about Peggy’s solution. Assuming Peggy’s solution is correct, from comparing the problem statement with the revealed/permuted values, Victor already knows that in the permutation, 7 has become 6, 8 has become 7, and 6 has become 1. So he already knows what the other preset green squares contain, he is just verifying them.

We cannot reveal anything more than the green squares, since that would reveal additional information about the permutation and thus the solution.

Edit: This actually still isn’t enough, which was pointed out to me by “dooglius” here. Basically, if the sudoku problem has two digits which only appear once each, there is nothing that can stop Peggy from coming up with a solution where these two digits have been changed to something else (since they’ll never be in a green square). Fixing this is easy, we allow Victor to ask Peggy to reveal just the permuted values of the presets (without simultaneously revealing a row/column/subsquare). Victor can then verify that the preset-permutation mapping is consistent (all presets of the same value map to the same permutation) and 1-1.

This check actually obviates the need of the green squares entirely. As long as there is a chance that Victor will ask for the presets to be revealed instead of a row/column/subsquare, Peggy cannot try to trick Victor with the solution of a different sudoku problem without the risk of getting caught when Victor asks for the presets to be revealed. However, the green squares leak no information, so there’s no problem in keeping them as a part of the ZKP as a way to reduce the chances of Peggy duping Victor.

The interactive verifier

Visit the interactive verifier. There’s a sudoku square at the top which you can fill with a problem, and you can fill the solution in on the first square on the Prover side – fill this in and click Start. Since I know nobody’s going to actually do that, there’s a “Fill with known problem/solution” that does this for you.

Once you’ve initiated the process, the ball is in the Prover’s court. The Prover must first permute the solution by clicking the Permute button. You can edit the permutation if you like (to introduce a flaw), or manually do this after clicking the button.

Once you’ve clicked the button, generate nonces by clicking the next one, “Populate Nonces”. These, too can be edited. You can generate hashes (which can also be edited) by clicking the next button, and after that send the hashes (commitments) over to the Verifier’s side.

The ball is now in the Verifier’s court. As you can see, there’s a set of hashes on the Verifier’s side. The Verifier only knows the problem statement and whatever is visible on their side of the screen, and nothing more.

You, acting on behalf of the Verifier, can now select a row/column/subsquare/preset using the dropdown and text box on the Verifier. As you select, the orange/green squares that are going to be revealed will be shown. When satisfied with your choice, click “Reveal”, and the Prover will populate your squares with the permuted values and nonces. “Verify” will verify that:

  • The appropriate elements and hashes are revealed
  • The hash is equal to SHA256(nonce + "-" + digit)
  • The orange squares contain distinct digits.
  • The green squares contain digits that match with the orange squares they correspond to from the problem solution

Once you click verify, it will show the probability of correctness (this isn’t an exact value, it’s calculated using an approximate formula that doesn’t depend on the problem statement), and the ball moves back into Peggy’s court, who can permute her solution again and continue. The probability slowly increases each round.

Doing this manually till it reaches 99% is boring, so there’s a button at the top (“Run automatically”) which can be clicked to run it for a given number of rounds, at any stage in the process once started. If you tamper with one of the values in the permuted solution, and run it for ~20 runs, it usually gets caught.

Have fun!

Starting at Mozilla

I got a job!

I’m now working at Mozilla as a Research Engineer, on Servo.

I started two weeks ago, and so far I’m really enjoying it! I feel quite lucky to get to work on an open source project; with an amazing and helpful team. Getting to do most of my work in Rust is great, too :)

So far I’ve been working on the network stack (specifically, “making fetch happen”), and I’ll probably be spending time on DOM things as well.

Really excited to see how this goes!

Exploring Zero-Knowledge Proofs

Follow up article to this one here

So recently I read this article about how the Bitcoin community had executed what’s known as a “Zero-Knowledge Contingent Payment”; a pretty neat concept.

What was really interesting for me was the (simplified) underlying algorithm for generic zero knowledge proofs. It took me a while (and some questions asked to helpful folks on the Internet) to understand it fully, but the concept is quite intriguing and sounds rather magical. I thought I’d explain it here in an accessible way, both so that others can get it and to improve my own understanding.

I intend this article to be read by people with a programming or mathematical background1, who have some understanding of what logic gates are. Please let me know if you feel that something is inadequately (or wrongly) explained.

So what is a zero knowledge proof?

Let’s say Alice has a problem she wants to solve. It could be a mathematical problem, like factorizing a large number, or coloring a map (or graph!) with only three colors, or solving a Sudoku puzzle. Anything where you can write a program to verify the solution.

She doesn’t have the resources to solve the problem herself, but wants to buy a solution from someone else. She makes the problem public, and Bob says he has a solution.

However, Alice is skeptical. She’s not sure if Bob is telling the truth here, and would like some evidence that he does indeed have a solution. At the same time, Bob is not willing to share his solution with Alice without getting paid. They also don’t want to involve a third party; let’s say this is a rather Important Sudoku Puzzle that affects National Security ¯\_(ツ)_/¯.

What Alice and Bob need here is a way for Bob to prove to Alice that he has the solution, without sharing the solution, and without involving a third party2.

It turns out that this is totally possible (magical, right!). There’s a quick example on Wikipedia of a simple proof of a non-mathematical fact – whether or not someone has a key to a particular door.

For proving more complicated problems, we have to digress into some basic crypto first

Interlude: Hashes and commitments

Feel free to skip this if you know what a hash function and a nonce is.

In cryptography, there’s something called a “hash function”. In essence it’s an “irreversible” function whose output is known as a “hash”, with the following three properties:

  • It’s not computationally intensive to calculate the hash of an input
  • Given a hash, it’s a computationally hard problem to calculate an input to the hash function that results in this hash, usually involving brute force
  • It’s also a computationally hard problem, given an input and a hash, to find a different input (especially a different input that is similar to the first one) that produces the same hash.

Note that multiple values may result in the same hash.

The result of this is basically that hashes are hard to forge. If Bob shares a hash Y = H(X) with Alice, where X is some secret data and H is a hash function, if Bob reveals X at some later point, by checking that Y = H(X), Alice can be reasonably certain that the value shared by Bob was indeed the original input to the hash function and not tampered with in a way that the same hash was produced. Similarly, Bob can be certain that knowing only Y, Alice cannot reverse-engineer X since the hash function is “irreversible”.

This brings us to the concept of a commitment. Hashes can be used as described above to “commit” to a value. If Bob decides on a number X, and makes its hash Y public, he has committed to this value without revealing it. When he does decide to reveal it, he is forced to reveal X and not some modified bogus value, thus making the “commitment” binding.

Some of you may have noticed a flaw here: It’s hard to commit to small numbers, or things that come from a restricted set. If Bob wishes to commit to the number 5 (without revealing it), or the color red (out of a set of three colors), Alice can just try H(0) to H(9) or H(red), H(green), H(blue) and find out which one matches. After all, hashes aren’t supposed to be resilient to brute force attacks, and brute force attacks become very easy when the set of inputs is tiny.

A solution to this is to use a nonce (also known as a “trapdoor”). Bob commits to 5 by hashing the string 5-vektvzkjyfdqtnwry, where vektvzkjyfdqtnwry is a random value he selected, known as a “nonce”. When Bob wishes to reveal the value, he just reveals 5-vektvzkjyfdqtnwry and Alice is convinced that the original value committed to was indeed 5. Of course, this requires some agreement on the format of the nonce; in this case the nonce is just “everything after the dash”. Note that the nonce is private, and only revealed when Bob wishes to reveal the committed number.

Note that each new commitment should use a new nonce. Otherwise, information can be leaked; for example if Bob needs to commit to three numbers (say, 2, 5, 2) in a way that they can be individually revealed, he shouldn’t compute the hashes for 2-vektvzkjyfdqtnwry, 5-vektvzkjyfdqtnwry, 2-vektvzkjyfdqtnwry, since the first and last hashes will be equal and Alice will know that the committed values behind them are probably the same too (something which you may not wish to reveal).

Another issue that can turn up is a “rainbow table”, where one party comes into the game with a precomputed table of hashes of all strings up till a certain number of characters. One solution for this is to increase the nonce size, however since Bob decides the nonces it’s possible for him to smartly select them if he’s the one with a table. The solution here is to use a “salt”, which is a large random string combined with the committed value and hash. Bob and Alice could, for example, mutually decide on a salt of asdcjyxeafxjvikfzmnyfqsehsxwxsfywbreb, and when Bob wishes to commit to the number 5, he hashes asdcjyxeafxjvikfzmnyfqsehsxwxsfywbreb-5-vektvzkjyfdqtnwry. Note that salts work similar to nonces here, however the salt is known publically (you can model it as a last-minute modification of the agreed-upon hash function H, since H'(X) = H(add_salt(X))). In some cases, you may also want a per-instance salt, which is mutually decided every time Bob wants to compute a hash.

Hashes are a useful building block for many things; they’re a key component in password security, as well as being part of all kinds of cryptographics protocols. In this post we’ll mainly focus on their ability to be used as a unbreakable commitment.

Back to your regularly scheduled blog post.

Coloring graphs

The classic example of zero knowledge proofs is graph coloring. I’ll run through a quick explanation, though it’s explained beautifully here too.

Let’s say Alice has a graph:

No, not that kind, Alice. The other graph.

She wants it colored such that no two adjacent nodes share a color. This is an NP-complete problem (so it can take up a lot of computational resources to solve). Of course, this graph is small and easy to color, but that’s just for the sake of this blog post.

Bob, using his trusty Crayola™ 3-crayon set3, has managed to come up with a valid coloring:

He wishes to prove that he has this to Alice, without revealing it or involving a third party. Probably for National Security Reasons. Something something Nicolas Cage.

Bob and Alice meet, and Alice gives him a large piece of paper with the (uncolored) graph drawn on it.

Bob goes into a private room, and colors it. He also covers each graph node with a hat. Alice now enters the room.

Alice chooses an adjacent pair of nodes. Let’s say she chooses 1 and 2. Bob removes those two hats (since Alice is watching him, he has no chance to tamper with the colorings underneath the hats before revealing them). Now, Alice knows the colors of nodes 1 and 2:

This lets her verify that nodes 1 and 2 had different colorings in the graph Bob drew.

Note that this doesn’t actually tell her anything about Bob’s coloring aside from the increased probability of correctness. The colors can always be permuted, so any valid coloring would give the same answer here if the colors were permuted so that 1 is red and 2 is blue. This is important; we don’t want to leak information about Bob’s solution aside from the fact that it is correct.

Nor is this information enough to verify correctness. Bob could have equally drawn a wrong coloring.

(clearly someone wasn’t paying attention in kindergarten)

Since Alice only looked at nodes 1 and 2, she didn’t see anything wrong with the graph. But if she had by chance picked nodes 3 and 4, Bob’s deception would have been revealed.

So she only has 14% (1/7) certainity4 that Bob’s graph is correct.

However, we can run this experiment again. Bob can permute the colors, draw on a fresh copy of the graph, and ask Alice to choose another pair of adjacent nodes. She can check this, and the probability of correctness will increase to around 27% (1 - (6/7)*(6/7)).

Since Bob has permuted the colors, Alice cannot use the information from the previous round to glean any information about Bob’s solution in this round. Of course, Bob is free to produce a completely different coloring (one that is not a permutation), with a different flaw this time. Regardless of where the flaw is, Alice still has a chance of uncovering it each time.

This can continue until Alice is satisfied that there is a very little chance that Bob has cheated. For example, after 60 rounds, Alice would be 99.99% certain.

Note that this didn’t actually involve any cryptography; it was an algorithm based on information flow. However, if you want this to work securely (in the current solution Alice could push Bob away and reveal all the nodes herself) and make it work without requiring Alice and Bob to be in the same location, you need to use hashes.

Remember when Bob colored the graph whilst alone in the secret room? Once Alice had entered the room, this coloring was committed. There was no way for Bob to tamper with this coloring.

We do the same thing here. After obtaining a valid coloring, Bob commits to this coloring by calculating some hashes.

NodeColor(private)Nonce(private)Hash
1redwmdqatobcke1f957bedcceeb217305bfa12cbee4abac36eff1
2bluefmcbpzkgyp87d9d7239909c28ec8d73a3b9a99673cbf870046
3greendktuqvrsssa40bafb81149937c77ae55589aff1b53d9c043d8
4blueauhbyuzkmzb3503962937850f7c1b59cf4b827ca40a62b122a
5redgfunjcmygkd8db52bb36ca595b9231180c1055fe3958c3ea7d


(The hashes here are calculated using SHA-1 for the hashing algorithm. It’s not considered very secure anymore, but the secure ones all output huge hashes which spill over the page)

Bob sends the public part of the table (the node-hash mapping) to Alice. Alice asks for nodes 1 and 2, and Bob reveals the entire table entry for those two nodes (including the nonce).

Note that since Alice now knows the color and nonce for nodes 1 and 2, she can verify that the colors shown are indeed the ones Bob committed to. echo red-wmdqatobck | sha1sum if you want to check on a local Unixy shell.

As in the previous case, Alice can repeat this algorithm until she reaches an acceptable level of certainty (each time with a permutation of colors and a new set of nonces).

A lot of zero knowledge proofs (but not all!) are inherently probabalistic and interactive. They involve multiple rounds where in each round the prover (Bob) commits to something, the verifier (Alice) challenges the prover to reveal some information. The process repeats, with the certainity on the verifier’s side approaching 100% as more and more rounds happen.

Zero Knowledge Proof for General Execution

It turns out that you can have a ZKP exchange for the execution of any algorithm that can be transcribed into combinatorical logic. In other words, you should be able to write the program without loops and recursion, though loops bounded by a constant are allowed5. This isn’t as restrictive as it seems, usually verification is a straightforward task not involving convoluted loops. The examples above (graph coloring6, sudoku, prime factorization7) can all be verified without loops.

The algorithm shown here is by Gregory Maxwell, originally published here. It’s somewhat inefficient, but it demonstrates the idea behind ZKP for general execution. As mentioned there, it can be optimized using techniques described in this paper.

Let’s get started. Any combinatorical program can be decomposed into a bunch of AND and NOT gates, taking in a bunch of input values and giving out one or more output values. For simplicity let’s assume that the problem statement (i.e. the specific sudoku puzzle, etc) that needs verifying is embedded inside the program, and the final output of the program is just a single boolean indicating whether or not the input is a solution. This algorithm, however, can work for programs with arbitrarily large outputs.

Alice and Bob do this decomposition. The also agree on a numbering of the AND gates. Let’s say that there are N AND gates. We’re mostly going to ignore the NOT gates for the purpose of this article – they’re there, but they aren’t modified or anything.

Creating encrypted AND gates

Now, Bob creates 4*N encrypted AND gates. This is an AND gate, but with the inputs and outputs all muddled up.

This is a regular AND gate:

This is an encrypted AND gate:

(yes, it can be identical to an AND gate)

So is this:

and this:

Basically, each input and the output may or may not be inverted. We can model this in a different way, there is an encryption key corresponding to each input and output. This key is XORd with the input/output (so if the key is 1, the wire is inverted, and if the key is 0, the wire is not inverted).

A regular AND gate has a truth table as follows:

Input 1 Input 2 Output
0 0 0
1 0 0
0 1 0
1 1 1


This truth table, encrypted (with the input keys \(e_1 = 1, e_2 = 0\) and output key \(e_o = 1\)) is:

Encrypted Input 1 Encrypted Input 2 Encrypted Output
1 0 1
0 0 1
1 1 1
0 1 0


So, if the encrypted gate gets the (encrypted) inputs 1 and 0, its (encrypted) output will be 1.

Since XOR is its own inverse (\(x \oplus y \oplus y\) is just \(x\)), if we wish to encrypt an input before piping it through this gate, we just XOR it with the relevant input key. If we wish to decrypt the output, we again XOR it with the output key. The XOR gates being applied will just cancel out with the internal encryption gates. In other words, encryption and decryption are done with the same operation!

To recap, the dotted box below is an encrypted AND gate. An encrypted input enters from the left, and is decrypted by the internal XOR gate to obtain the actual input, which is piped through the AND gate. To encrypt an input so that it can be passed into this gate, one uses the same key with an XOR (not shown in the diagram). Similarly, the actual output of the AND gate exits on the right, and is encrypted by the XOR gate at the output to get the “encrypted output” (the wire that extends out of the box). To decrypt this, one must apply the same XOR operation to the encrypted output to recover the actual output of the gate.

Creating adaptation keys and commitments

Now, unlike regular AND gates, these encrypted AND gates cannot be composed. The output of an encrypted AND gate is encrypted, with a potentially different encryption key as to that of the next gate’s input. So, we insert an “adaptation key” between the two. For example, if the output of the first gate is connected to the first input of the second gate, we need to insert this operation between the two gates:

We XOR by \(e_o\) of the first gate (to decrypt), and then again XOR by \(e_1\) of the second gate (to reencrypt). This operation is the same as XORing by \(e_o \oplus e_1\), which is the “adaptation key”. Every pair of encrypted gates will have an adaptation key for every configuration they can be placed in.

Alright. Bob creates a ton of these “encrypted gates”, and calculates all the adaptation keys. He also mixes up the truth tables of each encrypted gate8.

Now, he commits to these truth tables. A commitment for each entry in each truth table is made, so he’ll end up with something like this:

Encrypted Input 1 Encrypted Input 2 Encrypted Output nonce commitment
0 0 1 .. H(001 + nonce)
1 0 1 .. H(101 + nonce)
0 1 0 .. H(010 + nonce)
1 1 1 .. H(111 + nonce)


He also commits to each of the adaptation keys and each of the encryption keys.

As usual, all the commitments will be sent to Alice. Alice will then have data like: “Commitment for Gate 1 entry 1: .., Commitment for Gate 2 entry 2:.., … Commitment for Gate 2 entry 1: .., …. Commitment for adaptation key between Gate 1’s output and Gate 2’s first input: .., Commitment for adaptation key between Gate 1’s output and Gate 2’s second input: .., Commitment for encryption key for Gate 1’s first input, …”.

Shuffling and revealing

These commitments are taken in some predefined order, and the resultant monster string is hashed (without a nonce). This “superhash” is used as the seed to a pseudorandom number generator which is used to shuffle the gates. Both Alice and Bob can calculate this shuffling.

This post-shuffle ordering is used after this point. The hash-shuffle is important here because it adds a layer of tamper protection. If Bob wishes to tamper with the, say 4th gate post-shuffle, Bob would have to create a bad gate before making the commitments; this changes the commitments, and thus the shuffle order, and so the tampered gate will not end up being the 4th gate. Basically, it’s hard to control where the tampered gate will end up.

Now, out of the 4N gates, Bob takes the last 2N, and reveals everything about them: Their encryption keys, the nonces for their truth table commitments, and all adaptation keys between these gates (along with the nonces for the adaptation key commitments).

Alice ensures that everything in this revealed data adds up. All the truth tables, after decryption, should actually be AND gate truth tables. All adaptation keys must match up with their relevant encryption keys. All commitments must match up.

Double trouble!

Bob duplicates the AND-and-NOT-gate based circuit. He now has two identical circuits which take the same inputs, and have one output each. In itself this is pretty useless; this circuit is obviously redundant. However, in the context of encrypted gates, this redundancy becomes useful.

Bob drops in the 2*N encrypted gates into this double-circuit, using the post-shuffle ordering of encrypted gates and the predecided numbering9 of the AND gates in the circuit. He puts the necessary adaptation gates (i.e. an XOR operation with the relevant adaptation key) between encrypted AND gates to make the circuit work. Note that each “half” of the circuit has a different set of encrypted gates, and thus a different encryption key for each input. There are NOT gates here too (from the original circuit, which was made of ANDs and NOTs); they stay in place (the adaptation gate can go on either side of them) with no modifications or encryption.

Execution

Let’s recall that Bob is claiming to have the correct input for the original circuit – the input that makes that circuit output true.

Since Bob has all the encryption keys, he can encrypt this correct input to get the correct encrypted input, which should make the new circuit output true (well, encrypted true) as well.

Bob goes ahead and does this. He encrypts the input (since there are different encryption keys for either side of the circuit, he does this twice), and runs it through the circuit. He notes down the truth table entry utilized for each gate. He ensures that the output, once decrypted, is true (it should be, if everything has been done correctly till now).

Verification

He now reveals the details of the program execution to Alice. He reveals:

  • All adaptation gates involved (and their nonces, to verify the commitments)
  • All truth table entries involved in the execution (and their nonces …).
  • The output encryption key (and its nonce)
  • The encrypted inputs

Alice goes ahead and verifies that the commitments have not been reneged upon. Note that she also now has a full execution history. It’s an encrypted history – she can’t calculate the original input from it – but she can verify that the execution was faithfully carried out. While she doesn’t have the entire truth table for any encrypted gate, she has the entry that was used in the execution, which is enough. She just has to ensure that the inputs to a gate match the truth table entry, use the entry to see what the output is, apply the relevant adaptation key to get the input for the next gate, and repeat.

And there you have it. Alice has verified that Bob faithfully executed her verification circuit, and thus he must have the correct answer to her problem.

Tampering?

Let’s see if it’s possible for Bob to tamper with any of this. If Bob wishes to tamper with one of the gates, he has to tamper with the gates before calculating commitments, which means that the shuffling will get mixed up, which will mean that he can’t control where the tampered gate will end up in the final circuit. This is compounded by the fact that half the gates are revealed (so the tampered gate may end up in the wrong set), and that there are two copies of the circuit (so you need to tamper with both sides simultaneously, requiring even more luck on getting the shuffle where you want it).

The probability of Bob being able to execute a succesful tamper can be adjusted by increasing the number of revealed gates, and increasing the duplication of the circuit. There is also the aforementioned fudge factor that can be introduced by having Alice choose where each encrypted gate should go after Bob has already provided commitments, and finally the procedure can be repeated as many times as necessary with a fresh set of encrypted gates to increase certainty. Unlike the graph coloring algorithm (where the uncertainty in a single run was large – if Bob has a couple of wrong edges there’s relatively small chance he’ll get caught); here in a single run it is Bob who has a massive disadvantage, since he must tamper with exactly the right gates, and there’s very little chance that his tampered gates will fall in the right place based on Alice’s chosen ordering. Additionally, tampering with the gates in the first place is hard, since you need to avoid having them get revealed. I think that with reasonable (e.g., not asking for something like 1000 duplicated circuits) choices on the level of duplication and number of revealed gates, it’s possible for Alice to get a very high level of certainty without needing to conduct multiple rounds.

How about the opposite question: Can Alice find out anything about the input, aside from the fact that it is correct, from the information she has? At first glance it seems like she can, because she can see the whole path of execution. In case of a program with non-constant loops, this would be damning, since she can figure out how many executions happened (and thus know the decrypted value for the number of loop iterations) and backtrack using that in a cleverly-written program. However, this program has no loops.

Looking at it closely, any encrypted history of execution can be changed to a different encrypted history of execution for the same nonencrypted execution by adding NOT gates wherever they don’t match, and then absorbing these NOT gates into the input or output keys (by NOTing them) of the adjacent encrypted AND gates. This means that without knowing the details of the encrypted gates, all histories of execution are equally possible for a given actual execution10. Therefore, knowing only a history of execution does not provide you further information about the actual execution, since it could equally have been for some other history of execution.

Bonus: Fixing the escrow and Bitcoin

(I’m going to assume basic knowledge of Bitcoin later on in this section)

After all this, we still only have a way of Bob proving he has a solution. There’s no way of securely exchanging the solution for money (or whatever) without involving a trusted third party to handle the swap. This is known as escrow, where a third party is given both items for swapping; and after checking that everything is in order the third party completes the swap.

We can build on this so that the third party is only trusted with the money, and cannot actually peek at the answer.

It’s pretty straightforward: Bob and Alice mutually agree on a shared secret “pad” P. Bob takes his answer, bitwise-XORs it with the pad (which is of the same length as the answer) to get padded input X, and then hashes it to get hash Y.

Now, initially we had a verification program which proves the statement “This input is a solution to Alice’s problem”. We modify this program so that it proves the following two statements:

  • This input is a solution to Alice’s problem
  • When the input is XORd with P, and subsequently hashed, the hash that comes out is Y

Alice and Bob now go through the ZKP algorithm and the above is proven. Of course, they must keep the exchange between themselves, since the value of the pad (which can be extracted from the circuit) must remain secret.

Assuming that Bob isn’t able to cause any hash collisions, Alice at this point would be happy with a number that, when hashed, gives Y. This is something that escrow can verify, since neither Y nor X can be reverse-engineered to get the original answer unless you have P.

Now, Alice puts the money in escrow, and notifies the third party handing escrow of the value of Y (the hash). Bob puts the padded input X in escrow as well. The third party verifies that Y is the hash of X, and releases the money to Bob and the padded input to Alice. Since Alice knows pad P, she can XOR it with X to recover the original real input. Everyone walks away happy!

Well, maybe not. There still is the danger of the third party handling escrow to walk away with the money. Why trust any one party?

Turns out that Bitcoin proves to be an alternative to this situation. The technique described in Greg Maxwell’s article (called Zero-Knowledge Contingent Payment), builds upon the above protocol using “scripts” in Bitcoin.

The way a Bitcoin transaction works is that anyone (well, the first person) who can solve the embedded challenge is allowed to use the money contained in it. Like a piñata. Except with money instead of candy and public-key cryptography instead of a stick.

Most Bitcoin transactions pay directly to a person, and they use a standard kind of challenge (the actual script is here). If Alice wishes to pay Bob 5 BTC, Alice crafts a transaction which says “anyone with the private key behind Bob’s public key (his address) may spend this money”. Of course, in practice this means that only Bob can spend the money. Alice created a piñata which only Bob can break.

We can build on this to make the Bitcoin network behave as a trustworthy escrow. After having stepped through the zero-knowledge protocol and being confident that Y is the hash of the padded input, Alice crafts a transaction which says “anyone with a string that results in this hash may spend this money”11. Bob has this string; it is the padded answer X. He makes a transaction with X as part of the input script (so that he can claim the money); and the Bitcoin network accepts it. Assuming Alice and Bob are not able to tamper with each others’ local networks, by the time Alice sees the transaction containing X, the network should have accepted this transaction already (though it may not yet be part of the blockchain), and Bob should be getting his money.

(In case the crucial part is trusting that the escrow doesn’t run off with the money, and you don’t care if other people can see the answer, you can skip the padding step and directly hash the input. I believe the proof of concept executed in Greg’s post did this, but I’m not sure)

Thanks to Shantanu Thakoor, eternaleye, and ebfull for feedback on drafts of this post


  1. I have some physics friends who would probably enjoy this too.

  2. Actually, you still need a trusted third party to make the money-swap work, but it can be done in a way that the National Secrets Sudoku Solution isn’t actually shared with the third party. The Bitcoin article linked above describes a way to do away with a trusted third party, instead replacing it with the implicitly trusted Bitcoin network. We’ll discuss this further at the end of the post.

  3. With free sharpener!

  4. There are seven edges. This is a conservative estimate, assuming that Bob’s graph has one bad edge. More mistakes increase this probability, but it becomes more cumbersome to calculate.

  5. We basically want to be able to write this as a series of sequentially-arranged logic gates. If a loop is bounded by a constant, it can just be unrolled. break and continue can be handled here, though goto cannot.

  6. Remember that the number of nodes and edges is already known, so we can just write a program “Check edge 1”, “Check edge 2”, … without needing to explicitly loop over everything

  7. Again, since the number being factorized is known beforehand, there are bounds on the sizes of its factors, and a multiplication circuit for a number of bounded size can be designed.

  8. mixing up a truth table doesn’t change how it works, but it makes it impossible to figure out the original entry just by knowing that your entry was the “third” entry or something

  9. You can actually add another fudge factor here by making Alice decide the gate numbering after having received gate commitments. If N isn’t that large, there’s still a small chance Bob can fake the output by permuting the original gates (and twiddling the nonces) until the tampered gates fall into the right spot. This removes that possibility to a reasonably high level of certainty, which can be strengthened by going through the whole procedure multiple times.

  10. We’re ignoring the commitments made by Bob here, which let us make the opposite statement – “this encrypted history of execution is the only one that’s possible given the commitments”. However, the commitments themselves don’t carry any new information per se; they instead lock in information which is revealed to you in the future (information which is not revealed at all cannot be reverse-engineered from the commitments, so that’s safe too). This means that Alice cannot use them to glean anything about the decrypted input, and we can ignore them for the time being.

  11. She should probably also add a clause that requires Bob’s private key to sign something, so that someone else can’t copy the answer from Bob’s transaction and steal the money. Additional work can be done to make it so that if the transaction goes unclaimed, Alice can reclaim the money.