Essentially, the lemma says that computing the natural recursive combination of hash values over \(2^b\) bits for two distinct trees (ADT instances) \(a\) and \(b\) yields a collision probability at most \(\frac{|a| + |b|}{2^b}\) if we use a random hash function (sure), and Section 6.2 claims without proof that the result can be safely extended to the unspecified “seeded” hash function they use.
That’s a minor result, and the paper’s most interesting contribution
(to me) is an algorithmically efficient alternative to
the locally nameless representation:
rather than representing bindings with simple binders and complex
references, as in de Bruijn indices (lambda is literally just a
lambda
literal, but references must count how many lambdas to
go up in order to find the correct bindings), Maziarz and his
coauthors use simple references (holes, all identical), and complex
binders (each lambda tracks the set of paths from the lambda binding
to the relevant holes).
The rest all flows naturally from this powerful idea.
Part of the naturally flowing rest are collision probability analyses for a few hashing-based data structures. Of course it’s not what PLDI is about, but that aspect of the paper makes it look like the authors are unaware of analysis and design tools for hashing based algorithms introduced in the 1970s (a quick Ctrl-F for “universal,” “Wegman,” or “Carter” yields nothing). That probably explains the reckless generalisation from truly random hash functions to practically realisable ones.
There are two core responsibilities for the hashing logic:
As Per saliently put it, there are two options for formal analysis of collision probabilities here: we can either assume a cryptographic hash function like SHA-3 or BLAKE3, in which case any collision is world-breaking news, so all that matters is serialising data unambiguously when feeding bytes to the hash function, or we can work in the universal hashing framework.
Collision probability analysis for the former is trivial, so let’s assume we want the latter, pinpoint where the paper is overly optimistic, and figure out how to fix it.
Let’s tackle the first responsibility: incrementally hashing trees bottom up.
The paper essentially says the following in Appendix A.
Assume we have one truly random variable-arity hash function (“hash combiner”) \(f\), and a tag for each constructor (e.g., \(s_{\texttt{Plus}}\) for (Plus a b)
);
we can simply feed the constructor’s arity, its tag, and the subtrees’ hash values to \(f\), e.g., \(f(2, s_{\texttt{Plus}}, hv_a, hv_b)\)…
and goes on to show a surprisingly weak collision bound
(the collision rate for two distinct trees grows with the
sum of the size of both trees).1
A non-intuitive fact in hash-based algorithms is that results for truly random hash functions often fail to generalise for the weaker “salted” hash functions we can implement in practice. For example, linear probing hash tables need 5-universal hash functions2 in order to match the performance we expect from a naïve analysis with truly random hash functions. A 5-universal family of hash functions isn’t the kind of thing we use or come up with by accident (such families are parameterised by at least 5 words for word-sized outputs, and that’s a lot of salt).
The paper’s assumption that the collision bound it gets for a truly random function \(h\) holds for practical salted/seeded hash functions is thus unwarranted (see, for examples, these counter examples for linear probing, or the seed-independent collisions that motivated the development of SipHash); strong cryptographic hash functions could work (find a collision, break Bitcoin), but we otherwise need a more careful analysis.
It so happens that we can easily improve on the collision bound with a classic incremental hashing approach: polynomial string hashing.
Polynomial string hash functions are computed over a fixed finite field \(\mathbb{F}\) (e.g., arithmetic modulo a prime number \(p\)), and parameterised by a single point \(x \in \mathbb{F}\).
Assuming a string of “characters” \(v_i \in \mathbb{F}\) (e.g., we could hash strings of atomic bytes in arithmetic modulo a prime \(p \geq 256\) by mapping each byte to the corresponding binary-encoded integer), the hash value is simply
\[v_0 + v_1 x + v_2 x^2 \ldots + v_{n - 1} x^{n - 1},\]
evaluated in the field \(\mathbb{F}\), e.g., \(\mathbb{Z}/p\mathbb{Z}\).
For more structured atomic (leaf) values, we can serialise to bits and
make sure the field is large enough, or split longer bit serialised
values into multiple characters. And of course, we can linearise trees
to strings by encoding them in binary S-expressions, with dedicated
characters for open (
and close )
parentheses.3
The only remaining problem is to commute hashing and string
concatenation: given two subtrees a
, b
, we want to compute the
hash value of (Plus a b)
, i.e., hash "(Plus " + a + " " + b + ")"
in constant time, given something of constant size, like hash values
for a
and b
.
Polynomials offer a lot of algebraic structure, so it shouldn’t be a surprise that there exists a solution.
In addition to computing h(a)
, i.e., \(\sum_{i=1}^{|a|} a_i x^i,\)
we will remember \(x^{|a|}\), i.e., the product of x
repeated for each
“character” we fed to the hash function while hashing the subtree a
.
We can obviously compute that power in time linear in the size of a
,
although in practice we might prefer to first compute that size, and later
exponentiate in logarithmic time with repeated squaring.
Equipped with this additional power of \(x\in\mathbb{F}\), we can now compute the hash for the concatenation of two strings \(h(a \mathtt{++} b)\)
in constant time, given the hash and power of x
for the constituent strings \(a\) and \(b\).
Expanding \(h(a \mathtt{++} b)\) and letting \(m = |a|, \) \(n = |b| \) yields:
\[a_0 + a_1 x + \ldots + a_{m - 1} x^{m - 1} + b_0 x^n + b_1 x^{n + 1} + \ldots + b_{n - 1} x^{m + n - 1},\]
which we can rearrange as
\[a_0 + a_1 x + \ldots + a_{m - 1} x^{m - 1} + x^m (b_0 + b_1 x + \ldots b_{n-1} x^{n-1},\]
i.e.,
\[h(a \mathtt{++} b) = h(a) + x^{|a|} h(b),\]
and we already have all right-hand side three terms \(h(a),\) \(x^{|a|},\) and \(h(b).\)
Similarly, \(x^{|a \mathtt{++} b|} = x^{|a| + |b|} = x^a \cdot x^b,\) computable in constant time as well.
This gives us an explicit representation for the hash summary of each substring, so it’s easy to handle, e.g., commutative and associative operators by sorting the pairs of \((h(\cdot), x^{|\cdot|})\) that correspond to each argument before hashing their concatenation.
TL;DR: a small extension of classic polynomial string hashing commutes efficiently with string concatenation.
And the collision rate? We compute the same polynomial string hash, so two distinct strings of length at most \(n\) collide with probability at most \(n/|\mathbb{F}|\) (with the expectation over the generation of the random point \(x \in \mathbb{F}\);4 never worse than Lemma 6.6 of Maziarz et al, and up to twice as good.
Practical implementations of polynomial string hashing tend to evaluate the polynomial with Horner’s method rather than maintaining \(x^i\). The result computes a different hash function, since it reverses the order of the terms in the polynomial, but that’s irrelevant for collision analysis. The concatenation trick is similarly little affected: we now want \(h(a \mathtt{++} b) = x^{|b|} h(a) + h(b)\).
The term representation introduced in “Hashing Module
Alpha-Equivalence” contains a map from variable name to a tree
representation of the holes where the variable goes (like a DAWG
representation for a set of words where each word is a path, except
the paths only share as they get closer to the root of the tree… so
maybe more like snoc
lists with sharing).
We already know how to hash trees incrementally; the new challenge is in maintaining the hash value for a map.
Typically, one hashes unordered sets or maps by storing them in
balanced trees sorted primarily on the key’s hash value, and secondarily on the key.5 We can also easily tweak arbitrary
balanced trees to maintain the tree’s hash value as we add or remove
entries: augment each node
with the hash and power of x
for the serialised representation of
subtree rooted at the node.6
The paper instead takes the treacherously attractive approach of
hashing individual key-value pairs, and combining them with an abelian group
operator (commutative and associative, and where each element has an
inverse)… in their case, bitwise xor
over fixed-size words.
Of course, for truly random hash functions, this works well enough, and the proof is simple. Unfortunately, just because a practical hash function is well distributed for individial value does not mean pairs or triplets of values won’t show any “clumping” or pattern. That’s what \(k-\)universality is all about.
For key-value pairs, we can do something simple: associate one hash
function from a (almost-xor
)-universal family to each value, and
use it to mix the associated value before xor
ing everything together.
It’s not always practical to associate one hash function with each
key, but it does work for the data structure introduced in “Hashing
modulo Alpha-Equivalence:” the keys are variable names, and these
were regenerated arbitrarily to ensure uniqueness in a prior linear
traversal of the expression tree. The “variable names” could thus
include (or be) randomly generated parameters for a
(almost-xor
)-universal family.
Multiply-shift is universal,
so that would work; other approaches modulo a Mersenne prime should also be safe to xor
.
For compilers where hashing speed is more important than compact hash values, almost-universal families could make sense.
The simplest almost-xor
-universal family of hash functions on contemporary hardware is probably PH
, a 1-universal family that maps
a pair of words \((x_1, x_2)\) to a pair of output words, and is
parameterised on a pair of words \((a_1, a_2)\):
\[\texttt{PH}_a(x) = (x_1 \oplus a_1) \odot (x_2 \oplus a_2),\]
where \(\oplus\) is the bitwise xor
, and \(\odot\) an unreduced
carryless multiplication (e.g., x86 CLMUL
).
Each instance of PH
accepts a pair of \(w-\)bit words and returns
a \(2w-\)bit result; that’s not really a useful hash function.
However, not only does PH
guarantee a somewhat disappointing
collision rate at most \(w^{-1}\) for distinct inputs (expectation
taken over the \(2w-\)bit parameter \((a_1, a_2)\)), but,
crucially, the results from any number of independently parameterised
PH
can be combined with xor
and maintain that collision rate!
For compilers that may not want to rely on cryptographic extensions,
the NH
family also works, with \(\oplus\) mapping to addition modulo
\(2^w\), and \(\odot\) to full multiplication of two \(w-\)bit
multiplicands into a single \(2w-\)bit product. The products have the
similar property of colliding with probability \(w^{-1}\) even once
combined with addition modulo \(w^2\).
Regardless of the hash function, it’s cute. Useful? Maybe not, when we could use purely functional balanced trees, and time complexity is already in linearithmic land.
None of this takes away from the paper, which I found both interesting and useful (I intend to soon apply its insights), and it’s all fixable with a minimal amount of elbow grease… but the paper does make claims it can’t back, and that’s unfortunate when reaching out to people working on hash-based data structures would have easily prevented the issues.
I find cross-disciplinary collaboration most effective for problems we’re not even aware of, unknown unknowns for some, unknown knowns for the others. Corollary: we should especially ask experts for pointers and quick gut checks when we think it’s all trivial because we don’t see anything to worry about.
Thank you Per for linking to Maziarz et al’s nice paper and for quick feedback as I iterated on this post.
Perhaps not that surprising given the straightforward union bound. ↩
Twisted tabular hashing also works despite not being quite 5-universal, and is already at the edge of practicality. ↩
It’s often easier to update a hash value when appending a string, so reverse Polish notation could be a bit more efficient. ↩
Two distincts inputs a
and b
define polynomials \(p_a\) and `\(p_b\) of respective degree \(|a|\) and \(|b|\). They only collide for a seed \(x\in\mathbb{F}\) when \(p_a(x) = p_b(x),\) i.e., \(p_a(x) - p_b(x) = 0\). This difference is a non-zero polynomial of degree at most \(\max(|a|, |b|),\) so at most that many of the \(|\mathbb{F}|\) potential values for \(x\) will lead to a collision. ↩
A more efficient option in practice, if maybe idiosyncratic, is to use Robin Hood hashing with linear probing to maintain the key-value pairs sorted by hash(key)
(and breaking improbable ties by comparing the keys themselves), but that doesn’t lend itself well to incremental hash maintenance. ↩
Cryptographically-minded readers might find Incremental Multiset Hashes and their Application to Integrity Checking interesting. ↩
I guess I’m lucky that the systems I’ve worked on mostly fall in two classes:2
those with trivial write load (often trivial load in general), where the performance implications of UUIDs for primary keys are irrelevant.
those where performance concerns lead us to heavily partition the data, by tenant if not more finely… making information leaks from sequentially allocation a minor concern.
Of course, there’s always the possibility that a system in the first class eventually handles a much higher load. Until roughly 2016, I figured we could always sacrifice some opacity and switch to one of the many k-sorted alternatives created by web-scale companies.
By 2016-17, I felt comfortable assuming AES-NI was available on any x86 server,3 and that opens up a different option: work with structured “leaky” keys internally, and encrypt/decrypt them at the edge (e.g., by printing a user-defined type in the database server). Assuming we get the cryptography right, such an approach lets us have our cake (present structured keys to the database’s storage engine), and eat it too (present opaque unique identifiers to external parties), as long as the computation overhead of repeated encryption and decryption at the edge remains reasonable.
I can’t know why this approach has so little mindshare, but I think part of the reason must be that developers tend to have an outdated mental cost model for strong encryption like AES-128.4 This quantitative concern is the easiest to address, so that’s what I’ll do in this post. That leaves the usual hard design questions around complexity, debuggability, and failure modes… and new ones related to symmetric key management.
Brandur compares sequential keys and UUIDs. I’m thinking more generally about “structured” keys, which may be sequential in single-node deployments, or include a short sharding prefix in smaller (range-sharded) distributed systems. Eventually, a short prefix will run out of bits, and fully random UUIDs are definitely more robust for range-sharded systems that might scale out to hundreds of nodes… especially ones focused more on horizontal scalability than single-node performance.
That being said, design decisions that unlock scalability to hundreds or thousands of nodes have a tendency to also force you to distribute work over a dozen machines when a laptop might have sufficed.
Mentioning cryptography makes people ask for a crisp threat model. There isn’t one here (and the question makes sense outside cryptography and auth!).
Depending on the domain, leaky or guessable external ids can enable scraping, let competitors estimate the creation rate and number of accounts (or, similarly, activity) in your application, or, more benignly, expose an accidentally powerful API endpoint that will be difficult to replace.
Rather than try to pinpoint the exact level of dedication we’re trying to foil, from curious power user to nation state actor, let’s aim for something that’s hopefully as hard to break as our transport (e.g., HTTPS). AES should be helpful.
Intel shipped their first chip with AES-NI in 2010, and AMD in 2013. A decade later, it’s anything but exotic, and is available even in low-power Goldmont Atoms. For consumer hardware, with a longer tail of old machines than servers, the May 2022 Steam hardware survey shows 96.28% of the responses came from machines that support AES-NI (under “Other Settings”), an availability rate somewhere between those of AVX (2011) and SSE4.2 (2008).
The core of the AES-NI extension to the x86-64 instruction set is a pair of instructions to perform one round of AES encryption (AESENC
) or one round of decryption (AESDEC
) on a 16-byte block.
Andreas Abel’s uops.info shows that the first implementation, in Westmere, had a 6-cycle latency for each round, and that Intel and AMD have been optimising the instructions to bring their latencies down to 3 (Intel) or 4 (AMD) cycles per round.
That’s pretty good (on the order of a multiplication), but each instruction only handles one round. The schedule for AES-128, the fastest option, consists of 10 rounds:
an initial whitening xor, 9 aesenc
/ aesdec
and 1 aesenclast
/ aesdeclast
.
Multiply 3 cycles per round by 10 “real” rounds, and we find a latency of 30 cycles (\(+ 1\) for the whitening xor
) on recent Intels and \(40 + 1\) cycles on recent AMDs, assuming the key material is already available in registers or L1 cache.
This might be disappointing given that AES128-CTR could already achieve more than 1 byte/cycle in 2013. There’s a gap between throughput and latency because pipelining lets contemporary x86 chips start two rounds per cycle, while prior rounds are still in flight (i.e., 6 concurrent rounds when each has a 3 cycle latency).
Still, 35-50 cycles latency to encrypt or decrypt a single 16-byte block with AES-128 is similar to a L3 cache hit… really not that bad compared to executing a durable DML statement, or even a single lookup in a big hash table stored in RAM.
AES works on 16 byte blocks, and 16-byte randomish external ids are generally accepted practice. The simplest approach to turn structured keys into something that’s provably difficult to distinguish from random bits probably goes as follows:
The computational core lies in the encode
and decode
functions, two identical functions from a performance point of view.
We can estimate how long it takes to encode (or decode) an identifier by executing encode
in a tight loop, with a data dependency linking each iteration to the next;
the data dependency is necessary to prevent superscalar chips from overlapping multiple loop iterations.6
uiCA predicts 36 cycles per iteration on Ice Lake.
On my unloaded 2 GHz EPYC 7713, I observe 50 cycles/encode
(without frequency boost), and 13.5 ns/encode
when boosting a single active core.
That’s orders of magnitude less than a syscall, and in the same range as a slow L3 hit.
This simple solution works if our external interface may expose arbitrary 16-byte ids. AES-128 defines permutation, so we could also run it in reverse to generate sequence/nonce pairs for preexisting rows that avoid changing their external id too much (e.g., pad integer ids with zero bytes).
However, it’s sometimes important to generate valid UUIDs, or to at least save one bit in the encoding as an escape hatch for a versioning scheme. We can do that, with format-preserving encryption.
We view our primary keys as pairs of 64-bit integers, where the first integer is a sequentially allocated identifier. Realistically, the top bit of that sequential id will always be zero (i.e., the first integer’s value will be less than \(2^{63}\)). Let’s ask the same of our external ids.
The code in this post assumes a little-endian encoding, for simplicity (and because the world runs on little endian), but the same logic works for big endian.
Black and Rogaway’s cycle-walking method can efficiently fix one input/output bit: we just keep encrypting the data until bit 63 is zero.
When decrypting, we know the initial (fully decrypted) value had a zero in bit 63, and we also know that we only re-encrypted when the output did not have a zero in bit 63. This means we can keep iterating the decryption function (at least once) until we find a value with a zero in bit 63.
This approach terminates after two rounds of encryption (encode
) or decryption (decode
), in expectation.
That’s not bad, but some might prefer a deterministic algorithm. More importantly, the expected runtime scales exponentially with the number of bits we want to control, and no one wants to turn their database server into a glorified shitcoin miner. This exponential scaling is far from ideal for UUIDv4, where only 122 of the 128 bits act as payload: we can expect to loop 64 times in order to fix the remaining 6 bits.
A Feistel network derives a permutation over tuples of values from hash functions over the individual values. There are NIST recommendations for general format-preserving encryption (FFX) with Feistel networks, but they call for 8+ AES invocations to encrypt one value.
FFX solves a much harder problem than ours: we only have 64 bits (not even) of actual information, the rest is just random bits. Full format-preserving encryption must assume everything in the input is meaningful information that must not be leaked, and supports arbitrary domains (e.g., decimal credit card numbers).
Our situation is closer to a 64-bit payload (the sequential id) and a 64-bit random nonce.
It’s tempting to simply xor
the payload with the low bits of (truncated) AES-128, or any PRF like SipHash7 or BLAKE3 applied to the nonce:
BrokenPermutation(id, nonce):
id ^= PRF_k(nonce)[0:len(id)] # e.g., truncated AES_k
return (id, nonce)
The nonce
is still available, so we can apply the same PRF_k
to the nonce
, and undo the xor
(xor
is a self-inverse) to recover the original id
.
Unfortunately, random 64-bit values could repeat on realistic database sizes (a couple billion rows).
When an attacker observes two external ids with the same nonce, they can xor
the encrypted payloads and find the xor
of the two plaintext sequential ids.
This might seem like a minor information leak, but clever people have been known to amplify similar leaks and fully break encryption systems.
Intuitively, we’d want to also mix the 64 random bits with before returning an external id. That sounds a lot like a Feistel network, for which Luby and Rackoff have shown that 3 rounds are pretty good:
PseudoRandomPermutation(A, B):
B ^= PRF_k1(A)[0:len(b)] # e.g., truncated AES_k1
A ^= PRF_k2(B)[0:len(a)]
B ^= PRF_k3(A)[0:len(b)]
return (A, B)
This function is reversible (a constructive proof that it’s a permutation):
apply the ^= PRF_k
steps in reverse order (at each step, the value fed to the PRF passes unscathed), like peeling an onion.
If we let A be the sequentially allocated id, and B the 64 random bits, we can observe that xor
ing the uniformly generated B with a pseudorandom function’s output is the same as generating bits uniformly.
In our case, we can skip the first round of the Feistel network;
we deterministically need exactly two PRF evaluations, instead of the two expected AES (PRP) evaluations for the previous cycle-walking algorithm.
ReducedPseudoRandomPermutation(id, nonce):
id ^= AES_k1(nonce)[0:len(id)]
nonce ^= AES_k2(id)[0:len(nonce)]
return (id, nonce)
This is a minimal tweak to fix BrokenPermutation
: we hide the value of nonce
before returning it, in order to make it harder to use collisions.
That Feistel network construction works for arbitrary splits between id
and nonce
, but closer (balanced) bitwidths are safer.
For example, we can work within the layout proposed for UUIDv8 and assign \(48 + 12 = 60\) bits for the sequential id (row id or timestamp), and 62 bits for the uniformly generated value.8
Again, we can evaluate the time it takes to encode (or symmetrically, decode) an internal identifier into an opaque UUID by encoding in a loop, with a data dependency between each iteration and the next (source: aes128-feistel-latency.c).
The format-preserving Feistel network essentially does double the work of a plain AES-128 encryption, with a serial dependency between the two AES-128 evaluations. We expect roughly twice the latency, and uiCA agrees: 78 cycles/format-preserving encoding on Ice Lake (compared to 36 cycles for AES-128 of 16 bytes).
On my unloaded 2 GHz EPYC 7713, I observe 98 cycles/format-preserving encoding (compared to 50 cycles for AES-128 of 16 bytes), and 26.5 ns/format-presering encoding when boosting a single active core (13.5 ns for AES-128).
Still much faster than a syscall, and, although twice as slow as AES-128 of one 16 byte block, not that slow: somewhere between a L3 hit and a load from RAM.
With hardware-accelerated AES-128 (SipHash or BLAKE3 specialised for 8-byte inputs would probably be slower, but not unreasonably so), converting between structured 128-bit ids and opaque UUIDs takes less than 100 cycles on contemporary x86-64 servers… faster than a load from main memory!
This post only addressed the question of runtime performance. I think the real challenges with encrypting external ids aren’t strictly technical in nature, and have more to do with making it hard for programmers to accidentally leak internal ids. I don’t know how that would go because I’ve never had to use this trick in a production system, but it seems like it can’t be harder than doing the same in a schemas that have explicit internal primary keys and external ids on each table. I’m also hopeful that one could do something smart with views and user-defined types.
Either way, I believe the runtime overhead of encrypting and decrypting 128-bit identifiers is a non-issue for the vast majority of database workloads. Arguments against encrypting structured identifiers should probably focus on system complexity, key management9 (e.g., between production and testing environments), and graceful failure in the face of faulty hardware or code accidentally leaking internal identifiers.
Thank you Andrew, Barkley, Chris, Jacob, Justin, Marius, and Ruchir, for helping me clarify this post, and for reminding me about things like range-sharded distributed databases.
I’m told I must remind everyone that sharing internal identifiers with external systems is a classic design trap, because one day you’ll want to decouple your internal representation from the public interface, and that’s really hard to do when there’s no explicit translation step anywhere. ↩
There’s also a third class of really performance-sensitive systems, where the high-performance data plane benefited from managing a transient (reallocatable) id space separately from the control plane’s domain-driven keys… much like one would use mapping tables to decouple internal and external keys. ↩
ARMv8’s cryptographic extension offers similar AESD/AESE instructions. ↩
On the other hand, when I asked twitter to think about it, most response were wildly optimistic, maybe because people were thinking of throughput and not latency. ↩
The first 64-bit field can be arbitrarily structured, and, e.g., begin with a sharding key. The output also isn’t incorrect if the second integer is always 0 or a table-specific value. However, losing that entropy makes it easier for an attacker to correlate ids across tables. ↩
It’s important to measure latency and not throughput because we can expect to decode one id at a time, and immediately block with a data dependency on the decoded result. Encoding may sometimes be closer to a throughput problem, but low latency usually implies decent throughput, while the converse is often false. For example, a 747 carrying 400 passengers across the Atlantic in just under 5 hours is more efficient in person-km/h (throughput) than a Concorde, with a maximum capacity of 100 passengers, but the Concorde was definitely faster: three and a half hours from JFK to LHR is shorter than five hours, and that’s the metric individual passengers usually care about. ↩
Most likely an easier route than AES in a corporate setting that’s likely to mandate frequent key rotation. ↩
Or copy UUIDv7, with its 48-bit timestamp and 74 bit random value. ↩
Rotating symmetric keys isn’t hard a technical problem, when generating UUIDs with a Feistel network: we can use 1-2 bits to identify keys, and eventually reuse key ids. Rotation however must imply that we will eventually fail to decode (reject) old ids, which may be a bug or a feature, depending on who you ask. A saving grace may be that it should be possible for a service to update old external ids to the most recent symmetric key without accessing any information except the symmetric keys. ↩
perf
to make it do a thing it’s not designed for, and a proper C program
would represent an excessive amount of work.
Here are two tricks I find helpful when jotting down hacky analysis scripts.
addr2line -i
Perf can resolve symbols itself, but addr2line is a lot more flexible (especially when you inflict subtle things on your executable’s mappings).
It’s already nice that addr2line -Cfe /path/to/binary
lets you write
hex addresses to stdin and spits out the corresponding function name
on one line, and its source location on the next (or ??
/ ??:0
if
debug info is missing). However, for heavily inlined (cough C++
cough) programs, you really want the whole callstack that’s encoded
in the debug info, not just the most deeply inlined function (“oh
great, it’s in std::vector<Foo>::size()
”).
The --inline
flag
addresses that… by printing source locations for inline callers on
their own line(s). Now that the output for each address can span
a variable number of lines, how is one to know when to stop reading?
A simple trick is to always write two addresses to addr2line
’s
standard input: the address we want to symbolicate, and
that never has debug info (e.g., 0).
EDIT: Travis Downs reports that llvm-addr2line-14
finds debug info for 0x0
(presumably a bug. I don’t see that on llvm-addr2line-12)
and suggests looking for 0x0.*
in addition to ??
/??:0
. It’s
easy enough to stop when either happens, and clang’s version of
addr2line
can be a lot faster than binutil’s on files with a lot of
debug information.1
We now know that the first set of resolution information lines (one
line when printing only the file and line number, two lines when
printing function names as well with -f
)
belongs to the address we want to symbolicate. We also know to
expect output for missing information (??:0
or ??
/ ??:0
)
from the dummy address. We can thus keep reading until we find
a set of lines that corresponds to missing information, and
disregard that final source info.
For example, passing $IP\n0\n
on stdin could yield:
??
??:0
??
??:0
or, without -f
function names,
??:0
??:0
In both cases we first consume the first set of lines (the output
for$IP
must include at least one record), then consume the next set
of lines and observe it represent missing information, so we stop
reading.
When debug information is present, we might instead find
foo()
src/foo.cc:10
??
??:0
The same algorithm clearly works.
Finally, with inlining, we might instead observe
inline_function()
src/bar.h:5
foo()
src/foo.cc:12
??
??:0
We’ll unconditionally assign the first pair of lines to $IP
,
read a second pair of lines, see that it’s not ??
/ ??:0
and push that to the bottom of the inline source location
stack, and finally stop after reading the third pair of lines.
Performance monitoring events in perf tend to be much more powerful than non-PMU events: each perf “driver” works independently, so only PMU events can snapshot the Processor Trace buffer, for example.
However, we sometimes really want to trigger on a non-PMU event. For example, we might want to watch for writes to a specific address with a hardware breakpoint, and snapshot the PT buffer to figure out what happened in the microseconds preceding that write. Unfortunately, that doesn’t work out of the box: only PMU events can snapshot the buffer. I remember running into a similar limitation when I wanted to capture performance counters after non-PMU events.
There is however a way to trigger PMU events from most non-PMU events: watch for far branches! I believe I also found these events much more reliable to detect preemption than the scheduler’s software event, many years ago.
Far branches are rare (they certainly don’t happen in regular x86-64 userspace program), but interrupt usually trigger a far CALL to execute the handler in ring 0 (attributed to ring 0), and a far RET to switch back to the user program (attributed to ring 3).
We can thus configure
perf record \
-e intel_pt//u \
-e BR_INST_RETIRED.FAR_BRANCH/aux-sample-size=...,period=1/u \
-e mem:0x...:wu ...
to:
far_branch
performance monitoring
counterNot only does this work, but it also minimises the trigger latency.
That’s a big win compared to, e.g., perf record’s built-in --switch-output-event
:
a trigger latency on the order of hundreds of microseconds forces a
large PT buffer in order to capture the period we’re actually
interested in, and copying that large buffer slows down everything.
Who knows? (Who cares?) These tricks fulfill a common need in quick hacks, and I’ve been using (and rediscovering) them for years.
I find tightly scoped tools that don’t try to generalise have an ideal insight:effort ratio. Go write your own!
I ended up generating passing a string suffixed with a UUIDv4 as a sentinel: llvm-addr2line
just spits back any line that doesn’t look addresses. Alexey Alexandrov on the profiler developers’ slack noted that llvm-symbolizer
cleanly terminates each sequence of frames with an empty line. ↩
All long-lived programs are either implemented in dynamic languages,1 or eventually Greenspun themselves into subverting static programming languages to create a dynamic system (e.g., Unix process trees). The latter approach isn’t a bad idea, but it’s easy to introduce more flexibility than intended (e.g., data-driven JNDI lookups) when we add late binding features piecemeal, without a holistic view of how all the interacting components engender a weird program modification language.
At Backtrace, we mostly implement late (re)binding by isolating subtle logic in dedicated executables with short process lifetimes: we can replace binaries on disk atomically, and their next invocation will automatically pick up the change. In a pinch, we sometimes edit template or Lua source files and hot reload them in nginx. We prefer this to first-class programmatic support for runtime modification because Unix has a well understood permission model around files, and it’s harder to bamboozzle code into overwriting files when that code doesn’t perform any disk I/O.
However, these patterns aren’t always sufficient. For example, we
sometimes wish to toggle code that’s deep in performance-sensitive
query processing loops, or tightly coupled with such logic. That’s
when we rely on our dynamic_flag
library.
This library lets us tweak flags at runtime, but flags can only take boolean values (enabled or disabled), so the dynamicism it introduces is hopefully bounded enough to avoid unexpected emergent complexity. The functionality looks like classic feature flags, but thanks to the flags’ minimal runtime overhead coupled with the ability to flip them at runtime, there are additional use cases, such as disabling mutual exclusion logic during single-threaded startup or toggling log statements. The library has also proved invaluable for crisis management, since we can leave flags (enabled by default) in well-trodden pieces of code without agonising over their impact on application performance. These flags can serve as ad hoc circuit breakers around complete features or specific pieces of code when new inputs tickle old latent bugs.
The secret behind this minimal overhead? Cross-modifying machine code!
Intel tells us we’re not supposed to do that, at least not without
pausing threads… yet the core of the dynamic_flag
C library
has been toggling branches on thousands of machines for years, without
any issue. It’s available under the
Apache license
for other adventurous folks.
Runtime efficiency is an essential feature in dynamic_flag
—
enough to justify mutating machine code while it’s executing on other
cores
—not only because it unlocks additional use cases, but, more
importantly, because it frees programmers from worrying about the
performance impact of branching on a flag in the most obvious
location, even if that’s in the middle of a hot inner loop.
With the aim of encouraging programmers to spontaneously protect code
with flag checks, without prodding during design or code review, we
designed dynamic_flag
to minimise the amount of friction and mental
overhead of adding a new feature flag. That’s why we care so much
about all forms of overhead, not just execution time. For example,
there’s no need to break one’s flow and register flags separately from
their use points. Adding a feature flag should not feel like a chore.
However, we’re also aware that feature flags tend to stick around
forever. We try to counteract this inertia with static registration:
all the DF_*
expansions in an executable appear in
its dynamic_flag_list
section, and
the dynamic_flag_list_state
function
enumerates them at runtime. Periodic audits will reveal flags that
have become obsolete, and flags are easy to find:
each flag’s full name includes its location in the source code.
We find value in dynamic_flag
because its runtime overhead is
negligible for all but the most demanding
code,2 while the interface lets us easily
make chunks of code toggleable at runtime without having to worry
about things like “where am I supposed to register this new option?”
The same system is efficient and ergonomic enough for all teams in
all contexts, avoids contention in our source tree, and guarantees
discoverability for whoever happens to be on call.
dynamic_flag
All dynamic flags have a “kind” (namespace) string, and a name. We
often group all flags related to an experimental module or feature in
the same “kind,” and use the name to describe the specific functionality
in the feature guarded by the flag. A dynamic flag can be disabled by
default (like a feature flag), or enabled by default, and evaluating a
dynamic flag’s value implicitly defines and registers it with the
dynamic_flag
library.
A dynamic flag introduced with the DF_FEATURE
macro,
as in the code snippet below, is disabled (evaluates to false) by
default, and instructs the compiler to optimise for that default value.
We can instead enable code by default and optimise for cases where the
flag is enabled with the DF_DEFAULT
macro.
Each DF_*
condition
in the source is actually its own flag;
a flag’s full name looks like kind:name@source_file:line_number
(e.g.,
my_module:flag_name@<stdin>:15
), and each condition has its own
state record. It’s thus safe, if potentially confusing, to define flags
of different types (feature or default) with the same kind and
name. These macros may appear in inline
or
static inline
functions: each instantiation will get its own metadata block, and an
arbitrary number of blocks can share the same full name.
Before manipulating these dynamic flags,
applications must call dynamic_flag_init_lib
to initialise the library’s state.
Once the library is initialised, interactive or configuration-driven
usage typically toggles flags by calling
dynamic_flag_activate
and dynamic_flag_deactivate
with POSIX extended regexes that match
the flags’ full names.
dynamic_flag
programmaticallyThe DF_FEATURE
and DF_DEFAULT
macros
directly map to classic feature flags, but the
dynamic_flag
library still has more to offer. Applications can
programmatically enable and disable blocks of code to implement a
restricted form of aspect oriented programming:
“advice” cannot
be inserted post hoc, and must instead be defined inline in the
source, but may be toggled at runtime by unrelated code.
For example, an application could let individual HTTP requests opt
into detailed tracing with a query string parameter ?tracing=1
, and
set request->tracing_mode = true
in its internal request object when
it accepts such a request. Environments where fewer than one request
in a million enables tracing could easily spend more aggregate time
evaluating if (request->tracing_mode == true)
than they do in the
tracing logic itself.
One could try to reduce the overhead by coalescing the trace code in
fewer conditional blocks, but that puts more distance between the
tracing code and the traced logic it’s supposed to record, which
tends to cause the two to desynchronise and adds to development
friction.
It’s tempting to instead optimise frequent checks for the common case
(no tracing) with a dynamic flag that is enabled whenever at least one
in-flight request has opted into tracing. That’s why
the DF_OPT
(for opt-in logic) macro exists.
The
DF_OPT
macro
instructs the compiler to assume the flag is disabled, but leaves the
flag enabled (i.e., the conditional always evaluates
request->tracing_mode
) until the library is initialised with
dynamic_flag_init_lib
.3 After initialisation,
the flag acts like a DF_FEATURE
(i.e., the overhead is a test eax
instruction that falls through without any conditional branching)
until it is explicitly enabled again.
With this flag-before-check pattern, it’s always safe to enable
request_tracing
flags: in the worst case, we’ll just look at the
request object, see that request->tracing_mode == false
, and skip
the tracing logic. Of course, that’s not ideal for performance. When
we definitely know that no request has asked for tracing, we want to
disable request_tracing
flags and not even look at the request
object’s tracing_mode
field.
Whenever the application receives a request that opts into tracing, it
can enable all flags with kind request_tracing
by
executing dynamic_flag_activate_kind(request_tracing, NULL)
.
When that same request leaves the system (e.g., when the application
has fully sent a response back), the application
undoes the activation with dynamic_flag_deactivate_kind(request_tracing, NULL)
.
Activation and deactivation calls actually increment and decrement
counters associated with each instance of a DF_...
macro, so this scheme works correctly when
multiple requests with overlapping lifetimes opt into tracing:
tracing blocks will check whether request->tracing_mode == true
whenever at least one in-flight request has tracing_mode == true
, and
skip these conditionals as soon as no such request exists.
Confirming that a flag is set to its expected value (disabled for
DF_FEATURE
and DF_OPT
, enabled for DF_DEFAULT
) is fast…
because we shifted all the complexity to the flag flipping
code. Changing the value for a set of flags is extremely slow
(milliseconds of runtime and several IPIs for
multiple mprotect(2)
calls), so it only makes sense to use dynamic flags when they are
rarely activated or deactivated (e.g., less often than once a minute
or even less often than once an hour).
We have found programmatic flag manipulation to be useful not just for
opt-in request tracing or to enable log statements, but also to
minimise the impact of complex logic on program phases that do not
require them. For example, mutual exclusion and
safe memory reclamation deferral (PDF)
may be redundant while a program is in a single-threaded
startup mode; we can guard such code behind DF_OPT(steady_state, ...)
to accelerate startup,
and enable steady_state
flags just before spawning worker threads.
It can also make sense to guard slow paths with DF_OPT
when
a program only enters phases that needs this slow path logic every few minutes. That was the case for a
software transactional memory system with batched updates.
Most of the time, no update is in flight, so readers never have to
check for concurrent writes. These checks can be guarded with
DF_OPT(stm, ...)
conditions., as long as the program enables stm
flags around batches of updates. Enabling and disabling all these
flags can take a while (milliseconds), but, as long as updates are
infrequent enough, the improved common case (getting rid of a memory
load and a conditional jump for a read barrier) means the tradeoff
is favourable.
Even when flags are controlled programmatically, it can be useful to
work around bugs by manually forcing some flags to remain enabled or
disabled. In the tracing example above, we could find a crash in one
of the tracing blocks, and wish to prevent request->tracing_mode
from
exercising that block of code.
It’s easy to force a flag into an active state: flag activations
are counted, so it suffices to activate it
manually, once. However, we want it to be safe issue ad hoc dynamic_flag_deactivate
calls
without wedging the system in a weird state, so activation counts don’t go negative.
Unfortunately, this means we can’t use deactivations
to prevent, e.g., a crashy request tracing block from being
activated.
Flags can instead be “unhooked” dynamically. While unhooked,
increments to a flag’s activation count are silently disregarded.
The dynamic_flag_unhook
function
unhooks DF_*
conditions when their full name matches the extended POSIX regular expression it received as an argument.
When a flag has been
“unhook”ed
more often than it has been
“rehook”ed,
attempts to activate it will
silently no-op. Once a flag has been unhooked, we can
issue dynamic_flag_deactivate
calls
until its activation count reaches 0.
At that point, the flag is disabled, and will remain disabled
until rehooked.
The introduction of asm goto
in GCC 4.5
made it possible to implement control operators in inline assembly.
When the condition actually varies at runtime, it usually
makes more sense to set an output variable with a condition code,
but dynamic_flag
conditions are actually static in machine code:
each DF_*
macro expands to one 5-byte instruction,
a test eax, imm32
instruction
that falls through to the common case when that’s the flag’s value
(i.e., enabled for DF_DEFAULT
, disabled for DF_FEATURE
and
DF_OPT
), and a 32-bit relative jmp rel32
to the unexpected path
(disabled for DF_DEFAULT
, enabled for DF_FEATURE
and DF_OPT
)
otherwise. Activating and deactivating dynamic flags toggles the
corresponding target instructions between test imm32
(0xA9) and jmp rel32
(0xE9).
The DF_...
macros expand into a
lot more inline assembly than just that one instruction;
the rest of the expansion is a lot of noise to
register everything with structs and pointers in dedicated
sections. Automatic static registration is mostly orthogonal to the
performance goals, but is key to the (lazy-)programmer-friendly
interface.
We use test eax, imm32
instead of a nop because it’s exactly five
bytes, just like jmp rel32
, and because its 4-byte immediate is in
the same place as the 4-byte offset of jmp rel32
. We can thus encode
the jump offset at assembly-time, and flip between falling through to
the common path (test
) and jumping to the unexpected path (jmp
) by
overwriting the opcode byte (0xA9 for test
, 0xE9 for jmp
).
Updating a single byte for each dynamic flag avoids questions
around the correct order for writes. This single-byte
cross-modification (we overwrite instruction bytes while other threads
may be executing the mutated machine code) also doesn’t affect the
size of the instruction (both test eax
and jmp rel
span 5 bytes),
which should hopefully suffice to avoid sharp edges around instruction
decoding in hardware, despite our disregard for
Intel’s recommendations regarding cross-modifying code in Section 8.1.3 of the SDM.4
The library does try to protect against code execution exploits by
relaxing and reinstating page protection with mprotect(2)
) around
all cross modification writes. Since mprotect
-ing from
Read-Write-eXecute permissions to Read-eXecute acts as a
membarrier
(issues IPIs) on Linux/x86-64,
we can also know that the updated code is globally visible by the time
a call to dynamic_flag_activate
, etc., returns.
It’s not practical to bounce page protection for each DF_
expansion,
especially with inlining (some users have hundreds of inlined calls to
flagged functions, e.g., to temporarily paper over
use-after-frees by nopping out a few calls to free(2)
). Most of the
complexity in dynamic_flag.c
is simply in
gathering metadata records for all DF_
sites that should be activated or deactivated, and in
amortising mprotect
calls for stretches of DF_
sites on contiguous pages.
The dynamic_flag
library
is an updated interface for the core implementation of the
6-year old an_hook
,
and reflects years of experience with that functionality. We’re happy
to share it, but aren’t looking for feature requests or contributions.
There might be some small clean-ups as we add support for ARM or RISC
V, or let the library interoperate with a Rust implementation.
However, we don’t expect changes to the interface, i.e., the DF_
macros
and the activation/deactivation functions, nor to its core structure,
especially given the contemporary tastes for hardening (for example,
the cross-modification approach is completely incompatible with
OpenBSD’s and OS X’s strict W^X
policies). The library works for our
target platforms, and we don’t wish to take on extra complexity that
is of no benefit to us.
Of course, it’s Apache licensed,
so anyone can fork the library and twist it beyond
recognition. However, if you’re interested in powerful patching
capabilities, dynamic languages (e.g., Erlang, Common Lisp, or even
Python and Ruby), or tools like Live++
and Recode may be more appropriate.5
We want dynamic_flag
to remain simple and just barely flexible
enough for our usage patterns.
Thank you, Jacob, Josh, and Per, for feedback on earlier versions.
It’s no accident that canonical dynamic languages like Smalltalk, Forth, and Lisp are all image-based: how would an image-based system even work if it were impossible to redefine functions or types? ↩
Like guaranteed optimisations in Lisps, the predictable performance impact isn’t important because all code is performance sensitive, but because performance is a cross-cutting concern, and a predictably negligible overhead makes it easier to implement new abstractions, especially with the few tools available in C. In practice, the impact of considering a code path reachable in case a flag is flipped from its expected value usually dwarfs that of the single test
instruction generated for the dynamic flag itself. ↩
Or if the dynamic_flag
library isn’t aware of that DF_OPT
, maybe because the function surrounding that DF_OPT
conditional was loaded dynamically. ↩
After a few CPU-millenia of production experience, the cross-modification logic hasn’t been associated with any “impossible” bug, or with any noticeable increase in the rate of hardware hangs or failures. ↩
The industry could learn a lot from game development practices, especially for stateful non-interactive backend servers and slow batch computations. ↩
Slitter is Backtrace’s deliberately middle-of-the-road thread-caching slab allocator, with explicit allocation class tags (rather than derived from the object’s size class). It’s mostly written in Rust, and we use it in our C backend server.
Slitter’s design is about as standard as it gets: we hope to dedicate the project’s complexity budget to always-on “observability” and safety features. We don’t wish to detect all or even most memory management errors, but we should statistically catch a small fraction (enough to help pinpoint production issues) of such bugs, and always constrain their scope to the mismanaged allocation class.1
We decided to code up Slitter last April, when we noticed that we would immediately benefit from backing allocation with temporary file mappings:2 the bulk of our data is mapped from persistent data files, but we also regenerate some cold metadata during startup, and accesses to that metadata have amazing locality, both temporal and spatial (assuming bump allocation). We don’t want the OS to swap out all the heap–that way lie grey failures–so we opt specific allocation classes into it.
By itself, this isn’t a reason to write a slab allocator: we could easily have configured specialised arenas in jemalloc, for example. However, we also had eyes on longer term improvements to observability and debugging or mitigation of memory management errors in production, and those could only be unlocked by migrating to an interface with explicit tags for each allocation class (type).
Classic mallocs like jemalloc
and tcmalloc are fundamentally
unable to match that level of integration: we can’t tell malloc(3)
what we’re trying to allocate (e.g., a struct request
in the HTTP
module), only its size. It’s still possible to wrap malloc in a richer
interface, and, e.g., track heap consumption by tag. Unfortunately,
the result is slower than a native solution, and, without help from
the underlying allocator, it’s easy to incorrectly match tags between
malloc
and free
calls. In my experience, this frequently leads to
useless allocation statistics, usually around the very faulty code
paths one is attempting to debug.
Even once we have built detailed statistics on top of a regular malloc, it’s hard to convince the underlying allocator to only recycle allocations within an object class: not only do mallocs eagerly recycle allocations of similar sizes regardless of their type, but they will also release unused runs of address space, or repurpose them for totally different size classes. That’s what mallocs are supposed to do… it just happens to also make debugging a lot harder when things inevitably go wrong.3
Slab allocators work with semantically richer allocation tags: an allocation tag describes its objects’ size, but can also specify how to initialise, recycle, or deinitialise them. The problem is that slab allocators tend to focus exclusively on speed.
Forks of libumem
may be the exception, thanks to the Solaris culture of pervasive
hooking. However, umem
’s design reflects the sensibilities of the
00s, when it was written: threads share a few caches, and the
allocator tries to reuse address space. In contrast, Slitter assumes memory
is plentiful enough for thread-local caches and type-stable
allocations.4
We have been running Slitter in production for over two months, and rely on it to:
Thanks to extensive contracts and a mix of hardcoded and random tests, we encountered only two issues during the initial rollout, both in the small amount of lock-free C code that is hard to test.6
Type stability exerts a heavy influence all over Slitter’s design, and has obvious downsides. For example, a short-lived application that progresses through a pipeline of stages, where each stage allocates different types, would definitely waste memory if it were to replace a regular malloc with a type-stable allocator like Slitter. We believe the isolation benefits are more than worth the trouble, at least for long-lived servers that quickly enter a steady state.
In the future, we hope to also:
In addition to these safety features, we plan to rely on the allocator to improve observability into the calling program, and wish to:
Here’s how it currently works, and why we wrote it in Rust, with dash of C.
At a high level, Slitter
Chunk
s of memory via the Mapper
traitSpan
s from each chunk with Mill
objectsSpan
s with Press
objects, into allocation Magazines
Rack
Many general purpose memory allocators implement strategies similarly inspired by Bonwick’s slab allocator, and time-tested mallocs may well provide better performance and lower fragmentation than Slitter.8 The primary motivation for designing Slitter is that having explicit allocation classes in the API makes it easier for the allocator to improve the debuggability and resilience of the calling program.9 For example, most allocators can tell you the size of your program’s heap, but that data is much more useful when broken down by struct type or program module.
Most allocators try to minimise accesses to the metadata associated with allocations. In fact, that’s often seen as a strength of the slab interface: the allocator can just rely on the caller to pass the correct allocation class tag, instead of hitting metadata to figure out there the freed address should go.
We went in the opposite direction with Slitter. We still rely on the allocation class tag for speed, but also actively look for mismatches before returning from deallocation calls. Nothing depends on values computed by the mismatch detection logic, and the resulting branch is trivially predictable (the tag always matches), so we can hope that wide out-of-order CPUs will hide most of the checking code, if it’s simple enough.
This concern (access to metadata in few instructions) combined with our goal of avoiding in-band metadata lead to a simple layout for each chunk’s data and metadata.
.-------.------.-------|---------------.-------.
| guard | meta | guard | data ... data | guard |
'-------'------'-------|---------------'-------'
2 MB 2 MB 2 MB | 1 GB 2 MB
v
Aligned to 1 GB
A chunk’s data is always a 1 GB address range, aligned to 1 GB: the underlying mapper doesn’t have to immediately back that with memory, but it certainly can, e.g., in order to use gigantic pages. The chunk is preceded and followed by 2 MB guard pages. The metadata for the chunk’s data lives in a 2 MB range, just before the preceding guard page (i.e., 4 MB to 2 MB before the beginning of the aligned 1 GB range). Finally, the 2 MB metadata range is itself preceded by a 2MB guard page.
Each chunk is statically divided in 65536 spans of 16 KB each. We can thus map a span to its slot in the metadata block with a shifts, masks, and some address arithmetic. Mills don’t have to hand out individual 16 KB spans at a time, they simply have to work in multiples of 16 KB, and never split a span in two.
We call Slitter from C, but wrote it in Rust, despite the more painful build10 process: that pain isn’t going anywhere, since we expect our backend to be in a mix of C, C++, and Rust for a long time. We also sprinkled in some C when the alternative would have been to pull in a crate just to make a couple syscalls, or to enable unstable Rust features: we’re not “rewrite-it-in-Rust” absolutists, and merely wish to use Rust for its strengths (control over data layout, support for domain-specific invariants, large ecosystem for less performance-sensitive logic, ability to lie to the compiler where necessary, …), while avoiding its weaknesses (interacting with Linux interfaces defined by C headers, or fine-tuning code generation).
The majority of allocations only interact with the thread-local magazines. That’s why we wrote that code in C: stable Rust doesn’t (yet) let us access likely/unlikely annotations, nor fast “initial-exec” thread-local storage. Of course, allocation and deallocation are the main entry points into a memory allocation library, so this creates a bit of friction with Rust’s linking process.11
We also had to implement our lock-free multi-popper Treiber stack in C: x86-64 doesn’t have anything like LL/SC, so we instead pair the top-of-stack pointer with a generation counter… and Rust hasn’t stabilised 128-bit atomics yet.
We chose to use atomics in C instead of a simple lock in Rust because the lock-free stack (and the atomic bump pointer, which Rust handles fine) are important for our use case: when we rehydrate cold metadata at startup, we do so from multiple I/O-bound threads, and we have observed hiccups due to lock contention in malloc. At some point, lock acquisitions are rare enough that contention isn’t an issue; that’s why we’re comfortable with locks when refilling bump allocation regions.
A recurring theme in the design of Slitter is that we find ways to make the core (de)allocation logic slightly faster, and immediately spend that efficiency on safety, debuggability or, eventually, observability. For a lot of code, performance is a constraint to satisfy, not a goal to maximise; once we’re close to good enough, it makes sense to trade performance away.12 I also believe that there are lower hanging fruits in memory placement than shaving a few nanoseconds from the allocation path.
Slitter also focuses on instrumentation and debugging features that are always active, even in production, instead of leaving that to development tools, or to logic that must be explicitly enabled. In a SaaS world, development and debugging is never done. Opt-in tools are definitely useful, but always-on features are much more likely to help developers catch the rarely occurring bugs on which they tend to spend an inordinate amount of investigation effort (and if a debugging feature can be safely enabled in production at a large scale, why not leave it enabled forever?).
If that sounds like an interesting philosophy for a slab allocator, come hack on Slitter! Admittedly, the value of Slitter isn’t as clear for pure Rust hackers as it is for those of us who blend C and Rust, but per-class allocation statistics and placement decisions should be useful, even in safe Rust, especially for larger programs with long runtimes.
Our MIT-licensed code is on github, there are plenty of small improvements to work on, and, while we still have to re-review the documentation, it has decent test coverage, and we try to write straightforward code.
This post was much improved by feedback from my beta readers, Barkley, David, Eloise, Mark, Per, Phil, Ruchir, and Samy.
In my experience, their unlimited blast radius is what makes memory management bugs so frustrating to track down. The design goals of generic memory allocators (e.g., recycling memory quickly) and some implementation strategies (e.g., in-band metadata) make it easy for bugs in one module to show up as broken invariants in a completely unrelated one that happened to share allocation addresses with the former. Adversarial thinkers will even exploit the absence of isolation to amplify small programming errors into arbitrary code execution. Of course, one should simply not write bugs, but when they do happen, it’s nice to know that the broken code most likely hit itself and its neighbours in the callgraph, and not unrelated code that also uses the same memory allocator (something Windows got right with private heaps). ↩
Linux does not have anything like the BSD’s MAP_NOSYNC
mmap flag. This has historically created problems for heavy mmap users like LMDB. Empirically, Linux’s flushing behaviour is much more reasonable these days, especially when dirty pages are a small fraction of physical RAM, as it is for us: in a well configured installation of our backend server, most of the RAM goes to clean file mappings, so only the dirty_expire_centisec
timer triggers write-outs, and we haven’t been growing the file-backed heap fast enough for the time-based flusher to thrash too much. ↩
There are obvious parallels with undefined behaviour in C and C++… ↩
umem also takes a performance hit in order to let object classes define callbacks for object initialisation, recycling, and destruction. It makes sense to let the allocator do some pre-allocation work: if you’re going to incur a cache miss for the first write to an allocation, it’s preferable to do so before you immediately want that newly allocated object (yes, profiles will show more cycles in the allocators, but you’re just shifting work around, hopefully farther from the critical path). Slitter only supports the bare minimum: objects are either always zero-initialised, or initially zero-filled and later left untouched. That covers the most common cases, without incurring too many branch mispredictions. ↩
One could be tempted to really rely on it not just for isolation and resilience, but during normal operations. That sounds like a bad idea (we certainly haven’t taken that leap), at least until Slitter works with Valgrind/ASan/LSan: it’s easier to debug easily reproducible issues when one can just slot in calls to regular malloc/calloc/free with a dedicated heap debugger. ↩
It would be easy to blame the complexity of lock-free code, but the initial version, with C11 atomics, was correct. Unfortunately, gcc backs C11 atomic uint128_t
s with locks, so we had to switch to the legacy interface, and that’s when the errors crept in. ↩
There isn’t much the allocator can do if an application writes to a wild address megabytes away from the base object. Thankfully, buffer overflows tend to proceed linearly from the actual end of the undersized object. ↩
In fact, Slitter actively worsens external fragmentation to guarantee type-stable allocations. We think it’s reasonable to sacrifice heap footprint in order to control the blast radius of use-after-frees and double-frees. ↩
That’s why we’re interested in allocation class tags, but they can also help application and malloc performance. Some malloc developers are looking into tags for placement (should the allocation be backed by memory local to the NUMA node, with huge pages, …?) or lifetime (is the allocation immortal, short-lived, or tied to a request?) hints. ↩
We re-export our dependencies from an uber-crate, and let our outer meson build invoke cargo
to generate a static library for that facade uber-crate. ↩
Rust automatically hides foreign symbols when linking cdylib
s. We worked around that with static linking, but statically linked rust libraries are mutually incompatible, hence the uber-crate. ↩
And not just for safety or productivity features! I find it often makes sense to give up on small performance wins (e.g., aggressive autovectorisation or link-time optimisation) when they would make future performance investigations harder. The latter are higher risk, and only potential benefits, but their upside (order of magnitude improvements) dwarfs guaranteed small wins that freeze the code in time. ↩