The core of UMASH is a hybrid PH/(E)NH block compression function. That function is fast (it needs one multiplication for each 16byte âchunkâ in a block), but relatively weak: despite a 128bit output, the worstcase probability of collision is \(2^{64}\).
For a fingerprinting application, we want collision probability less than \(\approx 2^{70},\) so thatâs already too weak, before we even consider merging a variablelength string of compressed block values.
The initial UMASH proposal compresses each block with two independent compression functions. Krovetz showed that we could do so while reusing most of the key material (random parameters), with a Toeplitz extension, and I simply recycled the proof for UMASHâs hybrid compressor.
Thatâs good for the memory footprint of the random parameters, but doesnât help performance: we still have to do double the work to get double the hash bits.
Earlier this month, Jim Apple pointed me at a promising alternative that doubles the hash bit with only one more multiplication. The construction adds finite field operations that arenât particularly efficient in software, on top of the additional 64x64 > 128 (carryless) multiplication, so isnât a slam dunk win over a straightforward Toeplitz extension. However, Jim felt like we could âspendâ some of the bits we donât need for fingerprinting (\(2^{128}\) collision probability is overkill when we only need \(2^{70}\)) in order to make do with faster operations.
Turns out he was right! We can use carryless multiplications by sparse
constants (concretely, xorshift and one more shift) without any
reducing polynomial, on independent 64bit halvesâŠ and still
collide with probability at most 2^{126} \(2^{98}\).
The proof is fairly simple, but relies on a bit of notation for clarity. Letâs start by restating UMASHâs hybrid PH/ENH block compressor in that notation.
The current block compressor in UMASH splits a 256byte block \(m\) in 16 chunks \(m_i,\, i\in [0, 15]\) of 128 bits each, and processes all but the last chunk with a PH loop,
\[ \bigoplus_{i=0}^{14} \mathtt{PH}(k_i, m_i), \]
where
\[ \mathtt{PH}(k_i, m_i) = ((k_i \bmod 2^{64}) \oplus (m_i \bmod 2^{64})) \odot (\lfloor k_i / 2^{64} \rfloor \oplus \lfloor m_i / 2^{64} \rfloor) \]
and each \(k_i\) is a randomly generated 128bit parameter.
The compression loop in UMASH handles the last chunk, along with a size tag (to protect against extension attacks), with ENH:
\[ \mathtt{ENH}(k, x, y) = ((k + x) \bmod 2^{64}) \cdot (\lfloor k / 2^{64}\rfloor + \lfloor x / 2^{64} \rfloor \bmod 2^{64}) + y \mod 2^{128}. \]
The core operation in ENH is a full (64x64 > 128) integer multiplication, which has lower latency than PHâs carryless multiplication on x8664. Thatâs why UMASH switches to ENH for the last chunk. We use ENH for only one chunk because combining multiple NH values calls for 128bit additions, and thatâs slower than PHâs xors. Once we have mixed the last chunk and the size tag with ENH, the result is simply xored in with the previous chunksâ PH values:
\[ \left(\bigoplus_{i=0}^{14} \mathtt{PH}(k_i, m_i)\right) \oplus \mathtt{ENH}(m_{15}, k_{15}, \mathit{tag}). \]
This function is annoying to analyse directly, because we end up having to manipulate different proofs of almostuniversality. Letâs abstract things a bit, and reduce the ENH/PH to the bare minimum we need to find our collision bounds.
Letâs split our message blocks in \(n\) (\(n = 16\) for UMASH) âchunksâ, and apply an independently sampled mixing function to each chunk. Letâs say we have two messages \(m\) and \(m^\prime\) with chunks \(m_i\) and \(m^\prime_i\), for \(i\in [0, n)\), and let \(h_i\) be the result of mixing chunk \(m_i,\) and \(h^\prime_i\) that of mixing \(m^\prime_i.\)
Weâll assume that the first chunk is mixed with a \(2^{w}\)almostuniversal (\(2^{64}\) for UMASH) hash function: if \(m_0 \neq m^\prime_0,\) \(\mathrm{P}[h_0 = h^\prime_0] \leq 2^{w},\) (where the probability is taken over the set of randomly chosen parameters for the mixer). Otherwise, \(m_0 = m^\prime_0 \Rightarrow h_i = h^\prime_i\).
This first chunks stands for the ENH iteration in UMASH.
Every remaining chunk will instead be mixed with a \(2^{w}\)XORalmostuniversal hash function: if \(m_i \neq m^\prime_i\) (\(0 < i < n\)), \(\mathrm{P}[h_i \oplus h^\prime_i = y] \leq 2^{w}\) for any \(y,\) where the probability is taken over the randomly generated parameter for the mixer.
This stronger condition represents the PH iterations in UMASH.
We hash a full block by xoring all the mixed chunks together:
\[ H = \bigoplus_{i = 0}^{n  1} h_i, \]
and
\[ H^\prime = \bigoplus_{i = 0}^{n  1} h^\prime_i. \]
We want to bound the probability that \(H = H^\prime \Leftrightarrow H \oplus H^\prime = 0,\) assuming that the messages differ (i.e., there is at least one index \(i\) such that \(m_i \neq m^\prime_i\)).
If the two messages only differ in \(m_0 \neq n^\prime_0\) (and thus \(m_i = m^\prime_i,\,\forall i \in [1, n)\)),
\[ \bigoplus_{i = 1}^{n  1} h_i = \bigoplus_{i = 1}^{n  1} h^\prime_i, \]
and thus \(H = H^\prime \Leftrightarrow h_0 = h^\prime_0\).
By hypothesis, the 0th chunks are mixed with a \(2^{w}\)almostuniversal hash, so this happens with probability at most \(2^{w}\).
Otherwise, assume that \(m_j \neq m^\prime_j\), for some \(j \in [1, n)\). We will rearrange the expression
\[ H \oplus H^\prime = h_j \oplus h^\prime_j \oplus \left(\bigoplus_{i\in [0, n) \setminus \{ j \}} h_i \oplus h^\prime_i\right). \]
Letâs conservatively replace that unwieldly sum with an adversarially chosen value \(y\):
\[ H \oplus H^\prime = h_j \oplus h^\prime_j \oplus y, \]
and thus \(H = H^\prime\) iff \(h_j \oplus h^\prime_j = y.\) By hypothesis, the \(j\)th chunk (every chunk but the 0th), is mixed with a \(2^{w}\)almostXORuniversal hash, and this thus happens with probability at most \(2^{w}\).
In both cases, we find a collision probability at most \(2^{w}\) with a simple analysis, despite combining mixing functions from different families over different rings.
We combined strong mixers (each is \(2^{w}\)almostuniversal), and only got a \(2^{w}\)almostuniversal output. It seems like we should be able to do better when two or more chunks differ.
As Nandi points outs, we can apply erasure codes to derive additional chunks from the original messagesâ contents. We only need one more chunk, so we can simply xor together all the original chunks:
\[m_n = \bigoplus_{i=0}^{n  1} m_i,\]
and similarly for \(m^\prime_n\). If \(m\) and \(m^\prime\) differ in only one chunk, \(m_n \neq m^\prime_n\). Itâs definitely possible for \(m_n = m^\prime_n\) when \(m \neq m^\prime\), but only if two or more chunks differ.
We will again mix \(m_n\) and \(m^\prime_n\) with a fresh \(2^{w}\)almostXORuniversal hash function to yield \(h_n\) and \(h^\prime_n\).
We want to xor the result \(h_n\) and \(h^\prime_n\) with the second (still undefined) hash values \(H_2\) and \(H^\prime_2\); if \(m_n \neq m^\prime_n\), the final xored values are equal with probability at most \(2^{w}\), regardless of \(H_2\) and \(H^\prime_2\ldots\) and, crucially, independently of \(H \neq H^\prime\).
When the two messages \(m\) and \(m^\prime\) only differ in a single (initial) chunk, mixing a LRC checksum gives us an independent hash function, which squares the collision probability to \(2^{2w}\).
Now to the interesting bit: we must define a second hash function that combines \(h_0,h_1,\ldots, h_{n  1}\) and \(h^\prime_0, h^\prime_1, \ldots, h^\prime_{n  1}\) such that the resulting hash values \(H_2\) and \(H^\prime_2\) collide independently enough of \(H\) and \(H^\prime\). Thatâs a tall order, but we do have one additional assumption to work with: we only care about collisions in this second hash function if the additional checksum chunks are equal, which means that the two messages differ in two or more chunks (or theyâre identical).
For each index \(0 < i < n\), weâll fix a public linear (with xor as the addition) function \(\overline{xs}_i(x)\). This family of function must have two properties:
For regularity, we will also define \(\overline{xs}_0(x) = x\).
Concretely, let \(\overline{xs}_1(x) = x \mathtt{Â«} 1\), where the bitshift is computed for the two 64bit halves independently, and \(\overline{xs}_i(x) = (x \mathtt{Â«} 1) \oplus (x \mathtt{Â«} i)\) for \(i > 1\), again with all the bitshifts computed independently over the two 64bit halves.
To see that these satisfy our requirements, we can represent the functions as carryless multiplication by distinct âevenâ constants (the least significant bit is 0) on each 64bit half:
To recapitulate, we defined the first hash function as
\[ H = \bigoplus_{i = 0}^{n  1} h_i, \]
the (xor) sum of the mixed value \(h_i\) for each chunk \(m_i\) in the message block \(m\), and similarly for \(H^\prime\) and \(h^\prime_i\).
Weâll let the second hash function be
\[ H_2 \oplus h_n = \left(\bigoplus_{i = 0}^{n  1} \overline{xs}_i(h_i)\right) \oplus h_n, \]
and
\[ H^\prime_2 \oplus h^\prime_n = \left(\bigoplus_{i = 0}^{n  1} \overline{xs}_i(h^\prime_i)\right) \oplus h^\prime_n. \]
We can finally get down to business and find some collision bounds. Weâve already shown that both \(H = H^\prime\) and \(H_2 \oplus h_n = H^\prime_2 \oplus h^\prime_n\) collide simultaneously with probability at most \(2^{2w}\) when the checksum chunks differ, i.e., when \(m_n \neq m^\prime_n\).
Letâs now focus on the case when \(m \neq m^\prime\), but \(m_n = m^\prime_n\). In that case, we know that at least two chunks \(0 \leq i < j < n\) differ: \(m_i \neq m^\prime_i\) and \(m_j \neq m^\prime_j\).
If only two chunks \(i\) and \(j\) differ, and one of them is the \(i = 0\)th chunk, we want to bound the probability that
\[ h_0 \oplus h_j = h^\prime_0 \oplus h^\prime_j \]
and
\[ h_0 \oplus \overline{xs}_j(h_j) = h^\prime_0 \oplus \overline{xs}_j(h^\prime_j), \]
both at the same time.
Letting \(\Delta_i = h_i \oplus h^\prime_i\), we can reformulate the two conditions as
\[ \Delta_0 = \Delta_j \] and \[ \Delta_0 = \overline{xs}_j(\Delta_j). \]
Taking the xor of the two conditions yields
\[ \Delta_j \oplus \overline{xs}_j(\Delta_j) = 0, \]
which is only satisfied for \(\Delta_j = 0\), since \(f(x) = x \oplus \overline{xs}_j(x)\) is an invertible linear function. This also forces \(\Delta_0 = 0\).
By hypothesis, \(\mathrm{P}[\Delta_j = 0] \leq 2^{w}\), and \(\mathrm{P}[\Delta_0 = 0] \leq 2^{w}\) as well. These two probabilities are independent, so we get a probability that both hash collide less than or equal to \(2^{2w}\) (\(2^{128}\)).
In the other case, we have messages that differ in at least two chunks \(0 < i < j < n\): \(m_i \neq m^\prime_i\) and \(m_j \neq m^\prime_j\).
We can simplify the collision conditions to
\[ h_i \oplus h_j = h^\prime_i \oplus h^\prime_j \oplus y \]
and
\[ \overline{xs}_i(h_i) \oplus \overline{xs}_j(h_j) = \overline{xs}_i(h^\prime_i) \oplus \overline{xs}_j(h^\prime_j) \oplus z, \]
for \(y\) and \(z\) generated arbitrarily (adversarially), but without knowledge of the parameters that generated \(h_i, h_j, h^\prime_i, h^\prime_j\).
Again, let \(\Delta_i = h_i \oplus h^\prime_i\) and \(\Delta_j = h_j \oplus h^\prime_j\), and reformulate the conditions into
\[ \Delta_i \oplus \Delta_j = y \] and \[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_j(\Delta_j) = z. \]
Letâs apply the linear function \(\overline{xs}_i\) to the first condition
\[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_i(\Delta_j) = \overline{xs}_i(y); \]
since \(\overline{xs}_i\) isnât invertible, the result isnât equivalent, but is a weaker (necessary, not sufficient) version of the initial condiion.
After xoring that with the second condition
\[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_j(\Delta_j) = z, \]
we find
\[ \overline{xs}_i(\Delta_j) \oplus \overline{xs}_j(\Delta_j) = \overline{xs}_i(y) \oplus z. \]
By hypothesis, the null space of \(g(x) = \overline{xs}_i(x) \oplus \overline{xs}_j(x)\) is âsmall.â For our concrete definition of \(\overline{xs}\), there are \(2^{2j}\) values in that null space, which means that \(\Delta_j\) can only satisfy the combined xored condition by taking one of at most \(2^{2j}\) values; otherwise, the two hashes definitely canât both collide.
Since \(j < n\), this happens with probability at most \(2^{2(n  1)  w} \leq 2^{34}\) for UMASH with \(w = 64\) and \(n = 16\).
Finally, for any given \(\Delta_j\), there is at most one \(\Delta_i\) that satisfies
\[ \Delta_i \oplus \Delta_j = y,\]
and so both hashes collide with probability at most \(2^{98}\), for \(w = 64\) and \(n = 16\).
Astute readers will notice that we could let \(\overline{xs}_i(x) = x \mathtt{Â«} i\), and find the same combined collision probability. However, this results in a much weaker secondary hash, since a chunk could lose up to \(2n  2\) bits (\(n  1\) in each 64bit half) of hash information to a plain shift. The shifted xorshifts might be a bit slower to compute, but guarantees that we only lose at most 2 bits^{1} of information per chunk. This feels like an interface thatâs harder to misuse.
If one were to change the \(\overline{xs}_i\) family of functions, I think it would make more sense to look at a more diverse form of (still sparse) multipliers, which would likely let us preserve a couple more bits of independence. Jim has constructed such a family of multipliers, in arithmetic modulo \(2^{64}\); Iâm sure we could find something similar in carryless multiplication. The hard part is implementing these multipliers: in order to exploit the multipliersâ sparsity, weâd probably have to fully unroll the block hashing loop, and thatâs not something I like to force on implementations.
The base UMASH block compressor mixes all but the last of the message blockâs 16byte chunks with PH: xor the chunk with the corresponding bytes in the parameter array, computes a carryless multiplication of the xored chunksâ half with the other half. The last chunk goes through a variant of ENH with an invertible finaliser (safe because we only rely on \(\varepsilon\)almostuniversality), and everything is xored in the accumulator.
The collision proofs above preserved the same structure for the first hash.
The second hash reuses so much work from the first that it mostly makes sense to consider a combined loop that computes both (regular UMASH and this new xorshifted variant) block compression functions at the same time.
The first change for this combined loop is that we need to xor
together all 16bytes chunk in the message, and mix the resulting
checksum with a fresh PH function. Thatâs equivalent to xoring
everything in a new accumulator (or two accumulators when working with
256bit vectors) initialised with the PH parameters, and CLMUL
ing
together the accumulatorâs two 64bit halves at the end.
We also have to apply the \(\overline{xs}_i\) quasixorshift
functions to each \(h_i\). The trick is to accumulate the shifted
values in two variables: one is the regular UMASH accumulator without
\(h_0\) (i.e., \(h_1 \oplus h_2 \ldots\)), and the other shifts
the current accumulator before xoring in a new value, i.e.,
\(\mathtt{acc}^\prime = (\mathtt{acc} \mathtt{Â«} 1) \oplus h_i\),
where the left shift on parallel 64bit halves simply adds acc
to itself.
This additional shifted accumulator includes another special case to skip \(\overline{xs}_1(x) = x \mathtt{Â«} 1\); thatâs not a big deal for the code, since we already have to special case the last iteration for the ENH mixer.
Armed with \(\mathtt{UMASH} = \bigoplus_{i=1}^{n  1} h_i\) and \(\mathtt{acc} = \bigoplus_{i=2}^{n  1} h_i \mathtt{Â«} (i  1),\) we have \[\bigoplus_{i=1}^{n  1} \overline{xs}_i(h_i) = (\mathtt{UMASH} \oplus \mathtt{acc}) \mathtt{Â«} 1.\]
We just have to xor in the PH
mixed checksum \(h_n\), and finally
\(h_0\) (which naturally goes in GPRs, so can be computed while we
extract values out of vector registers).
We added two vector xors and one addition for each chunk in a block,
and, at the end, one CLMUL
plus a couple more xors and adds again.
This should most definitely be faster than computing two UMASH at the
same time, which incurred two vector xors and a CLMUL
(or full
integer multiplication) for each chunk: even when CLMUL
can pipeline
one instruction per cycle, vector additions can dispatch to more
execution units, so the combined throughput is still higher.
Itâs easy to show that UMASH is relatively safe when one block is shorter than the other, and we simply xor together fewer mixed chunks. Without loss of generality, we can assume the longer block has \(n\) chunks; that blockâs final ENH is independent of the shorter blockâs UMASH, and any specific value occurs with probability at most \(2^{63}\) (the probability of a multiplication by zero).
A similar argument seems more complex to defend for the shifted UMASH.
Luckily, we can tweak the LRC checksum we use to generate an additional chunk in the block: rather than xoring together the raw message chunks, weâll xor them after xoring them with the PH key, i.e.,
\[m_n = \bigoplus_{i=0}^{n  1} m_i \oplus k_i, \]
where \(k_i\) are the PH parameters for each chunk.
When checksumming blocks of the same size, this is a noop with respect
to collision probabilities. Implementations might however benefit
from the ability to use a fused xor
with load from memory^{2}
to compute \(m_i \oplus k_i\), and feed that both into the checksum
and into CLMUL
for PH.
Unless weâre extremely unlucky (\(m_{n  1} = k_{n  1}\), with probability \(2^{2w}\)), the long blockâs LRC will differ from the shorter blockâs. As long as we always xor in the same PH parameters when mixing the artificial LRC, the secondary hashes collide with probability at most \(2^{64}\).
With a small tweak to the checksum function, we can easily guarantee that blocks with a different number of chunks collide with probability less than \(2^{126}\).^{3}
Thank you Joonas for helping me rubber duck the presentation, and Jim for pointing me in the right direction, and for the fruitful discussion!
Itâs even better for UMASH, since we obtained these shifted chunks by mixing with PH. The result of PH is the carryless product of two 64bit values, so the most significant bit is always 0. The shiftedxorshift doesnât erase any information in the high 64bit half!Â ↩
This might also come with a small latency hit, which is unfortunate since PHing \(m_n\) is likely to be on the critical pathâŠ but one cycle doesnât seem that bad.Â ↩
The algorithm to expand any input message to a sequence of full 16byte chunks is fixed. Thatâs why we incorporate a size tag in ENH; that makes it impossible for two messages of different lengths to collide when they are otherwise identical after expansion.Â ↩
We accidentally a whole hash functionâŠ but we had a good reason! Our MITlicensed UMASH hash function is a decently fast noncryptographic hash function that guarantees a worstcase bound on the probability of collision between any two inputs generated independently of the UMASH parameters.
On the 2.5 GHz Intel 8175M servers that power Backtraceâs hosted offering, UMASH computes a 64bit hash for short cached inputs of up to 64 bytes in 922 ns, and for longer ones at up to 22 GB/s, while guaranteeing that two distinct inputs of at most \(s\) bytes collide with probability less than \(\lceil s / 2048 \rceil \cdot 2^{56}\). If thatâs not good enough, we can also reuse most of the parameters to compute two independent UMASH values. The resulting 128bit fingerprint function offers a shortinput latency of 926 ns, a peak throughput of 11.2 GB/s, and a collision probability of \(\lceil s / 2048 \rceil^2 \cdot 2^{112}\) (better than \(2^{70}\) for input size up to 7.5 GB). These collision bounds hold for all inputs constructed without any feedback about the randomly chosen UMASH parameters.
The latency on short cached inputs (922 ns for 64 bits, 926 ns for 128) is somewhat worse than the state of the art for noncryptographic hashesâ wyhash achieves 815 ns and xxh3 812 nsâbut still in the same ballpark. It also compares well with latencyoptimised hash functions like FNV1a (586 ns) and MurmurHash64A (723 ns).
Similarly, UMASHâs peak throughput (22 GB/s) does not match the current best hash throughput (37 GB/s with xxh3 and falkhash, apparently 10% higher with Meow hash), but does comes within a factor of two; itâs actually higher than that of some performanceoptimised hashes, like wyhash (16 GB/s) and farmhash32 (19 GB/s). In fact, even the 128bit fingerprint (11.2 GB/s) is comparable to respectable options like MurmurHash64A (5.8 GB/s) and SpookyHash (11.6 GB/s).
What sets UMASH apart from these other noncryptographic hash functions is its proof of a collision probability bound. In the absence of an adversary that adaptively constructs pathological inputs as it infers more information about the randomly chosen parameters, we know that two distinct inputs of \(s\) or fewer bytes will have the same 64bit hash with probability at most \(\lceil s / 2048 \rceil \cdot 2^{56},\) where the expectation is taken over the random âkeyâ parameters.
Only one noncryptographic hash function in Reini Urbanâs fork of SMHasher provides this sort of bound: CLHash guarantees a collision probability \(\approx 2^{63}\) in the same universal hashing model as UMASH. While CLHashâs peak throughput (22 GB/s) is equal to UMASHâs, its latency on short inputs is worse (2325 ns instead of 922ns). We will also see that its stronger collision bound remains too weak for many practical applications. In order to compute a fingerprint with CLHash, one would have to combine multiple hashes, exactly like we did for the 128bit UMASH fingerprint.
Actual cryptographic hash functions provide stronger bounds in a much more pessimistic model; however theyâre also markedly slower than noncryptographic hashes. BLAKE3 needs at least 66 ns to hash short inputs, and achieves a peak throughput of 5.5 GB/s. Even the reducedround SipHash13 hashes short inputs in 1840 ns and longer ones at a peak throughput of 2.8 GB/s. Thatâs the price of their pessimistically adversarial security model. Depending on the application, it can make sense to consider a more restricted adversary that must prepare its dirty deed before the hash functionâs parameters are generated at random, and still ask for provable bounds on the probability of collisions. Thatâs the niche weâre targeting with UMASH.
Clearly, the industry is comfortable with no bound at all. However, even in the absence of seedindependent collisions, timing sidechannels in a data structure implementation could theoretically leak information about colliding inputs, and iterating over a hash tableâs entries to print its contents can divulge even more bits. A sufficiently motivated adversary could use something like that to learn more about the key and deploy an algorithmic denial of service attack. For example, the linear structure of UMASH (and of other polynomial hashes like CLHash) makes it easy to combine known collisions to create exponentially more colliding inputs. There is no universal answer; UMASH is simply another point in the solution space.
If reasonable performance coupled with an actual bound on collision probability for data that does not adaptively break the hash sounds useful to you, take a look at UMASH on GitHub!
The next section will explain why we found it useful to design another hash function. The rest of the post sketches how UMASH works and how it balances shortinput latency and strength, before describing a few interesting usage patterns.
The latency and throughput results above were all measured on the same unloaded 2.5 GHz Xeon 8175M. While we did not disable frequency scaling (#cloud), the clock rate seemed stable at 3.1 GHz during our run.
Engineering is the discipline of satisficisation: crisply defined problems with perfect solutions rarely exist in reality, so we must resign ourselves to satisfying approximate constraint sets âwell enough.â However, there are times when all options are not only imperfect, but downright sucky. Thatâs when one has to put on a different hat, and question the problem itself: are our constraints irremediably at odds, or are we looking at an underexplored solution space?
In the former case, we simply have to want something else. In the latter, it might make sense to spend time to really understand the current set of options and handroll a specialised approach.
Thatâs the choice we faced when we started caching intermediate results in Backtraceâs database and found a dearth of acceptable hash functions. Our inmemory columnar database is a core component of the backend, and, like most analytics databases, it tends to process streams of similar queries. However, a naĂŻve query cache would be ineffective: our more heavily loaded servers handle a constant write load of more than 100 events per second with dozens of indexed attributes (populated column values) each. Moreover, queries invariably select a large number of data points with a time windowing predicate that excludes old dataâŠ and the endpoints of these time windows advance with each wallclock second. The queries evolve over time, and must usually consider newly ingested data points.
Bhatotia et alâs Slider show how we can specialise the idea of selfadjusting or incremental computation for repeated MapReducestyle queries over a sliding window. The key idea is to split the data set at stable boundaries (e.g., on date change boundaries rather than 24 hours from the beginning of the current time window) in order to expose memoisation opportunities, and to do so recursively to repair around point mutations to older data.
Caching fully aggregated partial results works well for static queries, like scheduled reportsâŠ but the first step towards creating a great report is interactive data exploration, and thatâs an activity we strive to support well, even when drilling down tens of millions of rich data points. Thatâs why we want to also cache intermediate results, in order to improve response times when tweaking a saved report, or when crafting ad hoc queries to better understand how and when an application fails.
We must go back to a more general incremental computation strategy: rather than only splitting up inputs, we want to stably partition the data dependency graph of each query, in order to identify shared subcomponents whose results can be reused. This finer grained strategy surfaces opportunities to âresynchroniseâ computations, to recognize when different expressions end up generating a subset of identical results, enabling reuse in later steps. For example, when someone updates a query by adding a selection predicate that only rejects a small fraction of the data, we can expect to reuse some of the postselection work executed for earlier incarnations of the query, if we remember to key on the selected data points rather than the predicates.
The complication here is that these intermediate results tend to be large. Useful analytical queries start small (a reasonable query coupled with cache/transaction invalidation metadata to stand in for the full data set), grow larger as we select data points, arrange them in groups, and materialise their attributes, and shrink again at the end, as we summarise data and throw out less interesting groups.
When caching the latter shrinking steps, where resynchronised reuse opportunities abound and can save a lot of CPU time, we often find that storing a fully materialised representation of the cache key would take up more space than the cached result.
A classic approach in this situation is to fingerprint cache keys with a cryptographic hash function like BLAKE or SHA3, and store a compact (128 or 256 bits) fingerprint instead of the cache key: the probability of a collision is then so low that we might as well assume any false positive will have been caused by a bug in the code or a hardware failure. For example, a study of memory errors at Facebook found that uncorrectable memory errors affect 0.03% of servers each month. Assuming a generous clock rate of 5 GHz, this means each clock cycle may be afflicted by such a memory error with probability \(\approx 2.2\cdot 10^{20} > 2^{66}.\) If we can guarantee that distinct inputs collide with probability significantly less than \(2^{66}\), e.g., \(< 2^{70},\) any collision is far more likely to have been caused by a bug in our code or by hardware failure than by the fingerprinting algorithm itself.
Using cryptographic hashes is certainly safe enough, but requires a lot of CPU time, and, more importantly, worsens latency on smaller keys (for which caching may not be that beneficial, such that our goal should be to minimise overhead). Itâs not that stateoftheart cryptographic hash functions are wasteful, but that they defend against attacks like key recovery or collision amplification that we may not care to consider in our design.
At the other extreme of the hash spectrum, there is a plethora of fast hash functions with no proof of collision probability. However, most of them are keyed on just a 64bit âseedâ integer, and thatâs already enough for a pigeonhole argument to show we can construct sets of strings of length \(64m\) bits where any two members collide with probability at least \(m/ 2^{64}\). In practice, security researchers seem to find keyindependent collisions wherever they look (i.e., the collision probability is on the order of 1 for some particularly pathological sets of inputs), so itâs safe to assume that lacking a proof of collision probability implies a horrible worst case. I personally wouldnât put too much faith in âsecurity claimsâ taking the form of failed attempts at breaking a proposal.
Lemire and Kaserâs CLHash is one
of the few exceptions we found: it achieves a high throughput of 22
GB/s and comes with a proof of \(2^{63}\)almostuniversality.
However, its finalisation step is slow (23 ns for onebyte inputs), due
to a Barrett reduction
followed by
three rounds of xorshift
/multiply mixing.
Dai and Krovetzâs VHASH,
which inspired CLHash, offers similar guarantees, with worse
performance.
Unfortunately, \(2^{63}\) is also not quite good enough for our purposes: we estimate that the probability of uncorrectable memory errors is on the order of \(2^{66}\) per clock cycle, so we want the collision probability for any two distinct inputs to be comfortably less than that, around \(2^{70}\) (i.e., \(10^{21}\)) or less. This also tells us that any acceptable fingerprint must consist of more than 64 bits, so we will have to either work in slower multiword domains, or combine independent hashes.
Interestingly, we also donât need much more than that for (nonadversarial) fingerprinting: at some point, the theoretical probability of a collision is dominated by the practical possibility of a hardware or networking issue making our program execute the fingerprinting function incorrectly, or pass the wrong data to that function.
While CLHash and VHASH arenât quite what we want, theyâre pretty close, so we felt it made sense to come up with a specialised solution for our fingerprinting use case.
Krovetz et alâs RFC 4418 brings an interesting idea: we can come up with a fast 64bit hash function structured to make it easy to compute a second independent hash value, and concatenate two independent 64bit outputs. The hash function can heavily favour computational efficiency and let each 64bit half collide with probability \(\varepsilon\) significantly worse than \(2^{64}\), as long as the collision probability for the concatenated fingerprint, \(\varepsilon^2\), is small enough, i.e., as long as \(\varepsilon^2 < 2^{70} \Longleftrightarrow \varepsilon < 2^{35}\). We get a more general purpose hash function out of the deal, and the fingerprint comparison logic is now free to only compute and look at half the fingerprint when it makes sense (e.g., in a prepass that tolerates spurious matches).
The design of UMASH is driven by two observations:
CLHash achieves a high throughput, but introduces a lot of latency to finalise its 127bit state into a 64 bits result.
We can get away with a significantly weaker hash, since we plan to combine two of them when we need a strong fingerprint.
Thatâs why we started with the highlevel structure diagrammed below, the same as UMAC, VHASH, and CLHash: a fast firstlevel block compression function based on Winogradâs pseudo dotproduct, and a secondlevel CarterWegman polynomial hash function to accumulate the compressed outputs in a fixedsize state.
The inner loop in this twolevel strategy is the block compressor,
which divides each 256byte block \(m\) into 32 64bit values
\(m_i\), combines them with randomly generated parameters \(k_i\),
and converts the resulting sequence of machine words to a 16byte
output. The performance of that component will largely determine the
hash functionâs global peak throughput. After playing around with
the NH
inner loop,
we came to the
same conclusion as Lemire and Kaser:
the scalar operations, the outer 128bit ones in particular, map to too
many Â”ops. We thus focused on the
same PH
inner loop
as CLHash,
While the similarity to NH
is striking, analysing PH
is actually
much simpler: we can see the xor
and carryless multiplications as
working in the same ring of polynomials over \(\mathrm{GF}(2)\), unlike
NH
âs mixing of \(\bmod 2^{64}\) for the innermost additions with
\(\bmod 2^{128}\) for the outer multiplications and sum. In
fact, as
Bernstein points out,
PH
is a direct application of
Winogradâs pseudo dotproduct
to compute a multiplicative vector hash in half the multiplications.
CLHash uses an aggressively throughputoptimised block size of 1024 bytes. We found diminishing returns after 256 bytes, and stopped there.
With modular or polynomial ring arithmetic, the collision probability is \(2^{64}\) for any pair of blocks. Given this fast compression function, the rest of the hashing algorithm must chop the input in blocks, accumulate compressed outputs in a constantsize state, and handle the potentially shorter final block while avoiding length extension issues.
Both VHASH and CLHash accumulate compressed outputs in a
polynomial string hash over a large field
(\(\mathbb{Z}/M_{127}\mathbb{Z}\) for VHASH, and
\(\mathrm{GF}(2^{127})\) with irreducible polynomial \(x^{127} + x + 1\)
for CLHash): the collision probability for polynomial string hashes is
inversely proportional to the field size and grows with the string
length (number of compressed blocks), so working in fields much larger
than \(2^{64}\) lets the NH
/PH
term dominate.
Arithmetic in such large fields is slow, and reducing the 127bit state to 64 bits is also not fast. CLHash and VHASH make the situation worse by zeropadding the final block, and CLHash defends against length extension attacks with a more complex mechanism than the one in VHASH.
Similarly to VHASH, UMASH uses a polynomial hash over the (much smaller) prime field \(\mathbb{F} = \mathbb{Z}/M_{61}\mathbb{Z},\)
where \(f\in\mathbb{F}\) is the randomly chosen point at which we
evaluate the polynomial, and \(y\), the polynomialâs coefficients, is
the stream of 64bit values obtained by splitting in half the PH
output for each block. This choice saves 2030 cycles of latency in
the final block, compared to CLHash: modular multiplications have
lower latency than carryless multiplications for judiciously picked
machineintegersized moduli, and integer multiplications seem to mix
better, so we need less work in the finaliser.
Of course, UMASH sacrifices a lot of strength by working in \(\mathbb{F} =\, \bmod 2^{61}  1:\) the resulting field is much smaller than \(2^{127}\), and we now have to update the polynomial twice for the same number of blocks. This means the collision probability starts worse \((\approx 2^{61}\) instead of \(\approx 2^{127})\), and grows twice as fast with the number of blocks \(n\) \((\approx 2n\cdot 2^{61}\) instead of \(\approx n\cdot 2^{61})\). But remember, weâre only aiming for collision probability \(< 2^{35}\) and each block represents 256 bytes of input data, so this is acceptable, assuming that multigigabyte inputs are out of scope.
We protect against length extension collisions
by xor
ing (adding, in the polynomial ring) the original byte size of
the final block to its compressed PH
output. This xor
is simpler
than CLHashâs finalisation step with a carryless multiplication, but
still sufficient: we can adapt
Krovetzâs proof for VHASH
by replacing NH
âs almost\(\Delta\)universality with PH
âs
almostXORuniversality.
Having this protection means we can extend short final blocks however
we want. Rather than conceptually zeropadding our inputs (which adds
complexity and thus latency on short inputs), we allow redundant
reads. We bifurcate inputs shorter than 16 bytes to a completely
different latencyoptimised code path, and let the final PH
iteration read the last 16 bytes of the input, regardless of how
redundant that might be.
The semiliterate Python reference implementation has the full code and includes more detailed analysis and rationale for the design decisions.
The previous section already showed how we let microoptimisation
inform the highlevel structure of UMASH. The use of PH
over
NH
, our choice of a polynomial hash in a small modular field, and
the way we handle short blocks all aim to improve the performance of
production implementations. We also made sure to enable a couple more
implementation tricks with lower level design decisions.
The block size is set to 256 bytes because we observed diminishing
returns for larger blocksâŠ but also because itâs reasonable to cache
the PH
loopâs parameters in 8 AVX registers, if we need to shave load
Â”ops.
More importantly, itâs easy to implement a Horner update with the prime modulus \(2^{61}  1\). Better, thatâs also true for a âdoublepumpedâ Horner update, \(h^\prime = H_f(h, a, b) = af + (b + h)f^2.\)
The trick is to work in \(\bmod 2^{64}  8 = \bmod 8\cdot(2^{61}  1),\) which lets us implement modular multiplication of an arbitrary 64bit integer \(a\) by a multiplier \(0 < f < 2^{61}  1\) without worrying too much about overflow. \(2^{64} \equiv 8 \mod 2^{64}  8,\) so we can reduce a value \(x\) to a smaller representative with
this equivalence is particularly useful when \(x < 2^{125}\): in that case, \(x / 2^{64} < 2^{61},\) and the intermediate product \(8\lfloor x / 2^{64}\rfloor < 2^{64}\) never overflows 64 bits. Thatâs exactly what happens when \(x = af\) is the product of \(0\leq a < 2^{64}\) and \(0 < f < 2^{61}  1\). This also holds when we square the multiplier \(f\): itâs sampled from the field \(\mathbb{Z}/(2^{61}  1)\mathbb{Z},\) so its square also satisfies \(f^2 < 2^{61}\) once fully reduced.
Integer multiplication instructions for 64bit values will naturally split the product \(x = af\) in its high and low 64bit half; we get \(\lfloor x / 2^{64}\rfloor\) and \(x\bmod 2^{64}\) for free. The rest of the doublepumped Horner update is a pair of modular additions, where only the final sum must be reduced to fit in \(\bmod 2^{64}  8\). The resulting instructionparallel double Horner update is only a few cycles slower than a single Horner update.
We also never fully reduce to \(\bmod 2^{61}  1\). While the collision bound assumes that prime field, we simply work in its \(\bmod 2^{64}  8\) extension. This does not affect the collision bound, and the resulting expression is still amenable to algebraic manipulation: modular arithmetic is a well defined ring even for composite moduli.
A proof of almostuniversality doesnât mean a hash passes the SMHasher test suite. It should definitely guarantee collisions are (probably) rare enough, but SMHasher also looks at bit avalanching and bias, and universality is oblivious to these issues. Even XOR or \(\Delta\)universality doesnât suffice: the hash values for a given string are well distributed when parameters are chosen uniformly at random, but this does not imply that hashes are always (or usually) well distributed for fixed parameters.
The most stringent SMHasher tests focus on short inputs: mostly up to
128 or 256 bits, unless âExtraâ torture testing is enabled. In a way,
this makes sense, given that arbitrarylength string hashing is
provably harder than the boundedlength vector case. Moreover, a
specialised code path for these inputs is beneficial, since theyâre
relatively common and deserve strong and lowlatency hashes. Thatâs
why UMASH uses a completely different code path for inputs of at
most 8 bytes, and a specialised PH
iteration for inputs of 9 to 16
bytes.
However, this means that SMHasherâs best avalanche and bias tests often tell us very little about the general case. For UMASH, the medium length (9 to 16 bytes) code path at least shares the same structure and finalisation logic as the code for longer inputs.
There may also be a bit of coevolution between the test harness and
the design of hash functions: the sort of xorshift
/multiply mixers
favoured by Appleby in the various versions of MurmurHash tends to do
well on SMHasher. These mixers are also invertible, so we can take
any hash function with good collision properties, mix its output with
someone elseâs series of xorshift
and multiplications (in UMASHâs
case, the
SplitMix64 update function
or a subset thereof), and usually find that the result satisfies
SMHasherâs bias and avalanche tests.
It definitely looks like interleaving rightward bitwise operations and integer multiplications is a good mixing strategy. However, I find it interesting that the hash evaluation harness created by the author of MurmurHash steers implementations towards MurmurHashsyle mixing code.
The structure of UMASH lets us support more sophisticated usage patterns than merely hashing or fingerprinting an array of bytes.
The PH
loop needs less than 17 bytes of state for its 16byte
accumulator and an iteration count, and the polynomial hash also needs
17 bytes, for its own 8byte accumulator, the 8byte âseed,â and a
counter for the final block size (up to 256 bytes). The total comes up to
34 bytes of state, plus a 16byte input buffer, since the PH
loop
consumes 16byte chunks at a time. Coupled with the way we only
consider the input size at the end of UMASH, this makes it easy to
implement incremental hashing.
In fact, the state is small enough that our implementation stashes some
parameter data inline in the state struct, and uses the same
layout for hashing and fingerprinting with a pair of hashes (and thus
double the state): most of the work happens in PH
, which only
accesses the constant parameter array, the shared input buffer and iteration
counter, and its private 16byte accumulator.
Incremental fingerprinting is a crucial capability for our caching
system: cache keys may be large, so we want to avoid serialising them
to an array of contiguous bytes just to compute a fingerprint.
Efficient incrementality also means we can hash NULterminated C
strings with a fused UMASH / strlen
loop, a nice speedup when the
data is in cache.
The outer polynomial hash in UMASH is so simple to analyse that we can easily process blocks out of order. In my experience, such a âparallel hashingâ capability is more important than peak throughput when checksumming large amounts of data coming over the wire. We usually maximise transfer throughput by asking for several ranges of data in parallel. Having to checksum these ranges in order introduces a serial bottleneck and the usual headofline blocking challenges; more importantly, checksumming in order adds complexity to code that should be as obviously correct as possible. The polynomial hash lets us hash an arbitrary subsequence of 256byte aligned blocks and use modular exponentiation to figure out its impact on the final hash value, given the subsequenceâs position in the checksummed data. Parallel hashing can exploit multiple cores (more cores, more bandwidth!) with simpler code.
The UMAC RFC uses a Toeplitz
extension scheme to compute independent NH
values while recycling
most of the parameters. We do the same with PH
, by adapting
Krovetzâs proof
to exploit PH
âs almostXORuniversality instead of NH
âs
almost\(\Delta\)universality. Our fingerprinting code reuses all
but the first 32 bytes of PH
parameters for the second hash: thatâs
the size of an AVX register, which makes is trivial to avoid loading
parameters twice in a fused PH
loop.
The same RFC also points out that concatenating the output of fast hashes lets validation code decide which speedsecurity tradeoff makes sense for each situation: some applications may be willing to only compute and compare half the hashes.
We use that freedom when reading from large hash tables keyed on the UMASH fingerprint of strings. We compute a single UMASH hash value to probe the hash tables, and only hash the second half of the fingerprint when we find a probable hit. The idea is that hashing the search key (now hot in cache) a second time will be faster than comparing it against the hash entryâs string key in cold storage.
When we add this sort of trickery to our code base, itâs important to make sure the interfaces are hard to misuse. For example, it would be unfortunate if only one half of the 128bit fingerprint were well distributed and protected against collisions: this would make it far too easy to implement the twostep lookupbyfingerprint above correctly but inefficiently. Thatâs why we maximise the symmetry in the fingerprint: the two 64bit halves are computed with the same algorithm to guarantee the same worstcase collision probability and distribution quality. This choice leaves fingerprinting throughput on the table when a weaker secondary hash would suffice. However, I prefer a safer if slightly slower interface to one ripe for silent performance bugs.
While we intend for UMASH to become our default hash and fingerprint function, it canât be the right choice for every application.
First, it shouldnât be used for authentication or similar cryptographic purposes: the implementation is probably riddled with sidechannels, the function has no protection against parameter extraction or adaptive attacks, and collisions are too frequent anyway.
Obviously, this rules out using UMASH in a MAC, but might also be an issue for, e.g., hash tables where attackers control the keys and can extrapolate the hash values. A timing sidechannel may let attackers determine when keys collide; once a set of colliding keys is known, the linear structure of UMASH makes it trivial to create more collisions by combining keys from that set. Worse, iterating over the hash tableâs entries can leak the hash values, which would let an attacker slowly extract the parameters. We conservatively avoid noncryptographic hashes and even hashed data structures for sections of the Backtrace code base where such attacks are in scope.
Second, the performance numbers reported by SMHasher (up to 22 ns when hashing 64 bytes or less, and 22 GB/s peak throughput) are probably a lie for real applications, even when running on the exact same 2.5 GHz Xeon 8175M hardware. These are bestcase values, when the code and the parameters are all hot in cacheâŠ and thatâs a fair amount of bytes for UMASH. The instruction footprint for a 64bit hash is 1435 bytes (comparable to heavier highthroughput hashes, like the 1600byte xxh3_64 or 1350byte farmhash64), and the parameters span 288 bytes (320 for a fingerprint).
There is a saving grace for UMASH and other complex hash functions: the amount of bytes executed is proportional to the input size (e.g., the code for 8 or fewer byte only needs 141 bytes, and would inline to around 100 bytes), and the number of parameters read is bounded by the input length. Although UMASH can need a lot of instruction and parameter bytes, the worst case only happens for larger inputs, where the cache misses can hopefully be absorbed by the work of loading and hashing the data.
The numbers are also only representative of powerful CPUs with
carryless multiplication in hardware. The PH
inner loop has 50%
higher throughput than NH
(22 vs 14 GB/s) on contemporary Intel
servers. The carryless approach still has an edge over 128bit
modular arithmetic on AMDâs Naples,
but less so, around 2030%. We did not test on ARM (the
Backtrace database
only runs on x8664), but I would assume the situation there is closer
to AMDâs than Intelâs.
However, I also believe weâre more likely to observe improved
performance for PH
than NH
in future microarchitectures: the core
of NH
, fullwidth integer multiplication, has been aggressively
optimised by now, while the gap between Intel and AMD shows there
may still be lowhanging fruits for the carryless multiplications
at the heart of PH
. So, NH
is probably already as good as
itâs going to be, but we can hope that PH
will continue to benefit from
hardware optimisations, as chip designers improve the performance of
cryptographic algorithms like
AESGCM.
Third and last, UMASH isnât fully stabilised yet. We do not plan to
modify the high level structure of UMASH, a PH
block compressor
that feeds into a polynomial string hash. However, we are looking for
suggestions to improve its latency on short inputs, and to simplify
the finaliser while satisfying SMHasherâs distribution tests.
We believe UMASH is ready for nonpersistent usage: weâre confident in its quality, but the algorithm isnât set in stone yet, so hash or fingerprint values should not reach longterm storage. We do not plan to change anything that will affect the proof of collision bound, but improvements to the rest of the code are more than welcome.
In particular:
xorshift
/ multiply in the finaliser, but can we shave even
more latency there?A hash function is a perfect target for automated correctness and performance testing. I hope to use UMASH as a test bed for the automatic evaluation (and approval?!) of pull requests.
Of course, youâre also welcome to just use UMASH as a singlefile C library or reimplement it to fit your requirements. The MITlicensed C code is on GitHub, and we can definitely discuss validation strategies for alternative implementations.
Finally, our fingerprinting use case shows collision rates are probably not something to minimise, but closer to soft constraints. We estimate that, once the probability reaches \(2^{70}\), collisions are rare enough to only compare fingerprints instead of the fingerprinted values. However, going lower than \(2^{70}\) doesnât do anything for us.
It would be useful to document other backoftheenvelope requirements for a hash functionâs output size or collision rate. Now that most developers work on powerful 64bit machines, it seems far too easy to add complexity and waste resources for improved collision bounds that may not unlock any additional application.
Any error in the analysis or the code is mine, but a few people helped improve UMASH and its presentation.
Colin Percival scanned an earlier version of the reference implementation for obvious issues, encouraged me to simplify the parameter generation process, and prodded us to think about side channels, even in data structures.
Joonas Pihlaja helped streamline my initial attempt while making the reference implementation easier to understand.
Jacob Shufro independently confirmed that he too found the reference implementation understandable, and tightened the natural language.
Phil Vachon helped me gain more confidence in the implementation
tricks borrowed from VHASH after replacing the NH
compression
function with PH
.
hp_read_swf
without changing the fast path. See the addendum.
Back in February 2020, Blelloch and Wei submitted this cool preprint: Concurrent Reference Counting and Resource Management in Waitfree Constant Time. Their work mostly caught my attention because they propose a waitfree implementation of hazard pointers for safe memory reclamation.^{1} Safe memory reclamation (PDF) is a key component in lockfree algorithms when garbage collection isnât an option,^{2} and hazard pointers (PDF) let us bound the amount of resources stranded by delayed cleanups much more tightly than, e.g., epoch reclamation (PDF). However the usual implementation has a loop in its read barriers (in the garbage collection sense), which can be annoying for code generation and bad for worstcase time bounds.
Blelloch and Weiâs waitfree algorithm eliminates that loopâŠ with a construction that stacks two emulated primitivesâstrong LL/SC, and atomic copy, implemented with the formerâon top of what real hardware offers. I see the real value of the construction in proving that waitfreedom is achievable,^{3} and that the key is atomic memorymemory copies.
In this post, Iâll show how to flatten down that abstraction tower into something practical with a bit of engineering elbow grease, and come up with waitfree alternatives to the usual lockfree hazard pointers that are competitive in the best case. Blelloch and Weiâs insight that hazard pointers can use any waitfree atomic memorymemory copy lets us improve the worst case without impacting the common case!
But first, what are hazard pointers?
Hazard pointers were introduced by Maged Michael in Hazard Pointers: Safe Memory Reclamation for LockFree Objects (2005, PDF), as the first solution to reclamation races in lockfree code. The introduction includes a concise explanation of the safe memory reclamation (SMR) problem.
When a thread removes a node, it is possible that some other contending threadâin the course of its lockfree operationâhas earlier read a reference to that node, and is about to access its contents. If the removing thread were to reclaim the removed node for arbitrary reuse, the contending thread might corrupt the object or some other object that happens to occupy the space of the freed node, return the wrong result, or suffer an access error by dereferencing an invalid pointer value. [âŠ] Simply put, the memory reclamation problem is how to allow the memory of removed nodes to be freed (i.e., reused arbitrarily or returned to the OS), while guaranteeing that no thread accesses free memory, and how to do so in a lockfree manner.
In other words, a solution to the SMR problem lets us know when itâs safe to physically release resources that used to be owned by a linked data structure, once all links to these resources have been removed from that data structure (after âlogical deletionâ). The problem makes intuitive sense for dynamically managed memory, but it applies equally well to any resource (e.g., file descriptors), and its solutions can even be seen as extremely readoptimised reader/writer locks.^{4}
The basic idea behind Hazard Pointers is to have each thread publish to permanently allocated^{5} hazard pointer records (HP records) the set of resources (pointers) itâs temporarily borrowing from a lockfree data structure. Thatâs enough information for a background thread to snapshot the current list of resources that have been logically deleted but not yet physically released (the limbo list), scan all records for all threads, and physically release all resources in the snapshot that arenât in any HP record.
With just enough batching of the limbo list, this scheme can be practical: in practice, lockfree algorithms only need to pin a few (usually one or two) nodes at a time to ensure memory safety. As long as we avoid running arbitrary code while holding hazardous references, we can bound the number of records each thread may need at any one time. Scanning the records thus takes time roughly linear in the number of active threads, and we can amortise that to constant time per deleted item by waiting until the size of the limbo list is greater than a multiple of the number of active threads.^{6}
The tricky bit is figuring out how to reliably publish to a HP record without locking. Hazard pointers simplify that challenge with three observations:
This is where the clever bit of hazard pointers comes in: we must make sure that any resource (pointer to a node, etc.) we borrow from a lockfree data structure is immediately protected by a HP record. We canât make two things happen atomically without locking. Instead, weâll guess^{8} what resource we will borrow, publish that guess, and then actually borrow the resource. If we guessed correctly, we can immediately use the borrowed resource; if we were wrong, we must try again.
On an ideal sequentially consistent machine,
the pseudocode looks like the following. The cell
argument points to the resource we wish to acquire
(e.g., itâs a reference to the next
field in a linked list node), and record
is the hazard pointer
record that will protect the value borrowed from cell
.
1 2 3 4 5 6 7 

In practice, we must make sure that our write to record.pin
is visible before rereading the cell
âs value, and we should also make sure the pointer read is ordered with respect to the rest of the calling readside code.^{9}
1 2 3 4 5 6 7 

We need a store/load fence in R1
to make sure the store to the record (just above R1
) is visible by the time the second read (R2
) executes. Under the TSO memory model implemented by x86 chips (PDF),
this fence is the only one that isnât implicitly satisfied by the hardware.
It also happens that fences are best implemented with atomic operations
on x86oids,
so we can eliminate the fence in R1
by replacing the store just before R1
with an atomic exchange (fetchandset).
The slow cleanup path has its own fence that matches R1
(the one in R2
matches the mutatorsâ writes to cell
).
1 2 3 4 5 6 7 8 9 10 

We must make sure all the values in the limbo list we grab in C1
were added to the list (and thus logically deleted) before we read any
of the records in C2
, with the acquire read in C1
matching the storeload fence in R1
.
Itâs important to note that the cleanup loop does not implement
anything like an atomic snapshot of all the records. The reclamation
logic is correct as long as we scan the correct value for records that
have had the same pinned value since before C1
: we assume that a
resource only enters the limbo list once all its persistent
references have been cleared (in particular, this means circular
backreferences must be broken before scheduling a node for
destruction), so any newly pinned value cannot refer to any resource
awaiting destruction in the limbo list.^{10}
The following sequence diagrams shows how the fencing guarantees that any
iteration of hp_read_explicit
will fail if it starts before C1
and observes a stale value.
If the read succeeds, the ordering between R1
and C1
instead
guarantees that the cleanup loop will observed the pinned value
when it reads the record in C2
.
This all works, but itâs slow:^{11} we added an atomic write instruction (or worse, a fence) to a readonly operation. We can do better with a little help from our operating system.
When we use fences or memory ordering correctly, there should be
an implicit pairing between fences or ordered operations: we use fencing to
enforce an ordering (one must fully execute before or after another,
overlap is forbidden) between pairs of instructions in different
threads. For example, the pseudocode for hazard pointers with
explicit fencing and memory ordering paired the storeload fence in
R1
with the acquisition of the limbo list in C1
.
We only need that pairing very rarely, when a thread actually executes the cleanup function. The amortisation strategy guarantees we donât scan records all the time, and we can always increase the amortisation factor if weâre generating tiny amounts of garbage very quickly.
It kind of sucks that we have to incur a full fence on the fast read path, when it only matches reads in the cleanup loop maybe as rarely as, e.g., once a second. If we waited long enough on the slow path, we could rely on events like preemption or other interrupts to insert a barrier in all threads that are executing the readside.
How long is âenough?â
Linux has the membarrier
syscall
to block the calling thread until (more than) long enough has elapsed,
Windows has the similar FlushProcessWriteBuffers
, and
on other operating systems, we can probably do something useful with scheduler statistics or ask for a new syscall.
Armed with these new blocking system calls, we can replace the storeload fence in R1
with a compiler barrier, and execute a slow membarrier
/FlushProcessWriteBuffers
after C1
.
The cleanup function will then wait long enough^{12} to ensure that any
readside operation that had executed before R1
at the time we read the limbo list in C1
will be visible (e.g., because the operating system knows a preemption interrupt executed at least once on each core).
The pseudocode for this asymmetric strategy follows.
1 2 3 4 5 6 7 

1 2 3 4 5 6 7 8 9 10 11 

Weâve replaced a fence on the fast read path with a compiler barrier, at the expense of executing a heavy syscall on the slow path. Thatâs usually an advantageous tradeoff, and is the preferred implementation strategy for Follyâs hazard pointers.
The ability to pair mere compiler barriers with membarrier
syscalls opens the door for many more âatomic enoughâ operations, not
just the fenced stores and loads we used until now:
similarly to the key idea in Concurrency Kitâs atomicfree SPMC event count,
we can use noninterlocked readmodifywrite instructions,
since any interrupt (please donât mention imprecise interrupts) will happen before or after any such instruction,
and never in the middle of an instruction.
Letâs use that to simplify waitfree hazard pointers.
The key insight that lets Blelloch and Wei
achieve waitfreedom in hazard pointer is that the combination
of publishing a guess and confirming that the guess is correct in hp_read
emulates an atomic memorymemory copy. Given such an atomic copy primitive, the readside becomes trivial.
1 2 3 

The âonlyâ problem is that atomic copies (which would look like locking all other cores out of memory accesses, copying the cell
âs wordsized contents to record.pin
, and releasing the lock) donât exist in contemporary hardware.
However, weâve already noted that syscalls like membarrier
mean we can weaken our requirements to interrupt atomicity. In other words, individual nonatomic instructions work since weâre assuming precise interruptsâŠ and x86
and amd64
do have an instruction for memorymemory copies!
The MOVS
instructions are typically only used with a REP
prefix. However, they can also be executed without any prefix, to execute one iteration of the copy loop. Executing a REP
free MOVSQ
instruction copies one quadword (8 bytes) from the memory address in the source register [RSI]
to the address in the destination register [RDI]
, and advances both registers by 8 bytesâŠ and all this stuff happens in one instruction, so will never be split by an interrupt.
Thatâs an interruptatomic copy, which we can slot in place
of the software atomic copy in Blelloch and Weiâs proposal!
1 2 3 

Again, the MOVS
instruction is not atomic, but will be ordered with
respect to the membarrier
syscall in hp_cleanup_membarrier
: either
the copy fully executes before the membarrier
in C1
, in which case the
pinned value will be visible to the cleanup loop, or it executes after
the membarrier
, which guarantees the copy will not observe a stale value
thatâs waiting in the limbo list.
Thatâs just one instruction, but instructions arenât all created
equal. MOVS
is on the heavy side: in order to read from memory, write to memory, and increment two registers,
a modern Intel chip has to execute 5 microops in at least ~5 cycles.
Thatâs not exactly fast; definitely better than an atomic (LOCK
ed)
instruction, but not fast.
We can improve that with a trick from sidechannel attacks, and
preserve waitfreedom. We can usually guess what value weâll find in
record.pin
, simply by reading cell
with a regular relaxed load.
Unless weâre extremely unlucky (realistically, as long as the reader
thread isnât interrupted), MOVSQ
will copy the same value we just
guessed. Thatâs enough to exploit branch prediction and turn a data
dependency on MOVSQ
(a high latency instruction) into a data
dependency on a regular load MOV
(low latency), and a highly
predictable control dependency. In very low level pseudo code, this
âspeculativeâ version of the MOVS
readside might look like:
1 2 3 4 5 6 

1 2 3 4 5 6 7 8 9 10 

Weâll see that, in reasonable circumstances, this waitfree
code sequence is faster than the usual membarrierbased lockfree
read side. But first, letâs see how we can achieve waitfreedom
when CISCy instructions like MOVSQ
arenât available, with an asymmetric âhelpingâ scheme.
Blelloch and Weiâs waitfree atomic copy primitive builds on the usual
trick for waitfree algorithms: when a thread would wait for an
operation to complete, it helps that operation make progress instead
of blocking. In this specific case, a thread initiates an atomic copy
by acquiring a fresh hazard pointer record, setting that descriptorâs
pin
field to \(\bot\), publishing the address it wants to
copy from, and then performing the actual copy. When another thread
enters the cleanup loop and wishes to read the recordâs pin
field , it may either find a value, or
\(\bot\); in the latter case, the cleanup thread has to help the
hazard pointer descriptor forward, by attempting to update the
descriptorâs pin
field.
This strategy has the marked advantage of working. However, itâs also symmetric between the common case (the thread that initiated the operation quickly completes it), and the worst case (a cleanup thread notices the initiating thread got stuck and moves the operation along). This forces the common case to use atomic operations, similarly to the way cleanup threads would. We pessimised the common case in order to eliminate blocking in the worst case, a frequent and unfortunate pattern in waitfree algorithms.
The source of that symmetry is our specification of an atomic copy
from the source field to a single destination pin
field, which must
be written exactly once by the thread that initiated the copy
(the hazard pointer reader), or any concurrent helper (the cleanup loop).
We can relax that requirement, since we know that the hazard pointer
scanning loop can handle spurious or garbage pinned values. Rather
than forcing both the read sequence (fast path) and the cleanup loop (slow path) to write to the same
pin
field, we will give each HP record two pin fields: a
singlewriter one for the fast path, and a multiwriter one for all
helpers (all threads in cleanup code).
The readside sequence will have to first write to the HP record to publish the cell itâs reading from, read and publish the cellâs pinned value, and then check if a cleanup thread helped the record along. If the record was helped, the readside sequence must use the value written by the helping cleanup thread. This means cleanup threads can detect when a HP record is missing its pinned value, and help it along with the cellâs current value. Cleanup threads may later observe two pinned values (both the reader and a cleanup thread wrote a pinned value); in that case, both values are conservatively protected from physical destruction.
Until now a hazard pointer record has only had one field, the âpinnedâ value. We must add some complexity to make this asymmetric helping scheme work: in order for cleanup threads to be able to help, we must publish the cell we are reading, and we need somewhere for cleanup threads to write the pinned value they read. We also need some sort of ABA protection to make sure slow cleanup threads donât overwrite a fresher pinned value with a stale one, when the cleanup thread gets stuck (preempted).
Concretely, the HP record still has a pin
field, which is only
written by the reader that owns the record, and read by cleanup
threads. The help
subrecord is written by both the owner of the
record and any cleanup thread that might want to move a reader along. The
reader will first write the address of the pointer it wants to read
and protect in cell
, generate a new unique generation id by incrementing
gen_sequence
, and write that to pin_or_gen
. Weâll tag
generation ids with their sign: negative
values are generation ids, positive ones are addresses.
1 2 3 4 5 6 7 8 9 10 

At this point, any cleanup thread should be able to notice that the
help.pin_or_gen
is a generation value, and find a valid cell address
in help.cell
. Thatâs all the information a cleanup threads needs to
attempt to help the record move forward. It can read the cellâs value, and publish
the pinned value it just read with an atomic compareandswap (CAS) of
pin_or_gen
; if the CAS fails, another cleanup thread got there first, or
the reader has already moved on to a new target cell. In the latter
case, any inflight hazard pointer read sequence started before we
started reclaiming the limbo list, and it doesnât matter what pinned
value we extract from the record.
Having populated the help
subrecord, a reader can now publish a
value in pin
, and then look for a pinned value in
help.pin_or_gen
: if a cleanup thread published a pinned value there, the
reader must use it, and not the potentially stale (already destroyed)
value the reader wrote to pin
.
On the read side, we obtain plain waitfreedom, with standard operations.
All we need are two compiler barriers to let
membarriers guarantee writes to the help
subrecord are visible
before we start reading from the target cell, and to guarantee
that any cleanup threadâs write to record.help.pin_or_gen
is visible
before we compare record.help.pin_or_gen
against gen
:
1 2 3 4 5 6 7 8 9 10 11 12 

On the cleanup side, we will consume the limbo list, issue a
membarrier to catch any readside critical section that wrote to
pin_or_gen
before we consumed the list, help these sections
along, issue another membarrier to guarantee that either the readersâ
writes to record.pin
are visible, or our writes to
record.help.pin_or_gen
are visible to readers, and finally scan the
records while remembering to pin the union of record.pin
and
record.help.pin_or_gen
if the latter holds a pinned value.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

The membarrier in C1
matches the compiler barrier in RA
: if a
readside section executed R2
before we consumed the limbo list, its
writes to record.help
must be visible. The second membarrier in
C2
matches the compiler barrier in RB
: if the readside section
has written to record.pin
, that write must be visible, otherwise,
the cleanup threadsâs write to help.pin_or_gen
must be visible to the reader.
Finally, when scanning for pinned values, we canât determine whether
the reader used its own value or the one we published, so we must
conservatively add both to the pinned set.
Thatâs a couple more instructions on the readside than the
speculative MOVSQ
implementation. However, the instructions are
simpler, and the portable waitfree implementation benefits even more
from speculative execution: the final branch is equally predictable,
and now depends only on a read of record.help.pin_or_gen
, which can
be satisfied by forwarding the readerâs own write to that same field.
The end result is that, in my microbenchmarks, this portable waitfree
implementation does slightly better than the speculative MOVSQ
code.
We make this even tighter, by further specialising the code. The cleanup
path is already slow. What if we also assumed mutual exclusion; what if,
for each record, only one cleanup at a time could be in flight?
Once we may assume mutual exclusion between cleanup loopsâmore specifically, the âhelpâ loop, the only part that writes to recordsâwe donât have to worry about ABA protection anymore. Hazard pointer records can lose some weight.
1 2 3 4 5 6 

Weâll also use tagging with negative or positive values, this time to distinguish target cell addresses (positive) from pinned values (negative). Now that the read side doesnât have to update a generation counter to obtain unique sequence values, itâs even simpler.
1 2 3 4 5 6 7 8 9 

The cleanup function isnât particularly different, except for the new encoding scheme.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Again, RA
matches C1
, and RB
C2
. This new implementation is simpler than hp_read_wf
on the read side, and needs even fewer instructions.
A key attribute for hazard pointers is how much they slow down pointer traversal in the common case. However, there are other qualitative factors that should impact our choice of implementation.
The classic fenced (hp_read_explicit
) implementation
needs one atomic or fence instruction per read, but does not require any
exotic OS operation.
A simple membarrier implementation (hp_read_membarrier
)
is ABIcompatible with the fenced implementations, but lets the read side
replace the fence with a compiler barrier, as long as the
slow cleanup path can issue membarrier
syscalls
on Linux, or the similar FlushProcessWriteBuffers
on Windows. All the remaining implementations (we wonât mention the much more complex waitfree implementation of Blelloch and Wei)
also rely on the same syscalls to avoid fences or atomic instructions
on the read side, while additionally providing waitfreedom (constant
execution time) for readers, rather than mere lockfreedom.
The simple MOVSQ
based implementations (hp_read_movs
) is fully
compatible with hp_read_membarrier
, waitfree, and usually compiles
down to fewer instruction bytes, but is slightly slower. Adding
speculation (hp_read_movs_spec
) retains compatibility and closes the
performance gap, with a number of instruction bytes comparable to the
lockfree membarrier implementation. In both cases, we rely on
MOVSQ
, an instruction that only exists on x86
and amd64
.
However, we can also provide portable waitfreedom, once we modify the
cleanup code to help the read side sections forward. The basic
implementation hp_read_wf
compiles to many more instructions than
the other readside implementations, but those instructions are mostly
upstream of the protected pointer read; in microbenchmarks, the result
can even be faster than the simple hp_read_membarrier
or the
speculative hp_read_movs_spec
. The downside is that instruction
bytes tend to hurt much more in real code than in microbenchmarks.
We also rely on pointer tagging, which could make the code less widely applicable.
We can simplify and shrink the portable waitfree code by assuming
mutual exclusion on the cleanup path (hp_read_swf
). Performance is
improved or roughly the same, and instruction bytes comparable to
hp_read_membarrier
. However, weâve introduced more opportunities
for reclamation hiccups.
More importantly, achieving waitfreedom with concurrent help suffers
from a fundamental issue: helpers donât know that the pointer read theyâre
helping move forward is stale until they (fail to) CAS into
place the value they just read. This means they must be able to safely read potentially stale
pointers without crashing. One might think mutual exclusion in the
cleanup function fixes that, but programs often mix and match
different reclamation schemes, as well as lockfree and lockful code.
On Linux, we could
abuse the process_vm_readv
syscall;^{13}
in general I suppose we could install signal handlers to catch SIGSEGV
and SIGBUS
.
The stale read problem is even worse for the singlecleanup hp_read_swf
read sequence:
thereâs no ABA protection, so a cleanup helper can pin an
old value in record.help.cell_or_pin
. This could happen if a
readside sequence is initiated before hp_cleanup_swf
âs membarrier
in C1
, and the associated incomplete record is noticed by the
helper, at which point the helper is preempted. The readside
sequence completes, and later uses the same record to read from the
same addressâŠ and thatâs when the helper resumes execution, with a
compare_exchange
that succeeds.
The pinned value âhelped inâ by hp_cleanup_swf
is still validâthe
call to hp_cleanup_swf
hasnât physically destroyed anything yetâso
the hazard pointer implementation is technically correct. However,
this scenario shows that hp_read_swf
can violate memory ordering and
causality, and even let longoverwritten values time travel into the future. The
simpler readside code sequence comes at a cost: its load is extremely
relaxed, much more so than any intuitive mental model might allow.^{14}
EDIT 20200709: However, see this addendum for a way to fix that race without affecting the fast (read) path.
Having to help readers forward also loses a nice practical property of hazard pointers: itâs always safe to spuriously consider arbitrary (readable) memory as a hazard pointer record, it only costs us additional conservatism in reclamation. Thatâs not the case anymore, once the cleanup thread has to help readers, and thus must write to HP records. This downside does not impact plain implementations of hazard pointers, but does make it harder to improve record management overhead by taking inspiration from managed language runtimes.
The overhead of hazard pointers only matters in code that traverse a lot of pointers, especially pointer chains. Thatâs why Iâll focus on microbenchmarking a loop that traverses a pseudorandomly shuffled circular linked list (embedded in an array of 1024 nodes, at 16 bytes per node) for a fixed number of pointer chasing hops. You can find the code to replicate the results in this gist.
The unprotected (baseline) inner loop follows. Note the NULL
endoflist check, for realism; the list is circular, so the loop
never breaks early.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Thereâs clearly a dependency chain between each read of head>next
.
The call to frob_ptr
lets us introduce work in the dependency chain,
which more closely represents realistic use cases. For example, when
using hazard pointers to protect a binary search tree traversal, we
must perform a small amount of work to determine whether we want to go
down the left or the right subtree.
A hazard pointered implementation of this loop would probably unroll the loop body twice, to more easily implement handoverhand locking. Thatâs why I also include an unrolled version of this inner loop in the microbenchmark: we avoid discovering that hazard pointer protection improves performance because itâs also associated with unrolling, and gives us an idea of how much variation we can expect from small code generation changes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

The hazard pointer inner loops are just like the above, except that
head = head>next
is replaced with calls to an inline function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 

The dependency chain is pretty obvious; we can measure the sum of latencies for 1000 pointer dereferences by running the loop 1000 times. Iâm using a large iteration count to absorb noise from the timing harness (roughly 50 cycles per call), as well as any boundary effect around the first and last few loop iterations.
All the cycle measurements here are from my unloaded E54617, running
at 2.9 GHz without Turbo Boost. First, letâs see what happens with a
pure traversal, where frob_ptr
is an inline noop function
that simply return its first argument. This microbenchmark is far
from realistic (if I found an inner loop that only traversed a
singly linked list, Iâd consider a different data structure), but helps
establish an upper bound on the overhead from different hazard pointer
read sides. I would usually show a faceted graph of the latency
distribution for the various methodsâŠ but the results are so stable^{15}
that I doubt thereâs any additional information to be found in the graphs,
compared to the tables below.
The following table shows the cycle counts for following 1000 pointers in a circular linked list, with various hazard pointer schemes and no work to find the next node, on an unloaded E54617 @ 2.9 GHz, without Turbo Boost.
 Method * 1000 iterations  p00.1 latency  median latency  p99.9 latency 

 noop  52  56  56 
 baseline  4056  4060  4080 
 unrolled  4056  4060  4080 
 hp_read_explicit  20136  20160  24740 
 hp_read_membarrier  5080  5092  5164 
 hp_read_movs  10060  10076  10348 
 hp_read_movs_spec  8568  8568  8572 
 hp_read_wf  6572  7620  8140 
 hp_read_swf  4268  4304  4368 
The table above reports quantiles for the total runtime of 1000 pointer dereferences, after one million repetitions.
Weâre looking at a baseline of 4 cycles/pointer dereference (the L1 cache
latency), regardless of unrolling. The only implementation with an
actual fence or atomic operation, hp_read_explicit
fares pretty
badly, at more than 5x the latency. Replacing that fence with a
compiler barrier in hp_read_membarrier
reduces the overhead to ~1
cycle per pointer dereference. Our first waitfree implementation,
hp_read_movs
(based on a raw MOVSQ
) doesnât do too great, with a
~6 cycle (150%) overhead for each pointer dereference. However,
speculation (hp_read_movs_spec
) does help shave that to ~4.5 cycles
(110%). The portable waitfree implementation hp_read_wf
does
slightly better, and its singlecleanup version hp_read_swf
takes
the crown, by adding around 0.2 cycle/dereference.
These results are stable and repeatable, but still fragile, in a way:
except for hp_read_explicit
, which is massively slowed down by its
atomic operation, and for hp_read_movs
, which adds a known latency bump on the hot path, the other slowdowns mostly reflect contention for
execution resources. In real life, such contention usually only occurs in
heavily tuned code, and the actual execution units (ports) in high
demand will vary from one inner loop to another.
Letâs see what happens when we insert a ~4cycle latency slowdown
(three for the multiplication, and one more for the increment) in the
hot path, by redefining frob_ptr
. The result of the integer
multiplication by 0 is always 0, but adds a (non speculated) data
dependency on the node value and on the multiplication to the pointer
chasing dependency chain. Only four cycles of work to decide which
pointer to traverse is on the low end of my practical experience, but
suffices to equalise away most of the difference between the hazard
pointer implementations.
1 2 3 4 5 6 7 8 9 

Letâs again look at the quantiles for the cycle count for one million loops of 1000 pointer dereferences, on an unloaded E54617 @ 2.9 GHz.
 Method * 1000 iterations  p00.1 latency  median latency  p99.9 latency 

 noop  52  56  56 
 baseline  10260  10320  10572 
 unrolled  9056  9060  9180 
 hp_read_explicit  22124  22156  26768 
 hp_read_membarrier  10052  10084  10264 
 hp_read_movs  12084  12112  15896 
 hp_read_movs_spec  9888  9940  10152 
 hp_read_wf  9380  9420  9672 
 hp_read_swf  10112  10136  10360 
The difference between unrolled
in this table and in the previous
one shows we actually added around 5 cycles of latency per iteration
with the multiplication in frob_ptr
. This dominates the overhead we
estimated earlier for all the hazard pointer schemes except for the
remarkably slow hp_read_explicit
and hp_read_movs
. Itâs thus not
surprising that all hazard pointer implementations but the latter
two are on par with the unprotected traversal loops (within 1.1 cycle
per pointer dereference, less than the impact of unrolling the loop
without unlocking any further rewrite).
The relative speed of the methods has changed, compared
to the previous table. The speculative waitfree implementation
hp_read_movs_spec
was slower than hp_read_membarrier
and much
slower than hp_read_swf
; itâs now slightly faster than both.
The simple portable waitfree implementation hp_read_wf
was slower than
hp_read_membarrier
and hp_read_swf
; itâs now the fastest implementation.
I wouldnât read too much into the relative rankings of
hp_read_membarrier
, hp_read_movs_spec
, hp_read_wf
, and
hp_read_swf
. They only differ by fractions of a cycle per
dereference (all between 9.5 and 10.1 cycle/deref), and the exact values are a function of the
specific mix of microops in the inner loop, and of the
nearunpredictable impact of instruction ordering on the chipâs
scheduling logic. What really matters is that their impact
on traversal latency is negligible once the pointer chasing loop does some
work to find the next node.
I hope Iâve made a convincing case that hazard pointers can be
waitfree and efficient on the readside, as long as we have access
to something like membarrier
or FlushProcessWriteBuffers
on the
slow cleanup (reclamation) path. If one were to look at the
microbenchmarks alone, one would probably pick hp_read_swf
.
However, the real world is more complex than microbenchmarks. When I
have to extrapolate from microbenchmarks, I usually worry about the
hidden impact of instruction bytes or cold branches, since
microbenchmarks tend to fail at surfacing these things. Iâm not as
worried for hp_read_movs_spec
, and hp_read_swf
: they both compile
down to roughly as many instructions as the incumbent, hp_read_membarrier
,
and their forward untaken branch would be handled fine by a static predictor.
What I would take into account is the ability to transparently use
hp_read_movs_spec
in code that already uses hp_read_membarrier
,
and the added requirements of hp_read_swf
. In
addition to relying on membarrier
for correctness, hp_read_swf
needs a pointer tagging scheme to distinguish target pointers from pinned ones, a way for cleanup threads to read stale pointers without
crashing, and also imposes mutual exclusion around the scanning of (sets
of) hazard pointer records. These additional requirements donât seem
impractical, but I can imagine code bases where they would constitute
hard blockers (e.g., library code, or when protecting arbitrary integers).
Finally, hp_read_swf
can let protected values time travel in the future,
with read sequences returning values so long after they were overwritten
that the result violates pretty much any memory ordering modelâŠ unless
you implement the addendum below.
TL;DR: Use hp_read_swf
if youâre willing to sacrifice waitfreedom on the reclamation path and remember to implement the cleanup function with time travel protection. When targeting x86
and amd64
, hp_read_movs_spec
is a well rounded option, and still waitfree. Otherwise, hp_read_wf
uses standard operations, but compiles down to more code.
P.S., Travis Downs notes that memmem PUSH
might be an alternative to MOVSQ
, but that requires either pointing RSP
to arbitrary memory, or allocating hazard pointers on the stack (which isnât necessarily a bad idea). Another idea worthy of investigation!
Thank you, Travis, for deciphering and validating a much rougher draft when the preprint dropped, and Paul and Jacob, for helping me clarify this last iteration.
hp_read_swf
There is one huge practical issue with
hp_read_swf
, our simple and waitfree hazard pointer read sequence
that sacrifices lockfree reclamation to avoid x86specific instructions:
when the cleanup loop must help a record forward, it can fill in old values in ways that violate causality.
I noted that the reason for this hole is the lack of ABA protection in
HP recordsâŠ and hp_read_wf
is what weâd get if we were to add full ABA protection.
However, given mutual exclusion around the âhelpâ loop in the cleanup
function, we donât need full ABA protection. What we really need to
know is whether a given inflight record weâre about to CAS forward is
the same inflight record for which we read the cellâs value, or was
overwritten by a readside section. We can encode that by
stealing one more bit from the target cell
address in cell_or_pin
.
We already steal the sign bit to distinguish the address of the cell
to read (positive), from pinned values (negative). The split make
sense because 64 bit architectures tend to reserve high (negative)
addresses for kernel space. I doubt weâll see full 64 bit address
spaces for a while, so it seems safe to steal the next bit (bit 62) to
tag cell
addresses. The next table summarises the tagging scheme.
0b00xxxx: untagged cell address
0b01xxxx: tagged cell address
0b1yyyyy: helped pinned value
At a high level, weâll change the hp_cleanup_swf
to tag
cell_or_pin
before reading the value pointed by cell
, and only CAS in the
new pinned value if the cell is still tagged.
Thanks to mutual exclusion, we know cell_or_pin
canât be retagged by another thread.
Only the for record in records
block has to change.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

We doubled the number of atomic operations in the helping loop, but
thatâs acceptable on the slow path. We also rely on strong
compare_exchange
(compareandswap) acting as a storeload fence.
If that doesnât come for free, we could also tag records in one pass,
issue a storeload fence, and help tagged records in a
second pass.
Another practical issue with the swf
/wf
cleanup approach is that
they require two membarriers, and
OSprovided implementations can be slow, especially under load.
This is particularly important for the swf
approach, since mutual
exclusion means that one slow membarrier
delays the physical destruction of everything on the limbo list.
I donât think we can get rid of mutual exclusion, and, while we can improve membarrier latency, reducing the number of membarriers on the reclamation path is always good.
We can software pipeline calls to the cleanup function,
and use the same membarrier
for C1
and C2
in two consecutive
cleanup calls. Overlapping cleanups decreases the worstcase reclaim
latency from 4 membarriers to 3, and thatâs not negligible when each
membarrier
can block for 30ms.
I also tend to read anything by Guy Blelloch.Â ↩
In fact, Iâve often argued that SMR is garbage collection, just not fully tracing GC. Hazard pointers in particular look a lot like deferred reference counting, a form of tracing GC.Â ↩
Something that wasnât necessarily obvious until then. See, for example, this article presented at PPoPP 2020, which conjectures that âmaking the original Hazard Pointers scheme or epochbased reclamation completely waitfree seems infeasible;â Blelloch was in attendance, so this must have been a fun session.Â ↩
The SMR problem is essentially the same problem as determining when itâs safe to return from rcu_synchronize
or execute a rcu_call
callback. Hence, the same readerwriter lock analogy holds.Â ↩
Hazard pointer records must still be managed separately, e.g., with a type stable allocator, but we can bootstrap everything else once we have a few records per thread.Â ↩
We can even do that without keeping track of the number of nodes that were previously pinned by hazard pointer records and kept in the limbo list: each record can only pin at most one node, so we can wait until the limbo list is, e.g., twice the size of the record set.Â ↩
Letâs also hope efforts like CHERI donât have to break lockfree code.Â ↩
The scheme is correct with any actual guess; we could even use a random number generator. However, performance is ideal (the loop exits) when we âguessâ by reading current value of the pointer we want to borrow safely.Â ↩
I tend to implement lockfree algorithms with a heavy dose of inline assembly, or Concurrency Kitâs wrappers: itâs far too easy to run into subtle undefined behaviour. For example, comparing a pointer after it has been freed is UB in C and C++, even if we donât access the pointee. Even if we compare as uintptr_t
, itâs apparently debatable whether the code is well defined when the comparison happens to succeed because the pointee was freed, then recycled in an allocation and published to cell
again.Â ↩
In Reclaiming Memory for LockFree Data Structures: There has to be a Better Way, Trevor Brown argues this requirement is a serious flaw in all hazard pointer schemes. I think it mostly means applications should be careful to coarsen the definition of resource in order to ensure the resulting condensed heap graph is a DAG. In extreme cases, we end up proxy collecting an epoch, but we can usually do much better.Â ↩
Thatâs what Travis Downs classifies as a Level 1 concurrency cost, which is usually fine for writes, but adds a sizable overhead to simple readonly code.Â ↩
I suppose this means the reclamation path isnât waitfree, or even lockfree, anymore, in the strict sense of the words. In practice, weâre simply waiting for periodic events that would occur regardless of the syscalls we issue. People who really know what theyâre doing might have fully isolated cores. If they do, they most likely have a watchdog on their isolated and latencysensitive tasks, so we can still rely on running some code periodically, potentially after some instrumentation: if an isolated task fails to check in for a short while, the whole box will probably be presumed wedged and taken offline.Â ↩
With the caveat that the public documentation for process_vm_readv
does not mention any atomic load guarantee. In practice, I saw a longbylong copy loop the last time I looked at the code, and Iâm pretty sure the kernelâs build flags prevent GCC/clang from converting it to memcpy
. We could rely on the strong âdonât break userspaceâ culture, but itâs probably a better idea to try and get that guarantee in writing.Â ↩
This problem feels like something we could address with a coarse epochbased versioning scheme. Itâs however not clear to me that the result would be much simpler than hp_read_wf
, and weâd have to steal even more bits (2 bits, I expect) from cell_or_pin
to make room for the epoch. EDIT 20200709: it turns out we only need to steal one bit.Â ↩
Although I did compare several independent executions and confirmed the reported cycle counts were stable, I did not try to randomise code generationâŠ mostly because Iâm not looking for super fine differences as much as close enough runtimes. Hopefully, aligning functions to 256 bytes leveled some bias away.Â ↩
In the fourth installment of his series on sorting with AVX2, @damageboy has a short aside where he tries to detect partitioning (pivot) patterns where elements less than and greater than or equal to the pivot are already in the correct order: in that case, the partitioning routine does not need to permute the block of values. The practical details are irrelevant for this post; what matters is that we wish to quickly identify whether a byte value matches any of the follow nine cases:
0b11111111
0b11111110
0b11111100
0b11111000
0b11110000
0b11100000
0b11000000
0b10000000
0b00000000
Looking at the bit patterns,^{1} the OPâs solution with popcount and bitscan is pretty natural. These instructions are somewhat complex (latency closer to 3 cycles than 1, and often port restricted), and it seems like the sort of problem that would have had efficient solutions before SSE4 finally graced x86 with a population count instruction.
In the context of a sorting libraryâs partition loop, popcnt
and
bsf
is probably more than good enough:
the post shows that the real issue is branch mispredictions
being slower than permuting unconditionally.
This is just a fun challenge to think about (:
is_power_of_two
Detecting whether a machine integer is a power of two (or zero) is another task that has a straightforward solution in terms of popcount or bitscan. Thereâs also a simpler classic solution to this problem:
x == 0  is_power_of_two(x) <==> (x & (x  1)) == 0
How does that expression work? Say x
is a power of two. Its binary
representation is 0b0...010...0
: any number of leading zeros,^{2}
a single â1â bit, and trailing zeros (maybe none). Letâs see what happens when
we subtract 1 from x
:
x = 0b00...0010...0
x  1 = 0b00...0001...1
x & (x  1) = 0b00...0000...0
The subtraction triggered a chain of borrows
throughout the trailing zeros, until we finally hit that 1 bit.
In decimal, subtracting one from 10...0
yields 09...9
;
in binary we instead find 01...1
.
If you ever studied the circuit depth (latency) of carry chains
(for me, that was for circuit complexity theory), you know
that this is difficult to do well.
Luckily for us, chip makers work hard to pull it off,
and we can just use carries as a datacontrolled
primitive to efficiently flip ranges of bits.
When x
is a power of two, x
and x  1
have no â1â bit in common,
so taking the bitwise and
yields zero. Thatâs also true when x
is 0,
since and
ing anything with 0 yields zero. Letâs see what happens
for nonzero, nonpoweroftwo values x = 0bxx...xx10..0
,
i.e., where x
consists of an arbitrary nonzero sequence of bits xx..xx
followed by the least set bit (thereâs at least one, since x
is neither zero nor a power of two), and the trailing zeros:
x = 0bxx...xx10...0
x  1 = 0bxx...xx01...1
x & (x  1) = 0bxx...xx000000
The leading notallzero 0bxx...xx
is unaffected by the subtraction,
so it passes through the bitwise and
unscathed (and
ing any bit with
itself yields that same bit), and we know thereâs at least one nonzero
bit in there; our test correctly rejects it!
When decoding variable length integers in ULEB format, e.g., for protocol buffers, it quickly becomes clear that, in order to avoid byteatatime logic, we must rapidly segment (lex or tokenize, in a way) our byte stream to determine where each ULEB ends. Letâs focus on the fast path, when the encoded ULEB fits in a machine register.
We have uleb = 0bnnnnnnnnmmmmmmmm...0zzzzzzz1yyyyyyy1...
:
a sequence of bytes^{3} with the topmost bit equal to 1,
terminated by a byte with the top bit set to 0,
and, finally, arbitrary nuisance bytes (m...m
, n...n
, etc.) we wish to ignore.
Ideally, weâd extract data = 0b0000000000000000...?zzzzzzz?yyyyyyy?...
from uleb
: we want to clear the
nuisance bytes, and are fine with arbitrary values in the
ULEBâs control bits.
Itâs much easier to find bits set to 1 than to zero, so the first thing to do is
to complement the ULEB
data and
clear out everything but potential ULEB control bits (the high bit of
each byte), with c = ~uleb & (128 * (WORD_MAX / 255))
, i.e.,
compute the bitwise and
of ~uleb
with a bitmask of the high bit in each byte.
uleb = 0bnnnnnnnnmmmmmmmm...0zzzzzzz1yyyyyyy1...
~uleb = 0bÌ
nÌ
nÌ
nÌ
nÌ
nÌ
nÌ
nÌ
nÌ
mÌ
mÌ
mÌ
mÌ
mÌ
mÌ
mÌ
mÌ
...1zÌ
zÌ
zÌ
zÌ
zÌ
zÌ
z0yÌ
yÌ
yÌ
yÌ
yÌ
yÌ
y0...
c = 0bÌ
nÌ
0000000Ì
mÌ
0000000...10000000000000000...
We could now bitscan to find the index of the first 1 (marking the last ULEB byte), and then generate a mask. However, it seems wasteful to generate an index with a scan, only to convert it back into bitmap space with a shift. Weâll probably still want that index to know how far to advance the decoderâs cursor, but we can hopefully update the cursor in parallel with decoding the current ULEB value.
When we were trying to detect powers of two, we subtracted 1
from
x
, a value kind of like c
, in order to generate a new value
that differed from x
in all the bits up to and including the first
set (equal to 1
) bit of x
, and identical in the remaining bits. We
then used the fact that and
ing a bit with itself yields that same
bit to detect whether there was any nonzero bit in the remainder.
Here, we wish to do something else with the remaining untouched bits, we
wish to set them all to zero. Another bitwise operator does
what we want: xor
ing a bit with itself always yields zero, while
xor
ing bits that differ yields 1
. Thatâs the plan for ULEB. Weâll
subtract 1 from c
and xor
that back with c
.
uleb = 0bnnnnnnnnmmmmmmmm...0zzzzzzz1yyyyyyy1...
~uleb = 0bÌ
nÌ
nÌ
nÌ
nÌ
nÌ
nÌ
nÌ
nÌ
mÌ
mÌ
mÌ
mÌ
mÌ
mÌ
mÌ
mÌ
...1zÌ
zÌ
zÌ
zÌ
zÌ
zÌ
z0yÌ
yÌ
yÌ
yÌ
yÌ
yÌ
y0...
c = 0bÌ
nÌ
0000000Ì
mÌ
0000000...10000000000000000...
c  1 = 0bÌ
nÌ
0000000Ì
mÌ
0000000...01111111111111111...
c ^ (c  1) = 0b0000000000000000...11111111111111111...
We now just have to bitwise and
uleb
with c ^ (c  1)
to obtain the bits of the first ULEB
value in uleb
, while
overwriting everything else with 0. Once we have that, we can either
extract data bits with PEXT
on recent Intel chips, or otherwise dust off interesting stunts for SWAR shifts by variable amounts.
Letâs first repeat the question that motivated this post. We want to detect when a byte p
is one of the following nine values:
0b11111111
0b11111110
0b11111100
0b11111000
0b11110000
0b11100000
0b11000000
0b10000000
0b00000000
These bit patterns feel similar to those for power of two bytes: if we
complement the bits, these values are all 1 less than a power of two
(or 1, one less than zero). We already know how to detect when a
value x
is zero or a power of two (x & (x  1) == 0
), so itâs easy
to instead determine whether ~p
is one less than zero or a power of
two: (~p + 1) & ~p == 0
.
This is already pretty good: bitwise not
the byte p
,
and check if itâs one less than zero or a power of two (three simple
instructions on the critical path). We can do better.
Thereâs another name for ~p + 1
, i.e., for bitwise complementing a value and
adding one: thatâs simply p
, the additive inverse of p
in twoâs
complement! We can use p & ~p == 0
. Thatâs one fewer
instruction on the critical path of our dependency graph (down to two, since we can test
whether and
ing yields zero), and still only
uses simple instructions that are unlikely to be port constrained.
Letâs check our logic by enumerating all bytesized values.
CLUSER> (dotimes (p 256)
(when (zerop (logand ( p) (lognot p) 255))
(format t "0b~2,8,'0r~%" p)))
0b00000000
0b10000000
0b11000000
0b11100000
0b11110000
0b11111000
0b11111100
0b11111110
0b11111111
These are the bytes weâre looking for (in ascending rather than descending order)!
I hope the examples above communicated a pattern I often observe when mangling bits: operations that are annoying (not hard, just a bit more complex than weâd like) in the bitmap domain can be simpler in twoâs complement arithmetic. Arithmetic operations are powerful mutators for bitmaps, but theyâre often hard to control. Subtracting or adding 1 are the main exceptions: itâs easy to describe their impact in terms of the low bits of the bitmap. In fact, we can extend that trick to subtracting or adding powers of two: itâs the same carry/borrow chain effect as for 1, except that bits smaller than the power of two pass straight throughâŠ which might be useful when we expect a known tag followed by a ULEB value that must be decoded.
If you find yourself wishing for a way to flip ranges of bits in a datadependent fashion, itâs always worth considering the twoâs complement representation of the problem for a couple minutes. Adding or subtracting powers of two doesnât always work, but the payoff is pretty good when it does.
P.S., Wojciech MuĆa offers a different 3operation sequence with p
to solve damageboyâs problem.
Thatâs another nice primitive to generate bitmasks dynamically.
Thank you Ruchir for helping me clarify the notation around the ULEB section.
Hereâs a graph of the empirical distribution functions for the number of calls into OpenSSL 1.0.1f it takes for Hypothesis to find Heartbleed, given a description of the format for Heartbeat requests (âgrammarâ), and the same with additional branch coverage feedback (âbtsâ, for Intel Branch Trace Store).
On average, knowing the grammar for Heartbeat requests lets Hypothesis find the vulnerability after 535 calls to OpenSSL; when we add branch coverage feedback, the average goes down to 473^{1} calls, 11.5% faster. The plot shows that, as long as we have time for more than 100 calls to OpenSSL, branch coverage information makes it more likely that Hypothesis will find Heartbleed for every attempt budget. Iâll describe this experiment in more details later in the post; first, why did I ask that question?
I care about efficiently generating test cases because I believe that most of the time people spend writing, reviewing, and maintaining classic small or medium tests is misallocated. Thereâs a place for them, same as there is for manual QA testingâŠ but their returns curve is strongly concave.
Here are the sort of things I want to know when I look at typical unit test code:
As programmers, I think itâs natural to say that, if we need to formalize how we come up with assertions, or how we decide what inputs to use during testing, we should express that in code. But once we have that code, why would we enumerate test cases manually? Thatâs a job for a computer!
I have such conviction in this thesis that, as soon as I took over the query processing component (a library written in C with a dash of intrinsics and inline assembly) at Backtrace, I introduced a new testing approach based on Hypothesis. The query engine was always designed and built as a discrete library, so it didnât take too much work to disentangle it from other pieces of business logic. That let us quickly add coverage for the external interface (partial coverage, this is still work in progress), as well as for key internal components that we chose to expose for testing.
Along the way, I had to answer a few reasonable questions:
Hypothesis is a Python library that helps programmers generate test cases programmatically, and handles a lot of the annoying work involved in making such generators practical to use. Hereâs an excerpt from real test code for our vectorised operations on arrays of 64bit integers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

A key component of these operations is a C (SSE intrinsics, really) function that accepts eight 64bit masks as four 128bit SSE registers (one pair of masks per register),
and returns a byte, with one bit per mask.
In order to test that logic, we expose u64_block_bitmask
, a wrapper that accepts an array of eight masks
and forwards its contents to the masktobit function.
check_u64_bitmask
calls that wrapper function from Python, and asserts that its return value is as expected: 1 for each mask thatâs all 1s, and 0 for each mask thatâs all 0s.
We turn check_u64_bitmask
into a test cases generator
by annotating the test method test_u64_block_bitmask
with a @given
decorator
that asks Hypothesis to generate arbitrary lists of eight booleans.
Of course, there are only 256 lists of eight booleans; we should test that exhaustively.
Hypothesis has the @example
decorator,
so itâs really easy to implement our own Python decorator to apply
hypothesis.example
once for each element in an iterable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Thatâs pretty cool, but we could enumerate that more efficiently in C. Where Hypothesis shines is in reporting easily actionable failures. Weâre wasting computer time in order to save developer time, and thatâs usually a reasonable tradeoff. For example, hereâs what Hypothesis spits back via pytest when I introduce a bug to ignore the last pair of masks, masks 6 and 7, and instead reuse the register that holds masks 4 and 5 (a mistake thatâs surprisingly easy to make when writing intrinsics by hand).
booleans = [False, False, False, False, False, False, ...]
def check_u64_bitmask(booleans):
"""Given a list of 8 booleans, generates the corresponding array of
u64 masks, and checks that they are correctly compressed to a u8
bitmask.
"""
expected = 0
for i, value in enumerate(booleans):
if value:
expected = 1 << i
masks = struct.pack("8Q", *[2 ** 64  1 if boolean else 0 for boolean in booleans])
> assert expected == C.u64_block_bitmask(FFI.from_buffer("uint64_t[]", masks))
E AssertionError: assert 128 == 0
E 128
E +0
test_u64_block.py:40: AssertionError
 Hypothesis 
Falsifying example: test_u64_block_bitmask(
self=<test_u64_block.TestU64BlockOp testMethod=test_u64_block_bitmask>,
bits=[False, False, False, False, False, False, False, True],
)
Hypothesis finds the bug in a fraction of a second, but, more importantly, it then works harder to report a minimal counterexample.^{4} With all but the last mask set to 0, itâs easy to guess that weâre probably ignoring the value of the last mask (and maybe more), which would be why we found a bitmask of 0 rather than 128.
So far this is all regular propertybased testing, with a hint of more productionreadiness than weâve come to expect from clever software correctness tools. What really sold me was Hypothesisâs stateful testing capability, which makes it easy to test not only individual functions, but also methods on stateful objects.
For example, here is the test code for our specialised representation of lists of row ids (of 64bit integers),
which reuses internal bits as inline storage in the common case when the list is small (the type is called entry
because itâs
a pair of a key and a list of values).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 

Most @rule
decorated methods call a C function
before updating a Pythonside model of the values we expect to find in the list.
The @rule
decorator mark methods that Hypothesis may call to generate examples;
the decoratorâs arguments declare how each method should be invoked.
Some methods also have a @precondition
decorator to specify when they can be invoked
(a @rule
without any @precondition
is always safe to call).
There is one method (create_entry
) to create a new list populated with an initial row id,
another (add_row
) to add a row id to the list,
one to finalize
the list, something which should always be safe to call and must be done before reading the listâs contents
(finalizing converts away from the inline representation),
and a check
to compare the list of row ids in C to the one we maintained in Python.
Row ids are arbitrary 64bit integers, so we could simply
ask Hypothesis to generate integers in \([0, 2^{64}  1]\).
However, we also know that the implementation uses high row id values around
UINT64_MAX
as sentinels in its implementation of inline storage, as shown below.
struct inlined_list {
union {
uint64_t data[2];
struct {
uint64_t *arr;
unsigned int capacity;
unsigned int length;
};
};
uint64_t control;
};
Out of line list Inline list of 2 Inline list of 3
  
data[0]: arr elt #0 elt #0
data[1]: capacity & length elt #1 elt #1
control: UINT64_MAX UINT64_MAX  2 elt #2 < UINT64_MAX  2
Thatâs why we bias the data generation towards row ids that could also be mistaken for sentinel values: we generate each row id by either sampling from a set of sentinellike values, or by sampling from all 64bit integers. This approach to defining the input data is decidedly nonmagical, compared to the way I work with fuzzers. However, fuzzers also tend to be slow and somewhat fickle. I think it makes sense to ask programmers to think about how their own code should be tested, instead of hoping a computer program will eventually intuit what edge cases look like.^{5}
Itâs pretty cool that Hypothesis will now generate a bunch of API calls for me, but, again, what makes Hypothesis really valuable is the way it minimises long sequences of random calls into understandable counterexamples. Hereâs what Hypothesis/pytest reports when I remove a guard that saves us from writing a row id to a control word when that id would look like a sentinel value.
self = PregrouperEntryList({})
@precondition(lambda self: self.entry)
@rule()
def check(self):
self.finalize()
> assert self.rows == [
self.entry.payload.list.values[i]
for i in range(self.entry.payload.list.n_entries)
]
E AssertionError: assert [184467440737...4073709551613] == [184467440737...4073709551613]
E Left contains one more item: 18446744073709551613
E Full diff:
E  [18446744073709551613, 18446744073709551613, 18446744073709551613]
E ? 
E + [18446744073709551613, 18446744073709551613]
test_pregroup_merge.py:81: AssertionError
 Hypothesis 
Falsifying example:
state = PregrouperEntryList()
state.create_entry(row=18446744073709551613)
state.add_row(row=18446744073709551613)
state.add_row(row=18446744073709551613)
state.check()
state.teardown()
We can see that we populated a list
thrice with the same row id, 18446744073709551613
,
but only found it twice in the final Cside list.
That row id is \(2^{64}  3,\) the value we use to denote inline lists of two values.
This drawing shows how the last write ended up going to a control word where it was treated as a sentinel, making the inline representation look
like a list of two values, instead of three values.
Bad list of 3 Identical list of 2
 
data[0]: UINT64_MAX  2 UINT64_MAX  2
data[1]: UINT64_MAX  2 UINT64_MAX  2
control: UINT64_MAX  2 (payload) UINT64_MAX  2 (sentinel)
I restored the logic to prematurely stop using the inline representation and convert to a heapallocated vector whenever the row value would be interpreted as a sentinel, and now Hypothesis doesnât find anything wrong. We also know that this specific failure is fixed, because Hypothesis retries examples from its failure database^{6} on every rerun.
There are multiple languagespecific libraries under the Hypothesis project; none of them is in C, and only the Python implementation is actively maintained. One might think that makes Hypothesis inappropriate for testing C. However, the big ideas in Hypothesis are languageindependent. There is a practical reason for the multiple implementations: for a tool to be really usable, it has to integrate well with the programming languageâs surrounding ecosystem. In most languages, we also expect to write test code in the same language as the system under test.
C (and C++, I would argue) is an exception. When I tell an experienced C developer they should write test code in Python, I expect a sigh of relief. The fact that Hypothesis is written in Python, as opposed to another managed language like Ruby or C# also helps: embedding and calling C libraries is Pythonâs bread and butter. The weak state of C tooling is another factor: no one has a strong opinion regarding how to invoke C test code, or how the results should be reported. Iâll work with anything standard enough (e.g., pytestâs JUnit dump) to be ingested by Jenkins.
The last thing that sealed the deal for me is Pythonâs CFFI. With that library, I simply had to make sure my public header files were clean enough to be parsed without a fullblown compiler; I could then write some simple Python code to strip away preprocessor directive, read headers in dependency order, and load the production shared object, without any testspecific build step. The snippets of test code above donât look that different from tests for a regular Python module, but thereâs also a clear mapping from each Python call to a C call. The level of magic is just right.
There is one testing concern thatâs almost specific to C and C++ programs:
we must be on the lookout for memory management bugs.
For that, we use our ASan
build and bubble up some issues (e.g., read overflows or leaks) back to Python;
everything else results in an abort()
, which, while suboptimal for minimisation, is still useful.
I simplified our test harness for the Heartbleed experiment;
see below for gists that anyone can use as starting points.
I abused fuzzers a lot at Google, mostly because the tooling was there and it was trivial to burn hundreds of CPUhours. Technical workarounds seemed much easier than trying to add support for actual propertybased testers, so I would add domain specific assertions in order to detect more than crash conditions, and translate bytes into more structured inputs or even convert them to series of method calls. Now that I donât have access to prebuilt infrastructure for fuzzers, that approach doesnât make as much sense. Thatâs particularly true when byte arrays donât naturally fit the entry point: I spent a lot of time making sure the fuzzer was exercising branches in the system under test, not the test harness, and designing input formats such that common fuzzer mutations, e.g., removing a byte, had a local effect and not, e.g., cause a frameshift mutation for the rest of the inputâŠ and thatâs before considering the time I wasted manually converting bytes to their structured interpretation on failures.
Hypothesis natively supports structured inputs and function calls, and reports buggy inputs in terms of these high level concepts. It is admittedly slower than fuzzers, especially when compared to fuzzers that (like Hypothesis) donât fork/exec for each evaluation. Iâm comfortable with that: my experience with NPHard problems tells me itâs better to start by trying to do smart things with the structure of the problem, and later speed that up, rather than putting all our hopes in making bad decisions really really fast. Brute force can only do so much to an exponentialtime cliff.
I had one niggling concern when leaving the magical world of fuzzers for staid propertybased testing. In some cases, I had seen coverage information steer fuzzers towards really subtle bugs; could I benefit from the same smartness in Hypothesis? Thatâs how I came up with the idea of simulating a reasonable testing process that ought to find CVE20140160, Heartbleed.
CVE20140160 a.k.a. Heartbleed is a read heap overflow in OpenSSL 1.0.1 (\(\leq\) 1.0.1f) that leaks potentially private data over the wire. Itâs a straightforward logic bug in the implementation of RFC 6520, a small optional (D)TLS extension that lets clients or servers ask for a ping. The Register has a concise visualisation of a bad packet. We must send correctly sized packets that happen to ask for more pingback data than was sent to OpenSSL.
State of the art fuzzers like AFL or Honggfuzz find that bug in seconds when given an entry point that passes bytes to the connection handler, like data that had just come over the wire. When provided with a corpus of valid messages, theyâre even faster.
Itâs a really impressive showing of brute force that these programs to come up with almost valid packets so quickly, and traditional propertybased testing frameworks are really not up to that level of magic. However, I donât find the black box setting that interesting. Itâs probably different for security testing, since not leaking invalid data is apparently seen as a feature one can slap on after the fact: there are so many things to test that itâs probably best to focus on what fuzzing can easily find, and, in any case, it doesnât seem practical to manually boil that ocean one functionality at a time.
From a software testing point of view however, I would expect the people who send in a patch to implement something like the Heartbeat extension to also have a decent idea how to send bytes that exercise their new code. Of course, a sufficiently diligent coder would have found the heap overflow during testing, or simply not have introduced that bug. Thatâs not a useful scenario to explore; Iâm interested in something that does find Heartbleed, and also looks like a repeatable process. The question becomes âWithin this repeatable process, can branch coverage feedback help Hypothesis find Heartbeat?â
Hereâs what I settled on: letâs only assume the new feature code also comes with a packet generator to exercise that code. In Hypothesis, it might look like the following.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

We have a single rule to initialize the test buffer with a valid heartbeat packet, and we send the whole buffer to a standard fuzzer entry point at the end. In practice, I would also ask for some actual assertions that the system under test handles the packets correctly, but thatâs not important when it comes to Heartbleed: we just need to run with ASan and look for heap overflows, which are essentially never OK.
Being able to provide a happypathonly âtestâ like the above should be less than table stakes in a healthy project. Letâs simulate a bugfinding process that looks for crashes after adding three generic buffer mutating primitives: one that replaces a single byte in the message buffer, another one that removes some bytes from the end of the buffer, and a last one that appends some bytes to the buffer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

Given the initial generator rules and these three mutators, Hypothesis will assemble sequences of calls to create and mutate buffers before sending them to OpenSSL at the end of the âtest.â
The first time I ran this, Hypothesis found the bug in a second or two
==12810==ERROR: AddressSanitizer: heapbufferoverflow on address 0x62900012b748 at pc 0x7fb70d372f4e bp 0x7ffe55e56a30 sp 0x7ffe55e561e0
READ of size 30728 at 0x62900012b748 thread T0
0x62900012b748 is located 0 bytes to the right of 17736byte region [0x629000127200,0x62900012b748)
allocated by thread T0 here:
#0 0x7fb70d3e3628 in malloc (/usr/lib/x86_64linuxgnu/libasan.so.5+0x107628)
#1 0x7fb7058ee939 (fuzzertestsuite/openssl1.0.1ffsanitize.so+0x19a939)
#2 0x7fb7058caa08 (fuzzertestsuite/openssl1.0.1ffsanitize.so+0x176a08)
#3 0x7fb7058caf41 (fuzzertestsuite/openssl1.0.1ffsanitize.so+0x176f41)
#4 0x7fb7058a658d (fuzzertestsuite/openssl1.0.1ffsanitize.so+0x15258d)
#5 0x7fb705865a12 (fuzzertestsuite/openssl1.0.1ffsanitize.so+0x111a12)
#6 0x7fb708326deb in ffi_call_unix64 (/home/pkhuong/follicle/follicle/lib/Python3.7/sitepackages/.libs_cffi_backend/libffi806b1a9d.so.6.0.4+0x6deb)
#7 0x7fb7081b4f0f (<unknown module>)
and then spent the 30 seconds trying to minimise the failure. Unfortunately, it seems that ASan tries to be smart and avoids reporting duplicate errors, so minimisation does not work for memory errors. It also doesnât matter that much if we find a minimal test case: a core dump and a reproducer is usually good enough.
You can find the complete code for GrammarTester, along with the ASan wrapper, and CFFI glue in this gist.
For the rest of this post, weâll make ASan crash on the first error, and count the number of test cases generated (i.e., the number of calls into the OpenSSL fuzzing entry point) until ASan made OpenSSL crash.
Intel chips have offered BTS, a branch trace store, since Nehalem (20089), if not earlier. BTS is slower than PT, its full blown hardware tracing successor, and only traces branches, but it does so in a dead simple format. I wrapped Linux perfâs interface for BTS in a small library; letâs see if we can somehow feed that trace to Hypothesis and consistently find Heartbeat faster.
But first, how do fuzzers use coverage information?
A lot of the impressive stuff that fuzzers find stems from their ability to infer the constants that were compared with the fuzzing input by the system under test. Thatâs not really that important when we assume a more white box testing setting, where the same person or group responsible for writing the code is also tasked with testing it, or at least specifying how to do so.
Apart from guessing what valid inputs look like, fuzzers also use branch coverage information to diversify their inputs. In the initial search phase, none of the inputs cause a crash, and fuzzers can only mutate existing, equally noncrashy, inputs or create new ones from scratch. The goal is to maintain a small population that can trigger all the behaviours (branches) weâve observed so far, and search locally around each member of that population. Pruning to keep the population small is beneficial for two reasons: first, itâs faster to iterate over a smaller population, and second, it avoids redundantly exploring the neighbourhood of nearly identical inputs.
Of course, this isnât ideal. Weâd prefer to keep one input per program state, but we donât know what the distinct program states are. Instead, we only know what branches we took on the way to wherever we ended up. Itâs as if we were trying to generate directions to every landmark in a city, but the only feedback we received was the set of streets we walked on while following the directions. Thatâs far from perfect, but, with enough brute force, it might just be good enough.
We can emulate this diversification and local exploration logic with multidimensional Targeted example generation, for which Hypothesis has experimental support.
Weâll assign an arbitrary unique label to each origin/destination pair we observe via BTS, and assign a score of 1.0 to every such label (regardless of how many times we observed each pair). Whenever Hypothesis compares two score vectors, a missing value is treated as \(\infty\), so, everything else being equal, covering a branch is better than not covering it. After that, weâll rely on the fact that multidimensional discrete optimisation is also all about maintaining a small but diverse population: Hypothesis regularly prunes redundant examples (examples that exercises a subset of the branches triggered by another example), and generates new examples by mutating members of the population. With our scoring scheme, the multidimensional search will split its efforts between families of examples that trigger different sets of branches, and will also stop looking around examples that trigger a strict subset of another exampleâs branches.
Hereâs the plan to give BTS feedback to Hypothesis and diversify its initial search for failing examples.
Iâll use my libbts to wrap perf syscalls into something usable,
and wrap that in Python to more easily gather origin/destination pairs.
Even though I enable BTS tracing only around FFI calls,
there will be some noise from libffi, as well as dummy transitions for interrupts or context switches; bts.py
attempts to only consider interesting branches
by remembering the set of executable mappings present at startup,
before loading the library under test,
and dropping jumps to or from addresses that were mapped at startup
(presumably, thatâs from the test harness), or invalid zero or negative addresses
(which Iâm pretty sure denote interrupts and syscalls).
Weâll then wrap the Python functions that call into OpenSSL to record the set of branches executed during that call,
and convert that set to a multidimensional score at the end of the test, in the teardown
method.
The only difference is in the __init__
method, which must also reset BTS state,
and in the teardown
method, where we score the example if it failed to crash.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Letâs instrument the teardown
methods to print âCHECKâ and flush before every call to the fuzzing entry point,
and make sure ASan crashes when it finds an issue.
Weâll run the test files, grep for âCHECKâ, and, assuming we donât trigger into any runtime limit,
find out how many examples Hypothesis had to generate before causing a crash.
I ran this 12000 times for both test_with_grammar.py
and test_with_grammar_bts.py
.
Letâs take a look at the empirical distribution functions for the number of calls until a crash, for test_with_grammar.py
(grammar
)
and test_with_grammar_bts.py
(bts
).
Thereâs a crossover around 100 calls: as long as we have enough time for least 100 calls to OpenSSL, weâre more likely to find Heartbleed with coverage feedback than by rapidly searching blindly. With fewer than 100 calls, it seems likely that branch coverage only guides the search towards clearly invalid inputs that trigger early exit conditions. Crucially, the curves are smooth and tap out before our limit of 10000 examples per execution, so weâre probably not measuring a sideeffect of the experimental setup.
In theory, the distribution function for the uninstrumented grammar
search should look a lot like a geometric distribution,
but I donât want to assume too much of the bts
implementation. Letâs confirm our gut feeling in R
with a simple nonparametric test,
the Wilcoxon rank sum test, an unpaired analogue to the sign test:
> wilcox.test(data$Calls[data$Impl == 'bts'], data$Calls[data$Impl == 'grammar'], 'less')
Wilcoxon rank sum test with continuity correction
data: data$Calls[data$Impl == "bts"] and data$Calls[data$Impl == "grammar"]
W = 68514000, pvalue = 4.156e11
alternative hypothesis: true location shift is less than 0
Weâre reasonably certain that the number of calls for a random run of bts
is more
frequently less than the number of calls for an independently chosen random run of grammar
than the opposite. Since the tracing overhead is negligible,
this means branch feedback saves us time more often than not.
For the 24 000 separate runs we observed,
bts
is only faster 52% of the time,
but thatâs mostly because both distributions of calls to the system under test go pretty high.
On average, vanilla Hypothesis, without coverage feedback, found the vulnerability after 535 calls to OpenSSL;
with feedback, Hypothesis is 11.5% faster on average, and only needs 473 calls.
We have been using this testing approach (without branch coverage) at Backtrace for a couple months now, and itâs working well as a developer tool that offers rapid feedback, enough that weâre considering running these tests in a commit hook. Most of the work involved in making the approach useful was just plumbing, e.g., dealing with the way ASan reports errors, or making sure we donât report leaks that happen in Python, outside the system under test. Once the commit hook is proven solid, weâll probably want to look into running tests in the background for a long time. Thatâs a very different setting from precommit or commit time runs, where every bug is sacred. If I let something run over the weekend, I must be able to rely on deduplication (i.e., minimisation is even more useful), and I will probably want to silence or otherwise triage some issues. Thatâs the sort of thing Backtrace already handles, so we are looking into sending Hypothesis reports directly to Backtrace, the same way we do for clanganalyzer reports and for crashes or leaks found by ASan in our endtoend tests.
The challenges are more researchy for coverage feedback. Thereâs no doubt a lot of mundane plumbing issues involved in making this feedback robust (see, e.g., the logic in bts.py to filter out interrupts and branches from the kernel, or branches from code thatâs probably not in the system under test). However, thereâs also a fundamental question that we unfortunately canât answer by copying fuzzers.
How should we score coverage over multiple calls? After all, the ability to test a stateful interface via a sequence of method calls and expections is what really sold me on Hypothesis. Fuzzers have it easy: they assume each call is atomic and independent of any other call to the system under test, even when they donât fork for each execution. This seems like a key reason why simple coverage metrics work well in fuzzers, and I donât know that we can trivially port ideas from fuzzing to stateful testing.
For example, my first stab at this experiment found no statistically significant improvement from BTS feedback. The only difference is that the assertion lived in a check
rule,
and not in the teardown
method, which let Hypothesis trigger a call to OpenSSL at various points in the buffer mutation sequenceâŠ usually a good thing for test coverage.
Iâm pretty sure the problem is that a single example could collect âpointsâ for covering branches in multiple unrelated calls to OpenSSL,
while we would rather cover many branches in a single call.
What does it mean for stateful testing, where we want to invoke different functions multiple times in a test?
I have no idea; maybe we should come up with some synthetic stateful testing benchmarks that are expected to benefit from coverage information. However, the experiment in this post gives me hope that there exists some way to exploit coverage information in stateful testing. The MITlicensed support code in this gist (along with libbts) should give you all a headstart to try more stuff and report your experiences.
Thank you David, Ruchir, Samy, and Travis for your comments on an early draft.
In a performanceoriented C library,
the main obstable to testing with Hypothesis and CFFI will probably be
that the headers are too complex for pycparser
. I had to use a few tricks
to make ours parse.
First, I retroactively imposed more discipline on what was really public and what implementation details should live in private headers: we want CFFI to handle every public header, and only some private headers for more targeted testing. This separation is a good thing to maintain in general, and, if anything, having pycparser yell at us when we make our public interface depend on internals is a net benefit.
I then had to reduce our reliance on the C preprocessor. In some cases,
that meant making types opaque and adding getters.
In many more cases, I simply converted small #define
d integer constants to
anonymous enums enum { CONSTANT_FOO = ... }
.
Finally, especially when testing internal functionality, I had to remove
static inline
functions.
Thatâs another case where pycparser forces us to maintain cleaner headers,
and the fix is usually simple:
declare inline
(not static) functions in the header,
#include
a .inl
file with the inline
definition (we can easily drop directives in Python),
and redeclare the function as extern
in the main source file for that header.
With this approach, the header can focus on documenting the interface,
the compiler still has access to an inline definition,
and we donât waste instruction bytes on duplicate outofline definitions.
1 2 3 4 5 6 

1 2 3 4 5 6 7 8 

1 2 3 4 

That doesnât always work, mostly because regular inline
functions arenât
supposed to call static inline
functions. When I ran into that issue,
I either tried to factor the more complex slow path out to an outofline definition,
or, more rarely, resorted to CPP tricks (also hidden in a .inl
file) to rewrite calls to
extern int foo(...)
with macros like #define foo(...) static_inline_foo(__VA_ARGS__)
.
All that work isnât really testing overhead; theyâre mostly things that library maintainers should do, but are easy to forget when when we only hear from C programmers.
Once the headers were close enough to being accepted by CFFI, I closed the gap with string munging in Python. All the tests depend on the same file that parses all the headers we care about in the correct order and loads the shared object. Loading everything in the same order also enforces a reasonable dependency graph, another unintended benefit. Once everything is loaded, that file also postprocesses the CFFI functions to hook in ASan (and branch tracing), and strip away any namespacing prefix.
The end result is a library thatâs more easily used from managed languages like Python, and which we can now test it like any other Python module.
The graph shows time in terms of calls to the SUT, not real time. However, the additional overhead of gathering and considering coverage information is small enough that the feedback also improves wallclock and CPU time.Â ↩
High level matchers like GMockâs help, but they donât always explain why a given matcher makes sense.Â ↩
Thereâs also a place for smart change detectors, especially when refactoring ill understood code.Â ↩
I had to disable the exhaustive list of examples to benefit from minimisation. Maybe one day Iâll figure out how to make Hypothesis treat explicit examples more like an initial test corpus.Â ↩
That reminds me of an AI Lab Koan. Â«In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP6. âWhat are you doing?â asked Minsky. âI am training a randomly wired neural net to play Tictactoe,â Sussman replied. âWhy is the net wired randomly?â asked Minsky. Sussman replied, âI do not want it to have any preconceptions of how to play.â Minsky then shut his eyes. âWhy do you close your eyes?â Sussman asked his teacher. âSo that the room will be empty,â replied Minsky. At that moment, Sussman was enlightened.Â»Â ↩
The database is a directory of files full of binary data. The specific meaning of these bytes depends on the Hypothesis version and on the test code, so the database should probably not be checked in. Important examples (e.g., for regression testing) should instead be made persistent with @example
decorators. Hypothesis does guarantee that tese files are always valid (i.e., any sequence of bytes in the database will result in an example that can be generated by the version of Hypothesis and of the test code that reads it), so we donât have to invalidate the cache when we update the test harness.Â ↩