Nine months ago, we embarked on a format migration for the persistent (ondisk) representation of variablelength strings like symbolicated call stacks in the Backtrace server. We chose a variant of consistent overhead byte stuffing (COBS), a selfsynchronising code, for the metadata (variablelength as well). This choice let us improve our software’s resilience to data corruption in local files, and then parallelise data hydration, which improved startup times by a factor of 10… without any hard migration from the old to the current ondisk data format.
In this post, I will explain why I believe that the representation of first resort for binary logs (writeahead, recovery, replay, or anything else that may be consumed by a program) should be selfsynchronising, backed by this migration and by prior experience with COBSstyle encoding. I will also describe the specific algorithm (available under the MIT license) we implemented for our server software.
This encoding offers low space overhead for framing, fast encoding and faster decoding, resilience to data corruption, and a restricted form of random access. Maybe it makes sense to use it for your own data!
A code is selfsynchronising when it’s always possible to unambiguously detect where a valid code word (record) starts in a stream of symbols (bytes). That’s a stronger property than prefix codes like Huffman codes, which only detect when valid code words end. For example, the UTF8 encoding is selfsynchronising, because initial bytes and continuation bytes differ in their high bits. That’s why it’s possible to decode multibyte code points when tailing a UTF8 stream.
The UTF8 code was designed for small integers (Unicode code points), and can double the size of binary data. Other encodings are more appropriate for arbitrary bytes; for example, consistent overhead byte stuffing (COBS), a selfsynchronising code for byte streams, offers a worstcase space overhead of one byte plus a 0.4% space blowup.
Selfsynchronisation is important for binary logs because it lets us efficiently (with respect to both run time and space overhead) frame records in a simple and robust manner… and we want simplicity and robustness because logs are most useful when something has gone wrong.
Of course, the storage layer should detect and correct errors, but things will sometimes fall through, especially for onpremises software, where no one fully controls deployments. When that happens, graceful partial failure is preferable to, e.g., losing all the information in a file because one of its pages went to the great bit bucket in the sky.
One easy solution is to spread the data out over multiple files or blobs. However, there’s a tradeoff between keeping data fragmentation and file metadata overhead in check, and minimising the blast radius of minor corruption. Our server must be able to run on isolated nodes, so we can’t rely on design options available to replicated systems… plus bugs tend to be correlated across replicas, so there is something to be said for defense in depth, even with distributed storage.
When each record is converted with a selfsynchronising code like COBS before persisting to disk, we can decode all records that weren’t directly impacted by corruption, exactly like decoding a stream of mostly valid UTF8 bytes. Any form of corruption will only make us lose the records whose bytes were corrupted, and, at most, the two records that immediately precede or follow the corrupt byte range. This guarantee covers overwritten data (e.g., when a network switch flips a bit, or a read syscall silently errors out with a zerofilled page), as well as bytes removed or garbage inserted in the middle of log files.
The coding doesn’t store redundant information: replication or erasure coding is the storage layer’s responsibility. It instead guarantees to always minimise the impact of corruption, and only lose records that were adjacent to or directly hit by corruption.
A COBS encoding for log records achieves that by unambiguously separating records with a reserved byte (e.g., 0), and reencoding each record to avoid that separator byte. A reader can thus assume that potential records start and end at a log file’s first and last bytes, and otherwise look for separator bytes to determine where to cut all potential records. These records may be invalid: a separator byte could be introduced or removed by corruption, and the contents of a correctly framed record may be corrupt. When that happens, readers can simply scan for the next separator byte and try to validate that new potential record. The decoder’s state resets after each separator byte, so any corruption is “forgotten” as soon as the decoder finds valid a separator byte.
On the write side, the encoding logic is simple (a couple dozen lines of C code), and uses a predictable amount of space, as expected from an algorithm suitable for microcontrollers.
Actually writing encoded data is also easy:
on POSIX filesystems, we can make sure each record is delimited (e.g.,
prefixed with the delimiter byte), and issue a
regular O_APPEND
write(2).
Vectored writes can even insert delimiters without copying in
userspace. Realistically, our code is probably less stable than
operating system and the hardware it runs on, so we make sure our
writes make it to the kernel as soon as possible, and let fsync
s
happen on a timer.
When a write errors out, we can blindly (maybe once or twice) try again: the encoding is independent of the output file’s state. When a write is cut short, we can still issue the same^{1} write call, without trying to “fix” the short write: the encoding and the readside logic already protect against that kind of corruption.
What if multiple threads or processes write to the same log file?
When we open with O_APPEND
,
the operating system can handle the rest. This doesn’t make
contention disappear, but at least we’re not adding a bottleneck in
userspace on top of what is necessary to append to the same file.
Buffering is also trivial: the encoding is independent of the state of
the destination file, so we can always concatenate buffered records
and write the result with a single syscall.
This simplicity also plays well with
highthroughput I/O primitives like io_uring
, and with
blob stores that support appends:
independent workers can concurrently queue up blind append requests and
retry on failure.
There’s no need for applicationlevel mutual exclusion or rollback.
Our log encoding will recover from bad bytes, as long as readers can detect and reject invalid records as a whole; the processing logic should also handle duplicated valid records. These are table stakes for a reliable log consumer.
In our variablelength metadata use case, each record describes a symbolicated call stack, and we recreate inmemory data structures by replaying an appendonly log of metadata records, one for each unique call stack. The hydration phase handles invalid records by ignoring (not recreating) any call stack with corrupt metadata, but only those call stacks. That’s definitely an improvement over the previous situation, where corruption in a size header would prevent us from decoding the remainder of the file, and thus make us forget about all call stacks stored at file offsets after the corruption.
Of course, losing data should be avoided, so we are careful to
fsync
regularly and recommend reasonable storage configurations.
However, one can only make data loss unlikely, not impossible (if only
due to fat fingering), especially when cost is a factor. With the COBS
encoding, we can recover gracefully and automatically from any
unfortunate data corruption event.
We can also turn this robustness into new capabilities.
It’s often useful to process the tail of a log at a regular cadence. For example, I once maintained a system that regularly tailed hourly logs to update approximate views. One could support that use case with length footers. COBS framing lets us instead scan for a valid record from an arbitrary byte location, and read the rest of the data normally.
When logs grow large enough, we want to process them in parallel. The standard solution is to shard log streams, which unfortunately couples the parallelisation and storage strategies, and adds complexity to the write side.
COBS framing lets us parallelise readers independently of the writer. The downside is that the readside code and I/O patterns are now more complex, but, all other things being equal, that’s a tradeoff I’ll gladly accept, especially given that our servers run on independent machines and store their data in files, where reads are finegrained and latency relatively low.
A parallel COBS reader partitions a data file arbitrarily (e.g., in fixed size chunks) for independent workers. A worker will scan for the first valid record starting inside its assigned chunk, and handle every record that starts in its chunk. Filtering on the start byte means that a worker may read past the logical end of its chunk, when it fully decodes the last record that starts in the chunk: that’s how we unambiguously assign a worker to every record, including records that straddle chunk boundaries.
Random access even lets us implement a form of binary or interpolation search on raw unindexed logs, when we know the records are (k)sorted on the search key! This lets us, e.g., access the metadata for a few call stacks without parsing the whole log.
Eventually, we might also want to truncate our logs.
Contemporary filesystems like XFS (and even Ext4) support large sparse files. For example, sparse files can reach \(2^{63}  1\) bytes on XFS with a minimal metadataonly footprint: the ondisk data for such sparse files is only allocated when we issue actual writes. Nowadays, we can sparsify files after the fact, and convert ranges of nonzero data into zerofilled “holes” in order to release storage without messing with file offsets (or even atomically collapse old data away).
Filesystems can only execute these operations at coarse granularity, but that’s not an issue for our readers: they must merely remember to skip sparse holes, and the decoding loop will naturally handle any garbage partial record left behind.
Cheshire and Baker’s original byte stuffing scheme targets small machines and slow transports (amateur radio and phone lines). That’s why it bounds the amount of buffering needed to 254 bytes for writers and 9 bits of state for readers, and attempts to minimise space overhead, beyond its worstcase bound of 0.4%.
The algorithm is also reasonable. The encoder buffers data until it
encounters a reserved 0 byte (a delimiter byte), or there are 254
bytes of buffered data. Whenever the encoder stops buffering, it
outputs a block whose contents are described by its first byte. If
the writer stopped buffering because it found a reserved byte, it
emits one byte with buffer_size + 1
before writing and clearing the
buffer. Otherwise, it outputs 255 (one more than the buffer size),
followed by the buffer’s contents.
On the decoder side, we know that the first byte of each block describes its size and decoded value (255 means 254 bytes of literal data, any other value is one more than the number of literal bytes to copy, followed by a reserved 0 byte). We denote the end of a record with an implicit delimiter: when we run out of data to decode, we should have just decoded an extra delimiter byte that’s not really part of the data.
With framing, an encoded record surrounded by delimiters thus looks like the following
0 blen(blen  1) literal data bytes....blenliteral data bytes ...0 
The delimiting “0” bytes are optional at the beginning and end of a
file, and each blen
size prefix is one byte with value in
\([1, 255]\). A value \(\mathtt{blen} \in [1, 254]\) represents
a block \(\mathtt{blen}  1\) literal bytes, followed by an implicit
0 byte. If we instead have \(\mathtt{blen} = 255\), we
have a block of \(254\) bytes, without any implicit byte. Readers
only need to remember how many bytes remain until the end of the
current block (eight bits for a counter), and whether they should insert
an implicit 0 byte before decoding the next block (one binary flag).
We have different goals for the software we write at
Backtrace. For our logging use case, we
pass around fully constructed records, and we want to issue a single
write syscall per record, with periodic fsync
.^{2}
Buffering is baked in, so there’s no point in making sure we can work
with a small write buffer. We also don’t care as much about the space
overhead (the worstcase bound is already pretty good) as much as we
do about encoding and decoding speed.
These different design goals lead us to an updated hybrid word/byte stuffing scheme:
This hybrid scheme improves encoding and decoding speed compared to COBS, and even marginally improves the asymptotic space overhead. At the low end, the worstcase overhead is only slightly worse than that of traditional COBS: we need three additional bytes, including the framing separator, for records of 252 bytes or fewer, and five bytes for records of 25364260 bytes.
In the past, I’ve seen “word” stuffing schemes aim to reduce the runtime overhead of COBS codecs by scaling up the COBS loops to work on two or four bytes at a time. However, a byte search is trivial to vectorise, and there is no guarantee that frameshift corruption will be aligned to word boundaries (for example, POSIX allows short writes of an arbitrary number of bytes).
Our hybrid wordstuffing looks for a reserved twobyte delimiter sequence at arbitrary byte offsets. We must still conceptually process bytes one at a time, but delimiting with a pair of bytes instead of with a single byte makes it easier to craft a delimiter that’s unlikely to appear in our data.
Cheshire and Baker do the opposite, and use a frequent byte (0) to
eliminate the space overhead in the common case. We care a lot more
about encoding and decoding speed, so an unlikely delimiter makes more
sense for us. We picked 0xfe 0xfd
because that sequence doesn’t
appear in small integers (unsigned, two’s complement, varint, single
or double float) regardless of endianness, nor in valid UTF8 strings.
Any positive integer with 0xfe 0xfd
(254 253
) in its byte must be
around \(2^{16}\) or more. If the integer is instead negative in
littleendian two’s complement, 0xfe 0xfd
equals 514 as a
littleendian int16_t
, and 259 in big endian (not as great, but not
nothing). Of course, the sequence could appear in two adjacent
uint8_t
s, but otherwise, for 0xfe
or 0xfd
can only appear in
most significant byte of large 32 or 64bit integers (unlike 0xff
,
which could be sign extension for, e.g., 1).
Any (U)LEB varint that
includes 0xfe 0xfd
must span at least 3 bytes (i.e., 15 bits),
since both these bytes have the most significant bit set to 1.
Even a negative SLEB has to be at least as negative as
\( 2^{14} = 16384\).
For floating point types, we can observe that 0xfe 0xfd
in the
significand would represent an awful fraction in little or big
endian, so can only happen for the IEEE754 representation of large
integers (approximately \(\pm 2^{15}\)). If we instead assume
that 0xfd
or 0xfe
appear in the sign and exponent fields, we find
either very positive or very negative exponents (the exponent is
biased, instead of complemented). A semiexhaustive search confirms
that the smallest integervalued single float that includes the
sequence is 32511.0 in little endian and 130554.0 in big endian;
among integervalued double floats, we find 122852.0 and 126928.0
respectively.
Finally, the sequence isn’t valid UTF8 because both 0xfe
and 0xfd
have their top bit set (indicating a multibyte code point), but neither
looks like a continuation byte: the two most significant bits are
0b11
in both cases, while UTF8 continuations must have 0b10
.
Consistent overhead byte stuffing rewrites reserved 0 bytes away by counting the number of bytes from the beginning of a record until the next 0, and storing that count in a block size header followed by the nonreserved bytes, then resetting the counter, and doing the same thing for the remaining of the record. A complete record is stored as a sequence of encoded blocks, none of which include the reserved byte 0. Each block header spans exactly one byte, and must never itself be 0, so the byte count is capped at 254, and incremented by one (e.g., a header value of 1 represents a count of 0); when the count in the header is equal to the maximum, the decoder knows that the encoder stopped short without finding a 0.
With our twobyte reserved sequence, we can encode the size of each
block in radix 253 (0xfd
); given a twobyte header for each block, sizes
can go up to \(253^2  1 = 64008\). That’s a reasonable granularity
for memcpy
. This radix conversion replaces the offbyone weirdness
in COBS: that part of the original algorithm merely encodes values
from \([0, 254]\) into one byte while avoiding the reserved byte 0.
A twobyte size prefix is a bit ridiculous for small records (ours
tend to be on the order of 3050 bytes). We thus encode the first
block specially, with a single byte in \([0, 252]\) for the size
prefix. Since the reserved sequence 0xfe 0xfd
is unlikely to appear in
our data, the encoding for short record often boils down to adding a
uint8_t
length prefix.
A framed encoded record now looks like
0xfe0xfdblenblen literal bytes...blen_1blen_2literal bytes...0xfe0xfd
The first blen
is in \([0, 252]\) and tells us how many literal
bytes follow in the initial block. If the initial \(\mathtt{blen} =
252\), the literal bytes are immediately followed by the next block’s
decoded contents. Otherwise, we must first append an implicit 0xfe
0xfd
sequence… which may be the artificial reserved sequence that
mark the end of every record.
Every subsequent block comes with a twobyte size prefix, in littleendian
radix253. In other words, blen_1blen_2
represents the
block size \(\mathtt{blen}\sb{1} + 253 \cdot \mathtt{blen}\sb{2}\), where
\(\mathtt{blen}_{{1, 2}} \in [0, 252]\). Again, if the block
size is the maximum encodable size, \(253^2  1 = 64008\), we
have literal data followed by the next block; otherwise, we must
append a 0xfe 0xfd
sequence to the output before
moving on to the next block.
The encoding algorithm is only a bit more complex than for the original COBS scheme.
Assume the data to encode is suffixed with an artificial twobyte
reserved sequence 0xfe 0xfd
.
For the first block, look for the reserved sequence in the first 252
bytes. If we find it, emit its position (must be less than 251) in
one byte, then all the data bytes up to but not including the reserved
sequence, and enter regular encoding after the reserved sequence. If
the sequence isn’t in the first block, emit 252
, followed
by 252 bytes of data, and enter regular encoding after those bytes.
For regular (all but the first) blocks, look for the reserved sequence in
the next 64008 bytes. If we find it, emit the sequence’s byte offset
(must be less than 64008) in littleendian radix 253, followed by the
data up to but not including the reserved sequence, and skip that sequence
before encoding the rest of the data. If we don’t find the reserved
sequence, emit 64008 in radix 253 (0xfc 0xfc
), copy the next 64008
bytes of data, and encode the rest of the data without skipping anything.
Remember that we conceptually padded the data with a reserved sequence at the end. This means we’ll always observe that we fully consumed the input data at a block boundary. When we encode the block that stops at the artificial reserved sequence, we stop (and frame with a reserved sequence to delimit a record boundary).
You can find our implementation in the stuffedrecordstream repository.
When writing short records, we already noted that the encoding step is often equivalent to adding a onebyte size prefix. In fact, we can encode and decode all records of size up to \(252 + 64008 = 64260\) bytes in place, and only ever have to slide the initial 252byte block: whenever a block is shorter than the maximum length (252 bytes for the first block, 64008 for subsequent ones), that’s because we found a reserved sequence in the decoded data. When that happens, we can replace the reserved sequence with a size header when encoding, and undo the substitution when decoding.
Our code does not implement these optimisations because encoding and decoding stuffed bytes aren’t bottlenecks for our use case, but it’s good to know that we’re nowhere near the performance ceiling.
The stuffing scheme only provides resilient framing. That’s essential, but not enough for an abstract stream or sequence of records. At the very least, we need checksums in order to detect invalid records that happen to be correctly encoded (e.g., when a block’s literal data is overwritten).
Our prestuffed records start with the littleendian header
struct record_header {
uint32_t crc;
uint32_t generation;
};
where crc
is the crc32c
of whole record, including the
header,^{3} and generation
is a yetunused
arbitrary 32bit payload that we added for forward compatibility.
There is no size field: the framing already handles that.
The remaining bytes in a record are an arbitrary payload. We use protobuf messages to help with schema evolution (and keep messages small and flat for decoding performance), but there’s no special relationship between the stream of wordstuffed records and the payload’s format.
Our implementation
let writers output to buffered FILE
streams, or directly to file descriptors.
Buffered streams offer higher write throughput, but are only safe
when the caller handles synchronisation and flushing; we use them
as part of a commit protocol that
fsyncs
and publishes files with
atomic rename
syscalls.
During normal operations, we instead write to file descriptors opened
with O_APPEND
and a background fsync worker: in practice, the
hardware and operating system are more stable than our software, so
it’s more important that encoded records immediately make it to the
kernel than all the way to persistent storage. We also avoid batching
write syscalls because we would often have to wait several minutes if
not hours to buffer more than two or three records.
For readers, we can either read from a buffer, or mmap
in a file,
and read from the resulting buffer. While we expose a linear iterator
interface, we can also override the start and stop byte offset of an
iterator; we use that capability to replay logs in parallel. Finally,
when readers advance an iterator, they can choose to receive a raw data
buffer, or have it decoded with a protobuf message descriptor.
We have happily been using this log format for more than nine months to store a log of metadata records that we replay every time the Backtrace server restarts.
Decoupling writes from the parallel read strategy let us improve our startup time incrementally, without any hard migration. Serialising with flexible schemas (protocol buffers) also made it easier to start small and slowly add optional metadata, and only enforce a hard switchover when we chose to delete backward compatibility code.
This piecemeal approach let us transition from a lengthprefixed data format to one where all important metadata lives in a resilient record stream, without any breaking change. We slowly added more metadata to records and eventually parallelised loading from the metadata record stream, all while preserving backward and forward compatibility. Six months after the initial roll out, we flipped the switch and made the new, more robust, format mandatory; the old lengthprefixed files still exist, but are now bags of arbitrary checksummed data bytes, with metadata in record streams.
In the past nine months, we’ve gained a respectable amount of pleasant
operational experience with the format. Moreover, while performance is
good enough for us (the parallel loading phase is currently
dominated by disk I/O and parsing in protobufc
), we also know
there’s plenty of headroom: our records are short enough that they can
usually be decoded without any write, and always in place.
We’re now starting laying the groundwork to distribute our singlenode embedded database and making it interact more fluently with other data stores. The first step will be generating a change data capture stream, and reusing the wordstuffed record format was an obvious choice.
Word stuffing is simple, efficient, and robust. If you can’t just defer to a real database (maybe you’re trying to write one yourself) for your log records, give it a shot! Feel free to play with our code if you don’t want to roll your own.
Thank you, Ruchir and Alex, for helping me clarify and restructure an earlier version.
If you append with the delimiter, it probably makes sense to specialcase short writes and also prepend with the delimiter after failures, in order to make sure readers will observe a delimiter before the new record. ↩
Highthroughput writers should batch records. We do syscallperrecord because the write load for the current use case is so sporadic that any batching logic would usually end up writing individual records. For now, batching would introduce complexity and bug potential for a minimal impact on write throughput. ↩
We overwrite the crc
field with UINT32_MAX
before computing a checksum for the header and its trailing data. It’s important to avoid zero prefixes because the result of crcing a 0 byte into a 0 state is… 0. ↩
The core of UMASH is a hybrid PH/(E)NH block compression function. That function is fast (it needs one multiplication for each 16byte “chunk” in a block), but relatively weak: despite a 128bit output, the worstcase probability of collision is \(2^{64}\).
For a fingerprinting application, we want collision probability less than \(\approx 2^{70},\) so that’s already too weak, before we even consider merging a variablelength string of compressed block values.
The initial UMASH proposal compresses each block with two independent compression functions. Krovetz showed that we could do so while reusing most of the key material (random parameters), with a Toeplitz extension, and I simply recycled the proof for UMASH’s hybrid compressor.
That’s good for the memory footprint of the random parameters, but doesn’t help performance: we still have to do double the work to get double the hash bits.
Earlier this month, Jim Apple pointed me at a promising alternative that doubles the hash bit with only one more multiplication. The construction adds finite field operations that aren’t particularly efficient in software, on top of the additional 64x64 > 128 (carryless) multiplication, so isn’t a slam dunk win over a straightforward Toeplitz extension. However, Jim felt like we could “spend” some of the bits we don’t need for fingerprinting (\(2^{128}\) collision probability is overkill when we only need \(2^{70}\)) in order to make do with faster operations.
Turns out he was right! We can use carryless multiplications by sparse
constants (concretely, xorshift and one more shift) without any
reducing polynomial, on independent 64bit halves… and still
collide with probability at most 2^{126} \(2^{98}\).
The proof is fairly simple, but relies on a bit of notation for clarity. Let’s start by restating UMASH’s hybrid PH/ENH block compressor in that notation.
The current block compressor in UMASH splits a 256byte block \(m\) in 16 chunks \(m_i,\, i\in [0, 15]\) of 128 bits each, and processes all but the last chunk with a PH loop,
\[ \bigoplus_{i=0}^{14} \mathtt{PH}(k_i, m_i), \]
where
\[ \mathtt{PH}(k_i, m_i) = ((k_i \bmod 2^{64}) \oplus (m_i \bmod 2^{64})) \odot (\lfloor k_i / 2^{64} \rfloor \oplus \lfloor m_i / 2^{64} \rfloor) \]
and each \(k_i\) is a randomly generated 128bit parameter.
The compression loop in UMASH handles the last chunk, along with a size tag (to protect against extension attacks), with ENH:
\[ \mathtt{ENH}(k, x, y) = ((k + x) \bmod 2^{64}) \cdot (\lfloor k / 2^{64}\rfloor + \lfloor x / 2^{64} \rfloor \bmod 2^{64}) + y \mod 2^{128}. \]
The core operation in ENH is a full (64x64 > 128) integer multiplication, which has lower latency than PH’s carryless multiplication on x8664. That’s why UMASH switches to ENH for the last chunk. We use ENH for only one chunk because combining multiple NH values calls for 128bit additions, and that’s slower than PH’s xors. Once we have mixed the last chunk and the size tag with ENH, the result is simply xored in with the previous chunks’ PH values:
\[ \left(\bigoplus_{i=0}^{14} \mathtt{PH}(k_i, m_i)\right) \oplus \mathtt{ENH}(m_{15}, k_{15}, \mathit{tag}). \]
This function is annoying to analyse directly, because we end up having to manipulate different proofs of almostuniversality. Let’s abstract things a bit, and reduce the ENH/PH to the bare minimum we need to find our collision bounds.
Let’s split our message blocks in \(n\) (\(n = 16\) for UMASH) “chunks”, and apply an independently sampled mixing function to each chunk. Let’s say we have two messages \(m\) and \(m^\prime\) with chunks \(m_i\) and \(m^\prime_i\), for \(i\in [0, n)\), and let \(h_i\) be the result of mixing chunk \(m_i,\) and \(h^\prime_i\) that of mixing \(m^\prime_i.\)
We’ll assume that the first chunk is mixed with a \(2^{w}\)almostuniversal (\(2^{64}\) for UMASH) hash function: if \(m_0 \neq m^\prime_0,\) \(\mathrm{P}[h_0 = h^\prime_0] \leq 2^{w},\) (where the probability is taken over the set of randomly chosen parameters for the mixer). Otherwise, \(m_0 = m^\prime_0 \Rightarrow h_i = h^\prime_i\).
This first chunks stands for the ENH iteration in UMASH.
Every remaining chunk will instead be mixed with a \(2^{w}\)XORalmostuniversal hash function: if \(m_i \neq m^\prime_i\) (\(0 < i < n\)), \(\mathrm{P}[h_i \oplus h^\prime_i = y] \leq 2^{w}\) for any \(y,\) where the probability is taken over the randomly generated parameter for the mixer.
This stronger condition represents the PH iterations in UMASH.
We hash a full block by xoring all the mixed chunks together:
\[ H = \bigoplus_{i = 0}^{n  1} h_i, \]
and
\[ H^\prime = \bigoplus_{i = 0}^{n  1} h^\prime_i. \]
We want to bound the probability that \(H = H^\prime \Leftrightarrow H \oplus H^\prime = 0,\) assuming that the messages differ (i.e., there is at least one index \(i\) such that \(m_i \neq m^\prime_i\)).
If the two messages only differ in \(m_0 \neq n^\prime_0\) (and thus \(m_i = m^\prime_i,\,\forall i \in [1, n)\)),
\[ \bigoplus_{i = 1}^{n  1} h_i = \bigoplus_{i = 1}^{n  1} h^\prime_i, \]
and thus \(H = H^\prime \Leftrightarrow h_0 = h^\prime_0\).
By hypothesis, the 0th chunks are mixed with a \(2^{w}\)almostuniversal hash, so this happens with probability at most \(2^{w}\).
Otherwise, assume that \(m_j \neq m^\prime_j\), for some \(j \in [1, n)\). We will rearrange the expression
\[ H \oplus H^\prime = h_j \oplus h^\prime_j \oplus \left(\bigoplus_{i\in [0, n) \setminus \{ j \}} h_i \oplus h^\prime_i\right). \]
Let’s conservatively replace that unwieldly sum with an adversarially chosen value \(y\):
\[ H \oplus H^\prime = h_j \oplus h^\prime_j \oplus y, \]
and thus \(H = H^\prime\) iff \(h_j \oplus h^\prime_j = y.\) By hypothesis, the \(j\)th chunk (every chunk but the 0th), is mixed with a \(2^{w}\)almostXORuniversal hash, and this thus happens with probability at most \(2^{w}\).
In both cases, we find a collision probability at most \(2^{w}\) with a simple analysis, despite combining mixing functions from different families over different rings.
We combined strong mixers (each is \(2^{w}\)almostuniversal), and only got a \(2^{w}\)almostuniversal output. It seems like we should be able to do better when two or more chunks differ.
As Nandi points outs, we can apply erasure codes to derive additional chunks from the original messages’ contents. We only need one more chunk, so we can simply xor together all the original chunks:
\[m_n = \bigoplus_{i=0}^{n  1} m_i,\]
and similarly for \(m^\prime_n\). If \(m\) and \(m^\prime\) differ in only one chunk, \(m_n \neq m^\prime_n\). It’s definitely possible for \(m_n = m^\prime_n\) when \(m \neq m^\prime\), but only if two or more chunks differ.
We will again mix \(m_n\) and \(m^\prime_n\) with a fresh \(2^{w}\)almostXORuniversal hash function to yield \(h_n\) and \(h^\prime_n\).
We want to xor the result \(h_n\) and \(h^\prime_n\) with the second (still undefined) hash values \(H_2\) and \(H^\prime_2\); if \(m_n \neq m^\prime_n\), the final xored values are equal with probability at most \(2^{w}\), regardless of \(H_2\) and \(H^\prime_2\ldots\) and, crucially, independently of \(H \neq H^\prime\).
When the two messages \(m\) and \(m^\prime\) only differ in a single (initial) chunk, mixing a LRC checksum gives us an independent hash function, which squares the collision probability to \(2^{2w}\).
Now to the interesting bit: we must define a second hash function that combines \(h_0,h_1,\ldots, h_{n  1}\) and \(h^\prime_0, h^\prime_1, \ldots, h^\prime_{n  1}\) such that the resulting hash values \(H_2\) and \(H^\prime_2\) collide independently enough of \(H\) and \(H^\prime\). That’s a tall order, but we do have one additional assumption to work with: we only care about collisions in this second hash function if the additional checksum chunks are equal, which means that the two messages differ in two or more chunks (or they’re identical).
For each index \(0 < i < n\), we’ll fix a public linear (with xor as the addition) function \(\overline{xs}_i(x)\). This family of function must have two properties:
For regularity, we will also define \(\overline{xs}_0(x) = x\).
Concretely, let \(\overline{xs}_1(x) = x \mathtt{«} 1\), where the bitshift is computed for the two 64bit halves independently, and \(\overline{xs}_i(x) = (x \mathtt{«} 1) \oplus (x \mathtt{«} i)\) for \(i > 1\), again with all the bitshifts computed independently over the two 64bit halves.
To see that these satisfy our requirements, we can represent the functions as carryless multiplication by distinct “even” constants (the least significant bit is 0) on each 64bit half:
To recapitulate, we defined the first hash function as
\[ H = \bigoplus_{i = 0}^{n  1} h_i, \]
the (xor) sum of the mixed value \(h_i\) for each chunk \(m_i\) in the message block \(m\), and similarly for \(H^\prime\) and \(h^\prime_i\).
We’ll let the second hash function be
\[ H_2 \oplus h_n = \left(\bigoplus_{i = 0}^{n  1} \overline{xs}_i(h_i)\right) \oplus h_n, \]
and
\[ H^\prime_2 \oplus h^\prime_n = \left(\bigoplus_{i = 0}^{n  1} \overline{xs}_i(h^\prime_i)\right) \oplus h^\prime_n. \]
We can finally get down to business and find some collision bounds. We’ve already shown that both \(H = H^\prime\) and \(H_2 \oplus h_n = H^\prime_2 \oplus h^\prime_n\) collide simultaneously with probability at most \(2^{2w}\) when the checksum chunks differ, i.e., when \(m_n \neq m^\prime_n\).
Let’s now focus on the case when \(m \neq m^\prime\), but \(m_n = m^\prime_n\). In that case, we know that at least two chunks \(0 \leq i < j < n\) differ: \(m_i \neq m^\prime_i\) and \(m_j \neq m^\prime_j\).
If only two chunks \(i\) and \(j\) differ, and one of them is the \(i = 0\)th chunk, we want to bound the probability that
\[ h_0 \oplus h_j = h^\prime_0 \oplus h^\prime_j \]
and
\[ h_0 \oplus \overline{xs}_j(h_j) = h^\prime_0 \oplus \overline{xs}_j(h^\prime_j), \]
both at the same time.
Letting \(\Delta_i = h_i \oplus h^\prime_i\), we can reformulate the two conditions as
\[ \Delta_0 = \Delta_j \] and \[ \Delta_0 = \overline{xs}_j(\Delta_j). \]
Taking the xor of the two conditions yields
\[ \Delta_j \oplus \overline{xs}_j(\Delta_j) = 0, \]
which is only satisfied for \(\Delta_j = 0\), since \(f(x) = x \oplus \overline{xs}_j(x)\) is an invertible linear function. This also forces \(\Delta_0 = 0\).
By hypothesis, \(\mathrm{P}[\Delta_j = 0] \leq 2^{w}\), and \(\mathrm{P}[\Delta_0 = 0] \leq 2^{w}\) as well. These two probabilities are independent, so we get a probability that both hash collide less than or equal to \(2^{2w}\) (\(2^{128}\)).
In the other case, we have messages that differ in at least two chunks \(0 < i < j < n\): \(m_i \neq m^\prime_i\) and \(m_j \neq m^\prime_j\).
We can simplify the collision conditions to
\[ h_i \oplus h_j = h^\prime_i \oplus h^\prime_j \oplus y \]
and
\[ \overline{xs}_i(h_i) \oplus \overline{xs}_j(h_j) = \overline{xs}_i(h^\prime_i) \oplus \overline{xs}_j(h^\prime_j) \oplus z, \]
for \(y\) and \(z\) generated arbitrarily (adversarially), but without knowledge of the parameters that generated \(h_i, h_j, h^\prime_i, h^\prime_j\).
Again, let \(\Delta_i = h_i \oplus h^\prime_i\) and \(\Delta_j = h_j \oplus h^\prime_j\), and reformulate the conditions into
\[ \Delta_i \oplus \Delta_j = y \] and \[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_j(\Delta_j) = z. \]
Let’s apply the linear function \(\overline{xs}_i\) to the first condition
\[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_i(\Delta_j) = \overline{xs}_i(y); \]
since \(\overline{xs}_i\) isn’t invertible, the result isn’t equivalent, but is a weaker (necessary, not sufficient) version of the initial condiion.
After xoring that with the second condition
\[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_j(\Delta_j) = z, \]
we find
\[ \overline{xs}_i(\Delta_j) \oplus \overline{xs}_j(\Delta_j) = \overline{xs}_i(y) \oplus z. \]
By hypothesis, the null space of \(g(x) = \overline{xs}_i(x) \oplus \overline{xs}_j(x)\) is “small.” For our concrete definition of \(\overline{xs}\), there are \(2^{2j}\) values in that null space, which means that \(\Delta_j\) can only satisfy the combined xored condition by taking one of at most \(2^{2j}\) values; otherwise, the two hashes definitely can’t both collide.
Since \(j < n\), this happens with probability at most \(2^{2(n  1)  w} \leq 2^{34}\) for UMASH with \(w = 64\) and \(n = 16\).
Finally, for any given \(\Delta_j\), there is at most one \(\Delta_i\) that satisfies
\[ \Delta_i \oplus \Delta_j = y,\]
and so both hashes collide with probability at most \(2^{98}\), for \(w = 64\) and \(n = 16\).
Astute readers will notice that we could let \(\overline{xs}_i(x) = x \mathtt{«} i\), and find the same combined collision probability. However, this results in a much weaker secondary hash, since a chunk could lose up to \(2n  2\) bits (\(n  1\) in each 64bit half) of hash information to a plain shift. The shifted xorshifts might be a bit slower to compute, but guarantees that we only lose at most 2 bits^{1} of information per chunk. This feels like an interface that’s harder to misuse.
If one were to change the \(\overline{xs}_i\) family of functions, I think it would make more sense to look at a more diverse form of (still sparse) multipliers, which would likely let us preserve a couple more bits of independence. Jim has constructed such a family of multipliers, in arithmetic modulo \(2^{64}\); I’m sure we could find something similar in carryless multiplication. The hard part is implementing these multipliers: in order to exploit the multipliers’ sparsity, we’d probably have to fully unroll the block hashing loop, and that’s not something I like to force on implementations.
The base UMASH block compressor mixes all but the last of the message block’s 16byte chunks with PH: xor the chunk with the corresponding bytes in the parameter array, computes a carryless multiplication of the xored chunks’ half with the other half. The last chunk goes through a variant of ENH with an invertible finaliser (safe because we only rely on \(\varepsilon\)almostuniversality), and everything is xored in the accumulator.
The collision proofs above preserved the same structure for the first hash.
The second hash reuses so much work from the first that it mostly makes sense to consider a combined loop that computes both (regular UMASH and this new xorshifted variant) block compression functions at the same time.
The first change for this combined loop is that we need to xor
together all 16bytes chunk in the message, and mix the resulting
checksum with a fresh PH function. That’s equivalent to xoring
everything in a new accumulator (or two accumulators when working with
256bit vectors) initialised with the PH parameters, and CLMUL
ing
together the accumulator’s two 64bit halves at the end.
We also have to apply the \(\overline{xs}_i\) quasixorshift
functions to each \(h_i\). The trick is to accumulate the shifted
values in two variables: one is the regular UMASH accumulator without
\(h_0\) (i.e., \(h_1 \oplus h_2 \ldots\)), and the other shifts
the current accumulator before xoring in a new value, i.e.,
\(\mathtt{acc}^\prime = (\mathtt{acc} \mathtt{«} 1) \oplus h_i\),
where the left shift on parallel 64bit halves simply adds acc
to itself.
This additional shifted accumulator includes another special case to skip \(\overline{xs}_1(x) = x \mathtt{«} 1\); that’s not a big deal for the code, since we already have to special case the last iteration for the ENH mixer.
Armed with \(\mathtt{UMASH} = \bigoplus_{i=1}^{n  1} h_i\) and \(\mathtt{acc} = \bigoplus_{i=2}^{n  1} h_i \mathtt{«} (i  1),\) we have \[\bigoplus_{i=1}^{n  1} \overline{xs}_i(h_i) = (\mathtt{UMASH} \oplus \mathtt{acc}) \mathtt{«} 1.\]
We just have to xor in the PH
mixed checksum \(h_n\), and finally
\(h_0\) (which naturally goes in GPRs, so can be computed while we
extract values out of vector registers).
We added two vector xors and one addition for each chunk in a block,
and, at the end, one CLMUL
plus a couple more xors and adds again.
This should most definitely be faster than computing two UMASH at the
same time, which incurred two vector xors and a CLMUL
(or full
integer multiplication) for each chunk: even when CLMUL
can pipeline
one instruction per cycle, vector additions can dispatch to more
execution units, so the combined throughput is still higher.
It’s easy to show that UMASH is relatively safe when one block is shorter than the other, and we simply xor together fewer mixed chunks. Without loss of generality, we can assume the longer block has \(n\) chunks; that block’s final ENH is independent of the shorter block’s UMASH, and any specific value occurs with probability at most \(2^{63}\) (the probability of a multiplication by zero).
A similar argument seems more complex to defend for the shifted UMASH.
Luckily, we can tweak the LRC checksum we use to generate an additional chunk in the block: rather than xoring together the raw message chunks, we’ll xor them after xoring them with the PH key, i.e.,
\[m_n = \bigoplus_{i=0}^{n  1} m_i \oplus k_i, \]
where \(k_i\) are the PH parameters for each chunk.
When checksumming blocks of the same size, this is a noop with respect
to collision probabilities. Implementations might however benefit
from the ability to use a fused xor
with load from memory^{2}
to compute \(m_i \oplus k_i\), and feed that both into the checksum
and into CLMUL
for PH.
Unless we’re extremely unlucky (\(m_{n  1} = k_{n  1}\), with probability \(2^{2w}\)), the long block’s LRC will differ from the shorter block’s. As long as we always xor in the same PH parameters when mixing the artificial LRC, the secondary hashes collide with probability at most \(2^{64}\).
With a small tweak to the checksum function, we can easily guarantee that blocks with a different number of chunks collide with probability less than \(2^{126}\).^{3}
Thank you Joonas for helping me rubber duck the presentation, and Jim for pointing me in the right direction, and for the fruitful discussion!
It’s even better for UMASH, since we obtained these shifted chunks by mixing with PH. The result of PH is the carryless product of two 64bit values, so the most significant bit is always 0. The shiftedxorshift doesn’t erase any information in the high 64bit half! ↩
This might also come with a small latency hit, which is unfortunate since PHing \(m_n\) is likely to be on the critical path… but one cycle doesn’t seem that bad. ↩
The algorithm to expand any input message to a sequence of full 16byte chunks is fixed. That’s why we incorporate a size tag in ENH; that makes it impossible for two messages of different lengths to collide when they are otherwise identical after expansion. ↩
We accidentally a whole hash function… but we had a good reason! Our MITlicensed UMASH hash function is a decently fast noncryptographic hash function that guarantees a worstcase bound on the probability of collision between any two inputs generated independently of the UMASH parameters.
On the 2.5 GHz Intel 8175M servers that power Backtrace’s hosted offering, UMASH computes a 64bit hash for short cached inputs of up to 64 bytes in 922 ns, and for longer ones at up to 22 GB/s, while guaranteeing that two distinct inputs of at most \(s\) bytes collide with probability less than \(\lceil s / 2048 \rceil \cdot 2^{56}\). If that’s not good enough, we can also reuse most of the parameters to compute two independent UMASH values. The resulting 128bit fingerprint function offers a shortinput latency of 926 ns, a peak throughput of 11.2 GB/s, and a collision probability of \(\lceil s / 2048 \rceil^2 \cdot 2^{112}\) (better than \(2^{70}\) for input size up to 7.5 GB). These collision bounds hold for all inputs constructed without any feedback about the randomly chosen UMASH parameters.
The latency on short cached inputs (922 ns for 64 bits, 926 ns for 128) is somewhat worse than the state of the art for noncryptographic hashes— wyhash achieves 815 ns and xxh3 812 ns—but still in the same ballpark. It also compares well with latencyoptimised hash functions like FNV1a (586 ns) and MurmurHash64A (723 ns).
Similarly, UMASH’s peak throughput (22 GB/s) does not match the current best hash throughput (37 GB/s with xxh3 and falkhash, apparently 10% higher with Meow hash), but does comes within a factor of two; it’s actually higher than that of some performanceoptimised hashes, like wyhash (16 GB/s) and farmhash32 (19 GB/s). In fact, even the 128bit fingerprint (11.2 GB/s) is comparable to respectable options like MurmurHash64A (5.8 GB/s) and SpookyHash (11.6 GB/s).
What sets UMASH apart from these other noncryptographic hash functions is its proof of a collision probability bound. In the absence of an adversary that adaptively constructs pathological inputs as it infers more information about the randomly chosen parameters, we know that two distinct inputs of \(s\) or fewer bytes will have the same 64bit hash with probability at most \(\lceil s / 2048 \rceil \cdot 2^{56},\) where the expectation is taken over the random “key” parameters.
Only one noncryptographic hash function in Reini Urban’s fork of SMHasher provides this sort of bound: CLHash guarantees a collision probability \(\approx 2^{63}\) in the same universal hashing model as UMASH. While CLHash’s peak throughput (22 GB/s) is equal to UMASH’s, its latency on short inputs is worse (2325 ns instead of 922ns). We will also see that its stronger collision bound remains too weak for many practical applications. In order to compute a fingerprint with CLHash, one would have to combine multiple hashes, exactly like we did for the 128bit UMASH fingerprint.
Actual cryptographic hash functions provide stronger bounds in a much more pessimistic model; however they’re also markedly slower than noncryptographic hashes. BLAKE3 needs at least 66 ns to hash short inputs, and achieves a peak throughput of 5.5 GB/s. Even the reducedround SipHash13 hashes short inputs in 1840 ns and longer ones at a peak throughput of 2.8 GB/s. That’s the price of their pessimistically adversarial security model. Depending on the application, it can make sense to consider a more restricted adversary that must prepare its dirty deed before the hash function’s parameters are generated at random, and still ask for provable bounds on the probability of collisions. That’s the niche we’re targeting with UMASH.
Clearly, the industry is comfortable with no bound at all. However, even in the absence of seedindependent collisions, timing sidechannels in a data structure implementation could theoretically leak information about colliding inputs, and iterating over a hash table’s entries to print its contents can divulge even more bits. A sufficiently motivated adversary could use something like that to learn more about the key and deploy an algorithmic denial of service attack. For example, the linear structure of UMASH (and of other polynomial hashes like CLHash) makes it easy to combine known collisions to create exponentially more colliding inputs. There is no universal answer; UMASH is simply another point in the solution space.
If reasonable performance coupled with an actual bound on collision probability for data that does not adaptively break the hash sounds useful to you, take a look at UMASH on GitHub!
The next section will explain why we found it useful to design another hash function. The rest of the post sketches how UMASH works and how it balances shortinput latency and strength, before describing a few interesting usage patterns.
The latency and throughput results above were all measured on the same unloaded 2.5 GHz Xeon 8175M. While we did not disable frequency scaling (#cloud), the clock rate seemed stable at 3.1 GHz during our run.
Engineering is the discipline of satisficisation: crisply defined problems with perfect solutions rarely exist in reality, so we must resign ourselves to satisfying approximate constraint sets “well enough.” However, there are times when all options are not only imperfect, but downright sucky. That’s when one has to put on a different hat, and question the problem itself: are our constraints irremediably at odds, or are we looking at an underexplored solution space?
In the former case, we simply have to want something else. In the latter, it might make sense to spend time to really understand the current set of options and handroll a specialised approach.
That’s the choice we faced when we started caching intermediate results in Backtrace’s database and found a dearth of acceptable hash functions. Our inmemory columnar database is a core component of the backend, and, like most analytics databases, it tends to process streams of similar queries. However, a naïve query cache would be ineffective: our more heavily loaded servers handle a constant write load of more than 100 events per second with dozens of indexed attributes (populated column values) each. Moreover, queries invariably select a large number of data points with a time windowing predicate that excludes old data… and the endpoints of these time windows advance with each wallclock second. The queries evolve over time, and must usually consider newly ingested data points.
Bhatotia et al’s Slider show how we can specialise the idea of selfadjusting or incremental computation for repeated MapReducestyle queries over a sliding window. The key idea is to split the data set at stable boundaries (e.g., on date change boundaries rather than 24 hours from the beginning of the current time window) in order to expose memoisation opportunities, and to do so recursively to repair around point mutations to older data.
Caching fully aggregated partial results works well for static queries, like scheduled reports… but the first step towards creating a great report is interactive data exploration, and that’s an activity we strive to support well, even when drilling down tens of millions of rich data points. That’s why we want to also cache intermediate results, in order to improve response times when tweaking a saved report, or when crafting ad hoc queries to better understand how and when an application fails.
We must go back to a more general incremental computation strategy: rather than only splitting up inputs, we want to stably partition the data dependency graph of each query, in order to identify shared subcomponents whose results can be reused. This finer grained strategy surfaces opportunities to “resynchronise” computations, to recognize when different expressions end up generating a subset of identical results, enabling reuse in later steps. For example, when someone updates a query by adding a selection predicate that only rejects a small fraction of the data, we can expect to reuse some of the postselection work executed for earlier incarnations of the query, if we remember to key on the selected data points rather than the predicates.
The complication here is that these intermediate results tend to be large. Useful analytical queries start small (a reasonable query coupled with cache/transaction invalidation metadata to stand in for the full data set), grow larger as we select data points, arrange them in groups, and materialise their attributes, and shrink again at the end, as we summarise data and throw out less interesting groups.
When caching the latter shrinking steps, where resynchronised reuse opportunities abound and can save a lot of CPU time, we often find that storing a fully materialised representation of the cache key would take up more space than the cached result.
A classic approach in this situation is to fingerprint cache keys with a cryptographic hash function like BLAKE or SHA3, and store a compact (128 or 256 bits) fingerprint instead of the cache key: the probability of a collision is then so low that we might as well assume any false positive will have been caused by a bug in the code or a hardware failure. For example, a study of memory errors at Facebook found that uncorrectable memory errors affect 0.03% of servers each month. Assuming a generous clock rate of 5 GHz, this means each clock cycle may be afflicted by such a memory error with probability \(\approx 2.2\cdot 10^{20} > 2^{66}.\) If we can guarantee that distinct inputs collide with probability significantly less than \(2^{66}\), e.g., \(< 2^{70},\) any collision is far more likely to have been caused by a bug in our code or by hardware failure than by the fingerprinting algorithm itself.
Using cryptographic hashes is certainly safe enough, but requires a lot of CPU time, and, more importantly, worsens latency on smaller keys (for which caching may not be that beneficial, such that our goal should be to minimise overhead). It’s not that stateoftheart cryptographic hash functions are wasteful, but that they defend against attacks like key recovery or collision amplification that we may not care to consider in our design.
At the other extreme of the hash spectrum, there is a plethora of fast hash functions with no proof of collision probability. However, most of them are keyed on just a 64bit “seed” integer, and that’s already enough for a pigeonhole argument to show we can construct sets of strings of length \(64m\) bits where any two members collide with probability at least \(m/ 2^{64}\). In practice, security researchers seem to find keyindependent collisions wherever they look (i.e., the collision probability is on the order of 1 for some particularly pathological sets of inputs), so it’s safe to assume that lacking a proof of collision probability implies a horrible worst case. I personally wouldn’t put too much faith in “security claims” taking the form of failed attempts at breaking a proposal.
Lemire and Kaser’s CLHash is one
of the few exceptions we found: it achieves a high throughput of 22
GB/s and comes with a proof of \(2^{63}\)almostuniversality.
However, its finalisation step is slow (23 ns for onebyte inputs), due
to a Barrett reduction
followed by
three rounds of xorshift
/multiply mixing.
Dai and Krovetz’s VHASH,
which inspired CLHash, offers similar guarantees, with worse
performance.
Unfortunately, \(2^{63}\) is also not quite good enough for our purposes: we estimate that the probability of uncorrectable memory errors is on the order of \(2^{66}\) per clock cycle, so we want the collision probability for any two distinct inputs to be comfortably less than that, around \(2^{70}\) (i.e., \(10^{21}\)) or less. This also tells us that any acceptable fingerprint must consist of more than 64 bits, so we will have to either work in slower multiword domains, or combine independent hashes.
Interestingly, we also don’t need much more than that for (nonadversarial) fingerprinting: at some point, the theoretical probability of a collision is dominated by the practical possibility of a hardware or networking issue making our program execute the fingerprinting function incorrectly, or pass the wrong data to that function.
While CLHash and VHASH aren’t quite what we want, they’re pretty close, so we felt it made sense to come up with a specialised solution for our fingerprinting use case.
Krovetz et al’s RFC 4418 brings an interesting idea: we can come up with a fast 64bit hash function structured to make it easy to compute a second independent hash value, and concatenate two independent 64bit outputs. The hash function can heavily favour computational efficiency and let each 64bit half collide with probability \(\varepsilon\) significantly worse than \(2^{64}\), as long as the collision probability for the concatenated fingerprint, \(\varepsilon^2\), is small enough, i.e., as long as \(\varepsilon^2 < 2^{70} \Longleftrightarrow \varepsilon < 2^{35}\). We get a more general purpose hash function out of the deal, and the fingerprint comparison logic is now free to only compute and look at half the fingerprint when it makes sense (e.g., in a prepass that tolerates spurious matches).
The design of UMASH is driven by two observations:
CLHash achieves a high throughput, but introduces a lot of latency to finalise its 127bit state into a 64 bits result.
We can get away with a significantly weaker hash, since we plan to combine two of them when we need a strong fingerprint.
That’s why we started with the highlevel structure diagrammed below, the same as UMAC, VHASH, and CLHash: a fast firstlevel block compression function based on Winograd’s pseudo dotproduct, and a secondlevel CarterWegman polynomial hash function to accumulate the compressed outputs in a fixedsize state.
The inner loop in this twolevel strategy is the block compressor,
which divides each 256byte block \(m\) into 32 64bit values
\(m_i\), combines them with randomly generated parameters \(k_i\),
and converts the resulting sequence of machine words to a 16byte
output. The performance of that component will largely determine the
hash function’s global peak throughput. After playing around with
the NH
inner loop,
we came to the
same conclusion as Lemire and Kaser:
the scalar operations, the outer 128bit ones in particular, map to too
many µops. We thus focused on the
same PH
inner loop
as CLHash,
While the similarity to NH
is striking, analysing PH
is actually
much simpler: we can see the xor
and carryless multiplications as
working in the same ring of polynomials over \(\mathrm{GF}(2)\), unlike
NH
’s mixing of \(\bmod 2^{64}\) for the innermost additions with
\(\bmod 2^{128}\) for the outer multiplications and sum. In
fact, as
Bernstein points out,
PH
is a direct application of
Winograd’s pseudo dotproduct
to compute a multiplicative vector hash in half the multiplications.
CLHash uses an aggressively throughputoptimised block size of 1024 bytes. We found diminishing returns after 256 bytes, and stopped there.
With modular or polynomial ring arithmetic, the collision probability is \(2^{64}\) for any pair of blocks. Given this fast compression function, the rest of the hashing algorithm must chop the input in blocks, accumulate compressed outputs in a constantsize state, and handle the potentially shorter final block while avoiding length extension issues.
Both VHASH and CLHash accumulate compressed outputs in a
polynomial string hash over a large field
(\(\mathbb{Z}/M_{127}\mathbb{Z}\) for VHASH, and
\(\mathrm{GF}(2^{127})\) with irreducible polynomial \(x^{127} + x + 1\)
for CLHash): the collision probability for polynomial string hashes is
inversely proportional to the field size and grows with the string
length (number of compressed blocks), so working in fields much larger
than \(2^{64}\) lets the NH
/PH
term dominate.
Arithmetic in such large fields is slow, and reducing the 127bit state to 64 bits is also not fast. CLHash and VHASH make the situation worse by zeropadding the final block, and CLHash defends against length extension attacks with a more complex mechanism than the one in VHASH.
Similarly to VHASH, UMASH uses a polynomial hash over the (much smaller) prime field \(\mathbb{F} = \mathbb{Z}/M_{61}\mathbb{Z},\)
where \(f\in\mathbb{F}\) is the randomly chosen point at which we
evaluate the polynomial, and \(y\), the polynomial’s coefficients, is
the stream of 64bit values obtained by splitting in half the PH
output for each block. This choice saves 2030 cycles of latency in
the final block, compared to CLHash: modular multiplications have
lower latency than carryless multiplications for judiciously picked
machineintegersized moduli, and integer multiplications seem to mix
better, so we need less work in the finaliser.
Of course, UMASH sacrifices a lot of strength by working in \(\mathbb{F} =\, \bmod 2^{61}  1:\) the resulting field is much smaller than \(2^{127}\), and we now have to update the polynomial twice for the same number of blocks. This means the collision probability starts worse \((\approx 2^{61}\) instead of \(\approx 2^{127})\), and grows twice as fast with the number of blocks \(n\) \((\approx 2n\cdot 2^{61}\) instead of \(\approx n\cdot 2^{61})\). But remember, we’re only aiming for collision probability \(< 2^{35}\) and each block represents 256 bytes of input data, so this is acceptable, assuming that multigigabyte inputs are out of scope.
We protect against length extension collisions
by xor
ing (adding, in the polynomial ring) the original byte size of
the final block to its compressed PH
output. This xor
is simpler
than CLHash’s finalisation step with a carryless multiplication, but
still sufficient: we can adapt
Krovetz’s proof for VHASH
by replacing NH
’s almost\(\Delta\)universality with PH
’s
almostXORuniversality.
Having this protection means we can extend short final blocks however
we want. Rather than conceptually zeropadding our inputs (which adds
complexity and thus latency on short inputs), we allow redundant
reads. We bifurcate inputs shorter than 16 bytes to a completely
different latencyoptimised code path, and let the final PH
iteration read the last 16 bytes of the input, regardless of how
redundant that might be.
The semiliterate Python reference implementation has the full code and includes more detailed analysis and rationale for the design decisions.
The previous section already showed how we let microoptimisation
inform the highlevel structure of UMASH. The use of PH
over
NH
, our choice of a polynomial hash in a small modular field, and
the way we handle short blocks all aim to improve the performance of
production implementations. We also made sure to enable a couple more
implementation tricks with lower level design decisions.
The block size is set to 256 bytes because we observed diminishing
returns for larger blocks… but also because it’s reasonable to cache
the PH
loop’s parameters in 8 AVX registers, if we need to shave load
µops.
More importantly, it’s easy to implement a Horner update with the prime modulus \(2^{61}  1\). Better, that’s also true for a “doublepumped” Horner update, \(h^\prime = H_f(h, a, b) = af + (b + h)f^2.\)
The trick is to work in \(\bmod 2^{64}  8 = \bmod 8\cdot(2^{61}  1),\) which lets us implement modular multiplication of an arbitrary 64bit integer \(a\) by a multiplier \(0 < f < 2^{61}  1\) without worrying too much about overflow. \(2^{64} \equiv 8 \mod 2^{64}  8,\) so we can reduce a value \(x\) to a smaller representative with
this equivalence is particularly useful when \(x < 2^{125}\): in that case, \(x / 2^{64} < 2^{61},\) and the intermediate product \(8\lfloor x / 2^{64}\rfloor < 2^{64}\) never overflows 64 bits. That’s exactly what happens when \(x = af\) is the product of \(0\leq a < 2^{64}\) and \(0 < f < 2^{61}  1\). This also holds when we square the multiplier \(f\): it’s sampled from the field \(\mathbb{Z}/(2^{61}  1)\mathbb{Z},\) so its square also satisfies \(f^2 < 2^{61}\) once fully reduced.
Integer multiplication instructions for 64bit values will naturally split the product \(x = af\) in its high and low 64bit half; we get \(\lfloor x / 2^{64}\rfloor\) and \(x\bmod 2^{64}\) for free. The rest of the doublepumped Horner update is a pair of modular additions, where only the final sum must be reduced to fit in \(\bmod 2^{64}  8\). The resulting instructionparallel double Horner update is only a few cycles slower than a single Horner update.
We also never fully reduce to \(\bmod 2^{61}  1\). While the collision bound assumes that prime field, we simply work in its \(\bmod 2^{64}  8\) extension. This does not affect the collision bound, and the resulting expression is still amenable to algebraic manipulation: modular arithmetic is a well defined ring even for composite moduli.
A proof of almostuniversality doesn’t mean a hash passes the SMHasher test suite. It should definitely guarantee collisions are (probably) rare enough, but SMHasher also looks at bit avalanching and bias, and universality is oblivious to these issues. Even XOR or \(\Delta\)universality doesn’t suffice: the hash values for a given string are well distributed when parameters are chosen uniformly at random, but this does not imply that hashes are always (or usually) well distributed for fixed parameters.
The most stringent SMHasher tests focus on short inputs: mostly up to
128 or 256 bits, unless “Extra” torture testing is enabled. In a way,
this makes sense, given that arbitrarylength string hashing is
provably harder than the boundedlength vector case. Moreover, a
specialised code path for these inputs is beneficial, since they’re
relatively common and deserve strong and lowlatency hashes. That’s
why UMASH uses a completely different code path for inputs of at
most 8 bytes, and a specialised PH
iteration for inputs of 9 to 16
bytes.
However, this means that SMHasher’s best avalanche and bias tests often tell us very little about the general case. For UMASH, the medium length (9 to 16 bytes) code path at least shares the same structure and finalisation logic as the code for longer inputs.
There may also be a bit of coevolution between the test harness and
the design of hash functions: the sort of xorshift
/multiply mixers
favoured by Appleby in the various versions of MurmurHash tends to do
well on SMHasher. These mixers are also invertible, so we can take
any hash function with good collision properties, mix its output with
someone else’s series of xorshift
and multiplications (in UMASH’s
case, the
SplitMix64 update function
or a subset thereof), and usually find that the result satisfies
SMHasher’s bias and avalanche tests.
It definitely looks like interleaving rightward bitwise operations and integer multiplications is a good mixing strategy. However, I find it interesting that the hash evaluation harness created by the author of MurmurHash steers implementations towards MurmurHashsyle mixing code.
The structure of UMASH lets us support more sophisticated usage patterns than merely hashing or fingerprinting an array of bytes.
The PH
loop needs less than 17 bytes of state for its 16byte
accumulator and an iteration count, and the polynomial hash also needs
17 bytes, for its own 8byte accumulator, the 8byte “seed,” and a
counter for the final block size (up to 256 bytes). The total comes up to
34 bytes of state, plus a 16byte input buffer, since the PH
loop
consumes 16byte chunks at a time. Coupled with the way we only
consider the input size at the end of UMASH, this makes it easy to
implement incremental hashing.
In fact, the state is small enough that our implementation stashes some
parameter data inline in the state struct, and uses the same
layout for hashing and fingerprinting with a pair of hashes (and thus
double the state): most of the work happens in PH
, which only
accesses the constant parameter array, the shared input buffer and iteration
counter, and its private 16byte accumulator.
Incremental fingerprinting is a crucial capability for our caching
system: cache keys may be large, so we want to avoid serialising them
to an array of contiguous bytes just to compute a fingerprint.
Efficient incrementality also means we can hash NULterminated C
strings with a fused UMASH / strlen
loop, a nice speedup when the
data is in cache.
The outer polynomial hash in UMASH is so simple to analyse that we can easily process blocks out of order. In my experience, such a “parallel hashing” capability is more important than peak throughput when checksumming large amounts of data coming over the wire. We usually maximise transfer throughput by asking for several ranges of data in parallel. Having to checksum these ranges in order introduces a serial bottleneck and the usual headofline blocking challenges; more importantly, checksumming in order adds complexity to code that should be as obviously correct as possible. The polynomial hash lets us hash an arbitrary subsequence of 256byte aligned blocks and use modular exponentiation to figure out its impact on the final hash value, given the subsequence’s position in the checksummed data. Parallel hashing can exploit multiple cores (more cores, more bandwidth!) with simpler code.
The UMAC RFC uses a Toeplitz
extension scheme to compute independent NH
values while recycling
most of the parameters. We do the same with PH
, by adapting
Krovetz’s proof
to exploit PH
’s almostXORuniversality instead of NH
’s
almost\(\Delta\)universality. Our fingerprinting code reuses all
but the first 32 bytes of PH
parameters for the second hash: that’s
the size of an AVX register, which makes is trivial to avoid loading
parameters twice in a fused PH
loop.
The same RFC also points out that concatenating the output of fast hashes lets validation code decide which speedsecurity tradeoff makes sense for each situation: some applications may be willing to only compute and compare half the hashes.
We use that freedom when reading from large hash tables keyed on the UMASH fingerprint of strings. We compute a single UMASH hash value to probe the hash tables, and only hash the second half of the fingerprint when we find a probable hit. The idea is that hashing the search key (now hot in cache) a second time will be faster than comparing it against the hash entry’s string key in cold storage.
When we add this sort of trickery to our code base, it’s important to make sure the interfaces are hard to misuse. For example, it would be unfortunate if only one half of the 128bit fingerprint were well distributed and protected against collisions: this would make it far too easy to implement the twostep lookupbyfingerprint above correctly but inefficiently. That’s why we maximise the symmetry in the fingerprint: the two 64bit halves are computed with the same algorithm to guarantee the same worstcase collision probability and distribution quality. This choice leaves fingerprinting throughput on the table when a weaker secondary hash would suffice. However, I prefer a safer if slightly slower interface to one ripe for silent performance bugs.
While we intend for UMASH to become our default hash and fingerprint function, it can’t be the right choice for every application.
First, it shouldn’t be used for authentication or similar cryptographic purposes: the implementation is probably riddled with sidechannels, the function has no protection against parameter extraction or adaptive attacks, and collisions are too frequent anyway.
Obviously, this rules out using UMASH in a MAC, but might also be an issue for, e.g., hash tables where attackers control the keys and can extrapolate the hash values. A timing sidechannel may let attackers determine when keys collide; once a set of colliding keys is known, the linear structure of UMASH makes it trivial to create more collisions by combining keys from that set. Worse, iterating over the hash table’s entries can leak the hash values, which would let an attacker slowly extract the parameters. We conservatively avoid noncryptographic hashes and even hashed data structures for sections of the Backtrace code base where such attacks are in scope.
Second, the performance numbers reported by SMHasher (up to 22 ns when hashing 64 bytes or less, and 22 GB/s peak throughput) are probably a lie for real applications, even when running on the exact same 2.5 GHz Xeon 8175M hardware. These are bestcase values, when the code and the parameters are all hot in cache… and that’s a fair amount of bytes for UMASH. The instruction footprint for a 64bit hash is 1435 bytes (comparable to heavier highthroughput hashes, like the 1600byte xxh3_64 or 1350byte farmhash64), and the parameters span 288 bytes (320 for a fingerprint).
There is a saving grace for UMASH and other complex hash functions: the amount of bytes executed is proportional to the input size (e.g., the code for 8 or fewer byte only needs 141 bytes, and would inline to around 100 bytes), and the number of parameters read is bounded by the input length. Although UMASH can need a lot of instruction and parameter bytes, the worst case only happens for larger inputs, where the cache misses can hopefully be absorbed by the work of loading and hashing the data.
The numbers are also only representative of powerful CPUs with
carryless multiplication in hardware. The PH
inner loop has 50%
higher throughput than NH
(22 vs 14 GB/s) on contemporary Intel
servers. The carryless approach still has an edge over 128bit
modular arithmetic on AMD’s Naples,
but less so, around 2030%. We did not test on ARM (the
Backtrace database
only runs on x8664), but I would assume the situation there is closer
to AMD’s than Intel’s.
However, I also believe we’re more likely to observe improved
performance for PH
than NH
in future microarchitectures: the core
of NH
, fullwidth integer multiplication, has been aggressively
optimised by now, while the gap between Intel and AMD shows there
may still be lowhanging fruits for the carryless multiplications
at the heart of PH
. So, NH
is probably already as good as
it’s going to be, but we can hope that PH
will continue to benefit from
hardware optimisations, as chip designers improve the performance of
cryptographic algorithms like
AESGCM.
Third and last, UMASH isn’t fully stabilised yet. We do not plan to
modify the high level structure of UMASH, a PH
block compressor
that feeds into a polynomial string hash. However, we are looking for
suggestions to improve its latency on short inputs, and to simplify
the finaliser while satisfying SMHasher’s distribution tests.
We believe UMASH is ready for nonpersistent usage: we’re confident in its quality, but the algorithm isn’t set in stone yet, so hash or fingerprint values should not reach longterm storage. We do not plan to change anything that will affect the proof of collision bound, but improvements to the rest of the code are more than welcome.
In particular:
xorshift
/ multiply in the finaliser, but can we shave even
more latency there?A hash function is a perfect target for automated correctness and performance testing. I hope to use UMASH as a test bed for the automatic evaluation (and approval?!) of pull requests.
Of course, you’re also welcome to just use UMASH as a singlefile C library or reimplement it to fit your requirements. The MITlicensed C code is on GitHub, and we can definitely discuss validation strategies for alternative implementations.
Finally, our fingerprinting use case shows collision rates are probably not something to minimise, but closer to soft constraints. We estimate that, once the probability reaches \(2^{70}\), collisions are rare enough to only compare fingerprints instead of the fingerprinted values. However, going lower than \(2^{70}\) doesn’t do anything for us.
It would be useful to document other backoftheenvelope requirements for a hash function’s output size or collision rate. Now that most developers work on powerful 64bit machines, it seems far too easy to add complexity and waste resources for improved collision bounds that may not unlock any additional application.
Any error in the analysis or the code is mine, but a few people helped improve UMASH and its presentation.
Colin Percival scanned an earlier version of the reference implementation for obvious issues, encouraged me to simplify the parameter generation process, and prodded us to think about side channels, even in data structures.
Joonas Pihlaja helped streamline my initial attempt while making the reference implementation easier to understand.
Jacob Shufro independently confirmed that he too found the reference implementation understandable, and tightened the natural language.
Phil Vachon helped me gain more confidence in the implementation
tricks borrowed from VHASH after replacing the NH
compression
function with PH
.
hp_read_swf
without changing the fast path. See the addendum.
Back in February 2020, Blelloch and Wei submitted this cool preprint: Concurrent Reference Counting and Resource Management in Waitfree Constant Time. Their work mostly caught my attention because they propose a waitfree implementation of hazard pointers for safe memory reclamation.^{1} Safe memory reclamation (PDF) is a key component in lockfree algorithms when garbage collection isn’t an option,^{2} and hazard pointers (PDF) let us bound the amount of resources stranded by delayed cleanups much more tightly than, e.g., epoch reclamation (PDF). However the usual implementation has a loop in its read barriers (in the garbage collection sense), which can be annoying for code generation and bad for worstcase time bounds.
Blelloch and Wei’s waitfree algorithm eliminates that loop… with a construction that stacks two emulated primitives—strong LL/SC, and atomic copy, implemented with the former—on top of what real hardware offers. I see the real value of the construction in proving that waitfreedom is achievable,^{3} and that the key is atomic memorymemory copies.
In this post, I’ll show how to flatten down that abstraction tower into something practical with a bit of engineering elbow grease, and come up with waitfree alternatives to the usual lockfree hazard pointers that are competitive in the best case. Blelloch and Wei’s insight that hazard pointers can use any waitfree atomic memorymemory copy lets us improve the worst case without impacting the common case!
But first, what are hazard pointers?
Hazard pointers were introduced by Maged Michael in Hazard Pointers: Safe Memory Reclamation for LockFree Objects (2005, PDF), as the first solution to reclamation races in lockfree code. The introduction includes a concise explanation of the safe memory reclamation (SMR) problem.
When a thread removes a node, it is possible that some other contending thread—in the course of its lockfree operation—has earlier read a reference to that node, and is about to access its contents. If the removing thread were to reclaim the removed node for arbitrary reuse, the contending thread might corrupt the object or some other object that happens to occupy the space of the freed node, return the wrong result, or suffer an access error by dereferencing an invalid pointer value. […] Simply put, the memory reclamation problem is how to allow the memory of removed nodes to be freed (i.e., reused arbitrarily or returned to the OS), while guaranteeing that no thread accesses free memory, and how to do so in a lockfree manner.
In other words, a solution to the SMR problem lets us know when it’s safe to physically release resources that used to be owned by a linked data structure, once all links to these resources have been removed from that data structure (after “logical deletion”). The problem makes intuitive sense for dynamically managed memory, but it applies equally well to any resource (e.g., file descriptors), and its solutions can even be seen as extremely readoptimised reader/writer locks.^{4}
The basic idea behind Hazard Pointers is to have each thread publish to permanently allocated^{5} hazard pointer records (HP records) the set of resources (pointers) it’s temporarily borrowing from a lockfree data structure. That’s enough information for a background thread to snapshot the current list of resources that have been logically deleted but not yet physically released (the limbo list), scan all records for all threads, and physically release all resources in the snapshot that aren’t in any HP record.
With just enough batching of the limbo list, this scheme can be practical: in practice, lockfree algorithms only need to pin a few (usually one or two) nodes at a time to ensure memory safety. As long as we avoid running arbitrary code while holding hazardous references, we can bound the number of records each thread may need at any one time. Scanning the records thus takes time roughly linear in the number of active threads, and we can amortise that to constant time per deleted item by waiting until the size of the limbo list is greater than a multiple of the number of active threads.^{6}
The tricky bit is figuring out how to reliably publish to a HP record without locking. Hazard pointers simplify that challenge with three observations:
This is where the clever bit of hazard pointers comes in: we must make sure that any resource (pointer to a node, etc.) we borrow from a lockfree data structure is immediately protected by a HP record. We can’t make two things happen atomically without locking. Instead, we’ll guess^{8} what resource we will borrow, publish that guess, and then actually borrow the resource. If we guessed correctly, we can immediately use the borrowed resource; if we were wrong, we must try again.
On an ideal sequentially consistent machine,
the pseudocode looks like the following. The cell
argument points to the resource we wish to acquire
(e.g., it’s a reference to the next
field in a linked list node), and record
is the hazard pointer
record that will protect the value borrowed from cell
.
1 2 3 4 5 6 7 

In practice, we must make sure that our write to record.pin
is visible before rereading the cell
’s value, and we should also make sure the pointer read is ordered with respect to the rest of the calling readside code.^{9}
1 2 3 4 5 6 7 

We need a store/load fence in R1
to make sure the store to the record (just above R1
) is visible by the time the second read (R2
) executes. Under the TSO memory model implemented by x86 chips (PDF),
this fence is the only one that isn’t implicitly satisfied by the hardware.
It also happens that fences are best implemented with atomic operations
on x86oids,
so we can eliminate the fence in R1
by replacing the store just before R1
with an atomic exchange (fetchandset).
The slow cleanup path has its own fence that matches R1
(the one in R2
matches the mutators’ writes to cell
).
1 2 3 4 5 6 7 8 9 10 

We must make sure all the values in the limbo list we grab in C1
were added to the list (and thus logically deleted) before we read any
of the records in C2
, with the acquire read in C1
matching the storeload fence in R1
.
It’s important to note that the cleanup loop does not implement
anything like an atomic snapshot of all the records. The reclamation
logic is correct as long as we scan the correct value for records that
have had the same pinned value since before C1
: we assume that a
resource only enters the limbo list once all its persistent
references have been cleared (in particular, this means circular
backreferences must be broken before scheduling a node for
destruction), so any newly pinned value cannot refer to any resource
awaiting destruction in the limbo list.^{10}
The following sequence diagrams shows how the fencing guarantees that any
iteration of hp_read_explicit
will fail if it starts before C1
and observes a stale value.
If the read succeeds, the ordering between R1
and C1
instead
guarantees that the cleanup loop will observed the pinned value
when it reads the record in C2
.
This all works, but it’s slow:^{11} we added an atomic write instruction (or worse, a fence) to a readonly operation. We can do better with a little help from our operating system.
When we use fences or memory ordering correctly, there should be
an implicit pairing between fences or ordered operations: we use fencing to
enforce an ordering (one must fully execute before or after another,
overlap is forbidden) between pairs of instructions in different
threads. For example, the pseudocode for hazard pointers with
explicit fencing and memory ordering paired the storeload fence in
R1
with the acquisition of the limbo list in C1
.
We only need that pairing very rarely, when a thread actually executes the cleanup function. The amortisation strategy guarantees we don’t scan records all the time, and we can always increase the amortisation factor if we’re generating tiny amounts of garbage very quickly.
It kind of sucks that we have to incur a full fence on the fast read path, when it only matches reads in the cleanup loop maybe as rarely as, e.g., once a second. If we waited long enough on the slow path, we could rely on events like preemption or other interrupts to insert a barrier in all threads that are executing the readside.
How long is “enough?”
Linux has the membarrier
syscall
to block the calling thread until (more than) long enough has elapsed,
Windows has the similar FlushProcessWriteBuffers
, and
on other operating systems, we can probably do something useful with scheduler statistics or ask for a new syscall.
Armed with these new blocking system calls, we can replace the storeload fence in R1
with a compiler barrier, and execute a slow membarrier
/FlushProcessWriteBuffers
after C1
.
The cleanup function will then wait long enough^{12} to ensure that any
readside operation that had executed before R1
at the time we read the limbo list in C1
will be visible (e.g., because the operating system knows a preemption interrupt executed at least once on each core).
The pseudocode for this asymmetric strategy follows.
1 2 3 4 5 6 7 

1 2 3 4 5 6 7 8 9 10 11 

We’ve replaced a fence on the fast read path with a compiler barrier, at the expense of executing a heavy syscall on the slow path. That’s usually an advantageous tradeoff, and is the preferred implementation strategy for Folly’s hazard pointers.
The ability to pair mere compiler barriers with membarrier
syscalls opens the door for many more “atomic enough” operations, not
just the fenced stores and loads we used until now:
similarly to the key idea in Concurrency Kit’s atomicfree SPMC event count,
we can use noninterlocked readmodifywrite instructions,
since any interrupt (please don’t mention imprecise interrupts) will happen before or after any such instruction,
and never in the middle of an instruction.
Let’s use that to simplify waitfree hazard pointers.
The key insight that lets Blelloch and Wei
achieve waitfreedom in hazard pointer is that the combination
of publishing a guess and confirming that the guess is correct in hp_read
emulates an atomic memorymemory copy. Given such an atomic copy primitive, the readside becomes trivial.
1 2 3 

The “only” problem is that atomic copies (which would look like locking all other cores out of memory accesses, copying the cell
’s wordsized contents to record.pin
, and releasing the lock) don’t exist in contemporary hardware.
However, we’ve already noted that syscalls like membarrier
mean we can weaken our requirements to interrupt atomicity. In other words, individual nonatomic instructions work since we’re assuming precise interrupts… and x86
and amd64
do have an instruction for memorymemory copies!
The MOVS
instructions are typically only used with a REP
prefix. However, they can also be executed without any prefix, to execute one iteration of the copy loop. Executing a REP
free MOVSQ
instruction copies one quadword (8 bytes) from the memory address in the source register [RSI]
to the address in the destination register [RDI]
, and advances both registers by 8 bytes… and all this stuff happens in one instruction, so will never be split by an interrupt.
That’s an interruptatomic copy, which we can slot in place
of the software atomic copy in Blelloch and Wei’s proposal!
1 2 3 

Again, the MOVS
instruction is not atomic, but will be ordered with
respect to the membarrier
syscall in hp_cleanup_membarrier
: either
the copy fully executes before the membarrier
in C1
, in which case the
pinned value will be visible to the cleanup loop, or it executes after
the membarrier
, which guarantees the copy will not observe a stale value
that’s waiting in the limbo list.
That’s just one instruction, but instructions aren’t all created
equal. MOVS
is on the heavy side: in order to read from memory, write to memory, and increment two registers,
a modern Intel chip has to execute 5 microops in at least ~5 cycles.
That’s not exactly fast; definitely better than an atomic (LOCK
ed)
instruction, but not fast.
We can improve that with a trick from sidechannel attacks, and
preserve waitfreedom. We can usually guess what value we’ll find in
record.pin
, simply by reading cell
with a regular relaxed load.
Unless we’re extremely unlucky (realistically, as long as the reader
thread isn’t interrupted), MOVSQ
will copy the same value we just
guessed. That’s enough to exploit branch prediction and turn a data
dependency on MOVSQ
(a high latency instruction) into a data
dependency on a regular load MOV
(low latency), and a highly
predictable control dependency. In very low level pseudo code, this
“speculative” version of the MOVS
readside might look like:
1 2 3 4 5 6 

1 2 3 4 5 6 7 8 9 10 

We’ll see that, in reasonable circumstances, this waitfree
code sequence is faster than the usual membarrierbased lockfree
read side. But first, let’s see how we can achieve waitfreedom
when CISCy instructions like MOVSQ
aren’t available, with an asymmetric “helping” scheme.
Blelloch and Wei’s waitfree atomic copy primitive builds on the usual
trick for waitfree algorithms: when a thread would wait for an
operation to complete, it helps that operation make progress instead
of blocking. In this specific case, a thread initiates an atomic copy
by acquiring a fresh hazard pointer record, setting that descriptor’s
pin
field to \(\bot\), publishing the address it wants to
copy from, and then performing the actual copy. When another thread
enters the cleanup loop and wishes to read the record’s pin
field , it may either find a value, or
\(\bot\); in the latter case, the cleanup thread has to help the
hazard pointer descriptor forward, by attempting to update the
descriptor’s pin
field.
This strategy has the marked advantage of working. However, it’s also symmetric between the common case (the thread that initiated the operation quickly completes it), and the worst case (a cleanup thread notices the initiating thread got stuck and moves the operation along). This forces the common case to use atomic operations, similarly to the way cleanup threads would. We pessimised the common case in order to eliminate blocking in the worst case, a frequent and unfortunate pattern in waitfree algorithms.
The source of that symmetry is our specification of an atomic copy
from the source field to a single destination pin
field, which must
be written exactly once by the thread that initiated the copy
(the hazard pointer reader), or any concurrent helper (the cleanup loop).
We can relax that requirement, since we know that the hazard pointer
scanning loop can handle spurious or garbage pinned values. Rather
than forcing both the read sequence (fast path) and the cleanup loop (slow path) to write to the same
pin
field, we will give each HP record two pin fields: a
singlewriter one for the fast path, and a multiwriter one for all
helpers (all threads in cleanup code).
The readside sequence will have to first write to the HP record to publish the cell it’s reading from, read and publish the cell’s pinned value, and then check if a cleanup thread helped the record along. If the record was helped, the readside sequence must use the value written by the helping cleanup thread. This means cleanup threads can detect when a HP record is missing its pinned value, and help it along with the cell’s current value. Cleanup threads may later observe two pinned values (both the reader and a cleanup thread wrote a pinned value); in that case, both values are conservatively protected from physical destruction.
Until now a hazard pointer record has only had one field, the “pinned” value. We must add some complexity to make this asymmetric helping scheme work: in order for cleanup threads to be able to help, we must publish the cell we are reading, and we need somewhere for cleanup threads to write the pinned value they read. We also need some sort of ABA protection to make sure slow cleanup threads don’t overwrite a fresher pinned value with a stale one, when the cleanup thread gets stuck (preempted).
Concretely, the HP record still has a pin
field, which is only
written by the reader that owns the record, and read by cleanup
threads. The help
subrecord is written by both the owner of the
record and any cleanup thread that might want to move a reader along. The
reader will first write the address of the pointer it wants to read
and protect in cell
, generate a new unique generation id by incrementing
gen_sequence
, and write that to pin_or_gen
. We’ll tag
generation ids with their sign: negative
values are generation ids, positive ones are addresses.
1 2 3 4 5 6 7 8 9 10 

At this point, any cleanup thread should be able to notice that the
help.pin_or_gen
is a generation value, and find a valid cell address
in help.cell
. That’s all the information a cleanup threads needs to
attempt to help the record move forward. It can read the cell’s value, and publish
the pinned value it just read with an atomic compareandswap (CAS) of
pin_or_gen
; if the CAS fails, another cleanup thread got there first, or
the reader has already moved on to a new target cell. In the latter
case, any inflight hazard pointer read sequence started before we
started reclaiming the limbo list, and it doesn’t matter what pinned
value we extract from the record.
Having populated the help
subrecord, a reader can now publish a
value in pin
, and then look for a pinned value in
help.pin_or_gen
: if a cleanup thread published a pinned value there, the
reader must use it, and not the potentially stale (already destroyed)
value the reader wrote to pin
.
On the read side, we obtain plain waitfreedom, with standard operations.
All we need are two compiler barriers to let
membarriers guarantee writes to the help
subrecord are visible
before we start reading from the target cell, and to guarantee
that any cleanup thread’s write to record.help.pin_or_gen
is visible
before we compare record.help.pin_or_gen
against gen
:
1 2 3 4 5 6 7 8 9 10 11 12 

On the cleanup side, we will consume the limbo list, issue a
membarrier to catch any readside critical section that wrote to
pin_or_gen
before we consumed the list, help these sections
along, issue another membarrier to guarantee that either the readers’
writes to record.pin
are visible, or our writes to
record.help.pin_or_gen
are visible to readers, and finally scan the
records while remembering to pin the union of record.pin
and
record.help.pin_or_gen
if the latter holds a pinned value.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

The membarrier in C1
matches the compiler barrier in RA
: if a
readside section executed R2
before we consumed the limbo list, its
writes to record.help
must be visible. The second membarrier in
C2
matches the compiler barrier in RB
: if the readside section
has written to record.pin
, that write must be visible, otherwise,
the cleanup threads’s write to help.pin_or_gen
must be visible to the reader.
Finally, when scanning for pinned values, we can’t determine whether
the reader used its own value or the one we published, so we must
conservatively add both to the pinned set.
That’s a couple more instructions on the readside than the
speculative MOVSQ
implementation. However, the instructions are
simpler, and the portable waitfree implementation benefits even more
from speculative execution: the final branch is equally predictable,
and now depends only on a read of record.help.pin_or_gen
, which can
be satisfied by forwarding the reader’s own write to that same field.
The end result is that, in my microbenchmarks, this portable waitfree
implementation does slightly better than the speculative MOVSQ
code.
We make this even tighter, by further specialising the code. The cleanup
path is already slow. What if we also assumed mutual exclusion; what if,
for each record, only one cleanup at a time could be in flight?
Once we may assume mutual exclusion between cleanup loops–more specifically, the “help” loop, the only part that writes to records–we don’t have to worry about ABA protection anymore. Hazard pointer records can lose some weight.
1 2 3 4 5 6 

We’ll also use tagging with negative or positive values, this time to distinguish target cell addresses (positive) from pinned values (negative). Now that the read side doesn’t have to update a generation counter to obtain unique sequence values, it’s even simpler.
1 2 3 4 5 6 7 8 9 

The cleanup function isn’t particularly different, except for the new encoding scheme.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Again, RA
matches C1
, and RB
C2
. This new implementation is simpler than hp_read_wf
on the read side, and needs even fewer instructions.
A key attribute for hazard pointers is how much they slow down pointer traversal in the common case. However, there are other qualitative factors that should impact our choice of implementation.
The classic fenced (hp_read_explicit
) implementation
needs one atomic or fence instruction per read, but does not require any
exotic OS operation.
A simple membarrier implementation (hp_read_membarrier
)
is ABIcompatible with the fenced implementations, but lets the read side
replace the fence with a compiler barrier, as long as the
slow cleanup path can issue membarrier
syscalls
on Linux, or the similar FlushProcessWriteBuffers
on Windows. All the remaining implementations (we won’t mention the much more complex waitfree implementation of Blelloch and Wei)
also rely on the same syscalls to avoid fences or atomic instructions
on the read side, while additionally providing waitfreedom (constant
execution time) for readers, rather than mere lockfreedom.
The simple MOVSQ
based implementations (hp_read_movs
) is fully
compatible with hp_read_membarrier
, waitfree, and usually compiles
down to fewer instruction bytes, but is slightly slower. Adding
speculation (hp_read_movs_spec
) retains compatibility and closes the
performance gap, with a number of instruction bytes comparable to the
lockfree membarrier implementation. In both cases, we rely on
MOVSQ
, an instruction that only exists on x86
and amd64
.
However, we can also provide portable waitfreedom, once we modify the
cleanup code to help the read side sections forward. The basic
implementation hp_read_wf
compiles to many more instructions than
the other readside implementations, but those instructions are mostly
upstream of the protected pointer read; in microbenchmarks, the result
can even be faster than the simple hp_read_membarrier
or the
speculative hp_read_movs_spec
. The downside is that instruction
bytes tend to hurt much more in real code than in microbenchmarks.
We also rely on pointer tagging, which could make the code less widely applicable.
We can simplify and shrink the portable waitfree code by assuming
mutual exclusion on the cleanup path (hp_read_swf
). Performance is
improved or roughly the same, and instruction bytes comparable to
hp_read_membarrier
. However, we’ve introduced more opportunities
for reclamation hiccups.
More importantly, achieving waitfreedom with concurrent help suffers
from a fundamental issue: helpers don’t know that the pointer read they’re
helping move forward is stale until they (fail to) CAS into
place the value they just read. This means they must be able to safely read potentially stale
pointers without crashing. One might think mutual exclusion in the
cleanup function fixes that, but programs often mix and match
different reclamation schemes, as well as lockfree and lockful code.
On Linux, we could
abuse the process_vm_readv
syscall;^{13}
in general I suppose we could install signal handlers to catch SIGSEGV
and SIGBUS
.
The stale read problem is even worse for the singlecleanup hp_read_swf
read sequence:
there’s no ABA protection, so a cleanup helper can pin an
old value in record.help.cell_or_pin
. This could happen if a
readside sequence is initiated before hp_cleanup_swf
’s membarrier
in C1
, and the associated incomplete record is noticed by the
helper, at which point the helper is preempted. The readside
sequence completes, and later uses the same record to read from the
same address… and that’s when the helper resumes execution, with a
compare_exchange
that succeeds.
The pinned value “helped in” by hp_cleanup_swf
is still valid—the
call to hp_cleanup_swf
hasn’t physically destroyed anything yet—so
the hazard pointer implementation is technically correct. However,
this scenario shows that hp_read_swf
can violate memory ordering and
causality, and even let longoverwritten values time travel into the future. The
simpler readside code sequence comes at a cost: its load is extremely
relaxed, much more so than any intuitive mental model might allow.^{14}
EDIT 20200709: However, see this addendum for a way to fix that race without affecting the fast (read) path.
Having to help readers forward also loses a nice practical property of hazard pointers: it’s always safe to spuriously consider arbitrary (readable) memory as a hazard pointer record, it only costs us additional conservatism in reclamation. That’s not the case anymore, once the cleanup thread has to help readers, and thus must write to HP records. This downside does not impact plain implementations of hazard pointers, but does make it harder to improve record management overhead by taking inspiration from managed language runtimes.
The overhead of hazard pointers only matters in code that traverse a lot of pointers, especially pointer chains. That’s why I’ll focus on microbenchmarking a loop that traverses a pseudorandomly shuffled circular linked list (embedded in an array of 1024 nodes, at 16 bytes per node) for a fixed number of pointer chasing hops. You can find the code to replicate the results in this gist.
The unprotected (baseline) inner loop follows. Note the NULL
endoflist check, for realism; the list is circular, so the loop
never breaks early.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

There’s clearly a dependency chain between each read of head>next
.
The call to frob_ptr
lets us introduce work in the dependency chain,
which more closely represents realistic use cases. For example, when
using hazard pointers to protect a binary search tree traversal, we
must perform a small amount of work to determine whether we want to go
down the left or the right subtree.
A hazard pointered implementation of this loop would probably unroll the loop body twice, to more easily implement handoverhand locking. That’s why I also include an unrolled version of this inner loop in the microbenchmark: we avoid discovering that hazard pointer protection improves performance because it’s also associated with unrolling, and gives us an idea of how much variation we can expect from small code generation changes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

The hazard pointer inner loops are just like the above, except that
head = head>next
is replaced with calls to an inline function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 

The dependency chain is pretty obvious; we can measure the sum of latencies for 1000 pointer dereferences by running the loop 1000 times. I’m using a large iteration count to absorb noise from the timing harness (roughly 50 cycles per call), as well as any boundary effect around the first and last few loop iterations.
All the cycle measurements here are from my unloaded E54617, running
at 2.9 GHz without Turbo Boost. First, let’s see what happens with a
pure traversal, where frob_ptr
is an inline noop function
that simply return its first argument. This microbenchmark is far
from realistic (if I found an inner loop that only traversed a
singly linked list, I’d consider a different data structure), but helps
establish an upper bound on the overhead from different hazard pointer
read sides. I would usually show a faceted graph of the latency
distribution for the various methods… but the results are so stable^{15}
that I doubt there’s any additional information to be found in the graphs,
compared to the tables below.
The following table shows the cycle counts for following 1000 pointers in a circular linked list, with various hazard pointer schemes and no work to find the next node, on an unloaded E54617 @ 2.9 GHz, without Turbo Boost.
 Method * 1000 iterations  p00.1 latency  median latency  p99.9 latency 

 noop  52  56  56 
 baseline  4056  4060  4080 
 unrolled  4056  4060  4080 
 hp_read_explicit  20136  20160  24740 
 hp_read_membarrier  5080  5092  5164 
 hp_read_movs  10060  10076  10348 
 hp_read_movs_spec  8568  8568  8572 
 hp_read_wf  6572  7620  8140 
 hp_read_swf  4268  4304  4368 
The table above reports quantiles for the total runtime of 1000 pointer dereferences, after one million repetitions.
We’re looking at a baseline of 4 cycles/pointer dereference (the L1 cache
latency), regardless of unrolling. The only implementation with an
actual fence or atomic operation, hp_read_explicit
fares pretty
badly, at more than 5x the latency. Replacing that fence with a
compiler barrier in hp_read_membarrier
reduces the overhead to ~1
cycle per pointer dereference. Our first waitfree implementation,
hp_read_movs
(based on a raw MOVSQ
) doesn’t do too great, with a
~6 cycle (150%) overhead for each pointer dereference. However,
speculation (hp_read_movs_spec
) does help shave that to ~4.5 cycles
(110%). The portable waitfree implementation hp_read_wf
does
slightly better, and its singlecleanup version hp_read_swf
takes
the crown, by adding around 0.2 cycle/dereference.
These results are stable and repeatable, but still fragile, in a way:
except for hp_read_explicit
, which is massively slowed down by its
atomic operation, and for hp_read_movs
, which adds a known latency bump on the hot path, the other slowdowns mostly reflect contention for
execution resources. In real life, such contention usually only occurs in
heavily tuned code, and the actual execution units (ports) in high
demand will vary from one inner loop to another.
Let’s see what happens when we insert a ~4cycle latency slowdown
(three for the multiplication, and one more for the increment) in the
hot path, by redefining frob_ptr
. The result of the integer
multiplication by 0 is always 0, but adds a (non speculated) data
dependency on the node value and on the multiplication to the pointer
chasing dependency chain. Only four cycles of work to decide which
pointer to traverse is on the low end of my practical experience, but
suffices to equalise away most of the difference between the hazard
pointer implementations.
1 2 3 4 5 6 7 8 9 

Let’s again look at the quantiles for the cycle count for one million loops of 1000 pointer dereferences, on an unloaded E54617 @ 2.9 GHz.
 Method * 1000 iterations  p00.1 latency  median latency  p99.9 latency 

 noop  52  56  56 
 baseline  10260  10320  10572 
 unrolled  9056  9060  9180 
 hp_read_explicit  22124  22156  26768 
 hp_read_membarrier  10052  10084  10264 
 hp_read_movs  12084  12112  15896 
 hp_read_movs_spec  9888  9940  10152 
 hp_read_wf  9380  9420  9672 
 hp_read_swf  10112  10136  10360 
The difference between unrolled
in this table and in the previous
one shows we actually added around 5 cycles of latency per iteration
with the multiplication in frob_ptr
. This dominates the overhead we
estimated earlier for all the hazard pointer schemes except for the
remarkably slow hp_read_explicit
and hp_read_movs
. It’s thus not
surprising that all hazard pointer implementations but the latter
two are on par with the unprotected traversal loops (within 1.1 cycle
per pointer dereference, less than the impact of unrolling the loop
without unlocking any further rewrite).
The relative speed of the methods has changed, compared
to the previous table. The speculative waitfree implementation
hp_read_movs_spec
was slower than hp_read_membarrier
and much
slower than hp_read_swf
; it’s now slightly faster than both.
The simple portable waitfree implementation hp_read_wf
was slower than
hp_read_membarrier
and hp_read_swf
; it’s now the fastest implementation.
I wouldn’t read too much into the relative rankings of
hp_read_membarrier
, hp_read_movs_spec
, hp_read_wf
, and
hp_read_swf
. They only differ by fractions of a cycle per
dereference (all between 9.5 and 10.1 cycle/deref), and the exact values are a function of the
specific mix of microops in the inner loop, and of the
nearunpredictable impact of instruction ordering on the chip’s
scheduling logic. What really matters is that their impact
on traversal latency is negligible once the pointer chasing loop does some
work to find the next node.
I hope I’ve made a convincing case that hazard pointers can be
waitfree and efficient on the readside, as long as we have access
to something like membarrier
or FlushProcessWriteBuffers
on the
slow cleanup (reclamation) path. If one were to look at the
microbenchmarks alone, one would probably pick hp_read_swf
.
However, the real world is more complex than microbenchmarks. When I
have to extrapolate from microbenchmarks, I usually worry about the
hidden impact of instruction bytes or cold branches, since
microbenchmarks tend to fail at surfacing these things. I’m not as
worried for hp_read_movs_spec
, and hp_read_swf
: they both compile
down to roughly as many instructions as the incumbent, hp_read_membarrier
,
and their forward untaken branch would be handled fine by a static predictor.
What I would take into account is the ability to transparently use
hp_read_movs_spec
in code that already uses hp_read_membarrier
,
and the added requirements of hp_read_swf
. In
addition to relying on membarrier
for correctness, hp_read_swf
needs a pointer tagging scheme to distinguish target pointers from pinned ones, a way for cleanup threads to read stale pointers without
crashing, and also imposes mutual exclusion around the scanning of (sets
of) hazard pointer records. These additional requirements don’t seem
impractical, but I can imagine code bases where they would constitute
hard blockers (e.g., library code, or when protecting arbitrary integers).
Finally, hp_read_swf
can let protected values time travel in the future,
with read sequences returning values so long after they were overwritten
that the result violates pretty much any memory ordering model… unless
you implement the addendum below.
TL;DR: Use hp_read_swf
if you’re willing to sacrifice waitfreedom on the reclamation path and remember to implement the cleanup function with time travel protection. When targeting x86
and amd64
, hp_read_movs_spec
is a well rounded option, and still waitfree. Otherwise, hp_read_wf
uses standard operations, but compiles down to more code.
P.S., Travis Downs notes that memmem PUSH
might be an alternative to MOVSQ
, but that requires either pointing RSP
to arbitrary memory, or allocating hazard pointers on the stack (which isn’t necessarily a bad idea). Another idea worthy of investigation!
Thank you, Travis, for deciphering and validating a much rougher draft when the preprint dropped, and Paul and Jacob, for helping me clarify this last iteration.
hp_read_swf
There is one huge practical issue with
hp_read_swf
, our simple and waitfree hazard pointer read sequence
that sacrifices lockfree reclamation to avoid x86specific instructions:
when the cleanup loop must help a record forward, it can fill in old values in ways that violate causality.
I noted that the reason for this hole is the lack of ABA protection in
HP records… and hp_read_wf
is what we’d get if we were to add full ABA protection.
However, given mutual exclusion around the “help” loop in the cleanup
function, we don’t need full ABA protection. What we really need to
know is whether a given inflight record we’re about to CAS forward is
the same inflight record for which we read the cell’s value, or was
overwritten by a readside section. We can encode that by
stealing one more bit from the target cell
address in cell_or_pin
.
We already steal the sign bit to distinguish the address of the cell
to read (positive), from pinned values (negative). The split make
sense because 64 bit architectures tend to reserve high (negative)
addresses for kernel space. I doubt we’ll see full 64 bit address
spaces for a while, so it seems safe to steal the next bit (bit 62) to
tag cell
addresses. The next table summarises the tagging scheme.
0b00xxxx: untagged cell address
0b01xxxx: tagged cell address
0b1yyyyy: helped pinned value
At a high level, we’ll change the hp_cleanup_swf
to tag
cell_or_pin
before reading the value pointed by cell
, and only CAS in the
new pinned value if the cell is still tagged.
Thanks to mutual exclusion, we know cell_or_pin
can’t be retagged by another thread.
Only the for record in records
block has to change.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

We doubled the number of atomic operations in the helping loop, but
that’s acceptable on the slow path. We also rely on strong
compare_exchange
(compareandswap) acting as a storeload fence.
If that doesn’t come for free, we could also tag records in one pass,
issue a storeload fence, and help tagged records in a
second pass.
Another practical issue with the swf
/wf
cleanup approach is that
they require two membarriers, and
OSprovided implementations can be slow, especially under load.
This is particularly important for the swf
approach, since mutual
exclusion means that one slow membarrier
delays the physical destruction of everything on the limbo list.
I don’t think we can get rid of mutual exclusion, and, while we can improve membarrier latency, reducing the number of membarriers on the reclamation path is always good.
We can software pipeline calls to the cleanup function,
and use the same membarrier
for C1
and C2
in two consecutive
cleanup calls. Overlapping cleanups decreases the worstcase reclaim
latency from 4 membarriers to 3, and that’s not negligible when each
membarrier
can block for 30ms.
I also tend to read anything by Guy Blelloch. ↩
In fact, I’ve often argued that SMR is garbage collection, just not fully tracing GC. Hazard pointers in particular look a lot like deferred reference counting, a form of tracing GC. ↩
Something that wasn’t necessarily obvious until then. See, for example, this article presented at PPoPP 2020, which conjectures that “making the original Hazard Pointers scheme or epochbased reclamation completely waitfree seems infeasible;” Blelloch was in attendance, so this must have been a fun session. ↩
The SMR problem is essentially the same problem as determining when it’s safe to return from rcu_synchronize
or execute a rcu_call
callback. Hence, the same readerwriter lock analogy holds. ↩
Hazard pointer records must still be managed separately, e.g., with a type stable allocator, but we can bootstrap everything else once we have a few records per thread. ↩
We can even do that without keeping track of the number of nodes that were previously pinned by hazard pointer records and kept in the limbo list: each record can only pin at most one node, so we can wait until the limbo list is, e.g., twice the size of the record set. ↩
Let’s also hope efforts like CHERI don’t have to break lockfree code. ↩
The scheme is correct with any actual guess; we could even use a random number generator. However, performance is ideal (the loop exits) when we “guess” by reading current value of the pointer we want to borrow safely. ↩
I tend to implement lockfree algorithms with a heavy dose of inline assembly, or Concurrency Kit’s wrappers: it’s far too easy to run into subtle undefined behaviour. For example, comparing a pointer after it has been freed is UB in C and C++, even if we don’t access the pointee. Even if we compare as uintptr_t
, it’s apparently debatable whether the code is well defined when the comparison happens to succeed because the pointee was freed, then recycled in an allocation and published to cell
again. ↩
In Reclaiming Memory for LockFree Data Structures: There has to be a Better Way, Trevor Brown argues this requirement is a serious flaw in all hazard pointer schemes. I think it mostly means applications should be careful to coarsen the definition of resource in order to ensure the resulting condensed heap graph is a DAG. In extreme cases, we end up proxy collecting an epoch, but we can usually do much better. ↩
That’s what Travis Downs classifies as a Level 1 concurrency cost, which is usually fine for writes, but adds a sizable overhead to simple readonly code. ↩
I suppose this means the reclamation path isn’t waitfree, or even lockfree, anymore, in the strict sense of the words. In practice, we’re simply waiting for periodic events that would occur regardless of the syscalls we issue. People who really know what they’re doing might have fully isolated cores. If they do, they most likely have a watchdog on their isolated and latencysensitive tasks, so we can still rely on running some code periodically, potentially after some instrumentation: if an isolated task fails to check in for a short while, the whole box will probably be presumed wedged and taken offline. ↩
With the caveat that the public documentation for process_vm_readv
does not mention any atomic load guarantee. In practice, I saw a longbylong copy loop the last time I looked at the code, and I’m pretty sure the kernel’s build flags prevent GCC/clang from converting it to memcpy
. We could rely on the strong “don’t break userspace” culture, but it’s probably a better idea to try and get that guarantee in writing. ↩
This problem feels like something we could address with a coarse epochbased versioning scheme. It’s however not clear to me that the result would be much simpler than hp_read_wf
, and we’d have to steal even more bits (2 bits, I expect) from cell_or_pin
to make room for the epoch. EDIT 20200709: it turns out we only need to steal one bit. ↩
Although I did compare several independent executions and confirmed the reported cycle counts were stable, I did not try to randomise code generation… mostly because I’m not looking for super fine differences as much as close enough runtimes. Hopefully, aligning functions to 256 bytes leveled some bias away. ↩
In the fourth installment of his series on sorting with AVX2, @damageboy has a short aside where he tries to detect partitioning (pivot) patterns where elements less than and greater than or equal to the pivot are already in the correct order: in that case, the partitioning routine does not need to permute the block of values. The practical details are irrelevant for this post; what matters is that we wish to quickly identify whether a byte value matches any of the follow nine cases:
0b11111111
0b11111110
0b11111100
0b11111000
0b11110000
0b11100000
0b11000000
0b10000000
0b00000000
Looking at the bit patterns,^{1} the OP’s solution with popcount and bitscan is pretty natural. These instructions are somewhat complex (latency closer to 3 cycles than 1, and often port restricted), and it seems like the sort of problem that would have had efficient solutions before SSE4 finally graced x86 with a population count instruction.
In the context of a sorting library’s partition loop, popcnt
and
bsf
is probably more than good enough:
the post shows that the real issue is branch mispredictions
being slower than permuting unconditionally.
This is just a fun challenge to think about (:
is_power_of_two
Detecting whether a machine integer is a power of two (or zero) is another task that has a straightforward solution in terms of popcount or bitscan. There’s also a simpler classic solution to this problem:
x == 0  is_power_of_two(x) <==> (x & (x  1)) == 0
How does that expression work? Say x
is a power of two. Its binary
representation is 0b0...010...0
: any number of leading zeros,^{2}
a single “1” bit, and trailing zeros (maybe none). Let’s see what happens when
we subtract 1 from x
:
x = 0b00...0010...0
x  1 = 0b00...0001...1
x & (x  1) = 0b00...0000...0
The subtraction triggered a chain of borrows
throughout the trailing zeros, until we finally hit that 1 bit.
In decimal, subtracting one from 10...0
yields 09...9
;
in binary we instead find 01...1
.
If you ever studied the circuit depth (latency) of carry chains
(for me, that was for circuit complexity theory), you know
that this is difficult to do well.
Luckily for us, chip makers work hard to pull it off,
and we can just use carries as a datacontrolled
primitive to efficiently flip ranges of bits.
When x
is a power of two, x
and x  1
have no “1” bit in common,
so taking the bitwise and
yields zero. That’s also true when x
is 0,
since and
ing anything with 0 yields zero. Let’s see what happens
for nonzero, nonpoweroftwo values x = 0bxx...xx10..0
,
i.e., where x
consists of an arbitrary nonzero sequence of bits xx..xx
followed by the least set bit (there’s at least one, since x
is neither zero nor a power of two), and the trailing zeros:
x = 0bxx...xx10...0
x  1 = 0bxx...xx01...1
x & (x  1) = 0bxx...xx000000
The leading notallzero 0bxx...xx
is unaffected by the subtraction,
so it passes through the bitwise and
unscathed (and
ing any bit with
itself yields that same bit), and we know there’s at least one nonzero
bit in there; our test correctly rejects it!
When decoding variable length integers in ULEB format, e.g., for protocol buffers, it quickly becomes clear that, in order to avoid byteatatime logic, we must rapidly segment (lex or tokenize, in a way) our byte stream to determine where each ULEB ends. Let’s focus on the fast path, when the encoded ULEB fits in a machine register.
We have uleb = 0bnnnnnnnnmmmmmmmm...0zzzzzzz1yyyyyyy1...
:
a sequence of bytes^{3} with the topmost bit equal to 1,
terminated by a byte with the top bit set to 0,
and, finally, arbitrary nuisance bytes (m...m
, n...n
, etc.) we wish to ignore.
Ideally, we’d extract data = 0b0000000000000000...?zzzzzzz?yyyyyyy?...
from uleb
: we want to clear the
nuisance bytes, and are fine with arbitrary values in the
ULEB’s control bits.
It’s much easier to find bits set to 1 than to zero, so the first thing to do is
to complement the ULEB
data and
clear out everything but potential ULEB control bits (the high bit of
each byte), with c = ~uleb & (128 * (WORD_MAX / 255))
, i.e.,
compute the bitwise and
of ~uleb
with a bitmask of the high bit in each byte.
uleb = 0bnnnnnnnnmmmmmmmm...0zzzzzzz1yyyyyyy1...
~uleb = 0b̅n̅n̅n̅n̅n̅n̅n̅n̅m̅m̅m̅m̅m̅m̅m̅m̅...1z̅z̅z̅z̅z̅z̅z0y̅y̅y̅y̅y̅y̅y0...
c = 0b̅n̅0000000̅m̅0000000...10000000000000000...
We could now bitscan to find the index of the first 1 (marking the last ULEB byte), and then generate a mask. However, it seems wasteful to generate an index with a scan, only to convert it back into bitmap space with a shift. We’ll probably still want that index to know how far to advance the decoder’s cursor, but we can hopefully update the cursor in parallel with decoding the current ULEB value.
When we were trying to detect powers of two, we subtracted 1
from
x
, a value kind of like c
, in order to generate a new value
that differed from x
in all the bits up to and including the first
set (equal to 1
) bit of x
, and identical in the remaining bits. We
then used the fact that and
ing a bit with itself yields that same
bit to detect whether there was any nonzero bit in the remainder.
Here, we wish to do something else with the remaining untouched bits, we
wish to set them all to zero. Another bitwise operator does
what we want: xor
ing a bit with itself always yields zero, while
xor
ing bits that differ yields 1
. That’s the plan for ULEB. We’ll
subtract 1 from c
and xor
that back with c
.
uleb = 0bnnnnnnnnmmmmmmmm...0zzzzzzz1yyyyyyy1...
~uleb = 0b̅n̅n̅n̅n̅n̅n̅n̅n̅m̅m̅m̅m̅m̅m̅m̅m̅...1z̅z̅z̅z̅z̅z̅z0y̅y̅y̅y̅y̅y̅y0...
c = 0b̅n̅0000000̅m̅0000000...10000000000000000...
c  1 = 0b̅n̅0000000̅m̅0000000...01111111111111111...
c ^ (c  1) = 0b0000000000000000...11111111111111111...
We now just have to bitwise and
uleb
with c ^ (c  1)
to obtain the bits of the first ULEB
value in uleb
, while
overwriting everything else with 0. Once we have that, we can either
extract data bits with PEXT
on recent Intel chips, or otherwise dust off interesting stunts for SWAR shifts by variable amounts.
Let’s first repeat the question that motivated this post. We want to detect when a byte p
is one of the following nine values:
0b11111111
0b11111110
0b11111100
0b11111000
0b11110000
0b11100000
0b11000000
0b10000000
0b00000000
These bit patterns feel similar to those for power of two bytes: if we
complement the bits, these values are all 1 less than a power of two
(or 1, one less than zero). We already know how to detect when a
value x
is zero or a power of two (x & (x  1) == 0
), so it’s easy
to instead determine whether ~p
is one less than zero or a power of
two: (~p + 1) & ~p == 0
.
This is already pretty good: bitwise not
the byte p
,
and check if it’s one less than zero or a power of two (three simple
instructions on the critical path). We can do better.
There’s another name for ~p + 1
, i.e., for bitwise complementing a value and
adding one: that’s simply p
, the additive inverse of p
in two’s
complement! We can use p & ~p == 0
. That’s one fewer
instruction on the critical path of our dependency graph (down to two, since we can test
whether and
ing yields zero), and still only
uses simple instructions that are unlikely to be port constrained.
Let’s check our logic by enumerating all bytesized values.
CLUSER> (dotimes (p 256)
(when (zerop (logand ( p) (lognot p) 255))
(format t "0b~2,8,'0r~%" p)))
0b00000000
0b10000000
0b11000000
0b11100000
0b11110000
0b11111000
0b11111100
0b11111110
0b11111111
These are the bytes we’re looking for (in ascending rather than descending order)!
I hope the examples above communicated a pattern I often observe when mangling bits: operations that are annoying (not hard, just a bit more complex than we’d like) in the bitmap domain can be simpler in two’s complement arithmetic. Arithmetic operations are powerful mutators for bitmaps, but they’re often hard to control. Subtracting or adding 1 are the main exceptions: it’s easy to describe their impact in terms of the low bits of the bitmap. In fact, we can extend that trick to subtracting or adding powers of two: it’s the same carry/borrow chain effect as for 1, except that bits smaller than the power of two pass straight through… which might be useful when we expect a known tag followed by a ULEB value that must be decoded.
If you find yourself wishing for a way to flip ranges of bits in a datadependent fashion, it’s always worth considering the two’s complement representation of the problem for a couple minutes. Adding or subtracting powers of two doesn’t always work, but the payoff is pretty good when it does.
P.S., Wojciech Muła offers a different 3operation sequence with p
to solve damageboy’s problem.
That’s another nice primitive to generate bitmasks dynamically.
Thank you Ruchir for helping me clarify the notation around the ULEB section.