I believe the difference is rooted in the way systems which must bound their pauses will sacrifice nice-to-haves to more cleanly satisfy that hard requirement.
Researchers and library writers instead tend to aim for maximal guarantees or functionality while assuming (sacrificing) as little as possible. Someone who’s sprinkling lock-free algorithms in a large codebase will similarly want to rely on maximally general algorithms, in order to avoid increasing the scope of their work: if tail latency and lock-freedom were high enough priorities to justify wide-ranging changes, the program would probably have been designed that way from the start.
It makes sense to explore general solutions, and academia is definitely a good place to do so. It was fruitful for mathematicians to ask questions about complex numbers, then move on to fields, rings, groups, etc., like sadistic kids probing what a bug can still do as they tear its legs off one by one. Quicksort and mergesort are probably the strongest exemplars of that sort of research in computer science: we don’t even ask what data is made of before assuming a comparison sort is probably a good idea.
It is however more typical to trade something in return for generality. When there’s no impact on performance or resource usage, code complexity usually takes a hit.
When solving a core problem like lock-freedom in a parallel realtime system, we instead ask how much more we can assume, what else we can give up, in order to obtain simpler, more robust solutions. We don’t want generality, we’re not crippling bugs; we want specificity, we’re dosing eggs with Hox to get more legs.
The first time someone used to academic non-blocking algorithms hears about the resulting maximally specialised solutions, they’ll sometimes complain about “cheating.” Of course, it’s never cheating when a requirement actually is met; the surprise merely shows that the rules typically used to evaluate academic solutions are but approximations of reality, and can be broken… and practitioners faced with specific problems are ideally placed to determine what rules they can flout.
My favourite example of such cheating is type-stable memory. The literature on Safe memory reclamation (SMR) conflates^{1} two problems that are addressed by SMR algorithms: reclamation races, and the ABA problem.
A reclamation race is what happens when a thread dereferences a pointer to the heap, but the pointee has already been deallocated; even when the racy accesses are loads, they can result in a segmentation fault (and a crash).
The ABA problem is what happens when a descriptor (e.g., a pointer) is reused to refer to something else, but some code is unaware of the swap. For example, a thread could load a global pointer to a logically read-only object, read data off that pointer, sleep for a while, and observe that the global pointer has the same value. That does not mean nothing has changed: while the thread was sleeping, the pointee could have been freed, and then recycled to satisfy a fresh allocation.
Classic SMR algorithms like epoch reclamation and hazard pointers solve both problems at once; in fact, addressing reclamation races is their core contribution (it’s certainly the challenging part), and ABA protection is simply a nice corollary.
However, programs can choose not to crash on benign use-after-free: reading from freed objects only triggers crashes when memory is mapped and unmapped dynamically, and that’s usually not an option for hard realtime systems. On smaller embedded targets, there’s a fixed physical memory budget; either the program fits, or the program is broken. Larger shared-memory parallel programs often can’t afford the IPIs and other hiccups associated with releasing memory to the operating system. Either way, half the problem solved by SMR doesn’t even exist for them.
The other half, ABA, is still an issue… but that subproblem is easier to solve. For example, we can tag data with sequence counters.^{2}
A lock-free multiple producers / single consumer linked stack might be the simplest lock-free data structure.^{3}
Pushing a new record to such a stack is easy:^{4} load the current top of stack pointer, publish that in our new record’s “next” (CDR) field, and attempt to replace the top of stack with a pointer to our new record with a compare-and-swap (CAS).
How do we consume from such a stack?
The simplest way is to use a fetch-and-set (atomic exchange) to
simultaneously clear the stack (set the top-of-stack pointer to the
“empty stack” sentinel, e.g., NULL
) and read the previous
top-of-stack. Any number of consumers can concurrently execute such a
batch pop, although only one will grab everything.
Alternatively, if there’s only one consumer at a time, it can pop with a compare-and-swap. The consumer must load the current top-of-stack pointer. If the stack is empty, there’s nothing to pop. Otherwise, it can read the top record’s “next” field, and attempt to CAS out the top-of-stack pointer from the one it just read to the “next” record.
The tricky step here is the one where the consumer reads the “next” field in the current top-of-stack record: that step would be subject to reclamation races, except that there’s only one consumer, so we know no one else concurrently popped that record and freed it.
What can we do to support multiple consumers? Can we simply make sure that stack records are always safe to read, e.g., by freeing them to an object pool? Unfortunately, while use-after-free is benign for producers, it is not for consumers.
The key problem is that a consumer can observe that the top of stack points to record A, and that record A’s “next” field points to B, and then get stuck or sleep for a while. During that time, another thread pops A and B, frees both, pushes C, and then pushes A’, a new record that happens to have the same address as A. Finally, the initial consumer will compare the top-of-stack pointers with A (which also matches A’), and swap that for B, resurrecting a record that has already been consumed and freed.
Full-blown SMR would fix all that. However, if we can instead assume read-after-free do not crash (e.g., we use a type-stable allocator or an explicit object pool for records), we simply have to reliably detect when a record has returned to the top of the stack.^{5}
We can do that by tagging the top-of-stack pointer with a sequence counter, and update both with a double-wide compare-and-swap: instead of CASing the top-of-stack pointer, we want to CAS that pointer and its monotonically increasing counter. Every successful CAS of the pointer will also increment the counter by one, so the sequence counter will differ when a record is popped and pushed back on the stack.
There is still a risk of ABA: the counter can wrap around. That’s not a practical concern with 64-bit counters, and there are reasonable arguments that narrower counters are safe because no consumer will stay “stuck” for minutes or hours.^{6}
Sometimes, the application can naturally guarantee that CASed fields are ABA-free.
For example, a hierarchical bump allocator may carve out global type-specific arenas from a shared chunk of address space, and satisfy allocations for each object type from the type’s current arena. Within an arena, allocations are reserved with atomic increments. Similarly, we carve out each arena from the shared chunk of address space with a (larger) atomic increment. Neither bump pointer ever decreases: once a region of address space has been converted to an arena, it stays that way, and once an object has been allocated from an arena, it also remains allocated (although it might enter a freelist). Arenas are also never recycled: once exhausted, they stay exhausted.
When an allocation type has exhausted its current arena (the arena’s bump pointer is exhausted), we want to atomically grab a new arena from the shared chunk of address space, and replace the type’s arena pointer with the newly created arena.
A lock-free algorithm for such a transaction looks like it would have to build on top of multi-word compare-and-swap (MCAS), a hard operation that can be implemented in a wait-free manner, but with complex algorithms.
However, we know that the compare-and-swapped state evolves monotonically: once an arena has been carved out from the shared chunk, it will never be returned as a fresh arena again. In other words, there is no ABA, and a compare-and-swap on an arena pointer will never succeed spuriously.
Monotonicity also means that we can acquire a consistent snapshot of both the type’s arena pointer and the chunk’s allocation state by reading everything twice. Values are never repeated, so any write that executes concurrently with our snapshot loop will be detected: the first and second reads of the updated data will differ.
We also know that the only way a type’s arena pointer can be replaced is by allocating a new one from the shared chunk. If we took a consistent snapshot of the type’s arena pointer and of the shared chunk’s allocation state, and the allocation state hasn’t changed since, the arena pointer must also be unchanged (there’s a hierarchy).
We can combine all these properties to atomically replace a type’s arena pointer with a new one obtained from the shared chunk, using a much simpler core operation, a single-compare multiple-swap (SCMS). We want to execute a series of CASes (one to allocate an arena in the chunk, a few to initialise the arena, and one to publish the new arena), but we can also assume that once the first updated location matches the CAS’s expected value, all other ones will as well. In short, only the first CAS may fail.
That’s the key simplifier compared to full-blown multi-word compare-and-swap algorithms: they have to incrementally acquire update locations, any of which might turn the operation into a failure.
We can instead encode all the CASes in a transaction descriptor, CAS that descriptor in the first update location, and know that the multi-swaps will all succeed iff that CAS is successful.
If the first CAS is successful, we also know that it’s safe to execute the remaining CASes, and finally replace the descriptor with its final value with one last CAS. We don’t even have to publish the descriptor to all updated locations, because concurrent allocations will notice the current arena has been exhausted, and try to get a new one from the shared chunk… at which point they will notice the transaction descriptor.
All the CASes after the first one are safe to execute arbitrarily often thanks to monotonicity. We already know that any descriptor that has been published with the initial CAS will not fail, which means the only potential issue is spuriously successful CASes… but our mutable fields never repeat a value, so that can’t happen.
The application’s guarantee of ABA-safety ends up really simplifying this single-compare multiple-swap algorithm (SCMS), compared to a multi-word compare-and-swap (MCAS). In a typical MCAS implementation, helpers must abort when they detect that they’re helping a MCAS operation that has already failed or already been completed. Our single-compare assumption (once the first CAS succeeds, the operation succeeds) takes care of the first case: helpers never see failed operations. Lack of ABA means helpers don’t have to worry about their CASes succeeding after the SCMS operation has already been completed: they will always fail.
Finally, we don’t even need any form of SMR on the transaction descriptor: a sequence counter in the descriptor and a copy of that counter in a tag next to pointers to that descriptor suffice to disambiguate incarnations of the same physical descriptor.
Specialising to the allocator’s monotonic state let us use single-compare multiple-swap, a simplification of full multi-word compare-and-swap, and further specialising that primitive for monotonic state let us get away with nearly half as many CASes (k + 1 for k locations) as the state of the art for MCAS (2k + 1 for k locations).
There is a common thread between never unmapping allocated addresses, sequence tags, type-stable memory, and the allocator’s single-compare multiple-swap: monotonicity.
The lock-free stack shows how easy it is to conjure up artificial monotonicity. However, when we integrate algorithms more tightly with the program and assume the program’s state is naturally monotonic, we’ll often unlock simpler and more efficient solutions. I also find there’s something of a virtuous cycle: it’s easier for a module to guarantee monotonicity to its components when it itself only has to handles monotonic state, like a sort of end-to-end monotonicity principle.
Unfortunately, it’s not clear how much latent monotonicity there is in real programs. I suppose that makes it hard to publish algorithms that assumes its presence. I think it nevertheless makes sense to explore such stronger assumptions, in order to help practitioners estimate what we could gain in exchange for small sacrifices.
Asymmetric synchronisation is widely used these days, but I imagine it was once unclear how much practical interest there might be in that niche; better understood benefits lead to increased adoption. I hope the same can happen for algorithms that assume monotonic state.
Maged Michael’s original Safe Memory Reclamation paper doesn’t: allowing arbitrary memory management is the paper’s main claim. I think there’s a bit of a first mover’s advantage, and researchers are incentivised to play within the sandbox defined by Michael. For example, Arbel-Raviv and Brown in “Reuse, don’t Recycle” practically hide the implementation of their proposal on page 17 (Section 5), perhaps because a straightforward sequence counter scheme is too simple for publication nowadays. ↩
See Reuse, don’t Recycle for a flexible take. ↩
A stack fundamentally focuses contention towards the top-of-stack pointer, so lock-free definitely doesn’t imply scalable. It’s still a good building block. ↩
In assembly language, anyway. Language memory models make this surprisingly hard. For example, any ABA in the push sequence is benign (we still have the correct bit pattern in the “next” field), but C and C++’s pointer provenance rules say that accessing a live object through a pointer to a freed object that happens to alias the new object is undefined behaviour. ↩
Load Linked / Store Conditional solves this specific problem, but that doesn’t mean LL/SC as found on real computers is necessarily a better primitive than compare-and-swap. ↩
Which is nice, because it means we can pack data and sequence counters in 64-bit words, and use the more widely available single-word compare-and-swap. ↩
Like ridiculous fish mentions in his review of integer divisions on M1 and Xeon, certain divisors (those that lose a lot of precision when rounding up to a fraction of the form \(n / 2^k\)) need a different, slower, code path in classic implementations. Powers of two are also typically different, but at least divert to a faster sequence, a variable right shift.
Reciprocal instead uses a unified code path to implement two expressions, \(f_{m,s}(x) = \left\lfloor \frac{m x}{2^s} \right\rfloor\) and \(g_{m^\prime,s^\prime}(x) = \left\lfloor\frac{m^\prime \cdot \min(x + 1, \mathtt{u64::MAX})}{2^{s^\prime}}\right\rfloor\), that are identical except for the saturating increment of \(x\) in \(g_{m^\prime,s^\prime}(x)\).
The first expression, \(f_{m,s}(x)\) corresponds to the usual div-by-mul approximation (implemented in gcc, LLVM, libdivide, etc.) where the reciprocal \(1/d\) is approximated in fixed point by rounding \(m\) up, with the upward error compensated by the truncating multiplication at runtime. See, for example, Granlund and Montgomery’s Division by invariant integers using multiplication.
The second, \(g_{m^\prime,s^\prime}(x)\), is the multiply-and-add scheme of described by Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add.
In that approximation, the reciprocal multiplier \(m^\prime\) is rounded down when converting \(1/d^\prime\) to fixed point. At runtime, we then bump the product up (by the largest value \(\frac{n}{2^{s^\prime}} < 1/d^\prime\), i.e., \(\frac{m^\prime}{2^{s^\prime}}\)) before dropping the low bits.
With a bit of algebra, we see that \(m^\prime x + m^\prime = m^\prime (x + 1)\)… and we can use a saturating increment to avoid a 64x65 multiplication as long as we don’t trigger this second expression for divisors \(d^\prime\) for which \(\left\lfloor \frac{\mathtt{u64::MAX}}{d^\prime}\right\rfloor \neq \left\lfloor \frac{\mathtt{u64::MAX} - 1}{d^\prime}\right\rfloor\).
We have a pair of dual approximations, one that rounds the reciprocal up to a fixed point value, and another that rounds down; it makes sense to round to nearest, which nets us one extra bit of precision in the worst case, compared to always applying one or the other.
Luckily,^{1}
all of u64::MAX
’s factors (except 1 and u64::MAX
) work with the “round up” approximation
that doesn’t increment, so the saturating increment is always safe
when we actually want to use the second “round-down” approximation
(unless \(d^\prime \in \{1, \mathtt{u64::MAX}\}\)).
This duality is the reason why Reciprocal can get away with 64-bit multipliers.
Even better, \(f_{m,s}\) and \(g_{m^\prime,s^\prime}\) differ only in the absence or presence of a saturating increment. Rather than branching, Reciprocal executes a data-driven increment by 0 or 1, for \(f_{m,s}(x)\) or \(g_{m^\prime,s^\prime}(x)\) respectively. The upshot: predictable improvements over hardware division, even when dividing by different constants.
Summary of the results below: when measuring the throughput of independent divisions on my i7 7Y75 @ 1.3 GHz, Reciprocal consistently needs 1.3 ns per division, while hardware division can only achieve ~9.6 ns / division (Reciprocal needs 14% as much / 86% less time). This looks comparable to the results reported by fish for libdivide when dividing by 7. Fish’s libdivide no doubt does better on nicer divisors, especially powers of two, but it’s good to know that a simple implementation comes close.
We’ll also see that, in Rust land, the fast_divide crate is dominated by strength_reduce, and that strength_reduce is only faster than Reciprocal when dividing by powers of two (although, looking at the disassembly, it probably comes close for single-result latency).
First, results for division with the same precomputed inverse. The timings are from criterion.rs, for \(10^4\) divisions in a tight loop.
The last two options are the crates I considered before writing Reciprocal. The strength_reduce crate switches between a special case for powers of two (implemented as a bitscan and a shift), and a general slow path that handles everything with a 128-bit fixed point multiplier. fast_divide is inspired by libdivide and implements the same three paths: a fast path for powers of two (shift right), a slow path for reciprocal multipliers that need one more bit than the word size (e.g, division by 7), and a regular round-up div-by-mul sequence.
Let’s look at the three cases in that order.
\(10^4\) divisions by 2 (i.e., a mere shift right by 1)
hardware_u64_div_2 time: [92.297 us 95.762 us 100.32 us]
compiled_u64_div_by_2 time: [2.3214 us 2.3408 us 2.3604 us]
reciprocal_u64_div_by_2 time: [12.667 us 12.954 us 13.261 us]
strength_reduce_u64_div_by_2
time: [2.8679 us 2.9190 us 2.9955 us]
fast_divide_u64_div_by_2
time: [2.7467 us 2.7752 us 2.8025 us]
This is the comparative worst case for Reciprocal: while Reciprocal always uses the same code path (1.3 ns/division), the compiler shows we can do much better with a right shift. Both branchy implementations include a special case for powers of two, and thus come close to the compiler, thanks a predictable branch into a right shift.
\(10^4\) divisions by 7 (a “hard” division)
hardware_u64_div_7 time: [95.244 us 96.096 us 97.072 us]
compiled_u64_div_by_7 time: [10.564 us 10.666 us 10.778 us]
reciprocal_u64_div_by_7 time: [12.718 us 12.846 us 12.976 us]
strength_reduce_u64_div_by_7
time: [17.366 us 17.582 us 17.827 us]
fast_divide_u64_div_by_7
time: [25.795 us 26.045 us 26.345 us]
Division by 7 is hard for compilers that do not implement the “rounded down” approximation described in Robison’s N-Bit Unsigned Division Via N-Bit Multiply-Add. This is the comparative best case for Reciprocal, since it always uses the same code (1.3 ns/division), but most other implementations switch to a slow path (strength_reduce enters a general case that is arguably more complex, but more transparent to LLVM). Even divisions directly compiled with LLVM are ~20% faster than Reciprocal: LLVM does not implement Robison’s round-down scheme, so it hardcodes a more complex sequence than Reciprocal’s.
\(10^4\) divisions by 11 (a regular division)
hardware_u64_div_11 time: [95.199 us 95.733 us 96.213 us]
compiled_u64_div_by_11 time: [7.0886 us 7.1565 us 7.2309 us]
reciprocal_u64_div_by_11
time: [12.841 us 13.171 us 13.556 us]
strength_reduce_u64_div_by_11
time: [17.026 us 17.318 us 17.692 us]
fast_divide_u64_div_by_11
time: [21.731 us 21.918 us 22.138 us]
This is a typical result. Again, Reciprocal can be trusted to work at 1.3 ns/division. Regular round-up div-by-mul works fine when dividing by 11, so code compiled by LLVM only needs a multiplication and a shift, nearly twice as fast as Reciprocal’s generic sequence. The fast_divide crate does do better here than when dividing by 7, since it avoids the slowest path, but Reciprocal is still faster; simplicity pays.
The three microbenchmarks above reward special-casing, since they always divide by the same constant in a loop, and thus always hit the same code path without ever incurring a mispredicted branch.
What happens to independent divisions by unpredictable precomputed divisors, for divisions by 2, 3, 7, or 11 (respectively easy, regular, hard, and regular divisors)?
hardware_u64_div time: [91.592 us 93.211 us 95.125 us]
reciprocal_u64_div time: [17.436 us 17.620 us 17.828 us]
strength_reduce_u64_div time: [40.477 us 41.581 us 42.891 us]
fast_divide_u64_div time: [69.069 us 69.562 us 70.100 us]
The hardware doesn’t care, and Reciprocal is only a bit slower (1.8
ns/division instead of 1.3 ns/division) presumably because the relevant
PartialReciprocal
struct must now be loaded in the loop body.
The other two branchy implementations seemingly take a hit
proportional to the number of special cases. The strength_reduce
hot
path only branches once, to detect divisors that are powers of two;
its runtime goes from 0.29 - 1.8 ns/division to 4.2 ns/division (at
least 2.4 ns slower/division). The fast_divide
hot path, like libdivide’s,
switches between three cases, and goes from 0.28 - 2.2
ns/division to 7.0 ns/division (at least 4.8 ns slower/division).
And that’s why I prefer to start with predictable baseline implementations: unpredictable code with special cases can easily perform well on benchmarks, but, early on during development, it’s hard to tell how the benchmarks may differ from real workloads, and whether the special cases “overfit” on these differences.
With special cases for classes of divisors, most runtime div-by-mul implementations make you guess whether you’ll tend to divide by powers of two, by “regular” divisors, or by “hard” ones in order to estimate how they will perform. Worse, they also force you to take into account how often you’ll switch between the different classes. Reciprocal does not have that problem: its hot path is the same regardless of the constant divisor, so it has the same predictable performance for all divisors,^{3} and there’s only one code path, so we don’t have to worry about class switches.
Depending on the workload, it may make sense to divert to faster code paths, but it’s usually best to start without special cases when it’s practical to do so… and I think Reciprocal shows that, for integer division by constants, it is.
Is it luck? Sounds like a fun number theory puzzle. ↩
The struct is “partial” because it can’t represent divisions by 1 or u64::MAX
. ↩
…all divisors except 1 and u64::MAX
, which must instead use the more general Reciprocal
struct. ↩
Nine months ago, we embarked on a format migration for the persistent (on-disk) representation of variable-length strings like symbolicated call stacks in the Backtrace server. We chose a variant of consistent overhead byte stuffing (COBS), a self-synchronising code, for the metadata (variable-length as well). This choice let us improve our software’s resilience to data corruption in local files, and then parallelise data hydration, which improved startup times by a factor of 10… without any hard migration from the old to the current on-disk data format.
In this post, I will explain why I believe that the representation of first resort for binary logs (write-ahead, recovery, replay, or anything else that may be consumed by a program) should be self-synchronising, backed by this migration and by prior experience with COBS-style encoding. I will also describe the specific algorithm (available under the MIT license) we implemented for our server software.
This encoding offers low space overhead for framing, fast encoding and faster decoding, resilience to data corruption, and a restricted form of random access. Maybe it makes sense to use it for your own data!
A code is self-synchronising when it’s always possible to unambiguously detect where a valid code word (record) starts in a stream of symbols (bytes). That’s a stronger property than prefix codes like Huffman codes, which only detect when valid code words end. For example, the UTF-8 encoding is self-synchronising, because initial bytes and continuation bytes differ in their high bits. That’s why it’s possible to decode multi-byte code points when tailing a UTF-8 stream.
The UTF-8 code was designed for small integers (Unicode code points), and can double the size of binary data. Other encodings are more appropriate for arbitrary bytes; for example, consistent overhead byte stuffing (COBS), a self-synchronising code for byte streams, offers a worst-case space overhead of one byte plus a 0.4% space blow-up.
Self-synchronisation is important for binary logs because it lets us efficiently (with respect to both run time and space overhead) frame records in a simple and robust manner… and we want simplicity and robustness because logs are most useful when something has gone wrong.
Of course, the storage layer should detect and correct errors, but things will sometimes fall through, especially for on-premises software, where no one fully controls deployments. When that happens, graceful partial failure is preferable to, e.g., losing all the information in a file because one of its pages went to the great bit bucket in the sky.
One easy solution is to spread the data out over multiple files or blobs. However, there’s a trade-off between keeping data fragmentation and file metadata overhead in check, and minimising the blast radius of minor corruption. Our server must be able to run on isolated nodes, so we can’t rely on design options available to replicated systems… plus bugs tend to be correlated across replicas, so there is something to be said for defense in depth, even with distributed storage.
When each record is converted with a self-synchronising code like COBS before persisting to disk, we can decode all records that weren’t directly impacted by corruption, exactly like decoding a stream of mostly valid UTF-8 bytes. Any form of corruption will only make us lose the records whose bytes were corrupted, and, at most, the two records that immediately precede or follow the corrupt byte range. This guarantee covers overwritten data (e.g., when a network switch flips a bit, or a read syscall silently errors out with a zero-filled page), as well as bytes removed or garbage inserted in the middle of log files.
The coding doesn’t store redundant information: replication or erasure coding is the storage layer’s responsibility. It instead guarantees to always minimise the impact of corruption, and only lose records that were adjacent to or directly hit by corruption.
A COBS encoding for log records achieves that by unambiguously separating records with a reserved byte (e.g., 0), and re-encoding each record to avoid that separator byte. A reader can thus assume that potential records start and end at a log file’s first and last bytes, and otherwise look for separator bytes to determine where to cut all potential records. These records may be invalid: a separator byte could be introduced or removed by corruption, and the contents of a correctly framed record may be corrupt. When that happens, readers can simply scan for the next separator byte and try to validate that new potential record. The decoder’s state resets after each separator byte, so any corruption is “forgotten” as soon as the decoder finds valid a separator byte.
On the write side, the encoding logic is simple (a couple dozen lines of C code), and uses a predictable amount of space, as expected from an algorithm suitable for microcontrollers.
Actually writing encoded data is also easy:
on POSIX filesystems, we can make sure each record is delimited (e.g.,
prefixed with the delimiter byte), and issue a
regular O_APPEND
write(2).
Vectored writes can even insert delimiters without copying in
userspace. Realistically, our code is probably less stable than
operating system and the hardware it runs on, so we make sure our
writes make it to the kernel as soon as possible, and let fsync
s
happen on a timer.
When a write errors out, we can blindly (maybe once or twice) try again: the encoding is independent of the output file’s state. When a write is cut short, we can still issue the same^{1} write call, without trying to “fix” the short write: the encoding and the read-side logic already protect against that kind of corruption.
What if multiple threads or processes write to the same log file?
When we open with O_APPEND
,
the operating system can handle the rest. This doesn’t make
contention disappear, but at least we’re not adding a bottleneck in
userspace on top of what is necessary to append to the same file.
Buffering is also trivial: the encoding is independent of the state of
the destination file, so we can always concatenate buffered records
and write the result with a single syscall.
This simplicity also plays well with
high-throughput I/O primitives like io_uring
, and with
blob stores that support appends:
independent workers can concurrently queue up blind append requests and
retry on failure.
There’s no need for application-level mutual exclusion or rollback.
Our log encoding will recover from bad bytes, as long as readers can detect and reject invalid records as a whole; the processing logic should also handle duplicated valid records. These are table stakes for a reliable log consumer.
In our variable-length metadata use case, each record describes a symbolicated call stack, and we recreate in-memory data structures by replaying an append-only log of metadata records, one for each unique call stack. The hydration phase handles invalid records by ignoring (not recreating) any call stack with corrupt metadata, but only those call stacks. That’s definitely an improvement over the previous situation, where corruption in a size header would prevent us from decoding the remainder of the file, and thus make us forget about all call stacks stored at file offsets after the corruption.
Of course, losing data should be avoided, so we are careful to
fsync
regularly and recommend reasonable storage configurations.
However, one can only make data loss unlikely, not impossible (if only
due to fat fingering), especially when cost is a factor. With the COBS
encoding, we can recover gracefully and automatically from any
unfortunate data corruption event.
We can also turn this robustness into new capabilities.
It’s often useful to process the tail of a log at a regular cadence. For example, I once maintained a system that regularly tailed hourly logs to update approximate views. One could support that use case with length footers. COBS framing lets us instead scan for a valid record from an arbitrary byte location, and read the rest of the data normally.
When logs grow large enough, we want to process them in parallel. The standard solution is to shard log streams, which unfortunately couples the parallelisation and storage strategies, and adds complexity to the write side.
COBS framing lets us parallelise readers independently of the writer. The downside is that the read-side code and I/O patterns are now more complex, but, all other things being equal, that’s a trade-off I’ll gladly accept, especially given that our servers run on independent machines and store their data in files, where reads are fine-grained and latency relatively low.
A parallel COBS reader partitions a data file arbitrarily (e.g., in fixed size chunks) for independent workers. A worker will scan for the first valid record starting inside its assigned chunk, and handle every record that starts in its chunk. Filtering on the start byte means that a worker may read past the logical end of its chunk, when it fully decodes the last record that starts in the chunk: that’s how we unambiguously assign a worker to every record, including records that straddle chunk boundaries.
Random access even lets us implement a form of binary or interpolation search on raw unindexed logs, when we know the records are (k-)sorted on the search key! This lets us, e.g., access the metadata for a few call stacks without parsing the whole log.
Eventually, we might also want to truncate our logs.
Contemporary filesystems like XFS (and even Ext4) support large sparse files. For example, sparse files can reach \(2^{63} - 1\) bytes on XFS with a minimal metadata-only footprint: the on-disk data for such sparse files is only allocated when we issue actual writes. Nowadays, we can sparsify files after the fact, and convert ranges of non-zero data into zero-filled “holes” in order to release storage without messing with file offsets (or even atomically collapse old data away).
Filesystems can only execute these operations at coarse granularity, but that’s not an issue for our readers: they must merely remember to skip sparse holes, and the decoding loop will naturally handle any garbage partial record left behind.
Cheshire and Baker’s original byte stuffing scheme targets small machines and slow transports (amateur radio and phone lines). That’s why it bounds the amount of buffering needed to 254 bytes for writers and 9 bits of state for readers, and attempts to minimise space overhead, beyond its worst-case bound of 0.4%.
The algorithm is also reasonable. The encoder buffers data until it
encounters a reserved 0 byte (a delimiter byte), or there are 254
bytes of buffered data. Whenever the encoder stops buffering, it
outputs a block whose contents are described by its first byte. If
the writer stopped buffering because it found a reserved byte, it
emits one byte with buffer_size + 1
before writing and clearing the
buffer. Otherwise, it outputs 255 (one more than the buffer size),
followed by the buffer’s contents.
On the decoder side, we know that the first byte of each block describes its size and decoded value (255 means 254 bytes of literal data, any other value is one more than the number of literal bytes to copy, followed by a reserved 0 byte). We denote the end of a record with an implicit delimiter: when we run out of data to decode, we should have just decoded an extra delimiter byte that’s not really part of the data.
With framing, an encoded record surrounded by delimiters thus looks like the following
|0 |blen|(blen - 1) literal data bytes....|blen|literal data bytes ...|0 |
The delimiting “0” bytes are optional at the beginning and end of a
file, and each blen
size prefix is one byte with value in
\([1, 255]\). A value \(\mathtt{blen} \in [1, 254]\) represents
a block \(\mathtt{blen} - 1\) literal bytes, followed by an implicit
0 byte. If we instead have \(\mathtt{blen} = 255\), we
have a block of \(254\) bytes, without any implicit byte. Readers
only need to remember how many bytes remain until the end of the
current block (eight bits for a counter), and whether they should insert
an implicit 0 byte before decoding the next block (one binary flag).
We have different goals for the software we write at
Backtrace. For our logging use case, we
pass around fully constructed records, and we want to issue a single
write syscall per record, with periodic fsync
.^{2}
Buffering is baked in, so there’s no point in making sure we can work
with a small write buffer. We also don’t care as much about the space
overhead (the worst-case bound is already pretty good) as much as we
do about encoding and decoding speed.
These different design goals lead us to an updated hybrid word/byte stuffing scheme:
This hybrid scheme improves encoding and decoding speed compared to COBS, and even marginally improves the asymptotic space overhead. At the low end, the worst-case overhead is only slightly worse than that of traditional COBS: we need three additional bytes, including the framing separator, for records of 252 bytes or fewer, and five bytes for records of 253-64260 bytes.
In the past, I’ve seen “word” stuffing schemes aim to reduce the run-time overhead of COBS codecs by scaling up the COBS loops to work on two or four bytes at a time. However, a byte search is trivial to vectorise, and there is no guarantee that frameshift corruption will be aligned to word boundaries (for example, POSIX allows short writes of an arbitrary number of bytes).
Our hybrid word-stuffing looks for a reserved two-byte delimiter sequence at arbitrary byte offsets. We must still conceptually process bytes one at a time, but delimiting with a pair of bytes instead of with a single byte makes it easier to craft a delimiter that’s unlikely to appear in our data.
Cheshire and Baker do the opposite, and use a frequent byte (0) to
eliminate the space overhead in the common case. We care a lot more
about encoding and decoding speed, so an unlikely delimiter makes more
sense for us. We picked 0xfe 0xfd
because that sequence doesn’t
appear in small integers (unsigned, two’s complement, varint, single
or double float) regardless of endianness, nor in valid UTF-8 strings.
Any positive integer with 0xfe 0xfd
(254 253
) in its byte must be
around \(2^{16}\) or more. If the integer is instead negative in
little-endian two’s complement, 0xfe 0xfd
equals -514 as a
little-endian int16_t
, and -259 in big endian (not as great, but not
nothing). Of course, the sequence could appear in two adjacent
uint8_t
s, but otherwise, for 0xfe
or 0xfd
can only appear in
most significant byte of large 32- or 64-bit integers (unlike 0xff
,
which could be sign extension for, e.g., -1).
Any (U)LEB varint that
includes 0xfe 0xfd
must span at least 3 bytes (i.e., 15 bits),
since both these bytes have the most significant bit set to 1.
Even a negative SLEB has to be at least as negative as
\(- 2^{14} = -16384\).
For floating point types, we can observe that 0xfe 0xfd
in the
significand would represent an awful fraction in little or big
endian, so can only happen for the IEEE-754 representation of large
integers (approximately \(\pm 2^{15}\)). If we instead assume
that 0xfd
or 0xfe
appear in the sign and exponent fields, we find
either very positive or very negative exponents (the exponent is
biased, instead of complemented). A semi-exhaustive search confirms
that the smallest integer-valued single float that includes the
sequence is 32511.0 in little endian and 130554.0 in big endian;
among integer-valued double floats, we find 122852.0 and 126928.0
respectively.
Finally, the sequence isn’t valid UTF-8 because both 0xfe
and 0xfd
have their top bit set (indicating a multi-byte code point), but neither
looks like a continuation byte: the two most significant bits are
0b11
in both cases, while UTF-8 continuations must have 0b10
.
Consistent overhead byte stuffing rewrites reserved 0 bytes away by counting the number of bytes from the beginning of a record until the next 0, and storing that count in a block size header followed by the non-reserved bytes, then resetting the counter, and doing the same thing for the remaining of the record. A complete record is stored as a sequence of encoded blocks, none of which include the reserved byte 0. Each block header spans exactly one byte, and must never itself be 0, so the byte count is capped at 254, and incremented by one (e.g., a header value of 1 represents a count of 0); when the count in the header is equal to the maximum, the decoder knows that the encoder stopped short without finding a 0.
With our two-byte reserved sequence, we can encode the size of each
block in radix 253 (0xfd
); given a two-byte header for each block, sizes
can go up to \(253^2 - 1 = 64008\). That’s a reasonable granularity
for memcpy
. This radix conversion replaces the off-by-one weirdness
in COBS: that part of the original algorithm merely encodes values
from \([0, 254]\) into one byte while avoiding the reserved byte 0.
A two-byte size prefix is a bit ridiculous for small records (ours
tend to be on the order of 30-50 bytes). We thus encode the first
block specially, with a single byte in \([0, 252]\) for the size
prefix. Since the reserved sequence 0xfe 0xfd
is unlikely to appear in
our data, the encoding for short record often boils down to adding a
uint8_t
length prefix.
A framed encoded record now looks like
|0xfe|0xfd|blen|blen literal bytes...|blen_1|blen_2|literal bytes...|0xfe|0xfd|
The first blen
is in \([0, 252]\) and tells us how many literal
bytes follow in the initial block. If the initial \(\mathtt{blen} =
252\), the literal bytes are immediately followed by the next block’s
decoded contents. Otherwise, we must first append an implicit 0xfe
0xfd
sequence… which may be the artificial reserved sequence that
mark the end of every record.
Every subsequent block comes with a two-byte size prefix, in little-endian
radix-253. In other words, |blen_1|blen_2|
represents the
block size \(\mathtt{blen}\sb{1} + 253 \cdot \mathtt{blen}\sb{2}\), where
\(\mathtt{blen}_{{1, 2}} \in [0, 252]\). Again, if the block
size is the maximum encodable size, \(253^2 - 1 = 64008\), we
have literal data followed by the next block; otherwise, we must
append a 0xfe 0xfd
sequence to the output before
moving on to the next block.
The encoding algorithm is only a bit more complex than for the original COBS scheme.
Assume the data to encode is suffixed with an artificial two-byte
reserved sequence 0xfe 0xfd
.
For the first block, look for the reserved sequence in the first 252
bytes. If we find it, emit its position (must be less than 251) in
one byte, then all the data bytes up to but not including the reserved
sequence, and enter regular encoding after the reserved sequence. If
the sequence isn’t in the first block, emit 252
, followed
by 252 bytes of data, and enter regular encoding after those bytes.
For regular (all but the first) blocks, look for the reserved sequence in
the next 64008 bytes. If we find it, emit the sequence’s byte offset
(must be less than 64008) in little-endian radix 253, followed by the
data up to but not including the reserved sequence, and skip that sequence
before encoding the rest of the data. If we don’t find the reserved
sequence, emit 64008 in radix 253 (0xfc 0xfc
), copy the next 64008
bytes of data, and encode the rest of the data without skipping anything.
Remember that we conceptually padded the data with a reserved sequence at the end. This means we’ll always observe that we fully consumed the input data at a block boundary. When we encode the block that stops at the artificial reserved sequence, we stop (and frame with a reserved sequence to delimit a record boundary).
You can find our implementation in the stuffed-record-stream repository.
When writing short records, we already noted that the encoding step is often equivalent to adding a one-byte size prefix. In fact, we can encode and decode all records of size up to \(252 + 64008 = 64260\) bytes in place, and only ever have to slide the initial 252-byte block: whenever a block is shorter than the maximum length (252 bytes for the first block, 64008 for subsequent ones), that’s because we found a reserved sequence in the decoded data. When that happens, we can replace the reserved sequence with a size header when encoding, and undo the substitution when decoding.
Our code does not implement these optimisations because encoding and decoding stuffed bytes aren’t bottlenecks for our use case, but it’s good to know that we’re nowhere near the performance ceiling.
The stuffing scheme only provides resilient framing. That’s essential, but not enough for an abstract stream or sequence of records. At the very least, we need checksums in order to detect invalid records that happen to be correctly encoded (e.g., when a block’s literal data is overwritten).
Our pre-stuffed records start with the little-endian header
struct record_header {
uint32_t crc;
uint32_t generation;
};
where crc
is the crc32c
of whole record, including the
header,^{3} and generation
is a yet-unused
arbitrary 32-bit payload that we added for forward compatibility.
There is no size field: the framing already handles that.
The remaining bytes in a record are an arbitrary payload. We use protobuf messages to help with schema evolution (and keep messages small and flat for decoding performance), but there’s no special relationship between the stream of word-stuffed records and the payload’s format.
Our implementation
let writers output to buffered FILE
streams, or directly to file descriptors.
Buffered streams offer higher write throughput, but are only safe
when the caller handles synchronisation and flushing; we use them
as part of a commit protocol that
fsyncs
and publishes files with
atomic rename
syscalls.
During normal operations, we instead write to file descriptors opened
with O_APPEND
and a background fsync worker: in practice, the
hardware and operating system are more stable than our software, so
it’s more important that encoded records immediately make it to the
kernel than all the way to persistent storage. We also avoid batching
write syscalls because we would often have to wait several minutes if
not hours to buffer more than two or three records.
For readers, we can either read from a buffer, or mmap
in a file,
and read from the resulting buffer. While we expose a linear iterator
interface, we can also override the start and stop byte offset of an
iterator; we use that capability to replay logs in parallel. Finally,
when readers advance an iterator, they can choose to receive a raw data
buffer, or have it decoded with a protobuf message descriptor.
We have happily been using this log format for more than nine months to store a log of metadata records that we replay every time the Backtrace server restarts.
Decoupling writes from the parallel read strategy let us improve our startup time incrementally, without any hard migration. Serialising with flexible schemas (protocol buffers) also made it easier to start small and slowly add optional metadata, and only enforce a hard switch-over when we chose to delete backward compatibility code.
This piecemeal approach let us transition from a length-prefixed data format to one where all important metadata lives in a resilient record stream, without any breaking change. We slowly added more metadata to records and eventually parallelised loading from the metadata record stream, all while preserving backward and forward compatibility. Six months after the initial roll out, we flipped the switch and made the new, more robust, format mandatory; the old length-prefixed files still exist, but are now bags of arbitrary checksummed data bytes, with metadata in record streams.
In the past nine months, we’ve gained a respectable amount of pleasant
operational experience with the format. Moreover, while performance is
good enough for us (the parallel loading phase is currently
dominated by disk I/O and parsing in protobuf-c
), we also know
there’s plenty of headroom: our records are short enough that they can
usually be decoded without any write, and always in place.
We’re now starting laying the groundwork to distribute our single-node embedded database and making it interact more fluently with other data stores. The first step will be generating a change data capture stream, and re-using the word-stuffed record format was an obvious choice.
Word stuffing is simple, efficient, and robust. If you can’t just defer to a real database (maybe you’re trying to write one yourself) for your log records, give it a shot! Feel free to play with our code if you don’t want to roll your own.
Thank you, Ruchir and Alex, for helping me clarify and restructure an earlier version.
If you append with the delimiter, it probably makes sense to special-case short writes and also prepend with the delimiter after failures, in order to make sure readers will observe a delimiter before the new record. ↩
High-throughput writers should batch records. We do syscall-per-record because the write load for the current use case is so sporadic that any batching logic would usually end up writing individual records. For now, batching would introduce complexity and bug potential for a minimal impact on write throughput. ↩
We overwrite the crc
field with UINT32_MAX
before computing a checksum for the header and its trailing data. It’s important to avoid zero prefixes because the result of crc-ing a 0 byte into a 0 state is… 0. ↩
The core of UMASH is a hybrid PH/(E)NH block compression function. That function is fast (it needs one multiplication for each 16-byte “chunk” in a block), but relatively weak: despite a 128-bit output, the worst-case probability of collision is \(2^{-64}\).
For a fingerprinting application, we want collision probability less than \(\approx 2^{-70},\) so that’s already too weak, before we even consider merging a variable-length string of compressed block values.
The initial UMASH proposal compresses each block with two independent compression functions. Krovetz showed that we could do so while reusing most of the key material (random parameters), with a Toeplitz extension, and I simply recycled the proof for UMASH’s hybrid compressor.
That’s good for the memory footprint of the random parameters, but doesn’t help performance: we still have to do double the work to get double the hash bits.
Earlier this month, Jim Apple pointed me at a promising alternative that doubles the hash bit with only one more multiplication. The construction adds finite field operations that aren’t particularly efficient in software, on top of the additional 64x64 -> 128 (carryless) multiplication, so isn’t a slam dunk win over a straightforward Toeplitz extension. However, Jim felt like we could “spend” some of the bits we don’t need for fingerprinting (\(2^{-128}\) collision probability is overkill when we only need \(2^{-70}\)) in order to make do with faster operations.
Turns out he was right! We can use carryless multiplications by sparse
constants (concretely, xor-shift and one more shift) without any
reducing polynomial, on independent 64-bit halves… and still
collide with probability at most 2^{-126} \(2^{-98}\).
The proof is fairly simple, but relies on a bit of notation for clarity. Let’s start by re-stating UMASH’s hybrid PH/ENH block compressor in that notation.
The current block compressor in UMASH splits a 256-byte block \(m\) in 16 chunks \(m_i,\, i\in [0, 15]\) of 128 bits each, and processes all but the last chunk with a PH loop,
\[ \bigoplus_{i=0}^{14} \mathtt{PH}(k_i, m_i), \]
where
\[ \mathtt{PH}(k_i, m_i) = ((k_i \bmod 2^{64}) \oplus (m_i \bmod 2^{64})) \odot (\lfloor k_i / 2^{64} \rfloor \oplus \lfloor m_i / 2^{64} \rfloor) \]
and each \(k_i\) is a randomly generated 128-bit parameter.
The compression loop in UMASH handles the last chunk, along with a size tag (to protect against extension attacks), with ENH:
\[ \mathtt{ENH}(k, x, y) = ((k + x) \bmod 2^{64}) \cdot (\lfloor k / 2^{64}\rfloor + \lfloor x / 2^{64} \rfloor \bmod 2^{64}) + y \mod 2^{128}. \]
The core operation in ENH is a full (64x64 -> 128) integer multiplication, which has lower latency than PH’s carryless multiplication on x86-64. That’s why UMASH switches to ENH for the last chunk. We use ENH for only one chunk because combining multiple NH values calls for 128-bit additions, and that’s slower than PH’s xors. Once we have mixed the last chunk and the size tag with ENH, the result is simply xored in with the previous chunks’ PH values:
\[ \left(\bigoplus_{i=0}^{14} \mathtt{PH}(k_i, m_i)\right) \oplus \mathtt{ENH}(m_{15}, k_{15}, \mathit{tag}). \]
This function is annoying to analyse directly, because we end up having to manipulate different proofs of almost-universality. Let’s abstract things a bit, and reduce the ENH/PH to the bare minimum we need to find our collision bounds.
Let’s split our message blocks in \(n\) (\(n = 16\) for UMASH) “chunks”, and apply an independently sampled mixing function to each chunk. Let’s say we have two messages \(m\) and \(m^\prime\) with chunks \(m_i\) and \(m^\prime_i\), for \(i\in [0, n)\), and let \(h_i\) be the result of mixing chunk \(m_i,\) and \(h^\prime_i\) that of mixing \(m^\prime_i.\)
We’ll assume that the first chunk is mixed with a \(2^{-w}\)-almost-universal (\(2^{-64}\) for UMASH) hash function: if \(m_0 \neq m^\prime_0,\) \(\mathrm{P}[h_0 = h^\prime_0] \leq 2^{-w},\) (where the probability is taken over the set of randomly chosen parameters for the mixer). Otherwise, \(m_0 = m^\prime_0 \Rightarrow h_i = h^\prime_i\).
This first chunks stands for the ENH iteration in UMASH.
Every remaining chunk will instead be mixed with a \(2^{-w}\)-XOR-almost-universal hash function: if \(m_i \neq m^\prime_i\) (\(0 < i < n\)), \(\mathrm{P}[h_i \oplus h^\prime_i = y] \leq 2^{-w}\) for any \(y,\) where the probability is taken over the randomly generated parameter for the mixer.
This stronger condition represents the PH iterations in UMASH.
We hash a full block by xoring all the mixed chunks together:
\[ H = \bigoplus_{i = 0}^{n - 1} h_i, \]
and
\[ H^\prime = \bigoplus_{i = 0}^{n - 1} h^\prime_i. \]
We want to bound the probability that \(H = H^\prime \Leftrightarrow H \oplus H^\prime = 0,\) assuming that the messages differ (i.e., there is at least one index \(i\) such that \(m_i \neq m^\prime_i\)).
If the two messages only differ in \(m_0 \neq n^\prime_0\) (and thus \(m_i = m^\prime_i,\,\forall i \in [1, n)\)),
\[ \bigoplus_{i = 1}^{n - 1} h_i = \bigoplus_{i = 1}^{n - 1} h^\prime_i, \]
and thus \(H = H^\prime \Leftrightarrow h_0 = h^\prime_0\).
By hypothesis, the 0th chunks are mixed with a \(2^{-w}\)-almost-universal hash, so this happens with probability at most \(2^{-w}\).
Otherwise, assume that \(m_j \neq m^\prime_j\), for some \(j \in [1, n)\). We will rearrange the expression
\[ H \oplus H^\prime = h_j \oplus h^\prime_j \oplus \left(\bigoplus_{i\in [0, n) \setminus \{ j \}} h_i \oplus h^\prime_i\right). \]
Let’s conservatively replace that unwieldly sum with an adversarially chosen value \(y\):
\[ H \oplus H^\prime = h_j \oplus h^\prime_j \oplus y, \]
and thus \(H = H^\prime\) iff \(h_j \oplus h^\prime_j = y.\) By hypothesis, the \(j\)th chunk (every chunk but the 0th), is mixed with a \(2^{-w}\)-almost-XOR-universal hash, and this thus happens with probability at most \(2^{-w}\).
In both cases, we find a collision probability at most \(2^{-w}\) with a simple analysis, despite combining mixing functions from different families over different rings.
We combined strong mixers (each is \(2^{-w}\)-almost-universal), and only got a \(2^{-w}\)-almost-universal output. It seems like we should be able to do better when two or more chunks differ.
As Nandi points outs, we can apply erasure codes to derive additional chunks from the original messages’ contents. We only need one more chunk, so we can simply xor together all the original chunks:
\[m_n = \bigoplus_{i=0}^{n - 1} m_i,\]
and similarly for \(m^\prime_n\). If \(m\) and \(m^\prime\) differ in only one chunk, \(m_n \neq m^\prime_n\). It’s definitely possible for \(m_n = m^\prime_n\) when \(m \neq m^\prime\), but only if two or more chunks differ.
We will again mix \(m_n\) and \(m^\prime_n\) with a fresh \(2^{-w}\)-almost-XOR-universal hash function to yield \(h_n\) and \(h^\prime_n\).
We want to xor the result \(h_n\) and \(h^\prime_n\) with the second (still undefined) hash values \(H_2\) and \(H^\prime_2\); if \(m_n \neq m^\prime_n\), the final xored values are equal with probability at most \(2^{-w}\), regardless of \(H_2\) and \(H^\prime_2\ldots\) and, crucially, independently of \(H \neq H^\prime\).
When the two messages \(m\) and \(m^\prime\) only differ in a single (initial) chunk, mixing a LRC checksum gives us an independent hash function, which squares the collision probability to \(2^{-2w}\).
Now to the interesting bit: we must define a second hash function that combines \(h_0,h_1,\ldots, h_{n - 1}\) and \(h^\prime_0, h^\prime_1, \ldots, h^\prime_{n - 1}\) such that the resulting hash values \(H_2\) and \(H^\prime_2\) collide independently enough of \(H\) and \(H^\prime\). That’s a tall order, but we do have one additional assumption to work with: we only care about collisions in this second hash function if the additional checksum chunks are equal, which means that the two messages differ in two or more chunks (or they’re identical).
For each index \(0 < i < n\), we’ll fix a public linear (with xor as the addition) function \(\overline{xs}_i(x)\). This family of function must have two properties:
For regularity, we will also define \(\overline{xs}_0(x) = x\).
Concretely, let \(\overline{xs}_1(x) = x \mathtt{«} 1\), where the bitshift is computed for the two 64-bit halves independently, and \(\overline{xs}_i(x) = (x \mathtt{«} 1) \oplus (x \mathtt{«} i)\) for \(i > 1\), again with all the bitshifts computed independently over the two 64-bit halves.
To see that these satisfy our requirements, we can represent the functions as carryless multiplication by distinct “even” constants (the least significant bit is 0) on each 64-bit half:
To recapitulate, we defined the first hash function as
\[ H = \bigoplus_{i = 0}^{n - 1} h_i, \]
the (xor) sum of the mixed value \(h_i\) for each chunk \(m_i\) in the message block \(m\), and similarly for \(H^\prime\) and \(h^\prime_i\).
We’ll let the second hash function be
\[ H_2 \oplus h_n = \left(\bigoplus_{i = 0}^{n - 1} \overline{xs}_i(h_i)\right) \oplus h_n, \]
and
\[ H^\prime_2 \oplus h^\prime_n = \left(\bigoplus_{i = 0}^{n - 1} \overline{xs}_i(h^\prime_i)\right) \oplus h^\prime_n. \]
We can finally get down to business and find some collision bounds. We’ve already shown that both \(H = H^\prime\) and \(H_2 \oplus h_n = H^\prime_2 \oplus h^\prime_n\) collide simultaneously with probability at most \(2^{-2w}\) when the checksum chunks differ, i.e., when \(m_n \neq m^\prime_n\).
Let’s now focus on the case when \(m \neq m^\prime\), but \(m_n = m^\prime_n\). In that case, we know that at least two chunks \(0 \leq i < j < n\) differ: \(m_i \neq m^\prime_i\) and \(m_j \neq m^\prime_j\).
If only two chunks \(i\) and \(j\) differ, and one of them is the \(i = 0\)th chunk, we want to bound the probability that
\[ h_0 \oplus h_j = h^\prime_0 \oplus h^\prime_j \]
and
\[ h_0 \oplus \overline{xs}_j(h_j) = h^\prime_0 \oplus \overline{xs}_j(h^\prime_j), \]
both at the same time.
Letting \(\Delta_i = h_i \oplus h^\prime_i\), we can reformulate the two conditions as
\[ \Delta_0 = \Delta_j \] and \[ \Delta_0 = \overline{xs}_j(\Delta_j). \]
Taking the xor of the two conditions yields
\[ \Delta_j \oplus \overline{xs}_j(\Delta_j) = 0, \]
which is only satisfied for \(\Delta_j = 0\), since \(f(x) = x \oplus \overline{xs}_j(x)\) is an invertible linear function. This also forces \(\Delta_0 = 0\).
By hypothesis, \(\mathrm{P}[\Delta_j = 0] \leq 2^{-w}\), and \(\mathrm{P}[\Delta_0 = 0] \leq 2^{-w}\) as well. These two probabilities are independent, so we get a probability that both hash collide less than or equal to \(2^{-2w}\) (\(2^{-128}\)).
In the other case, we have messages that differ in at least two chunks \(0 < i < j < n\): \(m_i \neq m^\prime_i\) and \(m_j \neq m^\prime_j\).
We can simplify the collision conditions to
\[ h_i \oplus h_j = h^\prime_i \oplus h^\prime_j \oplus y \]
and
\[ \overline{xs}_i(h_i) \oplus \overline{xs}_j(h_j) = \overline{xs}_i(h^\prime_i) \oplus \overline{xs}_j(h^\prime_j) \oplus z, \]
for \(y\) and \(z\) generated arbitrarily (adversarially), but without knowledge of the parameters that generated \(h_i, h_j, h^\prime_i, h^\prime_j\).
Again, let \(\Delta_i = h_i \oplus h^\prime_i\) and \(\Delta_j = h_j \oplus h^\prime_j\), and reformulate the conditions into
\[ \Delta_i \oplus \Delta_j = y \] and \[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_j(\Delta_j) = z. \]
Let’s apply the linear function \(\overline{xs}_i\) to the first condition
\[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_i(\Delta_j) = \overline{xs}_i(y); \]
since \(\overline{xs}_i\) isn’t invertible, the result isn’t equivalent, but is a weaker (necessary, not sufficient) version of the initial condiion.
After xoring that with the second condition
\[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_j(\Delta_j) = z, \]
we find
\[ \overline{xs}_i(\Delta_j) \oplus \overline{xs}_j(\Delta_j) = \overline{xs}_i(y) \oplus z. \]
By hypothesis, the null space of \(g(x) = \overline{xs}_i(x) \oplus \overline{xs}_j(x)\) is “small.” For our concrete definition of \(\overline{xs}\), there are \(2^{2j}\) values in that null space, which means that \(\Delta_j\) can only satisfy the combined xored condition by taking one of at most \(2^{2j}\) values; otherwise, the two hashes definitely can’t both collide.
Since \(j < n\), this happens with probability at most \(2^{2(n - 1) - w} \leq 2^{-34}\) for UMASH with \(w = 64\) and \(n = 16\).
Finally, for any given \(\Delta_j\), there is at most one \(\Delta_i\) that satisfies
\[ \Delta_i \oplus \Delta_j = y,\]
and so both hashes collide with probability at most \(2^{-98}\), for \(w = 64\) and \(n = 16\).
Astute readers will notice that we could let \(\overline{xs}_i(x) = x \mathtt{«} i\), and find the same combined collision probability. However, this results in a much weaker secondary hash, since a chunk could lose up to \(2n - 2\) bits (\(n - 1\) in each 64-bit half) of hash information to a plain shift. The shifted xor-shifts might be a bit slower to compute, but guarantees that we only lose at most 2 bits^{1} of information per chunk. This feels like an interface that’s harder to misuse.
If one were to change the \(\overline{xs}_i\) family of functions, I think it would make more sense to look at a more diverse form of (still sparse) multipliers, which would likely let us preserve a couple more bits of independence. Jim has constructed such a family of multipliers, in arithmetic modulo \(2^{64}\); I’m sure we could find something similar in carryless multiplication. The hard part is implementing these multipliers: in order to exploit the multipliers’ sparsity, we’d probably have to fully unroll the block hashing loop, and that’s not something I like to force on implementations.
The base UMASH block compressor mixes all but the last of the message block’s 16-byte chunks with PH: xor the chunk with the corresponding bytes in the parameter array, computes a carryless multiplication of the xored chunks’ half with the other half. The last chunk goes through a variant of ENH with an invertible finaliser (safe because we only rely on \(\varepsilon\)-almost-universality), and everything is xored in the accumulator.
The collision proofs above preserved the same structure for the first hash.
The second hash reuses so much work from the first that it mostly makes sense to consider a combined loop that computes both (regular UMASH and this new xor-shifted variant) block compression functions at the same time.
The first change for this combined loop is that we need to xor
together all 16-bytes chunk in the message, and mix the resulting
checksum with a fresh PH function. That’s equivalent to xoring
everything in a new accumulator (or two accumulators when working with
256-bit vectors) initialised with the PH parameters, and CLMUL
ing
together the accumulator’s two 64-bit halves at the end.
We also have to apply the \(\overline{xs}_i\) quasi-xor-shift
functions to each \(h_i\). The trick is to accumulate the shifted
values in two variables: one is the regular UMASH accumulator without
\(h_0\) (i.e., \(h_1 \oplus h_2 \ldots\)), and the other shifts
the current accumulator before xoring in a new value, i.e.,
\(\mathtt{acc}^\prime = (\mathtt{acc} \mathtt{«} 1) \oplus h_i\),
where the left shift on parallel 64-bit halves simply adds acc
to itself.
This additional shifted accumulator includes another special case to skip \(\overline{xs}_1(x) = x \mathtt{«} 1\); that’s not a big deal for the code, since we already have to special case the last iteration for the ENH mixer.
Armed with \(\mathtt{UMASH} = \bigoplus_{i=1}^{n - 1} h_i\) and \(\mathtt{acc} = \bigoplus_{i=2}^{n - 1} h_i \mathtt{«} (i - 1),\) we have \[\bigoplus_{i=1}^{n - 1} \overline{xs}_i(h_i) = (\mathtt{UMASH} \oplus \mathtt{acc}) \mathtt{«} 1.\]
We just have to xor in the PH
-mixed checksum \(h_n\), and finally
\(h_0\) (which naturally goes in GPRs, so can be computed while we
extract values out of vector registers).
We added two vector xors and one addition for each chunk in a block,
and, at the end, one CLMUL
plus a couple more xors and adds again.
This should most definitely be faster than computing two UMASH at the
same time, which incurred two vector xors and a CLMUL
(or full
integer multiplication) for each chunk: even when CLMUL
can pipeline
one instruction per cycle, vector additions can dispatch to more
execution units, so the combined throughput is still higher.
It’s easy to show that UMASH is relatively safe when one block is shorter than the other, and we simply xor together fewer mixed chunks. Without loss of generality, we can assume the longer block has \(n\) chunks; that block’s final ENH is independent of the shorter block’s UMASH, and any specific value occurs with probability at most \(2^{-63}\) (the probability of a multiplication by zero).
A similar argument seems more complex to defend for the shifted UMASH.
Luckily, we can tweak the LRC checksum we use to generate an additional chunk in the block: rather than xoring together the raw message chunks, we’ll xor them after xoring them with the PH key, i.e.,
\[m_n = \bigoplus_{i=0}^{n - 1} m_i \oplus k_i, \]
where \(k_i\) are the PH parameters for each chunk.
When checksumming blocks of the same size, this is a no-op with respect
to collision probabilities. Implementations might however benefit
from the ability to use a fused xor
with load from memory^{2}
to compute \(m_i \oplus k_i\), and feed that both into the checksum
and into CLMUL
for PH.
Unless we’re extremely unlucky (\(m_{n - 1} = k_{n - 1}\), with probability \(2^{-2w}\)), the long block’s LRC will differ from the shorter block’s. As long as we always xor in the same PH parameters when mixing the artificial LRC, the secondary hashes collide with probability at most \(2^{-64}\).
With a small tweak to the checksum function, we can easily guarantee that blocks with a different number of chunks collide with probability less than \(2^{-126}\).^{3}
Thank you Joonas for helping me rubber duck the presentation, and Jim for pointing me in the right direction, and for the fruitful discussion!
It’s even better for UMASH, since we obtained these shifted chunks by mixing with PH. The result of PH is the carryless product of two 64-bit values, so the most significant bit is always 0. The shifted-xorshift doesn’t erase any information in the high 64-bit half! ↩
This might also come with a small latency hit, which is unfortunate since PH-ing \(m_n\) is likely to be on the critical path… but one cycle doesn’t seem that bad. ↩
The algorithm to expand any input message to a sequence of full 16-byte chunks is fixed. That’s why we incorporate a size tag in ENH; that makes it impossible for two messages of different lengths to collide when they are otherwise identical after expansion. ↩
We accidentally a whole hash function… but we had a good reason! Our MIT-licensed UMASH hash function is a decently fast non-cryptographic hash function that guarantees a worst-case bound on the probability of collision between any two inputs generated independently of the UMASH parameters.
On the 2.5 GHz Intel 8175M servers that power Backtrace’s hosted offering, UMASH computes a 64-bit hash for short cached inputs of up to 64 bytes in 9-22 ns, and for longer ones at up to 22 GB/s, while guaranteeing that two distinct inputs of at most \(s\) bytes collide with probability less than \(\lceil s / 2048 \rceil \cdot 2^{-56}\). If that’s not good enough, we can also reuse most of the parameters to compute two independent UMASH values. The resulting 128-bit fingerprint function offers a short-input latency of 9-26 ns, a peak throughput of 11.2 GB/s, and a collision probability of \(\lceil s / 2048 \rceil^2 \cdot 2^{-112}\) (better than \(2^{-70}\) for input size up to 7.5 GB). These collision bounds hold for all inputs constructed without any feedback about the randomly chosen UMASH parameters.
The latency on short cached inputs (9-22 ns for 64 bits, 9-26 ns for 128) is somewhat worse than the state of the art for non-cryptographic hashes— wyhash achieves 8-15 ns and xxh3 8-12 ns—but still in the same ballpark. It also compares well with latency-optimised hash functions like FNV-1a (5-86 ns) and MurmurHash64A (7-23 ns).
Similarly, UMASH’s peak throughput (22 GB/s) does not match the current best hash throughput (37 GB/s with xxh3 and falkhash, apparently 10% higher with Meow hash), but does comes within a factor of two; it’s actually higher than that of some performance-optimised hashes, like wyhash (16 GB/s) and farmhash32 (19 GB/s). In fact, even the 128-bit fingerprint (11.2 GB/s) is comparable to respectable options like MurmurHash64A (5.8 GB/s) and SpookyHash (11.6 GB/s).
What sets UMASH apart from these other non-cryptographic hash functions is its proof of a collision probability bound. In the absence of an adversary that adaptively constructs pathological inputs as it infers more information about the randomly chosen parameters, we know that two distinct inputs of \(s\) or fewer bytes will have the same 64-bit hash with probability at most \(\lceil s / 2048 \rceil \cdot 2^{-56},\) where the expectation is taken over the random “key” parameters.
Only one non-cryptographic hash function in Reini Urban’s fork of SMHasher provides this sort of bound: CLHash guarantees a collision probability \(\approx 2^{-63}\) in the same universal hashing model as UMASH. While CLHash’s peak throughput (22 GB/s) is equal to UMASH’s, its latency on short inputs is worse (23-25 ns instead of 9-22ns). We will also see that its stronger collision bound remains too weak for many practical applications. In order to compute a fingerprint with CLHash, one would have to combine multiple hashes, exactly like we did for the 128-bit UMASH fingerprint.
Actual cryptographic hash functions provide stronger bounds in a much more pessimistic model; however they’re also markedly slower than non-cryptographic hashes. BLAKE3 needs at least 66 ns to hash short inputs, and achieves a peak throughput of 5.5 GB/s. Even the reduced-round SipHash-1-3 hashes short inputs in 18-40 ns and longer ones at a peak throughput of 2.8 GB/s. That’s the price of their pessimistically adversarial security model. Depending on the application, it can make sense to consider a more restricted adversary that must prepare its dirty deed before the hash function’s parameters are generated at random, and still ask for provable bounds on the probability of collisions. That’s the niche we’re targeting with UMASH.
Clearly, the industry is comfortable with no bound at all. However, even in the absence of seed-independent collisions, timing side-channels in a data structure implementation could theoretically leak information about colliding inputs, and iterating over a hash table’s entries to print its contents can divulge even more bits. A sufficiently motivated adversary could use something like that to learn more about the key and deploy an algorithmic denial of service attack. For example, the linear structure of UMASH (and of other polynomial hashes like CLHash) makes it easy to combine known collisions to create exponentially more colliding inputs. There is no universal answer; UMASH is simply another point in the solution space.
If reasonable performance coupled with an actual bound on collision probability for data that does not adaptively break the hash sounds useful to you, take a look at UMASH on GitHub!
The next section will explain why we found it useful to design another hash function. The rest of the post sketches how UMASH works and how it balances short-input latency and strength, before describing a few interesting usage patterns.
The latency and throughput results above were all measured on the same unloaded 2.5 GHz Xeon 8175M. While we did not disable frequency scaling (#cloud), the clock rate seemed stable at 3.1 GHz during our run.
Engineering is the discipline of satisficisation: crisply defined problems with perfect solutions rarely exist in reality, so we must resign ourselves to satisfying approximate constraint sets “well enough.” However, there are times when all options are not only imperfect, but downright sucky. That’s when one has to put on a different hat, and question the problem itself: are our constraints irremediably at odds, or are we looking at an under-explored solution space?
In the former case, we simply have to want something else. In the latter, it might make sense to spend time to really understand the current set of options and hand-roll a specialised approach.
That’s the choice we faced when we started caching intermediate results in Backtrace’s database and found a dearth of acceptable hash functions. Our in-memory columnar database is a core component of the backend, and, like most analytics databases, it tends to process streams of similar queries. However, a naïve query cache would be ineffective: our more heavily loaded servers handle a constant write load of more than 100 events per second with dozens of indexed attributes (populated column values) each. Moreover, queries invariably select a large number of data points with a time windowing predicate that excludes old data… and the endpoints of these time windows advance with each wall-clock second. The queries evolve over time, and must usually consider newly ingested data points.
Bhatotia et al’s Slider show how we can specialise the idea of self-adjusting or incremental computation for repeated MapReduce-style queries over a sliding window. The key idea is to split the data set at stable boundaries (e.g., on date change boundaries rather than 24 hours from the beginning of the current time window) in order to expose memoisation opportunities, and to do so recursively to repair around point mutations to older data.
Caching fully aggregated partial results works well for static queries, like scheduled reports… but the first step towards creating a great report is interactive data exploration, and that’s an activity we strive to support well, even when drilling down tens of millions of rich data points. That’s why we want to also cache intermediate results, in order to improve response times when tweaking a saved report, or when crafting ad hoc queries to better understand how and when an application fails.
We must go back to a more general incremental computation strategy: rather than only splitting up inputs, we want to stably partition the data dependency graph of each query, in order to identify shared subcomponents whose results can be reused. This finer grained strategy surfaces opportunities to “resynchronise” computations, to recognize when different expressions end up generating a subset of identical results, enabling reuse in later steps. For example, when someone updates a query by adding a selection predicate that only rejects a small fraction of the data, we can expect to reuse some of the post-selection work executed for earlier incarnations of the query, if we remember to key on the selected data points rather than the predicates.
The complication here is that these intermediate results tend to be large. Useful analytical queries start small (a reasonable query coupled with cache/transaction invalidation metadata to stand in for the full data set), grow larger as we select data points, arrange them in groups, and materialise their attributes, and shrink again at the end, as we summarise data and throw out less interesting groups.
When caching the latter shrinking steps, where resynchronised reuse opportunities abound and can save a lot of CPU time, we often find that storing a fully materialised representation of the cache key would take up more space than the cached result.
A classic approach in this situation is to fingerprint cache keys with a cryptographic hash function like BLAKE or SHA-3, and store a compact (128 or 256 bits) fingerprint instead of the cache key: the probability of a collision is then so low that we might as well assume any false positive will have been caused by a bug in the code or a hardware failure. For example, a study of memory errors at Facebook found that uncorrectable memory errors affect 0.03% of servers each month. Assuming a generous clock rate of 5 GHz, this means each clock cycle may be afflicted by such a memory error with probability \(\approx 2.2\cdot 10^{-20} > 2^{-66}.\) If we can guarantee that distinct inputs collide with probability significantly less than \(2^{-66}\), e.g., \(< 2^{-70},\) any collision is far more likely to have been caused by a bug in our code or by hardware failure than by the fingerprinting algorithm itself.
Using cryptographic hashes is certainly safe enough, but requires a lot of CPU time, and, more importantly, worsens latency on smaller keys (for which caching may not be that beneficial, such that our goal should be to minimise overhead). It’s not that state-of-the-art cryptographic hash functions are wasteful, but that they defend against attacks like key recovery or collision amplification that we may not care to consider in our design.
At the other extreme of the hash spectrum, there is a plethora of fast hash functions with no proof of collision probability. However, most of them are keyed on just a 64-bit “seed” integer, and that’s already enough for a pigeonhole argument to show we can construct sets of strings of length \(64m\) bits where any two members collide with probability at least \(m/ 2^{64}\). In practice, security researchers seem to find key-independent collisions wherever they look (i.e., the collision probability is on the order of 1 for some particularly pathological sets of inputs), so it’s safe to assume that lacking a proof of collision probability implies a horrible worst case. I personally wouldn’t put too much faith in “security claims” taking the form of failed attempts at breaking a proposal.
Lemire and Kaser’s CLHash is one
of the few exceptions we found: it achieves a high throughput of 22
GB/s and comes with a proof of \(2^{-63}\)-almost-universality.
However, its finalisation step is slow (23 ns for one-byte inputs), due
to a Barrett reduction
followed by
three rounds of xorshift
/multiply mixing.
Dai and Krovetz’s VHASH,
which inspired CLHash, offers similar guarantees, with worse
performance.
Unfortunately, \(2^{-63}\) is also not quite good enough for our purposes: we estimate that the probability of uncorrectable memory errors is on the order of \(2^{-66}\) per clock cycle, so we want the collision probability for any two distinct inputs to be comfortably less than that, around \(2^{-70}\) (i.e., \(10^{-21}\)) or less. This also tells us that any acceptable fingerprint must consist of more than 64 bits, so we will have to either work in slower multi-word domains, or combine independent hashes.
Interestingly, we also don’t need much more than that for (non-adversarial) fingerprinting: at some point, the theoretical probability of a collision is dominated by the practical possibility of a hardware or networking issue making our program execute the fingerprinting function incorrectly, or pass the wrong data to that function.
While CLHash and VHASH aren’t quite what we want, they’re pretty close, so we felt it made sense to come up with a specialised solution for our fingerprinting use case.
Krovetz et al’s RFC 4418 brings an interesting idea: we can come up with a fast 64-bit hash function structured to make it easy to compute a second independent hash value, and concatenate two independent 64-bit outputs. The hash function can heavily favour computational efficiency and let each 64-bit half collide with probability \(\varepsilon\) significantly worse than \(2^{-64}\), as long as the collision probability for the concatenated fingerprint, \(\varepsilon^2\), is small enough, i.e., as long as \(\varepsilon^2 < 2^{-70} \Longleftrightarrow \varepsilon < 2^{-35}\). We get a more general purpose hash function out of the deal, and the fingerprint comparison logic is now free to only compute and look at half the fingerprint when it makes sense (e.g., in a prepass that tolerates spurious matches).
The design of UMASH is driven by two observations:
CLHash achieves a high throughput, but introduces a lot of latency to finalise its 127-bit state into a 64 bits result.
We can get away with a significantly weaker hash, since we plan to combine two of them when we need a strong fingerprint.
That’s why we started with the high-level structure diagrammed below, the same as UMAC, VHASH, and CLHash: a fast first-level block compression function based on Winograd’s pseudo dot-product, and a second-level Carter-Wegman polynomial hash function to accumulate the compressed outputs in a fixed-size state.
The inner loop in this two-level strategy is the block compressor,
which divides each 256-byte block \(m\) into 32 64-bit values
\(m_i\), combines them with randomly generated parameters \(k_i\),
and converts the resulting sequence of machine words to a 16-byte
output. The performance of that component will largely determine the
hash function’s global peak throughput. After playing around with
the NH
inner loop,
we came to the
same conclusion as Lemire and Kaser:
the scalar operations, the outer 128-bit ones in particular, map to too
many µops. We thus focused on the
same PH
inner loop
as CLHash,
While the similarity to NH
is striking, analysing PH
is actually
much simpler: we can see the xor
and carry-less multiplications as
working in the same ring of polynomials over \(\mathrm{GF}(2)\), unlike
NH
’s mixing of \(\bmod 2^{64}\) for the innermost additions with
\(\bmod 2^{128}\) for the outer multiplications and sum. In
fact, as
Bernstein points out,
PH
is a direct application of
Winograd’s pseudo dot-product
to compute a multiplicative vector hash in half the multiplications.
CLHash uses an aggressively throughput-optimised block size of 1024 bytes. We found diminishing returns after 256 bytes, and stopped there.
With modular or polynomial ring arithmetic, the collision probability is \(2^{-64}\) for any pair of blocks. Given this fast compression function, the rest of the hashing algorithm must chop the input in blocks, accumulate compressed outputs in a constant-size state, and handle the potentially shorter final block while avoiding length extension issues.
Both VHASH and CLHash accumulate compressed outputs in a
polynomial string hash over a large field
(\(\mathbb{Z}/M_{127}\mathbb{Z}\) for VHASH, and
\(\mathrm{GF}(2^{127})\) with irreducible polynomial \(x^{127} + x + 1\)
for CLHash): the collision probability for polynomial string hashes is
inversely proportional to the field size and grows with the string
length (number of compressed blocks), so working in fields much larger
than \(2^{64}\) lets the NH
/PH
term dominate.
Arithmetic in such large fields is slow, and reducing the 127-bit state to 64 bits is also not fast. CLHash and VHASH make the situation worse by zero-padding the final block, and CLHash defends against length extension attacks with a more complex mechanism than the one in VHASH.
Similarly to VHASH, UMASH uses a polynomial hash over the (much smaller) prime field \(\mathbb{F} = \mathbb{Z}/M_{61}\mathbb{Z},\)
where \(f\in\mathbb{F}\) is the randomly chosen point at which we
evaluate the polynomial, and \(y\), the polynomial’s coefficients, is
the stream of 64-bit values obtained by splitting in half the PH
output for each block. This choice saves 20-30 cycles of latency in
the final block, compared to CLHash: modular multiplications have
lower latency than carry-less multiplications for judiciously picked
machine-integer-sized moduli, and integer multiplications seem to mix
better, so we need less work in the finaliser.
Of course, UMASH sacrifices a lot of strength by working in \(\mathbb{F} =\, \bmod 2^{61} - 1:\) the resulting field is much smaller than \(2^{127}\), and we now have to update the polynomial twice for the same number of blocks. This means the collision probability starts worse \((\approx 2^{-61}\) instead of \(\approx 2^{-127})\), and grows twice as fast with the number of blocks \(n\) \((\approx 2n\cdot 2^{-61}\) instead of \(\approx n\cdot 2^{-61})\). But remember, we’re only aiming for collision probability \(< 2^{-35}\) and each block represents 256 bytes of input data, so this is acceptable, assuming that multi-gigabyte inputs are out of scope.
We protect against length extension collisions
by xor
ing (adding, in the polynomial ring) the original byte size of
the final block to its compressed PH
output. This xor
is simpler
than CLHash’s finalisation step with a carry-less multiplication, but
still sufficient: we can adapt
Krovetz’s proof for VHASH
by replacing NH
’s almost-\(\Delta\)-universality with PH
’s
almost-XOR-universality.
Having this protection means we can extend short final blocks however
we want. Rather than conceptually zero-padding our inputs (which adds
complexity and thus latency on short inputs), we allow redundant
reads. We bifurcate inputs shorter than 16 bytes to a completely
different latency-optimised code path, and let the final PH
iteration read the last 16 bytes of the input, regardless of how
redundant that might be.
The semi-literate Python reference implementation has the full code and includes more detailed analysis and rationale for the design decisions.
The previous section already showed how we let micro-optimisation
inform the high-level structure of UMASH. The use of PH
over
NH
, our choice of a polynomial hash in a small modular field, and
the way we handle short blocks all aim to improve the performance of
production implementations. We also made sure to enable a couple more
implementation tricks with lower level design decisions.
The block size is set to 256 bytes because we observed diminishing
returns for larger blocks… but also because it’s reasonable to cache
the PH
loop’s parameters in 8 AVX registers, if we need to shave load
µops.
More importantly, it’s easy to implement a Horner update with the prime modulus \(2^{61} - 1\). Better, that’s also true for a “double-pumped” Horner update, \(h^\prime = H_f(h, a, b) = af + (b + h)f^2.\)
The trick is to work in \(\bmod 2^{64} - 8 = \bmod 8\cdot(2^{-61} - 1),\) which lets us implement modular multiplication of an arbitrary 64-bit integer \(a\) by a multiplier \(0 < f < 2^{61} - 1\) without worrying too much about overflow. \(2^{64} \equiv 8 \mod 2^{64} - 8,\) so we can reduce a value \(x\) to a smaller representative with
this equivalence is particularly useful when \(x < 2^{125}\): in that case, \(x / 2^{64} < 2^{61},\) and the intermediate product \(8\lfloor x / 2^{64}\rfloor < 2^{64}\) never overflows 64 bits. That’s exactly what happens when \(x = af\) is the product of \(0\leq a < 2^{64}\) and \(0 < f < 2^{61} - 1\). This also holds when we square the multiplier \(f\): it’s sampled from the field \(\mathbb{Z}/(2^{61} - 1)\mathbb{Z},\) so its square also satisfies \(f^2 < 2^{61}\) once fully reduced.
Integer multiplication instructions for 64-bit values will naturally split the product \(x = af\) in its high and low 64-bit half; we get \(\lfloor x / 2^{64}\rfloor\) and \(x\bmod 2^{64}\) for free. The rest of the double-pumped Horner update is a pair of modular additions, where only the final sum must be reduced to fit in \(\bmod 2^{64} - 8\). The resulting instruction-parallel double Horner update is only a few cycles slower than a single Horner update.
We also never fully reduce to \(\bmod 2^{61} - 1\). While the collision bound assumes that prime field, we simply work in its \(\bmod 2^{64} - 8\) extension. This does not affect the collision bound, and the resulting expression is still amenable to algebraic manipulation: modular arithmetic is a well defined ring even for composite moduli.
A proof of almost-universality doesn’t mean a hash passes the SMHasher test suite. It should definitely guarantee collisions are (probably) rare enough, but SMHasher also looks at bit avalanching and bias, and universality is oblivious to these issues. Even XOR- or \(\Delta\)-universality doesn’t suffice: the hash values for a given string are well distributed when parameters are chosen uniformly at random, but this does not imply that hashes are always (or usually) well distributed for fixed parameters.
The most stringent SMHasher tests focus on short inputs: mostly up to
128 or 256 bits, unless “Extra” torture testing is enabled. In a way,
this makes sense, given that arbitrary-length string hashing is
provably harder than the bounded-length vector case. Moreover, a
specialised code path for these inputs is beneficial, since they’re
relatively common and deserve strong and low-latency hashes. That’s
why UMASH uses a completely different code path for inputs of at
most 8 bytes, and a specialised PH
iteration for inputs of 9 to 16
bytes.
However, this means that SMHasher’s best avalanche and bias tests often tell us very little about the general case. For UMASH, the medium length (9 to 16 bytes) code path at least shares the same structure and finalisation logic as the code for longer inputs.
There may also be a bit of co-evolution between the test harness and
the design of hash functions: the sort of xorshift
/multiply mixers
favoured by Appleby in the various versions of MurmurHash tends to do
well on SMHasher. These mixers are also invertible, so we can take
any hash function with good collision properties, mix its output with
someone else’s series of xorshift
and multiplications (in UMASH’s
case, the
SplitMix64 update function
or a subset thereof), and usually find that the result satisfies
SMHasher’s bias and avalanche tests.
It definitely looks like interleaving rightward bitwise operations and integer multiplications is a good mixing strategy. However, I find it interesting that the hash evaluation harness created by the author of MurmurHash steers implementations towards MurmurHash-syle mixing code.
The structure of UMASH lets us support more sophisticated usage patterns than merely hashing or fingerprinting an array of bytes.
The PH
loop needs less than 17 bytes of state for its 16-byte
accumulator and an iteration count, and the polynomial hash also needs
17 bytes, for its own 8-byte accumulator, the 8-byte “seed,” and a
counter for the final block size (up to 256 bytes). The total comes up to
34 bytes of state, plus a 16-byte input buffer, since the PH
loop
consumes 16-byte chunks at a time. Coupled with the way we only
consider the input size at the end of UMASH, this makes it easy to
implement incremental hashing.
In fact, the state is small enough that our implementation stashes some
parameter data inline in the state struct, and uses the same
layout for hashing and fingerprinting with a pair of hashes (and thus
double the state): most of the work happens in PH
, which only
accesses the constant parameter array, the shared input buffer and iteration
counter, and its private 16-byte accumulator.
Incremental fingerprinting is a crucial capability for our caching
system: cache keys may be large, so we want to avoid serialising them
to an array of contiguous bytes just to compute a fingerprint.
Efficient incrementality also means we can hash NUL-terminated C
strings with a fused UMASH / strlen
loop, a nice speed-up when the
data is in cache.
The outer polynomial hash in UMASH is so simple to analyse that we can easily process blocks out of order. In my experience, such a “parallel hashing” capability is more important than peak throughput when checksumming large amounts of data coming over the wire. We usually maximise transfer throughput by asking for several ranges of data in parallel. Having to checksum these ranges in order introduces a serial bottleneck and the usual head-of-line blocking challenges; more importantly, checksumming in order adds complexity to code that should be as obviously correct as possible. The polynomial hash lets us hash an arbitrary subsequence of 256-byte aligned blocks and use modular exponentiation to figure out its impact on the final hash value, given the subsequence’s position in the checksummed data. Parallel hashing can exploit multiple cores (more cores, more bandwidth!) with simpler code.
The UMAC RFC uses a Toeplitz
extension scheme to compute independent NH
values while recycling
most of the parameters. We do the same with PH
, by adapting
Krovetz’s proof
to exploit PH
’s almost-XOR-universality instead of NH
’s
almost-\(\Delta\)-universality. Our fingerprinting code reuses all
but the first 32 bytes of PH
parameters for the second hash: that’s
the size of an AVX register, which makes is trivial to avoid loading
parameters twice in a fused PH
loop.
The same RFC also points out that concatenating the output of fast hashes lets validation code decide which speed-security trade-off makes sense for each situation: some applications may be willing to only compute and compare half the hashes.
We use that freedom when reading from large hash tables keyed on the UMASH fingerprint of strings. We compute a single UMASH hash value to probe the hash tables, and only hash the second half of the fingerprint when we find a probable hit. The idea is that hashing the search key (now hot in cache) a second time will be faster than comparing it against the hash entry’s string key in cold storage.
When we add this sort of trickery to our code base, it’s important to make sure the interfaces are hard to misuse. For example, it would be unfortunate if only one half of the 128-bit fingerprint were well distributed and protected against collisions: this would make it far too easy to implement the two-step lookup-by-fingerprint above correctly but inefficiently. That’s why we maximise the symmetry in the fingerprint: the two 64-bit halves are computed with the same algorithm to guarantee the same worst-case collision probability and distribution quality. This choice leaves fingerprinting throughput on the table when a weaker secondary hash would suffice. However, I prefer a safer if slightly slower interface to one ripe for silent performance bugs.
While we intend for UMASH to become our default hash and fingerprint function, it can’t be the right choice for every application.
First, it shouldn’t be used for authentication or similar cryptographic purposes: the implementation is probably riddled with side-channels, the function has no protection against parameter extraction or adaptive attacks, and collisions are too frequent anyway.
Obviously, this rules out using UMASH in a MAC, but might also be an issue for, e.g., hash tables where attackers control the keys and can extrapolate the hash values. A timing side-channel may let attackers determine when keys collide; once a set of colliding keys is known, the linear structure of UMASH makes it trivial to create more collisions by combining keys from that set. Worse, iterating over the hash table’s entries can leak the hash values, which would let an attacker slowly extract the parameters. We conservatively avoid non-cryptographic hashes and even hashed data structures for sections of the Backtrace code base where such attacks are in scope.
Second, the performance numbers reported by SMHasher (up to 22 ns when hashing 64 bytes or less, and 22 GB/s peak throughput) are probably a lie for real applications, even when running on the exact same 2.5 GHz Xeon 8175M hardware. These are best-case values, when the code and the parameters are all hot in cache… and that’s a fair amount of bytes for UMASH. The instruction footprint for a 64-bit hash is 1435 bytes (comparable to heavier high-throughput hashes, like the 1600-byte xxh3_64 or 1350-byte farmhash64), and the parameters span 288 bytes (320 for a fingerprint).
There is a saving grace for UMASH and other complex hash functions: the amount of bytes executed is proportional to the input size (e.g., the code for 8 or fewer byte only needs 141 bytes, and would inline to around 100 bytes), and the number of parameters read is bounded by the input length. Although UMASH can need a lot of instruction and parameter bytes, the worst case only happens for larger inputs, where the cache misses can hopefully be absorbed by the work of loading and hashing the data.
The numbers are also only representative of powerful CPUs with
carry-less multiplication in hardware. The PH
inner loop has 50%
higher throughput than NH
(22 vs 14 GB/s) on contemporary Intel
servers. The carry-less approach still has an edge over 128-bit
modular arithmetic on AMD’s Naples,
but less so, around 20-30%. We did not test on ARM (the
Backtrace database
only runs on x86-64), but I would assume the situation there is closer
to AMD’s than Intel’s.
However, I also believe we’re more likely to observe improved
performance for PH
than NH
in future micro-architectures: the core
of NH
, full-width integer multiplication, has been aggressively
optimised by now, while the gap between Intel and AMD shows there
may still be low-hanging fruits for the carry-less multiplications
at the heart of PH
. So, NH
is probably already as good as
it’s going to be, but we can hope that PH
will continue to benefit from
hardware optimisations, as chip designers improve the performance of
cryptographic algorithms like
AES-GCM.
Third and last, UMASH isn’t fully stabilised yet. We do not plan to
modify the high level structure of UMASH, a PH
block compressor
that feeds into a polynomial string hash. However, we are looking for
suggestions to improve its latency on short inputs, and to simplify
the finaliser while satisfying SMHasher’s distribution tests.
We believe UMASH is ready for non-persistent usage: we’re confident in its quality, but the algorithm isn’t set in stone yet, so hash or fingerprint values should not reach long-term storage. We do not plan to change anything that will affect the proof of collision bound, but improvements to the rest of the code are more than welcome.
In particular:
xorshift
/ multiply in the finaliser, but can we shave even
more latency there?A hash function is a perfect target for automated correctness and performance testing. I hope to use UMASH as a test bed for the automatic evaluation (and approval?!) of pull requests.
Of course, you’re also welcome to just use UMASH as a single-file C library or re-implement it to fit your requirements. The MIT-licensed C code is on GitHub, and we can definitely discuss validation strategies for alternative implementations.
Finally, our fingerprinting use case shows collision rates are probably not something to minimise, but closer to soft constraints. We estimate that, once the probability reaches \(2^{-70}\), collisions are rare enough to only compare fingerprints instead of the fingerprinted values. However, going lower than \(2^{-70}\) doesn’t do anything for us.
It would be useful to document other back-of-the-envelope requirements for a hash function’s output size or collision rate. Now that most developers work on powerful 64-bit machines, it seems far too easy to add complexity and waste resources for improved collision bounds that may not unlock any additional application.
Any error in the analysis or the code is mine, but a few people helped improve UMASH and its presentation.
Colin Percival scanned an earlier version of the reference implementation for obvious issues, encouraged me to simplify the parameter generation process, and prodded us to think about side channels, even in data structures.
Joonas Pihlaja helped streamline my initial attempt while making the reference implementation easier to understand.
Jacob Shufro independently confirmed that he too found the reference implementation understandable, and tightened the natural language.
Phil Vachon helped me gain more confidence in the implementation
tricks borrowed from VHASH after replacing the NH
compression
function with PH
.