Paul Khuong: some Lisp

UMASH: a fast and universal enough hash

Aug 24th, 2020 | Comments

Originally posted on the Backtrace I/O tech blog.

We accidentally a whole hash function… but we had a good reason! Our MIT-licensed UMASH hash function is a decently fast non-cryptographic hash function that guarantees a worst-case bound on the probability of collision between any two inputs generated independently of the UMASH parameters.

On the 2.5 GHz Intel 8175M servers that power Backtrace’s hosted offering, UMASH computes a 64-bit hash for short cached inputs of up to 64 bytes in 9-22 ns, and for longer ones at up to 22 GB/s, while guaranteeing that two distinct inputs of at most \(s\) bytes collide with probability less than \(\lceil s / 2048 \rceil \cdot 2^{-56}\). If that’s not good enough, we can also reuse most of the parameters to compute two independent UMASH values. The resulting 128-bit fingerprint function offers a short-input latency of 9-26 ns, a peak throughput of 11.2 GB/s, and a collision probability of \(\lceil s / 2048 \rceil^2 \cdot 2^{-112}\) (better than \(2^{-70}\) for input size up to 7.5 GB). These collision bounds hold for all inputs constructed without any feedback about the randomly chosen UMASH parameters.

The latency on short cached inputs (9-22 ns for 64 bits, 9-26 ns for 128) is somewhat worse than the state of the art for non-cryptographic hashes— wyhash achieves 8-15 ns and xxh3 8-12 ns—but still in the same ballpark. It also compares well with latency-optimised hash functions like FNV-1a (5-86 ns) and MurmurHash64A (7-23 ns).

Similarly, UMASH’s peak throughput (22 GB/s) does not match the current best hash throughput (37 GB/s with xxh3 and falkhash, apparently 10% higher with Meow hash), but does comes within a factor of two; it’s actually higher than that of some performance-optimised hashes, like wyhash (16 GB/s) and farmhash32 (19 GB/s). In fact, even the 128-bit fingerprint (11.2 GB/s) is comparable to respectable options like MurmurHash64A (5.8 GB/s) and SpookyHash (11.6 GB/s).

What sets UMASH apart from these other non-cryptographic hash functions is its proof of a collision probability bound. In the absence of an adversary that adaptively constructs pathological inputs as it infers more information about the randomly chosen parameters, we know that two distinct inputs of \(s\) or fewer bytes will have the same 64-bit hash with probability at most \(\lceil s / 2048 \rceil \cdot 2^{-56},\) where the expectation is taken over the random “key” parameters.

Only one non-cryptographic hash function in Reini Urban’s fork of SMHasher provides this sort of bound: CLHash guarantees a collision probability \(\approx 2^{-63}\) in the same universal hashing model as UMASH. While CLHash’s peak throughput (22 GB/s) is equal to UMASH’s, its latency on short inputs is worse (23-25 ns instead of 9-22ns). We will also see that its stronger collision bound remains too weak for many practical applications. In order to compute a fingerprint with CLHash, one would have to combine multiple hashes, exactly like we did for the 128-bit UMASH fingerprint.

Actual cryptographic hash functions provide stronger bounds in a much more pessimistic model; however they’re also markedly slower than non-cryptographic hashes. BLAKE3 needs at least 66 ns to hash short inputs, and achieves a peak throughput of 5.5 GB/s. Even the reduced-round SipHash-1-3 hashes short inputs in 18-40 ns and longer ones at a peak throughput of 2.8 GB/s. That’s the price of their pessimistically adversarial security model. Depending on the application, it can make sense to consider a more restricted adversary that must prepare its dirty deed before the hash function’s parameters are generated at random, and still ask for provable bounds on the probability of collisions. That’s the niche we’re targeting with UMASH.

Clearly, the industry is comfortable with no bound at all. However, even in the absence of seed-independent collisions, timing side-channels in a data structure implementation could theoretically leak information about colliding inputs, and iterating over a hash table’s entries to print its contents can divulge even more bits. A sufficiently motivated adversary could use something like that to learn more about the key and deploy an algorithmic denial of service attack. For example, the linear structure of UMASH (and of other polynomial hashes like CLHash) makes it easy to combine known collisions to create exponentially more colliding inputs. There is no universal answer; UMASH is simply another point in the solution space.

If reasonable performance coupled with an actual bound on collision probability for data that does not adaptively break the hash sounds useful to you, take a look at UMASH on GitHub!

The next section will explain why we found it useful to design another hash function. The rest of the post sketches how UMASH works and how it balances short-input latency and strength, before describing a few interesting usage patterns.

The latency and throughput results above were all measured on the same unloaded 2.5 GHz Xeon 8175M. While we did not disable frequency scaling (#cloud), the clock rate seemed stable at 3.1 GHz during our run.

How did we even get here?

Engineering is the discipline of satisficisation: crisply defined problems with perfect solutions rarely exist in reality, so we must resign ourselves to satisfying approximate constraint sets “well enough.” However, there are times when all options are not only imperfect, but downright sucky. That’s when one has to put on a different hat, and question the problem itself: are our constraints irremediably at odds, or are we looking at an under-explored solution space?

In the former case, we simply have to want something else. In the latter, it might make sense to spend time to really understand the current set of options and hand-roll a specialised approach.

That’s the choice we faced when we started caching intermediate results in Backtrace’s database and found a dearth of acceptable hash functions. Our in-memory columnar database is a core component of the backend, and, like most analytics databases, it tends to process streams of similar queries. However, a naïve query cache would be ineffective: our more heavily loaded servers handle a constant write load of more than 100 events per second with dozens of indexed attributes (populated column values) each. Moreover, queries invariably select a large number of data points with a time windowing predicate that excludes old data… and the endpoints of these time windows advance with each wall-clock second. The queries evolve over time, and must usually consider newly ingested data points.

Bhatotia et al’s Slider show how we can specialise the idea of self-adjusting or incremental computation for repeated MapReduce-style queries over a sliding window. The key idea is to split the data set at stable boundaries (e.g., on date change boundaries rather than 24 hours from the beginning of the current time window) in order to expose memoisation opportunities, and to do so recursively to repair around point mutations to older data.

Caching fully aggregated partial results works well for static queries, like scheduled reports… but the first step towards creating a great report is interactive data exploration, and that’s an activity we strive to support well, even when drilling down tens of millions of rich data points. That’s why we want to also cache intermediate results, in order to improve response times when tweaking a saved report, or when crafting ad hoc queries to better understand how and when an application fails.

We must go back to a more general incremental computation strategy: rather than only splitting up inputs, we want to stably partition the data dependency graph of each query, in order to identify shared subcomponents whose results can be reused. This finer grained strategy surfaces opportunities to “resynchronise” computations, to recognize when different expressions end up generating a subset of identical results, enabling reuse in later steps. For example, when someone updates a query by adding a selection predicate that only rejects a small fraction of the data, we can expect to reuse some of the post-selection work executed for earlier incarnations of the query, if we remember to key on the selected data points rather than the predicates.

The complication here is that these intermediate results tend to be large. Useful analytical queries start small (a reasonable query coupled with cache/transaction invalidation metadata to stand in for the full data set), grow larger as we select data points, arrange them in groups, and materialise their attributes, and shrink again at the end, as we summarise data and throw out less interesting groups.

When caching the latter shrinking steps, where resynchronised reuse opportunities abound and can save a lot of CPU time, we often find that storing a fully materialised representation of the cache key would take up more space than the cached result.

A classic approach in this situation is to fingerprint cache keys with a cryptographic hash function like BLAKE or SHA-3, and store a compact (128 or 256 bits) fingerprint instead of the cache key: the probability of a collision is then so low that we might as well assume any false positive will have been caused by a bug in the code or a hardware failure. For example, a study of memory errors at Facebook found that uncorrectable memory errors affect 0.03% of servers each month. Assuming a generous clock rate of 5 GHz, this means each clock cycle may be afflicted by such a memory error with probability \(\approx 2.2\cdot 10^{-20} > 2^{-66}.\) If we can guarantee that distinct inputs collide with probability significantly less than \(2^{-66}\), e.g., \(< 2^{-70},\) any collision is far more likely to have been caused by a bug in our code or by hardware failure than by the fingerprinting algorithm itself.

Using cryptographic hashes is certainly safe enough, but requires a lot of CPU time, and, more importantly, worsens latency on smaller keys (for which caching may not be that beneficial, such that our goal should be to minimise overhead). It’s not that state-of-the-art cryptographic hash functions are wasteful, but that they defend against attacks like key recovery or collision amplification that we may not care to consider in our design.

At the other extreme of the hash spectrum, there is a plethora of fast hash functions with no proof of collision probability. However, most of them are keyed on just a 64-bit “seed” integer, and that’s already enough for a pigeonhole argument to show we can construct sets of strings of length \(64m\) bits where any two members collide with probability at least \(m/ 2^{64}\). In practice, security researchers seem to find key-independent collisions wherever they look (i.e., the collision probability is on the order of 1 for some particularly pathological sets of inputs), so it’s safe to assume that lacking a proof of collision probability implies a horrible worst case. I personally wouldn’t put too much faith in “security claims” taking the form of failed attempts at breaking a proposal.

Lemire and Kaser’s CLHash is one of the few exceptions we found: it achieves a high throughput of 22 GB/s and comes with a proof of \(2^{-63}\)-almost-universality. However, its finalisation step is slow (23 ns for one-byte inputs), due to a Barrett reduction followed by three rounds of xorshift/multiply mixing. Dai and Krovetz’s VHASH, which inspired CLHash, offers similar guarantees, with worse performance.

Unfortunately, \(2^{-63}\) is also not quite good enough for our purposes: we estimate that the probability of uncorrectable memory errors is on the order of \(2^{-66}\) per clock cycle, so we want the collision probability for any two distinct inputs to be comfortably less than that, around \(2^{-70}\) (i.e., \(10^{-21}\)) or less. This also tells us that any acceptable fingerprint must consist of more than 64 bits, so we will have to either work in slower multi-word domains, or combine independent hashes.

Interestingly, we also don’t need much more than that for (non-adversarial) fingerprinting: at some point, the theoretical probability of a collision is dominated by the practical possibility of a hardware or networking issue making our program execute the fingerprinting function incorrectly, or pass the wrong data to that function.

While CLHash and VHASH aren’t quite what we want, they’re pretty close, so we felt it made sense to come up with a specialised solution for our fingerprinting use case.

Krovetz et al’s RFC 4418 brings an interesting idea: we can come up with a fast 64-bit hash function structured to make it easy to compute a second independent hash value, and concatenate two independent 64-bit outputs. The hash function can heavily favour computational efficiency and let each 64-bit half collide with probability \(\varepsilon\) significantly worse than \(2^{-64}\), as long as the collision probability for the concatenated fingerprint, \(\varepsilon^2\), is small enough, i.e., as long as \(\varepsilon^2 < 2^{-70} \Longleftrightarrow \varepsilon < 2^{-35}\). We get a more general purpose hash function out of the deal, and the fingerprint comparison logic is now free to only compute and look at half the fingerprint when it makes sense (e.g., in a prepass that tolerates spurious matches).

UMASH, at a high level

The design of UMASH is driven by two observations:

  1. CLHash achieves a high throughput, but introduces a lot of latency to finalise its 127-bit state into a 64 bits result.

  2. We can get away with a significantly weaker hash, since we plan to combine two of them when we need a strong fingerprint.

That’s why we started with the high-level structure diagrammed below, the same as UMAC, VHASH, and CLHash: a fast first-level block compression function based on Winograd’s pseudo dot-product, and a second-level Carter-Wegman polynomial hash function to accumulate the compressed outputs in a fixed-size state.

The inner loop in this two-level strategy is the block compressor, which divides each 256-byte block \(m\) into 32 64-bit values \(m_i\), combines them with randomly generated parameters \(k_i\), and converts the resulting sequence of machine words to a 16-byte output. The performance of that component will largely determine the hash function’s global peak throughput. After playing around with the NH inner loop,

\[ \mathtt{NH}_k(m) = \sum_{i=0}^{|m| / 2 - 1} (m_{2i} + k_{2i} \bmod 2^{64})\cdot(m_{2i + 1} + k_{2i + 1} \bmod 2^{64})\mod 2^{128},\]

we came to the same conclusion as Lemire and Kaser: the scalar operations, the outer 128-bit ones in particular, map to too many µops. We thus focused on the same PH inner loop as CLHash,

\[ \mathtt{PH}_k(m) = \bigoplus_{i=0}^{|m| / 2 - 1} (m_{2i} \oplus k_{2i})\odot (m_{2i + 1} \oplus k_{2i + 1}).\]

While the similarity to NH is striking, analysing PH is actually much simpler: we can see the xor and carry-less multiplications as working in the same ring of polynomials over \(\mathrm{GF}(2)\), unlike NH’s mixing of \(\bmod 2^{64}\) for the innermost additions with \(\bmod 2^{128}\) for the outer multiplications and sum. In fact, as Bernstein points out, PH is a direct application of Winograd’s pseudo dot-product to compute a multiplicative vector hash in half the multiplications.

CLHash uses an aggressively throughput-optimised block size of 1024 bytes. We found diminishing returns after 256 bytes, and stopped there.

With modular or polynomial ring arithmetic, the collision probability is \(2^{-64}\) for any pair of blocks. Given this fast compression function, the rest of the hashing algorithm must chop the input in blocks, accumulate compressed outputs in a constant-size state, and handle the potentially shorter final block while avoiding length extension issues.

Both VHASH and CLHash accumulate compressed outputs in a polynomial string hash over a large field (\(\mathbb{Z}/M_{127}\mathbb{Z}\) for VHASH, and \(\mathrm{GF}(2^{127})\) with irreducible polynomial \(x^{127} + x + 1\) for CLHash): the collision probability for polynomial string hashes is inversely proportional to the field size and grows with the string length (number of compressed blocks), so working in fields much larger than \(2^{64}\) lets the NH/PH term dominate.

Arithmetic in such large fields is slow, and reducing the 127-bit state to 64 bits is also not fast. CLHash and VHASH make the situation worse by zero-padding the final block, and CLHash defends against length extension attacks with a more complex mechanism than the one in VHASH.

Similarly to VHASH, UMASH uses a polynomial hash over the (much smaller) prime field \(\mathbb{F} = \mathbb{Z}/M_{61}\mathbb{Z},\)

\[ CW_f(y) = \left(\sum_{j=0}^{d - 1} y_j \cdot f^{d - j}\right) \mod 2^{61} - 1, \]

where \(f\in\mathbb{F}\) is the randomly chosen point at which we evaluate the polynomial, and \(y\), the polynomial’s coefficients, is the stream of 64-bit values obtained by splitting in half the PH output for each block. This choice saves 20-30 cycles of latency in the final block, compared to CLHash: modular multiplications have lower latency than carry-less multiplications for judiciously picked machine-integer-sized moduli, and integer multiplications seem to mix better, so we need less work in the finaliser.

Of course, UMASH sacrifices a lot of strength by working in \(\mathbb{F} =\, \bmod 2^{61} - 1:\) the resulting field is much smaller than \(2^{127}\), and we now have to update the polynomial twice for the same number of blocks. This means the collision probability starts worse \((\approx 2^{-61}\) instead of \(\approx 2^{-127})\), and grows twice as fast with the number of blocks \(n\) \((\approx 2n\cdot 2^{-61}\) instead of \(\approx n\cdot 2^{-61})\). But remember, we’re only aiming for collision probability \(< 2^{-35}\) and each block represents 256 bytes of input data, so this is acceptable, assuming that multi-gigabyte inputs are out of scope.

We protect against length extension collisions by xoring (adding, in the polynomial ring) the original byte size of the final block to its compressed PH output. This xor is simpler than CLHash’s finalisation step with a carry-less multiplication, but still sufficient: we can adapt Krovetz’s proof for VHASH by replacing NH’s almost-\(\Delta\)-universality with PH’s almost-XOR-universality.

Having this protection means we can extend short final blocks however we want. Rather than conceptually zero-padding our inputs (which adds complexity and thus latency on short inputs), we allow redundant reads. We bifurcate inputs shorter than 16 bytes to a completely different latency-optimised code path, and let the final PH iteration read the last 16 bytes of the input, regardless of how redundant that might be.

The semi-literate Python reference implementation has the full code and includes more detailed analysis and rationale for the design decisions.

Internal implementation tricks

The previous section already showed how we let micro-optimisation inform the high-level structure of UMASH. The use of PH over NH, our choice of a polynomial hash in a small modular field, and the way we handle short blocks all aim to improve the performance of production implementations. We also made sure to enable a couple more implementation tricks with lower level design decisions.

The block size is set to 256 bytes because we observed diminishing returns for larger blocks… but also because it’s reasonable to cache the PH loop’s parameters in 8 AVX registers, if we need to shave load µops.

More importantly, it’s easy to implement a Horner update with the prime modulus \(2^{61} - 1\). Better, that’s also true for a “double-pumped” Horner update, \(h^\prime = H_f(h, a, b) = af + (b + h)f^2.\)

The trick is to work in \(\bmod 2^{64} - 8 = \bmod 8\cdot(2^{-61} - 1),\) which lets us implement modular multiplication of an arbitrary 64-bit integer \(a\) by a multiplier \(0 < f < 2^{61} - 1\) without worrying too much about overflow. \(2^{64} \equiv 8 \mod 2^{64} - 8,\) so we can reduce a value \(x\) to a smaller representative with

\[af = x \equiv 8\lfloor x / 2^{64}\rfloor + (x\bmod 2^{64}) \mod 2^{64} - 8;\]

this equivalence is particularly useful when \(x < 2^{125}\): in that case, \(x / 2^{64} < 2^{61},\) and the intermediate product \(8\lfloor x / 2^{64}\rfloor < 2^{64}\) never overflows 64 bits. That’s exactly what happens when \(x = af\) is the product of \(0\leq a < 2^{64}\) and \(0 < f < 2^{61} - 1\). This also holds when we square the multiplier \(f\): it’s sampled from the field \(\mathbb{Z}/(2^{61} - 1)\mathbb{Z},\) so its square also satisfies \(f^2 < 2^{61}\) once fully reduced.

Integer multiplication instructions for 64-bit values will naturally split the product \(x = af\) in its high and low 64-bit half; we get \(\lfloor x / 2^{64}\rfloor\) and \(x\bmod 2^{64}\) for free. The rest of the double-pumped Horner update is a pair of modular additions, where only the final sum must be reduced to fit in \(\bmod 2^{64} - 8\). The resulting instruction-parallel double Horner update is only a few cycles slower than a single Horner update.

We also never fully reduce to \(\bmod 2^{61} - 1\). While the collision bound assumes that prime field, we simply work in its \(\bmod 2^{64} - 8\) extension. This does not affect the collision bound, and the resulting expression is still amenable to algebraic manipulation: modular arithmetic is a well defined ring even for composite moduli.

SMHasher tricks

A proof of almost-universality doesn’t mean a hash passes the SMHasher test suite. It should definitely guarantee collisions are (probably) rare enough, but SMHasher also looks at bit avalanching and bias, and universality is oblivious to these issues. Even XOR- or \(\Delta\)-universality doesn’t suffice: the hash values for a given string are well distributed when parameters are chosen uniformly at random, but this does not imply that hashes are always (or usually) well distributed for fixed parameters.

The most stringent SMHasher tests focus on short inputs: mostly up to 128 or 256 bits, unless “Extra” torture testing is enabled. In a way, this makes sense, given that arbitrary-length string hashing is provably harder than the bounded-length vector case. Moreover, a specialised code path for these inputs is beneficial, since they’re relatively common and deserve strong and low-latency hashes. That’s why UMASH uses a completely different code path for inputs of at most 8 bytes, and a specialised PH iteration for inputs of 9 to 16 bytes.

However, this means that SMHasher’s best avalanche and bias tests often tell us very little about the general case. For UMASH, the medium length (9 to 16 bytes) code path at least shares the same structure and finalisation logic as the code for longer inputs.

There may also be a bit of co-evolution between the test harness and the design of hash functions: the sort of xorshift/multiply mixers favoured by Appleby in the various versions of MurmurHash tends to do well on SMHasher. These mixers are also invertible, so we can take any hash function with good collision properties, mix its output with someone else’s series of xorshift and multiplications (in UMASH’s case, the SplitMix64 update function or a subset thereof), and usually find that the result satisfies SMHasher’s bias and avalanche tests.

It definitely looks like interleaving rightward bitwise operations and integer multiplications is a good mixing strategy. However, I find it interesting that the hash evaluation harness created by the author of MurmurHash steers implementations towards MurmurHash-syle mixing code.

Fun things to do with UMASH

The structure of UMASH lets us support more sophisticated usage patterns than merely hashing or fingerprinting an array of bytes.

The PH loop needs less than 17 bytes of state for its 16-byte accumulator and an iteration count, and the polynomial hash also needs 17 bytes, for its own 8-byte accumulator, the 8-byte “seed,” and a counter for the final block size (up to 256 bytes). The total comes up to 34 bytes of state, plus a 16-byte input buffer, since the PH loop consumes 16-byte chunks at a time. Coupled with the way we only consider the input size at the end of UMASH, this makes it easy to implement incremental hashing.

In fact, the state is small enough that our implementation stashes some parameter data inline in the state struct, and uses the same layout for hashing and fingerprinting with a pair of hashes (and thus double the state): most of the work happens in PH, which only accesses the constant parameter array, the shared input buffer and iteration counter, and its private 16-byte accumulator.

Incremental fingerprinting is a crucial capability for our caching system: cache keys may be large, so we want to avoid serialising them to an array of contiguous bytes just to compute a fingerprint. Efficient incrementality also means we can hash NUL-terminated C strings with a fused UMASH / strlen loop, a nice speed-up when the data is in cache.

The outer polynomial hash in UMASH is so simple to analyse that we can easily process blocks out of order. In my experience, such a “parallel hashing” capability is more important than peak throughput when checksumming large amounts of data coming over the wire. We usually maximise transfer throughput by asking for several ranges of data in parallel. Having to checksum these ranges in order introduces a serial bottleneck and the usual head-of-line blocking challenges; more importantly, checksumming in order adds complexity to code that should be as obviously correct as possible. The polynomial hash lets us hash an arbitrary subsequence of 256-byte aligned blocks and use modular exponentiation to figure out its impact on the final hash value, given the subsequence’s position in the checksummed data. Parallel hashing can exploit multiple cores (more cores, more bandwidth!) with simpler code.

The UMAC RFC uses a Toeplitz extension scheme to compute independent NH values while recycling most of the parameters. We do the same with PH, by adapting Krovetz’s proof to exploit PH’s almost-XOR-universality instead of NH’s almost-\(\Delta\)-universality. Our fingerprinting code reuses all but the first 32 bytes of PH parameters for the second hash: that’s the size of an AVX register, which makes is trivial to avoid loading parameters twice in a fused PH loop.

The same RFC also points out that concatenating the output of fast hashes lets validation code decide which speed-security trade-off makes sense for each situation: some applications may be willing to only compute and compare half the hashes.

We use that freedom when reading from large hash tables keyed on the UMASH fingerprint of strings. We compute a single UMASH hash value to probe the hash tables, and only hash the second half of the fingerprint when we find a probable hit. The idea is that hashing the search key (now hot in cache) a second time will be faster than comparing it against the hash entry’s string key in cold storage.

When we add this sort of trickery to our code base, it’s important to make sure the interfaces are hard to misuse. For example, it would be unfortunate if only one half of the 128-bit fingerprint were well distributed and protected against collisions: this would make it far too easy to implement the two-step lookup-by-fingerprint above correctly but inefficiently. That’s why we maximise the symmetry in the fingerprint: the two 64-bit halves are computed with the same algorithm to guarantee the same worst-case collision probability and distribution quality. This choice leaves fingerprinting throughput on the table when a weaker secondary hash would suffice. However, I prefer a safer if slightly slower interface to one ripe for silent performance bugs.

Caveat programmator

While we intend for UMASH to become our default hash and fingerprint function, it can’t be the right choice for every application.

First, it shouldn’t be used for authentication or similar cryptographic purposes: the implementation is probably riddled with side-channels, the function has no protection against parameter extraction or adaptive attacks, and collisions are too frequent anyway.

Obviously, this rules out using UMASH in a MAC, but might also be an issue for, e.g., hash tables where attackers control the keys and can extrapolate the hash values. A timing side-channel may let attackers determine when keys collide; once a set of colliding keys is known, the linear structure of UMASH makes it trivial to create more collisions by combining keys from that set. Worse, iterating over the hash table’s entries can leak the hash values, which would let an attacker slowly extract the parameters. We conservatively avoid non-cryptographic hashes and even hashed data structures for sections of the Backtrace code base where such attacks are in scope.

Second, the performance numbers reported by SMHasher (up to 22 ns when hashing 64 bytes or less, and 22 GB/s peak throughput) are probably a lie for real applications, even when running on the exact same 2.5 GHz Xeon 8175M hardware. These are best-case values, when the code and the parameters are all hot in cache… and that’s a fair amount of bytes for UMASH. The instruction footprint for a 64-bit hash is 1435 bytes (comparable to heavier high-throughput hashes, like the 1600-byte xxh3_64 or 1350-byte farmhash64), and the parameters span 288 bytes (320 for a fingerprint).

There is a saving grace for UMASH and other complex hash functions: the amount of bytes executed is proportional to the input size (e.g., the code for 8 or fewer byte only needs 141 bytes, and would inline to around 100 bytes), and the number of parameters read is bounded by the input length. Although UMASH can need a lot of instruction and parameter bytes, the worst case only happens for larger inputs, where the cache misses can hopefully be absorbed by the work of loading and hashing the data.

The numbers are also only representative of powerful CPUs with carry-less multiplication in hardware. The PH inner loop has 50% higher throughput than NH (22 vs 14 GB/s) on contemporary Intel servers. The carry-less approach still has an edge over 128-bit modular arithmetic on AMD’s Naples, but less so, around 20-30%. We did not test on ARM (the Backtrace database only runs on x86-64), but I would assume the situation there is closer to AMD’s than Intel’s.

However, I also believe we’re more likely to observe improved performance for PH than NH in future micro-architectures: the core of NH, full-width integer multiplication, has been aggressively optimised by now, while the gap between Intel and AMD shows there may still be low-hanging fruits for the carry-less multiplications at the heart of PH. So, NH is probably already as good as it’s going to be, but we can hope that PH will continue to benefit from hardware optimisations, as chip designers improve the performance of cryptographic algorithms like AES-GCM.

Third and last, UMASH isn’t fully stabilised yet. We do not plan to modify the high level structure of UMASH, a PH block compressor that feeds into a polynomial string hash. However, we are looking for suggestions to improve its latency on short inputs, and to simplify the finaliser while satisfying SMHasher’s distribution tests.

Help us improve UMASH

We believe UMASH is ready for non-persistent usage: we’re confident in its quality, but the algorithm isn’t set in stone yet, so hash or fingerprint values should not reach long-term storage. We do not plan to change anything that will affect the proof of collision bound, but improvements to the rest of the code are more than welcome.

In particular:

  1. The short (8 or fewer bytes) input code can hopefully be simpler.
  2. The medium-length (9-15 bytes) input code path is a micro-optimised version of the general case, but does not actually share any machine code; can we improve the latency and maintain the collision bound by replacing it with something completely different?
  3. It’s already nice that we can get away with a single round of xorshift / multiply in the finaliser, but can we shave even more latency there?
  4. We only looked at straightforward x86-64 implementations; we will consider tweaks that improve performance on x86-64, or on other platforms without penalising x86-64.
  5. We currently only use incremental and one-shot hashing interfaces. If someone needs parallel hashing, we can collaborate to find out what that interface could look like.

A hash function is a perfect target for automated correctness and performance testing. I hope to use UMASH as a test bed for the automatic evaluation (and approval?!) of pull requests.

Of course, you’re also welcome to just use UMASH as a single-file C library or re-implement it to fit your requirements. The MIT-licensed C code is on GitHub, and we can definitely discuss validation strategies for alternative implementations.

Finally, our fingerprinting use case shows collision rates are probably not something to minimise, but closer to soft constraints. We estimate that, once the probability reaches \(2^{-70}\), collisions are rare enough to only compare fingerprints instead of the fingerprinted values. However, going lower than \(2^{-70}\) doesn’t do anything for us.

It would be useful to document other back-of-the-envelope requirements for a hash function’s output size or collision rate. Now that most developers work on powerful 64-bit machines, it seems far too easy to add complexity and waste resources for improved collision bounds that may not unlock any additional application.


Any error in the analysis or the code is mine, but a few people helped improve UMASH and its presentation.

Colin Percival scanned an earlier version of the reference implementation for obvious issues, encouraged me to simplify the parameter generation process, and prodded us to think about side channels, even in data structures.

Joonas Pihlaja helped streamline my initial attempt while making the reference implementation easier to understand.

Jacob Shufro independently confirmed that he too found the reference implementation understandable, and tightened the natural language.

Phil Vachon helped me gain more confidence in the implementation tricks borrowed from VHASH after replacing the NH compression function with PH.