Turns out the part where I simply asserted a property was slightly off đ. I had to go back to an older proof with weaker bounds, but thatâs not catastrophic: we still collide way more rarely than \(2^{70}\).
The core of UMASH is a hybrid PH/(E)NH block compression function. That function is fast (it needs one multiplication for each 16byte âchunkâ in a block), but relatively weak: despite a 128bit output, the worstcase probability of collision is \(2^{64}\).
For a fingerprinting application, we want collision probability less than \(\approx 2^{70},\) so thatâs already too weak, before we even consider merging a variablelength string of compressed block values.
The initial UMASH proposal compresses each block with two independent compression functions. Krovetz showed that we could do so while reusing most of the key material (random parameters), with a Toeplitz extension, and I simply recycled the proof for UMASHâs hybrid compressor.
Thatâs good for the memory footprint of the random parameters, but doesnât help performance: we still have to do double the work to get double the hash bits.
Earlier this month, Jim Apple pointed me at a promising alternative that doubles the hash bit with only one more multiplication. The construction adds finite field operations that arenât particularly efficient in software, on top of the additional 64x64 > 128 (carryless) multiplication, so isnât a slam dunk win over a straightforward Toeplitz extension. However, Jim felt like we could âspendâ some of the bits we donât need for fingerprinting (\(2^{128}\) collision probability is overkill when we only need \(2^{70}\)) in order to make do with faster operations.
Turns out he was right! We can use carryless multiplications by sparse
constants (concretely, xorshift and one more shift) without any
reducing polynomial, on independent 64bit halvesâŠ and still
collide with probability at most 2^{126} \(2^{98}\).
The proof is fairly simple, but relies on a bit of notation for clarity. Letâs start by restating UMASHâs hybrid PH/ENH block compressor in that notation.
How does UMASH currently work?
The current block compressor in UMASH splits a 256byte block \(m\) in 16 chunks \(m_i,\, i\in [0, 15]\) of 128 bits each, and processes all but the last chunk with a PH loop,
\[ \bigoplus_{i=0}^{14} \mathtt{PH}(k_i, m_i), \]
where
\[ \mathtt{PH}(k_i, m_i) = ((k_i \bmod 2^{64}) \oplus (m_i \bmod 2^{64})) \odot (\lfloor k_i / 2^{64} \rfloor \oplus \lfloor m_i / 2^{64} \rfloor) \]
and each \(k_i\) is a randomly generated 128bit parameter.
The compression loop in UMASH handles the last chunk, along with a size tag (to protect against extension attacks), with ENH:
\[ \mathtt{ENH}(k, x, y) = ((k + x) \bmod 2^{64}) \cdot (\lfloor k / 2^{64}\rfloor + \lfloor x / 2^{64} \rfloor \bmod 2^{64}) + y \mod 2^{128}. \]
The core operation in ENH is a full (64x64 > 128) integer multiplication, which has lower latency than PHâs carryless multiplication on x8664. Thatâs why UMASH switches to ENH for the last chunk. We use ENH for only one chunk because combining multiple NH values calls for 128bit additions, and thatâs slower than PHâs xors. Once we have mixed the last chunk and the size tag with ENH, the result is simply xored in with the previous chunksâ PH values:
\[ \left(\bigoplus_{i=0}^{14} \mathtt{PH}(k_i, m_i)\right) \oplus \mathtt{ENH}(m_{15}, k_{15}, \mathit{tag}). \]
This function is annoying to analyse directly, because we end up having to manipulate different proofs of almostuniversality. Letâs abstract things a bit, and reduce the ENH/PH to the bare minimum we need to find our collision bounds.
Letâs split our message blocks in \(n\) (\(n = 16\) for UMASH) âchunksâ, and apply an independently sampled mixing function to each chunk. Letâs say we have two messages \(m\) and \(m^\prime\) with chunks \(m_i\) and \(m^\prime_i\), for \(i\in [0, n)\), and let \(h_i\) be the result of mixing chunk \(m_i,\) and \(h^\prime_i\) that of mixing \(m^\prime_i.\)
Weâll assume that the first chunk is mixed with a \(2^{w}\)almostuniversal (\(2^{64}\) for UMASH) hash function: if \(m_0 \neq m^\prime_0,\) \(\mathrm{P}[h_0 = h^\prime_0] \leq 2^{w},\) (where the probability is taken over the set of randomly chosen parameters for the mixer). Otherwise, \(m_0 = m^\prime_0 \Rightarrow h_i = h^\prime_i\).
This first chunks stands for the ENH iteration in UMASH.
Every remaining chunk will instead be mixed with a \(2^{w}\)XORalmostuniversal hash function: if \(m_i \neq m^\prime_i\) (\(0 < i < n\)), \(\mathrm{P}[h_i \oplus h^\prime_i = y] \leq 2^{w}\) for any \(y,\) where the probability is taken over the randomly generated parameter for the mixer.
This stronger condition represents the PH iterations in UMASH.
We hash a full block by xoring all the mixed chunks together:
\[ H = \bigoplus_{i = 0}^{n  1} h_i, \]
and
\[ H^\prime = \bigoplus_{i = 0}^{n  1} h^\prime_i. \]
We want to bound the probability that \(H = H^\prime \Leftrightarrow H \oplus H^\prime = 0,\) assuming that the messages differ (i.e., there is at least one index \(i\) such that \(m_i \neq m^\prime_i\)).
If the two messages only differ in \(m_0 \neq n^\prime_0\) (and thus \(m_i = m^\prime_i,\,\forall i \in [1, n)\)),
\[ \bigoplus_{i = 1}^{n  1} h_i = \bigoplus_{i = 1}^{n  1} h^\prime_i, \]
and thus \(H = H^\prime \Leftrightarrow h_0 = h^\prime_0\).
By hypothesis, the 0th chunks are mixed with a \(2^{w}\)almostuniversal hash, so this happens with probability at most \(2^{w}\).
Otherwise, assume that \(m_j \neq m^\prime_j\), for some \(j \in [1, n)\). We will rearrange the expression
\[ H \oplus H^\prime = h_j \oplus h^\prime_j \oplus \left(\bigoplus_{i\in [0, n) \setminus \{ j \}} h_i \oplus h^\prime_i\right). \]
Letâs conservatively replace that unwieldly sum with an adversarially chosen value \(y\):
\[ H \oplus H^\prime = h_j \oplus h^\prime_j \oplus y, \]
and thus \(H = H^\prime\) iff \(h_j \oplus h^\prime_j = y.\) By hypothesis, the \(j\)th chunk (every chunk but the 0th), is mixed with a \(2^{w}\)almostXORuniversal hash, and this thus happens with probability at most \(2^{w}\).
In both cases, we find a collision probability at most \(2^{w}\) with a simple analysis, despite combining mixing functions from different families over different rings.
Wringing more bits out of the same mixers
We combined strong mixers (each is \(2^{w}\)almostuniversal), and only got a \(2^{w}\)almostuniversal output. It seems like we should be able to do better when two or more chunks differ.
As Nandi points outs, we can apply erasure codes to derive additional chunks from the original messagesâ contents. We only need one more chunk, so we can simply xor together all the original chunks:
\[m_n = \bigoplus_{i=0}^{n  1} m_i,\]
and similarly for \(m^\prime_n\). If \(m\) and \(m^\prime\) differ in only one chunk, \(m_n \neq m^\prime_n\). Itâs definitely possible for \(m_n = m^\prime_n\) when \(m \neq m^\prime\), but only if two or more chunks differ.
We will again mix \(m_n\) and \(m^\prime_n\) with a fresh \(2^{w}\)almostXORuniversal hash function to yield \(h_n\) and \(h^\prime_n\).
We want to xor the result \(h_n\) and \(h^\prime_n\) with the second (still undefined) hash values \(H_2\) and \(H^\prime_2\); if \(m_n \neq m^\prime_n\), the final xored values are equal with probability at most \(2^{w}\), regardless of \(H_2\) and \(H^\prime_2\ldots\) and, crucially, independently of \(H \neq H^\prime\).
When the two messages \(m\) and \(m^\prime\) only differ in a single (initial) chunk, mixing a LRC checksum gives us an independent hash function, which squares the collision probability to \(2^{2w}\).
Now to the interesting bit: we must define a second hash function that combines \(h_0,h_1,\ldots, h_{n  1}\) and \(h^\prime_0, h^\prime_1, \ldots, h^\prime_{n  1}\) such that the resulting hash values \(H_2\) and \(H^\prime_2\) collide independently enough of \(H\) and \(H^\prime\). Thatâs a tall order, but we do have one additional assumption to work with: we only care about collisions in this second hash function if the additional checksum chunks are equal, which means that the two messages differ in two or more chunks (or theyâre identical).
For each index \(0 < i < n\), weâll fix a public linear (with xor as the addition) function \(\overline{xs}_i(x)\). This family of function must have two properties:
 \(f(x) = x \oplus \overline{xs}_i(x)\) is invertible for all \(0 < i < n\).
 For any two distinct \(0 < i < j < n\), xoring the two functions together into \(g(x) = \overline{xs}_i(x) \oplus \overline{xs}_j(x)\) yields a function with a small (low rank) null space. In other words, while \(g\) may not be invertible, it must be âpretty close.â
For regularity, we will also define \(\overline{xs}_0(x) = x\).
Concretely, let \(\overline{xs}_1(x) = x \mathtt{Â«} 1\), where the bitshift is computed for the two 64bit halves independently, and \(\overline{xs}_i(x) = (x \mathtt{Â«} 1) \oplus (x \mathtt{Â«} i)\) for \(i > 1\), again with all the bitshifts computed independently over the two 64bit halves.
To see that these satisfy our requirements, we can represent the functions as carryless multiplication by distinct âevenâ constants (the least significant bit is 0) on each 64bit half:
 Once we xor in \(x\), we get a multiplication by an odd constant, and thatâs invertible.
 Combining \(\overline{xs}_i\) and \(\overline{xs}_j\) with xor yields either \(x \mathtt{Â«} j\) or \((x \mathtt{Â«} i) \oplus (x \mathtt{Â«} j)\). In the worst case, we lose \(j\) bits in each 64bit half, and there are thus \(2^{2j}\) values in \(g^{1}(0)\).
To recapitulate, we defined the first hash function as
\[ H = \bigoplus_{i = 0}^{n  1} h_i, \]
the (xor) sum of the mixed value \(h_i\) for each chunk \(m_i\) in the message block \(m\), and similarly for \(H^\prime\) and \(h^\prime_i\).
Weâll let the second hash function be
\[ H_2 \oplus h_n = \left(\bigoplus_{i = 0}^{n  1} \overline{xs}_i(h_i)\right) \oplus h_n, \]
and
\[ H^\prime_2 \oplus h^\prime_n = \left(\bigoplus_{i = 0}^{n  1} \overline{xs}_i(h^\prime_i)\right) \oplus h^\prime_n. \]
We can finally get down to business and find some collision bounds. Weâve already shown that both \(H = H^\prime\) and \(H_2 \oplus h_n = H^\prime_2 \oplus h^\prime_n\) collide simultaneously with probability at most \(2^{2w}\) when the checksum chunks differ, i.e., when \(m_n \neq m^\prime_n\).
Letâs now focus on the case when \(m \neq m^\prime\), but \(m_n = m^\prime_n\). In that case, we know that at least two chunks \(0 \leq i < j < n\) differ: \(m_i \neq m^\prime_i\) and \(m_j \neq m^\prime_j\).
If only two chunks \(i\) and \(j\) differ, and one of them is the \(i = 0\)th chunk, we want to bound the probability that
\[ h_0 \oplus h_j = h^\prime_0 \oplus h^\prime_j \]
and
\[ h_0 \oplus \overline{xs}_j(h_j) = h^\prime_0 \oplus \overline{xs}_j(h^\prime_j), \]
both at the same time.
Letting \(\Delta_i = h_i \oplus h^\prime_i\), we can reformulate the two conditions as
\[ \Delta_0 = \Delta_j \] and \[ \Delta_0 = \overline{xs}_j(\Delta_j). \]
Taking the xor of the two conditions yields
\[ \Delta_j \oplus \overline{xs}_j(\Delta_j) = 0, \]
which is only satisfied for \(\Delta_j = 0\), since \(f(x) = x \oplus \overline{xs}_j(x)\) is an invertible linear function. This also forces \(\Delta_0 = 0\).
By hypothesis, \(\mathrm{P}[\Delta_j = 0] \leq 2^{w}\), and \(\mathrm{P}[\Delta_0 = 0] \leq 2^{w}\) as well. These two probabilities are independent, so we get a probability that both hash collide less than or equal to \(2^{2w}\) (\(2^{128}\)).
In the other case, we have messages that differ in at least two chunks \(0 < i < j < n\): \(m_i \neq m^\prime_i\) and \(m_j \neq m^\prime_j\).
We can simplify the collision conditions to
\[ h_i \oplus h_j = h^\prime_i \oplus h^\prime_j \oplus y \]
and
\[ \overline{xs}_i(h_i) \oplus \overline{xs}_j(h_j) = \overline{xs}_i(h^\prime_i) \oplus \overline{xs}_j(h^\prime_j) \oplus z, \]
for \(y\) and \(z\) generated arbitrarily (adversarially), but without knowledge of the parameters that generated \(h_i, h_j, h^\prime_i, h^\prime_j\).
Again, let \(\Delta_i = h_i \oplus h^\prime_i\) and \(\Delta_j = h_j \oplus h^\prime_j\), and reformulate the conditions into
\[ \Delta_i \oplus \Delta_j = y \] and \[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_j(\Delta_j) = z. \]
Letâs apply the linear function \(\overline{xs}_i\) to the first condition
\[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_i(\Delta_j) = \overline{xs}_i(y); \]
since \(\overline{xs}_i\) isnât invertible, the result isnât equivalent, but is a weaker (necessary, not sufficient) version of the initial condiion.
After xoring that with the second condition
\[ \overline{xs}_i(\Delta_i) \oplus \overline{xs}_j(\Delta_j) = z, \]
we find
\[ \overline{xs}_i(\Delta_j) \oplus \overline{xs}_j(\Delta_j) = \overline{xs}_i(y) \oplus z. \]
By hypothesis, the null space of \(g(x) = \overline{xs}_i(x) \oplus \overline{xs}_j(x)\) is âsmall.â For our concrete definition of \(\overline{xs}\), there are \(2^{2j}\) values in that null space, which means that \(\Delta_j\) can only satisfy the combined xored condition by taking one of at most \(2^{2j}\) values; otherwise, the two hashes definitely canât both collide.
Since \(j < n\), this happens with probability at most \(2^{2(n  1)  w} \leq 2^{34}\) for UMASH with \(w = 64\) and \(n = 16\).
Finally, for any given \(\Delta_j\), there is at most one \(\Delta_i\) that satisfies
\[ \Delta_i \oplus \Delta_j = y,\]
and so both hashes collide with probability at most \(2^{98}\), for \(w = 64\) and \(n = 16\).
Astute readers will notice that we could let \(\overline{xs}_i(x) = x \mathtt{Â«} i\), and find the same combined collision probability. However, this results in a much weaker secondary hash, since a chunk could lose up to \(2n  2\) bits (\(n  1\) in each 64bit half) of hash information to a plain shift. The shifted xorshifts might be a bit slower to compute, but guarantees that we only lose at most 2 bits^{1} of information per chunk. This feels like an interface thatâs harder to misuse.
If one were to change the \(\overline{xs}_i\) family of functions, I think it would make more sense to look at a more diverse form of (still sparse) multipliers, which would likely let us preserve a couple more bits of independence. Jim has constructed such a family of multipliers, in arithmetic modulo \(2^{64}\); Iâm sure we could find something similar in carryless multiplication. The hard part is implementing these multipliers: in order to exploit the multipliersâ sparsity, weâd probably have to fully unroll the block hashing loop, and thatâs not something I like to force on implementations.
What does this look like in code?
The base UMASH block compressor mixes all but the last of the message blockâs 16byte chunks with PH: xor the chunk with the corresponding bytes in the parameter array, computes a carryless multiplication of the xored chunksâ half with the other half. The last chunk goes through a variant of ENH with an invertible finaliser (safe because we only rely on \(\varepsilon\)almostuniversality), and everything is xored in the accumulator.
The collision proofs above preserved the same structure for the first hash.
The second hash reuses so much work from the first that it mostly makes sense to consider a combined loop that computes both (regular UMASH and this new xorshifted variant) block compression functions at the same time.
The first change for this combined loop is that we need to xor
together all 16bytes chunk in the message, and mix the resulting
checksum with a fresh PH function. Thatâs equivalent to xoring
everything in a new accumulator (or two accumulators when working with
256bit vectors) initialised with the PH parameters, and CLMUL
ing
together the accumulatorâs two 64bit halves at the end.
We also have to apply the \(\overline{xs}_i\) quasixorshift
functions to each \(h_i\). The trick is to accumulate the shifted
values in two variables: one is the regular UMASH accumulator without
\(h_0\) (i.e., \(h_1 \oplus h_2 \ldots\)), and the other shifts
the current accumulator before xoring in a new value, i.e.,
\(\mathtt{acc}^\prime = (\mathtt{acc} \mathtt{Â«} 1) \oplus h_i\),
where the left shift on parallel 64bit halves simply adds acc
to itself.
This additional shifted accumulator includes another special case to skip \(\overline{xs}_1(x) = x \mathtt{Â«} 1\); thatâs not a big deal for the code, since we already have to special case the last iteration for the ENH mixer.
Armed with \(\mathtt{UMASH} = \bigoplus_{i=1}^{n  1} h_i\) and \(\mathtt{acc} = \bigoplus_{i=2}^{n  1} h_i \mathtt{Â«} (i  1),\) we have \[\bigoplus_{i=1}^{n  1} \overline{xs}_i(h_i) = (\mathtt{UMASH} \oplus \mathtt{acc}) \mathtt{Â«} 1.\]
We just have to xor in the PH
mixed checksum \(h_n\), and finally
\(h_0\) (which naturally goes in GPRs, so can be computed while we
extract values out of vector registers).
We added two vector xors and one addition for each chunk in a block,
and, at the end, one CLMUL
plus a couple more xors and adds again.
This should most definitely be faster than computing two UMASH at the
same time, which incurred two vector xors and a CLMUL
(or full
integer multiplication) for each chunk: even when CLMUL
can pipeline
one instruction per cycle, vector additions can dispatch to more
execution units, so the combined throughput is still higher.
One last thing: what if we have blocks of different length?
Itâs easy to show that UMASH is relatively safe when one block is shorter than the other, and we simply xor together fewer mixed chunks. Without loss of generality, we can assume the longer block has \(n\) chunks; that blockâs final ENH is independent of the shorter blockâs UMASH, and any specific value occurs with probability at most \(2^{63}\) (the probability of a multiplication by zero).
A similar argument seems more complex to defend for the shifted UMASH.
Luckily, we can tweak the LRC checksum we use to generate an additional chunk in the block: rather than xoring together the raw message chunks, weâll xor them after xoring them with the PH key, i.e.,
\[m_n = \bigoplus_{i=0}^{n  1} m_i \oplus k_i, \]
where \(k_i\) are the PH parameters for each chunk.
When checksumming blocks of the same size, this is a noop with respect
to collision probabilities. Implementations might however benefit
from the ability to use a fused xor
with load from memory^{2}
to compute \(m_i \oplus k_i\), and feed that both into the checksum
and into CLMUL
for PH.
Unless weâre extremely unlucky (\(m_{n  1} = k_{n  1}\), with probability \(2^{2w}\)), the long blockâs LRC will differ from the shorter blockâs. As long as we always xor in the same PH parameters when mixing the artificial LRC, the secondary hashes collide with probability at most \(2^{64}\).
With a small tweak to the checksum function, we can easily guarantee that blocks with a different number of chunks collide with probability less than \(2^{126}\).^{3}
Thank you Joonas for helping me rubber duck the presentation, and Jim for pointing me in the right direction, and for the fruitful discussion!

Itâs even better for UMASH, since we obtained these shifted chunks by mixing with PH. The result of PH is the carryless product of two 64bit values, so the most significant bit is always 0. The shiftedxorshift doesnât erase any information in the high 64bit half!Â ↩

This might also come with a small latency hit, which is unfortunate since PHing \(m_n\) is likely to be on the critical pathâŠ but one cycle doesnât seem that bad.Â ↩

The algorithm to expand any input message to a sequence of full 16byte chunks is fixed. Thatâs why we incorporate a size tag in ENH; that makes it impossible for two messages of different lengths to collide when they are otherwise identical after expansion.Â ↩