Getting these nonblocking protocols right is still challenging, but the challenge is one fundamental for reliable systems. The same problems, solutions, and space/functionality tradeoffs appear in all distributed systems. Some would even argue that the kind of interfaces that guarantee lock or wait freedom are closer to the object oriented ideals.
Of course, there is still a place for clever instruction sequences that avoid internal locks, for code that may be paused anywhere without freezing the whole system: interrupts can’t always be disabled, read operations should avoid writing to shared memory if they can, and a single atomic readmodifywrite operation may be faster than locking. The key point for me is that this complexity is optin: we can choose to tackle it incrementally, as a performance problem rather than as a prerequisite for correctness.
We don’t have the same luxury in userspace. We can’t start by focusing on the fundamentals of a nonblocking algorithm, and only implement interruptable sequences where it makes sense. Userspace can’t disable preemption, so we must think about the minutiae of interruptable code sequences from the start; nonblocking algorithms in userspace are always in hard mode, where every step of the protocol might be paused at any instruction.
Specifically, the problem with nonblocking code in user space isn’t that threads or processes can be preempted at any point, but rather that the preemption can be observed. It’s a PCLSRing issue! Even Unices guarantee programmers won’t observe a thread in the middle of a syscall: when a thread (process) must be interrupted, any pending syscall either runs to completion, or returns with an error^{1}. What we need is a similar guarantee for steps of our own nonblocking protocols^{2}.
Hardware transactional memory kind of solves the problem (preemption aborts any pending transaction) but is a bit slow^{3}, and needs a fallback mechanism. Other emulation schemes for PCLSRing userspace code divide the problem in two:
The first part is relatively easy. For perCPU data, it suffices to
observe that we are running on a given CPU (e.g., core #4), and that
another thread claims to own the same CPU’s (core #4’s) data. For
global locks,
we can instead spin for a while before entering a slow path
that determines whether the holder has been preempted, by reading
scheduling information in /proc
.
The second part is harder. I have played with schemes that relied on signals, but was never satisfied: I found Linux perf will rarely, but not never, drop interrupts when I used it to “profile” context switches, and signaling when we determine that the holder has been preempted has memory visibility issues for perCPU data^{5}.
Until earlier this month, the best known solution on mainline Linux involved crossmodifying code! When a CPU executes a memory write instruction, that write is affected by the registers, virtual memory mappings, and the instruction’s bytes. Contemporary operating systems rarely let us halt and tweak another thread’s general purpose registers (Linux won’t let us selfptrace, nor pause an individual thread). Virtual memory mappings are perprocess, and can’t be modified from the outside. The only remaining angle is modifying the premptee’s machine code.
That’s what Facebook’s experimental library Rseq (restartable sequences) actually does.
I’m not happy with that solution either: while it “works,” it requires perthread clones of each critical section, and makes us deal with crossmodifying code. I’m not comfortable with leaving code pages writable, and we also have to guarantee the preemptee’s writes are visible. For me, the only defensible implementation is to modify the code by mmaping pages in place, which incurs an IPI per modification. The total system overhead thus scales superlinearly with the number of CPUs.
With Mathieu Desnoyers’s, Paul Turner’s, and Andrew Hunter’s patch to add an rseq syscall to Linux 4.18, we finally have a decent answer. Rather than triggering special code when a thread detects that another thread has been preempted in the middle of a critical section, userspace can associate recovery code with the address range for each restartable critical section’s instructions. Whenever the kernel preempts a thread, it detects whether the interruptee is in such a restartable sequence, and, if so, redirects the instruction pointer to the associated recovery code. This essentially means that critical sections must be readonly except for the last instruction in the section, but that’s not too hard to satisfy. It also means that we incur recovery even when no one would have noticed, but the overhead should be marginal (there’s at most one recovery per timeslice), and we get a simpler programming model in return.
Earlier this year, I found another way to prevent critical sections from resuming normal execution after being preempted. It’s a total hack that exercises a state saving defect in Linux/x8664, but I’m comfortable sharing it now that Rseq is in mainline: if anyone needs the functionality, they can update to 4.18, or backport the feature.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

With an appropriate setup, the read_value
function above will return
a different value once the executing thread is switched out. No, the
kernel isn’t overwriting readonly data while we’re switched out. When
I listed the set of inputs that affect a memory store or load
instruction (general purpose registers, virtual memory mappings, and
the instruction bytes), I left out one last x86 thing: segment registers.
Effective addresses on x86oids are about as feature rich as it gets:
they sum a base address, a shifted index, a constant offset, and,
optionally, a segment base. Today, we simply use segment bases to implement
threadlocal storage (each thread’s FS
or GS
offset points to its
threadlocal block), but that usage
repurposes memory segmentation, an old 8086 feature… and x8664 still
maintains some backward compatibility with its 16bit ancestor.
There’s a lot of unused complexity there, so it’s plausible that we’ll
find information leaks or otherwise flawed architectural state
switching by poking around segment registers.
After learning about this trick to observe interrupts from userland, I decided to do a close reading of Linux’s task switching code on x8664 and eventually found this interesting comment^{6}.
Observing a value of 0
in the FS
or GS
registers can mean
two things:
0
.0
in there before setting up the segment base
directly, with WR{FS,GS}BASE
or by writing to a modelspecific
register (MSR).Hardware has to efficiently keep track of which is actually in
effect. If userspace wrote a 0
in FS
or GS
, prefixing an
instruction with that segment has no impact; if the MSR write is
still active (and is nonzero), using that segment must
impact effective address computation.
There’s no easy way to do the same in software. Even in ring 0, the
only surefire way to distinguish between the two cases is to actually
read the current segment base value, and that’s slow. Linux instead
fastpaths the common case, where the segment register is 0 because
the kernel is handling segment bases. It prioritises that use case so
much that the code knowingly sacrifices correctness when userspace
writes 0
in a segment register after asking the kernel to setup its
segment base directly.
This incorrectness is acceptable because it only affects the thread that overwrites its segment register, and no one should go through that sequence of operations. Legacy code can still manipulate segment descriptor tables and address them in segment registers. However, being legacy code, it won’t use the modern syscall that directly manipulates the segment base. Modern code can let the kernel set the segment base without playing with descriptor tables, and has no reason to look at segment registers.
The only way to observe the buggy state saving is to go looking for
it, with something like the code below (which uses GS
because FS
is already taken by glibc
to implement threadlocal storage).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 

Running the above on my Linux 4.14/x8664 machine yields
$ gcc6 std=gnu99 h4x.c && ./a.out
Reads: XXYX
Rereads: XYX
The first set of reads shows that:
GS
(reads[0] == values[0]
)GS
to 0 does not change that (reads[1] == values[0]
)GS
base to 1 with arch_prctl
does work (reads[2] == values[1]
)GS
selector to 0 resets the base (reads[3] == values[0]
).The second set of reads shows that:
re_reads[0] == values[0]
)GS
base to the arch_prctl
value (re_reads[1] == values[1]
)GS
resets the base again (re_reads[2] == values[0]
).The property demonstrated in the hack above is that, after our call to
arch_prctl
, we can write a 0
in GS
with a regular instruction to
temporarily reset the GS
base to 0, and know it will revert to the
arch_prctl
offset again when the thread resumes execution, after
being suspended.
We now have to ensure our restartable sequences are noops when the
GS
base is reset to the arch_prctl
offset, and that the noop is
detected as such. For example, we could set the arch_prctl
offset to
something small, like 4 or 8 bytes, and make sure that any address we
wish to mutate in a critical section is followed by 4 or 8 bytes of
padding that can be detected as such. If a thread is switched out in
the middle of a critical section, its GS
base will be reset to 4 or
8 when the thread resumes execution; we must guarantee that this
offset will make the critical section’s writes fail.
If a write is a compareandswap, we only have to make sure the padding’s value is unambiguously different from real data: reading the padding instead of the data will make compareandswap fail, and the old value will tell us that it failed because we read padding, which should only happen after the section is preempted. We can play similar tricks with fetchandadd (e.g., real data is always even, while the padding is odd), or atomic bitwise operations (steal the sign bit).
If we’re willing to eat a signal after a context switch, we can set
the arch_prctl
offset to something very large, and take a
segmentation fault after being rescheduled. Another option is to set
the arch_prctl
offset to 1, and use a doublewide compareandswap
(CMPXCHG16B
), or turn on the AC (alignment check) bit in
EFLAGS. After a context switch, our destination address will be
misaligned, which will trigger a SIGBUS
that we can handle.
The last two options aren’t great, but, if we make sure to regularly
write a 0 in GS
, signals should be triggered rarely, only when
preemption happens between the last write to GS
and a critical
section. They also have the advantages of avoiding the need for
padding, and making it trivial to detect when a restartable section was
interrupted. Detection is crucial because it often isn’t safe to
assume an operation failed when it succeeded (e.g., unwittingly
succeeding at popping from a memory allocator’s freelist would leak
memory). When a GS
prefixed instruction fails, we must be able to
tell from the instruction’s result, and nothing else. We can’t just
check if the segment base is still what we expect, after the fact: our
thread could have been preempted right after the special GS
prefixed
instruction, before our check.
Once we have restartable sections, we can use them to implement perCPU data structures (instead of perthread), or to let thread acquire locks and hold them until they are preempted: with restartable sections that only write if there was no preemption between the lock acquisition and the final store instruction, we can create a revocable lock abstraction and implement waitfree coöperation or flatcombining.
Unfortunately, our restartable sections will always be hard to debug:
observing a thread’s state in a regular debugger like GDB will reset
the GS
base and abort the section. That’s not unique to the segment
hack approach. Hardware transactional memory will abort critical
sections when debugged, and there’s similar behaviour with the
official rseq
syscall. It’s hard enough to PCLSR userspace code; it
would be even harder to
PCLSRexceptwhentheinterruptionisfordebugging.
The null GS
hack sounds like it only works because of a pile of
questionable design decisions. However, if we look at the historical
context, I’d say everything made sense.
Intel came up with segmentation back when 16 bit pointers were big,
but 64KB of RAM not quite capacious enough. They didn’t have 32 bit
(never mind 64 bit) addresses in mind, nor threads; they only wanted
to address 1 MB of RAM with their puny registers. When thread
libraries abused segments to implement threadlocal storage, the only
other options were to overalign the stack and hide information there,
or to steal a register. Neither sounds great, especially with x86’s
sixandahalf general purpose registers. Finally, when AMD decided
to rip out segmentation, but keep FS
and GS
, they needed to make
porting x86 code as easy as possible, since that was the whole value
proposition for AMD64 over Itanium.
I guess that’s what systems programming is about. We take our tools, get comfortable with their imperfections, and use that knowledge to build new tools by breaking the ones we already have (#Mickens).
Thank you Andrew for a fun conversation that showed the segment
hack might be of interest to someone else, and to Gabe for snarkily
reminding us Rseq
is another Linux/Silicon Valley
reinvention.
That’s not as nice as rewinding the PC to just before the syscall, with a fixed up state that will resume the operation, but is simpler to implement, and usually good enough. Classic worst is better (Unix semantics are also safer with concurrency, but that could have been optin…).↩
That’s not a new observation, and SUN heads like to point to prior art like Dice’s and Garthwaite’s Mostly LockFree Malloc, Garthwaite’s, Dice’s, and White’s work on Preemption notification for perCPU buffers, or Harris’s and Fraser’s Revocable locks. Linux sometimes has to reinvent everything with its special flavour.↩
For instance, SuperMalloc optimistically uses TSX to access perCPU caches, but TSX is slow enough that SuperMalloc first tries to use a perthread cache. Dice and Harris explored the use of hardware transactional lock elision solely to abort on context switches; they maintained high system throughput under contention by trying the transaction once before falling back to a regular lock.↩
I did not expect systems programming to get near multiagent epistemic logic ;)↩
Which is fixable with LOCKed instructions, but that defeats the purpose of perCPU data.↩
I actually found the logic bug before the Spectre/Meltdown fire drill and was worried the hole would be plugged. This one survived the purge. fingers crossed↩
I recently resumed thinking about balls and bins for hash tables. This time, I’m looking at large bins (on the order of one 2MB huge page). There are many hashing methods with solid worstcase guarantees that unfortunately query multiple uncorrelated locations; I feel like we could automatically adapt them to modern hierarchical storage (or address translation) to make them more efficient, for a small loss in density.
In theory, large enough bins can be allocated statically with a minimal waste of space. I wanted some actual nonasymptotic numbers, so I ran numerical experiments and got the following distribution of global utilisation (fill rate) when the first bin fills up.
It looks like, even with one thousand bins of thirty thousand values, we can expect almost 98% space utilisation until the first bin saturates. I want something more formal.
Could I establish something like a service level objective, “When distributing balls randomly between one thousand bins with individual capacity of thirty thousand balls, we can utilise at least 98% of the total space before a bin fills up, x% of the time?”
The natural way to compute the “x%” that makes the proposition true is to first fit a distribution on the observed data, then find out the probability mass for that distribution that lies above 98% fill rate. Fitting distributions takes a lot of judgment, and I’m not sure I trust myself that much.
Alternatively, we can observe independent identically distributed fill rates, check if they achieve 98% space utilisation, and bound the success rate for this Bernoulli process.
There are some nontrivial questions associated with this approach.
Thankfully, I have been sitting on a software package to compute satisfaction rate for exactly this kind of SLOtype properties, properties of the form “this indicator satisfies $PREDICATE x% of the time,” with arbitrarily bounded false positive rates.
The code takes care of adaptive stopping, generates a credible interval, and spits out a report like this : we see the threshold (0.98), the empirical success rate estimate (0.993 ≫ 0.98), a credible interval for the success rate, and the shape of the probability mass for success rates.
This post shows how to compute credible intervals for the Bernoulli’s success rate, how to implement a dynamic stopping criterion, and how to combine the two while compensating for multiple hypothesis testing. It also gives two examples of converting more general questions to SLO form, and answers them with the same code.
If we run the same experiment \(n\) times, and observe \(a\) successes (\(b = n  a\) failures), it’s natural to ask for an estimate of the success rate \(p\) for the underlying Bernoulli process, assuming the observations are independent and identically distributed.
Intuitively, that estimate should be close to \(a / n\), the empirical success rate, but that’s not enough. I also want something that reflects the uncertainty associated with small \(n\), much like in the following ridge line plot, where different phrases are assigned not only a different average probability, but also a different spread.
I’m looking for an interval of plausible success rates \(p\) that responds to both the empirical success rate \(a / n\) and the sample size \(n\); that interval should be centered around \(a / n\), be wide when \(n\) is small, and become gradually tighter as \(n\) increases.
The Bayesian approach is straightforward, if we’re willing to shut up and calculate. Once we fix the underlying success rate \(p = \hat{p}\), the conditional probability of observing \(a\) successes and \(b\) failures is
\[P((a, b)  p = \hat{p}) \sim \hat{p}\sp{a} \cdot (1  \hat{p})\sp{b},\]
where the righthand side is a proportion^{1}, rather than a probability.
We can now apply Bayes’s theorem to invert the condition and the event. The inversion will give us the conditional probability that \(p = \hat{p}\), given that we observed \(a\) successes and \(b\) successes. We only need to impose a prior distribution on the underlying rate \(p\). For simplicity, I’ll go with the uniform \(U[0, 1]\), i.e., every success rate is equally plausible, at first. We find
\[P(p = \hat{p}  (a, b)) = \frac{P((a, b)  p = \hat{p}) P(p = \hat{p})}{P(a, b)}.\]
We already picked the uniform prior, \(P(p = \hat{p}) = 1\,\forall \hat{p}\in [0,1],\) and the denominator is a constant with respect to \(\hat{p}\). The expression simplifies to
\[P(p = \hat{p}  (a, b)) \sim \hat{p}\sp{a} \cdot (1  \hat{p})\sp{b},\]
or, if we normalise to obtain a probability,
\[P(p = \hat{p}  (a, b)) = \frac{\hat{p}\sp{a} \cdot (1  \hat{p})\sp{b}}{\int\sb{0}\sp{1} \hat{p}\sp{a} \cdot (1  \hat{p})\sp{b}\, d\hat{p}} = \textrm{Beta}(a+1, b+1).\]
A bit of calculation, and we find that our credibility estimate for the underlying success rate follows a Beta distribution. If one is really into statistics, they can observe that the uniform prior distribution is just the \(\textrm{Beta}(1, 1)\) distribution, and rederive that the Beta is the conjugate distribution for the Binomial distribution.
For me, it suffices to observe that the distribution \(\textrm{Beta}(a+1, b+1)\) is unimodal, does peak around \(a / (a + b)\), and becomes tighter as the number of observations grows. In the following image, I plotted three Beta distributions, all with empirical success rate 0.9; red corresponds to \(n = 10\) (\(a = 9\), \(b = 1\), \(\textrm{Beta}(10, 2)\)), black to \(n = 100\) (\(\textrm{Beta}(91, 11)\)), and blue to \(n = 1000\) (\(\textrm{Beta}(901, 101)\)).
We calculated, and we got something that matches my intuition. Before trying to understand what it means, let’s take a detour to simply plot points from that unnormalised proportion function \(\hat{p}\sp{a} \cdot (1  \hat{p})\sp{b}\), on an arbitrary \(y\) axis.
Let \(\hat{p} = 0.4\), \(a = 901\), \(b = 101\). Naïvely entering the expression at the REPL yields nothing useful.
CLUSER> (* (expt 0.4d0 901) (expt ( 1 0.4d0) 101))
0.0d0
The issue here is that the unnormalised proportion is so small that it underflows double floats and becomes a round zero. We can guess that the normalisation factor \(\frac{1}{\mathrm{Beta}(\cdot,\cdot)}\) quickly grows very large, which will bring its own set of issues when we do care about the normalised probability.
How can we renormalise a set of points without underflow? The usual trick to handle extremely small or large magnitudes is to work in the log domain. Rather than computing \(\hat{p}\sp{a} \cdot (1  \hat{p})\sp{b}\), we shall compute
\[\log\left[\hat{p}\sp{a} \cdot (1  \hat{p})\sp{b}\right] = a \log\hat{p} + b \log (1  \hat{p}).\]
CLUSER> (+ (* 901 (log 0.4d0)) (* 101 (log ( 1 0.4d0))))
877.1713374189787d0
CLUSER> (exp *)
0.0d0
That’s somewhat better: the logdomain value is not \(\infty\), but converting it back to a regular value still gives us 0.
The \(\log\) function is monotonic, so we can find the maximum
proportion value for a set of points, and divide everything by that
maximum value to get plottable points. There’s one last thing that
should change: when \(x\) is small, \(1  x\) will round most of
\(x\) away.
Instead of (log ( 1 x))
, we should use (log1p ( x))
to compute \(\log (1 + x) = \log (1  x)\). Common
Lisp did not standardise log1p
,
but SBCL does have it in internals, as a wrapper around libm
. We’ll
just abuse that for now.
CLUSER> (defun proportion (x) (+ (* 901 (log x)) (* 101 (sbkernel:%log1p ( x)))))
PROPORTION
CLUSER> (defparameter *points* (loop for i from 1 upto 19 collect (/ i 20d0)))
*POINTS*
CLUSER> (reduce #'max *points* :key #'proportion)
327.4909190001001d0
We have to normalise in the log domain, which is simply a subtraction: \(\log(x / y) = \log x  \log y\). In the case above, we will subtract \(327.49\ldots\), or add a massive \(327.49\ldots\) to each log proportion (i.e., multiply by \(10\sp{142}\)). The resulting values should have a reasonably nonzero range.
CLUSER> (mapcar (lambda (x) (cons x (exp ( (proportion x) *)))) *points*)
((0.05d0 . 0.0d0)
(0.1d0 . 0.0d0)
[...]
(0.35d0 . 3.443943164733533d288)
[...]
(0.8d0 . 2.0682681158181894d16)
(0.85d0 . 2.6252352579425913d5)
(0.9d0 . 1.0d0)
(0.95d0 . 5.65506756824607d10))
There’s finally some signal in there. This is still just an unnormalised proportion function, not a probability density function, but that’s already useful to show the general shape of the density function, something like the following, for \(\mathrm{Beta}(901, 101)\).
Finally, we have a probability density function for the Bayesian update of our belief about the success rate after \(n\) observations of a Bernoulli process, and we know how to compute its proportion function. Until now, I’ve carefully avoided the question of what all these computations even mean. No more (:
The Bayesian view assumes that the underlying success rate (the value we’re trying to estimate) is unknown, but sampled from some distribution. In our case, we assumed a uniform distribution, i.e., that every success rate is a priori equally likely. We then observe \(n\) outcomes (successes or failures), and assign an updated probability to each success rate. It’s like a manyworld interpretation in which we assume we live in one of a set of worlds, each with a success rate sampled from the uniform distribution; after observing 900 successes and 100 failures, we’re more likely to be in a world where the success rate is 0.9 than in one where it’s 0.2. With Bayes’s theorem to formalise the update, we assign posterior probabilities to each potential success rate value.
We can compute an equaltailed credible interval from that \(\mathrm{Beta}(a+1,b+1)\) posterior distribution by excluding the leftmost values, \([0, l)\), such that the Beta CDF (cumulative distribution function) at \(l\) is \(\varepsilon / 2\), and doing the same with the right most values to cut away \(\varepsilon / 2\) of the probability density. The CDF for \(\mathrm{Beta}(a+1,b+1)\) at \(x\) is the incomplete beta function, \(I\sb{x}(a+1,b+1)\). That function is really hard to compute (this technical report detailing Algorithm 708 deploys five different evaluation strategies), so I’ll address that later.
The more orthodox “frequentist” approach to confidence intervals treats the whole experiment, from data colleaction to analysis (to publication, independent of the observations 😉) as an Atlantic City algorithm: if we allow a false positive rate of \(\varepsilon\) (e.g., \(\varepsilon=5\%\)), the experiment must return a confidence interval that includes the actual success rate (population statistic or parameter, in general) with probability \(1  \varepsilon\), for any actual success rate (or underlying population statistic / parameter). When the procedure fails, with probability at most \(\varepsilon\), it is allowed to fail in an arbitrary manner.
The same Atlantic City logic applies to \(p\)values. An experiment (data collection and analysis) that accepts when the \(p\)value is at most \(0.05\) is an Atlantic City algorithm that returns a correct result (including “don’t know”) with probability at least \(0.95\), and is otherwise allowed to yield any result with probability at most \(0.05\). The \(p\)value associated with a conclusion, e.g., “success rate is more than 0.8” (the confidence level associated with an interval) means something like “I’m pretty sure that the success rate is more than 0.8, because the odds of observing our data if that were false are small (less than 0.05).” If we set that threshold (of 0.05, in the example) ahead of time, we get an Atlantic City algorithm to determine if “the success rate is more than 0.8” with failure probability 0.05. (In practice, reporting is censored in all sorts of ways, so…)
There are ways to recover a classical confidence interval, given \(n\) observations from a Bernoulli. However, they’re pretty convoluted, and, as Jaynes argues in his note on confidence intervals, the classical approach gives values that are roughly the same^{2} as the Bayesian approach… so I’ll just use the Bayesian credibility interval instead.
See this stackexchange post for a lot more details.
The way statistics are usually deployed is that someone collects a data set, as rich as is practical, and squeezes that static data set dry for significant results. That’s exactly the setting for the credible interval computation I sketched in the previous section.
When studying the properties of computer programs or systems, we can usually generate additional data on demand, given more time. The problem is knowing when it’s ok to stop wasting computer time, because we have enough data… and how to determine that without running into multiple hypothesis testing issues (ask anyone who’s run A/B tests).
Here’s an example of an intuitive but completely broken dynamic stopping criterion. Let’s say we’re trying to find out if the success rate is less than or greater than 90%, and are willing to be wrong 5% of the time. We could get \(k\) data points, run a statistical test on those data points, and stop if the data let us conclude with 95% confidence that the underlying success rate differs from 90%. Otherwise, collect \(2k\) fresh points, run the same test; collect \(4k, \ldots, 2\sp{i}k\) points. Eventually, we’ll have enough data.
The issue is that each time we execute the statistical test that determines if we should stop, we run a 5% risk of being totally wrong. For an extreme example, if the success rate is exactly 90%, we will eventually stop, with probability 1. When we do stop, we’ll inevitably conclude that the success rate differs from 90%, and we will be wrong. The worstcase (over all underlying success rates) false positive rate is 100%, not 5%!
In my experience, programmers tend to sidestep the question by wasting CPU time with a large, fixed, number of iterations… people are then less likely to run our statistical tests, since they’re so slow, and everyone loses (the other popular option is to impose a reasonable CPU budget, with error thresholds so lax we end up with a smoke test).
Robbins, in Statistical Methods Related to the Law of the Iterated Logarithm, introduces a criterion that, given a threshold success rate \(p\) and a sequence of (infinitely many!) observations from the same Bernoulli with unknown success rate parameter, will be satisfied infinitely often when \(p\) differs from the Bernoulli’s success rate. Crucially, Robbins also bounds the false positive rate, the probability that the criterion be satisfied even once in the infinite sequence of observations if the Bernoulli’s unknown success rate is exactly equal to \(p\). That criterion is
\[{n \choose a} p\sp{a} (1p)\sp{na} \leq \frac{\varepsilon}{n+1},\]
where \(n\) is the number of observations, \(a\) the number of successes, \(p\) the threshold success rate, and \(\varepsilon\) the error (false positive) rate. As the number of observation grows, the criterion becomes more and more stringent to maintain a bounded false positive rate over the whole infinite sequence of observations.
There are similar “Confidence Sequence” results for other distributions (see, for example, this paper of Lai), but we only care about the Binomial here.
More recently, Ding, Gandy, and Hahn showed that Robbins’s criterion also guarantees that, when it is satisfied, the empirical success rate (\(a/n\)) lies on the correct side of the threshold \(p\) (same side as the actual unknown success rate) with probability \(1\varepsilon\). This result leads them to propose the use of Robbins’s criterion to stop Monte Carlo statistical tests, which they refer to as the Confidence Sequence Method (CSM).
(defun csmstopp (successes failures threshold eps)
"Pseudocode, this will not work on a real machine."
(let ((n (+ successes failures)))
(<= (* (choose n successes)
(expt threshold successes)
(expt ( 1 threshold) failures))
(/ eps (1+ n)))))
We may call this predicate at any time with more independent and identically distributed results, and stop as soon as it returns true.
The CSM is simple (it’s all in Robbins’s criterion), but still provides good guarantees. The downside is that it is conservative when we have a limit on the number of observations: the method “hedges” against the possibility of having a false positive in the infinite number of observations after the limit, observations we will never make. For computergenerated data sets, I think having a principled limit is pretty good; it’s not ideal to ask for more data than strictly necessary, but not a blocker either.
In practice, there are still real obstacles to implementing the CSM on computers with finite precision (floating point) arithmetic, especially since I want to preserve the method’s theoretical guarantees (i.e., make sure rounding is onesided to overestimate the lefthand side of the inequality).
If we implement the expression well, the effect of rounding on correctness should be less than marginal. However, I don’t want to be stuck wondering if my bad results are due to known approximation errors in the method, rather than errors in the code. Moreover, if we do have a tight expression with little rounding errors, adjusting it to make the errors onesided should have almost no impact. That seems like a good tradeoff to me, especially if I’m going to use the CSM semiautomatically, in continuous integration scripts, for example.
One look at csmstopp
shows we’ll have the same problem we had with
the proportion function for the Beta distribution: we’re multiplying
very small and very large values. We’ll apply the same fix: work in
the log domain and exploit \(\log\)’s monotonicity.
\[{n \choose a} p\sp{a} (1p)\sp{na} \leq \frac{\varepsilon}{n+1}\]
becomes
\[\log {n \choose a} + a \log p + (na)\log (1p) \leq \log\varepsilon \log(n+1),\]
or, after some more expansions, and with \(b = n  a\),
\[\log n!  \log a!  \log b! + a \log p + b \log(1  p) + \log(n+1) \leq \log\varepsilon.\]
The new obstacle is computing the factorial \(x!\), or the logfactorial \(\log x!\). We shouldn’t compute the factorial iteratively: otherwise, we could spend more time in the stopping criterion than in the data generation subroutine. Robbins has another useful result for us:
\[\sqrt{2\pi} n\sp{n + ½} \exp(n) \exp\left(\frac{1}{12n+1}\right) < n! < \sqrt{2\pi} n\sp{n + ½} \exp(n) \exp\left(\frac{1}{12n}\right),\]
or, in the log domain,
\[\log\sqrt{2\pi} + \left(n + \frac{1}{2}\right)\log n n + \frac{1}{12n+1} < \log n! < \log\sqrt{2\pi} + \left(n + \frac{1}{2}\right)\log n n +\frac{1}{12n}.\]
This double inequality gives us a way to overapproximate \(\log {n \choose a} = \log \frac{n!}{a! b!} = \log n!  \log a!  \log b!,\) where \(b = n  a\):
\[\log {n \choose a} < \log\sqrt{2\pi} + \left(n + \frac{1}{2}\right)\log n n +\frac{1}{12n}  \left(a + \frac{1}{2}\right)\log a +a  \frac{1}{12a+1}  \left(b + \frac{1}{2}\right)\log b +b  \frac{1}{12b+1},\]
where the rightmost expression in Robbins’s double inequality replaces \(\log n!\), which must be overapproximated, and the leftmost \(\log a!\) and \(\log b!\), which must be underapproximated.
Robbins’s approximation works well for us because, it is onesided, and guarantees that the (relative) error in \(n!\), \(\frac{\exp\left(\frac{1}{12n}\right)  \exp\left(\frac{1}{12n+1}\right)}{n!},\) is small, even for small values like \(n = 5\) (error \(< 0.0023\%\)), and decreases with \(n\): as we perform more trials, the approximation is increasingly accurate, thus less likely to spuriously prevent us from stopping.
Now that we have a conservative approximation of Robbins’s criterion
that only needs the four arithmetic operations and logarithms (and
log1p
), we can implement it on a real computer. The only challenge
left is regular floating point arithmetic stuff: if rounding must
occur, we must make sure it is in a safe (conservative) direction for
our predicate.
Hardware usually lets us manipulate the rounding mode to force floating point arithmetic operations to round up or down, instead of the usual round to even. However, that tends to be slow, so most language (implementations) don’t support changing the rounding mode, or do so badly… which leaves us in a multidecade hardware/software coevolution Catch22.
I could think hard and derive tight bounds on the roundoff error, but I’d
rather apply a bit of brute force. IEEE754 compliant implementations
must round the four basic operations correctly. This means that
\(z = x \oplus y\) is at most half a ULP away from \(x + y,\)
and thus either \(z = x \oplus y \geq x + y,\) or the next floating
point value after \(z,\) \(z^\prime \geq x + y\). We can find this
“next value” portably in Common Lisp, with
decodefloat
/scalefloat
, and some handwaving for denormals.
(defun next (x &optional (delta 1))
"Increment x by delta ULPs. Very conservative for
small (0/denormalised) values."
(declare (type doublefloat x)
(type unsignedbyte delta))
(let* ((exponent (nthvalue 1 (decodefloat x)))
(ulp (max (scalefloat doublefloatepsilon exponent)
leastpositivenormalizeddoublefloat)))
(+ x (* delta ulp))))
I prefer to manipulate IEEE754 bits directly. That’s theoretically not portable, but the platforms I care about make sure we can treat floats as signmagnitude integers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 

CLUSER> (doublefloatbits pi)
4614256656552045848
CLUSER> (doublefloatbits ( pi))
4614256656552045849
The two’s complement value for pi
is one less than
( (doublefloatbits pi))
because two’s complement does not support
signed zeros.
CLUSER> (eql 0 ( 0))
T
CLUSER> (eql 0d0 ( 0d0))
NIL
CLUSER> (doublefloatbits 0d0)
0
CLUSER> (doublefloatbits 0d0)
1
We can quickly check that the round trip from float to integer and back is an identity.
CLUSER> (eql pi (bitsdoublefloat (doublefloatbits pi)))
T
CLUSER> (eql ( pi) (bitsdoublefloat (doublefloatbits ( pi))))
T
CLUSER> (eql 0d0 (bitsdoublefloat (doublefloatbits 0d0)))
T
CLUSER> (eql 0d0 (bitsdoublefloat (doublefloatbits 0d0)))
T
We can also check that incrementing or decrementing the integer representation does increase or decrease the floating point value.
CLUSER> (< (bitsdoublefloat (1 (doublefloatbits pi))) pi)
T
CLUSER> (< (bitsdoublefloat (1 (doublefloatbits ( pi)))) ( pi))
T
CLUSER> (bitsdoublefloat (1 (doublefloatbits 0d0)))
0.0d0
CLUSER> (bitsdoublefloat (1+ (doublefloatbits 0d0)))
0.0d0
CLUSER> (bitsdoublefloat (1+ (doublefloatbits 0d0)))
4.9406564584124654d324
CLUSER> (bitsdoublefloat (1 (doublefloatbits 0d0)))
4.9406564584124654d324
The code doesn’t handle special values like infinities or NaNs, but
that’s out of scope for the CSM criterion anyway. That’s all we need
to nudge the result of the four operations to guarantee an over or
under approximation of the real value. We can also look at the
documentation for our libm
(e.g., for GNU libm)
to find error bounds on functions like log
; GNU claims their
log
is never off by more than 3 ULP. We can round up to the
fourth next floating point value to obtain a conservative upper bound
on \(\log x\).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

I could go ahead and use the building blocks above (ULP nudging for directed rounding) to directly implement Robbins’s criterion,
\[\log {n \choose a} + a \log p + b\log (1p) + \log(n+1) \leq \log\varepsilon,\]
with Robbins’s factorial approximation,
\[\log {n \choose a} < \log\sqrt{2\pi} + \left(n + \frac{1}{2}\right)\log n n +\frac{1}{12n}  \left(a + \frac{1}{2}\right)\log a +a  \frac{1}{12a+1}  \left(b + \frac{1}{2}\right)\log b +b  \frac{1}{12b+1}.\]
However, even in the log domain, there’s a lot of cancellation: we’re taking the difference of relatively large numbers to find a small result. It’s possible to avoid that by reassociating some of the terms above, e.g., for \(a\):
\[\left(a + \frac{1}{2}\right) \log a + a  a \log p = \frac{\log a}{2} + a (\log a + 1  \log p).\]
Instead, I’ll just brute force things (again) with Kahan summation. Shewchuk’s presentation in Adaptive Precision FloatingPoint Arithmetic and Fast Robust Geometric Predicates highlights how the only step where we may lose precision to rounding is when we add the current compensation term to the new summand. We can implement Kahan summation with directed rounding in only that one place: all the other operations are exact!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 

We need one last thing to implement \(\log {n \choose a}\), and then Robbins’s confidence sequence: a safely rounded floatingpoint value approximation of \(\log \sqrt{2 \pi}\). I precomputed one with computablereals:
CLUSER> (computablereals:r
(computablereals:logr
(computablereals:sqrtr computablereals:+2pir+)))
0.91893853320467274178...
CLUSER> (computablereals:ceilingr
(computablereals:*r *
(ash 1 53)))
8277062471433908
0.65067431749790398594...
CLUSER> (* 8277062471433908 (expt 2d0 53))
0.9189385332046727d0
CLUSER> (computablereals:r (rational *)
***)
+0.00000000000000007224...
We can safely replace \(\log\sqrt{2\pi}\) with
0.9189385332046727d0
, or, equivalently,
(scalefloat 8277062471433908.0d0 53)
, for an upper bound.
If we wanted a lower bound, we could decrement the integer significand
by one.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

We can quickly check against an exact implementation with
computablereals
and a brute force factorial.
CLUSER> (defun crlogchoose (n s)
(computablereals:r
(computablereals:logr (alexandria:factorial n))
(computablereals:logr (alexandria:factorial s))
(computablereals:logr (alexandria:factorial ( n s)))))
CRLOGCHOOSE
CLUSER> (computablereals:r (rational (robbinslogchoose 10 5))
(crlogchoose 10 5))
+0.00050526703375914436...
CLUSER> (computablereals:r (rational (robbinslogchoose 1000 500))
(crlogchoose 1000 500))
+0.00000005551513197557...
CLUSER> (computablereals:r (rational (robbinslogchoose 1000 5))
(crlogchoose 1000 5))
+0.00025125559085509706...
That’s not obviously broken: the error is pretty small, and always positive.
Given a function to overapproximate logchoose, the Confidence Sequence Method’s stopping criterion is straightforward.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

The other, much harder, part is computing credible (Bayesian) intervals for the Beta distribution. I won’t go over the code, but the basic strategy is to invert the CDF, a monotonic function, by bisection^{3}, and to assume we’re looking for improbable (\(\mathrm{cdf} < 0.5\)) thresholds. This assumption lets us pick a simple hypergeometric series that is normally useless, but converges well for \(x\) that correspond to such small cumulative probabilities; when the series converges too slowly, it’s always conservative to assume that \(x\) is too central (not extreme enough).
That’s all we need to demo the code. Looking at the distribution of fill rates for the 1000 bins @ 30K ball/bin facet in
it looks like we almost always hit at least 97.5% global density, let’s say with probability at least 98%. We can ask the CSM to tell us when we have enough data to confirm or disprove that hypothesis, with a 0.1% false positive rate.
Instead of generating more data on demand, I’ll keep things simple and prepopulate a list with new independently observed fill rates.
CLUSER> (defparameter *observations* '(0.978518900
0.984687300
0.983160833
[...]))
CLUSER> (defun test (n)
(let ((count (countif (lambda (x) (>= x 0.975))
*observations*
:end n)))
(csm:csm n 0.98d0 count (log 0.001d0))))
CLUSER> (test 10)
NIL
2.1958681996231784d0
CLUSER> (test 100)
NIL
2.5948497850893184d0
CLUSER> (test 1000)
NIL
3.0115331544604658d0
CLUSER> (test 2000)
NIL
4.190687115879456d0
CLUSER> (test 4000)
T
17.238559826956475d0
We can also use the inverse Beta CDF to get a 99.9% credible interval. After 4000 trials, we found 3972 successes.
CLUSER> (countif (lambda (x) (>= x 0.975))
*observations*
:end 4000)
3972
These values give us the following lower and upper bounds on the 99.9% CI.
CLUSER> (csm:betaicdf 3972 ( 4000 3972) 0.001d0)
0.9882119750976562d0
1.515197753898523d5
CLUSER> (csm:betaicdf 3972 ( 4000 3972) 0.001d0 t)
0.9963832682169742d0
2.0372679238045424d13
And we can even reuse and extend the Beta proportion code from earlier to generate this embeddable SVG report.
There’s one small problem with the sample usage above: if we compute the stopping criterion with a false positive rate of 0.1%, and do the same for each end of the credible interval, our total false positive (error) rate might actually be 0.3%! The next section will address that, and the equally important problem of estimating power.
It’s not always practical to generate data forever. For example, we might want to bound the number of iterations we’re willing to waste in an automated testing script. When there is a bound on the sample size, the CSM is still correct, just conservative.
We would then like to know the probability that the CSM will stop
successfully when the underlying success rate differs from the
threshold rate \(p\) (alpha
in the code). The problem here is
that, for any bounded number of iterations, we can come up with an
underlying success rate so close to \(p\) (but still different) that
the CSM can’t reliably distinguish between the two.
If we want to be able to guarantee any termination rate, we need two thresholds: the CSM will stop whenever it’s likely that the underlying success rate differs from either of them. The hardest probability to distinguish from both thresholds is close to the midpoint between them.
With two thresholds and the credible interval, we’re running three tests in parallel. I’ll apply a Bonferroni correction, and use \(\varepsilon / 3\) for each of the two CSM tests, and \(\varepsilon / 6\) for each end of the CI.
That logic is encapsulated in csmdriver
.
We only have to pass a
success value generator function to the driver. In our case, the
generator is itself a call to csmdriver
, with fixed thresholds
(e.g., 96% and 98%), and a Bernoulli sampler (e.g., return T
with
probability 97%). We can see if the driver returns successfully and
correctly at each invocation of the generator function, with the
parameters we would use in production, and recursively compute
an estimate for that procedure’s success rate with CSM. The following
expression simulates a CSM procedure with thresholds at 96% and 98%,
the (usually unknown) underlying success rate in the middle, at 97%, a
false positive rate of at most 0.1%, and an iteration limit of ten thousand
trials. We pass that simulation’s result to csmdriver
, and ask
whether the simulation’s success rate differs from 99%, while allowing
one in a million false positives.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

We find that yes, we can expect the 96%/98%/0.1% false positive/10K
iterations setup to succeed more than 99% of the time. The
code above is available as csmpower
,
with a tighter outer false positive rate of 1e9. If we only allow
1000 iterations, csmpower
quickly tells us that, with one CSM
success in 100 attempts, we can expect the CSM success rate to be less
than 99%.
CLUSER> (csm:csmpower 0.97d0 0.96d0 1000 :alphahi 0.98d0 :eps 1d3 :stream *standardoutput*)
1 0.000e+0 1.250e10 10.000e1 1.699e+0
10 0.000e+0 0.000e+0 8.660e1 1.896e+1
20 0.000e+0 0.000e+0 6.511e1 3.868e+1
30 0.000e+0 0.000e+0 5.099e1 5.851e+1
40 2.500e2 5.518e7 4.659e1 7.479e+1
50 2.000e2 4.425e7 3.952e1 9.460e+1
60 1.667e2 3.694e7 3.427e1 1.144e+2
70 1.429e2 3.170e7 3.024e1 1.343e+2
80 1.250e2 2.776e7 2.705e1 1.542e+2
90 1.111e2 2.469e7 2.446e1 1.741e+2
100 1.000e2 2.223e7 2.232e1 1.940e+2
100 iterations, 1 successes (false positive rate < 1.000000e9)
success rate p ~ 1.000000e2
confidence interval [2.223495e7, 0.223213 ]
p < 0.990000
max inner iteration count: 816
T
T
0.01d0
100
1
2.2234953205868331d7
0.22321314110840665d0
Until now, I’ve only used the Confidence Sequence Method (CSM) for Monte Carlo simulation of phenomena that are naturally seen as boolean success / failures processes. We can apply the same CSM to implement an exact test for null hypothesis testing, with a bit of resampling magic.
Looking back at the balls and bins grid, the average fill rate seems to be slightly worse for 100 bins @ 60K ball/bin, than for 1000 bins @ 128K ball/bin. How can we test that with the CSM?
First, we should get a fresh dataset for the two setups we wish to compare.
CLUSER> (defparameter *10060k* #(0.988110167
0.990352500
0.989940667
0.991670667
[...]))
CLUSER> (defparameter *1000128k* #(0.991456281
0.991559578
0.990970109
0.990425805
[...]))
CLUSER> (alexandria:mean *10060k*)
0.9897938
CLUSER> (alexandria:mean *1000128k*)
0.9909645
CLUSER> ( * **)
0.0011706948
The mean for 1000 bins @ 128K ball/bin is slightly higher than that
for 100 bins @ 60k ball/bin. We will now simulate the null hypothesis
(in our case, that the distributions for the two setups are
identical), and determine how rarely we observe a difference of
0.00117
in means. I only use a null hypothesis where the
distributions are identical for simplicity; we could use the same
resampling procedure to simulate distributions that, e.g., have
identical shapes, but one is shifted right of the other.
In order to simulate our null hypothesis, we want to be as close to the test we performed as possible, with the only difference being that we generate data by reshuffling from our observations.
CLUSER> (defparameter *resamplingdata* (concatenate 'simplevector *10060k* *1000128k*))
*RESAMPLINGDATA*
CLUSER> (length *10060k*)
10000
CLUSER> (length *1000128k*)
10000
The two observation vectors have the same size, 10000 values; in
general, that’s not always the case, and we must make sure to
replicate the sample sizes in the simulation. We’ll generate our
simulated observations by shuffling the *resamplingdata*
vector,
and splitting it in two subvectors of ten thousand elements.
CLUSER> (let* ((shuffled (alexandria:shuffle *resamplingdata*))
(60k (subseq shuffled 0 10000))
(128k (subseq shuffled 10000)))
( (alexandria:mean 128k) (alexandria:mean 60k)))
6.2584877e6
We’ll convert that to a truth value by comparing the difference of simulated means with the difference we observed in our real data, \(0.00117\ldots\), and declare success when the simulated difference is at least as large as the actual one. This approach gives us a onesided test; a twosided test would compare the absolute values of the differences.
CLUSER> (csm:csmdriver
(lambda (_)
(declare (ignore _))
(let* ((shuffled (alexandria:shuffle *resamplingdata*))
(60k (subseq shuffled 0 10000))
(128k (subseq shuffled 10000)))
(>= ( (alexandria:mean 128k) (alexandria:mean 60k))
0.0011706948)))
0.005 1d9 :alphahi 0.01 :stream *standardoutput*)
1 0.000e+0 7.761e11 10.000e1 2.967e1
10 0.000e+0 0.000e+0 8.709e1 9.977e1
20 0.000e+0 0.000e+0 6.577e1 1.235e+0
30 0.000e+0 0.000e+0 5.163e1 1.360e+0
40 0.000e+0 0.000e+0 4.226e1 1.438e+0
50 0.000e+0 0.000e+0 3.569e1 1.489e+0
60 0.000e+0 0.000e+0 3.086e1 1.523e+0
70 0.000e+0 0.000e+0 2.718e1 1.546e+0
80 0.000e+0 0.000e+0 2.427e1 1.559e+0
90 0.000e+0 0.000e+0 2.192e1 1.566e+0
100 0.000e+0 0.000e+0 1.998e1 1.568e+0
200 0.000e+0 0.000e+0 1.060e1 1.430e+0
300 0.000e+0 0.000e+0 7.207e2 1.169e+0
400 0.000e+0 0.000e+0 5.460e2 8.572e1
500 0.000e+0 0.000e+0 4.395e2 5.174e1
600 0.000e+0 0.000e+0 3.677e2 1.600e1
700 0.000e+0 0.000e+0 3.161e2 2.096e1
800 0.000e+0 0.000e+0 2.772e2 5.882e1
900 0.000e+0 0.000e+0 2.468e2 9.736e1
1000 0.000e+0 0.000e+0 2.224e2 1.364e+0
2000 0.000e+0 0.000e+0 1.119e2 5.428e+0
NIL
T
0.0d0
2967
0
0.0d0
0.007557510165262294d0
We tried to replicate the difference 2967 times, and did not succeed
even once. The CSM stopped us there, and we find a CI for the
probability of observing our difference, under the null hypothesis, of
[0, 0.007557]
(i.e., \(p < 0.01\)). Or, for a graphical summary, .
We can also test for a lower \(p\)value by changing the
thresholds and running the simulation more times (around thirty
thousand iterations for \(p < 0.001\)).
This experiment lets us conclude that the difference in mean fill rate between 100 bins @ 60K ball/bin and 1000 @ 128K is probably not due to chance: it’s unlikely that we observed an expected difference between data sampled from the same distribution. In other words, “I’m confident that the fill rate for 1000 bins @ 128K ball/bin is greater than for 100 bins @ 60K ball/bins, because it would be highly unlikely to observe a difference in means that extreme if they had the same distribution (\(p < 0.01\))”.
In general, we can use this exact test when we have two sets of observations, \(X\sb{0}\) and \(Y\sb{0}\), and a statistic \(f\sb{0} = f(X\sb{0}, Y\sb{0})\), where \(f\) is a pure function (the extension to three or more sets of observations is straightforward).
The test lets us determine the likelihood of observing \(f(X, Y) \geq f\sb{0}\) (we could also test for \(f(X, Y) \leq f\sb{0}\)), if \(X\) and \(Y\) were taken from similar distributions, modulo simple transformations (e.g., \(X\)’s mean is shifted compared to \(Y\)’s, or the latter’s variance is double the former’s).
We answer that question by repeatedly sampling without replacement from \(X\sb{0} \cup Y\sb{0}\) to generate \(X\sb{i}\) and \(Y\sb{i}\), such that \(X\sb{i} = X\sb{0}\) and \(Y\sb{i} = Y\sb{0}\) (e.g., by shuffling a vector and splitting it in two). We can apply any simple transformation here (e.g., increment every value in \(Y\sb{i}\) by \(\Delta\) to shift its mean by \(\Delta\)). Finally, we check if \(f(X\sb{i}, Y\sb{i}) \geq f\sb{0} = f(X\sb{0}, Y\sb{0})\); if so, we return success for this iteration, otherwise failure.
The loop above is a Bernoulli process that generates independent, identically distributed (assuming the random sampling is correct) truth values, and its success rate is equal to the probability of observing a value for \(f\) “as extreme” as \(f\sb{0}\) under the null hypothesis. We use the CSM with false positive rate \(\varepsilon\) to know when to stop generating more values and compute a credible interval for the probability under the null hypothesis. If that probability is low (less than some predetermined threshold, like \(\alpha = 0.001\)), we infer that the null hypothesis does not hold, and declare that the difference in our sample data points at a real difference in distributions. If we do everything correctly (cough), we will have implemented an Atlantic City procedure that fails with probability \(\alpha + \varepsilon\).
Personally, I often just set the threshold and the false positive rate unreasonably low and handwave some Bayes.
I pushed
the code above, and much more, to github,
in Common
Lisp, C, and Python (probably Py3, although 2.7 might work). Hopefully
anyone can run with the code and use it to test, not only
SLOtype
properties, but also answer more general questions, with an exact
test. I’d love to have ideas or contributions on the usability front.
I have some
throwaway code in attic/
,
which I used to generate the SVG in this post, but it’s not great. I
also feel like I can do something to make it easier to stick the logic
in shell scripts and continuous testing pipelines.
When I passed around a first draft for this post, many readers that could have used the CSM got stuck on the process of moving from mathematical expressions to computer code; not just how to do it, but, more fundamentally, why we can’t just transliterate Greek to C or CL. I hope this revised post is clearer. Also, I hope it’s clear that the reason I care so much about not introducing false positive via rounding isn’t that I believe they’re likely to make a difference, but simply that I want peace of mind with respect to numerical issues; I really don’t want to be debugging some issue in my tests and have to wonder if it’s all just caused by numerical errors.
The reason I care so much about making sure users can understand what the CSM codes does (and why it does what it does) is that I strongly believe we should minimise dependencies whose inner working we’re unable to (legally) explore. Every abstraction leaks, and leakage is particularly frequent in failure situations. We may not need to understand magic if everything works fine, but, everything breaks eventually, and that’s when expertise is most useful. When shit’s on fire, we must be able to break the abstraction and understand how the magic works, and how it fails.
This post only tests ideal SLOtype properties (and regular null hypothesis tests translated to SLO properties), properties of the form “I claim that this indicator satisfies $PREDICATE x% of the time, with false positive rate y%” where the indicator’s values are independent and identically distributed.
The last assumption is rarely truly satisfied in practice. I’ve seen an interesting choice, where the service level objective is defined in terms of a sample of production requests, which can replayed, shuffled, etc. to ensure i.i.d.ness. If the nature of the traffic changes abruptly, the SLO may not be representative of behaviour in production; but, then again, how could the service provider have guessed the change was about to happen? I like this approach because it is amenable to predictive statistical analysis, and incentivises communication between service users and providers, rather than users assuming the service will gracefully handle radically new crap being thrown at it.
Even if we have a representative sample of production, it’s not true that the service level indicators for individual requests are distributed identically. There’s an easy fix for the CSM and our credible intervals: generate i.i.d. sets of requests by resampling (e.g., shuffle the requests sample) and count successes and failures for individual requests, but only test for CSM termination after each resampled set.
On a more general note, I see the Binomial and Exact tests as instances of a general pattern that avoids intuitive functional decompositions that create subproblems that are harder to solve than the original problem. For example, instead of trying to directly determine how frequently the SLI satisfies some threshold, it’s natural to first fit a distribution on the SLI, and then compute percentiles on that distribution. Automatically fitting an arbitrary distribution is hard, especially with the weird outliers computer systems spit out. Reducing to a Bernoulli process before applying statistics is much simpler. Similarly, rather than coming up with analytical distributions in the Exact test, we bruteforce the problem by resampling from the empirical data. I have more examples from online control systems… I guess the moral is to be wary of decompositions where internal subcomponents generate intermediate values that are richer in information than the final output.
Thank you Jacob, Ruchir, Barkley, and Joonas for all the editing and restructuring comments.
Proportions are unscaled probabilities that don’t have to sum or integrate to 1. Using proportions instead of probabilities tends to make calculations simpler, and we can always get a probability back by rescaling a proportion by the inverse of its integral.↩
Instead of a \(\mathrm{Beta}(a+1, b+1)\), they tend to bound with a \(\mathrm{Beta}(a, b)\). The difference is marginal for doubledigit \(n\).↩
I used the bisection method instead of more sophisticated ones with better convergence, like Newton’s method or the derivativefree Secant method, because bisection already adds one bit of precision per iteration, only needs a predicate that returns “too high” or “too low,” and is easily tweaked to be conservative when the predicate declines to return an answer.↩
The question is interesting because stream processing in constant space is a subset of L (or FL), and thus probably not Pcomplete, let alone Turing complete. Having easily characterisable subsets of stream processing that can be implemented in constant space would be a boon for the usability of stream DSLs.
I think I find this academic trope as suspicious as @DRMavIver does, so I have mixed feelings about the fact that this one still feels true seven years later.
Is it just me or do impossibility theorems which claim "these three obviously desirable properties cannot simultaneously be satisfied" always include at least one obviously undesirable or at least suspicious property?
— David R. MacIver (@DRMacIver) June 19, 2018
The main reason I believe in this conjecture is the following
example, F(S(X), X)
, where S
is the function that takes a stream
and ouputs every other value. Or, more formally, \(F\sb{i} = f(X\sb{2i}, X\sb{i})\).
Let’s say X
is some stream of values that can’t be easily
recomputed (e.g., each output value is the result of a slow
computation). How do we then compute F(S(X), X)
without either
recomputing the stream X
, or buffering an unbounded amount of past
values from that stream? I don’t see a way to do so, not just in any
stream processing DSL (domain specific language), but also in any
general purpose language.
For me, the essence of the problem is that the two inputs to F
are
out of sync with respect to the same source of values, X
: one
consumes two values of X
per invocation of F
, and the other only
one. This issue could also occur if we forced stream transducers
(processing nodes) to output a fixed number of value at each
invocation: let S
repeat each value of X
twice,
i.e., interleave X
with X
(\(F\sb{i} = f(X\sb{\lfloor i / 2\rfloor}, X\sb{i})\)).
Forcing each invocation of a transducer to always produce exactly one value is one way to rule out this class of stream processing network. Two other common options are to forbid either forks (everything is singleuse or subtrees copied and recomputed for each reuse) or joins (only singleinput stream processing nodes).
I don’t think this turtleandhare desynchronisation problem is a weakness in stream DSLs, I only see a reasonable task that can’t be performed in constant space. Given the existence of such tasks, I’d like to see stream processing DSLs be explicit about the tradeoffs they make to balance performance guarantees, expressiveness, and usability, especially when it comes to the performance model.
]]>In the words of a friend and former colleague:
Two years of my life in one repository….
— John Wittrock (@johnwittrock) December 19, 2017
Congrats @pkhuong @arexus and all! https://t.co/jPFnYrc5V4
If you don’t want to read more about what’s in ACF and why I feel
it’s important to open source imperfect repositories,
jump to the section on fast itoa
.
ACF contains the base data structure and runtime library code we use to build production services, in C that targets Linux/x8664. Some of it is correctly packaged, most of it just has the raw files from our internal repository. Ironically, after settling on the project’s name, we decided not to publish the most “frameworky” bits of code: it’s unclear why anyone else would want to use it. The data structures are in C, and tend to be readoptimised, with perhaps some support for nonblocking singlewriter/multireader concurrency. There’s also nonblocking algorithms to support the data structures, and basic HTTP server code that we find useful to run CPUintensive or mixed CPU/networkintensive services.
Publishing this internal code took a long time because we were trying to open a project that didn’t exist yet, despite being composed of code that we use every day. AppNexus doesn’t sell code or binaries. Like many other companies, AppNexus sells services backed by inhouse code. Our code base is full of informal libraries (I would be unable to make sense of the code base if it wasn’t organised that way), but enforcing a clean separation between pseudolibraries can be a lot of extra work for questionable value.
These fuzzy demarcations are made worse by the way we imported some ideas directly from Operating Systems literature, in order to support efficient concurrent operations. That had a snowball effect: everything, even basic data structures, ends up indirectly depending on runtime system/framework code specialised for our use case. The usual initial offenders are the safe memory reclamation module, and the tracking memory allocator (with a bump pointer mode); both go deep in internals that probably don’t make sense outside AppNexus.
Back in 2015, we looked at our support code (i.e., code that doesn’t directly run the business) and decided we should share it. We were–and still are–sure that other people face similar challenges, and exchanging ideas, if not directly trading code, can only be good for us and for programming in general. We tried to untangle the “Common” (great name) support library from the rest of the code base, and to decouple it from the more opinionated parts of our code, while keeping integration around (we need it), but purely optin.
That was hard. Aiming for a separate shared object and a real Debian package made it even harder than it had to be. The strong separation between packaged ACF code and the rest of repo added a lot of friction, and the majority of the support code remained intree.
Maybe we made a mistake when we tried to librarify our internals. We want a library of reusable code; that doesn’t have to mean a literal shared object. I’m reminded of the two definitions of portable code: code sprinkled with platform conditionals, or code that can be made to run on a new machine with minimal effort. Most of the time, I’d rather have the latter. Especially when code mostly runs on a single platform, or is integrated in few programs, I try to reduce overhead for the common case, while making reuse possible and easy enough that others can benefit.
And that’s how we got the ACF effort out of the door: we accepted that the result would not be as polished as our favourite open source libraries, and that most of the code wouldn’t even be packaged or disentangled from internals. That’s far from an ideal state, but it’s closer to our goals than keeping the project private and on the backburner. We got it out by “feature” boxing the amount of work–paring it down to figuring out what would never be useful to others, and tracking down licenses and provenance–before pushing the partial result out to a public repository. Unsurprisingly, once that was done, we completed more tasks on ACF in a few days than we have in the past year.
Now that ACF is out, we still have to figure out the best way to help others coopt our code, to synchronise the public repository with our internal repository, and, in my dreams, to accept patches for the public repo and have them also work for the internal one. In the end, what’s important is that the code is out there with a clear license, and that someone with similar problems can easily borrow our ideas, if not our code.
The source isn’t always pretty, and is definitely not as well packaged and easily reusable as we’d like it to be, but it has proved itself in production (years of use on thousands of cores), and builds to our real needs. The code also tries to expose correct and efficient enough code in ways that make correct usage easy, and, ideally, misuse hard. Since we were addressing specific concrete challenges, we were able to tweak contracts and interfaces a bit, even for standard functionality like memory allocation.
The last two things are what I’m really looking for when exploring other people’s support code: how did usage and development experience drive interface design, and what kind of nonstandard tradeoffs allowed them to find new lowhanging fruits?
If anyone else is in the same situation, please give yourself the permission to open source something that’s not yet fully packaged. As frustrating as that can be, it has to be better than keeping it closed. I’d rather see real, flawed but productiontested, code from which I can take inspiration than nothing at all.
¶ The
integer to string conversion file (an_itoa
)
is one instance of code that relaxes the usual [u]itoa
contract
because it was written for a specific problem (which also gave us
real data to optimise for). The relaxation stems from the fact that
callers should reserve up to 10 chars to convert 32 bit (unsigned)
integers, and 20 chars for 64 bit ones: we let the routines write
garbage (0
/NUL
bytes) after the converted string, as long as it’s
in bounds. This allowance, coupled with a smidge of thinking, let us
combine a few cute ideas to solve the depressingly common problem of
needing to print integers quickly.
Switching to an_itoa
might be a quick win for someone else, so I cleaned it up
and packaged it immediately after making the repository public.
We wrote an_itoa
in July 2014. Back then, we had an application
with a moderate deployment (a couple racks on three continents) that
was approaching capacity. While more machines were in the pipeline, a
quick perf
run showed it was spending a lot of time converting
strings to integers and back. We already had a fastish string to
integer function. Converting machine integers back to string however,
is a bit more work, and took up around 20% of total CPU time.
Of course, the real solution here is to not have this problem. We shouldn’t have been using a humanreadable format like JSON in the first place. We had realised the format would be a problem a long time ago, and were actually in the middle of a transition to protobuf, after a first temporary fix (replacing a piece of theoretically reconfigurable JavaScript that was almost never reconfigured with hardcoded C that performed the same JSON manipulation). But, there we were, in the middle of this slow transition involving terabytes of valuable persistent data, and we needed another speed boost until protobuf was ready to go.
When you’re stuck with C code that was manually converted, line by line, from JavaScript, you don’t want to try and make high level changes to the code. The only reasonable quick win was to make the conversion from integer to string faster.
Human readable formats wasting CPU cycles to print integers is a
common problem, and we quickly found a few promising approaches and
libraries. Our baseline was the radix10 code in
stringencoders.
This post about Lwan
suggested using radix10, but generating strings backward instead of
reversing like the stringencoders
library. Facebook apparently hit
a similar problem in 2013, which lead to
this solution
by Andrei Alexandrescu. The Facebook code combines two key
ideas: radix100 encoding, and finding the length of the string with
galloping search to write the result backward, directly where it
should go.
Radix100 made sense, although I wasn’t a fan of the 200byte lookup
table. I was also dubious of the galloping search; it’s a lot of
branches, and not necessarily easy to predict. The kind of memmove
we need to fixup after conversion is small and easy to specialise on
x86, so we might not need to predict the number of digits at all.
I then looked at the microbenchmarks for Andrei’s code, and they made it look like the code was either tested on integers with a fixed number of digits (e.g., only 4digit integers), or randomly picked with uniform probability over a large range.
If the number of digits is fixed, the branchiness of galloping search
isn’t an issue. When sampling uniformly… it’s also not an issue
because most integers are large! If I pick an integer at random in
[0, 1e6)
, 90% of the integers have 6 digits, 99% 5 or 6, etc.
Sometimes, uniform selection is representative of the real workload (e.g., random uids or sequential object ids). Often, not so much. In general, small numbers are more common; for example, small counts can be expected to roughly follow a Poisson distribution.
I was also worried about data cache footprint with the larger lookup
table for radix100 encoding, but then realised we were converting
integers in tight loops, so the lookup table should usually be hot.
That also meant we could afford a lot of instruction bytes; a multiKB
atoi
function wouldn’t be acceptable, but a couple hundred bytes was
fine.
Given these known solutions, John and I started doodling for a bit. Clearly, the radix100 encoding was a good idea. We now had to know if we could do better.
Our first attempt was to find the number of decimal digits more quickly than with the galloping search. It turns out that approximating \(\log\sb{10}\) is hard, and we gave up ;)
We then realised we didn’t need to know the number of decimal digits. If we generated the string in registers, we could find the length after the fact, slide bytes with bitwise shifts, and directly write to memory.
I was still worried about the lookup table: the random accesses in the
200 byte table for radix100 encoding could hurt when converting short
arrays of small integers. I was more comfortable with some form of
arithmetic that would trade bestcase speed for consistent, if slightly
suboptimal, performance. As it turns out, it’s easy to convert values
between 0 and 100 to
unpacked BCD
with a reciprocal multiplication by \( 1/10 \) and some inregister
bit twiddling. Once we have a string of BCD bytes buffered in a
general purpose register, we can vertically add '0'
to every byte in
the register to convert to ASCII characters. We can even do the whole
conversion on a pair of such values at once, with SIMD within a
register.
The radix100 approach is nice because it chops up the input two digits at a time; the makespan for a given integer is roughly half as long, since modern CPUs have plenty of execution units for the body.
The dependency graph for radix10 encoding of 12345678
looks like
the following, with 7 serial steps.
Going for radix100 halves the number of steps, to 4. The steps are
still serial, except for the conversion of integers in [0, 100)
to
strings.
Could we expose even more ILP than the radix100 loop?
The trick is to divide and conquer: divide by 10000 (1e4
) before
splitting each group of four digits with a radix100 conversion.
Recursive encoding gives us fewer steps, and 2 of the 3 steps can execute in parallel. However, that might not always be worth the trouble for small integers, and we know that small numbers are common. Even if we have a good divideandconquer approach for larger integers, we must also implement a fast path for small integers.
The fast path for small integers (or the most significant limb of
larger integers) converts a 2 or 4 digit integer to unpacked BCD,
bitscans for the number of leading zeros, converts the BCD to ASCII by
adding '0'
(0x30
) to each byte, and shifts out any leading zero;
we assume that trailing noise is acceptable, and it’s all NUL
bytes
anyway.
For 32bit integers
an_itoa
(really an_uitoa
) looks like:
if number < 100:
execute specialised 2digit function
if number < 10000:
execute specialised 4digit function
partition number with first 4 digits, next 4 digits, and remainder.
convert first 2 groups of 4 digits to string.
If the number is < 1e8: # remainder is 0!
shift out leading zeros, print string.
else:
print remainder # at most 100, since 2^32 < 1e10
print strings for the first 2 groups of 4 digits.
The 64 bit version,
an_ltoa
(really an_ultoa
) is more of the same, with differences when the
input number exceeds 1e8
.
I’ve already concluded that cache footprint was mostly not an issue, but we should still made sure we didn’t get anything too big.
an_itoa
: 400 bytes.an_ltoa
: 880 bytesfb_itoa
: 426 bytes + 200 byte LUTfb_constant_itoa
(without the galloping search): 172 bytes + 200 byte LUTlwan_itoa
(radix10, backward generation): 60 bytes.modp_uitoa10
: 91 bytes.The galloping search in Facebook’s converter takes a lot of space
(there’s a ton of conditional branches, and large numbers must be
encoded somewhere). Even if we disregard the lookup table, an_itoa
is smaller than fb_itoa
, and an_ltoa
(which adds code for > 32
bit integers) is only 254 bytes larger than fb_itoa
(+ LUT). Now,
Facebook’s galloping search attempts to make small integers go faster
by checking for them first; if we convert small numbers, we don’t
expect to use all ~250 bytes in the galloping search. However,
an_itoa
and an_ltoa
are similar: the code is setup such that
larger numbers jump forward over specialised subroutines for small
integers. Small integers thus fall through to only execute code at
the beginning of the functions. 400 or 800 bytes are sizable
footprints compared to the 60 or 90 bytes of the radix10 functions,
but acceptable when called in tight loops.
Now that we feel like the code and lookup table sizes are reasonable (something that microbenchmarks rarely highlight), we can look at speed.
I first ran the conversion with random integers in each digit count
class from 1 digit (i.e., numbers in [0, 10)
) to 19 (numbers in
[1e8, 1e9)
). The instruction cache was hot, but the routines were
not warmed on that size class of numbers (more realistic that way).
The results are cycle counts (with the minimum overhead for a noop conversion subtracted from the raw count), on an unloaded 2.4 GHz Xeon E52630L, a machine that’s similar to our older production hardware.
We have data for:
an_itoa
, our 32 bit conversion routine;an_ltoa
, our 64 bit conversion routine;fb_constant_itoa
, Facebook’s code, with the galloping search
stubbed out;fb_itoa
, Facebook’s radix100 code;itoa
, GNU libc conversion (via sprintf);lw_itoa
, Lwan’s backward radix10 converter;modp
, stringencoder’s radix10 / strreverse
converter.I included fb_constant_itoa
to serve as a lower bound on the
radix100 approach: the conversion loop stops as soon as it hits 0
(same as fb_itoa
), but the data is written at a fixed offset, like
lw_itoa
does. In both fb_constant_itoa
’s and lw_itoa
’s cases,
we’d need another copy to slide the part of the output buffer that was
populated with characters over the unused padding (that’s why
fb_itoa
has a galloping search).
When I chose these functions back in 2014, they were all I could find that was reasonable. Since then, I’ve seen one other divide and conquer implementation, although it uses a lookup table instead of arithmetic to convert radix100 limbs to characters, and an SSE2 implementation that only pays off for larger integers (32 bits or more).
Some functions only go up to UINT32_MAX
, in which case we have no
data after 9 digits. The raw data is here; I used
this R script to generate the plot.
The solid line is the average time per conversion (in cycles), over 10K data points, while the shaded region covers the 10th percentile to the 90th percentile.
(GNU) libc’s conversion is just wayy out there. The straightforward
modp
(stringencoders) code overlaps with Facebook’s itoa
; it’s
slightly slower, but so much smaller.
We then have two incomplete string encoders: neither
fb_constant_itoa
nor lw_itoa
generates their output where it should
go. They fill a buffer from the end, and something else (not
benchmarked) is responsible for copying the valid bytes where they
belong. If an incomplete implementation suffices, Lwan’s radix10
approach is already competitive with, arguably faster than, the
Facebook code. The same backward loop, but in radix100, is
definitely faster than Facebook’s full galloping search/radix100
converter.
Finally, we have an_itoa
and an_ltoa
, that are neck and neck with
one another, faster than both modp
and fb_itoa
on small and large
integers, and even comparable with or faster than the incomplete
converters. Their runtime is also more reliable (less variance) than
modp
’s and fb_itoa
’s: modp
pays for the second variable length
loop in strreverse
, and fb_itoa
for the galloping search. There
are more code paths in an_itoa
and an_ltoa
, but no loop, so the
number of (unpredictable) conditional branches is lower.
What have we learned from this experiment?
sprintf
. That makes sense,
since that code is so generic. However, in practice, we only
convert to decimal, some hex, even less octal, and the rest is
noise. Maybe we can afford to special case these bases.modp_uitoa10
hurts. It does make sense
to avoid that by generating backward, ideally in the right spot
from the start.fb_constant_itoa
is
faster than lwan_itoa
).an_itoa
and an_ltoa
are faster for small values).an_ltoa
is flatter for large
integers).With results that made sense for an easily understood microbenchmark, I decided to try a bunch of distributions. Again, the code was hot, the predictors lukewarm, and we gathered 10K cycle counts per distribution/function. The raw data is here, and I used this R script to generate the plot.
The independent variables are all categorical here, so I use one facet per distribution, and, in each facet, a boxplot per conversion function, as well as a jittered scatter plot to show the distribution of cycle counts.
Clearly, we can disregard glibc’s sprintf
(itoa
).
The first facet generated integers by choosing uniformly between
\(100, 1000, 10\sp{4}, \ldots, 10\sp{8}\). That’s a semirealistic
variation on the earlier dataset, which generated a bunch of numbers in
each size class, and serves as an easily understood worstcase for
branch prediction. Both an_itoa
and an_ltoa
are faster than the
other implementations, and branchier implementations (fb_itoa
and
modp
) show their variance. Facebook’s fb_itoa
isn’t even faster
than modp
’s radix10/strreverse
encoder. The galloping search
really hurts: fb_constant_itoa
, without that component, is slightly
faster than the radix10 lw_itoa
.
The second facet is an even harder case for branch predictors: random
values skewed with an exponential (pow(2, 64.0 * random() / RAND_MAX)
),
to simulate realworld counts. Both an_itoa
and
an_ltoa
are faster than the other implementations, although
an_ltoa
less so: an_itoa
only handles 32 bit integers, so it deals
with less entropy. Between the 32bit implementations, an_itoa
is
markedly faster and more consistent than lw_itoa
(which is
incomplete) and modp
. Full 64bit converters generally exhibit more
variance in runtime (their input is more randomised), but an_ltoa
is
still visibly faster than fb_itoa
, and even than the incomplete
fb_constant_itoa
. We also notice that fb_itoa
’s runtimes are more
spread out than fb_constant_itoa
: the galloping search adds overhead
in time, but also a lot of variance. That makes me think that the
Facebook code is more sensitive than others to difference in data
distribution between microbenchmarks and production.
The third facet should be representative of printing internal
sequential object ids: uniform integers in [0, 256K)
. As expected,
every approach is tighter than with the skewed “counts” distribution
(most integers are large). The an_itoa
/an_ltoa
options are faster
than the rest, and it’s far from clear that fb_itoa
is preferable to
even modp
. The range was also chosen because it’s somewhat of a
worst case for an_itoa
: the code does extra work for values between
\(10\sp{4}\) and \(10\sp{8}\) to have more to do before the
conditional branch for x < 1e8
. That never pays off in the range
tested here. However, even with this weakness, an_itoa
still seems
preferable to fb_itoa
, and even to the simpler modp_uitoa10
.
The fourth facet (first of the second row) shows what happens when we
choose random integers in [0, 20)
. That test case is interesting
because it’s small, thus semirepresentative of some of our counts,
and because it needs 1 or 2 digits with equal probability. Everything
does pretty well, and runtime distributions are overall tight; branch
predictors can do a decent job when there are only two options. I’m
not sure why there’s such a difference between an_itoa
and
an_ltoa
’s distribution. Although the code for any value less than
100 is identical at the C level, there are small difference in code
generation… but I can’t pinpoint where the difference might come from.
The fifth facet, for random integers in [100, 200)
is similar, with
a bit more variance.
The sixth facet generates unix timestamps around a date in 2014 with
uniform selection plus or minus one million second. It’s meant to be
representative of printing timestamps. Again, an_itoa
and an_ltoa
are faster than the rest, with an_itoa
being slightly faster and
more consistent. Radix100 (fb_constant_itoa
) is faster and more
consistent than radix10 (lw_itoa
), but it’s not clear if fb_itoa
is preferable to modp
. The variance for modp
is larger than for
the other implementations, even fb_itoa
: that’s the cost of a
radix10 loop and of the additional strreverse
.
This set of results shows that conditional branches are an issue when
converting integers to strings, and that the impact of branches
strongly depends on the distribution. The Facebook approach, with a
galloping search for the number of digits, seems particularly
sensitive to the distribution. Running something like fb_itoa
because it does well in microbenchmark is thus only a good idea if we
know that the microbenchmark is representative of production.
Bigger numbers take more time to convert, but the divide and conquer
approach of an_itoa
and an_ltoa
is consistently faster at the high
end, while their unrolled SIMDwithinaregister fast path does well for
small numbers.
s[n]printf
The correct solution to the “integer printing is too slow” problem is simple: don’t do that. After all, remember the first rule of high performance string processing: “DON’T.” When there’s no special requirement, I find Protobuf does very well as a better JSON.
However, once you find yourself in this bad spot, it’s trivial to do better than generic libc conversion code. This makes it a dangerously fun problem in a way… especially given that the data distribution can matter so much. No benchmark is perfect, but various implementations are affected differently by flaws in microbenchmarks. It’s thus essential not to overfit on the benchmark data, probably even more important than improving performance by another factor of 10% or 20% (doing 45x better than libc code is already a given). That’s why I prefer integer conversion code with more consistent cycle counts: there’s less room for differences due to the distribution of data.
Finally, if, like 2014AppNexus, you find yourself converting a lot of
integers to strings in tight loops (on x8664 machines),
try an_itoa
or an_ltoa
!
The whole repository is Apache 2.0,
and it should be easy to copy and paste all the dependencies to pare
it down to two files. If you do snatch our code, note that the
functions use their destination array (up to 10 bytes for an_itoa
,
and 20
for an_ltoa
) as scratch space, even for small integers.
Thank you for reviewing drafts, John, Ruchir, Shreyas, and Andrew.
]]>Whenever I mention a data or work distribution problem where I ideally want everything related to a given key to hit the same machine, everyone jumps to consistent hashing. I don’t know how this technique achieved the mindshare it has, although I suspect Amazon’s 2007 Dynamo DB paper is to blame (by introducing the problem to many of us, and mentioning exactly one decentralised solution)… or maybe some Google interview prep package.
Karger et al’s paper doesn’t help, since they introduce the generic concept of a consistent hash function and call their specific solution… “consistent hashing.” I’m not sure where I first encountered rendezvous hashing, but I vaguely remember a technical report by Karger, so it’s probably not some MIT vs UMich thing.
Regardless of the reason for consistent hashing’s popularity, I feel the goto technique should instead be rendezvous hashing. Its basic form is simple enough to remember without really trying (one of those desert island algorithms), it is more memory efficient than consistent hashing in practice, and its downside–a simple implementation assigns a location in time linear in the number of hosts–is not a problem for small deployments, or even medium (a couple racks) scale ones if you actually think about failure domains.
Side question: why did rendezvous have to lose its hyphen to cross the Channel?
Basic rendezvous hashing takes a distribution key (e.g., a filename),
and a set of destinations (e.g., hostnames). It then uses a hash function
to pseudorandomly map each (distribution_key, destination)
pair to a
value in [0, 1)
or [0, 2^64  1)
, and picks the destination that
gives the minimal hash value. If it needs k
destinations for
redundancy, it can pick the destinations that yield the least k
hash
values. If there are ties (unlikely with a good hash function), it
breaks them arbitrarily but consistently, e.g., by imposing a total
order on hostnames.
A Python implementation could look like the following.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

We only need to store the list of destinations, and we can convince ourselves that data distribution is pretty good (close to uniform) and that small changes in the set of destinations only affects a small fraction of keys (those going to destinations added/removed), either with pen and paper or with a few simulations. That compares positively with consistent hashing, where a practical implementation has to create a lot (sometimes hundreds) of pseudonodes for each real destination in order to mitigate clumping in the hash ring.
The downside is that we must iterate over all the nodes, while consistent hashing is easily \(\mathcal{O}(\log n)\) time, or even \(\mathcal{O}(\log \log n)\), with respect to the number of (pseudo)nodes. However, that’s only a problem if you have a lot of nodes, and rendezvous hashing, unlike consistent hashing, does not inflate the number of nodes.
Another thing I like about rendezvous hashing is that it naturally handles weights. With consistent hashing, if I want a node to receive ten times as much load as another, I create ten times more pseudonodes. As the greatest common divisor of weights shrinks, the number of pseudonode per node grows, which makes distribution a bit slower, and, more importantly, increases memory usage (linear in the number of pseudonodes). Worse, if you hit the fundamental theorem of arithmetic (as a coworker once snarked out in a commit message), you may have to rescale everything, potentially causing massive data movement.
Rendezvous hashing generates pseudorandom scores by hashing, and ranks
them to find the right node(s). Intuitively, we want to use weights
so that the distribution of pseudorandom scores generated for a node A
with twice the weight as another node B has the same shape as that of
node B, but is linearly stretched so that the average hash value for A is
twice that for B. We also want the distribution to cover [0, infty)
,
otherwise a proportion of hashes will always go to the heavier node,
regardless of what the lighter node hashes to, and that seems wrong.
The trick,
as explained by Jason Resch
at Cleversafe, is to map our hashes from uniform in [0, 1)
to
[0, infty)
not as an exponential, but with weight / log(h)
. If you
simulate just using an exponential, you can quickly observe that it
doesn’t reweigh things correctly: while the mean is correctly scaled,
the mass of the probability density function isn’t shifted quite right.
Resch’s proof of correctness for this tweaked exponential fits on a
single page.
The Python code becomes something like:
1 2 3 4 5 6 7 8 9 10 11 12 13 

There are obvious microoptimisations here (for example, computing the inverse of the score lets us precompute the reciprocal of each destination’s weight), but that’s all details. The salient part to me is that space and time are still linear in the number of nodes, regardless of the weights; consistent hashing instead needs space pseudolinear(!) in the weights, and is thus a bit slower than its \(\mathcal{O}(\log n)\) runtime would have us believe.
The lineartime computation for weighted rendezvous hashing is also CPU friendly. The memory accesses are all linear and easily prefetchable (load all metadata from an array of nodes), and the computational kernel is standard vectorisable floating point arithmetic.
In practice, I’m also not sure I ever really want to distribute between hundreds of machines: what kind of failure/resource allocation domain encompasses that many equivalent nodes? For example, when distributing data, I would likely want a hierarchical consistent distribution scheme, like Ceph’s CRUSH: something that first assigns data to sections of a datacenter, then to racks, and only then to individual machines. I should never blindly distribute data across hundreds of machines; I need to distribute between a handful of sections of the network, then one of a dozen racks, and finally to one of twenty machines. The difference between linear and logarithmic time at each level of this “failure trie” is marginal and is easily compensated by a bit of programming.
The simplicity of basic rendezvous hashing, combined with its minimal space usage and the existence of a weighted extension, makes me believe it’s a better initial/default implementation of consistent hash functions than consistent hashing. Moreover, consistent hashing’s main advantage, sublineartime distribution, isn’t necessarily compelling when you think about the whole datacenter (or even many datacenters) as a resilient system of failureprone domains. Maybe rendezvous hashing deserves a rebranding campaign (:
]]>