<![CDATA[Paul Khuong: some Lisp]]> 2019-07-09T16:59:49-04:00 https://www.pvk.ca/ Octopress <![CDATA[Fractional Set Covering With Experts]]> 2019-04-23T18:05:09-04:00 https://www.pvk.ca/Blog/2019/04/23/fractional-set-covering-with-experts Last winter break, I played with one of the annual capacitated vehicle routing problem (CVRP) “Santa Claus” contests. Real world family stuff took precedence, so, after the obvious LKH with Concorde polishing for individual tours, I only had enough time for one diversification moonshot. I decided to treat the high level problem of assembling prefabricated routes as a set covering problem: I would solve the linear programming (LP) relaxation for the min-cost set cover, and use randomised rounding to feed new starting points to LKH. Add a lot of luck, and that might just strike the right balance between solution quality and diversity.

Unsurprisingly, luck failed to show up, but I had ulterior motives: I’m much more interested in exploring first order methods for relaxations of combinatorial problems than in solving CVRPs. The routes I had accumulated after a couple days turned into a set covering LP with 1.1M decision variables, 10K constraints, and 20M nonzeros. That’s maybe denser than most combinatorial LPs (the aspect ratio is definitely atypical), but 0.2% non-zeros is in the right ballpark.

As soon as I had that fractional set cover instance, I tried to solve it with a simplex solver. Like any good Googler, I used Glop… and stared at a blank terminal for more than one hour.

Having observed that lack of progress, I implemented the toy I really wanted to try out: first order online “learning with experts” (specifically, AdaHedge) applied to LP optimisation. I let this not-particularly-optimised serial CL code run on my 1.6 GHz laptop for 21 hours, at which point the first order method had found a 4.5% infeasible solution (i.e., all the constraints were satisfied with $$\ldots \geq 0.955$$ instead of $$\ldots \geq 1$$). I left Glop running long after the contest was over, and finally stopped it with no solution after more than 40 days on my 2.9 GHz E5.

Given the shape of the constraint matrix, I would have loved to try an interior point method, but all my licenses had expired, and I didn’t want to risk OOMing my workstation. Erling Andersen was later kind enough to test Mosek’s interior point solver on it. The runtime was much more reasonable: 10 minutes on 1 core, and 4 on 12 cores, with the sublinear speed-up mostly caused by the serial crossover to a simplex basis.

At 21 hours for a naïve implementation, the “learning with experts” first order method isn’t practical yet, but also not obviously uninteresting, so I’ll write it up here.

Using online learning algorithms for the “experts problem” (e.g., Freund and Schapire’s Hedge algorithm) to solve linear programming feasibility is now a classic result; Jeremy Kun has a good explanation on his blog. What’s new here is:

1. Directly solving the optimisation problem.
2. Confirming that the parameter-free nature of AdaHedge helps.

The first item is particularly important to me because it’s a simple modification to the LP feasibility meta-algorithm, and might make the difference between a tool that’s only suitable for theoretical analysis and a practical approach.

I’ll start by reviewing the experts problem, and how LP feasibility is usually reduced to the former problem. After that, I’ll cast the reduction as a surrogate relaxation method, rather than a Lagrangian relaxation; optimisation should flow naturally from that point of view. Finally, I’ll guess why I had more success with AdaHedge this time than with Multiplicative Weight Update eight years ago.1

## The experts problem and LP feasibility

I first heard about the experts problem while researching dynamic sorted set data structures: Igal Galperin’s PhD dissertation describes scapegoat trees, but is really about online learning with experts. Arora, Hazan, and Kale’s 2012 survey of multiplicative weight update methods. is probably a better introduction to the topic ;)

The experts problem comes in many variations. The simplest form sounds like the following. Assume you’re playing a binary prediction game over a predetermined number of turns, and have access to a fixed finite set of experts at each turn. At the beginning of every turn, each expert offers their binary prediction (e.g., yes it will rain today, or it will not rain today). You then have to make a prediction yourself, with no additional input. The actual result (e.g., it didn’t rain today) is revealed at the end of the turn. In general, you can’t expect to be right more often than the best expert at the end of the game. Is there a strategy that bounds the “regret,” how many more wrong prediction you’ll make compared to the expert(s) with the highest number of correct predictions, and in what circumstances?

Amazingly enough, even with an omniscient adversary that has access to your strategy and determines both the experts’ predictions and the actual result at the end of each turn, a stream of random bits (hidden from the adversary) suffice to bound our expected regret in $$\mathcal{O}(\sqrt{T}\,\lg n)$$, where $$T$$ is the number of turns and $$n$$ the number of experts.

I long had trouble with that claim: it just seems too good of a magic trick to be true. The key realisation for me was that we’re only comparing against invidivual experts. If each expert is a move in a matrix game, that’s the same as claiming you’ll never do much worse than any pure strategy. One example of a pure strategy is always playing rock in Rock-Paper-Scissors; pure strategies are really bad! The trick is actually in making that regret bound useful.

We need a more continuous version of the experts problem for LP feasibility. We’re still playing a turn-based game, but, this time, instead of outputting a prediction, we get to “play” a mixture of the experts (with non-negative weights that sum to 1). At the beginning of each turn, we describe what weight we’d like to give to each experts (e.g., 60% rock, 40% paper, 0% scissors). The cost (equivalently, payoff) for each expert is then revealed (e.g., $$\mathrm{rock} = -0.5$$, $$\mathrm{paper} = 0.5$$, $$\mathrm{scissors} = 0$$), and we incur the weighted average from our play (e.g., $$60\% \cdot -0.5 + 40\% \cdot 0.5 = -0.1$$) before playing the next round.2 The goal is to minimise our worst-case regret, the additive difference between the total cost incurred by our mixtures of experts and that of the a posteriori best single expert. In this case as well, online learning algorithms guarantee regret in $$\mathcal{O}(\sqrt{T} \, \lg n)$$

This line of research is interesting because simple algorithms achieve that bound, with explicit constant factors on the order of 1,3 and those bounds are known to be non-asymptotically tight for a large class of algorithms. Like dense linear algebra or fast Fourier transforms, where algorithms are often compared by counting individual floating point operations, online learning has matured into such tight bounds that worst-case regret is routinely presented without Landau notation. Advances improve constant factors in the worst case, or adapt to easier inputs in order to achieve “better than worst case” performance.

The reduction below lets us take any learning algorithm with an additive regret bound, and convert it to an algorithm with a corresponding worst-case iteration complexity bound for $$\varepsilon$$-approximate LP feasibility. An algorithm that promises low worst-case regret in $$\mathcal{O}(\sqrt{T})$$ gives us an algorithm that needs at most $$\mathcal{O}(1/\varepsilon\sp{2})$$ iterations to return a solution that almost satisfies every constraint in the linear program, where each constraint is violated by $$\varepsilon$$ or less (e.g., $$x \leq 1$$ is actually $$x \leq 1 + \varepsilon$$).

We first split the linear program in two components, a simple domain (e.g., the non-negative orthant or the $$[0, 1]\sp{d}$$ box) and the actual linear constraints. We then map each of the latter constraints to an expert, and use an arbitrary algorithm that solves our continuous version of the experts problem as a black box. At each turn, the black box will output a set of non-negative weights for the constraints (experts). We will average the constraints using these weights, and attempt to find a solution in the intersection of our simple domain and the weighted average of the linear constraints. We can do so in the “experts problem” setting by consider each linear constraint’s violation as a payoff, or, equivalently, satisfaction as a loss.

Let’s use Stigler’s Diet Problem with three foods and two constraints as a small example, and further simplify it by disregarding the minimum value for calories, and the maximum value for vitamin A. Our simple domain here is at least the non-negative orthant: we can’t ingest negative food. We’ll make things more interesting by also making sure we don’t eat more than 10 servings of any food per day.

The first constraint says we mustn’t get too many calories

$72 x\sb{\mathrm{corn}} + 121 x\sb{\mathrm{milk}} + 65 x\sb{\mathrm{bread}} \leq 2250,$

and the second constraint (tweaked to improve this example) ensures we ge enough vitamin A

$107 x\sb{\mathrm{corn}} + 400 x\sb{\mathrm{milk}} \geq 5000,$

or, equivalently,

$-107 x\sb{\mathrm{corn}} - 400 x\sb{\mathrm{milk}} \leq -5000,$

Given weights $$[¾, ¼]$$, the weighted average of the two constraints is

$27.25 x\sb{\mathrm{corn}} - 9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}} \leq 437.5,$

where the coefficients for each variable and for the right-hand side were averaged independently.

The subproblem asks us to find a feasible point in the intersection of these two constraints: $27.25 x\sb{\mathrm{corn}} - 9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}} \leq 437.5,$ $0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10.$

Classically, we claim that this is just Lagrangian relaxation, and find a solution to

$\min 27.25 x\sb{\mathrm{corn}} - 9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}}$ subject to $0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10.$

In the next section, I’ll explain why I think this analogy is wrong and worse than useless. For now, we can easily find the minimum one variable at a time, and find the solution $$x\sb{\mathrm{corn}} = 0$$, $$x\sb{\mathrm{milk}} = 10$$, $$x\sb{\mathrm{bread}} = 0$$, with objective value $$-92.5$$ (which is $$530$$ less than $$437.5$$).

In general, three things can happen at this point. We could discover that the subproblem is infeasible. In that case, the original non-relaxed linear program itself is infeasible: any solution to the original LP satisfies all of its constraints, and thus would also satisfy any weighted average of the same constraints. We could also be extremely lucky and find that our optimal solution to the relaxation is ($$\varepsilon$$-)feasible for the original linear program; we can stop with a solution. More commonly, we have a solution that’s feasible for the relaxation, but not for the original linear program.

Since that solution satisfies the weighted average constraint and payoffs track constraint violation, the black box’s payoff for this turn (and for every other turn) is non-positive. In the current case, the first constraint (on calories) is satisfied by $$1040$$, while the second (on vitamin A) is violated by $$1000$$. On weighted average, the constraints are satisfied by $$\frac{1}{4}(3 \cdot 1040 - 1000) = 530.$$ Equivalently, they’re violated by $$-530$$ on average.

We’ll add that solution to an accumulator vector that will come in handy later.

The next step is the key to the reduction: we’ll derive payoffs (negative costs) for the black box from the solution to the last relaxation. Each constraint (expert) has a payoff equal to its level of violation in the relaxation’s solution. If a constraint is strictly satisfied, the payoff is negative; for example, the constraint on calories is satisfied by $$1040$$, so its payoff this turn is $$-1040$$. The constraint on vitamin A is violated by $$1000$$, so its payoff this turn is $$1000$$. Next turn, we expect the black box to decrease the weight of the constraint on calories, and to increase the weight of the one on vitamin A.

After $$T$$ turns, the total payoff for each constraint is equal to the sum of violations by all solutions in the accumulator. Once we divide both sides by $$T$$, we find that the divided payoff for each constraint is equal to its violation by the average of the solutions in the accumulator. For example, if we have two solutions, one that violates the calories constraint by $$500$$ and another that satisfies it by $$1000$$ (violates it by $$-1000$$), the total payoff for the calories constraint is $$-500$$, and the average of the two solutions does strictly satisfy the linear constraint by $$\frac{500}{2} = 250$$!

We also know that we only generated feasible solutions to the relaxed subproblem (otherwise, we’d have stopped and marked the original LP as infeasible), so the black box’s total payoff is $$0$$ or negative.

Finally, we assumed that the black box algorithm guarantees an additive regret in $$\mathcal{O}(\sqrt{T}\, \lg n)$$, so the black box’s payoff of (at most) $$0$$ means that any constraint’s payoff is at most $$\mathcal{O}(\sqrt{T}\, \lg n)$$. After dividing by $$T$$, we obtain a bound on the violation by the arithmetic mean of all solutions in the accumulator: for all constraint, that violation is in $$\mathcal{O}\left(\frac{\lg n}{\sqrt{T}}\right)$$. In other words, the number of iteration $$T$$ must scale with $$\mathcal{O}\left(\frac{\lg n}{\varepsilon\sp{2}}\right)$$, which isn’t bad when $$n$$ is in the millions but $$\varepsilon \approx 0.01$$.

Theoreticians find this reduction interesting because there are concrete implementations of the black box, e.g., the multiplicative weight update (MWU) method with non-asymptotic bounds. For many problems, this makes it possible to derive the exact number of iterations necessary to find an $$\varepsilon-$$feasible fractional solution, given $$\varepsilon$$ and the instance’s size (but not the instance itself).

That’s why algorithms like MWU are theoretically useful tools for fractional approximations, when we already have subgradient methods that only need $$\mathcal{O}\left(\frac{1}{\varepsilon}\right)$$ iterations: state-of-the-art algorithms for learning with experts explicit non-asymptotic regret bounds that yield, for many problems, iteration bounds that only depend on the instance’s size, but not its data. While the iteration count when solving LP feasibility with MWU scales with $$\frac{1}{\varepsilon\sp{2}}$$, it is merely proportional to $$\lg n$$, the log of the the number of linear constraints. That’s attractive, compared to subgradient methods for which the iteration count scales with $$\frac{1}{\varepsilon}$$, but also scales linearly with respect to instance-dependent values like the distance between the initial dual solution and the optimum, or the Lipschitz constant of the Lagrangian dual function; these values are hard to bound, and are often proportional to the square root of the number of constraints. Given the choice between $$\mathcal{O}\left(\frac{\lg n}{\varepsilon\sp{2}}\right)$$ iterations with explicit constants, and a looser $$\mathcal{O}\left(\frac{\sqrt{n}}{\varepsilon}\right)$$, it’s obvious why MWU and online learning are powerful additions to the theory toolbox.

Theoreticians are otherwise not concerned with efficiency, so the usual answer to someone asking about optimisation is to tell them they can always reduce linear optimisation to feasibility with a binary search on the objective value. I once made the mistake of implementing that binary search last strategy. Unsurprisingly, it wasn’t useful. I also tried another theoretical reduction, where I looked for a pair of primal and dual -feasible solutions that happened to have the same objective value. That also failed, in a more interesting manner: since the two solution had to have almost the same value, the universe spited me by sending back solutions that were primal and dual infeasible in the worst possible way. In the end, the second reduction generated fractional solutions that were neither feasible nor superoptimal, which really isn’t helpful.

## Direct linear optimisation with experts

The reduction above works for any “simple” domain, as long as it’s convex and we can solve the subproblems, i.e., find a point in the intersection of the simple domain and a single linear constraint or determine that the intersection is empty.

The set of (super)optimal points in some initial simple domain is still convex, so we could restrict our search to the search of the domain that is superoptimal for the linear program we wish to optimise, and directly reduce optimisation to the feasibility problem solved in the last section, without binary search.

That sounds silly at first: how can we find solutions that are superoptimal when we don’t even know the optimal value?

Remember that the subproblems are always relaxations of the original linear program. We can port the objective function from the original LP over to the subproblems, and optimise the relaxations. Any solution that’s optimal for a realxation must have an optimal or superoptimal value for the original LP.

Rather than treating the black box online solver as a generator of Lagrangian dual vectors, we’re using its weights as solutions to the surrogate relaxation dual. The latter interpretation isn’t just more powerful by handling objective functions. It also makes more sense: the weights generated by algorithms for the experts problem are probabilities, i.e., they’re non-negative and sum to $$1$$. That’s also what’s expected for surrogate dual vectors, but definitely not the case for Lagrangian dual vectors, even when restricted to $$\leq$$ constraints.

We can do even better!

Unlike Lagrangian dual solvers, which only converge when fed (approximate) subgradients and thus make us (nearly) optimal solutions to the relaxed subproblems, our reduction to the experts problem only needs feasible solutions to the subproblems. That’s all we need to guarantee an $$\varepsilon-$$feasible solution to the initial problem in a bounded number of iterations. We also know exactly how that $$\varepsilon-$$feasible solution is generated: it’s the arithmetic mean of the solutions for relaxed subproblems.

This lets us decouple finding lower bounds from generating feasible solutions that will, on average, $$\varepsilon-$$satisfy the original LP. In practice, the search for an $$\varepsilon-$$feasible solution that is also superoptimal will tend to improve the lower bound. However, nothing forces us to evaluate lower bounds synchronously, or to only use the experts problem solver to improve our bounds.

We can find a new bound from any vector of non-negative constraint weights: they always yield a valid surrogate relaxation. We can solve that relaxation, and update our best bound when it’s improved. The Diet subproblem earlier had

$27.25 x\sb{\mathrm{corn}} - 9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}} \leq 437.5,$ $0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10.$

Adding the original objective function back yields the linear program

$\min 0.18 x\sb{\mathrm{corn}} + 0.23 x\sb{\mathrm{milk}} + 0.05 x\sb{\mathrm{bread}}$ subject to $27.25 x\sb{\mathrm{corn}} - 9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}} \leq 437.5,$ $0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10,$

which has a trivial optimal solution at $$[0, 0, 0]$$.

When we generate a feasible solution for the same subproblem, we can use any valid bound on the objective value to find the most feasible solution that is also assuredly (super)optimal. For example, if some oracle has given us a lower bound of $$2$$ for the original Diet problem, we can solve for

$\min 27.25 x\sb{\mathrm{corn}} - 9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}}$ subject to $0.18 x\sb{\mathrm{corn}} + 0.23 x\sb{\mathrm{milk}} + 0.05 x\sb{\mathrm{bread}}\leq 2$ $0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10.$

We can relax the objective value constraint further, since we know that the final $$\varepsilon-$$feasible solution is a simple arithmetic mean. Given the same best bound of $$2$$, and, e.g., a current average of $$3$$ solutions with a value of $$1.9$$, a new solution with an objective value of $$2.3$$ (more than our best bound, so not necessarily optimal!) would yield a new average solution with a value of $$2$$, which is still (super)optimal. This means we can solve the more relaxed subproblem

$\min 27.25 x\sb{\mathrm{corn}} - 9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}}$ subject to $0.18 x\sb{\mathrm{corn}} + 0.23 x\sb{\mathrm{milk}} + 0.05 x\sb{\mathrm{bread}}\leq 2.3$ $0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10.$

Given a bound on the objective value, we swapped the constraint and the objective; the goal is to maximise feasibility, while generating a solution that’s “good enough” to guarantee that the average solution is still (super)optimal.

For box-constrained linear programs where the box is the convex domain, subproblems are bounded linear knapsacks, so we can simply stop the greedy algorithm as soon as the objective value constraint is satisfied, or when the knapsack constraint becomes active (we found a better bound).

This last tweak doesn’t just accelerate convergence to $$\varepsilon-$$feasible solutions. More importantly for me, it pretty much guarantees that out $$\varepsilon-$$feasible solution matches the best known lower bound, even if that bound was provided by an outside oracle. Bundle methods and the Volume algorithm can also mix solutions to relaxed subproblems in order to generate $$\varepsilon-$$feasible solutions, but the result lacks the last guarantee: their fractional solutions are even more superoptimal than the best bound, and that can make bounding and variable fixing difficult.

## The secret sauce: AdaHedge

Before last Christmas’s CVRP set covering LP, I had always used the multiplicative weight update (MWU) algorithm as my black box online learning algorithm: it wasn’t great, but I couldn’t find anything better. The two main downsides for me were that I had to know a “width” parameter ahead of time, as well as the number of iterations I wanted to run.

The width is essentially the range of the payoffs; in our case, the potential level of violation or satisfaction of each constraints by any solution to the relaxed subproblems. The dependence isn’t surprising: folklore in Lagrangian relaxation also says that’s a big factor there. The problem is that the most extreme violations and satisfactions are initialisation parameters for the MWU algorithm, and the iteration count for a given $$\varepsilon$$ is quadratic in the width ($$\mathrm{max}\sb{violation} \cdot \mathrm{max}\sb{satisfaction}$$).

What’s even worse is that the MWU is explicitly tuned for a specific iteration count. If I estimate that, give my worst-case width estimate, one million iterations will be necessary to achieve $$\varepsilon-$$feasibility, MWU tuned for 1M iterations will need 1M iterations, even if the actual width is narrower.

de Rooij and others published AdaHedge in 2013, an algorithm that addresses both these issues by smoothly estimating its parameter over time, without using the doubling trick.4 AdaHedge’s loss (convergence rate to an $$\varepsilon-$$solution) still depends on the relaxation’s width. However, it depends on the maximum width actually observed during the solution process, and not on any explicit worst-case bound. It’s also not explicily tuned for a specific iteration count, and simply keeps improving at a rate that roughly matches MWU. If the instance happens to be easy, we will find an $$\varepsilon-$$feasible solution more quickly. In the worst case, the iteration count is never much worse than that of an optimally tuned MWU.

These 400 lines of Common Lisp implement AdaHedge and use it to optimise the set covering LP. AdaHedge acts as the online black box solver for the surrogate dual problem, the relaxed set covering LP is a linear knapsack, and each subproblem attempts to improve the lower bound before maximising feasibility.

When I ran the code, I had no idea how long it would take to find a feasible enough solution: covering constraints can never be violated by more than $$1$$, but some points could be covered by hundreds of tours, so the worst case satisfaction width is high. I had to rely on the way AdaHedge adapts to the actual hardness of the problem. In the end, $$34492$$ iterations sufficed to find a solution that was $$4.5\%$$ infeasible.5 This corresponds to a worst case with a width of less than $$2$$, which is probably not what happened. It seems more likely that the surrogate dual isn’t actually an omniscient adversary, and AdaHedge was able to exploit some of that “easiness.”

The iterations themselves are also reasonable: one sparse matrix / dense vector multiplication to convert surrogate dual weights to an average constraint, one solve of the relaxed LP, and another sparse matrix / dense vector multiplication to compute violations for each constraint. The relaxed LP is a fractional $$[0, 1]$$ knapsack, so the bottleneck is sorting double floats. Each iteration took 1.8 seconds on my old laptop; I’m guessing that could easily be 10-20 times faster with vectorisation and parallelisation.

In another post, I’ll show how using the same surrogate dual optimisation algorithm to mimick Lagrangian decomposition instead of Lagrangian relaxation guarantees an iteration count in $$\mathcal{O}\left(\frac{\lg \#\mathrm{nonzero}}{\varepsilon\sp{2}}\right)$$ independently of luck or the specific linear constraints.

1. Yes, I have been banging my head against that wall for a while.

2. This is equivalent to minimising expected loss with random bits, but cleans up the reduction.

3. When was the last time you had to worry whether that log was natural or base-2?

4. The doubling trick essentially says to start with an estimate for some parameters (e.g., width), then adjust it to at least double the expected iteration count when the parameter’s actual value exceeds the estimate. The sum telescopes and we only pay a constant multiplicative overhead for the dynamic update.

5. I think I computed the $$\log$$ of the number of decision variables instead of the number of constraints, so maybe this could have gone a bit better.

]]>
<![CDATA[The Unscalable, Deadlock-prone, Thread Pool]]> 2019-02-25T21:27:06-05:00 https://www.pvk.ca/Blog/2019/02/25/the-unscalable-thread-pool Epistemic Status: I’ve seen thread pools fail this way multiple times, am confident the pool-per-state approach is an improvement, and have confirmed with others they’ve also successfully used it in anger. While I’ve thought about this issue several times over ~4 years and pool-per-state seems like a good fix, I’m not convinced it’s undominated and hope to hear about better approaches.

Thread pools tend to only offer a sparse interface: pass a closure or a function and its arguments to the pool, and that function will be called, eventually.1 Functions can do anything, so this interface should offer all the expressive power one could need. Experience tells me otherwise.

The standard pool interface is so impoverished that it is nearly impossible to use correctly in complex programs, and leads us down design dead-ends. I would actually argue it’s better to work with raw threads than to even have generic amorphous thread pools: the former force us to stop and think about resource requirements (and lets the OS’s real scheduler help us along), instead of making us pretend we only care about CPU usage. I claim thread pools aren’t scalable because, with the exception of CPU time, they actively hinder the development of programs that achieve high resource utilisation.

This post comes in two parts. First, the story of a simple program that’s parallelised with a thread pool, then hits a wall as a wider set of resources becomes scarce. Second, a solution I like for that kind of program: an explicit state machine, where each state gets a dedicated queue that is aware of the state’s resource requirements.

## Stages of parallelisation

We start with a simple program that processes independent work units, a serial loop that pulls in work (e.g., files in a directory), or wait for requests on a socket, one work unit at a time.

At some point, there’s enough work to think about parallelisation, and we choose threads.2 To keep things simple, we simply spawn a thread per work unit. Load increases further, and we observe that we spend more time switching between threads or contending on shared data than doing actual work. We could use a semaphore to limit the number of work units we process concurrently, but we might as well just push work units to a thread pool and recycle threads instead of wasting resources on a thread-per-request model. We can even start thinking about queueing disciplines, admission control, backpressure, etc. Experienced developers will often jump directly to this stage after the serial loop.

The 80s saw a lot of research on generalising this “flat” parallelism model to nested parallelism, where work units can spawn additional requests and wait for the results (e.g., to recursively explore sub-branches of a search tree). Nested parallelism seems like a good fit for contemporary network services: we often respond to a request by sending simpler requests downstream, before merging and munging the responses and sending the result back to the original requestor. That may be why futures and promises are so popular these days.

I believe that, for most programs, the futures model is an excellent answer to the wrong question. The moment we perform I/O (be it network, disk, or even with hardware accelerators) in order to generate a result, running at scale will have to mean controlling more resources than just CPU, and both the futures and the generic thread pool models fall short.

The issue is that futures only work well when a waiter can help along the value it needs, with task stealing, while thread pools implement a trivial scheduler (dedicate a thread to a function until that function returns) that must be oblivious to resource requirements, since it handles opaque functions.

Once we have futures that might be blocked on I/O, we can’t guarantee a waiter will achieve anything by lending CPU time to its children. We could help sibling tasks, but that way stack overflows lie.

The deficiency of flat generic thread pools is more subtle. Obviously, one doesn’t want to take a tight thread pool, with one thread per core, and waste it on synchronous I/O. We’ll simply kick off I/O asynchronously, and re-enqueue the continuation on the pool upon completion!

Instead of doing

A, I/O, B


in one function, we’ll split the work in two functions and a callback

A, initiate asynchronous I/O
On I/O completion: enqueue B in thread pool
B


The problem here is that it’s easy to create too many asynchronous requests, and run out of memory, DOS the target, or delay the rest of the computation for too long. As soon as the I/O requests has been initiated in A, the function returns to the thread pool, which will just execute more instances of A and initiate even more I/O.

At first, when the program doesn’t heavily utilise any resource in particular, there’s an easy solution: limit the total number of in-flight work units with a semaphore. Note that I wrote work unit, not function calls. We want to track logical requests that we started processing, but for which there is still work to do (e.g., the response hasn’t been sent back yet).

I’ve seen two ways to cap in-flight work units. One’s buggy, the other doesn’t generalise.

The buggy implementation acquires a semaphore in the first stage of request handling (A) and releases it in the last stage (B). The bug is that, by the time we’re executing A, we’re already using up a slot in the thread pool, so we might be preventing Bs from executing. We have a lock ordering problem: A acquires a thread pool slot before acquiring the in-flight semaphore, but B needs to acquire a slot before releasing the same semaphore. If you’ve seen code that deadlocks when the thread pool is too small, this was probably part of the problem.

The correct implementation acquires the semaphore before enqueueing a new work unit, before shipping a call to A to the thread pool (and releases it at the end of processing, in B). This only works because we can assume that the first thing A does is to acquire the semaphore. As our code becomes more efficient, we’ll want to more finely track the utilisation of multiple resources, and pre-acquisition won’t suffice. For example, we might want to limit network requests going to individual hosts, independently from disk reads or writes, or from database transactions.

## Resource-aware thread pools

The core issue with thread pools is that the only thing they can do is run opaque functions in a dedicated thread, so the only way to reserve resources is to already be running in a dedicated thread. However, the one resource that every function needs is a thread on which to run, thus any correct lock order must acquire the thread last.

We care about reserving resources because, as our code becomes more efficient and scales up, it will start saturating resources that used to be virtually infinite. Unfortunately, classical thread pools can only control CPU usage, and actively hinder correct resource throttling. If we can’t guarantee we won’t overwhelm the supply of a given resource (e.g., read IOPS), we must accept wasteful overprovisioning.

Once the problem has been identified, the solution becomes obvious: make sure the work we push to thread pools describes the resources to acquire before running the code in a dedicated thread.

My favourite approach assigns one global thread pool (queue) to each function or processing step. The arguments to the functions will change, but the code is always the same, so the resource requirements are also well understood. This does mean that we incur complexity to decide how many threads or cores each pool is allowed to use. However, I find that the resulting programs are better understandable at a high level: it’s much easier to write code that traverses and describes the work waiting at different stages when each stage has a dedicated thread pool queue. They’re also easier to model as queueing systems, which helps answer “what if?” questions without actually implementing the hypothesis.

In increasing order of annoyingness, I’d divide resources to acquire in four classes.

1. Resources that may be seamlessly3 shared or timesliced, like CPU.
2. Resources that are acquired for the duration of a single function call or processing step, like DB connections.
3. Resources that are acquired in one function call, then released in another thread pool invocation, like DB transactions, or asynchronous I/O semaphores.
4. Resources that may only be released after temporarily using more of it, or by cancelling work: memory.

We don’t really have to think about the first class of resources, at least when it comes to correctness. However, repeatedly running the same code on a given core tends to improve performance, compared to running all sorts of code on all cores.

The second class of resources may be acquired once our code is running in a thread pool, so one could pretend it doesn’t exist. However, it is more efficient to batch acquisition, and execute a bunch of calls that all need a given resource (e.g., a DB connection from a connection pool) before releasing it, instead of repetitively acquiring and releasing the same resource in back-to-back function calls, or blocking multiple workers on the same bottleneck.4 More importantly, the property of always being acquired and released in the same function invocation, is a global one: as soon as even one piece of code acquires a given resource and releases in another thread pool call (e.g., acquires a DB connection, initiates an asynchronous network call, writes the result of the call to the DB, and releases the connection), we must always treat that resource as being in the third, more annoying, class. Having explicit stages with fixed resource requirements helps us confirm resources are classified correctly.

The third class of resources must be acquired in a way that preserves forward progress in the rest of the system. In particular, we must never have all workers waiting for resources of this third class. In most cases, it suffices to make sure there at least as many workers as there are queues or stages, and to only let each stage run the initial resource acquisition code in one worker at a time. However, it can pay off to be smart when different queued items require different resources, instead of always trying to satisfy resource requirements in FIFO order.

The fourth class of resources is essentially heap memory. Memory is special because the only way to release it is often to complete the computation. However, moving the computation forward will use even more heap. In general, my only solution is to impose a hard cap on the total number of in-flight work units, and to make sure it’s easy to tweak that limit at runtime, in disaster scenarios. If we still run close to the memory capacity with that limit, the code can either crash (and perhaps restart with a lower in-flight cap), or try to cancel work that’s already in progress. Neither option is very appealing.

There are some easier cases. For example, I find that temporary bumps in heap usage can be caused by parsing large responses from idempotent (GET) requests. It would be nice if networking subsystems tracked memory usage to dynamically throttle requests, or even cancel and retry idempotent ones.

Once we’ve done the work of explicitly writing out the processing steps in our program as well as their individual resource requirements, it makes sense to let that topology drive the structure of the code.

Over time, we’ll gain more confidence in that topology and bake it in our program to improve performance. For example, rather than limiting the number of in-flight requests with a semaphore, we can have a fixed-size allocation pool of request objects. We can also selectively use bounded ring buffers once we know we wish to impose a limit on queue size. Similarly, when a sequence (or subgraph) of processing steps is fully synchronous or retires in order, we can control both the queue size and the number of in-flight work units with a disruptor, which should also improve locality and throughput under load. These transformations are easy to apply once we know what the movement of data and resource looks like. However, they also ossify the structure of the program, so I only think about such improvements if they provide a system property I know I need (e.g., a limit on the number of in-flight requests), or once the code is functional and we have load-testing data.

Complex programs are often best understood as state machines. These state machines can be implicit, or explicit. I prefer the latter. I claim that it’s also preferable to have one thread pool5 per explicit state than to dump all sorts of state transition logic in a shared pool. If writing functions that process flat tables is data-oriented programming, I suppose I’m arguing for data-oriented state machines.

1. Convenience wrappers, like parallel map, or “run after this time,” still rely on the flexibility of opaque functions.

2. Maybe we decided to use threads because there’s a lot of shared, read-mostly, data on the heap. It doesn’t really matter, process pools have similar problems.

3. Up to a point, of course. No model is perfect, etc. etc.

4. Explicit resource requirements combined with one queue per stage lets us steal ideas from SEDA.

5. One thread pool per state in the sense that no state can fully starve out another of CPU time. The concrete implementation may definitely let a shared set of workers pull from all the queues.

]]>
<![CDATA[Preemption Is GC for Memory Reordering]]> 2019-01-09T16:29:37-05:00 https://www.pvk.ca/Blog/2019/01/09/preemption-is-gc-for-memory-reordering I previously noted how preemption makes lock-free programming harder in userspace than in the kernel. I now believe that preemption ought to be treated as a sunk cost, like garbage collection: we’re already paying for it, so we might as well use it. Interrupt processing (returning from an interrupt handler, actually) is fully serialising on x86, and on other platforms, no doubt: any userspace instruction either fully executes before the interrupt, or is (re-)executed from scratch some time after the return back to userspace. That’s something we can abuse to guarantee ordering between memory accesses, without explicit barriers.

This abuse of interrupts is complementary to Bounded TSO. Bounded TSO measures the hardware limit on the number of store instructions that may concurrently be in-flight (and combines that with the knowledge that instructions are retired in order) to guarantee liveness without explicit barriers, with no overhead, and usually marginal latency. However, without worst-case execution time information, it’s hard to map instruction counts to real time. Tracking interrupts lets us determine when enough real time has elapsed that earlier writes have definitely retired, albeit after a more conservative delay than Bounded TSO’s typical case.

I reached this position after working on two lock-free synchronisation primitives—event counts, and asymmetric flag flips as used in hazard pointers and epoch reclamation—that are similar in that a slow path waits for a sign of life from a fast path, but differ in the way they handle “stuck” fast paths. I’ll cover the event count and flag flip implementations that I came to on Linux/x86[-64], which both rely on interrupts for ordering. Hopefully that will convince you too that preemption is a useful source of pre-paid barriers for lock-free code in userspace.

I’m writing this for readers who are already familiar with lock-free programming, safe memory reclamation techniques in particular, and have some experience reasoning with formal memory models. For more references, Samy’s overview in the ACM Queue is a good resource. I already committed the code for event counts in Concurrency Kit, and for interrupt-based reverse barriers in my barrierd project.

# Event counts with x86-TSO and futexes

An event count is essentially a version counter that lets threads wait until the current version differs from an arbitrary prior version. A trivial “wait” implementation could spin on the version counter. However, the value of event counts is that they let lock-free code integrate with OS-level blocking: waiters can grab the event count’s current version v0, do what they want with the versioned data, and wait for new data by sleeping rather than burning cycles until the event count’s version differs from v0. The event count is a common synchronisation primitive that is often reinvented and goes by many names (e.g., blockpoints); what matters is that writers can update the version counter, and waiters can read the version, run arbitrary code, then efficiently wait while the version counter is still equal to that previous version.

The explicit version counter solves the lost wake-up issue associated with misused condition variables, as in the pseudocode below.

bad condition waiter:

while True:
atomically read data
if need to wait:
WaitOnConditionVariable(cv)
else:
break


In order to work correctly, condition variables require waiters to acquire a mutex that protects both data and the condition variable, before checking that the wait condition still holds and then waiting on the condition variable.

good condition waiter:

while True:
with(mutex):
read data
if need to wait:
WaitOnConditionVariable(cv, mutex)
else:
break


Waiters must prevent writers from making changes to the data, otherwise the data change (and associated condition variable wake-up) could occur between checking the wait condition, and starting to wait on the condition variable. The waiter would then have missed a wake-up and could end up sleeping forever, waiting for something that has already happened.

good condition waker:

with(mutex):
update data
SignalConditionVariable(cv)


The six diagrams below show the possible interleavings between the signaler (writer) making changes to the data and waking waiters, and a waiter observing the data and entering the queue to wait for changes. The two left-most diagrams don’t interleave anything; these are the only scenarios allowed by correct locking. The remaining four actually interleave the waiter and signaler, and show that, while three are accidentally correct (lucky), there is one case, WSSW, where the waiter misses its wake-up.

If any waiter can prevent writers from making progress, we don’t have a lock-free protocol. Event counts let waiters detect when they would have been woken up (the event count’s version counter has changed), and thus patch up this window where waiters can miss wake-ups for data changes they have yet to observe. Crucially, waiters detect lost wake-ups, rather than preventing them by locking writers out. Event counts thus preserve lock-freedom (and even wait-freedom!).

We could, for example, use an event count in a lock-free ring buffer: rather than making consumers spin on the write pointer, the write pointer could be encoded in an event count, and consumers would then efficiently block on that, without burning CPU cycles to wait for new messages.

The challenging part about implementing event counts isn’t making sure to wake up sleepers, but to only do so when there are sleepers to wake. For some use cases, we don’t need to do any active wake-up, because exponential backoff is good enough: if version updates signal the arrival of a response in a request/response communication pattern, exponential backoff, e.g., with a 1.1x backoff factor, could bound the increase in response latency caused by the blind sleep during backoff, e.g., to 10%.

Unfortunately, that’s not always applicable. In general, we can’t assume that signals corresponds to responses for prior requests, and we must support the case where progress is usually fast enough that waiters only spin for a short while before grabbing more work. The latter expectation means we can’t “just” unconditionally execute a syscall to wake up sleepers whenever we increment the version counter: that would be too slow. This problem isn’t new, and has a solution similar to the one deployed in adaptive spin locks.

The solution pattern for adaptive locks relies on tight integration with an OS primitive, e.g., futexes. The control word, the machine word on which waiters spin, encodes its usual data (in our case, a version counter), as well as a new flag to denote that there are sleepers waiting to be woken up with an OS syscall. Every write to the control word uses atomic read-modify-write instructions, and before sleeping, waiters ensure the “sleepers are present” flag is set, then make a syscall to sleep only if the control word is still what they expect, with the sleepers flag set.

OpenBSD’s compatibility shim for Linux’s futexes is about as simple an implementation of the futex calls as it gets. The OS code for futex wake and wait is identical to what userspace would do with mutexes and condition variables (waitqueues). Waiters lock out wakers for the futex word or a coarser superset, check that the futex word’s value is as expected, and enters the futex’s waitqueue. Wakers acquire the futex word for writes, and wake up the waitqueue. The difference is that all of this happens in the kernel, which, unlike userspace, can force the scheduler to be helpful. Futex code can run in the kernel because, unlike arbitrary mutex/condition variable pairs, the protected data is always a single machine integer, and the wait condition an equality test. This setup is simple enough to fully implement in the kernel, yet general enough to be useful.

OS-assisted conditional blocking is straightforward enough to adapt to event counts. The control word is the event count’s version counter, with one bit stolen for the “sleepers are present” flag (sleepers flag).

Incrementing the version counter can use a regular atomic increment; we only need to make sure we can tell whether the sleepers flag might have been set before the increment. If the sleepers flag was set, we clear it (with an atomic bit reset), and wake up any OS thread blocked on the control word.

increment event count:

old <- fetch_and_add(event_count.counter, 2)  # flag is in the low bit
if (old & 1):
atomic_and(event_count.counter, -2)
signal waiters on event_count.counter


Waiters can spin for a while, waiting for the version counter to change. At some point, a waiter determines that it’s time to stop wasting CPU time. The waiter then sets the sleepers flag with a compare-and-swap: the CAS (compare-and-swap) can only fail because the counter’s value has changed or because the flag is already set. In the former failure case, it’s finally time to stop waiting. In the latter failure care, or if the CAS succeeded, the flag is now set. The waiter can then make a syscall to block on the control word, but only if the control word still has the sleepers flag set and contains the same expected (old) version counter.

wait until event count differs from prev:

repeat k times:
if (event_count.counter / 2) != prev:  # flag is in low bit.
return
compare_and_swap(event_count.counter, prev * 2, prev * 2 + 1)
if cas_failed and cas_old_value != (prev * 2 + 1):
return
repeat k times:
if (event_count.counter / 2) != prev:
return
sleep_if(event_count.center == prev * 2 + 1)


This scheme works, and offers decent performance. In fact, it’s good enough for Facebook’s Folly.
I certainly don’t see how we can improve on that if there are concurrent writers (incrementing threads).

However, if we go back to the ring buffer example, there is often only one writer per ring. Enqueueing an item in a single-producer ring buffer incurs no atomic, only a release store: the write pointer increment only has to be visible after the data write, which is always the case under the TSO memory model (including x86). Replacing the write pointer in a single-producer ring buffer with an event count where each increment incurs an atomic operation is far from a no-brainer. Can we do better, when there is only one incrementer?

On x86 (or any of the zero other architectures with non-atomic read-modify-write instructions and TSO), we can… but we must accept some weirdness.

The operation that must really be fast is incrementing the event counter, especially when the sleepers flag is not set. Setting the sleepers flag on the other hand, may be slower and use atomic instructions, since it only happens when the executing thread is waiting for fresh data.

I suggest that we perform the former, the increment on the fast path, with a non-atomic read-modify-write instruction, either inc mem or xadd mem, reg. If the sleepers flag is in the sign bit, we can detect it (modulo a false positive on wrap-around) in the condition codes computed by inc; otherwise, we must use xadd (fetch-and-add) and look at the flag bit in the fetched value.

The usual ordering-based arguments are no help in this kind of asymmetric synchronisation pattern. Instead, we must go directly to the x86-TSO memory model. All atomic (LOCK prefixed) instructions conceptually flush the executing core’s store buffer, grab an exclusive lock on memory, and perform the read-modify-write operation with that lock held. Thus, manipulating the sleepers flag can’t lose updates that are already visible in memory, or on their way from the store buffer. The RMW increment will also always see the latest version update (either in global memory, or in the only incrementer’s store buffer), so won’t lose version updates either. Finally, scheduling and thread migration must always guarantee that the incrementer thread sees its own writes, so that won’t lose version updates.

increment event count without atomics in the common case:

old <- non_atomic_fetch_and_add(event_count.counter, 2)
if (old & 1):
atomic_and(event_count.counter, -2)
signal waiters on event_count.counter


The only thing that might be silently overwritten is the sleepers flag: a waiter might set that flag in memory just after the increment’s load from memory, or while the increment reads a value with the flag unset from the local store buffer. The question is then how long waiters must spin before either observing an increment, or knowing that the flag flip will be observed by the next increment. That question can’t be answered with the memory model, and worst-case execution time bounds are a joke on contemporary x86.

I found an answer by remembering that IRET, the instruction used to return from interrupt handlers, is a full barrier.1 We also know that interrupts happen at frequent and regular intervals, if only for the preemption timer (every 4-10ms on stock Linux/x86oid).

Regardless of the bound on store visibility, a waiter can flip the sleepers-are-present flag, spin on the control word for a while, and then start sleeping for short amounts of time (e.g., a millisecond or two at first, then 10 ms, etc.): the spin time is long enough in the vast majority of cases, but could still, very rarely, be too short.

At some point, we’d like to know for sure that, since we have yet to observe a silent overwrite of the sleepers flag or any activity on the counter, the flag will always be observed and it is now safe to sleep forever. Again, I don’t think x86 offers any strict bound on this sort of thing. However, one second seems reasonable. Even if a core could stall for that long, interrupts fire on every core several times a second, and returning from interrupt handlers acts as a full barrier. No write can remain in the store buffer across interrupts, interrupts that occur at least once per second. It seems safe to assume that, once no activity has been observed on the event count for one second, the sleepers flag will be visible to the next increment.

That assumption is only safe if interrupts do fire at regular intervals. Some latency sensitive systems dedicate cores to specific userspace threads, and move all interrupt processing and preemption away from those cores. A correctly isolated core running Linux in tickless mode, with a single runnable process, might not process interrupts frequently enough. However, this kind of configuration does not happen by accident. I expect that even a half-second stall in such a system would be treated as a system error, and hopefully trigger a watchdog. When we can’t count on interrupts to get us barriers for free, we can instead rely on practical performance requirements to enforce a hard bound on execution time.

Either way, waiters set the sleepers flag, but can’t rely on it being observed until, very conservatively, one second later. Until that time has passed, waiters spin on the control word, then block for short, but growing, amounts of time. Finally, if the control word (event count version and sleepers flag) has not changed in one second, we assume the incrementer has no write in flight, and will observe the sleepers flag; it is safe to block on the control word forever.

wait until event count differs from prev:

repeat k times:
if (event_count.counter / 2) != prev:
return
compare_and_swap(event_count.counter, 2 * prev, 2 * prev + 1)
if cas_failed and cas_old_value != 2 * prev + 1:
return
repeat k times:
if event_count.counter != 2 * prev + 1:
return
repeat for 1 second:
sleep_if_until(event_count.center == 2 * prev + 1,
$exponential_backoff) if event_count.counter != 2 * prev + 1: return sleep_if(event_count.center == prev * 2 + 1)  That’s the solution I implemented in this pull request for SPMC and MPMC event counts in concurrency kit. The MP (multiple producer) implementation is the regular adaptive logic, and matches Folly’s strategy. It needs about 30 cycles for an uncontended increment with no waiter, and waking up sleepers adds another 700 cycles on my E5-46xx (Linux 4.16). The single producer implementation is identical for the slow path, but only takes ~8 cycles per increment with no waiter, and, eschewing atomic instruction, does not flush the pipeline (i.e., the out-of-order execution engine is free to maximise throughput). The additional overhead for an increment without waiter, compared to a regular ring buffer pointer update, is 3-4 cycles for a single predictable conditional branch or fused test and branch, and the RMW’s load instead of a regular add/store. That’s closer to zero overhead, which makes it much easier for coders to offer OS-assisted blocking in their lock-free algorithms, without agonising over the penalty when no one needs to block. # Asymmetric flag flip with interrupts on Linux Hazard pointers and epoch reclamation. Two different memory reclamation technique, in which the fundamental complexity stems from nearly identical synchronisation requirements: rarely, a cold code path (which is allowed to be very slow) writes to memory, and must know when another, much hotter, code path is guaranteed to observe the slow path’s last write. For hazard pointers, the cold code path waits until, having overwritten an object’s last persistent reference in memory, it is safe to destroy the pointee. The hot path is the reader: 1. read pointer value *(T **)x. 2. write pointer value to hazard pointer table 3. check that pointer value *(T **)x has not changed  Similarly, for epoch reclamation, a read-side section will grab the current epoch value, mark itself as reading in that epoch, then confirm that the epoch hasn’t become stale. 1.$epoch <- current epoch
2. publish self as entering a read-side section under $epoch 3. check that$epoch is still current, otherwise retry


Under a sequentially consistent (SC) memory model, the two sequences are valid with regular (atomic) loads and stores. The slow path can always make its write, then scan every other thread’s single-writer data to see if any thread has published something that proves it executed step 2 before the slow path’s store (i.e., by publishing the old pointer or epoch value).

The diagrams below show all possible interleavings. In all cases, once there is no evidence that a thread has failed to observe the slow path’s new write, we can correctly assume that all threads will observe the write. I simplified the diagrams by not interleaving the first read in step 1: its role is to provide a guess for the value that will be re-read in step 3, so, at least with respect to correctness, that initial read might as well be generating random values. I also kept the second “scan” step in the slow path abstract. In practice, it’s a non-snapshot read of all the epoch or hazard pointer tables for threads that execute the fast path: the slow path can assume an epoch or pointer will not be resurrected once the epoch or pointer is absent from the scan.

No one implements SC in hardware. X86 and SPARC offer the strongest practical memory model, Total Store Ordering, and that’s still not enough to correctly execute the read-side critical sections above without special annotations. Under TSO, reads (e.g., step 3) are allowed to execute before writes (e.g., step 2). X86-TSO models that as a buffer in which stores may be delayed, and that’s what the scenarios below show, with steps 2 and 3 of the fast path reversed (the slow path can always be instrumented to recover sequential order, it’s meant to be slow). The TSO interleavings only differ from the SC ones when the fast path’s steps 2 and 3 are separated by something on slow path’s: when the two steps are adjacent, their order relative to the slow path’s steps is unaffected by TSO’s delayed stores. TSO is so strong that we only have to fix one case, FSSF, where the slow path executes in the middle of the fast path, with the reversal of store and load order allowed by TSO.

Simple implementations plug this hole with a store-load barrier between the second and third steps, or implement the store with an atomic read-modify-write instruction that doubles as a barrier. Both modifications are safe and recover SC semantics, but incur a non-negligible overhead (the barrier forces the out of order execution engine to flush before accepting more work) which is only necessary a minority of the time.

The pattern here is similar to the event count, where the slow path signals the fast path that the latter should do something different. However, where the slow path for event counts wants to wait forever if the fast path never makes progress, hazard pointer and epoch reclamation must detect that case and ignore sleeping threads (that are not in the middle of a read-side SMR critical section).

In this kind of asymmetric synchronisation pattern, we wish to move as much of the overhead to the slow (cold) path. Linux 4.3 gained the membarrier syscall for exactly this use case. The slow path can execute its write(s) before making a membarrier syscall. Once the syscall returns, any fast path write that has yet to be visible (hasn’t retired yet), along with every subsequent instruction in program order, started in a state where the slow path’s writes were visible. As the next diagram shows, this global barrier lets us rule out the one anomalous execution possible under TSO, without adding any special barrier to the fast path. The problem with membarrier is that it comes in two flavours: slow, or not scalable. The initial, unexpedited, version waits for kernel RCU to run its callback, which, on my machine, takes anywhere between 25 and 50 milliseconds. The reason it’s so slow is that the condition for an RCU grace period to elapse are more demanding than a global barrier, and may even require multiple such barriers. For example, if we used the same scheme to nest epoch reclamation ten deep, the outermost reclaimer would be 1024 times slower than the innermost one. In reaction to this slowness, potential users of membarrier went back to triggering IPIs, e.g., by mprotecting a dummy page. mprotect isn’t guaranteed to act as a barrier, and does not do so on AArch64, so Linux 4.16 added an “expedited” mode to membarrier. In that expedited mode, each membarrier syscall sends an IPI to every other core… when I look at machines with hundreds of cores, $$n - 1$$ IPI per core, a couple times per second on every $$n$$ core, start to sound like a bad idea.

Let’s go back to the observation we made for event count: any interrupt acts as a barrier for us, in that any instruction that retires after the interrupt must observe writes made before the interrupt. Once the hazard pointer slow path has overwritten a pointer, or the epoch slow path advanced the current epoch, we can simply look at the current time, and wait until an interrupt has been handled at a later time on all cores. The slow path can then scan all the fast path state for evidence that they are still using the overwritten pointer or the previous epoch: any fast path that has not published that fact before the interrupt will eventually execute the second and third steps after the interrupt, and that last step will notice the slow path’s update.

There’s a lot of information in /proc that lets us conservatively determine when a new interrupt has been handled on every core. However, it’s either too granular (/proc/stat) or extremely slow to generate (/proc/schedstat). More importantly, even with ftrace, we can’t easily ask to be woken up when something interesting happens, and are forced to poll files for updates (never mind the weirdly hard to productionalise kernel interface).

What we need is a way to read, for each core, the last time it was definitely processing an interrupt. Ideally, we could also block and let the OS wake up our waiter on changes to the oldest “last interrupt” timestamp, across all cores. On x86, that’s enough to get us the asymmetric barriers we need for hazard pointers and epoch reclamation, even if only IRET is serialising, and not interrupt handler entry. Once a core’s update to its “last interrupt” timestamp is visible, any write prior to the update, and thus any write prior to the interrupt is also globally visible: we can only observe the timestamp update from a different core than the updater, in which case TSO saves us, or after the handler has returned with a serialising IRET.

We can bundle all that logic in a short eBPF program.2 The program has a map of thread-local arrays (of 1 CLOCK_MONOTONIC timestamp each), a map of perf event queues (one per CPU), and an array of 1 “watermark” timestamp. Whenever the program runs, it gets the current time. That time will go in the thread-local array of interrupt timestamps. Before storing a new value in that array, the program first reads the previous interrupt time: if that time is less than or equal to the watermark, we should wake up userspace by enqueueing in event in perf. The enqueueing is conditional because perf has more overhead than a thread-local array, and because we want to minimise spurious wake-ups. A high signal-to-noise ratio lets userspace set up the read end of the perf queue to wake up on every event and thus minimise update latency.

We now need a single global daemon to attach the eBPF program to an arbitrary set of software tracepoints triggered by interrupts (or PMU events that trigger interrupts), to hook the perf fds to epoll, and to re-read the map of interrupt timestamps whenever epoll detects a new perf event. That’s what the rest of the code handles: setting up tracepoints, attaching the eBPF program, convincing perf to wake us up, and hooking it all up to epoll. On my fully loaded 24-core E5-46xx running Linux 4.18 with security patches, the daemon uses ~1-2% (much less on 4.16) of a core to read the map of timestamps every time it’s woken up every ~4 milliseconds. perf shows the non-JITted eBPF program itself uses ~0.1-0.2% of every core.

Amusingly enough, while eBPF offers maps that are safe for concurrent access in eBPF programs, the same maps come with no guarantee when accessed from userspace, via the syscall interface. However, the implementation uses a hand-rolled long-by-long copy loop, and, on x86-64, our data all fit in longs. I’ll hope that the kernel’s compilation flags (e.g., -ffree-standing) suffice to prevent GCC from recognising memcpy or memmove, and that we thus get atomic store and loads on x86-64. Given the quality of eBPF documentation, I’ll bet that this implementation accident is actually part of the API. Every BPF map is single writer (either per-CPU in the kernel, or single-threaded in userspace), so this should work.

Once the barrierd daemon is running, any program can mmap its data file to find out the last time we definitely know each core had interrupted userspace, without making any further syscall or incurring any IPI. We can also use regular synchronisation to let the daemon wake up threads waiting for interrupts as soon as the oldest interrupt timestamp is updated. Applications don’t even need to call clock_gettime to get the current time: the daemon also works in terms of a virtual time that it updates in the mmaped data file.

The barrierd data file also includes an array of per-CPU structs with each core’s timestamps (both from CLOCK_MONOTONIC and in virtual time). A client that knows it will only execute on a subset of CPUs, e.g., cores 2-6, can compute its own “last interrupt” timestamp by only looking at entries 2 to 6 in the array. The daemon even wakes up any futex waiter on the per-CPU values whenever they change. The convenience interface is pessimistic, and assumes that client code might run on every configured core. However, anyone can mmap the same file and implement tighter logic.

Again, there’s a snag with tickless kernels. In the default configuration already, a fully idle core might not process timer interrupts. The barrierd daemon detects when a core is falling behind, and starts looking for changes to /proc/stat. This backup path is slower and coarser grained, but always works with idle cores. More generally, the daemon might be running on a system with dedicated cores. I thought about causing interrupts by re-affining RT threads, but that seems counterproductive. Instead, I think the right approach is for users of barrierd to treat dedicated cores specially. Dedicated threads can’t (shouldn’t) be interrupted, so they can regularly increment a watchdog counter with a serialising instruction. Waiters will quickly observe a change in the counters for dedicated threads, and may use barrierd to wait for barriers on preemptively shared cores. Maybe dedicated threads should be able to hook into barrierd and check-in from time to time. That would break the isolation between users of barrierd, but threads on dedicated cores are already in a privileged position.

I quickly compared the barrier latency on an unloaded 4-way E5-46xx running Linux 4.16, with a sample size of 20000 observations per method (I had to remove one outlier at 300ms). The synchronous methods mprotect (which abuses mprotect to send IPIs by removing and restoring permissions on a dummy page), or explicit IPI via expedited membarrier, are much faster than the other (unexpedited membarrier with kernel RCU, or barrierd that counts interrupts). We can zoom in on the IPI-based methods, and see that an expedited membarrier (IPI) is usually slightly faster than mprotect; IPI via expedited membarrier hits a worst-case of 0.041 ms, versus 0.046 for mprotect. The performance of IPI-based barriers should be roughly independent of system load. However, we did observe a slowdown for expedited membarrier (between $$68.4-73.0\%$$ of the time, $$p < 10\sp{-12}$$ according to a binomial test3) on the same 4-way system, when all CPUs were running CPU-intensive code at low priority. In this second experiment, we have a sample size of one million observations for each method, and the worst case for IPI via expedited membarrier was 0.076 ms (0.041 ms on an unloaded system), compared to a more stable 0.047 ms for mprotect. Now for non-IPI methods: they should be slower than methods that trigger synchronous IPIs, but hopefully have lower overhead and scale better, while offering usable latencies.

On an unloaded system, the interrupts that drive barrierd are less frequent, sometimes outright absent, so unexpedited membarrier achieves faster response times. We can even observe barrierd’s fallback logic, which scans /proc/stat for evidence of idle CPUs after 10 ms of inaction: that’s the spike at 20ms. The values for vtime show the additional slowdown we can expect if we wait on barrierd’s virtual time, rather than directly reading CLOCK_MONOTONIC. Overall, the worst case latencies for barrierd (53.7 ms) and membarrier (39.9 ms) aren’t that different, but I should add another fallback mechanism based on membarrier to improve barrierd’s performance on lightly loaded machines. When the same 4-way, 24-core, system is under load, interrupts are fired much more frequently and reliably, so barrierd shines, but everything has a longer tail, simply because of preemption of the benchmark process. Out of the one million observations we have for each of unexpedited membarrier, barrierd, and barrierd with virtual time on this loaded system, I eliminated 54 values over 100 ms (18 for membarrier, 29 for barrierd, and 7 for virtual time). The rest is shown below. barrierd is consistently much faster than membarrier, with a geometric mean speedup of 23.8x. In fact, not only can we expect barrierd to finish before an unexpedited membarrier $$99.99\%$$ of the time ($$p<10\sp{-12}$$ according to a binomial test), but we can even expect barrierd to be 10 times as fast $$98.3-98.5\%$$ of the time ($$p<10\sp{-12}$$). The gap is so wide that even the opportunistic virtual-time approach is faster than membarrier (geometric mean of 5.6x), but this time with a mere three 9s (as fast as membarrier $$99.91-99.96\%$$ of the time, $$p<10\sp{-12}$$). With barrierd, we get implicit barriers with worse overhead than unexpedited membarrier (which is essentially free since it piggybacks on kernel RCU, another sunk cost), but 1/10th the latency (0-4 ms instead of 25-50 ms). In addition, interrupt tracking is per-CPU, not per-thread, so it only has to happen in a global single-threaded daemon; the rest of userspace can obtain the information it needs without causing additional system overhead. More importantly, threads don’t have to block if they use barrierd to wait for a system-wide barrier. That’s useful when, e.g., a thread pool worker is waiting for a reverse barrier before sleeping on a futex. When that worker blocks in membarrier for 25ms or 50ms, there’s a potential hiccup where a work unit could sit in the worker’s queue for that amount of time before it gets processed. With barrierd (or the event count described earlier), the worker can spin and wait for work units to show up until enough time has passed to sleep on the futex.

While I believe that information about interrupt times should be made available without tracepoint hacks, I don’t know if a syscall like membarrier is really preferable to a shared daemon like barrierd. The one salient downside is that barrierd slows down when some CPUs are idle; that’s something we can fix by including a membarrier fallback, or by sacrificing power consumption and forcing kernel ticks, even for idle cores.

# Preemption can be an asset

When we write lock-free code in userspace, we always have preemption in mind. In fact, the primary reason for lock-free code in userspace is to ensure consistent latency despite potentially adversarial scheduling. We spend so much effort to make our algorithms work despite interrupts and scheduling that we can fail to see how interrupts can help us. Obviously, there’s a cost to making our code preemption-safe, but preemption isn’t an option. Much like garbage collection in managed language, preemption is a feature we can’t turn off. Unlike GC, it’s not obvious how to make use of preemption in lock-free code, but this post shows it’s not impossible.

We can use preemption to get asymmetric barriers, nearly for free, with a daemon like barrierd. I see a duality between preemption-driven barriers and techniques like Bounded TSO: the former are relatively slow, but offer hard bounds, while the latter guarantee liveness, usually with negligible latency, but without any time bound.

I used preemption to make single-writer event counts faster (comparable to a regular non-atomic counter), and to provide a lower-latency alternative to membarrier’s asymmetric barrier. In a similar vein, SPeCK uses time bounds to ensure scalability, at the expense of a bit of latency, by enforcing periodic TLB reloads instead of relying on synchronous shootdowns. What else can we do with interrupts, timer or otherwise?

Thank you Samy, Gabe, and Hanes for discussions on an earlier draft. Thank you Ruchir for improving this final version.

# P.S. event count without non-atomic RMW?

The single-producer event count specialisation relies on non-atomic read-modify-write instructions, which are hard to find outside x86. I think the flag flip pattern in epoch and hazard pointer reclamation shows that’s not the only option.

We need two control words, one for the version counter, and another for the sleepers flag. The version counter is only written by the incrementer, with regular non-atomic instructions, while the flag word is written to by multiple producers, always with atomic instructions.

The challenge is that OS blocking primitives like futex only let us conditionalise the sleep on a single word. We could try to pack a pair of 16-bit shorts in a 32-bit int, but that doesn’t give us a lot of room to avoid wrap-around. Otherwise, we can guarantee that the sleepers flag is only cleared immediately before incrementing the version counter. That suffices to let sleepers only conditionalise on the version counter… but we still need to trigger a wake-up if the sleepers flag was flipped between the last clearing and the increment.

On the increment side, the logic looks like

must_wake = false
if sleepers flag is set:
must_wake = true
clear sleepers flag
increment version
if must_wake or sleepers flag is set:
wake up waiters


and, on the waiter side, we find

if version has changed
return
set sleepers flag
sleep if version has not changed


The separate “sleepers flag” word doubles the space usage, compared to the single flag bit in the x86 single-producer version. Composite OS uses that two-word solution in blockpoints, and the advantages seem to be simplicity and additional flexibility in data layout. I don’t know that we can implement this scheme more efficiently in the single producer case, under other memory models than TSO. If this two-word solution is only useful for non-x86 TSO, that’s essentially SPARC, and I’m not sure that platform still warrants the maintenance burden.

But, we’ll see, maybe we can make the above work on AArch64 or POWER.

1. I actually prefer another, more intuitive, explanation that isn’t backed by official documentation.The store buffer in x86-TSO doesn’t actually exist in silicon: it represents the instructions waiting to be retired in the out-of-order execution engine. Precise interrupts seem to imply that even entering the interrupt handler flushes the OOE engine’s state, and thus acts as a full barrier that flushes the conceptual store buffer.

2. I used raw eBPF instead of the C frontend because that frontend relies on a ridiculous amount of runtime code that parses an ELF file when loading the eBPF snippet to know what eBPF maps to setup and where to backpatch their fd number. I also find there’s little advantage to the C frontend for the scale of eBPF programs (at most 4096 instructions, usually much fewer). I did use clang to generate a starting point, but it’s not that hard to tighten 30 instructions in ways that a compiler can’t without knowing what part of the program’s semantics is essential. The bpf syscall can also populate a string buffer with additional information when loading a program. That’s helpful to know that something was assembled wrong, or to understand why the verifier is rejecting your program.

3. I computed these extreme confidence intervals with my old code to test statistical SLOs.

]]>
<![CDATA[Restartable Sequences With the Polysemic Null Segment Selector]]> 2018-08-25T21:57:06-04:00 https://www.pvk.ca/Blog/2018/08/25/restartable-sequences-with-the-polysemic-null-segment-selector Implementing non-blocking algorithms is one of the few things that are easier to code in a kernel than in userspace (the only other one I can think of is physically wrecking a machine). In the kernel, we only have to worry about designing a protocol that achieves forward progress even if some threads stop participating. We must guarantee the absence of conceptual locks (we can’t have operations that must be paired, or barriers that everyone must reach), but are free to implement the protocol by naïvely spinlocking around individual steps: kernel code can temporarily disable preemption and interrupts to bound a critical section’s execution time.

Getting these non-blocking protocols right is still challenging, but the challenge is one fundamental for reliable systems. The same problems, solutions, and space/functionality trade-offs appear in all distributed systems. Some would even argue that the kind of interfaces that guarantee lock- or wait- freedom are closer to the object oriented ideals.

Of course, there is still a place for clever instruction sequences that avoid internal locks, for code that may be paused anywhere without freezing the whole system: interrupts can’t always be disabled, read operations should avoid writing to shared memory if they can, and a single atomic read-modify-write operation may be faster than locking. The key point for me is that this complexity is opt-in: we can choose to tackle it incrementally, as a performance problem rather than as a prerequisite for correctness.

We don’t have the same luxury in userspace. We can’t start by focusing on the fundamentals of a non-blocking algorithm, and only implement interruptable sequences where it makes sense. Userspace can’t disable preemption, so we must think about the minutiae of interruptable code sequences from the start; non-blocking algorithms in userspace are always in hard mode, where every step of the protocol might be paused at any instruction.

Specifically, the problem with non-blocking code in user space isn’t that threads or processes can be preempted at any point, but rather that the preemption can be observed. It’s a PCLSRing issue! Even Unices guarantee programmers won’t observe a thread in the middle of a syscall: when a thread (process) must be interrupted, any pending syscall either runs to completion, or returns with an error1. What we need is a similar guarantee for steps of our own non-blocking protocols2.

Hardware transactional memory kind of solves the problem (preemption aborts any pending transaction) but is a bit slow3, and needs a fallback mechanism. Other emulation schemes for PCLSRing userspace code divide the problem in two:

1. Determining that another thread was preempted in the middle of a critical section.
2. Guaranteeing that that other thread will not blindly resume executing its critical section (i.e., knowing that the thread knows that we know it’s been interrupted4).

The first part is relatively easy. For per-CPU data, it suffices to observe that we are running on a given CPU (e.g., core #4), and that another thread claims to own the same CPU’s (core #4’s) data. For global locks, we can instead spin for a while before entering a slow path that determines whether the holder has been preempted, by reading scheduling information in /proc.

The second part is harder. I have played with schemes that relied on signals, but was never satisfied: I found Linux perf will rarely, but not never, drop interrupts when I used it to “profile” context switches, and signaling when we determine that the holder has been pre-empted has memory visibility issues for per-CPU data5.

Until earlier this month, the best known solution on mainline Linux involved cross-modifying code! When a CPU executes a memory write instruction, that write is affected by the registers, virtual memory mappings, and the instruction’s bytes. Contemporary operating systems rarely let us halt and tweak another thread’s general purpose registers (Linux won’t let us self-ptrace, nor pause an individual thread). Virtual memory mappings are per-process, and can’t be modified from the outside. The only remaining angle is modifying the premptee’s machine code.

That’s what Facebook’s experimental library Rseq (restartable sequences) actually does.

I’m not happy with that solution either: while it “works,” it requires per-thread clones of each critical section, and makes us deal with cross-modifying code. I’m not comfortable with leaving code pages writable, and we also have to guarantee the pre-emptee’s writes are visible. For me, the only defensible implementation is to modify the code by mmap-ing pages in place, which incurs an IPI per modification. The total system overhead thus scales superlinearly with the number of CPUs.

With Mathieu Desnoyers’s, Paul Turner’s, and Andrew Hunter’s patch to add an rseq syscall to Linux 4.18, we finally have a decent answer. Rather than triggering special code when a thread detects that another thread has been pre-empted in the middle of a critical section, userspace can associate recovery code with the address range for each restartable critical section’s instructions. Whenever the kernel preempts a thread, it detects whether the interruptee is in such a restartable sequence, and, if so, redirects the instruction pointer to the associated recovery code. This essentially means that critical sections must be read-only except for the last instruction in the section, but that’s not too hard to satisfy. It also means that we incur recovery even when no one would have noticed, but the overhead should be marginal (there’s at most one recovery per timeslice), and we get a simpler programming model in return.

Earlier this year, I found another way to prevent critical sections from resuming normal execution after being preempted. It’s a total hack that exercises a state saving defect in Linux/x86-64, but I’m comfortable sharing it now that Rseq is in mainline: if anyone needs the functionality, they can update to 4.18, or backport the feature.

## Here’s a riddle!

With an appropriate setup, the read_value function above will return a different value once the executing thread is switched out. No, the kernel isn’t overwriting read-only data while we’re switched out. When I listed the set of inputs that affect a memory store or load instruction (general purpose registers, virtual memory mappings, and the instruction bytes), I left out one last x86 thing: segment registers.

Effective addresses on x86oids are about as feature rich as it gets: they sum a base address, a shifted index, a constant offset, and, optionally, a segment base. Today, we simply use segment bases to implement thread-local storage (each thread’s FS or GS offset points to its thread-local block), but that usage repurposes memory segmentation, an old 8086 feature… and x86-64 still maintains some backward compatibility with its 16-bit ancestor. There’s a lot of unused complexity there, so it’s plausible that we’ll find information leaks or otherwise flawed architectural state switching by poking around segment registers.

## How to set that up

After learning about this trick to observe interrupts from userland, I decided to do a close reading of Linux’s task switching code on x86-64 and eventually found this interesting comment6.

Observing a value of 0 in the FS or GS registers can mean two things:

1. Userspace explicitly wrote the null segment selector in there, and reset the segment base to 0.
2. The kernel wrote a 0 in there before setting up the segment base directly, with WR{FS,GS}BASE or by writing to a model-specific register (MSR).

Hardware has to efficiently keep track of which is actually in effect. If userspace wrote a 0 in FS or GS, prefixing an instruction with that segment has no impact; if the MSR write is still active (and is non-zero), using that segment must impact effective address computation.

There’s no easy way to do the same in software. Even in ring 0, the only sure-fire way to distinguish between the two cases is to actually read the current segment base value, and that’s slow. Linux instead fast-paths the common case, where the segment register is 0 because the kernel is handling segment bases. It prioritises that use case so much that the code knowingly sacrifices correctness when userspace writes 0 in a segment register after asking the kernel to setup its segment base directly.

This incorrectness is acceptable because it only affects the thread that overwrites its segment register, and no one should go through that sequence of operations. Legacy code can still manipulate segment descriptor tables and address them in segment registers. However, being legacy code, it won’t use the modern syscall that directly manipulates the segment base. Modern code can let the kernel set the segment base without playing with descriptor tables, and has no reason to look at segment registers.

The only way to observe the buggy state saving is to go looking for it, with something like the code below (which uses GS because FS is already taken by glibc to implement thread-local storage).

Running the above on my Linux 4.14/x86-64 machine yields

gcc-6 -std=gnu99 h4x.c && ./a.out Reads: XXYX Re-reads: XYX  The first set of reads shows that: 1. our program starts with no offset in GS (reads == values) 2. explicitly setting GS to 0 does not change that (reads == values) 3. changing the GS base to 1 with arch_prctl does work (reads == values) 4. resetting the GS selector to 0 resets the base (reads == values). The second set of reads shows that: 1. the reset base survives short syscalls (re_reads == values) 2. an actual context switch reverts the GS base to the arch_prctl value (re_reads == values) 3. writing a 0 in GS resets the base again (re_reads == values). ## Cute hack, why is it useful? The property demonstrated in the hack above is that, after our call to arch_prctl, we can write a 0 in GS with a regular instruction to temporarily reset the GS base to 0, and know it will revert to the arch_prctl offset again when the thread resumes execution, after being suspended. We now have to ensure our restartable sequences are no-ops when the GS base is reset to the arch_prctl offset, and that the no-op is detected as such. For example, we could set the arch_prctl offset to something small, like 4 or 8 bytes, and make sure that any address we wish to mutate in a critical section is followed by 4 or 8 bytes of padding that can be detected as such. If a thread is switched out in the middle of a critical section, its GS base will be reset to 4 or 8 when the thread resumes execution; we must guarantee that this offset will make the critical section’s writes fail. If a write is a compare-and-swap, we only have to make sure the padding’s value is unambiguously different from real data: reading the padding instead of the data will make compare-and-swap fail, and the old value will tell us that it failed because we read padding, which should only happen after the section is pre-empted. We can play similar tricks with fetch-and-add (e.g., real data is always even, while the padding is odd), or atomic bitwise operations (steal the sign bit). If we’re willing to eat a signal after a context switch, we can set the arch_prctl offset to something very large, and take a segmentation fault after being re-scheduled. Another option is to set the arch_prctl offset to 1, and use a double-wide compare-and-swap (CMPXCHG16B), or turn on the AC (alignment check) bit in EFLAGS. After a context switch, our destination address will be misaligned, which will trigger a SIGBUS that we can handle. The last two options aren’t great, but, if we make sure to regularly write a 0 in GS, signals should be triggered rarely, only when pre-emption happens between the last write to GS and a critical section. They also have the advantages of avoiding the need for padding, and making it trivial to detect when a restartable section was interrupted. Detection is crucial because it often isn’t safe to assume an operation failed when it succeeded (e.g., unwittingly succeeding at popping from a memory allocator’s freelist would leak memory). When a GS-prefixed instruction fails, we must be able to tell from the instruction’s result, and nothing else. We can’t just check if the segment base is still what we expect, after the fact: our thread could have been preempted right after the special GS-prefixed instruction, before our check. Once we have restartable sections, we can use them to implement per-CPU data structures (instead of per-thread), or to let thread acquire locks and hold them until they are preempted: with restartable sections that only write if there was no preemption between the lock acquisition and the final store instruction, we can create a revocable lock abstraction and implement wait-free coöperation or flat-combining. Unfortunately, our restartable sections will always be hard to debug: observing a thread’s state in a regular debugger like GDB will reset the GS base and abort the section. That’s not unique to the segment hack approach. Hardware transactional memory will abort critical sections when debugged, and there’s similar behaviour with the official rseq syscall. It’s hard enough to PCLSR userspace code; it would be even harder to PCLSR-except-when-the-interruption-is-for-debugging. ## Who’s to blame? The null GS hack sounds like it only works because of a pile of questionable design decisions. However, if we look at the historical context, I’d say everything made sense. Intel came up with segmentation back when 16 bit pointers were big, but 64KB of RAM not quite capacious enough. They didn’t have 32 bit (never mind 64 bit) addresses in mind, nor threads; they only wanted to address 1 MB of RAM with their puny registers. When thread libraries abused segments to implement thread-local storage, the only other options were to over-align the stack and hide information there, or to steal a register. Neither sounds great, especially with x86’s six-and-a-half general purpose registers. Finally, when AMD decided to rip out segmentation, but keep FS and GS, they needed to make porting x86 code as easy as possible, since that was the whole value proposition for AMD64 over Itanium. I guess that’s what systems programming is about. We take our tools, get comfortable with their imperfections, and use that knowledge to build new tools by breaking the ones we already have (#Mickens). Thank you Andrew for a fun conversation that showed the segment hack might be of interest to someone else, and to Gabe for snarkily reminding us Rseq is another Linux/Silicon Valley re-invention. 1. That’s not as nice as rewinding the PC to just before the syscall, with a fixed up state that will resume the operation, but is simpler to implement, and usually good enough. Classic worst is better (Unix semantics are also safer with concurrency, but that could have been opt-in…). 2. That’s not a new observation, and SUN heads like to point to prior art like Dice’s and Garthwaite’s Mostly Lock-Free Malloc, Garthwaite’s, Dice’s, and White’s work on Preemption notification for per-CPU buffers, or Harris’s and Fraser’s Revocable locks. Linux sometimes has to reinvent everything with its special flavour. 3. For instance, SuperMalloc optimistically uses TSX to access per-CPU caches, but TSX is slow enough that SuperMalloc first tries to use a per-thread cache. Dice and Harris explored the use of hardware transactional lock elision solely to abort on context switches; they maintained high system throughput under contention by trying the transaction once before falling back to a regular lock. 4. I did not expect systems programming to get near multi-agent epistemic logic ;) 5. Which is fixable with LOCKed instructions, but that defeats the purpose of per-CPU data. 6. I actually found the logic bug before the Spectre/Meltdown fire drill and was worried the hole would be plugged. This one survived the purge. fingers crossed ]]> <![CDATA[The Confidence Sequence Method: A Computer-age Test for Statistical SLOs]]> 2018-07-06T18:02:40-04:00 https://www.pvk.ca/Blog/2018/07/06/testing-slo-type-properties-with-the-confidence-sequence-method This post goes over some code that I pushed to github today. All the snippets below should be in the repo, which also includes C and Python code with the same structure. I recently resumed thinking about balls and bins for hash tables. This time, I’m looking at large bins (on the order of one 2MB huge page). There are many hashing methods with solid worst-case guarantees that unfortunately query multiple uncorrelated locations; I feel like we could automatically adapt them to modern hierarchical storage (or address translation) to make them more efficient, for a small loss in density. In theory, large enough bins can be allocated statically with a minimal waste of space. I wanted some actual non-asymptotic numbers, so I ran numerical experiments and got the following distribution of global utilisation (fill rate) when the first bin fills up. It looks like, even with one thousand bins of thirty thousand values, we can expect almost 98% space utilisation until the first bin saturates. I want something more formal. Could I establish something like a service level objective, “When distributing balls randomly between one thousand bins with individual capacity of thirty thousand balls, we can utilise at least 98% of the total space before a bin fills up, x% of the time?” The natural way to compute the “x%” that makes the proposition true is to first fit a distribution on the observed data, then find out the probability mass for that distribution that lies above 98% fill rate. Fitting distributions takes a lot of judgment, and I’m not sure I trust myself that much. Alternatively, we can observe independent identically distributed fill rates, check if they achieve 98% space utilisation, and bound the success rate for this Bernoulli process. There are some non-trivial questions associated with this approach. 1. How do we know when to stop generating more observations… without fooling ourselves with $$p$$-hacking? 2. How can we generate something like a confidence interval for the success rate? Thankfully, I have been sitting on a software package to compute satisfaction rate for exactly this kind of SLO-type properties, properties of the form “this indicator satisfiesPREDICATE x% of the time,” with arbitrarily bounded false positive rates.

The code takes care of adaptive stopping, generates a credible interval, and spits out a report like this : we see the threshold (0.98), the empirical success rate estimate (0.993 ≫ 0.98), a credible interval for the success rate, and the shape of the probability mass for success rates.

This post shows how to compute credible intervals for the Bernoulli’s success rate, how to implement a dynamic stopping criterion, and how to combine the two while compensating for multiple hypothesis testing. It also gives two examples of converting more general questions to SLO form, and answers them with the same code.

# Credible intervals for the Binomial

If we run the same experiment $$n$$ times, and observe $$a$$ successes ($$b = n - a$$ failures), it’s natural to ask for an estimate of the success rate $$p$$ for the underlying Bernoulli process, assuming the observations are independent and identically distributed.

Intuitively, that estimate should be close to $$a / n$$, the empirical success rate, but that’s not enough. I also want something that reflects the uncertainty associated with small $$n$$, much like in the following ridge line plot, where different phrases are assigned not only a different average probability, but also a different spread. I’m looking for an interval of plausible success rates $$p$$ that responds to both the empirical success rate $$a / n$$ and the sample size $$n$$; that interval should be centered around $$a / n$$, be wide when $$n$$ is small, and become gradually tighter as $$n$$ increases.

The Bayesian approach is straightforward, if we’re willing to shut up and calculate. Once we fix the underlying success rate $$p = \hat{p}$$, the conditional probability of observing $$a$$ successes and $$b$$ failures is

$P((a, b) | p = \hat{p}) \sim \hat{p}\sp{a} \cdot (1 - \hat{p})\sp{b},$

where the right-hand side is a proportion1, rather than a probability.

We can now apply Bayes’s theorem to invert the condition and the event. The inversion will give us the conditional probability that $$p = \hat{p}$$, given that we observed $$a$$ successes and $$b$$ successes. We only need to impose a prior distribution on the underlying rate $$p$$. For simplicity, I’ll go with the uniform $$U[0, 1]$$, i.e., every success rate is equally plausible, at first. We find

$P(p = \hat{p} | (a, b)) = \frac{P((a, b) | p = \hat{p}) P(p = \hat{p})}{P(a, b)}.$

We already picked the uniform prior, $$P(p = \hat{p}) = 1\,\forall \hat{p}\in [0,1],$$ and the denominator is a constant with respect to $$\hat{p}$$. The expression simplifies to

$P(p = \hat{p} | (a, b)) \sim \hat{p}\sp{a} \cdot (1 - \hat{p})\sp{b},$

or, if we normalise to obtain a probability,

$P(p = \hat{p} | (a, b)) = \frac{\hat{p}\sp{a} \cdot (1 - \hat{p})\sp{b}}{\int\sb{0}\sp{1} \hat{p}\sp{a} \cdot (1 - \hat{p})\sp{b}\, d\hat{p}} = \textrm{Beta}(a+1, b+1).$

A bit of calculation, and we find that our credibility estimate for the underlying success rate follows a Beta distribution. If one is really into statistics, they can observe that the uniform prior distribution is just the $$\textrm{Beta}(1, 1)$$ distribution, and rederive that the Beta is the conjugate distribution for the Binomial distribution.

For me, it suffices to observe that the distribution $$\textrm{Beta}(a+1, b+1)$$ is unimodal, does peak around $$a / (a + b)$$, and becomes tighter as the number of observations grows. In the following image, I plotted three Beta distributions, all with empirical success rate 0.9; red corresponds to $$n = 10$$ ($$a = 9$$, $$b = 1$$, $$\textrm{Beta}(10, 2)$$), black to $$n = 100$$ ($$\textrm{Beta}(91, 11)$$), and blue to $$n = 1000$$ ($$\textrm{Beta}(901, 101)$$). We calculated, and we got something that matches my intuition. Before trying to understand what it means, let’s take a detour to simply plot points from that un-normalised proportion function $$\hat{p}\sp{a} \cdot (1 - \hat{p})\sp{b}$$, on an arbitrary $$y$$ axis.

Let $$\hat{p} = 0.4$$, $$a = 901$$, $$b = 101$$. Naïvely entering the expression at the REPL yields nothing useful.

CL-USER> (* (expt 0.4d0 901) (expt (- 1 0.4d0) 101))
0.0d0


The issue here is that the un-normalised proportion is so small that it underflows double floats and becomes a round zero. We can guess that the normalisation factor $$\frac{1}{\mathrm{Beta}(\cdot,\cdot)}$$ quickly grows very large, which will bring its own set of issues when we do care about the normalised probability.

How can we renormalise a set of points without underflow? The usual trick to handle extremely small or large magnitudes is to work in the log domain. Rather than computing $$\hat{p}\sp{a} \cdot (1 - \hat{p})\sp{b}$$, we shall compute

$\log\left[\hat{p}\sp{a} \cdot (1 - \hat{p})\sp{b}\right] = a \log\hat{p} + b \log (1 - \hat{p}).$

CL-USER> (+ (* 901 (log 0.4d0)) (* 101 (log (- 1 0.4d0))))
-877.1713374189787d0
CL-USER> (exp *)
0.0d0


That’s somewhat better: the log-domain value is not $$-\infty$$, but converting it back to a regular value still gives us 0.

The $$\log$$ function is monotonic, so we can find the maximum proportion value for a set of points, and divide everything by that maximum value to get plottable points. There’s one last thing that should change: when $$x$$ is small, $$1 - x$$ will round most of $$x$$ away. Instead of (log (- 1 x)), we should use (log1p (- x)) to compute $$\log (1 + -x) = \log (1 - x)$$. Common Lisp did not standardise log1p, but SBCL does have it in internals, as a wrapper around libm. We’ll just abuse that for now.

CL-USER> (defun proportion (x) (+ (* 901 (log x)) (* 101 (sb-kernel:%log1p (- x)))))
PROPORTION
CL-USER> (defparameter *points* (loop for i from 1 upto 19 collect (/ i 20d0)))
*POINTS*
CL-USER> (reduce #'max *points* :key #'proportion)
-327.4909190001001d0


We have to normalise in the log domain, which is simply a subtraction: $$\log(x / y) = \log x - \log y$$. In the case above, we will subtract $$-327.49\ldots$$, or add a massive $$327.49\ldots$$ to each log proportion (i.e., multiply by $$10\sp{142}$$). The resulting values should have a reasonably non-zero range.

CL-USER> (mapcar (lambda (x) (cons x (exp (- (proportion x) *)))) *points*)
((0.05d0 . 0.0d0)
(0.1d0 . 0.0d0)
[...]
(0.35d0 . 3.443943164733533d-288)
[...]
(0.8d0 . 2.0682681158181894d-16)
(0.85d0 . 2.6252352579425913d-5)
(0.9d0 . 1.0d0)
(0.95d0 . 5.65506756824607d-10))


There’s finally some signal in there. This is still just an un-normalised proportion function, not a probability density function, but that’s already useful to show the general shape of the density function, something like the following, for $$\mathrm{Beta}(901, 101)$$. Finally, we have a probability density function for the Bayesian update of our belief about the success rate after $$n$$ observations of a Bernoulli process, and we know how to compute its proportion function. Until now, I’ve carefully avoided the question of what all these computations even mean. No more (:

The Bayesian view assumes that the underlying success rate (the value we’re trying to estimate) is unknown, but sampled from some distribution. In our case, we assumed a uniform distribution, i.e., that every success rate is a priori equally likely. We then observe $$n$$ outcomes (successes or failures), and assign an updated probability to each success rate. It’s like a many-world interpretation in which we assume we live in one of a set of worlds, each with a success rate sampled from the uniform distribution; after observing 900 successes and 100 failures, we’re more likely to be in a world where the success rate is 0.9 than in one where it’s 0.2. With Bayes’s theorem to formalise the update, we assign posterior probabilities to each potential success rate value.

We can compute an equal-tailed credible interval from that $$\mathrm{Beta}(a+1,b+1)$$ posterior distribution by excluding the left-most values, $$[0, l)$$, such that the Beta CDF (cumulative distribution function) at $$l$$ is $$\varepsilon / 2$$, and doing the same with the right most values to cut away $$\varepsilon / 2$$ of the probability density. The CDF for $$\mathrm{Beta}(a+1,b+1)$$ at $$x$$ is the incomplete beta function, $$I\sb{x}(a+1,b+1)$$. That function is really hard to compute (this technical report detailing Algorithm 708 deploys five different evaluation strategies), so I’ll address that later.

The more orthodox “frequentist” approach to confidence intervals treats the whole experiment, from data colleaction to analysis (to publication, independent of the observations 😉) as an Atlantic City algorithm: if we allow a false positive rate of $$\varepsilon$$ (e.g., $$\varepsilon=5\%$$), the experiment must return a confidence interval that includes the actual success rate (population statistic or parameter, in general) with probability $$1 - \varepsilon$$, for any actual success rate (or underlying population statistic / parameter). When the procedure fails, with probability at most $$\varepsilon$$, it is allowed to fail in an arbitrary manner.

The same Atlantic City logic applies to $$p$$-values. An experiment (data collection and analysis) that accepts when the $$p$$-value is at most $$0.05$$ is an Atlantic City algorithm that returns a correct result (including “don’t know”) with probability at least $$0.95$$, and is otherwise allowed to yield any result with probability at most $$0.05$$. The $$p$$-value associated with a conclusion, e.g., “success rate is more than 0.8” (the confidence level associated with an interval) means something like “I’m pretty sure that the success rate is more than 0.8, because the odds of observing our data if that were false are small (less than 0.05).” If we set that threshold (of 0.05, in the example) ahead of time, we get an Atlantic City algorithm to determine if “the success rate is more than 0.8” with failure probability 0.05. (In practice, reporting is censored in all sorts of ways, so…)

There are ways to recover a classical confidence interval, given $$n$$ observations from a Bernoulli. However, they’re pretty convoluted, and, as Jaynes argues in his note on confidence intervals, the classical approach gives values that are roughly the same2 as the Bayesian approach… so I’ll just use the Bayesian credibility interval instead.

See this stackexchange post for a lot more details.

# Dynamic stopping for Binomial testing

The way statistics are usually deployed is that someone collects a data set, as rich as is practical, and squeezes that static data set dry for significant results. That’s exactly the setting for the credible interval computation I sketched in the previous section.

When studying the properties of computer programs or systems, we can usually generate additional data on demand, given more time. The problem is knowing when it’s ok to stop wasting computer time, because we have enough data… and how to determine that without running into multiple hypothesis testing issues (ask anyone who’s run A/B tests).

Here’s an example of an intuitive but completely broken dynamic stopping criterion. Let’s say we’re trying to find out if the success rate is less than or greater than 90%, and are willing to be wrong 5% of the time. We could get $$k$$ data points, run a statistical test on those data points, and stop if the data let us conclude with 95% confidence that the underlying success rate differs from 90%. Otherwise, collect $$2k$$ fresh points, run the same test; collect $$4k, \ldots, 2\sp{i}k$$ points. Eventually, we’ll have enough data.

The issue is that each time we execute the statistical test that determines if we should stop, we run a 5% risk of being totally wrong. For an extreme example, if the success rate is exactly 90%, we will eventually stop, with probability 1. When we do stop, we’ll inevitably conclude that the success rate differs from 90%, and we will be wrong. The worst-case (over all underlying success rates) false positive rate is 100%, not 5%!

In my experience, programmers tend to sidestep the question by wasting CPU time with a large, fixed, number of iterations… people are then less likely to run our statistical tests, since they’re so slow, and everyone loses (the other popular option is to impose a reasonable CPU budget, with error thresholds so lax we end up with a smoke test).

Robbins, in Statistical Methods Related to the Law of the Iterated Logarithm, introduces a criterion that, given a threshold success rate $$p$$ and a sequence of (infinitely many!) observations from the same Bernoulli with unknown success rate parameter, will be satisfied infinitely often when $$p$$ differs from the Bernoulli’s success rate. Crucially, Robbins also bounds the false positive rate, the probability that the criterion be satisfied even once in the infinite sequence of observations if the Bernoulli’s unknown success rate is exactly equal to $$p$$. That criterion is

${n \choose a} p\sp{a} (1-p)\sp{n-a} \leq \frac{\varepsilon}{n+1},$

where $$n$$ is the number of observations, $$a$$ the number of successes, $$p$$ the threshold success rate, and $$\varepsilon$$ the error (false positive) rate. As the number of observation grows, the criterion becomes more and more stringent to maintain a bounded false positive rate over the whole infinite sequence of observations.

There are similar “Confidence Sequence” results for other distributions (see, for example, this paper of Lai), but we only care about the Binomial here.

More recently, Ding, Gandy, and Hahn showed that Robbins’s criterion also guarantees that, when it is satisfied, the empirical success rate ($$a/n$$) lies on the correct side of the threshold $$p$$ (same side as the actual unknown success rate) with probability $$1-\varepsilon$$. This result leads them to propose the use of Robbins’s criterion to stop Monte Carlo statistical tests, which they refer to as the Confidence Sequence Method (CSM).

(defun csm-stop-p (successes failures threshold eps)
"Pseudocode, this will not work on a real machine."
(let ((n (+ successes failures)))
(<= (* (choose n successes)
(expt threshold successes)
(expt (- 1 threshold) failures))
(/ eps (1+ n)))))


We may call this predicate at any time with more independent and identically distributed results, and stop as soon as it returns true.

The CSM is simple (it’s all in Robbins’s criterion), but still provides good guarantees. The downside is that it is conservative when we have a limit on the number of observations: the method “hedges” against the possibility of having a false positive in the infinite number of observations after the limit, observations we will never make. For computer-generated data sets, I think having a principled limit is pretty good; it’s not ideal to ask for more data than strictly necessary, but not a blocker either.

In practice, there are still real obstacles to implementing the CSM on computers with finite precision (floating point) arithmetic, especially since I want to preserve the method’s theoretical guarantees (i.e., make sure rounding is one-sided to overestimate the left-hand side of the inequality).

If we implement the expression well, the effect of rounding on correctness should be less than marginal. However, I don’t want to be stuck wondering if my bad results are due to known approximation errors in the method, rather than errors in the code. Moreover, if we do have a tight expression with little rounding errors, adjusting it to make the errors one-sided should have almost no impact. That seems like a good trade-off to me, especially if I’m going to use the CSM semi-automatically, in continuous integration scripts, for example.

One look at csm-stop-p shows we’ll have the same problem we had with the proportion function for the Beta distribution: we’re multiplying very small and very large values. We’ll apply the same fix: work in the log domain and exploit $$\log$$’s monotonicity.

${n \choose a} p\sp{a} (1-p)\sp{n-a} \leq \frac{\varepsilon}{n+1}$

becomes

$\log {n \choose a} + a \log p + (n-a)\log (1-p) \leq \log\varepsilon -\log(n+1),$

or, after some more expansions, and with $$b = n - a$$,

$\log n! - \log a! - \log b! + a \log p + b \log(1 - p) + \log(n+1) \leq \log\varepsilon.$

The new obstacle is computing the factorial $$x!$$, or the log-factorial $$\log x!$$. We shouldn’t compute the factorial iteratively: otherwise, we could spend more time in the stopping criterion than in the data generation subroutine. Robbins has another useful result for us:

$\sqrt{2\pi} n\sp{n + ½} \exp(-n) \exp\left(\frac{1}{12n+1}\right) < n! < \sqrt{2\pi} n\sp{n + ½} \exp(-n) \exp\left(\frac{1}{12n}\right),$

or, in the log domain,

$\log\sqrt{2\pi} + \left(n + \frac{1}{2}\right)\log n -n + \frac{1}{12n+1} < \log n! < \log\sqrt{2\pi} + \left(n + \frac{1}{2}\right)\log n -n +\frac{1}{12n}.$

This double inequality gives us a way to over-approximate $$\log {n \choose a} = \log \frac{n!}{a! b!} = \log n! - \log a! - \log b!,$$ where $$b = n - a$$:

$\log {n \choose a} < -\log\sqrt{2\pi} + \left(n + \frac{1}{2}\right)\log n -n +\frac{1}{12n} - \left(a + \frac{1}{2}\right)\log a +a - \frac{1}{12a+1} - \left(b + \frac{1}{2}\right)\log b +b - \frac{1}{12b+1},$

where the right-most expression in Robbins’s double inequality replaces $$\log n!$$, which must be over-approximated, and the left-most $$\log a!$$ and $$\log b!$$, which must be under-approximated.

Robbins’s approximation works well for us because, it is one-sided, and guarantees that the (relative) error in $$n!$$, $$\frac{\exp\left(\frac{1}{12n}\right) - \exp\left(\frac{1}{12n+1}\right)}{n!},$$ is small, even for small values like $$n = 5$$ (error $$< 0.0023\%$$), and decreases with $$n$$: as we perform more trials, the approximation is increasingly accurate, thus less likely to spuriously prevent us from stopping.

Now that we have a conservative approximation of Robbins’s criterion that only needs the four arithmetic operations and logarithms (and log1p), we can implement it on a real computer. The only challenge left is regular floating point arithmetic stuff: if rounding must occur, we must make sure it is in a safe (conservative) direction for our predicate.

Hardware usually lets us manipulate the rounding mode to force floating point arithmetic operations to round up or down, instead of the usual round to even. However, that tends to be slow, so most language (implementations) don’t support changing the rounding mode, or do so badly… which leaves us in a multi-decade hardware/software co-evolution Catch-22.

I could think hard and derive tight bounds on the round-off error, but I’d rather apply a bit of brute force. IEEE-754 compliant implementations must round the four basic operations correctly. This means that $$z = x \oplus y$$ is at most half a ULP away from $$x + y,$$ and thus either $$z = x \oplus y \geq x + y,$$ or the next floating point value after $$z,$$ $$z^\prime \geq x + y$$. We can find this “next value” portably in Common Lisp, with decode-float/scale-float, and some hand-waving for denormals.

(defun next (x &optional (delta 1))
"Increment x by delta ULPs. Very conservative for
small (0/denormalised) values."
(declare (type double-float x)
(type unsigned-byte delta))
(let* ((exponent (nth-value 1 (decode-float x)))
(ulp (max (scale-float double-float-epsilon exponent)
least-positive-normalized-double-float)))
(+ x (* delta ulp))))


I prefer to manipulate IEEE-754 bits directly. That’s theoretically not portable, but the platforms I care about make sure we can treat floats as sign-magnitude integers.

CL-USER> (double-float-bits pi)
4614256656552045848
CL-USER> (double-float-bits (- pi))
-4614256656552045849


The two’s complement value for pi is one less than (- (double-float-bits pi)) because two’s complement does not support signed zeros.

CL-USER> (eql 0 (- 0))
T
CL-USER> (eql 0d0 (- 0d0))
NIL
CL-USER> (double-float-bits 0d0)
0
CL-USER> (double-float-bits -0d0)
-1


We can quickly check that the round trip from float to integer and back is an identity.

CL-USER> (eql pi (bits-double-float (double-float-bits pi)))
T
CL-USER> (eql (- pi) (bits-double-float (double-float-bits (- pi))))
T
CL-USER> (eql 0d0 (bits-double-float (double-float-bits 0d0)))
T
CL-USER> (eql -0d0 (bits-double-float (double-float-bits -0d0)))
T


We can also check that incrementing or decrementing the integer representation does increase or decrease the floating point value.

CL-USER> (< (bits-double-float (1- (double-float-bits pi))) pi)
T
CL-USER> (< (bits-double-float (1- (double-float-bits (- pi)))) (- pi))
T
CL-USER> (bits-double-float (1- (double-float-bits 0d0)))
-0.0d0
CL-USER> (bits-double-float (1+ (double-float-bits -0d0)))
0.0d0
CL-USER> (bits-double-float (1+ (double-float-bits 0d0)))
4.9406564584124654d-324
CL-USER> (bits-double-float (1- (double-float-bits -0d0)))
-4.9406564584124654d-324


The code doesn’t handle special values like infinities or NaNs, but that’s out of scope for the CSM criterion anyway. That’s all we need to nudge the result of the four operations to guarantee an over- or under- approximation of the real value. We can also look at the documentation for our libm (e.g., for GNU libm) to find error bounds on functions like log; GNU claims their log is never off by more than 3 ULP. We can round up to the fourth next floating point value to obtain a conservative upper bound on $$\log x$$.

I could go ahead and use the building blocks above (ULP nudging for directed rounding) to directly implement Robbins’s criterion,

$\log {n \choose a} + a \log p + b\log (1-p) + \log(n+1) \leq \log\varepsilon,$

with Robbins’s factorial approximation,

$\log {n \choose a} < -\log\sqrt{2\pi} + \left(n + \frac{1}{2}\right)\log n -n +\frac{1}{12n} - \left(a + \frac{1}{2}\right)\log a +a - \frac{1}{12a+1} - \left(b + \frac{1}{2}\right)\log b +b - \frac{1}{12b+1}.$

However, even in the log domain, there’s a lot of cancellation: we’re taking the difference of relatively large numbers to find a small result. It’s possible to avoid that by re-associating some of the terms above, e.g., for $$a$$:

$-\left(a + \frac{1}{2}\right) \log a + a - a \log p = -\frac{\log a}{2} + a (-\log a + 1 - \log p).$

Instead, I’ll just brute force things (again) with Kahan summation. Shewchuk’s presentation in Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates highlights how the only step where we may lose precision to rounding is when we add the current compensation term to the new summand. We can implement Kahan summation with directed rounding in only that one place: all the other operations are exact!

We need one last thing to implement $$\log {n \choose a}$$, and then Robbins’s confidence sequence: a safely rounded floating-point value approximation of $$-\log \sqrt{2 \pi}$$. I precomputed one with computable-reals:

CL-USER> (computable-reals:-r
(computable-reals:log-r
(computable-reals:sqrt-r computable-reals:+2pi-r+)))
-0.91893853320467274178...
CL-USER> (computable-reals:ceiling-r
(computable-reals:*r *
(ash 1 53)))
-8277062471433908
-0.65067431749790398594...
CL-USER> (* -8277062471433908 (expt 2d0 -53))
-0.9189385332046727d0
CL-USER> (computable-reals:-r (rational *)
***)
+0.00000000000000007224...


We can safely replace $$-\log\sqrt{2\pi}$$ with -0.9189385332046727d0, or, equivalently, (scale-float -8277062471433908.0d0 -53), for an upper bound. If we wanted a lower bound, we could decrement the integer significand by one.

We can quickly check against an exact implementation with computable-reals and a brute force factorial.

CL-USER> (defun cr-log-choose (n s)
(computable-reals:-r
(computable-reals:log-r (alexandria:factorial n))
(computable-reals:log-r (alexandria:factorial s))
(computable-reals:log-r (alexandria:factorial (- n s)))))
CR-LOG-CHOOSE
CL-USER> (computable-reals:-r (rational (robbins-log-choose 10 5))
(cr-log-choose 10 5))
+0.00050526703375914436...
CL-USER> (computable-reals:-r (rational (robbins-log-choose 1000 500))
(cr-log-choose 1000 500))
+0.00000005551513197557...
CL-USER> (computable-reals:-r (rational (robbins-log-choose 1000 5))
(cr-log-choose 1000 5))
+0.00025125559085509706...


That’s not obviously broken: the error is pretty small, and always positive.

Given a function to over-approximate log-choose, the Confidence Sequence Method’s stopping criterion is straightforward.

The other, much harder, part is computing credible (Bayesian) intervals for the Beta distribution. I won’t go over the code, but the basic strategy is to invert the CDF, a monotonic function, by bisection3, and to assume we’re looking for improbable ($$\mathrm{cdf} < 0.5$$) thresholds. This assumption lets us pick a simple hypergeometric series that is normally useless, but converges well for $$x$$ that correspond to such small cumulative probabilities; when the series converges too slowly, it’s always conservative to assume that $$x$$ is too central (not extreme enough).

That’s all we need to demo the code. Looking at the distribution of fill rates for the 1000 bins @ 30K ball/bin facet in it looks like we almost always hit at least 97.5% global density, let’s say with probability at least 98%. We can ask the CSM to tell us when we have enough data to confirm or disprove that hypothesis, with a 0.1% false positive rate.

Instead of generating more data on demand, I’ll keep things simple and prepopulate a list with new independently observed fill rates.

CL-USER> (defparameter *observations* '(0.978518900
0.984687300
0.983160833
[...]))
CL-USER> (defun test (n)
(let ((count (count-if (lambda (x) (>= x 0.975))
*observations*
:end n)))
(csm:csm n 0.98d0 count (log 0.001d0))))
CL-USER> (test 10)
NIL
2.1958681996231784d0
CL-USER> (test 100)
NIL
2.5948497850893184d0
CL-USER> (test 1000)
NIL
-3.0115331544604658d0
CL-USER> (test 2000)
NIL
-4.190687115879456d0
CL-USER> (test 4000)
T
-17.238559826956475d0


We can also use the inverse Beta CDF to get a 99.9% credible interval. After 4000 trials, we found 3972 successes.

CL-USER> (count-if (lambda (x) (>= x 0.975))
*observations*
:end 4000)
3972


These values give us the following lower and upper bounds on the 99.9% CI.

CL-USER> (csm:beta-icdf 3972 (- 4000 3972) 0.001d0)
0.9882119750976562d0
1.515197753898523d-5
CL-USER> (csm:beta-icdf 3972 (- 4000 3972) 0.001d0 t)
0.9963832682169742d0
2.0372679238045424d-13


And we can even re-use and extend the Beta proportion code from earlier to generate this embeddable SVG report.

There’s one small problem with the sample usage above: if we compute the stopping criterion with a false positive rate of 0.1%, and do the same for each end of the credible interval, our total false positive (error) rate might actually be 0.3%! The next section will address that, and the equally important problem of estimating power.

# Monte Carlo power estimation

It’s not always practical to generate data forever. For example, we might want to bound the number of iterations we’re willing to waste in an automated testing script. When there is a bound on the sample size, the CSM is still correct, just conservative.

We would then like to know the probability that the CSM will stop successfully when the underlying success rate differs from the threshold rate $$p$$ (alpha in the code). The problem here is that, for any bounded number of iterations, we can come up with an underlying success rate so close to $$p$$ (but still different) that the CSM can’t reliably distinguish between the two.

If we want to be able to guarantee any termination rate, we need two thresholds: the CSM will stop whenever it’s likely that the underlying success rate differs from either of them. The hardest probability to distinguish from both thresholds is close to the midpoint between them.

With two thresholds and the credible interval, we’re running three tests in parallel. I’ll apply a Bonferroni correction, and use $$\varepsilon / 3$$ for each of the two CSM tests, and $$\varepsilon / 6$$ for each end of the CI.

That logic is encapsulated in csm-driver. We only have to pass a success value generator function to the driver. In our case, the generator is itself a call to csm-driver, with fixed thresholds (e.g., 96% and 98%), and a Bernoulli sampler (e.g., return T with probability 97%). We can see if the driver returns successfully and correctly at each invocation of the generator function, with the parameters we would use in production, and recursively compute an estimate for that procedure’s success rate with CSM. The following expression simulates a CSM procedure with thresholds at 96% and 98%, the (usually unknown) underlying success rate in the middle, at 97%, a false positive rate of at most 0.1%, and an iteration limit of ten thousand trials. We pass that simulation’s result to csm-driver, and ask whether the simulation’s success rate differs from 99%, while allowing one in a million false positives.

We find that yes, we can expect the 96%/98%/0.1% false positive/10K iterations setup to succeed more than 99% of the time. The code above is available as csm-power, with a tighter outer false positive rate of 1e-9. If we only allow 1000 iterations, csm-power quickly tells us that, with one CSM success in 100 attempts, we can expect the CSM success rate to be less than 99%.

CL-USER> (csm:csm-power 0.97d0 0.96d0 1000 :alpha-hi 0.98d0 :eps 1d-3 :stream *standard-output*)
1 0.000e+0 1.250e-10 10.000e-1 1.699e+0
10 0.000e+0 0.000e+0 8.660e-1 1.896e+1
20 0.000e+0 0.000e+0 6.511e-1 3.868e+1
30 0.000e+0 0.000e+0 5.099e-1 5.851e+1
40 2.500e-2 5.518e-7 4.659e-1 7.479e+1
50 2.000e-2 4.425e-7 3.952e-1 9.460e+1
60 1.667e-2 3.694e-7 3.427e-1 1.144e+2
70 1.429e-2 3.170e-7 3.024e-1 1.343e+2
80 1.250e-2 2.776e-7 2.705e-1 1.542e+2
90 1.111e-2 2.469e-7 2.446e-1 1.741e+2
100 1.000e-2 2.223e-7 2.232e-1 1.940e+2
100 iterations, 1 successes (false positive rate < 1.000000e-9)
success rate p ~ 1.000000e-2
confidence interval [2.223495e-7, 0.223213    ]
p < 0.990000
max inner iteration count: 816

T
T
0.01d0
100
1
2.2234953205868331d-7
0.22321314110840665d0


# SLO-ify all the things with this Exact test

Until now, I’ve only used the Confidence Sequence Method (CSM) for Monte Carlo simulation of phenomena that are naturally seen as boolean success / failures processes. We can apply the same CSM to implement an exact test for null hypothesis testing, with a bit of resampling magic.

Looking back at the balls and bins grid, the average fill rate seems to be slightly worse for 100 bins @ 60K ball/bin, than for 1000 bins @ 128K ball/bin. How can we test that with the CSM? First, we should get a fresh dataset for the two setups we wish to compare.

CL-USER> (defparameter *100-60k* #(0.988110167
0.990352500
0.989940667
0.991670667
[...]))
CL-USER> (defparameter *1000-128k* #(0.991456281
0.991559578
0.990970109
0.990425805
[...]))
CL-USER> (alexandria:mean *100-60k*)
0.9897938
CL-USER> (alexandria:mean *1000-128k*)
0.9909645
CL-USER> (- * **)
0.0011706948


The mean for 1000 bins @ 128K ball/bin is slightly higher than that for 100 bins @ 60k ball/bin. We will now simulate the null hypothesis (in our case, that the distributions for the two setups are identical), and determine how rarely we observe a difference of 0.00117 in means. I only use a null hypothesis where the distributions are identical for simplicity; we could use the same resampling procedure to simulate distributions that, e.g., have identical shapes, but one is shifted right of the other.

In order to simulate our null hypothesis, we want to be as close to the test we performed as possible, with the only difference being that we generate data by reshuffling from our observations.

CL-USER> (defparameter *resampling-data* (concatenate 'simple-vector *100-60k* *1000-128k*))
*RESAMPLING-DATA*
CL-USER> (length *100-60k*)
10000
CL-USER> (length *1000-128k*)
10000


The two observation vectors have the same size, 10000 values; in general, that’s not always the case, and we must make sure to replicate the sample sizes in the simulation. We’ll generate our simulated observations by shuffling the *resampling-data* vector, and splitting it in two subvectors of ten thousand elements.

CL-USER> (let* ((shuffled (alexandria:shuffle *resampling-data*))
(60k (subseq shuffled 0 10000))
(128k (subseq shuffled 10000)))
(- (alexandria:mean 128k) (alexandria:mean 60k)))
6.2584877e-6


We’ll convert that to a truth value by comparing the difference of simulated means with the difference we observed in our real data, $$0.00117\ldots$$, and declare success when the simulated difference is at least as large as the actual one. This approach gives us a one-sided test; a two-sided test would compare the absolute values of the differences.

CL-USER> (csm:csm-driver
(lambda (_)
(declare (ignore _))
(let* ((shuffled (alexandria:shuffle *resampling-data*))
(60k (subseq shuffled 0 10000))
(128k (subseq shuffled 10000)))
(>= (- (alexandria:mean 128k) (alexandria:mean 60k))
0.0011706948)))
0.005 1d-9 :alpha-hi 0.01 :stream *standard-output*)
1 0.000e+0 7.761e-11 10.000e-1 -2.967e-1
10 0.000e+0 0.000e+0 8.709e-1 -9.977e-1
20 0.000e+0 0.000e+0 6.577e-1 -1.235e+0
30 0.000e+0 0.000e+0 5.163e-1 -1.360e+0
40 0.000e+0 0.000e+0 4.226e-1 -1.438e+0
50 0.000e+0 0.000e+0 3.569e-1 -1.489e+0
60 0.000e+0 0.000e+0 3.086e-1 -1.523e+0
70 0.000e+0 0.000e+0 2.718e-1 -1.546e+0
80 0.000e+0 0.000e+0 2.427e-1 -1.559e+0
90 0.000e+0 0.000e+0 2.192e-1 -1.566e+0
100 0.000e+0 0.000e+0 1.998e-1 -1.568e+0
200 0.000e+0 0.000e+0 1.060e-1 -1.430e+0
300 0.000e+0 0.000e+0 7.207e-2 -1.169e+0
400 0.000e+0 0.000e+0 5.460e-2 -8.572e-1
500 0.000e+0 0.000e+0 4.395e-2 -5.174e-1
600 0.000e+0 0.000e+0 3.677e-2 -1.600e-1
700 0.000e+0 0.000e+0 3.161e-2 2.096e-1
800 0.000e+0 0.000e+0 2.772e-2 5.882e-1
900 0.000e+0 0.000e+0 2.468e-2 9.736e-1
1000 0.000e+0 0.000e+0 2.224e-2 1.364e+0
2000 0.000e+0 0.000e+0 1.119e-2 5.428e+0

NIL
T
0.0d0
2967
0
0.0d0
0.007557510165262294d0


We tried to replicate the difference 2967 times, and did not succeed even once. The CSM stopped us there, and we find a CI for the probability of observing our difference, under the null hypothesis, of [0, 0.007557] (i.e., $$p < 0.01$$). Or, for a graphical summary, . We can also test for a lower $$p$$-value by changing the thresholds and running the simulation more times (around thirty thousand iterations for $$p < 0.001$$).

This experiment lets us conclude that the difference in mean fill rate between 100 bins @ 60K ball/bin and 1000 @ 128K is probably not due to chance: it’s unlikely that we observed an expected difference between data sampled from the same distribution. In other words, “I’m confident that the fill rate for 1000 bins @ 128K ball/bin is greater than for 100 bins @ 60K ball/bins, because it would be highly unlikely to observe a difference in means that extreme if they had the same distribution ($$p < 0.01$$)”.

In general, we can use this exact test when we have two sets of observations, $$X\sb{0}$$ and $$Y\sb{0}$$, and a statistic $$f\sb{0} = f(X\sb{0}, Y\sb{0})$$, where $$f$$ is a pure function (the extension to three or more sets of observations is straightforward).

The test lets us determine the likelihood of observing $$f(X, Y) \geq f\sb{0}$$ (we could also test for $$f(X, Y) \leq f\sb{0}$$), if $$X$$ and $$Y$$ were taken from similar distributions, modulo simple transformations (e.g., $$X$$’s mean is shifted compared to $$Y$$’s, or the latter’s variance is double the former’s).

We answer that question by repeatedly sampling without replacement from $$X\sb{0} \cup Y\sb{0}$$ to generate $$X\sb{i}$$ and $$Y\sb{i}$$, such that $$|X\sb{i}| = |X\sb{0}|$$ and $$|Y\sb{i}| = |Y\sb{0}|$$ (e.g., by shuffling a vector and splitting it in two). We can apply any simple transformation here (e.g., increment every value in $$Y\sb{i}$$ by $$\Delta$$ to shift its mean by $$\Delta$$). Finally, we check if $$f(X\sb{i}, Y\sb{i}) \geq f\sb{0} = f(X\sb{0}, Y\sb{0})$$; if so, we return success for this iteration, otherwise failure.

The loop above is a Bernoulli process that generates independent, identically distributed (assuming the random sampling is correct) truth values, and its success rate is equal to the probability of observing a value for $$f$$ “as extreme” as $$f\sb{0}$$ under the null hypothesis. We use the CSM with false positive rate $$\varepsilon$$ to know when to stop generating more values and compute a credible interval for the probability under the null hypothesis. If that probability is low (less than some predetermined threshold, like $$\alpha = 0.001$$), we infer that the null hypothesis does not hold, and declare that the difference in our sample data points at a real difference in distributions. If we do everything correctly (cough), we will have implemented an Atlantic City procedure that fails with probability $$\alpha + \varepsilon$$.

Personally, I often just set the threshold and the false positive rate unreasonably low and handwave some Bayes.

# That’s all!

I pushed the code above, and much more, to github, in Common Lisp, C, and Python (probably Py3, although 2.7 might work). Hopefully anyone can run with the code and use it to test, not only SLO-type properties, but also answer more general questions, with an exact test. I’d love to have ideas or contributions on the usability front. I have some throw-away code in attic/, which I used to generate the SVG in this post, but it’s not great. I also feel like I can do something to make it easier to stick the logic in shell scripts and continuous testing pipelines.

When I passed around a first draft for this post, many readers that could have used the CSM got stuck on the process of moving from mathematical expressions to computer code; not just how to do it, but, more fundamentally, why we can’t just transliterate Greek to C or CL. I hope this revised post is clearer. Also, I hope it’s clear that the reason I care so much about not introducing false positive via rounding isn’t that I believe they’re likely to make a difference, but simply that I want peace of mind with respect to numerical issues; I really don’t want to be debugging some issue in my tests and have to wonder if it’s all just caused by numerical errors.

The reason I care so much about making sure users can understand what the CSM codes does (and why it does what it does) is that I strongly believe we should minimise dependencies whose inner working we’re unable to (legally) explore. Every abstraction leaks, and leakage is particularly frequent in failure situations. We may not need to understand magic if everything works fine, but, everything breaks eventually, and that’s when expertise is most useful. When shit’s on fire, we must be able to break the abstraction and understand how the magic works, and how it fails.

This post only tests ideal SLO-type properties (and regular null hypothesis tests translated to SLO properties), properties of the form “I claim that this indicator satisfies \$PREDICATE x% of the time, with false positive rate y%” where the indicator’s values are independent and identically distributed.

The last assumption is rarely truly satisfied in practice. I’ve seen an interesting choice, where the service level objective is defined in terms of a sample of production requests, which can replayed, shuffled, etc. to ensure i.i.d.-ness. If the nature of the traffic changes abruptly, the SLO may not be representative of behaviour in production; but, then again, how could the service provider have guessed the change was about to happen? I like this approach because it is amenable to predictive statistical analysis, and incentivises communication between service users and providers, rather than users assuming the service will gracefully handle radically new crap being thrown at it.

Even if we have a representative sample of production, it’s not true that the service level indicators for individual requests are distributed identically. There’s an easy fix for the CSM and our credible intervals: generate i.i.d. sets of requests by resampling (e.g., shuffle the requests sample) and count successes and failures for individual requests, but only test for CSM termination after each resampled set.

On a more general note, I see the Binomial and Exact tests as instances of a general pattern that avoids intuitive functional decompositions that create subproblems that are harder to solve than the original problem. For example, instead of trying to directly determine how frequently the SLI satisfies some threshold, it’s natural to first fit a distribution on the SLI, and then compute percentiles on that distribution. Automatically fitting an arbitrary distribution is hard, especially with the weird outliers computer systems spit out. Reducing to a Bernoulli process before applying statistics is much simpler. Similarly, rather than coming up with analytical distributions in the Exact test, we brute-force the problem by resampling from the empirical data. I have more examples from online control systems… I guess the moral is to be wary of decompositions where internal subcomponents generate intermediate values that are richer in information than the final output.

Thank you Jacob, Ruchir, Barkley, and Joonas for all the editing and restructuring comments.

1. Proportions are unscaled probabilities that don’t have to sum or integrate to 1. Using proportions instead of probabilities tends to make calculations simpler, and we can always get a probability back by rescaling a proportion by the inverse of its integral.

2. Instead of a $$\mathrm{Beta}(a+1, b+1)$$, they tend to bound with a $$\mathrm{Beta}(a, b)$$. The difference is marginal for double-digit $$n$$.

3. I used the bisection method instead of more sophisticated ones with better convergence, like Newton’s method or the derivative-free Secant method, because bisection already adds one bit of precision per iteration, only needs a predicate that returns “too high” or “too low,” and is easily tweaked to be conservative when the predicate declines to return an answer.

]]>