I’ve been responsible for Backtrace.io’s crash analytic database^{1} for a couple months now. I have focused my recent efforts on improving query times for inmemory grouped aggregations, i.e., the archetypal MapReduce usecase where we generate keyvalue pairs, and fold over the values for each key in some (semi)group. We have a cute cacheefficient data structure for this type of workload; the inner loop simply inserts in a small hash table with Robin Hood linear probing, in order to guarantee entries in the table are ordered by hash value. This ordering lets us easily dump the entries in sorted order, and block the merge loop for an arbitrary number of sorted arrays into a unified, larger, ordered hash table (which we can, again, dump to a sorted array).^{2}
As I updated more operators to use this data structure, I noticed that we were spending a lot of time in its inner loop. In fact, perf showed that the query server as a whole was spending 4% of its CPU time on one instruction in that loop:
2.17  movdqu (%rbx),%xmm0
39.63  lea 0x1(%r8),%r14 # that's 40% of the annotated function
 mov 0x20(%rbx),%rax
0.15  movaps %xmm0,0xa0(%rsp)
The first thing to note is that instructionlevel profiling tends to put the blame on the instruction following the one that triggered a sampling interrupt.
It’s not the lea
(which computes r14 < r8 + 1
) that’s slow, but the movdqu
just before.
So, what is that movdqu
loading into xmm0
? Maybe it’s just a normal cache miss, something inherent to the workload.
I turned on source locations (hit s
in perf report
), and observed that this instruction was simply copying to the stack an argument that was passed by address.
The source clearly showed that the argument should be hot in cache: the inner loop was essentially
A1. Generate a new keyvalue pair
B1. Mangle that kv pair just a bit to turn it into a hash element
C1. Insert the new hash element
A2.
B2.
C2.
and the movdqu
happens in step C, to copy the element that step B just constructed.^{3}
At this point, an important question suggests itself: does it matter? We could simply increase the size of the base case and speed up the rest of the bottomup recursion… eventually, the latency for the random accesses in the initial hash table will dominate the inner loop.
When I look into the performance of these deep inner loop, my goal isn’t only to do the same thing better. The big wins, in my experience, come from the additional design freedom that we get from being able to find new uses for the same code. Improved latency, throughput, or memory footprint really shine when the increased optionality from multiple such improvements compounds and lets us consider a much larger design space for the project as a whole. That’s why I wanted to make sure this hash table insertion loop worked on as wide a set of parameter as possible: because that will give future me the ability to combine versatile tools.
Back to the original question. Why do we spend so many cycles loading data we just wrote to cache?
The answer is in the question and in the title of this post: too little time elapses between the instructions that write data to the cache and the ones that read the same data.^{4} A modern outoforder machine (e.g., most amd64 chips) can execute multiple instructions at the same time, and will start executing instructions as soon as their operands are ready, even when earlier instructions in program order are still waiting for theirs. Machine code is essentially a messy way to encode a dataflow graph, which means our job as microoptimisers is, at a high level, to avoid long dependency chains and make the dataflow graph as wide as possible. When that’s too hard, we should distribute as much scheduling slack as possible between nodes in a chain, in order to absorb the knockon effects of cache misses and other latency spikes. If we fail, the chip will often find itself with no instruction ready to execute; stalling the pipeline like that is like slowing down by a factor of 10.
The initial inner loop simply executes steps A, B, and C in order, where step C depends on the result of step B, and step B on that of step A. In theory, a chip with a wide enough instruction reordering window could pipeline multiple loop iterations. In practice, real hardware can only plan on the order of 100200 instructions ahead, and that mechanism depends on branches being predicted correctly. We have to explicitly insert slack in our dataflow schedule, and we must distribute it well enough for instruction reordering to see the gaps.
This specific instance is a particularly bad case for contemporary machines:
step B populates the entry with regular (64bit) register writes,
while step C copies the same bytes with vector reads and writes.
Travis Downs looked into this forwarding scenario and found that no other readafterwrite setup behaves this badly, on Intel or AMD.
That’s probably why the movdqu
vector load instruction was such an issue.
If the compiler had emitted the copy with GPR reads and writes,
that might have been enough for the hardware to hide the latency.
However, as Travis points out on Twitter, it’s hard for a compiler to get that right across compilation units.
In any case, our most reliable (more so than passing this large struct by value and hoping the compiler will avoid mismatched instructions) and powerful tool to fix this at the source level is to schedule operations manually.
The dataflow graph for each loop iteration is currently a pure chain:
A1

v
B1

v
C1
A2

v
B2

v
C2
How does one add slack to these chains? With bounded queues!
My first fix was to add a oneelement buffer between steps B and C. The inner loop became
A1. Generate a new keyvalue pair
C0. Insert the hash element from the previous iteration
B1. Mangle the kv pair and stash that in the buffer
A2.
C1.
B2
etc.
which yields a dataflow graph like
 A1
v 
C0 

v
B1

 A2
v 
C1 

v
B2

We’ve introduced slack between steps A and B (there’s now step C from the previous iteration between them), and between steps B and C (we shifted step A from the next iteration between them). There isn’t such a long delay between the definition of a value and its use that the data is likely to be evicted from L1. However, there is more than enough work available between them to keep the pipeline busy with useful work while C waits for B’s result, or B for A’s. That was a nice singledigit improvement in query latency for my internal benchmark, just by permuting a loop.
If a oneelement buffer helps, we should clearly experiment with the buffer size, and that’s where I found a more impactful speedup. Once we have an array of elements to insert in a hash table, we can focus on a bulk insert of maybe 8 or 10 elements: instead of trying to improve the latency for individual writes, we can focus on the throughput for multiple inserts at once. That’s good because throughput is an easier problem than latency. In the current case, passing the whole buffer to the hash table code made it easier to pipeline the insert loop in software: we can compute hashes ahead of time, and accelerate random accesses to the hash table with software prefetching. The profile for the new inner loop is flatter, and the hottest part is as follows
 mov 0x8(%rsp),%rdx
9.91  lea (%r12,%r12,4),%rax
0.64  prefetcht0 (%rdx,%rax,8)
17.04  cmp %rcx,0x28(%rsp)
Again, the blame for a “slow” instruction hits the following instruction, so it’s not lea
(multiplying by 5) or cmp
that are slow; it’s the load from the stack and the prefetch.
The good news is that these instructions do not have any dependent. It’s all prefetching, and that’s only used for its side effects.
Moreover, they come from a block of code that was pipelined in software and executes one full iteration ahead of where its side effects might be useful.
It doesn’t really matter if these instructions are slow: they’re still far from being on the critical path! This last restructuring yielded a 20% speedup on a few slow queries.
I described two tools that I use regularly when optimising code for contemporary hardware. Finding ways to scatter around scheduling slack is always useful, both in software and in real life planning.^{5} One simple way to do so is to add bounded buffers, and to flush buffers as soon as they fill up (or refill when they become empty), instead of waiting until the next write to the buffer. However, I think the more powerful transformation is using buffering to expose bulk operations, which tends to open up more opportunities than just doing the same thing in a loop. In the case above, we found a 20% speedup; for someone who visit their Backtrace dashboard a couple times a day, that can add up to an hour or two at the end of the year.
TL;DR: When a function is hot enough to look into, it’s worth asking why it’s called so often, in order to focus on higher level bulk operations.
And by that, I mean I started working there a couple months ago (: ↩
I think that’s a meaty idea, and am planning a longer post on that data structure and where it fits in the hash/sort join continuum. ↩
Would I have avoided this issue if I had directly passed by value? The resulting code might have been friendlier to storetoload forwarding than loading a whole 128 bit SSE register, but see the next footnote. ↩
Storetoload forwarding can help improve the performance of this pattern, when we use forwarding patterns that the hardware supports. However, this mechanism can only decrease the penalty of serial dependencies, e.g., by shaving away some or all of the time it takes to store a result to cache and load it back; even when results can feed directly into dependencies, we still have to wait for inputs to be computed. This is fundamentally a scheduling issue. ↩
Unless you’re writing schedule optimising software and people will look at the result. A final hill climbing pass to make things look artificially tight often makes for an easier sale in that situation. ↩
\[\max_{x \in [0, 1]^n} p’x\] subject to \[w’x \leq b.\]
It has a linear objective, one constraint, and each decision variable is bounded to ensure the optimum exists. Note the key difference from the binary knapsack problem: decision variables are allowed to take any value between 0 and 1. In other words, we can, e.g., stick half of a profitable but large item in the knapsack. That’s why this knapsack problem can be solved in linear time.
Duality also lets us determine the shape of all optimal solutions to this problem. For each item \(i\) with weight \(w_i\) and profit \(p_i\), let its profit ratio be \(r_i = p_i / w_i,\) and let \(\lambda^\star\) be the optimal dual (Lagrange or linear) multiplier associated with the capacity constraint \(w’x \leq b.\) If \(\lambda^\star = 0,\) we simply take all items with a positive profit ratio (\(r_i > 0\)) and a nonnegative weight \(w_i \geq 0.\) Otherwise, every item with a profit ratio \(r_i > \lambda^\star\) will be at its weight upper bound (1 if \(w_i \geq 0\), 0 otherwise), and items with \(r_i < \lambda^\star\) will instead be at their lower bound (0 of \(w_i \leq 0\), and 1 otherwise).
Critical items, items with \(r_i = \lambda^\star,\) will take any value that results in \(w’x = b.\) Given \(\lambda^\star,\) we can derive the sum of weights for noncritical items; divide the remaining capacity for critical items by the total weight of critical items, and let that be the value for every critical item (with the appropriate sign for the weight).
For example, if we have capacity \(b = 10,\) and the sum of weights for noncritical items in the knsapsack is \(8,\) we’re left with another two units of capacity to distribute however we want among critical items (they all have the same profit ratio \(r_i = \lambda^\star,\) so it doesn’t matter where that capacity goes). Say critical items with a positive weight have a collective weight of 4; we could then assign a value of \(2 / 4 = 0.5\) to the corresponding decision variable (and 0 for critical items with a nonpositive weight).
We could instead have \(b = 10,\) and the sum of weights for noncritical items in the knapsack \(12\): we must find two units of capacity among critical items (they all cost \(r_i = \lambda^\star\) per unit, so it doesn’t matter which). If critical items with a negative weight have a collective weight of \(3,\) we could assign a value of \(2 / 3 = 0.6\overline{6}\) to the corresponding decision variables, and 0 for critical items with a nonnegative weight.
The last case highlights something important about the knapsack: in general, we can’t assume that the weights or profits are positive. We could have an item with a nonpositive weight and nonnegative profit (that’s always worth taking), an item with positive weight and negative profit (never interesting), or weights and profits of the same sign. The last case is the only one that calls for actual decision making. Classically, items with negative weight and profit are rewritten away, by assuming they’re taken in the knapsack, and replacing them with a decision variable for the complementary decision of removing that item from the knapsack (i.e., removing the additional capacity in order to improve the profit). I’ll try to treat them directly as much as possible, because that reduction can be a significant fraction of solve times in practice.
The characterisation of optimal solutions above makes it easy to directly handle elements with a negative weight: just find the optimal multiplier, compute the contribution of noncritical elements (with decision variables at a bound) to the lefthand side of the capacity constraint, separately sums the negative and positive weights for critical elements, then do a final pass to distribute the remaining capacity to critical elements (and 0weight / 0value elements if one wishes).
Finding the optimal multiplier \(\lambda^\star\) is similar to a selection problem: the value is either 0 (the capacity constraint is redundant), or one of the profit ratios \(r_i,\) and, given a multiplier value \(\lambda,\) we can determine if it’s too high or too low in linear time. If the noncritical elements yield a lefthand side such that critical elements can’t add enough capacity (i.e., no solution with the optimal form can be feasible), \(\lambda\) is too low. If the maximum weight of potentially optimal solutions is too low, \(\lambda\) is too high.
We can thus sort the items by profit ratio \(r_i\), compute the total weight corresponding to each ratio with a prefix sum (with a prepass to sum all negative weights), and perform a linear (or binary) search to find the critical profit ratio. Moreover, the status of noncritical items is monotonic as \(\lambda\) grows: if an item with positive weight is taken at \(\lambda_0\), it is also taken for every \(\lambda \leq \lambda_0\), and a negativeweight item that’s taken at \(\lambda_0\) is also taken for every \(\lambda \geq \lambda_0.\) This means we can adapt selection algorithms like Quickselect to solve the continuous knapsack problem in linear time.
I’m looking at large instances, so I would like to run these algorithms in parallel or even distributed on multiple machines, and ideally use GPUs or SIMD extensions. Unfortunately, selection doesn’t parallelise very well: we can run a distributed quickselect where every processor partitions the data in its local RAM, but that still requires a logarithmic number of iterations.
Lazy Select offers a completely different angle for the selection problem. Selecting the \(k\)th smallest element from a list of \(n\) elements is the same as finding the \(k / n\)th quantile^{1} in that list of \(n\) elements. We can use concentration bounds^{2} to estimate quantiles from a sample of, e.g., \(m = n^{3/4}\) elements: the population quantile value is very probably between the \(qm  \frac{\log m}{\sqrt{m}}\)th and \(qm + \frac{\log m}{\sqrt{m}}\)th values of the sample. Moreover, this range very probably includes at most \(\mathcal{O}(n^{3/4})\) elements^{3}, so a second pass suffices to buffer all the elements around the quantile, and find the exact quantile. Even with a much smaller sample size \(m = \sqrt{n},\) we would only need four passes.
Unfortunately, we can’t directly use that correspondence between selection and quantile estimation for the continuous knapsack.
I tried to apply a similar idea by sampling the knapsack elements equiprobably, and extrapolating from a solution to the sample. For every \(\lambda,\) we can derive a selection function \(f_\lambda (i) = I[r_i \geq \lambda]w_i\) (invert the condition if the weight is negative), and scale up \(\sum_i f(i)\) from the sample to the population). As long as we sample independently of \(f\), we can reuse the same sample for all \(f_\lambda.\) The difficulty here is that, while the error for Lazy Select scales as a function of \(n,\) the equivalent bounds with variable weights are a function of \(n(\max_i w_i + \min_i w_i)^2.\) That doesn’t seem necessarily practical; scaling with \(\sum_i w_i\) would be more reasonable.
Good news: we can hit that, thanks to linearity.
Let’s assume weights are all integers. Any item with weight \(w_i\) is equivalent to \(w_i\) subitems with unit weight (or \(w_i\) elements with negative unit weight), and the same profit ratio \(r_i\), i.e., profit \(p_i / w_i\). The range of subitem weights is now a constant.
We could sample uniformly from the subitems with a Bernoulli for each subitem, but that’s clearly linear time in the sum of weights, rather than the number of elements. If we wish to sample roughly \(m\) elements from a total weight \(W = \sum_i w_i,\) we can instead determine how many subitems (units of weight) to skip before sampling with a Geometric of success probability \(m / W.\) This shows us how to lift the integrality constraint on weights: sample from an Exponential with the same parameter \(m / W!\)
That helps, but we could still end up spending much more than constant time on very heavy elements. The trick is to deterministically specialcase these elements: stash any element with large weight \(w_i \geq W / m\) to the side, exactly once. By Markov’s inequality,^{4} we know there aren’t too many heavy elements: at most \(m.\)
The heart of the estimation problem can be formalised as follows: given a list of elements \(i \in [n]\) with weight \(w_i \geq 0\), generate a sample of \(m \leq n\) elements ahead of time. After the sample has been generated, we want to accept an arbitrary predicate \(p \in \{0,1\}^n\) and estimate \(\sum_{i\in [n]} p(i) w_i.\)
We just had a sketch of an algorithm for this problem. Let’s see what it looks like in Python. The initial sample logic has to determine the total weight, and sample items with probability proportional to their weight. Items heavier than the cutoff are not considered in the sample and instead saved to an auxiliary list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

We can assemble the resulting sample (and list of “large” elements) to compute a lower bound on the weight of items that satisfy any predicate that’s independent of the sampling decisions. The value for large elements is trivial: we have a list of all large elements. We can subtract the weight of all large elements from the total item weight, and determine how much we have to extrapolate up.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 

And finally, here’s how we can sample from an arbitrary list of items, compure a lower bound on the weight of items that satisfy a predicate, and compare that with the real lower bound.
1 2 3 4 5 6 7 8 9 

How do we test that? Far too often, I see tests for randomised algorithms where the success rate is computed over randomly generated inputs. That’s too weak! For example, this approach could lead us to accept that the identity function is a randomised sort function, with success probability \(\frac{1}{n!}.\)
The property we’re looking for is that, for any input, the success rate (with the expectation over the pseudorandom sampling decisions) is as high as requested.
For a given input (list of items and predicate), we can use the Confidence sequence method (CSM) to confirm that the lower bound is valid at least \(1  \alpha\) of the time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

With a false positive rate of at most one in a million,^{5} we can
run automated tests against check_bounds
. I’ll use
Hypothesis to generate list of pairs of weight and predicate value:
1 2 3 4 5 6 7 8 9 

Bimodal inputs tend to be harder, so we can add a specialised test generator.
1 2 3 4 5 6 

Again, we use Hypothesis to generate inputs, and the Confidence sequence method (available in C, Common Lisp, and Python) to check that the lower bound is valid with probability at least \(1  \alpha\). The CSM tests for this statistical property with power 1 and adjustable error rate (in our case, one in a million): we only provide a generator for success values, and the driver adaptively determines when it makes sense to make a call and stop generating more data, while accounting for multiple hypothesis testing.
TL;DR: the estimation algorithm for individual sampling passes works, and the combination of Hypothesis and Confidence Sequence Method lets us painlessly test for a statistical property.
We can iteratively use this sampling procedure to derive lower and (symmetrically) upper bounds for the optimal Lagrange multiplier \(\lambda^\star,\) and Hoeffding’s inequality lets us control the probability that the lower and upper bounds are valid. Typically, we’d use a tolerance of \(\sqrt{\log(n) / n},\) for an error rate of \(1 / n^2.\) I prefer to simply use something like \(7 / \sqrt{n}:\) the error rate is then less than \(10^{42},\) orders of manitude smaller than the probability of hardware failure in any given nanosecond.^{6} We can still check for failure of our Las Vegas algorithm, but if something went wrong, it’s much more likely that we detected a hardware failure than anything else. It’s like running SuperPi to stress test a computer, except the work is useful. 😉
How many sampling passes do we need? Our bounds are in terms of the sum of item weight: if we let our sample size be in \(\Theta(\sqrt{n}),\) the sum of weights \(\sum_i w_i\) for unfathomed items (that may or may not be chosen depending on the exact optimal multiplier \(\lambda^\star\) in the current range) will very probably shrink by a factor of \(\Omega(n^{1/4}).\) The initial sum can, in the worst case, be exponentially larger than the bitlength of the input, so even a division by \(n^{1/4}\) isn’t necessarily that great.
I intend to apply this Lazy Linear Knapsack algorithm on subproblems in a more interesting solver, and I know that the sum of weights is bounded by the size of the initial problem, so that’s good enough for me! After a constant (\(\approx 4\)) number of passes, the difference in item weight between the lower and upper bound on \(\lambda^\star\) should also be at most 1. One or two additional passes will get me near optimality (e.g., within \(10^{4}\)), and the lower bound on \(\lambda^\star\) should thus yield a superoptimal solution that’s infeasible by at most \(10^{4},\) which is, for my intended usage (again), good enough.
Given an optimal enough \(\lambda^\star,\) we can construct an explicit solution in one pass, plus a simple fixup for critical items. This Lazy Knapsack seems pretty reasonable for parallel or GPU computing: each sampling pass only needs to read the items (i.e., no partitioninglike shuffling) before writing a fraction of the data to a sample buffer, and we only need a constant number of passes (around 6 or 7) in the worst case.
It’s more like a fractional percentile, but you know what I mean: the value such that the distribution function at that point equals \(k / n\). ↩
Binomial bounds offer even stronger confidence intervals when the estimate is close to 0 or 1 (where Hoeffding’s bound would yield a confidence interval that juts outside \([0, 1]\)), but don’t impact worstcase performance. ↩
Thanks to Hoeffding’s inequality, again. ↩
That’s a troll. I think any selfrespecting computer person would rather see it as a sort of pigeonhole argument. ↩
We’re juggling a handful of error rates here. We’re checking whether the success rate for the Lazy Knapsack sampling subroutine is at least as high as \(1  \alpha,\) as requested in the test parameters, and we’re doing so with another randomised procedure that will give an incorrect conclusion at most once every one million invocation. ↩
This classic Google study found 8% of DIMMs hit at least one error per year; that’s more than one singlebit error every \(10^9\) DIMMsecond, and they’re mostly hard errors. More recently, Facebook reported that uncorrectable errors affect 0.03% of servers each month; that’s more than one uncorrectable error every \(10^{10}\) serversecond. If we performed one statistical test every nanosecond, the probability of memory failure alone would still dominate statistical errors by \(10^{20}!\) ↩
Let’s say you have a multiset (bag) of “reals” (floats or rationals), where each value is a sampled observations. It’s easy to augment any implementation of the multiset ADT to also return the sample mean of the values in the multiset in constant time: track the sum of values in the multiset, as they are individually added and removed. This requires one accumulator and a counter for the number of observations in the multiset (i.e., constant space), and adds a constant time overhead to each update.
It’s not as simple when you also need the sample variance of the multiset \(X\), i.e.,
\[\frac{1}{n  1} \sum\sb{x \in X} (x  \hat{x})\sp{2},\]
where \(n = X\) is the sample size and \(\hat{x}\) is the sample mean \(\sum\sb{x\in X} x/n,\) ideally with constant query time, and constant and update time overhead.
One could try to apply the textbook equality
\[s\sp{2} = \frac{1}{n(n1)}\left[n\sum\sb{x\in X} x\sp{2}  \left(\sum\sb{x\in X} x\right)\sp{2}\right].\]
However, as Knuth notes in TAoCP volume 2,
this expression loses a lot of precision to roundoff in floating point:
in extreme cases, the difference might be negative
(and we know the variance is never negative).
More commonly, we’ll lose precision
when the sampled values are clustered around a large mean.
For example, the sample standard deviation of 1e8
and 1e8  1
is 1
, same as for 0
and 1
.
However, the expression above would evaluate that to 0.0
, even in double precision:
while 1e8
is comfortably within range for double floats,
its square 1e16
is outside the range where all integers are represented exactly.
Knuth refers to a better behaved recurrence by Welford, where
a running sample mean is subtracted from each new observation
before squaring.
John Cook has a C++
implementation
of the recurrence that adds observations to a sample variance in constant time.
In Python, this streaming algorithm looks like this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

That’s all we need for insertonly multisets, but does not handle removals; if only we had removals, we could always implement updates (replacement) as a removal and an insertion.
Luckily, StreamingVariance.observe
looks invertible.
It’s shouldn’t be hard to recover the previous sample mean, given v
,
and, given the current and previous sample means,
we can reevaluate (v  old_mean) * (v  self.mean)
and
subtract it from self.var_sum
.
Let \(\hat{x}\sp{\prime}\) be the sample mean after observe(v)
.
We can derive the previous sample mean \(\hat{x}\) from \(v\):
\[(n  1)\hat{x} = n\hat{x}\sp{\prime}  v \Leftrightarrow \hat{x} = \hat{x}\sp{\prime} + \frac{\hat{x}\sp{\prime}  v}{n1}.\]
This invertibility means that we can undo calls to observe
in
LIFO order. We can’t handle arbitrary multiset updates, only a
stack of observation. That’s still better than nothing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 

Before going any further, let’s test this.
VarianceStack
The best way to test the VarianceStack
is to execute a series of
push
and pop
calls, and compare the results of get_mean
and
get_variance
with batch reference implementations.
I could hardcode calls in unit tests. However, that quickly hits diminishing returns in terms of marginal coverage VS developer time. Instead, I’ll be lazy, completely skip unit tests, and rely on Hypothesis, its high level “stateful” testing API in particular.
We’ll keep track of the values pushed and popped off the observation stack in the driver: we must make sure they’re matched in LIFO order, and we need the stack’s contents to compute the reference mean and variance. We’ll also want to compare the results with reference implementations, modulo some numerical noise. Let’s try to be aggressive and bound the number of float values between the reference and the actual results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 

This initial driver does not even use the VarianceStack
yet.
All it does is push values to the reference stack,
pop values when the stack has something to pop,
and check that the reference implementations match themselves after each call:
I want to first shake out any bug in the test harness itself.
Not surprisingly, Hypothesis does find an issue in the reference implementation:
Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=2.6815615859885194e+154)
state.teardown()
We get a numerical OverflowError
in reference_variance
: 2.68...e154 / 2
is slightly greater than sqrt(sys.float_info.max) = 1.3407807929942596e+154
,
so taking the square of that value errors out instead of returning infinity.
Let’s start by clamping the range of the generated floats.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Now that the test harness doesn’t find fault in itself,
let’s hook in the VarianceStack
, and see what happens
when only push
calls are generated (i.e., first test
only the standard streaming variance algorithm).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 

This already fails horribly.
Falsifying example:
state = VarianceStackDriver()
state.push(v=1.0)
state.push(v=1.488565707357403e+138)
state.teardown()
F
The reference finds a variance of 5.54e275
,
which is very much not the streaming computation’s 1.108e276
.
We can manually check that the reference is wrong:
it’s missing the n  1
correction term in the denominator.
We should use this updated reference.
1 2 3 4 5 6 7 8 9 

Let’s now reenable calls to pop()
.
1 2 3 4 5 6 7 8 

And now things fail in new and excitingly numerical ways.
Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=0.00014142319560050964)
state.push(v=14188.9609375)
state.pop()
state.teardown()
F
This counterexample fails with the online variance returning 0.0
instead of 1e8
.
That’s not unexpected:
removing (the square of) a large value from a running sum
spells catastrophic cancellation.
It’s also not that bad for my use case,
where I don’t expect to observe very large values.
Another problem for our test harness is that
floats are very dense around 0.0
, and
I’m ok with small (around 1e8
) absolute error
because the input and output will be single floats.
Let’s relax assert_almost_equal
, and
restrict generated observations to fall
in \([2\sp{12}, 2\sp{12}].\)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

With all these tweaks to make sure we generate easy (i.e., interesting) test cases, Hypothesis fails to find a failure after its default time budget.
I’m willing to call that a victory.
We have tested code to undo updates in Welford’s classic streaming variance algorithm.
Unfortunately, inverting push
es away only works for LIFO edits,
and we’re looking for arbitrary inserts and removals (and updates) to a multiset
of observations.
However, both the mean \(\hat{x} = \sum\sb{x\in X} x/n\) and the centered second moment \(\sum\sb{x\in X}(x  \hat{x})\sp{2}\) are orderindependent: they’re just sums over all observations. Disregarding roundoff, we’ll find the same mean and second moment regardless of the order in which the observations were pushed in. Thus, whenever we wish to remove an observation from the multiset, we can assume it was the last one added to the estimates, and pop it off.
We think we know how to implement running mean and variance for a multiset of observations. How do we test that with Hypothesis?
The hardest part about testing dictionary (map)like interfaces is making sure to generate valid identifiers when removing values. As it turns out, Hypothesis has builtin support for this important use case, with its Bundles. We’ll use that to test a dictionary from observation name to observation value, augmented to keep track of the current mean and variance of all values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 

Each call to add_entry
will either go to update_entry
if
the key already exists, or add an observation to the dictionary
and streaming estimator. If we have a new key, it is added
to the keys
Bundle; calls to del_entry
and update_entry
draw keys from this Bundle. When we remove an entry, it’s
also consumed from the keys
Bundle.
Hypothesis finds no fault with our new implementation of dictionarywithvariance,
but update
seems like it could be much faster and numerically stable,
and I intend to mostly use this data structure for calls to update
.
The key operation for my usecase is to update one observation
by replacing its old
value with a new
one.
We can maintain the estimator by popping old
away and pushing new
in,
but this business with updating the number of observation n
and
rescaling everything seems like a lot of numerical trouble.
We should be able to do better.
We’re replacing the multiset of sampled observations \(X\) with \(X\sp{\prime} = X \setminus \{\textrm{old}\} \cup \{\textrm{new}\}.\) It’s easy to maintain the mean after this update: \(\hat{x}\sp{\prime} = \hat{x} + (\textrm{new}  \textrm{old})/n.\)
The update to self.var_sum
, the sum of squared differences from the mean, is trickier.
We start with \(v = \sum\sb{x\in X} (x  \hat{x})\sp{2},\)
and we wish to find \(v\sp{\prime} = \sum\sb{x\sp{\prime}\in X\sp{\prime}} (x\sp{\prime}  \hat{x}\sp{\prime})\sp{2}.\)
Let \(\delta = \textrm{new}  \textrm{old}\) and \(\delta\sb{\hat{x}} = \delta/n.\) We have \[\sum\sb{x\in X} (x  \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} [(x  \hat{x})  \delta\sb{\hat{x}}]\sp{2},\] and \[[(x  \hat{x})  \delta\sb{\hat{x}}]\sp{2} = (x  \hat{x})\sp{2}  2\delta\sb{\hat{x}} (x  \hat{x}) + \delta\sb{\hat{x}}\sp{2}.\]
We can reassociate the sum, and find
\[\sum\sb{x\in X} (x  \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} (x  \hat{x})\sp{2}  2\delta\sb{\hat{x}} \left(\sum\sb{x \in X} x  \hat{x}\right) + n \delta\sb{\hat{x}}\sp{2}\]
Once we notice that \(\hat{x} = \sum\sb{x\in X} x/n,\) it’s clear that the middle term sums to zero, and we find the very reasonable
\[v\sb{\hat{x}\sp{\prime}} = \sum\sb{x\in X} (x  \hat{x})\sp{2} + n \delta\sb{\hat{x}}\sp{2} = v + \delta \delta\sb{\hat{x}}.\]
This new accumulator \(v\sb{\hat{x}\sp{\prime}}\) corresponds to the sum of the
squared differences between the old observations \(X\) and the new mean \(\hat{x}\sp{\prime}\).
We still have to update one observation from old
to new
.
The remaining adjustment to \(v\) (self.var_sum
) corresponds to
going from \((\textrm{old}  \hat{x}\sp{\prime})\sp{2}\)
to \((\textrm{new}  \hat{x}\sp{\prime})\sp{2},\)
where \(\textrm{new} = \textrm{old} + \delta.\)
After a bit of algebra, we get \[(\textrm{new}  \hat{x}\sp{\prime})\sp{2} = [(\textrm{old}  \hat{x}\sp{\prime}) + \delta]\sp{2} = (\textrm{old}  \hat{x}\sp{\prime})\sp{2} + \delta (\textrm{old}  \hat{x} + \textrm{new}  \hat{x}\sp{\prime}).\]
The adjusted \(v\sb{\hat{x}\sp{\prime}}\) already includes
\((\textrm{old}  \hat{x}\sp{\prime})\sp{2}\)
in its sum, so we only have to add the last term
to obtain the final updated self.var_sum
\[v\sp{\prime} = v\sb{\hat{x}\sp{\prime}} + \delta (\textrm{old}  \hat{x} + \textrm{new}  \hat{x}\sp{\prime}) = v + \delta [2 (\textrm{old}  \hat{x}) + \textrm{new}  \hat{x}\sp{\prime}].\]
That’s our final implementation for VarianceBag.update
,
for which Hypothesis also fails to find failures.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

We have automated propertybased tests and some humanchecked proofs. Ship it?
I was initially going to ask a CAS
to check my reformulations,
but the implicit \(\forall\) looked messy.
Instead, I decided to check the induction hypothesis implicit in
VarianceBag.update
, and enumerate all cases up to a certain number
of values with Z3 in IPython.
In [1]: from z3 import *
In [2]: x, y, z, new_x = Reals("x y z new_x")
In [3]: mean = (x + y + z) / 3
In [4]: var_sum = sum((v  mean) * (v  mean) for v in (x, y, z))
In [5]: delta = new_x  x
In [6]: new_mean = mean + delta / 3
In [7]: delta_mean = delta / 3
In [8]: adjustment = delta * (2 * (x  mean) + (delta  delta_mean))
In [9]: new_var_sum = var_sum + adjustment
# We have our expressions. Let's check equivalence for mean, then var_sum
In [10]: s = Solver()
In [11]: s.push()
In [12]: s.add(new_mean != (new_x + y + z) / 3)
In [13]: s.check()
Out[13]: unsat # No counter example of size 3 for the updated mean
In [14]: s.pop()
In [15]: s.push()
In [16]: s.add(new_mean == (new_x + y + z) / 3) # We know the mean matches
In [17]: s.add(new_var_sum != sum((v  new_mean) * (v  new_mean) for v in (new_x, y, z)))
In [18]: s.check()
Out[18]: unsat # No counter example of size 3 for the updated variance
Given this script, it’s a small matter of programming to generalise
from 3 values (x
, y
, and z
) to any fixed number of values, and
generate all small cases up to, e.g., 10 values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

I find the most important thing when it comes to using automated proofs is to insert errors and confirm we can find the bugs we’re looking for.
I did that by manually mutating the expressions for new_mean
and new_var_sum
in updated_expressions
. This let me find a simple bug in the initial
implementation of test_num_var
: I used if not result
instead of result != unsat
,
and both sat
and unsat
are truthy. The code initially failed to flag a failure
when z3
found a counterexample for our correctness condition!
I have code to augment an arbitrary multiset or dictionary with a running estimate of the mean and variance; that code is based on a classic recurrence, with some new math checked by hand, with automated tests, and with some exhaustive checking of small inputs (to which I claim most bugs can be reduced).
I’m now pretty sure the code works, but there’s another more obviously correct way to solve that update problem. This 2008 report by Philippe Pébay^{1} presents formulas to compute the mean, variance, and arbitrary moments in one pass, and shows how to combine accumulators, a useful operation in parallel computing.
We could use these formulas to augment an arbitrary \(k\)ary tree and recombine the merged accumulator as we go back up the (search) tree from the modified leaf to the root. The update would be much more stable (we only add and merge observations), and incur logarithmic time overhead (with linear space overhead). However, given the same time budget, and a logarithmic space overhead, we could also implement the constanttime update with arbitrary precision software floats, and probably guarantee even better precision.
The constanttime update I described in this post demanded more effort to convince myself of its correctness, but I think it’s always a better option than an augmented tree for serial code, especially if initial values are available to populate the accumulators with batchcomputed mean and variance. I’m pretty sure the code works, and it’s up in this gist. I’ll be reimplementing it in C++ because that’s the language used by the project that lead me to this problem; feel free to steal that gist.
There’s also a 2016 journal article by Pébay and others with numerical experiments, but I failed to implement their simplerlooking scalar update… ↩
There’s a lot of work on the expected time complexity of operations on linear probing Robin Hood hash tables. Alfredo Viola, along with a few collaborators, has long been exploring the distribution of displacements (i.e., search times) for random elements. The packed memory array angle has also been around for a while.^{1}
I’m a bit wary of the “random element” aspect of the linear probing bounds: while I’m comfortable with an expectation over the hash function (i.e., over the uniformly distributed hash values), a program could repeatedly ask for the same key, and consistently experience worsethanexpected performance. I’m more interested in bounding the worstcase displacement (the distance between the ideal location for an element, and where it is actually located) across all values in a randomly generated^{2} Robin Hood table, with high enough probability. That probability doesn’t have to be extremely high: \(p = 0.9\) or even \(p = 0.5\) is good enough, as long as we can either rehash with an independent hash function, or the probability of failure drops exponentially enough as the displacement leeway grows.
The people who study hashing with buckets, or hashing as load balancing, seem more interested in these probable worstcase bounds: as soon as one bucket overflows, it’s game over for that hash table! In that context, we wish to determine how much headroom we must reserve in each bucket, on top of the expected occupancy, in order to make sure failures are rare enough. That’s a balls into bins problem, where the \(m\) balls are entries in the hash table, and the \(n\) bins its hash buckets.
Raab and Steger’s Simple and Tight Analysis of the “balls into bins” problem shows that the case where the average occupancy grows with \((\log m)\sp{k}\) and \(k > 1\) has potential, when it comes to worstcase bounds that shrink quickly enough: we only need headroom that grows with \(\sqrt{\log\sp{k+1} m} = \log\sp{(k+1)/2} m\), slightly more than the square root of the average occupancy.
The only issue is that the ballsintobins analysis is asymptotic, and, more importantly, doesn’t apply at all to linear probing!
One could propose a form of packed memory array, where the sorted set is subdivided in chunks such that the expected load per chunk is in \(\Theta(\log\sp{k} m)\), and the size of each chunk multiplicatively larger (more than \(\log\sp{(k+1)/2}m\))…
Can we instead derive similar bounds with regular linear probing? It turns out that Raab and Steger’s bounds are indeed simple: they find the probability of overflow for one bucket, and derive a union bound for the probability of overflow for any bucket by multiplying the singlebucket failure probability by the number of buckets. Moreover, the singlebucket case itself is a Binomial confidence interval.
We can use the same approach for linear probing; I don’t expect a tight result, but it might be useful.
Let’s say we want to determine how unlikely it is to observe a clump of \(\log\sp{k}n\) entries, where \(n\) is the capacity of the hash table. We can bound the probability of observing such a clump starting at index 0 in the backing array, and multiply by the size of the array for our union bound (the clump could start anywhere in the array).
Given density \(d = m / n\), where \(m\) is the number of entries and \(n\) is the size of the array that backs the hash table, the probability that any given element falls in a range of size \(\log\sp{k}n\) is \(p = d/n \log\sp{k}n\). The number of entries in such a range follows a Binomial distribution \(B(dn, p)\), with expected value \(d \log\sp{k}n\). We want to determine the maximum density \(d\) such that \(\mathrm{Pr}[B(dn, p) > \log\sp{k}n] < \frac{\alpha}{dn}\), where \(\alpha\) is our overall failure rate. If rehashing is acceptable, we can let \(\alpha = 0.5\), and expect to find a suitably uniform hash function after half a rehash on average.
We know we want \(k > 1\) for the tail to shrink rapidly enough as \(n\) grows, but even \(\log\sp{2}n\) doesn’t shrink very rapidly. After some trial and error, I settled on a chunk size \(s(n) = 5 \log\sb{2}\sp{3/2} n\). That’s not great for small or medium sized tables (e.g., \(s(1024) = 158.1\)), but grows slowly, and reflects extreme worst cases; in practice, we can expect the worst case for any table to be more reasonable.
Assuming we have a quantile function for the Binomial distribution, we can find the occupancy of our chunk, at \(q = 1  \frac{\alpha}{n}\). The occupancy is a monotonic function of the density, so we can use, e.g., bisection search to find the maximum density such that the probability that we saturate our chunk is \(\frac{\alpha}{n}\), and thus the probability that any continuous run of entries has size at least \(s(n) = 5 \log\sb{2}\sp{3/2} n\) is less than \(\alpha\).
For \(\alpha = 0.5\), the plot of densities looks like the following.
This curve roughly matches the shape of some my older purely numerical experiments with Robin Hood hashing. When the table is small (less than \(\approx 1000\)), \(\log n\) is a large fraction of \(n\), so the probability of finding a run of size \(s(n) = 5 \log\sb{2}\sp{3/2} n\) is low. When the table is much larger, the asymptotic result kicks in, and the probabiliy slowly shrinks. However, even around the worst case \(n \approx 4500\), we can exceed \(77\%\) density and only observe a run of length \(s(n)\) half the time.
If we really don’t want to rehash, we can let \(\alpha = 10\sp{10}\), which compresses the curve and shifts it down: the minimum value is now slightly above \(50\%\) density, and we can clearly see the growth in permissible density as the size of the table grows.
In practice, we can dynamically compute the worstcase displacement, which is always less than the longest run (i.e., less than \(s(n) = 5 \log\sb{2}\sp{3/2} n\)). However, having nonasymptotic bounds lets us write sizespecialised code and know that its assumptions are likely to be satisfied in real life.
I mentioned at the beginning of this post that we can also manipulate Robin Hood hash tables as sorted sets, where the sort keys are uniformly distributed hash values.
Let’s say we wished to merge the immutable source table S
into the
larger destination table D
inplace, without copying all of D
.
For example, from
S = [2, 3, 4]
D = [1, 6, 7, 9, 10];
we want the merged result
D' = [1, 2, 3, 4, 6, 7, 9, 10].
The issue here is that, even with gaps, we might have to overwrite
elements of D
, and buffer them in some scratch space until we get to
their final position. In this case, all three elements of S
must be
inserted between the first and second elements of D
, so we could
need to buffer D[1:4]
.
How large of a merge buffer should we reasonably plan for?
In general, we might have to buffer as many elements as there are in
the smaller table of S
and D
. However, we’re working with hash
values, so we can expect them to be distributed uniformly. That
should give us some grip on the problem.
We can do even better and only assume that both sorted sets were
sampled from the same underlying distribution. The key idea
is that the rank of an element in S
is equal to the
value of S
’s
empirical distribution function
for that element, multiplied by the size of S
(similarly for
D
).
The amount of buffering we might need is simply a measure of the
worstcase difference between the two empirical DFs: the more S
get
ahead of D
, the more we need to buffer values of D
before
overwriting them (if we’re very unlucky, we might need a buffer the
same size as S
). That’s the
twosample KolmogorovSmirnov statistic, and we have
simple bounds for that distance.
With probability \(1  \alpha\), we’ll consume from S
and D
at the same rate \(\pm \sqrt{\frac{(S + D) \ln \alpha}{2 S D}}\).
We can let \(\alpha = 10\sp{10}\) and preallocate a buffer of size
\[  S  \sqrt{\frac{(  S  +  D  ) \ln \alpha}{2  S  D  }} < \sqrt{\frac{23.03  S  (  S  +  D  )}{2  D  }}.\] 
In the worst case, \(S = D\), and we can preallocate a buffer of size \(\sqrt{23.03 D} < 4.8 \sqrt{D}\) and only need to grow the buffer every ten billion (\(\alpha\sp{1}\)) merge.
The same bound applies in a stream processing setting; I assume this is closer to what Frank had in mind when he brought up this question.
Let’s assume a “push” dataflow model, where we still work on sorted sets of uniformly distributed hash values (and the data tuples associated with them), but now in streams that generate values every tick. The buffer size problem now sounds as follows. We wish to implement a sorted merge operator for two input streams that generate one value every tick, and we can’t tell our sources to cease producing values; how much buffer space might we need in order to merge them correctly?
Again, we can go back to the KolmogorovSmirnov statistic. In this case however, we could buffer each stream independently, so we’re looking for critical values for the onesided onesample KolmogorovSmirnov test (how much one stream might get ahead of the hypothetical exactly uniform stream). We have recent (1990) simple and tight bounds for this case as well.
The critical values for the onesided case are stronger than the twosided twosample critical values we used earlier: given an overflow probability of \(1  \alpha\), we need to buffer at most \(\sqrt{\frac{n \ln \alpha}{2}},\) elements. For \(\alpha = 10\sp{20}\) that’s less than \(4.8 \sqrt{n}\).^{3} This square root scaling is pretty good news in practice: shrinking \(n\) to \(\sqrt{n}\) tends to correspond to going down a rung or two in the storage hierarchy. For example, \(10\sp{15}\) elements is clearly in the range of distributed storage; however, such a humongous stream calls for a buffer of fewer than \(1.5 \cdot 10\sp{8}\) elements, which, at a couple gigabytes, should fit in RAM on one large machine. Similarly, \(10\sp{10}\) elements might fill the RAM on one machine, but the corresponding buffer of less than half a million elements could fit in L3 cache, while one million elements could fill the L3, and 4800 elements fit in L1 or L2.
What I find neat about this (probabilistic) bound on the buffer size is its independence from the size of the other inputs to the merge operator. We can have a shared \(\Theta(\sqrt{n})\)size buffer in front of each stream, and do all our operations without worrying about getting stuck (unless we’re extremely unlucky, in which case we can grow the buffer a bit and resume or restart the computation).
Probably of more theoretical interest is the fact that these bounds do not assume a uniform distribution, only that all the input streams are identically and independently sampled from the same underlying distribution. That’s the beauty of working in terms of the (inverse) distribution functions.
That’s it. Two cute tricks that use wellunderstood statistical distributions in hashed data structure and algorithm design. I doubt there’s anything to generalise from either bounding approach.
However, I definitely believe they’re useful in practice. I like knowing that I can expect the maximum displacement for a table of \(n\) elements with Robin Hood linear probing to be less than \(5 \log\sb{2}^{3/2} n\), because that lets me select an appropriate option for each table, as a function of that table’s maximum displacement, while knowing the range of displacements I might have to handle. Having a strong bound on how much I might have to buffer for stream join operators feels even more useful: I can preallocate a single buffer per stream and not think about efficiently growing the buffer, or signaling that a consumer is falling behind. The probability that I’ll need a larger buffer is so low that I just need to handle it, however inefficiently. In a replicated system, where each node picks an independent hash function, I would even consider crashing when the buffer is too small!
]]>Unsurprisingly, luck failed to show up, but I had ulterior motives: I’m much more interested in exploring first order methods for relaxations of combinatorial problems than in solving CVRPs. The routes I had accumulated after a couple days turned into a set covering LP with 1.1M decision variables, 10K constraints, and 20M nonzeros. That’s maybe denser than most combinatorial LPs (the aspect ratio is definitely atypical), but 0.2% nonzeros is in the right ballpark.
As soon as I had that fractional set cover instance, I tried to solve it with a simplex solver. Like any good Googler, I used Glop… and stared at a blank terminal for more than one hour.
Having observed that lack of progress, I implemented the toy I really wanted to try out: first order online “learning with experts” (specifically, AdaHedge) applied to LP optimisation. I let this notparticularlyoptimised serial CL code run on my 1.6 GHz laptop for 21 hours, at which point the first order method had found a 4.5% infeasible solution (i.e., all the constraints were satisfied with \(\ldots \geq 0.955\) instead of \(\ldots \geq 1\)). I left Glop running long after the contest was over, and finally stopped it with no solution after more than 40 days on my 2.9 GHz E5.
Given the shape of the constraint matrix, I would have loved to try an interior point method, but all my licenses had expired, and I didn’t want to risk OOMing my workstation. Erling Andersen was later kind enough to test Mosek’s interior point solver on it. The runtime was much more reasonable: 10 minutes on 1 core, and 4 on 12 cores, with the sublinear speedup mostly caused by the serial crossover to a simplex basis.
At 21 hours for a naïve implementation, the “learning with experts” first order method isn’t practical yet, but also not obviously uninteresting, so I’ll write it up here.
Using online learning algorithms for the “experts problem” (e.g., Freund and Schapire’s Hedge algorithm) to solve linear programming feasibility is now a classic result; Jeremy Kun has a good explanation on his blog. What’s new here is:
The first item is particularly important to me because it’s a simple modification to the LP feasibility metaalgorithm, and might make the difference between a tool that’s only suitable for theoretical analysis and a practical approach.
I’ll start by reviewing the experts problem, and how LP feasibility is usually reduced to the former problem. After that, I’ll cast the reduction as a surrogate relaxation method, rather than a Lagrangian relaxation; optimisation should flow naturally from that point of view. Finally, I’ll guess why I had more success with AdaHedge this time than with Multiplicative Weight Update eight years ago.^{1}
I first heard about the experts problem while researching dynamic sorted set data structures: Igal Galperin’s PhD dissertation describes scapegoat trees, but is really about online learning with experts. Arora, Hazan, and Kale’s 2012 survey of multiplicative weight update methods. is probably a better introduction to the topic ;)
The experts problem comes in many variations. The simplest form sounds like the following. Assume you’re playing a binary prediction game over a predetermined number of turns, and have access to a fixed finite set of experts at each turn. At the beginning of every turn, each expert offers their binary prediction (e.g., yes it will rain today, or it will not rain today). You then have to make a prediction yourself, with no additional input. The actual result (e.g., it didn’t rain today) is revealed at the end of the turn. In general, you can’t expect to be right more often than the best expert at the end of the game. Is there a strategy that bounds the “regret,” how many more wrong prediction you’ll make compared to the expert(s) with the highest number of correct predictions, and in what circumstances?
Amazingly enough, even with an omniscient adversary that has access to your strategy and determines both the experts’ predictions and the actual result at the end of each turn, a stream of random bits (hidden from the adversary) suffice to bound our expected regret in \(\mathcal{O}(\sqrt{T}\,\lg n)\), where \(T\) is the number of turns and \(n\) the number of experts.
I long had trouble with that claim: it just seems too good of a magic trick to be true. The key realisation for me was that we’re only comparing against invidivual experts. If each expert is a move in a matrix game, that’s the same as claiming you’ll never do much worse than any pure strategy. One example of a pure strategy is always playing rock in RockPaperScissors; pure strategies are really bad! The trick is actually in making that regret bound useful.
We need a more continuous version of the experts problem for LP feasibility. We’re still playing a turnbased game, but, this time, instead of outputting a prediction, we get to “play” a mixture of the experts (with nonnegative weights that sum to 1). At the beginning of each turn, we describe what weight we’d like to give to each experts (e.g., 60% rock, 40% paper, 0% scissors). The cost (equivalently, payoff) for each expert is then revealed (e.g., \(\mathrm{rock} = 0.5\), \(\mathrm{paper} = 0.5\), \(\mathrm{scissors} = 0\)), and we incur the weighted average from our play (e.g., \(60\% \cdot 0.5 + 40\% \cdot 0.5 = 0.1\)) before playing the next round.^{2} The goal is to minimise our worstcase regret, the additive difference between the total cost incurred by our mixtures of experts and that of the a posteriori best single expert. In this case as well, online learning algorithms guarantee regret in \(\mathcal{O}(\sqrt{T} \, \lg n)\)
This line of research is interesting because simple algorithms achieve that bound, with explicit constant factors on the order of 1,^{3} and those bounds are known to be nonasymptotically tight for a large class of algorithms. Like dense linear algebra or fast Fourier transforms, where algorithms are often compared by counting individual floating point operations, online learning has matured into such tight bounds that worstcase regret is routinely presented without Landau notation. Advances improve constant factors in the worst case, or adapt to easier inputs in order to achieve “better than worst case” performance.
The reduction below lets us take any learning algorithm with an additive regret bound, and convert it to an algorithm with a corresponding worstcase iteration complexity bound for \(\varepsilon\)approximate LP feasibility. An algorithm that promises low worstcase regret in \(\mathcal{O}(\sqrt{T})\) gives us an algorithm that needs at most \(\mathcal{O}(1/\varepsilon\sp{2})\) iterations to return a solution that almost satisfies every constraint in the linear program, where each constraint is violated by \(\varepsilon\) or less (e.g., \(x \leq 1\) is actually \(x \leq 1 + \varepsilon\)).
We first split the linear program in two components, a simple domain (e.g., the nonnegative orthant or the \([0, 1]\sp{d}\) box) and the actual linear constraints. We then map each of the latter constraints to an expert, and use an arbitrary algorithm that solves our continuous version of the experts problem as a black box. At each turn, the black box will output a set of nonnegative weights for the constraints (experts). We will average the constraints using these weights, and attempt to find a solution in the intersection of our simple domain and the weighted average of the linear constraints. We can do so in the “experts problem” setting by consider each linear constraint’s violation as a payoff, or, equivalently, satisfaction as a loss.
Let’s use Stigler’s Diet Problem with three foods and two constraints as a small example, and further simplify it by disregarding the minimum value for calories, and the maximum value for vitamin A. Our simple domain here is at least the nonnegative orthant: we can’t ingest negative food. We’ll make things more interesting by also making sure we don’t eat more than 10 servings of any food per day.
The first constraint says we mustn’t get too many calories
\[72 x\sb{\mathrm{corn}} + 121 x\sb{\mathrm{milk}} + 65 x\sb{\mathrm{bread}} \leq 2250,\]
and the second constraint (tweaked to improve this example) ensures we ge enough vitamin A
\[107 x\sb{\mathrm{corn}} + 400 x\sb{\mathrm{milk}} \geq 5000,\]
or, equivalently,
\[107 x\sb{\mathrm{corn}}  400 x\sb{\mathrm{milk}} \leq 5000,\]
Given weights \([3/4, 1/4]\), the weighted average of the two constraints is
\[27.25 x\sb{\mathrm{corn}}  9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}} \leq 437.5,\]
where the coefficients for each variable and for the righthand side were averaged independently.
The subproblem asks us to find a feasible point in the intersection of these two constraints: \[27.25 x\sb{\mathrm{corn}}  9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}} \leq 437.5,\] \[0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10.\]
Classically, we claim that this is just Lagrangian relaxation, and find a solution to
\[\min 27.25 x\sb{\mathrm{corn}}  9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}}\] subject to \[0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10.\]
In the next section, I’ll explain why I think this analogy is wrong and worse than useless. For now, we can easily find the minimum one variable at a time, and find the solution \(x\sb{\mathrm{corn}} = 0\), \(x\sb{\mathrm{milk}} = 10\), \(x\sb{\mathrm{bread}} = 0\), with objective value \(92.5\) (which is \(530\) less than \(437.5\)).
In general, three things can happen at this point. We could discover that the subproblem is infeasible. In that case, the original nonrelaxed linear program itself is infeasible: any solution to the original LP satisfies all of its constraints, and thus would also satisfy any weighted average of the same constraints. We could also be extremely lucky and find that our optimal solution to the relaxation is (\(\varepsilon\))feasible for the original linear program; we can stop with a solution. More commonly, we have a solution that’s feasible for the relaxation, but not for the original linear program.
Since that solution satisfies the weighted average constraint and payoffs track constraint violation, the black box’s payoff for this turn (and for every other turn) is nonpositive. In the current case, the first constraint (on calories) is satisfied by \(1040\), while the second (on vitamin A) is violated by \(1000\). On weighted average, the constraints are satisfied by \(\frac{1}{4}(3 \cdot 1040  1000) = 530.\) Equivalently, they’re violated by \(530\) on average.
We’ll add that solution to an accumulator vector that will come in handy later.
The next step is the key to the reduction: we’ll derive payoffs (negative costs) for the black box from the solution to the last relaxation. Each constraint (expert) has a payoff equal to its level of violation in the relaxation’s solution. If a constraint is strictly satisfied, the payoff is negative; for example, the constraint on calories is satisfied by \(1040\), so its payoff this turn is \(1040\). The constraint on vitamin A is violated by \(1000\), so its payoff this turn is \(1000\). Next turn, we expect the black box to decrease the weight of the constraint on calories, and to increase the weight of the one on vitamin A.
After \(T\) turns, the total payoff for each constraint is equal to the sum of violations by all solutions in the accumulator. Once we divide both sides by \(T\), we find that the divided payoff for each constraint is equal to its violation by the average of the solutions in the accumulator. For example, if we have two solutions, one that violates the calories constraint by \(500\) and another that satisfies it by \(1000\) (violates it by \(1000\)), the total payoff for the calories constraint is \(500\), and the average of the two solutions does strictly satisfy the linear constraint by \(\frac{500}{2} = 250\)!
We also know that we only generated feasible solutions to the relaxed subproblem (otherwise, we’d have stopped and marked the original LP as infeasible), so the black box’s total payoff is \(0\) or negative.
Finally, we assumed that the black box algorithm guarantees an additive regret in \(\mathcal{O}(\sqrt{T}\, \lg n)\), so the black box’s payoff of (at most) \(0\) means that any constraint’s payoff is at most \(\mathcal{O}(\sqrt{T}\, \lg n)\). After dividing by \(T\), we obtain a bound on the violation by the arithmetic mean of all solutions in the accumulator: for all constraint, that violation is in \(\mathcal{O}\left(\frac{\lg n}{\sqrt{T}}\right)\). In other words, the number of iteration \(T\) must scale with \(\mathcal{O}\left(\frac{\lg n}{\varepsilon\sp{2}}\right)\), which isn’t bad when \(n\) is in the millions but \(\varepsilon \approx 0.01\).
Theoreticians find this reduction interesting because there are concrete implementations of the black box, e.g., the multiplicative weight update (MWU) method with nonasymptotic bounds. For many problems, this makes it possible to derive the exact number of iterations necessary to find an \(\varepsilon\)feasible fractional solution, given \(\varepsilon\) and the instance’s size (but not the instance itself).
That’s why algorithms like MWU are theoretically useful tools for fractional approximations, when we already have subgradient methods that only need \(\mathcal{O}\left(\frac{1}{\varepsilon}\right)\) iterations: stateoftheart algorithms for learning with experts explicit nonasymptotic regret bounds that yield, for many problems, iteration bounds that only depend on the instance’s size, but not its data. While the iteration count when solving LP feasibility with MWU scales with \(\frac{1}{\varepsilon\sp{2}}\), it is merely proportional to \(\lg n\), the log of the the number of linear constraints. That’s attractive, compared to subgradient methods for which the iteration count scales with \(\frac{1}{\varepsilon}\), but also scales linearly with respect to instancedependent values like the distance between the initial dual solution and the optimum, or the Lipschitz constant of the Lagrangian dual function; these values are hard to bound, and are often proportional to the square root of the number of constraints. Given the choice between \(\mathcal{O}\left(\frac{\lg n}{\varepsilon\sp{2}}\right)\) iterations with explicit constants, and a looser \(\mathcal{O}\left(\frac{\sqrt{n}}{\varepsilon}\right)\), it’s obvious why MWU and online learning are powerful additions to the theory toolbox.
Theoreticians are otherwise not concerned with efficiency, so the usual answer to someone asking about optimisation is to tell them they can always reduce linear optimisation to feasibility with a binary search on the objective value. I once made the mistake of implementing that binary search last strategy. Unsurprisingly, it wasn’t useful. I also tried another theoretical reduction, where I looked for a pair of primal and dual feasible solutions that happened to have the same objective value. That also failed, in a more interesting manner: since the two solution had to have almost the same value, the universe spited me by sending back solutions that were primal and dual infeasible in the worst possible way. In the end, the second reduction generated fractional solutions that were neither feasible nor superoptimal, which really isn’t helpful.
The reduction above works for any “simple” domain, as long as it’s convex and we can solve the subproblems, i.e., find a point in the intersection of the simple domain and a single linear constraint or determine that the intersection is empty.
The set of (super)optimal points in some initial simple domain is still convex, so we could restrict our search to the search of the domain that is superoptimal for the linear program we wish to optimise, and directly reduce optimisation to the feasibility problem solved in the last section, without binary search.
That sounds silly at first: how can we find solutions that are superoptimal when we don’t even know the optimal value?
Remember that the subproblems are always relaxations of the original linear program. We can port the objective function from the original LP over to the subproblems, and optimise the relaxations. Any solution that’s optimal for a realxation must have an optimal or superoptimal value for the original LP.
Rather than treating the black box online solver as a generator of Lagrangian dual vectors, we’re using its weights as solutions to the surrogate relaxation dual. The latter interpretation isn’t just more powerful by handling objective functions. It also makes more sense: the weights generated by algorithms for the experts problem are probabilities, i.e., they’re nonnegative and sum to \(1\). That’s also what’s expected for surrogate dual vectors, but definitely not the case for Lagrangian dual vectors, even when restricted to \(\leq\) constraints.
We can do even better!
Unlike Lagrangian dual solvers, which only converge when fed (approximate) subgradients and thus make us (nearly) optimal solutions to the relaxed subproblems, our reduction to the experts problem only needs feasible solutions to the subproblems. That’s all we need to guarantee an \(\varepsilon\)feasible solution to the initial problem in a bounded number of iterations. We also know exactly how that \(\varepsilon\)feasible solution is generated: it’s the arithmetic mean of the solutions for relaxed subproblems.
This lets us decouple finding lower bounds from generating feasible solutions that will, on average, \(\varepsilon\)satisfy the original LP. In practice, the search for an \(\varepsilon\)feasible solution that is also superoptimal will tend to improve the lower bound. However, nothing forces us to evaluate lower bounds synchronously, or to only use the experts problem solver to improve our bounds.
We can find a new bound from any vector of nonnegative constraint weights: they always yield a valid surrogate relaxation. We can solve that relaxation, and update our best bound when it’s improved. The Diet subproblem earlier had
\[27.25 x\sb{\mathrm{corn}}  9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}} \leq 437.5,\] \[0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10.\]
Adding the original objective function back yields the linear program
\[\min 0.18 x\sb{\mathrm{corn}} + 0.23 x\sb{\mathrm{milk}} + 0.05 x\sb{\mathrm{bread}}\] subject to \[27.25 x\sb{\mathrm{corn}}  9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}} \leq 437.5,\] \[0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10,\]
which has a trivial optimal solution at \([0, 0, 0]\).
When we generate a feasible solution for the same subproblem, we can use any valid bound on the objective value to find the most feasible solution that is also assuredly (super)optimal. For example, if some oracle has given us a lower bound of \(2\) for the original Diet problem, we can solve for
\[\min 27.25 x\sb{\mathrm{corn}}  9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}}\] subject to \[0.18 x\sb{\mathrm{corn}} + 0.23 x\sb{\mathrm{milk}} + 0.05 x\sb{\mathrm{bread}}\leq 2\] \[0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10.\]
We can relax the objective value constraint further, since we know that the final \(\varepsilon\)feasible solution is a simple arithmetic mean. Given the same best bound of \(2\), and, e.g., a current average of \(3\) solutions with a value of \(1.9\), a new solution with an objective value of \(2.3\) (more than our best bound, so not necessarily optimal!) would yield a new average solution with a value of \(2\), which is still (super)optimal. This means we can solve the more relaxed subproblem
\[\min 27.25 x\sb{\mathrm{corn}}  9.25 x\sb{\mathrm{milk}} + 48.75 x\sb{\mathrm{bread}}\] subject to \[0.18 x\sb{\mathrm{corn}} + 0.23 x\sb{\mathrm{milk}} + 0.05 x\sb{\mathrm{bread}}\leq 2.3\] \[0 \leq x\sb{\mathrm{corn}},\, x\sb{\mathrm{milk}},\, x\sb{\mathrm{bread}} \leq 10.\]
Given a bound on the objective value, we swapped the constraint and the objective; the goal is to maximise feasibility, while generating a solution that’s “good enough” to guarantee that the average solution is still (super)optimal.
For boxconstrained linear programs where the box is the convex domain, subproblems are bounded linear knapsacks, so we can simply stop the greedy algorithm as soon as the objective value constraint is satisfied, or when the knapsack constraint becomes active (we found a better bound).
This last tweak doesn’t just accelerate convergence to \(\varepsilon\)feasible solutions. More importantly for me, it pretty much guarantees that out \(\varepsilon\)feasible solution matches the best known lower bound, even if that bound was provided by an outside oracle. Bundle methods and the Volume algorithm can also mix solutions to relaxed subproblems in order to generate \(\varepsilon\)feasible solutions, but the result lacks the last guarantee: their fractional solutions are even more superoptimal than the best bound, and that can make bounding and variable fixing difficult.
Before last Christmas’s CVRP set covering LP, I had always used the multiplicative weight update (MWU) algorithm as my black box online learning algorithm: it wasn’t great, but I couldn’t find anything better. The two main downsides for me were that I had to know a “width” parameter ahead of time, as well as the number of iterations I wanted to run.
The width is essentially the range of the payoffs; in our case, the potential level of violation or satisfaction of each constraints by any solution to the relaxed subproblems. The dependence isn’t surprising: folklore in Lagrangian relaxation also says that’s a big factor there. The problem is that the most extreme violations and satisfactions are initialisation parameters for the MWU algorithm, and the iteration count for a given \(\varepsilon\) is quadratic in the width (\(\mathrm{max}\sb{violation} \cdot \mathrm{max}\sb{satisfaction}\)).
What’s even worse is that the MWU is explicitly tuned for a specific iteration count. If I estimate that, give my worstcase width estimate, one million iterations will be necessary to achieve \(\varepsilon\)feasibility, MWU tuned for 1M iterations will need 1M iterations, even if the actual width is narrower.
de Rooij and others published AdaHedge in 2013, an algorithm that addresses both these issues by smoothly estimating its parameter over time, without using the doubling trick.^{4} AdaHedge’s loss (convergence rate to an \(\varepsilon\)solution) still depends on the relaxation’s width. However, it depends on the maximum width actually observed during the solution process, and not on any explicit worstcase bound. It’s also not explicily tuned for a specific iteration count, and simply keeps improving at a rate that roughly matches MWU. If the instance happens to be easy, we will find an \(\varepsilon\)feasible solution more quickly. In the worst case, the iteration count is never much worse than that of an optimally tuned MWU.
These 400 lines of Common Lisp implement AdaHedge and use it to optimise the set covering LP. AdaHedge acts as the online black box solver for the surrogate dual problem, the relaxed set covering LP is a linear knapsack, and each subproblem attempts to improve the lower bound before maximising feasibility.
When I ran the code, I had no idea how long it would take to find a feasible enough solution: covering constraints can never be violated by more than \(1\), but some points could be covered by hundreds of tours, so the worst case satisfaction width is high. I had to rely on the way AdaHedge adapts to the actual hardness of the problem. In the end, \(34492\) iterations sufficed to find a solution that was \(4.5\%\) infeasible.^{5} This corresponds to a worst case with a width of less than \(2\), which is probably not what happened. It seems more likely that the surrogate dual isn’t actually an omniscient adversary, and AdaHedge was able to exploit some of that “easiness.”
The iterations themselves are also reasonable: one sparse matrix / dense vector multiplication to convert surrogate dual weights to an average constraint, one solve of the relaxed LP, and another sparse matrix / dense vector multiplication to compute violations for each constraint. The relaxed LP is a fractional \([0, 1]\) knapsack, so the bottleneck is sorting double floats. Each iteration took 1.8 seconds on my old laptop; I’m guessing that could easily be 1020 times faster with vectorisation and parallelisation.
In another post, I’ll show how using the same surrogate dual optimisation algorithm to mimick Lagrangian decomposition instead of Lagrangian relaxation guarantees an iteration count in \(\mathcal{O}\left(\frac{\lg \#\mathrm{nonzero}}{\varepsilon\sp{2}}\right)\) independently of luck or the specific linear constraints.
Yes, I have been banging my head against that wall for a while. ↩
This is equivalent to minimising expected loss with random bits, but cleans up the reduction. ↩
When was the last time you had to worry whether that log was natural or base2? ↩
The doubling trick essentially says to start with an estimate for some parameters (e.g., width), then adjust it to at least double the expected iteration count when the parameter’s actual value exceeds the estimate. The sum telescopes and we only pay a constant multiplicative overhead for the dynamic update. ↩
I think I computed the \(\log\) of the number of decision variables instead of the number of constraints, so maybe this could have gone a bit better. ↩