<![CDATA[Paul Khuong: some Lisp]]> 2020-05-03T08:29:43-04:00 https://www.pvk.ca/ Octopress <![CDATA[Check for borrows in bitwise operations]]> 2020-05-02T17:24:39-04:00 https://www.pvk.ca/Blog/2020/05/02/check-for-borrows-in-bitwise-operations 2020-05-03: Had to add a complement step in the ULEB section. Seems I couldn’t actually avoid that crummy-looking notation. Spotted by redditor /u/Y_Less.

In the fourth installment of his series on sorting with AVX2, @damageboy has a short aside where he tries to detect partitioning (pivot) patterns where elements less than and greater than or equal to the pivot are already in the correct order: in that case, the partitioning routine does not need to permute the block of values. The practical details are irrelevant for this post; what matters is that we wish to quickly identify whether a byte value matches any of the follow nine cases:

• 0b11111111
• 0b11111110
• 0b11111100
• 0b11111000
• 0b11110000
• 0b11100000
• 0b11000000
• 0b10000000
• 0b00000000

Looking at the bit patterns,1 the OP’s solution with popcount and bitscan is pretty natural. These instructions are somewhat complex (latency closer to 3 cycles than 1, and often port restricted), and it seems like the sort of problem that would have had efficient solutions before SSE4 finally graced x86 with a population count instruction.

In the context of a sorting library’s partition loop, popcnt and bsf is probably more than good enough: the post shows that the real issue is branch mispredictions being slower than permuting unconditionally. This is just a fun challenge to think about (:

## Warm-up: is_power_of_two

Detecting whether a machine integer is a power of two (or zero) is another task that has a straightforward solution in terms of popcount or bitscan. There’s also a simpler classic solution to this problem:

x == 0 || is_power_of_two(x) <==> (x & (x - 1)) == 0

How does that expression work? Say x is a power of two. Its binary representation is 0b0...010...0: any number of leading zeros,2 a single “1” bit, and trailing zeros (maybe none). Let’s see what happens when we subtract 1 from x:

x           = 0b00...0010...0
x - 1  = 0b00...0001...1
x & (x - 1) = 0b00...0000...0


The subtraction triggered a chain of borrows throughout the trailing zeros, until we finally hit that 1 bit. In decimal, subtracting one from 10...0 yields 09...9; in binary we instead find 01...1. If you ever studied the circuit depth (latency) of carry chains (for me, that was for circuit complexity theory), you know that this is difficult to do well. Luckily for us, chip makers work hard to pull it off, and we can just use carries as a data-controlled primitive to efficiently flip ranges of bits.

When x is a power of two, x and x - 1 have no “1” bit in common, so taking the bitwise and yields zero. That’s also true when x is 0, since anding anything with 0 yields zero. Let’s see what happens for non-zero, non-power-of-two values x = 0bxx...xx10..0, i.e., where x consists of an arbitrary non-zero sequence of bits xx..xx followed by the least set bit (there’s at least one, since x is neither zero nor a power of two), and the trailing zeros:

x           = 0bxx...xx10...0
x - 1  = 0bxx...xx01...1
x & (x - 1) = 0bxx...xx000000


The leading not-all-zero 0bxx...xx is unaffected by the subtraction, so it passes through the bitwise and unscathed (anding any bit with itself yields that same bit), and we know there’s at least one non-zero bit in there; our test correctly rejects it!

## Stretching: decoding varints

When decoding variable length integers in ULEB format, e.g., for protocol buffers, it quickly becomes clear that, in order to avoid byte-at-a-time logic, we must rapidly segment (lex or tokenize, in a way) our byte stream to determine where each ULEB ends. Let’s focus on the fast path, when the encoded ULEB fits in a machine register.

We have uleb = 0bnnnnnnnnmmmmmmmm...0zzzzzzz1yyyyyyy1...: a sequence of bytes3 with the topmost bit equal to 1, terminated by a byte with the top bit set to 0, and, finally, arbitrary nuisance bytes (m...m, n...n, etc.) we wish to ignore. Ideally, we’d extract data = 0b0000000000000000...?zzzzzzz?yyyyyyy?... from uleb: we want to clear the nuisance bytes, and are fine with arbitrary values in the ULEB’s control bits.

It’s much easier to find bits set to 1 than to zero, so the first thing to do is to complement the ULEB data and clear out everything but potential ULEB control bits (the high bit of each byte), with c = ~uleb & (128 * (WORD_MAX / 255)), i.e., compute the bitwise and of ~uleb with a bitmask of the high bit in each byte.

   uleb = 0bnnnnnnnnmmmmmmmm...0zzzzzzz1yyyyyyy1...
~uleb = 0b̅n̅n̅n̅n̅n̅n̅n̅n̅m̅m̅m̅m̅m̅m̅m̅m̅...1z̅z̅z̅z̅z̅z̅z0y̅y̅y̅y̅y̅y̅y0...
c = 0b̅n̅0000000̅m̅0000000...10000000000000000...


We could now bitscan to find the index of the first 1 (marking the last ULEB byte), and then generate a mask. However, it seems wasteful to generate an index with a scan, only to convert it back into bitmap space with a shift. We’ll probably still want that index to know how far to advance the decoder’s cursor, but we can hopefully update the cursor in parallel with decoding the current ULEB value.

When we were trying to detect powers of two, we subtracted 1 from x, a value kind of like c, in order to generate a new value that differed from x in all the bits up to and including the first set (equal to 1) bit of x, and identical in the remaining bits. We then used the fact that anding a bit with itself yields that same bit to detect whether there was any non-zero bit in the remainder.

Here, we wish to do something else with the remaining untouched bits, we wish to set them all to zero. Another bitwise operator does what we want: xoring a bit with itself always yields zero, while xoring bits that differ yields 1. That’s the plan for ULEB. We’ll subtract 1 from c and xor that back with c.

       uleb = 0bnnnnnnnnmmmmmmmm...0zzzzzzz1yyyyyyy1...
~uleb = 0b̅n̅n̅n̅n̅n̅n̅n̅n̅m̅m̅m̅m̅m̅m̅m̅m̅...1z̅z̅z̅z̅z̅z̅z0y̅y̅y̅y̅y̅y̅y0...
c           = 0b̅n̅0000000̅m̅0000000...10000000000000000...
c - 1  = 0b̅n̅0000000̅m̅0000000...01111111111111111...
c ^ (c - 1) = 0b0000000000000000...11111111111111111...


We now just have to bitwise and uleb with c ^ (c - 1) to obtain the bits of the first ULEB value in uleb, while overwriting everything else with 0. Once we have that, we can either extract data bits with PEXT on recent Intel chips, or otherwise dust off interesting stunts for SWAR shifts by variable amounts.

## Now for damageboy’s problem

Let’s first repeat the question that motivated this post. We want to detect when a byte p is one of the following nine values:

• 0b11111111
• 0b11111110
• 0b11111100
• 0b11111000
• 0b11110000
• 0b11100000
• 0b11000000
• 0b10000000
• 0b00000000

These bit patterns feel similar to those for power of two bytes: if we complement the bits, these values are all 1 less than a power of two (or -1, one less than zero). We already know how to detect when a value x is zero or a power of two (x & (x - 1) == 0), so it’s easy to instead determine whether ~p is one less than zero or a power of two: (~p + 1) & ~p == 0.

This is already pretty good: bitwise not the byte p, and check if it’s one less than zero or a power of two (three simple instructions on the critical path). We can do better.

There’s another name for ~p + 1, i.e., for bitwise complementing a value and adding one: that’s simply -p, the additive inverse of p in two’s complement! We can use -p & ~p == 0. That’s one fewer instruction on the critical path of our dependency graph (down to two, since we can test whether anding yields zero), and still only uses simple instructions that are unlikely to be port constrained.

Let’s check our logic by enumerating all byte-sized values.

CL-USER> (dotimes (p 256)
(when (zerop (logand (- p) (lognot p) 255))
(format t "0b~2,8,'0r~%" p)))
0b00000000
0b10000000
0b11000000
0b11100000
0b11110000
0b11111000
0b11111100
0b11111110
0b11111111


These are the bytes we’re looking for (in ascending rather than descending order)!

## Remember the power of borrows

I hope the examples above communicated a pattern I often observe when mangling bits: operations that are annoying (not hard, just a bit more complex than we’d like) in the bitmap domain can be simpler in two’s complement arithmetic. Arithmetic operations are powerful mutators for bitmaps, but they’re often hard to control. Subtracting or adding 1 are the main exceptions: it’s easy to describe their impact in terms of the low bits of the bitmap. In fact, we can extend that trick to subtracting or adding powers of two: it’s the same carry/borrow chain effect as for 1, except that bits smaller than the power of two pass straight through… which might be useful when we expect a known tag followed by a ULEB value that must be decoded.

If you find yourself wishing for a way to flip ranges of bits in a data-dependent fashion, it’s always worth considering the two’s complement representation of the problem for a couple minutes. Adding or subtracting powers of two doesn’t always work, but the payoff is pretty good when it does.

P.S., Wojciech Muła offers a different 3-operation sequence with -p to solve damageboy’s problem. That’s another nice primitive to generate bitmasks dynamically.

Thank you Ruchir for helping me clarify the notation around the ULEB section.

1. I use the 0b... syntax throughout this post to denote bit literals, similarly to the usual 0x... hexadecimal literals.

2. While I prefer to work with little-endian bytes, I find everything makes more sense with big-endian bits.

3. Remember, while ULEB is little-endian, we use big bit-endianness.

]]>
<![CDATA[How hard is it to guide test case generators with branch coverage feedback?]]> 2020-03-11T18:00:26-04:00 https://www.pvk.ca/Blog/2020/03/11/how-hard-is-it-to-guide-test-case-generators-with-branch-coverage-feedback To make a long story short, it’s almost easy, now that tracing is available on commodity hardware, especially with a small library to handle platform-specific setup.

Here’s a graph of the empirical distribution functions for the number of calls into OpenSSL 1.0.1f it takes for Hypothesis to find Heartbleed, given a description of the format for Heartbeat requests (“grammar”), and the same with additional branch coverage feedback (“bts”, for Intel Branch Trace Store). On average, knowing the grammar for Heartbeat requests lets Hypothesis find the vulnerability after 535 calls to OpenSSL; when we add branch coverage feedback, the average goes down to 4731 calls, 11.5% faster. The plot shows that, as long as we have time for more than 100 calls to OpenSSL, branch coverage information makes it more likely that Hypothesis will find Heartbleed for every attempt budget. I’ll describe this experiment in more details later in the post; first, why did I ask that question?

## Enumerating test cases by hand is a waste of people’s time

I care about efficiently generating test cases because I believe that most of the time people spend writing, reviewing, and maintaining classic small or medium tests is misallocated. There’s a place for them, same as there is for manual QA testing… but their returns curve is strongly concave.

Here are the sort of things I want to know when I look at typical unit test code:

1. Why do we make this exact assertion; what makes a result acceptable?2
2. Why do we feed this or that input; what makes input data useful or interesting?
3. What is expected to change when I tweak the external behaviour of the system under test;3 how are answers to the previous two questions encoded?

As programmers, I think it’s natural to say that, if we need to formalize how we come up with assertions, or how we decide what inputs to use during testing, we should express that in code. But once we have that code, why would we enumerate test cases manually? That’s a job for a computer!

## Property-based testing at Backtrace

I have such conviction in this thesis that, as soon as I took over the query processing component (a library written in C with a dash of intrinsics and inline assembly) at Backtrace, I introduced a new testing approach based on Hypothesis. The query engine was always designed and built as a discrete library, so it didn’t take too much work to disentangle it from other pieces of business logic. That let us quickly add coverage for the external interface (partial coverage, this is still work in progress), as well as for key internal components that we chose to expose for testing.

1. What is Hypothesis?
2. Why Hypothesis?
3. Wait, there are multiple language backends for Hypothesis, but none for C or C++; how does that work?

### What is Hypothesis?

Hypothesis is a Python library that helps programmers generate test cases programmatically, and handles a lot of the annoying work involved in making such generators practical to use. Here’s an excerpt from real test code for our vectorised operations on arrays of 64-bit integers.

A key component of these operations is a C (SSE intrinsics, really) function that accepts eight 64-bit masks as four 128-bit SSE registers (one pair of masks per register), and returns a byte, with one bit per mask. In order to test that logic, we expose u64_block_bitmask, a wrapper that accepts an array of eight masks and forwards its contents to the mask-to-bit function. check_u64_bitmask calls that wrapper function from Python, and asserts that its return value is as expected: 1 for each mask that’s all 1s, and 0 for each mask that’s all 0s. We turn check_u64_bitmask into a test cases generator by annotating the test method test_u64_block_bitmask with a @given decorator that asks Hypothesis to generate arbitrary lists of eight booleans.

Of course, there are only 256 lists of eight booleans; we should test that exhaustively. Hypothesis has the @example decorator, so it’s really easy to implement our own Python decorator to apply hypothesis.example once for each element in an iterable.

That’s pretty cool, but we could enumerate that more efficiently in C. Where Hypothesis shines is in reporting easily actionable failures. We’re wasting computer time in order to save developer time, and that’s usually a reasonable trade-off. For example, here’s what Hypothesis spits back via pytest when I introduce a bug to ignore the last pair of masks, masks 6 and 7, and instead reuse the register that holds masks 4 and 5 (a mistake that’s surprisingly easy to make when writing intrinsics by hand). booleans = [False, False, False, False, False, False, ...]

"""Given a list of 8 booleans, generates the corresponding array of
u64 masks, and checks that they are correctly compressed to a u8
"""
expected = 0
for i, value in enumerate(booleans):
if value:
expected |= 1 << i
masks = struct.pack("8Q", *[2 ** 64 - 1 if boolean else 0 for boolean in booleans])
E       AssertionError: assert 128 == 0
E         -128
E         +0

test_u64_block.py:40: AssertionError
-------------------------------------- Hypothesis ----------------------------------------------
bits=[False, False, False, False, False, False, False, True],
)


Hypothesis finds the bug in a fraction of a second, but, more importantly, it then works harder to report a minimal counter-example.4 With all but the last mask set to 0, it’s easy to guess that we’re probably ignoring the value of the last mask (and maybe more), which would be why we found a bitmask of 0 rather than 128.

### Why Hypothesis?

So far this is all regular property-based testing, with a hint of more production-readiness than we’ve come to expect from clever software correctness tools. What really sold me was Hypothesis’s stateful testing capability, which makes it easy to test not only individual functions, but also methods on stateful objects.

For example, here is the test code for our specialised representation of lists of row ids (of 64-bit integers), which reuses internal bits as inline storage in the common case when the list is small (the type is called entry because it’s a pair of a key and a list of values).

Most @rule-decorated methods call a C function before updating a Python-side model of the values we expect to find in the list. The @rule decorator mark methods that Hypothesis may call to generate examples; the decorator’s arguments declare how each method should be invoked. Some methods also have a @precondition decorator to specify when they can be invoked (a @rule without any @precondition is always safe to call). There is one method (create_entry) to create a new list populated with an initial row id, another (add_row) to add a row id to the list, one to finalize the list, something which should always be safe to call and must be done before reading the list’s contents (finalizing converts away from the inline representation), and a check to compare the list of row ids in C to the one we maintained in Python.

Row ids are arbitrary 64-bit integers, so we could simply ask Hypothesis to generate integers in $$[0, 2^{64} - 1]$$. However, we also know that the implementation uses high row id values around UINT64_MAX as sentinels in its implementation of inline storage, as shown below.

struct inlined_list {
union {
uint64_t data;
struct {
uint64_t *arr;
unsigned int capacity;
unsigned int length;
};
};
uint64_t control;
};

Out of line list      Inline list of 2      Inline list of 3
----------------      ----------------      ----------------
data:    arr                   elt #0                elt #0
data:    capacity & length     elt #1                elt #1
control:    UINT64_MAX            UINT64_MAX - 2        elt #2 < UINT64_MAX - 2


That’s why we bias the data generation towards row ids that could also be mistaken for sentinel values: we generate each row id by either sampling from a set of sentinel-like values, or by sampling from all 64-bit integers. This approach to defining the input data is decidedly non-magical, compared to the way I work with fuzzers. However, fuzzers also tend to be slow and somewhat fickle. I think it makes sense to ask programmers to think about how their own code should be tested, instead of hoping a computer program will eventually intuit what edge cases look like.5

It’s pretty cool that Hypothesis will now generate a bunch of API calls for me, but, again, what makes Hypothesis really valuable is the way it minimises long sequences of random calls into understandable counter-examples. Here’s what Hypothesis/pytest reports when I remove a guard that saves us from writing a row id to a control word when that id would look like a sentinel value.

self = PregrouperEntryList({})

@precondition(lambda self: self.entry)
@rule()
def check(self):
self.finalize()
>       assert self.rows == [
]
E       AssertionError: assert [184467440737...4073709551613] == [184467440737...4073709551613]
E         Left contains one more item: 18446744073709551613
E         Full diff:
E         - [18446744073709551613, 18446744073709551613, 18446744073709551613]
E         ?                        ----------------------
E         + [18446744073709551613, 18446744073709551613]

test_pregroup_merge.py:81: AssertionError
------------------------------------------ Hypothesis ------------------------------------------
Falsifying example:
state = PregrouperEntryList()
state.create_entry(row=18446744073709551613)
state.check()
state.teardown()


We can see that we populated a list thrice with the same row id, 18446744073709551613, but only found it twice in the final C-side list. That row id is $$2^{64} - 3,$$ the value we use to denote inline lists of two values. This drawing shows how the last write ended up going to a control word where it was treated as a sentinel, making the inline representation look like a list of two values, instead of three values.

           Bad list of 3                Identical list of 2
----------------             ---------------------
data:    UINT64_MAX - 2              UINT64_MAX - 2
data:    UINT64_MAX - 2              UINT64_MAX - 2
control:    UINT64_MAX - 2 (payload)    UINT64_MAX - 2 (sentinel)



I restored the logic to prematurely stop using the inline representation and convert to a heap-allocated vector whenever the row value would be interpreted as a sentinel, and now Hypothesis doesn’t find anything wrong. We also know that this specific failure is fixed, because Hypothesis retries examples from its failure database6 on every rerun.

### But Hypothesis is a Python library?!

There are multiple language-specific libraries under the Hypothesis project; none of them is in C, and only the Python implementation is actively maintained. One might think that makes Hypothesis inappropriate for testing C. However, the big ideas in Hypothesis are language-independent. There is a practical reason for the multiple implementations: for a tool to be really usable, it has to integrate well with the programming language’s surrounding ecosystem. In most languages, we also expect to write test code in the same language as the system under test.

C (and C++, I would argue) is an exception. When I tell an experienced C developer they should write test code in Python, I expect a sigh of relief. The fact that Hypothesis is written in Python, as opposed to another managed language like Ruby or C# also helps: embedding and calling C libraries is Python’s bread and butter. The weak state of C tooling is another factor: no one has a strong opinion regarding how to invoke C test code, or how the results should be reported. I’ll work with anything standard enough (e.g., pytest’s JUnit dump) to be ingested by Jenkins.

The last thing that sealed the deal for me is Python’s CFFI. With that library, I simply had to make sure my public header files were clean enough to be parsed without a full-blown compiler; I could then write some simple Python code to strip away preprocessor directive, read headers in dependency order, and load the production shared object, without any test-specific build step. The snippets of test code above don’t look that different from tests for a regular Python module, but there’s also a clear mapping from each Python call to a C call. The level of magic is just right.

There is one testing concern that’s almost specific to C and C++ programs: we must be on the lookout for memory management bugs. For that, we use our ASan build and bubble up some issues (e.g., read overflows or leaks) back to Python; everything else results in an abort(), which, while suboptimal for minimisation, is still useful. I simplified our test harness for the Heartbleed experiment; see below for gists that anyone can use as starting points.

## Why not fuzzing?

I abused fuzzers a lot at Google, mostly because the tooling was there and it was trivial to burn hundreds of CPU-hours. Technical workarounds seemed much easier than trying to add support for actual property-based testers, so I would add domain specific assertions in order to detect more than crash conditions, and translate bytes into more structured inputs or even convert them to series of method calls. Now that I don’t have access to prebuilt infrastructure for fuzzers, that approach doesn’t make as much sense. That’s particularly true when byte arrays don’t naturally fit the entry point: I spent a lot of time making sure the fuzzer was exercising branches in the system under test, not the test harness, and designing input formats such that common fuzzer mutations, e.g., removing a byte, had a local effect and not, e.g., cause a frameshift mutation for the rest of the input… and that’s before considering the time I wasted manually converting bytes to their structured interpretation on failures.

Hypothesis natively supports structured inputs and function calls, and reports buggy inputs in terms of these high level concepts. It is admittedly slower than fuzzers, especially when compared to fuzzers that (like Hypothesis) don’t fork/exec for each evaluation. I’m comfortable with that: my experience with NP-Hard problems tells me it’s better to start by trying to do smart things with the structure of the problem, and later speed that up, rather than putting all our hopes in making bad decisions really really fast. Brute force can only do so much to an exponential-time cliff.

I had one niggling concern when leaving the magical world of fuzzers for staid property-based testing. In some cases, I had seen coverage information steer fuzzers towards really subtle bugs; could I benefit from the same smartness in Hypothesis? That’s how I came up with the idea of simulating a reasonable testing process that ought to find CVE-2014-0160, Heartbleed.

## Role-playing CVE-2014-0160

CVE-2014-0160 a.k.a. Heartbleed is a read heap overflow in OpenSSL 1.0.1 ($$\leq$$ 1.0.1f) that leaks potentially private data over the wire. It’s a straightforward logic bug in the implementation of RFC 6520, a small optional (D)TLS extension that lets clients or servers ask for a ping. The Register has a concise visualisation of a bad packet. We must send correctly sized packets that happen to ask for more pingback data than was sent to OpenSSL. State of the art fuzzers like AFL or Honggfuzz find that bug in seconds when given an entry point that passes bytes to the connection handler, like data that had just come over the wire. When provided with a corpus of valid messages, they’re even faster.

It’s a really impressive showing of brute force that these programs to come up with almost valid packets so quickly, and traditional property-based testing frameworks are really not up to that level of magic. However, I don’t find the black box setting that interesting. It’s probably different for security testing, since not leaking invalid data is apparently seen as a feature one can slap on after the fact: there are so many things to test that it’s probably best to focus on what fuzzing can easily find, and, in any case, it doesn’t seem practical to manually boil that ocean one functionality at a time.

From a software testing point of view however, I would expect the people who send in a patch to implement something like the Heartbeat extension to also have a decent idea how to send bytes that exercise their new code. Of course, a sufficiently diligent coder would have found the heap overflow during testing, or simply not have introduced that bug. That’s not a useful scenario to explore; I’m interested in something that does find Heartbleed, and also looks like a repeatable process. The question becomes “Within this repeatable process, can branch coverage feedback help Hypothesis find Heartbeat?”

Here’s what I settled on: let’s only assume the new feature code also comes with a packet generator to exercise that code. In Hypothesis, it might look like the following.

We have a single rule to initialize the test buffer with a valid heartbeat packet, and we send the whole buffer to a standard fuzzer entry point at the end. In practice, I would also ask for some actual assertions that the system under test handles the packets correctly, but that’s not important when it comes to Heartbleed: we just need to run with ASan and look for heap overflows, which are essentially never OK.

Being able to provide a happy-path-only “test” like the above should be less than table stakes in a healthy project. Let’s simulate a bug-finding process that looks for crashes after adding three generic buffer mutating primitives: one that replaces a single byte in the message buffer, another one that removes some bytes from the end of the buffer, and a last one that appends some bytes to the buffer.

Given the initial generator rules and these three mutators, Hypothesis will assemble sequences of calls to create and mutate buffers before sending them to OpenSSL at the end of the “test.”

The first time I ran this, Hypothesis found the bug in a second or two

==12810==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62900012b748 at pc 0x7fb70d372f4e bp 0x7ffe55e56a30 sp 0x7ffe55e561e0

0x62900012b748 is located 0 bytes to the right of 17736-byte region [0x629000127200,0x62900012b748)
#0 0x7fb70d3e3628 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0x107628)
#1 0x7fb7058ee939  (fuzzer-test-suite/openssl-1.0.1f-fsanitize.so+0x19a939)
#2 0x7fb7058caa08  (fuzzer-test-suite/openssl-1.0.1f-fsanitize.so+0x176a08)
#3 0x7fb7058caf41  (fuzzer-test-suite/openssl-1.0.1f-fsanitize.so+0x176f41)
#4 0x7fb7058a658d  (fuzzer-test-suite/openssl-1.0.1f-fsanitize.so+0x15258d)
#5 0x7fb705865a12  (fuzzer-test-suite/openssl-1.0.1f-fsanitize.so+0x111a12)
#6 0x7fb708326deb in ffi_call_unix64 (/home/pkhuong/follicle/follicle/lib/Python3.7/site-packages/.libs_cffi_backend/libffi-806b1a9d.so.6.0.4+0x6deb)
#7 0x7fb7081b4f0f  (<unknown module>)


and then spent the 30 seconds trying to minimise the failure. Unfortunately, it seems that ASan tries to be smart and avoids reporting duplicate errors, so minimisation does not work for memory errors. It also doesn’t matter that much if we find a minimal test case: a core dump and a reproducer is usually good enough.

You can find the complete code for GrammarTester, along with the ASan wrapper, and CFFI glue in this gist.

For the rest of this post, we’ll make ASan crash on the first error, and count the number of test cases generated (i.e., the number of calls into the OpenSSL fuzzing entry point) until ASan made OpenSSL crash.

## How can we do better with hardware tracing?

Intel chips have offered BTS, a branch trace store, since Nehalem (2008-9), if not earlier. BTS is slower than PT, its full blown hardware tracing successor, and only traces branches, but it does so in a dead simple format. I wrapped Linux perf’s interface for BTS in a small library; let’s see if we can somehow feed that trace to Hypothesis and consistently find Heartbeat faster.

But first, how do fuzzers use coverage information?

A lot of the impressive stuff that fuzzers find stems from their ability to infer the constants that were compared with the fuzzing input by the system under test. That’s not really that important when we assume a more white box testing setting, where the same person or group responsible for writing the code is also tasked with testing it, or at least specifying how to do so.

Apart from guessing what valid inputs look like, fuzzers also use branch coverage information to diversify their inputs. In the initial search phase, none of the inputs cause a crash, and fuzzers can only mutate existing, equally non-crashy, inputs or create new ones from scratch. The goal is to maintain a small population that can trigger all the behaviours (branches) we’ve observed so far, and search locally around each member of that population. Pruning to keep the population small is beneficial for two reasons: first, it’s faster to iterate over a smaller population, and second, it avoids redundantly exploring the neighbourhood of nearly identical inputs.

Of course, this isn’t ideal. We’d prefer to keep one input per program state, but we don’t know what the distinct program states are. Instead, we only know what branches we took on the way to wherever we ended up. It’s as if we were trying to generate directions to every landmark in a city, but the only feedback we received was the set of streets we walked on while following the directions. That’s far from perfect, but, with enough brute force, it might just be good enough.

We can emulate this diversification and local exploration logic with multi-dimensional Targeted example generation, for which Hypothesis has experimental support.

We’ll assign an arbitrary unique label to each origin/destination pair we observe via BTS, and assign a score of 1.0 to every such label (regardless of how many times we observed each pair). Whenever Hypothesis compares two score vectors, a missing value is treated as $$-\infty$$, so, everything else being equal, covering a branch is better than not covering it. After that, we’ll rely on the fact that multidimensional discrete optimisation is also all about maintaining a small but diverse population: Hypothesis regularly prunes redundant examples (examples that exercises a subset of the branches triggered by another example), and generates new examples by mutating members of the population. With our scoring scheme, the multidimensional search will split its efforts between families of examples that trigger different sets of branches, and will also stop looking around examples that trigger a strict subset of another example’s branches.

Here’s the plan to give BTS feedback to Hypothesis and diversify its initial search for failing examples. I’ll use my libbts to wrap perf syscalls into something usable, and wrap that in Python to more easily gather origin/destination pairs. Even though I enable BTS tracing only around FFI calls, there will be some noise from libffi, as well as dummy transitions for interrupts or context switches; bts.py attempts to only consider interesting branches by remembering the set of executable mappings present at startup, before loading the library under test, and dropping jumps to or from addresses that were mapped at startup (presumably, that’s from the test harness), or invalid zero or negative addresses (which I’m pretty sure denote interrupts and syscalls).

We’ll then wrap the Python functions that call into OpenSSL to record the set of branches executed during that call, and convert that set to a multidimensional score at the end of the test, in the teardown method.

The only difference is in the __init__ method, which must also reset BTS state, and in the teardown method, where we score the example if it failed to crash.

Let’s instrument the teardown methods to print “CHECK” and flush before every call to the fuzzing entry point, and make sure ASan crashes when it finds an issue. We’ll run the test files, grep for “CHECK”, and, assuming we don’t trigger into any runtime limit, find out how many examples Hypothesis had to generate before causing a crash.

I ran this 12000 times for both test_with_grammar.py and test_with_grammar_bts.py. Let’s take a look at the empirical distribution functions for the number of calls until a crash, for test_with_grammar.py (grammar) and test_with_grammar_bts.py (bts). There’s a cross-over around 100 calls: as long as we have enough time for least 100 calls to OpenSSL, we’re more likely to find Heartbleed with coverage feedback than by rapidly searching blindly. With fewer than 100 calls, it seems likely that branch coverage only guides the search towards clearly invalid inputs that trigger early exit conditions. Crucially, the curves are smooth and tap out before our limit of 10000 examples per execution, so we’re probably not measuring a side-effect of the experimental setup.

In theory, the distribution function for the uninstrumented grammar search should look a lot like a geometric distribution, but I don’t want to assume too much of the bts implementation. Let’s confirm our gut feeling in R with a simple non-parametric test, the Wilcoxon rank sum test, an unpaired analogue to the sign test:

> wilcox.test(data$Calls[data$Impl == 'bts'], data$Calls[data$Impl == 'grammar'], 'less')

Wilcoxon rank sum test with continuity correction

data:  data$Calls[data$Impl == "bts"] and data$Calls[data$Impl == "grammar"]
W = 68514000, p-value = 4.156e-11
alternative hypothesis: true location shift is less than 0


We’re reasonably certain that the number of calls for a random run of bts is more frequently less than the number of calls for an independently chosen random run of grammar than the opposite. Since the tracing overhead is negligible, this means branch feedback saves us time more often than not. For the 24 000 separate runs we observed, bts is only faster 52% of the time, but that’s mostly because both distributions of calls to the system under test go pretty high. On average, vanilla Hypothesis, without coverage feedback, found the vulnerability after 535 calls to OpenSSL; with feedback, Hypothesis is 11.5% faster on average, and only needs 473 calls.

## Time to use this in anger

We have been using this testing approach (without branch coverage) at Backtrace for a couple months now, and it’s working well as a developer tool that offers rapid feedback, enough that we’re considering running these tests in a commit hook. Most of the work involved in making the approach useful was just plumbing, e.g., dealing with the way ASan reports errors, or making sure we don’t report leaks that happen in Python, outside the system under test. Once the commit hook is proven solid, we’ll probably want to look into running tests in the background for a long time. That’s a very different setting from pre-commit or commit -time runs, where every bug is sacred. If I let something run over the weekend, I must be able to rely on deduplication (i.e., minimisation is even more useful), and I will probably want to silence or otherwise triage some issues. That’s the sort of thing Backtrace already handles, so we are looking into sending Hypothesis reports directly to Backtrace, the same way we do for clang-analyzer reports and for crashes or leaks found by ASan in our end-to-end tests.

The challenges are more research-y for coverage feedback. There’s no doubt a lot of mundane plumbing issues involved in making this feedback robust (see, e.g., the logic in bts.py to filter out interrupts and branches from the kernel, or branches from code that’s probably not in the system under test). However, there’s also a fundamental question that we unfortunately can’t answer by copying fuzzers.

How should we score coverage over multiple calls? After all, the ability to test a stateful interface via a sequence of method calls and expections is what really sold me on Hypothesis. Fuzzers have it easy: they assume each call is atomic and independent of any other call to the system under test, even when they don’t fork for each execution. This seems like a key reason why simple coverage metrics work well in fuzzers, and I don’t know that we can trivially port ideas from fuzzing to stateful testing.

For example, my first stab at this experiment found no statistically significant improvement from BTS feedback. The only difference is that the assertion lived in a check rule, and not in the teardown method, which let Hypothesis trigger a call to OpenSSL at various points in the buffer mutation sequence… usually a good thing for test coverage. I’m pretty sure the problem is that a single example could collect “points” for covering branches in multiple unrelated calls to OpenSSL, while we would rather cover many branches in a single call. What does it mean for stateful testing, where we want to invoke different functions multiple times in a test?

I have no idea; maybe we should come up with some synthetic stateful testing benchmarks that are expected to benefit from coverage information. However, the experiment in this post gives me hope that there exists some way to exploit coverage information in stateful testing. The MIT-licensed support code in this gist (along with libbts) should give you all a headstart to try more stuff and report your experiences.

Thank you David, Ruchir, Samy, and Travis for your comments on an early draft.

## Appendix: How to get started with your own project

In a performance-oriented C library, the main obstable to testing with Hypothesis and CFFI will probably be that the headers are too complex for pycparser. I had to use a few tricks to make ours parse.

First, I retro-actively imposed more discipline on what was really public and what implementation details should live in private headers: we want CFFI to handle every public header, and only some private headers for more targeted testing. This separation is a good thing to maintain in general, and, if anything, having pycparser yell at us when we make our public interface depend on internals is a net benefit.

I then had to reduce our reliance on the C preprocessor. In some cases, that meant making types opaque and adding getters. In many more cases, I simply converted small #defined integer constants to anonymous enums enum { CONSTANT_FOO = ... }.

Finally, especially when testing internal functionality, I had to remove static inline functions. That’s another case where pycparser forces us to maintain cleaner headers, and the fix is usually simple: declare inline (not static) functions in the header, #include a .inl file with the inline definition (we can easily drop directives in Python), and re-declare the function as extern in the main source file for that header. With this approach, the header can focus on documenting the interface, the compiler still has access to an inline definition, and we don’t waste instruction bytes on duplicate out-of-line definitions.

That doesn’t always work, mostly because regular inline functions aren’t supposed to call static inline functions. When I ran into that issue, I either tried to factor the more complex slow path out to an out-of-line definition, or, more rarely, resorted to CPP tricks (also hidden in a .inl file) to rewrite calls to extern int foo(...) with macros like #define foo(...) static_inline_foo(__VA_ARGS__).

All that work isn’t really testing overhead; they’re mostly things that library maintainers should do, but are easy to forget when when we only hear from C programmers.

Once the headers were close enough to being accepted by CFFI, I closed the gap with string munging in Python. All the tests depend on the same file that parses all the headers we care about in the correct order and loads the shared object. Loading everything in the same order also enforces a reasonable dependency graph, another unintended benefit. Once everything is loaded, that file also post-processes the CFFI functions to hook in ASan (and branch tracing), and strip away any namespacing prefix.

The end result is a library that’s more easily used from managed languages like Python, and which we can now test it like any other Python module.

1. The graph shows time in terms of calls to the SUT, not real time. However, the additional overhead of gathering and considering coverage information is small enough that the feedback also improves wallclock and CPU time.

2. High level matchers like GMock’s help, but they don’t always explain why a given matcher makes sense.

3. There’s also a place for smart change detectors, especially when refactoring ill understood code

4. I had to disable the exhaustive list of examples to benefit from minimisation. Maybe one day I’ll figure out how to make Hypothesis treat explicit examples more like an initial test corpus.

5. That reminds me of an AI Lab Koan. «In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play Tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened.»

6. The database is a directory of files full of binary data. The specific meaning of these bytes depends on the Hypothesis version and on the test code, so the database should probably not be checked in. Important examples (e.g., for regression testing) should instead be made persistent with @example decorators. Hypothesis does guarantee that tese files are always valid (i.e., any sequence of bytes in the database will result in an example that can be generated by the version of Hypothesis and of the test code that reads it), so we don’t have to invalidate the cache when we update the test harness.

]]>
<![CDATA[Too much locality... for stores to forward]]> 2020-02-01T17:29:21-05:00 https://www.pvk.ca/Blog/2020/02/01/too-much-locality-for-store-forwarding Apologies for the failed Cake reference.
2020-02-02: Refer to Travis Downs’s investigation into this pathological case for forwarding.

I’ve been responsible for Backtrace.io’s crash analytic database1 for a couple months now. I have focused my recent efforts on improving query times for in-memory grouped aggregations, i.e., the archetypal MapReduce use-case where we generate key-value pairs, and fold over the values for each key in some (semi)group. We have a cute cache-efficient data structure for this type of workload; the inner loop simply inserts in a small hash table with Robin Hood linear probing, in order to guarantee entries in the table are ordered by hash value. This ordering lets us easily dump the entries in sorted order, and block the merge loop for an arbitrary number of sorted arrays into a unified, larger, ordered hash table (which we can, again, dump to a sorted array).2

# Observation

As I updated more operators to use this data structure, I noticed that we were spending a lot of time in its inner loop. In fact, perf showed that the query server as a whole was spending 4% of its CPU time on one instruction in that loop:

 2.17 |       movdqu     (%rbx),%xmm0
39.63 |       lea        0x1(%r8),%r14  # that's 40% of the annotated function
|       mov        0x20(%rbx),%rax
0.15 |       movaps     %xmm0,0xa0(%rsp)


The first thing to note is that instruction-level profiling tends to put the blame on the instruction following the one that triggered a sampling interrupt. It’s not the lea (which computes r14 <- r8 + 1) that’s slow, but the movdqu just before. So, what is that movdqu loading into xmm0? Maybe it’s just a normal cache miss, something inherent to the workload.

I turned on source locations (hit s in perf report), and observed that this instruction was simply copying to the stack an argument that was passed by address. The source clearly showed that the argument should be hot in cache: the inner loop was essentially

A1. Generate a new key-value pair
B1. Mangle that kv pair just a bit to turn it into a hash element
C1. Insert the new hash element
A2.
B2.
C2.


and the movdqu happens in step C, to copy the element that step B just constructed.3

At this point, an important question suggests itself: does it matter? We could simply increase the size of the base case and speed up the rest of the bottom-up recursion… eventually, the latency for the random accesses in the initial hash table will dominate the inner loop.

When I look into the performance of these deep inner loop, my goal isn’t only to do the same thing better. The big wins, in my experience, come from the additional design freedom that we get from being able to find new uses for the same code. Improved latency, throughput, or memory footprint really shine when the increased optionality from multiple such improvements compounds and lets us consider a much larger design space for the project as a whole. That’s why I wanted to make sure this hash table insertion loop worked on as wide a set of parameter as possible: because that will give future me the ability to combine versatile tools.

# Hypothesis

Back to the original question. Why do we spend so many cycles loading data we just wrote to cache?

The answer is in the question and in the title of this post: too little time elapses between the instructions that write data to the cache and the ones that read the same data.4 A modern out-of-order machine (e.g., most amd64 chips) can execute multiple instructions at the same time, and will start executing instructions as soon as their operands are ready, even when earlier instructions in program order are still waiting for theirs. Machine code is essentially a messy way to encode a dataflow graph, which means our job as micro-optimisers is, at a high level, to avoid long dependency chains and make the dataflow graph as wide as possible. When that’s too hard, we should distribute as much scheduling slack as possible between nodes in a chain, in order to absorb the knock-on effects of cache misses and other latency spikes. If we fail, the chip will often find itself with no instruction ready to execute; stalling the pipeline like that is like slowing down by a factor of 10.

The initial inner loop simply executes steps A, B, and C in order, where step C depends on the result of step B, and step B on that of step A. In theory, a chip with a wide enough instruction reordering window could pipeline multiple loop iterations. In practice, real hardware can only plan on the order of 100-200 instructions ahead, and that mechanism depends on branches being predicted correctly. We have to explicitly insert slack in our dataflow schedule, and we must distribute it well enough for instruction reordering to see the gaps.

This specific instance is a particularly bad case for contemporary machines: step B populates the entry with regular (64-bit) register writes, while step C copies the same bytes with vector reads and writes. Travis Downs looked into this forwarding scenario and found that no other read-after-write setup behaves this badly, on Intel or AMD. That’s probably why the movdqu vector load instruction was such an issue. If the compiler had emitted the copy with GPR reads and writes, that might have been enough for the hardware to hide the latency. However, as Travis points out on Twitter, it’s hard for a compiler to get that right across compilation units. In any case, our most reliable (more so than passing this large struct by value and hoping the compiler will avoid mismatched instructions) and powerful tool to fix this at the source level is to schedule operations manually.

The dataflow graph for each loop iteration is currently a pure chain:

         A1
|
v
B1
|
v
C1
A2
|
v
B2
|
v
C2


How does one add slack to these chains? With bounded queues!

# Experiment

My first fix was to add a one-element buffer between steps B and C. The inner loop became

A1. Generate a new key-value pair
C0. Insert the hash element from the previous iteration
B1. Mangle the kv pair and stash that in the buffer
A2.
C1.
B2
etc.


which yields a dataflow graph like

        |     A1
v     |
C0    |
|
v
B1
|
|     A2
v     |
C1    |
|
v
B2
|


We’ve introduced slack between steps A and B (there’s now step C from the previous iteration between them), and between steps B and C (we shifted step A from the next iteration between them). There isn’t such a long delay between the definition of a value and its use that the data is likely to be evicted from L1. However, there is more than enough work available between them to keep the pipeline busy with useful work while C waits for B’s result, or B for A’s. That was a nice single-digit improvement in query latency for my internal benchmark, just by permuting a loop.

If a one-element buffer helps, we should clearly experiment with the buffer size, and that’s where I found a more impactful speed-up. Once we have an array of elements to insert in a hash table, we can focus on a bulk insert of maybe 8 or 10 elements: instead of trying to improve the latency for individual writes, we can focus on the throughput for multiple inserts at once. That’s good because throughput is an easier problem than latency. In the current case, passing the whole buffer to the hash table code made it easier to pipeline the insert loop in software: we can compute hashes ahead of time, and accelerate random accesses to the hash table with software prefetching. The profile for the new inner loop is flatter, and the hottest part is as follows

      |       mov        0x8(%rsp),%rdx
9.91 |       lea        (%r12,%r12,4),%rax
0.64 |       prefetcht0 (%rdx,%rax,8)
17.04 |       cmp        %rcx,0x28(%rsp)


Again, the blame for a “slow” instruction hits the following instruction, so it’s not lea (multiplying by 5) or cmp that are slow; it’s the load from the stack and the prefetch. The good news is that these instructions do not have any dependent. It’s all prefetching, and that’s only used for its side effects. Moreover, they come from a block of code that was pipelined in software and executes one full iteration ahead of where its side effects might be useful. It doesn’t really matter if these instructions are slow: they’re still far from being on the critical path! This last restructuring yielded a 20% speed-up on a few slow queries.

I described two tools that I use regularly when optimising code for contemporary hardware. Finding ways to scatter around scheduling slack is always useful, both in software and in real life planning.5 One simple way to do so is to add bounded buffers, and to flush buffers as soon as they fill up (or refill when they become empty), instead of waiting until the next write to the buffer. However, I think the more powerful transformation is using buffering to expose bulk operations, which tends to open up more opportunities than just doing the same thing in a loop. In the case above, we found a 20% speed-up; for someone who visit their Backtrace dashboard a couple times a day, that can add up to an hour or two at the end of the year.

TL;DR: When a function is hot enough to look into, it’s worth asking why it’s called so often, in order to focus on higher level bulk operations.

1. And by that, I mean I started working there a couple months ago (:

2. I think that’s a meaty idea, and am planning a longer post on that data structure and where it fits in the hash/sort join continuum.

3. Would I have avoided this issue if I had directly passed by value? The resulting code might have been friendlier to store-to-load forwarding than loading a whole 128 bit SSE register, but see the next footnote.

4. Store-to-load forwarding can help improve the performance of this pattern, when we use forwarding patterns that the hardware supports. However, this mechanism can only decrease the penalty of serial dependencies, e.g., by shaving away some or all of the time it takes to store a result to cache and load it back; even when results can feed directly into dependencies, we still have to wait for inputs to be computed. This is fundamentally a scheduling issue.

5. Unless you’re writing schedule optimising software and people will look at the result. A final hill climbing pass to make things look artificially tight often makes for an easier sale in that situation.

]]>
<![CDATA[Lazy Linear Knapsack]]> 2020-01-20T23:10:19-05:00 https://www.pvk.ca/Blog/2020/01/20/lazy-linear-knapsack The continuous knapsack problem may be the simplest non-trivial linear programming problem:

$\max_{x \in [0, 1]^n} p’x$ subject to $w’x \leq b.$

It has a linear objective, one constraint, and each decision variable is bounded to ensure the optimum exists. Note the key difference from the binary knapsack problem: decision variables are allowed to take any value between 0 and 1. In other words, we can, e.g., stick half of a profitable but large item in the knapsack. That’s why this knapsack problem can be solved in linear time.

## Dual to primal is reasonable

Duality also lets us determine the shape of all optimal solutions to this problem. For each item $$i$$ with weight $$w_i$$ and profit $$p_i$$, let its profit ratio be $$r_i = p_i / w_i,$$ and let $$\lambda^\star$$ be the optimal dual (Lagrange or linear) multiplier associated with the capacity constraint $$w’x \leq b.$$ If $$\lambda^\star = 0,$$ we simply take all items with a positive profit ratio ($$r_i > 0$$) and a non-negative weight $$w_i \geq 0.$$ Otherwise, every item with a profit ratio $$r_i > \lambda^\star$$ will be at its weight upper bound (1 if $$w_i \geq 0$$, 0 otherwise), and items with $$r_i < \lambda^\star$$ will instead be at their lower bound (0 of $$w_i \leq 0$$, and 1 otherwise).

Critical items, items with $$r_i = \lambda^\star,$$ will take any value that results in $$w’x = b.$$ Given $$\lambda^\star,$$ we can derive the sum of weights for non-critical items; divide the remaining capacity for critical items by the total weight of critical items, and let that be the value for every critical item (with the appropriate sign for the weight).

For example, if we have capacity $$b = 10,$$ and the sum of weights for non-critical items in the knsapsack is $$8,$$ we’re left with another two units of capacity to distribute however we want among critical items (they all have the same profit ratio $$r_i = \lambda^\star,$$ so it doesn’t matter where that capacity goes). Say critical items with a positive weight have a collective weight of 4; we could then assign a value of $$2 / 4 = 0.5$$ to the corresponding decision variable (and 0 for critical items with a non-positive weight).

We could instead have $$b = 10,$$ and the sum of weights for non-critical items in the knapsack $$12$$: we must find two units of capacity among critical items (they all cost $$r_i = \lambda^\star$$ per unit, so it doesn’t matter which). If critical items with a negative weight have a collective weight of $$-3,$$ we could assign a value of $$-2 / -3 = 0.6\overline{6}$$ to the corresponding decision variables, and 0 for critical items with a non-negative weight.

The last case highlights something important about the knapsack: in general, we can’t assume that the weights or profits are positive. We could have an item with a non-positive weight and non-negative profit (that’s always worth taking), an item with positive weight and negative profit (never interesting), or weights and profits of the same sign. The last case is the only one that calls for actual decision making. Classically, items with negative weight and profit are rewritten away, by assuming they’re taken in the knapsack, and replacing them with a decision variable for the complementary decision of removing that item from the knapsack (i.e., removing the additional capacity in order to improve the profit). I’ll try to treat them directly as much as possible, because that reduction can be a significant fraction of solve times in practice.

The characterisation of optimal solutions above makes it easy to directly handle elements with a negative weight: just find the optimal multiplier, compute the contribution of non-critical elements (with decision variables at a bound) to the left-hand side of the capacity constraint, separately sums the negative and positive weights for critical elements, then do a final pass to distribute the remaining capacity to critical elements (and 0-weight / 0-value elements if one wishes).

## Solving the dual looks like selection

Finding the optimal multiplier $$\lambda^\star$$ is similar to a selection problem: the value is either 0 (the capacity constraint is redundant), or one of the profit ratios $$r_i,$$ and, given a multiplier value $$\lambda,$$ we can determine if it’s too high or too low in linear time. If the non-critical elements yield a left-hand side such that critical elements can’t add enough capacity (i.e., no solution with the optimal form can be feasible), $$\lambda$$ is too low. If the maximum weight of potentially optimal solutions is too low, $$\lambda$$ is too high.

We can thus sort the items by profit ratio $$r_i$$, compute the total weight corresponding to each ratio with a prefix sum (with a pre-pass to sum all negative weights), and perform a linear (or binary) search to find the critical profit ratio. Moreover, the status of non-critical items is monotonic as $$\lambda$$ grows: if an item with positive weight is taken at $$\lambda_0$$, it is also taken for every $$\lambda \leq \lambda_0$$, and a negative-weight item that’s taken at $$\lambda_0$$ is also taken for every $$\lambda \geq \lambda_0.$$ This means we can adapt selection algorithms like Quickselect to solve the continuous knapsack problem in linear time.

I’m looking at large instances, so I would like to run these algorithms in parallel or even distributed on multiple machines, and ideally use GPUs or SIMD extensions. Unfortunately, selection doesn’t parallelise very well: we can run a distributed quickselect where every processor partitions the data in its local RAM, but that still requires a logarithmic number of iterations.

## Selection looks like quantile estimation; does the dual?

Lazy Select offers a completely different angle for the selection problem. Selecting the $$k$$th smallest element from a list of $$n$$ elements is the same as finding the $$k / n$$th quantile1 in that list of $$n$$ elements. We can use concentration bounds2 to estimate quantiles from a sample of, e.g., $$m = n^{3/4}$$ elements: the population quantile value is very probably between the $$qm - \frac{\log m}{\sqrt{m}}$$th and $$qm + \frac{\log m}{\sqrt{m}}$$th values of the sample. Moreover, this range very probably includes at most $$\mathcal{O}(n^{3/4})$$ elements3, so a second pass suffices to buffer all the elements around the quantile, and find the exact quantile. Even with a much smaller sample size $$m = \sqrt{n},$$ we would only need four passes.

Unfortunately, we can’t directly use that correspondence between selection and quantile estimation for the continuous knapsack.

I tried to apply a similar idea by sampling the knapsack elements equiprobably, and extrapolating from a solution to the sample. For every $$\lambda,$$ we can derive a selection function $$f_\lambda (i) = I[r_i \geq \lambda]w_i$$ (invert the condition if the weight is negative), and scale up $$\sum_i f(i)$$ from the sample to the population). As long as we sample independently of $$f$$, we can reuse the same sample for all $$f_\lambda.$$ The difficulty here is that, while the error for Lazy Select scales as a function of $$n,$$ the equivalent bounds with variable weights are a function of $$n(|\max_i w_i| + |\min_i w_i|)^2.$$ That doesn’t seem necessarily practical; scaling with $$\sum_i |w_i|$$ would be more reasonable.

Good news: we can hit that, thanks to linearity.

Let’s assume weights are all integers. Any item with weight $$w_i$$ is equivalent to $$w_i$$ subitems with unit weight (or $$-w_i$$ elements with negative unit weight), and the same profit ratio $$r_i$$, i.e., profit $$p_i / |w_i|$$. The range of subitem weights is now a constant.

We could sample uniformly from the subitems with a Bernoulli for each subitem, but that’s clearly linear time in the sum of weights, rather than the number of elements. If we wish to sample roughly $$m$$ elements from a total weight $$W = \sum_i |w_i|,$$ we can instead determine how many subitems (units of weight) to skip before sampling with a Geometric of success probability $$m / W.$$ This shows us how to lift the integrality constraint on weights: sample from an Exponential with the same parameter $$m / W!$$

That helps, but we could still end up spending much more than constant time on very heavy elements. The trick is to deterministically special-case these elements: stash any element with large weight $$w_i \geq W / m$$ to the side, exactly once. By Markov’s inequality,4 we know there aren’t too many heavy elements: at most $$m.$$

## Let’s test this out

The heart of the estimation problem can be formalised as follows: given a list of elements $$i \in [n]$$ with weight $$w_i \geq 0$$, generate a sample of $$m \leq n$$ elements ahead of time. After the sample has been generated, we want to accept an arbitrary predicate $$p \in \{0,1\}^n$$ and estimate $$\sum_{i\in [n]} p(i) w_i.$$

We just had a sketch of an algorithm for this problem. Let’s see what it looks like in Python. The initial sample logic has to determine the total weight, and sample items with probability proportional to their weight. Items heavier than the cutoff are not considered in the sample and instead saved to an auxiliary list.

We can assemble the resulting sample (and list of “large” elements) to compute a lower bound on the weight of items that satisfy any predicate that’s independent of the sampling decisions. The value for large elements is trivial: we have a list of all large elements. We can subtract the weight of all large elements from the total item weight, and determine how much we have to extrapolate up.

And finally, here’s how we can sample from an arbitrary list of items, compure a lower bound on the weight of items that satisfy a predicate, and compare that with the real lower bound.

How do we test that? Far too often, I see tests for randomised algorithms where the success rate is computed over randomly generated inputs. That’s too weak! For example, this approach could lead us to accept that the identity function is a randomised sort function, with success probability $$\frac{1}{n!}.$$

The property we’re looking for is that, for any input, the success rate (with the expectation over the pseudorandom sampling decisions) is as high as requested.

For a given input (list of items and predicate), we can use the Confidence sequence method (CSM) to confirm that the lower bound is valid at least $$1 - \alpha$$ of the time.

With a false positive rate of at most one in a million,5 we can run automated tests against check_bounds. I’ll use Hypothesis to generate list of pairs of weight and predicate value:

Bimodal inputs tend to be harder, so we can add a specialised test generator.

Again, we use Hypothesis to generate inputs, and the Confidence sequence method (available in C, Common Lisp, and Python) to check that the lower bound is valid with probability at least $$1 - \alpha$$. The CSM tests for this statistical property with power 1 and adjustable error rate (in our case, one in a million): we only provide a generator for success values, and the driver adaptively determines when it makes sense to make a call and stop generating more data, while accounting for multiple hypothesis testing.

TL;DR: the estimation algorithm for individual sampling passes works, and the combination of Hypothesis and Confidence Sequence Method lets us painlessly test for a statistical property.

We can iteratively use this sampling procedure to derive lower and (symmetrically) upper bounds for the optimal Lagrange multiplier $$\lambda^\star,$$ and Hoeffding’s inequality lets us control the probability that the lower and upper bounds are valid. Typically, we’d use a tolerance of $$\sqrt{\log(n) / n},$$ for an error rate of $$1 / n^2.$$ I prefer to simply use something like $$7 / \sqrt{n}:$$ the error rate is then less than $$10^{-42},$$ orders of manitude smaller than the probability of hardware failure in any given nanosecond.6 We can still check for failure of our Las Vegas algorithm, but if something went wrong, it’s much more likely that we detected a hardware failure than anything else. It’s like running SuperPi to stress test a computer, except the work is useful. 😉

## Repeat as necessary to solve a knapsack

How many sampling passes do we need? Our bounds are in terms of the sum of item weight: if we let our sample size be in $$\Theta(\sqrt{n}),$$ the sum of weights $$\sum_i |w_i|$$ for unfathomed items (that may or may not be chosen depending on the exact optimal multiplier $$\lambda^\star$$ in the current range) will very probably shrink by a factor of $$\Omega(n^{1/4}).$$ The initial sum can, in the worst case, be exponentially larger than the bitlength of the input, so even a division by $$n^{1/4}$$ isn’t necessarily that great.

I intend to apply this Lazy Linear Knapsack algorithm on subproblems in a more interesting solver, and I know that the sum of weights is bounded by the size of the initial problem, so that’s good enough for me! After a constant ($$\approx 4$$) number of passes, the difference in item weight between the lower and upper bound on $$\lambda^\star$$ should also be at most 1. One or two additional passes will get me near optimality (e.g., within $$10^{-4}$$), and the lower bound on $$\lambda^\star$$ should thus yield a super-optimal solution that’s infeasible by at most $$10^{-4},$$ which is, for my intended usage (again), good enough.

Given an optimal enough $$\lambda^\star,$$ we can construct an explicit solution in one pass, plus a simple fixup for critical items. This Lazy Knapsack seems pretty reasonable for parallel or GPU computing: each sampling pass only needs to read the items (i.e., no partitioning-like shuffling) before writing a fraction of the data to a sample buffer, and we only need a constant number of passes (around 6 or 7) in the worst case.

1. It’s more like a fractional percentile, but you know what I mean: the value such that the distribution function at that point equals $$k / n$$.

2. Binomial bounds offer even stronger confidence intervals when the estimate is close to 0 or 1 (where Hoeffding’s bound would yield a confidence interval that juts outside $$[0, 1]$$), but don’t impact worst-case performance.

3. Thanks to Hoeffding’s inequality, again.

4. That’s a troll. I think any self-respecting computer person would rather see it as a sort of pigeonhole argument.

5. We’re juggling a handful of error rates here. We’re checking whether the success rate for the Lazy Knapsack sampling subroutine is at least as high as $$1 - \alpha,$$ as requested in the test parameters, and we’re doing so with another randomised procedure that will give an incorrect conclusion at most once every one million invocation.

6. This classic Google study found 8% of DIMMs hit at least one error per year; that’s more than one single-bit error every $$10^9$$ DIMM-second, and they’re mostly hard errors. More recently, Facebook reported that uncorrectable errors affect 0.03% of servers each month; that’s more than one uncorrectable error every $$10^{10}$$ server-second. If we performed one statistical test every nanosecond, the probability of memory failure alone would still dominate statistical errors by $$10^{20}!$$

]]>
<![CDATA[A multiset of observations with constant-time sample mean and variance]]> 2019-11-30T23:51:41-05:00 https://www.pvk.ca/Blog/2019/11/30/a-multiset-of-observations-with-constant-time-sample-mean-and-variance Fixed notation issues in the “Faster multiset updates” section. Thank you Joonas.

Let’s say you have a multiset (bag) of “reals” (floats or rationals), where each value is a sampled observations. It’s easy to augment any implementation of the multiset ADT to also return the sample mean of the values in the multiset in constant time: track the sum of values in the multiset, as they are individually added and removed. This requires one accumulator and a counter for the number of observations in the multiset (i.e., constant space), and adds a constant time overhead to each update.

It’s not as simple when you also need the sample variance of the multiset $$X$$, i.e.,

$\frac{1}{n - 1} \sum\sb{x \in X} (x - \hat{x})\sp{2},$

where $$n = |X|$$ is the sample size and $$\hat{x}$$ is the sample mean $$\sum\sb{x\in X} x/n,$$ ideally with constant query time, and constant and update time overhead.

One could try to apply the textbook equality

$s\sp{2} = \frac{1}{n(n-1)}\left[n\sum\sb{x\in X} x\sp{2} - \left(\sum\sb{x\in X} x\right)\sp{2}\right].$

However, as Knuth notes in TAoCP volume 2, this expression loses a lot of precision to round-off in floating point: in extreme cases, the difference might be negative (and we know the variance is never negative). More commonly, we’ll lose precision when the sampled values are clustered around a large mean. For example, the sample standard deviation of 1e8 and 1e8 - 1 is 1, same as for 0 and 1. However, the expression above would evaluate that to 0.0, even in double precision: while 1e8 is comfortably within range for double floats, its square 1e16 is outside the range where all integers are represented exactly.

Knuth refers to a better behaved recurrence by Welford, where a running sample mean is subtracted from each new observation before squaring. John Cook has a C++ implementation of the recurrence that adds observations to a sample variance in constant time. In Python, this streaming algorithm looks like this.

That’s all we need for insert-only multisets, but does not handle removals; if only we had removals, we could always implement updates (replacement) as a removal and an insertion.

Luckily, StreamingVariance.observe looks invertible. It’s shouldn’t be hard to recover the previous sample mean, given v, and, given the current and previous sample means, we can re-evaluate (v - old_mean) * (v - self.mean) and subtract it from self.var_sum.

Let $$\hat{x}\sp{\prime}$$ be the sample mean after observe(v). We can derive the previous sample mean $$\hat{x}$$ from $$v$$:

$(n - 1)\hat{x} = n\hat{x}\sp{\prime} - v \Leftrightarrow \hat{x} = \hat{x}\sp{\prime} + \frac{\hat{x}\sp{\prime} - v}{n-1}.$

This invertibility means that we can undo calls to observe in LIFO order. We can’t handle arbitrary multiset updates, only a stack of observation. That’s still better than nothing.

Before going any further, let’s test this.

## Testing the VarianceStack

The best way to test the VarianceStack is to execute a series of push and pop calls, and compare the results of get_mean and get_variance with batch reference implementations.

I could hardcode calls in unit tests. However, that quickly hits diminishing returns in terms of marginal coverage VS developer time. Instead, I’ll be lazy, completely skip unit tests, and rely on Hypothesis, its high level “stateful” testing API in particular.

We’ll keep track of the values pushed and popped off the observation stack in the driver: we must make sure they’re matched in LIFO order, and we need the stack’s contents to compute the reference mean and variance. We’ll also want to compare the results with reference implementations, modulo some numerical noise. Let’s try to be aggressive and bound the number of float values between the reference and the actual results.

This initial driver does not even use the VarianceStack yet. All it does is push values to the reference stack, pop values when the stack has something to pop, and check that the reference implementations match themselves after each call: I want to first shake out any bug in the test harness itself.

Not surprisingly, Hypothesis does find an issue in the reference implementation:

Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=2.6815615859885194e+154)
state.teardown()


We get a numerical OverflowError in reference_variance: 2.68...e154 / 2 is slightly greater than sqrt(sys.float_info.max) = 1.3407807929942596e+154, so taking the square of that value errors out instead of returning infinity.

Let’s start by clamping the range of the generated floats.

Now that the test harness doesn’t find fault in itself, let’s hook in the VarianceStack, and see what happens when only push calls are generated (i.e., first test only the standard streaming variance algorithm).

Falsifying example:
state = VarianceStackDriver()
state.push(v=1.0)
state.push(v=1.488565707357403e+138)
state.teardown()
F


The reference finds a variance of 5.54e275, which is very much not the streaming computation’s 1.108e276. We can manually check that the reference is wrong: it’s missing the n - 1 correction term in the denominator.

We should use this updated reference.

Let’s now re-enable calls to pop().

And now things fail in new and excitingly numerical ways.

Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=0.00014142319560050964)
state.push(v=14188.9609375)
state.pop()
state.teardown()
F


This counter-example fails with the online variance returning 0.0 instead of 1e-8. That’s not unexpected: removing (the square of) a large value from a running sum spells catastrophic cancellation. It’s also not that bad for my use case, where I don’t expect to observe very large values.

Another problem for our test harness is that floats are very dense around 0.0, and I’m ok with small (around 1e-8) absolute error because the input and output will be single floats.

Let’s relax assert_almost_equal, and restrict generated observations to fall in $$[-2\sp{-12}, 2\sp{12}].$$

With all these tweaks to make sure we generate easy (i.e., interesting) test cases, Hypothesis fails to find a failure after its default time budget.

I’m willing to call that a victory.

## From stack to full multiset

We have tested code to undo updates in Welford’s classic streaming variance algorithm. Unfortunately, inverting pushes away only works for LIFO edits, and we’re looking for arbitrary inserts and removals (and updates) to a multiset of observations.

However, both the mean $$\hat{x} = \sum\sb{x\in X} x/n$$ and the centered second moment $$\sum\sb{x\in X}(x - \hat{x})\sp{2}$$ are order-independent: they’re just sums over all observations. Disregarding round-off, we’ll find the same mean and second moment regardless of the order in which the observations were pushed in. Thus, whenever we wish to remove an observation from the multiset, we can assume it was the last one added to the estimates, and pop it off.

We think we know how to implement running mean and variance for a multiset of observations. How do we test that with Hypothesis?

The hardest part about testing dictionary (map)-like interfaces is making sure to generate valid identifiers when removing values. As it turns out, Hypothesis has built-in support for this important use case, with its Bundles. We’ll use that to test a dictionary from observation name to observation value, augmented to keep track of the current mean and variance of all values.

Each call to add_entry will either go to update_entry if the key already exists, or add an observation to the dictionary and streaming estimator. If we have a new key, it is added to the keys Bundle; calls to del_entry and update_entry draw keys from this Bundle. When we remove an entry, it’s also consumed from the keys Bundle.

Hypothesis finds no fault with our new implementation of dictionary-with-variance, but update seems like it could be much faster and numerically stable, and I intend to mostly use this data structure for calls to update.

The key operation for my use-case is to update one observation by replacing its old value with a new one. We can maintain the estimator by popping old away and pushing new in, but this business with updating the number of observation n and rescaling everything seems like a lot of numerical trouble.

We should be able to do better.

We’re replacing the multiset of sampled observations $$X$$ with $$X\sp{\prime} = X \setminus \{\textrm{old}\} \cup \{\textrm{new}\}.$$ It’s easy to maintain the mean after this update: $$\hat{x}\sp{\prime} = \hat{x} + (\textrm{new} - \textrm{old})/n.$$

The update to self.var_sum, the sum of squared differences from the mean, is trickier. We start with $$v = \sum\sb{x\in X} (x - \hat{x})\sp{2},$$ and we wish to find $$v\sp{\prime} = \sum\sb{x\sp{\prime}\in X\sp{\prime}} (x\sp{\prime} - \hat{x}\sp{\prime})\sp{2}.$$

Let $$\delta = \textrm{new} - \textrm{old}$$ and $$\delta\sb{\hat{x}} = \delta/n.$$ We have $\sum\sb{x\in X} (x - \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} [(x - \hat{x}) - \delta\sb{\hat{x}}]\sp{2},$ and $[(x - \hat{x}) - \delta\sb{\hat{x}}]\sp{2} = (x - \hat{x})\sp{2} - 2\delta\sb{\hat{x}} (x - \hat{x}) + \delta\sb{\hat{x}}\sp{2}.$

We can reassociate the sum, and find

$\sum\sb{x\in X} (x - \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} (x - \hat{x})\sp{2} - 2\delta\sb{\hat{x}} \left(\sum\sb{x \in X} x - \hat{x}\right) + n \delta\sb{\hat{x}}\sp{2}$

Once we notice that $$\hat{x} = \sum\sb{x\in X} x/n,$$ it’s clear that the middle term sums to zero, and we find the very reasonable

$v\sb{\hat{x}\sp{\prime}} = \sum\sb{x\in X} (x - \hat{x})\sp{2} + n \delta\sb{\hat{x}}\sp{2} = v + \delta \delta\sb{\hat{x}}.$

This new accumulator $$v\sb{\hat{x}\sp{\prime}}$$ corresponds to the sum of the squared differences between the old observations $$X$$ and the new mean $$\hat{x}\sp{\prime}$$. We still have to update one observation from old to new. The remaining adjustment to $$v$$ (self.var_sum) corresponds to going from $$(\textrm{old} - \hat{x}\sp{\prime})\sp{2}$$ to $$(\textrm{new} - \hat{x}\sp{\prime})\sp{2},$$ where $$\textrm{new} = \textrm{old} + \delta.$$

After a bit of algebra, we get $(\textrm{new} - \hat{x}\sp{\prime})\sp{2} = [(\textrm{old} - \hat{x}\sp{\prime}) + \delta]\sp{2} = (\textrm{old} - \hat{x}\sp{\prime})\sp{2} + \delta (\textrm{old} - \hat{x} + \textrm{new} - \hat{x}\sp{\prime}).$

The adjusted $$v\sb{\hat{x}\sp{\prime}}$$ already includes $$(\textrm{old} - \hat{x}\sp{\prime})\sp{2}$$ in its sum, so we only have to add the last term to obtain the final updated self.var_sum

$v\sp{\prime} = v\sb{\hat{x}\sp{\prime}} + \delta (\textrm{old} - \hat{x} + \textrm{new} - \hat{x}\sp{\prime}) = v + \delta [2 (\textrm{old} - \hat{x}) + \textrm{new} - \hat{x}\sp{\prime}].$

That’s our final implementation for VarianceBag.update, for which Hypothesis also fails to find failures.

## How much do you trust testing?

We have automated property-based tests and some human-checked proofs. Ship it?

I was initially going to ask a CAS to check my reformulations, but the implicit $$\forall$$ looked messy. Instead, I decided to check the induction hypothesis implicit in VarianceBag.update, and enumerate all cases up to a certain number of values with Z3 in IPython.

In : from z3 import *
In : x, y, z, new_x = Reals("x y z new_x")
In : mean = (x + y + z) / 3
In : var_sum = sum((v - mean) * (v - mean) for v in (x, y, z))
In : delta = new_x - x
In : new_mean = mean + delta / 3
In : delta_mean = delta / 3
In : adjustment = delta * (2 * (x - mean) + (delta - delta_mean))
In : new_var_sum = var_sum + adjustment

# We have our expressions. Let's check equivalence for mean, then var_sum
In : s = Solver()
In : s.push()
In : s.add(new_mean != (new_x + y + z) / 3)
In : s.check()
Out: unsat  # No counter example of size 3 for the updated mean
In : s.pop()

In : s.push()
In : s.add(new_mean == (new_x + y + z) / 3)  # We know the mean matches
In : s.add(new_var_sum != sum((v - new_mean) * (v - new_mean) for v in (new_x, y, z)))
In : s.check()
Out: unsat  # No counter example of size 3 for the updated variance


Given this script, it’s a small matter of programming to generalise from 3 values (x, y, and z) to any fixed number of values, and generate all small cases up to, e.g., 10 values.

I find the most important thing when it comes to using automated proofs is to insert errors and confirm we can find the bugs we’re looking for.

I did that by manually mutating the expressions for new_mean and new_var_sum in updated_expressions. This let me find a simple bug in the initial implementation of test_num_var: I used if not result instead of result != unsat, and both sat and unsat are truthy. The code initially failed to flag a failure when z3 found a counter-example for our correctness condition!

## And now I’m satisfied

I have code to augment an arbitrary multiset or dictionary with a running estimate of the mean and variance; that code is based on a classic recurrence, with some new math checked by hand, with automated tests, and with some exhaustive checking of small inputs (to which I claim most bugs can be reduced).

I’m now pretty sure the code works, but there’s another more obviously correct way to solve that update problem. This 2008 report by Philippe Pébay1 presents formulas to compute the mean, variance, and arbitrary moments in one pass, and shows how to combine accumulators, a useful operation in parallel computing.

We could use these formulas to augment an arbitrary $$k$$-ary tree and re-combine the merged accumulator as we go back up the (search) tree from the modified leaf to the root. The update would be much more stable (we only add and merge observations), and incur logarithmic time overhead (with linear space overhead). However, given the same time budget, and a logarithmic space overhead, we could also implement the constant-time update with arbitrary precision software floats, and probably guarantee even better precision.

The constant-time update I described in this post demanded more effort to convince myself of its correctness, but I think it’s always a better option than an augmented tree for serial code, especially if initial values are available to populate the accumulators with batch-computed mean and variance. I’m pretty sure the code works, and it’s up in this gist. I’ll be re-implementing it in C++ because that’s the language used by the project that lead me to this problem; feel free to steal that gist.

1. There’s also a 2016 journal article by Pébay and others with numerical experiments, but I failed to implement their simpler-looking scalar update…

]]>