Paul Khuong mostly on Lisp

Mon, 06 Oct 2008

Revisiting VM tricks for safepoints

In February this year, Nathan Froyd described allocation sequences in SBCL and alternative approaches other implementations use to make sure the runtime never sees half-initialised objects. At first sight, SBCL's sequence seems fairly large and slow for what, in the end, increments a thread-local pointer, even on x86-64, with its awesome 14 (RBP actually serves as a frame pointer in SBCL) GPRs:

    Set the pseudo-atomic bit
OR BYTE PTR [R12+160], 8

    (what you'd expect for bump:) Load the allocation pointer
MOV R11, [R12+80]
    Test for overflow
LEA RDX, [R11+16]
CMP [R12+88], RDX
    Jump to an out of line sequence
JBE L4
    Or save incremented pointer
MOV [R12+80], RDX
    ... and tag it
LEA RDX, [R11+7]
    (end of allocation per se)

    Finally, unset the pseudo-atomic bit
XOR BYTE PTR [R12+160], 8
    Check for interrupt pending
JEQ L2
    Trigger a signal if so
BREAK 9                    ;      pending interrupt trap
L2: ...

That's two load/store and a test just to make sure we're not caught with our pants down, handling improperly initialised objects.

One element of the alternatives is to actively check for pending interrupts (or GCs, etc). Instead of allowing interruptions everywhere except around key sequences, interruptions are queued, and explicit checks inserted by the compiler. That seems like an interesting implementation choice, so I wanted to see what it'd look like in SBCL.

It was pretty easy to adapt the compiler to insert such checks at appropriate locations (in the current case, at the end of each function' prologue, and at the head of loops). Using memory protection to turn an access into a conditional interruption seemed like a good idea, and that's what I quickly implemented.

These checks can be executed fairly often in tight loops, so it's important that they be very efficient. The first version loaded a magic address from thread-local storage (an address pointed to by a register), and then wrote to that address. When the thread had to be interrupted, the page containing that address was made unreadable and unwritable, so the last access triggered a segfault, which was then handled specially.

The result was slow... 25% as much time as the original (signals-ful) version for a simple (loop repeat n do [not cons]) function, and no difference for a consing loop. Modifying the safepoint sequence to read from the magic address instead of writing to it halved the slowdown to ~10%, and did not improve the runtime of the consing loop. That's still far from impressive.

Executing two instructions at each safepoint seems obviously fishy. Indeed, things improved sensibly when the safepoint sequence became a single TEST, as in Nathan's example from HotSpot. Instead of having to read an arbitrary address from the thread struct, I moved the "magic page" to a fixed offset from the thread structure. The safepoint sequence then became a single instruction, TEST EAX, [THREAD_STRUCT_REG + offset] (a register is dedicated to thread-local storage). That's a single instruction, reads from memory and does not clobber any register. Unfortunately, that was only enough to bring the runtime for safepoints to the same level (+/- 1-2%) as that of the original code (it does save ~50 KB out of 40 MB on the x86-64 core :).

I'm not sure how to explain the results, except by saying that x86-64 (both Core 2 and K10) memory subsystems are awesome. In any case, unless I was doing something extremely wrong, the current XOR/XOR/JEQ/BREAK sequence seems to perform well, even when compared to what other implementations (not constrained by any systems programming goal) do. There still are reasons to look into a safe point system for SBCL (simpler to reason about, easier to avoid certain states, easier to manipulate where and when interruptions can happen, ...), but performance doesn't seem to be one of them.

posted at: 23:25 | /LowLevel | permalink

Contact me by email: pvk@pvk.ca.