Huge month for #SBCL: devious bugs fixed, new optimisations, new features. Help us find regressions in HEAD!sbcl.git.sourceforge.net/git/gitweb.cgi…
— Paul Khuong (@pkhuong) May 24, 2013
One feature I’m particularly excited about is the basic support for SSE intrinsics on x86-64. It’s only the fundamental infrastructure needed for the compiler and runtime to manipulate and compile operations on SIMD packs (the platform-independent name suggested by Kevin Reid). However, now that it’s merged in, we can easily add new functionality that maps (more or less) directly to SSE assembly in a running image, like the sb-rotate-byte contrib does for bitwise rotations.
There is currently no such contrib in SBCL, although Alexander Gavrilov has been maintaining cl-simd, an extension for SBCL and ECL (he also kept an old SSE/SBCL branch of mine on life support since 2008). I’m really not sure what the right interface will look like for Common Lisp, so I decided to only provide the mechanism to implement such an interface; when something emerges that looks sane and has been used in anger, we can merge it in.
In the meantime, we get to define our own interface and see where that experience leads us. In this post, I’ll use a classic example to guide my first stab at an interface: the Mandelbrot set. I’ll only define as much of an interface as I need, and adapt it along the way if I see the need for more sophistication; I prefer to build programs and libraries bottom-up and iteratively, rather than trying to divine my requirements ahead of time. A consequence of this approach is that I have to/am able to maintain a dynamic development style, even when working with (and working on) SSE intrinsics. I know many people use REPLs as desktop calculators; maybe others can be persuaded to also use SBCL as an SIMD calculator, to doodle on SSE code sequences (:
Packed floating-point arithmetic
1 2 3 4 5 6 7 8 9 |
|
The last form adds three new functions (f4+
, f4*
and f4-
) to the
compiler’s database (fndb). Each function accepts two packs of
single floats as arguments, and returns another single float pack (the
result of element-wise addition, multiplication or subtraction).
Moreover, the functions can be reordered and flushed as dead code by
the compiler, and, whenever they appear in code, they should
ultimately be translated to assembly code. The last bit is so the
runtime system silently overwrites the database if the functions
are already there, rather than warning at load-time.
Next, we add a template (VOP) to convert calls to f4+
into assembly
code. This task is so deeply tied with the internals that it makes
sense to just do it from SB-VM.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
The first line defines a VOP with name mandelbrot::f4+
(it’s an
arbitrary symbol and the important bit is only to avoid collisions,
but good naming is useful when reading compiler traces or optimisation
notes).
Then, we specify that it’s to be automatically considered when there
is a call to mandelbrot::f4+
, and that it can be used both in fast
and in safe code.
The template converts a call with two arguments, x
and y
, that
must both be in SSE registers of single floats. Both value must be
represented as single floats in SIMD packs: when defining VOPs,
types refer to primitive types, and primitive types are concerned with
representation (like C types) rather than only sets of values (a
given CL value or type can be mapped to many primitive types
depending on the context). We also declare a preference for x
to be
allocated in the same register as the result r
, if possible.
The template has a single result, which is also an SIMD pack of single floats allocated in an SSE register of single floats.
Finally, we have code to emit assembly. If r
and y
were packed
(register-allocated) in the same register, we exploit commutativity to
add x
into y
. Otherwise, we move x
into r
if necessary (the
move
function takes care of checking for equivalent locations), and
emit a packed single float addition (addps
) of y
into r
(which
has just been overwritten with the contents of x
).
The reason we have to specify that the SSE registers hold single
float values (rather than doubles or integers) is for functions like
move
to emit a move of FP SSE registers rather than integer, for
example. However, this additional information doesn’t really affect
register allocation: the three storage classes (SC) –
single-sse-reg
, double-sse-reg
and int-sse-reg
– all map to the
same storage base (SB), and the same SSE register can be used for an
simd-pack-single
value at one program point, an simd-pack-int
at
another, and a double-float
at yet another.
This VOP is the bare minimum for a useful packed addition of single floats: in the future, it will probably make sense to add support for a constant argument, which could be directly loaded from a RIP-relative address to save a register (SBCL isn’t clever enough to determine when it makes sense to keep a constant in a register across a basic block or a loop).
We do the same for multiplication and subtraction. There’s one last
twist for f4-
. We can’t exploit commutativity there, so we have to
make sure y
and r
aren’t packed in the same register; this is
achieved by extending the live range of r
from the end of the
(default) live range for x
. There are obvious opportunities for
macro-isation, but, again, I can refactor when I have a clear need
(e.g. when I decide to add a dozen more intrinsics).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Now that we’ve declared the functions and defined one way to convert
each to machine code, we can define stubs to interact with them from
interpreted code or as normal first-class functions. The compiler
already has all the information needed to convert any call to f4+
et
al to machine code; however, no such function is actually defined. We
can define them with what looks like infinite recursion: the
defknown
will ensure the compiler inserts type checks for
(simd-pack single-float)
as needed and infers that the result is
also an (simd-pack single-float)
, so the templates will be
applicable.
1 2 3 4 5 6 7 8 |
|
Now, we can call our new functions at the REPL, pass them to higher-order functions, etc., like any normal function.
1 2 3 4 |
|
The first form makes an simd-pack
from four single float values. We
must already track at compile-time whether each SIMD pack value is a
pack of singles, doubles or integers, so it makes sense to do the same
at runtime: that way we preserve type information when constant
folding, and that we can print them nicely. Thus, we can call f4+
on that value, and see that 1.0 + 1.0 = 2.0
. We can also
disassemble the function f4*
to confirm that it is a normal function
that can be passed around (in this case to disassemble
), and that it
doesn’t endlessly recurse.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
A first stab at the Mandelbrot inner loop
The inner loop body to approximate the Mandelbrot set is simply z' =
z*z + c
, where z
and c
are complex values. If we expand the
complexes into their real and imaginary parts (zr + i zi
, cr + i
ci
), we find z*z+c = (zr*zr - zi*zi + cr) + i(2*zr*zi + ci)
. That’s easily
vectorised if we have a pack of four zr
values, another for the
zi
, and similarly for cr
and ci
.
1 2 3 4 5 6 |
|
1 2 3 4 5 6 7 8 9 |
|
This is a straight translation of the scalar formula above, but it computes four values in parallel (that’s why struct of array layouts lead to easy to vectorisation). Crucially, a disassembly reveals we get the code we expect:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
The beauty of Python’s representation selection capabilities is that we can specify multiple representations (unboxed in a register or on stack, or boxed as a generic pointer to a heap-allocated value) for SIMD packs, and the right (usually…) one will be chosen depending on context. In this case, the function receives boxed SIMD packs, unboxes them into SSE registers in the prologue, performs its FP arithmetic only on SSE registers, and converts the results back into boxed SIMD packs.
Even better: it seems to compute the right thing (:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
The first return value is a pack of real components, and the second a pack of imaginary components, and the values do correspond to the scalar computation in normal complex arithmetic.
Comparisons and packed integers
The next step is to compute the (squared) norm of complex values, in order to detect escape from the orbit.
1 2 3 4 |
|
Again, a couple smoke tests, and a disassembly
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Finally, we have to compare floats (squared norms) against a limit
(4.0), and perform some integer arithmetic to count the number of
iterations until escape. f4-sign-mask
is the first instance of a
function that accepts arbitrary SIMD pack types: singles, integers, or
even doubles, it’ll extract sign bits from the four 32-bit chunks.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
The VOPs are really straightforward and somewhat repetitive, again: we only want the bare minimum to get something working, and fancy special cases can wait until they actually show up.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
Again, we add stubs to easily work at the REPL
1 2 3 4 5 6 7 8 9 10 11 |
|
And some smoke tests
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
f4<=
compares two packs of single floats element by element, and
returns a pack of integers with all 32 bits set to 1 for pairs that
are <=
, and 0 otherwise. i4-
subtracts 32-bit signed integers
element-wise; subtracting -1 (#xffff...
) is equivalent to adding 1.
Finally, f4-sign-mask
extracts the topmost (sign) bit from each
chunk of 32 bits, and returns the 4 bits as a normal integer (i.e. in
a general purpose register). The function sb-ext:%simd-pack-ub32s
is useful to extract the four 32 bit unsigned values from an SIMD pack
– in this case the result of subtracting the comparison masks from
zero – and work on them in a scalar context.
A complete Mandelbrot inner loop
The idea is that we’ll want to compare the result of %norm^2
with
the limit (4.0), and stop iterating when the maximal iteration count
is reached, or when all four norms are greater than 4.0 (all sign bits
are zero). Until then, we can subtract from the counter to increment
the number of unescaped iterations. When we’re done, we can easily
extract individual iteration counts from the packed counters.
This yields
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
We can run a couple random tests and compare with a straightforward scalar version:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
So, there’s some register allocation oddities (mostly because the VOPs
don’t handle the case when both arguments are the same, I think), but
it’s pretty good. The one issue that bothers me is (zerop
(f4-sign-mask still-in-orbit))
: it’s a common operation, and there’s
a slight miscompile (the result of f4-sign-mask
is converted to a
fixnum before comparison… an instance of bad representation
selection). These two reasons – factoring code out and improving
code generation for a medium-level operation – are enough for me to
add a specialised function to test if all the sign bits are zero.
A new predicate
We define predicates simply as functions that return boolean
values (this one is prefixed with f4-
because the instruction
officially works on single floats).
1 2 3 |
|
However, VOPs for such functions shouldn’t return values; instead,
they can define how to determine that their result is true. In this
case, if the :z
(zero) flag is set. Code generation will take care of
using that information in conditional branches or conditional moves.
1 2 3 4 5 6 7 8 9 10 |
|
For example, in this stub, the compiler will emit the template, followed by a conditional move to return T or NIL.
1 2 |
|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Wrapping up
Now, we can slightly adjust mandelbrot-escape
to exploit this new
function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
This all looks like a lot of work, but the vast majority of it is
reusable code to define SSE operations: 22 LOC of defknown, 83 LOC for
assembly templates, 20 LOC in stubs and utility code. There’s only 31
LOC for the SIMD mandelbrot loop (%mandelbrot-iter
, %norm^2
and
mandelbrot-escape
), 6 for the naïve scalar loop, and 12 for the
random test harness. I uploaded all the files on
github to simplify
further experimentation; just fork and play around! The function
names could certainly be improved, for one.
Miscellaneous notes
My request for extra-hard bug shaking in SBCL 1.1.8 didn’t function as well as I’d wish: only a few hours after release (and after a week of code freeze), we received a new bug report (with fixes committed very rapidly, at least). We observed the same phenomenon in 1.1.6, and wound up being extremely conservative for 1.1.7. I’m not sure what’s the best way to encourage testing (binary release candidates during freeze?), but I am starting to see the point of the old odd/even policy for Linux. For now, the result is that 1.1.8 offers a lot of exciting improvements, but also more than its share of bugs; for production code, it’s probably best to either build an early 1.1.8.git-commit, or wait for 1.1.9.
Long-time readers might remember that I first blogged about SSE
intrinsics in
2008.
The problem is that I hit some issues with representing type
information (whether an SIMD pack contains integer, single float or
double float data), and didn’t manage to find a nice resolution. I
also had trouble with type operations (intersection, union, negation)
on (simd-pack [type])
. For the first problem, I bit the bullet and
created many storage classes and primitive types; as shown above, a
function can accept arbitrary simd-pack
, and so can VOPs: when
converting to primitive types, unspecialised simd-pack
are
treated like integer simd-pack
. The second, I solved by moving to
a type upgrading system inspired by array types. There are really only
three specialised simd-pack
types: on integers, single floats and
double floats. Any specialised simd-pack
type specifier must be
specialised on a recognisable subtype of one of these three types.
It’s a hack, but all type operations become trivial and easy to think
about.
The reason I finally sat down, re-read my code and thought hard for a few days is Google’s Summer of Code. A student proposed to build on that work to vectorise standard sequence operations in CL, and I really didn’t want to be blocking their work. In the end we were only allotted two slots, and had to select a pair of extremely strong proposals from ten interesting submissions… more on that soon!
In the meantime, I hope the recent burst of activity and outside interest can motivate others to contribute to SBCL, even without Google support. We already have one person working on a contrib to exploit GMP for bignum and rational arithmetic, instead of our hand-rolled routines; I expect to see it merged in 1.1.9 or in 1.1.10.
P.S. I recently stumbled on Glassman’s FFT and it’s really nifty: an in-order general-size FFT (not so fast when prime factors are huge) in approximately 50 LOC. The performance isn’t awesome on cached architectures, but shouldn’t be atrocious either… and it handles arbitrary DFT sizes. There are issues with numerical stability when computing the twiddle factors on the fly, and the access patterns could be improved, but it’s a very simple and interesting foundation… especially given that the inner loop is naturally vectorisable.