Hacking SSE Intrinsics in SBCL (part 1)
I think I managed to add sane support for SSE values (octwords) and operations to SBCL this weekend (all that because of a discussion on PRNGs which lead to remembering that SFMT wasn't too hard to implement, which lead to SSE intrinsics). It wasn't too hard, but clearly the path could have been better documented. The interesting part is the creation of a new primitive data type in the compiler; once that is done, we only have to define new VOPs. Note that since I'm only targetting x86-64, alignment isn't an issue: objects are aligned to 128 bit by default.
The compiler has to be informed of the new data type at two levels:
- The front/middle -end, to create a new type (as in
CL:TYPEP
), and to define how to map from that type to the primitive type (which is more concerned about representation than semantics) used in the back-end; - The back-end, which has to be informed about the existence of a new primitive type, and must also be modified to correctly map that primitive type to machine registers or stack locations, and to know how to move from one to the other or how to box such values in a GC-friendly object on the heap.
Obviously the runtime was also be modified to be able to GC the new type of objects.
I believe it makes more sense to do this by starting low-level, in the back-end, and then building things up to the front-end, so I'll try and explain it that way.
Hacking the machine definition
SBCL's backend has a sort of type system at the VOP
/ TN
level
(virtual operations/registers). Two elements of type information are
associated with each TN
: the primitive type and the storage class. As
its name implies, the primitive type is a low-level, C-style type
(e.g. SIMPLE-ARRAY-UNSIGNED-BYTE-64
or DOUBLE-FLOAT
), which is almost
entirely concerned with representation. Apart from bad puns, there is
no subtyping; COMPLEX
and COMPLEX-DOUBLE-FLOAT
are disjoint
types. However, that still leaves some leeway to the back-end. A
DOUBLE-FLOAT
may be stored in a FP register, on the stack, or as a
boxed value. Thus, TN
are also assigned a storage class before
generating code.
The first step was to define new storage classes for SSE values:
sse-reg
and sse-stack
for SSE values in XMM
registers and on the
stack, respectively. That's done in src/compiler/x86-64/vm.lisp
,
!define-storage-classes
:
(sse-stack stack :element-size 2 :alignment 2) [...] (sse-reg float-registers :locations #.(loop for i from 0 below 15 collect i) :constant-scs () :save-p t :alternate-scs (sse-stack))
The first form defines a new storage class (SC
) that's stored on the
stack (the storage base, SB
), where each element takes two locations
(in this case, 64 bit words) and requires an alignment of two
locations. The second form defines a new storage class that uses the
float-registers
storage base, and may take any of the first 15
locations in that SB
(xmm15
is reserved). Values in that SC
must be
saved to the stack when needed. The sse-stack
SC
is to be used when
there aren't enough registers or when saving registers (e.g. for a
call). Some more modifications were needed in the assembler to make it
aware of the new SC
, but that's a mostly orthogonal concern.
Adding a new primitive type
That's enough to define a new primitive type in
src/compiler/generic/primtype.lisp
:
(!def-primitive-type sse-value (sse-reg descriptor-reg))
sse-value
can be stored in sse-reg
or descriptor-reg
(as boxed
values), or in any of their alternate SC
.
Defining a new kind of primitive object
I've defined a new primitive type that can be stored as a boxed
value. However, I haven't yet defined how that boxed value should be
represented. I allocated one of the unused widetags to
sse-values
in src/compiler/generic/early-objdef.lisp
:
#!-x86-64 unused02 #!+x86-64 sse-value ; 01100010
That's not necessary, but I would then have to define my own
typechecking VOPs
, adapt the GC somehow, etc. It will define a new
constant, SB!VM::SSE-VALUE-WIDETAG
. I use it in the primitive object
definition (src/compiler/generic/objdef.lisp
):
(define-primitive-object (sse-value :lowtag other-pointer-lowtag :widetag sse-value-widetag) (filler) ; preserve the natural 128 bit alignment (lo-value :c-type "long" :type (unsigned-byte 64)) (hi-value :c-type "long" :type (unsigned-byte 64)))
The macro will also define useful constants, e.g. SSE-VALUE-SIZE
, as
well as slot offsets for the low and high values. That information
will be needed by the GC. Genesis only exports CL constants to C when
the symbols are external to the package. Thus, I had to add some
symbols to the export list for SB!VM
in ./package-data-list.lisp-expr
:
#!+x86-64 "SSE-VALUE" #!+x86-64 "SSE-VALUE-P" ; will be defined later on #!+x86-64 "SSE-VALUE-HI-VALUE-SLOT" #!+x86-64 "SSE-VALUE-LO-VALUE-SLOT" #!+x86-64 "SSE-VALUE-SIZE" #!+x86-64 "SSE-VALUE-WIDETAG"
and similarly for SB!KERNEL
:
#!+x86-64 "OBJECT-NOT-SSE-VALUE-ERROR" ; so will this
Adapting the GC
I chose a very simple representation for sse-values
: the header word
will contain the right widetag, obviously, and the rest will encode
the size of the object (in words). That's a common scheme in SBCL, and
well supported by generic code everywhere. The garbage collector (a
copying Cheney GC) has three important tables that are used to
dispatch to the correct function given an object's widetag: scavtab
to
scavenge objects for pointers, transother
to copy objects to the new
space and sizetab
to compute the size of an object. They're all
initialised in src/runtime/gc-common.c
, gc_init_tables
. The
representation is standard, so I only had to add pointers to
predefined functions, scav_unboxed
, trans_unboxed
and size_unboxed
:
#ifdef SSE_VALUE_WIDETAG scavtab[SSE_VALUE_WIDETAG] = scav_unboxed; #endif [...] #ifdef SSE_VALUE_WIDETAG transother[SSE_VALUE_WIDETAG] = trans_unboxed; #endif [...] #ifdef SSE_VALUE_WIDETAG sizetab[SSE_VALUE_WIDETAG] = size_unboxed; #endif
Adding utility VOPs to the compiler
That's enough code to be able to use the sse-value
primitive type, use
our new storage classes in VOP
definitions, and pass boxed sse-values
around without crashing the GC. The first VOPs
to define are probably
those that let the compiler move sse-values
around, from an sse-reg
,
sse-stack
or descriptor-reg
to another. Not all VOPs
in the cartesian
product must be defined; the compiler can figure out how to piece
together move VOPs
to a certain extent.
(define-move-fun (load-sse-value 2) (vop x y) ((sse-stack) (sse-reg)) (inst movdqa y (ea-for-sse-stack x))) (define-move-fun (store-sse-value 2) (vop x y) ((sse-reg) (sse-stack)) (inst movdqa (ea-for-sse-stack y) x))
are enough to define how to move between the stack and xmm
registers
(ea-for-sse-stack
is a helper function that generates an effective
address from an sse-value
's location in the compile-time stack
frame).
(define-vop (sse-move) (:args (x :scs (sse-reg) :target y :load-if (not (location= x y)))) (:results (y :scs (sse-reg) :load-if (not (location= x y)))) (:note "sse move") (:generator 0 (unless (location= y x) (inst movdqa y x)))) (define-move-vop sse-move :move (sse-reg) (sse-reg))
provides and registers code to move from one xmm
register to another.
(define-vop (move-from-sse) (:args (x :scs (sse-reg))) (:results (y :scs (descriptor-reg))) (:node-var node) (:note "sse to pointer coercion") (:generator 13 (with-fixed-allocation (y sse-value-widetag sse-value-size node) (inst movdqa (make-ea :qword :base y :disp 1) x)))) (define-move-vop move-from-sse :move (sse-reg) (descriptor-reg)) (define-vop (move-to-sse) (:args (x :scs (descriptor-reg))) (:results (y :scs (sse-reg))) (:note "pointer to sse coercion") (:generator 2 (inst movdqa y (make-ea :qword :base x :disp 1)))) (define-move-vop move-to-sse :move (descriptor-reg) (sse-reg))
will be used to move from an xmm
register to a boxed representation
and vice-versa.
Finally,
(define-vop (move-sse-arg) (:args (x :scs (sse-reg) :target y) (fp :scs (any-reg) :load-if (not (sc-is y sse-reg)))) (:results (y)) (:note "SSE argument move") (:generator 4 (sc-case y (sse-reg (unless (location= x y) (inst movdqa y x))) (sse-stack (inst movdqa (ea-for-sse-stack y fp) x))))) (define-move-vop move-sse-arg :move-arg (sse-reg descriptor-reg) (sse-reg))
defines how arguments are loaded before calling a function.
Creating a new CL type and associated functions
Now that pretty much all the groundwork has been done in the backend,
it's time to inform the frontend about the new data type. The system's
built-in classes are defined in src/code/class.lisp
, right after
(defvar *built-in-classes*)
. I only had to insert another sublist in
the list of built-in classes.
(sb!vm:sse-value :codes (#.sb!vm:sse-value-widetag))
That takes care of defining a new class for the middle-end, and of mapping the class's name, in the front-end, to the class object in the middle-end.
SBCL tries pretty hard to always provide a safe language by default,
so I have to make sure the type sse-value
can be checked. First, a
new error kind is defined in src/code/interr.lisp
:
(deferr object-not-sse-value-error (object) (error 'type-error :datum object :expected-type 'sb!vm:sse-value))
The mapping from internal error number to a meaningful message is
specified in src/compiler/generic/interr.lisp
, define-internal-errors:
(object-not-sse-value "Object is not of type SSE-VALUE.")
.
Since I use a normal representation, I can use preexisting machinery
to define the type checking VOPs
in src/compiler/generic/late-type-vops.lisp
:
(!define-type-vops sse-value-p check-sse-value sse-value object-not-sse-value-error (sse-value-widetag))
That only creates a VOP
for sse-value-p
; we'd sometimes like to have a
real function. That's created in src/code/pred.lisp
, with
(def-type-predicate-wrapper sb!vm:sse-value-p)
. Moreover, when a VOP
is defined as a translation for a function, that function must be
defknown
ed to the compiler. I do that in
src/compiler/generic/vm-fndb.lisp
, with
(defknown sb!vm:sse-value-p (t) boolean (foldable flushable))
.
TYPEP
must also be informed to use that function. That's set up in
src/compiler/generic/vm-typetran.lisp
, with
(define-type-predicate sb!vm:sse-value-p sb!vm:sse-value)
.
Making the middle and the back meet
The modifications above added a new primitive type to the back-end,
along with some machinery to represent and manipulate values of
that type. They also added a new built-in class to the front and
middle -end and some type-checking code for that new class. The only
thing left is to make sure the new class is mapped to the correct
primitive type. The default is to map everything to the primitive type
T
, a boxed value. That's done at the bottom of
src/compiler/generic/primtype.lisp
, in primitive-type-aux
. I only had
to modify the case of translating built-in class(oids) to primitive
types: sse-value
are treated like complex
, function
, system-area-pointer
or
weak-pointer.
The class sse-value
is mapped to the primitive type
sse-value
. That way, a function that is defknown
to, e.g., take an
argument of type sse-value
(the class) can be translated by a VOP
that
takes an argument of primitive type sse-value
, which can be stored in
an sse-reg
(an xmm
register).
Now what?
We have the data type definitions. To make them useful we still have
to define a lot of functions and VOPs
(and SSE
instructions). However,
that's much closer to regular development and doesn't require as much
digging around in the source. I'll leave that for another post.