EDIT: 20120829: I added a section to compare comparison counts with known bounds for general comparison sorts and sorting networks.
In an earlier post, I noted how tedious coding unrolled sorts can be. Frankly, that’s the main reason I stopped at leaf sorts of size three. Recently, Neil Toronto wrote a nice post on the generation of sizespecialised merge sorts. The post made me think about that issue a bit more, and I now have a neat way to generate unrolled/inlined merge sorts that are significantly smaller than the comparison and size “optimal” inlined merge sorts.
The code is up as a singlefile library, and sorts short vectors faster than SBCL’s inlined heapsort by a factor of two to three… and compiles to less machine code. The generator is a bit less than 100 LOC, so I’m not sure I want to include it in the mainline yet. If someone wants to add support for more implementations, I’d be happy to extend Tabasco sort, and might even consider letting it span multiple files ;)
Differentlyoptimal sorts
The inlined merge sort for three values (a
, b
, and c
) is copied
below. It has to detect between \(3! = 6\) permutation, and does so
with an optimal binary search tree. That scheme leads to code with
\(n!  1\) comparisons to sort n values, and for which each
execution only goes through two or three comparisons (\(\approx \lg n!\)).
1 2 3 4 5 6 7 8 9 10 11 

An optimal sorting network for three values needs only three comparisons, and always executes those three comparisons.
1 2 3 4 5 6 7 

Finally, the leaf sort I used in SBCL is smaller than the inlined merge sort (three comparisons), but sometimes executes fewer than three comparisons. It’s superoptimal ;)
1 2 3 4 5 6 7 8 

The optimal merge sort is larger than the optimal sorting network, and the optimal sorting network performs potentially more comparisons than the optimal merge sort…
Each implementation is optimal for different search spaces: the optimal merge sort never merges continuations, and the sorting network’s only control dependencies are in the conditional swaps.
The “superoptimal” merge sort does better by allowing itself both assignments (or tailcalls) and nontrivial control flow: it’s smaller than the inlined merge sort (but performs more data movement), and potentially executes fewer comparisons than the sorting network (with a larger total code size). And, essential attribute in practice, it’s easy to generate. This contrasts with optimal sorting networks, for which we do not have any generation method short of brute force search; in fact, in practice, sorting networks tend to exploit suboptimal (by a factor of \(\log n\)) schemes like bitonic sort or oddeven merge sort. Then again, we’re only concerned with tiny sorts, and asymptotics can be misleading: Batcher’s oddeven merge sort happens to be optimal for \(n\leq 8\). The issue with sorting networks remains: dataoblivious control flow pessimises their comparison count.
Generalising from size 3
What the last merge sort does is to first sort both halves of the
values (a
is trivially sorted, and b c
needs one conditional
swap), and then, assuming that each half (a
and b c
) is sorted,
find the right permutation with which to merge them. Rather than
\(n!\) permutations, a merge only needs to distinguish between
\(C(n, \lfloor n/2\rfloor) = \frac{n!}{\lfloor n/2\rfloor!\lceil n/2\rceil!}\)
permutations, and the recursive sorts are negligible compared to the
merge step. That’s a huge reduction in code size!
A simple merge generator fits in half a screenful.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Given two lists of sorted variables, emitmerge
calls emitmerge1
to generate code that finds the right permutation, and executes it at
the leaf. A binary search tree is generated by keeping track of the
merged list in a reverseorder (to enable tailsharing) accumulator of
variable names. As expected, when merging a list of length one with
another of length two, we get pretty much the code I wrote by hand
earlier.
CLUSER> (emitmerge '(a) '(b c))
(if (< b a)
(if (< c a)
(setf (values a b c) (values b c a))
(setf (values a b c) (values b a c)))
(setf (values a b c) (values a b c)))
There’s one striking weakness: we generate useless code for the
identity permutation. We could detect that case, or, more generally,
we could find the cycle decomposition of each permutation and use it
to minimise temporary values; that’d implicitly take care of cases
like (setf (values a b c) (values b a c))
, in which some values are
left unaffected.
A smarter permutation generator
I’ll represent permutations as associative lists, from source to destination. Finding a cycle is easy: just walk the permutation from an arbitrary value until we loop back.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

To generate the code corresponding to a permutation, I can extract all
the cycles, execute each cycle with a rotatef
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

The merge step for a sort of size three is now a bit more explicit, but likely compiles to code that uses fewer registers as well. It probably doesn’t matter on good SSAbased backends, but those are the exception rather than the norm in the Lisp world.
CLUSER> (emitmerge '(a) '(b c))
(if (< b a)
(if (< c a)
(progn (rotatef a b c))
(progn (rotatef a b)))
(progn))
Adding the recursive steps
The only thing missing for a merge sort is to add base cases and recursion. The base case is easy: lists of length one are sorted. Inlining recursion is trivial, as is usually the case when generating Lisp code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

The resulting threevalue sorter looks good; there are some
redundancies with nested or empty progn
s, but any halfdecent
compiler will take care of that. Python certainly does a goob job on
that code.
CLUSER> (emitsort '(a b c))
(progn
nil
(progn
nil
nil
(if (< c b)
(progn (rotatef b c))
(progn)))
(if (< b a)
(if (< c a)
(progn (rotatef a b c))
(progn (rotatef a b)))
(progn)))
CLUSER> (disassemble (lambda (a b c)
(declare (type fixnum a b c))
(inlinesort a b c)))
; disassembly for (lambda (a b c))
; 0E88F150: 498BD0 mov rdx, r8 ; noargparsing entry point
; 53: 498BC9 mov rcx, r9
; 56: 498BDA mov rbx, r10
; 59: 4D39CA cmp r10, r9
; 5C: 7C30 jl L3
; 5E: L0: 4C39C1 cmp rcx, r8
; 61: 7D0B jnl L1
; 63: 4C39C3 cmp rbx, r8
; 66: 7C1B jl L2
; 68: 488BD1 mov rdx, rcx
; 6B: 498BC8 mov rcx, r8
; 6E: L1: 488BF9 mov rdi, rcx
; 71: 488BF3 mov rsi, rbx
; 74: 488D5D10 lea rbx, [rbp+16]
; 78: B906000000 mov ecx, 6
; 7D: F9 stc
; 7E: 488BE5 mov rsp, rbp
; 81: 5D pop rbp
; 82: C3 ret
; 83: L2: 488BD1 mov rdx, rcx
; 86: 488BCB mov rcx, rbx
; 89: 498BD8 mov rbx, r8
; 8C: EBE0 jmp L1
; 8E: L3: 498BCA mov rcx, r10
; 91: 498BD9 mov rbx, r9
; 94: EBC8 jmp L0
I thought about generating calls to values
in the final merge,
rather than permuting, but decided against: I know SBCL doesn’t
generate clever code when permuting registers, and that’d result in
avoidable spills. I also considered generating code in CPS rather
than emitting assignments; again, I decided against because I can’t
depend on SBCL to emit clever permutation code. The transformation
would make sense in a dialect with weaker support (both compiler and
social) for assignment.
How good is the generated code?
Both this inline merge sort and the original, permutationfree (except at the leaves), one actually define the exact same algorithms. For any input, both (should) execute the same comparisons in the same order: the original inline merge sort simply inlines the whole set of execution traces, without even merging control flow.
The permutationfree sort guarantees that it never performs redundant comparisons. Whether it performs the strict minimum number of comparisons, either on average or in the worst case, is another question. At first, I thought that \(\lg n!\) should be a good estimate, since the search tree seems optimally balanced. The problem is \(n!\) tends to have many other prime factors than 2, and we can thus expect multiple comparisons to extract less than 1 bit of information, for each execution. The lower bound can thus be fairly far from the actual value… Still, this question is only practically relevant for tiny sorts, so the discrepancy shouldn’t be too large.
A simple way to get the minimum, average or maximum comparison count would be to annotate the permutationfree generator to compute the shortest, average and longest path as it generates the search tree.
I’ll instead mimic the current generator.
The first step is to find the number of comparisons to perform a
merge of m
and n
values.
If either m
or n
is zero, merging the sequences is trivial.
Otherwise, the minimum number of comparisons is (min m n)
: the
sequences are presorted, and the shortest sequence comes first. The
maximum is (1 (+ m n))
. I couldn’t find a simple expression for
the average over all permutations. Instead, I iterate over all
possible combinations and count the number of comparisons until either
subsequence is exhausted.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

Counting the number of comparisons in sorts is then trivial, with a recursive function. I didn’t even try to memoise repeated computations: the generated code is ludicrously long when sorting as few as 13 or 14 values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

CLUSER> (loop for i from 2 upto 16
do (multiplevaluebind (min avg max)
(sortcount i)
(format t "~4D ~6,2F ~6D ~6,2F ~6D~%"
i (log (! i) 2d0)
min (float avg) max)))
;; n lg(n!) min avg max ; best network
2 1.00 1 1.00 1 ; 1 1
3 2.58 2 2.67 3 ; 3 3
4 4.58 4 4.67 5 ; 5 5
5 6.91 5 7.17 8 ; 7 9
6 9.49 7 9.83 11 ; 10 12
7 12.30 9 12.73 14 ; 13 16
8 15.30 12 15.73 17 ; 16 19
9 18.47 13 19.17 21 ; 19 25?
10 21.79 15 22.67 25 ; 22 29?
11 25.25 17 26.29 29 ; 26 35?
12 28.84 20 29.95 33 ; 30 39?
13 32.54 22 33.82 37 ; 34 45?
14 36.34 25 37.72 41 ; 38 51?
15 40.25 28 41.69 45 ; 42 56?
16 44.25 32 45.69 49 ; 46? 60?
I annotated the output with comments (marked with semicolons). The columns are, from left to right, the sort size, the theoretical lower bound (on the average or maximum number of comparisons), the minimum number of comparisons (depending on the input permutation), the average (over all input permutations), and the maximum count. I added two columns by hand: the optimal worstcase (maximum) comparison counts (over all sorting methods, copied from the OEIS), and the optimal size for sorting networks, when known (lifted from a table here [pdf]). Inexact (potentially higher than the optimum) bounds are marked with a question mark.
For the input size the inline merge sort can reasonably tackle (up to ten or so), its worstcase is reasonably close to the best possible, and its average case tends to fall between the lower bound and the best possible. Over all these sizes, the merge sort’s worst case performs fewer comparisons than the optimal or bestknown sorting networks. The current best upper bounds on the minimal worstcase comparison count seem to be based on insertion sort passes that minimise the number of comparisons with a binary search. I don’t believe that’s directly useful for the current use case, but a similar trick might be useful to reduce the number of comparisons, at the expense of reasonably more data movement.
Making it CLstrength
That’s already a decent proof of concept. It’s also far too plain to fit in the industrialstrength Common Lisp way. An inline sorting macro worthy of CL should be parameterised on both comparator and key functions, and work with arbitrary places rather than only variables.
Parameterising the comparator is trivial. The key could be handled by calling it at each comparison, but that’s wasteful. We’re generating code; might as well go for glory. Just like in the list merge sort, I’ll cache calls to the key function in the merge tree. I’ll also use special variables instead of passing a halfdozen parameters around in the generator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 

Finally, handling arbitrary places, thus letting the macro take care of writing results back to the places, is just regular macrology.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 

Now, the macro can be used to sort, e.g., vectors of double floats “inplace” (inasmuch as copying everything to registers can be considered inplace).
CLUSER> (macroexpand1 `(inlinesort (#'< :key #')
(aref array 0) (aref array 1) (aref array 2)))
(let* ((#:comparator1184 #'<)
(#:comparator1184
(if (functionp #:comparator1184)
#:comparator11184
(symbolfunction #:comparator1184)))
(#:key1185 #')
(#:key1185
(if (functionp #:key1185)
#:key1185
(symbolfunction #:key1185)))
(#:overwrite1186 t)
(#:array1195 array)
(#:array1192 array)
(#:array1189 array)
(#:temp1193 (aref #:array1195 0))
(#:temp1190 (aref #:array1192 1))
(#:temp1187 (aref #:array1189 2)))
(declare (ignorable #:comparator1184 #:key1185 #:overwrite1186))
(progn
nil
(progn
nil
nil
(let ((#:leftheadkey1196 (funcall #:key1185 #:temp1190))
(#:rightheadkey1197 (funcall #:key1185 #:temp1187)))
(if (funcall #:comparator1184 #:rightheadkey1197 #:leftheadkey1196)
(progn (rotatef #:temp1190 #:temp1187))
(progn))))
(let ((#:leftheadkey1198 (funcall #:key1185 #:temp1193))
(#:rightheadkey1199 (funcall #:key1185 #:temp1190)))
(if (funcall #:comparator1184 #:rightheadkey1199 #:leftheadkey1198)
(let ((#:rightheadkey1199 (funcall #:key1185 #:temp1187)))
(if (funcall #:comparator1184 #:rightheadkey1199
#:leftheadkey1198)
(progn (rotatef #:temp1193 #:temp1190 #:temp1187))
(progn (rotatef #:temp1193 #:temp1190))))
(progn))))
(when #:overwrite1186
(let ((#:new1194 #:temp1193))
(sbkernel:%aset #:array1195 0 #:new1194))
(let ((#:new1191 #:temp1190))
(sbkernel:%aset #:array1192 1 #:new1191))
(let ((#:new1188 #:temp1187))
(sbkernel:%aset #:array1189 2 #:new1188)))
(values #:temp1193 #:temp1190 #:temp1187))
t
CLUSER> (disassemble (lambda (array)
(declare (type (simplearray doublefloat (3)) array))
(inlinesort (#'< :key #')
(aref array 0) (aref array 1) (aref array 2))
array))
; disassembly for (lambda (array))
; 0C5A5661: F20F105201 movsd XMM2, [rdx+1] ; noargparsing entry point
; 666: F20F104209 movsd XMM0, [rdx+9]
; 66B: F20F104A11 movsd XMM1, [rdx+17]
; 670: 660F28E0 movapd XMM4, XMM0
; 674: 660F5725A4000000 xorpd XMM4, [rip+164]
; 67C: 660F28D9 movapd XMM3, XMM1
; 680: 660F571D98000000 xorpd XMM3, [rip+152]
; 688: 660F2FDC comisd XMM3, XMM4
; 68C: 7A02 jp L0
; 68E: 7267 jb L3
; 690: L0: 660F28DA movapd XMM3, XMM2
; 694: 660F571D84000000 xorpd XMM3, [rip+132] ; negate doublefloat
; 69C: 660F28E0 movapd XMM4, XMM0
; 6A0: 660F572578000000 xorpd XMM4, [rip+120]
; 6A8: 660F2FE3 comisd XMM4, XMM3
; 6AC: 7A26 jp L1
; 6AE: 7324 jnb L1
; 6B0: 660F28E1 movapd XMM4, XMM1
; 6B4: 660F572564000000 xorpd XMM4, [rip+100]
; 6BC: 660F2FE3 comisd XMM4, XMM3
; 6C0: 7A27 jp L2
; 6C2: 7325 jnb L2
; 6C4: 660F28DA movapd XMM3, XMM2
; 6C8: 660F28D0 movapd XMM2, XMM0
; 6CC: 660F28C1 movapd XMM0, XMM1
; 6D0: 660F28CB movapd XMM1, XMM3
; 6D4: L1: F20F115201 movsd [rdx+1], XMM2
; 6D9: F20F114209 movsd [rdx+9], XMM0
; 6DE: F20F114A11 movsd [rdx+17], XMM1
; 6E3: 488BE5 mov rsp, rbp
; 6E6: F8 clc
; 6E7: 5D pop rbp
; 6E8: C3 ret
; 6E9: L2: 660F28DA movapd XMM3, XMM2
; 6ED: 660F28D0 movapd XMM2, XMM0
; 6F1: 660F28C3 movapd XMM0, XMM3
; 6F5: EBDD jmp L1
; 6F7: L3: 660F28D8 movapd XMM3, XMM0
; 6FB: 660F28C1 movapd XMM0, XMM1
; 6FF: 660F28CB movapd XMM1, XMM3
; 703: EB8B jmp L0
Bonus: Hooking in SBCL
The inline sort supports the same options as CL:SORT
, so it’d be
really interesting to opportunistically compile calls to the latter
into sizespecialised inline sort. The usual, portable, way to code
that sort of macro qua sourcetosource optimiser in CL is with
compiler macros; compiler macros have access to all the usual
macroexpansiontime utility, but the function definition is left in
place. That way the user can still use the function as a firstclass
function, and the compilermacro can decline the transformation if a
regular call would work better (and the compiler can ignore any
compiler macro). That’s not enough for our needs, though… and there
can only be one compiler macro per function, so adding one to code we
don’t own is a bad idea.
Python’s first internal representation (ir1) is optimised by iteratively deriving tighter type information, and (mostly) pattern matching on the type of function calls. Its DEFTRANSFORM form lets us add new rules, and there may be an arbitrary number of such rules for each function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 

The two deftransforms at the end define new rules that match on calls
to CL:SORT
and CL:STABLESORT
, with arbitrary argument types and
return types: basic type checks are performed elsewhere, and
maybeemitunrolledmergesort
does the rest. Transforms are
identified by the docstring (which also improve compiler notes), and
the argument and return types, so the forms are reevaluationsafe.
All the logic lies in maybeemitunrolledmergesort
. The policy
form checks that the optimisation policy at the call node has speed
greater than space
, and gives up on the transformation otherwise.
The next step is to make sure the sequence argument is an array, and
that its dimensions are known and define a vector (its dimension list
is a list of one number). The final guard makes sure we only
specialise on small sorts (at most
*unrolledvectorsortmaxlength*
).
Finally, we get to code generation itself. A vector of length 1 or 0
is trivially presorted. I could also directly emit the inner
inlinesort
form, but SBCL has some stuff to hoist out computations
related to hairy arrays. witharraydata
takes a potentially
complex array (e.g. displaced, or not a vector), and binds the
relevant variables to the underlying simple array of rank 1, and the
start and end indices corresponding to the range we defined
(defaulting to the whole range of the input array). Bound checks are
eliminated because static information ensures the accesses are safe
(or the user lied and asked not to insert type checks earlier), and
the start
index is declared to be small enough that we can add to it
without overflow – Python doesn’t implement the sort of sophisticated
shape analyses that could figure that out. Finally, a straight
inlinesort
form can be emitted.
That machinery means we’ll get quicker and shorter inline sort code
when the size is known ahead of time. For example, a quick
disassembly shows the following is about one fourth the size of the
sizegeneric inline code (with (simplearray doublefloat (*))
, and
(optimize speed (space 0))
.
CLUSER> (lambda (x)
(declare (type (simplearray doublefloat (4)) x)
(optimize speed))
(sort x #’<))
#<FUNCTION (lambda (x)) {100BFF457B}>
CLUSER> (disassemble *)
; disassembly for (lambda (x))
; 0BFF45CF: F20F105A01 movsd XMM3, [rdx+1] ; noargparsing entry point
; 5D4: F20F104209 movsd XMM0, [rdx+9]
; 5D9: F20F104A11 movsd XMM1, [rdx+17]
; 5DE: F20F105219 movsd XMM2, [rdx+25]
; 5E3: 660F2FC3 comisd XMM0, XMM3
; 5E7: 7A0E jp L0
; 5E9: 730C jnb L0
; 5EB: 660F28E3 movapd XMM4, XMM3
; 5EF: 660F28D8 movapd XMM3, XMM0
; 5F3: 660F28C4 movapd XMM0, XMM4
[…]
Even for vectors of length 8 (the default limit), the specialised merge sort is shorter than SBCL’s inlined heapsort, and about three times as fast on shuffled vectors.
That’s it
It took me much longer to write this up than to code the generator, but I hope this can be useful to other people. One thing I’d like to note is that sorting networks are much harder to get right than this generator, and pessimise performance: without branches, there must be partially redundant computations on nonpoweroftwo sizes. In the absence of solid compiler support for conditional swaps, I doubt the additional overhead of optimal sorting networks can be compensated by the simpler control flow, never mind the usual oddeven or bitonic networks.