Fast Constant Integer Division
Integer division (and its kissing cousins, remainder and modulo) is an operation that regularly pops up in surprising places. The first time I encountered the problem was a few years ago, when I wanted to implement a corewar VM in SBCL (probably my first attempt at a real project in CL). My program was much slower than I would have expected: constant MOD were compiled to actual divisions. More recently, I hit the same problem when trying to implement hash functions and PRNGs efficiently. The main technique to make such operation decently fast approximates division by a multiplication and a shift (in other words, multiplication by a fraction). However, there are other tricks in more restricted situations that can sometimes yield simpler code.
Expressing division as a multiplication is actually simple when you
think of it in terms of fractions. Division (and then truncation) by x
is like multiplication by 1/x
. Since the only type of division we can
easily express is division by a power of 2, we want to approximate
multiplication by 1/x
with a multiplication by y/2^k
. Moreover, since
the result will be truncated, we want to overapproximate (in absolute
value) the fraction. We thus find (for x
positive) the folklore
formula y = ceiling(2^k/x)
(or 1 + floor(2^k/x)
,
which suffers from an off-by-one when x
is a power of 2). Note that
since the multiplier is rounded up, any error will be one of
overapproximation. If the sign of the input is known (or the range
otherwise asymetric), it is possible to use a small additive constant
to bring the result of the multiplication closer to 0 before dividing
and taking the floor (with a shift), and reduce the overapproximation.
In http://www.discontinuity.info/~pkhuong/pseudoinverse.pdf, we find a
nice way to figure out the smallest z
such that floor(z/x)
is
different from floor(z*y/2^k)
. If the input is known to fit in the
range (which is often on the order of machine ints), then it is
possible to convert integer division to multiplication. Unfortunately,
this doesn't even consider the possibility of a (constant) additive
fixup between multiplication and division/flooring.
When implementing this on a machine that offers BxB -> B, B
multiplication (e.g., 32x32 -> 32, 32
on IA32, or 64x64 -> 64, 64
on
x86-64), or even just the most significant half (like Itanium or
Alpha), there are two main interesting values for k
(as in y/2^k
): B
and B + floor(lg x)
. If we take k = B
, then we can simply
use the most significant word as an implicitly shifted value. If that
isn't precise enough, then B + floor(lg x)
is the largest denominator
such that y
(skipping leading zeros) will fit in B
bits; the result
must however then be shifted. In both cases, but especially when
shifting, it is essential to ensure that the result won't overflow
the space available in the significant half of the product. When a
machine offers many multiplication sizes, it can be useful to try
smaller ones first: wider multiplications are sometimes slower, and
larger constant always take up more space.
Sometimes, we're only interested in the remainder (MOD is slightly more complex and can be expressed in terms of REM). In the general case, it is impossible to avoid a division, or at least a multiplication by a pseudoinverse, followed by a multiplication and a subtraction. However, when the input range is restricted, some nice tricks are available. It's much simpler to assume that the input is always positive (or always negative). Doing otherwise requires a non-trivial amount of mask generation and subsequent masking, making the tricks less attractive.
Apart from powers of 2, the most obvious trick is when we want (rem z
x)
, where z
is known to be smaller than 2x
. We can express x
as
2^n - e
. We have (rem z x) = z
if z < x
, and
(rem z x) = z-x
otherwise. It would be simple to compare,
generate a mask and do a masked subtraction. However, when z + e
might overflow (and n
is the width of the register), a simple way to
check for that is to add e
to z
. If the result is greater than 2^n -
1
, z >= x
(but we still have z < 2^(n+1)
). Luckily, the
implicit wrap around (masking off everything but the lower n
bits) is
enough to subtract x
(remember that e
was previously added, so
subtracting 2^n
really subtracts x
). If z < x
(the
overflow bit isn't set), then we have to subtract e
back. This logic can
easily be expressed with a masked subtraction.
A slightly more widely applicable trick is also based on representing
the modulo x
as 2^k - e
. We can split z
in 2 (or more) parts: z
= z1 2^k + z0
. Thus (rem z x) = (rem (z1 e + z0) x)
.
If the product z1 e
is small enough, we can then use the previous
trick instead of computing the remainder the usual way (it would be
possible to recurse). By adding e
to z1 e
and then adding that to z0
we can guarantee that any overflow will only happen in the last
addition, thus making it possible to easily use hardware flags.
Official SBCL should be able to express some of these tricks (at the very least reciprocal multiplication) some time soon. The neat thing about optimising divisions away is that they're so expensive (on the order of 100-200 clock cycles on modern x86) that even branches are pretty much guaranteed to be an improvement, even in the worst case :)