Space-complexity of SSA in practices
Converting code to static single assignment style really simplifies analyses in compilers. The ease with which I can hack things up in LLVM’s SSA intermediate representation is impressive, especially when compared to the same task in SBCL’s mostly-CFG IR.
In theory, there’s an important downside to converting to SSA before analyses: the conversion can cause a quadratic explosion in size from the source to the IR. A linear-time pass on an SSA-converted IR is actually quadratic-time (in the worst case) with respect to the initial input. Obviously, this is an issue for just-in-time compilation, but also for some lisp and scheme compilers. SBCL is frequently used to compile very large, programmatically-generated, functions, and I’m told that Chez Scheme faces similar challenges. The fact that both languages have local functions compounds the issue: analysing a single toplevel function can actually be equivalent to interprocedural analyses in other languages.
On the other hand, all the reference texts on SSA construction that I’ve read reassure the reader that the output is linear-size in practice... the practice of Fortran, C, and other curly-brace languages, that is.
What’s the practical size blowup like in Scheme or Lisp? I think it’s much closer to the theoretical worst case.
The worst-case expansion for SSA happens when a ϕ node is inserted for each variable at the head of each basic block. A function with m variables and n basic blocks (i.e. size m + n) is converted into something with an additional mn ϕs.
Unfortunately, that’s exactly what tends to happen for code generated by the macro in Shriram Krishnamurthi’s Automata via Macros. That’s a pretty bad sign: the macro is simple enough that the author used it as a pedagogic example. It also seems very idiomatic; it’s certainly very close to how I would solve the problem.
In something like C, I would implement that simple state machine with a while
loop and a switch
:
enum state state = ...; int var1 = ..., var2 = ..., ...; while (1) { switch (state) { case ... } };
That loop has a linear-size SSA conversion. Each state has a single successor,
and ϕ are only inserted at the head of the while
loop. clang
and the llvm
toolchain are perfect for this sort of exploration. The tricky thing is that clang
’s
optimisations involve some code duplication, even at -O1
. It’s probably best to
compile to bitcode, at -O0
, and then use opt
to enable only the passes we need
(mem2reg
to convert to SSA and jump-threading
to clean up trivial basic
blocks).
The automaton macro’s output is equivalent to having a few goto
at the end of
each state:
{ s0: var1++; if (...) goto s1; goto s4; s1: ... }
Clearly, it will more easily yield efficient code, and is arguably more transparent to analyses. However, even very simple states – only one variable is read or written, and there are two successors – like the one shown above can have a quadratic-size SSA conversion (of course, at least one of the states must also use the variables to compute a return value, otherwise all that code is dead).
It suffices to have n basic blocks and n variables, with each basic block writing and reading exactly one variable (and each variable being read and written by exactly one basic block). If there are two (directed) cycles that go through each of n basic blocks (in different order), the conversion will have n ϕ nodes in each basic block. In fact, I think it suffices for the basic blocks to form a single strong component, and for each basic block to have more than one predecessor.
That condition is not very strong. I don’t think I would even express logic as a state machine if the automaton in question did not have a large number of states with multiple predecessors in a single strong component; without either of those conditions, normal conditional branches and a couple small loops would probably be easier to read and write.
One argument I can see against this example is that idiomatic Scheme would lift the variables as arguments to each state function. The corresponding SSA IR isn’t smaller, though. The main difference is that the size explosion is done by the macro, at the source level, instead of a later IR construction pass. If anything, it’s an argument against manually writing “pure” tail-recursive loops in Scheme: generating a quadratically blown-up intermediate representation is definitely not a task for human programmers.
So, it seems that the practice of static single assignment differs between C-style and lisp-esque languages. I don’t know whether the difference is large and common enough to be an issue, however. Simpler analyses may very well compensate for the size increase. Lambda-lifted IRs are very common in functional language compilers, and those are even more easily trapped into quadratic code size explosions. For now, I’ll try and figure out if we can be clever and adapt a lambda-lifted IR so that it has the same (or comparable) size increase vulnerabilities as SSA.
posted at: 15:47 | /Implementation | permalink