Paul Khuong: some Lisp

An old conjecture on stream transducers

June 24th, 2018 | Comments

I’ve been thinking about stream processing again, and came back to an old pick-two-of-three conjecture of mine: for stream processing without dynamic allocation, “arbitrary outputs per input, multiple consumers, multiple producers: choose two.”

The question is interesting because stream processing in constant space is a subset of L (or FL, and thus probably not P-complete, let alone Turing complete. Having easily characterisable subsets of stream processing that can be implemented in constant space would be a boon for the usability of stream DSLs.

I think I find this academic trope as suspicious as @DRMavIver does, so I have mixed feelings about the fact that this one still feels true seven years later.

The main reason I believe in this conjecture is the following example, F(S(X), X), where S is the function that takes a stream and ouputs every other value. Or, more formally, \(F\sb{i} = f(X\sb{2i}, X\sb{i})\).

Let’s say X is some stream of values that can’t be easily re-computed (e.g., each output value is the result of a slow computation). How do we then compute F(S(X), X) without either recomputing the stream X, or buffering an unbounded amount of past values from that stream? I don’t see a way to do so, not just in any stream processing DSL (domain specific language), but also in any general purpose language.

For me, the essence of the problem is that the two inputs to F are out of sync with respect to the same source of values, X: one consumes two values of X per invocation of F, and the other only one. This issue could also occur if we forced stream transducers (processing nodes) to output a fixed number of value at each invocation: let S repeat each value of X twice, i.e., interleave X with X (\(F\sb{i} = f(X\sb{\lfloor i / 2\rfloor}, X\sb{i})\)).

Forcing each invocation of a transducer to always produce exactly one value is one way to rule out this class of stream processing network. Two other common options are to forbid either forks (everything is single-use or subtrees copied and recomputed for each reuse) or joins (only single-input stream processing nodes).

I don’t think this turtle-and-hare desynchronisation problem is a weakness in stream DSLs, I only see a reasonable task that can’t be performed in constant space. Given the existence of such tasks, I’d like to see stream processing DSLs be explicit about the tradeoffs they make to balance performance guarantees, expressiveness, and usability, especially when it comes to the performance model.

Comments