Paul Khuong mostly on Lisp

rss feed

Sun, 30 Sep 2007

 

String-case (bis)

I already wrote a relatively long document on the topic (available at http://www.discontinuity.info/~pkhuong/string-case.lisp), but the format (pdf or source file) might not be ideal for everyone. I think writing pattern matchers is a surprisingly frequent task, so I'll go over the main points again here.

The usual approach used to implement string-case is to intern the strings in a hash table (e.g. via symbols), and then use a normal eql case. This seems lossy for at least two reasons: a hash lookup must traverse the string at least twice (once to compute the hash value, and some more to find the right key) and the eql case (usually) adds a pass over all the cases. A specialised pattern matcher can do much less work: each test's branches eliminate at least 1 candidate, and once a single candidate is left, a single pass over the parts of the input string that haven't been examined is needed. A string-case based on perfect hashing would also be an option. However, finding such a hash function can take some time, and pessimises behaviour on mismatches.

Obviously, a pattern matcher should, at each step, split the candidate patterns as close to 50/50 as possible, with only half of the candidates left in the 'true' branch, and the other half in the 'false' branch. This can be a problem when some candidates are applicable in both branches (code explosion or backtracking, more complex metric, ...). Fortunately, this is not the case here, since we only have constant values as patterns. In this case, first discriminating on the length of the input seems like a good idea; it may not give an even distribution between all the candidates, but it simplifies the rest of the code (no need for bounds checking). For the rest of the matching process, the function find-best-split simply iterates over all the possible tests to find the one that leads to the most even split. Note that an interesting property of strings is that random access takes constant time, so, faced with similar cases (e.g. "foobaR" and "foobaZ"), it is obviously a good idea to discriminate on indices that actually differ (the last one here), instead of fixing an arbitrary order ahead of time.

Once a single candidate is left, a last series of tests is still needed to make sure the candidate does match the input. It would be simple to simply keep track of the indices that haven't been used to discriminate yet and emit the corresponding checks. This would however lead to some code duplication for similar patterns, since most of the indices (those for which the patterns are identical) would not have been tested. It is only slightly more complex to emit the tests as soon as all the candidates are identical on some indices, which will lead to much less repetition.

While it's not very apparent in this case (since the patterns are simply constant values), there's a certain tension between not doing work twice (by fully specialising code) and controlling code size (by executing common code as early as possible or by not fully specialising code, thus making it common). On modern computers, there's also the issue of branch prediction. Completely balanced pattern matching code is practically the worst case possible: most branches are taken exactly 50% of the time (there is no 'right' way to predict them).

One way to alleviate this problem is to coalesce tests together and only branch at the very end. This reduces the number of branches, and thus the number of mispredictions. However, it also pessimises the amount of computation needed, since there is no branch to abort early. Again, there is a balance to be found to reduce the number of branches without overly increasing the amount of work. In the string-case macro, it is achieved by coalescing equality tests by chunks of up to 4 indices (via or of multiple xor before branching).

Another way to attack the problem is to use the probability of each case (if they're not uniform) to produce an unbalanced search tree that gives a better expected runtime than a balanced one. This can be pretty complex to achieve, though.

Pattern matchers are like the code-generation equivalent of dictionaries (as in map, hash table, etc). Dictionaries are obviously extremely important data structures in programming; I feel that the usefulness of pattern matchers is often underestimated, especially by users of languages that don't offer such functionalities built-in (they instead end up writing simplistic code by hand). Another point that is easily overlooked, even by people who already use pattern matching, is that, just like there are many ways to implement dictionaries (e.g. trie VS hash table), each with their strengths and weaknesses, it is possible to implement pattern matching to take advantage of properties of the problem (a typical pattern matcher for algebraic data types would not use random access to variable-length inputs). This might be why I end up writing special-purpose pattern matchers much more often than most other programmers, and (yet) still haven't found a way to write a satisfying generic pattern matcher.

Pattern matching is your friend, use it!

P.S. Here are some URLs that might be useful to someone who want to implement a pattern matcher:

  1. http://discontinuity.info/~pkhuong/xt-pattern.lisp is my extensible pattern matcher. It's 'generic', so probably decent for many things, but wouldn't easily allow us to express the strategy used in string-case. Still useful if you don't feel like writing your own from scratch, or would like some inspiration for the general architecture.
  2. F. Le Fessant and L. Maranget. Optimizing Pattern Matching. In International Conference on Functional Programming. ACM press, 2001. is a paper describing the authors' experience while improving OCaml's pattern matching, which is based in backtracking (which tightly controls output code size but may do redundant work).
  3. P. Sestoft. ML Pattern Match Compilation and Partial Evaluation. In Dagstuhl Seminar on Partial Evaluation. Springer-Verlag, 1996. Lecture Notes in Computer Science 1110. describes a pattern matcher for ML ADTs (again), but this time from a program specialisation perspective. By fully specialising the pattern matching code on known data, redundant work is completely eliminated, but code size is not controlled anymore. Note that avoiding redundant work is far from enough to ensure 'optimal' code: as evidenced by string-case, deciding what to test for in what order (even with ADTs, there is random access to members and temporaries) is crucial. In fact, it would not be surprising to see a backtracking-based pattern matcher outperform a fully specialising one, simply by choosing the order in which to perform tests better.

posted at: 22:38 | /Lisp | permalink

Made with PyBlosxom Contact me by email: pvk@pvk.ca.