Like many, when I first saw VPTERNLOG, my reaction was “\(\log_2(3) \approx 1.58\) is a nice reduction in depth, but my code mostly doesn’t have super deep reductions.”
A little bit of thinking reveals a big win at smaller (reasonable) scales: a binary operator takes two values and outputs one, while a ternary operator takes three and outputs one. In a reduction, each application of the binary operator decrements the number of values by \(2 - 1 = 1\), but each application of the ternary operator decrements it by \(3 - 1 = 2\)!
We thus need half as many ternary operations to reduce a given number of bitvectors, compared to binary operations… and it’s not like the throughput (or latency for that matter) is worse. Plus it’s hard to be more orthogonal than a lookup table.
Cute lightweight instruction, two thumbs up!