# How Bad Can 1GB Pages Be?

I joined the ad server team at AppNexus two weeks ago. It’s a new problem domain for me, one that I find pretty interesting so far: the workload is definitely on the branchy/pointer-chasing side of things, and, although aggregate throughput matters, we work under stringent latency goals. There are a few hash table lookups on the code path I’ve been working on, and some micro-benchmarking revealed that 1GB pages could easily shave 25% off the latency for lookups in large (hundreds of MB) tables.

That’s not too surprising. Given any change, it’s usually easy to design a situation that makes it shine. I find it just as important to ask the opposite question: what’s the worst slowdown I can get from switching from regular 4KB pages to huge 1GB pages? This way, I can make more robust decisions, by taking into account both a good (if not best) and a worst case. By the end of this post, I’ll multiply runtime by 250% for the same operation, simply by switching from 4KB to 1GB pages… with a setup is so contrived that it seems unlikely to occur by accident.

But first, a quick overview of the benefits and downsides of huge pages.

# Why are huge pages interesting?

In short, manufacturers are adding huge pages because translating virtual addresses to physical ones is slow.

On x86-64, the mapping from virtual to physical pages is represented with a trie; each level dispatches on 9 bits (i.e., each node has 512 children), and leaves correspond to 4KB pages. There are 4 levels from the root node to the leaves, which covers the (currently) standard 48-bit virtual address space.

The address translation table is (mostly) stored in normal memory and is too large to fit in cache. Thus, translating a virtual address requires four reads, any of which can hit uncached memory.

This is where the translation lookaside buffer (TLB) comes in. On my E5-4617, each core has 64 entries for regular 4KB pages in its L1 dTLB, and 512 (shared) entries in its L2 TLB. I don’t know if the TLBs are exclusive or inclusive, but even if they’re exclusive, that’s only enough for 2.25MB worth of data. Assuming that the working set is completely contiguous (i.e., the best case), the TLB space for 4KB pages is less than the total cache/core (2.5MB L3/core + 256 KB L2 + 32 KB L1D).

2MB and 1GB “huge pages” address this imbalance: nine 2MB pages suffice to cover more address space than all the caches in a 6-core E5-4617. However, there are only 32 L1dTLB entries for 2MB pages – and 4 entries for 1GB pages – on my E5.

In addition to covering more address space in the TLB, huge pages offer secondary benefits: there are fewer page table entries, and the trie is shallower. Fewer page table entries means that a larger fraction of memory can be used by data, rather than metadata, and that the page table walk is more likely to stay in cache. Moreover, larger pages are closer to the trie’s root: while the processor traverses 4 levels to get to a 4KB page, it only traverses 3 levels to reach a 2MB page and 2 levels to for a 1GB page. These two effects compound to make TLB misses quicker to handle.

# Now, the downsides…

This idea that one must cover as much address space as possible with the TLB is most relevant in two settings: trivially, if the working set is completely covered by the (L1d)TLB, or, more interestingly, when the access patterns show a lot of locality. Examples of the latter case are BLAS routines: with appropriate blocking, they can usually access each page once or twice, but read almost every byte in a page before switching to the next.

The opposite, worst, case would be something like lookups in a large (too big for the TLB) hash table: we choose a virtual address pseudorandomly and painstakingly translate it to a physical address, only to read a handful of words from that page. In that situation, we want as many TLB entries as possible, regardless of the address space each one covers… and that’s where 4KB pages ought to shine. Taking into account both the L1DTLB and the L2TLB, each core has 576 TLB entries for 4KB (data) pages, versus 64x2MB and 4x1GB. Now, I don’t know if the TLBs are exclusive or not, so I’ll assume the worst case and work with 512*4KB entries.

The thing is, 512 TLB entries aren’t that many. If, by chance, our hash table lookups keep hitting the same 512 pages, a contrived microbenchmark will show that 4KB pages are a big win (but really, a software cache might be a better way to exploit the situation). It’s more likely that it’ll be hard to avoid TLB misses regardless of page size, and huge pages then become useful because each TLB miss is handled more quickly. Regardless, I’ll try to approximate this worst-case behaviour to see how bad things can get.

# A first stab at pessimising huge pages

Ideally, I would want to read from 512 (or a bit fewer) locations 1GB apart, but I don’t have that much RAM. In the interest of realism, I decided to “only” allocate 24GB.

My first microbenchmark follows: I allocate 24GB, divide that space in 512 chunks, and read the first word of each chunk in a loop. At first, I didn’t even randomise the traversal order (so as to abuse LRU), but there seems to be some prefetching for 1GB pages.

The results: 1.13e10 cycles for 4KB pages, 1.60e10 for 2MB and 1.56e10 for 1GB. That’s only 40% more cycles… it’s bad, but not horrible. The reason is that the data vector spans only 24x1GB, so 1/6th of the random lookups will hit the 1GB TLB. Instead, let’s try and load from each of these 24 pages, in random order. 24 pages will easily fit in the L1DTLB for 4KB pages, but not in the 4 slots for 1GB pages.

# Takes two to six

The results are even worse (better)! 4.82e9 cycles for 4KB pages, versus 3.96e9 and 2.84e9 for 2MB and 1GB pages!

The problem is aliasing. The TLB on my E5 has limited way-ness (4-way, I believe), so, by aligning everything to a 1GB boundary, the effective size of the 4KB page TLB is 4 entries (same for 2MB). In a way, this highlights the effect of page size when TLBs are useless (random accesses to dozens or hundreds of GBs): 2MB pages shave 18% off the runtime, and 1GB pages another 30%, for a total of 60% as much time to handle a 1GB TLB miss versus 4KB.

Let’s try again, with indices[i] = (x*stride + (x*4096)%(1ul<<30)) % (24ul << 30); on line 19. I now find 1.14e9, 6.18e9 and 2.65e9 cycles. Much better!

For fun, I also tried to offset by 2MB increments, with indices[i] = (x*stride + (x<<21)%(1ul<<30)) % (24ul << 30);, and found 2.76e9, 1.30e9, and 2.85e9 cycles.

Finally, I tried

            size_t offset = 4096 + (1ul<<21);
indices[i] = (x*stride + (x*offset)%(1ul<<30)) % (24ul << 30);


so that neither 4KB nor 2MB pages would alias, and got 1.13e9, 1.08e9 and 2.65e9 cycles. That’s 234% as much time for 1GB pages as for 4KB.

We’re close: this setup is such that 1GB pages cause a lot of TLB misses, but neither 4KB nor 2MB pages do. However, perf stat shows there’s a lot of cache misses, and that probably reduces the difference between 4KB and 1GB pages.

Let’s try one last thing, with size_t offset = 4096 + (1ul<<21) + 64; (to avoid aliasing at the data cache level), and a smaller index vector that fits in cache.

We get 1.06e9, 9.94e8, and 2.62e9 cycles, i.e., 250% as much time with 1GB pages than 4KB ones.

We can easily turn this around: we just have to loop over more than 4 4KB-aligned locations in a 4GB space. For example, with

With the above, I find 7.55e9 cycles for 4KB pages, 3.35e9 for 2MB and 1.09e9 for 1GB pages. Here, 4KB pages are almost 7x as slow as 1GB pages. If I instead let size_t offset = 4096 + 64; (to avoid aliasing in the 4KB TLB), I get 4.72e9 cycles for 4KB pages, so still 433% as much time.

We can also play the same trick over 32*2MB = 64MB. On my E5, I find 3.23e9 cycles for 4KB pages, versus 1.09e9 for 2MB and 1GB pages. Eliminating page-level aliasing only brings the 4KB case down to 3.02e9 cycles, and doesn’t affect the other two cases.

# So, are 1GB pages generally useful?

The following table summarises the runtimes of all the variations above with 2MB and 1GB pages (as a fraction of the number of cycles for 4KB pages).

 2MB/4KB 1GB/4KB 1.42 1.38 0.82 0.59 5.42 2.32 0.47 1.03 0.96 2.34 0.94 2.47 0.44 0.14 0.72 0.23 0.34 0.34 0.36 0.36

Overall, I think that I wouldn’t automatically switch to 2MB pages, but that 1GB pages are a solid choice for machines that basically run a single process at a time. When the data fits in 4GB, 1GB pages completely eliminate TLB misses. When the data is even larger, 2MB and 1GB pages make page table walks quicker (by 18% and 40%, respectively). It takes a very contrived situation – in which a program keeps hitting fewer than 512 4KB-pages that are spread out across multiple GBs – for smaller pages to be preferable. The worst I managed was 250% as much time for 1GB pages vs 4KB; in the other direction, I achieved 693% as much time for 4KB pages versus 1GB, and 433% with a realistic situation (e.g., repeated lookups in a 4GB hash table). Plus, there’s another interesting benefits from larger pages that did not show up in this post: we get more control over aliasing in data caches.

With multiple processes in play, there are fragmentation issues, and things aren’t as clear-cut… especially given that 1GB pages must currently be allocated at boot-time, on Linux.

I’m also still unsure how 1GB pages interact with NUMA. I’m particularly worried about interleaving: interleaving at a 1GB granularity seems unlikely to smooth out the ratio of local:remote accesses as much as doing it at a 4KB granularity.

# A Whirlwind Introduction to Decomposition Methods in Discrete Optimisation

TL;DR: Optimisers compose, satisfiers don’t.

One rainy day at Sophia-Antipolis, a researcher told me a story about an optimisation software company. They’d looked into adding explicit support for Dantzig-Wolfe decomposition, but market research showed there were a few hundred potential users at most: too few people are familiar enough with decomposition methods to implement new ones. The problem is, the strength of these methods is that they are guided by domain expertise; useful decompositions are custom-made for the target problem, and thus obviously novel. What this story tells me is that we, researchers in decomposition, are really bad communicators.

It’s not like these methods are new or ill understood. The theory on how to break up optimisation problems into smaller ones (for Benders and Dantzig-Wolfe decompositions, at least) was fully developed in the fifties and sixties. It’s the same theory that backs state-of-the-art solvers for planet-scale planning problems today. Surprisingly, it seems that a single undergraduate or graduate course in optimisation methods does not suffice for expertise to trickle down to the wider computer science and management communities.

The divide seems particularly wide with the constraint programming (CP) community (despite the work of Hooker, Chvátal and Achterberg [that I know of]). Perhaps this explains why the VPRI’s efforts on cooperating constraint solvers are not informed by decomposition methods at all. Then again, it’s not just decomposition people who exchange too little with the CP community, but all of mathematical programming. I only single out decomposition because our tools are general enough to work with CP. We write and sometimes think in mixed integer linear programming (MILP) terms, but mostly for theoretical reasons; implementations may (useful ones usually do) include specialised solvers for combinatorial problems that only happen to be presented as MILP.

That’s why I spent three months in Nice: I explored how to cross-pollinate between CP and decomposition by developing a TSP solver (it’s a simple problem and Concorde is an impressive baseline). I have a lot to say about that, certainly more than can fit in this buffer. In this first post of many, I hope to communicate an intuition on why Dantzig-Wolfe and Benders decomposition work, to relate these methods to work from the SAT/CSP/CP community, and to show how to decompose finite domain problems.

# But first, duality

I noted earlier that we work in MILP terms because it gives a sound theoretical basis for decomposition methods. That basis is duality, especially linear programming duality.

In optimisation, duality refers to the relationship between two problems a primal problem (P) and its dual (D). I’ll assume that we’re interested in solving (P), a minimisation problem (minimise badness, e.g., costs); (D) is then a maximisation problem (maximise goodness, e.g., profits). These problems are such that the minimal (least) value for (P) is always higher or equal to the maximal (greatest) value for (D).

So, even if (P) and (D) are so hard that we can only solve them heuristically, a pair of feasible solutions always brackets the optimal value for the primal problem (P). $$\tilde{x}$$, the feasible solution for (P) might not be optimal, but it’s still a solution. In turn, $$\tilde{y}$$, the dual feasible solution, gives us a lower bound on the optimal value: we don’t know the optimal value, but we know it can’t be lower than that of any solution for (D). We can then compare the value of $$\tilde{x}$$ to this lower bound. If they’re close enough, we stop looking; otherwise we use both primal and dual information to refine our solutions and tighten the bracket (close the gap).

The primal solution is useful: it’s a feasible plan. However, the dual solution is often where we gain insights into the problem. For example, if I wish to minimise my carbon footprint, it’s clear that driving less will help me. However, what is the impact of allowing myself more commuting time? and how does that compare with simply shortening the distance between my home and work? That’s what dual solutions tell us: they estimate the impact of (the right-hand side of) each constraint on the objective value.

Linear programs are a special case: they have a “strong” duality, in that we can easily write down an exact dual program for any primal linear program. The dual is exact because its maximal value matches the minimal value of the primal; there is no duality gap. This dual is another linear program that’s exactly as large as the primal, and the dual of the dual is… the initial primal program.

Decomposition exploits this relationship to extract information about simpler subproblems, and translate that into a refined approximation of the complete problem. It would be disappointing if that only worked on linear programs, and this is where Lagrangian duality comes in.

You may remember Lagrange multipliers from calculus courses. That’s exactly what I’m talking about. For reasons both practical and theoretical, I’ll assume that we have a (sub)problem that’s simpler once a few linear (affine) constraints are removed, i.e. something like $$\min\sb{x} z(x)$$ subject to $$Ax \leq b,$$ $$x\in X.$$

This form is general; the objective function $$z$$ need only be convex and the feasible set $$X$$ is arbitrary.

Lagrangian duality says that we can relax this problem into $$\min\sb{x\in X} z(x) + \lambda(Ax-b),$$ where $$\lambda$$ is any vector of non-negative multipliers for each row (constraint) in $$A$$. I find such subproblems interesting because minimising over $$X$$ can be easy with a specialised procedure, but hard as a linear program. For example, $$X$$ could represent spanning trees for a graph. That’s how the one-tree relaxation for the TSP works. Once you remove an arbitrary “one-node” from a Hamiltonian cycle, the remaining Hamiltonian path is a special kind of spanning tree: the degree of all but two nodes is two, and the remaining two nodes are endpoints (would be connected to the one-node). The one-tree relaxation dualises (relaxes into the objective function) the constraint that the degree of each node be at most two, and solves the remaining minimum spanning tree problem for many values of $$\lambda$$.

The one-tree subproblem, and Lagrangian subproblems in general, is a relaxation because any optimal solution to this problem is a lower bound for the optimal value of the initial problem.

To show this, we first notice that the feasible set of the subproblem is a superset of that of the initial problem: we only removed a set of linear constraints. Moreover, for any feasible solution $$\bar{x}$$ for the initial problem, $$z(\bar{x}) + \lambda(A\bar{x}-b) \leq z(\bar{x}):$$ $$A\bar{x} \leq b$$, i.e., $$A\bar{x}-b \leq 0$$, and $$\lambda\geq 0$$. In particular, that is true if $$\bar{x}$$ is an optimal solution to the initial problem. Obviously, an optimal solution to the relaxed subproblem must be at least as good as $$\bar{x}$$, and the optimal value of the subproblem is a lower bound for that of the initial problem.

This relation between optimal solutions for the initial problem and for the relaxed subproblem is true given any $$\lambda\geq 0$$. However, it’s often useless: bad multipliers (e.g., $$\lambda = 0$$) lead to trivially weak bounds. The Lagrangian dual problem with respect to $$Ax \leq b$$ is to maximise this lower bound: $$\max\sb{\lambda\geq 0} \min\sb{x\in X} F(\lambda, x),$$ where $$F(\lambda, x) = z(x) + \lambda(Ax-b) = z(x) + \lambda Ax - \lambda b.$$

Again, we don’t have to maximise this dual exactly. As long as the minisation subproblem is solved to optimality, we have a set of lower bounds; we only have to consider the highest lower bound. Maximising this Lagrangian dual, given an oracle that solves the linearly penalised subproblem, is an interesting problem in itself that I won’t address in this post. The important bit is that it’s doable, if only approximately, even with very little storage and computing power.

There’s some nifty theoretical results that tell us that, when $$z(x)$$ is a linear function, maximising the Lagrangian dual function is equivalent to solving the linear programming dual of $$\min z(x)$$ subject to $$Ax \leq b,$$ $$x\in \textrm{conv}(X),$$ where $$\textrm{conv}(X)$$ is the convex hull of $$X$$.

This is a relaxation (the convex hull is superset of $$X$$), but a particularly tight one. For example, if $$X$$ is a finite set, optimising a linear function over $$\textrm{conv}(X)$$ is equivalent to doing so over $$X$$ itself. Sadly, this doesn’t hold anymore once we add the linear constraint $$Ax \leq b$$ after taking the convex hull, but still gives an idea of how strong the relaxation can be.

One key detail here is that the optimality of $$x$$ depends on $$\lambda$$, $$z$$, $$X$$ and $$A$$, but not on $$b$$: once we have a pair $$(\lambda\sp{\star}, x\sp{\star})$$ such that $$x\sp{\star}$$ is minimal for $$F(\lambda\sp{\star},\cdot)$$, their objective value is always a valid lower bound for the initial problem, regardless of the right-hand side $$b$$. Thus, it make sense to cache a bunch of multiplier-solution pairs: whenever $$b$$ changes, we only need to reevaluate their new objective value to obtain lower bounds for the modified problem (and interesting pairs can serve to warm start the search for better maximising multipliers).

# Intermission

## Why optimisation, when satisfaction is hard enough?

Because I’m a heartless capitalist. (I find it interesting how economists sometimes seem overly eager to map results based on strong duality to the [not obviously convex] world.)

Seriously, because optimisation composes better than satisfaction. For example, imagine that we only had a routine to enumerate elements of $$X$$. There’s not a lot we can do to optimise over $$X$$ (when you seriously consider a British Museum search…), or even just to determine if the intersection of $$X$$ and $$Ax \leq b$$ is empty.

I’ve hinted as to how we can instead exploit an optimisation routine over $$X$$ to optimise over an outer approximation of that intersection. It’s not exact, but it’s often close enough; when it isn’t, that approximation is still useful to guide branch-and-bound searches. A satisfaction problem is simply an optimisation problem with $$z(x) = 0$$. Whenever we have a relaxed solution with a strictly positive objective value, we can prune a section of the search space; at some point we either find a feasible solution or prune away all the search space. Thus, even for pure satisfaction problems, the objective function is useful: it replaces a binary indicator (perhaps feasible, definitely infeasible) with a continuous one (higher values for less clearly feasible parts of the search space).

I said optimisation composes better because decomposition also works for the intersection of sets defined by oracles. For example, we can satisfy over the intersection of finite sets $$X$$ and $$Y$$ by relaxing the linear constraint in $$x = y,$$ $$x\in X, y\in Y.$$ The relaxed subproblem is then $$\min\sb{(x,y)\in X\times Y} \lambda(x-y),$$ which we can separate (linear objectives are separable) into two optimisation subproblems: $$\min\sb{x\in X} \lambda x,$$ and $$\min\sb{y\in Y} -\lambda y.$$ Even if the initial problem is one of satisfaction, Lagrangian decomposition considers each finite set in the intersection through independent minimisation: components communicate via their objective functions.

## But I don’t deal with numbers

All this can also work with categorical data; we just have to do what no CP practicioner would ever do and map each choice to a binary (0/1) variable. When, in the constraint program, a variable $$x$$ can take values $${a, b, c}$$, we express that in integer programs with a triplet of variables:

• $$x\sb{a} = 1$$ iff $$x=a$$ (0 otherwise),
• $$x\sb{b} = 1$$ iff $$x=b$$,
• $$x\sb{c} = 1$$ iff $$x=c$$,

and a constraint $$\sum\sb{i\in {a,b,c}} x\sb{i} = 1$$.

We can perform this mapping around each subproblem (e.g., optimisation over $$X$$). On entry, the coefficient for $$x\sb{a}$$ in the linear objective corresponds to the impact on the objective function of letting $$x=a$$. On exit, $$x=c$$ becomes $$x\sb{c}=1$$, $$x\sb{a}=0$$ and $$x\sb{b}=0$$, for example.

# Benders decomposition

Benders decomposition is the mathematical programming equivalent of clause learning in SAT solvers, or explanation-based search in CP.

Clause learning improves straight backtracking by extracting information from infeasible branches: what’s a small (minimal is NP-hard) set of assignments (e.g. $$x=\mathtt{false}$$ and $$y=\mathtt{false}$$) that causes this infeasibility? The problem is then updated to avoid this partial assignment.

Explanations in CP are similar: when a global constraint declares infeasibility, it can also provide an explanation, a small set of assignments that must be avoided. This, combined with explanations from other constraints, can trigger further propagation.

Benders decomposition generalises this feasibility-oriented view for the optimisation of problems with a hierarchical structure. I think the parallel has been known for a while, just not popularised. John Hooker proposed logic-based Benders to try and strengthen the link with logic programming, but I don’t know who (if anyone) else works in that direction.

The idea behind Benders decomposition is to partition the decision variables in two sets, strategic variables $$y$$, and tactical ones $$x$$. The decomposition attempts to determine good values for $$y$$ without spending too much time on $$x$$, which are left for the subproblem. This method is usually presented for mixed integer programs with continuous tactical ($$x$$) variables, but I’ll instead present the generalised form, which seems well suited to constraint programming.

We start with the initial integrated formulation $$\min\sb{(x,y)\in S} cx + fy,$$ which we reformulate as $$\min\sb{(x,y)} cx + fy,$$ $$y\in Y,$$ $$x\in X(y).$$

For example, this could be a network design problem: we wish to determine where to build links so that all (known) communication demands are satisfied at least total cost.

Benders decomposition then eliminates variables $$x$$ from the reformulation. The latter becomes the master problem $$\min\sb{y,z} fy + z$$ subject to $$Ay = d,$$ $$By \leq z,$$ $$y\in Y.$$

The two linear constraints are generated dynamically, by solving subproblems over $$X$$. The first corresponds directly to learned clauses or failure explanations: whenever an assignment for $$y$$ happens to be infeasible, we add a linear constraint to avoid it (and similar assignments) in the future. The second enables optimisation over $$S$$.

The idea is to solve a Lagrangian relaxation of the (slave) subproblem for a given assignment $$y\sp{i}$$: $$\min\sb{x,y} cx$$ subject to $$y = y\sp{i},$$ $$x\in X(y).$$

We will relax the first equality constraint to obtain a new feasible set that is a relaxation (outer approximation) of $$S$$: it only considers interactions within $$X$$ and between $$X$$ and $$Y$$, but not within $$Y$$ (those are handled in the master problem).

In my network design example, the unrelaxed subproblem would be a lot of independent shortest path problems (over the edges that are open in $$y\sp{i}$$).

We dualise the first constraint and obtain the Lagrangian dual $$\max\sb{\lambda} \min\sb{y\in\mathbb{R}\sp{m}, x\in X(y)} cx + \lambda(y-y\sp{i})$$ or, equivalently, $$\max\sb{\lambda} -\lambda y\sp{i} + \min\sb{y\in\mathbb{R}\sp{m}, x\in X(y)} cx + \lambda y.$$

The network design subproblem still reduces to a lot of independent shortest path problems, but (with appropriate duplication of decision variables) forbidden edges may be taken, at additional cost per flow.

In this example, the previous step is gratuitously convoluted: we can solve the unrelaxed subproblem and shortest paths can be seen as linear programs, so there are other ways to recover optimal multipliers directly. However, this is useful for more complicated subproblems, particularly when the $$x = y\sp{i}$$ constraint is difficult to handle. Note that the slave can be reformulated (e.g., with multiple clones of $$y\sp{i}$$) to simplify the Lagrangian subproblem.

I pointed out earlier that, given $$\bar{\lambda}\sp{i}$$ and an optimal solution $$(\bar{x}\sp{i},\bar{y}\sp{i})$$ for that vector of multipliers, we always have a valid lower bound, regardless of the right-hand side, i.e., regardless of the assignment $$y\sp{i}$$.

We found a first (hopefully feasible) solution by solving a restricted subproblem in the previous step. The multipliers for our dualised constraint $$y = y\sp{i}$$ explain the value associated to $$y\sp{i}$$, what part of $$y\sp{i}$$ makes the solution to the restricted subproblem so bad (or good).

That’s how we dynamically constrain $$z$$. Each time we solve the (updated) master problem, we get a tentative assignment for the strategic variables, and a lower bound for the initial problem. We can then solve for $$x\in X(y\sp{i})$$ to find a heuristic solution (or generate a feasibility cut to tell the master why the assignment is infeasible); if the heuristic solution is close enough to our current lower bound, we can stop the search. Otherwise, we solve the fixed Lagrangian dual and generate a fresh optimality cut $$-\bar{\lambda}\sp{i}y + c\sp{i} \leq z,$$ where $$c\sp{i} = c\bar{x}\sp{i} + \bar{\lambda}\sp{i}\bar{y}\sp{i}$$: we update the master problem with information about why its last tentative assignment isn’t as good as it looked.

It can be harder to explain infeasibility than to solve the Lagrangian subproblem. In that case, it suffices to have an upper bound on the optimal value (e.g., 0 for pure feasibility problems): when the lower bound is greater than the upper bound, the problem is infeasible. Given that the master problem also tries to minimise its total cost, it will avoid solutions that are infeasible if only because their value is too high. The advantage of feasibility cuts is that, with some thinking, a single feasibility cut can forbid a lot more solutions than multiple optimality cuts.

The problem with Benders decomposition is that the master mixed integer program gains new constraints at each iteration. Eventually, its size becomes a liability.

There are a few folk workarounds. One is to not stop (and then restart) the branch-and-bound search as soon as it finds an optimal (for the current relaxed master problem) integer feasible assignment for $$y$$: instead, add a new constraint to the formulation and resume the search (previously computed lower bounds remain valid when we add constraints). Another is to drop constraints that have been inactive for a long while (much like cache replacement strategies); they can always be added back if they ever would be active again.

Chvátal’s Resolution Search takes a different tack: it imposes a structure on the clauses it learns to make sure they can be represented compactly. The master problem is completely different and doesn’t depend on solving mixed integer programs anymore. It works well on problems where the difficulty lies in satisfying the constraints more than finding an optimal solution. A close friend of mine worked on extensions of the method, and it’s not yet clear to me that it’s widely applicable… but there’s hope.

Another issue is that the Benders lower bound is only exact if the slave subproblem is a linear program. In the general case, the master problem is still useful, as a lower bounding method and as a fancy heuristic to guide our search, but it may be necessary to also branch on $$x$$ to close the gap.

# Lagrangian decomposition

I believe Lagrangian decomposition, a special case of Lagrangian relaxation, is a more promising approach to improve constraint programming with mathematical programming ideas: I think it works better with the split between constraint propagation and search. The decomposition scheme helps constraints communicate better than via only domain reductions: adjustments to the objective functions pass around partial information, e.g., “$$x$$ is likely not equal to $$a$$ in feasible/optimal solutions.” Caveat lector, after 7 years of hard work, I’m probably biased (Stockholm syndrome and all that ;).

I already presented the basic idea. We take an initial integrated problem $$\min\sb{x\in X\cap Y} cx$$ and reformulate it as $$\min\sb{(x,y)\in X\times Y} cx$$ subject to $$x = y.$$ Of course, this works for any finite number of sets in the intersection as well as for partial intersections, e.g., $$x\sb{0} = y\sb{0}$$ and $$x$$ and $$y$$ otherwise independent.

Dualising the linear constraint leaves the independent subproblems $$\min\sb{x\in X} cx+\lambda x,$$ and $$\min\sb{y\in Y} -\lambda y.$$

The two subproblems communicate indirectly, through adjustments to $$\lambda$$… and updating Lagrange multipliers is a fascinating topic in itself that I’ll leave for a later post (: The basic idea is that, if $$x\sb{i} = 1$$ and $$y\sb{i} = 0$$, we increase the associated multiplier $$\lambda\sb{i}$$, and decrease it in the opposite situation: the new multipliers make the current incoherence less attractive. For a first implementation, subgradient methods (e.g., the volume algorithm) may be interesting because they’re so simple. In my experience, however, bundle methods work better for non-trivial subproblems.

When we have an upper bound on the optimal value (e.g., 0 for pure satisfaction problems), each new multipliers $$\lambda$$ can also trigger more pruning: if forcing $$x = a$$ means that the lower bound becomes higher than the known upper bound, we must have $$x \neq a$$. The challenges becomes to find specialised ways to perform such pruning (based on reduced costs) efficiently, without evaluating each possibility.

Lagrangian decomposition also gives us a fractional solution to go with our lower bound. Thus, we may guide branching decisions with fractional variables, rather than only domains. Again, an optimisation-oriented view augments the classic discrete indicators of CP (this variable may or may not take that value) with continuous ones (this assignment is unattractive, or that variable is split .5/.5 between these two values).

# Hybrid mathematical/constraint programming

I believe decomposition methods are useful for constraint programming at two levels: to enable more communication between constraints, and then to guide branching choices and heuristics.

In classic constraint programming, constraint communicate by reducing the domain of decision variables. For example, given $$x \leq y$$, any change to the upper bound for $$y$$ (e.g., $$y\in [0, 5]$$) means that we can tighten the upper bound of $$x$$: $$x \leq y \leq 5$$. This can in turn trigger more domain reduction.

Some constraints also come in “weighted” version, e.g., $$x\sb{i}$$ form a Hamiltonian cycle of weight (cost) at most 5, but these constraints still only communicate through the domain of shared decision variables.

With Lagrangian decomposition, we generalise domain reduction through optimisation and reduced cost oracles. For each constraint, we ask, in isolation,

• to the optimisation oracle: given this linear objective function for the variables that appear in the constraint, what’s an optimal solution?
• to the reduced cost oracle: given this objective function and optimal solution, what’s the impact (or a lower estimate of that impact, a reduced cost) on the objective value of forcing each variable to a different value?

Adding the objective values computed by the optimisation oracle gives us a lower bound. If the lower bound is greater than some upper bound, the problem (search node) is infeasible. Otherwise, we can use reduced costs to prune possibilities.

Let’s say the difference between our current lower and upper bounds is $$\Delta$$. We prune possibilities by determining whether setting $$x=a$$ increases the lower bound by more than $$\Delta$$. We can do this independently for each constraint, or we can go for additive bounding. Additive bounding computes a (lower) estimate for the impact of each assignment on the objective function, and sums these reduced costs for all constraints. The result is a set of valid lower estimates for all constraints simultaneously. We can then scan these reduced costs and prune away all possibilities that lead to an increase that’s greater than $$\Delta$$.

The optimisation oracle must be exact, but the reduced cost oracle can be trivially weak (in the extreme, reduced costs are always 0) without affecting correctness. Domain reduction propagators are simply a special case of reduced cost oracles: infeasible solutions have an arbitrarily bad objective value.

Some constraints also come with explanations: whenever they declare infeasibility, they can also generate a small assignment to avoid in the future. The mathematical programming equivalent is cut generation to eliminate fractional solutions. Some very well understood problems (constraints) come with specialised cuts that eliminate fractional solutions, which seems useful in hybrid solvers. For example, in my work on the TSP, I added the (redundant) constraint that Hamiltonian cycles must be 2-matchings. Whenever the Lagrangian dual problem was solved fairly accurately, I stopped to compute a fractional solution (for a convexification of the discrete feasible set) that approximately met the current lower bound. I then used that fractional solution to generate new constraints that forbid fractional 2-matchings, including the current fractional solution. These constraints increased the lower bound the next time I maximised the Lagrangian dual, and thus caused additional reduced cost pruning.

The same fractional solution I used to generate integrality cuts can also guide branching. There’s a lot of interesting literature on branching choices, in CP as well as in mathematical programming. Lagrangian decomposition, by computing fractional solutions and lower bounds, let us choose from both worlds. Again, the lower bound is useful even for pure feasibility problems: they are simply problems for which we have a trivial upper bound of 0, but relaxations may be lower. Only branching on fractional decision variables also let us cope better with weak domain propagators: the relaxation will naturally avoid assignments that are infeasible for individual constraints… we just can’t tell the difference between a choice that’s infeasible or simply uninteresting because of the objective function.

# Automated CP decomposition

This is a lot more tentative than everything above, but I believe that we can generate decomposition automatically by studying (hyper)tree decompositions for our primal constraint graphs. Problems with a small (hyper)treewidth are amenable to quick dynamic programming, if we explore the search space in the order defined by the tree decomposition: a small treewidth means that each subtree depends on a few memoisation keys, and there are thus few possibilities to consider.

For example, I’m pretty sure that the fun poly-time dynamic programming algorithms in the Squiggol book all correspond to problems with a bounded tree-width for the constraint graph.

In general though, problems don’t have a small treewidth. This is where mathematical decomposition methods come in. Benders decomposition could move a few key linking variables to the master problem; once they are taken out of the slave subproblem, the fixed Lagrangian subproblem has a small tree width.

However, I already wrote that Lagrangian decomposition seems like a better fit to me. In that case, we find a few variables such that removing them from the primal graph leaves a small tree decomposition. Each subproblem gets its own clones of these variables, and Lagrangian decomposition manipulates the objective function to make the clones agree with one another.

With this approach, we could also automatically generate dynamic programming solvers for Lagrangian subproblems… and it might even be practical to do the same for reduced cost oracle.

Going to hypertree-width improves our support for global constraints (e.g., alldiff) or constraints that otherwise affect many variables directly. Global constraint should probably get their own decomposition component, to easily exploit their pruning routines, but not necessarily large ad hoc ones. Hypertree decomposition takes the latter point into account, that a large clique created by a single constraint is simpler to handle than one that corresponds to many smaller (e.g., pairwise) constraints.

# What’s next?

I don’t know. I won’t have a lot of time to work on this stuff now that I have a real job. I’ve decided to dedicate what little time I have to a solver for Lagrangian master problems, because even the research community has a bad handle on that.

For the rest, I’m mostly unexcited about work that attempts to explore the search space as quickly as possible. I believe we should instead reify the constraint graph as much as possible to expose the structure of each instance to analyses and propagators, and thus reduce our reliance on search.

As a first step, I’d look into a simple system that only does table constraints (perhaps represented as streams of feasible tuples). A bucket heuristic for tree decomposition may suffice to get decent dynamic programming orders and efficiently optimise over the intersection of these table constraints. (Yes, this is highly related to query optimisation; constraint programming and relational joins have much in common.)

After that, I’d probably play with a few externally specified decomposition schemes, then add support for global constraints, and finally automatic decomposition.

All in all, this looks like more than a thesis’s worth of work…

# So You Want to Write an LP Solver (a.k.a. My Little LP: Cholesky Is Magic)

In which a Lisp hacker exhibits his ignorance of applied mathematics ;)

Linear programming (LP) is awesome. I based my PhD work on solving LPs with millions of variables and constraints… instead of integer programs a couple orders of magnitude smaller. However, writing LP solvers is an art that should probably be left to experts, or to those willing to dedicate a couple years to becoming one.

That being said, if one must write their own, an interior point method (IPM) is probably the best approach. This post walks through the development of a primal affine scaling method; it doesn’t guarantee state-of-the-art practical or theoretical efficiency (it’s not even polynomial time), but is simple and easy to understand.

I think interior point methods suffer from an undeserved bad reputation: they only seem complicated compared to the nice combinatorial Simplex method. That would explain why I regularly (once or twice a year) see hackers implement a (classic!) simplex method, but rarely IPMs.

The problem is that simplex only works because world-class implementations are marvels of engineering: it’s almost like their performance is in spite of the algorithm, only thanks to coder-centuries of effort. Worse, textbooks usually present the classic simplex method (the dark force behind simplex tableaux) so that’s what unsuspecting hackers implement. Pretty much no one uses that method: it’s slow and doesn’t really exploit sparsity (the fact that each constraint tends to only involve a few variables). Avoiding the trap of the classic simplex is only the first step. Factorising the basis — and updating the factors — is essential for efficient revised simplex methods, and some clever tricks really help in practice (packages like LUSOL mostly take care of that); pivot selection (pricing) is fiddly but has a huge impact on the number of iterations — never mind degeneracy, which complicates everything; and all that must be balanced with minimising work per iteration (which explains the edge of dual simplex: there’s no primal analogue to devex normalised dual steepest edge pricing). There’s also presolving and crash basis construction, which can simplify a problem so much that it’s solved without any simplex pivot. Despite all this sophistication, only recent theoretical advances protect against instances going exponential.

Sidenote: Vasek Chvatal’s Linear Programming takes a view that’s closer to the revised simplex algorithm and seems more enlightening to me. Also, it’s been a while since I read about pricing in the simplex, but the dual simplex definitely has a computational edge for obscure reasons; this dissertation might have interesting details.

There’s also the learning angle. My perspective is biased (I’ve been swimming in this for 5+ years), but I’d say that implementing IPMs teaches a lot more optimisation theory than implementing simplex methods. The latter are mostly engineering hacks, and relying on tableaux or bases to understand duality is a liability for large scale optimisation.

I hope to show that IPMs are the better choice, with respect to both the performance of naïve implementations and the insights gained by coding them, by deriving a decent implementation from an intuitive starting point.

# Affine scaling in 30 minutes

I’ll solve linear problems of the form $$\min\sb{x} cx$$ subject to $$Ax = b$$ $$l \leq x \leq u.$$

I will assume that we already have a feasible point that is strictly inside the box $$[l, u]$$ and that A has full row rank. Full rank doesn’t always pan out in practice, but that’s a topic for another day.

If it weren’t for the box constraint on x, we would be minimising a smooth function subject to affine equalities. We could then easily find a feasible descent direction by projecting $$-c$$ (the opposite of the gradient) in the nullspace of A, i.e., by solving the linear equality-constrained least squares $$\min\sb{d} \|(-c)-d\|\sb{2}$$ subject to $$Ad = 0,$$ e.g. with LAPACK’s xGGLSE.

The problem with this direction is that it doesn’t take into account the box constraints. Once some component of x is at its bound, we can’t move past that wall, so we should try to avoid getting to close to any bound.

Interior point methods add a logarithmic penalty on the distance between x and its bounds; (the gradient of) the objective function then reflects how close each component of x is to its lower or upper bound. As long as x isn’t directly on the border of the box, we’ll be able to make a non-trivial step forward without getting too close to that border.

There’s a similar interpretation for primal affine scaling methods, but I prefer the original explanation. These methods rescale the space in which we solve for d so that x is always far from its bounds; rescaling the direction back in the original space means that we’ll tend to move more quickly along coordinates that are far from their bounds, and less for those close to either bound. As long as we’re strictly inside the box, the transformation is meaningful and invertible.

It’s a primal method because we only manipulate primal variables; dual methods instead work with the dual variables associated with the linear constraints.

This sketch might be useful. x is close to the left-hand side bound, and, after projection in the nullspace, d quickly hits the border. With scaling, d is more vertical, and this carries to the unscaled direction.

The descent direction may naturally guide us away from a close-by bound; scaling is then useless. However, each jump away from that bound makes the next longer, and such situations resolve themselves after a couple iterations.

Formally, let $$s = \min\{x-l, u-x\} > 0$$ be the slack vector (computed element wise) for the distance from x to its bounds. We’ll just scale d and c elementwise by S (S is the diagonal matrix with the elements of s on its diagonal). The scaled least square system is $$\min\sb{d\sb{s}} \|(-Sc)-d\sb{s}\|$$ subject to $$ASd\sb{s} = 0.$$

Now that we have a descent direction $$d=Sd\sb{s}$$, we only have to decide how far along to go in that direction. We determine the limit, r, with a ratio test. For each coordinate, look at $$d\sb{i}$$: if it’s zero, this coordinate doesn’t affect the limit; if it’s negative, the limit must be at most $$(l-x)\sb{i}/d\sb{i}$$; if it’s positive, the limit is at most $$(u-x)\sb{i}/d\sb{i}$$. We then take the minimum of these ratios.

If the minimum doesn’t exist (i.e., $$r=\infty$$), the problem is unbounded.

Otherwise, we could go up to $$x+rd$$ to improve the solution while maintaining feasibility. However, we want to stay strictly inside the feasible space: once we hit a bound, we’re stuck there. That’s why we instead take a slightly smaller step, e.g., $$0.9r$$.

This drawing shows what’s going on: given d, r takes us exactly to the edge, so we take a slightly shorter step. The new solution is still strictly feasible, but has a better objective value than the previous one. In this case, we’re lucky and the new iterate $$x\sp{\prime}$$ is more isotropic than the previous one; usually, we slide closer to the edge, only less so than without scaling.

We then solve the constrained least squares system again, with a new value for S.

## Finding a feasible solution

I assumed that x is a strictly feasible (i.e., interior) solution: it satisfies the equality constraint $$Ax=b$$ and is strictly inside its box $$[l, u]$$. That’s not always easy to achieve; in theory, finding a feasible solution is as hard as optimising a linear program.

I’ll now assume that x is strictly inside its box and repair it toward feasibility, again with a scaled least squares solution. This time, we’re looking for the least norm (the system is underdetermined) solution of $$Ad = (b - Ax).$$

We rescale the norm to penalise movements in coordinates that are already close to their bound by instead solving $$ASd\sb{s} = (b - Ax),$$ for example with xGELSD.

If we move in direction $$Sd\sb{s}$$ with step $$r=1$$, we’ll satisfy the equality exactly. Agin, we must also take the box into account, and perform a ratio test to determine the step size; if $$0.9r > 1$$, we instead set $$r=1/0.9$$ and the result is a strictly feasible solution.

## An initial implementation

I uploaded some hacktastic CL to github. The initial affine scaling method corresponds to this commit. The outer loop looks at the residual $$Ax-b$$ to determine whether to run a repair (feasibility) or optimisation iteration.

The code depends on a few libraries for the MPS parser and on matlisp, and isn’t even ASDF-loadable; I just saved and committed whatever I defined in the REPL.

This commit depends on a patch to matlisp’s src/lapack.lisp (and to packages.lisp to export lapack:dgglse):

MPS is a very old (punchcard-oriented) format to describe linear and mixed integer programs. The parser mostly works (the format isn’t very well specified), and standard-form.lisp converts everything to our standard form with equality constraints and a box around decision variables.

I will test the code on a couple LPs from netlib, a classic instance set (I think the tarball in CUTEr is OK).

AFIRO is a tiny LP (32 variables by 28 constraints, 88 nonzeros). Good news: the program works on this trivial instance and finds the optimum (the relative difference with the reference value is 7e-6). The first four “repair” iterations find a feasible solutions, and 11 more iterations eventually get to the optimum.

I also tested on ADLITTLE, another tiny instance (97x57, 465 nz): 48 iterations, 0.37 seconds total.

AGG is more interesting, at 163x489 and 2541 nz; that one took 156 iterations and 80 seconds (130% CPU, thanks to Apple’s parallel BLAS). Finally, FIT1D is challenging, at 1026x25 and 14430 nz; that one took 100 seconds and the final objective value was off .5%.

All these instances are solved in a fraction of a second by state-of-the-art solvers. The next steps will get us closer to a decent implementation.

# First, simplify the linear algebra

The initial implementation resorted to DGGLSE for all least squares solves. That function is much too general.

I changed the repair least squares to a GELSY (GELSD isn’t nicely wrapped in matlisp yet).

We can do the same for the optimisation constrained least squares. Some doodling around with Lagrange multipliers shows that the solution of $$\min\sb{x} \|x-c\|\sb{2}\sp{2}$$ subject to $$Ax = 0$$ is $$x = A\sp{t}(AA\sp{t})\sp{-1}Ac - c.$$

This is a series of matrix-vector multiplications and a single linear solve, which I first perform with GELSY: when we’re nearly optimal, the system is really badly conditioned and regular linear solves just crash.

Both linear algebra microoptimisations are in this commit.

ADLITTLE becomes about five times faster, and FIT1D now runs in 2.9 seconds (instead of 100), but AGG eventually fails to make progress: we still have problems with numerical stability.

# Now, the magic sauce

Our numerical issues aren’t surprising: we work with a scaled matrix $$AS$$, where the diagonal of $$S$$ reflects how close $$x$$ is to its bounds. As we get closer to an optimal solution, some of $$x$$ will converge to the bounds, and some to a value in-between: the limit is a basic solution. We then take this badly conditioned matrix and multiply it by its own transpose, which squares the condition number! It’s a wonder that we can get any useful bit from our linear solves.

There’s a way out: the normal matrix $$AA\sp{t}$$ is not only symmetric, but also positive (semi)definite (the full rank assumption isn’t always satisfied in practice). This means we can go for a Cholesky $$LL\sp{t}$$ (or $$LDL\sp{t}$$) factorisation.

We’re lucky: it turns out that Cholesky factorisation reacts to our badly conditioned matrix in a way that is safe for Newton steps in IPMs (for reasons I totally don’t understand). It also works for us. I guess the intuition is that the bad conditioning stems from our scaling term. When we go back in the unscaled space, rounding errors have accumulated exactly in the components that are reduced into nothingness.

I used Cholesky factorisation with LAPACK’s DPOTRF and DPOTRS for the optimisation projection, and for repair iterations as well.

LAPACK’s factorisation may fail on semidefinite matrices; I call out to GELSY when this happens.

To my eyes, this yields the first decent commit.

The result terminates even closer to optimum, more quickly. AGG stops after 168 iterations, in 26 seconds, and everything else is slightly quicker than before.

# Icing on the cake

I think the code is finally usable: the key was to reduce everything to PD (positive definite) solves and Cholesky factorisation. So far, we’ve been doing dense linear algebra with BLAS and LAPACK and benefitted from our platform’s tuned and parallel code. For small (up to a hundred or so constraints and variables) or dense instances, this is a good simple implementation.

We’ll now get a 10-100x speedup on practical instances. I already noted, en passant, that linear programs in the wild tend to be very sparse. This is certainly true of our four test instances: their nonzero density is around 1% or less. Larger instances tend to be even sparser.

There’s been a lot of work on sparse PD solvers. Direct sparse linear solvers is another complex area that I don’t think should be approached lightly: expensive supercomputers have been solving sparse PD linear systems for a couple decades, and there’s some really crazy code around. I’ve read reports that, with appropriate blocking and vectorisation, a sparse matrix can be factored to disk at 80% peak FLOP/s. If you’re into that, I’m told there are nice introductions, like Tim Davis’s book, which covers a didactical yet useful implementation of sparse Cholesky in 2000 LOC.

I decided to go with CHOLMOD from SuiteSparse, a mixed GPL/LGPL library. CHOLMOD implements state of the art methods: when all its dependencies are available, it exposes parallelisation and vectorisation opportunities to the BLAS (thanks to permutations computed, e.g., by METIS) and exploits Intel’s TBB and Nvidia’s CUDA. It’s also well designed for embedding in other programs; for example, it won’t abort(3) on error (I’m looking at you, LLVM), and includes data checking/printing routines that help detect FFI issues. Its interface even helps you find memory leaks! Overall, it’s a really refreshing experience compared to other academic code, and I wish all libraries were this well thought out.

I built only the barest version on my Mac and linked it with a wrapper to help my bindings. I had to build the dynamic library with -Wl,-all_load on OS X to preserve symbols that only appear in archives; -Wl,--whole-archive should do it for gnu ld. (I also bound a lot of functions that I don’t use: I was paying more attention to the TV at the time.)

CHOLMOD includes functions to simultaneously multiply a sparse matrix with its transpose and factorise the result. Function solve-dense shows how I use it to solve a dense PD system. It’s a three-step process:

1. Analyse the nonzero pattern of the constraint matrix to determine an elimination order (trivial for a dense matrix);
2. Compute the normal matrix and factor it according to the analysis stored in the factor struct;
3. Solve for that factorisation.

This is really stupid: I’m not even dropping zero entries in the dense matrix. Yet, it suffices to speed up AGG to 14 seconds!

It’s clear that we need to preserve the constraint matrix A in a sparse format. CHOLMOD has a function to translate from the trivial triplet representation (one vector of row indices, another of column indices, and another of values) to the more widely used compressed representation. Instances in our standard form already represent the constraint matrix as a vector of triplets. We only have to copy from CL to a CHOLMOD triplet struct to exploit sparsity in Cholesky solves and in matrix-vector multiplications.

We now solve AGG in 0.83 seconds and FIT1D in 0.95. I think we can expect runtimes of one second or less for instances up to ~200x200. Better, we can finally hope to solve respectable LPs, like FIT2D (10500x26, 138018 nz, 2.3 s) or FIT1P (1677x628, 10894 nz, 14 s).

# Finishing touches

Our normal matrix $$ASS\sp{t}A\sp{t}$$ always has the same pattern (nonzero locations): we only change the scaling diagonals in the middle. Sparse solvers separate the analysis and factorisation steps for exactly such situations. When we solve a lot of systems with the same pattern, it makes sense to spend a lot of time on a one-time analysis that we then reuse at each iteration: fancy analysis routines generate factors that take up less space and need fewer FLOPs to build.

I do that here. Our less tiny instances are almost twice as fast. We solve FIT1P in 8 seconds now, and even FIT2P (13525x3001, 60784 nz) in 200 seconds.

I then made some microoptimisation to reuse storage.

Finally, I added steps that push the current solution away from close-by bounds. These centering steps help subsequent iterations make longer jumps toward optimality

The last two changes shave a couple more seconds on large instances and gets closer to optimality on nastier ones.

# Not quite the end

I hope I managed to communicate the intuition behind primal affine scaling methods, and that the walkthrough helped map that intuition to a sparse implementation. I also realise that the code isn’t pretty: I wrote it during a short marathon and tried to only make incremental changes. Still, the algorithm should be more or less usable for small instances; more than a naïve simplex method, anyway.

That’s particularly true on multicore machines. Parallelising simplex methods has been an area of slow research for a couple decades; the best work I’ve seen so far takes a huge hit by reverting to the classic algorithm and hopes that parallelisation can compensate for that initial 10-1000x slowdown. In contrast, even our sparse method is already parallel: CHOLMOD automatically vectorises, parallelises and offloads work to the GPU.

I’ll try to code a sparse primal-dual affine scaling method from scratch soon. Primal-dual methods usually work better than pure primal (or dual) methods and I find their theory interesting (if a bit complicated).

If you liked this post, you might be interested in Stephen Boyd’s Convex optimisation course. He’s offering it online this winter, starting January 21st.