#5: Applications of Grovercomp
There are several game-changing problems that become solvable with this.
Let’s go through a few applications that Grovercomp should be well-suited for.
Chip simulation
Designing the Manitowoc chips that will be inside Anodyne and verifying properties about performance, efficiency, and programmability will require a significant amount of computing power and memory. Ideally, we want to simulate the processor close to or at the gate level, as this should allow a nice compromise between high simulation precision and usability on our end.
Something akin to vectorized bit matrices should be sufficient to emulate digital circuits fairly efficiently, and the dual 512-bit vector engines used in the Xeon Phis should be very effective for this. This is one reason why 1-bit precision matters a lot for Grovercomp; each bit-operation in sim can correspond to a single logic gate in the chip.
Simulating a small number of cores on a single Xeon Phi shouldn’t be particularly hard. Simulating the full Manitowoc chip with its thousands of cores may still technically be doable with a single Phi, though it gets much faster when the full cluster is in use.
Performance is a serious consideration; if we want to write software to test out the architecture and make tweaks to improve efficiency and programmability, we’d ideally like to get as close as possible to real-time. In practice, my current estimates are that Grovercomp will be able to simulate the full Manitowoc chip at around 30-100 kHz, mostly bottlenecked by network latency. Depending on NoC congestion, we may be able to get somewhat higher than that in practice, and if we really need high performance we can probably scale back to smaller numbers of cores.
One very valuable use of this will be in exploring advanced programming techniques for large tiled architectures. There is an enormous amount of efficiency that can be gained from intelligently laying out memory and computing power in space, turning optimization into a problem at the intersection of dynamic code execution and geometry.
SAT solving
Satisfiability (SAT) problems are very valuable for helping solve a wide range of verification problems, as they are a general-purpose tool for property checking across all possible states of a finite system. Verification is extremely important in chip design (it’s very hard to fix a bug when it’s baked into silicon), and so even exponential-time algorithms are on the table.
Luckily, SAT solvers tend to be much more efficient in practice than in theory. The precise nature of this efficiency gap between theory and practice is a major unsolved problem in complexity theory – closely related to the P vs NP problem. This makes it hard to reason in the abstract about how quantitatively beneficial different approaches to the problem may be.
One of the disadvantages of SAT solving on a big, highly-parallel compute cluster is that SAT solvers are not very well-suited for parallel programming. Yes, technically you could brute-force SAT problems in parallel to an enormous degree, but in practice you tend to get quasi-exponential speedups from more sequential code. This is mostly because SAT problems are bottlenecked by information learned about the problem, which tends to be of higher quality when built off previously learned information, and that is generated and tweaked so often that parallel SAT algorithms get bottlenecked by communication.
That said, there are many known ways that you can throw parallelism at preprocessing to speed up the sequential part. I also happen to be building a SAT solver that makes heavy use of vectorized bitwise operations to rapidly brute-force local searches in the SAT problem, which should hopefully provide more speedups while conforming very well to the capabilities of the Xeon Phis.
At the very least, it’s likely we’ll have many small SAT problems that need to be run as well, not just big ones, and so having 1200 cores or so will be beneficial regardless.
Superoptimization
The general approach I am taking toward developing the Manitowoc chips is to start out with a well-mapped landscape of the tradeoffs of different chip designs and computing techniques, combined with some statistics on code structure, to use as a foundation. Once the foundation is laid and a starting chip is available, then an evolutionary process can be used to hill-climb to a more ideal architecture that is capable of running ordinary code without enormous programmer effort while simultaneously doing so as efficiently as possible.
When you have a clean-slate chip design, you have the luxury of being able to scrap all the parts of legacy instruction sets that waste transistors while no longer being used enough to justify their existence from an efficiency perspective. In fact, anything can be on the table if you’re ambitious enough.
The big problem with this is that it means the instruction will may be changing constantly during development of the chip, which can make the simulated chip a very difficult compilation target!
The solution is superoptimization, which is simply just search-based code optimization. You take the space of possible instruction sequences, treat it as a search space, and ask what the fastest possible sequence of instructions are that do the same thing as the original code. This produces near-optimal code with little more than a depth-first search algorithm and a lot of time. New instruction sets could likely be generated by the same code that generates the CPU simulator.
This is of course an exceptionally hard problem in theory as you don’t just have the exponential search problem of finding the optimal code, but also the exponential problem of verifying that the code is correct across all possible inputs.
In practice though, it’s not completely impractical as there are many ways of optimizing the search (and of course is a lot easier if you have an entire compute cluster available rather than just a laptop). The existing academia demonstrates that this is a pretty promising approach in terms of the performance of the code it generates, but I’m far less impressed with the search algorithms they use to find such code. Throwing Grovercomp at analyzing codespace for shortcuts would be well within the realm of possibility, as would being able to optimize huge amounts of code quickly.
Thanks for reading! If you enjoy my work, please support me with a subscription to Bzogramming, my main newsletter. One-time donations can be made to Alexander via PayPal to fund the build. Until next time!