A brief overview of IBM’s new 7 nm Telum mainframe CPU
From the perspective of a traditional x86 computer enthusiast or professional, mainframes are weird and archaic beasts. They are physically huge, power hungry, and expensive compared to more traditional data center equipment, typically offering less compute per rack at a higher cost.
This begs the question “Why continue to use mainframes then?” Once you’ve hand-waved the cynical answers that boil down to “because that’s the way we’ve always done it,” the practical answers largely boil down to reliability and consistency. As AnandTech’s Ian Cutress points out in a speculative article focusing on Telum’s redesigned cache, “the downtime of these [IBM Z] systems is measured in milliseconds per year. “(If that’s true, at least that’s Seven nine.)
IBM’s own Telum announcement shows how different the priorities of mainframe and backbone are. He casually describes Telum’s memory interface as “capable of tolerating complete channel or DIMM failures, and designed to recover data transparently without impacting response time.”
When you extract a DIMM from a running x86 server, that server does not “transparently retrieve data” – it simply crashes.
IBM Z series architecture
Telum is designed to be a sort of single chip for all mainframes, replacing a much more heterogeneous setup in earlier IBM mainframes.
The 14nm IBM z15 processor that Telum is replacing includes five processors in total: two pairs of 12-core compute processors and a system controller. Each compute processor hosts 256MB of L3 cache shared among its 12 cores, while the system controller hosts a whopping 960MB of L4 cache shared between the four compute processors.
Five of these z15 processors, each consisting of four compute processors and a system controller, constitute a “drawer”. Four drawers are combined in a single main frame powered by the z15.
While the concept of multiple processors in a drawer and multiple drawers in a system remains, the architecture inside Telum itself is radically different and significantly simplified.
Telum is a bit simpler at first glance than the z15: it’s an eight-core processor built on Samsung’s 7nm process, with two processors combined on each case (similar to AMD’s chiplet approach for Ryzen) . There is no separate system controller processor — all Telum processors are the same.
From there, four packages of Telum processors combine to form a four-socket “drawer”, and four of those drawers go into a single central system. This provides 256 cores in total out of 32 processors. Each nucleus operates at a based higher clock rate than 5 GHz, providing more predictable and consistent latency for real-time transactions than a lower base with a higher turbo rate.
Pockets full of stash
Removing the central system processor on each package also meant redesigning Telum’s cache: the massive 960MB L4 cache is gone, along with the matrix-shared L3 cache. In Telum, each individual core has a private 32MB L2 cache, and that’s it. There is no L3 or L4 hardware cache at all.
This is where things get deeply weird – while the 32MB L2 cache of each Telum core is technically private, it really is only virtually private. When a row in the L2 cache of one core is evicted, the processor looks for empty space in the L2 of the other cores. If it finds any, the L2 cache line kicked out of the kernel X is tagged as an L3 cache line and stored in the kernel and‘s L2.
OK, so we have a shared virtual L3 cache of up to 256MB on each Telum processor, made up of the 32MB “private” L2 cache on each of its eight cores. From there, things go even further: 256MB of shared “virtual L3” on each processor can, in turn, be used as a “virtual L4” shared among all processors in a system.
Telum’s “virtual L4” works much the same way its “virtual L3” did in the first place: L3 cache lines kicked out from one processor look for a home on a different processor. If another processor on the same Telum system has free space, the evicted L3 cache row is relabeled as L4 and lives in the virtual L3 on the other processor (which is made up of the “private” L2s of its eight hearts).
Ian Cutress of AnandTech explains in more detail the caching mechanisms of Telum. He ends up summarizing them by answering “How is that possible?” with a simple “magic”.
Accelerating AI Inference
Telum also introduces a 6TFLOPS on-chip inference accelerator. It is intended to be used, among other things, for the detection of fraud in real time during financial transactions (as opposed to shortly after the transaction).
In the quest for maximum performance and minimum latency, IBM threads several needles. The new inference accelerator is placed on the chip, which allows for lower latency interconnects between the accelerator and the processor cores, but it is not integrated into the cores themselves, an Intel AVX-512 instruction set.
The problem with in-core inference acceleration like Intel’s is that it typically limits the AI processing power available for a single core. A Xeon core executing an AVX-512 instruction only has the hardware inside its own core, which means that larger inference tasks must be split across multiple Xeon cores to extract all of the available performance.
Telum’s accelerator is on matrix but out of core. This allows a single core to run inference workloads with the power of the all matrix accelerator, not just the integrated part itself.
List image by IBM