hckrnws
Telum II at Hot Chips 2024: Mainframe with a Unique Caching Strategy
by mfiguiere
"Why Do Mainframes Still Exist? What's Inside One? 40TB, 200+ Cores, AI, and more! - Dave explores the IBM z16 mainframe from design to assembly and testing. What's inside a modern IBM z16 mainframe that makes it relevant today?" - by Dave Plummer.
This is an amazing 23-minute video by the Microsoft programmer who developed the Windows NT Task Manager, among other things. He visits IBM and talks to engineers about the Telum chip architecture (Hot Chips 2023), used in the z16 mainframe. Special attention is paid to the cache.
Dave Plummer seems to be a bit careless with facts in his videos, and I wouldn't generally trust him as a source of information.
In an episode on hard drives, he talked about how drivers for hard drives still report a constant number of sectors per track, so they must have a physical layout that matches that. Hard drive manufacturers are open about the actual layout of their drives and that they virtualize the hard drive for the OS so that it behaves well.
A few Microsoft engineers also dispute a lot of the facts of his stories about the development of the start menu.
Caveat emptor.
do you happen to have factual.insights into false reporting on the Mainframe in this specific video?
afaict (and I've worked with mainframes for a couple of years) this is spot on. poor signal/noise ratio but the facts are right.
Another good article by Chips and Cheese on Telum II @ Hot Chips – an interview with IBM directors of development at the processor and system levels. [1]
[1] https://chipsandcheese.com/2024/09/05/an-interview-with-susa...
That's a huge amount of effort to let most of the transistors in a computer (in the RAM) sit idle most of the time. Surely there are viable non-Von Neuman alternatives that could be spun out into general purpose computing.
The vast majority of transistors in any modern CPU are ‘idle’ at any given moment.
With the cascade of different clock domains on a core and package, the control loops can spend that thermal budget effectively elsewhere; idling is one of the benefits of CMOS.
Perhaps most weirdly, we've reached the point where that is actually desirable. Power+heat is the limit now, and slapping on some extra circuitry that is only used for some operations makes the chip perform better.
I believe GP was counting the transistors in DRAM, not only those on the CPU.
If someone is really into high performance, it's ideal to never have to wait for DRAM, either with predictive fetches or explicit cache warming. For that, the more cache you have, the better.
There have been quite a number of attempts at computing in memory kinds of chip design through computing architectures. So far with very limited success. It's not impossible but in search of good applications.
These mainframe COUs are really cool. But I still don’t feel like I understand where they make sense. I’ve heard they are used in finance, but what does that mean? High frequency trading? Processing credit card payments? Managing bank balances? I’ve always thought of mainframes as batch / offline systems, but this sounds much more online and low latency.
Transaction processing was basically invented on mainframes. High-throughput, high-volume, low-latency. I'm not sure about high-frequency trading, but in article nxobject linked above they say ~70% of financial transactions are processed on mainframes. So credit card transactions, bank transactions. Historically, things like airline reservations were also done on mainframes.
https://en.wikipedia.org/wiki/Transaction_Processing_Facilit...
Mainframe sounds like a good idea to solve many of today's problems. Why don't people start thinking about making a RISC-V or x86 Mainframe?
A mainframe is just a very large server, with lots of reliability features (RAID-like memory, fault detection and mitigation, redundant components, etc) and lots of intelligent peripherals that offload work from the CPU so that it can spend as much time as possible running application code (and don't waste time handling interrupts, assembling network packets, dealing with IO, etc). A lot of these offload functions are baked into the ISA, making it a VERY CISC machine.
I believe Unisys still makes x86-based mainframes running MCP.
And loads of IP that IBM will defend vigorously with any infringement.
They invest a colossal amount of money creating those patents. There are lots of bullshit patents in this space, but IBM is not playing that kind of game.
Modern cloud environments are basically virtualised infinitely scalable mainframes.
Modern cloud environments tend to be aimed at running multiple independent workloads well on a huge server. Mainframes are generally aimed at running a smaller number of large workloads well on a huge server. Sort of analogous to multithreaded vs singlethreaded performance in CPU benchmarks.
My personal take:
The typical x86[1] is a sports car. Gets going fast, reaches most destinations fast, not great for driving for several hours, and not great at moving lots of cargo.
A mainframe is a freight train. Somewhat slow to get going, but can haul large amounts of cargo without breaks for a long time.
Mainframes weren't built for an interactive, highly variable, query-response workload; they were built for the classic overnight/monthly batch job that streams through a large amount of data.
[1]: It's not about the CPU, it's about the architecture around it, like this article talks about cache, expanded to I/O etc concerns.
That’s how I had always thought about mainframes before, but the focus on low latency here seems to suggest a different purpose (more sports car like than any x86 server cpu) is this a different kind of mainframe?
The freight train part of my thinking is admittedly dated .
This is an impressively fast design, but you'll get much more x86 for the same money -> x86 wins when you can scale out.
Random question to anyone who might know anything about this – is the uarch internally POWER, like System i (AS/400)?
No. It's completely different. Unlike System i, it's also fully documented (in the rather impenetrable "Principles of Operation" red book).
Ah, thank you! I would have never thought to search under that name. It makes sense for such a non-mainstream architecture, but I wish there were (even reverse-engineered) resources as well on "this is what the execution engine looks like". There are instructions there that clearly scream "very extensive microcoding is going on here", e.g. vintage EBCDIC/BCD conversion, string/stream instructions, control channel supervision, etc.
It seems to me the mentioned "Principles of Operation" document describes a virtal machine compatible with good old System/370 what's now known as IBM z/Architecture. But the Telum CPU itself runs on Power10 cores or alike RISCs.
No. The cores are not POWER10 at all. While there is a ton of microcode and interception magic happening (and nothing but the hypervisor, PR/SM IIRC, runs on metal - the hypervisor exposes partitions to the zVM environment) the cores are still very different from POWER and run their own s390x ISA.
You might be confusing it with the AS/400 CISC ISA, which exists as an emulation layer on top of POWER, since all IBMi machines are almost identical to their POWER counterparts.
There isn't an emulation layer for AS/400 though, it's native POWER code.
The AS/400 / 'i' are descendants of the System/38 and implement a "Technology Independent Machine Interface". Applications target this high-level interface, rather than the underlying hardware. Before first run (or when they're installed?) applications get compiled from abstract Machine Interface code to native code.
So looked into this in Wikipedia and it seems like it is "z/Architecture" which is basically a 64-bit extension of the s/390 instruction set which was the evolution of the s/360 which was in a sense the first instruction set (it was the first instruction set intended to be implemented by multiple CPUs). It looks like software for the s/360 should still be able to run unchanged on the modern CPUs but there was some mention that not operating systems.
To bad the mainframe business will not be spun off from IBM. Then you may see innovation, but IBM see it as a cash cow.
I fail to understand where you think IBM has been lacking in innovation - Telum and Telum II (as is POWER10) are very impressive designs the likes of which you won't see anytime soon on x86 or ARM space. They target a relatively small segment of the market where people will pay whatever it takes to reach 99.99999% uptime or the most transactions per second.
If mainframes were not competitive at that, they would have ceased to exist a long time ago.
Or, at the very least, IBM would have sold a product that required far less investment in unique technology, e.g. software emulation on commodity hardware.
Everything from the ground up is designed for business transaction performance. That includes the OS, which is somewhat limited for a lot of other uses.
Do the customers want innovation?
The CPUs has been a tour de force from the S/360, they have never relented, so empirically yes the customers care a lot or they wouldn't keep doing this.
The software side seems to be more a tale of dichotomy. The MVS lineage is technically impressive but undoubtedly bizarre and old feeling. The TPF lineage seems like eventually somewhere the cloud movement will dip for certain cases so it is ahead of time. Linux is neither stale nor avant-garde, I guess that is their strategy to remain "contemporary". VM was always the most delightful one but internally forever the odd one out.
Seems very complex therefore very expensive (and possibly slow where it matters, at L2). Or it might just work.
On the contrary!
Yes there's a lot of cache. But rather than try to have a bunch of cores reading each cache (sharing 96MB L3 for AMD's consumer cores), now there's a lot of separate 36MB L2 caches.
(And yes, then again, some fancy protocols to create a virtual L3 cache from these L2 caches. But less cache heirarchy & more like networking. It still seems beautifully simpler in many ways to me!)
L3 caches are already basically distributed and networked through a ring bus or other NoC on many x86 chips, for example sapphire rapids has 1.875MB of L3 per core, which is pooled into a single coherent L3. Fun fact, this is smaller than each cores L2 (2MB).
From https://chipsandcheese.com/2023/03/12/a-peek-at-sapphire-rap... “ the chip appears to be set up to expose all four chiplets as a monolithic entity, with a single large L3 instance. Interconnect optimization gets harder when you have to connect more nodes, and SPR is a showcase of this. Intel’s mesh has to connect 56 cores with 56 L3 slices.”
I wonder what workloads would benefit from having an L4 victim cache on another CPU, but that other CPU doesn't need its own L2 cache.
The claimed latency for it seems not far off from some other vendor's L3 caches which may be by virtue of rethinking where to share and therefore paying interconnection coherency taxes.
The innovation here seems to be adaptive sizing so if by whatever algorithm/metric a remote core is idle, it can volunteer cache to L4.
Presumably the interconnect is much richer than contemporary processors in typical IBM fashion and they can do all the control at a very low level (hw state machinesµcoding) so it is fast and transparent. It will be interesting to hear how it works in practice and if POWER12 gets a similar feature since it shares a lot of R&D.
At a basic level, anything with a working set on the order of 360 MB should benefit from 360 MB of combined L3 with a worst-case latency of 11.5 ns, regardless of which parts end up in which L2 slice (and the cache allocation heuristics described in the article look pretty smart to me). Similarly, if you have a total working set of a couple of GB then the 2.8 GB combined L4 at 48.5 ns latency should be great. Is there any other hardware on the market that can offer so much memory at such a low latency?
/Uneducated/ these latency numbers seem large to me. DDR5 memory sticks I browsed yesterday for a home PC listed 10ns first word latency.
If the data is not in cache, it takes quite a while longer from the time the CPU core issues a load instruction for the results to get back to the next instruction. The CPU core has to first try L1 and L2, do a TLB lookup to convert a virtual address to a physical address, send a request to L3 over an on-chip connection, then after L3 lookup fails the memory controller has to transfer a 64-byte cache line from the main memory, and the results are then sent back to the core...
Have a look at the section "Cache setup" at https://chipsandcheese.com/2024/08/14/amds-ryzen-9950x-zen-5... for some real-world latency values. Once we're talking about a 100+ MB working set (i.e. DDR5 instead of cache), a top-of-the-line Ryzen 9950X has an access latency of about 100 ns. There is also some older data for a wider variety of CPUs at https://chipsandcheese.com/memory-latency-data/ - and there the older IBM z15 is in a class of its own.
At a guess, a single thread which benefits from as much cache as it can get.
Sure, but having to buy entire CPUs filled with idle cores to scale up cache seems very expensive.
These cores are typically licensed with class/restrictions so in absolute terms yes but in the financial engineering of how the system is delivered with excess and restricted hardware no (see core types on the prior/shipping generation here https://www.ibm.com/downloads/cas/6NW3RPQV)
There are probably design reuse and RAS considerations that make it not currently worthwhile to i.e. have a distinct physical design for SAP or whatever cores.
I don't know if it's still the case, but in terms of RAS, the Z/Series CPUs from ~2004 had duplicated/compared instruction-fetch/decode and execution units.
https://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers...
I wouldn't be surprised as well if there was some binning that occurred – the dies are huge, so why not overprovision in the design? (Although erring on the side of slighlty more surprise in the case of some binning, since IBM mainframes seem to exist beyond the laws of commodity economics, and it looks like they're using a 5nm node.)
People buy whole machines to run memcached!
I said this is where it would help (AFAICS), not that it was the best solution.
the CPU's only appear to use about 1/3rd of the die area. Most of the space is cache.
Crafted by Rajat
Source Code