hckrnws
Discussion at the time https://news.ycombinator.com/item?id=27750610
I'd like to necro one particular comment from this discussion. Someone said that "x86 cannot have acquire/release semantics", because by default x86 has total store order (TSO).
My question: What stops Intel or AMD from providing an opt-in weaker memory model? Or what would the programming model look like for new programs that wanted to abandon TSO for better performance? Would it just be weak order prefixes for existing memory-altering instructions? Would you need a process-wide bit to weakly order the whole program? Would that affect loaded libraries (including OS-provided libraries) too? Would programs dropping TSO potentially affect kernels or hypervisors above them on the privilege hierarchy of the CPU?
To be pedantic, x86 does have opt-in weaker memory models. When you're setting up the TLB entries, you can configure how strong the cache behavior is for the memory... which means in practice, you can't really access these memory models unless you're doing firmware, or you're using nontemporal stores [1].
On a more practical level, however, it's actually disruptive to make the hardware have optional less-restrictive semantic operations, because the other instructions might need more fences to guarantee the necessary properties. For example, on x86, compilers drop all explicit store fences because they're "unnecessary" on the hardware, and adding in operations that would make them necessary, even for existing code that doesn't know about the instructions that don't exist yet.
[1] Incidentally, this means trying to read the manual to figure out what a nontemporal store actually does can feel like turning your brain to mush.
The cache behavior regarding the ordering of memory accesses cannot be configured.
What you can select when mapping memory is to make some regions uncached (with several variants, e.g. write-combining).
Some of the kinds of uncached memory regions, especially those whose names include "write-combining", implement weaker orderings of the memory accesses in comparison with how cacheable memory works on x86, where 3 of the 4 kinds of memory access reorderings are prohibited (i.e. all except that subsequent loads can be performed before prior stores).
However choosing a memory type with weakly-ordered accesses cannot increase the performance on x86, because the loss of performance by not using the cache is much greater.
The weakly-ordered accesses only recover a small part of the performance loss caused by uncached memory.
Is the mapping memory stuff separate from what's being set up by the OS as part of the TLB? I'm generally willfully ignorant of the privileged side of ISAs since I'm firmly a compilers guy, so I mostly only got this level of stuff by having to look at it thanks to it being incorporated into the definition of nontemporal store on x86 (which still ends up being confusing as heck).
Yes, the "mapping memory stuff" is the page tables. (The TLB is an internal cache the processor keeps of recently used page table entries, which is mostly internal to the processor and transparent to the OS, except that the OS needs to remember to issue a special instruction to invalidate the TLB after modifying page tables.)
> For example, on x86, compilers drop all explicit store fences because they're "unnecessary" on the hardware
I mean specifically for new programs that are written with weaker memory models in mind. So you'd have to enable an -fNOTSO flag on your compiler that emits all the fences that would otherwise be skipped.
Looking at what I can find about nontemporal stores, they sound like they already have a weird kind of release semantics on x86, even though their intent was to avoid cache thrashing, not so much to allow greater memory reordering. Are these actually used in compilers?
Compilers can generate non-temporal ops if you ask for it explicitly in some form. And they are problematic: https://github.com/llvm/llvm-project/issues/64521
Compilers will generally expose a nontemporal store intrinsic, but don't expect them to actually try to use it automatically.
I think they were particularly beneficial in some workloads for optane persistent memory
(We detached this from https://news.ycombinator.com/item?id=42400471 since it's a good top-level comment)
My question: What stops Intel or AMD from providing an opt-in weaker memory model?
Probably for the same reason Intel/AMD doesn't get rid of the rest of the cruft in x86-64, i.e. backwards compatibility. Additionally, there would be issues with Intel/AMD leveraging an optional weak memory model in their chips without compromising the performance of legacy TSO applications. They are probably better off making x86-64 perform best under TSO.
There is this myth that having a weaker memory model can lead to higher performance. It sounds plausible, and it was a good idea to explore at the end of the 80s, but in the end it turned out that it's simply not substantiated by facts. Note that Apple Silicon is TSO.
Note that Apple Silicon is TSO.
Apple Silicon supports both Arm's relaxed memory model and Intel's TSO for backwards compatibility. Microsoft is doing the same for Windows on Arm.
Apple's Mx chip has an instruction to enable TSO, which Rosetta uses when running x86 code. I don't believe it uses TSO when running native ARM code, but I could be mistaken.
> x86 cannot have acquire/release semantics
As I said at the time, this is nonsense. All stores on x86 are have release semantics and all loads have acquire semantics.
Atomic RMW are sequentially consistent.
Technically you could implement relaxed stores via non temporal stores but it would be pointless.
It is correct that on x86 all normal stores have release semantics and all normal loads have acquire semantics (only a few instructions behave differently, i.e. the string instructions and the streaming "non-temporal" store instructions).
However the ordering properties of the x86 stores and loads are stronger than that. An x86 store is not only a release store, but it has an additional property that could be called of being an "initial store", i.e. a store that is guaranteed to be performed before all subsequent stores.
An x86 load is not only an acquire load, but it has an additional property that could be called of being a "final load", i.e. a load that is guaranteed to be performed after all prior loads.
The research paper that has introduced the concepts of acquire loads and release stores (in May 1990) was mistaken. They have claimed that these 2 kinds of accesses are enough for the synchronization of accesses to shared resources.
While it is true that acquire loads and release stores are sufficient for implementing critical sections, there are other algorithms for accessing shared resources that need the other 2 kinds of ordered loads and stores, i.e. final loads and initial stores.
On x86, this does not matter, because the normal loads and stores provide all 4 kinds of ordered loads and stores, so any algorithm can be implemented without using memory barriers or other special instructions.
In ISAs that provide only 2 kinds of ordered loads and stores, i.e. acquire loads and release stores, like Arm Aarch64, the other 2 kinds of ordered loads and stores must be synthesized using memory barriers, i.e. an initial store is made by a weakly-ordered store followed by a store barrier, while a final load is made by a weakly-ordered load preceded by a load barrier.
Arm Aarch64 does not have load barriers, but it has stronger acquire barriers, which can always be used instead of any load barriers. The Arm Aarch64 acquire barrier has the confusing mnemonic DMB.LD, apparently inspired by the x86 LFENCE, which is also an acquire barrier, not a load barrier, despite its confusing mnemonic.
(a load barrier guarantees that all prior loads are performed before all subsequent loads; an acquire barrier guarantees that all prior loads are performed before all subsequent loads and stores; such memory barriers are stronger than necessary; the weaker ordering properties provided by the 4 kinds of ordered loads and stores are sufficient)
Interesting. I don't consider myself an expert, but I have written my share of lock free algorithms. For which algorithms you specifically needed an "initial store" or "final read" for?
I could see these being useful for seq locks, but usually you represent them with relaxed operations and explicit memory barriers.
An example is when you have a shared data structure that is written by one or more writers and to which one or more readers want read-only access.
In this case, the readers can use optimistic access to the shared data structure, also known as lock-free access (a very inappropriate name in my opinion).
Thus all the readers access the shared resource concurrently, but they must be able to detect when an update is in progress or an update has overlapped in time their read access, in order to stall or retry their reading of the shared data structure.
To implement this, a version counter is used, which is incremented after each update by a writer of the shared data. The version counter must be updated twice, both before and after the rest of the shared data is updated. The first update must provide an invalid version number, to enable the detection of an update being in progress.
There are two simple ways to achieve this. The version counter can be complemented before data update and negated after data update. Complement followed by negate is equal to increment, and negative version numbers would indicate an update in progress. The alternative is to increment the version counter both before and after the data update, when odd numbers would indicate an update in progress and even numbers would be valid version numbers.
A reader must read the version number both before and after reading the shared data. An invalid version number before reading will stall the reading, while after the reading an invalid version number or one that is different from the first version number will cause a retry of the reading.
The two version counter stores by a writer and the two version counter loads by a reader have different ordering requirements.
The version counter store before data update must be an "initial store" (as defined in the previously posted message).
The version counter store after data update must be a release store.
The version counter load before data reading must be an acquire load.
The version counter load after data reading must be a "final load" (as defined in the previously posted message).
There are also other frequently used algorithms, e.g. for queuing locks, which need "initial stores".
The explicit memory barriers are a mistake in ISA design. They have stronger ordering properties than the various kinds of ordered loads and stores that can replace them, and these stronger ordering properties are never needed.
Moreover, explicit memory barriers increase the code size and they occupy instruction decode slots and various locations in the internal structures of a CPU core without providing any benefit.
Explicit memory barriers are absolutely necessary for Arm Aarch64 and for Intel/AMD x86-64 (for the latter only when accessing memory regions with weakly-ordered accesses), but only because these ISAs do not include all the kinds of ordered loads and stores that are needed.
You are describing a seq lock. Hans Boehm [1] has discussed in details the impedance mismatch between seq lock and the acquire/release model, and the bottom line is that you need compiler help to recognize non-modifying RMW that are purely used to establish ordering and lower them the correct optimal sequence. No compiler does it yet so we all hack it around.
[1] "Can Seqlocks Get Along with Programming Language Memory Models?" I can no longer find an easily accessible free version. The link at HP from Boehms home is broken.
I am pretty sure that variants of this algorithm have been used many decades before Linux and before the coining of the name "seq lock", which is not a good name for it.
It is not needed only on multiprocessors, but it is needed also even when there is a single writer and a single reader, both running on the same CPU core, e.g. when some data is updated by an interrupt service routine or by a signal handler and the main program must also access that data.
However, before the fashion of implementing weakly-ordered memory accesses, which has spread mostly after 1990, nobody has paid much attention to the ordering requirements for its stores and loads, because in ancient times in all existing computers those ordering requirements were satisfied implicitly by all stores and loads.
Thanks for pointing to the paper written by Hans Boehm, SciHub has it.
A name is a name, RCU is a similarly terrible name, but it stuck and it is immediately recognizable. the kernel (and related articles on LWN) simply popularized the algo; I have no doubt that it was in use well before linux, but that's how I learned it.
(oh, I just realized you must be the same Adrian from RWT that was arguing for initial stores/last load)
For those watching along at home, we're talking about these two comments from dragontamer:
- https://news.ycombinator.com/item?id=27753913
- https://news.ycombinator.com/item?id=27762749
I interpreted dragontamer's comment to mean "x86 does not have relaxed stores", not "x86 does not have a strong memory model". The (presumed) problem is that you can't actually test acquire/release code on x86, because it won't crash if you get it wrong.
There's a second level of this in C++'s memory model: consume/release. I've no idea what the difference between acquire and consume are; when I look it up there's usually some reference to the DEC Alpha, a 30 year old workstation and server chip nobody uses today which was legendary for pushing the boundaries of memory ordering. My assumption is that no hardware provides memory ordering exactly as weak as consume, so I shouldn't bother asking for it, since it'll just get strengthened to acquire on any chip that matters.
Also, from you and one other person, it sounds like relaxed memory models aren't actually a performance benefit? Would the opposite - ARM mandating TSO in ARMv9.1-A or whatever - make sense? A lot of ink and code is wasted talking about memory model strengths.
> There's a second level of this in C++'s memory model: consume/release. I've no idea what the difference between acquire and consume are; when I look it up there's usually some reference to the DEC Alpha, a 30 year old workstation and server chip nobody uses today which was legendary for pushing the boundaries of memory ordering. My assumption is that no hardware provides memory ordering exactly as weak as consume, so I shouldn't bother asking for it, since it'll just get strengthened to acquire on any chip that matters.
You have it wrong here.
So the basic idea behind release/consume comes from an observation that, on all hardware not named Alpha [1], there is no hardware fence needed to guarantee that a load happens-before another load that is data-dependent on the first load. So the C++ committee decided to add a memory model that would preserve this guarantee. A consume load is exactly like an acquire load, but only for other loads that are data-dependent on that load as opposed to all loads in general. So whereas an acquire load requires a fence on pretty much every hardware not x86 or SPARC in TSO mode, a consume load would only require a fence on Alpha.
For various reasons, data-dependency isn't a property that compilers are in the position of guaranteeing, which means every compiler ended up implementing consume as the stronger acquire instead, defeating its entire design goal of eliminating unnecessary fences. There have been a couple of attempts within the committee to try to find a path that would let something like consume by tweaking the necessary data dependence definitions, but none of them have swayed the implementers, so the current state is that they've given up and let consume die.
Of particular note is that release/consume is not designed to support the Alpha memory model; in fact, it's quite the opposite: it's designed to support everybody but Alpha, since for Alpha (and essentially only Alpha), release/consume is not usefully weaker than release/acquire. Instead, the people who benefit are the PPCs and the ARMs of the world, for whom release/consume is a better approximation of their native memory model and would allow many fences to be omitted.
[1] Actually, some accelerator hardware might have memory models as weak as Alpha here, but I'm far less familiar with memory models as they apply to accelerators.
So, just to confirm my own (new) understanding now:
Consume is "I want to read data from a pointer, please do not reorder loads that go through the pointer"
and Acquire is "I want to read data from a pointer, please do not reorder loads that happen after this one, *whether or not they go through the same pointer"
there's no relation to the Alpha aside from it being the only processor that needs a barrier for Consume
and for whatever reason compilers don't or can't do the dependency tracking that would let them take advantage of the weaker guarantees of Consume, so in practice it just becomes Acquire?
Consume is for data dependency, loading through a pointer is the obvious one, but there are other variants (using the loaded value as an index for example).
Acquire is also not about pointers but about independent reads: reading a flag and then reading data protected by that flag for example.
Consume exists because it can be implemented cheaply on some platforms with relatively expensive acquire barriers (ARM, POWER). Alpha didn't respect load dependencies, so consume is as expensive as acquire.
Thank you! I have argued this point so many times on HN :).
Just wanted to say that to me, this series is the holy grail of breaking down an inscrutably technical topic with decades of research behind it into understandable language, while at the same time retaining all the necessary complexities and not “dumbing down” anything. Truly a delight to read if you are interested in the topic. I can only aspire to be a writer like Russ some day.
[dead]
Comment was deleted :(
[flagged]
Crafted by Rajat
Source Code