> Since C++17, the concurrency library has expanded significantly. We now have:
> ...
> safe memory reclamation mechanisms (RCU, hazard pointers in C++26)
> These tools focus mostly on thread lifetime, coordination, cancellation or lock-free programming.
> std::shared_mutex fills a different role.
> It is still a mutual-exclusion primitive, explicitly designed to protect a shared state with many readers and few writers. It does not compete with atomics, condition variables, or RCU.
What does it mean to say shared_mutex does not "compete" with atomics/CVs/RCU? There are many situations where RCU or other safe reclamation mechanism is a good replacement for state + shared_mutex. (And some situations where atomics are a decent replacement.)
If you want the write to be a synchronization point after which all threads will observe only the new value, it's only possible with shared mutex. Of course you can use a barrier to accomplish that instead but using something like hazard pointers or rcu doesn't synchronize by itself.
From my knowledge RCU/epoch/Hazard pointers are useful in data structures and algorithms where raw atomics cannot be used but you still nees lock free or in some cases wait free semantics.
If you can use an atomic then these are overkill and you should just be using an atomic, but many times things that are atomic does not make it lock free, if there's no hardware support the compiler will add a mutex.
I think it's possible with an atomic<shared_ptr> too (C++20)?
A shared_mutex comes in useful when you can't really have multiple copies of the shared data due to perhaps memory usage, so readers fail to acquire it when the writer is updating it.
You know it often is the case that APIs like this both in C++ and Rust don't offer you enough knobs when your usecase deviates from being trivial.
It happens with locking APIs, it happens with socket APIs, anything platform dependent.
Does the C++ standard give you an idiomatic way to set PTHREAD_RWLOCK_PREFER_READER_NP or PTHREAD_RWLOCK_PREFER_WRITER_NP explicitly when initializing a rwlock? Nope. Then you either roll your own or in Rust you reach for a crate where someone did the work of making a smarter primitive for you.
Yeah, you can't enable priority inheritance for mutexes in std of either C++ or Rust. Which is a show stopper for hard realtime (my dayjob).
And then you have mutexes internally inside some dependency still (e.g. grpc or what have you). What I would really like is the ability to change defaults for all mutexes created in the program, and have everyone use the same std mutexes.
By the way: rwlocks are often a bad idea, since you still get cache contention between readers on the counter for number of active readers. Unless the time you hold the lock for is really long (several milliseconds at least) it usually doesn't improve performance compared to mutexes. Consider alternatives like seqlocks, RCU, hazard pointers etc instead, depending on the specifics of your situation (there is no silver bullet when it comes to performance in concurrent primtitves).
> What I would really like is the ability to change defaults for all mutexes created in the program, and have everyone use the same std mutexes.
Yes. I imagine subbing out a debug mutex implementation that tracks lock ordering and warns about order inversion (similar to things like witness(4)): https://man.freebsd.org/cgi/man.cgi?witness(4)
> rwlocks are often a bad idea
Yes.
> since you still get cache contention between readers on the counter for number of active readers
There are rwlock impls that put the reader counts on distinct cache lines per core, or something like that (e.g., folly::SharedMutex), mitigating this particular problem. But it isn't the only problem with rwlocks.
> Consider alternatives like seqlocks, RCU, hazard pointers etc instead, depending on the specifics of your situation (there is no silver bullet when it comes to performance in concurrent primtitves).
What are some of the other problems with rwlocks? I genuinely ask because I used them with quite good success for pretty complicated use-case at large scale and very high concurrency.
I have as well. I find RW locks much easier to use than, say, a recursive mutex. Mainly since it took me a long time to actually understand how a recursive mutex actually works in the first place. When you want to use only the stdlib, you aren't left with many choices. At least in the STL.
Mostly that if you actually have both readers and writers, they obstruct each other; this is often undesirable. And you have to pick some bias in advance. You can get priority inversion because readers are anonymous.
Sure, the workload was with both the readers and the writers, and it was a pretty "bursty" one with high volume of data which had to scale across all the cores. So, not a particularly light workload. It was basically a hi-concurrency cache which was write-mostly in the first-phase (ingestion), and then read-mostly in the second-phase (crunching). It had to support multiple sessions simultaneously at the same time so in the end it was about supporting heavily mixed read-write workloads, e.g. second-phase from session nr. 1 could overlap with the first-phase from session nr. 2.
To avoid the lock contention I managed to get away with sharding across the array of shared-mutexes and load-balancing the sessions by their UUIDs. And this worked pretty well - after almost ~10 years it's still rock solid, and workloads are basically ever changing.
I considered the RCU for this use-case too but I figured that it wouldn't be as good fit because the workload is essentially heavily mixed so I thought it would result with a lot of strain to the memory subsystem by having to handle multiple copies of the data (which was not small).
One thing I don't understand is the priority inversion and how that may happen. I'll think about it, thanks.
> What I would really like is the ability to change defaults for all mutexes created in the program, and have everyone use the same std mutexes.
Assuming you're building the whole userspace at once with something like yocto... you can just patch pthread to change the default to PTHREAD_PRIO_INHERIT and silently ignore attempts to set it to PTHREAD_PRIO_NONE. It's a little evil though.
That is a great terrible idea (I really have to think a bit more on that). Won't help for Rust, since the mutexes there use futex directly, so you would have to patch the standard library itself (and for futex it is more complex than just enabling a flag). Seems plausible that other libraries and language runtimes might do similar things.
The implementation for the rust std mutex when targeting fuchsia does implement priority inheritance by default, but the zircon kernel scheduler and futex implementation are written with priority inheritance in mind as the default approach rather than something ad hoc tacked on. Unfortunately on Linux it seems like there is a large performance tradeoff which may not be worthwhile for the common case. It does seem like it would be nice to set an env variable to change the behavior through rather than require a recompile of libstd. A lot of programs use alternatives to the std library as well like parking_lot, which is indeed a pain.
Sometimes I feel like trying to use Linux for realtime is an effort in futility. The ecosystem is optimized for throughput over fairness, predictability, and latency.
> Unfortunately on Linux it seems like there is a large performance tradeoff
Implementing transitive priority inheritance is just inherently algorithmically more expensive: there's no avoiding that.
> Sometimes I feel like trying to use Linux for realtime is an effort in futility.
If you're not actually using an RT kernel, yeah, it's futile. But if you are, the guarantees are pretty strong... on x86 PCs, the hardware gets in the way much more than the software in my experience. There's a lot of active work upstream.
There are rw lock implementations where waiters (whether or readers or writers) don't contend on a shared cache line (they only touch it once to enqueue themselves, not to spin/wait)
These are usually called "scaleable locks" and the algorithms for them have been out there for decades. They are optimal from a cache coherence point of view.
The issue with them is it's impossible to support the same API as you're used to with std::shared_mutex, as every thread needs it's own line.
> it's impossible to support the same API as you're used to with std::shared_mutex
If I understand the problem correctly, the standard library could specialize the scoped lock objects for the shared mutex, online allocate the waiter object in the lock (so it will be in the stack) and use an internal dedicated API.
It might be harder (but not impossible) to interoperate threads using direct lock calls with threads using the scoped API.
In any case it is a moot point, as I don't think any std library does it nor will never do as it would probably be ABI breaking.
One thing I appreciate about Rust's stdlib is that it exposes enough platform details to allow writing the missing knobs without reimplementing the entire wrapper (e.g. File, TcpStream, etc. allows access to raw file descriptors, OpenOptionsExt allows me to use FILE_FLAG_DELETE_ON_CLOSE on windows, etc.)
NP means the API is not portable. There are Linux specific extensions for many things but not everything has it. This is also nothing wrong with needing to use an alternative to the standard library if you have more niche requirements.
The equivalent in Rust is RwLock: https://doc.rust-lang.org/std/sync/struct.RwLock.html
The more general idea for this is readers-writer lock: https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock
https://www.cppstories.com/2026/shared_mutex/#newer-concurre...
> Since C++17, the concurrency library has expanded significantly. We now have:
> ...
> safe memory reclamation mechanisms (RCU, hazard pointers in C++26)
> These tools focus mostly on thread lifetime, coordination, cancellation or lock-free programming.
> std::shared_mutex fills a different role.
> It is still a mutual-exclusion primitive, explicitly designed to protect a shared state with many readers and few writers. It does not compete with atomics, condition variables, or RCU.
What does it mean to say shared_mutex does not "compete" with atomics/CVs/RCU? There are many situations where RCU or other safe reclamation mechanism is a good replacement for state + shared_mutex. (And some situations where atomics are a decent replacement.)
If you want the write to be a synchronization point after which all threads will observe only the new value, it's only possible with shared mutex. Of course you can use a barrier to accomplish that instead but using something like hazard pointers or rcu doesn't synchronize by itself.
This is true, but it is a subset of designs using shared_mutex.
Not an expert, but can’t you get synchronization like this just by using release/acquire memory order with C11 atomic stores and loads?
From my knowledge RCU/epoch/Hazard pointers are useful in data structures and algorithms where raw atomics cannot be used but you still nees lock free or in some cases wait free semantics.
If you can use an atomic then these are overkill and you should just be using an atomic, but many times things that are atomic does not make it lock free, if there's no hardware support the compiler will add a mutex.
Yes. But if you are tempted to do this in most cases you should just use a mutex anyway.
I think it's possible with an atomic<shared_ptr> too (C++20)?
A shared_mutex comes in useful when you can't really have multiple copies of the shared data due to perhaps memory usage, so readers fail to acquire it when the writer is updating it.
You know it often is the case that APIs like this both in C++ and Rust don't offer you enough knobs when your usecase deviates from being trivial.
It happens with locking APIs, it happens with socket APIs, anything platform dependent.
Does the C++ standard give you an idiomatic way to set PTHREAD_RWLOCK_PREFER_READER_NP or PTHREAD_RWLOCK_PREFER_WRITER_NP explicitly when initializing a rwlock? Nope. Then you either roll your own or in Rust you reach for a crate where someone did the work of making a smarter primitive for you.
Yeah, you can't enable priority inheritance for mutexes in std of either C++ or Rust. Which is a show stopper for hard realtime (my dayjob).
And then you have mutexes internally inside some dependency still (e.g. grpc or what have you). What I would really like is the ability to change defaults for all mutexes created in the program, and have everyone use the same std mutexes.
By the way: rwlocks are often a bad idea, since you still get cache contention between readers on the counter for number of active readers. Unless the time you hold the lock for is really long (several milliseconds at least) it usually doesn't improve performance compared to mutexes. Consider alternatives like seqlocks, RCU, hazard pointers etc instead, depending on the specifics of your situation (there is no silver bullet when it comes to performance in concurrent primtitves).
> What I would really like is the ability to change defaults for all mutexes created in the program, and have everyone use the same std mutexes.
Yes. I imagine subbing out a debug mutex implementation that tracks lock ordering and warns about order inversion (similar to things like witness(4)): https://man.freebsd.org/cgi/man.cgi?witness(4)
> rwlocks are often a bad idea
Yes.
> since you still get cache contention between readers on the counter for number of active readers
There are rwlock impls that put the reader counts on distinct cache lines per core, or something like that (e.g., folly::SharedMutex), mitigating this particular problem. But it isn't the only problem with rwlocks.
> Consider alternatives like seqlocks, RCU, hazard pointers etc instead, depending on the specifics of your situation (there is no silver bullet when it comes to performance in concurrent primtitves).
Yes. :)
What are some of the other problems with rwlocks? I genuinely ask because I used them with quite good success for pretty complicated use-case at large scale and very high concurrency.
I have as well. I find RW locks much easier to use than, say, a recursive mutex. Mainly since it took me a long time to actually understand how a recursive mutex actually works in the first place. When you want to use only the stdlib, you aren't left with many choices. At least in the STL.
Mostly that if you actually have both readers and writers, they obstruct each other; this is often undesirable. And you have to pick some bias in advance. You can get priority inversion because readers are anonymous.
Sure, the workload was with both the readers and the writers, and it was a pretty "bursty" one with high volume of data which had to scale across all the cores. So, not a particularly light workload. It was basically a hi-concurrency cache which was write-mostly in the first-phase (ingestion), and then read-mostly in the second-phase (crunching). It had to support multiple sessions simultaneously at the same time so in the end it was about supporting heavily mixed read-write workloads, e.g. second-phase from session nr. 1 could overlap with the first-phase from session nr. 2.
To avoid the lock contention I managed to get away with sharding across the array of shared-mutexes and load-balancing the sessions by their UUIDs. And this worked pretty well - after almost ~10 years it's still rock solid, and workloads are basically ever changing.
I considered the RCU for this use-case too but I figured that it wouldn't be as good fit because the workload is essentially heavily mixed so I thought it would result with a lot of strain to the memory subsystem by having to handle multiple copies of the data (which was not small).
One thing I don't understand is the priority inversion and how that may happen. I'll think about it, thanks.
> What I would really like is the ability to change defaults for all mutexes created in the program, and have everyone use the same std mutexes.
Assuming you're building the whole userspace at once with something like yocto... you can just patch pthread to change the default to PTHREAD_PRIO_INHERIT and silently ignore attempts to set it to PTHREAD_PRIO_NONE. It's a little evil though.
> By the way: rwlocks are often a bad idea
+1
That is a great terrible idea (I really have to think a bit more on that). Won't help for Rust, since the mutexes there use futex directly, so you would have to patch the standard library itself (and for futex it is more complex than just enabling a flag). Seems plausible that other libraries and language runtimes might do similar things.
The implementation for the rust std mutex when targeting fuchsia does implement priority inheritance by default, but the zircon kernel scheduler and futex implementation are written with priority inheritance in mind as the default approach rather than something ad hoc tacked on. Unfortunately on Linux it seems like there is a large performance tradeoff which may not be worthwhile for the common case. It does seem like it would be nice to set an env variable to change the behavior through rather than require a recompile of libstd. A lot of programs use alternatives to the std library as well like parking_lot, which is indeed a pain.
Sometimes I feel like trying to use Linux for realtime is an effort in futility. The ecosystem is optimized for throughput over fairness, predictability, and latency.
> Unfortunately on Linux it seems like there is a large performance tradeoff
Implementing transitive priority inheritance is just inherently algorithmically more expensive: there's no avoiding that.
> Sometimes I feel like trying to use Linux for realtime is an effort in futility.
If you're not actually using an RT kernel, yeah, it's futile. But if you are, the guarantees are pretty strong... on x86 PCs, the hardware gets in the way much more than the software in my experience. There's a lot of active work upstream.
The Linux priority inheritance futexes are also fair, which adds unnecessary overhead if you only care about PI, not fairness.
i think both you guys have the same job as me lol
There are rw lock implementations where waiters (whether or readers or writers) don't contend on a shared cache line (they only touch it once to enqueue themselves, not to spin/wait)
These are usually called "scaleable locks" and the algorithms for them have been out there for decades. They are optimal from a cache coherence point of view.
The issue with them is it's impossible to support the same API as you're used to with std::shared_mutex, as every thread needs it's own line.
> it's impossible to support the same API as you're used to with std::shared_mutex
If I understand the problem correctly, the standard library could specialize the scoped lock objects for the shared mutex, online allocate the waiter object in the lock (so it will be in the stack) and use an internal dedicated API.
It might be harder (but not impossible) to interoperate threads using direct lock calls with threads using the scoped API.
In any case it is a moot point, as I don't think any std library does it nor will never do as it would probably be ABI breaking.
One thing I appreciate about Rust's stdlib is that it exposes enough platform details to allow writing the missing knobs without reimplementing the entire wrapper (e.g. File, TcpStream, etc. allows access to raw file descriptors, OpenOptionsExt allows me to use FILE_FLAG_DELETE_ON_CLOSE on windows, etc.)
Because usually that is OS specific and not portable to be part of standard library that is supposed to work everywhere.
NP means the API is not portable. There are Linux specific extensions for many things but not everything has it. This is also nothing wrong with needing to use an alternative to the standard library if you have more niche requirements.