> Legend says it only happens in obscure, low-level systems, but I'm here to refute the legend.
In a somewhat self-deprecating, but also an “anyone can do this if they have the will” fashion, I tend to say my real skill in performance tuning is stubbornness. Orphaned gains can be easily responsible for 1/2 of the runtime of your average operation, sometimes as high as 2/3. The first 20% is fairly straightforward, the rest is just hard work. There’s always four smartasses willing to misquote a Knuth aphorism they never understood, but the fact is if your customers hate how slow the product is, and more to the point, if your marketing people are talking about it, it’s time to get off your asses and do the hard work, because your shit is broken.
Yes there’s specialized knowledge but is it really any harder than understanding how databases work? Probably not. But if what is described I’m the quote above is a “legend”, then either things have got a lot worse while I wasn’t looking, or I’m underplaying the value of those skills in the overall mix, or both.
The biggest memory pressure problem I ever fixed, I deduped running the same DB query twice and intersecting two filters of that data. I was expecting about a 2x improvement, but instead I got 10x. In retrospect I should have hoped for maybe as high as 4x due to the extra short circuiting on the data set, but not 10x. That was all better data locality.
The irony of that situation is that I had been pulled in as a fixer and asked if I could figure out how to get a request taking thirty seconds down to three. This was supposed to be my opening salvo in a several week drill down and a series of improvements going from 15 to 10 to 8 to 7 seconds etc and prepping to apologize when I was only able to get it down to five, maybe six seconds. And instead I was done in like two days. But it did serve my larger thesis that we were using caching wrong. Data passed on the call stack doesn’t have to be looked up a second time. Or a third, or a fifth.
It feels like my era of education 2012-2020 (couple of degrees over that time) really deemphasized perf tuning, even heard it was practically useless in current day a few times.
I had a computer organization course that came close but mostly just described microarchitecture and its historical development, not so much the practical ways to exploit it.
Actually taking the time to sit down and poke around with techniques was mind blowing. I grew up during the golden age of CPU and OS advancements 90s-00s and the rush from seeing ‘instructions per cycle’ > 1 captured a bit of that magic that CRUD app dev and wrangling k8s just doesn’t have.
It's not worth optimising if you don't have a problem. Focus your effort.
If this code runs once per week at midnight, needs to finish by 5am, and currently it takes 18 minutes, the fact it could take 40 seconds isn't actually important and so spending meaningful engineering effort to go from 18 minutes to 40 seconds is a waste.
On the other hand, if the code runs on every toaster when it's started and ideally would finish before the toast pops up, but currently takes 4 minutes, then even getting it down to 2.5 minutes will make more customers happy [also, why the fuck are we running software in the toaster? But that's beside the point] and might well be worth doing.
The classic UX examples given are much closer to the latter category. When I type fast the symbols ought to appear immediately for example, if you can't do that then you have a performance problem and optimisation is appropriate. But so much of what software engineers do all day isn't in that space and doesn't need to prioritise performance so optimisation shouldn't be a priority.
>so much of what software engineers do all day isn't in that space
Seems to me that critical infra that supports a lot of modern computing is in that space though.
If you want to develop that depth of knowledge you need to go into HPC/scientific, trading or accelerator hardware. I didn’t get into this sometimes crazy industry to NOT learn stuff and push the limits of my computer.
I’m glad I know about those applications now, but I wonder how much of a disservice we did to the industry by just focusing on frameworks and abstraction especially now that you can just sling a lot of that out with a prompt…
In the era of kubernetes and edge servers and everything running on battery power, that distinction between need and want becomes much fuzzier because of course we can bin pack the more efficient one better or preserve another five minutes of standby time even if the wall clock behavior is moot.
And I’d also argue that if you wait to use a skill only until the need is dire then you will be both 1) shit at doing it and fail to achieve your goal well and 2) won’t have spent enough time on the cost/benefit analysis to know when things have changed over from want to need. Like the blind people I allude to in my top level.
I went to a top ten school. I had one semester of circuit design, one one of EE, and a couple of computer architecture that went over the MIPs and writing assembly.
I think there was some sort of transition of curriculum going on with the introductory classes though because the difficulty from one homework assignment to the next that first year of CS 1XX classes was pretty choppy. A friend and I made a game of one-upsmanship of adding our own constraints to the easier assignments to make them more interesting. Like taking larger inputs than the requirements and counting execution time.
When I left school my first job the application was glacially slow, and I learned half of what I know about optimization in a short stint there through trial and error. It was a couple jobs in before I ever got pushback and had to learn the human factors element. But it (the optimization balanced against readability, robustness, extensibility) was a way I have always made pedestrian work more interesting. There are whole classes of code smells that also contain performance penalties, and at the peak of my restlessness I needed those to keep my sanity without irritating coworkers. I’m just cleaning up this messy code, nothing to see here.
Reading release notes for other tools bragging on their improvements. Dev tools and frameworks are more forthcoming about how and what than consumer apps, but there are standouts from time to time. I read a ton of SIGPLAN proceedings during that era. Fortune favors the prepared mind and you look a lot smarter when you’re confronting a problem or opportunity with a primed pump rather than coming in cold (being friendly with other disciplines in your company also helps there).
Everything was slow because sorting was taking a lot of time. Sorting was slow because its comparator was taking ~6 read locks on every comparison, and was cloning large structures to avoid holding the lock for a long time. The first fix was to access just the information needed to avoid the clones, the second fix was to cache exactly the data needed for sorting after the underlying data was updated, and use that for the comparators without needing to take the underlying lock.
I'm looking forward to the next post about how cache consistency is tough.
By far my favorite feature of lodash is the sortby function, in which instead of providing a custom comparator as most std libraries’ sort() function offers, provides a way to substitute a simpler object for the comparator to chew on. If your comparator needs to go beyond a couple nested conditionals, to actual data transform or grabbing locks, then that nasty little logn term in the runtime can take your sort from using 20% of the time budget for an expensive operation to taking 120%. Especially when you consider the memory bandwidth sloshing around L3 or the main memory bus to do the full table scan logn times.
I think the world would be better off if this wasn’t in a third part library in a single programming language. Iirc Ruby is the only language I know with it as a built-in.
Rust's slice of T [T] provides [T]::sort_by_cached_key which is a stable IPN sort which lets you provide a key transform callable f, which it will call at most once for each item to be sorted, sorting by comparing the (cached) results from that callable.
However ..._by_cached_key is not provided for Rust's unstable sort because the unstable sort doesn't require an allocator and necessarily a cache does need an allocator.
Yeah this is a function called sort_by but I don’t think it’s doing the same thing.
let (initial_values, stream) = (initial_values, stream)
.filter(filter)
.sort_by(new_sorter_lexicographic(vec![
// Sort by latest event's kind.
Box::new(new_sorter_latest_event()),
// Sort rooms by their recency.
Box::new(new_sorter_recency()),
// Finally, sort by name.
Box::new(new_sorter_name()),
]))
.dynamic_head_with_initial_value(page_size, limit_stream);
That’s an api that would make an OG Java developer get tingles.
sortBy should be locking each object once and I’m reasonably sure this is happening at least three times. Author ends up approximating _.sortBy() at the bottom by introducing a struct.
IMO, there are a lot of smells in this code not addressed in the article. I only skimmed, and still, here are a few:
1. They represent a single room change with this sequence of three operations:
VectorDiff::Set { index: 3, value: new_room } because of the new “preview”,
VectorDiff::Remove { index: 3 } to remove the room… immediately followed by
VectorDiff::PushFront { value: new_room } to insert the room at the top of the Room List.
and I don't see any mention of atomic sequences. I think the room will momentarily disappear from view before being placed into the correct spot. That kind of thing would drive me nuts as a user. It suggests to me this is not the right abstraction.
Also, if you are actually representing the result with a vector, it's O(n), so from a performance perspective, it's not great if the vector can be large: you're shifting everything from [3, n) one spot forward and then one spot back, unnecessarily. If there were a `VectorDiff::Move`, you'd only be shifting 3 elements (the distance moved). Could still be the full length of the list but probably usually not? Something like a `BTreeSet` would make it actually O(lg n).
2. Taking a lock in a comparison function (they call it `Sorter`, but the name is wrong) is a smell for correctness as well as performance. Can the values change mid-sort? Then the result is non-deterministic. (In C++ it's actually undefined behavior to use a non-deterministic comparator. In Rust it's safe but still a bad idea.) You just can't sort values while they're changing, full stop, so inner mutability in a list you're sorting is suss. [edit: and for what? within a client, are you seriously doing heavy mutations on many rooms at once? or is a single lock on all the rooms sufficient?]
3. The sorted adapter just degrades to insertion sort of changes right here: <https://docs.rs/eyeball-im-util/0.10.0/src/eyeball_im_util/v...> and decomposes what could have been an atomic operation (append) into several inserts. Even `Set` does a linear scan and then becomes a (non-atomic again) remove and an insert, because it can change the sort order.
4. The `.sort_by(new_sorter_lexicographic(vec![Box(...), Box(...), Box(...)]))` means that it's doing up to three dynamic dispatches on each comparison. The `new_sorter_lexicographic` is trivial, so inline those instead. And definitely don't take a separate lock on each, yuck, although see above anyway about how you just shouldn't have locks within the vec you're sorting.
5. In their "dessert" section, they talk about a problem with sort when the items are shallow clones. It's an example of a broader problem: they put something into an `ObservableVector` but then semantically mutate it via inner mutability (defeating the "observable"). You just can't do that. The sort infinite loop is the tip of the iceberg. Everything relying on the observable aspect is then wrong. The lesson isn't just "jumping on an optimization can lead to a bug"; it's also that abstractions have contracts.
The useful distinction here is not just AoS vs SoA, it is moving expensive work off the hot path. The biggest win in the article seems to be caching the sort/filter inputs so lock-taking and cache misses happen on updates, not during every comparison. That is a very transferable lesson even if you never go full data-oriented design.
> So, yes, it takes time to read from memory. That's why we try to avoid allocations as much as possible.
This whole thing is summed up by some pretty basic physics. What you actually want to minimize is the communication of information between physical cores. Nothing else really matters. Certainly not terminology or clever tricks that effectively try to cheat thermodynamics. The cost of communicating information is almost always much more substantial than the cost of computing over that same information. The ALU is not nearly as power hungry as infinity fabric.
Always wondered if it is possible/useful to model multi core CPU caches as CISC cores where the complex instructions are all the operations you can do with the L1 within the time frame of an L2 access
Minimizing core-to-core communication is a nice theory until you actually run workloads with unpredictable access patterns or legacy interfaces bolted on where you can't just flatten and pack everything for optimal cache locality. There's a whole separate world of pain once you start scaling out mutability and pointer-heavy structures: you get play-by-play races between TLB shootdowns, OS-level dirty page tracking and whatever the thread scheduler feels like that day.
In the old Windows NT days, the scheduler still wasn't that clever, so in multiple core servers it would throw processes around thrashing the whole cache and CPU context.
By pinning the processes to specific CPUs, the performance gains were quite visible.
It's not rust per se. It's a human trait. People make uneducated assumptions about the topics all of the time, and it's completely normal since your have to bootstrap yourself somehow to be able to grow further. It's impossible to know everything. And it's also a difficult situation because you do have to arrive to some conclusion. I do think, though, that in this particular case, assumptions made are a little bit all over the place and the tone is overconfident.
> Legend says it only happens in obscure, low-level systems, but I'm here to refute the legend.
In a somewhat self-deprecating, but also an “anyone can do this if they have the will” fashion, I tend to say my real skill in performance tuning is stubbornness. Orphaned gains can be easily responsible for 1/2 of the runtime of your average operation, sometimes as high as 2/3. The first 20% is fairly straightforward, the rest is just hard work. There’s always four smartasses willing to misquote a Knuth aphorism they never understood, but the fact is if your customers hate how slow the product is, and more to the point, if your marketing people are talking about it, it’s time to get off your asses and do the hard work, because your shit is broken.
Yes there’s specialized knowledge but is it really any harder than understanding how databases work? Probably not. But if what is described I’m the quote above is a “legend”, then either things have got a lot worse while I wasn’t looking, or I’m underplaying the value of those skills in the overall mix, or both.
The biggest memory pressure problem I ever fixed, I deduped running the same DB query twice and intersecting two filters of that data. I was expecting about a 2x improvement, but instead I got 10x. In retrospect I should have hoped for maybe as high as 4x due to the extra short circuiting on the data set, but not 10x. That was all better data locality.
The irony of that situation is that I had been pulled in as a fixer and asked if I could figure out how to get a request taking thirty seconds down to three. This was supposed to be my opening salvo in a several week drill down and a series of improvements going from 15 to 10 to 8 to 7 seconds etc and prepping to apologize when I was only able to get it down to five, maybe six seconds. And instead I was done in like two days. But it did serve my larger thesis that we were using caching wrong. Data passed on the call stack doesn’t have to be looked up a second time. Or a third, or a fifth.
It feels like my era of education 2012-2020 (couple of degrees over that time) really deemphasized perf tuning, even heard it was practically useless in current day a few times.
I had a computer organization course that came close but mostly just described microarchitecture and its historical development, not so much the practical ways to exploit it.
Actually taking the time to sit down and poke around with techniques was mind blowing. I grew up during the golden age of CPU and OS advancements 90s-00s and the rush from seeing ‘instructions per cycle’ > 1 captured a bit of that magic that CRUD app dev and wrangling k8s just doesn’t have.
It's not worth optimising if you don't have a problem. Focus your effort.
If this code runs once per week at midnight, needs to finish by 5am, and currently it takes 18 minutes, the fact it could take 40 seconds isn't actually important and so spending meaningful engineering effort to go from 18 minutes to 40 seconds is a waste.
On the other hand, if the code runs on every toaster when it's started and ideally would finish before the toast pops up, but currently takes 4 minutes, then even getting it down to 2.5 minutes will make more customers happy [also, why the fuck are we running software in the toaster? But that's beside the point] and might well be worth doing.
The classic UX examples given are much closer to the latter category. When I type fast the symbols ought to appear immediately for example, if you can't do that then you have a performance problem and optimisation is appropriate. But so much of what software engineers do all day isn't in that space and doesn't need to prioritise performance so optimisation shouldn't be a priority.
In particular Fast but Wrong is just Wrong. https://x.com/magdraws/status/1551612747569299458
>so much of what software engineers do all day isn't in that space
Seems to me that critical infra that supports a lot of modern computing is in that space though.
If you want to develop that depth of knowledge you need to go into HPC/scientific, trading or accelerator hardware. I didn’t get into this sometimes crazy industry to NOT learn stuff and push the limits of my computer.
I’m glad I know about those applications now, but I wonder how much of a disservice we did to the industry by just focusing on frameworks and abstraction especially now that you can just sling a lot of that out with a prompt…
In the era of kubernetes and edge servers and everything running on battery power, that distinction between need and want becomes much fuzzier because of course we can bin pack the more efficient one better or preserve another five minutes of standby time even if the wall clock behavior is moot.
And I’d also argue that if you wait to use a skill only until the need is dire then you will be both 1) shit at doing it and fail to achieve your goal well and 2) won’t have spent enough time on the cost/benefit analysis to know when things have changed over from want to need. Like the blind people I allude to in my top level.
I went to a top ten school. I had one semester of circuit design, one one of EE, and a couple of computer architecture that went over the MIPs and writing assembly.
I think there was some sort of transition of curriculum going on with the introductory classes though because the difficulty from one homework assignment to the next that first year of CS 1XX classes was pretty choppy. A friend and I made a game of one-upsmanship of adding our own constraints to the easier assignments to make them more interesting. Like taking larger inputs than the requirements and counting execution time.
When I left school my first job the application was glacially slow, and I learned half of what I know about optimization in a short stint there through trial and error. It was a couple jobs in before I ever got pushback and had to learn the human factors element. But it (the optimization balanced against readability, robustness, extensibility) was a way I have always made pedestrian work more interesting. There are whole classes of code smells that also contain performance penalties, and at the peak of my restlessness I needed those to keep my sanity without irritating coworkers. I’m just cleaning up this messy code, nothing to see here.
Reading release notes for other tools bragging on their improvements. Dev tools and frameworks are more forthcoming about how and what than consumer apps, but there are standouts from time to time. I read a ton of SIGPLAN proceedings during that era. Fortune favors the prepared mind and you look a lot smarter when you’re confronting a problem or opportunity with a primed pump rather than coming in cold (being friendly with other disciplines in your company also helps there).
Post can be summarised quite succinctly:
Everything was slow because sorting was taking a lot of time. Sorting was slow because its comparator was taking ~6 read locks on every comparison, and was cloning large structures to avoid holding the lock for a long time. The first fix was to access just the information needed to avoid the clones, the second fix was to cache exactly the data needed for sorting after the underlying data was updated, and use that for the comparators without needing to take the underlying lock.
I'm looking forward to the next post about how cache consistency is tough.
By far my favorite feature of lodash is the sortby function, in which instead of providing a custom comparator as most std libraries’ sort() function offers, provides a way to substitute a simpler object for the comparator to chew on. If your comparator needs to go beyond a couple nested conditionals, to actual data transform or grabbing locks, then that nasty little logn term in the runtime can take your sort from using 20% of the time budget for an expensive operation to taking 120%. Especially when you consider the memory bandwidth sloshing around L3 or the main memory bus to do the full table scan logn times.
I think the world would be better off if this wasn’t in a third part library in a single programming language. Iirc Ruby is the only language I know with it as a built-in.
Rust's slice of T [T] provides [T]::sort_by_cached_key which is a stable IPN sort which lets you provide a key transform callable f, which it will call at most once for each item to be sorted, sorting by comparing the (cached) results from that callable.
https://doc.rust-lang.org/std/primitive.slice.html#method.so...
However ..._by_cached_key is not provided for Rust's unstable sort because the unstable sort doesn't require an allocator and necessarily a cache does need an allocator.
Yeah this is a function called sort_by but I don’t think it’s doing the same thing.
That’s an api that would make an OG Java developer get tingles.sortBy should be locking each object once and I’m reasonably sure this is happening at least three times. Author ends up approximating _.sortBy() at the bottom by introducing a struct.
This is well known in Perl as the Schwartz transform
IMO, there are a lot of smells in this code not addressed in the article. I only skimmed, and still, here are a few:
1. They represent a single room change with this sequence of three operations:
and I don't see any mention of atomic sequences. I think the room will momentarily disappear from view before being placed into the correct spot. That kind of thing would drive me nuts as a user. It suggests to me this is not the right abstraction.Also, if you are actually representing the result with a vector, it's O(n), so from a performance perspective, it's not great if the vector can be large: you're shifting everything from [3, n) one spot forward and then one spot back, unnecessarily. If there were a `VectorDiff::Move`, you'd only be shifting 3 elements (the distance moved). Could still be the full length of the list but probably usually not? Something like a `BTreeSet` would make it actually O(lg n).
2. Taking a lock in a comparison function (they call it `Sorter`, but the name is wrong) is a smell for correctness as well as performance. Can the values change mid-sort? Then the result is non-deterministic. (In C++ it's actually undefined behavior to use a non-deterministic comparator. In Rust it's safe but still a bad idea.) You just can't sort values while they're changing, full stop, so inner mutability in a list you're sorting is suss. [edit: and for what? within a client, are you seriously doing heavy mutations on many rooms at once? or is a single lock on all the rooms sufficient?]
3. The sorted adapter just degrades to insertion sort of changes right here: <https://docs.rs/eyeball-im-util/0.10.0/src/eyeball_im_util/v...> and decomposes what could have been an atomic operation (append) into several inserts. Even `Set` does a linear scan and then becomes a (non-atomic again) remove and an insert, because it can change the sort order.
4. The `.sort_by(new_sorter_lexicographic(vec![Box(...), Box(...), Box(...)]))` means that it's doing up to three dynamic dispatches on each comparison. The `new_sorter_lexicographic` is trivial, so inline those instead. And definitely don't take a separate lock on each, yuck, although see above anyway about how you just shouldn't have locks within the vec you're sorting.
I would never use these abstractions.
5. In their "dessert" section, they talk about a problem with sort when the items are shallow clones. It's an example of a broader problem: they put something into an `ObservableVector` but then semantically mutate it via inner mutability (defeating the "observable"). You just can't do that. The sort infinite loop is the tip of the iceberg. Everything relying on the observable aspect is then wrong. The lesson isn't just "jumping on an optimization can lead to a bug"; it's also that abstractions have contracts.
The useful distinction here is not just AoS vs SoA, it is moving expensive work off the hot path. The biggest win in the article seems to be caching the sort/filter inputs so lock-taking and cache misses happen on updates, not during every comparison. That is a very transferable lesson even if you never go full data-oriented design.
> So, yes, it takes time to read from memory. That's why we try to avoid allocations as much as possible.
This whole thing is summed up by some pretty basic physics. What you actually want to minimize is the communication of information between physical cores. Nothing else really matters. Certainly not terminology or clever tricks that effectively try to cheat thermodynamics. The cost of communicating information is almost always much more substantial than the cost of computing over that same information. The ALU is not nearly as power hungry as infinity fabric.
Always wondered if it is possible/useful to model multi core CPU caches as CISC cores where the complex instructions are all the operations you can do with the L1 within the time frame of an L2 access
Minimizing core-to-core communication is a nice theory until you actually run workloads with unpredictable access patterns or legacy interfaces bolted on where you can't just flatten and pack everything for optimal cache locality. There's a whole separate world of pain once you start scaling out mutability and pointer-heavy structures: you get play-by-play races between TLB shootdowns, OS-level dirty page tracking and whatever the thread scheduler feels like that day.
In the old Windows NT days, the scheduler still wasn't that clever, so in multiple core servers it would throw processes around thrashing the whole cache and CPU context.
By pinning the processes to specific CPUs, the performance gains were quite visible.
What the hell do allocations even have to do with time to read from memory...
It seems all of these Rust performance warriors have no understanding of computer architecture fundamentals.
It's not rust per se. It's a human trait. People make uneducated assumptions about the topics all of the time, and it's completely normal since your have to bootstrap yourself somehow to be able to grow further. It's impossible to know everything. And it's also a difficult situation because you do have to arrive to some conclusion. I do think, though, that in this particular case, assumptions made are a little bit all over the place and the tone is overconfident.
If an allocation falls in a forest and never gets written to, did it really happen?
We don’t allocate things we don’t write to. Maybe if the question sounds stupid then you didn’t appreciate it, instead of other people being stupid.