Investigating Split Locks on x86-64

(chipsandcheese.com)

77 points | by ingve 5 days ago ago

5 comments

  • anematode 2 days ago ago

    Cool investigation. This part perplexes me, though:

    > Games have apparently been using split locks for quite a while, and have not created issues even on AMD’s Zen 2 and Zen 5.

    For the life of me I don't understand why you'd ever want to do an atomic operation that's not naturally aligned, let alone one split across cache lines....

  • lifis a day ago ago

    But why doesn't the CPU just lock two cachelines? Seems relatively easy to do in microcode, no? Just sort by physical address with a conditional swap and then run the "lock one cacheline algorithm" twice, no?

    Perhaps the issue it that each core has a locked cacheline entry for each other core, but even then given the size of current CPUs doubling it shouldn't be that significant. And one could also add just a single extra entry and then have a global lock but that only locks the ability to lock a second cacheline.

  • strstr 2 days ago ago

    Split locks are weird. It’s never been obvious to me why you’d want to do them unless you are on a small core count system. When split lock detection rolled out for linux, it massacred perf for some games (which were probably min-maxing single core perf and didn’t care about noisy neighbor effects).

    Frankly, I’m surprised split lock detection is enabled anywhere outside of multi-tenant clouds.

  • sidkshatriya 2 days ago ago

    This article seems relevant to me for the following scenario:

    - You have faulty software (e.g. games) that happen to have split locks

    AND

    - You have DISABLED split lock detection and "mitigation" which would have hugely penalised the thread in question (so the lock becomes painfully evident to that program and forced to be fixed).

    AND

    - You want to see which CPU does best in this scenario

    In other words you just assume the CPU will take the bus lock penalty and continue WITHOUT culprit thread being actively throttled by the OS.

    In the normal case, IIUC Linux should helpfully throttle the thread so the rest of the system is not affected by the bus lock. In this benchmark here the assumption is the thread will NOT be throttled by Linux via appropriate setting.

    So to be honest I don't see the merit of this study. This study is essentially how fast is your interconnect so it can survive bad software that is allowed to run untrammelled.

    On aarch64 the thread would simply be killed. It's possible to do the same on modern AMD / Intel also OR simply throttle the thread so that it does not cause problems via bus locks that affect other threads -- none of these are done in this benchmark.

  • Cold_Miserable 2 days ago ago

    It went from ~30ns to 2K ns but mostly timed out when I changed the alignment to +7.5 QWORDs on Golden Cove.