What Makes System Calls Expensive: A Linux Internals Deep Dive

(blog.codingconfessions.com)

70 points | by rbanffy 2 days ago ago

10 comments

  • anonymousiam 2 days ago ago

    On a secure system (not serving to the Internet, and all trusted local users), you can add "mitigations=off" to greatly improve performance.

    https://fosspost.org/disable-cpu-mitigations-on-linux

    • abnercoimbre a day ago ago

      This depends on the CPU. From the article you linked:

      > some CPUs like those in the AMD 7000 series can actually give a worse performance if mitigations are turned off.

      Due diligence!

  • blakepelton 2 days ago ago

    The article quotes the Intel docs: "Instruction ordering: Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible)."

    More detail here would be great, especially using the terms "issue" and "commit" rather than execute.

    A barrier makes sense to me, but preventing instructions from issuing seems like too hard of a requirement, how could anyone tell?

    • eigenform a day ago ago

      > preventing instructions from issuing seems like too hard of a requirement

      If this were the case, you could perform SYSCALL in the shadow of a mispredicted branch, and then try to use it to leak data from privileged code.

      When the machine encounters an instruction that changes privilege level, you need to validate that you're on a correct path before you start scheduling and executing instructions from another context. Otherwise, you might be creating a situation where instructions in userspace can speculatively influence instructions in the kernel (among probably many other things).

      That's why you typically make things like this drain the pipeline - once all younger instructions have retired, you know that you're on a correct [not-predicted] path through the program.

      edit: Also, here's a recent example[^1] of how tricky these things can be (where SYSCALL isn't even serializing enough to prevent effects in one privilege level from propagating to another)

      [^1]: https://comsec.ethz.ch/wp-content/files/bprc_sec25.pdf

    • convolvatron 2 days ago ago

      it might have more to do with the difficult in separating out the contexts of the two execution streams across the rings. someone may have looked at the cost and complexity of all that accounting and said 'hell no'

      • blakepelton 2 days ago ago

        Yeah, I would probably say the same. It is a bit strange to document this as part of the architecture (rather than leaving it open as a potential future microarchitectural optimization). Is there some advantage an OS has knowing that the CPU flushes the pipeline on each system call?

      • BobbyTables2 2 days ago ago

        And given Intel’s numerous speculation related vulnerabilities, it must have been quite a rare moment!!!

      • codedokode a day ago ago

        Is it that difficult, add a "ring" bit to every instruction in instruction queue? Sorry I never made a OoO CPU before.

  • codedokode a day ago ago

    There are so many extra steps, obviously the CPU is designed for legacy monolithic OS like Windows which uses syscalls rarely and would work slowly with much safer and better, than Windows, microkernels.

    For example, why bother saving userspace registers? Just zero them out to prevent leaks. Ideally with a single instruction.

  • pengaru 2 days ago ago

    Linux used to deliver relatively low syscall overhead esp. on modern aggressively speculating CPUs.

    But after spectre+meltdown mitigations landed it felt like the 1990s all over again where syscall overhead was a huge cost relative to the MIPS available.