Linux 7.0 Broke PostgreSQL: The Preemption Regression Explained

(read.thecoder.cafe)

133 points | by 0xKelsey a day ago ago

71 comments

  • ameliaquining a day ago ago

    This post comes uncomfortably close to plagiarizing https://thebuild.com/blog/2026/04/23/preempt_none-is-dead-yo..., which it cites as a source; almost all the technical explanation is in there and some of the wording is extremely similar. Compare, e.g., "What Linux 7.0 actually changed" in Pettus's post to "What Is Preemption?" in this one. I think this link should have been to Pettus's post instead.

    • jdonaldson a day ago ago

      That post comes uncomfortably close to how Opus writes this kind of prose. It's a good idea to acknowledge all stakeholders.

    • teivah 21 hours ago ago

      I used that post as a source yes, and it’s stated explicitly but it’s not the only one. One section in particular is similar since we both present the different preemption modes. However, both audiences are different. thebuild.com has an audience composed of PostgreSQL enthusiasts (if not experts) and I don’t. So a significant part of my post was about explaining things from first principles (what’s a page, a TLB, a spinlock, etc.). I explain way more “basic” things and he goes beyond me in terms of how to cope with the problem. I don’t think the posts are closed.

    • galkk 21 hours ago ago

      After your comment I went to original and it really looks like ai assisted rewrite with prompt like “give more explanations about basic concepts”…

      • 20 hours ago ago
        [deleted]
  • fulafel 21 hours ago ago

    > PREEMPT_NONE: The kernel almost never interrupts a running thread

    This seems confused. These are options for preemptibility of the kernel, which is a relatively modern fearure. Userspace could always be preempted and these options do not change anything there. The kernel must in any case frequently interrupt threads and processes to implement preemptive multitasking which Linux of course had since the beginning.

    Read more eg at https://lwn.net/Articles/944686/ or help texts at https://github.com/torvalds/linux/blob/master/kernel/Kconfig...

  • singron 20 hours ago ago

    This has the wrong explanation of the proposed rseq (Restartable Sequences) solution.

    > a Linux kernel facility that lets userspace code detect whether it was preempted or migrated during a critical section and restart it if so. PostgreSQL's spinlock paths would use rseq to detect preemption and retry, avoiding the scenario where a preempted lock holder stalls all waiting backends.

    The real proposal is about time-slice extension, which is a feature that uses the abi for rseq but otherwise has nothing to do with retrying critical sections. While a process holds a s_lock, it would set a request bit. If the kernel tries to preempt that thread while the request bit is set, it instead extends the time slice once and returns control back to the thread. It's further explained here: https://docs.kernel.org/userspace-api/rseq.html

  • ozgrakkurt a day ago ago

    It is a crime that postgres isn't able to allocate with 1GB huge pages by changing a config parameter in 2026

    Also a crime that people are still running databases with 4kb pages.

    To put it in perspective, this means you will have more than 30 million pages on a server with 128GB RAM. As an example, if there is 16bytes of metadata for memory page. The metadata itself would take more than half a gigabyte.

    • dezgeg a day ago ago

      Even worse, the actual struct page on Linux is 64 bytes, so 4x your example

    • ldargin 21 hours ago ago

      Database systems lock pages when writing to them, to maintain integrity. Using 1GB pages would cause excessive blocking in many if not most transactional databases.

      • jlokier 20 hours ago ago

        In database engines that use page locks, the locked page size can be different from the file/mapped/allocated page sizes. If you still have excessive locking while using smaller page locks, there are other ways to reduce contention as well, such as CoW to protect concurrent reads, deferred write-merging to assist concurrent writes, and the storage equivalent of RCU.

      • andrewf 17 hours ago ago

        I don't think hardware page size has to match database page size. It would if Postgres was mmap'ing it. https://www.postgresql.org/docs/current/runtime-config-prese... says the database page size is 8KB by default.

    • anarazel 16 hours ago ago

      > It is a crime that postgres isn't able to allocate with 1GB huge pages by changing a config parameter in 2026

      It is able to? Configure huge_page_size=1GB?

      Support for 2MB pages was added in 2014, for larger pages 2020.

      Edit: year details.

      • ozgrakkurt 10 hours ago ago

        Didn't know that, thanks. Sorry for the wrong comment.

    • bonzini a day ago ago

      There is 64 bytes of metadata per memory page indeed.

    • andrewstuart a day ago ago

      Sensible defaults would be nice.

  • selckin a day ago ago
    • neogodless 19 hours ago ago

      Expanded:

      AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy (phoronix.com) ~24 days ago, 165+ comments

  • MBCook a day ago ago

    This only happened under a very odd configuration. Yeah it wasn’t great but it was not the normal case.

    The headline implies it broke PG everywhere. It didn’t.

  • buster a day ago ago

    I'd rather like to know if any real world usage broke, before coming to the conclusion that an edge case synthetic benchmark is worth changing the kernel (back or wherever) where supposedly the change that broke the benchmark had real world benefits.

    Since we will never know it might be a good idea to feature gate the change, change the default and let users decide to change it back. This may give some feedback on the lkml or else to decide if the change is worthwhile?

    • nijave a day ago ago

      "synthetic benchmark" is doing some heavy lifting here. Pgbench just runs a bunch of SQL statements against a real Postgres instance.

      It's very close to a real world simulation of a production workload

      • buster a day ago ago

        I am not questioning the benchmark. But the benchmark is NOT measuring a real world application in a real world setting. Anyway, I am merely wondering IF there is a company out there affected, at all. I understand that this was only measured on a graviton 4 setting with very heavy lifting, without huge tables.

        For example, this issue aside, I'd rather split such a workload into multiple smaller instances, naturally. Because the impact of a crash on this single node, heavy load, many cores, many clients scenario would be huge.

  • ahartmetz a day ago ago

    PREEMPT_LAZY triggering on page faults seems like a bad idea in light of this. It is probably not a good idea to suspend processes right when they get unexpectedly bogged down. The logic makes a little more sense for syscalls that are expected to take long compared to a scheduling quantum (a few milliseconds). But page faults are mostly invisible and unplannable.

    It only took a few decades for Linux to get a good CPU scheduler and good I/O schedulers, too. I don't get how such an important area can be so bad for so long. But then, bad scheduling is everywhere. I find it to be a pretty fun area to work in, but, judging by how much it is less than half-assed in much existing software, most developers seem to hate dealing with it?

    • AlienRobot 21 hours ago ago

      One thing I miss from using Windows is that the desktop didn't just freeze completely if you ran out of RAM.

      At first I thought that maybe Linux doesn't have ways to give priority to the desktop environment (a.k.a. "graphical shell") which is why running out of RAM means your cursor starts lagging, clicking on things stops working, etc.

      But maybe Linux is just bad at that in general and a single process eating too much RAM can simply bring the whole system to a halt as it tries to move and compress RAM to a pagefile on an HDD (not SSD).

      Every time it happens to me I just find it so incredible. Here I am with a PC with a multiple cores, multiple processors, and a single process eating all the RAM can bottleneck ALL of them at once? Am I misunderstanding something? Shouldn't it, ideally, work in such way that so long as one processor is free, the system can process mouse input and render the cursor and do all the desktop stuff no matter what I/O is happening in the background?

      Since it's Linux maybe it's just my DE/distro (Cinnamon/Mint). Maybe it does allocations under the assumption there will always be a few free bytes in RAM available, so it halts if RAM runs out while some other DE wouldn't. But even then you'd think there would be a way to just reserve "premium" memory for critical processes so they never become unresponsive.

      I wonder if other people have the same experience as me. This part of Linux just always felt fundamentally poor for me.

      • rcxdude 21 hours ago ago

        This issue is much worse if you don't have swap. What happens, I think, is that as memory allocated by processes grows to fill the available RAM, it starts to push out memory that doesn't technically need to be in RAM, like cached file pages. Which accounts for some of the slowdown, until it reaches the code itself, which is 'just' a memory mapped file. So eventually most of the code that is actively trying to run is being pushed out of RAM and must be loaded in as it executes, slowing everything to a crawl and generally creating a death spiral. If you have swap the kernel can decide to put other pages onto disk and keep the more important stuff in memory. Or you can run something like early-oom which stops things from getting to that point in the first place (albeit in a somewhat brute-force manner).

        Dealing with low-memory situations elegantly is pretty hard: firstly Linux uses memory overcommit by default, in part because the semantics of fork imply very large memory commitments which are almost never realised, and in part because a lot of software does the same because it's the default. Secondly, managing allocation failures is often tricky and ill-tests, and often requires co-ordination between different systems. The DE could, though, in principle, put running applications in a container which would prevent them from using above a certain amount of memory, but the results are similar to early-oom in that the result of reaching the limit is almost certainly the termination of the process using the most memory.

        • AlienRobot 20 hours ago ago

          Yes, but the problem, I feel, is the priority of what gets pushed out to RAM on Linux.

          You could split the processes into 2 categories:

          1: applications that are doing tasks the user wants.

          2: OS processes that the user needs to interact with in order to terminate applications.

          There is an argument for applications taking priority: the user wants to do a task, if you move application out of RAM, the task is going to take longer.

          But to me OS processes, including the graphical shell (taskbar, windowing system, etc.), should have priority: if an application hangs on I/O, the user NEEDS to be able to use the taskbar in order to terminate the application, otherwise they're going to have to wait who knows how long for the application to finish its task (or just hard reset the computer).

          I don't know anything about how Linux handles memory, but the impression I have is that it has its priorities wrong, or it may not even have a way to configure priorities (unlikely), or maybe there is a to prioritize what is kept in memory but it only splits kernel/userspace memory so DE's that sit in userspace don't get priority (i.e. it's inadequate for a graphical operating system).

          To be frank, as a desktop Linux user my biggest fear is that the Linux kernel is perfectly capable of prioritizing kernel/userspace memory, but it has no way to prioritize DE's. In other words, that the "graphical OS" use case of Linux is a second-class citizen, a feature bolted on top of GNU/Linux/Systemd. Because that would mean a lot of things are considered only from the perspective of a Linux server. This is only my imagination talking, since I'm not really involved with how Linux works. But to be fair I was never involved with how Windows worked either, and I never doubted it considered desktop a primary use case.

          • singron an hour ago ago

            A specific process can use mlockall to keep all its mapped memory resident and prevent swapping or page cache eviction. That's what earlyoom does so that it can stay responsive when memory gets low. It's unfortunately underutilized in other infrastructure. It's also all-or-nothing: everything stays resident until it's munlocked regardless of how frequently it's used.

            I had hoped that something like Linux Pressure Stall Information (PSI) would become more useful for low-memory scenarios. E.g. you could put critical processes in a cgroup that could rate-limit swap-outs/evictions so that it was always responsive. There are some cgroup knobs that affect reclamation, but you need a really good guess about how much memory something needs, which makes it hard to use.

            https://docs.kernel.org/accounting/psi.html

          • rcxdude 4 hours ago ago

            The issue is, if you don't have swap, then it's not matter of prioritisation: there are some things in RAM that can't be pushed out, regardless of how unimportant they are (so the only recourse is to terminate some processes, usually the ones using the most RAM. Linux the kernel by default tries really had not to do this, which is why there are userspace applications like earlyoom to do it, which is probably the better location for such logic).

            And yeah, you can adjust the priorities and the latency/throughput tradeoff (even per-application to some extent), but it's a difficult thing to get right in general (what works for one use-case might make another a lot worse). I don't know of any DE that really tries to adjust this, though (not because the kernel can't, but probably because no-one on the DE side has really prioritised it or they have tried and it hasn't made a noticable difference).

      • jcgl 21 hours ago ago

        Same experience here. Linux admin. I’d absolutely love to be told I’m holding it wrong, but all I can see is that there’s no way to hold it right.

        Your consternation is seconded.

        • baq 20 hours ago ago

          It’s even worse than that… you can hard lock a system with significant freeable memory left if you have insane vm.dirty_* settings (which is of course the case by default)

        • rcxdude 21 hours ago ago

          The two mitigations are to: - (somewhat counterintuitively) have swap enabled - run something like earlyoom to stop the system from reaching a low-RAM situation in the first place.

        • ahartmetz 18 hours ago ago

          zram + no swap is a surprisingly workable workaround IME. The system slows down by a factor of 100 or so instead of 100000 or so, which allows to kill the offending process in a few seconds or have it killed by the OOM killer faster than a reboot.

      • nijave 20 hours ago ago

        More aggressive oomkiller and cgroups have helped in recent years

        Edit: systemd-oomd is what I was thinking of

    • bobmcnamara a day ago ago

      Userspace spinlocks seem like a risky idea too.

      What if it was on a VM and the core holding the lock got descheduled from the hypervisor?

  • nijave a day ago ago

    Right on the heels of 6.19 breaking tcmalloc and Mongo

    • matharmin a day ago ago

      Yup - interesting to see so much written about Postgres having a performance regression on Linux 7.0, in a scenario that affects almost no-one in practice. Meanwhile MongoDB refuses to run at all on Linux 7.0 due to some issue with tcmalloc.

      https://jira.mongodb.org/browse/SERVER-121885

      • duskwuff 21 hours ago ago

        The underlying tcmalloc issue is interesting - the library was relying on an implementation detail of the rseq kernel API which was never guaranteed, and which already generated warnings in previous versions.

        https://lore.kernel.org/all/20260126204745.GP171111@noisy.pr...

        • nijave 19 hours ago ago

          Implemented behavior of the interface vs documented behavior of the interface

          I thought the warnings were only generated when you turned on a kernel config "that no one uses in practice"

  • nubinetwork 13 hours ago ago

    They got rid of PREEMPT_NONE? Just a while ago they got rid of slab, and the noop io scheduler. Why do they insist on removing features that don't make sense on a desktop according to some random bozo? Not everyone is running a dyntick laptop.

  • fabian2k 21 hours ago ago

    That regression is maybe most useful as a reminder to people to configure huge pages for PostgreSQL. That's the one recommended basic performance tuning that is just annoying enough to set up that I suspect many people with smaller DBs will skip it.

    Though I actually don't know how large shared buffers has to be for huge pages to make a noticeable difference.

  • jeltz a day ago ago

    Moderators should change this headline because it is nowhere near true. It only regressed performance on some incorrect configurations.

    • nijave 19 hours ago ago

      What is considered incorrect?

      Edit: It may not be optimal or recommended config but I was under the impression it's very close to default config. As far as I know, most popular distros are shipping with no hugepage pool reservation and shared memory transparent hugepages disabled.

  • ApolloFortyNine a day ago ago

    I can't help but think of the classic XKCD example of breaking a user's workflow [1].

    Doing research though a spinlock actually doesn't seem as unusual a hack as it would first seem, do drivers and the like not have similar issues because they don't trigger a page fault I guess?

    [1] https://xkcd.com/1172/

    • doubletwoyou 20 hours ago ago

      From what I understand userspace spinlocks are particularly hazardous whereas in-kernel spinlocks are the norm

  • cachius 20 hours ago ago

    The last time a linux upgrade broke PG was the xz backdoor.

  • baq a day ago ago

    TLDR of the LMKL thread: 120GB RAM postgres with hugepages=off, lock contention went from terrible to abysmal. nothing to see here except that amazon for whatever reason runs DB tests with huge pages disabled. (hope I'm not paying for RDS and auroras like that in production!)

    • Twirrim a day ago ago

      Huge pages has had a spotty history, that lead to people being paranoid about it, and no doubt a whole bunch of folks just disable it "because that's what we've always done". It has been stable and reliable for quite a while now, would really hope folks could move away from that perspective.

      • jeltz a day ago ago

        Are you sure you are not thinking of transparent huge pages? They have a spotty history but you are supposed to run big PostgreSQL instances with huge pages, not transparent huge pages.

      • nijave a day ago ago

        I tested it once about 2 years ago on Azure VM and got a nice 10-15% perf boost on pgbench (I want to say at least 64GB shared mem)

      • lstodd a day ago ago

        I remember when support for them just appeared and you had to LD_PRELOAD a shim IIRC to make Postgres actually use them we jumped on it, enabled them immediately and got a pretty significant boost, around 15-20%, yes.

        That was idk, 2008-9 -ish? I don't know what spotty history you are talking about, if you have multigigabyte address spaces floating on a machine it's stupid not to use hugepages.

    • nijave a day ago ago

      In fairness, AWS could (and almost certainly is) using their own kernel build that does who-knows-what

    • andriy_koval 18 hours ago ago

      > nothing to see here except that amazon for whatever reason runs DB tests with huge pages disabled

      do you consider huge pages disabled as some discouraged config? If data doesn't fit into memory, it means single lookup will read multiple NVME pages instead of single, which could lead to significant regression.

    • mplanchard a day ago ago

      Also was only on ARM, wasn’t it?

      • pavon 19 hours ago ago

        I think that ended up being a red herring. It just happened to be the case that the ARM test had huge pages disabled while the AMD64 test had them enabled.

    • dist-epoch a day ago ago

      Many people have desktops with 128 GB RAM. Should they enable hugepages? I've never heard this recommendation for a desktop.

      • nijave a day ago ago

        Huge pages is good when a single process is reserving a giant block of memory which I think isnt that common.

        You might have transparent huge pages on by default depending on the distro

      • baq 21 hours ago ago

        If they’re running any sort of a vm (which they probably do with this amount of ram) they absolutely should and also should consider pre-reserving them.

  • dataflow a day ago ago

    An X% performance regression is basically a (100 - X)% feature breakage, so whatever that implies in terms of breaking userspace...

  • PunchyHamster a day ago ago

    Seems Linus needs to yell at someone again.

    Especially with containers around you might very well hit the case of running new kernel but older version of PostgreSQL with no code mitigation for the problem

    • nobleach a day ago ago

      I get that folks love a good Linus rant. But as someone who's been at the end of that style of "feedback", nothing can be more humiliating or demotivating. Certainly there are contributors that are making "rookie mistakes". There are folks that aren't willing to ingest the entire context of what was tried back in 2.0.36, 2.2, 2.4... etc. And perhaps it's wise to simply stay away until you're completely certain you've got the chops to contribute. More than half the folks that enjoy that sort of abuse don't have those chops.

      I can defend someone who is unwilling to yield on quality. Afterall, this truly is his baby. Issuing scathing rebukes to well-intentioned contributors is like slapping my kid when he brings me the wrong type of screwdriver.

      • ecshafer a day ago ago

        I don't think a Linus rant ever hit anyone that was a rookie, they are always AFAIK against people "who should know better". Veteran developers, with multiple commits merged.

      • slackfan a day ago ago

        Code quality does not care about your feelings.

        • nijave 19 hours ago ago

          No, but code quality can suffer if you piss off all the competent people and they leave

        • 20 hours ago ago
          [deleted]
      • themafia a day ago ago

        > scathing rebukes

        Would you be able to point one out?

        > to well-intentioned contributors

        This is a system used and relied upon by billions of people around the world. Your intentions, while good, are not material to the problem. Put another way we have an endless supply of people with "good intentions" but we don't enjoy the same largess of people with "good skills."

        • nijave 19 hours ago ago
        • vogelke 20 hours ago ago

          https://lwn.net/Articles/343828/ describes Alan Cox trying to fix the TTY layer, being trashed by Linus, and removing himself from the maintainer page.

          • themafia 19 hours ago ago

            I find it hard to call this a "scathing rebuke:"

            https://lkml.org/lkml/2009/7/28/373

            It also didn't just happen out of the blue. It's also true that Alan had already been working on the kernel for 15 years, was an employee of RedHat at the time, and his Wife's health was starting to fail.

            If you follow the thread it goes back and forth across quite a few messages with frustration building on both sides with Alan ultimately deciding to step away from a single (and very hairy) subsystem.

      • colechristensen a day ago ago

        If you're at the level of delivering to Linus, I'm sorry but humiliation and demotivation are earned.

        You don't talk like this to junior or even senior engineers, but you do reach a level at which gently telling isn't necessary.

        If you don't like it go fork Linux and try being the nice benevolent dictator and we'll applaud your success.

    • panny 20 hours ago ago

      This was my first thought. How long was the Linus rant.

    • bonzini a day ago ago

      Nope, there was and will be no yelling.