Tuning async IO in PostgreSQL 18

(vondra.me)

96 points | by fanf2 2 days ago ago

23 comments

  • fabian2k 2 days ago ago

    Interesting that the recommended default for performance is "worker" instead of "io_uring". The early posts about async IO in Postgres 18 read a bit different here, this post here appears more authorative to me though. We'll see how this turns out in future releases.

    At least it is only two settings to look at for tuning, though it does seem this is the kind of setting you have to try out with your own DB and workload.

    Right now async IO is used for sequential scans and bitmap scans, not for index scans. My initial guess would be that it mostly helps for complex queries (that use multiple indexes, so bitmap scans) and unoptimized queries (sequential scans), not so much for straightforward and/or optimized queries that use a single index. But this is really just a guess, I'm way out of my depth at this point. I'm curious how much it'll help once it is implemented for index scans as well.

    • nromiun 2 days ago ago

      The article explains this:

      > The default is io_method = worker. We did consider defaulting both to sync or io_uring, but I think worker is the right choice. It’s actually “asynchronous”, and it’s available everywhere (because it’s our implementation).

      I think a pool of workers is a sensible default. Many environments disable io_uring for security reasons anyway.

      • fabian2k 2 days ago ago

        Sorry, I was ambiguous in my statement there. I'm not surprised it's the default, that makes sense for all the reasons you state. I'm surprised this post also recommends "worker" instead of "io_uring" for performance in general. That is different from the impression I got based on earlier blog posts before Postgres 18 was released.

        • pgaddict a day ago ago

          I did a lot of tests comparing the io_method choices, and I'm yet to see a realistic query where it makes a significant difference of more than a couple percent (in either direction). I'm sure it's possible to construct such queries, and it's interesting, but for real workloads it's mostly not noticeable.

          At least that's how I see it right now, we'll see how that works on a much wider range of hardware and systems. The github repo linked from the pgsql-hackers post has a lot more results, some charts include results for the index prefetching patch - and there it makes more difference in some cases. But the patch is still fairly rough, it could be a bug in it too, and it changed a lot since August.

          • bluetech a day ago ago

            Which Linux version did you use for the io_uring tests? Unless I missed it I don't see it mentioned.

            • pgaddict a day ago ago

              Debian 12/13, with kernel 6.15.

              Sorry, should have mentioned that in the blog post.

        • taeric a day ago ago

          My gut is that the recommendation flows from the idea of a portable solution with more controls? That is, io_uring is limited in both target deployments, and in what you can do to adjust performance considerations.

          The first is somewhat obvious, with how new io_uring is and how it is specifically a linux thing. I don't think that is necessarily bad. My guess is you can get really good performance with relatively little code compared to other options.

          The second, though, is a bit tougher to consider. For one, it will be more code. Managing workers is managing code that you probably don't have on your radar. That said, you have full control over the worker so that you can make different priority work queues in ways that I don't see how you could do with io_uring.

        • film42 a day ago ago

          My guess is both will look about the same with real world workloads. Worker is certainly more predictable which is safer in general. That said, I appreciate the callout about signal throughput on workers (fewer connections farm to more processes vs each connection getting its own io_uring setup with upper bound being the throughput for a single process). Again, I doubt it makes any difference for 99.9999% of apps out there.

    • pgaddict a day ago ago

      > Right now async IO is used for sequential scans and bitmap scans, not for index scans. My initial guess would be that it mostly helps for complex queries (that use multiple indexes, so bitmap scans) and unoptimized queries (sequential scans), not so much for straightforward and/or optimized queries that use a single index. But this is really just a guess, I'm way out of my depth at this point. I'm curious how much it'll help once it is implemented for index scans as well.

      Those are good guesses, IMHO.

      For sequential scans, some of the "async" work could be done by kernel read-ahead, but AIO makes it explicit and loads the data into shared buffers, not just page cache. For bitmap scans we already had prefetching by fadvise, which is somewhat similar to read-ahead (also into page cache), and there were some bugs that made it ineffective in some cases, and AIO fixes that.

      For index scans the difference can be an order of magnitude (say, 5-10x). Doing random I/O block by block is simply awful, prefetching data is important. I was just doing some testing on TPC-H, and on scale 50 I see Q8 going from 130s to 20s, and Q19 from 50s to 8s. And smaller improvements for a couple more queries. Of course, it depends on what else the query is doing - if it's spending 1% on the index, you won't notice a difference.

  • rtp4me a day ago ago

    I am looking forward to testing this feature on our DBs running in Azure (on our own VMs - not hosted by Azure). We heavily use ZFS for compression (5:1 to 8:1) and tend to do lots of sequential scans. ZFS can hit about 150MB/sec on large scans, but with the new async-io feature and worker queues, I am hoping we can double or triple our seq scan performance. Time for testing!

    • antonkochubey a day ago ago

      How is your experience overall running Postgres on a CoW filesystem? I thought that was something highly frowned upon - e.g. Postgres deployments on btrfs recommend setting chattr +C to the pg_data folder, essentially disabling CoW.

      • rtp4me a day ago ago

        After lots of trial and error, we have everything running pretty well. We mainly use ZFS for the cost savings on virtual drives (compression=zstd). Previously, we were using XFS and the DB sizes were >5TB or more, and now we can use much smaller disk sizes (1TB or so) and still get usable performance.

        That said, ZFS presents some challenges for a few reasons:

        - As you probably already know, PGSQL relies heavily on system RAM for caching (effective_cache_size). That said, ZFS and OS cache are NOT the same thing, thus you need to take this into consideration when configuring PGSQL. We normally set PGSQL effective_cache_size=512MB and use `zfs_arc_min` and `zfs_arc_max` options to adjust ZFS ARC cache size. We typically get a +95% hit rate on ZFS (ARC) caching.

        - ZFS is definitely slower than XFS or EXT4 and it took a while to understand how options like `zfs_compressed_arc_enabled`, `zfs_abd_scatter_enabled`, and `zfs_prefetch_disable` affect performance. In particular, the `zfs_compressed_arc_enabled` option determines if the ZFS cache data is compressed in RAM as well on disk. When enabled, this option can seriously affect latency since the data has to be uncompressed each time it is read/written. That said, a very nice side affect of `zfs_compressed_arc_enabled=on` is the amount of data in the cache. From my understanding, if you get 5:1 data compression on disk, you get the same for ARC cache. Thus, if you give ZFS 12GB of cache, you get about 60GB of data in ZFS memory cache.

        - Getting ZFS installed requires additional kernel modules + the kernel header files, and these files have to match the version of ZFS you want to run. This is especially important if you update your kernel very often (thus requiring new ZFS modules to be built and installed).

        Lots of blog posts are on the 'net describing some of these challenges. It's worth checking them out...

        • abrookewood a day ago ago

          Have you ever done a blog post describing these recommendations? It's super interesting to me, but not well discussed as far as I can tell.

      • supermatt a day ago ago

        I’m not the parent so can’t comment on their experience, but ZFS groups the writes into transaction groups so it isn’t as write heavy as btrfs.

    • adgjlsfhk1 a day ago ago

      why are you doing compression at the fs layer rather than the db layer? postgres supports compression and I'd assume it would do a better job than the filesystem

      • rtp4me a day ago ago

        Based on our testing, we get much better compression using ZFS than PGSQL. As a side benefit, we also get snapshots and the ability to easily find out how much compression our DBs are getting.

        According to a quick google search (to refresh my memory), PGSQL compression (eg: TOAST) targets specific large data values within tables, while ZFS compresses all data written to the ZFS pool.

      • tempest_ a day ago ago

        Postgres doesnt have great compression.

        Depending on your database you can reduce the size by 50% sometimes with high zstd zfs configurations.

        No lunch is free though. Aside from the obvious cpu cycles spent compressing configuring zfs / postgres is a pain in the ass and really depends on the trade offs and use cases.

  • nextaccountic a day ago ago

    > I saw suggestions to “use io_uring for maximum efficiency”, but the earlier benchmark clearly shows io_uring being significantly slower than worker for sequential scans.

    What's unstated is: why does this happen? Is this a performance bug that can be fixed in later release, or an inherent io_uring limitation?

    • ozgrakkurt a day ago ago

      FWIW, there is no inherent reason io_uring should be slower. Both implementations look pretty basic to me (checked method_*.c files). Doing thread pool io is much easier than using io_uring properly with all features so it might be possible that it will be better in the future.

      You can compare worker_pool/io_uring/aio etc. using `fio` to see how different io approaches perform on the same hardware.

      • pgaddict a day ago ago

        I believe there are reasons why e.g. io_uring could be inherently slower in some cases, and I tried to point some of those out.

        With io_uring everything happens in the backend process, and so consumes some of the CPU time that might otherwise be spent executing the query. All the checksum verification, memcpy into shared buffers, etc. happen in the backend. And those things can be quite expensive. With worker this happens in the other processes, spreading the overhead.

        Of course, on truly I/O-bound workload (actually waiting on the I/O), this may not be a huge difference. For warmed-up cases it may be more significant.

        • ozgrakkurt 14 hours ago ago

          You can have a io_uring worker per thread and send jobs into a queue for them to pick it up and execute, so it won’t max out on cpu. Each worker would do cooperative async scheduling internally between the tasks so nothing delays other things too much.

          But this is more difficult than spawning thousands of os threads and running sync workers on them

          • pgaddict 12 hours ago ago

            I'm not sure what exactly you mean by "thread" here. Postgres is not thread-based - there are people working on that, but for now it's all processes.

            Some of these limitations are mostly due to Postgres design, no doubt about that.

            • ozgrakkurt 8 hours ago ago

              Ah my mistake. I meant to write something like, you can have a thread per cpu core that is running an io_uring instance like you would have a bunch of workers in a pool in the "worker" implementation.

              If there are no threads then this would be, you can have a process per cpu core