This won't work :) echo will run as root but the redirection is still running as the unprivileged user. Needs to be run from a privileged shell or by doing something like sudo sh -c "echo $NUM_PAGES > /proc/sys/vm/nr_hugepages"
The point gets across, though, technicality notwithstanding.
echo $NUM_PAGES | sudo tee /proc/sys/vm/nr_hugepages
I've always found it odd that there isn't a standard command to write stdin to a file that doesn't also write it to stdout. Or that tee doesn't have an option to supress writing to stdout.
I've always thought that there should be `cat -o output-file` flag for that. GNU coreutils have miriads of useless flags and missing one actually useful flag LoL.
> The real bottleneck is the single-threaded main loop in the postmaster.
A single-threaded event loop can do a lot of stuff. Certainly handle 4000 tasks of some sort in under 10s. Just offhand it seems like it would be eminently possible to handle incoming connections on the scale they describe in a single-threaded event loop.
Clearly the existing postgres postmaster thread is a bottleneck as it is implemented today. But I'd be interested to go deeper into what it's doing that causes it to be unable to keep up with a fairly low workload vs. what is possible to do on a single thread/core.
Can’t believe they needed this investigation to realize they need a connection pooler. It’s a fundamental component of every large-scale Postgres deployment, especially for serverless environments.
Pooling connections somewhere has been fundamental for several decades now.
Fun quick anecdote: a friend of mine worked at an EA subsidiary when Sim City (2013) was released, to great disaster as the online stuff failed under load. Got shifted over to the game a day after release to firefight their server stuff. He was responsible for the most dramatic initial improvement when he discovered the servers weren't using connection pooling, and instead were opening a new connection on almost every single query, using up all the connections on the back end DB.
EA's approach had been "you're programmers, you could build the back end", not accepting games devs accurately telling them it was a distinct skill set.
No? It sounds like they rejected the need for a connection pooler and took an alternative approach. I imagine they were aware of connection poolers and just didn't add one until they had to.
Much of the time in a transaction can reasonably be non-db-cpu time, be it io wait or be it client CPU processing between queries.
Note I'm not talking about transactions that run >10 seconds, just ones with the queries themselves technically quite cheap.
At 10% db-CPU-usage, you get a 1 second transaction from just 100ms of CPU.
That many long-running transactions seem like a pretty unusual workload to me and potentially running into isolation issues. I can see running a few of these, but not a lot, especially at the same time?
A few (single digits, maybe low double digits) per database CPU core isn't necessarily too much, of course the pattern can be conflict-heavy, where it would be problematic, but this is not about absolute concurrency but about concurrency needed to saturate CPU.
In a properly optimized database absolute majority of queries will hit indices and most data will be in memory cache, so majority of transactions will be CPU or RAM bound. So increasing number of concurrent transactions will reduce throughput. There will be few transactions waiting for I/O, but if majority of transactions are waiting for I/O, it's either horrifically inefficient database or very non-standard usage.
Your arguments make sense for concurrent queries (though high-latency storage like S3 is becoming increasingly popular, especially for analytic loads).
But transactions aren't processing queries all the time. Often the application will do processing between sending queries to the database. During that time a transaction is open, but doesn't do any work on the database server.
It is bad application architecture. Database work should be concentrated in minimal transactional units and connection should be released between these units. All data should be prepared before unit start and additional processing should take place after transaction ended. Using long transactions will cause locks, even deadlocks and generally should be avoided. That's my experience at least. Sometimes business transaction should be split into several database transaction.
If you aren't hitting IO (I don't mean HDDs) on a large fraction of queries you either skipped a cache in front of the DB or your data is very small or you spent too much on RAM and too little on your NVMe being not a bottleneck.
A process that is blocked for io, whether network or disk, will get taken off the cpu and another process put on the cpu. It doesn’t just waste the cpu until the quanta is gone.
redis is single-threaded but handles lots of connections (i.e. > 500) with much better performance vs. postgres. there's zero chance someone building postgres in 2025 would do one process per connection, I don't think there's any argument that it's a good design for performance. it's just a long-ago design choice that would be difficult to change now.
It's about queueing work, not running all these queries at the same time. You can run pgbouncer or you can have a pool on your backend. Having more connections won't make it go faster, so that really seems like a low-priority thing for postgres to me. Even if you integrated pooling into postgres the overhead of auth would be still taking time for small queries anyway.
In serverless world for sure but in old-school architecture it's common to use persistent connections to a database which make connection pooler less essential. Also the last time I did check (many years ago admittedly) connection poolers didn't play well with server-size prepared statements and transactions.
Good reminder to always remember Chesterton's Fence. The post indicates that the bottleneck occurs when "many thousands" of EC2 instances are connecting simultaneously. In order for this to happen, presumably someone had to turn `max_connections` way up on their database server to make this to work at all. Seems like the issue could have been avoided at that point with a bit more understanding about why the default is an order of magnitude or more lower than whatever they tuned it to.
I wish more applications would adopt the "H" option that Jenkins uses in it's cron notation - essentially it is a randomiser, based on some sort of deterministic hashing function. So you say you want this job to run hourly and it will always run at the same minute past the hour, but you don't know (or care) what that minute that is. Designed to prevent the thundering herd problem with scheduled work.
I use fqdn_rand [1] in puppet for most cron jobs - it allows to run cron jobs at different time on different hosts (with different FQDN) but with the consistent interval between job runs. I would expect any modern configuration management system to have something like that.
I'm not working at this company but I found that these types of problems can often be simplified in the architecture.
> Most meetings start on the hour, some on the half, but most on the full. It sounds obvious to say it aloud, but the implication of this has rippled through our entire media processing infrastructure.
When you can control when it happens, you can often jitter things. For instance the naive approach of rate limiting users down to quantized times (eg: the minute, the hour, etc.) leads to every client coming back at the same time. The solution there is to apply a stable jitter so different clients get different resets.
That pattern does not go all that well with meetings as they need to happen when they happen, which is going to be mostly the hour and 30 minutes etc. However often the lead up time to those meetings is quite long, so you can do the work needed that should happen on the hour, quite a bit ahead of time and then apply the changes in one large batch on the minute.
You have similar problems quite often with things like weekly update emails. At scale it can take you a lot of time to prepare all the updates, often more than 12 hours. But you don't want the mails to come in at different times of the day so you really need to get the reports prepared and then send them out when ready.
Cool debugging, but…
1) if you have very spiky loads (on the hour) and you can distribute them a little it’s obvious that this is will be good thing.
2) they had the answer all along in their telemetry
Sometimes wisdom beats effort
My first thought was "why even use big databases, you have perfect workload to shard it between a bunch of instances and as a bonus any downtime would only affect smaller part of customers"
This is not a big database usecase. It just needs one to not do silly things like opening a new database session for every query when it's well documented that this is expensive.
Isn't this kind of the reason why teams will tend to put database proxies in front of their postgres instances, to handle massive sudden influxes of potentially short lived connections?
This sounds exactly like the problem tools like pgbouncer were designed to solve. If you're on AWS one could look at RDS Proxy.
The article is very well written but is somewhat lacking at the end.
The conclusion lists pgbouncer as one of the solutions but it does not explain it clearly.
> Many pieces of wisdom in the engineering zeitgeist are well preached but poorly understood. Postgres connection pooling falls neatly into this category. In this expedition we found one of the underlying reasons that connection pooling is so widely deployed on postgres systems running at scale. [...] an artificial constraint that has warped the shape of the developer ecosystem (RDS Proxy, pgbouncer, pgcat, etc) around it.
The artificial constraint is the single core nature of postmaster.
Other points at the end of the article that can be improved:
> we can mechnically reason about a solution.
Mechanically as in letting an AI find a solution, or as in reasoning like a mechanic, or? Furthermore:
> * Implementing jitter in our fleet of EC2 instances reduced the peak connection rate
How? Did they wait a random amount of milliseconds before sending queries to the db?
> * Eliminating bursts of parallel queries from our API servers
Also check out ProxySQL [1][2], it's an extremely powerful and battle-tested proxy. Originally it was only for MySQL/MariaDB, where it is very widely used at scale, even despite MySQL already having excellent built-in scalable threaded connection management. But ProxySQL also added Postgres support too in 2024 and that has become a major focus.
+1 to ProxySQL, especially in RDS environments with huge monoliths attached that open a shitload of threads. RDS has fixed max_connections depending on the instance size so if you don't want to pay $$$$ for bigger but underused instances - and while you are trying to get the devs update all the hundreds old dependencies in the monolith to improve it, ProxySQL - can save your day. It did, for me.
And yes, it's a self-managed system but it's pretty easy to operate and very stable.
You can change the max_connections in RDS though. The default is insanely high and I have no idea what it is that way. 4vCPU instances running with 5k max connections iirc, I have never seen an instance this size handle more than 100-200 concurrent connections on a CPU bound workload.
There is still a max capped on the instance size [1] (well, RAM available) although now that I'm checking it again I swear it was different last time I read it. Anyway I did totally had issues with big replicated monoliths with their own connection pool (i.e. old Rails) maxing out connections on RDS.
PgBouncer introduces its own problems and strictly speaking adds additional complexity to your infrastructure. It needs to be monitored and scaled separately, not to mention the different modes of session/transaction/statement connection pooling. Adding another proxy in the middle also increases latency.
Yes, currently, putting PgBouncer in the middle helps handle massive sudden influxes of potentially short lived connections. That is indeed the correct current best practice. But I hardly fault the author for wishing that postmaster could be configured to run on multiple cores so that the additional complexity of running PgBouncer for this relatively simple use-case could be eliminated.
Great investigation. Slight aside, but I found the few typos that I noticed to make me feel better about continuing to read -- as a sign that the post wasn't AI-generated.
Note that they were running Postgres on a 32 CPU box with 256GB of ram.
I'm actually surprised that it handled that many connections. The data implies that they have 4000 new connections/sec...but is it 4000 connections handled/sec?
32 vCPU, meaning an undetermined slice of a CPU that varies depending on what else is running on the box (and the provider has an incentive to run as many VMs on the box as possible).
It’s likely an actual CPU would’ve handled this load just fine.
Combination of partitioning + sharding perhaps? Often times its is only a handful of tables that grows large, so even less so for a single large customer, thus sharding that customer out and then partitioning the data by a common/natural boundary should get you 90% there. Majority of data can be partitioned, and it doesn't have to be by date - it pays dividends to go sit with the data and reflect what is being stored, its read/write pattern and its overall shape, to determine where to slice the partitions best. Sometimes splitting a wide table into two or three smaller tables can work if your joins aren't too frequent or complex. Can also help if you can determine which of the rows can be considered hot or cold, so you move the colder/hotter data to a separate tables to speed up read/writes. There are always opportunities for storage optimization large datasets but it does take time & careful attention to get it right.
Sharding is often not easy. Depending on the application, it may add significant complexity to the application. For example, what do you do if you have data related to multiple customers? How do you handle customers of significantly different sizes?
And that is assuming you have a solution for things like balancing, and routing to the correct shard.
Presumably sharding is a lot easier than trying to debug lockups in individual postgres thread? It's well known, we've been doing it for at least 30+ years as an industry.
The article addresses this, sort of. I don't understand how you can run multiple postmasters.
> Most online resources chalk this up to connection churn, citing fork rates and the pid-per-backend yada, yada. This is all true but in my opinion misses the forest from the trees. The real bottleneck is the single-threaded main loop in the postmaster. Every operation requiring postmaster involvement is pulling from a fixed pool, the size of a single CPU core. A rudimentary experiment shows that we can linearly increase connection throughput by adding additional postmasters on the same host.
You don't need multiple postmasters to spawn connection processes, if you have a set of Postgres proxies each maintaining a set pool of long-standing connections, and parceling them out to application servers upon request. When your proxies use up all their allocated connections, they throttle the application servers rather than overwhelming Postgres itself (either postmaster or query-serving systems).
That said, proxies aren't perfect. https://jpcamara.com/2023/04/12/pgbouncer-is-useful.html outlines some dangers of using them (particularly when you might need session-level variables). My understanding is that PgDog does more tracking that mitigates some of these issues, but some of these are fundamental to the model. They're not a drop-in component the way other "proxies" might be.
> I don't understand how you can run multiple postmasters.
I believe they're just referring to having several completely-independent postgres instances on the same host.
In other words: say that postgres is maxing out at 2000 conns/sec. If the bottleneck actually was fork rate on the host, then having 2 independent copies of postgres on a host wouldn't improve the total number of connections per second that could be handled: each instance would max out at ~1000 conns/sec, since they're competing for process-spawning. But in reality that isn't the case, indicating that the fork rate isn't the bottleneck.
Why do you need a connection to a database during the meeting? Doesn't it make more sense to record the meeting data to some local state first, and then serialize it to database at the end of the meeting or when a database connection is available? Or better yet, have a lightweight API service that can be scaled horizontally that is responsible for talking to the database and maintains its own pool of connections.
They probably don't even need a database anyway for data that is likely write once, read many. You could store the JSON of the meeting in S3. It's not like people are going back in time and updating meeting records. It's more like a log file and logging systems and data structures should be enough here. You can then take that data and ingest it into a database later, or some kind of search system, vector database etc.
Database connections are designed this way on purpose, it's why connection pools exist. This design is suboptimal.
It took me a long time to realize this but yes asking people to just open and write to files (or S3) is in fact asking a lot.
What you describe makes sense, of course, but few can build it without it being drastically worse than abusing a database like postgres. It's a sad state of affairs.
But, I'm hopeful that in 2-3 years from now, we'll see this bear fruition. The recent asynchronous read I/O improvements in Postgres 18 show that Postgres can evolve, one just needs to be patient, potentially help contribute, and find workarounds (connection pooling, in this case).
maybe this is silly but
these days cloud resources are so cheap. just loading up instances and putting this stuff into memory and processing it is so fast and scalable. even if you have billions of things to process daily you can just split if needed.
you can keep things synced across databases easily and keep it super duper simple.
Yeah you can get an AMD 9454P with 1TB of memory and 20TB of redundant NVME storage for like 1000$ a month, its crazy how cheap compute and storage is these days.
If people are building things which actually require massive amounts of data stored in databases they should be able to charge accordingly.
I think this is the kind of investigation that AI can really accelerate. I imagine it did. I would love to see someone walk through a challenging investigation assisted by AI.
> sudo echo $NUM_PAGES > /proc/sys/vm/nr_hugepages
This won't work :) echo will run as root but the redirection is still running as the unprivileged user. Needs to be run from a privileged shell or by doing something like sudo sh -c "echo $NUM_PAGES > /proc/sys/vm/nr_hugepages"
The point gets across, though, technicality notwithstanding.
Or
I've always found it odd that there isn't a standard command to write stdin to a file that doesn't also write it to stdout. Or that tee doesn't have an option to supress writing to stdout.You forgot the "sudo" before "tee"
> write stdin to a file that doesn't also write it to stdout
You mean like "dd of=/path/file" ?
I physically/literally squinted when I saw disk destroyer.
I know it's useful for other things, but it has become a fearful instinct at this point.
I've always thought that there should be `cat -o output-file` flag for that. GNU coreutils have miriads of useless flags and missing one actually useful flag LoL.
And probably `echo -o output-file` as well.
> I've always found it odd that there isn't a standard command to write stdin to a file that doesn't also write it to stdout
If you happen to have moreutils installed, you can do that with pee
why not write sh -c then?
Because 'sh' isn't in moreutils and is harder to remember, next question.
Why use something portable when you can use cat pee?
Personally? Because I hate the smell.
It’s not an option or feature because it’s built into the shell. Just slap a > /dev/null on the end.
That works, but it feels wasteful to needlessly copy it, and it is verbose.
Great point, I was running as root so I didn't pick this up. Corrected, thank you!
From the article:
> The real bottleneck is the single-threaded main loop in the postmaster.
A single-threaded event loop can do a lot of stuff. Certainly handle 4000 tasks of some sort in under 10s. Just offhand it seems like it would be eminently possible to handle incoming connections on the scale they describe in a single-threaded event loop.
Clearly the existing postgres postmaster thread is a bottleneck as it is implemented today. But I'd be interested to go deeper into what it's doing that causes it to be unable to keep up with a fairly low workload vs. what is possible to do on a single thread/core.
Servers usually have massive amounts of cores that are individually slow. Not surprised that single threading would be a bottleneck
Can’t believe they needed this investigation to realize they need a connection pooler. It’s a fundamental component of every large-scale Postgres deployment, especially for serverless environments.
Pooling connections somewhere has been fundamental for several decades now.
Fun quick anecdote: a friend of mine worked at an EA subsidiary when Sim City (2013) was released, to great disaster as the online stuff failed under load. Got shifted over to the game a day after release to firefight their server stuff. He was responsible for the most dramatic initial improvement when he discovered the servers weren't using connection pooling, and instead were opening a new connection on almost every single query, using up all the connections on the back end DB. EA's approach had been "you're programmers, you could build the back end", not accepting games devs accurately telling them it was a distinct skill set.
No? It sounds like they rejected the need for a connection pooler and took an alternative approach. I imagine they were aware of connection poolers and just didn't add one until they had to.
can't believe postgres still uses a process-per-connection model that leads to endless problems like this one.
You can't process significantly many more queries than you've got CPU cores at the same time anyway.
Much of the time in a transaction can reasonably be non-db-cpu time, be it io wait or be it client CPU processing between queries. Note I'm not talking about transactions that run >10 seconds, just ones with the queries themselves technically quite cheap. At 10% db-CPU-usage, you get a 1 second transaction from just 100ms of CPU.
That many long-running transactions seem like a pretty unusual workload to me and potentially running into isolation issues. I can see running a few of these, but not a lot, especially at the same time?
A few (single digits, maybe low double digits) per database CPU core isn't necessarily too much, of course the pattern can be conflict-heavy, where it would be problematic, but this is not about absolute concurrency but about concurrency needed to saturate CPU.
In a properly optimized database absolute majority of queries will hit indices and most data will be in memory cache, so majority of transactions will be CPU or RAM bound. So increasing number of concurrent transactions will reduce throughput. There will be few transactions waiting for I/O, but if majority of transactions are waiting for I/O, it's either horrifically inefficient database or very non-standard usage.
Your arguments make sense for concurrent queries (though high-latency storage like S3 is becoming increasingly popular, especially for analytic loads).
But transactions aren't processing queries all the time. Often the application will do processing between sending queries to the database. During that time a transaction is open, but doesn't do any work on the database server.
It is bad application architecture. Database work should be concentrated in minimal transactional units and connection should be released between these units. All data should be prepared before unit start and additional processing should take place after transaction ended. Using long transactions will cause locks, even deadlocks and generally should be avoided. That's my experience at least. Sometimes business transaction should be split into several database transaction.
Your database usage should not involve application-focused locks, MVCC will restart your transaction if needed to resolve concurrency.
If you aren't hitting IO (I don't mean HDDs) on a large fraction of queries you either skipped a cache in front of the DB or your data is very small or you spent too much on RAM and too little on your NVMe being not a bottleneck.
A process that is blocked for io, whether network or disk, will get taken off the cpu and another process put on the cpu. It doesn’t just waste the cpu until the quanta is gone.
redis is single-threaded but handles lots of connections (i.e. > 500) with much better performance vs. postgres. there's zero chance someone building postgres in 2025 would do one process per connection, I don't think there's any argument that it's a good design for performance. it's just a long-ago design choice that would be difficult to change now.
I disagree. If that was the case, pgBouncer wouldn't need to exist.
The problem of resource usage for many connections is real.
It's about queueing work, not running all these queries at the same time. You can run pgbouncer or you can have a pool on your backend. Having more connections won't make it go faster, so that really seems like a low-priority thing for postgres to me. Even if you integrated pooling into postgres the overhead of auth would be still taking time for small queries anyway.
that's too simplistic.
there are many reasons to need something like pgbouncer.
1) not all workloads lend themselves to backend pools
2) not all backends can afford pooling
3) you don't always control your backend
4) you might have a lot of different backends connecting to the database. it's much simpler to just tell them to connect to pgbouncer.
5) you don't want to trust backend team to not bring postgresql down with many simultaneolus connections.
6) backend pools often misbehave
people don't wake up one day and just add another cog to their infrastructure randomly.
I was surprised too to need it in front of RDS (but not on vanilla, as you pointed out).
In serverless world for sure but in old-school architecture it's common to use persistent connections to a database which make connection pooler less essential. Also the last time I did check (many years ago admittedly) connection poolers didn't play well with server-size prepared statements and transactions.
pgbouncer added support for prepared statements a couple years back.
Good reminder to always remember Chesterton's Fence. The post indicates that the bottleneck occurs when "many thousands" of EC2 instances are connecting simultaneously. In order for this to happen, presumably someone had to turn `max_connections` way up on their database server to make this to work at all. Seems like the issue could have been avoided at that point with a bit more understanding about why the default is an order of magnitude or more lower than whatever they tuned it to.
Some a prime example of a service that naturally peaks at round hours.
We have a habbit of never scheduling long running processes at round hours. Usually because they tend to be busier.
https://hakibenita.com/sql-tricks-application-dba#dont-sched...
I wish more applications would adopt the "H" option that Jenkins uses in it's cron notation - essentially it is a randomiser, based on some sort of deterministic hashing function. So you say you want this job to run hourly and it will always run at the same minute past the hour, but you don't know (or care) what that minute that is. Designed to prevent the thundering herd problem with scheduled work.
systemd timers have this, and more.
https://www.freedesktop.org/software/systemd/man/latest/syst...
I use fqdn_rand [1] in puppet for most cron jobs - it allows to run cron jobs at different time on different hosts (with different FQDN) but with the consistent interval between job runs. I would expect any modern configuration management system to have something like that.
[1] https://github.com/puppetlabs/puppet/blob/main/lib/puppet/pa...
People are usually confused when I use prime numbers for periodic jobs, but then they understand.
I'm not working at this company but I found that these types of problems can often be simplified in the architecture.
> Most meetings start on the hour, some on the half, but most on the full. It sounds obvious to say it aloud, but the implication of this has rippled through our entire media processing infrastructure.
When you can control when it happens, you can often jitter things. For instance the naive approach of rate limiting users down to quantized times (eg: the minute, the hour, etc.) leads to every client coming back at the same time. The solution there is to apply a stable jitter so different clients get different resets.
That pattern does not go all that well with meetings as they need to happen when they happen, which is going to be mostly the hour and 30 minutes etc. However often the lead up time to those meetings is quite long, so you can do the work needed that should happen on the hour, quite a bit ahead of time and then apply the changes in one large batch on the minute.
You have similar problems quite often with things like weekly update emails. At scale it can take you a lot of time to prepare all the updates, often more than 12 hours. But you don't want the mails to come in at different times of the day so you really need to get the reports prepared and then send them out when ready.
They mention that they implemented jitter later in the post.
But my reading of their jitter is a very narrow one for the actual connection to the database. They are still doing most of the work on the minute.
Cool debugging, but… 1) if you have very spiky loads (on the hour) and you can distribute them a little it’s obvious that this is will be good thing. 2) they had the answer all along in their telemetry Sometimes wisdom beats effort
> We record millions of meetings every week.
My first thought was "why even use big databases, you have perfect workload to shard it between a bunch of instances and as a bonus any downtime would only affect smaller part of customers"
This is not a big database usecase. It just needs one to not do silly things like opening a new database session for every query when it's well documented that this is expensive.
Ha, .net does that automatically unless you really want not to - connection pooling I mean.
Isn't this kind of the reason why teams will tend to put database proxies in front of their postgres instances, to handle massive sudden influxes of potentially short lived connections?
This sounds exactly like the problem tools like pgbouncer were designed to solve. If you're on AWS one could look at RDS Proxy.
The article is very well written but is somewhat lacking at the end.
The conclusion lists pgbouncer as one of the solutions but it does not explain it clearly.
> Many pieces of wisdom in the engineering zeitgeist are well preached but poorly understood. Postgres connection pooling falls neatly into this category. In this expedition we found one of the underlying reasons that connection pooling is so widely deployed on postgres systems running at scale. [...] an artificial constraint that has warped the shape of the developer ecosystem (RDS Proxy, pgbouncer, pgcat, etc) around it.
The artificial constraint is the single core nature of postmaster.
Other points at the end of the article that can be improved:
> we can mechnically reason about a solution.
Mechanically as in letting an AI find a solution, or as in reasoning like a mechanic, or? Furthermore:
> * Implementing jitter in our fleet of EC2 instances reduced the peak connection rate
How? Did they wait a random amount of milliseconds before sending queries to the db?
> * Eliminating bursts of parallel queries from our API servers
How?
> Mechanically as in letting an AI find a solution, or as in reasoning like a mechanic, or?
As in it's fully characterized, so you can use only math and logic rather than relying on experience and guesswork.
Also check out ProxySQL [1][2], it's an extremely powerful and battle-tested proxy. Originally it was only for MySQL/MariaDB, where it is very widely used at scale, even despite MySQL already having excellent built-in scalable threaded connection management. But ProxySQL also added Postgres support too in 2024 and that has become a major focus.
[1] https://proxysql.com/
[2] https://github.com/sysown/proxysql
+1 to ProxySQL, especially in RDS environments with huge monoliths attached that open a shitload of threads. RDS has fixed max_connections depending on the instance size so if you don't want to pay $$$$ for bigger but underused instances - and while you are trying to get the devs update all the hundreds old dependencies in the monolith to improve it, ProxySQL - can save your day. It did, for me. And yes, it's a self-managed system but it's pretty easy to operate and very stable.
You can change the max_connections in RDS though. The default is insanely high and I have no idea what it is that way. 4vCPU instances running with 5k max connections iirc, I have never seen an instance this size handle more than 100-200 concurrent connections on a CPU bound workload.
There is still a max capped on the instance size [1] (well, RAM available) although now that I'm checking it again I swear it was different last time I read it. Anyway I did totally had issues with big replicated monoliths with their own connection pool (i.e. old Rails) maxing out connections on RDS.
[1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_...
And lets you rewrite queries on the fly. :)
PgBouncer introduces its own problems and strictly speaking adds additional complexity to your infrastructure. It needs to be monitored and scaled separately, not to mention the different modes of session/transaction/statement connection pooling. Adding another proxy in the middle also increases latency.
Yes, currently, putting PgBouncer in the middle helps handle massive sudden influxes of potentially short lived connections. That is indeed the correct current best practice. But I hardly fault the author for wishing that postmaster could be configured to run on multiple cores so that the additional complexity of running PgBouncer for this relatively simple use-case could be eliminated.
Great investigation. Slight aside, but I found the few typos that I noticed to make me feel better about continuing to read -- as a sign that the post wasn't AI-generated.
Note that they were running Postgres on a 32 CPU box with 256GB of ram.
I'm actually surprised that it handled that many connections. The data implies that they have 4000 new connections/sec...but is it 4000 connections handled/sec?
32 vCPU, meaning an undetermined slice of a CPU that varies depending on what else is running on the box (and the provider has an incentive to run as many VMs on the box as possible).
It’s likely an actual CPU would’ve handled this load just fine.
>... we run an unusual workload
ya, right. just make up some reason not following the best practices
I'm a bit confused here, do they have a single database they're writing to? Wouldn't it be easier and more reliable to shard the data per customer?
When one customer is 50 times bigger than your average customer then sharding doesn't do much.
Combination of partitioning + sharding perhaps? Often times its is only a handful of tables that grows large, so even less so for a single large customer, thus sharding that customer out and then partitioning the data by a common/natural boundary should get you 90% there. Majority of data can be partitioned, and it doesn't have to be by date - it pays dividends to go sit with the data and reflect what is being stored, its read/write pattern and its overall shape, to determine where to slice the partitions best. Sometimes splitting a wide table into two or three smaller tables can work if your joins aren't too frequent or complex. Can also help if you can determine which of the rows can be considered hot or cold, so you move the colder/hotter data to a separate tables to speed up read/writes. There are always opportunities for storage optimization large datasets but it does take time & careful attention to get it right.
It does if you have thousands of customers.
Orders of complexity don’t give a tinker’s damn about how many other customers you have.
I wouldn't call that "easier" perse.
Sharding is often not easy. Depending on the application, it may add significant complexity to the application. For example, what do you do if you have data related to multiple customers? How do you handle customers of significantly different sizes?
And that is assuming you have a solution for things like balancing, and routing to the correct shard.
Presumably sharding is a lot easier than trying to debug lockups in individual postgres thread? It's well known, we've been doing it for at least 30+ years as an industry.
deja vu
did you comment exactly the same things some months ago?
Not that I recall
One of the many problems PgDog will solve for you!
The article addresses this, sort of. I don't understand how you can run multiple postmasters.
> Most online resources chalk this up to connection churn, citing fork rates and the pid-per-backend yada, yada. This is all true but in my opinion misses the forest from the trees. The real bottleneck is the single-threaded main loop in the postmaster. Every operation requiring postmaster involvement is pulling from a fixed pool, the size of a single CPU core. A rudimentary experiment shows that we can linearly increase connection throughput by adding additional postmasters on the same host.
You don't need multiple postmasters to spawn connection processes, if you have a set of Postgres proxies each maintaining a set pool of long-standing connections, and parceling them out to application servers upon request. When your proxies use up all their allocated connections, they throttle the application servers rather than overwhelming Postgres itself (either postmaster or query-serving systems).
That said, proxies aren't perfect. https://jpcamara.com/2023/04/12/pgbouncer-is-useful.html outlines some dangers of using them (particularly when you might need session-level variables). My understanding is that PgDog does more tracking that mitigates some of these issues, but some of these are fundamental to the model. They're not a drop-in component the way other "proxies" might be.
> I don't understand how you can run multiple postmasters.
I believe they're just referring to having several completely-independent postgres instances on the same host.
In other words: say that postgres is maxing out at 2000 conns/sec. If the bottleneck actually was fork rate on the host, then having 2 independent copies of postgres on a host wouldn't improve the total number of connections per second that could be handled: each instance would max out at ~1000 conns/sec, since they're competing for process-spawning. But in reality that isn't the case, indicating that the fork rate isn't the bottleneck.
That makes sense, thanks.
Why do you need a connection to a database during the meeting? Doesn't it make more sense to record the meeting data to some local state first, and then serialize it to database at the end of the meeting or when a database connection is available? Or better yet, have a lightweight API service that can be scaled horizontally that is responsible for talking to the database and maintains its own pool of connections.
They probably don't even need a database anyway for data that is likely write once, read many. You could store the JSON of the meeting in S3. It's not like people are going back in time and updating meeting records. It's more like a log file and logging systems and data structures should be enough here. You can then take that data and ingest it into a database later, or some kind of search system, vector database etc.
Database connections are designed this way on purpose, it's why connection pools exist. This design is suboptimal.
It took me a long time to realize this but yes asking people to just open and write to files (or S3) is in fact asking a lot.
What you describe makes sense, of course, but few can build it without it being drastically worse than abusing a database like postgres. It's a sad state of affairs.
very stupid question: similar to how we had a GIL replacement in python, cant we replace postmaster with something better?
Specifically on the cost of forking a process for each connection (vs using threads), there are active efforts to make Postgres multi-threaded.
Since Postgres is a mature project, this is a non-trivial effort. See the Postgres wiki for some context: https://wiki.postgresql.org/wiki/Multithreading
But, I'm hopeful that in 2-3 years from now, we'll see this bear fruition. The recent asynchronous read I/O improvements in Postgres 18 show that Postgres can evolve, one just needs to be patient, potentially help contribute, and find workarounds (connection pooling, in this case).
Would be nice if the OrioleDB improvements were to be incorporated in postgresql proper some day.. https://www.slideshare.net/slideshow/solving-postgresql-wick...
maybe this is silly but these days cloud resources are so cheap. just loading up instances and putting this stuff into memory and processing it is so fast and scalable. even if you have billions of things to process daily you can just split if needed.
you can keep things synced across databases easily and keep it super duper simple.
It's not really my experience that cloud resources are very cheap.
They are expensive compared to buying server and running it 24/7
They are cheap if you tiny fraction of server use for $20/mo or have 50 engineers working on code
I guess I’m talking about a company actually making money.
I would much rather spend 5k per month to make 1 million, keeping things extremely simple.
Yeah you can get an AMD 9454P with 1TB of memory and 20TB of redundant NVME storage for like 1000$ a month, its crazy how cheap compute and storage is these days.
If people are building things which actually require massive amounts of data stored in databases they should be able to charge accordingly.
I think this is the kind of investigation that AI can really accelerate. I imagine it did. I would love to see someone walk through a challenging investigation assisted by AI.