Note that the measurements in the paper were made before they fixed a bug where they confused bits and bytes.
So SQLite only used 1/8 of the reserved bloom filter space, thus increasing the false positive rate significantly:
I found and reported the bug because I wanted to know how the bloom filters work in SQLite for my uni seminar paper. Still wondering if one can find those kind of bugs with test cases.
On top of that I don't think it's fair to say it's 10x faster when it btree was tested only on integer index primary key column. Benchmarks with that bold statements should include short string (1-16 chars maybe) and UUID indexes at least.
I do not know if it is still the case, but the last time I looked into the source code SQLite did hash all strings to the exact same value.
So the bloom filter optimization does not work there.
It had to do with the different ways strings can be compared with collating functions, as strings may be equal even if they have different bytes: https://sqlite.org/forum/forumpost/0846211821
Call me crazy, but I'm simply splitting my UUID into the higher and lower bits and indexing off that.
IE
CREATE TABLE foo(
id_ms UNSIGNED BIG INT NOT NULL,
id_ls UNSIGNED BIG INT NOT NULL,
PRIMARY KEY (id_ms, id_ls)
) WITHOUT ROWID;
That works well with UUIDv7 and is just storing 128bits rather than a full string. In most languages it's pretty trivial to turn 2 longs into a UUID and vice versa.
In SQLite, if you were to define a TEXT column (or anything other than INTEGER, for that matter) with a UUID as the PK, you’d already have two indices, because it stores data based on the rowid [0]. So you’d already have a level of indirection, where the “PK” would be pointing to the rowid.
You could define the table as WITHOUT ROWID [1], but as docs point out, the average row size shouldn’t exceed 200 bytes for the default 4 KiB page size. Since a UUID in text form is at best 32 chars, that doesn’t leave much for the rest of the columns.
It's also a problem in machine learning. Your data might be mangled due to a bug but the NN will still extract something useful out of it. Or, on the flip side, if you make a change to the data and things do break (learning stops converging), you never really know if it's the architecture or the data that's the issue.
SQLite only knows nested loop joins and the bloom filter can just tell us "no need to do a join, there is definitely no matching entry".
If it has a false positive all the time (the worst case) then the performance is the same as before the bloom filter optimization was implemented (besides the small bloom filter overhead).
As the bloom filter size in SQLite directly depends on the table size I estimated a false positive rate of 63.2% due to this bug, while it could have been just 11.75%.
People frequently bring up write concurrency issues in SQLite, with the implied idea that if two physical people are using your app at once they'll get errors.
But of course it's concurrency at a transaction level, vastly different - if your writes take 1ms (in SQLite they might even be faster), you can support 1,000 writes per second. And if your users are generating a write click every 5 seconds (a very engaged user base, in most apps more users are readers) you can support 5,000 simultaneous physical people before you need to start worrying about scaling, replication, and distributed shenanigans.
If only 5% of those users are writers and the rest are readers / lurkers, you can support 100,000 users simultaneously.
I suspect close to 99% of distributed app architectures are premature optimizations for a scale the company will never reach.
The issue isn’t quite just that. If you are running your app and database on one single server and don’t need to move past that, SQLite is a great solution. But let’s say you have a web server and a worker server or more than one of each. Now you can’t access SQLite from multiple hosts as it doesn’t have a network access model. So your architecture is somewhat limited. You can still utilize it by either putting a network front end on it or setting up an API server in front of it that everything else uses. But at that point you have basically abstracted away SQLite enough that just using Postgres is easier. And with Postgres you get a lot more features that go hand in hand with multiple servers: replication, failover, etc.
I am a huge fan of SQLite but its best use case is a local database for a local application. You can use it for a client-server setup in some special circumstances but in my mind you’d need a compelling reason to do that rather than using the standard solution of something like Postgres.
Agreed, if you have to have a web server and separate worker servers, then Postgres is likely better than a weird network layer on top of SQLite. But I'd question the true need for need separate worker servers. Is it for performance or functionality, or just hand-wavey stuff like "best practices" or "modern" or "scalability" without any hard numbers attached? If a single server can handle 100,000 simultaneous users, that's quite "scalable".
I guess my point is, a single cloud VM with the (web) app server and a SQLite database will take you VERY far these days in terms of performance and concurrent users. The web server becomes your client-server layer, which can possibly eliminate the need for a separate client-server interface on the database end. Of course each app is different, but it's worth taking a hard look at whether you need that additional server & network connection (i.e. a bunch of app servers sharing the same database, that can't be done with an API). It's good to delete any part or process that you can.
Cloud vendor developer-marketing and resume-driven-development has pushed the industry into ever more complicated distributed infrastructures, but I suspect the needs of 99% of businesses could be handled much more simply, with vertical scaling if needed (larger VM class) before diving into kubernetes, clustering, load balancing, lambda, microservices, replication, etc.
Even if you are running it all on the same machine, a separate worker queue process and a web server process cannot share an SQLite database file.
And it’s not that it’s best practice. It is because sometimes you legitimately do need separate processes. Imagine you have a web UI for a service that transcodes video. You aren’t running transcodes on the same CPU cores as your web requests. And chances are you need a lot more power for the worker side than the web side.
For toy projects you can get away with almost any setup. For a project where people are paying money for a service you really need to start engineering your infrastructure because you have actual trade offs.
> Even if you are running it all on the same machine, a separate worker queue process and a web server process cannot share an SQLite database file.
From SQLite’s FAQ page, yes they can [0]. Two processes cannot simultaneously write to it, but they can both have it open, and read. With a modicum of tuning and application error handling, you can quite easily have multiple processes writing.
> …you really need to start engineering your infrastructure because you have actual trade offs.
Step one is fully understanding the abilities and limitations of available tooling.
Correct. You cannot have concurrent writers even to two independent tables. That page also clearly states the limitations of even this global lock system. I should have been more precise with my language in that while it’s possible it functionally can get to being useless pretty quickly.
SQLite is a wonderful database engine. But it isn’t the end all be all and no it doesn’t scale in the same way as something like Postgres. You cannot stretch it but you hit diminishing returns fairly quickly.
If you read the relevant docs you will see how this is implemented and maybe realize that locking your entire database for a write transaction isn’t something that works for a whole lot of cases. Of course multiple readers are allowed, but multiple writers even to different tables still contend for the same lock (that doesn’t work on all file systems). There isn’t table-level locking, let alone row level locking.
Sure, but is this host incapable of performing 1k req/s? It's a crazy world if we have a DBMS that can parse a query, execute reads and writes against a database file, and return the results faster than it can be string concatenated into an html response.
That’s not the point. And the 1k req/s is a made up number in this case. The point is that SQLite is a single process or at best a single process group if you implement locking. It’s not about performance. I wouldn’t be surprised if SQLite is faster for writes than Postgres, being simpler. The point is that one is a really good bicycle and another is a really good all terrain truck. Both have uses and for a certain type of getting from A to B there is overlap in their utility but they have different use cases.
The description of nested loop join is confusing; it's mainly a single pass through the outer table with one B-tree probe per row to each inner table.
The linked paper is clearer:
"However, the inner loops in the join are typically accelerated with existing primary key indexes or temporary indexes built on the fly."
"Note that SQLite probes the part table index for every tuple in the lineorder table."
The Bloom filter does not reduce the cardinality of the join, it simply replaces each B-tree probe and filter with a Bloom probe on a pre-filtered key set.
This technique is well-known; the paper cites several examples, and Bloom filter pushdown is common to many commercial systems.
Bloom filters are great, I wish more people knew about them. The most important part about a bloom filter:
They will never have a false negative (and only sometimes a false positive).
We used this to vastly improve render times for comments pages on reddit. We used two tricks. The first was to store the time of your last vote as a first class property on your user object. If you loaded a comments page for a link that was submitted after your last vote, we knew that you couldn't have voted on any of those comments.
But if you had voted afterwards, we had to look up every single comment on the page to see if you had voted on it (we couldn't only do the comments made before your last vote because we didn't know the creation time until after we looked up the comment, and it was faster to just look up the vote).
But with a bloom filter, we could very quickly look up all the comments and get back a list of all the ones you voted on (with a couple of false positives in there). Then we could go to the cache and see if your actual vote was there
(and if it was an upvote or a downvote). It was only after a failed cache hit did we have to actually go to the database.
But that bloom filter saved us from doing sometimes 1000s of cache lookups.
That's wonderful. Can you help me understand why every single button click on Reddit has an 800ms user visible latency gap? Reddit has always been this way, ever since its very beginning. Its latency has always been so slow compared to Hacker News. Why can't Reddit go snappy like HN? Is Python just really really slow compared to LISP? Why didn't your bloom filter improve user visible latency?
I can't tell if you're trolling or serious. If you're serious, I think the problem may be on your end, because I don't see that latency, and when I worked there, we didn't see that latency either.
My average latency on reddit is about 150ms, most of that network transit time. Some calls are slow, yes, but on average site wide when I worked there is was about 300ms (yes we measured it).
> Is Python just really really slow compared to LISP?
Now I'm almost certain you're trolling. Reddit was originally written in LISP. It was rewritten to Python for a massive speedup. Then we rewrote parts of it in C for another speedup. And now I'm pretty sure they have Rust and Golang too.
I am serious. If you're not logged in then reddit pages will take 150ms to respond if you exclude the snooserv ?rdt=xxxx redirect (which makes it take 200ms total).
master jart@luna:~/cosmo$ time curl --tcp-fastopen -L 'https://www.reddit.com/r/torties/comments/1hldi83/i_adopted_two_girls/' 2>/dev/null | head -c1 >/dev/null
real 0m0.203s
user 0m0.016s
sys 0m0.006s
But if you're logged in as "jart" and use that chrome inspect feature to "copy as curl command" and run it on the shell, then the above URL takes 800ms to respond. Pretty much every page on Reddit takes 800ms+ to send back the first byte of content. Yes I use old reddit mode. I imagine the same problem afflicts all your long time loyal fans who uses Reddit while logged in. FWIW my ping to reddit.com is 4.5ms.
You could contact them but honestly that probably wouldn't work very well. Things that might help: switch your settings from old reddit to new reddit, use new reddit to load a few pages, and then switch back. Also try using "newest" reddit at https://sh.reddit.com.
Did you maintain a single bloom filter for each user listing the comment IDs they had voted on across the whole site, or was it one bloom filter per user per thread?
I honestly don't remember for sure, but I believe it was one row per comment with a list of everyone who voted on it. So you could quickly get the answer to "did user X vote on comment Y?"
And of course the answer from a bloom filter was either "no" or "probably yes?", which was good enough to then do an actual lookup of that person's vote on that comment.
Thanks to its simplicity for development and hosting SQLite has become our first choice for new Apps. We use a number of different ways to workaround its single concurrent writer limitation [1]. Whilst we're currently using Litestream for replication, we're currently evaluating switching to SQLite's native rsync [2].
Dumb question, but how do you use SQLite on kubernetes? Indeed SQLite doesn't work over NFS (and I guess other remote file shares) so how do you share access to it to other pods?
Without an additional layer you will have to be happy with a single vertically scaled instance of your application. If you want to resort to horizontal scaling, you can look into something like LiteFS
Is switching to SQLite really making hosting your web apps less of a headache? Most hosting providers make spinning up your standard client-server RDBMSs (MySQL, Postgres) a breeze.
If you’re using one of the hosting providers or managed services that just give you a DB to use and you don’t have to think about administering it, it’s a pretty relaxed experience.
Even if you self-host, running a container with the DB of your choice usually isn’t that much more difficult than just having an SQLite DB for your app, at that point it’s probably more about the features which each solution provides.
Personally, I have some stuff running on MariaDB, PostgreSQL and SQLite and I’ve almost never had any problems with either. Then again, I’ve never gotten to a scale with personal stuff where the specifics of that choice would matter much.
Not having to worry about needing to configure, manage or pay for any additional infrastructure dependencies definitely makes hosting a lot simpler. Using RDS was the only thing keeping us on AWS, by switching to SQLite we're now running all new Apps on Hetzner VMs which costs around ~€0.60/mo to host a .NET + SQLite Docker App.
Steve Gibson (grc.com) did a really great job of explaining how cascading bloom filters work in order to efficiently achieve fast certificate revocation lookups in Firefox. It's definitely worth checking out the episode
Just a thought, just because a general problem is NPHard doesn't mean that we can't find specific solutions quickly or that a given input is hard to search for. If the downstream effect results in an order of magnitude less work, it makes sense, it's just a tradeoff.
The theory being an exhaustive search of all possible query plans is np-hard and would take too long, so you do a limited, iterative, best fit search.
My understanding is it never worked super great and would only be used if your query exceeded some high level of complexity. I distinctly remember it being removed at some point, but I see it mentioned in current docs, so I am probably wrong about that.
Anyway I wonder if, with some of the new advances in machine learning, it would be worth revisiting this approach to optimization.
Beam search approximates NP-hard solutions (make wider beam, have better approximation), is very old and it is used in SQLite query planner: https://www.sqlite.org/queryplanner-ng.html
For knapsack, you can do that easily in polynomial time. Well, given a few minimal assumptions, like P!=NP; because otherwise there are no hard instances.
First, you start with an NP problem where virtually all instances are expected to be hard, like finding the pre-image to a given sha256 hash digest. Second, you sample a random instance in O(n). Third and last, you reduce this problem from the original sha256 inversion to knapsack. You can do this in polynomial time, because knapsack is NP complete.
Note for the pedantic: inverting sha256 is certainly in NP, but it's not expected to be NP complete.
Second note for the pedantic: because sha256's digest has a specific fixed size, you can technically solve any problems around it in constant time with a big lookup table. So you should replace sha256 in my example with any other problem that's expected to be hard on average.
The article states that order of join matters because then nest loops differently. But we still go through entire loops everywhere. Where do those numbers in the example come from? If we have 1000, 20 and 200 elements in 3 loops, algorithmically, it does not matter in which order you iterate. Complexity is always 1000×20×200.
You don't go through entire loops everywhere because if there isn't a match in the first two tables, you don't have to check the match with the third table.
It's better to check A x C before A x B if you know that A x C has less matching rows, because the final loop will be shorter.
Ah, I see the numbers in the example are the numbers of matched rows, not a total number of rows... Make sense. I do not work with databases, did not know that you should pay attention to the order here.
It should be fine for read-only data. If you want to write, be aware that only one process can write at a time, and if you forget to set busy_timeout at the start of the connection, it defaults to zero milliseconds and you'll get an error if another process has locked the database for writing while you try to read or write it. Client-server databases tend to handle concurrent writers better.
Yes, the sqlite concurrency model is a bad choice if you have a high degree of concurrent writes. However for many applications that simply isn't true. When it comes to websites i think people significantly overestimate the amount of concurrent writes.
This is a baseline benchmark I published of various p9x latencies for multiple readers/writers with a single SQLite database with different configurations set: https://github.com/mqudsi/sqlite-readers-writers
Even if you don’t use message passing to use one thread to perform all updates/writes, it still performs very solidly for many use cases.
Depending on the amount of write throughput you need and whether you care about latency, "concurrent writes" aren't necessarily a problem either. You can just shove them into a queue and then have a single thread pull stuff out of the queue and push it into SQLite. That still scales, up to a point.
No, because your readers are still blocked by writers and still randomly fail if you forgot to set busy_timeout, which is something at least one person didn't they had to do until they read this comment.
I agree It’s not a trivial difference in implementation and the concurrent write performance is probably a lot worse.
But it’s talked about as if it’s a categorical limitation, the app will fail if there is concurrency. But it’s just a question of how much time will be spent locking.
A website with a 16 process pool for handling requests will be fine.
SQLite really isn't meant to be used exactly like a hosted solution. I don't know who is advocating for this.
If you are sharing your database between processes or machines, you are probably doing "the fancy new SQLite thing" wrong.
If you need to share information contained in a SQLite database with other processes or machines, you can write application-level code to handle this concern.
Sharing between processes isn't impossible but this where you get all of your performance caveats.
I think it is a bit unfair to argue against SQLite on performance grounds but then not explore ways to utilize instances of it within a single process scope where it is most effective.
Sharing a single SQLite connection across all threads within a process is where you get the best performance outcomes. Everything is serialized within the SQLite provider - assuming you are using a mainstream build. So, if your storage subsystem is fast (NVMe & friends), you will achieve very good outcomes. Any utilization model that requires a file open/close operation every time you want to touch the data is a complete non-starter by comparison. The "unit of work" paradigm that hosted providers recommend is catastrophic for performance with SQLite.
One option for the "I absolutely must share this SQLite instance with 30+ services" scenario is to simply wrap it with a REST API or similar.
I don't see why performance would be significantly different between multiple threads using same sqlite db vs multiple processes on same machine. Can you explain more what you mean?
Maybe i misunderstand, but it doesn't seem like the linked document supports what you are saying. The linked docment also doesn't describe how things work in WAL mode which is how most people would (or should) use sqlite in production.
Yeah, but you're locking the whole file and if you try to open it while it's locked, sleep-polling for it to be unlocked. It's safe, but it's a minimum viable product - as they say, sqlite is a replacement for fopen, not for a fully featured database. Client/server systems have better locking.
I thought WAL mode solves this. Am I misunderstanding the docs, or this SQLite running without a write ahead log?
There are advantages and disadvantages to using WAL instead of a rollback journal. Advantages include:
WAL is significantly faster in most scenarios.
WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.
Disk I/O operations tends to be more sequential using WAL.
WAL uses many fewer fsync() operations and is thus less vulnerable to problems on systems where the fsync() system call is broken.*
Then you have to remember to enable WAL mode. And it still has its caveats, but different ones: it's possible that the WAL file can grow without bound under certain conditions.
You know it never occurred to me that there's probably a whole new generation (experience wise at least) of programmers who 1) know SQLIte is commonly used for web backends now but 2) don't know about WAL mode.
To me the concept of SQLite in these scenarios, without the WAL, is just nuts.
> One of the challenges with binary fuse filters, is that they are immutable once populated, so data cannot be added incrementally, and they consume a significant amount of memory during the populate process
I think you may be confusing the strict immutability of binary fuse with the "degraded" (aka need-to-resize-a-hash-table-once-"full") of https://en.wikipedia.org/wiki/Cuckoo_filter under high hash table load.
In any event, to address your question, most of the time people truncate to some round number of bytes (8, 16, 32, 64 bits, really) because this is very cheap, but it’s actually barely more expensive to truncate to any smaller than 64 number of bits with masks & shifts. Doing so with a back-to-back array of such truncated b-bit numbers (https://github.com/c-blake/adix/blob/master/adix/sequint.nim) structured as a regular hash table (such as robin hood linear probing (https://github.com/c-blake/adix/blob/master/adix/bltab.nim) where a hit or miss will likely only induce a nearly guaranteed single cache line miss up to ~90..95+% utilization) lets you make a filter with many fewer CPU cache misses (~10X fewer, but everything always depends on where you land in the parameter space) than a Bloom filter at a small 2..4X cost in more space.
There are some example numbers and a “calculus analysis” at the bottom of https://github.com/c-blake/adix/blob/master/tests/bl.nim, but it’s all only a few hundred lines of Nim and you could re-do a test in your favorite ProgLang. This table does eventually "fill up" like Cuckoo or any fixed malloc'd arena at which point people usually double/whatever to make more space resulting in a linear amortized cost growing up from zero. I would just call this a b-bit hash existence filter or maybe b-filter or bit-level filter if you want to get brief.
FWIW, I believe this was even understood by the aboriginal paper by Bloom in his 1970 CACM article (https://cacm.acm.org/research/space-time-trade-offs-in-hash-...) based upon his footnote2, though I think he was referring less to a CPU cache and more to "loading a whole word of memory at a time" like the 36-bit word IBM mainframes of the day, though these are similar mathematically (just think of a 64B cache-line as a 512-bit aligned word). Somehow it got lost in the teaching of Bloom filters.
To speculate on that "somehow", once you have a new dimension (accuracy in space-time-accuracy here), it is easy/natural to only consider "projections", but such oversimplifications can lead one astray. E.g., most people I know optimize for space only to indirectly optimize for time, but most discussion on this topic is about space-accuracy.
It's also not clear that a better filter will result in a substantial improvement for SQLite. If the improvement is on the order of a tens or hundreds of nanoseconds per a query, that's nothing compared to hitting disk.
Edit: I guess you could get really efficient at detecting things not in the database, but the soon as you get a single false or true positive the process runs into the equivalent of a screeching halt if you need to actually check if it's there.
What happens if the table is one with a big number of deletes? The Bloom filter false positive rate will keep increasing as time goes on.
One way to address this is to recalculate it every n deletes, but this sounds similar to AUTOVACUUM issues in PostgreSQL and might result in unexpected drops in performance
We do rotating filters for that. Items are added to the current and next bloom filters, and we stop serving from one, serving from the next, delete the former, and start the next filter.
Another option is a cuckoo filter; it is like a bloom but allows deletion
I like to think it's more than a nit to pick, but does anyone else absolutely despise the way the sql is written in the example? Join table1, table2, table3, table4... then the "on" logic in the where clause, without explicitly defining which columns belong to which table? Completely unsupportable and wastes so much time years from now. Please don't write sql like that, everyone.
"update ... from ..." still uses that syntax and it throws me for a loop every time. I have to stop and think about it while convincing myself it is doing the same thing as "join on".
If you mean in mysql explain: "Using join buffer (Block Nested Loop)", its not slow because nested loop algorithm is being used, its slow because of the join buffer part, which is an optimization used when its not possible to immediately get the right row of the inner table via an index.
As far as i know (might be wrong,im not really familiar with mysql internals), mysql (like sqlite) generally uses nested loop joins all the time. The EXPLAIN just only says something in the join buffer case. When using a simple nested loop join, EXPLAIN does not mention the fact that it is using that algorithm.
It was long time ago. The schema was basically one table with a lot of fields and records and multiple small tables with way fewer records. Query in question resulted in small number of rows (because of LIMIT clause).
It was faster to get rows from big table and then run few additional queries to get data for found rows from smaller ones than join everything together in one query. Maybe the nested loop buffer was culprit or whatever. Maybe I ran into edge case of query planner MySQL had 20 years ago. Who knows.
SQLite is self-described as not open contribution. So yes by their own measure they've made it more difficult to mainline features (and intentionally so).
I submitted a bug report on SQLite a year or so back (a simple test case only, not a solution). The folks were super nice, and their patch went into the next release.
Me too, repeatedly. I've asked questions, reported bugs, asked for enhancements, made suggestions, submitted patches for consideration, and was always welcomed. Even when I'm asking for stuff that doesn't necessarily align with their goals.
OTOH, I've requested clarification (just some basic documentation really) on the “open contribution” fork of SQLite… and they never documented their own code.
And I'm sorry, I know sarcasm isn't the way here, and is impolite, but that was exactly the point.
Less than a week ago we had a whole thread where, again, we discussed the impossibility of improving SQLite from the outside because it's not “open contribution.”
Well, this is just a great example of much larger feature that was developed in collaboration with them.
Note that the measurements in the paper were made before they fixed a bug where they confused bits and bytes. So SQLite only used 1/8 of the reserved bloom filter space, thus increasing the false positive rate significantly:
https://sqlite.org/src/info/56d9bb7aa63043f5
I found and reported the bug because I wanted to know how the bloom filters work in SQLite for my uni seminar paper. Still wondering if one can find those kind of bugs with test cases.
On top of that I don't think it's fair to say it's 10x faster when it btree was tested only on integer index primary key column. Benchmarks with that bold statements should include short string (1-16 chars maybe) and UUID indexes at least.
I do not know if it is still the case, but the last time I looked into the source code SQLite did hash all strings to the exact same value.
So the bloom filter optimization does not work there.
It had to do with the different ways strings can be compared with collating functions, as strings may be equal even if they have different bytes: https://sqlite.org/forum/forumpost/0846211821
Why do you want a UUID index? Use an integer index and have the UUID in another column.
Because if you want to refer to things by a UUID, now you have two indexes
UUIDs are very wasteful [1]. For most use cases you can replace them with much shorter strings and still have very low chances of collisions [2]
[1] https://henvic.dev/posts/uuid/
[2] https://alex7kom.github.io/nano-nanoid-cc/
Call me crazy, but I'm simply splitting my UUID into the higher and lower bits and indexing off that.
IE
That works well with UUIDv7 and is just storing 128bits rather than a full string. In most languages it's pretty trivial to turn 2 longs into a UUID and vice versa.Is there any advantage to this approach over Postgres native uuid support which should store the same number of bits?
No. This approach is strictly for DBS like sqlite without uuid or 128bit integer support.
Sure, at cost of increased complexity of access. Sometimes the waste is worth the simplicity.
Sounds complex, just use a UUID. If that’s the dominating factor for storage, then you have a different problem to solve.
In SQLite, if you were to define a TEXT column (or anything other than INTEGER, for that matter) with a UUID as the PK, you’d already have two indices, because it stores data based on the rowid [0]. So you’d already have a level of indirection, where the “PK” would be pointing to the rowid.
You could define the table as WITHOUT ROWID [1], but as docs point out, the average row size shouldn’t exceed 200 bytes for the default 4 KiB page size. Since a UUID in text form is at best 32 chars, that doesn’t leave much for the rest of the columns.
[0]: https://www.sqlite.org/lang_createtable.html#rowid
[1]: https://www.sqlite.org/withoutrowid.html
It's also a problem in machine learning. Your data might be mangled due to a bug but the NN will still extract something useful out of it. Or, on the flip side, if you make a change to the data and things do break (learning stops converging), you never really know if it's the architecture or the data that's the issue.
How much of a slowdown did you estimate this bug caused?
SQLite only knows nested loop joins and the bloom filter can just tell us "no need to do a join, there is definitely no matching entry".
If it has a false positive all the time (the worst case) then the performance is the same as before the bloom filter optimization was implemented (besides the small bloom filter overhead).
As the bloom filter size in SQLite directly depends on the table size I estimated a false positive rate of 63.2% due to this bug, while it could have been just 11.75%.
It actually would have performed faster, but the false positive rate drastically increased.
I guess the person is asking how much a slowdown did the whole query receive.
90%?
Small aside based on some comments here -
People frequently bring up write concurrency issues in SQLite, with the implied idea that if two physical people are using your app at once they'll get errors. But of course it's concurrency at a transaction level, vastly different - if your writes take 1ms (in SQLite they might even be faster), you can support 1,000 writes per second. And if your users are generating a write click every 5 seconds (a very engaged user base, in most apps more users are readers) you can support 5,000 simultaneous physical people before you need to start worrying about scaling, replication, and distributed shenanigans.
If only 5% of those users are writers and the rest are readers / lurkers, you can support 100,000 users simultaneously.
I suspect close to 99% of distributed app architectures are premature optimizations for a scale the company will never reach.
The issue isn’t quite just that. If you are running your app and database on one single server and don’t need to move past that, SQLite is a great solution. But let’s say you have a web server and a worker server or more than one of each. Now you can’t access SQLite from multiple hosts as it doesn’t have a network access model. So your architecture is somewhat limited. You can still utilize it by either putting a network front end on it or setting up an API server in front of it that everything else uses. But at that point you have basically abstracted away SQLite enough that just using Postgres is easier. And with Postgres you get a lot more features that go hand in hand with multiple servers: replication, failover, etc.
I am a huge fan of SQLite but its best use case is a local database for a local application. You can use it for a client-server setup in some special circumstances but in my mind you’d need a compelling reason to do that rather than using the standard solution of something like Postgres.
Agreed, if you have to have a web server and separate worker servers, then Postgres is likely better than a weird network layer on top of SQLite. But I'd question the true need for need separate worker servers. Is it for performance or functionality, or just hand-wavey stuff like "best practices" or "modern" or "scalability" without any hard numbers attached? If a single server can handle 100,000 simultaneous users, that's quite "scalable".
I guess my point is, a single cloud VM with the (web) app server and a SQLite database will take you VERY far these days in terms of performance and concurrent users. The web server becomes your client-server layer, which can possibly eliminate the need for a separate client-server interface on the database end. Of course each app is different, but it's worth taking a hard look at whether you need that additional server & network connection (i.e. a bunch of app servers sharing the same database, that can't be done with an API). It's good to delete any part or process that you can.
Cloud vendor developer-marketing and resume-driven-development has pushed the industry into ever more complicated distributed infrastructures, but I suspect the needs of 99% of businesses could be handled much more simply, with vertical scaling if needed (larger VM class) before diving into kubernetes, clustering, load balancing, lambda, microservices, replication, etc.
Even if you are running it all on the same machine, a separate worker queue process and a web server process cannot share an SQLite database file.
And it’s not that it’s best practice. It is because sometimes you legitimately do need separate processes. Imagine you have a web UI for a service that transcodes video. You aren’t running transcodes on the same CPU cores as your web requests. And chances are you need a lot more power for the worker side than the web side.
For toy projects you can get away with almost any setup. For a project where people are paying money for a service you really need to start engineering your infrastructure because you have actual trade offs.
> Even if you are running it all on the same machine, a separate worker queue process and a web server process cannot share an SQLite database file.
From SQLite’s FAQ page, yes they can [0]. Two processes cannot simultaneously write to it, but they can both have it open, and read. With a modicum of tuning and application error handling, you can quite easily have multiple processes writing.
> …you really need to start engineering your infrastructure because you have actual trade offs.
Step one is fully understanding the abilities and limitations of available tooling.
[0]: https://www.sqlite.org/faq.html
Correct. You cannot have concurrent writers even to two independent tables. That page also clearly states the limitations of even this global lock system. I should have been more precise with my language in that while it’s possible it functionally can get to being useless pretty quickly.
SQLite is a wonderful database engine. But it isn’t the end all be all and no it doesn’t scale in the same way as something like Postgres. You cannot stretch it but you hit diminishing returns fairly quickly.
> Even if you are running it all on the same machine, a separate worker queue process and a web server process cannot share an SQLite database file.
Excuse me? The only way you would be able to come to this conclusion is if you did no reading whatsoever and never tried it.
This “confidently wrong” attitude really needs to stop.
If you read the relevant docs you will see how this is implemented and maybe realize that locking your entire database for a write transaction isn’t something that works for a whole lot of cases. Of course multiple readers are allowed, but multiple writers even to different tables still contend for the same lock (that doesn’t work on all file systems). There isn’t table-level locking, let alone row level locking.
Exactly. As Dr. Hipp writes on the web site, "SQLite does not compete with client/server databases. SQLite competes with fopen()."
Sure, but is this host incapable of performing 1k req/s? It's a crazy world if we have a DBMS that can parse a query, execute reads and writes against a database file, and return the results faster than it can be string concatenated into an html response.
That’s not the point. And the 1k req/s is a made up number in this case. The point is that SQLite is a single process or at best a single process group if you implement locking. It’s not about performance. I wouldn’t be surprised if SQLite is faster for writes than Postgres, being simpler. The point is that one is a really good bicycle and another is a really good all terrain truck. Both have uses and for a certain type of getting from A to B there is overlap in their utility but they have different use cases.
Related:
SQLite: Past, Present, and Future - https://news.ycombinator.com/item?id=32675861 - Sept 2022 (143 comments)
The description of nested loop join is confusing; it's mainly a single pass through the outer table with one B-tree probe per row to each inner table.
The linked paper is clearer:
"However, the inner loops in the join are typically accelerated with existing primary key indexes or temporary indexes built on the fly."
"Note that SQLite probes the part table index for every tuple in the lineorder table."
The Bloom filter does not reduce the cardinality of the join, it simply replaces each B-tree probe and filter with a Bloom probe on a pre-filtered key set.
This technique is well-known; the paper cites several examples, and Bloom filter pushdown is common to many commercial systems.
hey could you let me know which part you found confusing specifically, so that I can rephrase it? Thanks!
The animation, pseudocode, and join order discussion all imply that the cartesian product is being generated.
Bloom filters are great, I wish more people knew about them. The most important part about a bloom filter:
They will never have a false negative (and only sometimes a false positive).
We used this to vastly improve render times for comments pages on reddit. We used two tricks. The first was to store the time of your last vote as a first class property on your user object. If you loaded a comments page for a link that was submitted after your last vote, we knew that you couldn't have voted on any of those comments.
But if you had voted afterwards, we had to look up every single comment on the page to see if you had voted on it (we couldn't only do the comments made before your last vote because we didn't know the creation time until after we looked up the comment, and it was faster to just look up the vote).
But with a bloom filter, we could very quickly look up all the comments and get back a list of all the ones you voted on (with a couple of false positives in there). Then we could go to the cache and see if your actual vote was there (and if it was an upvote or a downvote). It was only after a failed cache hit did we have to actually go to the database.
But that bloom filter saved us from doing sometimes 1000s of cache lookups.
That's wonderful. Can you help me understand why every single button click on Reddit has an 800ms user visible latency gap? Reddit has always been this way, ever since its very beginning. Its latency has always been so slow compared to Hacker News. Why can't Reddit go snappy like HN? Is Python just really really slow compared to LISP? Why didn't your bloom filter improve user visible latency?
I can't tell if you're trolling or serious. If you're serious, I think the problem may be on your end, because I don't see that latency, and when I worked there, we didn't see that latency either.
My average latency on reddit is about 150ms, most of that network transit time. Some calls are slow, yes, but on average site wide when I worked there is was about 300ms (yes we measured it).
> Is Python just really really slow compared to LISP?
Now I'm almost certain you're trolling. Reddit was originally written in LISP. It was rewritten to Python for a massive speedup. Then we rewrote parts of it in C for another speedup. And now I'm pretty sure they have Rust and Golang too.
Also, I'm pretty sure HN isn't in Arc anymore.
I am serious. If you're not logged in then reddit pages will take 150ms to respond if you exclude the snooserv ?rdt=xxxx redirect (which makes it take 200ms total).
But if you're logged in as "jart" and use that chrome inspect feature to "copy as curl command" and run it on the shell, then the above URL takes 800ms to respond. Pretty much every page on Reddit takes 800ms+ to send back the first byte of content. Yes I use old reddit mode. I imagine the same problem afflicts all your long time loyal fans who uses Reddit while logged in. FWIW my ping to reddit.com is 4.5ms.Something is broken in your profile then. I also use old reddit and most pages load within 150ms for me.
Any idea how I get unbroken?
You could contact them but honestly that probably wouldn't work very well. Things that might help: switch your settings from old reddit to new reddit, use new reddit to load a few pages, and then switch back. Also try using "newest" reddit at https://sh.reddit.com.
Hey that brought it down to about 650ms. Thank you! That's a noticeable improvement.
Did you maintain a single bloom filter for each user listing the comment IDs they had voted on across the whole site, or was it one bloom filter per user per thread?
I honestly don't remember for sure, but I believe it was one row per comment with a list of everyone who voted on it. So you could quickly get the answer to "did user X vote on comment Y?"
And of course the answer from a bloom filter was either "no" or "probably yes?", which was good enough to then do an actual lookup of that person's vote on that comment.
Thanks to its simplicity for development and hosting SQLite has become our first choice for new Apps. We use a number of different ways to workaround its single concurrent writer limitation [1]. Whilst we're currently using Litestream for replication, we're currently evaluating switching to SQLite's native rsync [2].
[1] https://servicestack.net/posts/scalable-sqlite
[2] https://www.sqlite.org/rsync.html
Dumb question, but how do you use SQLite on kubernetes? Indeed SQLite doesn't work over NFS (and I guess other remote file shares) so how do you share access to it to other pods?
Without an additional layer you will have to be happy with a single vertically scaled instance of your application. If you want to resort to horizontal scaling, you can look into something like LiteFS
Is switching to SQLite really making hosting your web apps less of a headache? Most hosting providers make spinning up your standard client-server RDBMSs (MySQL, Postgres) a breeze.
It probably depends!
If you’re using one of the hosting providers or managed services that just give you a DB to use and you don’t have to think about administering it, it’s a pretty relaxed experience.
Even if you self-host, running a container with the DB of your choice usually isn’t that much more difficult than just having an SQLite DB for your app, at that point it’s probably more about the features which each solution provides.
Personally, I have some stuff running on MariaDB, PostgreSQL and SQLite and I’ve almost never had any problems with either. Then again, I’ve never gotten to a scale with personal stuff where the specifics of that choice would matter much.
Not having to worry about needing to configure, manage or pay for any additional infrastructure dependencies definitely makes hosting a lot simpler. Using RDS was the only thing keeping us on AWS, by switching to SQLite we're now running all new Apps on Hetzner VMs which costs around ~€0.60/mo to host a .NET + SQLite Docker App.
Litestream replicates continuously. sqlite3_rsync takes a snapshot. How do you plan to use the latter?
Steve Gibson (grc.com) did a really great job of explaining how cascading bloom filters work in order to efficiently achieve fast certificate revocation lookups in Firefox. It's definitely worth checking out the episode
https://twit.tv/posts/tech/cascading-bloom-filters-revolutio...
If anyone is interested, here are multiple membership filter implementations: https://github.com/FastFilter/fastfilter_cpp/
Just a thought, just because a general problem is NPHard doesn't mean that we can't find specific solutions quickly or that a given input is hard to search for. If the downstream effect results in an order of magnitude less work, it makes sense, it's just a tradeoff.
The one I liked was the postgres genetic optimizer.
https://www.postgresql.org/docs/17/geqo-pg-intro.html
The theory being an exhaustive search of all possible query plans is np-hard and would take too long, so you do a limited, iterative, best fit search.
My understanding is it never worked super great and would only be used if your query exceeded some high level of complexity. I distinctly remember it being removed at some point, but I see it mentioned in current docs, so I am probably wrong about that.
Anyway I wonder if, with some of the new advances in machine learning, it would be worth revisiting this approach to optimization.
As you mentioned machine learning, a useful way to implement inference in language models is to use a beam search: https://en.wikipedia.org/wiki/Beam_search
Beam search approximates NP-hard solutions (make wider beam, have better approximation), is very old and it is used in SQLite query planner: https://www.sqlite.org/queryplanner-ng.html
Well yes, heurstics for query planning is a very well researched field
I was more thinking about solving NP hard problems. Modern CPUs are fast, if the benefit is worth it against the downstream task, just do it.
Most instances of most NP hard problems are fast and easy to solve in practice.
Eg you have to go to quite a bit of effort to construct a knapsack problem that's hard to solve.
What complexity class will be the problem of construct only hards to solve knapsack (or others) problems?
For knapsack, you can do that easily in polynomial time. Well, given a few minimal assumptions, like P!=NP; because otherwise there are no hard instances.
First, you start with an NP problem where virtually all instances are expected to be hard, like finding the pre-image to a given sha256 hash digest. Second, you sample a random instance in O(n). Third and last, you reduce this problem from the original sha256 inversion to knapsack. You can do this in polynomial time, because knapsack is NP complete.
Note for the pedantic: inverting sha256 is certainly in NP, but it's not expected to be NP complete.
Second note for the pedantic: because sha256's digest has a specific fixed size, you can technically solve any problems around it in constant time with a big lookup table. So you should replace sha256 in my example with any other problem that's expected to be hard on average.
The article states that order of join matters because then nest loops differently. But we still go through entire loops everywhere. Where do those numbers in the example come from? If we have 1000, 20 and 200 elements in 3 loops, algorithmically, it does not matter in which order you iterate. Complexity is always 1000×20×200.
What am I missing?
You don't go through entire loops everywhere because if there isn't a match in the first two tables, you don't have to check the match with the third table.
It's better to check A x C before A x B if you know that A x C has less matching rows, because the final loop will be shorter.
Ah, I see the numbers in the example are the numbers of matched rows, not a total number of rows... Make sense. I do not work with databases, did not know that you should pay attention to the order here.
https://llimllib.github.io/bloomfilter-tutorial/ for understanding how bloom filters work
SQLite is getting better and better. I am using it in production for a bunch of websites and never got a problem.
It should be fine for read-only data. If you want to write, be aware that only one process can write at a time, and if you forget to set busy_timeout at the start of the connection, it defaults to zero milliseconds and you'll get an error if another process has locked the database for writing while you try to read or write it. Client-server databases tend to handle concurrent writers better.
I think people overstate this.
Yes, the sqlite concurrency model is a bad choice if you have a high degree of concurrent writes. However for many applications that simply isn't true. When it comes to websites i think people significantly overestimate the amount of concurrent writes.
This is a baseline benchmark I published of various p9x latencies for multiple readers/writers with a single SQLite database with different configurations set: https://github.com/mqudsi/sqlite-readers-writers
Even if you don’t use message passing to use one thread to perform all updates/writes, it still performs very solidly for many use cases.
Depending on the amount of write throughput you need and whether you care about latency, "concurrent writes" aren't necessarily a problem either. You can just shove them into a queue and then have a single thread pull stuff out of the queue and push it into SQLite. That still scales, up to a point.
FYI this is basically what Redis does (which does not support concurrent writes or reads afaik), and that still scales quite a lot.
No, because your readers are still blocked by writers and still randomly fail if you forgot to set busy_timeout, which is something at least one person didn't they had to do until they read this comment.
| readers are still blocked by writers
We're all using WAL mode these days, right?
No, it's indeed a very real problem. I ran into with a very small service.
You want to configure it so it has a timeout. Take turns. That’s how locks work.
The only difference between SQLite and Postgres write locking is the granularity.
It's not a trivial difference like you suggest.
I agree It’s not a trivial difference in implementation and the concurrent write performance is probably a lot worse.
But it’s talked about as if it’s a categorical limitation, the app will fail if there is concurrency. But it’s just a question of how much time will be spent locking.
A website with a 16 process pool for handling requests will be fine.
sqlite also polls for lock availability
SQLite really isn't meant to be used exactly like a hosted solution. I don't know who is advocating for this.
If you are sharing your database between processes or machines, you are probably doing "the fancy new SQLite thing" wrong.
If you need to share information contained in a SQLite database with other processes or machines, you can write application-level code to handle this concern.
Sharing between processes? Absolutely. Machines? No.
For example, LAMP stack applications could swap the M for SQlite (and I think it would have been better historically if they did).
Sharing between processes isn't impossible but this where you get all of your performance caveats.
I think it is a bit unfair to argue against SQLite on performance grounds but then not explore ways to utilize instances of it within a single process scope where it is most effective.
Sharing a single SQLite connection across all threads within a process is where you get the best performance outcomes. Everything is serialized within the SQLite provider - assuming you are using a mainstream build. So, if your storage subsystem is fast (NVMe & friends), you will achieve very good outcomes. Any utilization model that requires a file open/close operation every time you want to touch the data is a complete non-starter by comparison. The "unit of work" paradigm that hosted providers recommend is catastrophic for performance with SQLite.
One option for the "I absolutely must share this SQLite instance with 30+ services" scenario is to simply wrap it with a REST API or similar.
I don't see why performance would be significantly different between multiple threads using same sqlite db vs multiple processes on same machine. Can you explain more what you mean?
See section 5:
https://www.sqlite.org/lockingv3.html
In the single process access model, you can connect exactly once and remain in a reserved lock state the entire time.
Maybe i misunderstand, but it doesn't seem like the linked document supports what you are saying. The linked docment also doesn't describe how things work in WAL mode which is how most people would (or should) use sqlite in production.
> but this where you get all of your performance caveats.
You mean the one where it locks on write? It’s totally fine, if you wrote any cross process code yourself it’s probably going to do similar locking.
Yeah, but you're locking the whole file and if you try to open it while it's locked, sleep-polling for it to be unlocked. It's safe, but it's a minimum viable product - as they say, sqlite is a replacement for fopen, not for a fully featured database. Client/server systems have better locking.
I mostly agree. If you need an RDMS get a real one that has all the features and granular locking.
The only point I’m disagreeing about is the blanket statement “your SQLite website won’t work if it has concurrent writers”. It will.
I thought WAL mode solves this. Am I misunderstanding the docs, or this SQLite running without a write ahead log?
There are advantages and disadvantages to using WAL instead of a rollback journal. Advantages include:
Then you have to remember to enable WAL mode. And it still has its caveats, but different ones: it's possible that the WAL file can grow without bound under certain conditions.
You know it never occurred to me that there's probably a whole new generation (experience wise at least) of programmers who 1) know SQLIte is commonly used for web backends now but 2) don't know about WAL mode.
To me the concept of SQLite in these scenarios, without the WAL, is just nuts.
So configure busy_timeout, that’s what it’s for.
I've been really impressed with sqlite+litefs on fly.io. it's pretty easy to get distributed reads with single master writes.
If you never got a problem then how is it getting better?
What if we didn't know we had a problem? And now it's "ooooh it's better now".
Packs more features and improvements over the time. One can read it to see that it gets better
Perhaps it's getting faster?
Next should be this -> https://x.com/lemire/status/1869752213402157131
What a progress we have with these. Amazing times.
Maybe not such a great fit for sqlite:
> One of the challenges with binary fuse filters, is that they are immutable once populated, so data cannot be added incrementally, and they consume a significant amount of memory during the populate process
Same restriction with cuckoo filters. Are there any better than bloom filters without this restriction?
I think you may be confusing the strict immutability of binary fuse with the "degraded" (aka need-to-resize-a-hash-table-once-"full") of https://en.wikipedia.org/wiki/Cuckoo_filter under high hash table load.
In any event, to address your question, most of the time people truncate to some round number of bytes (8, 16, 32, 64 bits, really) because this is very cheap, but it’s actually barely more expensive to truncate to any smaller than 64 number of bits with masks & shifts. Doing so with a back-to-back array of such truncated b-bit numbers (https://github.com/c-blake/adix/blob/master/adix/sequint.nim) structured as a regular hash table (such as robin hood linear probing (https://github.com/c-blake/adix/blob/master/adix/bltab.nim) where a hit or miss will likely only induce a nearly guaranteed single cache line miss up to ~90..95+% utilization) lets you make a filter with many fewer CPU cache misses (~10X fewer, but everything always depends on where you land in the parameter space) than a Bloom filter at a small 2..4X cost in more space.
There are some example numbers and a “calculus analysis” at the bottom of https://github.com/c-blake/adix/blob/master/tests/bl.nim, but it’s all only a few hundred lines of Nim and you could re-do a test in your favorite ProgLang. This table does eventually "fill up" like Cuckoo or any fixed malloc'd arena at which point people usually double/whatever to make more space resulting in a linear amortized cost growing up from zero. I would just call this a b-bit hash existence filter or maybe b-filter or bit-level filter if you want to get brief.
FWIW, I believe this was even understood by the aboriginal paper by Bloom in his 1970 CACM article (https://cacm.acm.org/research/space-time-trade-offs-in-hash-...) based upon his footnote2, though I think he was referring less to a CPU cache and more to "loading a whole word of memory at a time" like the 36-bit word IBM mainframes of the day, though these are similar mathematically (just think of a 64B cache-line as a 512-bit aligned word). Somehow it got lost in the teaching of Bloom filters.
To speculate on that "somehow", once you have a new dimension (accuracy in space-time-accuracy here), it is easy/natural to only consider "projections", but such oversimplifications can lead one astray. E.g., most people I know optimize for space only to indirectly optimize for time, but most discussion on this topic is about space-accuracy.
You can convert static data structures like these into dynamic ones with about a logarithmic slowdown. So it might still be worthwhile.
(You could also combine a static filter with a dynamic bloom filter in front. A bit like generational garbage collection.)
For others interested in this idea, look up "Bentley-Saxe transformation". These lecture notes are very readable: https://jeffe.cs.illinois.edu/teaching/datastructures/notes/... .
A recent paper trying to systematically apply the idea to databases "Towards Systematic Index Dynamization" ( https://bpb-us-e1.wpmucdn.com/sites.psu.edu/dist/b/123163/fi... )
Thanks! I got my intro to this topic from Okasaki's Purely Functional Data Structures.
It's also not clear that a better filter will result in a substantial improvement for SQLite. If the improvement is on the order of a tens or hundreds of nanoseconds per a query, that's nothing compared to hitting disk.
Edit: I guess you could get really efficient at detecting things not in the database, but the soon as you get a single false or true positive the process runs into the equivalent of a screeching halt if you need to actually check if it's there.
What happens if the table is one with a big number of deletes? The Bloom filter false positive rate will keep increasing as time goes on.
One way to address this is to recalculate it every n deletes, but this sounds similar to AUTOVACUUM issues in PostgreSQL and might result in unexpected drops in performance
We do rotating filters for that. Items are added to the current and next bloom filters, and we stop serving from one, serving from the next, delete the former, and start the next filter.
Another option is a cuckoo filter; it is like a bloom but allows deletion
The article says "at the start of the join operation the bits will be set in the bloom filter".
So maybe this is built for every query? Would be needed anyway if there is a where clause for the joined table.
You might reset the single position (which would invalidate all matching items) which is better than recalculating everything or mass invalidation.
My guess though is that many use cases are likely data stores that don't experience deletes at all.
I like to think it's more than a nit to pick, but does anyone else absolutely despise the way the sql is written in the example? Join table1, table2, table3, table4... then the "on" logic in the where clause, without explicitly defining which columns belong to which table? Completely unsupportable and wastes so much time years from now. Please don't write sql like that, everyone.
"update ... from ..." still uses that syntax and it throws me for a loop every time. I have to stop and think about it while convincing myself it is doing the same thing as "join on".
https://www.postgresql.org/docs/16/sql-update.html
> SQLite does Nested Loop join
Only that? Never anything better? Really?
EDIT: Really.
Section titled Joins here https://sqlite.org/optoverview.html
states:
"SQLite implements joins as nested loops."
That's quite shocking.
While doing MySQL and Postgres when nested loop showed up in EXPLAIN in almost all cases I knew I botched my query and/or indexes.
If you mean in mysql explain: "Using join buffer (Block Nested Loop)", its not slow because nested loop algorithm is being used, its slow because of the join buffer part, which is an optimization used when its not possible to immediately get the right row of the inner table via an index.
As far as i know (might be wrong,im not really familiar with mysql internals), mysql (like sqlite) generally uses nested loop joins all the time. The EXPLAIN just only says something in the join buffer case. When using a simple nested loop join, EXPLAIN does not mention the fact that it is using that algorithm.
Fair enough, most of my memories about building fast, complex queries come from my Postgres times.
Though I remember one instance where while using MySQL in a web app it turned out that N+1 was faster than doing a JOIN.
Then your table schema was likely sub-optimal.
It was long time ago. The schema was basically one table with a lot of fields and records and multiple small tables with way fewer records. Query in question resulted in small number of rows (because of LIMIT clause).
It was faster to get rows from big table and then run few additional queries to get data for found rows from smaller ones than join everything together in one query. Maybe the nested loop buffer was culprit or whatever. Maybe I ran into edge case of query planner MySQL had 20 years ago. Who knows.
[flagged]
SQLite is self-described as not open contribution. So yes by their own measure they've made it more difficult to mainline features (and intentionally so).
I submitted a bug report on SQLite a year or so back (a simple test case only, not a solution). The folks were super nice, and their patch went into the next release.
Me too, repeatedly. I've asked questions, reported bugs, asked for enhancements, made suggestions, submitted patches for consideration, and was always welcomed. Even when I'm asking for stuff that doesn't necessarily align with their goals.
OTOH, I've requested clarification (just some basic documentation really) on the “open contribution” fork of SQLite… and they never documented their own code.
And I'm sorry, I know sarcasm isn't the way here, and is impolite, but that was exactly the point.
Less than a week ago we had a whole thread where, again, we discussed the impossibility of improving SQLite from the outside because it's not “open contribution.”
Well, this is just a great example of much larger feature that was developed in collaboration with them.
Even if true, it seems like they're doing a pretty good job on their own.
Open contribution isn’t a good in and of itself.