DuckDB also runs in Excel, by the way, via the free xlwings Lite add-in that you can install from the add-in store. It’s using the Python package and allows to write scripts, custom functions, as well as use a Jupyter-like notebook workflow.
If you start with Excel, I'll counter with Postgres: https://github.com/duckdb/pg_duckdb.
I haven't found the time to check this on one of our installation, though.
I had a project a year ago that worked with lots and lots of csv files and it was just such a lifesaver. Doing a JOIN and WHERE on two (massive) csv files and getting instant results felt like magic. Saved me a lot of time on ad hoc queries. Also Claude being good with the tool is a bonus too. Easily my favourite piece of software recently.
I benchmarked DuckDB 1.5.2 with the latest Java JDBC driver which now supports user defined functions. This allows very fast modifications https://sqg.dev/blog/java-duckdb-benchmark/
I'd be lost without it for log analysis. It's like a swiss army knife for making sense of disparate or large sets of data. So easy to pull up, so cooperative with data that's easy to compose from curl and bash, etc. It makes life so much easier
From my very limited experience with duckdb, being a column based database, simple query also need loads entire columns and that uses a lots of RAM if a table has millions of rows. Perhaps there is a way to make it more memory friendly but I don't know how.
i must be doing something wrong but if i try a huge join on a table bigger than my ram no matter the flags or the spill-to-disk modes enabled i get crashes. im sure im doing something wrong.
So is DuckDb a database or a cli tool to query all sorts of file format using SQL statements? I've used it as a CLI tool, somehow don't understand the comparison to a database, which stores your data reliably, besides responding to your SQL queries.
My personal use case is a replacement for pandas for ad hoc analysis in Jupyter notebooks, which I have to do very often these days. If I had to store the data I'd pick S3+Glue+Athena.
Along with parsing various file formats, you can create duckdb files to store tables, and make related views, schema, etc. They also have a newer ducklake tool
Whoa, nice! I could see this being useful to people I work with. Do you think it would be a good setup for people who are technical but not great software developers? People who use basic R and Python for ETL and analysis, mostly.
I'm using DuckDB in another project (on my laptop) where `NetworkX` fails due to the memory limit of 32 GB. So yes, as soon as you are doing out of core work I'd assume the combination to be quite powerful. Knowledge in SQL would be a plus, though.
It is a educational/R&D type project. We are more of backend developers and `rill` worked fine as a rapid visualization frontend with low learning curve for us.
Edit: still realizing that I can't use markdown on HN...
I got introduced to it by Claude the other day as I was interrogating several GB of public csv files. Seemed magical as it out them all in parquet files and transformed what I needed into the normalized sqllite for my server. Coding agents seen quite comfortable with it!
claude + duckdb combo is legendary for doing quick analysis of huge datasets. every time i need to analyze a big ass csv (200mb+) or as you noted a parquet file or really anything columnar i'll tell claude, 'you have duckdb at your disposal for this' and within minutes it's all sorted (no pun intended)
I use it almost daily. Any time I benchmark changes or analyze logs, I collect the data I need as CSV and analyze it with duckdb. The flexibility and ease makes it so I find so much more interesting information. It's indispensable to me now
Probably because you probably don't have to do those arbitrary transformations that often. I do, being in a security-related role. But I wouldn't have recognized its usefulness in my previous roles as a front/backend dev.
I found it unusable due to out of memory errors with a billion row 8 column dataset.
It needs manual tuning to avoid those errors and I couldn’t find the right incantation, nor should I need to - memory management is the job of the db, not me. Far too flakey for any production usage.
> Search the issues of the duckdb GitHub there’s at least 110 open and closed oom (out of memory) and maybe 400 to 500 that reference “memory”.
Ah, missed this the first time around. Will check this out. And yes, I noticed that DuckDB rather aggressively tries to use the resources of your computer.
I think the authors disagree with me, but I see it like a online analytical processing (OLAP) database, not like a OLTP (online transaction processing) database, so crashes are more tolerable.
Agree with your assessment of small and reliable for SQLite.
Disagree with your baseline requirement. ACID is more important for me and does not contain `No crashes`.
I’ve had very good experience with it last year. I used it at large scale with data that had been in iceberg previously and it worked flawlessly. It’s only improved since. Highly recommend.
The problem space that ducklake solves is smaller, but it helped me to get a working metabase dashboard quickly on ~1tb of data with 128gb ram. Queries were much, much faster than all alternatives.
Some downsides are: No unique constraints with indexes (can accidentally shoot yourself in the foot with double ingestion), writing is a bit cumbersome if you already have parquet files.
Hm, our internal benchmarking shows something like a 30x speedup compared to SQLite (https://github.com/ClickHouse/ClickBench shows an even greater speedup due to not considering cache size). Calculating back on the envelope I'd estimate 8x for multithreading and 4x for SIMD. Should we expect even more?
"Performance
Does DuckDB use SIMD?
DuckDB does not use explicit SIMD (single instruction, multiple data) instructions because they greatly complicate portability and compilation. Instead, DuckDB uses implicit SIMD, where we go to great lengths to write our C++ code in such a way that the compiler can auto-generate SIMD instructions for the specific hardware. As an example why this is a good idea, it took 10 minutes to port DuckDB to the Apple Silicon architecture."
DuckDB also runs in Excel, by the way, via the free xlwings Lite add-in that you can install from the add-in store. It’s using the Python package and allows to write scripts, custom functions, as well as use a Jupyter-like notebook workflow.
If you start with Excel, I'll counter with Postgres: https://github.com/duckdb/pg_duckdb. I haven't found the time to check this on one of our installation, though.
I had a project a year ago that worked with lots and lots of csv files and it was just such a lifesaver. Doing a JOIN and WHERE on two (massive) csv files and getting instant results felt like magic. Saved me a lot of time on ad hoc queries. Also Claude being good with the tool is a bonus too. Easily my favourite piece of software recently.
I benchmarked DuckDB 1.5.2 with the latest Java JDBC driver which now supports user defined functions. This allows very fast modifications https://sqg.dev/blog/java-duckdb-benchmark/
It's been a lifesaver for some analysis I had to do on 70GB of Cloudflare logs.
I'd be lost without it for log analysis. It's like a swiss army knife for making sense of disparate or large sets of data. So easy to pull up, so cooperative with data that's easy to compose from curl and bash, etc. It makes life so much easier
Yep. And easy to reuse as well since it's just SQL.
Data engineer here: I use this all the time. It's amazing. For most of the data the sizes we often deal with it's perfect.
> For most of the data the sizes we often deal with it's perfect.
Interested here: for me it works for out of core work. Where is the limit? On a related note: do you need to handle concurrency restrictions?
From my very limited experience with duckdb, being a column based database, simple query also need loads entire columns and that uses a lots of RAM if a table has millions of rows. Perhaps there is a way to make it more memory friendly but I don't know how.
i must be doing something wrong but if i try a huge join on a table bigger than my ram no matter the flags or the spill-to-disk modes enabled i get crashes. im sure im doing something wrong.
Hm, only anecdotal evidence, but page rank computation for Wikipedia works on my laptop (https://github.com/idesis-gmbh/WikiExperiments) where `NetworkX` fails. And it uses some joins like here: https://github.com/idesis-gmbh/WikiExperiments/blob/0b108f3f...
So is DuckDb a database or a cli tool to query all sorts of file format using SQL statements? I've used it as a CLI tool, somehow don't understand the comparison to a database, which stores your data reliably, besides responding to your SQL queries.
It can act as an in-process database, like SQLite. You can import the library directly into your code.
My personal use case is a replacement for pandas for ad hoc analysis in Jupyter notebooks, which I have to do very often these days. If I had to store the data I'd pick S3+Glue+Athena.
Along with parsing various file formats, you can create duckdb files to store tables, and make related views, schema, etc. They also have a newer ducklake tool
duckdb is a generational technology innovation. insanely good ergonomics, great performance, it's awesome.
Can confirm: together with `dbt` and `rill` I'm able do to [this](https://github.com/idesis-gmbh/GitHubExperiments/blob/master...) on my laptop.
OP here. Thanks for sharing. This is why I keep coming back to HN.
Whoa, nice! I could see this being useful to people I work with. Do you think it would be a good setup for people who are technical but not great software developers? People who use basic R and Python for ETL and analysis, mostly.
I'm using DuckDB in another project (on my laptop) where `NetworkX` fails due to the memory limit of 32 GB. So yes, as soon as you are doing out of core work I'd assume the combination to be quite powerful. Knowledge in SQL would be a plus, though.
Why did you pick rill?
It is a educational/R&D type project. We are more of backend developers and `rill` worked fine as a rapid visualization frontend with low learning curve for us.
Edit: still realizing that I can't use markdown on HN...
is rill open source?
https://github.com/rilldata/rill, Apache 2.0 license
I got introduced to it by Claude the other day as I was interrogating several GB of public csv files. Seemed magical as it out them all in parquet files and transformed what I needed into the normalized sqllite for my server. Coding agents seen quite comfortable with it!
claude + duckdb combo is legendary for doing quick analysis of huge datasets. every time i need to analyze a big ass csv (200mb+) or as you noted a parquet file or really anything columnar i'll tell claude, 'you have duckdb at your disposal for this' and within minutes it's all sorted (no pun intended)
I use it almost daily. Any time I benchmark changes or analyze logs, I collect the data I need as CSV and analyze it with duckdb. The flexibility and ease makes it so I find so much more interesting information. It's indispensable to me now
I use duckdb often too, but the way it is being hyped in these comments makes me feel like I'm missing out on some insane usecase.
I basically use it to load csv, jsonl, parquet etc etc formats and do arbitrary transformations. Are people doing something else with it?
Maybe you are unconsciously doing the right thing(TM) already? So try doing it with SQLite instead :)
Probably because you probably don't have to do those arbitrary transformations that often. I do, being in a security-related role. But I wouldn't have recognized its usefulness in my previous roles as a front/backend dev.
loading data and doing arbitrary transformations turns out to be a huge PITA with most other tools over a long enough timescale.
I found it unusable due to out of memory errors with a billion row 8 column dataset.
It needs manual tuning to avoid those errors and I couldn’t find the right incantation, nor should I need to - memory management is the job of the db, not me. Far too flakey for any production usage.
That sounds like a rather serious application. Did you file an issue?
I filed many issues. They were aurtoclosed after 3 months of inactivity
No, I tried Clickhouse instead, which worked without crashing or manual memory tuning.
Search the issues of the duckdb GitHub there’s at least 110 open and closed oom (out of memory) and maybe 400 to 500 that reference “memory”.
> Search the issues of the duckdb GitHub there’s at least 110 open and closed oom (out of memory) and maybe 400 to 500 that reference “memory”.
Ah, missed this the first time around. Will check this out. And yes, I noticed that DuckDB rather aggressively tries to use the resources of your computer.
Understood: SQLite is to Postgres as DuckDB is to ClickHouse.
I don’t see the analogy, if you’re using it to excuse crashing on small data sets and indexes.
SQLite isn’t small and crashy, it’s small and reliable.
There’s something fundamentally wrong with the codebase/architecture if there’s so many memory problems.
And the absolute baseline requirement for a production database is no crashes.
I think the authors disagree with me, but I see it like a online analytical processing (OLAP) database, not like a OLTP (online transaction processing) database, so crashes are more tolerable.
Agree with your assessment of small and reliable for SQLite. Disagree with your baseline requirement. ACID is more important for me and does not contain `No crashes`.
Any opinions on DuckLake?
I’ve had very good experience with it last year. I used it at large scale with data that had been in iceberg previously and it worked flawlessly. It’s only improved since. Highly recommend.
The problem space that ducklake solves is smaller, but it helped me to get a working metabase dashboard quickly on ~1tb of data with 128gb ram. Queries were much, much faster than all alternatives.
Some downsides are: No unique constraints with indexes (can accidentally shoot yourself in the foot with double ingestion), writing is a bit cumbersome if you already have parquet files.
With my enterprise hat on, I'd say Athena + S3 is good enough. Only use DuckDB for ad hoc analysis.
Seems stable enough, they patched a bunch of things.
Did they finally enable full SIMD or keep insisting its okay not to have it?
Hm, our internal benchmarking shows something like a 30x speedup compared to SQLite (https://github.com/ClickHouse/ClickBench shows an even greater speedup due to not considering cache size). Calculating back on the envelope I'd estimate 8x for multithreading and 4x for SIMD. Should we expect even more?
fwiw:
"Performance Does DuckDB use SIMD? DuckDB does not use explicit SIMD (single instruction, multiple data) instructions because they greatly complicate portability and compilation. Instead, DuckDB uses implicit SIMD, where we go to great lengths to write our C++ code in such a way that the compiler can auto-generate SIMD instructions for the specific hardware. As an example why this is a good idea, it took 10 minutes to port DuckDB to the Apple Silicon architecture."
https://duckdb.org/faq
sqlite
both