Show HN: Pg_incremental – Incremental Data Processing in Postgres

(github.com)

6 points | by craigkerstiens 8 days ago ago

1 comments

mslot 8 days ago ago
I created pg_incremental because I keep running into the same challenge in PostgreSQL: You have a table of raw event data that you insert into, either individual rows when the event happens, or batches of events in other systems.
You then maybe want to aggregate the data, but the table is too big to keep reprocessing it, so you create a rollup table and only aggregate new data and insert into or update the rollup table.
However, how do you actually select the "new" data? That's more challenging than it seems, and you also need to orchestrate everything.
pg_incremental is a tool to help you create automated, reliable, incremental processing pipelines. It is built on top of pg_cron and around the idea of parameterized SQL commands.
You can define several types of pipelines:
- Sequence pipelines process a range of sequence values, to automatically aggregate or transform new data.
- Time interval pipelines process a range of time intervals after a time interval has passed, to automatically aggregate or export new data.
- File list pipelines process new files showing up in a directory, to automatically import data.
After defining a pipeline, new inserts will automatically get processed by the periodic background job. The SQL command is executed for a range of new sequence values, a new time interval, or a new file name, or skipped if there's no new work. Brin indexes are very useful for fast range scans.
The extension also ensures correct behaviour in the presence of concurrent inserts by waiting for ongoing writes to finish.
Overall, it simplifies the process of setting up an automated incremental processing pipeline to a single SQL command. There's not a lot of magic to it, but it's simple, reliable, and very versatile.