I still don’t see how this is different from iceberg? You don’t need a catalog to use it, atomic replace of metadata.json plus deletion vectors seems to be exactly the same thing.
I know Iceberg has this same issue, but you state deletion in this way (recording tombstones) is sufficient for GDPR compliance - but is it really? The 'deleted' data is still trivially readable.
It's OK provided there's a garbage collection procedure. But the write-up seems to regard this as optional.
> Deletes accumulate in tombstone files over time. Eventually we would want to coalesce 100 small tombstone files into one and /or rewrite data files if a row group has >50% rows deleted, resulting in further compaction.
The bigger problem for me is that tombstones that remove rows can make reads quite inefficient because they reduce the usefulness of min-max and bloom filter indexes. It can also affect vectorized query if you have to apply predicates within row groups. Finally there are degenerate cases where the tombstones would be bigger than the compressed columns themselves.
Any assertion that this would be performant needs to be backed up by code. ClickHouse took many years to implement so-called lightweight deletes. It's a hard problem to solve in a performant way.
I still don’t see how this is different from iceberg? You don’t need a catalog to use it, atomic replace of metadata.json plus deletion vectors seems to be exactly the same thing.
S3 is a HTTP API, does it mean that this database would be very slow? Especially if they use immutability and create copies of large files?
Yeah, it mentions in few areas - compared to OLTP or similar workloads, this will definitely be slow
The sequence diagram seems to have a mistake, the second writer somehow seems to know to create v124, only having observed v122.
Fun fact -- try to search for "124" there.
For some reason they thought hard-positioned top-to-bottom SVG is somehow better than adding "white-space: pre" once in CSS ¯\_(ツ)_/¯
Thanks! looks like i messed up some css on my last frontend refresh a bit.
Wow fast, now it's much better!
I know Iceberg has this same issue, but you state deletion in this way (recording tombstones) is sufficient for GDPR compliance - but is it really? The 'deleted' data is still trivially readable.
It's OK provided there's a garbage collection procedure. But the write-up seems to regard this as optional.
> Deletes accumulate in tombstone files over time. Eventually we would want to coalesce 100 small tombstone files into one and /or rewrite data files if a row group has >50% rows deleted, resulting in further compaction.
The bigger problem for me is that tombstones that remove rows can make reads quite inefficient because they reduce the usefulness of min-max and bloom filter indexes. It can also affect vectorized query if you have to apply predicates within row groups. Finally there are degenerate cases where the tombstones would be bigger than the compressed columns themselves.
Any assertion that this would be performant needs to be backed up by code. ClickHouse took many years to implement so-called lightweight deletes. It's a hard problem to solve in a performant way.
given that it’s parquet, deletes are nice, but what about inserts?