Looks interesting, but my first question is how does it it compare to all the others. Would be interesting with a comparison. We have clickhouse, spark, trino/dremio, starrocks, druid, apache drill, and apache datafusion. Some of these are full products, some are engines.
It seems like cloudberry uses postgresql in some way? What does this entail? Can we use postgresql extensions? How does it compare to paradedb?
Cloudberry is a fork of Greenplum, which is a fork of Postgres. Greenplum was open sourced in 2015, but when Broadcom acquired the parent company, the source was closed in 2024. Cloudberry is a continuation of that open-source version.
Both are based on old versions of Postgres; once you fork Postgres and change the innards, keeping up with the upstream has historically been a chore (you see the same problem with forks like Yugabyte). Cloudberry is not so bad compared to some; they are now up to date with Postgres 14, I believe.
In the Cloudberry architecture, as with Greenplum, database tables are partitioned into "segments", which are automatically distributed over a cluster of "segment hosts". A special optimizer called GPORCA figures out how to parallelize the query across these segments and then merge the results back together.
The strategy is a classic shared-nothing, single-master architecture, which differs from newer disaggregated compute/data architectures (used especially in "data lake" systems like Delta Lake and Iceberg) in that compute is kept close to the original data; each segment is basically a full database instance, except it just has a subset of the data.
GPORCA achieves high speed by "pushing down" operators such as filters and joins down to the individual segments. Greenplum is designed to be used together with a low-latency, high-bandwidth network interconnect, on top of which they use a custom UDP protocol, because each query needs to fan out to potentially a large number of parallel executors.
Like ClickHouse, Cloudberry supports columnar table layouts (like CH's MergeTree engine family) as well as the native Postgres row-oriented layout (like CH's Atomic table engine). A difference from CH is that there's not really any single server mode; distributed tables are always distributed by the cluster for you.
I can't compare CH's optimizer to Cloudberry's, but I suspect the latter is more sophisticated. I also don't know how performance compares. Cloudberry inherits a lot from Postgres, so I suspect that for non-columnar (OLTP) data performance may be a lot better, but not necessarily for columnar (OLAP) use cases.
Related:
Show HN: Cloudberry Database ā Greenplum Fork, Now an Apache Incubator Project - https://news.ycombinator.com/item?id=42256186 - Nov 2024 (1 comment)
Looks interesting, but my first question is how does it it compare to all the others. Would be interesting with a comparison. We have clickhouse, spark, trino/dremio, starrocks, druid, apache drill, and apache datafusion. Some of these are full products, some are engines.
It seems like cloudberry uses postgresql in some way? What does this entail? Can we use postgresql extensions? How does it compare to paradedb?
Sorry for the question dump:-P
Cloudberry is a fork of Greenplum, which is a fork of Postgres. Greenplum was open sourced in 2015, but when Broadcom acquired the parent company, the source was closed in 2024. Cloudberry is a continuation of that open-source version.
Both are based on old versions of Postgres; once you fork Postgres and change the innards, keeping up with the upstream has historically been a chore (you see the same problem with forks like Yugabyte). Cloudberry is not so bad compared to some; they are now up to date with Postgres 14, I believe.
In the Cloudberry architecture, as with Greenplum, database tables are partitioned into "segments", which are automatically distributed over a cluster of "segment hosts". A special optimizer called GPORCA figures out how to parallelize the query across these segments and then merge the results back together.
The strategy is a classic shared-nothing, single-master architecture, which differs from newer disaggregated compute/data architectures (used especially in "data lake" systems like Delta Lake and Iceberg) in that compute is kept close to the original data; each segment is basically a full database instance, except it just has a subset of the data.
GPORCA achieves high speed by "pushing down" operators such as filters and joins down to the individual segments. Greenplum is designed to be used together with a low-latency, high-bandwidth network interconnect, on top of which they use a custom UDP protocol, because each query needs to fan out to potentially a large number of parallel executors.
Like ClickHouse, Cloudberry supports columnar table layouts (like CH's MergeTree engine family) as well as the native Postgres row-oriented layout (like CH's Atomic table engine). A difference from CH is that there's not really any single server mode; distributed tables are always distributed by the cluster for you.
I can't compare CH's optimizer to Cloudberry's, but I suspect the latter is more sophisticated. I also don't know how performance compares. Cloudberry inherits a lot from Postgres, so I suspect that for non-columnar (OLTP) data performance may be a lot better, but not necessarily for columnar (OLAP) use cases.
Thanks! It was exactly this kind of explanation I needed:-)
Looks very interesting. How does an MPP differ from so called "cloud native" databases that separate storage and compute?
Any support for data sharing or querying (s3+)iceberg on the roadmap?
Thanks for asking. Iām from Apache Cloudberry community.
I think there is a plan on integration with iceberg, you can take this for reference: https://github.com/apache/cloudberry/discussions/369. We are also discussing the new roadmap, FYI.