Aren't the alternatives you mentioned - icerberg and duckdb - both storage solutions while spar...

isignal • 05/14/2025 • 5 replies • view on HN

Aren't the alternatives you mentioned - icerberg and duckdb - both storage solutions while spark is a way to express distributed compute? I'm a bit out of touch with this space, is there a newer way to express distributed compute?

Replies

mritchie712 • 05/14/2025

duckdb is primarily a query engine. It does have a storage format, but one of it's strengths is querying data where it already resides (e.g. a parquet file sitting in S3).

There are some examples[0] of enabling DuckDB to manage distributed workloads, but these are pretty experimental.

0 - https://www.definite.app/blog/smallpond

➕ show 1 reply

robertlacok • 05/15/2025

I think what many people are finding out is they don’t really need distributed processing. DuckDB on a single node can get you really far, and it’s much simpler.

tomjakubowski • 05/14/2025

DuckDB is not only a storage solution. It can directly query a variety of file formats at rest, without having to re-store anything. That's one of its selling points: you can query across archival/log data stored in S3 (or wherever) without needing to "ingest" anything or double-pay to duplicate the data you've already stored.

➕ show 1 reply

winwang • 05/14/2025

Not a new way like Ray, but a new way to express Spark super-efficiently (GPU-acceleration): https://news.ycombinator.com/item?id=43964505

Nate75Sanders • 05/14/2025

Flink. It has more momentum than Spark right now.

➕ show 2 replies

alt Hacker News

Replies