logoalt Hacker News

isignalyesterday at 3:38 PM5 repliesview on HN

Aren't the alternatives you mentioned - icerberg and duckdb - both storage solutions while spark is a way to express distributed compute? I'm a bit out of touch with this space, is there a newer way to express distributed compute?


Replies

robertlacoktoday at 9:45 AM

I think what many people are finding out is they don’t really need distributed processing. DuckDB on a single node can get you really far, and it’s much simpler.

mritchie712yesterday at 5:28 PM

duckdb is primarily a query engine. It does have a storage format, but one of it's strengths is querying data where it already resides (e.g. a parquet file sitting in S3).

There are some examples[0] of enabling DuckDB to manage distributed workloads, but these are pretty experimental.

0 - https://www.definite.app/blog/smallpond

show 1 reply
tomjakubowskiyesterday at 5:15 PM

DuckDB is not only a storage solution. It can directly query a variety of file formats at rest, without having to re-store anything. That's one of its selling points: you can query across archival/log data stored in S3 (or wherever) without needing to "ingest" anything or double-pay to duplicate the data you've already stored.

show 1 reply
winwangyesterday at 7:45 PM

Not a new way like Ray, but a new way to express Spark super-efficiently (GPU-acceleration): https://news.ycombinator.com/item?id=43964505

Nate75Sandersyesterday at 3:48 PM

Flink. It has more momentum than Spark right now.

show 2 replies