I don’t totally understand the fascination with storing analytical data on S3. It’s not fast, and if you’re in a write heavy environment it’s definitely not cheap either.
What’s with the avoidance of clickhouse or duckdb paired with insanely fast EBS or even physically attached storage? You can still backup to s3, but using s3 for live analytics queries is missing out on so much of the speed.
Would love the authors to pitch in with their use cases, but I think most people simply do not need sub millisecond analytics. This is mostly replacing typical spark pipelines where you're okay with sub second latencies.
S3 is the cheapest, fully managed storage you can get that can scale infinitely. When you're already archiving to S3, doubling it for analytics saves cost and simplifies data management.
S3 is a protocol understood by "everyone". If you're on AWS, as many are, it's basically the only natural choice. But a number of cloud providers, and a bunch of self-hostable software offer an S3 interface.
Clickhouse on local NVMe is one possible solution, but then you are married to that solution. An S3 interface is more universal and allows you to mix and match your tools, even though this comes at some expense.
It is immense amounts of marketing and being lazy to implement actual local storage with replication etc. It just makes everything easier and is marketed a lot.
My few cents:
- Compute and storage separation simplifies managing a system making compute "ephemeral"
- Compute resources can be scaled separately without worrying about scaling storage
- Object storage provides much higher durability (99.999999999% on S3) compared to disks
- Open table formats on S3 become a universal interface in the data space allowing to bring many other data tools if necessary
- Costs at scale can actually be lower since there is no data transfer cost within the same region. For example, you can check out WarpStream (Kafka on object storage) case studies that claim saving 5-10x