logoalt Hacker News

staticassertiontoday at 2:09 PM4 repliesview on HN

Yeah, I mean, I think we're all basically doing this now, right? I wouldn't choose this design, but I think something similar to DeltaLake can be simplified down for tons of use cases. Manifest with CAS + buffered objects to S3, maybe compaction if you intend to do lots of reads. It's not hard to put it together.

You can achieve stupidly fast read/write operations if you do this right with a system that is shocking simple to reason about.

> Step 4: queue.json with an HA brokered group commit > The broker is stateless, so it's easy and inexpensive to move. And if we end up with more than one broker at a time? That's fine: CAS ensures correctness even with two brokers.

TBH this is the part that I think is tricky. Just resolving this in a way that doesn't end up with tons of clients wasting time talking to a broker that buffers their writes, pushes them, then always fails. I solved this at one point with token fencing and then decided it wasn't worth it and I just use a single instance to manage all writes. I'd again point to DeltaLake for the "good" design here, which is to have multiple manifests and only serialize compaction, which also unlocks parallel writers.

The other hard part is data deletion. For the queue it looks deadly simple since it's one file, but if you want to ramp up your scale and get multiple writers or manage indexes (also in S3) then deletion becomes something you have to slip into compaction. Again, I had it at one point and backed it out because it was painful.

But I have 40k writes per second working just fine for my setup, so I'm not worrying. I'd suggest others basically punt as hard as possible on this. If you need more writes, start up a separate index with its own partition for its own separate set of data, or do naive sharding.


Replies

tomnicholas1today at 7:10 PM

What you describe is very similar to how Icechunk[1] works. It works beautifully for transactional writes to "repos" containing PBs of scientific array data in object storage.

[1]: https://icechunk.io/en/latest/

show 1 reply
thomas_fatoday at 3:10 PM

A lot of good insights here. I am also wandering if they can just simply put different jobs (unclaimed, in-progress, deleted/done) into different directory/prefix, and rely on atomic object rename primitive [1][2][3] to solve the problem more gracefully (group commit can still be used if needed).

[1] https://docs.cloud.google.com/storage/docs/samples/storage-m... [2] https://docs.aws.amazon.com/AmazonS3/latest/API/API_RenameOb... [3] https://fractalbits.com/blog/why-we-built-another-object-sto...

show 1 reply
allknowingfrogtoday at 5:07 PM

This is news to me. What motivates you to reach for an S3-backed queue versus SQS?

show 1 reply
zbentleytoday at 3:06 PM

> I solved this at one point with token fencing

Could you expand on that? Even if it wasn't the approach you stuck with, I'm curious.

show 1 reply