logoalt Hacker News

S3 Files and the changing face of S3

104 pointsby wernertoday at 7:44 PM32 commentsview on HN

Comments

minutesmithtoday at 9:41 PM

The pricing math is the actual story here. What AWS is doing is moving the "decision point" from "should I use S3?" to "what's my read/write ratio and cache hit rate?".

For most applications, this is actually a better outcome than the old model. It forces you to think about your actual access patterns instead of choosing a service based on name recognition.

The dangerous case is when teams deploy this without properly profiling. If you provision a large EFS cache assuming "everything will be cached efficiently" without validating that assumption, the bill surprise is real.

This is an architectural pattern that works great once, and then becomes a gotcha once. AWS's fault here isn't the product, it's the documentation — they need a pricing impact section that shows representative costs for different workload types, not just the raw rates.

Same lesson as many AWS products: the service works well when you've thought it through. It works badly when you haven't.

MontyCarloHalltoday at 9:12 PM

This is essentially S3FS using EFS (AWS's managed NFS service) as a cache layer for active data and small random accesses. Unfortunately, this also means that it comes with some of EFS's eye-watering pricing:

— All writes cost $0.06/GB, since everything is first written to the EFS cache. For write-heavy applications, this could be a dealbreaker.

— Reads hitting the cache get billed at $0.03/GB. Large reads (>128kB) get directly streamed from the underlying S3 bucket, which is free.

— Cache is charged at $0.30/GB/month. Even though everything is written to the cache (for consistency purposes), it seems like it's only used for persistent storage of small files (<128kB), so this shouldn't cost too much.

nyc_pizzadevtoday at 9:28 PM

This is very close to its first official release: https://fiberfs.io/

Built in cache, CDN compatible, JSON metadata, concurrency safe and it targets all S3 compatible storage layers.

jitltoday at 9:31 PM

I wish they offered some managed bridging to local NVMe storage. AWS NVMe is super fast compared to EBS, and EBS (node-exclusive access as block device) is faster than EFS (multi-node access). I imagine this can go fast if you put some kind of further-cache-to-NVMe FS on top, but a completely vertically integrated option would be much better.

rdtsctoday at 9:12 PM

Synchronization bits is what I was wondering about: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-fil...

> For example, suppose you edit /mnt/s3files/report.csv through the file system. Before S3 Files synchronizes your changes back to the S3 bucket, another application uploads a new version of report.csv directly to the S3 bucket. When S3 Files detects the conflict, it moves your version of report.csv to the lost and found directory and replaces it with the version from the S3 bucket.

> The lost and found directory is located in your file system's root directory under the name .s3files-lost+found-file-system-id.

mbanatoday at 9:21 PM

Werner Vogels is awesome. I first discovered about his writing when I learnt about Dynamo DB.

koolbatoday at 9:22 PM

If you though locking semantics over NFS were wonky, just wait till we through a remote S3 backend in the mix!

mgaunardtoday at 8:11 PM

Zero mention of s3fs which already did this for decades.

show 4 replies
gonzalohmtoday at 8:28 PM

I cannot 100% confirm this, but I believe AWS insisted a lot in NOT using S3 as a file system. Why the change now?

show 4 replies
PunchyHamstertoday at 9:06 PM

Eagerly awaiting on first blogpost where developers didn't read the eventually consistent part, lost the data and made some "genius" workaround with help of the LLM that got them in that spot in the first place

nvartolomeitoday at 8:23 PM

> changes are aggregated and committed back to S3 roughly every 60 seconds as a single PUT

Single PUT per file I assume?

show 1 reply
gervwyktoday at 8:53 PM

any recommendations for a lambda based sftp sever setup?

themafiatoday at 8:01 PM

> we locked a bunch of our most senior engineers in a room and said we weren’t going to let them out till they had a plan that they all liked.

That's one way to do it.

> When you create or modify files, changes are aggregated and committed back to S3 roughly every 60 seconds as a single PUT. Sync runs in both directions, so when other applications modify objects in the bucket, S3 Files automatically spots those modifications and reflects them in the filesystem view automatically.

That sounds about right given the above. I have trouble seeing this as something other than a giant "hack." I already don't enjoy projecting costs for new types of S3 access patterns and I feel like has the potential to double the complication I already experience here.

Maybe I'm too frugal, but I've been in the cloud for a decade now, and I've worked very hard to prevent any "surprise" bills from showing up. This seems like a great feature; if you don't care what your AWS bill is each month.

show 1 reply
mritchie712today at 9:35 PM

tldr: this caches your S3 data in EFS.

we run datalakes using DuckLake and this sounds really useful. GCP should follow suit quickly.

DenisMtoday at 8:08 PM

TLDR: Eventually consistent file system view on top of s3 with read/write cache.

ovaistariqtoday at 9:01 PM

TLDR: EFS as a eventually consistent cache in front of S3.

CrzyLngPwdtoday at 8:13 PM

If there is ever a post that needs a TLDR or an AI summary it is that one.

Sell the benefits.

I have around 9 TB in 21m files on S3. How does this change benefit me?

show 2 replies