logoalt Hacker News

Removing fsync from our local storage engine

50 pointsby zzshenglast Thursday at 9:25 AM38 commentsview on HN

Comments

mightyhamtoday at 4:51 PM

Unless I am mistaken, it seems like there is a glaring flaw in this scheme, which is that without fsync you cannot guarantee the previous WAL blocks have been persisted before the current one, so a power loss event could leave a hole in the log and cause erroneous recovery. I believe that SSDs reorder writes internally so even having atomic batched O_DIRECT is not a strong enough guarantee for durability. I'll admit that I could be misunderstanding something about the system that alleviates this concern.

show 4 replies
bradfatoday at 5:09 PM

There’s lies, damn lies, and lies that disks tell the operating system. Don’t believe any of them!

If you need to know it’s been persisted to non-volatile storage then you need to own the full stack of every piece of software between the OS and the actual physical memory.

Every managed flash drive is going to have layers and layers of complexity and caching and things you simply can’t easily control or really understand. Don’t trust it unless you know exactly how it works all the way down.

nh2today at 2:41 PM

> fsync doesn’t just sync the file’s data, it syncs every piece of metadata the file depends on: ... directory entry

Famously not, as the man page says.

It is also said later in the article:

> POSIX strictly requires a parent-directory fsync to make a newly created file’s existence durable.

So I'm not sure why the dirent sync is claimed earlier.

show 1 reply
matjatoday at 3:15 PM

Even with O_DIRECT and aligned blocks, I still don't understand how the storage engine can return a "successful commit" to the client without a sync at some point, because a sync (IIRC) is the only way to guarantee an ATA/NVMe FUA command is sent, and the device write cache/buffer is committed.

show 3 replies
sethevtoday at 4:29 PM

This seems sketchy. O_DIRECT skips the operating system's page cache, it does not guarantee that the SSD driver sent the data to the SSD or issued a flush to the drive itself. The data could still be in the driver's memory or the in non-durable memory in the drive itself when this engine says "ok, we're good".

EDIT: sketchy from an answering "what exactly are the guarantees?" perspective

show 1 reply
loegtoday at 5:09 PM

This design ACKs writes that aren't yet durably persisted (to the journal or data areas). That might be ok, but it might not. It's certainly unusual not to at least persist the journal update.

myself248today at 2:22 PM

To step back a bit, the device still has a filesystem on it, and the structures described here are files within the filesystem? Just you're able to write directly into them, bypassing the filesystem layer, because you've constrained yourself to writes that don't require updating other parts of the filesystem structure?

show 1 reply
seebeentoday at 4:56 PM

So basically, you are writing data without guarantees it's actually written? "YOLO mode" but for data written to a device?

Would you be so kind to explain what happens in a power-loss scenario?

show 1 reply
zzshenglast Thursday at 9:25 AM

Author here. This is not a general argument against fsync; the design depends on SSD-only deployment, preallocated files, O_DIRECT, single-key atomicity, and device write guarantees.

show 1 reply
seastarertoday at 4:25 PM

It's more correct to use O_DSYNC in addition to O_DIRECT. This adds FUA to the disk write if the disk requires it for durability.

show 1 reply
alexhnnlast Thursday at 9:43 AM

Working with files is hard [1], and most of the complicity is from the fsync API. I am glad it can be eliminated from a kv storage engine.

[1] https://news.ycombinator.com/item?id=42805425

bawolfftoday at 3:38 PM

Am i understanding correctly that you are just targeting consistency and not durability?

7etoday at 3:30 PM

This is really great work. Kudos to the team for such an elegant solution.

show 1 reply
dborehamtoday at 2:05 PM

Almost full-circle back to when Oracle took over the entire volume and implemented its own filesystem.

show 1 reply
WindyBolt907today at 4:07 PM

[dead]

hpcgrouptoday at 2:12 PM

[flagged]

QuietLedge375today at 4:08 PM

[dead]