logoalt Hacker News

nh2today at 6:00 PM1 replyview on HN

Even then, I also share the confusion of the poster you're replying to.

I don't see how a virtualised NVMe disk is different from a physical one.

Especially if you don't have control over the underlying hardware (so you don't know if it has power-loss-protection PLP SSDs), you should send the FUA.

> O_DATA_SYNC

You mean `O_DSYNC`?

Why would you need `O_DSYNC` on-premise, but not on cloud VMs? (Or are you saying you'd include it everywhere?) Similar to my above point, surely it is the task of the VM to pass through any FUA commands the VM guest issues to the actual storage?

Further: Is `O_DSYNC` actually substantially different from writing and then `fdatasync()`ing yourself?

My understand is that no, it's the same. In particular, the same amount of data gets written. So if you believe that to avoid the "can trigger an order of magnitude more I/O" by avoiding `fdatasync()`, you would re-introduce it with `O_DSYNC`.

However, I suspect that that whole consideration is pointless:

The only thing that makes your O_DIRECT+preallocated-only-overwrites writes safe are enterprise SSDs with Power Loss Protection (PLP), usually capacitors.

On those SSDs, NVMe Flush/FUA are no-ops [1]. So you might as well `fdatasync()`/`O_DSYNC`, always. This is simpler, and also better because you do not need to assume/hope that your underlying SSDs have PLP: Doing the safe thing is fast on PLP [2], and safe on non-PLP.

    [1] https://news.ycombinator.com/item?id=46532675
    [2] https://tanelpoder.com/posts/using-pg-test-fsync-for-testing-low-latency-writes/
So the only remaining benefit of `O_DSYNC` over `fdatasync()` is that you save a syscall. That's an OK optimisation given they are equivalent, but it would surprise me if it had any noticeable impact at the latencies you are reporting ("413 us"), because [2] reports the difference beting 6 us.

Let me know if I got anything wrong.

The only remaining question is: Why do you then see any difference in your benchmark?

    Configuration            Throughput (obj/s)
    -------------------------------------------
    ext4 + O_DIRECT + fsync             116,041
    Our engine                          190,985
That is what I'd find very valuable to investigate.

The first suspicion I have is: Shouldn't you be measuring `+ fdatasync` instead?

So I'd be interested in:

    ext4 + O_DIRECT + fdatasync
    ext4 + O_DIRECT + O_DSYNC
    Our engine + O_DSYNC (which you're suggesting above)
Also I don't fully understand what the remaining diference between "ext4 + O_DIRECT + O_DSYNC" and "Our engine + O_DSYNC" would be.

Replies

thomas_fatoday at 6:17 PM

Thanks for the feedback, since I have relied in other thread related to O_DSYNC which a lot of folks have already suggested, and I will not repeat it here.

For the benchmark results, and they were mainly due to metadata management. We have implemented our own KV store, see internal here [1], which is more efficient than ext4 namespace management, even after doing very aggressive fs tuning for that [2] (plus 65536 sharding for each leveled dir).

[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...

[2] https://github.com/fractalbits-labs/fractalbits/commit/12109...