Unless I am mistaken, it seems like there is a glaring flaw in this scheme, which is that without fs...

mightyham • today at 4:51 PM • 4 replies • view on HN

Unless I am mistaken, it seems like there is a glaring flaw in this scheme, which is that without fsync you cannot guarantee the previous WAL blocks have been persisted before the current one, so a power loss event could leave a hole in the log and cause erroneous recovery. I believe that SSDs reorder writes internally so even having atomic batched O_DIRECT is not a strong enough guarantee for durability. I'll admit that I could be misunderstanding something about the system that alleviates this concern.

Replies

hedora • today at 5:31 PM

Assuming O_DIRECT actually blocks until the SSD has acked (this isn't actually what O_DIRECT's contract says, but what they rely on), you have to wait until each page write acks whenever you need a persistence barrier.

My guess is the preallocation + zeroing is what got them most of the win, and the O_DIRECT is actually hurting, not helping throughput. This has been the case 100% of the time I've benchmarked such things.

If you're doing this sort of stuff for real under Linux, check out sync_file_range. It's the only non-broken and performant sync API for ext4 (note that it's broken by design for many other file systems, and the API is terribly difficult to use correctly).

If you really care, it's probably just easier to use SPDK or something. Linux has historically been pretty hostile towards DBMS implementations.

➕ show 1 reply

jandrewrogers • today at 5:20 PM

Many storage devices guarantee that all successful DMA (e.g. O_DIRECT) writes are persisted even in the event of a power loss. This does not work on storage devices that do not offer this guarantee obviously. It also does not work if the filesystem does not support direct I/O or requires metadata updates.

This is not a new trick. It has been used in many storage engine designs to effect durability without an fsync.

➕ show 1 reply

seebeen • today at 4:58 PM

I also asked what happens when a power loss happens.

convolvatron • today at 5:37 PM

if there is a hole in the log then the end of the log is before the hole. you do have to have checksums on log chunks, and better a kind of rolling hash, but you're really just talking about he number of entires that we would have liked to commit but didn't

➕ show 1 reply

alt Hacker News

Replies