logoalt Hacker News

What Does a Database for SSDs Look Like?

37 pointsby charleshntoday at 10:13 AM22 commentsview on HN

Comments

mrkeentoday at 10:48 AM

> Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random.

Overall speed is irrelevant, what mattered was the relative speed difference between sequential and random access.

And since there's still a massive difference between sequential and random access with SSDs, I doubt the overall approach of using buffers needs to be reconsidered.

show 1 reply
PunchyHamstertoday at 12:37 PM

> WALs, and related low-level logging details, are critical for database systems that care deeply about durability on a single system. But the modern database isn’t like that: it doesn’t depend on commit-to-disk on a single system for its durability story. Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).

And then a bug crashes your database cluster all at once and now instead of missing seconds, you miss minutes, because some smartass thought "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?".

I love how this industry invents best practices that are actually good then people just invent badly researched reasons to just... not do them.

show 2 replies
zokiertoday at 11:17 AM

Author could have started by surveying current state of art instead of just falsely assuming that DB devs have just been resting on the laurels for past decades. If you want to see (relational) DB for SSD just check out stuff like myrocks on zenfs+; it's pretty impressive stuff.

londons_exploretoday at 10:53 AM

Median database workloads are probably doing writes of just a few bytes per transaction. Ie 'set last_login_time = now() where userid=12345'.

Due to the interface between SSD and host OS being block based, you are forced to write a full 4k page. Which means you really still benefit from a write ahead log to batch together all those changes, at least up to page size, if not larger.

show 2 replies
ljosifovtoday at 12:01 PM

Not for SSD specifically, but I assume the compact design doesn't hurt: duckdb saved my sanity recently. Single file, columnar, with builtin compression I presume (given in columnar even simplest compression maybe very effective), and with $ duckdb -ui /path/to/data/base.duckdb opening a notebook in browser. Didn't find a single thing to dislike about duckdb - as a single user. To top it off - afaik can be zero-copy 'overlayed' on the top of a bunch of parquet binary files to provide sql over them?? (didn't try it; wd be amazing if it works well)

dist1lltoday at 12:37 PM

Is there more detail on the design of the distributed multi-AZ journal? That feels like the meat of the architecture.

danielfalbotoday at 11:07 AM

Reminds me of: Databases on SSDs, Initial Ideas on Tuning (2010) [1]

[1] https://www.dr-josiah.com/2010/08/databases-on-ssds-initial-...

raggitoday at 11:50 AM

It may not matter for clouds with massive margins but there are substantial opportunities for optimizing wear.