> By Jan 2024, our largest table had roughly 100 million rows.
I did a double take at this. At the onset of the article, the fact they're using a distributed database and the mention of a "mid 6 figure" DB bill made me assume they have some obscenely large database that's far beyond what a single node could do. They don't detail the Postgres setup that replaced it, so I assume it's a pretty standard single primary and a 100 million row table is well within the abilities of that—I have a 150 million row table happily plugging along on a 2vCPU+16GB instance. Apples and oranges, perhaps, but people shouldn't underestimate what a single modern server can do.
It’s incredible how much Postgres can handle.
At $WORK, we write ~100M rows per day and keep years of history, all in a single database. Sure, the box is big, but I have beautiful transactional workloads and no distributed systems to worry about!
You don't even need to be that "modern." Back in 2010 I was working on a MySQL 5.x system with about 300 million rows on a dual Xeon box with 16 gigs RAM and a few hundred gigs of RAID 10. This was before SSDs were common.
The largest table was over 100 million rows. Some migrations were painful, however. At that time, some of them would lock the whole table and we'd need to run them overnight. Fortunately, this was for an internal app so we could do that.
Yeah, we have 300m+ rows in a table as well. It's partitioned by time and chugs along with no issues. Granted It's a 30 vcpu, 100gb ram machine, but it hosts billions of rows in aggregate
Does mid six figure mean ~$500k?
That sounds insane for a crud app with one million users.
What am I missing?
Agreed. Devs usually do a double take when I tell them that their table with 100K rows is not in fact big, or even medium. Everyone’s experiences are different, of course, but to me, big is somewhere in the high hundreds of millions range. After a billion it doesn’t really matter; the difference between 5 billion and 1 billion isn’t important, because it’s exceedingly unlikely that a. Your working set is that large b. That your server could possibly cope with all of it at once. I hope you have partitions.
Yeah, 100mil is really not that much. I worked on a 10B rows table on an rds r6g.4xl, and Postgres handled it fine, even with 20+ indexes. Really not ideal and I'd rather have fewer indexes and sharding the table, but postgres dealt with it.
OTOH they are admittedly using an ORM (Prisma, known for its weight)
Also screamed in my head, I have way more rows than that in a Postgres right now and paying less than $500!
We have a couple of tables with about a billion rows now on single nodes in mysql. 256GB RAM and a number of 2TB nvme drives. It works completely fine, but you can forget about timely restore if something goes completely fucked. And we cant do any operation that isnt directly using the index or the whole performance suffers immediately. Which means we basically have to use those tables like they are a distributed database, but at least we have transactionality.
I missed that number, but caught they migrated all data in 15 minutes and I blinked: "wait, how little data are we talking about for how much money!?"
When I was running tech for a (tiny) nonprofit we self-hosted a geographic database because it was cheaper and easier.
There was something like 120 million rows in the database. It ran on a single VM. It really needed the indexes, but once those were built it just sang.
This was easily 10+ years ago.
Nice! What optimizations have you put in llace yo support 150 mil? Just some indexing or other fancy stuff?
Mid 6 figure DB bill, let's estimate $500k. Divided into 100 million rows (ignore the rest, because db provisioning is typically dominated by the needs of a few core tables). They get 200 rows per dollar.
Your table on a small VPS (which I concur is totally reasonable, am running something similar myself): Let's say your VPS costs $40/mo x 12 = $480/yr. Divide into 150 million. You get 312,500 rows per dollar.
I'd wager you server was faster under normal load too. But is it webscale? /s
There's waste, then there's "3 orders of magnitude" waste. The pain is self-inflicted. Unless you have actual requirements that warrant a complex distributed database, you should "just use postgres".
And just to calibrate everyone's expectations, I've seen a standard prod setup using open source postgres on AWS EC2s (1 primary, 2 replicas, 1 haproxy+pgbouncer box to load balance queries) that cost ~ $700k annually. This system was capable of handling 1.2 million rows inserted per second, while simultaneously serving thousands of read queries/s from hundreds of internal apps across the enterprise. The cost effectiveness in their case came out to ~ 20k rows per dollar, lower than your VPS since the replicas and connection pooling eat into the budget. But still: 2 orders of magnitude more cost effective than the hosted distributed hotness.
Call me old fashioned, but when records start reaching the 100 million range, it's usually an indication that either your dataset is too wide (consider sharding) or too deep (consider time based archival) to fit into a monolithic schema. For context, I've dealt with multiple systems that generate this volume of data between 2003 - 2013 (mostly capital markets, but also some govt/compliance work) with databases and hardware from that era, and we rarely had an issue that could not be solved by either query optimization, caching, sharding or archival, usually in that order.
Secondly, we did most of these things using SQL, Bash scripts, cron jobs and some I/O logic built directly into the application code. They were robust enough to handle some extremely mission critical systems (a failure could bring down a US primary market and if it's bad enough, you hear it on the news).