logoalt Hacker News

willvarfartoday at 6:39 AM3 repliesview on HN

I had a great euphoric epiphany feeling today. Doesn't come along too often, will celebrate with a nice glass of wine :)

Am doing data engineering for some big data (yeah, big enough) and thinking about efficiency of data enrichment. There's this classic trilemma with data enrichment where you can have good write efficiency, good read efficiency and/or good storage cost, pick two.

E.g. you have a 1TB table and you want to add a column that, say, will take 1GB to store.

You can create a new table that is 1.1TB and then delete the old table, but this is both write-inefficient and often breaks how normal data lake orchestration works.

You can create a new wide table that is 1.1TB and keep it along side the old table, but this is both write-inefficient and expensive to store.

You can create a narrow companion table that has just a join key and 1GB of data. This is efficient to write and store, but inefficient to query when you force all users to do joins on read.

And I've come up with a cunning forth way where you write a narrow table and read a wide table so its literally best of all worlds! Kinda staggering :) Still on a high.

Might actually be a conference paper, which is new territory for me. Lets see :)

/off dancing


Replies

Fazebookingtoday at 8:54 AM

Sounds off to me tbh.

Were your table is stored shouldn't matter that much if you have proper indezes which you need and if you change anything, your db is rebuilding the indezes anyway

nurettintoday at 6:53 AM

You mean you discovered parallel arrays?

show 2 replies
anonutoday at 8:11 AM

look into vector databases. for most representations, a column is just another file on disk