logoalt Hacker News

yencabulatortoday at 12:37 AM1 replyview on HN

Creating a new record batch for a single row is also a huge kludge leading to lot of write amplification. At that point, you're better off storing rows than pretending it's columnar.

I actually wrote a row storage format reusing Arrow data types (not Feather), just laying them out row-wise not columnar. Validity bits of the different columns collected into a shared per-row bitmap, fixed offsets within a record allow extracting any field in a zerocopy fashion. I store those in RocksDB, for now.

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...


Replies

amlutotoday at 2:28 AM

> Creating a new record batch for a single row is also a huge kludge leading to lot of write amplification.

Sure, except insofar as I didn’t want to pretend to be columnar. There just doesn’t seem to be something out there that met my (experimental) needs better. I wanted to stream out rows, event sourcing style, and snarf them up in batches in a separate process into Parquet. Using Feather like it’s a row store can do this.

> kantodb

Neat project. I would seriously consider using that in a project of mine, especially now that LLMs can help out with the exceedingly tedious parts. (The current stack is regrettable, but a prompt like “keep exactly the same queries but change the API from X to Y” is well within current capabilities.)

show 1 reply