I don't fully agree with this, for large nested datasets and arrays. Especially with arrays, ...

notepad0x90 • today at 4:05 PM • 1 reply • view on HN

I don't fully agree with this, for large nested datasets and arrays.

Especially with arrays, what could be one line of JSON, in a CSV you'd have non-normalized array as a string in a single cell, or you expand the array and create a single value for the cell, creating $array_size number of rows.

You can normalize data in just about any structured format, but columns aren't the end-all-be-all normalization format. I think pandas uses "frames".

Replies

llm_nerd • today at 5:39 PM

>but columns aren't the end-all-be-all normalization format. I think pandas uses "frames".

Pandas is column oriented, as are basically all high performance data libraries. Each column is a separate array of data. To get a "row" you take the n item from each of the arrays.

And FWIW, column-oriented isn't considered normalization. It's a physical optimization that can yield enormous performance advantages for some classes of problems, but can cause a performance nightmare for other problems.

Data analytics loves column-oriented. CRUD type stuff does not. And in the programming realm there are several options to have Structures of Arrays (SoA) instead of the classic Arrays of Structures (AoS).

➕ show 1 reply

alt Hacker News

Replies