I ran into a practical limitation while working on ML feature engineering and multi-omics data.
At some point, the problem stops being “how many rows” and becomes “how many columns”. Thousands, then tens of thousands, sometimes more.
What I observed in practice:
- Standard SQL databases usually cap out around ~1,000–1,600 columns. - Columnar formats like Parquet can handle width, but typically require Spark or Python pipelines. - OLAP engines are fast, but tend to assume relatively narrow schemas. - Feature stores often work around this by exploding data into joins or multiple tables.
At extreme width, metadata handling, query planning, and even SQL parsing become bottlenecks.
I experimented with a different approach: - no joins - no transactions - columns distributed instead of rows - SELECT as the primary operation
With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.
On a small cluster (2 servers, AMD EPYC, 128 GB RAM each), rough numbers look like: - creating a 1M-column table: ~6 minutes - inserting a single column with 1M values: ~2 seconds - selecting ~60 columns over ~5,000 rows: ~1 second
I’m curious how others here approach ultra-wide datasets. Have you seen architectures that work cleanly at this width without resorting to heavy ETL or complex joins?
With genomics, your data is probably write ~once, almost entirely numeric, and is most likely used for single-client offline analysis. This differs a lot from what most SQL databases are optimizing for.
My best experience has been ignoring SQL and using (sparse) matrix formats for the genomic data itself, possibly combined with some small metadata tables that can fit easily in existing solutions (often even in memory). Sparse matrix formats like CSC/CSR can store numeric data at ~12 bytes per non-zero entry, so a single one of your servers should handle 10B data points in RAM and another 10x that comfortably on a local SSD. Maybe no need to pay the cost of going distributed?
Self plug: if you're in the single cell space, I wrote a paper on my project BPCells which has some storage format benchmarks up to a 60k column, 44M row RNA-seq matrix.
Is there really a market for these kinds of relational tables?
I created a system to support my custom object store where the metadata tags are stored within key-value stores. I can use them to create relational tables and query them just like conventional row stores used by many popular database engines.
My 'columnar store database' can handle many thousands of columns within a single table. So far, I have only tested it out to 10,000 columns, but it should handle many more.
I can get sub-second query times against it running on a single desktop. I haven't promoted this feature since everyone I have talked to about it, never had a compelling use for it.
Yeah, this is a hard problem, in special because Standard SQL databases only partially implement the relational model, have not good recurse for deal with relations-in-relations and lack of ways to (in user space) build your own storage (all stuff that I dream to tackle).
I think the possible answer is to try to "compress" columns with custom datatypes, it could require to touch part of the innards of sql (like in postgreSQL you need to solve it with c) but is a viable option in many cases where you noted that what you could express in json, for example, is in fact a custom type that could be stored efficiently if there is a way to translate it to more primitive types, then solved that the indexes will work.
The second option is to hide part of the join complexity with views.
ClickHouse and Scuba address this. The core idea is the data layout on disk only requires the scan to open files or otherwise access data for the columns referenced in that query.
What engine and data format were you using for your experiment?
You mention parquet and spark, but I’m wondering if you tried any of the “Lakehouse” formats that are basically parquet + a metadata layer (ie iceberg). I’d probably at least give Trino or Presto a shot, although I suspect that you’ll have similar metadata issues with those engines.
Try StarRocks. I am totally not affiliated with them but I have investigated them deeply in the past.
That said, I have never seen 1 million columns.
> With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.
What is the design?
What are the columns and why are there so many of them? The standard approach is to explode into many tables and introduce joins as you said. Why don’t you want joins?
Exasol is another MPP database that easily handles super-wide tables, and does all the distribution across nodes for you.
It used to only be available for big enterprises, but now there is a totally free version you can try out: https://www.exasol.com/personal
[dead]
Check the Vortex file format (https://vortex.dev/), if you are interested in a distributed SQL engine then you can check SpiralDB (https://spiraldb.com/), I haven’t used this one personally but they created Vortex.
If you can drop the “distributed” part, then plug DuckDB (https://duckdb.org/) and query Parquet (out of the box) or Vortex (https://duckdb.org/docs/stable/core_extensions/vortex.html) with it.