That looks interesting but it seems inefficient to put an LLM directly into the compilation pipeline...

catlifeonmars • yesterday at 2:57 AM • 1 reply • view on HN

That looks interesting but it seems inefficient to put an LLM directly into the compilation pipeline, not to mention that it introduces nondeterministic behavior.

Replies

menaerus • yesterday at 6:47 AM

It has different limitations but inefficiency doesn't seem likely to be one of them. Did you read the Experimental Results section?

> Figure 2 shows the experimental results, and GenDB outperforms all baselines on every query in both benchmarks. On TPC-H, GenDB achieves a total execution time of 214 ms across five representative queries.

> This result is 2.8× faster than DuckDB (594 ms) and Umbra (590 ms), which are the two fastest baselines, and 11.2× faster than ClickHouse.

> On SEC-EDGAR, GenDB achieves 328 ms, which is 5.0× faster than DuckDB and 3.9× faster than Umbra.

> The performance gap increases with query complexity. For example, on TPC-H Q9, which is a five-way join with a LIKE filter, GenDB completes in 38 ms, which is 6.1× faster than DuckDB. GenDB uses iterative optimization with early stopping criteria.

> On TPC-H, Q6 reaches a near-optimal time of 17 ms at iteration 0 with zone-map pruning and a branchless scan, and does not require further optimization. In contrast, Q18 starts at 12,147 ms and decreases to 74 ms by iteration 1, which is a 163× improvement. This gain comes from replacing a cache-thrashing hash aggregation with an index-aware sequential scan.

> On SEC-EDGAR, Q4 decreases from 1,410 ms to 106 ms over three iterations, which is a 13.3× improvement, and Q6 decreases from 1,121 ms to 88 ms over four iterations, which is a 12.7× improvement. In Q6, the optimizer gradually fuses scan, compact, and merge operations into a single OpenMP parallel region, which removes three thread-spawn overheads. By iteration 1, GenDB already outperforms all baselines

➕ show 1 reply

alt Hacker News

Replies