logoalt Hacker News

jasonwatkinspdxyesterday at 11:17 PM3 repliesview on HN

Optimal join order is NP-Hard.


Replies

Sesse__today at 7:56 AM

TBH that's not the hard part about it. N is the number of tables, and for most real queries, N < 20 and even 2^N (clique join, which almost never happens in practice) would be tractable if you didn't have so many other things entering the mix. Most queries are closer to chain joins, which have only O(n³) possible join orders (assuming dynamic programming). (Obviously as N grows, you'll need to add some heuristics to prune out “obviously” bad plans. There are many different approaches.)

The really hard problem is estimating the cost of each plan once you've generated it, which necessarily must happen by some sort of heuristics combined with statistics. In particular: If you want to join A, B and C, the cost of (A JOIN B) JOIN C versus A JOIN (B JOIN C) can differ by many orders of magnitude, depending on the size of the intermediate products. And both the cost effects and any misestimation tend to compound through the plan as you add more tables and predicates.

mawekitoday at 6:36 AM

In light of that, I am wondering why the article opted to go for "However, determining the optimal join order is far from trivial.", when there are hard results in literature.

I was also missing mentioning "sideways information passing", though some of methods are exactly that.

I am wondering whether the company consults literature or whether they fiddle about, mostly reinventing the wheel.

nine_ktoday at 12:49 AM

It must be equivalent to the knapsack problem, for which many faster close-to-optimal algorithms exist. Am I missing something?

show 1 reply