Optimal join order is NP-Hard.

jasonwatkinspdx • yesterday at 11:17 PM • 3 replies • view on HN

Replies

TBH that's not the hard part about it. N is the number of tables, and for most real queries, N < 20 and even 2^N (clique join, which almost never happens in practice) would be tractable if you didn't have so many other things entering the mix. Most queries are closer to chain joins, which have only O(n³) possible join orders (assuming dynamic programming). (Obviously as N grows, you'll need to add some heuristics to prune out “obviously” bad plans. There are many different approaches.)

The really hard problem is estimating the cost of each plan once you've generated it, which necessarily must happen by some sort of heuristics combined with statistics. In particular: If you want to join A, B and C, the cost of (A JOIN B) JOIN C versus A JOIN (B JOIN C) can differ by many orders of magnitude, depending on the size of the intermediate products. And both the cost effects and any misestimation tend to compound through the plan as you add more tables and predicates.

maweki • today at 6:36 AM

In light of that, I am wondering why the article opted to go for "However, determining the optimal join order is far from trivial.", when there are hard results in literature.

I was also missing mentioning "sideways information passing", though some of methods are exactly that.

I am wondering whether the company consults literature or whether they fiddle about, mostly reinventing the wheel.

nine_k • today at 12:49 AM

It must be equivalent to the knapsack problem, for which many faster close-to-optimal algorithms exist. Am I missing something?

➕ show 1 reply

alt Hacker News

Replies