More generally, there are algorithms for multi-way joins (with some theoretical guarantees), but they tend to perform worse in practice than just a set of binary joins with a good implementation.
It's called cogroup in Spark and similar architectures.
It does a group-by to convert data into the format (key_col_1, ... key_col_n) -> [(other_col_1, ... other_col_n), ...]
This is useful and ergonomic in itself for lots of use-cases. A lot of Spark and similar pipelines do this just to make things easier to manipulate.
Its also especially useful if you cogroup each side before join, which gives you the key column and two arrays of matching rows, one for each side of the join.
A quick search says it's called "group join" in academia. I'm sure I've bumped into as another name in other DB engines but can't remember right now.
One advantage of this is that it is bounded memory. It doesn't actually iterate over the cartesian product of non-unique keys. In fact, the whole join can be done on pointers into the sides of the join, rather than shuffling and writing the values themselves.
My understanding is that a lot of big data distributed query engines do this, at least in mixer nodes. Then the discussion becomes how late they actually expand the product - are they able to communicate the cogrouped format to the next step in the plan or must they flatten it? Etc.
(In SQL big data engines sometimes you do this optimisation explicitly e.g. doing SELECT key, ARRAY_AGG(value) FROM ... on each side before join. But things are nicer when it happens transparently under the hood and users get the speedup without the boilerplate and brittleness and fear that it is a deoptimisation when circumstances change in the future.)
Yeah it's pretty obscure, sorry.
It's called cogroup in Spark and similar architectures.
It does a group-by to convert data into the format (key_col_1, ... key_col_n) -> [(other_col_1, ... other_col_n), ...]
This is useful and ergonomic in itself for lots of use-cases. A lot of Spark and similar pipelines do this just to make things easier to manipulate.
Its also especially useful if you cogroup each side before join, which gives you the key column and two arrays of matching rows, one for each side of the join.
A quick search says it's called "group join" in academia. I'm sure I've bumped into as another name in other DB engines but can't remember right now.
One advantage of this is that it is bounded memory. It doesn't actually iterate over the cartesian product of non-unique keys. In fact, the whole join can be done on pointers into the sides of the join, rather than shuffling and writing the values themselves.
My understanding is that a lot of big data distributed query engines do this, at least in mixer nodes. Then the discussion becomes how late they actually expand the product - are they able to communicate the cogrouped format to the next step in the plan or must they flatten it? Etc.
(In SQL big data engines sometimes you do this optimisation explicitly e.g. doing SELECT key, ARRAY_AGG(value) FROM ... on each side before join. But things are nicer when it happens transparently under the hood and users get the speedup without the boilerplate and brittleness and fear that it is a deoptimisation when circumstances change in the future.)