perf: Convert inner joins to semi joins when equivalent#22652
perf: Convert inner joins to semi joins when equivalent#22652neilconway wants to merge 12 commits into
Conversation
|
run benchmarks tpcds |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (39eb398) to 2e7b8e1 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
| /// The columns that are "live" at a plan node, i.e., which of its output | ||
| /// columns are referenced by an ancestor node. Represented as a set of column | ||
| /// indices, relative to the node's schema. | ||
| type LiveColumns = HashSet<usize>; |
There was a problem hiding this comment.
- This is similar but not identical to the
RequiredIndicesdata structure used byOptimizeProjections.RequiredIndicescares about insertion order but we don't, so it seemed cleaner to use a different data structure here. - It would be better to use a bitmap than a
HashSet. We could do so by adding a dependency on a third-party bitmap implementation (e.g., https://github.com/petgraph/fixedbitset, which one of our indirect dependencies already pulls in). But the performance impact should be small, so I'm not sure it's worth adding the dep.
There was a problem hiding this comment.
You could potentially use arrow BooleanBufferBuilder as a mutable bitset: https://docs.rs/arrow/latest/arrow/array/struct.BooleanBufferBuilder.html
Not sure if that is actually faster or not
There was a problem hiding this comment.
BooleanBufferBuilder would be awkward for what we need (i.e., variably-sized mutable bitmap, iterating over just the set bits, etc.)
Now that LiveColumns is a proper type, we can always switch this later. I'm inclined to leave as-is for now, I don't think the performance difference will be large.
|
run benchmarks |
|
run benchmark tpch10 |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
@neilconway FYI quite some plans changed now 🚀 Can you take a look at the PR |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (6b1d38a) to 7843ab3 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (6b1d38a) to 7843ab3 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (6b1d38a) to 7843ab3 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch10 — base (merge-base)
tpch10 — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
run benchmark tpch10 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (0e66e47) to 3c4034c (merge-base) diff using: tpch10 File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch10 — base (merge-base)
tpch10 — branch
File an issue against this benchmark runner |
|
Run benchmark tpcds |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (0e66e47) to 3c4034c (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
Q14 / Q23 still look like they might be regressing? |
|
@Dandandan Q23 doesn't result in a plan change and I can't repro that locally; I think that one is just noise. The regression on Q14 does indeed repro. The plan should strictly be an improvement (just swapping inner joins for semi-joins); it regresses because (1) We could pass down a flag indicating that the join is unique on the join keys (case 2a described above), so we can skip the duplicate removal. This should get semi-joins back to parity with inner joins. I'm inclined to pursue (2), taking a look at that now. |
|
run benchmarks |
|
run benchmark tpch10 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (227b2ac) to 04ef3c7 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (227b2ac) to 04ef3c7 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (227b2ac) to 04ef3c7 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (227b2ac) to 04ef3c7 (merge-base) diff using: tpch10 File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch10 — base (merge-base)
tpch10 — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmark tpch10 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (227b2ac) to 04ef3c7 (merge-base) diff using: tpch10 File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch10 — base (merge-base)
tpch10 — branch
File an issue against this benchmark runner |
|
For anyone following along, @neilconway and @Dandandan made Semi-join faster in So now we expect this rewrite to be better perofrming 🚀 |
|
The latest benchmark numbers look better overall. Looking at the TPC-H numbers, at least locally w/ SF10, it seems that Q22 improves because switching to a semi-join results in emitting different stats for the join output, and those stats result in avoiding some poor planner choices we made for operators above the semi-join. So while semi-joins are a bit easier to compute stats for than inner joins, the win there is a bit coincidental. Q13 and Q22 are noise (no plan change from this rewrite). I looked at TPC-DS Q14, it's interesting: locally, the rewrite regresses performance by ~15% with >= 2 partitions but improves performance by ~10% with a single partition. The reason seems to be that using a semi-join results in less flexibility for subsequent partition decision making, so the overall plan performs worse because it needs a repartition that the inner join plan didn't. I think we're okay to merge this as-is, but there's a bunch of room for further improvements:
|
Which issue does this PR close?
Rationale for this change
This PR extends the
EliminateJoinrewrite pass to replace inner joins with semi joins in some cases. An inner joinL ⋈ Rcan be rewritten to a left semi joinL ⋉ Rif two conditions hold:(And symmetrically with right semi joins.)
What changes are included in this PR?
for_each_referenced_indexhelper that is used by bothEliminateJoinandEliminateProjectionsLiveColumnstype to track the "live" (referenced by parent) columns of a plan nodeEliminateJoinAre these changes tested?
Yes; new tests added.
Are there any user-facing changes?
Some plan changes but no behavioral changes.