feat(queryopt): set simplification optimizer#3051
feat(queryopt): set simplification optimizer#3051jzelinskie wants to merge 20 commits intoauthzed:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3051 +/- ##
==========================================
+ Coverage 75.52% 75.63% +0.12%
==========================================
Files 503 504 +1
Lines 61820 62045 +225
==========================================
+ Hits 46683 46923 +240
+ Misses 11722 11708 -14
+ Partials 3415 3414 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3e14dd6 to
6e45bdc
Compare
6e45bdc to
a6cf7a3
Compare
tstirrat15
left a comment
There was a problem hiding this comment.
LGTM but would like barak's eyes on it as well
| Pushes caveat evalution to the lowest point in the tree. | ||
| Cannot push through intersection arrows | ||
| `, | ||
| Priority: 20, |
There was a problem hiding this comment.
Do we have a means of constructing a list of optimizers at startup time rather than taking the registration approach? This smells like it's going to turn into z-index eventually.
There was a problem hiding this comment.
This is definitely a place that could use work. I think for now it'd probably help to just have a central list where optimizers are registered, similar how to we register gRPC middleware. If it ever got too confusing doing that, it'd make sense to explicitly add "before" and "after" properties to each optimizer and let the system compute the order based on those (kind like systemd startup ordering)
There was a problem hiding this comment.
@jzelinskie I'd say we do the before/after now: its going to rapidly become untenable, so I recommend as a followup PR
9ac6f21 to
6a4e842
Compare
5ec75be to
24e2fbd
Compare
| } | ||
| for _, factor := range intersectionFactors(x) { | ||
| if !slices.ContainsFunc(y.SubOutlines, func(c query.Outline) bool { | ||
| return query.OutlineCompare(factor, c) == 0 |
There was a problem hiding this comment.
Is this potentially expensive for an intersection at the top of a bunch of deep trees? Or are we expecting that the shapes we're working with here aren't going to be problematic in that way?
There was a problem hiding this comment.
I didn't evaluate the cost of outline compares when drafting this. I assumed they were cheap, but that's definitely not the case.
Here's what Claude thinks about this:
unionAbsorptioncomplexity at a single Union node:
eliminateRedundantChildrendoes O(n²) pair comparisons. For each(y, x)pair inshouldDrop:for _, factor := range intersectionFactors(x) { // O(k) factors of x if !slices.ContainsFunc(y.SubOutlines, factor.Equals) { // O(m) calls per factor return false } }Each
factor.Equalscall invokesOutlineCompare(outline.go:428), which recursively walks the entire subtree — O(T) where T is subtree size. There is no memoization or hash shortcut.So total cost at a Union node: O(n² × k × m × T)
If you have
Union[Intersection[A,B], Intersection[A,B,C], ...]where A, B, C are deep trees, checking whetherA.Equals(A)does a full tree walk each time — even comparing the same subtree to itself.The early exits in
OutlineCompare(line 432: type first, then args) help when subtrees differ near the root, but for the absorption case the trees being compared will often share a common prefix all the way down.The fix: Precompute a
Serialize()hash per node before running mutations, and use it as an O(1) equality fast path.Serialize()(outline.go:573) already exists andCanonicalKey.Hash()(line 53) wraps xxhash. The blocker is thatOutlineis passed by value with no cached hash field, andIDis only populated post-canonicalization.The cleanest option is adding a lazily-precomputed
serialHash uint64field toOutline, set during the pre-optimizationFillMissingNodeIDspass, and short-circuitingOutlineComparewith a hash comparison first.
I think this suggestion is a way bigger change that needs feedback from @barakmich.
| func unionComplementAbsorption(outline query.Outline) query.Outline { | ||
| return eliminateRedundantChildren(outline, query.UnionIteratorType, func(y, x query.Outline) bool { | ||
| return y.Type == query.ExclusionIteratorType && | ||
| len(y.SubOutlines) == 2 && |
There was a problem hiding this comment.
Is this not satisfied by construction? Or are we being defensive here?
There was a problem hiding this comment.
This was just being defensive in case of refactoring.
24e2fbd to
a5e257d
Compare
a5e257d to
e69e85f
Compare
… ⊆ A Extends exclusionSelfAnnihilation (A − A = ∅) to the full subset case: any exclusion node is annihilated when its minuend is subsumed by its subtrahend. This covers (A ∩ B) − A = ∅, (A − B) − A = ∅, and A − (A ∪ B) = ∅ by reusing the existing isSubsumedBy predicate, making the change a one-liner.
Covers every branch of NullPropagation: all iterator types (union, intersection, arrow, intersection arrow, exclusion, caveat, alias, recursive), the defensive len guards, ID preservation on null output, and leaf/unhandled types that pass through unchanged.
Extends the absorption-idempotency optimizer with four additional cardinality-free structural transformations: - flattenAssociativity: inlines nested same-type union/intersection children into their parent so that absorption rules see all peers at the same level. Without this, Union[A, Union[A∩B, C]] cannot be reduced to Union[A, C] because A and A∩B are never compared as siblings. - exclusionNullIdentity: simplifies A − ∅ = A by dropping the exclusion wrapper when the subtrahend is NullIteratorType. Placed in absorption.go rather than NullPropagation to avoid affecting the reachability-pruning optimizer, which calls NullPropagation directly. - exclusionLeftPruning: removes union children from the left side of an exclusion that are subsumed by the subtrahend, since those elements would be fully removed by the subtraction regardless. Generalizes to (A ∪ B) − Y = B − Y when A ⊆ Y; when all children are pruned the union is replaced with null and NullPropagation propagates ∅ − Y = ∅. - intersectionComplementAnnihilation: replaces an intersection with ∅ when it contains an exclusion child (A − C) and any sibling Y where Y ⊆ C, since elements of Y are inside C and elements of A − C are outside C, making the intersection empty.
absorptionIdempotency combined union idempotency, union absorption, and complement-absorption into one function. intersectionIdempotencyAbsorption combined intersection idempotency and absorption. Split both into one function per law: unionIdempotency, unionAbsorption, unionComplementAbsorption intersectionIdempotency, intersectionAbsorption Each function now has an inline predicate that expresses exactly its rule, making it straightforward to match the code to the law it implements. Also reorders functions to match mutation execution order (normalization, union rules, intersection rules, exclusion rules, helpers), standardizes all doc comments to the "Law: X" convention, and moves the caveat/arrow opacity note from absorptionIdempotency into isSubsumedBy and the init description where it actually applies.
The optimizer now implements ten set-theoretic laws spanning union, intersection, and exclusion operators. "absorption-idempotency" described only two of them; "set-simplification" covers the full scope.
…tion Adds four targeted tests hitting previously uncovered code paths: - flattenAssociativity single-survivor branch (len(newChildren)==1) - isSubsumedBy intersection case returning false ((A∩B)−C no-op) - unionAbsorption factor-absent no-op (Union[C, Intersection[A,B]])
e69e85f to
782828b
Compare
Add a test for the case-0 branch of exclusionLeftPruning — where all union children are subsumed by a superset subtrahend but the whole union is not directly subsumed (so annihilation doesn't fire first). Remove the unreachable outer keep[i] guard in eliminateRedundantChildren, which could never be false at the start of a loop iteration since keep[i] is only set false inside that same iteration's body.
Description
Adds a new query-plan optimizer, set-simplification, that eliminates structurally redundant nodes from outline trees using ten set-theoretic laws. The optimizer runs as a standard pass (registered in StandardOptimizations) and requires no cardinality or schema information — all laws hold unconditionally.
Laws implemented
Union
Intersection
Exclusion
Structural prerequisite
peers)
Implementation notes
The optimizer is structured as one function per law, all composed via MutateOutline in a single bottom-up pass. Subsumption tests reuse the existing isSubsumedBy predicate; no new structural predicates were introduced.
A − ∅ = A lives in the absorption mutation chain rather than in the shared NullPropagation utility. NullPropagation is also called by reachability-pruning, which intentionally leaves A − ∅ intact (the null subtrahend is a meaningful pruning artifact in that context).
The optimizer was initially named absorption-idempotency; it is renamed to set-simplification to reflect the full scope of rules it now covers.
Testing
Added unit tests for everything including those missing for NullPropagation.
LMK if I need to add any other kind of tests when adding optimizers
References