Skip to content

refactor(hash-aggr): Forward port the soft limit optimization to the new hash aggregation impl#22824

Merged
2010YOUY01 merged 3 commits into
apache:mainfrom
2010YOUY01:hash-aggr-soft-limit
Jun 10, 2026
Merged

refactor(hash-aggr): Forward port the soft limit optimization to the new hash aggregation impl#22824
2010YOUY01 merged 3 commits into
apache:mainfrom
2010YOUY01:hash-aggr-soft-limit

Conversation

@2010YOUY01

@2010YOUY01 2010YOUY01 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Part of rewriting hash aggregation into several dedicated streams.

In the first step #22729, PartialHashAggregateStream and FinalHashAggregateStream has been split from the old GroupsHashAggregateStream, but both stream only have basic implementation, no optimizations and extra features like spilling.
* it's incremental migration, so old impl won't change, we plan to delete it once migration is finished

This PR forward ports the below optimization to the new implementation:

The optimizer part don't have to move, ported changes are only inside aggregate operator.

What changes are included in this PR?

Extends PartialHashAggregateStream and FinalHashAggregateStream to apply the optimization. See code comment at datafusion/physical-plan/src/aggregates/hash_aggregate.rs for the background.

Are these changes tested?

Yes, the original test in #8038 is only at ExecutionPlan level, they're still passing after the change.
This PR added new test coverage: check explain analyze to ensure the implementation actually respects this soft limit at runtime.

Are there any user-facing changes?

@github-actions github-actions Bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Jun 8, 2026

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @2010YOUY01 and @ariel-miculas

};

#[derive(Debug)]
struct AggregateRuntimeMetric {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments here (explaining what the limit and output row fields are in particualr) I think would help this test be easier to read

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, makes sense to me. Addressed in f54bf19

}

#[tokio::test]
async fn limited_distinct_aggregate_uses_migrated_hash_streams() -> Result<()> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend a different term that "migrated" as once the migration is complete it will not be relevant anympre

Perhaps something like "limited_distinct_aggregate_uses_partial_hash_stream" would be more future proof

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2010YOUY01 2010YOUY01 closed this Jun 9, 2026
@2010YOUY01 2010YOUY01 reopened this Jun 9, 2026
@2010YOUY01

Copy link
Copy Markdown
Contributor Author

By estimation, there are around 5 PRs to go for this refactor.

Comment thread datafusion/core/tests/physical_optimizer/limited_distinct_aggregation.rs Outdated
…egation.rs

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
@2010YOUY01 2010YOUY01 added this pull request to the merge queue Jun 10, 2026
Merged via the queue into apache:main with commit 40a6454 Jun 10, 2026
60 of 61 checks passed
@2010YOUY01 2010YOUY01 deleted the hash-aggr-soft-limit branch June 10, 2026 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants