perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups by haohuaijin · Pull Request #22768 · apache/datafusion

haohuaijin · 2026-06-05T02:12:36Z

Which issue does this PR close?

Closes improve performance for apporx_distinct when each group do no have many distinct value #22767

Rationale for this change

approx_distinct is very slow with GROUP BY on high-cardinality keys.

On a dataset (~3.9M rows, ~512K groups), one file from the dataset describe in #22767

SELECT client_ip, approx_distinct(trace_id) AS cnt
FROM '*.parquet'
GROUP BY client_ip
ORDER BY cnt DESC LIMIT 10;

DataFusion: ~32.6s
DuckDB (approx_count_distinct): ~0.1s

The reason is that approx_distinct only implemented Accumulator, not GroupsAccumulator. So grouped queries fell back to GroupsAccumulatorAdapter, which allocates a full 16 KiB HyperLogLog per group (~8 GB for 512K groups) and re-slices the input per group on every batch — even though most groups only see a few distinct values.

What changes are included in this PR?

Add a dedicated GroupsAccumulator for approx_distinct that processes each batch in a single pass (no per-group slicing or dynamic dispatch).
Use an adaptive per-group sketch: keep a small list of hashes (sparse) and only switch to a dense 16 KiB HyperLogLog after 256 distinct values. This cuts memory and keeps the partial state small. The dense format stays compatible with the existing scalar accumulator.
Add count_from_hashes so small groups are estimated directly from their stored hashes, avoiding a 16 KiB alloc + scan per group at output time.
Hashing matches the existing per-type scalar accumulators, so results are unchanged. Boolean / small-int / Null keep using the old path.

Result on the query above: ~32.6s → ~0.12s (~270x, on par with DuckDB), with identical output.

Are these changes tested?

Yes.

New unit tests for the per-group sketch (sparse/dense, promotion, serialize/merge round-trip, merging groups, empty groups), checked against a dense-fold reference.
New aggregate.slt cases: grouped approx_distinct over Utf8, Utf8View, and Int32 (small groups are exact), null-only groups (= 0), and a sparse→dense case (2000 distinct/group, within HyperLogLog error).
Existing aggregate.slt and aggregate_skip_partial.slt still pass; clippy and fmt are clean.

Are there any user-facing changes?

No API or result changes — only a large speedup for approx_distinct with GROUP BY on high-cardinality keys.

haohuaijin · 2026-06-05T02:15:50Z

~~i'm current try to submit a parquet file or benchmark to reproduce the result~~
added benchmark in 9660fc0

benchmark result

╰─$ critcmp main new                                                                
group                                            main                                    22768
-----                                            ----                                    ---
approx_distinct_grouped/Int64 50000 groups       101.11 1723.0±22.12ms        ? ?/sec    1.00     17.0±0.25ms        ? ?/sec
approx_distinct_grouped/Utf8 50000 groups        96.08 1744.4±38.15ms        ? ?/sec     1.00     18.2±0.74ms        ? ?/sec
approx_distinct_grouped/Utf8View 50000 groups    101.45 1724.4±17.53ms        ? ?/sec    1.00     17.0±0.12ms        ? ?/sec

kosiew

@haohuaijin
Thanks for the optimization here. I think there is one nullable filter case that needs to be fixed before this can land. I also left a few smaller suggestions around consistency and malformed state handling.

2010YOUY01

Thank you. I think this design achieves a good balance between performance and simplicity.

My only concern is have we handled groups with null value correct (see comment), otherwise LGTM.

haohuaijin

Thanks for your reviews @kosiew @2010YOUY01 , i apply all the suggestion in db8baf2

kosiew

@haohuaijin
This iteration looks 👍 to me

improve approx_distinct for small value

6030334

github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jun 5, 2026

haohuaijin changed the title ~~improve approx_distinct for small value~~ perf: improve approx_distinct for small value Jun 5, 2026

haohuaijin changed the title ~~perf: improve approx_distinct for small value~~ perf: improve approx_distinct performance when there are fewer distinct values Jun 5, 2026

hengfeiyang approved these changes Jun 5, 2026

View reviewed changes

add benchmark

9660fc0

haohuaijin changed the title ~~perf: improve approx_distinct performance when there are fewer distinct values~~ perf: improve approx_distinct performance 100x when there are fewer distinct values Jun 5, 2026

haohuaijin changed the title ~~perf: improve approx_distinct performance 100x when there are fewer distinct values~~ perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups Jun 5, 2026

haohuaijin added 2 commits June 5, 2026 11:25

update

5a22033

update test case

0d66853

kosiew requested changes Jun 5, 2026

View reviewed changes

Comment thread datafusion/functions-aggregate/src/approx_distinct.rs Outdated

Comment thread datafusion/functions-aggregate/src/approx_distinct.rs Outdated

Comment thread datafusion/functions-aggregate/src/approx_distinct.rs

Comment thread datafusion/functions-aggregate/src/approx_distinct.rs

2010YOUY01 reviewed Jun 5, 2026

View reviewed changes

apply suggestion

db8baf2

haohuaijin commented Jun 5, 2026

View reviewed changes

Comment thread datafusion/functions-aggregate/src/approx_distinct.rs

haohuaijin mentioned this pull request Jun 5, 2026

update merge_batch's docs for opt_filter in GroupsAccumulator #22775

Open

haohuaijin added 2 commits June 5, 2026 21:09

update test case

0b65bc9

make code better

7bffb02

kosiew approved these changes Jun 6, 2026

View reviewed changes

haohuaijin mentioned this pull request Jun 7, 2026

approx_distinct over-counts Utf8View because the hash strategy is chosen per batch instead of per value #22796

Open

2010YOUY01 approved these changes Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups#22768

perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups#22768
haohuaijin wants to merge 7 commits into
apache:mainfrom
haohuaijin:approx-distinct-improve

haohuaijin commented Jun 5, 2026 •

edited

Loading

Uh oh!

haohuaijin commented Jun 5, 2026 •

edited

Loading

Uh oh!

kosiew left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haohuaijin left a comment

Uh oh!

Uh oh!

kosiew left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

haohuaijin commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

haohuaijin commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haohuaijin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haohuaijin commented Jun 5, 2026 •

edited

Loading

haohuaijin commented Jun 5, 2026 •

edited

Loading

kosiew left a comment •

edited

Loading