Skip to content

perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups#22768

Open
haohuaijin wants to merge 7 commits into
apache:mainfrom
haohuaijin:approx-distinct-improve
Open

perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups#22768
haohuaijin wants to merge 7 commits into
apache:mainfrom
haohuaijin:approx-distinct-improve

Conversation

@haohuaijin
Copy link
Copy Markdown
Contributor

@haohuaijin haohuaijin commented Jun 5, 2026

Which issue does this PR close?

Rationale for this change

approx_distinct is very slow with GROUP BY on high-cardinality keys.

On a dataset (~3.9M rows, ~512K groups), one file from the dataset describe in #22767

SELECT client_ip, approx_distinct(trace_id) AS cnt
FROM '*.parquet'
GROUP BY client_ip
ORDER BY cnt DESC LIMIT 10;
  • DataFusion: ~32.6s
  • DuckDB (approx_count_distinct): ~0.1s

The reason is that approx_distinct only implemented Accumulator, not GroupsAccumulator. So grouped queries fell back to GroupsAccumulatorAdapter, which allocates a full 16 KiB HyperLogLog per group (~8 GB for 512K groups) and re-slices the input per group on every batch — even though most groups only see a few distinct values.

What changes are included in this PR?

  • Add a dedicated GroupsAccumulator for approx_distinct that processes each batch in a single pass (no per-group slicing or dynamic dispatch).
  • Use an adaptive per-group sketch: keep a small list of hashes (sparse) and only switch to a dense 16 KiB HyperLogLog after 256 distinct values. This cuts memory and keeps the partial state small. The dense format stays compatible with the existing scalar accumulator.
  • Add count_from_hashes so small groups are estimated directly from their stored hashes, avoiding a 16 KiB alloc + scan per group at output time.
  • Hashing matches the existing per-type scalar accumulators, so results are unchanged. Boolean / small-int / Null keep using the old path.

Result on the query above: ~32.6s → ~0.12s (~270x, on par with DuckDB), with identical output.

Are these changes tested?

Yes.

  • New unit tests for the per-group sketch (sparse/dense, promotion, serialize/merge round-trip, merging groups, empty groups), checked against a dense-fold reference.
  • New aggregate.slt cases: grouped approx_distinct over Utf8, Utf8View, and Int32 (small groups are exact), null-only groups (= 0), and a sparse→dense case (2000 distinct/group, within HyperLogLog error).
  • Existing aggregate.slt and aggregate_skip_partial.slt still pass; clippy and fmt are clean.

Are there any user-facing changes?

No API or result changes — only a large speedup for approx_distinct with GROUP BY on high-cardinality keys.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jun 5, 2026
@haohuaijin
Copy link
Copy Markdown
Contributor Author

haohuaijin commented Jun 5, 2026

i'm current try to submit a parquet file or benchmark to reproduce the result
added benchmark in 9660fc0

benchmark result

╰─$ critcmp main new                                                                
group                                            main                                    22768
-----                                            ----                                    ---
approx_distinct_grouped/Int64 50000 groups       101.11 1723.0±22.12ms        ? ?/sec    1.00     17.0±0.25ms        ? ?/sec
approx_distinct_grouped/Utf8 50000 groups        96.08 1744.4±38.15ms        ? ?/sec     1.00     18.2±0.74ms        ? ?/sec
approx_distinct_grouped/Utf8View 50000 groups    101.45 1724.4±17.53ms        ? ?/sec    1.00     17.0±0.12ms        ? ?/sec

@haohuaijin haohuaijin changed the title improve approx_distinct for small value perf: improve approx_distinct for small value Jun 5, 2026
@haohuaijin haohuaijin changed the title perf: improve approx_distinct for small value perf: improve approx_distinct performance when there are fewer distinct values Jun 5, 2026
@haohuaijin haohuaijin changed the title perf: improve approx_distinct performance when there are fewer distinct values perf: improve approx_distinct performance 100x when there are fewer distinct values Jun 5, 2026
@haohuaijin haohuaijin changed the title perf: improve approx_distinct performance 100x when there are fewer distinct values perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups Jun 5, 2026
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haohuaijin
Thanks for the optimization here. I think there is one nullable filter case that needs to be fixed before this can land. I also left a few smaller suggestions around consistency and malformed state handling.

Comment thread datafusion/functions-aggregate/src/approx_distinct.rs Outdated
Comment thread datafusion/functions-aggregate/src/approx_distinct.rs Outdated
Comment thread datafusion/functions-aggregate/src/approx_distinct.rs
Comment thread datafusion/functions-aggregate/src/approx_distinct.rs
Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I think this design achieves a good balance between performance and simplicity.

My only concern is have we handled groups with null value correct (see comment), otherwise LGTM.

Comment thread datafusion/functions-aggregate/src/approx_distinct.rs Outdated
Comment thread datafusion/functions-aggregate/src/approx_distinct.rs
Comment thread datafusion/functions-aggregate/src/approx_distinct.rs Outdated
Comment thread datafusion/functions-aggregate/src/approx_distinct.rs Outdated
Comment thread datafusion/functions-aggregate/src/approx_distinct.rs
Copy link
Copy Markdown
Contributor Author

@haohuaijin haohuaijin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reviews @kosiew @2010YOUY01 , i apply all the suggestion in db8baf2

Comment thread datafusion/functions-aggregate/src/approx_distinct.rs
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haohuaijin
This iteration looks 👍 to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

improve performance for apporx_distinct when each group do no have many distinct value

4 participants