perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups#22768
perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups#22768haohuaijin wants to merge 7 commits into
Conversation
|
benchmark result |
There was a problem hiding this comment.
@haohuaijin
Thanks for the optimization here. I think there is one nullable filter case that needs to be fixed before this can land. I also left a few smaller suggestions around consistency and malformed state handling.
2010YOUY01
left a comment
There was a problem hiding this comment.
Thank you. I think this design achieves a good balance between performance and simplicity.
My only concern is have we handled groups with null value correct (see comment), otherwise LGTM.
haohuaijin
left a comment
There was a problem hiding this comment.
Thanks for your reviews @kosiew @2010YOUY01 , i apply all the suggestion in db8baf2
kosiew
left a comment
There was a problem hiding this comment.
@haohuaijin
This iteration looks 👍 to me
Which issue does this PR close?
apporx_distinctwhen each group do no have many distinct value #22767Rationale for this change
approx_distinctis very slow withGROUP BYon high-cardinality keys.On a dataset (~3.9M rows, ~512K groups), one file from the dataset describe in #22767
approx_count_distinct): ~0.1sThe reason is that
approx_distinctonly implementedAccumulator, notGroupsAccumulator. So grouped queries fell back toGroupsAccumulatorAdapter, which allocates a full 16 KiB HyperLogLog per group (~8 GB for 512K groups) and re-slices the input per group on every batch — even though most groups only see a few distinct values.What changes are included in this PR?
GroupsAccumulatorforapprox_distinctthat processes each batch in a single pass (no per-group slicing or dynamic dispatch).count_from_hashesso small groups are estimated directly from their stored hashes, avoiding a 16 KiB alloc + scan per group at output time.Nullkeep using the old path.Result on the query above: ~32.6s → ~0.12s (~270x, on par with DuckDB), with identical output.
Are these changes tested?
Yes.
aggregate.sltcases: groupedapprox_distinctoverUtf8,Utf8View, andInt32(small groups are exact), null-only groups (= 0), and a sparse→dense case (2000 distinct/group, within HyperLogLog error).aggregate.sltandaggregate_skip_partial.sltstill pass; clippy and fmt are clean.Are there any user-facing changes?
No API or result changes — only a large speedup for
approx_distinctwithGROUP BYon high-cardinality keys.