Skip to content

feat(python/sedonadb): add DataFrame.group_by + GroupedDataFrame.agg#893

Merged
jiayuasu merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-group-by
Jun 2, 2026
Merged

feat(python/sedonadb): add DataFrame.group_by + GroupedDataFrame.agg#893
jiayuasu merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-group-by

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented Jun 1, 2026

Grouped aggregation on top of the registry-driven function dispatch (#885) and the global-aggregation binding (#887). Completes the aggregation track of Phase P2 (#791).

API

df.group_by("k").agg(total=sd.funcs.sum(sd.col("v")))

df.group_by("k1", "k2").agg(
    sd.funcs.sum(col("x")).alias("sum_x"),
    n=sd.funcs.count(col("y")),
)

df.group_by(col("x") + col("y")).agg(...)         # computed-Expr group key
df.group_by(col("k"), "other_key").agg(...)        # mixed str / Expr
  • df.group_by(*keys) — varargs of str | Expr. Strings auto-promote to col(name) (same pattern as sort). Empty keys → ValueError; non-str/non-Expr → TypeError.
  • Returns a new GroupedDataFrame — thin holder for the parent df + resolved group exprs.
  • GroupedDataFrame.agg(*exprs, **named_exprs) — same signature as DataFrame.agg, including the kwargs-as-alias shorthand.

Implementation

Pure Python — no Rust changes. The InternalDataFrame::aggregate(group_exprs, agg_exprs) binding from #887 already handles the grouped case; this PR just populates group_exprs when constructing the aggregation. The GroupedDataFrame intermediate is kept minimal (one method beyond __init__) so it stays a clean place to add convenience aggregates (count, size, etc.) later without polluting DataFrame.

Test plan

12 tests in tests/expr/test_dataframe_group_by.py:

  • Positive: single-key string; multi-key strings; col(k) Expr key; computed-Expr key (col("x") + col("y")); mixed str / Expr; positional + kwarg agg in one call.
  • Lazy return: df.group_by("k")GroupedDataFrame; .agg(...)DataFrame.
  • Errors: empty group_by()ValueError; bad-type key → TypeError; empty .agg()ValueError; non-Expr agg → TypeError.

Output assertions use pd.testing.assert_frame_equal after sorting (group-by output ordering isn't guaranteed).

Local: 12 unit + 23 doctests + ruff format + ruff check all clean.

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines +30 to +32
def _sorted(df: pd.DataFrame, *by: str) -> pd.DataFrame:
# Group output is unordered. Sort to compare deterministically.
return df.sort_values(list(by)).reset_index(drop=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you inline the sort into the tests? .group_by().agg().sort() would be about as compact and make it easier to rearrange the tests later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined in 6a20692_sorted helper is gone; each test that needed deterministic group ordering now chains .sort(...).to_pandas().reset_index(drop=True) directly.

Comment on lines +1276 to +1280
Produced by `DataFrame.group_by(...)`. The only public method is
`agg(...)`, which runs the aggregation and returns a new
`DataFrame` with one row per group. The class exists as a step in
the chain so that future convenience aggregates (e.g. `count()`,
`size()`) can land here without polluting `DataFrame`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Produced by `DataFrame.group_by(...)`. The only public method is
`agg(...)`, which runs the aggregation and returns a new
`DataFrame` with one row per group. The class exists as a step in
the chain so that future convenience aggregates (e.g. `count()`,
`size()`) can land here without polluting `DataFrame`.
Produced by `DataFrame.group_by(...)`. The class exists as a step in
the chain to simplify aggregation expressions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied verbatim in 6a20692.

@jiayuasu jiayuasu force-pushed the feature/df-group-by branch 2 times, most recently from cc14430 to 6a20692 Compare June 1, 2026 21:42
@jiayuasu jiayuasu marked this pull request as ready for review June 1, 2026 21:51
Comment on lines +18 to +21
# Tests for DataFrame.group_by(*keys).agg(*exprs, **named_exprs).
# Aggregate exprs come from `con.funcs.<name>(args)` via the function
# registry (#885); the Rust binding is shared with `df.agg`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Tests for DataFrame.group_by(*keys).agg(*exprs, **named_exprs).
# Aggregate exprs come from `con.funcs.<name>(args)` via the function
# registry (#885); the Rust binding is shared with `df.agg`.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.

.agg(total=con.funcs.sum(col("v")))
.sort("k")
.to_pandas()
.reset_index(drop=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these do anything since the fresh pandas df is not grouped? (With apologies if they're needed)

Suggested change
.reset_index(drop=True)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.

.agg(total=con.funcs.sum(col("v")))
.sort("k1", "k2")
.to_pandas()
.reset_index(drop=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.reset_index(drop=True)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.

.agg(total=con.funcs.sum(col("v")))
.sort("k")
.to_pandas()
.reset_index(drop=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.reset_index(drop=True)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.

.agg(n=con.funcs.count(col("x")))
.sort("xy")
.to_pandas()
.reset_index(drop=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.reset_index(drop=True)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.

)
.sort("k")
.to_pandas()
.reset_index(drop=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.reset_index(drop=True)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.

Grouped aggregation on top of the registry-driven function dispatch
(apache#885) and the global-aggregation binding (apache#887).

API:

    df.group_by("k").agg(total=sd.funcs.sum(sd.col("v")))
    df.group_by("k1", "k2").agg(
        sd.funcs.sum(col("x")).alias("sum_x"),
        n=sd.funcs.count(col("y")),
    )
    df.group_by(col("x") + col("y")).agg(...)
    df.group_by(col("k"), "other_key").agg(...)

- `df.group_by(*keys)` — varargs of `str | Expr`. Strings auto-promote
  to `col(name)`; arbitrary `Expr` values are accepted as computed
  group keys. Empty keys → ValueError; non-str/non-Expr → TypeError.
- Returns a new `GroupedDataFrame` — a thin holder for the parent df
  plus the resolved group exprs. Single method `.agg(*exprs,
  **named_exprs)` with the same shape as `DataFrame.agg`.

Pure Python — the Rust `InternalDataFrame::aggregate(group_exprs,
agg_exprs)` from apache#887 already handles the grouped case; this PR just
populates `group_exprs` when constructing the aggregation.

The `GroupedDataFrame` intermediate is kept minimal (one method
beyond `__init__`) so it stays a clean place to add convenience
aggregates (`count`, `size`, etc.) later without polluting
`DataFrame`.

Tests: 12 covering single/multi string keys, Expr keys, computed
Expr keys, mixed str/Expr, positional + kwarg agg, lazy return type,
and the empty/bad-type error paths for both `group_by` and its
`.agg`.
@jiayuasu jiayuasu force-pushed the feature/df-group-by branch from 6a20692 to 39bdba4 Compare June 2, 2026 03:23
@jiayuasu jiayuasu merged commit 817bbc7 into apache:main Jun 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants