feat(python/sedonadb): add DataFrame.group_by + GroupedDataFrame.agg by jiayuasu · Pull Request #893 · apache/sedona-db

jiayuasu · 2026-06-01T06:54:13Z

Grouped aggregation on top of the registry-driven function dispatch (#885) and the global-aggregation binding (#887). Completes the aggregation track of Phase P2 (#791).

API

df.group_by("k").agg(total=sd.funcs.sum(sd.col("v")))

df.group_by("k1", "k2").agg(
    sd.funcs.sum(col("x")).alias("sum_x"),
    n=sd.funcs.count(col("y")),
)

df.group_by(col("x") + col("y")).agg(...)         # computed-Expr group key
df.group_by(col("k"), "other_key").agg(...)        # mixed str / Expr

df.group_by(*keys) — varargs of str | Expr. Strings auto-promote to col(name) (same pattern as sort). Empty keys → ValueError; non-str/non-Expr → TypeError.
Returns a new GroupedDataFrame — thin holder for the parent df + resolved group exprs.
GroupedDataFrame.agg(*exprs, **named_exprs) — same signature as DataFrame.agg, including the kwargs-as-alias shorthand.

Implementation

Pure Python — no Rust changes. The InternalDataFrame::aggregate(group_exprs, agg_exprs) binding from #887 already handles the grouped case; this PR just populates group_exprs when constructing the aggregation. The GroupedDataFrame intermediate is kept minimal (one method beyond __init__) so it stays a clean place to add convenience aggregates (count, size, etc.) later without polluting DataFrame.

Test plan

12 tests in tests/expr/test_dataframe_group_by.py:

Positive: single-key string; multi-key strings; col(k) Expr key; computed-Expr key (col("x") + col("y")); mixed str / Expr; positional + kwarg agg in one call.
Lazy return: df.group_by("k") → GroupedDataFrame; .agg(...) → DataFrame.
Errors: empty group_by() → ValueError; bad-type key → TypeError; empty .agg() → ValueError; non-Expr agg → TypeError.

Output assertions use pd.testing.assert_frame_equal after sorting (group-by output ordering isn't guaranteed).

Local: 12 unit + 23 doctests + ruff format + ruff check all clean.

paleolimbot

Thank you!

paleolimbot · 2026-06-01T15:08:31Z

+def _sorted(df: pd.DataFrame, *by: str) -> pd.DataFrame:
+    # Group output is unordered. Sort to compare deterministically.
+    return df.sort_values(list(by)).reset_index(drop=True)


Can you inline the sort into the tests? .group_by().agg().sort() would be about as compact and make it easier to rearrange the tests later.

Inlined in 6a20692 — _sorted helper is gone; each test that needed deterministic group ordering now chains .sort(...).to_pandas().reset_index(drop=True) directly.

paleolimbot · 2026-06-01T15:15:34Z

+    Produced by `DataFrame.group_by(...)`. The only public method is
+    `agg(...)`, which runs the aggregation and returns a new
+    `DataFrame` with one row per group. The class exists as a step in
+    the chain so that future convenience aggregates (e.g. `count()`,
+    `size()`) can land here without polluting `DataFrame`.


Suggested change

Produced by `DataFrame.group_by(...)`. The only public method is

`agg(...)`, which runs the aggregation and returns a new

`DataFrame` with one row per group. The class exists as a step in

the chain so that future convenience aggregates (e.g. `count()`,

`size()`) can land here without polluting `DataFrame`.

Produced by `DataFrame.group_by(...)`. The class exists as a step in

the chain to simplify aggregation expressions.

Applied verbatim in 6a20692.

paleolimbot · 2026-06-02T02:48:09Z

+# Tests for DataFrame.group_by(*keys).agg(*exprs, **named_exprs).
+# Aggregate exprs come from `con.funcs.<name>(args)` via the function
+# registry (#885); the Rust binding is shared with `df.agg`.
+


Suggested change

# Tests for DataFrame.group_by(*keys).agg(*exprs, **named_exprs).

# Aggregate exprs come from `con.funcs.<name>(args)` via the function

# registry (#885); the Rust binding is shared with `df.agg`.