feat(python/sedonadb): add DataFrame.group_by + GroupedDataFrame.agg#893
Conversation
| def _sorted(df: pd.DataFrame, *by: str) -> pd.DataFrame: | ||
| # Group output is unordered. Sort to compare deterministically. | ||
| return df.sort_values(list(by)).reset_index(drop=True) |
There was a problem hiding this comment.
Can you inline the sort into the tests? .group_by().agg().sort() would be about as compact and make it easier to rearrange the tests later.
There was a problem hiding this comment.
Inlined in 6a20692 — _sorted helper is gone; each test that needed deterministic group ordering now chains .sort(...).to_pandas().reset_index(drop=True) directly.
| Produced by `DataFrame.group_by(...)`. The only public method is | ||
| `agg(...)`, which runs the aggregation and returns a new | ||
| `DataFrame` with one row per group. The class exists as a step in | ||
| the chain so that future convenience aggregates (e.g. `count()`, | ||
| `size()`) can land here without polluting `DataFrame`. |
There was a problem hiding this comment.
| Produced by `DataFrame.group_by(...)`. The only public method is | |
| `agg(...)`, which runs the aggregation and returns a new | |
| `DataFrame` with one row per group. The class exists as a step in | |
| the chain so that future convenience aggregates (e.g. `count()`, | |
| `size()`) can land here without polluting `DataFrame`. | |
| Produced by `DataFrame.group_by(...)`. The class exists as a step in | |
| the chain to simplify aggregation expressions. |
cc14430 to
6a20692
Compare
| # Tests for DataFrame.group_by(*keys).agg(*exprs, **named_exprs). | ||
| # Aggregate exprs come from `con.funcs.<name>(args)` via the function | ||
| # registry (#885); the Rust binding is shared with `df.agg`. | ||
|
|
There was a problem hiding this comment.
| # Tests for DataFrame.group_by(*keys).agg(*exprs, **named_exprs). | |
| # Aggregate exprs come from `con.funcs.<name>(args)` via the function | |
| # registry (#885); the Rust binding is shared with `df.agg`. |
There was a problem hiding this comment.
Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.
| .agg(total=con.funcs.sum(col("v"))) | ||
| .sort("k") | ||
| .to_pandas() | ||
| .reset_index(drop=True) |
There was a problem hiding this comment.
I don't think these do anything since the fresh pandas df is not grouped? (With apologies if they're needed)
| .reset_index(drop=True) |
There was a problem hiding this comment.
Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.
| .agg(total=con.funcs.sum(col("v"))) | ||
| .sort("k1", "k2") | ||
| .to_pandas() | ||
| .reset_index(drop=True) |
There was a problem hiding this comment.
| .reset_index(drop=True) |
There was a problem hiding this comment.
Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.
| .agg(total=con.funcs.sum(col("v"))) | ||
| .sort("k") | ||
| .to_pandas() | ||
| .reset_index(drop=True) |
There was a problem hiding this comment.
| .reset_index(drop=True) |
There was a problem hiding this comment.
Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.
| .agg(n=con.funcs.count(col("x"))) | ||
| .sort("xy") | ||
| .to_pandas() | ||
| .reset_index(drop=True) |
There was a problem hiding this comment.
| .reset_index(drop=True) |
There was a problem hiding this comment.
Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.
| ) | ||
| .sort("k") | ||
| .to_pandas() | ||
| .reset_index(drop=True) |
There was a problem hiding this comment.
| .reset_index(drop=True) |
There was a problem hiding this comment.
Dropped in 39bdba4 — module-level meta comment removed (deleting just the trailing blank would have left it adjacent to imports); five .reset_index(drop=True) calls also gone since to_pandas() of a fresh sedona DataFrame already returns a default RangeIndex.
Grouped aggregation on top of the registry-driven function dispatch (apache#885) and the global-aggregation binding (apache#887). API: df.group_by("k").agg(total=sd.funcs.sum(sd.col("v"))) df.group_by("k1", "k2").agg( sd.funcs.sum(col("x")).alias("sum_x"), n=sd.funcs.count(col("y")), ) df.group_by(col("x") + col("y")).agg(...) df.group_by(col("k"), "other_key").agg(...) - `df.group_by(*keys)` — varargs of `str | Expr`. Strings auto-promote to `col(name)`; arbitrary `Expr` values are accepted as computed group keys. Empty keys → ValueError; non-str/non-Expr → TypeError. - Returns a new `GroupedDataFrame` — a thin holder for the parent df plus the resolved group exprs. Single method `.agg(*exprs, **named_exprs)` with the same shape as `DataFrame.agg`. Pure Python — the Rust `InternalDataFrame::aggregate(group_exprs, agg_exprs)` from apache#887 already handles the grouped case; this PR just populates `group_exprs` when constructing the aggregation. The `GroupedDataFrame` intermediate is kept minimal (one method beyond `__init__`) so it stays a clean place to add convenience aggregates (`count`, `size`, etc.) later without polluting `DataFrame`. Tests: 12 covering single/multi string keys, Expr keys, computed Expr keys, mixed str/Expr, positional + kwarg agg, lazy return type, and the empty/bad-type error paths for both `group_by` and its `.agg`.
6a20692 to
39bdba4
Compare
Grouped aggregation on top of the registry-driven function dispatch (#885) and the global-aggregation binding (#887). Completes the aggregation track of Phase P2 (#791).
API
df.group_by(*keys)— varargs ofstr | Expr. Strings auto-promote tocol(name)(same pattern assort). Empty keys →ValueError; non-str/non-Expr →TypeError.GroupedDataFrame— thin holder for the parent df + resolved group exprs.GroupedDataFrame.agg(*exprs, **named_exprs)— same signature asDataFrame.agg, including the kwargs-as-alias shorthand.Implementation
Pure Python — no Rust changes. The
InternalDataFrame::aggregate(group_exprs, agg_exprs)binding from #887 already handles the grouped case; this PR just populatesgroup_exprswhen constructing the aggregation. TheGroupedDataFrameintermediate is kept minimal (one method beyond__init__) so it stays a clean place to add convenience aggregates (count,size, etc.) later without pollutingDataFrame.Test plan
12 tests in
tests/expr/test_dataframe_group_by.py:col(k)Expr key; computed-Expr key (col("x") + col("y")); mixed str / Expr; positional + kwarg agg in one call.df.group_by("k")→GroupedDataFrame;.agg(...)→DataFrame.group_by()→ValueError; bad-type key →TypeError; empty.agg()→ValueError; non-Expr agg →TypeError.Output assertions use
pd.testing.assert_frame_equalafter sorting (group-by output ordering isn't guaranteed).Local: 12 unit + 23 doctests +
ruff format+ruff checkall clean.