Coordinate cleanup#436
Conversation
| paths: set[str] = set() | ||
| for node in tree.subtree: | ||
| prefix = node.path.rstrip("/") + "/" | ||
| for name in set(node.data_vars) | set(node.to_dataset(inherit=False).coords): |
There was a problem hiding this comment.
to get coordinate variables at each node and NOT include inherited coordinates we have to directly coerce to a dataset and specify no inheritance. This was the simplest (only?) way I could find to accomplish this. Otherwise we get coordinate variables at every single node instead of where they are originally.
| def drop_vars_by_path(tree: DataTree, var_paths: str | list[str] | set[str]) -> None: | ||
| """ | ||
| Drop variables from a DataTree using paths in the format '/group/var' or '/var' for root level | ||
| Drop variables *in place* from a DataTree using paths in the | ||
| format '/group/var' or '/var' for root level. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| tree : DataTree | ||
| The input DataTree | ||
| var_paths : str or List[str] | ||
| var_paths : str or list[str] or set[str] | ||
| Paths to variables to drop in format '/group/var' or '/var' for root level | ||
| Examples: | ||
| - '/var1' # root level variable | ||
| - '/group1/var1' # variable in group1 | ||
| - '/group1/subgroup/var1' # variable in nested group | ||
|
|
||
| Returns | ||
| ------- | ||
| DataTree | ||
| Modified DataTree with variables dropped | ||
| """ | ||
| if isinstance(var_paths, str): | ||
| var_paths = [var_paths] | ||
|
|
||
| for path in var_paths: | ||
| # Split the path into group path and variable name | ||
| parts = path.strip("/").split("/") | ||
|
|
||
| if len(parts) == 1: | ||
| # Root level variable | ||
| var_name = parts[0] | ||
| # Modify the dataset in-place using xarray's drop_vars | ||
| tree.ds = tree.ds.drop_vars([var_name], errors="ignore") | ||
| else: | ||
| # Group variable | ||
| group_path = "/".join(parts[:-1]) | ||
| var_name = parts[-1] | ||
| try: | ||
| node = tree[group_path] | ||
| node.ds = node.ds.drop_vars([var_name], errors="ignore") | ||
| except KeyError: | ||
| pass | ||
|
|
||
| return tree | ||
| # guard for single string being passed | ||
| drop: set[str] = {var_paths} if isinstance(var_paths, str) else set(var_paths) | ||
|
|
||
| for node in tree.subtree: | ||
| prefix = node.path.rstrip("/") + "/" | ||
| to_drop = [name for name in node.variables if f"{prefix}{name}" in drop] | ||
| if to_drop: | ||
| node.dataset = node.dataset.drop_vars(to_drop, errors="ignore") |
There was a problem hiding this comment.
Function simplified greatly. Return value is dropped because it is working in place on reference to the tree and its underlying datasets are being overridden
| # add in root "/" to variable path if not present so that | ||
| # matching with `all_data_variables` is works correctly | ||
| normalized_variables = [item if item.startswith("/") else "/" + item for item in variables] | ||
|
|
There was a problem hiding this comment.
Since we are not flattening data anymore, I don't think there is a need to check for __ in place of /.
| keep_coords = coordinate_utils.collect_coordinate_variables(dataset, keep_variables) | ||
|
|
||
| keep_set = set(keep_variables) | keep_coords | ||
|
|
||
| drop_variables = all_data_variables - keep_set | ||
|
|
||
| datatree_subset.drop_vars_by_path(dataset, drop_variables) |
There was a problem hiding this comment.
Core logic for constructing set of variables to keep and drop which now accounts for coordinate variables.
| want_coord: set[str] | ||
|
|
||
|
|
||
| _test_table: list[VariableTestCase] = [ |
There was a problem hiding this comment.
simple table used to parameterize the test. Each entry has the input (file), wanted data variables, and wanted coordinate / dimension scales. A smaller subset of each file is requested to check that only required coordinate / dimension scale variables are included along side the variables requested.
| all_vars = get_vars_with_paths(out_tree) | ||
|
|
||
| subsetted_vars = get_all_variable_names_from_dtree(out_ds_tree) | ||
| subsetted_non_vars = get_non_variable_names_from_dtree(out_ds_tree) | ||
| # wanted variable should be a subset of all vars | ||
| assert case.want_var <= all_vars | ||
|
|
||
| assert set(subsetted_vars + subsetted_non_vars) == set(included_variables + non_vars) | ||
| # wanted coordinate vars should be a subset of all vars as well | ||
| assert case.want_coord <= all_vars | ||
|
|
||
| in_ds_tree.close() | ||
| out_ds_tree.close() | ||
| # and the symmetric difference of the variable super set and | ||
| # the union of data and coordinate vars should be an empty set | ||
| # indicating that nothing is present that is not expected to | ||
| # be present. E.g. extra dimension scale vars | ||
| assert (case.want_var | case.want_coord) ^ all_vars == set() |
There was a problem hiding this comment.
Test logic becomes simpler since we are not having to open the input data and pull variables from it to construct the test case. Allows us to also test for things like when sst_dtime is not included in the requested variables for modis is time still present in the output.
There was a problem hiding this comment.
Pull request overview
This PR addresses GitHub issue #435 by tightening variable subsetting so that only requested data variables and the coordinate/dimension-scale variables they actually depend on are preserved, preventing irrelevant dimension-scale variables from bloating outputs and/or being introduced into groups where they did not exist.
Changes:
- Add logic to discover and retain only the coordinate/dimension-scale variables required by the requested variables (and drop the rest).
- Refactor DataTree variable-path collection and in-place dropping to support the new cleanup flow.
- Update
test_specified_variablesto use a concrete expectation table andtmp_path, and update the changelog.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
podaac/subsetter/utils/coordinate_utils.py |
Adds ancestry search + coordinate dependency collection helpers to keep only required coordinate/dimension-scale variables. |
podaac/subsetter/subset.py |
Updates variable-subsetting logic to compute keep/drop sets using DataTree path sets and dependent coordinates. |
podaac/subsetter/datatree_subset.py |
Changes get_vars_with_paths to return a set[str] including coords, and makes drop_vars_by_path operate in-place. |
tests/test_subset/test_specified_variables.py |
Replaces prior heuristic test with explicit per-file expected data vars and required coordinate vars; uses tmp_path. |
CHANGELOG.md |
Adds a “Fixed” entry referencing issue #435. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
jamesfwood
left a comment
There was a problem hiding this comment.
@ocsmit looks good. you can merge this anytime.
Github Issue: #435
Description
See issue for more in-depth description, but tldr is that dimension scale variables not used by any of requested variables are being persisted and at times being added into groups which they originally not present. This can bloat output files with irrelevant variables.
Overview of work done
Core change is when constructing the set of variables to drop we collect the required dimensions for each requested variable, and we traverse back up the tree to find the associated dimension scale/coordinate variable if it is present. If any are present, then we added those variables to the set of variables to keep. We then add any other unnecessary dimension scale/coordinate variables to the set of variables to drop.
test_specified_variables was updated to use the pytest
tmp_pathfixture for output data and an actual test table was added for each of the data files to be more explicit about what the expected outcome of the subset is supposed to be. Each entry in the table includes the expected data variables (which form the subset request) and then the expected coordinate/dimension scale variables that are required by those data variables.Overview of verification done
All tests pass, tested with S5P which has more groups than I can count so makes for good test.
Overview of integration done
NA
PR checklist:
See Pull Request Review Checklist for pointers on reviewing this pull request