Skip to content

Coordinate cleanup#436

Open
ocsmit wants to merge 14 commits into
podaac:developfrom
ocsmit:coordinate-cleanup
Open

Coordinate cleanup#436
ocsmit wants to merge 14 commits into
podaac:developfrom
ocsmit:coordinate-cleanup

Conversation

@ocsmit

@ocsmit ocsmit commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Github Issue: #435

Description

See issue for more in-depth description, but tldr is that dimension scale variables not used by any of requested variables are being persisted and at times being added into groups which they originally not present. This can bloat output files with irrelevant variables.

Overview of work done

Core change is when constructing the set of variables to drop we collect the required dimensions for each requested variable, and we traverse back up the tree to find the associated dimension scale/coordinate variable if it is present. If any are present, then we added those variables to the set of variables to keep. We then add any other unnecessary dimension scale/coordinate variables to the set of variables to drop.

test_specified_variables was updated to use the pytest tmp_path fixture for output data and an actual test table was added for each of the data files to be more explicit about what the expected outcome of the subset is supposed to be. Each entry in the table includes the expected data variables (which form the subset request) and then the expected coordinate/dimension scale variables that are required by those data variables.

Overview of verification done

All tests pass, tested with S5P which has more groups than I can count so makes for good test.

Overview of integration done

NA

PR checklist:

  • Linted
  • Updated unit tests
  • Updated changelog
  • [NA] Integration testing

See Pull Request Review Checklist for pointers on reviewing this pull request

@ocsmit ocsmit requested a review from sliu008 June 4, 2026 14:53
Comment thread podaac/subsetter/datatree_subset.py Outdated
paths: set[str] = set()
for node in tree.subtree:
prefix = node.path.rstrip("/") + "/"
for name in set(node.data_vars) | set(node.to_dataset(inherit=False).coords):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to get coordinate variables at each node and NOT include inherited coordinates we have to directly coerce to a dataset and specify no inheritance. This was the simplest (only?) way I could find to accomplish this. Otherwise we get coordinate variables at every single node instead of where they are originally.

Comment thread podaac/subsetter/datatree_subset.py Outdated
Comment on lines +834 to +858
def drop_vars_by_path(tree: DataTree, var_paths: str | list[str] | set[str]) -> None:
"""
Drop variables from a DataTree using paths in the format '/group/var' or '/var' for root level
Drop variables *in place* from a DataTree using paths in the
format '/group/var' or '/var' for root level.

Parameters
----------
tree : DataTree
The input DataTree
var_paths : str or List[str]
var_paths : str or list[str] or set[str]
Paths to variables to drop in format '/group/var' or '/var' for root level
Examples:
- '/var1' # root level variable
- '/group1/var1' # variable in group1
- '/group1/subgroup/var1' # variable in nested group

Returns
-------
DataTree
Modified DataTree with variables dropped
"""
if isinstance(var_paths, str):
var_paths = [var_paths]

for path in var_paths:
# Split the path into group path and variable name
parts = path.strip("/").split("/")

if len(parts) == 1:
# Root level variable
var_name = parts[0]
# Modify the dataset in-place using xarray's drop_vars
tree.ds = tree.ds.drop_vars([var_name], errors="ignore")
else:
# Group variable
group_path = "/".join(parts[:-1])
var_name = parts[-1]
try:
node = tree[group_path]
node.ds = node.ds.drop_vars([var_name], errors="ignore")
except KeyError:
pass

return tree
# guard for single string being passed
drop: set[str] = {var_paths} if isinstance(var_paths, str) else set(var_paths)

for node in tree.subtree:
prefix = node.path.rstrip("/") + "/"
to_drop = [name for name in node.variables if f"{prefix}{name}" in drop]
if to_drop:
node.dataset = node.dataset.drop_vars(to_drop, errors="ignore")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function simplified greatly. Return value is dropped because it is working in place on reference to the tree and its underlying datasets are being overridden

# add in root "/" to variable path if not present so that
# matching with `all_data_variables` is works correctly
normalized_variables = [item if item.startswith("/") else "/" + item for item in variables]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are not flattening data anymore, I don't think there is a need to check for __ in place of /.

Comment thread podaac/subsetter/subset.py Outdated
Comment on lines +448 to +454
keep_coords = coordinate_utils.collect_coordinate_variables(dataset, keep_variables)

keep_set = set(keep_variables) | keep_coords

drop_variables = all_data_variables - keep_set

datatree_subset.drop_vars_by_path(dataset, drop_variables)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core logic for constructing set of variables to keep and drop which now accounts for coordinate variables.

want_coord: set[str]


_test_table: list[VariableTestCase] = [

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple table used to parameterize the test. Each entry has the input (file), wanted data variables, and wanted coordinate / dimension scales. A smaller subset of each file is requested to check that only required coordinate / dimension scale variables are included along side the variables requested.

Comment on lines +161 to +173
all_vars = get_vars_with_paths(out_tree)

subsetted_vars = get_all_variable_names_from_dtree(out_ds_tree)
subsetted_non_vars = get_non_variable_names_from_dtree(out_ds_tree)
# wanted variable should be a subset of all vars
assert case.want_var <= all_vars

assert set(subsetted_vars + subsetted_non_vars) == set(included_variables + non_vars)
# wanted coordinate vars should be a subset of all vars as well
assert case.want_coord <= all_vars

in_ds_tree.close()
out_ds_tree.close()
# and the symmetric difference of the variable super set and
# the union of data and coordinate vars should be an empty set
# indicating that nothing is present that is not expected to
# be present. E.g. extra dimension scale vars
assert (case.want_var | case.want_coord) ^ all_vars == set()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test logic becomes simpler since we are not having to open the input data and pull variables from it to construct the test case. Allows us to also test for things like when sst_dtime is not included in the requested variables for modis is time still present in the output.

@ocsmit ocsmit marked this pull request as ready for review June 4, 2026 15:05
@ocsmit ocsmit requested a review from jamesfwood June 4, 2026 15:05
@ocsmit ocsmit self-assigned this Jun 4, 2026
@jamesfwood jamesfwood requested a review from Copilot June 4, 2026 15:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses GitHub issue #435 by tightening variable subsetting so that only requested data variables and the coordinate/dimension-scale variables they actually depend on are preserved, preventing irrelevant dimension-scale variables from bloating outputs and/or being introduced into groups where they did not exist.

Changes:

  • Add logic to discover and retain only the coordinate/dimension-scale variables required by the requested variables (and drop the rest).
  • Refactor DataTree variable-path collection and in-place dropping to support the new cleanup flow.
  • Update test_specified_variables to use a concrete expectation table and tmp_path, and update the changelog.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
podaac/subsetter/utils/coordinate_utils.py Adds ancestry search + coordinate dependency collection helpers to keep only required coordinate/dimension-scale variables.
podaac/subsetter/subset.py Updates variable-subsetting logic to compute keep/drop sets using DataTree path sets and dependent coordinates.
podaac/subsetter/datatree_subset.py Changes get_vars_with_paths to return a set[str] including coords, and makes drop_vars_by_path operate in-place.
tests/test_subset/test_specified_variables.py Replaces prior heuristic test with explicit per-file expected data vars and required coordinate vars; uses tmp_path.
CHANGELOG.md Adds a “Fixed” entry referencing issue #435.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread podaac/subsetter/utils/coordinate_utils.py
Comment thread podaac/subsetter/utils/coordinate_utils.py
Comment thread podaac/subsetter/utils/coordinate_utils.py
Comment thread podaac/subsetter/subset.py Outdated
Comment thread CHANGELOG.md

@jamesfwood jamesfwood left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocsmit looks good. you can merge this anytime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants