Coordinate cleanup by ocsmit · Pull Request #436 · podaac/l2ss-py

ocsmit · 2026-06-04T14:53:16Z

Github Issue: #435

Description

See issue for more in-depth description, but tldr is that dimension scale variables not used by any of requested variables are being persisted and at times being added into groups which they originally not present. This can bloat output files with irrelevant variables.

Overview of work done

Core change is when constructing the set of variables to drop we collect the required dimensions for each requested variable, and we traverse back up the tree to find the associated dimension scale/coordinate variable if it is present. If any are present, then we added those variables to the set of variables to keep. We then add any other unnecessary dimension scale/coordinate variables to the set of variables to drop.

test_specified_variables was updated to use the pytest tmp_path fixture for output data and an actual test table was added for each of the data files to be more explicit about what the expected outcome of the subset is supposed to be. Each entry in the table includes the expected data variables (which form the subset request) and then the expected coordinate/dimension scale variables that are required by those data variables.

Overview of verification done

All tests pass, tested with S5P which has more groups than I can count so makes for good test.

Overview of integration done

NA

PR checklist:

Linted
Updated unit tests
Updated changelog
[NA] Integration testing

See Pull Request Review Checklist for pointers on reviewing this pull request

ocsmit · 2026-06-04T14:55:25Z

+    paths: set[str] = set()
+    for node in tree.subtree:
+        prefix = node.path.rstrip("/") + "/"
+        for name in set(node.data_vars) | set(node.to_dataset(inherit=False).coords):


to get coordinate variables at each node and NOT include inherited coordinates we have to directly coerce to a dataset and specify no inheritance. This was the simplest (only?) way I could find to accomplish this. Otherwise we get coordinate variables at every single node instead of where they are originally.

ocsmit · 2026-06-04T14:58:11Z

+def drop_vars_by_path(tree: DataTree, var_paths: str | list[str] | set[str]) -> None:
    """
-    Drop variables from a DataTree using paths in the format '/group/var' or '/var' for root level
+    Drop variables *in place* from a DataTree using paths in the
+    format '/group/var' or '/var' for root level.

    Parameters
    ----------
    tree : DataTree
        The input DataTree
-    var_paths : str or List[str]
+    var_paths : str or list[str] or set[str]
        Paths to variables to drop in format '/group/var' or '/var' for root level
        Examples:
            - '/var1'  # root level variable
            - '/group1/var1'  # variable in group1
            - '/group1/subgroup/var1'  # variable in nested group

-    Returns
-    -------
-    DataTree
-        Modified DataTree with variables dropped
    """
-    if isinstance(var_paths, str):
-        var_paths = [var_paths]
-
-    for path in var_paths:
-        # Split the path into group path and variable name
-        parts = path.strip("/").split("/")
-
-        if len(parts) == 1:
-            # Root level variable
-            var_name = parts[0]
-            # Modify the dataset in-place using xarray's drop_vars
-            tree.ds = tree.ds.drop_vars([var_name], errors="ignore")
-        else:
-            # Group variable
-            group_path = "/".join(parts[:-1])
-            var_name = parts[-1]
-            try:
-                node = tree[group_path]
-                node.ds = node.ds.drop_vars([var_name], errors="ignore")
-            except KeyError:
-                pass
-
-    return tree
+    # guard for single string being passed
+    drop: set[str] = {var_paths} if isinstance(var_paths, str) else set(var_paths)
+
+    for node in tree.subtree:
+        prefix = node.path.rstrip("/") + "/"
+        to_drop = [name for name in node.variables if f"{prefix}{name}" in drop]
+        if to_drop:
+            node.dataset = node.dataset.drop_vars(to_drop, errors="ignore")


Function simplified greatly. Return value is dropped because it is working in place on reference to the tree and its underlying datasets are being overridden

ocsmit · 2026-06-04T14:59:57Z

+            # add in root "/" to variable path if not present so that
+            # matching with `all_data_variables` is works correctly
+            normalized_variables = [item if item.startswith("/") else "/" + item for item in variables]



Since we are not flattening data anymore, I don't think there is a need to check for __ in place of /.

ocsmit · 2026-06-04T15:00:43Z

+            keep_coords = coordinate_utils.collect_coordinate_variables(dataset, keep_variables)
+
+            keep_set = set(keep_variables) | keep_coords
+
+            drop_variables = all_data_variables - keep_set
+
+            datatree_subset.drop_vars_by_path(dataset, drop_variables)


Core logic for constructing set of variables to keep and drop which now accounts for coordinate variables.

ocsmit · 2026-06-04T15:03:06Z

+    want_coord: set[str]
+
+
+_test_table: list[VariableTestCase] = [


simple table used to parameterize the test. Each entry has the input (file), wanted data variables, and wanted coordinate / dimension scales. A smaller subset of each file is requested to check that only required coordinate / dimension scale variables are included along side the variables requested.

ocsmit · 2026-06-04T15:05:04Z

+        all_vars = get_vars_with_paths(out_tree)

-    subsetted_vars = get_all_variable_names_from_dtree(out_ds_tree)
-    subsetted_non_vars = get_non_variable_names_from_dtree(out_ds_tree) 
+        # wanted variable should be a subset of all vars
+        assert case.want_var <= all_vars

-    assert set(subsetted_vars + subsetted_non_vars) == set(included_variables + non_vars)
+        # wanted coordinate vars should be a subset of all vars as well
+        assert case.want_coord <= all_vars

-    in_ds_tree.close()
-    out_ds_tree.close()
+        # and the symmetric difference of the variable super set and
+        # the union of data and coordinate vars should be an empty set
+        # indicating that nothing is present that is not expected to
+        # be present. E.g. extra dimension scale vars
+        assert (case.want_var | case.want_coord) ^ all_vars == set()


Test logic becomes simpler since we are not having to open the input data and pull variables from it to construct the test case. Allows us to also test for things like when sst_dtime is not included in the requested variables for modis is time still present in the output.

Copilot

Pull request overview

This PR addresses GitHub issue #435 by tightening variable subsetting so that only requested data variables and the coordinate/dimension-scale variables they actually depend on are preserved, preventing irrelevant dimension-scale variables from bloating outputs and/or being introduced into groups where they did not exist.

Changes:

Add logic to discover and retain only the coordinate/dimension-scale variables required by the requested variables (and drop the rest).
Refactor DataTree variable-path collection and in-place dropping to support the new cleanup flow.
Update test_specified_variables to use a concrete expectation table and tmp_path, and update the changelog.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`podaac/subsetter/utils/coordinate_utils.py`	Adds ancestry search + coordinate dependency collection helpers to keep only required coordinate/dimension-scale variables.
`podaac/subsetter/subset.py`	Updates variable-subsetting logic to compute keep/drop sets using DataTree path sets and dependent coordinates.
`podaac/subsetter/datatree_subset.py`	Changes `get_vars_with_paths` to return a `set[str]` including coords, and makes `drop_vars_by_path` operate in-place.
`tests/test_subset/test_specified_variables.py`	Replaces prior heuristic test with explicit per-file expected data vars and required coordinate vars; uses `tmp_path`.
`CHANGELOG.md`	Adds a “Fixed” entry referencing issue #435.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jamesfwood

@ocsmit looks good. you can merge this anytime.

ocsmit added 8 commits June 2, 2026 12:28

fix: drop coordinate/dimension variables which are not used

c94bd9a

refactor: lineage depricated in favor of parents

731b9f3

test: refactor variable tests for clarity and correctness

5a4ceef

style: remove bit of commented code

e50752b

refactor: uneeded anon func for getting vars. duplicated code

442305a

refactor: unguarded string coercion, simplify'

d18c715

refactor: set logic instead

fdf3a8d

chore: changelog

1103569

ocsmit requested a review from sliu008 June 4, 2026 14:53

ocsmit commented Jun 4, 2026

View reviewed changes

ocsmit marked this pull request as ready for review June 4, 2026 15:05

ocsmit requested a review from jamesfwood June 4, 2026 15:05

ocsmit self-assigned this Jun 4, 2026

jamesfwood requested a review from Copilot June 4, 2026 15:55

Copilot started reviewing on behalf of jamesfwood June 4, 2026 15:55 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Comment thread podaac/subsetter/utils/coordinate_utils.py

Comment thread podaac/subsetter/utils/coordinate_utils.py

Comment thread podaac/subsetter/utils/coordinate_utils.py

Comment thread podaac/subsetter/subset.py Outdated

Comment thread CHANGELOG.md

ocsmit added 6 commits June 4, 2026 14:15

fix: lineage and parents behavior is not the same

e84ba0f

refactor: handle keyerror in case of variable path not being present

4fc047a

chore: typo in doc string

27cb8a7

refactor: redundant recursion code, move variable utils, old normalize

1bc7640

fix: test for parent/lineage fix

06e117b

fix: test mock new func

36aefd3

sliu008 approved these changes Jun 11, 2026

View reviewed changes

jamesfwood approved these changes Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinate cleanup#436

Coordinate cleanup#436
ocsmit wants to merge 14 commits into
podaac:developfrom
ocsmit:coordinate-cleanup

ocsmit commented Jun 4, 2026

Uh oh!

ocsmit Jun 4, 2026

Uh oh!

ocsmit Jun 4, 2026

Uh oh!

ocsmit Jun 4, 2026

Uh oh!

ocsmit Jun 4, 2026

Uh oh!

ocsmit Jun 4, 2026

Uh oh!

ocsmit Jun 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jamesfwood left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ocsmit commented Jun 4, 2026

Description

Overview of work done

Overview of verification done

Overview of integration done

PR checklist:

Uh oh!

ocsmit Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

ocsmit Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

ocsmit Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

ocsmit Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

ocsmit Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

ocsmit Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jamesfwood left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants