Land PR #99 (red-team follow-up) onto master by nick-gorman · Pull Request #100 · UNSW-CEEM/NEMOSIS

nick-gorman · 2026-05-27T23:25:27Z

Summary

PR #99 was merged with baseRefName: docs-start-end-time instead of master, so its four commits never landed on master. This PR re-targets the same polish-from-red-team branch directly at master.

Commits

Identical to PR #99:

f5b947c Install standard finalise pipeline for MARKET_PRICE_THRESHOLDS
02b4ee2 Validate user select_columns includes primary-key columns up-front
46db129 Apply strict-select-columns and filter-cols-not-in-data checks to static_table
896a56f Emit partial-coverage WARNING when fetch loop misses periods in user window

See #99 for the full rationale, the per-commit detail, and the live smoke results.

Why this PR exists

Discovered while cross-checking the draft release notes against master: bullets that depend on PR #99 (MARKET_PRICE_THRESHOLDS dedup, up-front PK validation, static-table select_columns validation, partial-coverage WARNING, filter-cols UserInputError actually raised) didn't match what was on master. Tracing back showed PR #99's base was set to the docs-start-end-time feature branch by mistake.

Test plan

uv run pytest tests/ passes (435 passed / 1 skipped on PR Red-team follow-up: dedup config, PK validation, coverage warning #99)
Confirm no new merge conflicts vs the master tip

🤖 Generated with Claude Code

`processing_info_maps.finalise["MARKET_PRICE_THRESHOLDS"]` was `None`, unique among the search_type="all" tables. Every other one ships `[most_recent_records_before_start_time, drop_duplicates_by_primary_key]` so a "snapshot as of start_time" query returns one row per PK pair. With `finalise=None`, every monthly archive's copy of every effective date passed straight through to the user. A multi-month query returned ~N copies of each row where N is the number of monthly files scanned — a 5-month live query returned 1269 rows for 16 distinct (EFFECTIVEDATE, VERSIONNO) pairs, the same byte-identical row repeating up to 114 times. Joining MARKET_PRICE_THRESHOLDS to DISPATCHPRICE on EFFECTIVEDATE produces cartesian-explosion bugs; `df.set_index('EFFECTIVEDATE')` blows up on the duplicate index; mean/aggregate calculations are heavily biased toward older effective dates because more monthly files have already shipped by then. `defaults.effective_date_group_col["MARKET_PRICE_THRESHOLDS"]` is already `[]`, which sends `most_recent_records_before_start_time` down the `.tail(1)` branch — exactly one "as-of" row from before start_time. `defaults.table_primary_keys["MARKET_PRICE_THRESHOLDS"]` is already `["EFFECTIVEDATE", "VERSIONNO"]`, so the dedup just works. The fix is to install the standard pipeline. Strengthens the existing test to assert the PK invariant — verified the strengthened test FAILS without the fix (every PK pair appears twice in the 2-month fixture) and PASSES with it. Full suite green: 428 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For tables whose processing pipeline dedupes on PK, omitting a PK column from user-typed select_columns previously caused a bare pandas KeyError to escape from inside the finalise step: KeyError: Index(['VERSIONNO'], dtype='str') The traceback pointed at pandas internals and gave the user no clue that the actual fix is "add VERSIONNO back to your select_columns" — or just use the defaults, which always include the PK. Validate up-front in dynamic_data_compiler, cache_compiler, and static_table: * Dynamic-table dedup is detected by the presence of `query_wrappers.drop_duplicates_by_primary_key` in the table's finalise pipeline. * Static-table dedup is detected by the presence of `_finalise_excel_data` in the table's static_data_finaliser_map entry (it calls `drop_duplicates(primary_keys)` when the table is in `defaults.table_primary_keys`). In both cases, raise a clear UserInputError naming the missing PK column(s) and the suggested fix (add to select_columns, or pass select_columns=None to use defaults). Two regression tests in test_errors.py — one each for the dynamic (LOSSFACTORMODEL, PK = [EFFECTIVEDATE, INTERCONNECTORID, REGIONID, VERSIONNO]) and static (Generators and Scheduled Loads, PK = [DUID]) crash paths. 430 passed, 1 skipped (was 428). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tic_table `dynamic_data_compiler` had two related guards added in 63edfee: * `_check_loaded_select_columns`: any user-typed select_columns column that didn't survive the file load is a hard error (catches typos like 'duid' for 'DUID' that previously returned a stub DataFrame with a WARNING). * The post-load filter_cols check at line 162 still constructed UserInputError without `raise` — a silent filter no-op that returned unfiltered data when the filter column was inherited from defaults but happened to be missing from this AEMO file vintage. `static_table` shared both shapes. The first wasn't covered at all — a typo in select_columns silently produced a stub DataFrame. The second had the same missing-`raise` bug at the static_table call site. Apply both fixes to static_table: * Capture `user_select_columns` before defaults reassignment. * Call `_check_loaded_select_columns` after column-selection + string-strip, before filtering. * Add `raise` in front of the previously-bare `UserInputError(...)` in the post-load filter_cols check. Fix the same missing-`raise` at the corresponding spot in `dynamic_data_compiler` while we're here — same bug pattern. With `_check_loaded_select_columns` in place, the missing-`raise` path is mostly unreachable, but defence-in-depth matters for the defaults-inherited-but-vintage-missing edge case. Tests: * `test_static_select_columns_typo_raises` — bogus column in select_columns raises UserInputError naming the missing column. * `test_static_filter_col_not_in_loaded_data_raises` — combined select-and-filter typo raises (caught by the earlier _check_loaded_select_columns). 432 passed, 1 skipped (was 430). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…window A multi-period query that partly succeeds and partly 404s previously had no aggregate signal — only N separate per-file warnings that aggregating users tend not to read. So asking for 16 months of INTERMITTENT_GEN_SCADA against AEMO's ~6-7-week Reports/Current/ retention silently returned ~10% of the requested range, with all downstream means/aggregates correspondingly wrong. Track per-period success in `_dynamic_data_fetch_loop`, but only count iterations whose data period actually overlaps the user's [start_time, end_time) window. After the loop, if some — but not all — of those in-window periods returned data, emit a single summary WARNING naming the table, the missing count, the coverage percentage, and the three common causes (retention window, not-yet-published month, pre-AEMO data). The in-window check (`_iteration_overlaps_window`) translates each date_gen iteration into its [period_start, period_end) extent — monthly for MMS, daily for current/scraped, 5-min for FCAS — and runs the standard strict half-open overlap test: period_start < user_end AND period_end > user_start This naturally excludes the two kinds of "behind-the-scenes" iterations the fetch loop makes that the user didn't directly ask for: * The 1-day buffer-back month / day before `start_time`. Its period sits flush against `user_start` from the left, so the strict `period_end > user_start` test fails and it's excluded. * The ~120 historical months a search_type='all' table scans to compute "snapshot as of start_time". Those periods are entirely before `user_start` and excluded. So a PARTICIPANT query for [2024-01-01, 2024-06-01) counts 5 in-window months (Jan–May) regardless of the ~115 historical-scan + buffer-back months the loop actually iterated. If all 5 return data → no warning. If one is missing → 4/5, warning fires for a real gap in the user's request. caching_mode skips the check: cache_compiler doesn't return a frame the caller could aggregate over, and the per-period warnings already appear in its log stream. The zero-data case is also skipped — the caller already raises NoDataToReturn upstream, and a duplicate WARNING would just be noise. Three regression tests in test_errors.py: * partial coverage (DISPATCHPRICE 2018-04 -> 2018-12 against a fixture that only has 2018-04 and 2018-05; 2/8 in-window months) emits exactly one Partial-coverage WARNING. * full coverage (single fixtured month) emits NO warning — prevents the signal from being lost to noise on every healthy query. * out-of-window 404s for a search_type='all' table emit NO warning, even when several scan months 404 — proves the in-window check correctly suppresses the historical-scan noise that a flat success-rate ratio would have false-positived on. 435 passed, 1 skipped (was 432). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nick-gorman and others added 4 commits May 27, 2026 11:36

nick-gorman merged commit 1195345 into master May 27, 2026
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Land PR #99 (red-team follow-up) onto master#100

Land PR #99 (red-team follow-up) onto master#100
nick-gorman merged 4 commits into
masterfrom
polish-from-red-team

nick-gorman commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nick-gorman commented May 27, 2026

Summary

Commits

Why this PR exists

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant