Skip to content

Land PR #99 (red-team follow-up) onto master#100

Merged
nick-gorman merged 4 commits into
masterfrom
polish-from-red-team
May 27, 2026
Merged

Land PR #99 (red-team follow-up) onto master#100
nick-gorman merged 4 commits into
masterfrom
polish-from-red-team

Conversation

@nick-gorman
Copy link
Copy Markdown
Member

Summary

PR #99 was merged with baseRefName: docs-start-end-time instead of master, so its four commits never landed on master. This PR re-targets the same polish-from-red-team branch directly at master.

Commits

Identical to PR #99:

  • f5b947c Install standard finalise pipeline for MARKET_PRICE_THRESHOLDS
  • 02b4ee2 Validate user select_columns includes primary-key columns up-front
  • 46db129 Apply strict-select-columns and filter-cols-not-in-data checks to static_table
  • 896a56f Emit partial-coverage WARNING when fetch loop misses periods in user window

See #99 for the full rationale, the per-commit detail, and the live smoke results.

Why this PR exists

Discovered while cross-checking the draft release notes against master: bullets that depend on PR #99 (MARKET_PRICE_THRESHOLDS dedup, up-front PK validation, static-table select_columns validation, partial-coverage WARNING, filter-cols UserInputError actually raised) didn't match what was on master. Tracing back showed PR #99's base was set to the docs-start-end-time feature branch by mistake.

Test plan

🤖 Generated with Claude Code

nick-gorman and others added 4 commits May 27, 2026 11:36
`processing_info_maps.finalise["MARKET_PRICE_THRESHOLDS"]` was `None`,
unique among the search_type="all" tables. Every other one ships
`[most_recent_records_before_start_time, drop_duplicates_by_primary_key]`
so a "snapshot as of start_time" query returns one row per PK pair.

With `finalise=None`, every monthly archive's copy of every effective
date passed straight through to the user. A multi-month query
returned ~N copies of each row where N is the number of monthly files
scanned — a 5-month live query returned 1269 rows for 16 distinct
(EFFECTIVEDATE, VERSIONNO) pairs, the same byte-identical row
repeating up to 114 times. Joining MARKET_PRICE_THRESHOLDS to
DISPATCHPRICE on EFFECTIVEDATE produces cartesian-explosion bugs;
`df.set_index('EFFECTIVEDATE')` blows up on the duplicate index;
mean/aggregate calculations are heavily biased toward older effective
dates because more monthly files have already shipped by then.

`defaults.effective_date_group_col["MARKET_PRICE_THRESHOLDS"]` is
already `[]`, which sends `most_recent_records_before_start_time` down
the `.tail(1)` branch — exactly one "as-of" row from before
start_time. `defaults.table_primary_keys["MARKET_PRICE_THRESHOLDS"]`
is already `["EFFECTIVEDATE", "VERSIONNO"]`, so the dedup just works.
The fix is to install the standard pipeline.

Strengthens the existing test to assert the PK invariant — verified
the strengthened test FAILS without the fix (every PK pair appears
twice in the 2-month fixture) and PASSES with it. Full suite green:
428 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For tables whose processing pipeline dedupes on PK, omitting a PK
column from user-typed select_columns previously caused a bare pandas
KeyError to escape from inside the finalise step:

    KeyError: Index(['VERSIONNO'], dtype='str')

The traceback pointed at pandas internals and gave the user no clue
that the actual fix is "add VERSIONNO back to your select_columns" —
or just use the defaults, which always include the PK.

Validate up-front in dynamic_data_compiler, cache_compiler, and
static_table:

  * Dynamic-table dedup is detected by the presence of
    `query_wrappers.drop_duplicates_by_primary_key` in the table's
    finalise pipeline.
  * Static-table dedup is detected by the presence of
    `_finalise_excel_data` in the table's static_data_finaliser_map
    entry (it calls `drop_duplicates(primary_keys)` when the table is
    in `defaults.table_primary_keys`).

In both cases, raise a clear UserInputError naming the missing PK
column(s) and the suggested fix (add to select_columns, or pass
select_columns=None to use defaults).

Two regression tests in test_errors.py — one each for the dynamic
(LOSSFACTORMODEL, PK = [EFFECTIVEDATE, INTERCONNECTORID, REGIONID,
VERSIONNO]) and static (Generators and Scheduled Loads, PK = [DUID])
crash paths.

430 passed, 1 skipped (was 428).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tic_table

`dynamic_data_compiler` had two related guards added in 63edfee:

  * `_check_loaded_select_columns`: any user-typed select_columns
    column that didn't survive the file load is a hard error (catches
    typos like 'duid' for 'DUID' that previously returned a stub
    DataFrame with a WARNING).
  * The post-load filter_cols check at line 162 still constructed
    UserInputError without `raise` — a silent filter no-op that
    returned unfiltered data when the filter column was inherited
    from defaults but happened to be missing from this AEMO file
    vintage.

`static_table` shared both shapes. The first wasn't covered at all —
a typo in select_columns silently produced a stub DataFrame. The
second had the same missing-`raise` bug at the static_table call
site.

Apply both fixes to static_table:
  * Capture `user_select_columns` before defaults reassignment.
  * Call `_check_loaded_select_columns` after column-selection +
    string-strip, before filtering.
  * Add `raise` in front of the previously-bare `UserInputError(...)`
    in the post-load filter_cols check.

Fix the same missing-`raise` at the corresponding spot in
`dynamic_data_compiler` while we're here — same bug pattern. With
`_check_loaded_select_columns` in place, the missing-`raise` path is
mostly unreachable, but defence-in-depth matters for the
defaults-inherited-but-vintage-missing edge case.

Tests:
  * `test_static_select_columns_typo_raises` — bogus column in
    select_columns raises UserInputError naming the missing column.
  * `test_static_filter_col_not_in_loaded_data_raises` — combined
    select-and-filter typo raises (caught by the earlier
    _check_loaded_select_columns).

432 passed, 1 skipped (was 430).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…window

A multi-period query that partly succeeds and partly 404s previously
had no aggregate signal — only N separate per-file warnings that
aggregating users tend not to read. So asking for 16 months of
INTERMITTENT_GEN_SCADA against AEMO's ~6-7-week Reports/Current/
retention silently returned ~10% of the requested range, with all
downstream means/aggregates correspondingly wrong.

Track per-period success in `_dynamic_data_fetch_loop`, but only count
iterations whose data period actually overlaps the user's
[start_time, end_time) window. After the loop, if some — but not all
— of those in-window periods returned data, emit a single summary
WARNING naming the table, the missing count, the coverage
percentage, and the three common causes (retention window,
not-yet-published month, pre-AEMO data).

The in-window check (`_iteration_overlaps_window`) translates each
date_gen iteration into its [period_start, period_end) extent —
monthly for MMS, daily for current/scraped, 5-min for FCAS — and
runs the standard strict half-open overlap test:

    period_start < user_end AND period_end > user_start

This naturally excludes the two kinds of "behind-the-scenes" iterations
the fetch loop makes that the user didn't directly ask for:

  * The 1-day buffer-back month / day before `start_time`. Its period
    sits flush against `user_start` from the left, so the strict
    `period_end > user_start` test fails and it's excluded.
  * The ~120 historical months a search_type='all' table scans to
    compute "snapshot as of start_time". Those periods are entirely
    before `user_start` and excluded.

So a PARTICIPANT query for [2024-01-01, 2024-06-01) counts 5 in-window
months (Jan–May) regardless of the ~115 historical-scan + buffer-back
months the loop actually iterated. If all 5 return data → no warning.
If one is missing → 4/5, warning fires for a real gap in the user's
request.

caching_mode skips the check: cache_compiler doesn't return a frame
the caller could aggregate over, and the per-period warnings already
appear in its log stream. The zero-data case is also skipped — the
caller already raises NoDataToReturn upstream, and a duplicate
WARNING would just be noise.

Three regression tests in test_errors.py:
  * partial coverage (DISPATCHPRICE 2018-04 -> 2018-12 against a
    fixture that only has 2018-04 and 2018-05; 2/8 in-window months)
    emits exactly one Partial-coverage WARNING.
  * full coverage (single fixtured month) emits NO warning — prevents
    the signal from being lost to noise on every healthy query.
  * out-of-window 404s for a search_type='all' table emit NO warning,
    even when several scan months 404 — proves the in-window check
    correctly suppresses the historical-scan noise that a flat
    success-rate ratio would have false-positived on.

435 passed, 1 skipped (was 432).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nick-gorman nick-gorman merged commit 1195345 into master May 27, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant