Polish from DISPATCHPRICE user-testing: input validation, log clarity, index reset#98
Merged
Conversation
Before this change, _download_data unconditionally logged "Downloading data for table X, year Y, month M" whenever the cache-hit check at the outer layer missed — even when the actual network fetch inside download_to_path short-circuited because the zip was already on disk (e.g. fformat='csv' with keep_csv=False, where the CSV is deleted after every read but the zip persists). Reading the log line, users reasonably conclude every call hammers AEMO when in fact only the local unzip is being repeated. Bubble the True/False "did a network fetch happen" signal that download_to_path already returns up through download_unzip_csv, the four _download_and_unpack_* helpers, and all six run* functions. The run* functions now return True on a real fetch, False on a zip reuse, None on failure (a warning has already been emitted downstream). _download_data reads that signal and picks the verb: "Downloading and extracting data" on a real fetch, "Extracting cached zip" on a reuse, nothing on failure. The first-call verb names both steps explicitly so the second message reads as a proper subset (network skipped) rather than a different operation. No existing test asserts on the prior implicit None return of the run* functions or the absent return of download_unzip_csv, so the contract change is backwards-compatible. tests/test_downloader.py still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to exploratory user-testing of DISPATCHPRICE workflows. Five
small fixes, all in src/nemosis/data_fetch_methods.py (plus one loose
test trimmed to match the new contract).
Input validation (new UserInputErrors instead of silent bad behaviour):
- select_columns / filter_cols / filter_values: catch typos, length
mismatches, and missing pairs up front via new helpers
_validate_user_select_columns, _validate_filter_args, and a
_check_loaded_select_columns post-load membership check.
Previously select_columns=['SETTLEMENTDATE', 'rrp'] returned a stub
DataFrame, mismatched filter_cols/filter_values lengths silently
dropped filters, and filter_cols with no filter_values raised a bare
TypeError from inside zip().
- start_time / end_time: reject inverted (end < start) and zero-length
(end == start) windows. Previously both silently returned an empty
DataFrame after downloading a whole month for nothing. New helper
_validate_time_window wired into dynamic_data_compiler and
cache_compiler.
Polish:
- reset_index(drop=True) on dynamic_data_compiler and static_table
returns so callers can use .loc[0] / .iloc[0] without hitting the
underlying parquet row indices (e.g. 20162, 20167, ...).
Quieter logs:
- Plumb a user_select_columns flag through _dynamic_data_fetch_loop /
_perform_column_selection / _validate_select_columns. Defaults-only
column rejections (e.g. RAISE1SECRRP / LOWER1SECRRP on pre-1-sec-FCAS
vintages) now log at DEBUG instead of WARNING. User-typed missing
columns still WARNING + raise via _check_loaded_select_columns.
- Demote the redundant "Loading data from .../FILE.parquet failed."
WARNING that fires after a download 404 to DEBUG, with rewording
("No cached data file present at X (upstream download likely
404'd)."). The 404 itself is already announced by the downloader;
the old message read as local cache corruption.
Test:
- tests/.../test_interconnector_constraint.py: drop IMPORTLIMIT and
EXPORTLIMIT from select_columns. The test asserted neither, and these
columns aren't in the cached fixture parquet — the test was relying
on the silent-stub behaviour now being fixed.
All 428 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two commits' worth of polish that came out of an exploratory user-testing pass on DISPATCHPRICE workflows. No API breakage; the changes either convert previously-silent footguns into clear
UserInputErroror quiet WARNINGs that were either redundant or misleading.What's in here
d05a728— Distinguish "Downloading" from "Extracting cached zip" in INFO logsBefore: every call printed "Downloading data for table X" whether or not a network fetch actually happened. Now the verb reflects what the call did (Downloading and extracting vs Extracting cached zip), based on the bool already bubbled up from the download helpers.
63edfee— Tighten input validation and quiet redundant WARNINGsValidation (now raises
UserInputErrorinstead of going silent):select_columns=['SETTLEMENTDATE', 'rrp'](typo) returned a 1-column stub DataFrame with a WARNINGfilter_cols=['A','B']+filter_values=(['x'],)silently filtered by A only (zip-truncate)filter_cols=['A']with nofilter_valuesraised bareTypeError: 'NoneType' object is not iterableend_time < start_timesilently returned empty DataFramestart_time == end_timedownloaded a whole month for(0, N)rowsValidation happens up-front in
dynamic_data_compiler/cache_compiler/static_table, plus a post-load_check_loaded_select_columnsthat catches "user asked for a column that doesn't exist anywhere in their window."Polish:
reset_index(drop=True)on returned DataFrames sodf.loc[0]/df.iloc[0]work. Without this, rows carry the underlying parquet indices (e.g.20162, 20167, …).Quieter logs:
RAISE1SECRRP/LOWER1SECRRPon pre-1-sec-FCAS vintages) now log at DEBUG. They previously fired WARNING on every cache read — noise about NEMOSIS's own internal accounting, since the user never typed those columns. User-typed missing columns still WARNING + raise via the post-load check."Loading data from .../FILE.parquet failed."WARNING that fires after a 404 download is demoted to DEBUG with rewording. The 404 itself is already announced upstream; the old message read as local cache corruption.Test fix:
test_interconnector_constraint_returns_fixtured_rowswas asking forIMPORTLIMIT/EXPORTLIMITinselect_columnsbut not asserting anything about them, and the fixture parquet doesn't contain those columns — the test was relying on the silent-stub bug now being fixed. Trimmed those names.Test plan
uv run pytest tests/— 428 passed, 1 skipped, 1 unrelated pre-existing FutureWarningUserInputErrorpath against a live AEMO queryRAISE1SECRRPnoise reduction on pre-2024 DISPATCHPRICE and confirmed it's silent at INFO level🤖 Generated with Claude Code