Skip to content

fformat default → parquet, NEXT_DAY_DISPATCH URL fix, test fixture refresh, README docs#95

Merged
nick-gorman merged 6 commits into
masterfrom
keep-zip-default-true
May 26, 2026
Merged

fformat default → parquet, NEXT_DAY_DISPATCH URL fix, test fixture refresh, README docs#95
nick-gorman merged 6 commits into
masterfrom
keep-zip-default-true

Conversation

@nick-gorman
Copy link
Copy Markdown
Member

Summary

Bulk PR consolidating six commits that have accumulated on this branch since the keep_zip flip (PR #94) merged. Three loosely-related threads:

Behaviour changes

  • fformat default flipped from feather to parquet (1f034a7). dynamic_data_compiler and cache_compiler now default to parquet; existing feather users opt in explicitly with fformat="feather". Parquet has better compression and broader downstream tooling support (Dask, Arrow, BigQuery loaders).

Bug fix (production)

  • NEXT_DAY_DISPATCHLOAD URL fix (6e65ea5). AEMO renamed the directory /Reports/Current/NEXT_DAY_DISPATCH//Reports/Current/Next_Day_Dispatch/ (title case, in line with the other Reports/Current/ subdirs). The old URL now 404s; anyone fetching NEXT_DAY_DISPATCHLOAD against live AEMO was getting NoDataToReturn. Same shape as issue Monthly MMS dynamic table fetches fail from 2024-08 onward (PUBLIC_ARCHIVE#...) while older PUBLIC_DVD_* months still work #74 (%23 encoding) — local mock server's case-insensitive Windows filesystem masked the upstream change. Cache backward-compat: not affected — only the parent directory was renamed, the file basenames (which NEMOSIS keys its cache on) are unchanged.

Test fixture maintenance

  • Add 2026-04 era (fdb0b60). 12 dynamic tables now cover April 2026 alongside their existing eras. Adds 24 auto-generated boundary cases via _boundaries.py. Fixtures for 2026-04 + the 2026-03 prev-month buffer downloaded and committed.
  • Bump recent era to 2026-05-15 (part of 6e65ea5). Previous 2026-03-15 was approaching the edge of AEMO's rolling current-data retention window. Old March fixtures removed, fresh May 14/15 fixtures committed.

Docs

Test plan

  • Full offline suite passes: uv run pytest tests/ → 428 passed, 1 skipped (pre-existing ROOFTOP_PV_ACTUAL skip), 1 warning (pre-existing pandas FutureWarning).
  • Fixture build runs cleanly: uv run python tests/fixtures/build.py → no failures, 24 new MMS fixtures + 6 new scrape fixtures land where expected.
  • Live URL probe confirms NEXT_DAY_DISPATCH/ → 404, Next_Day_Dispatch/ → 200.
  • All other current_data_page_urls entries probed → still 200 (no other renames hiding).
  • CI passes on macOS/Linux (the case-sensitive filesystems would have caught the NEXT_DAY_DISPATCH casing bug earlier — worth confirming the fixture rename took on those platforms too).

Notes for the reviewer

  • The 6 commits cluster into three distinct concerns (behaviour change / production fix / fixture maintenance + docs). They're bundled per the "mega PR to ship today" decision; splitting into 3 separate PRs would be cleaner if the team prefers.
  • keep-zip-default-true is a recycled branch name — its original purpose (the keep_zip flip itself) shipped in PR Flip keep_zip default to True #94. This PR is the continuation work that landed on top.

🤖 Generated with Claude Code

nick-gorman and others added 6 commits May 26, 2026 09:48
Adds a new era key "2026-04" to tests/fixtures/spec.py and extends the
12 dynamic tables that already cover 2025-01 to also cover 2026-04 in
their `eras` lists:

  DISPATCHPRICE, DISPATCHLOAD, DISPATCH_UNIT_SCADA, DISPATCHREGIONSUM,
  DISPATCHINTERCONNECTORRES, DISPATCHCONSTRAINT, TRADINGPRICE,
  TRADINGINTERCONNECT, BIDPEROFFER_D, BIDDAYOFFER_D, MNSP_DAYOFFER,
  ROOFTOP_PV_ACTUAL.

Why: general recent-month coverage in the PUBLIC_ARCHIVE# format. Not
a year boundary and not tied to a known AEMO transition — just keeps
the matrix from drifting too far behind the live data we'd see on
nemweb today.

The boundary-test matrix in _boundaries.py reads from spec.DYNAMIC_TABLES
and auto-generates `at` and `before` cases per (table, era), so 24 new
test cases (12 tables x 2 flavours) flow through without any test-file
edits. Hand-written era tests (e.g. test_dispatch_price.py's `era_start`
parametrize) still only reference the older eras and are unchanged.

Fixtures for 2026-04 plus the 2026-03 prev-month buffer were downloaded
via `uv run python tests/fixtures/build.py` and committed alongside the
spec change. Full offline suite: 428 passed, 1 skipped (the pre-existing
ROOFTOP_PV_ACTUAL disjoint-archives skip, unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the README into line with code already on master:

- `keep_csv` default flipped from True to False in PR #87 — the
  "Caching options" section still claimed True. Updated.
- `keep_zip` was added in PR #87 and its default flipped to True in
  45285d2 — was entirely undocumented in the README. Added a
  paragraph plus a 4-row keep_csv x keep_zip matrix showing what
  ends up on disk under each combination.
- "Using the default settings" paragraph now mentions the
  zip-retention behaviour and points readers at Caching options.
- Cache compiler description now mentions zip retention; "covert"
  typo fixed.
- "Accessing additional table columns" notes that with
  keep_zip=True (the default), a rebuild re-extracts from the
  cached zip rather than re-downloading from AEMO. Also fixed a
  small "If you using" -> "If you are using" grammar slip.

No behavioural change — docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dynamic_data_compiler` and `cache_compiler` now default to
`fformat="parquet"` (previously `"feather"`).

Why: parquet has better compression characteristics and broader
interoperability with downstream tooling (Dask, Arrow, BigQuery
loaders, etc.) than feather, at a fairly small read/write
performance cost on the workloads NEMOSIS produces. Existing
feather users opt in explicitly via `fformat="feather"`.

Changes:

- src/nemosis/data_fetch_methods.py: flipped the default in
  `dynamic_data_compiler`, `cache_compiler`, and the private
  `_dynamic_data_fetch_loop` (the private one is belt-and-braces
  since both public callers always pass `fformat` explicitly).
- README.md: rewrote the "Using the default settings" paragraph,
  the "Caching options" section + matrix, and the "Cache compiler"
  intro to lead with parquet. Feather is now documented as the
  opt-in alternative.
- tests/end_to_end_table_tests/test_cache_compiler.py: five
  default-sensitive `*.feather` globs/pre-populated files updated
  to `*.parquet`. Includes a test rename
  (`test_existing_feather_means_no_csv_is_fetched`
  -> `test_existing_parquet_means_no_csv_is_fetched`) and a few
  docstring touch-ups.
- tests/end_to_end_table_tests/test_datetime_inputs.py: two
  default-sensitive `*.feather` globs updated to `*.parquet`.

`tests/test_errors.py` and `tests/end_to_end_table_tests/test_fformat_csv.py`
pass `fformat` explicitly and are unaffected. Full suite locally:
421 passed; 7 failures are pre-existing fixture gaps for the
2026-04 era (unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt era to 2026-05-15

Two threads in one commit (this branch is targeting a single PR today):

1. Production URL fix
-------------------
AEMO renamed the parent directory for NEXT_DAY_DISPATCH files from
`/Reports/Current/NEXT_DAY_DISPATCH/` to `/Reports/Current/Next_Day_Dispatch/`
(title case — bringing it in line with the other Reports/Current/ subdirs
like Daily_Reports/ and Next_Day_Intermittent_Gen_Scada/). The old URL
now 404s. Anyone calling `dynamic_data_compiler(table_name="NEXT_DAY_DISPATCHLOAD", ...)`
against live AEMO was getting NoDataToReturn — the scraper hit the index
URL, got a 404, found no matching anchors, and the date generator ran out
empty.

Fix:
- src/nemosis/defaults.py — flip the path in `current_data_page_urls`.
- tests/fixtures/build.py — same flip in `SCRAPE_FILES` + `TABLE_SCRAPE_FILE`
  (build.py keeps its own hardcoded scrape paths, independent of defaults).
- tests/fixtures/data/Reports/Current/NEXT_DAY_DISPATCH/ — `git mv` to
  the new title-case name so the offline mock server serves the fixture
  tree at the URL NEMOSIS now requests.

Backward compat with pre-existing user caches: not affected. NEMOSIS's
cache filename comes from the ZIP basename (`PUBLIC_NEXT_DAY_DISPATCH_YYYYMMDD_*.zip`),
which AEMO didn't rename — only the parent directory moved. Existing cached
ZIPs keep matching; new downloads land under the same names.

Verified live: GET `/Reports/Current/NEXT_DAY_DISPATCH/` returns 404;
GET `/Reports/Current/Next_Day_Dispatch/` returns 200. Probed the
other current-data URLs (Bidmove_Complete/, Daily_Reports/,
Next_Day_Intermittent_Gen_Scada/, Causer_Pays/) — all still 200, so this
is the only directory AEMO has renamed.

This is the same shape of bug as #74 (`%23` URL encoding): AEMO changed
something upstream, and the offline test suite couldn't detect it because
the local mock server's behaviour diverged from the real nemweb. In this
case Windows's case-insensitive filesystem masked the case mismatch
between NEMOSIS's request and the on-disk fixture path. Case-sensitive
filesystems (macOS/Linux CI) would have surfaced this.

2. Recent era shift to 2026-05-15
---------------------------------
`spec.ERAS["recent"]` was pinned to 2026-03-15, which is approaching the
edge of AEMO's rolling current-data retention. Bumped to 2026-05-15
(~10 days back from today) for deeper buffer.

Side effects:
- The 6 stale March 14/15 fixtures (3 scrape tables x 2 days) are
  `git rm`'d in favour of fresh May 14/15 fixtures fetched via
  `uv run python tests/fixtures/build.py`.
- index.html files in each scrape dir are regenerated by build.py to
  list only the current fixtures.
- 3 test files (test_daily_region_summary, test_intermittent_gen_scada,
  test_next_day_dispatch_load) had hard-coded `2026/03/15` dates in their
  assertions — bumped to `2026/05/15`. Docstring file-name references
  (`20260314`, `20260315`) bumped too.

Suite: 428 passed, 1 skipped (pre-existing ROOFTOP_PV_ACTUAL disjoint-
archives skip, unrelated), 1 warning (pre-existing pandas FutureWarning).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Caching options matrix row and the keep_csv=True example caption
both implied that the GUI reads the raw AEMO CSV from the cache and
therefore needs keep_csv=True to function. That's wrong:

The GUI calls dynamic_data_compiler via _dynamic_data_wrapper_for_gui
(method_map.py) with parse_data_types=False and no fformat/keep_csv
arguments. It reads the parquet/feather cache, not the raw CSV. The
CSV the GUI writes (gui.py:321) is an output artefact going to
save_location, separate from raw_data_location.

The keep_csv flag only controls whether the *raw AEMO CSV*
(extracted from the zip before parquet/feather conversion) is
retained. Nothing in the NEMOSIS distribution reads that file —
keep_csv=True is purely for external tools that want the original
CSV on disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "for downstream tools that consume the raw CSV" annotation
attached to row 3 (keep_csv=True, keep_zip=False) implied that
this specific combination is the CSV-consumer use case. That's
wrong — the CSV-consumer rationale applies whenever keep_csv=True,
regardless of keep_zip. The example caption below the matrix
already covers the use case in prose, so the matrix just needs to
describe what's on disk.

Matrix is now symmetric: rows 2 and 4 carry the "leanest cache" /
"full raw retention" extreme markers; rows 1 and 3 are described
purely by contents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nick-gorman nick-gorman merged commit 3ff5104 into master May 26, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant