UNSW-CEEM · nick-gorman · May 26, 2026 · May 25, 2026 · May 25, 2026 · May 25, 2026
diff --git a/README.md b/README.md
@@ -110,7 +110,7 @@ raw_data_cache = 'C:/Users/your_data_storage'
 price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache)
 ```
 
-Using the default settings of `dynamic_data_compiler` will download CSV data from AEMO's NEMWeb portal and save it to the `raw_data_cache` directory. It will also create a feather file version of each CSV (feather files have a faster read time). Subsequent `dynamic_data_compiler` calls will check if any data in `raw_data_cache` matches the query and loads it. This means that subsequent `dynamic_data_compiler` will be faster so long as the cached data is available.
+Using the default settings of `dynamic_data_compiler` will download zip archives from AEMO's NEMWeb portal and save them to the `raw_data_cache` directory. It will then extract each CSV, create a parquet file version (parquet has excellent compression and good interop with downstream tools), and delete the extracted CSV. The downloaded zip is retained by default so that subsequent runs that need to rebuild the parquet (e.g. to add columns or switch format) do not have to re-download from AEMO. Subsequent `dynamic_data_compiler` calls will check if any data in `raw_data_cache` matches the query and load it from there. This means that subsequent `dynamic_data_compiler` calls will be faster so long as the cached data is available. See [Caching options](#caching-options) below to tune what stays on disk.
 
 A number of options are available to configure filtering (i.e. what data NEMOSIS returns as a pandas DataFrame) and caching.
 
@@ -162,46 +162,71 @@ unit_dispatch_data = dynamic_data_compiler(start_time, end_time, 'DISPATCHLOAD',
 
 ###### Caching options
 
-By default the options fformat='feather' and keep_csv=True are used.
+By default `dynamic_data_compiler` uses `fformat='parquet'`, `keep_csv=False`, and `keep_zip=True`. That is:
 
-If the option fformat='csv' is used then no feather files will be created, and all caching will be done using CSVs.
+- The original AEMO CSVs are extracted, converted to parquet, and then **deleted** (lean cache). Parquet has excellent compression characteristics and good compatibility with packages for handling large on-memory/cluster datasets (e.g. Dask) — this helps with local storage (especially for Causer Pays data) and file size for version control.
+- The downloaded AEMO archive **zips are kept** on disk so that rebuilding the cache or switching format later does not require re-downloading from AEMO (see [#56](https://github.com/UNSW-CEEM/NEMOSIS/issues/56)).
+
+You can mix and match these two flags to control what is retained:
+
+| `keep_csv` | `keep_zip` | What stays in `raw_data_cache` |
+|---|---|---|
+| `False` (default) | `True` (default) | parquet/feather + zip |
+| `False` | `False` | parquet/feather only — leanest cache |
+| `True` | `False` | parquet/feather + CSV |
+| `True` | `True` | parquet/feather + CSV + zip — full raw retention |
+
+For example, to keep neither the CSV nor the zip after the parquet file is written:
+
+```python
+price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache,
+                                   keep_csv=False, keep_zip=False)
+```
+
+To keep the raw AEMO CSVs alongside the parquet files (e.g. for an external tool that consumes the original CSV directly):
+
+```python
+price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, keep_csv=True)
+```
+
+If the option `fformat='csv'` is used then no parquet files will be created, and all caching will be done using CSVs.
 
 ```python
 price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, fformat='csv')
 ```
 
-If you supply fformat='feather', the original AEMO CSVs will still be cached by default. To save disk space but still ensure your data will work with the API & GUI, use `keep_csv=False` in combination with `fformat='feather'` (which is the default option). This will delete the AEMO CSVs after the feather file is created.
+If the option `fformat='feather'` is provided then no parquet files will be created, and a feather file will be used instead. Feather may give faster read/write than parquet, at the cost of larger files on disk.
 
 ```python
-price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, keep_csv=False)
+price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, fformat='feather')
 ```
 
-If the option `fformat='parquet'` is provided then no feather files will be created, and a parquet file will be used instead.
-While feather might have faster read/write, parquet has excellent compression characteristics and good compatability with packages for handling large on-memory/cluster datasets (e.g. Dask). This helps with local storage (especially for Causer Pays data) and file size for version control.
+`keep_csv` and `keep_zip` are also accepted by `cache_compiler` with the same defaults and meaning.
 
 ##### Cache compiler
 
 This may be useful if you're using NEMOSIS to
 build a data cache, but then process the cache using other packages or applications. It is particularly useful because `cache_compiler` will infer the data types of the columns before saving to parquet or feather, thereby eliminating the need to type convert data that is obtained using `dynamic_data_compiler`.
 
-`cache_compiler` can be used to compile a cache of parquet or feather files. Parquet will likely be smaller, but feather can be read faster. `cache_compiler` will not run if it detects the appropriate files in the `raw_data_cache` directory. Otherwise, it will download CSVs, covert to the requested format and then delete the CSVs. It does not return any data, unlike `dynamic_data_compiler`.
+`cache_compiler` can be used to compile a cache of parquet or feather files (parquet by default; pass `fformat='feather'` to switch). `cache_compiler` will not run if it detects the appropriate files in the `raw_data_cache` directory. Otherwise, it will download zip archives from AEMO, extract the CSVs, convert them to the requested format, and then delete the CSVs. The downloaded zips are kept by default (`keep_zip=True`); pass `keep_zip=False` to delete them as well. `cache_compiler` does not return any data, unlike `dynamic_data_compiler`.
 
 The example below downloads parquet data into the cache.
 
 ```python
 from nemosis import cache_compiler
 
-cache_compiler(start_time, end_time, table, raw_data_cache, fformat='parquet')
+cache_compiler(start_time, end_time, table, raw_data_cache)
 ```
 
 ##### Accessing additional table columns
 
 By default NEMOSIS only includes a subset of an AEMO table's columns, the full set of columns are listed in the 
 [MMS Data Model Reports](https://visualisations.aemo.com.au/aemo/di-help/Content/Data_Model/MMS_Data_Model.htm), 
 or can be seen by inspecting the CSVs in the raw data cache. Users of the python interface can add additional 
-columns as shown below. If you using a feather or parquet based cache the rebuild option should be set to
-true so the additional columns are added to the cache files when they are rebuilt. This method of adding additional
-columns should also work with the `cache_compiler` function.
+columns as shown below. If you are using a feather or parquet based cache the rebuild option should be set to
+true so the additional columns are added to the cache files when they are rebuilt. With the default `keep_zip=True`,
+the rebuild re-extracts the CSV from the cached zip rather than re-downloading from AEMO. This method of adding
+additional columns should also work with the `cache_compiler` function.
 
 ```python
 from nemosis import dynamic_data_compiler

diff --git a/src/nemosis/data_fetch_methods.py b/src/nemosis/data_fetch_methods.py
@@ -23,7 +23,7 @@ def dynamic_data_compiler(
     select_columns=None,
     filter_cols=None,
     filter_values=None,
-    fformat="feather",
+    fformat="parquet",
     keep_csv=False,
     keep_zip=True,
     parse_data_types=True,
@@ -173,7 +173,7 @@ def cache_compiler(
     table_name,
     raw_data_location,
     select_columns=None,
-    fformat="feather",
+    fformat="parquet",
     rebuild=False,
     keep_csv=False,
     keep_zip=True,
@@ -596,7 +596,7 @@ def _dynamic_data_fetch_loop(
     raw_data_location,
     select_columns,
     date_filter,
-    fformat="feather",
+    fformat="parquet",
     keep_csv=False,
     keep_zip=True,
     caching_mode=False,

diff --git a/src/nemosis/defaults.py b/src/nemosis/defaults.py
@@ -144,7 +144,7 @@
 current_data_page_urls = {
     "BIDDING": "Reports/Current/Bidmove_Complete/",
     "DAILY_REGION_SUMMARY": "/Reports/Current/Daily_Reports/",
-    "NEXT_DAY_DISPATCHLOAD": "/Reports/Current/NEXT_DAY_DISPATCH/",
+    "NEXT_DAY_DISPATCHLOAD": "/Reports/Current/Next_Day_Dispatch/",
     "INTERMITTENT_GEN_SCADA": "/Reports/Current/Next_Day_Intermittent_Gen_Scada/"
 }
 

diff --git a/tests/end_to_end_table_tests/test_cache_compiler.py b/tests/end_to_end_table_tests/test_cache_compiler.py
@@ -75,7 +75,7 @@ def test_creates_cache_directory_when_missing(nemosis_fixture):
     )
 
     assert target.is_dir()
-    assert list(target.glob("*.feather")), "cache should be populated"
+    assert list(target.glob("*.parquet")), "cache should be populated"
 
 
 def test_raises_when_cache_path_is_a_file(nemosis_fixture):
@@ -93,9 +93,9 @@ def test_raises_when_cache_path_is_a_file(nemosis_fixture):
 
 
 def test_keep_csv_true_retains_csv(nemosis_fixture):
-    """When cache_compiler actually has to fetch (no existing feather, so
+    """When cache_compiler actually has to fetch (no existing parquet, so
     the code path that downloads + extracts a CSV runs), `keep_csv=True`
-    must leave the extracted CSV on disk alongside the feather.
+    must leave the extracted CSV on disk alongside the parquet.
     rebuild=True forces the fetch path so this test isn't dependent on
     tmp_path being empty at start.
 
@@ -115,7 +115,7 @@ def test_keep_csv_true_retains_csv(nemosis_fixture):
 
 def test_keep_csv_default_is_false(nemosis_fixture):
     """Default behaviour — keep_csv=False removes the extracted CSV
-    after the typed feather is written, leaving only the feather in
+    after the typed parquet is written, leaving only the parquet in
     the cache. Users who want raw retention opt in via keep_csv=True."""
     cache_compiler(
         start_time=START, end_time=END,
@@ -128,18 +128,18 @@ def test_keep_csv_default_is_false(nemosis_fixture):
     assert not csv_files, f"default keep_csv=False should remove CSV; found: {csv_files}"
 
 
-def test_existing_feather_means_no_csv_is_fetched(nemosis_fixture):
-    """If the feather is already in the cache, cache_compiler must take
+def test_existing_parquet_means_no_csv_is_fetched(nemosis_fixture):
+    """If the parquet is already in the cache, cache_compiler must take
     the "already compiled" short-circuit and not fetch a CSV — keep_csv
     is only about retaining a CSV we actually downloaded, not about
-    creating one out of thin air. Pre-populate empty feather files at
+    creating one out of thin air. Pre-populate empty parquet files at
     the expected filenames (in caching_mode the existence check skips
     the read), call cache_compiler without rebuild, and verify no CSV
     appeared."""
     # April + May because NEMOSIS uses a 1-day buffer-back, so a query
     # starting 2018-05-01 also scans the 2018-04 archive.
     for month in ("201804", "201805"):
-        (nemosis_fixture / f"PUBLIC_DVD_DISPATCHPRICE_{month}010000.feather").touch()
+        (nemosis_fixture / f"PUBLIC_DVD_DISPATCHPRICE_{month}010000.parquet").touch()
 
     cache_compiler(
         start_time=START, end_time=END,
@@ -151,7 +151,7 @@ def test_existing_feather_means_no_csv_is_fetched(nemosis_fixture):
     csv_files = list(nemosis_fixture.glob("*DISPATCHPRICE*.[Cc][Ss][Vv]"))
     assert not csv_files, (
         "keep_csv=True should NOT cause a CSV to be created when the "
-        "feather already exists — the CSV branch must not run at all"
+        "parquet already exists — the CSV branch must not run at all"
     )
 
 
@@ -204,7 +204,7 @@ def test_keep_zip_false_removes_zip(nemosis_fixture):
 
 def test_cached_zip_extracts_without_network(nemosis_fixture, monkeypatch):
     """The #56 benefit in action: with keep_zip=True on a first call,
-    a subsequent call that needs the same CSV but finds the feather
+    a subsequent call that needs the same CSV but finds the parquet
     missing should re-extract from the cached zip locally without
     hitting nemweb. Proves it by breaking the AEMO URL after the first
     call — if the second call tries to fetch, it would 404 against the
@@ -217,14 +217,14 @@ def test_cached_zip_extracts_without_network(nemosis_fixture, monkeypatch):
         rebuild=True,
         keep_zip=True,
     )
-    feather_files = list(nemosis_fixture.glob("*.feather"))
+    parquet_files = list(nemosis_fixture.glob("*.parquet"))
     zip_files = list(nemosis_fixture.glob("*.zip"))
-    assert feather_files and zip_files
+    assert parquet_files and zip_files
 
-    # Delete the feather so the loop has to call _download_data again,
+    # Delete the parquet so the loop has to call _download_data again,
     # but leave the zip in place so the lower-level download_to_path
     # short-circuits on the cached file.
-    for f in feather_files:
+    for f in parquet_files:
         f.unlink()
 
     # Point the URL at a dead address — any actual network call would fail.
@@ -239,8 +239,8 @@ def test_cached_zip_extracts_without_network(nemosis_fixture, monkeypatch):
         keep_zip=True,
     )
 
-    # Feather rebuilt from the cached zip
-    assert list(nemosis_fixture.glob("*.feather")), "cache should be rebuilt from cached zip"
+    # Parquet rebuilt from the cached zip
+    assert list(nemosis_fixture.glob("*.parquet")), "cache should be rebuilt from cached zip"
 
 
 @pytest.mark.parametrize("fformat", ["feather", "parquet"])

diff --git a/tests/end_to_end_table_tests/test_daily_region_summary.py b/tests/end_to_end_table_tests/test_daily_region_summary.py
@@ -17,8 +17,8 @@
 
 def test_recent_day_returns_rows_for_filtered_regions(nemosis_fixture):
     data = dynamic_data_compiler(
-        start_time="2026/03/15 00:00:00",
-        end_time="2026/03/15 01:00:00",
+        start_time="2026/05/15 00:00:00",
+        end_time="2026/05/15 01:00:00",
         table_name="DAILY_REGION_SUMMARY",
         raw_data_location=str(nemosis_fixture),
         select_columns=["SETTLEMENTDATE", "REGIONID", "TOTALDEMAND"],
@@ -38,31 +38,31 @@ def test_recent_day_returns_rows_for_filtered_regions(nemosis_fixture):
 
 def test_end_market_day_returns_0400_row_from_previous_file(nemosis_fixture):
     """[03:55, 04:00]: filter is start-exclusive end-inclusive, so only
-    the 04:00 row qualifies — and it lives in the `20260314` daily file,
-    not `20260315`. Non-empty proves buffer-back fired correctly."""
+    the 04:00 row qualifies — and it lives in the `20260514` daily file,
+    not `20260515`. Non-empty proves buffer-back fired correctly."""
     data = dynamic_data_compiler(
-        start_time="2026/03/15 03:55:00",
-        end_time="2026/03/15 04:00:00",
+        start_time="2026/05/15 03:55:00",
+        end_time="2026/05/15 04:00:00",
         table_name="DAILY_REGION_SUMMARY",
         raw_data_location=str(nemosis_fixture),
         select_columns=["SETTLEMENTDATE", "REGIONID", "TOTALDEMAND"],
     )
-    assert set(data["SETTLEMENTDATE"].unique()) == {pd.Timestamp("2026-03-15 04:00:00")}
+    assert set(data["SETTLEMENTDATE"].unique()) == {pd.Timestamp("2026-05-15 04:00:00")}
     assert set(data["REGIONID"]) == {"SA1", "NSW1"}
     assert not data.duplicated(["SETTLEMENTDATE", "REGIONID"]).any()
 
 
 def test_start_market_day_returns_0405_row_from_current_file(nemosis_fixture):
     """[04:00, 04:05]: start-exclusive means 04:00 is dropped; 04:05 is
-    the first row of the `20260315` daily file. Tests the stitch from
+    the first row of the `20260515` daily file. Tests the stitch from
     the consumer side — the current day's file must be fetched."""
     data = dynamic_data_compiler(
-        start_time="2026/03/15 04:00:00",
-        end_time="2026/03/15 04:05:00",
+        start_time="2026/05/15 04:00:00",
+        end_time="2026/05/15 04:05:00",
         table_name="DAILY_REGION_SUMMARY",
         raw_data_location=str(nemosis_fixture),
         select_columns=["SETTLEMENTDATE", "REGIONID", "TOTALDEMAND"],
     )
-    assert set(data["SETTLEMENTDATE"].unique()) == {pd.Timestamp("2026-03-15 04:05:00")}
+    assert set(data["SETTLEMENTDATE"].unique()) == {pd.Timestamp("2026-05-15 04:05:00")}
     assert set(data["REGIONID"]) == {"SA1", "NSW1"}
     assert not data.duplicated(["SETTLEMENTDATE", "REGIONID"]).any()
diff --git a/tests/end_to_end_table_tests/test_datetime_inputs.py b/tests/end_to_end_table_tests/test_datetime_inputs.py
@@ -72,13 +72,13 @@ def test_dynamic_data_compiler_accepts_date(nemosis_fixture):
 def test_cache_compiler_accepts_datetime(nemosis_fixture):
     """cache_compiler must accept the same input shapes as
     dynamic_data_compiler — just confirms no exception and that a
-    feather lands in the cache."""
+    parquet lands in the cache."""
     cache_compiler(
         start_time=START_DT, end_time=END_DT,
         table_name="DISPATCHPRICE",
         raw_data_location=str(nemosis_fixture),
     )
-    assert list(nemosis_fixture.glob("*DISPATCHPRICE*.feather")), (
+    assert list(nemosis_fixture.glob("*DISPATCHPRICE*.parquet")), (
         "cache should be populated when datetime inputs are supplied"
     )
 
@@ -90,7 +90,7 @@ def test_cache_compiler_accepts_date(nemosis_fixture):
         table_name="DISPATCHPRICE",
         raw_data_location=str(nemosis_fixture),
     )
-    assert list(nemosis_fixture.glob("*DISPATCHPRICE*.feather"))
+    assert list(nemosis_fixture.glob("*DISPATCHPRICE*.parquet"))
 
 
 # ---------------------------------------------------------------------------

diff --git a/tests/end_to_end_table_tests/test_intermittent_gen_scada.py b/tests/end_to_end_table_tests/test_intermittent_gen_scada.py
@@ -13,8 +13,8 @@
 
 def test_recent_day_returns_rows_for_fixtured_duids(nemosis_fixture):
     data = dynamic_data_compiler(
-        start_time="2026/03/15 00:00:00",
-        end_time="2026/03/15 01:00:00",
+        start_time="2026/05/15 00:00:00",
+        end_time="2026/05/15 01:00:00",
         table_name="INTERMITTENT_GEN_SCADA",
         raw_data_location=str(nemosis_fixture),
         select_columns=["RUN_DATETIME", "DUID", "SCADA_VALUE"],
@@ -32,27 +32,27 @@ def test_recent_day_returns_rows_for_fixtured_duids(nemosis_fixture):
 
 def test_end_market_day_returns_0400_row_from_previous_file(nemosis_fixture):
     data = dynamic_data_compiler(
-        start_time="2026/03/15 03:55:00",
-        end_time="2026/03/15 04:00:00",
+        start_time="2026/05/15 03:55:00",
+        end_time="2026/05/15 04:00:00",
         table_name="INTERMITTENT_GEN_SCADA",
         raw_data_location=str(nemosis_fixture),
         select_columns=["RUN_DATETIME", "DUID", "SCADA_TYPE", "SCADA_VALUE"],
     )
-    assert set(data["RUN_DATETIME"].unique()) == {pd.Timestamp("2026-03-15 04:00:00")}
+    assert set(data["RUN_DATETIME"].unique()) == {pd.Timestamp("2026-05-15 04:00:00")}
     assert set(data["DUID"]) == {"HDWF2"}
     assert set(data["SCADA_TYPE"]) == {"ELAV", "LOCL"}
     assert not data.duplicated(["RUN_DATETIME", "DUID", "SCADA_TYPE"]).any()
 
 
 def test_start_market_day_returns_0405_row_from_current_file(nemosis_fixture):
     data = dynamic_data_compiler(
-        start_time="2026/03/15 04:00:00",
-        end_time="2026/03/15 04:05:00",
+        start_time="2026/05/15 04:00:00",
+        end_time="2026/05/15 04:05:00",
         table_name="INTERMITTENT_GEN_SCADA",
         raw_data_location=str(nemosis_fixture),
         select_columns=["RUN_DATETIME", "DUID", "SCADA_TYPE", "SCADA_VALUE"],
     )
-    assert set(data["RUN_DATETIME"].unique()) == {pd.Timestamp("2026-03-15 04:05:00")}
+    assert set(data["RUN_DATETIME"].unique()) == {pd.Timestamp("2026-05-15 04:05:00")}
     assert set(data["DUID"]) == {"HDWF2"}
     assert set(data["SCADA_TYPE"]) == {"ELAV", "LOCL"}
     assert not data.duplicated(["RUN_DATETIME", "DUID", "SCADA_TYPE"]).any()