Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 37 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ raw_data_cache = 'C:/Users/your_data_storage'
price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache)
```

Using the default settings of `dynamic_data_compiler` will download CSV data from AEMO's NEMWeb portal and save it to the `raw_data_cache` directory. It will also create a feather file version of each CSV (feather files have a faster read time). Subsequent `dynamic_data_compiler` calls will check if any data in `raw_data_cache` matches the query and loads it. This means that subsequent `dynamic_data_compiler` will be faster so long as the cached data is available.
Using the default settings of `dynamic_data_compiler` will download zip archives from AEMO's NEMWeb portal and save them to the `raw_data_cache` directory. It will then extract each CSV, create a parquet file version (parquet has excellent compression and good interop with downstream tools), and delete the extracted CSV. The downloaded zip is retained by default so that subsequent runs that need to rebuild the parquet (e.g. to add columns or switch format) do not have to re-download from AEMO. Subsequent `dynamic_data_compiler` calls will check if any data in `raw_data_cache` matches the query and load it from there. This means that subsequent `dynamic_data_compiler` calls will be faster so long as the cached data is available. See [Caching options](#caching-options) below to tune what stays on disk.

A number of options are available to configure filtering (i.e. what data NEMOSIS returns as a pandas DataFrame) and caching.

Expand Down Expand Up @@ -162,46 +162,71 @@ unit_dispatch_data = dynamic_data_compiler(start_time, end_time, 'DISPATCHLOAD',

###### Caching options

By default the options fformat='feather' and keep_csv=True are used.
By default `dynamic_data_compiler` uses `fformat='parquet'`, `keep_csv=False`, and `keep_zip=True`. That is:

If the option fformat='csv' is used then no feather files will be created, and all caching will be done using CSVs.
- The original AEMO CSVs are extracted, converted to parquet, and then **deleted** (lean cache). Parquet has excellent compression characteristics and good compatibility with packages for handling large on-memory/cluster datasets (e.g. Dask) — this helps with local storage (especially for Causer Pays data) and file size for version control.
- The downloaded AEMO archive **zips are kept** on disk so that rebuilding the cache or switching format later does not require re-downloading from AEMO (see [#56](https://github.com/UNSW-CEEM/NEMOSIS/issues/56)).

You can mix and match these two flags to control what is retained:

| `keep_csv` | `keep_zip` | What stays in `raw_data_cache` |
|---|---|---|
| `False` (default) | `True` (default) | parquet/feather + zip |
| `False` | `False` | parquet/feather only — leanest cache |
| `True` | `False` | parquet/feather + CSV |
| `True` | `True` | parquet/feather + CSV + zip — full raw retention |

For example, to keep neither the CSV nor the zip after the parquet file is written:

```python
price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache,
keep_csv=False, keep_zip=False)
```

To keep the raw AEMO CSVs alongside the parquet files (e.g. for an external tool that consumes the original CSV directly):

```python
price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, keep_csv=True)
```

If the option `fformat='csv'` is used then no parquet files will be created, and all caching will be done using CSVs.

```python
price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, fformat='csv')
```

If you supply fformat='feather', the original AEMO CSVs will still be cached by default. To save disk space but still ensure your data will work with the API & GUI, use `keep_csv=False` in combination with `fformat='feather'` (which is the default option). This will delete the AEMO CSVs after the feather file is created.
If the option `fformat='feather'` is provided then no parquet files will be created, and a feather file will be used instead. Feather may give faster read/write than parquet, at the cost of larger files on disk.

```python
price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, keep_csv=False)
price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache, fformat='feather')
```

If the option `fformat='parquet'` is provided then no feather files will be created, and a parquet file will be used instead.
While feather might have faster read/write, parquet has excellent compression characteristics and good compatability with packages for handling large on-memory/cluster datasets (e.g. Dask). This helps with local storage (especially for Causer Pays data) and file size for version control.
`keep_csv` and `keep_zip` are also accepted by `cache_compiler` with the same defaults and meaning.

##### Cache compiler

This may be useful if you're using NEMOSIS to
build a data cache, but then process the cache using other packages or applications. It is particularly useful because `cache_compiler` will infer the data types of the columns before saving to parquet or feather, thereby eliminating the need to type convert data that is obtained using `dynamic_data_compiler`.

`cache_compiler` can be used to compile a cache of parquet or feather files. Parquet will likely be smaller, but feather can be read faster. `cache_compiler` will not run if it detects the appropriate files in the `raw_data_cache` directory. Otherwise, it will download CSVs, covert to the requested format and then delete the CSVs. It does not return any data, unlike `dynamic_data_compiler`.
`cache_compiler` can be used to compile a cache of parquet or feather files (parquet by default; pass `fformat='feather'` to switch). `cache_compiler` will not run if it detects the appropriate files in the `raw_data_cache` directory. Otherwise, it will download zip archives from AEMO, extract the CSVs, convert them to the requested format, and then delete the CSVs. The downloaded zips are kept by default (`keep_zip=True`); pass `keep_zip=False` to delete them as well. `cache_compiler` does not return any data, unlike `dynamic_data_compiler`.

The example below downloads parquet data into the cache.

```python
from nemosis import cache_compiler

cache_compiler(start_time, end_time, table, raw_data_cache, fformat='parquet')
cache_compiler(start_time, end_time, table, raw_data_cache)
```

##### Accessing additional table columns

By default NEMOSIS only includes a subset of an AEMO table's columns, the full set of columns are listed in the
[MMS Data Model Reports](https://visualisations.aemo.com.au/aemo/di-help/Content/Data_Model/MMS_Data_Model.htm),
or can be seen by inspecting the CSVs in the raw data cache. Users of the python interface can add additional
columns as shown below. If you using a feather or parquet based cache the rebuild option should be set to
true so the additional columns are added to the cache files when they are rebuilt. This method of adding additional
columns should also work with the `cache_compiler` function.
columns as shown below. If you are using a feather or parquet based cache the rebuild option should be set to
true so the additional columns are added to the cache files when they are rebuilt. With the default `keep_zip=True`,
the rebuild re-extracts the CSV from the cached zip rather than re-downloading from AEMO. This method of adding
additional columns should also work with the `cache_compiler` function.

```python
from nemosis import dynamic_data_compiler
Expand Down
6 changes: 3 additions & 3 deletions src/nemosis/data_fetch_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def dynamic_data_compiler(
select_columns=None,
filter_cols=None,
filter_values=None,
fformat="feather",
fformat="parquet",
keep_csv=False,
keep_zip=True,
parse_data_types=True,
Expand Down Expand Up @@ -173,7 +173,7 @@ def cache_compiler(
table_name,
raw_data_location,
select_columns=None,
fformat="feather",
fformat="parquet",
rebuild=False,
keep_csv=False,
keep_zip=True,
Expand Down Expand Up @@ -596,7 +596,7 @@ def _dynamic_data_fetch_loop(
raw_data_location,
select_columns,
date_filter,
fformat="feather",
fformat="parquet",
keep_csv=False,
keep_zip=True,
caching_mode=False,
Expand Down
2 changes: 1 addition & 1 deletion src/nemosis/defaults.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@
current_data_page_urls = {
"BIDDING": "Reports/Current/Bidmove_Complete/",
"DAILY_REGION_SUMMARY": "/Reports/Current/Daily_Reports/",
"NEXT_DAY_DISPATCHLOAD": "/Reports/Current/NEXT_DAY_DISPATCH/",
"NEXT_DAY_DISPATCHLOAD": "/Reports/Current/Next_Day_Dispatch/",
"INTERMITTENT_GEN_SCADA": "/Reports/Current/Next_Day_Intermittent_Gen_Scada/"
}

Expand Down
32 changes: 16 additions & 16 deletions tests/end_to_end_table_tests/test_cache_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ def test_creates_cache_directory_when_missing(nemosis_fixture):
)

assert target.is_dir()
assert list(target.glob("*.feather")), "cache should be populated"
assert list(target.glob("*.parquet")), "cache should be populated"


def test_raises_when_cache_path_is_a_file(nemosis_fixture):
Expand All @@ -93,9 +93,9 @@ def test_raises_when_cache_path_is_a_file(nemosis_fixture):


def test_keep_csv_true_retains_csv(nemosis_fixture):
"""When cache_compiler actually has to fetch (no existing feather, so
"""When cache_compiler actually has to fetch (no existing parquet, so
the code path that downloads + extracts a CSV runs), `keep_csv=True`
must leave the extracted CSV on disk alongside the feather.
must leave the extracted CSV on disk alongside the parquet.
rebuild=True forces the fetch path so this test isn't dependent on
tmp_path being empty at start.

Expand All @@ -115,7 +115,7 @@ def test_keep_csv_true_retains_csv(nemosis_fixture):

def test_keep_csv_default_is_false(nemosis_fixture):
"""Default behaviour — keep_csv=False removes the extracted CSV
after the typed feather is written, leaving only the feather in
after the typed parquet is written, leaving only the parquet in
the cache. Users who want raw retention opt in via keep_csv=True."""
cache_compiler(
start_time=START, end_time=END,
Expand All @@ -128,18 +128,18 @@ def test_keep_csv_default_is_false(nemosis_fixture):
assert not csv_files, f"default keep_csv=False should remove CSV; found: {csv_files}"


def test_existing_feather_means_no_csv_is_fetched(nemosis_fixture):
"""If the feather is already in the cache, cache_compiler must take
def test_existing_parquet_means_no_csv_is_fetched(nemosis_fixture):
"""If the parquet is already in the cache, cache_compiler must take
the "already compiled" short-circuit and not fetch a CSV — keep_csv
is only about retaining a CSV we actually downloaded, not about
creating one out of thin air. Pre-populate empty feather files at
creating one out of thin air. Pre-populate empty parquet files at
the expected filenames (in caching_mode the existence check skips
the read), call cache_compiler without rebuild, and verify no CSV
appeared."""
# April + May because NEMOSIS uses a 1-day buffer-back, so a query
# starting 2018-05-01 also scans the 2018-04 archive.
for month in ("201804", "201805"):
(nemosis_fixture / f"PUBLIC_DVD_DISPATCHPRICE_{month}010000.feather").touch()
(nemosis_fixture / f"PUBLIC_DVD_DISPATCHPRICE_{month}010000.parquet").touch()

cache_compiler(
start_time=START, end_time=END,
Expand All @@ -151,7 +151,7 @@ def test_existing_feather_means_no_csv_is_fetched(nemosis_fixture):
csv_files = list(nemosis_fixture.glob("*DISPATCHPRICE*.[Cc][Ss][Vv]"))
assert not csv_files, (
"keep_csv=True should NOT cause a CSV to be created when the "
"feather already exists — the CSV branch must not run at all"
"parquet already exists — the CSV branch must not run at all"
)


Expand Down Expand Up @@ -204,7 +204,7 @@ def test_keep_zip_false_removes_zip(nemosis_fixture):

def test_cached_zip_extracts_without_network(nemosis_fixture, monkeypatch):
"""The #56 benefit in action: with keep_zip=True on a first call,
a subsequent call that needs the same CSV but finds the feather
a subsequent call that needs the same CSV but finds the parquet
missing should re-extract from the cached zip locally without
hitting nemweb. Proves it by breaking the AEMO URL after the first
call — if the second call tries to fetch, it would 404 against the
Expand All @@ -217,14 +217,14 @@ def test_cached_zip_extracts_without_network(nemosis_fixture, monkeypatch):
rebuild=True,
keep_zip=True,
)
feather_files = list(nemosis_fixture.glob("*.feather"))
parquet_files = list(nemosis_fixture.glob("*.parquet"))
zip_files = list(nemosis_fixture.glob("*.zip"))
assert feather_files and zip_files
assert parquet_files and zip_files

# Delete the feather so the loop has to call _download_data again,
# Delete the parquet so the loop has to call _download_data again,
# but leave the zip in place so the lower-level download_to_path
# short-circuits on the cached file.
for f in feather_files:
for f in parquet_files:
f.unlink()

# Point the URL at a dead address — any actual network call would fail.
Expand All @@ -239,8 +239,8 @@ def test_cached_zip_extracts_without_network(nemosis_fixture, monkeypatch):
keep_zip=True,
)

# Feather rebuilt from the cached zip
assert list(nemosis_fixture.glob("*.feather")), "cache should be rebuilt from cached zip"
# Parquet rebuilt from the cached zip
assert list(nemosis_fixture.glob("*.parquet")), "cache should be rebuilt from cached zip"


@pytest.mark.parametrize("fformat", ["feather", "parquet"])
Expand Down
22 changes: 11 additions & 11 deletions tests/end_to_end_table_tests/test_daily_region_summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@

def test_recent_day_returns_rows_for_filtered_regions(nemosis_fixture):
data = dynamic_data_compiler(
start_time="2026/03/15 00:00:00",
end_time="2026/03/15 01:00:00",
start_time="2026/05/15 00:00:00",
end_time="2026/05/15 01:00:00",
table_name="DAILY_REGION_SUMMARY",
raw_data_location=str(nemosis_fixture),
select_columns=["SETTLEMENTDATE", "REGIONID", "TOTALDEMAND"],
Expand All @@ -38,31 +38,31 @@ def test_recent_day_returns_rows_for_filtered_regions(nemosis_fixture):

def test_end_market_day_returns_0400_row_from_previous_file(nemosis_fixture):
"""[03:55, 04:00]: filter is start-exclusive end-inclusive, so only
the 04:00 row qualifies — and it lives in the `20260314` daily file,
not `20260315`. Non-empty proves buffer-back fired correctly."""
the 04:00 row qualifies — and it lives in the `20260514` daily file,
not `20260515`. Non-empty proves buffer-back fired correctly."""
data = dynamic_data_compiler(
start_time="2026/03/15 03:55:00",
end_time="2026/03/15 04:00:00",
start_time="2026/05/15 03:55:00",
end_time="2026/05/15 04:00:00",
table_name="DAILY_REGION_SUMMARY",
raw_data_location=str(nemosis_fixture),
select_columns=["SETTLEMENTDATE", "REGIONID", "TOTALDEMAND"],
)
assert set(data["SETTLEMENTDATE"].unique()) == {pd.Timestamp("2026-03-15 04:00:00")}
assert set(data["SETTLEMENTDATE"].unique()) == {pd.Timestamp("2026-05-15 04:00:00")}
assert set(data["REGIONID"]) == {"SA1", "NSW1"}
assert not data.duplicated(["SETTLEMENTDATE", "REGIONID"]).any()


def test_start_market_day_returns_0405_row_from_current_file(nemosis_fixture):
"""[04:00, 04:05]: start-exclusive means 04:00 is dropped; 04:05 is
the first row of the `20260315` daily file. Tests the stitch from
the first row of the `20260515` daily file. Tests the stitch from
the consumer side — the current day's file must be fetched."""
data = dynamic_data_compiler(
start_time="2026/03/15 04:00:00",
end_time="2026/03/15 04:05:00",
start_time="2026/05/15 04:00:00",
end_time="2026/05/15 04:05:00",
table_name="DAILY_REGION_SUMMARY",
raw_data_location=str(nemosis_fixture),
select_columns=["SETTLEMENTDATE", "REGIONID", "TOTALDEMAND"],
)
assert set(data["SETTLEMENTDATE"].unique()) == {pd.Timestamp("2026-03-15 04:05:00")}
assert set(data["SETTLEMENTDATE"].unique()) == {pd.Timestamp("2026-05-15 04:05:00")}
assert set(data["REGIONID"]) == {"SA1", "NSW1"}
assert not data.duplicated(["SETTLEMENTDATE", "REGIONID"]).any()
6 changes: 3 additions & 3 deletions tests/end_to_end_table_tests/test_datetime_inputs.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,13 @@ def test_dynamic_data_compiler_accepts_date(nemosis_fixture):
def test_cache_compiler_accepts_datetime(nemosis_fixture):
"""cache_compiler must accept the same input shapes as
dynamic_data_compiler — just confirms no exception and that a
feather lands in the cache."""
parquet lands in the cache."""
cache_compiler(
start_time=START_DT, end_time=END_DT,
table_name="DISPATCHPRICE",
raw_data_location=str(nemosis_fixture),
)
assert list(nemosis_fixture.glob("*DISPATCHPRICE*.feather")), (
assert list(nemosis_fixture.glob("*DISPATCHPRICE*.parquet")), (
"cache should be populated when datetime inputs are supplied"
)

Expand All @@ -90,7 +90,7 @@ def test_cache_compiler_accepts_date(nemosis_fixture):
table_name="DISPATCHPRICE",
raw_data_location=str(nemosis_fixture),
)
assert list(nemosis_fixture.glob("*DISPATCHPRICE*.feather"))
assert list(nemosis_fixture.glob("*DISPATCHPRICE*.parquet"))


# ---------------------------------------------------------------------------
Expand Down
16 changes: 8 additions & 8 deletions tests/end_to_end_table_tests/test_intermittent_gen_scada.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@

def test_recent_day_returns_rows_for_fixtured_duids(nemosis_fixture):
data = dynamic_data_compiler(
start_time="2026/03/15 00:00:00",
end_time="2026/03/15 01:00:00",
start_time="2026/05/15 00:00:00",
end_time="2026/05/15 01:00:00",
table_name="INTERMITTENT_GEN_SCADA",
raw_data_location=str(nemosis_fixture),
select_columns=["RUN_DATETIME", "DUID", "SCADA_VALUE"],
Expand All @@ -32,27 +32,27 @@ def test_recent_day_returns_rows_for_fixtured_duids(nemosis_fixture):

def test_end_market_day_returns_0400_row_from_previous_file(nemosis_fixture):
data = dynamic_data_compiler(
start_time="2026/03/15 03:55:00",
end_time="2026/03/15 04:00:00",
start_time="2026/05/15 03:55:00",
end_time="2026/05/15 04:00:00",
table_name="INTERMITTENT_GEN_SCADA",
raw_data_location=str(nemosis_fixture),
select_columns=["RUN_DATETIME", "DUID", "SCADA_TYPE", "SCADA_VALUE"],
)
assert set(data["RUN_DATETIME"].unique()) == {pd.Timestamp("2026-03-15 04:00:00")}
assert set(data["RUN_DATETIME"].unique()) == {pd.Timestamp("2026-05-15 04:00:00")}
assert set(data["DUID"]) == {"HDWF2"}
assert set(data["SCADA_TYPE"]) == {"ELAV", "LOCL"}
assert not data.duplicated(["RUN_DATETIME", "DUID", "SCADA_TYPE"]).any()


def test_start_market_day_returns_0405_row_from_current_file(nemosis_fixture):
data = dynamic_data_compiler(
start_time="2026/03/15 04:00:00",
end_time="2026/03/15 04:05:00",
start_time="2026/05/15 04:00:00",
end_time="2026/05/15 04:05:00",
table_name="INTERMITTENT_GEN_SCADA",
raw_data_location=str(nemosis_fixture),
select_columns=["RUN_DATETIME", "DUID", "SCADA_TYPE", "SCADA_VALUE"],
)
assert set(data["RUN_DATETIME"].unique()) == {pd.Timestamp("2026-03-15 04:05:00")}
assert set(data["RUN_DATETIME"].unique()) == {pd.Timestamp("2026-05-15 04:05:00")}
assert set(data["DUID"]) == {"HDWF2"}
assert set(data["SCADA_TYPE"]) == {"ELAV", "LOCL"}
assert not data.duplicated(["RUN_DATETIME", "DUID", "SCADA_TYPE"]).any()
Loading
Loading