MRB-650 maps simplified by jonasbhend · Pull Request #92 · MeteoSwiss/evalml

jonasbhend · 2026-01-07T10:38:09Z

Opt-in score maps for runs and baselines

Adds a pipeline that produces temporally aggregated BIAS/RMSE/MAE maps for both model runs and baselines. The work is computationally heavy, so it's gated behind a new --maps flag on evalml experiment.

What's new

CLI / config

evalml experiment <config> --maps triggers map plotting alongside the standard pipeline.
New optional metric_maps: block in config YAML controls params, leadtimes, metrics, regions, and seasons. Sensible defaults; existing configs work unchanged.
JSON schema regenerated.
The domains which were originally called "centraleurope" and "switzerland" are mostly the same. I suggest making domain "switzerland" much smaller, so that more spatial detail can be seen, especially in the complex topography of the alps.

Workflow

New target metric_maps_all, only built when --maps is set.
New rules verification_metrics_maps (runs, GRIB input) and verification_metrics_maps_baseline (baselines, zarr input).
New plot_summary_stat_maps / plot_summary_stat_maps_baseline produce the seasonal map plots.
Map plots live under results/{experiment}/metric_maps/{runs,baselines}/.
Removed obsolete spatial-data filtering in report_experiment_dashboard.py — no longer needed now that metric maps live in their own files instead of verif_aggregated.nc.

Script

workflow/scripts/verification_metric_maps.py is a single unified script handling both run (GRIB) and baseline (zarr) inputs via mutually exclusive --run_root / --baseline_root.
Streaming aggregation: BIAS/RMSE/MAE accumulators stream over init times; no per-init-time error fields are written to disk.
--reftimes flag restricts processing to the configured hindcast period (essential for baselines, whose zarr is a continuous archive).
Per-season stratification (DJF/MAM/JJA/SON/all).

Testing

End-to-end validated with the example configs.
A 40-init-time real-scale run completes in ~26 min wall-clock.

Deferred to follow-up PRs

GRIB loader per-call overhead (TODO marker added in data_input/__init__.py)
Wind direction / vector visualisation
Reorganisation of pre-existing scalar plots under plots/
Consolidation of load_fct_data_from_grib and load_state_from_grib
Same-grid short-circuit in map_forecast_to_truth
Nice country polygons for map plots
plot-every-pixel approach for the switzerland domain

Authors

Co-authored-by: Louis Frey louis.frey@meteoswiss.ch
Co-authored-by: Francesco Zanetta francesco.zanetta@meteoswiss.ch
Co-authored-by: Jonas Bhend jonas.bhend@meteoswiss.ch

…oard" This reverts commit cdefa16.

summary statistics. (No changes to code yet.)

of summary statistics.

For Bias, RMSE and MAE map plots.

compat.py

earthkit instead.

Francesco. Got a long way towards the png plots. Co-authored-by: Francesco Zanetta <francesco.zanetta@meteoswiss.ch>

variables.

properly working). Output written to .png now working.

detailed inspection of results at smaller spatial scale.

symmetric colour map for bias.

to see if all of it still works.

Louis-Frey · 2026-05-06T09:35:38Z

Fixed all issues remaining in my view and tested with the ICON example configs. After pinning the anemoi-inference version to 0.10.0, they run without error. The PR is ready. Please check again @dnerini @jonasbhend @frazane

Both loaders assumed >=2 lead times when disaggregating TOT_PREC, so the maps rules (which pass a single step) crashed: load_baseline_from_zarr with "fmin which has no identity" on the empty .diff(), and load_fct_data_from_grib with "not all values found in index 'lead_time'" when anemoi-inference omits step 0. Push step-0 augmentation into the loaders: fetch step 0 alongside the requested steps for cumulative-from-start params, synthesize it as 0 if the GRIB lacks it, and drop it from the output. The maps script's _preceding_step shim is no longer needed and is removed. Production callers (regular verification, plot rules) all start at step 0, so their behavior is unchanged.

Aligns the plot rules and script with the rest of the metric-maps naming convention. No logic change; only rule names, the script filename, and the log path are renamed. Output paths are unchanged.

# Conflicts: # src/verification/__init__.py # workflow/scripts/verification_metrics.py

Keep commented line for potential future interactive map plotting (e.g., design stuff like nicer country-border polygons, or plotting every pixel).

The on-the-fly wind-speed computation was a half-baked feature outside the scope of MRB-650 (which is about metric maps). Comprehensive wind-speed support — covering point verification, dashboard, and proper config-driven derivation — belongs in a separate focused PR. Map plots of SP_10M continue to work because the metric-maps pipeline derives wind speed independently in verification_metric_maps.py.

The "hard-code for the moment, can still make smarter later on" line was the same anti-pattern as the recently-removed proposal comments. Replace it (plus the surrounding rationale block) with a tight impersonal description of the colour-scheme choice.

jonasbhend

Hi @Louis-Frey. Wow this has come a long way, since I last had a look. Congrats!

I have a number of points, that I would like to discuss. More top-level ones are summarized below:

Naming
In the evaluation part, we use the terms scores and metrics, where metrics are independent of the ground truth (i.e. a 'statistic' of the forecast alone). This is not set in stone in the community, but I think it would help to keep the naming consistent at least throughout evalml.
Config
Instead of computing the scores for a specific lead time, I think we should be able to compute the scores for all the lead times as specified in the corresponding entries of the runs in the config. This would allow us to store results (.nc files) that are somewhat independent of the time slices selected for visualization. If the performance / time it takes to do compute all steps is really prohibitive, we should discuss it.
Precip-related code
Computing error components for all steps should also allow us to avoid issues with disaggregation (and the need to load adjacent time steps for accumulated quantities). I have the impression that there is quite a bunch of precip related code, that may become redundant.
Output paths
Currently the spatial scores are stored in files such as: output/data/baselines/ICON-CH1-EPS/metric_maps/T_2M_24.nc. I think at least for baselines, we need a different approach, as we run into problems when using the same baseline in different experiments (e.g. verification against analysis, verification against stations or low and high-resolution analysis data).
Adjacent work
There is a bunch of changes that are related to other ongoing work. Most notably probably the harmonization of data input across experiments and showcases. Therefore, I would strongly suggest to coordinate. What is not clear to me is, what functionality from the current data_input module is missing for this PR. If we manage to integrate all the necessary features in the data_input module, the harmonization is likely much easier.

@cosunae is working on a side-by-side visualization of maps for the showcases. Maybe this functionality could be leveraged to show side-by-side animations (plots) of maps of scores?

Addendum:
My heart bleeds when I see the manual iteration through reftime and summing up of components. I was really hoping we could avoid this (I know we can't currently). So kudos to you, @Louis-Frey, for implementing this nonetheless.

jonasbhend · 2026-05-08T12:23:55Z

+# ---------------------------------------------------------------------------
+
+
+def _open_zarr_component(root: Path, param: str) -> xr.DataArray:


Is this functionality that is unique to the maps application or could this be introduced in load_truth_data (in fact in load_analysis_from_zarr)?

Yes, ideally this should be handled in src/data_input. Since @clairemerker is refactoring data loading, I refrained from interfering and instead kept the code as-is, but added a TODO as a reminder to refactor later. See commit 7dfd8be.

jonasbhend · 2026-05-08T13:16:13Z

+    parser.add_argument(
+        "--baseline_root",
+        type=Path,
+        default=None,
+        help="Root directory of a baseline (e.g. /path/to/ICON-CH1-EPS), containing FCST<YY>.zarr files.",
+    )
+    parser.add_argument(
+        "--baseline_zarrs",
+        type=Path,
+        nargs="+",
+        default=None,
+        help="Explicit list of baseline zarr paths (used by Snakemake for dependency tracking).",
+    )


Do we really need both? baseline zarrs are not an output from a downstream rule, so baseline_root should be sufficient.

Good catch — agreed the two were redundant. The script now globs FCST*.zarr under --baseline_root for init-time discovery, and I've dropped --baseline_zarrs entirely. Since the discovered init times get filtered down to the configured reftimes anyway, picking up extra archive years via the glob is harmless.

One nuance: I kept baseline_zarrs as an input: in the rule (just not passed on the CLI). Snakemake dependency tracking applies to all declared inputs, not only rule-produced ones — so listing them still gives existence checks and re-runs the rule if the baseline archive changes. The script simply doesn't need to receive the list to get that. Added a comment in the rule documenting this.

Unfortunately, there are no zarr archives for baselines anymore since #164

Thanks for the hint. I will wait for @clairemerker's refactor/data-io to be merged into main before refactoring.

Louis-Frey · 2026-05-08T14:36:54Z

Hi @jonasbhend, thanks a lot for the feedback and the kudos :-) I will address your points next week, now I have to leave soon. Regarding the

manual iteration through reftime and summing up

I have to give credit to @frazane too, it was mainly his idea.

verification_metric_maps.py now keys error accumulators by (season, init_hour) instead of season only, producing netcdfs with a new init_hour dimension (integer hour, -999 = "all"; matches the convention in verification_aggregation.py). plot_metric_maps.mo.py + plot.smk + Snakefile wire init_hour through as a wildcard so per-init-hour plots can be requested via the new metric_maps.init_hours config field (default ["all"], unchanged behaviour). MetricMapsConfig + config.schema.json + the eight example configs gain the init_hours scaffolding. This lets future per-init-hour analyses re-aggregate from the existing netcdfs without re-reading GRIBs.

Per review feedback: BIAS/RMSE/MAE compare forecast to truth, so by evalml's convention they're scores (not metrics, which are statistics of the forecast alone). Aligns the new pipeline's naming with the existing scores/metrics distinction throughout the codebase. - Config section: metric_maps -> score_maps - Pydantic class: MetricMapsConfig -> ScoreMapsConfig - Snakefile target: metric_maps_all -> score_maps_all - Rules: verification_metrics_maps[_baseline] -> verification_score_maps[_baseline] - Plot rules: plot_metric_maps[_baseline] -> plot_score_maps[_baseline] - Scripts: verification_metric_maps.py, plot_metric_maps.mo.py -> ..._score_maps... - Output / log directories: metric_maps/ -> score_maps/ CLI flag (--maps) and colormap-key suffix (.map) kept as-is — they're already concise and unambiguous. The existing main-side rules (verification_metrics, verification_metrics_aggregation, verification_metrics_plot, etc.) also compute scores against truth but are named "metrics". So main itself doesn't follow the reviewer's stated convention. That's not in MRB-650's scope to fix — but worth noting in case the reviewer wants to address it broadly.

Allows users to opt into computing scores for every leadtime in the union of all configured runs and baselines, without enumerating them. Default remains the explicit list, so a casual `--maps` invocation won't accidentally trigger heavy computation at high temporal resolution. - ScoreMapsConfig.leadtimes accepts `List[int] | Literal["all"]`. - New `resolve_leadtimes(spec)` helper in common.smk expands "all" to the union of `steps` from RUN_CONFIGS and BASELINE_CONFIGS. - Snakefile score_maps_all uses the resolver; no other rules change. - Example configs updated to show both forms (explicit list active, "all" alternative as a comment).

The same baseline (e.g. ICON-CH1-EPS) is now scored against different truths in different experiments (analyses, stations, low/high-res). Inserting the truth label into the path prevents output collisions without forcing recomputation for each experiment that shares the same (baseline, truth) combination. - New baseline output path: data/baselines/{baseline_id}/{truth_label}/score_maps/... - Mirror change in plot_score_maps_baseline's verif_file input. - Runs side unchanged: run_id is already hash-based per experiment. - Schema regenerated (was stale from the prior `leadtimes: "all"` commit).

Consistency with the data path change in the previous commit. Two experiments verifying the same baseline against different truths would otherwise overwrite each other's logs.

Acknowledges duplication of ~80% of load_analysis_data_from_zarr in _open_zarr_component. Not consolidating here to avoid conflicts with the ongoing data_input refactor; the TODO points the right pointer at the eventual consolidation target.

The helpers it used to label (TOT_PREC step disaggregation) moved into load_fct_data_from_grib when that logic was centralised. The header was left behind as dead decoration.

verification_score_maps.py took both --baseline_root (to build per-reftime FCST<YY>.zarr paths) and --baseline_zarrs (an explicit list used only for init-time discovery). The list was redundant: the layout is fixed, so the script can glob FCST*.zarr under baseline_root itself, and discovered init times are filtered down to the configured reftimes anyway. - iter_baseline_init_times now globs baseline_root; drop --baseline_zarrs. - Keep baseline_zarrs as a Snakemake input: (dependency tracking re-runs the rule if the archive changes) but no longer pass it on the CLI.

The fallback derived the output path from --run_root/--baseline_root when --output was omitted. For baselines this would write into the read-only archive root (e.g. /store_new/mch/msopr/ml/ICON-CH1-EPS). Both workflow rules always pass --output explicitly, so the fallback was dead code; make --output required and remove it.

Compute STDE = sqrt(E[e²] - E[e]²) per (season, lead) from the existing error accumulators, clamping the variance term at 0 to avoid spurious NaNs from float round-off. Update docstrings so the metric is discoverable; the plotter is generic and needs no change.

Completes commit e9fbef9, which renamed the "...maps" containers but left the per-map score selector still called "metric". By evalml's convention BIAS/RMSE/MAE/STDE compare forecast to truth, so they are scores. Renames within the score_maps feature only: - ScoreMapsConfig.metrics -> scores (+ regenerated config.schema.json) - score_maps config key metrics: -> scores: in all config/*.yaml - Snakefile expand kwarg + plot.smk {metric} wildcard / --metric flag - plot_score_maps.mo.py: --score flag, variables, helper, title The main-side verification_metrics* pipeline is left as-is, consistent with e9fbef9's scope note. Computation and plotting of metric maps like min, max, mean is deferred to a follow-up PR.

Louis-Frey force-pushed the MRB-650-Maps-simplified branch from 2185fd6 to 9eb4643 Compare January 22, 2026 12:43

jonasbhend and others added 29 commits January 27, 2026 16:28

Simplified way of implementing fields

2b49526

Exclude spatial data from being plotted and included in dashboard

a456b1d

delete intermeidate verification files

0ec286f

Fix typo

28692d6

include score components for maps

f3dcf0d

Revert "Exclude spatial data from being plotted and included in dashb…

99dac52

…oard" This reverts commit cdefa16.

remove source dimension from scores

edcca5b

clean up

0ec8a0f

New rule and plotting file for plotting maps of

51f22a6

summary statistics. (No changes to code yet.)

Obvious changes to the new plotting rule for maps

366249d

of summary statistics.

Some more changes (preliminary, to be continued).

f531395

Further changes to plotting scripts.

32c7ecf

First version of colour maps finished.

c2ab645

For Bias, RMSE and MAE map plots.

Better comments in the colour map code.

27b91e2

Better Comments, some further changes to code.

1b2e670

Added back instances of lead time.

52dd1e7

New function for loading netCDF files added to

ddc5883

compat.py

Marimo app cell for loading data from .nc

35396a7

Remove .nc-loading function again, do it with

e8a92aa

earthkit instead.

All kinds of changes. Co-Development session with

06be6fa

Francesco. Got a long way towards the png plots. Co-authored-by: Francesco Zanetta <francesco.zanetta@meteoswiss.ch>

Generalized to the other non-trivial (non-wind)

61c728e

variables.

Some changes to plotting script.

761f8d9

Plotting region now dynamical (but not yet

1cd4dbf

properly working). Output written to .png now working.

Dynamic Regions now working.

cdb2ccd

Store results under experiment hash.

558689c

Introduced new domain "switzerland_small" for more

986a7ee

detailed inspection of results at smaller spatial scale.

Reverse Red-Blue colour maps for bias.

7f70fb4

Preliminary changes to plotting script for getting

50e899f

symmetric colour map for bias.

Temporarily changed plotting script back to original

a0aefc0

to see if all of it still works.

metric_maps config section for interpolator too.

413a1f6

Louis-Frey marked this pull request as ready for review May 6, 2026 09:17

Louis-Frey added 3 commits May 8, 2026 10:57

Rename plot_summary_stat_maps[_baseline] -> plot_metric_maps[_baseline]

5ed413b

Aligns the plot rules and script with the rest of the metric-maps naming convention. No logic change; only rule names, the script filename, and the log path are renamed. Output paths are unchanged.

Merge remote-tracking branch 'origin/main' into MRB-650-Maps-simplified

445020e

# Conflicts: # src/verification/__init__.py # workflow/scripts/verification_metrics.py

dnerini reviewed May 8, 2026

View reviewed changes

Comment thread .gitignore

Comment thread src/plotting/__init__.py Outdated

Comment thread workflow/rules/plot.smk Outdated

jonasbhend commented May 8, 2026

View reviewed changes

Comment thread workflow/scripts/verification_metrics.py Outdated

Louis-Frey added 8 commits May 8, 2026 15:03

Revert .gitignore changes to match main

4326de5

Revert changes to .gitignore

de72f2a

...

52e60bf

Comment out local map plotting.

9042795

Keep commented line for potential future interactive map plotting (e.g., design stuff like nicer country-border polygons, or plotting every pixel).

Skip unnecessary comment.

858f872

Drop another unnecessary comment.

1244b7c

jonasbhend commented May 8, 2026

View reviewed changes

Louis-Frey added 12 commits May 11, 2026 20:36

Apply ruff-format to metric_maps scripts

90ff476

Scope verification_score_maps_baseline log path by truth_label

7173cc2

Consistency with the data path change in the previous commit. Two experiments verifying the same baseline against different truths would otherwise overwrite each other's logs.

Drop empty "GRIB step helpers" section header

3cda778

The helpers it used to label (TOT_PREC step disaggregation) moved into load_fct_data_from_grib when that logic was centralised. The header was left behind as dead decoration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRB-650 maps simplified#92

MRB-650 maps simplified#92
jonasbhend wants to merge 120 commits into
mainfrom
MRB-650-Maps-simplified

jonasbhend commented Jan 7, 2026 •

edited by Louis-Frey

Loading

Uh oh!

Louis-Frey commented May 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonasbhend left a comment

Uh oh!

jonasbhend May 8, 2026

Uh oh!

Louis-Frey Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

jonasbhend May 8, 2026

Uh oh!

Louis-Frey Jun 4, 2026

Uh oh!

jonasbhend Jun 4, 2026

Uh oh!

Louis-Frey Jun 4, 2026

Uh oh!

Uh oh!

Uh oh!

Louis-Frey commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		# ---------------------------------------------------------------------------


		def _open_zarr_component(root: Path, param: str) -> xr.DataArray:

Conversation

jonasbhend commented Jan 7, 2026 • edited by Louis-Frey Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Opt-in score maps for runs and baselines

What's new

Testing

Deferred to follow-up PRs

Authors

Uh oh!

Louis-Frey commented May 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonasbhend left a comment

Choose a reason for hiding this comment

Uh oh!

jonasbhend May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Louis-Frey Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonasbhend May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Louis-Frey Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

jonasbhend Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Louis-Frey Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Louis-Frey commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jonasbhend commented Jan 7, 2026 •

edited by Louis-Frey

Loading

Louis-Frey Jun 4, 2026 •

edited

Loading