Skip to content

MRB-665 Support for global models and remove meteodata-lab#90

Open
frazane wants to merge 25 commits into
mainfrom
feat/support-global-models
Open

MRB-665 Support for global models and remove meteodata-lab#90
frazane wants to merge 25 commits into
mainfrom
feat/support-global-models

Conversation

@frazane
Copy link
Copy Markdown
Contributor

@frazane frazane commented Dec 11, 2025

This PR introduces minimal support for global models in evalml, as we might want to evaluate small experiments on coarse global grids (e.g. o48 or o96).

Summary of changes

  • optionally disable setting the ECCODES_DEFINITION_PATH environment variable when running global models
  • added base anemoi-inference config for global models and an example evalml config
  • switch from meteodatalab to earthkit-data (with its xarray engine) to load GRIB files during verification

TODO:

  • Deal with the field _earthkit added to the xarray object by earthkit. New versions of xarray crash when to_netcdf with fields that are not allowed. Solution is to always pop it away. Temporary solution is a pin of xarray but its not sustainable.

@frazane frazane changed the title Support for local models Support for global models Dec 11, 2025
@dnerini dnerini changed the title Support for global models MRB-665 Support for global models Mar 12, 2026
@MicheleCattaneo
Copy link
Copy Markdown

Should I produce a global forecaster model to review this? Or can an interpolator work?

Comment thread config/hourly-rollout-o96.yaml Outdated
@dnerini
Copy link
Copy Markdown
Member

dnerini commented Apr 22, 2026

experiment workflows works, e.g
image

I'm currently debugging the showcase workflow, but I suppose it'd make already sense to have another review round here.

@dnerini dnerini marked this pull request as ready for review April 22, 2026 08:47
@dnerini dnerini requested review from MicheleCattaneo and removed request for MicheleCattaneo April 22, 2026 08:47
@dnerini
Copy link
Copy Markdown
Member

dnerini commented Apr 23, 2026

@frazane can you also have a look pls? Supporting tp from the global GRIB files has been a bit of a pain and some of this fixes are really unpleasant ....

params_sel = list(
{p for p in params} | {_COSMO_TO_IFS[p] for p in params if p in _COSMO_TO_IFS}
)
# Precipitation params don't have a step=0 field (accumulation is zero at
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this was previously handled by meteodata-lab?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this didn't occur previously because we haven't read from the global files at all. All our lam cutouts use COSMO names.

@dnerini dnerini requested a review from clairemerker April 24, 2026 07:56
@MicheleCattaneo
Copy link
Copy Markdown

MicheleCattaneo commented Apr 29, 2026

Is it possible that there is an issue with the verification at specific locations?
202402010000_T_2M_KLO

This is T_2M. I understand that ERA is 0.25 degrees, but there is not even the diurnal cycle here. What do you think? Another issue, the lines dont start from the same point, but this model uses ERA5 as IC (unless I have an issue in my evalml config?)

@dnerini dnerini changed the title MRB-665 Support for global models MRB-665 Support for global models and remove meteodata-lab May 8, 2026
Comment thread src/data_input/__init__.py Outdated
Comment on lines +149 to +157
prec_params = [p for p in params_sel if p in _PREC_PARAMS]
other_params = [p for p in params_sel if p not in _PREC_PARAMS]
fieldlist = ekd.from_source("file", files)
datasets = []
if other_params:
datasets.append(fieldlist.sel(param=other_params, step=steps).to_xarray(profile=profile))
if prec_params:
prec_steps = [s for s in steps if s > 0]
datasets.append(fieldlist.sel(param=prec_params, step=prec_steps).to_xarray(profile=profile))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in data_extract_baseline.py we use the following pattern:

       for lt in lead_times:
            lh = lt % 24
            ld = lt // 24
            filepath = file / "grib" / f"{gribname}{ld:02}{lh:02}0000_{run_id}"
            LOG.info(f"Extracting {filepath}.")
            fields = ekd.from_source("file", filepath)
            for field in fields:
                if field.metadata("shortName") in params:
                    out.append(field)

That one looks slow to me, but I haven't profiled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above approach has the advantage that you do not hard-code expectations in, in the sense that if there was a precipitation field at time 0 (with zero or NA in it), then you could still read it.

MicheleCattaneo and others added 7 commits May 19, 2026 18:35
These changes have not been reconciled with main. Therefore another PR
is necessary.

---------

Co-authored-by: Francesco Zanetta <62377868+frazane@users.noreply.github.com>
Co-authored-by: Claire Merker <34312518+clairemerker@users.noreply.github.com>
When running experiments with many initialization times, sometime
individual initializations do not run successfully due to data problems.
In order to deal with this problem, a blacklist of initialization times
can be provided in the config to exclude from experiments.

## Summary of Changes
* adapted config schema and update generation of list of initialization
times
* provide one example config with a blacklisted initialization
@frazane
Copy link
Copy Markdown
Contributor Author

frazane commented May 19, 2026

In order to use earthkit-data v1.0 release candidates, we need up-to-date eccodes definitions that have not yet been published by DWD, but that we have internally. The CI is failing because of this as it cannot install the definitions from a private repository.

@cosunae are there any updates from DWD?

@MicheleCattaneo
Copy link
Copy Markdown

@frazane the issue with the _earthkit field is still there with the updated dependencies, I assume?

@cosunae
Copy link
Copy Markdown
Contributor

cosunae commented May 21, 2026

DWD eccodes released on open data version 2.44 http://opendata.dwd.de/
we have a private copy of 2.47

dnerini added a commit that referenced this pull request May 21, 2026
This PR adds the option to read ICON-CH1/2-EPS surface GRIB files
directly from the operational archive. It also removes the legacy zarr
reader for baselines and consequently all cosmo-based config files.

### Results

Quick test shows no difference in results between the existing zarr and
the new grib readers
<img width="1000" height="600" alt="image"
src="https://github.com/user-attachments/assets/dfa1e1b2-f6d1-4795-ac00-c181ea32b68d"
/>

Performance-wise, it doesn't seem to make a big difference, which I find
a bit odd, so I'll need to have a closer look.
```
2026-05-13 23:12:00,289 - data_input - INFO - Loading baseline forecasts from ICON GRIB archive...
2026-05-13 23:12:00,292 - data_input - INFO - Reading ICON archive from /store_new/mch/msopr/osm/ICON-CH1-EPS/FCST25/25030100_638
2026-05-13 23:12:39,291 - __main__ - INFO - Loaded forecast data in 39.007409 seconds
```
```
2026-05-13 23:12:00,284 - data_input - INFO - Loading baseline forecasts from zarr dataset...
2026-05-13 23:13:17,642 - __main__ - INFO - Loaded forecast data in 77.357972 seconds
```
<img width="648" height="236" alt="image"
src="https://github.com/user-attachments/assets/509c4cd3-4849-4e92-be8c-9cf939e397c5"
/>


### Open questions
- ~should we deprecate the baselines zarr instead?~ Done in 8619fea
- ~should we use switch to earthkit v1?~ Out of scope
- ~should we deprecate the dependency on meteodata-lab?~ Out of scope,
already part of #90
- extend method to read extra variables, including from vertical levels?

### Follow-up PRs
-  extend method to read any arbitrary member
-  extend method to read the pre-computed median
-  extend method to compute ensemble mean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants