Reconstruct valid_time from time and step dimensions#47
Reconstruct valid_time from time and step dimensions#47tcarion wants to merge 7 commits intoJuliaGeo:mainfrom
Conversation
|
At the moment, this PR breaks this test in Rasters: stack properties: Test Failed at /home/tricarion/.julia/dev/Rasters/test/sources/gribdatasets.jl:78
Expression: keys(eagerstack) == keys(lazystack) == (:z, :t)
Evaluated: (:z, :t, :valid_time) == (:z, :t, :valid_time) == (:z, :t)It's because GRIBDatasets now adds a Any thoughts on this @Alexander-Barth and @rafaqz ? Details
julia> RasterStack("/home/tricarion/.julia/dev/GRIBDatasets/test/sample-data/era5-levels-members.grib")
┌ 120×61×2×10×4 RasterStack ┐
├───────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
↓ X Mapped{Float64} [0.0, …, 357.0] ForwardOrdered Regular Points,
→ Y Mapped{Float64} [90.0, …, -90.0] ReverseOrdered Regular Points,
↗ Z Sampled{Int64} [500, 850] ForwardOrdered Regular Points,
⬔ number Sampled{Int64} [0, …, 9] ForwardOrdered Regular Points,
◩ Ti Sampled{Dates.DateTime} [Dates.DateTime("2017-01-01T00:00:00"), …, Dates.DateTime("2017-01-02T12:00:00")] ForwardOrdered Irregular Points
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── layers ┤
:z eltype: Union{Missing, Float64} dims: X, Y, Z, number, Ti size: 120×61×2×10×4
:t eltype: Union{Missing, Float64} dims: X, Y, Z, number, Ti size: 120×61×2×10×4
:valid_time eltype: Union{Missing, Int64} dims: Ti size: 4
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── metadata ┤
Metadata{Rasters.GRIBsource} of Dict{String, Any} with 6 entries:
"edition" => "1"
"source" => "/home/tricarion/.julia/dev/GRIBDatasets/test/sample-data/era5-levels-members.grib"
"centreDescription" => "European Centre for Medium-Range Weather Forecasts"
"centre" => "ecmf"
"subCentre" => "0"
"Conventions" => "CF-1.7"
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── raster ┤
missingval: missing
extent: Extent(X = (0.0, 357.0), Y = (-90.0, 90.0), Z = (500, 850), number = (0, 9), Ti = (Dates.DateTime("2017-01-01T00:00:00"), Dates.DateTime("2017-01-02T12:00:00")))
crs: EPSG:4326
mappedcrs: EPSG:4326
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
|
|
I'm kind of confused, why are you adding it in this case when there is no step dimension, why not just where there is a step dimension? And just for me to understand, when there is a step dimension does that mean it's a 2d lookup? What is this in terms of the CF standards? I'm updating Rasters to cover the whole CF spec besides interpolation and a few other details, so if it's CF it will be handled correctly whenever that's done, but I don't know what the point of your step dimension is here or what the convention is. |
Because it is what python time (62)
Datatype: Dates.DateTime (Int64)
Dimensions: time
Attributes:
units = seconds since 1970-01-01T00:00:00
calendar = proleptic_gregorian
long_name = initial time of forecast
standard_name = forecast_reference_time
step (4)
Datatype: Int64 (Int64)
Dimensions: step
Attributes:
units = hours
long_name = time since forecast_reference_time
standard_name = forecast_period
valid_time (62 × 4)
Datatype: Dates.DateTime (Int64)
Dimensions: time × step
Attributes:
units = seconds since 1970-01-01T00:00:00
calendar = proleptic_gregorian
long_name = time
standard_name = timeSo when there's no "step" dimension, the "valid_time" is actually the same as the "time" variable. Maybe the variable name should be renamed from "time" to "forecast_reference_time", and "valid_time" to "time" to make this more clear. The above example would become: forecast_reference_time (62)
Datatype: Dates.DateTime (Int64)
Dimensions: time
Attributes:
units = seconds since 1970-01-01T00:00:00
calendar = proleptic_gregorian
long_name = initial time of forecast
standard_name = forecast_reference_time
forecast_period (4)
Datatype: Int64 (Int64)
Dimensions: step
Attributes:
units = hours
long_name = time since forecast_reference_time
standard_name = forecast_period
time (62 × 4)
Datatype: Dates.DateTime (Int64)
Dimensions: time × step
Attributes:
units = seconds since 1970-01-01T00:00:00
calendar = proleptic_gregorian
long_name = time
standard_name = time
Yes, this implies the "valid_time" variable is 2-D. I didn't find any counter-argument in the CF conventions, and, for what it's worth, this is what a LLM told me it is recommended to do it in such case. Can Rasters handle 2D time lookups? cfgrib time handling
In [1]: import cfgrib
...: import xarray as xr
...: data=xr.open_dataset('/home/tricarion/.julia/dev/GRIBDatasets/test/sample-data/era5-levels-members.grib',engine='cfgrib')
...: data
Out[1]:
<xarray.Dataset>
Dimensions: (number: 10, time: 4, isobaricInhPa: 2, latitude: 61,
longitude: 120)
Coordinates:
* number (number) int64 0 1 2 3 4 5 6 7 8 9
* time (time) datetime64[ns] 2017-01-01 ... 2017-01-02T12:00:00
step timedelta64[ns] ...
* isobaricInhPa (isobaricInhPa) float64 850.0 500.0
* latitude (latitude) float64 90.0 87.0 84.0 81.0 ... -84.0 -87.0 -90.0
* longitude (longitude) float64 0.0 3.0 6.0 9.0 ... 351.0 354.0 357.0
valid_time (time) datetime64[ns] ...
Data variables:
z (number, time, isobaricInhPa, latitude, longitude) float32 ...
t (number, time, isobaricInhPa, latitude, longitude) float32 ...
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2025-09-02T14:55 GRIB to CDM+CF via cfgrib-0.9.1... |
|
It would be good to be clear what all of this is for independent of what an LLM thinks or another package does. To me, adding a new variable identical to an existing variable seems odd. How does it help anyone when there is no step dimension? Variable names mean very little in CF so I don't understand the renaming. And asking can Rasters handle 2d time: the question is what those dimensions mean. Why is step dimension of length 4 in your example? What would those four values be for any value in the time dimensions? In Rasters, users can query for a time series with Your data still only has one dimension for time, but a two dimensional lookup. What would users query with? Do users supply the Usually this setup means the user provides two coordinate values to find one index along a dimension. An example is sparse |
|
Ok so yes it's exactly that, step is a dimension in msl here.
In terms of time variables forecast_period is a very new definition, I haven't seen an example of this in the CF spec, I you have one please post. But this will just work as a co-ordinate dimension with the new Rasters CF handling (maybe? Actually not sure how those step times are read, but we can fix it if not) valid_time seems to be a hack to put all that information into the time unit format that CF requires. But I'm not sure how we are meant to use it for anything practical. We really need to be discussing a paragraph in the standard |
|
There is a proposed amendment to CF to allow these steps units. cf-convention/cf-conventions#166 Let's just handle them directly and not add the variable? If you have to, please don't add it when there is no step dimension. And please don't rename it to time or use time as the standard name, that's not what it is. |
I try to compose with my limited time and knowledge of CF conventions, trying to comply with what's already existing. But at the same time, I don't want to follow it blindly, that's why I'm also asking your opinion :-) Thank you for the references to the cf conventions discussion, I didn't know about it. In the meantime, I also found these specifications in the standard name database:
In the current implementation of GRIBDatasets, the "time" dimension is the number of forecast reference times, and the "step" dimension is the number of forecast periods. The "valid_time" variable is the computed validity time of the forecast, i.e., the reference time + the forecast period. According to the specification above, the "time" standard name should be attributed to this validity time (or at least, this is my understanding, it is not very clear imo). Even when there's no "step" dimension (or when step is a scalar dimension, which is btw not handled properly in GRIBDatasets), the "valid_time" is relevant because it still gives the time at which the forecast is valid. In the example in the widget dropdown all below, there's a unique step=6 for multiple reference times, thus the "time" variable is 2024-01-01T00:00:00, 2024-01-02T00:00:00..., while the "valid_time" is 2024-01-01T06:00:00, 2024-01-02T06:00:00... Another case is when there are multiple forecast reference times and multiple forecast period, for example 2 forecasts with 4 steps as below: In this case, the "valid_time" variable can be 2D with dimensions (time, step). But I couldn't find any example of a 2D variable with standard name "time". It's also possible to flatten the variable and make it 1D with dimension (time x step). But I think it makes it more clear to the user to have a 2D array to represent the combination of forecast reference time and forecast period. I hope I managed to make more clear the justification behind this "valid_time" variable. If you still think it is not relevant in case of scalar step dimension, I would propose to just keep one "time" variable, that actually contains the validity time values, since it is what the specifications above suggest. Unique step example
julia> st = GRIBDataset("/home/tricarion/Documents/dev/GRIBDatasets/test_1step_crop.grib")
Dataset: /home/tricarion/Documents/dev/GRIBDatasets/test_1step_crop.grib
Group: /
Dimensions
lon = 3
lat = 3
time = 31
Variables
lon (3)
Datatype: Float64 (Float64)
Dimensions: lon
Attributes:
units = degrees_east
long_name = longitude
standard_name = longitude
lat (3)
Datatype: Float64 (Float64)
Dimensions: lat
Attributes:
units = degrees_north
long_name = latitude
standard_name = latitude
time (31)
Datatype: Dates.DateTime (Int64)
Dimensions: time
Attributes:
units = seconds since 1970-01-01T00:00:00
calendar = proleptic_gregorian
long_name = initial time of forecast
standard_name = forecast_reference_time
msl (3 × 3 × 31)
Datatype: Union{Missing, Float64} (Float64)
Dimensions: lon × lat × time
Attributes:
units = Pa
long_name = Mean sea level pressure
standard_name = air_pressure_at_mean_sea_level
valid_time (31)
Datatype: Dates.DateTime (Int64)
Dimensions: time
Attributes:
units = seconds since 1970-01-01T00:00:00
calendar = proleptic_gregorian
long_name = time
standard_name = time
Global attributes
edition = 2
source = /home/tricarion/Documents/dev/GRIBDatasets/test_1step_crop.grib
centreDescription = European Centre for Medium-Range Weather Forecasts
centre = ecmf
subCentre = 0
Conventions = CF-1.7
julia> st["time"][1], st["valid_time"][1]
(Dates.DateTime("2024-01-01T00:00:00"), Dates.DateTime("2024-01-01T06:00:00")) |
|
We really do follow the CF spec blindly around here, but I doubt adding this Where does it say anywhere to create a valid_time variable that did not exist is the key question. I also can't understand how valid_time helps is when you don't have step. It adds no new information. |
|
No, https://cfconventions.org/cf-conventions/cf-conventions.html This PR is not CF spec. I don't know what it is. I would really like to see some documented justification for it besides some other package having it. (My proposal is to close this PR) |
|
I have an alternate proposal to do instead of this PR: Treat step variables as The But we also know they are periods, the standard name is So, in Julia/CommonDataModel.jl, we should read these variables as Periods like Hour/Day etc. Then step is just a regular dimension and would e.g. work in Rasters like any other dimension. Clean, keeps the original structure, no added variables. |
I meant that I didn't want to follow blindly other packages or LLM, I'm totally ok with following blindly CF conventions ;-) I found a quite helpful link: https://confluence.ecmwf.int/display/COPSRV/Metadata+recommendations+for+encoding+NetCDF+products+based+on+CF+convention#MetadatarecommendationsforencodingNetCDFproductsbasedonCFconvention-3.3Timecoordinatesystems This gives a clear example of what I tried to explain in my previous comment (and how to implement the CF conventions I highlighted in the above screenshot). The fact that you don't find But I agree that the current variable naming in GRIBDatasets is confusing, and I propose to change it as follows:
This avoids a confusing And I also agree that |
|
But there's still the remaining question on how to handle cases where neither In this case, I propose that the Dimensions
reftime = 2
fcstperiod = 3
time = 6
Variables
reftime (2)
Datatype: Dates.DateTime (Int64)
Dimensions: reftime
Values:
2-element Vector{DateTime}:
2024-01-01T00:00:00
2024-01-01T12:00:00
fcstperiod (3)
Datatype: Dates.Hour (Int64)
Dimensions: fcstperiod
Values:
4-element Vector{Hour}:
0 hours
6 hours
12 hours
time (6)
Datatype: Dates.DateTime (Int64)
Dimensions: time
Values:
6-element Vector{DateTime}:
2024-01-01T00:00:00
2024-01-01T06:00:00
2024-01-01T12:00:00
2024-01-01T12:00:00
2024-01-01T18:00:00
2024-01-02T00:00:00
Another question is whether the layer variables should be defined on the |
|
I still don't get how this helps anyone more than my proposal. Rasters.jl will likely delete whatever change you make here when it loads gribb and instead make step a proper Period lookup, as I mentioned. Its much cleaner and fitting of how Rasters works. Another option is to fork this package to keep the old behavior. (Also we still have zero actual CF standards documentation about this being a thing to do, that's not the CF standards that you linked) |
|
Actually - making a keyword to the dataset constructor that keeps the original behavior is the only workable solution I can think of here that will avoid a fork being necessary. Either way a breaking version will be required. |
|
There are actually several changes I want to make, that I will summarize here.
To me, this change is hardly debatable. Maybe I misunderstood something, but I'll need clear arguments to show me how.
If we drop the change 4., the only noticeable changes for Rasters will be the disappearance of the |
|
|
Maybe this Figure 2 of this paper is useful (ignore training set; and consider "current time" as "valid time"). So for a fixed valid time, there are potentially multiple fields (e.g. with a forecast reference time 2000-01-01T00 and 12 h forecast period, and forecast reference time 2000-01-01T06 and 6 h forecast period). In most cases only a single valid time is distributed (with the shortest lead time), but there are cases when you actually have the multiple lead times (for forecast verification for example). I was curious how ECMWF distributes their operational forecast when you requested netcdf files and surprisingly the MARS server gave an error when multiple forecast with overlapping lead times are requested. (so I guess, the problem is not that easy :-)) For the case with multiple reference times, multiple forecast period resulting in multiple estimates for the same valid time ("overlapping lead time"), I think it makes sense to have two time coordinates: The order of the dimensions can be discussed, but a julia model would fill the temperature array in storage order which is good for performance if we choose this order. Flattening However, as mentioned before if dimension(forecast_period) = dimension(reference_time) = 1, it might be the best to avoid this complexity for the users and use only 1D time variables. |
Starting from these standards, the European Weather Center (ECMWF) shows on this link a practical example of how to encode forecasts products to netCDF, following the CF standards. ECMWF is the institution that produces most of the GRIB files around here, as it is the format they use for archiving their data. So following their recommendations to encode these data to netCDF seems relevant to me.
|
|
Thanks @tcarion for the info. I am wondering if you know about an NetCDF example (or CDL), where there are multiple reference times and multiple forecast periods (with a time overlap, so that you have multiple slices corresponding to a single valid time)? I don't know really the constraints from Rasters.jl, but in many cases you do not want to treat different lead times equally. For example if you compute a monthly average, you want to do so typically at a fixed lead time. In case, it is helpful but in CDM you can also create a view of a dataset. If you use a scalar (here 1), then its dimension is dropped (analogous to arrays): So even if For example (with another variables, just to show the mechanism): |
|
I don't have a NetCDF per se, but I looked at what was doing
In [1]: import cfgrib
...: import xarray as xr
...: xr.open_dataset('regular_ll_msl_with_steps.grib',engine='cfgrib')
Out[1]:
<xarray.Dataset>
Dimensions: (time: 62, step: 4, latitude: 3, longitude: 3)
Coordinates:
number int64 ...
* time (time) datetime64[ns] 2024-01-01 ... 2024-01-31T12:00:00
* step (step) timedelta64[ns] 00:00:00 06:00:00 12:00:00 18:00:00
meanSea float64 ...
* latitude (latitude) float64 55.0 54.5 54.0
* longitude (longitude) float64 3.0 3.5 4.0
valid_time (time, step) datetime64[ns] ...
Data variables:
msl (time, step, latitude, longitude) float32 ...With the In [3]: data.time
Out[3]:
<xarray.DataArray 'time' (time: 62)>
array(['2024-01-01T00:00:00.000000000', '2024-01-01T12:00:00.000000000',
...]
Coordinates:
number int64 ...
* time (time) datetime64[ns] 2024-01-01 ... 2024-01-31T12:00:00
meanSea float64 ...
Attributes:
long_name: initial time of forecast
standard_name: forecast_reference_timeand the In [5]: data.valid_time
Out[5]:
<xarray.DataArray 'valid_time' (time: 62, step: 4)>
[248 values with dtype=datetime64[ns]]
Coordinates:
number int64 ...
* time (time) datetime64[ns] 2024-01-01 ... 2024-01-31T12:00:00
* step (step) timedelta64[ns] 00:00:00 06:00:00 12:00:00 18:00:00
meanSea float64 ...
valid_time (time, step) datetime64[ns] ...
Attributes:
standard_name: time
long_name: timeThe In [6]: data.step
Out[6]:
<xarray.DataArray 'step' (step: 4)>
array([ 0, 21600000000000, 43200000000000, 64800000000000],
dtype='timedelta64[ns]')
Coordinates:
number int64 ...
* step (step) timedelta64[ns] 00:00:00 06:00:00 12:00:00 18:00:00
meanSea float64 ...
Attributes:
long_name: time since forecast_reference_time
standard_name: forecast_period
julia> NCDataset("/home/tricarion/Documents/dev/GRIBDatasets/regular_ll_msl_with_steps.nc")
Dataset: /home/tricarion/Documents/dev/GRIBDatasets/regular_ll_msl_with_steps.nc
Group: /
Dimensions
time = 248
lon = 3
lat = 3
Variables
time (248)
Datatype: Dates.DateTime (Float64)
Dimensions: time
Attributes:
standard_name = time
units = hours since 2024-1-1 00:00:00
calendar = proleptic_gregorian
axis = T
lon (3)
Datatype: Float64 (Float64)
Dimensions: lon
Attributes:
standard_name = longitude
long_name = longitude
units = degrees_east
axis = X
lat (3)
Datatype: Float64 (Float64)
Dimensions: lat
Attributes:
standard_name = latitude
long_name = latitude
units = degrees_north
axis = Y
msl (3 × 3 × 248)
Datatype: Float32 (Float32)
Dimensions: lon × lat × time
Attributes:
standard_name = air_pressure_at_mean_sea_level
long_name = Mean sea level pressure
units = Pa
param = 0.3.0
Global attributes
CDI = Climate Data Interface version 2.1.1 (https://mpimet.mpg.de/cdi)
Conventions = CF-1.6
institution = European Centre for Medium-Range Weather Forecasts
history = Tue Sep 16 15:23:21 2025: cdo -f nc copy regular_ll_msl_with_steps.grib regular_ll_msl_with_steps.nc
CDO = Climate Data Operators version 2.1.1 (https://mpimet.mpg.de/cdo)So here the Personnally I prefer what |
|
I think also that what cfgrib does here is more sensible.
I read paragraph, that this is not recommended: A variable named For some model, lon and lat can have multiple dimensions too (e.g. a moving grid). |




Better handling of GRIB files with "time" and "step" dimensions. The "valid_time" coordinate variable is now constructed from these dimensions. This makes the handling of GRIB files more similar to what python
cfgribandxarraydo.It fixes issues #41
GRIBDataset output
Python output
CC: @LJ-Young