Skip to content

Reconstruct valid_time from time and step dimensions#47

Open
tcarion wants to merge 7 commits intoJuliaGeo:mainfrom
tcarion:issue41
Open

Reconstruct valid_time from time and step dimensions#47
tcarion wants to merge 7 commits intoJuliaGeo:mainfrom
tcarion:issue41

Conversation

@tcarion
Copy link
Copy Markdown
Member

@tcarion tcarion commented Sep 2, 2025

Better handling of GRIB files with "time" and "step" dimensions. The "valid_time" coordinate variable is now constructed from these dimensions. This makes the handling of GRIB files more similar to what python cfgrib and xarray do.

It fixes issues #41

GRIBDataset output

julia> ds = GRIBDataset("test/sample-data/regular_ll_msl_with_steps.grib")
Dataset: test/sample-data/regular_ll_msl_with_steps.grib
Group: /

Dimensions
   lon = 3
   lat = 3
   time = 62
   step = 4

Variables
  lon   (3)
    Datatype:    Float64 (Float64)
    Dimensions:  lon
    Attributes:
     units                = degrees_east
     long_name            = longitude
     standard_name        = longitude

  lat   (3)
    Datatype:    Float64 (Float64)
    Dimensions:  lat
    Attributes:
     units                = degrees_north
     long_name            = latitude
     standard_name        = latitude

  time   (62)
    Datatype:    DateTime (Int64)
    Dimensions:  time
    Attributes:
     units                = seconds since 1970-01-01T00:00:00
     calendar             = proleptic_gregorian
     long_name            = initial time of forecast
     standard_name        = forecast_reference_time

  step   (4)
    Datatype:    Int64 (Int64)
    Dimensions:  step
    Attributes:
     units                = hours
     long_name            = time since forecast_reference_time
     standard_name        = forecast_period

  msl   (3 × 3 × 62 × 4)
    Datatype:    Union{Missing, Float64} (Float64)
    Dimensions:  lon × lat × time × step
    Attributes:
     units                = Pa
     long_name            = Mean sea level pressure
     standard_name        = air_pressure_at_mean_sea_level

  valid_time   (62 × 4)
    Datatype:    DateTime (Int64)
    Dimensions:  time × step
    Attributes:
     units                = seconds since 1970-01-01T00:00:00
     calendar             = proleptic_gregorian
     long_name            = time
     standard_name        = time

Global attributes
  edition              = 2
  source               = test/sample-data/regular_ll_msl_with_steps.grib
  centreDescription    = European Centre for Medium-Range Weather Forecasts
  centre               = ecmf
  subCentre            = 0
  Conventions          = CF-1.7

Python output

In [3]: xr.open_dataset('test/sample-data/regular_ll_msl_with_steps.grib',engine='cfgrib')
Out[3]: 
<xarray.Dataset>
Dimensions:     (time: 62, step: 4, latitude: 3, longitude: 3)
Coordinates:
    number      int64 ...
  * time        (time) datetime64[ns] 2024-01-01 ... 2024-01-31T12:00:00
  * step        (step) timedelta64[ns] 00:00:00 06:00:00 12:00:00 18:00:00
    meanSea     float64 ...
  * latitude    (latitude) float64 55.0 54.5 54.0
  * longitude   (longitude) float64 3.0 3.5 4.0
    valid_time  (time, step) datetime64[ns] ...
Data variables:
    msl         (time, step, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            2
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2025-09-02T11:16 GRIB to CDM+CF via cfgrib-0.9.1...

CC: @LJ-Young

@tcarion
Copy link
Copy Markdown
Member Author

tcarion commented Sep 2, 2025

At the moment, this PR breaks this test in Rasters:

stack properties: Test Failed at /home/tricarion/.julia/dev/Rasters/test/sources/gribdatasets.jl:78
  Expression: keys(eagerstack) == keys(lazystack) == (:z, :t)
   Evaluated: (:z, :t, :valid_time) == (:z, :t, :valid_time) == (:z, :t)

It's because GRIBDatasets now adds a valid_time coordinates variable even when there's is no step dimension. In such cases, that makes the valid_time variable a mere duplicate of the time variable. It seems to be the behaviour of python cfgrib, but this implies having an useless additional variable for many cases (see details below). I'm not sure that's relevant.

Any thoughts on this @Alexander-Barth and @rafaqz ?

Details

julia> RasterStack("/home/tricarion/.julia/dev/GRIBDatasets/test/sample-data/era5-levels-members.grib")
┌ 120×61×2×10×4 RasterStack ┐
├───────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
   X Mapped{Float64} [0.0, , 357.0] ForwardOrdered Regular Points,
   Y Mapped{Float64} [90.0, , -90.0] ReverseOrdered Regular Points,
  ↗ Z Sampled{Int64} [500, 850] ForwardOrdered Regular Points,
  ⬔ number Sampled{Int64} [0, , 9] ForwardOrdered Regular Points,
  ◩ Ti Sampled{Dates.DateTime} [Dates.DateTime("2017-01-01T00:00:00"), , Dates.DateTime("2017-01-02T12:00:00")] ForwardOrdered Irregular Points
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── layers ┤
  :z          eltype: Union{Missing, Float64} dims: X, Y, Z, number, Ti size: 120×61×2×10×4
  :t          eltype: Union{Missing, Float64} dims: X, Y, Z, number, Ti size: 120×61×2×10×4
  :valid_time eltype: Union{Missing, Int64} dims: Ti size: 4
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── metadata ┤
  Metadata{Rasters.GRIBsource} of Dict{String, Any} with 6 entries:
  "edition"           => "1"
  "source"            => "/home/tricarion/.julia/dev/GRIBDatasets/test/sample-data/era5-levels-members.grib"
  "centreDescription" => "European Centre for Medium-Range Weather Forecasts"
  "centre"            => "ecmf"
  "subCentre"         => "0"
  "Conventions"       => "CF-1.7"
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── raster ┤
  missingval: missing
  extent: Extent(X = (0.0, 357.0), Y = (-90.0, 90.0), Z = (500, 850), number = (0, 9), Ti = (Dates.DateTime("2017-01-01T00:00:00"), Dates.DateTime("2017-01-02T12:00:00")))
  crs: EPSG:4326
  mappedcrs: EPSG:4326
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

@rafaqz
Copy link
Copy Markdown
Member

rafaqz commented Sep 2, 2025

I'm kind of confused, why are you adding it in this case when there is no step dimension, why not just where there is a step dimension?

And just for me to understand, when there is a step dimension does that mean it's a 2d lookup? What is this in terms of the CF standards?

I'm updating Rasters to cover the whole CF spec besides interpolation and a few other details, so if it's CF it will be handled correctly whenever that's done, but I don't know what the point of your step dimension is here or what the convention is.

@tcarion
Copy link
Copy Markdown
Member Author

tcarion commented Sep 2, 2025

I'm kind of confused, why are you adding it in this case when there is no step dimension, why not just where there is a step dimension?

Because it is what python cfgrib does (see details below). As far as I understand, the "time" variable in GRIB files is actually the initial time of the forecast, the "step" variable is the number of hours after this reference time, and "valid_time" is the actual validity time of the forecast. In CF standard names, they would be respectively "forecast_reference_time", "forecast_period" and "time". These standard names are clarified in the variables attributes set by GRIBDatasets:

  time   (62)
    Datatype:    Dates.DateTime (Int64)
    Dimensions:  time
    Attributes:
     units                = seconds since 1970-01-01T00:00:00
     calendar             = proleptic_gregorian
     long_name            = initial time of forecast
     standard_name        = forecast_reference_time

  step   (4)
    Datatype:    Int64 (Int64)
    Dimensions:  step
    Attributes:
     units                = hours
     long_name            = time since forecast_reference_time
     standard_name        = forecast_period

  valid_time   (62 × 4)
    Datatype:    Dates.DateTime (Int64)
    Dimensions:  time × step
    Attributes:
     units                = seconds since 1970-01-01T00:00:00
     calendar             = proleptic_gregorian
     long_name            = time
     standard_name        = time

So when there's no "step" dimension, the "valid_time" is actually the same as the "time" variable. Maybe the variable name should be renamed from "time" to "forecast_reference_time", and "valid_time" to "time" to make this more clear. The above example would become:

  forecast_reference_time   (62)
    Datatype:    Dates.DateTime (Int64)
    Dimensions:  time
    Attributes:
     units                = seconds since 1970-01-01T00:00:00
     calendar             = proleptic_gregorian
     long_name            = initial time of forecast
     standard_name        = forecast_reference_time

  forecast_period   (4)
    Datatype:    Int64 (Int64)
    Dimensions:  step
    Attributes:
     units                = hours
     long_name            = time since forecast_reference_time
     standard_name        = forecast_period

  time   (62 × 4)
    Datatype:    Dates.DateTime (Int64)
    Dimensions:  time × step
    Attributes:
     units                = seconds since 1970-01-01T00:00:00
     calendar             = proleptic_gregorian
     long_name            = time
     standard_name        = time

And just for me to understand, when there is a step dimension does that mean it's a 2d lookup? What is this in terms of the CF standards?

Yes, this implies the "valid_time" variable is 2-D. I didn't find any counter-argument in the CF conventions, and, for what it's worth, this is what a LLM told me it is recommended to do it in such case. Can Rasters handle 2D time lookups?

cfgrib time handling

In [1]: import cfgrib
   ...: import xarray as xr
   ...: data=xr.open_dataset('/home/tricarion/.julia/dev/GRIBDatasets/test/sample-data/era5-levels-members.grib',engine='cfgrib')
   ...: data
Out[1]: 
<xarray.Dataset>
Dimensions:        (number: 10, time: 4, isobaricInhPa: 2, latitude: 61,
                    longitude: 120)
Coordinates:
  * number         (number) int64 0 1 2 3 4 5 6 7 8 9
  * time           (time) datetime64[ns] 2017-01-01 ... 2017-01-02T12:00:00
    step           timedelta64[ns] ...
  * isobaricInhPa  (isobaricInhPa) float64 850.0 500.0
  * latitude       (latitude) float64 90.0 87.0 84.0 81.0 ... -84.0 -87.0 -90.0
  * longitude      (longitude) float64 0.0 3.0 6.0 9.0 ... 351.0 354.0 357.0
    valid_time     (time) datetime64[ns] ...
Data variables:
    z              (number, time, isobaricInhPa, latitude, longitude) float32 ...
    t              (number, time, isobaricInhPa, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2025-09-02T14:55 GRIB to CDM+CF via cfgrib-0.9.1...

@rafaqz
Copy link
Copy Markdown
Member

rafaqz commented Sep 2, 2025

It would be good to be clear what all of this is for independent of what an LLM thinks or another package does.

To me, adding a new variable identical to an existing variable seems odd. How does it help anyone when there is no step dimension? Variable names mean very little in CF so I don't understand the renaming.

And asking can Rasters handle 2d time: the question is what those dimensions mean. Why is step dimension of length 4 in your example? What would those four values be for any value in the time dimensions?

In Rasters, users can query for a time series with At or Near selectors and a DateTime, to find data at that time, which is a vector of DateTime or similar.

Your data still only has one dimension for time, but a two dimensional lookup. What would users query with? Do users supply the step?

Usually this setup means the user provides two coordinate values to find one index along a dimension. An example is sparse X/Y coordinates compressed to points on a non spatial index dimension. So X and Y values are coordinates for the index. But this doesn't seem like the same thing? Are valid_time and step coordinates for time?

@rafaqz
Copy link
Copy Markdown
Member

rafaqz commented Sep 2, 2025

Ok so yes it's exactly that, step is a dimension in msl here.

msl (time, step, latitude, longitude)

In terms of time variables forecast_period is a very new definition, I haven't seen an example of this in the CF spec, I you have one please post.

But this will just work as a co-ordinate dimension with the new Rasters CF handling (maybe? Actually not sure how those step times are read, but we can fix it if not)

valid_time seems to be a hack to put all that information into the time unit format that CF requires. But I'm not sure how we are meant to use it for anything practical.

We really need to be discussing a paragraph in the standard

@rafaqz
Copy link
Copy Markdown
Member

rafaqz commented Sep 2, 2025

valid_time appears to be a hack to combine time and period into one array with correct CF units. There is no way for Rasters to sensibly handle it because that's not actually time. Handling step separately as a Period is much more true to the intention.

There is a proposed amendment to CF to allow these steps units.

cf-convention/cf-conventions#166
Which is merged:
cf-convention/cf-conventions#538

Let's just handle them directly and not add the variable?

If you have to, please don't add it when there is no step dimension.

And please don't rename it to time or use time as the standard name, that's not what it is.

@tcarion
Copy link
Copy Markdown
Member Author

tcarion commented Sep 3, 2025

It would be good to be clear what all of this is for independent of what an LLM thinks or another package does.

I try to compose with my limited time and knowledge of CF conventions, trying to comply with what's already existing. But at the same time, I don't want to follow it blindly, that's why I'm also asking your opinion :-)

Thank you for the references to the cf conventions discussion, I didn't know about it. In the meantime, I also found these specifications in the standard name database:

image

In the current implementation of GRIBDatasets, the "time" dimension is the number of forecast reference times, and the "step" dimension is the number of forecast periods. The "valid_time" variable is the computed validity time of the forecast, i.e., the reference time + the forecast period. According to the specification above, the "time" standard name should be attributed to this validity time (or at least, this is my understanding, it is not very clear imo).

Even when there's no "step" dimension (or when step is a scalar dimension, which is btw not handled properly in GRIBDatasets), the "valid_time" is relevant because it still gives the time at which the forecast is valid. In the example in the widget dropdown all below, there's a unique step=6 for multiple reference times, thus the "time" variable is 2024-01-01T00:00:00, 2024-01-02T00:00:00..., while the "valid_time" is 2024-01-01T06:00:00, 2024-01-02T06:00:00...

Another case is when there are multiple forecast reference times and multiple forecast period, for example 2 forecasts with 4 steps as below:
image

In this case, the "valid_time" variable can be 2D with dimensions (time, step). But I couldn't find any example of a 2D variable with standard name "time". It's also possible to flatten the variable and make it 1D with dimension (time x step). But I think it makes it more clear to the user to have a 2D array to represent the combination of forecast reference time and forecast period.

I hope I managed to make more clear the justification behind this "valid_time" variable. If you still think it is not relevant in case of scalar step dimension, I would propose to just keep one "time" variable, that actually contains the validity time values, since it is what the specifications above suggest.

Unique step example

julia> st = GRIBDataset("/home/tricarion/Documents/dev/GRIBDatasets/test_1step_crop.grib")
Dataset: /home/tricarion/Documents/dev/GRIBDatasets/test_1step_crop.grib
Group: /

Dimensions
   lon = 3
   lat = 3
   time = 31

Variables
  lon   (3)
    Datatype:    Float64 (Float64)
    Dimensions:  lon
    Attributes:
     units                = degrees_east
     long_name            = longitude
     standard_name        = longitude

  lat   (3)
    Datatype:    Float64 (Float64)
    Dimensions:  lat
    Attributes:
     units                = degrees_north
     long_name            = latitude
     standard_name        = latitude

  time   (31)
    Datatype:    Dates.DateTime (Int64)
    Dimensions:  time
    Attributes:
     units                = seconds since 1970-01-01T00:00:00
     calendar             = proleptic_gregorian
     long_name            = initial time of forecast
     standard_name        = forecast_reference_time

  msl   (3 × 3 × 31)
    Datatype:    Union{Missing, Float64} (Float64)
    Dimensions:  lon × lat × time
    Attributes:
     units                = Pa
     long_name            = Mean sea level pressure
     standard_name        = air_pressure_at_mean_sea_level

  valid_time   (31)
    Datatype:    Dates.DateTime (Int64)
    Dimensions:  time
    Attributes:
     units                = seconds since 1970-01-01T00:00:00
     calendar             = proleptic_gregorian
     long_name            = time
     standard_name        = time

Global attributes
  edition              = 2
  source               = /home/tricarion/Documents/dev/GRIBDatasets/test_1step_crop.grib
  centreDescription    = European Centre for Medium-Range Weather Forecasts
  centre               = ecmf
  subCentre            = 0
  Conventions          = CF-1.7


julia> st["time"][1], st["valid_time"][1]
(Dates.DateTime("2024-01-01T00:00:00"), Dates.DateTime("2024-01-01T06:00:00"))

@rafaqz
Copy link
Copy Markdown
Member

rafaqz commented Sep 3, 2025

We really do follow the CF spec blindly around here, but I doubt adding this valid_time variable is in it!!

Where does it say anywhere to create a valid_time variable that did not exist is the key question.

I also can't understand how valid_time helps is when you don't have step. It adds no new information.

@rafaqz
Copy link
Copy Markdown
Member

rafaqz commented Sep 3, 2025

No, valid_time is not mentioned in the CF spec. Searching for it finds nothing:

https://cfconventions.org/cf-conventions/cf-conventions.html

This PR is not CF spec. I don't know what it is. I would really like to see some documented justification for it besides some other package having it.

(My proposal is to close this PR)

@rafaqz
Copy link
Copy Markdown
Member

rafaqz commented Sep 3, 2025

I have an alternate proposal to do instead of this PR: Treat step variables as Dates.Periods rather than Int as they are currently.

The units for step does not contain the word since. Thus they explicitly are not time. https://github.com/cf-convention/cf-conventions/pull/538/files#diff-1b7764bcb523097b45cd754f5de4d3b14b9fd07d0b27bc3396f051097c85b640R168-R172

But we also know they are periods, the standard name is forecast_period

So, in Julia/CommonDataModel.jl, we should read these variables as Periods like Hour/Day etc.

Then step is just a regular dimension and would e.g. work in Rasters like any other dimension. Clean, keeps the original structure, no added variables.

@tcarion
Copy link
Copy Markdown
Member Author

tcarion commented Sep 4, 2025

We really do follow the CF spec blindly around here

I meant that I didn't want to follow blindly other packages or LLM, I'm totally ok with following blindly CF conventions ;-)

I found a quite helpful link: https://confluence.ecmwf.int/display/COPSRV/Metadata+recommendations+for+encoding+NetCDF+products+based+on+CF+convention#MetadatarecommendationsforencodingNetCDFproductsbasedonCFconvention-3.3Timecoordinatesystems

This gives a clear example of what I tried to explain in my previous comment (and how to implement the CF conventions I highlighted in the above screenshot).

The fact that you don't find valid_time in the convention is because CF is not assertive on the variable name, only on the standard name. And it explicitly says that the time standard name should be given to the coordinate representing the validity time of the forecast (cfr the above screenshot and the ECMWF recommendations in the link above), and it is what this PR tries to do.

But I agree that the current variable naming in GRIBDatasets is confusing, and I propose to change it as follows:

image

This avoids a confusing valid_time variable name, which will simply change to time, which contains the time at which the forecast is valid. And there'll be a "new" variable reftime for the forecast reference time, which was wrongly the variable with standard_name=time before this PR.

And I also agree that step fcstperiod (which is the variable name chosen in the ECMWF example) should be a Dates.Period (indeed, that should probably be the responsibility of CDM).

@tcarion
Copy link
Copy Markdown
Member Author

tcarion commented Sep 4, 2025

But there's still the remaining question on how to handle cases where neither fcstperiod nor reftime are scalar. I couldn't find any example of this so far.

In this case, I propose that the time be 1 dimensional with length reftime * fcstperiod . For example:

Dimensions
   reftime = 2
   fcstperiod = 3
   time = 6

Variables
  reftime   (2)
    Datatype:    Dates.DateTime (Int64)
    Dimensions:  reftime
    Values:
     2-element Vector{DateTime}:
        2024-01-01T00:00:00
        2024-01-01T12:00:00

  fcstperiod (3)
    Datatype:    Dates.Hour (Int64)
    Dimensions:  fcstperiod
    Values:
    4-element Vector{Hour}:
        0 hours
        6 hours
        12 hours

  time   (6)
    Datatype:    Dates.DateTime (Int64)
    Dimensions:  time
    Values:
    6-element Vector{DateTime}:
        2024-01-01T00:00:00
        2024-01-01T06:00:00
        2024-01-01T12:00:00
        2024-01-01T12:00:00
        2024-01-01T18:00:00
        2024-01-02T00:00:00

Another question is whether the layer variables should be defined on the time dimension or on reftime x fcstperiod.

@rafaqz
Copy link
Copy Markdown
Member

rafaqz commented Sep 6, 2025

I still don't get how this helps anyone more than my proposal.

Rasters.jl will likely delete whatever change you make here when it loads gribb and instead make step a proper Period lookup, as I mentioned. Its much cleaner and fitting of how Rasters works.

Another option is to fork this package to keep the old behavior.

(Also we still have zero actual CF standards documentation about this being a thing to do, that's not the CF standards that you linked)

@rafaqz
Copy link
Copy Markdown
Member

rafaqz commented Sep 6, 2025

Actually - making a keyword to the dataset constructor that keeps the original behavior is the only workable solution I can think of here that will avoid a fork being necessary.

Either way a breaking version will be required.

@tcarion
Copy link
Copy Markdown
Member Author

tcarion commented Sep 8, 2025

There are actually several changes I want to make, that I will summarize here.

  1. Making the variable with standard_name=time the time calculated from the forecast reference time and the forecast period.
    Currently, GRIBDatasets sets the variable with standard_name=time to the forecast reference time, which is wrong by the CF convention:

The forecast reference time in NWP is the "data time", the time of the analysis from which the forecast was made. It is not the time for which the forecast is valid; the standard name of time should be used for that time.

        To me, this change is hardly debatable. Maybe I misunderstood something, but I'll need clear arguments to show me how.

  1. When there's a step key in the GRIB file (which is very common as GRIB files are usually used for storing results of weather forecasts), creating a variable named fcstperiod containing these steps with attribute units=hours so it can be interpreted as Dates.Period by CMD or Rasters. The variable also has attribute standard_name=forecast_period, as stipulated in the standard names database (type "forecast_period" in the search field to make sure of this).
    As far as I understand, we agree on this change.

  2. Adding a variable named reftime, containing the forecast reference time with attribute standard_name=forecast_reference_time, as suggested by CF conventions in the standard names database.
    Again, this one seems hardly debatable, but I'm open to counter arguments if any.

  3. In the case of non-scalar reftime and fcstperiod dimensions, construct the time variable as explained my previous comment.
    This one is clearly debatable as I couldn't find explicit recommendations about such case in CF conventions. I'm totally ok with not doing it if you think it is not relevant. Then, GRIBDatasets should raise an error when trying to open such file.

If we drop the change 4., the only noticeable changes for Rasters will be the disappearance of the valid_time variable, and the addition of a time variable and two variables reftime, and fcstperiod when needed (which I think is useful for the user, who wants to know these kind of information about the forecast).

@rafaqz
Copy link
Copy Markdown
Member

rafaqz commented Sep 10, 2025

  1. Please link to where this text comes from so everyone can read the context. It is not in the current CF standards or the working draft, so your bold text is somewhat misleading. I have a feeling you misunderstand something but I can't find what it is without knowing where your ideas are coming from.
  2. We agree, forecast_period variables should be Dates.Period, which would make it actually useful
  3. I think that adding an additional variables that don't exist in the data on disk is debatable? But correctly labeling data that is forecast reference time as forecast reference time seems reasonable. Unless I really misunderstand how much you need to wrangle GRIBB data here. For me to understand this, can you link somehow to how you know that specific grib data is the forecast reference time and not regular time, how these things are stored, etc.
  4. Exactly, it's not recommended anywhere, as I have been getting at. In fact, a 2d variable called time is recommended against in the CF conventions:

We recommend that the name of a multidimensional coordinate variable should not match the name of any of its dimensions because that precludes supplying a coordinate variable for the dimension. This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional.

https://cfconventions.org/Data/cf-conventions/cf-conventions-1.12/cf-conventions.html#coordinate-system

@Alexander-Barth
Copy link
Copy Markdown
Member

Maybe this Figure 2 of this paper is useful
https://www.mdpi.com/2073-4433/11/3/300
image

(ignore training set; and consider "current time" as "valid time").

So for a fixed valid time, there are potentially multiple fields (e.g. with a forecast reference time 2000-01-01T00 and 12 h forecast period, and forecast reference time 2000-01-01T06 and 6 h forecast period). In most cases only a single valid time is distributed (with the shortest lead time), but there are cases when you actually have the multiple lead times (for forecast verification for example).

I was curious how ECMWF distributes their operational forecast when you requested netcdf files and surprisingly the MARS server gave an error when multiple forecast with overlapping lead times are requested. (so I guess, the problem is not that easy :-))

For the case with multiple reference times, multiple forecast period resulting in multiple estimates for the same valid time ("overlapping lead time"), I think it makes sense to have two time coordinates:

temperare(lon,lat,z,forecast_period,reference_time) 
valid_time(forecast_period,reference_time) (or just time)

The order of the dimensions can be discussed, but a julia model would fill the temperature array in storage order which is good for performance if we choose this order.

Flattening valid_time is also an option, but this would make indexing a bit more complicated. Most application would want to have the "best" estimate (as the error tends to grow over time). In the example above this would be temperare[:,:,1,:] and valid_time[1,:].
There are also some statistics that are computed for different forecast period (lead times) aggregating over all reference times like in this paper.
Such analysis would benefit of not flattering valid_time.

However, as mentioned before if dimension(forecast_period) = dimension(reference_time) = 1, it might be the best to avoid this complexity for the users and use only 1D time variables.

@tcarion
Copy link
Copy Markdown
Member Author

tcarion commented Sep 12, 2025

  1. It comes from the CF standard names table. If you type "forecast" in the search field, you'll find:

forecast_period: Forecast period is the time interval between the forecast reference time and the validity time. A period is an interval of time, or the time-period of an oscillation.
forecast_reference_time: The forecast reference time in NWP is the "data time", the time of the analysis from which the forecast was made. It is not the time for which the forecast is valid; the standard name of time should be used for that time.

     Starting from these standards, the European Weather Center (ECMWF) shows on this link a practical example of how to encode forecasts products to netCDF, following the CF standards. ECMWF is the institution that produces most of the GRIB files around here, as it is the format they use for archiving their data. So following their recommendations to encode these data to netCDF seems relevant to me.

  1. Alright, I think this should be made at the CommonDataModel level. I will need to make a PR at some point.
  2. The GRIB files can be viewed as a pile of independant chunk of data (called messages), which are generally horizontally gridded values with metadata describing what they are. There's no such thing as variables in GRIB files, GRIBDatasets actually reads some specific metadata of all the messages to construct the dimensions and variables. But the validity and reference times are part of these metadata (see "dataTime" and "validityTime" here). I used the term "reconstruct" for this validity time in this PR because the legacy code from Cfgrib.jl was calculating it from the reference and the step, but I now realize I might as well directly take the validityTime key from the GRIB messages. All of this to say that there's no creation of variables which don't exist on disk.
  3. I agree with the argument from the very clear explanation of @Alexander-Barth. I do think it is more convenient for the user to have a two dimensional time variable in case of multiple reference times, multiple forecast period. But I also understand it can cause issue in Rasters. I propose that the default loading of such dataset in GRIBDatasets raises an explanatory error, that can be by-passed with a specific keyword. With this keyword, it will build this 2-D time variable, as proposed here.

@Alexander-Barth
Copy link
Copy Markdown
Member

Alexander-Barth commented Sep 12, 2025

Thanks @tcarion for the info. I am wondering if you know about an NetCDF example (or CDL), where there are multiple reference times and multiple forecast periods (with a time overlap, so that you have multiple slices corresponding to a single valid time)?
Sorry if it is in your link, but I could not find it.

I don't know really the constraints from Rasters.jl, but in many cases you do not want to treat different lead times equally. For example if you compute a monthly average, you want to do so typically at a fixed lead time.

In case, it is helpful but in CDM you can also create a view of a dataset. If you use a scalar (here 1), then its dimension is dropped (analogous to arrays):

view(ds,time=1)

So even if valid_time is 2D, if you fix the lead time, then it becomes 1D.

For example (with another variables, just to show the mechanism):

julia> ds["Temperature"]
Temperature (151 × 76 × 2 × 2)
  Datatype:    Union{Missing, Float32} (Float32)
  Dimensions:  lon × lat × depth × time

julia> view(ds,time=1)["Temperature"]
Temperature (151 × 76 × 2)
  Datatype:    Union{Missing, Float32}
  Dimensions:  lon × lat × depth


julia> ndims(view(ds,time=1)["time"])
0

@tcarion
Copy link
Copy Markdown
Member Author

tcarion commented Sep 17, 2025

I don't have a NetCDF per se, but I looked at what was doing cdo and python cfgrib when converting the grib file from the originating issue. A cropped version of these data are added to the test files with this PR here.

  1. With python cfgrib, it gives:
In [1]: import cfgrib
   ...: import xarray as xr
   ...: xr.open_dataset('regular_ll_msl_with_steps.grib',engine='cfgrib')
Out[1]: 
<xarray.Dataset>
Dimensions:     (time: 62, step: 4, latitude: 3, longitude: 3)
Coordinates:
    number      int64 ...
  * time        (time) datetime64[ns] 2024-01-01 ... 2024-01-31T12:00:00
  * step        (step) timedelta64[ns] 00:00:00 06:00:00 12:00:00 18:00:00
    meanSea     float64 ...
  * latitude    (latitude) float64 55.0 54.5 54.0
  * longitude   (longitude) float64 3.0 3.5 4.0
    valid_time  (time, step) datetime64[ns] ...
Data variables:
    msl         (time, step, latitude, longitude) float32 ...

With the time variable having standard_name=forecast_reference_time:

In [3]: data.time
Out[3]: 
<xarray.DataArray 'time' (time: 62)>
array(['2024-01-01T00:00:00.000000000', '2024-01-01T12:00:00.000000000',
...]
Coordinates:
    number   int64 ...
  * time     (time) datetime64[ns] 2024-01-01 ... 2024-01-31T12:00:00
    meanSea  float64 ...
Attributes:
    long_name:      initial time of forecast
    standard_name:  forecast_reference_time

and the valid_time variable being 2-D with time x step dimensions and standard_name=time.

In [5]: data.valid_time
Out[5]: 
<xarray.DataArray 'valid_time' (time: 62, step: 4)>
[248 values with dtype=datetime64[ns]]
Coordinates:
    number      int64 ...
  * time        (time) datetime64[ns] 2024-01-01 ... 2024-01-31T12:00:00
  * step        (step) timedelta64[ns] 00:00:00 06:00:00 12:00:00 18:00:00
    meanSea     float64 ...
    valid_time  (time, step) datetime64[ns] ...
Attributes:
    standard_name:  time
    long_name:      time

The step variable has standard_name=forecast_period.

In [6]: data.step
Out[6]: 
<xarray.DataArray 'step' (step: 4)>
array([             0, 21600000000000, 43200000000000, 64800000000000],
      dtype='timedelta64[ns]')
Coordinates:
    number   int64 ...
  * step     (step) timedelta64[ns] 00:00:00 06:00:00 12:00:00 18:00:00
    meanSea  float64 ...
Attributes:
    long_name:      time since forecast_reference_time
    standard_name:  forecast_period
  1. The cdo -f nc copy regular_ll_msl_with_steps.grib regular_ll_msl_with_steps.nc commands produce the following netCDF file (read with NCDatasets):
julia> NCDataset("/home/tricarion/Documents/dev/GRIBDatasets/regular_ll_msl_with_steps.nc")
Dataset: /home/tricarion/Documents/dev/GRIBDatasets/regular_ll_msl_with_steps.nc
Group: /

Dimensions
   time = 248
   lon = 3
   lat = 3

Variables
  time   (248)
    Datatype:    Dates.DateTime (Float64)
    Dimensions:  time
    Attributes:
     standard_name        = time
     units                = hours since 2024-1-1 00:00:00
     calendar             = proleptic_gregorian
     axis                 = T

  lon   (3)
    Datatype:    Float64 (Float64)
    Dimensions:  lon
    Attributes:
     standard_name        = longitude
     long_name            = longitude
     units                = degrees_east
     axis                 = X

  lat   (3)
    Datatype:    Float64 (Float64)
    Dimensions:  lat
    Attributes:
     standard_name        = latitude
     long_name            = latitude
     units                = degrees_north
     axis                 = Y

  msl   (3 × 3 × 248)
    Datatype:    Float32 (Float32)
    Dimensions:  lon × lat × time
    Attributes:
     standard_name        = air_pressure_at_mean_sea_level
     long_name            = Mean sea level pressure
     units                = Pa
     param                = 0.3.0

Global attributes
  CDI                  = Climate Data Interface version 2.1.1 (https://mpimet.mpg.de/cdi)
  Conventions          = CF-1.6
  institution          = European Centre for Medium-Range Weather Forecasts
  history              = Tue Sep 16 15:23:21 2025: cdo -f nc copy regular_ll_msl_with_steps.grib regular_ll_msl_with_steps.nc
  CDO                  = Climate Data Operators version 2.1.1 (https://mpimet.mpg.de/cdo)

So here the time variable is flattened, and the step variable is not even kept.

Personnally I prefer what cfgrib does, since more information is kept from the original data. Besides, it respects the CF conventions (except maybe when giving the standard_name=time to a 2-D variable, which is not clear if it is recommended or not). That's what I wanted to base upon in this PR. With maybe renaming the variable time to reftime, and step to fcstperiod as in the ECMWF example. Therefore there won't be a variable called time anymore, but instead reftime with standard_name=forecast_reference_time and valid_time with standard_name=time.

@Alexander-Barth
Copy link
Copy Markdown
Member

Alexander-Barth commented Sep 17, 2025

I think also that what cfgrib does here is more sensible.
The fact that cdo is omitting the variable step (which is quite crucial information in my opinion) , makes me wonder if it was written with this special case in mind.

We recommend that the name of a multidimensional coordinate variable should not match the name of any of its dimensions because that precludes supplying a coordinate variable for the dimension. This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional.

I read paragraph, that this is not recommended:

Dimensions
   time = 2
   fcstperiod = 3

Variables
  time   (3, 2)
    Dimensions:  fcstperiod × time

A variable named time with the dimension time, yet time is 2D.
But naming the dimension reftime avoids this:

Dimensions
   reftime = 2
   fcstperiod = 3

Variables
  time   (3, 2)
    Dimensions:  fcstperiod × reftime

For some model, lon and lat can have multiple dimensions too (e.g. a moving grid).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants