1. Introduction
Weather forecast postprocessing is a key component in many operational forecasting systems. When compared with in situ observations, numerical weather prediction (NWP) models produce systematically biased forecasts, partly due to the presence of unresolved phenomena at finer scales. This encourages the study of statistical postprocessing methods that correct these biases by training on past forecasting errors (Vannitsem et al. 2021). Earlier postprocessing models were used to generate probabilistic predictions from deterministic model outputs, notably for precipitation forecasting (Antolik 2000). With the advent of ensemble forecasts, this approach has received less attention in recent years. However, deterministic forecasts are still being produced in many operational centers (Hamill 2021). This is done at shorter lead times, for which deterministic forecasts are more appropriate. It can also be motivated by computational constraints if different parts of the ensemble size/model resolution compromise are explored.
We propose a deterministic-to-probabilistic approach that recovers a deterministic model’s uncertainty by training on past observations using neural networks (NNs). Since it targets surface stations, it applies to any gridded NWP including machine learning weather prediction models (Lam et al. 2022; Pathak et al. 2022; Bi et al. 2023). Our approach requires no input regarding forecast spread and infers all uncertainties from its training.
This contribution arises in a dynamic context for weather forecast postprocessing (Vannitsem et al. 2021). A topic of active discussion is how the distribution is predicted, i.e., the model that determines the shape of the output distribution. Notable approaches include linear models (Gneiting et al. 2005), random forests (Taillardat et al. 2016), fully connected neural networks (Rasp and Lerch 2018), and convolutional neural networks (Veldkamp et al. 2021).
Another topic is how the forecast uncertainty is represented, that is, how the output of the predictive model is cast into a CDF. Deterministic postprocessing is being studied at shorter lead times (Hamill 2021). For probabilistic outputs, solutions revolve around parametric (Taillardat 2021; Demaeyer et al. 2023) and nonparametric (Bremnes 2020; Hewson and Pillosu 2021) methods (also called distribution-based and distribution-free methods, respectively). Among the nonparametric methods, Bremnes (2020) proposes a Bernstein quantile network (BQN) that models the quantile function of the probabilistic forecast as a Bernstein polynomial. This approach yields improvements over previous methods for surface wind speed fields while avoiding strong assumptions about the distribution of the predicted value. Numerous parametric and nonparametric methods were systematically reviewed for wind gust forecasting by Schulz and Lerch (2022b), using a modular framework that decouples how the distribution is predicted (i.e., using a linear model or a neural network) from how it is represented (i.e., using a normal distribution or quantile regression).
Finally, recent works have trained postprocessing models separately for all lead times (Rasp and Lerch 2018; Bremnes 2020; Schulz and Lerch 2022b) or jointly in a single model (Bouallègue et al. 2023). In the latter, it was observed that the lead-time predictor was not always retained by a feature selection process for deterministic forecast postprocessing. As such, there are discussions regarding how to best condition a postprocessing model for the lead time.
The deterministic-to-probabilistic approach has seen renewed interest in recent work. Veldkamp et al. (2021) proposed a convolutional network to postprocess wind speed forecasts from a high-resolution deterministic model. Their prediction is done at a single lead time. Bouallègue et al. (2023) proposed a two-step postprocessing pipeline where a second model is trained to predict the residual error of the first model. The residual error model provides an estimation of the uncertainty of a deterministic forecast but does not express its probability distribution. Bremnes et al. (2023) applied a BQN to deterministic weather forecasts produced by deep neural networks (Bi et al. 2023), making a probabilistic forecast from a deterministic NWP. Their work studies only a quantile-based method and invites the evaluation of other approaches. Demaeyer et al. (2023) introduced the atmosphere network (ANET) which predicts a parametric forecast using a single network for all stations and lead times. It is compatible with variable member counts.
We perform further examinations by producing probabilistic forecasts based on a deterministic global model, with up to 10-day lead times. This is done on a dataset built from the Global Deterministic Prediction System (GDPS) NWP model (Buehner et al. 2015), targeting METAR surface temperature observations in Canada and the United States (Herzmann 2001). We study neural network models for this purpose and compare them to linear models. The forecast uncertainty is expressed using either a normal distribution, a set of quantile values, or a quantile function built from Bernstein polynomials. The neural network models are trained jointly for all lead times. We compare multiple strategies to condition for it, including a lead-time embedding that has shown usefulness in recent work (Espeholt et al. 2022). Finally, we evaluate the behavior of our postprocessing method under extreme temperatures and compare its behavior to that of the NWP model.
The next section describes our experimental framework in more detail, including models, datasets, and evaluation methods. Section 3 contains our experiments that evaluate forecast performance, calibration, and its behavior under extreme events. This is followed by a discussion about the proposed postprocessing methods and our concluding remarks in section 4.
2. Methods
a. Data
We perform postprocessing of surface temperature fields over the operational output of the GDPS NWP model (Buehner et al. 2015; Meteorological Service of Canada 2019) and target observations from the METAR network. We use model outputs initialized at 0000 and 1200 UTC to perform postprocessing daily up to 10-day lead times.
For training, we use forecasts initialized from 1 January 2019 to 31 December 2020. This period contains a major update of the GDPS model in July 2019, where the model horizontal resolution was increased from 25 to 15 km (McTaggart-Cowan et al. 2019). We still retain the earlier forecasts, since removing them provoked a slight decrease in the validation score.
Forecasts initialized on the first 25 days of each month are used for the training itself, while the others are used for validation. This validation strategy is not fully independent because late lead times from the training set overlap with early lead times from the validation set. We choose this compromise to ensure good coverage of the seasonal cycle. Forecasts from 1 January 2021 to 30 November 2021 are used for testing. We note that our testing period for the GDPS dataset contains two extreme weather phenomena, a wave of extreme cold temperatures in Texas in February, and a heat wave over western North America in June and July.
We use 18 NWP-dependent predictors from the GDPS dataset, as well as seven NWP-independent predictors. Table 1 contains the full predictor list. The set of NWP-dependent predictors is relatively small due to constraints in accessing and storing operational outputs.
NWP-dependent and NWP-independent predictors used for postprocessing. ✓ = Always used; * = used unless specified otherwise.
All features are scaled using their mean and variance over the training set to improve normality. This is done stationwise. Before scaling, we apply a logarithm transformation on positively defined variables (albedo, precipitation, and wind speed). We provide the day-of-year predictor twice, encoded with sin and cos transforms, to represent periodicity.
We target observations from the METAR network harvested from the Iowa State University Environmental Mesonet (Herzmann 2001–2024). The dataset consists in observations from 1066 stations spread across North America. The NWP forecasts were interpolated to stations using the nearest grid point. Stations were selected based on data availability for the periods covered by the NWP forecasts. We removed 250 observations (∼0.01%) that reported temperatures more than 15 K off their corresponding 24-h forecast after debiasing. The bias values were computed by averaging model errors over the training set, separately for each station, initialization time, lead time, and month. This process is meant to account for sensor errors.
b. Postprocessing models
We devise a naive probabilistic forecast as a baseline model. First, the NWP forecasts are debiased using the process described in section 2a. Second, the standard deviation of forecast errors is computed using the same aggregation (separately by station, initialization time, lead time, and month). Our baseline probabilistic forecast is a normal distribution centered around the debiased NWP forecast and scaled by the computed standard deviation.
Our other methods use machine learning to generate probabilistic forecasts from NWP. They are summarized in Table 2. We follow a decoupled approach (Schulz and Lerch 2022b) where we distinguish how the forecast uncertainty is predicted from how it is represented. We study two predictive models: a linear model and a NN model. We study four uncertainty representations: a naive deterministic representation, a normal distribution, a set of quantiles, and a quantile function posed as a Bernstein polynomial.
Summary table of the postprocessing models considered in this work.
This yields eight different models which are illustrated in Fig. 1. The models use NWP-dependent and NWP-independent predictive features x. These features are used by a predictive model to build a vector θ which determines the predictive distribution. One such vector is produced for each station and lead time. It defines the probabilistic forecast, for instance, by providing its distribution parameters or a set of quantile values. The length of θ changes according to which uncertainty representation is used.
(top) Our model architecture. Each forecast is made using one of two parameter prediction models and one of four uncertainty representations. (bottom left) The linear model predicts a parameter vector θ using one linear layer. (bottom right) The NN predicts a parameter vector θ by processing the predictors through a MLP. It has embeddings for station and lead time, which are added to the output of the first linear layer.
Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1
1) Predictive models
Our first parameter prediction model linearly maps the predictors x to θ. It has separate coefficients and biases for each station, initialization hour, and lead time.
Our second model is a NN model centered around a multilayer perceptron (MLP), as introduced for postprocessing by Rasp and Lerch (2018). The hidden layer size is kept constant throughout. The sigmoid linear unit (SiLU) activation function (Ramachandran et al. 2017) was used. Batch norm layers were added after the linear layers since we observed they accelerate convergence during training.
It is common to train the NN model jointly for all stations and then condition for the station with an embedding (Rasp and Lerch 2018). An embedded station vector is learned for each station. It has the same size as the hidden layers in the MLP. It is added to the network activations after the first linear layer, but before the batch norm and activation layers are applied. Doing the addition in that location is equivalent to concatenating one-hot encoded predictors to x representing the station identity. Our embedding approach has a smaller memory footprint because it avoids padding x with numerous and sparsely populated features, which facilitates implementation.
Since our model is trained for 10 lead times on the GDPS dataset, we need to condition for the lead time as well. We use two strategies: We add a predictor to x corresponding to the lead time (rescaled from 0 to 1). We also learn a lead-time embedding implemented in the same way as the station embedding. This lets the network not only adapt its postprocessing strategy to the lead time (e.g., by increasing uncertainty) but also identify correlation structures that are common across lead times. The lead-time embedding was already proposed in precipitation forecasting (Espeholt et al. 2022).
2) Uncertainty representation
Four ways to represent the postprocessed forecast uncertainty are considered. Each uncertainty representation method accepts a parameter vector θ from one of the predictive models. Each method has a corresponding loss function so that the models can be trained against observations.
In the base case, we consider a one-parameter output and use it as a deterministic forecast. When the underlying predictive model is linear, we refer to this method as model output statistics (MOS) (Glahn and Lowry 1972). When it is a neural network, we refer to it as a deterministic NN (DNN). These models are trained using the root-mean-square error (RMSE) loss function.
The quantile regression approach consists in predicting θ
Bernstein polynomials have shape-preserving properties where a monotonous list of coefficients θj will yield a monotonous function Q(τ), an expected property for quantile functions. We enforce this by sorting the coefficients θj after they are predicted. We observed that without coefficient ordering, the BQN would sometimes converge to solutions that have good validation scores but poor calibration due to creating jagged quantile functions.
3) Combining predictions
NN methods converge to different models depending on their random weight initialization and the random composition of training batches. To account for this, NN-based postprocessing models are typically trained multiple times and their predictions are combined to create the final distribution (Schulz and Lerch 2022a). We follow this trend and train each NN five times. To combine predictions, we average the parameter vectors θ across models. For the DRN, this is equivalent to averaging distribution parameters as is done by Rasp and Lerch (2018). For the QRN, this is equivalent to uniform weight quantile averaging, also known as Vincentization (Schulz and Lerch 2022a). This is also equivalent for the BQN despite the fact that the values of θ represent polynomial coefficients instead of quantile values (Schulz and Lerch 2022b).
c. Training
All models are implemented in PyTorch (Paszke et al. 2019), a scientific computation framework well suited to deep learning applications. They are trained using the Adam optimizer (Kingma and Ba 2014). They are trained for 100 epochs with the OneCycleLR training scheduler (Smith and Topin 2018). The maximal learning rate was 10−3 for the linear model and 5 × 10−4 for the NN. A weight decay of 10−5 was used in both instances.
Each model was trained in four variants, corresponding to each of the uncertainty representations described in section 2b. We optimize variant-specific details for each model via a manual grid search (i.e., degree of Bernstein polynomial and number of predicted quantiles). The shape of the shared architecture (embedding size and number of fully connected layers) was kept fixed across the variants to facilitate model intercomparison.
d. Evaluation
To measure forecast uncertainty, the composite spread is used as proposed by Bremnes (2019). For quantile forecasts, the smallest quantile interval widths are summed until they cover the desired probability range (here 80%). A linear interpolation is used in the final interval to reach our desired probability interval more precisely. For normal distribution forecasts, this reduces to computing the distance between the 10th and 90th percentiles, as computed using the inverse CDF. Finally, we assess the calibration of the predicted distributions using rank histograms.
3. Experiments and results
a. Hyperparameter optimization
For the NN models, we choose four hidden layers with a size of 256, after observing diminishing returns on the validation set for larger architectures. We did not find strong interactions between shared parameters and parameters related to the uncertainty representation, i.e., the optimal embedding size performed similarly on the DRN, QRN, and BQN. Regarding the length of the θ vectors, we use Bernstein polynomials of degree 16 after observing no performance improvement for larger values. We use 32 quantiles in our quantile regression models for the same reason. The values considered were {8, 10, 12, 14, 16, 18} for the degree of the Bernstein polynomial and {16, 20, …, 40} for the number of predicted quantiles of the quantile regression model. The linear quantile methods (LQR and LBQ) were set to use the same length of θ as their NN counterpart to compare them at equivalent forecast resolution.
Once these hyperparameters are set, our NN models have between 580 000 and 590 000 trainable parameters. The variability is due to the different number of output parameters θ for each model. For comparison, the EMOS model has 1 517 164 trained parameters. It has more trained parameters than the NN because it has separate coefficients for each station, initialization time, and lead time.
b. Postprocessing performance
We ran a series of experiments to evaluate the performance of our postprocessing methods over the GDPS dataset. Table 3 shows performance metrics aggregated for all lead times and all stations. The NN models score better than their linear model in all configurations. The linear model performs better with the normal distribution, while the NN model shows similar performance for all variants. We posit this is because the linear model does not have enough representation capability to correctly predict a large amount of interrelated θ parameters. Among NN models, the BQN obtains slightly better results for all evaluated metrics.
Postprocessing model metrics, aggregated over all stations and lead times. The CRPSS is computed against the naive probabilistic baseline. Bold values represent the best performing uncertainty representation for each predictive model.
Figure 2 shows metrics describing forecast behavior across lead times. The CRPS and spread increase with lead time, as expected. Forecasts from the naive probabilistic model have more spread throughout. The EMOS model is successful in increasing sharpness, but less than the NN models. The QRN has the lowest spread among NN models, although this could be due to bias in the spread estimation related to its smaller number of quantiles.
Postprocessing model metrics. (left) CRPS by lead time. (right) Forecast sharpness as measured by a composite spread covering 80% of the predicted forecast distribution.
Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1
Model CRPSS and bias are reported in Fig. 3. The confidence intervals are computed using the paired bootstrapping procedure proposed by Hamill (1999), under the null hypothesis that the naive probabilistic model has the same statistic as the tested model. The resampling was performed 100 times. The CRPSS gains brought by NN models are significantly larger than those of EMOS at early lead times. This difference reduces to 2.5% for later time steps. Bias values are computed by comparing the observation to the mean quantile value for models that output quantiles, while they are compared against the forecast mean for the EMOS and DRN models. The NN models do not eliminate all biases at longer lead times, though they remain under 0.3 K of amplitude.
(left) CRPSS against the naive probabilistic model. (right) Bias of postprocessing models. The shaded areas represent a 5%–95% confidence interval.
Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1
Figure 4 shows the CRPSS against the naive probabilistic model, aggregated by station. Larger increases in CRPSS are observed in the central and eastern United States. We posit that this reflects the location of some surface temperature biases in the underlying NWP model. The figure shows the results for the DRN. Similar figures for the BQN and QRN are available in the appendix.
CRPSS aggregated by station for all lead times. The model is DRN. The baseline model is the naive probabilistic model.
Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1
The forecast calibration is assessed using the rank histograms in Fig. 5. For the LBQ and BQN models, the 99 bins were merged in groups of three so that the rank histogram would be similar to that of the quantile methods. For the methods producing normal distributions (naive, EMOS, and DRN), their inverse CDF was discretized in 33 bins for the same reason.
Postprocessed forecast rank histogram. For the normal uncertainty representation models (naive, EMOS, and DRN), bin boundaries with uniform probabilities were computed using the inverse CDF of the forecasts. For the LBQ and BQN models, bins were merged in groups of three to allow comparison with the other histograms.
Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1
All models have a surplus of observations at the first and last bins, indicating a certain amount of them are at the very edges of the forecasted distribution. Other than these missed forecasts, the postprocessing models tend to flatten the central part of the rank histogram when compared to the naive model. The BQN shows interesting shapes at the edges. We postulate that they reflect what distributions can be expressed using a 16th-degree Bernstein polynomial. Even though our validation scores stopped improving for higher-degree polynomials, they may be worth investigating from a calibration perspective.
c. Behavior toward extremes
We evaluate the behavior of the NN models when NWP forecasts tend toward extreme lows and highs in Fig. 6. Predictions were aggregated according to the percentile of their corresponding forecast for a given station. They are grouped in bins of two percentiles. The metrics are computed for an initialization time 0000 UTC and a lead time of 48 h.
Model metrics according to the NWP forecast percentile. Lead time 48 h. Initialization time 0000 UTC. The percentiles are computed stationwise. The forecasts are aggregated in bins of 2 percentiles. The metrics are computed over three periods: (left) full test set (all months), (middle) winter (January and February), and (right) summer (June and July).
Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1
The naive model has biases when the forecast percentile tends to the extremes. This is an expected consequence of our forecast-based stratification (Bellier et al. 2017). Although the yearly curves for CRPS are flat in the central quantiles, more specific evaluations reveal that all models have degraded predictive performance given very high and low forecasts within a season. We observe a particularly strong degradation in CRPSS for low percentiles in winter. Upon inspection, these bins are disproportionately populated with forecasts valid from 12 to 17 February 2021. These dates are associated with unusually cold weather events in Texas which are outside the training distribution.
d. Conditioning for lead time
Our NN models are trained jointly for all lead times, which implies they need to be conditioned for the lead time being predicted. We introduced a lead-time embedding in section 2 for that purpose, as well as the typical lead-time predictor. This section studies the effectiveness of these strategies and analyzes the representation learned by the embedding.
In Table 4, we compare the CRPS values obtained by each strategy on the full testing set. As baselines, we include results for nonconditioned models, as well as another strategy called partitioning. It consists in training separate models for each lead time. We call this strategy partitioning because it effectively splits the dataset into parts. We do not test the embedding on the linear model, because it uses separate models for each lead time.
Postprocessing model CRPS according to the conditioning strategy used for the lead time. The partition strategy uses a series of models trained on each time step individually. The predictor strategy adds a lead-time predictor. The embedding strategy injects a learned vector in the model input to represent the lead time. Bold values represent the best performing strategy for each postprocessing model.
In all cases, training the postprocessing model jointly for all lead times performed better than training separate models, given that the joint model is conditioned for lead time in one way or the other. The lead-time embedding improves performance slightly across all models when compared to using only a predictor. The best results were obtained by using it together with the lead-time predictor, except for the QRN where the embedding alone had the best performance. Our dataset did not include diurnal variations: each lead time points to the same time of day. Models trained for multiple lead times per day may see more benefits from the embedding, which would then have the dual purpose of encoding lead time and time of day.
Figure 7 shows the performance of different conditioning strategies according to the lead time. The confidence intervals are computed using the same bootstrapping strategy described in section 3b. Interestingly, the benefits of training a single model vary by lead time. The improvements are mostly observed around central lead times. Early and late lead times perform similarly or worse than lead time-specific NNs. We envisage two explanations. The first is related to the statistics of the data: The NN converges to solutions that are well suited to intermediate uncertainties because they are the “mean” case in the dataset. Another explanation can be considered which is related to predictability. Since it is very high in early lead times, the postprocessing must be adapted to rely heavily on the forecast, which could benefit specialized models. On the contrary, since predictability is very low at late lead times, it is difficult to do much better than a climatological model. The benefits of training postprocessing models jointly would thus be concentrated in the intermediate lead times.
CRPSS when training a NN for all lead times jointly. The baseline strategy is to train separate models for each lead time. The shaded areas represent 5%–95% confidence intervals.
Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1
Lead-time embedding self-similarity. Every row and column corresponds to a learned vector representing a given lead time. The cells represent the degree of similarity between two of these vectors, as measured by the cosine similarity. This embedding was obtained by training a DRN model.
Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1
e. Benefits of ensemble forecast for probabilistic postprocessing
The results from the previous subsections show that much of the uncertainty related to the forecast can be recovered statistically with a neural network. A follow-up question is to what extent ensemble members are useful in estimating the forecast distribution, given that it is postprocessed. To investigate this, we run an experiment on the ENS-10 dataset (Ashkboos et al. 2020) where we progressively add ensemble members to the postprocessing model input. ENS-10 is a 10-member reforecast dataset built from outputs of an operational configuration of the ECMWF IFS model (cycles Cy43r1 and Cy45r1). It spans a period going from January 1998 to December 2017, making two forecasts a week. We postprocessed this NWP model for lead times of 1 and 2 days. More detailed information about the dataset is contained in Table A1.
To input the NWP-dependent features from multiple ensemble members simultaneously, we applied the linear layer to each member individually and then averaged the resulting vectors. For the DRN model, the initial estimate
CRPSS gain related to adding ensemble members to the postprocessing model input. The dataset is ENS-10. The baseline is training postprocessing using only the control member. The shaded areas represent 5%–95% confidence intervals.
Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1
We observe that the impact of supplementary ensemble members increases with lead time. Furthermore, the impact of ensemble members seems to be diminishing. However, it is difficult to conclude because our experimental setup is limited to a 48-h lead time. The importance of large ensembles is expected to increase with lead time, but this cannot be verified given the data available for this study.
4. Discussion and conclusions
This work studied probabilistic forecasts produced from a deterministic NWP model. It evaluated the deterministic-to-probabilistic strategy under parametric and quantile-based assumptions with eight combinations of predictive models and uncertainty representations. The best postprocessing models have a CRPSS of about 15% when compared against a naive probabilistic forecast and significantly outperform EMOS for all but the latest lead times. Under extreme conditions, the NN model performance decreases in a way that is expected from a statistical model.
The proposed methodology can be applied to any numerical prediction model for which a sufficient training dataset is available. We believe it has applications in domains that must rely on deterministic models but benefit from probabilistic decision-making, e.g., energy consumption forecasting.
To account for the large number of lead times to be encoded within one model, we introduced a lead-time embedding. This embedding brought modest improvements to the model CRPS. The learned representation has an intuitive post hoc interpretation. Training a postprocessing model jointly for all lead times improved performance overall. The improvements were concentrated around central lead times.
Our results with the NN show low sensitivity to the choice of uncertainty representation in terms of aggregated metrics. However, their calibration exhibits noticeable differences, especially at the tails of the distribution. This suggests the choice of an uncertainty representation should be evaluated with regard to its capacity to represent the full distribution.
While our approach is inexpensive, it still has caveats when compared to a postprocessed ensemble forecast. Our experiment on the ENS-10 dataset shows that adding ensemble members quickly brings important improvements to the CRPS, indicating that supplementary numerical simulations are a robust way to improve the marginal forecast at a station. Perhaps more importantly, our postprocessing strategy makes the independence assumption over all stations and lead times. This breaks spatiotemporal consistency because it makes it impossible to sample from all stations and lead times in a way that is physically realizable (Schulz and Lerch 2022b). Such consistency is an important asset when considering extreme events in downstream applications of the weather forecast. This has been addressed notably with Schaake shuffle (Clark et al. 2004; Shrestha et al. 2020) and ensemble copula coupling (Schefzik et al. 2013; Lakatos et al. 2023), but these methods assume that the correlation structure between variables can be recovered using historical observations or the NWP ensemble members, respectively. This could be impossible in the presence of unresolved local effects. As such, we identify avenues for future work inside and outside the independence assumption.
a. Station- and lead-time-independent postprocessing
For station- and lead-time-independent forecasting, we identify two directions to extend our work. First, the benefits of training postprocessing models jointly for all lead times could be investigated further. Our experiments showed a tendency for NN postprocessing models to only make improvements in central lead times. We suggested two interpretations of this phenomenon. The first is related to dataset statistics where the model performs best in central lead times because they are at the center of the training distribution. The second was related to the predictability of the weather itself where postprocessing strategies are different enough in early lead times and late lead times that specialized models do better. The latter interpretation makes intuitive sense: One adjusts a 1-day forecast differently than a 10-day forecast to account for predictability. Further experiments could be designed to identify which interpretation is correct. As far as encoding the lead time is concerned, our experiments with the lead-time embedding are moderately conclusive. They did bring modest improvements to the CRPSS, but not always in a statistically significant way. The NN models did not react in the same way to their introduction, with the DRN benefitting more from it than the others. This warrants further experimentations, notably where forecast validity time changes with lead time. An embedding could let a neural network build a representation that efficiently blends the effects of lead time and time of day on postprocessing.
Second, our short experiment on the ENS-10 dataset showed that the CRPS is quickly and decisively improved by adding ensemble members at the input. This shows that supplementary numerical simulations are a robust way to improve the forecast. The improvement may become even larger with NN models whose architecture specifically leverages spread information from the ensemble. Our experiment was also limited in lead time, and it would be of interest to perform it on more operational models at longer time horizons. Since predictability declines with lead time, we expect the trend where later lead times benefit from more ensemble members to continue. Further experiments are required to determine its shape on larger time horizons.
b. Generative modeling for postprocessing
Other lines of research are available outside of the station and lead-time independence assumptions. We believe that the generative modeling literature could help preserve spatiotemporal consistency for postprocessing at stations. In computer vision, generative neural networks have been successful in sampling consistently from large output spaces. Work has already been performed to that effect in postprocessing on grid (Dai and Hemri 2021) and in situ (Chen et al. 2024), showing that spatial correlations can be recovered. We expect similar methods could be used to represent temporal dependencies as well. Questions remain about how to implement this exactly for in situ postprocessing. We expect well-adapted architecture will propose a way to encode the spatial relationship between stations, as well as a generative component that is not subject to mode collapse and training instability concerns, which are common in generative modeling.
Acknowledgments.
This work was funded in part by Environment and Climate Change Canada, the Computer Research Institute of Montreal, and a Choose France Chair in AI grant from the French government. Experiments were carried out using HPC resources from GENCI-IDRIS (Grant AD011014334).
Data availability statement.
The operational archives used in this study are accessible on demand through the open data access program of the Meteorological Service of Canada (Meteorological Service of Canada 2019). The METAR observations used are freely available in Herzmann (2001–2024). The source code of the models used in this work is available at https://github.com/davidlandry93/pp2023/.
APPENDIX
Appendix Title
Figure A1 shows the spatial distribution of skill gain brought by postprocessing approaches. Tables A1 and A2 provide more details about the ENS-10 dataset.
Skill gain brought by postprocessing models against a debiased baseline. (top) BQN model. (bottom) QRN model.
Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1
ENS-10 dataset.
NWP-dependent and NWP-independent predictors used in the ENS-10 dataset (1000–10 hPa denotes vertical levels 1000, 925, 850, 700, 500, 400, 300, 200, 100, 50, and 10 hPa; ✓ = always used).
REFERENCES
Antolik, M. S., 2000: An overview of the National Weather Service’s centralized statistical quantitative precipitation forecasts. J. Hydrol., 239, 306–337, https://doi.org/10.1016/S0022-1694(00)00361-9.
Ashkboos, S., L. Huang, N. Dryden, T. Ben-Nun, P. Dueben, L. Gianinazzi, L. Kummer, and T. Hoefler, 2020: ENS-10: A dataset for ensemble post-processing. ETHZ Scalable Parallel Computing Laboratory Storage, accessed 13 July 2023, https://spclstorage.inf.ethz.ch/projects/deep-weather/ENS10/.
Bellier, J., I. Zin, and G. Bontron, 2017: Sample stratification in verification of ensemble forecasts of continuous scalar variables: Potential benefits and pitfalls. Mon. Wea. Rev., 145, 3529–3544, https://doi.org/10.1175/MWR-D-16-0487.1.
Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2023: Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619, 533–538, https://doi.org/10.1038/s41586-023-06185-3.
Bouallègue, Z. B., F. Cooper, M. Chantry, P. Düben, P. Bechtold, and I. Sandu, 2023: Statistical modeling of 2-m temperature and 10-m wind speed forecast errors. Mon. Wea. Rev., 151, 897–911, https://doi.org/10.1175/MWR-D-22-0107.1.
Bremnes, J. B., 2004: Probabilistic forecasts of precipitation in terms of quantiles using NWP model output. Mon. Wea. Rev., 132, 338–347, https://doi.org/10.1175/1520-0493(2004)132<0338:PFOPIT>2.0.CO;2.
Bremnes, J. B., 2019: Constrained quantile regression splines for ensemble postprocessing. Mon. Wea. Rev., 147, 1769–1780, https://doi.org/10.1175/MWR-D-18-0420.1.
Bremnes, J. B., 2020: Ensemble postprocessing using quantile function regression based on neural networks and Bernstein polynomials. Mon. Wea. Rev., 148, 403–414, https://doi.org/10.1175/MWR-D-19-0227.1.
Bremnes, J. B., T. N. Nipen, and I. A. Seierstad, 2023: Evaluation of forecasts by a global data-driven weather model with and without probabilistic post-processing at Norwegian stations. arXiv, 2309.01247v1, https://doi.org/10.48550/arXiv.2309.01247.
Bröcker, J., 2012: Evaluating raw ensembles with the continuous ranked probability score. Quart. J. Roy. Meteor. Soc., 138, 1611–1617, https://doi.org/10.1002/qj.1891.
Buehner, M., and Coauthors, 2015: Implementation of deterministic weather forecasting systems based on ensemble–variational data assimilation at environment Canada. Part I: The global system. Mon. Wea. Rev., 143, 2532–2559, https://doi.org/10.1175/MWR-D-14-00354.1.
Cannon, A. J., 2018: Non-crossing nonlinear regression quantiles by monotone composite quantile regression neural network, with application to rainfall extremes. Stochastic Environ. Res. Risk Assess., 32, 3207–3225, https://doi.org/10.1007/s00477-018-1573-6.
Chen, J., T. Janke, F. Steinke, and S. Lerch, 2024: Generative machine learning methods for multivariate ensemble post-processing. Ann. Appl. Stat., 18, 159–183, https://doi.org/10.1214/23-AOAS1784.
Clark, M., S. Gangopadhyay, L. Hay, B. Rajagopalan, and R. Wilby, 2004: The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. J. Hydrometeor., 5, 243–262, https://doi.org/10.1175/1525-7541(2004)005<0243:TSSAMF>2.0.CO;2.
Dai, Y., and S. Hemri, 2021: Spatially coherent postprocessing of cloud cover ensemble forecasts. Mon. Wea. Rev., 149, 3923–3937, https://doi.org/10.1175/MWR-D-21-0046.1.
Demaeyer, J., and Coauthors, 2023: The EUPPBench postprocessing benchmark dataset v1.0. Earth Syst. Sci. Data, 15, 2635–2653, https://doi.org/10.5194/essd-15-2635-2023.
Espeholt, L., and Coauthors, 2022: Deep learning for twelve hour precipitation forecasts. Nat. Commun., 13, 5145, https://doi.org/10.1038/s41467-022-32483-x.
Finn, T. S., 2021: Self-attentive ensemble transformer: Representing ensemble interactions in neural networks for Earth system models. arXiv, 2106.13924v2, https://doi.org/10.48550/arXiv.2106.13924.
Glahn, H. R., and D. A. Lowry, 1972: The use of Model Output Statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11, 1203–1211, https://doi.org/10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.
Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359–378, https://doi.org/10.1198/016214506000001437.
Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 1098–1118, https://doi.org/10.1175/MWR2904.1.
Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155–167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.
Hamill, T. M., 2021: Comparing and combining deterministic surface temperature postprocessing methods over the United States. Mon. Wea. Rev., 149, 3289–3298, https://doi.org/10.1175/MWR-D-21-0027.1.
Herzmann, D., 2001: ASOS-AWOS-METAR data download. Iowa Environmental Mesonet, accessed 25 August 2023, https://mesonet.agron.iastate.edu/request/download.phtml.
Hewson, T. D., and F. M. Pillosu, 2021: A low-cost post-processing technique improves weather forecasts around the world. Commun. Earth Environ., 2, 132, https://doi.org/10.1038/s43247-021-00185-9.
Höhlein, K., B. Schulz, R. Westermann, and S. Lerch, 2024: Postprocessing of ensemble weather forecasts using permutation-invariant neural networks. Artif. Intell. Earth Syst., 3, e230070, https://doi.org/10.1175/AIES-D-23-0070.1.
Kingma, D. P., and J. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.
Lakatos, M., S. Lerch, S. Hemri, and S. Baran, 2023: Comparison of multivariate post-processing methods using global ECMWF ensemble forecasts. Quart. J. Roy. Meteor. Soc., 149, 856–877, https://doi.org/10.1002/qj.4436.
Lam, R., and Coauthors, 2022: GraphCast: Learning skillful medium-range global weather forecasting. arXiv, 2212.12794v2, https://doi.org/10.48550/arXiv.2212.12794.
McTaggart-Cowan, R., and Coauthors, 2019: Modernization of atmospheric physics parameterization in Canadian NWP. J. Adv. Model. Earth Syst., 11, 3593–3635, https://doi.org/10.1029/2019MS001781.
Meteorological Service of Canada, 2019: GDPS operational archive. Meteorological Service of Canada Open Data Access Program, accessed 1 November 2022, https://eccc-msc.github.io/open-data/cost-recovered/readme_en/.
Paszke, A., and Coauthors, 2019: PyTorch (version 2.0.1). The Linux Foundation, accessed 8 May 2023, https://pytorch.org.
Pathak, J., and Coauthors, 2022: FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/arXiv.2202.11214.
Ramachandran, P., B. Zoph, and Q. V. Le, 2017: Searching for activation functions. arXiv, 1710.05941v2, https://doi.org/10.48550/arXiv.1710.05941.
Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 3885–3900, https://doi.org/10.1175/MWR-D-18-0187.1.
Schefzik, R., T. L. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty quantification in complex simulation models using ensemble copula coupling. Stat. Sci., 28, 616–640, https://doi.org/10.1214/13-STS443.
Schulz, B., and S. Lerch, 2022a: Aggregating distribution forecasts from deep ensembles. arXiv, 2204.02291v1, https://doi.org/10.48550/ARXIV.2204.02291.
Schulz, B., and S. Lerch, 2022b: Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison. Mon. Wea. Rev., 150, 235–257, https://doi.org/10.1175/MWR-D-21-0150.1.
Shrestha, D. L., D. E. Robertson, J. C. Bennett, and Q. J. Wang, 2020: Using the Schaake shuffle when calibrating ensemble means can be problematic. J. Hydrol., 587, 124991, https://doi.org/10.1016/j.jhydrol.2020.124991.
Smith, L. N., and N. Topin, 2018: Super-convergence: Very fast training of neural networks using large learning rates. arXiv, 1708.07120v3, https://doi.org/10.48550/arXiv.1708.07120.
Taillardat, M., 2021: Skewed and mixture of Gaussian distributions for ensemble postprocessing. Atmosphere, 12, 966, https://doi.org/10.3390/atmos12080966.
Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 2375–2393, https://doi.org/10.1175/MWR-D-15-0260.1.
Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681–E699, https://doi.org/10.1175/BAMS-D-19-0308.1.
Veldkamp, S., K. Whan, S. Dirksen, and M. Schmeits, 2021: Statistical postprocessing of wind speed forecasts using convolutional neural networks. Mon. Wea. Rev., 149, 1141–1152, https://doi.org/10.1175/MWR-D-20-0219.1.
Zamo, M., and P. Naveau, 2018: Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts. Math. Geosci., 50, 209–234, https://doi.org/10.1007/s11004-017-9709-7.