Leveraging Deterministic Weather Forecasts for In Situ Probabilistic Temperature Predictions via Deep Learning

David Landry aINRIA, Paris, France

Search for other papers by David Landry in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0001-5343-2235
,
Anastase Charantonis aINRIA, Paris, France
bENSIIE, Evry, France
cLAMME, Evry, France
dLOCEAN/IPSL, Paris, France

Search for other papers by Anastase Charantonis in
Current site
Google Scholar
PubMed
Close
, and
Claire Monteleoni aINRIA, Paris, France
eUniversity of Colorado Boulder, Boulder, Colorado

Search for other papers by Claire Monteleoni in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

We propose a neural network approach to produce probabilistic weather forecasts from a deterministic numerical weather prediction. Our approach is applied to operational surface temperature outputs from the Global Deterministic Prediction System up to 10-day lead times, targeting METAR observations in Canada and the United States. We show how postprocessing performance is improved by training a single model for multiple lead times. Multiple strategies to condition the network for the lead time are studied, including a supplementary predictor and an embedding. The proposed model is evaluated for accuracy, spread, distribution calibration, and its behavior under extremes. The neural network approach decreases the continuous ranked probability score (CRPS) by 15% and has improved distribution calibration compared to a naive probabilistic model based on past forecast errors. Our approach increases the value of a deterministic forecast by adding information about the uncertainty, without incurring the cost of simulating multiple trajectories. It applies to any gridded forecast including the recent machine learning–based weather prediction models. It requires no information regarding forecast spread and can be trained to generate probabilistic predictions from any deterministic forecast.

Significance Statement

Weather is difficult to predict a long time in advance because we cannot measure the state of the atmosphere precisely enough. Consequently, it is common practice to run forecasts several times and look at the differences to evaluate how uncertain the prediction is. This process of running ensemble forecasts is expensive and consequently not always feasible. We propose a middle ground where we add uncertainty information to forecasts that were run only once, using artificial intelligence. Our method increases the value of these forecasts by adding information about the uncertainty without incurring the cost of multiple full simulations.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: David Landry, david.landry@inria.fr

Abstract

We propose a neural network approach to produce probabilistic weather forecasts from a deterministic numerical weather prediction. Our approach is applied to operational surface temperature outputs from the Global Deterministic Prediction System up to 10-day lead times, targeting METAR observations in Canada and the United States. We show how postprocessing performance is improved by training a single model for multiple lead times. Multiple strategies to condition the network for the lead time are studied, including a supplementary predictor and an embedding. The proposed model is evaluated for accuracy, spread, distribution calibration, and its behavior under extremes. The neural network approach decreases the continuous ranked probability score (CRPS) by 15% and has improved distribution calibration compared to a naive probabilistic model based on past forecast errors. Our approach increases the value of a deterministic forecast by adding information about the uncertainty, without incurring the cost of simulating multiple trajectories. It applies to any gridded forecast including the recent machine learning–based weather prediction models. It requires no information regarding forecast spread and can be trained to generate probabilistic predictions from any deterministic forecast.

Significance Statement

Weather is difficult to predict a long time in advance because we cannot measure the state of the atmosphere precisely enough. Consequently, it is common practice to run forecasts several times and look at the differences to evaluate how uncertain the prediction is. This process of running ensemble forecasts is expensive and consequently not always feasible. We propose a middle ground where we add uncertainty information to forecasts that were run only once, using artificial intelligence. Our method increases the value of these forecasts by adding information about the uncertainty without incurring the cost of multiple full simulations.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: David Landry, david.landry@inria.fr

1. Introduction

Weather forecast postprocessing is a key component in many operational forecasting systems. When compared with in situ observations, numerical weather prediction (NWP) models produce systematically biased forecasts, partly due to the presence of unresolved phenomena at finer scales. This encourages the study of statistical postprocessing methods that correct these biases by training on past forecasting errors (Vannitsem et al. 2021). Earlier postprocessing models were used to generate probabilistic predictions from deterministic model outputs, notably for precipitation forecasting (Antolik 2000). With the advent of ensemble forecasts, this approach has received less attention in recent years. However, deterministic forecasts are still being produced in many operational centers (Hamill 2021). This is done at shorter lead times, for which deterministic forecasts are more appropriate. It can also be motivated by computational constraints if different parts of the ensemble size/model resolution compromise are explored.

We propose a deterministic-to-probabilistic approach that recovers a deterministic model’s uncertainty by training on past observations using neural networks (NNs). Since it targets surface stations, it applies to any gridded NWP including machine learning weather prediction models (Lam et al. 2022; Pathak et al. 2022; Bi et al. 2023). Our approach requires no input regarding forecast spread and infers all uncertainties from its training.

This contribution arises in a dynamic context for weather forecast postprocessing (Vannitsem et al. 2021). A topic of active discussion is how the distribution is predicted, i.e., the model that determines the shape of the output distribution. Notable approaches include linear models (Gneiting et al. 2005), random forests (Taillardat et al. 2016), fully connected neural networks (Rasp and Lerch 2018), and convolutional neural networks (Veldkamp et al. 2021).

Another topic is how the forecast uncertainty is represented, that is, how the output of the predictive model is cast into a CDF. Deterministic postprocessing is being studied at shorter lead times (Hamill 2021). For probabilistic outputs, solutions revolve around parametric (Taillardat 2021; Demaeyer et al. 2023) and nonparametric (Bremnes 2020; Hewson and Pillosu 2021) methods (also called distribution-based and distribution-free methods, respectively). Among the nonparametric methods, Bremnes (2020) proposes a Bernstein quantile network (BQN) that models the quantile function of the probabilistic forecast as a Bernstein polynomial. This approach yields improvements over previous methods for surface wind speed fields while avoiding strong assumptions about the distribution of the predicted value. Numerous parametric and nonparametric methods were systematically reviewed for wind gust forecasting by Schulz and Lerch (2022b), using a modular framework that decouples how the distribution is predicted (i.e., using a linear model or a neural network) from how it is represented (i.e., using a normal distribution or quantile regression).

Finally, recent works have trained postprocessing models separately for all lead times (Rasp and Lerch 2018; Bremnes 2020; Schulz and Lerch 2022b) or jointly in a single model (Bouallègue et al. 2023). In the latter, it was observed that the lead-time predictor was not always retained by a feature selection process for deterministic forecast postprocessing. As such, there are discussions regarding how to best condition a postprocessing model for the lead time.

The deterministic-to-probabilistic approach has seen renewed interest in recent work. Veldkamp et al. (2021) proposed a convolutional network to postprocess wind speed forecasts from a high-resolution deterministic model. Their prediction is done at a single lead time. Bouallègue et al. (2023) proposed a two-step postprocessing pipeline where a second model is trained to predict the residual error of the first model. The residual error model provides an estimation of the uncertainty of a deterministic forecast but does not express its probability distribution. Bremnes et al. (2023) applied a BQN to deterministic weather forecasts produced by deep neural networks (Bi et al. 2023), making a probabilistic forecast from a deterministic NWP. Their work studies only a quantile-based method and invites the evaluation of other approaches. Demaeyer et al. (2023) introduced the atmosphere network (ANET) which predicts a parametric forecast using a single network for all stations and lead times. It is compatible with variable member counts.

We perform further examinations by producing probabilistic forecasts based on a deterministic global model, with up to 10-day lead times. This is done on a dataset built from the Global Deterministic Prediction System (GDPS) NWP model (Buehner et al. 2015), targeting METAR surface temperature observations in Canada and the United States (Herzmann 2001). We study neural network models for this purpose and compare them to linear models. The forecast uncertainty is expressed using either a normal distribution, a set of quantile values, or a quantile function built from Bernstein polynomials. The neural network models are trained jointly for all lead times. We compare multiple strategies to condition for it, including a lead-time embedding that has shown usefulness in recent work (Espeholt et al. 2022). Finally, we evaluate the behavior of our postprocessing method under extreme temperatures and compare its behavior to that of the NWP model.

The next section describes our experimental framework in more detail, including models, datasets, and evaluation methods. Section 3 contains our experiments that evaluate forecast performance, calibration, and its behavior under extreme events. This is followed by a discussion about the proposed postprocessing methods and our concluding remarks in section 4.

2. Methods

a. Data

We perform postprocessing of surface temperature fields over the operational output of the GDPS NWP model (Buehner et al. 2015; Meteorological Service of Canada 2019) and target observations from the METAR network. We use model outputs initialized at 0000 and 1200 UTC to perform postprocessing daily up to 10-day lead times.

For training, we use forecasts initialized from 1 January 2019 to 31 December 2020. This period contains a major update of the GDPS model in July 2019, where the model horizontal resolution was increased from 25 to 15 km (McTaggart-Cowan et al. 2019). We still retain the earlier forecasts, since removing them provoked a slight decrease in the validation score.

Forecasts initialized on the first 25 days of each month are used for the training itself, while the others are used for validation. This validation strategy is not fully independent because late lead times from the training set overlap with early lead times from the validation set. We choose this compromise to ensure good coverage of the seasonal cycle. Forecasts from 1 January 2021 to 30 November 2021 are used for testing. We note that our testing period for the GDPS dataset contains two extreme weather phenomena, a wave of extreme cold temperatures in Texas in February, and a heat wave over western North America in June and July.

We use 18 NWP-dependent predictors from the GDPS dataset, as well as seven NWP-independent predictors. Table 1 contains the full predictor list. The set of NWP-dependent predictors is relatively small due to constraints in accessing and storing operational outputs.

Table 1.

NWP-dependent and NWP-independent predictors used for postprocessing. = Always used; * = used unless specified otherwise.

Table 1.

All features are scaled using their mean and variance over the training set to improve normality. This is done stationwise. Before scaling, we apply a logarithm transformation on positively defined variables (albedo, precipitation, and wind speed). We provide the day-of-year predictor twice, encoded with sin and cos transforms, to represent periodicity.

We target observations from the METAR network harvested from the Iowa State University Environmental Mesonet (Herzmann 2001–2024). The dataset consists in observations from 1066 stations spread across North America. The NWP forecasts were interpolated to stations using the nearest grid point. Stations were selected based on data availability for the periods covered by the NWP forecasts. We removed 250 observations (∼0.01%) that reported temperatures more than 15 K off their corresponding 24-h forecast after debiasing. The bias values were computed by averaging model errors over the training set, separately for each station, initialization time, lead time, and month. This process is meant to account for sensor errors.

b. Postprocessing models

We devise a naive probabilistic forecast as a baseline model. First, the NWP forecasts are debiased using the process described in section 2a. Second, the standard deviation of forecast errors is computed using the same aggregation (separately by station, initialization time, lead time, and month). Our baseline probabilistic forecast is a normal distribution centered around the debiased NWP forecast and scaled by the computed standard deviation.

Our other methods use machine learning to generate probabilistic forecasts from NWP. They are summarized in Table 2. We follow a decoupled approach (Schulz and Lerch 2022b) where we distinguish how the forecast uncertainty is predicted from how it is represented. We study two predictive models: a linear model and a NN model. We study four uncertainty representations: a naive deterministic representation, a normal distribution, a set of quantiles, and a quantile function posed as a Bernstein polynomial.

Table 2.

Summary table of the postprocessing models considered in this work.

Table 2.

This yields eight different models which are illustrated in Fig. 1. The models use NWP-dependent and NWP-independent predictive features x. These features are used by a predictive model to build a vector θ which determines the predictive distribution. One such vector is produced for each station and lead time. It defines the probabilistic forecast, for instance, by providing its distribution parameters or a set of quantile values. The length of θ changes according to which uncertainty representation is used.

Fig. 1.
Fig. 1.

(top) Our model architecture. Each forecast is made using one of two parameter prediction models and one of four uncertainty representations. (bottom left) The linear model predicts a parameter vector θ using one linear layer. (bottom right) The NN predicts a parameter vector θ by processing the predictors through a MLP. It has embeddings for station and lead time, which are added to the output of the first linear layer.

Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1

1) Predictive models

Our first parameter prediction model linearly maps the predictors x to θ. It has separate coefficients and biases for each station, initialization hour, and lead time.

Our second model is a NN model centered around a multilayer perceptron (MLP), as introduced for postprocessing by Rasp and Lerch (2018). The hidden layer size is kept constant throughout. The sigmoid linear unit (SiLU) activation function (Ramachandran et al. 2017) was used. Batch norm layers were added after the linear layers since we observed they accelerate convergence during training.

It is common to train the NN model jointly for all stations and then condition for the station with an embedding (Rasp and Lerch 2018). An embedded station vector is learned for each station. It has the same size as the hidden layers in the MLP. It is added to the network activations after the first linear layer, but before the batch norm and activation layers are applied. Doing the addition in that location is equivalent to concatenating one-hot encoded predictors to x representing the station identity. Our embedding approach has a smaller memory footprint because it avoids padding x with numerous and sparsely populated features, which facilitates implementation.

Since our model is trained for 10 lead times on the GDPS dataset, we need to condition for the lead time as well. We use two strategies: We add a predictor to x corresponding to the lead time (rescaled from 0 to 1). We also learn a lead-time embedding implemented in the same way as the station embedding. This lets the network not only adapt its postprocessing strategy to the lead time (e.g., by increasing uncertainty) but also identify correlation structures that are common across lead times. The lead-time embedding was already proposed in precipitation forecasting (Espeholt et al. 2022).

2) Uncertainty representation

Four ways to represent the postprocessed forecast uncertainty are considered. Each uncertainty representation method accepts a parameter vector θ from one of the predictive models. Each method has a corresponding loss function so that the models can be trained against observations.

In the base case, we consider a one-parameter output and use it as a deterministic forecast. When the underlying predictive model is linear, we refer to this method as model output statistics (MOS) (Glahn and Lowry 1972). When it is a neural network, we refer to it as a deterministic NN (DNN). These models are trained using the root-mean-square error (RMSE) loss function.

Our second approach follows from ensemble MOS (EMOS) (Gneiting et al. 2005) and distribution regression networks (DRNs) (Rasp and Lerch 2018). Here, θR4 such that the postprocessed forecast Yobs is
YobsN{θ1xnwp+θ2,exp[θ3log(σ^)+θ4]},
where xnwp is the raw NWP forecast for surface temperature and σ^ is an initial guess for the forecast distribution standard deviation. The initial guess σ^ would typically be the ensemble standard deviation if an ensemble forecast was being postprocessed. In the absence of an ensemble forecast, it is substituted with the standard deviation of forecast errors as defined for the naive probabilistic model (aggregated by station, initialization time, lead time, and month). We use the exponent and logarithm on the scale parameter to preserve positiveness. We fit this model using the continuous ranked probability score (CRPS) as a loss function. A derivation of the CRPS for normal distributions is provided in section 2d.

The quantile regression approach consists in predicting θRn containing the n values of quantiles at positions τ{1/(n+1),2/(n+1),,n/(n+1)}. We refer to this method as linear quantile regression (LQR) in its linear variant (Bremnes 2004) and quantile regression network (QRN) in its NN variant. We fit the quantiles using the quantile loss defined in section 2d. Neural quantile regression has been approached with architectures that enforce quantile monotonicity by construction (Cannon 2018). We adopt an empirical approach where the quantiles are reordered after each prediction. Note that the quantile locations described here are not CRPS optimal. They were chosen to train models that target a uniform rank histogram (Bröcker 2012) to simplify interpretation.

The Bernstein polynomial approach consists in learning a Bernstein polynomial that represents the forecast distribution quantile function (Bremnes 2020). We call these approaches linear Bernstein quantiles (LBQs) and BQN when the coefficients are predicted by a linear model and a NN, respectively. The quantile function Q(τ) is parameterized by a j vector θ such that
Q(τ)=j=0dθj(dj)τj(1τ)dj,
where d is the degree of the Bernstein polynomial and (dj) is the binomial coefficient. One such polynomial is produced for each forecast. To compute loss values from Q(τ), we sample it at n = 98 evenly spread locations such that τ{1/(n+1),2/(n+1),,n/(n+1)}. Then, the sampled values are evaluated using the quantile loss. In contrast with our LQR and QRN models, θ does not contain quantile values directly, but rather the coefficients that shape the quantile function.

Bernstein polynomials have shape-preserving properties where a monotonous list of coefficients θj will yield a monotonous function Q(τ), an expected property for quantile functions. We enforce this by sorting the coefficients θj after they are predicted. We observed that without coefficient ordering, the BQN would sometimes converge to solutions that have good validation scores but poor calibration due to creating jagged quantile functions.

3) Combining predictions

NN methods converge to different models depending on their random weight initialization and the random composition of training batches. To account for this, NN-based postprocessing models are typically trained multiple times and their predictions are combined to create the final distribution (Schulz and Lerch 2022a). We follow this trend and train each NN five times. To combine predictions, we average the parameter vectors θ across models. For the DRN, this is equivalent to averaging distribution parameters as is done by Rasp and Lerch (2018). For the QRN, this is equivalent to uniform weight quantile averaging, also known as Vincentization (Schulz and Lerch 2022a). This is also equivalent for the BQN despite the fact that the values of θ represent polynomial coefficients instead of quantile values (Schulz and Lerch 2022b).

c. Training

All models are implemented in PyTorch (Paszke et al. 2019), a scientific computation framework well suited to deep learning applications. They are trained using the Adam optimizer (Kingma and Ba 2014). They are trained for 100 epochs with the OneCycleLR training scheduler (Smith and Topin 2018). The maximal learning rate was 10−3 for the linear model and 5 × 10−4 for the NN. A weight decay of 10−5 was used in both instances.

Each model was trained in four variants, corresponding to each of the uncertainty representations described in section 2b. We optimize variant-specific details for each model via a manual grid search (i.e., degree of Bernstein polynomial and number of predicted quantiles). The shape of the shared architecture (embedding size and number of fully connected layers) was kept fixed across the variants to facilitate model intercomparison.

d. Evaluation

To evaluate the quality of our probabilistic forecasts, we first use the CRPS metric (Gneiting and Raftery 2007). For EMOS and DRN, we relate an observation y to the predicted normal distribution with mean μ and standard deviation σ using its closed-form expression (Gneiting et al. 2005) where
CRPSnorm=σ{yμσ[2F(yμσ)1]+2f(yμσ)1π}.
Here, F and f are the CDF and the PDF of the normal distribution, respectively.
For quantile-based forecasts, we interpret the quantile values as an ensemble forecast. In that situation, we compute the CRPS using
CRPSens=1ni=1n|eiy|12n2i=1ni=1n|eiej|,
for an n member ensemble forecast {e1, …, en}. Our work compares the performance of distributions for which we have closed-form CRPS expressions (EMOS and DRN) against distributions where it is approximated numerically (LBQ and BQN). Care must be taken in the choice of CRPS estimator because it should be made according to the properties of the ensemble forecast. Equation (4) is a better option here because the quantile values produced by our models are not exchangeable (Zamo and Naveau 2018). However, it is known to have biases for small numbers of quantiles, which should be kept in mind when analyzing our results. The CRPS can also be evaluated with respect to a reference method, in which case it becomes the continuous ranked probability skill score (CRPSS), defined as
CRPSS=1CRPSmodelCRPSbaseline.
We measure forecast accuracy at the tails using the quantile loss QLτ defined as
QLτ={τ(yqτ),qτy(1τ)(qτy),qτ<y,
where τ is the quantile level being evaluated, qτ is the corresponding quantile value, and y the observation. For normal distribution forecasts, qτ is computed exactly using the inverse CDF. For quantile-based approaches, we use a linear interpolation between the two nearest available quantiles.

To measure forecast uncertainty, the composite spread is used as proposed by Bremnes (2019). For quantile forecasts, the smallest quantile interval widths are summed until they cover the desired probability range (here 80%). A linear interpolation is used in the final interval to reach our desired probability interval more precisely. For normal distribution forecasts, this reduces to computing the distance between the 10th and 90th percentiles, as computed using the inverse CDF. Finally, we assess the calibration of the predicted distributions using rank histograms.

3. Experiments and results

a. Hyperparameter optimization

For the NN models, we choose four hidden layers with a size of 256, after observing diminishing returns on the validation set for larger architectures. We did not find strong interactions between shared parameters and parameters related to the uncertainty representation, i.e., the optimal embedding size performed similarly on the DRN, QRN, and BQN. Regarding the length of the θ vectors, we use Bernstein polynomials of degree 16 after observing no performance improvement for larger values. We use 32 quantiles in our quantile regression models for the same reason. The values considered were {8, 10, 12, 14, 16, 18} for the degree of the Bernstein polynomial and {16, 20, …, 40} for the number of predicted quantiles of the quantile regression model. The linear quantile methods (LQR and LBQ) were set to use the same length of θ as their NN counterpart to compare them at equivalent forecast resolution.

Once these hyperparameters are set, our NN models have between 580 000 and 590 000 trainable parameters. The variability is due to the different number of output parameters θ for each model. For comparison, the EMOS model has 1 517 164 trained parameters. It has more trained parameters than the NN because it has separate coefficients for each station, initialization time, and lead time.

b. Postprocessing performance

We ran a series of experiments to evaluate the performance of our postprocessing methods over the GDPS dataset. Table 3 shows performance metrics aggregated for all lead times and all stations. The NN models score better than their linear model in all configurations. The linear model performs better with the normal distribution, while the NN model shows similar performance for all variants. We posit this is because the linear model does not have enough representation capability to correctly predict a large amount of interrelated θ parameters. Among NN models, the BQN obtains slightly better results for all evaluated metrics.

Table 3.

Postprocessing model metrics, aggregated over all stations and lead times. The CRPSS is computed against the naive probabilistic baseline. Bold values represent the best performing uncertainty representation for each predictive model.

Table 3.

Figure 2 shows metrics describing forecast behavior across lead times. The CRPS and spread increase with lead time, as expected. Forecasts from the naive probabilistic model have more spread throughout. The EMOS model is successful in increasing sharpness, but less than the NN models. The QRN has the lowest spread among NN models, although this could be due to bias in the spread estimation related to its smaller number of quantiles.

Fig. 2.
Fig. 2.

Postprocessing model metrics. (left) CRPS by lead time. (right) Forecast sharpness as measured by a composite spread covering 80% of the predicted forecast distribution.

Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1

Model CRPSS and bias are reported in Fig. 3. The confidence intervals are computed using the paired bootstrapping procedure proposed by Hamill (1999), under the null hypothesis that the naive probabilistic model has the same statistic as the tested model. The resampling was performed 100 times. The CRPSS gains brought by NN models are significantly larger than those of EMOS at early lead times. This difference reduces to 2.5% for later time steps. Bias values are computed by comparing the observation to the mean quantile value for models that output quantiles, while they are compared against the forecast mean for the EMOS and DRN models. The NN models do not eliminate all biases at longer lead times, though they remain under 0.3 K of amplitude.

Fig. 3.
Fig. 3.

(left) CRPSS against the naive probabilistic model. (right) Bias of postprocessing models. The shaded areas represent a 5%–95% confidence interval.

Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1

Figure 4 shows the CRPSS against the naive probabilistic model, aggregated by station. Larger increases in CRPSS are observed in the central and eastern United States. We posit that this reflects the location of some surface temperature biases in the underlying NWP model. The figure shows the results for the DRN. Similar figures for the BQN and QRN are available in the appendix.

Fig. 4.
Fig. 4.

CRPSS aggregated by station for all lead times. The model is DRN. The baseline model is the naive probabilistic model.

Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1

The forecast calibration is assessed using the rank histograms in Fig. 5. For the LBQ and BQN models, the 99 bins were merged in groups of three so that the rank histogram would be similar to that of the quantile methods. For the methods producing normal distributions (naive, EMOS, and DRN), their inverse CDF was discretized in 33 bins for the same reason.

Fig. 5.
Fig. 5.

Postprocessed forecast rank histogram. For the normal uncertainty representation models (naive, EMOS, and DRN), bin boundaries with uniform probabilities were computed using the inverse CDF of the forecasts. For the LBQ and BQN models, bins were merged in groups of three to allow comparison with the other histograms.

Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1

All models have a surplus of observations at the first and last bins, indicating a certain amount of them are at the very edges of the forecasted distribution. Other than these missed forecasts, the postprocessing models tend to flatten the central part of the rank histogram when compared to the naive model. The BQN shows interesting shapes at the edges. We postulate that they reflect what distributions can be expressed using a 16th-degree Bernstein polynomial. Even though our validation scores stopped improving for higher-degree polynomials, they may be worth investigating from a calibration perspective.

c. Behavior toward extremes

We evaluate the behavior of the NN models when NWP forecasts tend toward extreme lows and highs in Fig. 6. Predictions were aggregated according to the percentile of their corresponding forecast for a given station. They are grouped in bins of two percentiles. The metrics are computed for an initialization time 0000 UTC and a lead time of 48 h.

Fig. 6.
Fig. 6.

Model metrics according to the NWP forecast percentile. Lead time 48 h. Initialization time 0000 UTC. The percentiles are computed stationwise. The forecasts are aggregated in bins of 2 percentiles. The metrics are computed over three periods: (left) full test set (all months), (middle) winter (January and February), and (right) summer (June and July).

Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1

The naive model has biases when the forecast percentile tends to the extremes. This is an expected consequence of our forecast-based stratification (Bellier et al. 2017). Although the yearly curves for CRPS are flat in the central quantiles, more specific evaluations reveal that all models have degraded predictive performance given very high and low forecasts within a season. We observe a particularly strong degradation in CRPSS for low percentiles in winter. Upon inspection, these bins are disproportionately populated with forecasts valid from 12 to 17 February 2021. These dates are associated with unusually cold weather events in Texas which are outside the training distribution.

d. Conditioning for lead time

Our NN models are trained jointly for all lead times, which implies they need to be conditioned for the lead time being predicted. We introduced a lead-time embedding in section 2 for that purpose, as well as the typical lead-time predictor. This section studies the effectiveness of these strategies and analyzes the representation learned by the embedding.

In Table 4, we compare the CRPS values obtained by each strategy on the full testing set. As baselines, we include results for nonconditioned models, as well as another strategy called partitioning. It consists in training separate models for each lead time. We call this strategy partitioning because it effectively splits the dataset into parts. We do not test the embedding on the linear model, because it uses separate models for each lead time.

Table 4.

Postprocessing model CRPS according to the conditioning strategy used for the lead time. The partition strategy uses a series of models trained on each time step individually. The predictor strategy adds a lead-time predictor. The embedding strategy injects a learned vector in the model input to represent the lead time. Bold values represent the best performing strategy for each postprocessing model.

Table 4.

In all cases, training the postprocessing model jointly for all lead times performed better than training separate models, given that the joint model is conditioned for lead time in one way or the other. The lead-time embedding improves performance slightly across all models when compared to using only a predictor. The best results were obtained by using it together with the lead-time predictor, except for the QRN where the embedding alone had the best performance. Our dataset did not include diurnal variations: each lead time points to the same time of day. Models trained for multiple lead times per day may see more benefits from the embedding, which would then have the dual purpose of encoding lead time and time of day.

Figure 7 shows the performance of different conditioning strategies according to the lead time. The confidence intervals are computed using the same bootstrapping strategy described in section 3b. Interestingly, the benefits of training a single model vary by lead time. The improvements are mostly observed around central lead times. Early and late lead times perform similarly or worse than lead time-specific NNs. We envisage two explanations. The first is related to the statistics of the data: The NN converges to solutions that are well suited to intermediate uncertainties because they are the “mean” case in the dataset. Another explanation can be considered which is related to predictability. Since it is very high in early lead times, the postprocessing must be adapted to rely heavily on the forecast, which could benefit specialized models. On the contrary, since predictability is very low at late lead times, it is difficult to do much better than a climatological model. The benefits of training postprocessing models jointly would thus be concentrated in the intermediate lead times.

Fig. 7.
Fig. 7.

CRPSS when training a NN for all lead times jointly. The baseline strategy is to train separate models for each lead time. The shaded areas represent 5%–95% confidence intervals.

Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1

We study the embedding learned to represent lead time in Fig. 8. Our NN models learn 10 vectors vi, each representing a lead time. Their mutual similarity is computed using the cosine similarity s, such that
s(vi,vj)=vivjvivj.
The learned embedding vectors have a diagonal similarity matrix, whereas nearby lead times are more similar. Early lead times show more similarity to one another than later ones. We suggest this is due to a shift in postprocessing strategies where early forecasts are strongly related to the underlying NWP model, while later forecasts are more concerned with uncertainty quantification and knowledge of statistical trends in the data.
Fig. 8.
Fig. 8.

Lead-time embedding self-similarity. Every row and column corresponds to a learned vector representing a given lead time. The cells represent the degree of similarity between two of these vectors, as measured by the cosine similarity. This embedding was obtained by training a DRN model.

Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1

e. Benefits of ensemble forecast for probabilistic postprocessing

The results from the previous subsections show that much of the uncertainty related to the forecast can be recovered statistically with a neural network. A follow-up question is to what extent ensemble members are useful in estimating the forecast distribution, given that it is postprocessed. To investigate this, we run an experiment on the ENS-10 dataset (Ashkboos et al. 2020) where we progressively add ensemble members to the postprocessing model input. ENS-10 is a 10-member reforecast dataset built from outputs of an operational configuration of the ECMWF IFS model (cycles Cy43r1 and Cy45r1). It spans a period going from January 1998 to December 2017, making two forecasts a week. We postprocessed this NWP model for lead times of 1 and 2 days. More detailed information about the dataset is contained in Table A1.

To input the NWP-dependent features from multiple ensemble members simultaneously, we applied the linear layer to each member individually and then averaged the resulting vectors. For the DRN model, the initial estimate σ^ was set to the standard deviation of the predicted value in the ensemble forecast when more than one member was available. There exist more elaborate ways of leveraging multiple ensemble members for postprocessing (Rasp and Lerch 2018; Finn 2021; Höhlein et al. 2024). Despite this, Fig. 9 shows how the forecast improves when adding ensemble members. The metric is the CRPSS computed against postprocessing only the control member. The confidence intervals are computed using paired bootstrapping against the one-member model. Our work is mostly concerned with the left side of the figure where there is a sharp increase in performance where the second member is added, at all lead times. We conclude that multiple ensemble members are indeed useful in making a probabilistic forecast, even in the presence of probabilistic postprocessing. This conclusion is in line with previous work (Bremnes et al. 2023).

Fig. 9.
Fig. 9.

CRPSS gain related to adding ensemble members to the postprocessing model input. The dataset is ENS-10. The baseline is training postprocessing using only the control member. The shaded areas represent 5%–95% confidence intervals.

Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1

We observe that the impact of supplementary ensemble members increases with lead time. Furthermore, the impact of ensemble members seems to be diminishing. However, it is difficult to conclude because our experimental setup is limited to a 48-h lead time. The importance of large ensembles is expected to increase with lead time, but this cannot be verified given the data available for this study.

4. Discussion and conclusions

This work studied probabilistic forecasts produced from a deterministic NWP model. It evaluated the deterministic-to-probabilistic strategy under parametric and quantile-based assumptions with eight combinations of predictive models and uncertainty representations. The best postprocessing models have a CRPSS of about 15% when compared against a naive probabilistic forecast and significantly outperform EMOS for all but the latest lead times. Under extreme conditions, the NN model performance decreases in a way that is expected from a statistical model.

The proposed methodology can be applied to any numerical prediction model for which a sufficient training dataset is available. We believe it has applications in domains that must rely on deterministic models but benefit from probabilistic decision-making, e.g., energy consumption forecasting.

To account for the large number of lead times to be encoded within one model, we introduced a lead-time embedding. This embedding brought modest improvements to the model CRPS. The learned representation has an intuitive post hoc interpretation. Training a postprocessing model jointly for all lead times improved performance overall. The improvements were concentrated around central lead times.

Our results with the NN show low sensitivity to the choice of uncertainty representation in terms of aggregated metrics. However, their calibration exhibits noticeable differences, especially at the tails of the distribution. This suggests the choice of an uncertainty representation should be evaluated with regard to its capacity to represent the full distribution.

While our approach is inexpensive, it still has caveats when compared to a postprocessed ensemble forecast. Our experiment on the ENS-10 dataset shows that adding ensemble members quickly brings important improvements to the CRPS, indicating that supplementary numerical simulations are a robust way to improve the marginal forecast at a station. Perhaps more importantly, our postprocessing strategy makes the independence assumption over all stations and lead times. This breaks spatiotemporal consistency because it makes it impossible to sample from all stations and lead times in a way that is physically realizable (Schulz and Lerch 2022b). Such consistency is an important asset when considering extreme events in downstream applications of the weather forecast. This has been addressed notably with Schaake shuffle (Clark et al. 2004; Shrestha et al. 2020) and ensemble copula coupling (Schefzik et al. 2013; Lakatos et al. 2023), but these methods assume that the correlation structure between variables can be recovered using historical observations or the NWP ensemble members, respectively. This could be impossible in the presence of unresolved local effects. As such, we identify avenues for future work inside and outside the independence assumption.

a. Station- and lead-time-independent postprocessing

For station- and lead-time-independent forecasting, we identify two directions to extend our work. First, the benefits of training postprocessing models jointly for all lead times could be investigated further. Our experiments showed a tendency for NN postprocessing models to only make improvements in central lead times. We suggested two interpretations of this phenomenon. The first is related to dataset statistics where the model performs best in central lead times because they are at the center of the training distribution. The second was related to the predictability of the weather itself where postprocessing strategies are different enough in early lead times and late lead times that specialized models do better. The latter interpretation makes intuitive sense: One adjusts a 1-day forecast differently than a 10-day forecast to account for predictability. Further experiments could be designed to identify which interpretation is correct. As far as encoding the lead time is concerned, our experiments with the lead-time embedding are moderately conclusive. They did bring modest improvements to the CRPSS, but not always in a statistically significant way. The NN models did not react in the same way to their introduction, with the DRN benefitting more from it than the others. This warrants further experimentations, notably where forecast validity time changes with lead time. An embedding could let a neural network build a representation that efficiently blends the effects of lead time and time of day on postprocessing.

Second, our short experiment on the ENS-10 dataset showed that the CRPS is quickly and decisively improved by adding ensemble members at the input. This shows that supplementary numerical simulations are a robust way to improve the forecast. The improvement may become even larger with NN models whose architecture specifically leverages spread information from the ensemble. Our experiment was also limited in lead time, and it would be of interest to perform it on more operational models at longer time horizons. Since predictability declines with lead time, we expect the trend where later lead times benefit from more ensemble members to continue. Further experiments are required to determine its shape on larger time horizons.

b. Generative modeling for postprocessing

Other lines of research are available outside of the station and lead-time independence assumptions. We believe that the generative modeling literature could help preserve spatiotemporal consistency for postprocessing at stations. In computer vision, generative neural networks have been successful in sampling consistently from large output spaces. Work has already been performed to that effect in postprocessing on grid (Dai and Hemri 2021) and in situ (Chen et al. 2024), showing that spatial correlations can be recovered. We expect similar methods could be used to represent temporal dependencies as well. Questions remain about how to implement this exactly for in situ postprocessing. We expect well-adapted architecture will propose a way to encode the spatial relationship between stations, as well as a generative component that is not subject to mode collapse and training instability concerns, which are common in generative modeling.

Acknowledgments.

This work was funded in part by Environment and Climate Change Canada, the Computer Research Institute of Montreal, and a Choose France Chair in AI grant from the French government. Experiments were carried out using HPC resources from GENCI-IDRIS (Grant AD011014334).

Data availability statement.

The operational archives used in this study are accessible on demand through the open data access program of the Meteorological Service of Canada (Meteorological Service of Canada 2019). The METAR observations used are freely available in Herzmann (2001–2024). The source code of the models used in this work is available at https://github.com/davidlandry93/pp2023/.

APPENDIX

Appendix Title

Figure A1 shows the spatial distribution of skill gain brought by postprocessing approaches. Tables A1 and A2 provide more details about the ENS-10 dataset.

Fig. A1.
Fig. A1.

Skill gain brought by postprocessing models against a debiased baseline. (top) BQN model. (bottom) QRN model.

Citation: Monthly Weather Review 152, 9; 10.1175/MWR-D-23-0273.1

Table A1.

ENS-10 dataset.

Table A1.
Table A2.

NWP-dependent and NWP-independent predictors used in the ENS-10 dataset (1000–10 hPa denotes vertical levels 1000, 925, 850, 700, 500, 400, 300, 200, 100, 50, and 10 hPa; = always used).

Table A2.

REFERENCES

  • Antolik, M. S., 2000: An overview of the National Weather Service’s centralized statistical quantitative precipitation forecasts. J. Hydrol., 239, 306337, https://doi.org/10.1016/S0022-1694(00)00361-9.

    • Search Google Scholar
    • Export Citation
  • Ashkboos, S., L. Huang, N. Dryden, T. Ben-Nun, P. Dueben, L. Gianinazzi, L. Kummer, and T. Hoefler, 2020: ENS-10: A dataset for ensemble post-processing. ETHZ Scalable Parallel Computing Laboratory Storage, accessed 13 July 2023, https://spclstorage.inf.ethz.ch/projects/deep-weather/ENS10/.

  • Bellier, J., I. Zin, and G. Bontron, 2017: Sample stratification in verification of ensemble forecasts of continuous scalar variables: Potential benefits and pitfalls. Mon. Wea. Rev., 145, 35293544, https://doi.org/10.1175/MWR-D-16-0487.1.

    • Search Google Scholar
    • Export Citation
  • Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2023: Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619, 533538, https://doi.org/10.1038/s41586-023-06185-3.

    • Search Google Scholar
    • Export Citation
  • Bouallègue, Z. B., F. Cooper, M. Chantry, P. Düben, P. Bechtold, and I. Sandu, 2023: Statistical modeling of 2-m temperature and 10-m wind speed forecast errors. Mon. Wea. Rev., 151, 897911, https://doi.org/10.1175/MWR-D-22-0107.1.

    • Search Google Scholar
    • Export Citation
  • Bremnes, J. B., 2004: Probabilistic forecasts of precipitation in terms of quantiles using NWP model output. Mon. Wea. Rev., 132, 338347, https://doi.org/10.1175/1520-0493(2004)132<0338:PFOPIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Bremnes, J. B., 2019: Constrained quantile regression splines for ensemble postprocessing. Mon. Wea. Rev., 147, 17691780, https://doi.org/10.1175/MWR-D-18-0420.1.

    • Search Google Scholar
    • Export Citation
  • Bremnes, J. B., 2020: Ensemble postprocessing using quantile function regression based on neural networks and Bernstein polynomials. Mon. Wea. Rev., 148, 403414, https://doi.org/10.1175/MWR-D-19-0227.1.

    • Search Google Scholar
    • Export Citation
  • Bremnes, J. B., T. N. Nipen, and I. A. Seierstad, 2023: Evaluation of forecasts by a global data-driven weather model with and without probabilistic post-processing at Norwegian stations. arXiv, 2309.01247v1, https://doi.org/10.48550/arXiv.2309.01247.

  • Bröcker, J., 2012: Evaluating raw ensembles with the continuous ranked probability score. Quart. J. Roy. Meteor. Soc., 138, 16111617, https://doi.org/10.1002/qj.1891.

    • Search Google Scholar
    • Export Citation
  • Buehner, M., and Coauthors, 2015: Implementation of deterministic weather forecasting systems based on ensemble–variational data assimilation at environment Canada. Part I: The global system. Mon. Wea. Rev., 143, 25322559, https://doi.org/10.1175/MWR-D-14-00354.1.

    • Search Google Scholar
    • Export Citation
  • Cannon, A. J., 2018: Non-crossing nonlinear regression quantiles by monotone composite quantile regression neural network, with application to rainfall extremes. Stochastic Environ. Res. Risk Assess., 32, 32073225, https://doi.org/10.1007/s00477-018-1573-6.

    • Search Google Scholar
    • Export Citation
  • Chen, J., T. Janke, F. Steinke, and S. Lerch, 2024: Generative machine learning methods for multivariate ensemble post-processing. Ann. Appl. Stat., 18, 159183, https://doi.org/10.1214/23-AOAS1784.

    • Search Google Scholar
    • Export Citation
  • Clark, M., S. Gangopadhyay, L. Hay, B. Rajagopalan, and R. Wilby, 2004: The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. J. Hydrometeor., 5, 243262, https://doi.org/10.1175/1525-7541(2004)005<0243:TSSAMF>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Dai, Y., and S. Hemri, 2021: Spatially coherent postprocessing of cloud cover ensemble forecasts. Mon. Wea. Rev., 149, 39233937, https://doi.org/10.1175/MWR-D-21-0046.1.

    • Search Google Scholar
    • Export Citation
  • Demaeyer, J., and Coauthors, 2023: The EUPPBench postprocessing benchmark dataset v1.0. Earth Syst. Sci. Data, 15, 26352653, https://doi.org/10.5194/essd-15-2635-2023.

    • Search Google Scholar
    • Export Citation
  • Espeholt, L., and Coauthors, 2022: Deep learning for twelve hour precipitation forecasts. Nat. Commun., 13, 5145, https://doi.org/10.1038/s41467-022-32483-x.

    • Search Google Scholar
    • Export Citation
  • Finn, T. S., 2021: Self-attentive ensemble transformer: Representing ensemble interactions in neural networks for Earth system models. arXiv, 2106.13924v2, https://doi.org/10.48550/arXiv.2106.13924.

  • Glahn, H. R., and D. A. Lowry, 1972: The use of Model Output Statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11, 12031211, https://doi.org/10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359378, https://doi.org/10.1198/016214506000001437.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 10981118, https://doi.org/10.1175/MWR2904.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2021: Comparing and combining deterministic surface temperature postprocessing methods over the United States. Mon. Wea. Rev., 149, 32893298, https://doi.org/10.1175/MWR-D-21-0027.1.

    • Search Google Scholar
    • Export Citation
  • Herzmann, D., 2001: ASOS-AWOS-METAR data download. Iowa Environmental Mesonet, accessed 25 August 2023, https://mesonet.agron.iastate.edu/request/download.phtml.

  • Hewson, T. D., and F. M. Pillosu, 2021: A low-cost post-processing technique improves weather forecasts around the world. Commun. Earth Environ., 2, 132, https://doi.org/10.1038/s43247-021-00185-9.

    • Search Google Scholar
    • Export Citation
  • Höhlein, K., B. Schulz, R. Westermann, and S. Lerch, 2024: Postprocessing of ensemble weather forecasts using permutation-invariant neural networks. Artif. Intell. Earth Syst., 3, e230070, https://doi.org/10.1175/AIES-D-23-0070.1.

    • Search Google Scholar
    • Export Citation
  • Kingma, D. P., and J. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.

  • Lakatos, M., S. Lerch, S. Hemri, and S. Baran, 2023: Comparison of multivariate post-processing methods using global ECMWF ensemble forecasts. Quart. J. Roy. Meteor. Soc., 149, 856877, https://doi.org/10.1002/qj.4436.

    • Search Google Scholar
    • Export Citation
  • Lam, R., and Coauthors, 2022: GraphCast: Learning skillful medium-range global weather forecasting. arXiv, 2212.12794v2, https://doi.org/10.48550/arXiv.2212.12794.

  • McTaggart-Cowan, R., and Coauthors, 2019: Modernization of atmospheric physics parameterization in Canadian NWP. J. Adv. Model. Earth Syst., 11, 35933635, https://doi.org/10.1029/2019MS001781.

    • Search Google Scholar
    • Export Citation
  • Meteorological Service of Canada, 2019: GDPS operational archive. Meteorological Service of Canada Open Data Access Program, accessed 1 November 2022, https://eccc-msc.github.io/open-data/cost-recovered/readme_en/.

  • Paszke, A., and Coauthors, 2019: PyTorch (version 2.0.1). The Linux Foundation, accessed 8 May 2023, https://pytorch.org.

  • Pathak, J., and Coauthors, 2022: FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/arXiv.2202.11214.

  • Ramachandran, P., B. Zoph, and Q. V. Le, 2017: Searching for activation functions. arXiv, 1710.05941v2, https://doi.org/10.48550/arXiv.1710.05941.

  • Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 38853900, https://doi.org/10.1175/MWR-D-18-0187.1.

    • Search Google Scholar
    • Export Citation
  • Schefzik, R., T. L. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty quantification in complex simulation models using ensemble copula coupling. Stat. Sci., 28, 616640, https://doi.org/10.1214/13-STS443.

    • Search Google Scholar
    • Export Citation
  • Schulz, B., and S. Lerch, 2022a: Aggregating distribution forecasts from deep ensembles. arXiv, 2204.02291v1, https://doi.org/10.48550/ARXIV.2204.02291.

  • Schulz, B., and S. Lerch, 2022b: Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison. Mon. Wea. Rev., 150, 235257, https://doi.org/10.1175/MWR-D-21-0150.1.

    • Search Google Scholar
    • Export Citation
  • Shrestha, D. L., D. E. Robertson, J. C. Bennett, and Q. J. Wang, 2020: Using the Schaake shuffle when calibrating ensemble means can be problematic. J. Hydrol., 587, 124991, https://doi.org/10.1016/j.jhydrol.2020.124991.

    • Search Google Scholar
    • Export Citation
  • Smith, L. N., and N. Topin, 2018: Super-convergence: Very fast training of neural networks using large learning rates. arXiv, 1708.07120v3, https://doi.org/10.48550/arXiv.1708.07120.

  • Taillardat, M., 2021: Skewed and mixture of Gaussian distributions for ensemble postprocessing. Atmosphere, 12, 966, https://doi.org/10.3390/atmos12080966.

    • Search Google Scholar
    • Export Citation
  • Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 23752393, https://doi.org/10.1175/MWR-D-15-0260.1.

    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681E699, https://doi.org/10.1175/BAMS-D-19-0308.1.

    • Search Google Scholar
    • Export Citation
  • Veldkamp, S., K. Whan, S. Dirksen, and M. Schmeits, 2021: Statistical postprocessing of wind speed forecasts using convolutional neural networks. Mon. Wea. Rev., 149, 11411152, https://doi.org/10.1175/MWR-D-20-0219.1.

    • Search Google Scholar
    • Export Citation
  • Zamo, M., and P. Naveau, 2018: Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts. Math. Geosci., 50, 209234, https://doi.org/10.1007/s11004-017-9709-7.

    • Search Google Scholar
    • Export Citation
Save
  • Antolik, M. S., 2000: An overview of the National Weather Service’s centralized statistical quantitative precipitation forecasts. J. Hydrol., 239, 306337, https://doi.org/10.1016/S0022-1694(00)00361-9.

    • Search Google Scholar
    • Export Citation
  • Ashkboos, S., L. Huang, N. Dryden, T. Ben-Nun, P. Dueben, L. Gianinazzi, L. Kummer, and T. Hoefler, 2020: ENS-10: A dataset for ensemble post-processing. ETHZ Scalable Parallel Computing Laboratory Storage, accessed 13 July 2023, https://spclstorage.inf.ethz.ch/projects/deep-weather/ENS10/.

  • Bellier, J., I. Zin, and G. Bontron, 2017: Sample stratification in verification of ensemble forecasts of continuous scalar variables: Potential benefits and pitfalls. Mon. Wea. Rev., 145, 35293544, https://doi.org/10.1175/MWR-D-16-0487.1.

    • Search Google Scholar
    • Export Citation
  • Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2023: Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619, 533538, https://doi.org/10.1038/s41586-023-06185-3.

    • Search Google Scholar
    • Export Citation
  • Bouallègue, Z. B., F. Cooper, M. Chantry, P. Düben, P. Bechtold, and I. Sandu, 2023: Statistical modeling of 2-m temperature and 10-m wind speed forecast errors. Mon. Wea. Rev., 151, 897911, https://doi.org/10.1175/MWR-D-22-0107.1.

    • Search Google Scholar
    • Export Citation
  • Bremnes, J. B., 2004: Probabilistic forecasts of precipitation in terms of quantiles using NWP model output. Mon. Wea. Rev., 132, 338347, https://doi.org/10.1175/1520-0493(2004)132<0338:PFOPIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Bremnes, J. B., 2019: Constrained quantile regression splines for ensemble postprocessing. Mon. Wea. Rev., 147, 17691780, https://doi.org/10.1175/MWR-D-18-0420.1.

    • Search Google Scholar
    • Export Citation
  • Bremnes, J. B., 2020: Ensemble postprocessing using quantile function regression based on neural networks and Bernstein polynomials. Mon. Wea. Rev., 148, 403414, https://doi.org/10.1175/MWR-D-19-0227.1.

    • Search Google Scholar
    • Export Citation
  • Bremnes, J. B., T. N. Nipen, and I. A. Seierstad, 2023: Evaluation of forecasts by a global data-driven weather model with and without probabilistic post-processing at Norwegian stations. arXiv, 2309.01247v1, https://doi.org/10.48550/arXiv.2309.01247.

  • Bröcker, J., 2012: Evaluating raw ensembles with the continuous ranked probability score. Quart. J. Roy. Meteor. Soc., 138, 16111617, https://doi.org/10.1002/qj.1891.

    • Search Google Scholar
    • Export Citation
  • Buehner, M., and Coauthors, 2015: Implementation of deterministic weather forecasting systems based on ensemble–variational data assimilation at environment Canada. Part I: The global system. Mon. Wea. Rev., 143, 25322559, https://doi.org/10.1175/MWR-D-14-00354.1.

    • Search Google Scholar
    • Export Citation
  • Cannon, A. J., 2018: Non-crossing nonlinear regression quantiles by monotone composite quantile regression neural network, with application to rainfall extremes. Stochastic Environ. Res. Risk Assess., 32, 32073225, https://doi.org/10.1007/s00477-018-1573-6.

    • Search Google Scholar
    • Export Citation
  • Chen, J., T. Janke, F. Steinke, and S. Lerch, 2024: Generative machine learning methods for multivariate ensemble post-processing. Ann. Appl. Stat., 18, 159183, https://doi.org/10.1214/23-AOAS1784.

    • Search Google Scholar
    • Export Citation
  • Clark, M., S. Gangopadhyay, L. Hay, B. Rajagopalan, and R. Wilby, 2004: The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. J. Hydrometeor., 5, 243262, https://doi.org/10.1175/1525-7541(2004)005<0243:TSSAMF>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Dai, Y., and S. Hemri, 2021: Spatially coherent postprocessing of cloud cover ensemble forecasts. Mon. Wea. Rev., 149, 39233937, https://doi.org/10.1175/MWR-D-21-0046.1.

    • Search Google Scholar
    • Export Citation
  • Demaeyer, J., and Coauthors, 2023: The EUPPBench postprocessing benchmark dataset v1.0. Earth Syst. Sci. Data, 15, 26352653, https://doi.org/10.5194/essd-15-2635-2023.

    • Search Google Scholar
    • Export Citation
  • Espeholt, L., and Coauthors, 2022: Deep learning for twelve hour precipitation forecasts. Nat. Commun., 13, 5145, https://doi.org/10.1038/s41467-022-32483-x.

    • Search Google Scholar
    • Export Citation
  • Finn, T. S., 2021: Self-attentive ensemble transformer: Representing ensemble interactions in neural networks for Earth system models. arXiv, 2106.13924v2, https://doi.org/10.48550/arXiv.2106.13924.

  • Glahn, H. R., and D. A. Lowry, 1972: The use of Model Output Statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11, 12031211, https://doi.org/10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359378, https://doi.org/10.1198/016214506000001437.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 10981118, https://doi.org/10.1175/MWR2904.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2021: Comparing and combining deterministic surface temperature postprocessing methods over the United States. Mon. Wea. Rev., 149, 32893298, https://doi.org/10.1175/MWR-D-21-0027.1.

    • Search Google Scholar
    • Export Citation
  • Herzmann, D., 2001: ASOS-AWOS-METAR data download. Iowa Environmental Mesonet, accessed 25 August 2023, https://mesonet.agron.iastate.edu/request/download.phtml.

  • Hewson, T. D., and F. M. Pillosu, 2021: A low-cost post-processing technique improves weather forecasts around the world. Commun. Earth Environ., 2, 132, https://doi.org/10.1038/s43247-021-00185-9.

    • Search Google Scholar
    • Export Citation
  • Höhlein, K., B. Schulz, R. Westermann, and S. Lerch, 2024: Postprocessing of ensemble weather forecasts using permutation-invariant neural networks. Artif. Intell. Earth Syst., 3, e230070, https://doi.org/10.1175/AIES-D-23-0070.1.

    • Search Google Scholar
    • Export Citation
  • Kingma, D. P., and J. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.

  • Lakatos, M., S. Lerch, S. Hemri, and S. Baran, 2023: Comparison of multivariate post-processing methods using global ECMWF ensemble forecasts. Quart. J. Roy. Meteor. Soc., 149, 856877, https://doi.org/10.1002/qj.4436.

    • Search Google Scholar
    • Export Citation
  • Lam, R., and Coauthors, 2022: GraphCast: Learning skillful medium-range global weather forecasting. arXiv, 2212.12794v2, https://doi.org/10.48550/arXiv.2212.12794.

  • McTaggart-Cowan, R., and Coauthors, 2019: Modernization of atmospheric physics parameterization in Canadian NWP. J. Adv. Model. Earth Syst., 11, 35933635, https://doi.org/10.1029/2019MS001781.

    • Search Google Scholar
    • Export Citation
  • Meteorological Service of Canada, 2019: GDPS operational archive. Meteorological Service of Canada Open Data Access Program, accessed 1 November 2022, https://eccc-msc.github.io/open-data/cost-recovered/readme_en/.

  • Paszke, A., and Coauthors, 2019: PyTorch (version 2.0.1). The Linux Foundation, accessed 8 May 2023, https://pytorch.org.

  • Pathak, J., and Coauthors, 2022: FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/arXiv.2202.11214.

  • Ramachandran, P., B. Zoph, and Q. V. Le, 2017: Searching for activation functions. arXiv, 1710.05941v2, https://doi.org/10.48550/arXiv.1710.05941.

  • Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 38853900, https://doi.org/10.1175/MWR-D-18-0187.1.

    • Search Google Scholar
    • Export Citation
  • Schefzik, R., T. L. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty quantification in complex simulation models using ensemble copula coupling. Stat. Sci., 28, 616640, https://doi.org/10.1214/13-STS443.

    • Search Google Scholar
    • Export Citation
  • Schulz, B., and S. Lerch, 2022a: Aggregating distribution forecasts from deep ensembles. arXiv, 2204.02291v1, https://doi.org/10.48550/ARXIV.2204.02291.

  • Schulz, B., and S. Lerch, 2022b: Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison. Mon. Wea. Rev., 150, 235257, https://doi.org/10.1175/MWR-D-21-0150.1.

    • Search Google Scholar
    • Export Citation
  • Shrestha, D. L., D. E. Robertson, J. C. Bennett, and Q. J. Wang, 2020: Using the Schaake shuffle when calibrating ensemble means can be problematic. J. Hydrol., 587, 124991, https://doi.org/10.1016/j.jhydrol.2020.124991.

    • Search Google Scholar
    • Export Citation
  • Smith, L. N., and N. Topin, 2018: Super-convergence: Very fast training of neural networks using large learning rates. arXiv, 1708.07120v3, https://doi.org/10.48550/arXiv.1708.07120.

  • Taillardat, M., 2021: Skewed and mixture of Gaussian distributions for ensemble postprocessing. Atmosphere, 12, 966, https://doi.org/10.3390/atmos12080966.

    • Search Google Scholar
    • Export Citation
  • Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 23752393, https://doi.org/10.1175/MWR-D-15-0260.1.

    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681E699, https://doi.org/10.1175/BAMS-D-19-0308.1.

    • Search Google Scholar
    • Export Citation
  • Veldkamp, S., K. Whan, S. Dirksen, and M. Schmeits, 2021: Statistical postprocessing of wind speed forecasts using convolutional neural networks. Mon. Wea. Rev., 149, 11411152, https://doi.org/10.1175/MWR-D-20-0219.1.

    • Search Google Scholar
    • Export Citation
  • Zamo, M., and P. Naveau, 2018: Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts. Math. Geosci., 50, 209234, https://doi.org/10.1007/s11004-017-9709-7.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    (top) Our model architecture. Each forecast is made using one of two parameter prediction models and one of four uncertainty representations. (bottom left) The linear model predicts a parameter vector θ using one linear layer. (bottom right) The NN predicts a parameter vector θ by processing the predictors through a MLP. It has embeddings for station and lead time, which are added to the output of the first linear layer.

  • Fig. 2.

    Postprocessing model metrics. (left) CRPS by lead time. (right) Forecast sharpness as measured by a composite spread covering 80% of the predicted forecast distribution.

  • Fig. 3.

    (left) CRPSS against the naive probabilistic model. (right) Bias of postprocessing models. The shaded areas represent a 5%–95% confidence interval.

  • Fig. 4.

    CRPSS aggregated by station for all lead times. The model is DRN. The baseline model is the naive probabilistic model.

  • Fig. 5.

    Postprocessed forecast rank histogram. For the normal uncertainty representation models (naive, EMOS, and DRN), bin boundaries with uniform probabilities were computed using the inverse CDF of the forecasts. For the LBQ and BQN models, bins were merged in groups of three to allow comparison with the other histograms.

  • Fig. 6.

    Model metrics according to the NWP forecast percentile. Lead time 48 h. Initialization time 0000 UTC. The percentiles are computed stationwise. The forecasts are aggregated in bins of 2 percentiles. The metrics are computed over three periods: (left) full test set (all months), (middle) winter (January and February), and (right) summer (June and July).

  • Fig. 7.

    CRPSS when training a NN for all lead times jointly. The baseline strategy is to train separate models for each lead time. The shaded areas represent 5%–95% confidence intervals.

  • Fig. 8.

    Lead-time embedding self-similarity. Every row and column corresponds to a learned vector representing a given lead time. The cells represent the degree of similarity between two of these vectors, as measured by the cosine similarity. This embedding was obtained by training a DRN model.

  • Fig. 9.

    CRPSS gain related to adding ensemble members to the postprocessing model input. The dataset is ENS-10. The baseline is training postprocessing using only the control member. The shaded areas represent 5%–95% confidence intervals.

  • Fig. A1.

    Skill gain brought by postprocessing models against a debiased baseline. (top) BQN model. (bottom) QRN model.

All Time Past Year Past 30 Days
Abstract Views 3597 3597 0
Full Text Views 2014 2014 257
PDF Downloads 270 270 40