Improving Medium-Range Ensemble Weather Forecasts with Hierarchical Ensemble Transformers

Zied Ben Bouallègue aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Zied Ben Bouallègue in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0003-2914-4203
,
Jonathan A. Weyn aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom
bMicrosoft, Redmond, Washington

Search for other papers by Jonathan A. Weyn in
Current site
Google Scholar
PubMed
Close
,
Mariana C. A. Clare aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Mariana C. A. Clare in
Current site
Google Scholar
PubMed
Close
,
Jesper Dramsch aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Jesper Dramsch in
Current site
Google Scholar
PubMed
Close
,
Peter Dueben aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Peter Dueben in
Current site
Google Scholar
PubMed
Close
, and
Matthew Chantry aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Matthew Chantry in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Statistical postprocessing of global ensemble weather forecasts is revisited by leveraging recent developments in machine learning. Verification of past forecasts is exploited to learn systematic deficiencies of numerical weather predictions in order to boost postprocessed forecast performance. Here, we introduce postprocessing of ensembles with transformers (PoET), a postprocessing approach based on hierarchical transformers. PoET has two major characteristics: 1) the postprocessing is applied directly to the ensemble members rather than to a predictive distribution or a functional of it, and 2) the method is ensemble-size agnostic in the sense that the number of ensemble members in training and inference mode can differ. The PoET output is a set of calibrated members that has the same size as the original ensemble but with improved reliability. Performance assessments show that PoET can bring up to 20% improvement in skill globally for 2-m temperature and 2% for precipitation forecasts and outperforms the simpler statistical member-by-member method, used here as a competitive benchmark. PoET is also applied to the ENS-10 benchmark dataset for ensemble postprocessing and provides better results when compared to other deep learning solutions that are evaluated for most parameters. Furthermore, because each ensemble member is calibrated separately, downstream applications should directly benefit from the improvement made on the ensemble forecast with postprocessing.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Zied Ben Bouallègue, zied.benbouallegue@ecmwf.com

Abstract

Statistical postprocessing of global ensemble weather forecasts is revisited by leveraging recent developments in machine learning. Verification of past forecasts is exploited to learn systematic deficiencies of numerical weather predictions in order to boost postprocessed forecast performance. Here, we introduce postprocessing of ensembles with transformers (PoET), a postprocessing approach based on hierarchical transformers. PoET has two major characteristics: 1) the postprocessing is applied directly to the ensemble members rather than to a predictive distribution or a functional of it, and 2) the method is ensemble-size agnostic in the sense that the number of ensemble members in training and inference mode can differ. The PoET output is a set of calibrated members that has the same size as the original ensemble but with improved reliability. Performance assessments show that PoET can bring up to 20% improvement in skill globally for 2-m temperature and 2% for precipitation forecasts and outperforms the simpler statistical member-by-member method, used here as a competitive benchmark. PoET is also applied to the ENS-10 benchmark dataset for ensemble postprocessing and provides better results when compared to other deep learning solutions that are evaluated for most parameters. Furthermore, because each ensemble member is calibrated separately, downstream applications should directly benefit from the improvement made on the ensemble forecast with postprocessing.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Zied Ben Bouallègue, zied.benbouallegue@ecmwf.com

1. Introduction

The chaotic nature of the atmosphere makes forecasting the weather a challenging and scientifically exciting task. With large and high-quality publicly available datasets (Hersbach et al. 2020), weather forecasting is becoming a new playing field for deep learning practitioners (Pathak et al. 2022; Bi et al. 2022; Lam et al. 2022). More traditionally, at national meteorological centers, weather forecasts are generated by numerical weather prediction (NWP) models that resolve numerically physics-based equations. A Monte Carlo approach is followed to account for uncertainties: an ensemble of deterministic forecasts is run with variations in the initial conditions, the model parameterizations, and/or the numerical discretization. This ensemble approach, initially developed to explore the limits of deterministic forecasting, has now become the backbone of operational weather forecasting (Lewis 2005).

Practically, ensemble weather forecasts are a set of physically consistent weather scenarios that ideally capture the full range of possible outcomes given the information available at the start of the forecast (Leutbecher and Palmer 2008) and decision-making can be optimized using probabilistic forecasts derived from such an ensemble (Richardson 2000). One can assess not only the uncertainty of a weather variable at a given point in space and time, but also any joint probability distributions across the variables. This versatility is essential for ensemble prediction systems to support downstream applications with high societal relevance, such as flood forecasting or human heat stress forecasting (Magnusson et al. 2023).

However, as an output of a NWP model, an ensemble forecast is suboptimal in a statistical sense. On top of the limited ensemble size effect (Leutbecher 2019), systematic deficiencies like model biases (defined as the averaged differences between forecasts and observations) and over- or underdispersiveness (too much or too little ensemble spread as measured by the standard deviation among the ensemble members) are common features of any NWP ensemble forecasts (Haiden et al. 2021). Statistical postprocessing is proposed as a simple remedy where past data are exploited to learn forecast errors and correct the current forecast accordingly.

A variety of postprocessing approaches have been used over the years, from simple bias correction to machine learning based methods (Vannitsem et al. 2021). Classically, postprocessing of ensemble forecasts is achieved either by assuming the form of the predictive probability distribution and optimizing its parameters (Gneiting et al. 2005; Raftery et al. 2005) or by correcting a limited set of quantiles of the predictive distribution (Taillardat et al. 2016; Ben Bouallègue 2017). Recently, multiple different modern machine learning methods have been applied to ensemble postprocessing. For example, Rasp and Lerch (2018) trained a neural network to predict mean and standard deviation of a normal distribution for 2-m temperature forecasting at stations in Germany while Bremnes (2020) combined a neural network and Bernstein polynomials for generating quantile forecasts of wind speed at stations in Norway. In the case of downstream applications based on such a postprocessed forecast, an additional postprocessing step is required to “reconstruct” forecast dependencies between variables or in time and space (Ben Bouallègue et al. 2016; Baran et al. 2020). However, more recent developments in deep learning, particularly transformers, promise to resolve this issue by using mechanisms such as attention (Vaswani et al. 2017) to maintain intervariable and interspatial dependencies.

In this work, we target the direct generation of a calibrated ensemble for potential use in downstream applications. The focus is on 2-m temperature and precipitation, which are variables of interest for many stakeholders. We propose a new approach for ensemble postprocessing: postprocessing of ensembles with transformers (PoET). Our machine learning framework, PoET, combines the self-attention ensemble transformer used for postprocessing of individual ensemble members in Finn (2021) with the U-Net architecture used for bias correction in Grönquist et al. (2021), leveraging the advantages of both for the first time in a postprocessing application. We compare this approach with the statistical member-by-member (MBM) method proposed by Van Schaeybroeck and Vannitsem (2015) that is simpler than PoET but effective. In its simplest form, this method consists of a bias correction and a spread scaling with respect to the ensemble mean. MBM has been successfully tested on time series of 2-m temperature ensemble forecasts and is now run operationally at the Royal Meteorological Institute of Belgium (Demaeyer et al. 2021).

Machine learning approaches for ensemble postprocessing rely on the availability of suitable datasets (Dueben et al. 2022). Here, we use reforecasts and reanalysis for training. In this context, reforecasts and reanalysis are praised for their consistency because they are generated from a single NWP model for long periods of validity time. In particular, the benefit of reforecasts for postprocessing has been demonstrated in pioneering works on 2-m temperature and precipitation forecasts at station locations by Hagedorn et al. (2008) and Hamill et al. (2008), respectively. Reforecasts are also becoming the cornerstones of benchmark datasets for postprocessing of weather forecasts (Ashkboos et al. 2022; Demaeyer et al. 2023; Grönquist et al. 2021). In this work, we continue this trend focusing on ensemble postprocessing of global gridded forecasts.

The remainder of this paper is organized as follows: section 2 introduces the dataset and methods investigated in this study, section 3 provides details about the implementation of MBM and PoET for the postprocessing of 2-m temperature and precipitation ensemble forecasts as well as a description of the verification process, section 4 provides illustrative examples of postprocessing in action, section 5 presents and discusses verification results, and section 6 concludes this paper.

2. Data and methods

a. Data

At the European Centre for Medium-Range Weather Forecasts (ECMWF), the reforecast dataset consists of 11 ensemble members (10 perturbed + 1 control) generated twice a week over the past 20 years (Vitart et al. 2019). In our experiments, the dataset comes from the operational Integrated Forecasting System (IFS) reforecasts produced in 2020, that is with IFS cycles 46r1 and 47r1, with the switch in June 2020. Fields are on 1° horizontal grid resolution, and the focus is on lead times every 6 h up to 96 h. Reforecasts from 2000 to 2016 are used for training, while those in 2017 and 2018 are used for validation.

The postprocessing models are trained toward ERA5, the ECMWF reanalysis dataset (Hersbach et al. 2020). The target is the reanalysis of 2-m temperature while the short-range forecasts at T + 6 h (aligned with the forecast validity time) is used as a target for precipitation to account for the spinup after data assimilation [for a comprehensive assessment of ERA5 daily precipitation please refer to Lavers et al. (2022)].

For testing, we use the operational ensemble data from 2021, using two forecasts each week for 104 start dates in total, according to the ECMWF subseasonal-to-seasonal (S2S) model iterations. The operational ensemble has 51 members (50 perturbed members + 1 control member), but we apply postprocessing methods that are agnostic to ensemble size: they may be run in inference mode with a different ensemble size than used in training. The data from 2021 include model cycles Cy47r1, Cy47r2, and Cy47r3, switching in May and then October of 2021, respectively. Notably, the model upgrade in Cy47r2 included an increase to 137 vertical levels in the ensemble, an improvement that is not included in the training dataset. We are therefore directly testing our methodology for generalization across model cycles, an important property to reduce the maintenance required when operationalizing machine learning systems.

b. Statistical benchmark method for comparison

Neural networks are not the only methods that can be used to calibrate ensembles. There exist simpler statistical methods, which require less computational power and which are generally more “explainable.” In this work, we use the MBM approach detailed in Van Schaeybroeck and Vannitsem (2015) as a benchmark. MBM is a natural benchmark for PoET that can be seen as a sophisticated member-by-member method. In addition, a comparison with state-of-the-art machine learning (ML) postprocessing techniques is discussed in the framework of the 10-ensemble-member (ENS-10) benchmark dataset (Ashkboos et al. 2022) in section 5c.

With MBM, a correction is applied to each ensemble member individually with a component common to all members and a component that adjusts the deviation of a member with respect to the ensemble mean. Let us denote x^i the corrected forecast for the mth member of the ensemble. Formally, MBM consists of applying
x^i=α+βx¯+γ(xix¯),
where xi is the ensemble member i, and x¯ is the ensemble mean. The parameter α is the bias parameter that nudges the ensemble mean, β is the linear coefficient that scales the ensemble mean, and γ is the scaling parameter that adjusts the spread of the ensemble. Each parameter can be inspected separately to understand their respective contribution to the modifications of the forecasts.

In our application, the parameters optimization follows the climatological reliability (CR) + weak ensemble reliability (WER) (WER + CR) approach as defined in Van Schaeybroeck and Vannitsem (2015). WER + CR means that the estimated parameters are constrained to preserve two different reliability conditions, the WER and the CR conditions. For bias-free forecasts, CR is defined as the equality of forecast variability with observations variability, while WER is defined as the agreement between average ensemble variance and the mean squared forecast error. The analytical formulas used to compute the three MBM parameters are provided in appendix A. Note that other flavors of MBM exist (e.g., with score optimization), but they have been disregarded because of their prohibitive computational costs in our application. For example, the MBM approach based on the minimization (MIN) of the continuous ranked probability score (CRPS; the so-called CRPS MIN approach) is three orders of magnitude more computationally expensive. Furthermore, a test of the CRPS MIN approach on a subsample of the data shows no benefits in terms of scores compared with the method applied here.

c. The ensemble transformer

Transformers are a class of neural networks that were designed for large natural language processing (NLP) models (Vaswani et al. 2017). The main advantage of transformers is the capability to process arbitrary lengths of sequences, drawing context from every part of the sequence, without the expensive sequential computations and potential saturating gradient issues of recurrent methods such as long short-term memory (LSTM; Hochreiter and Schmidhuber 1997). Deep learning transformer architectures are often structured as encoder-decoder networks, where the encoder blocks use a self-attention layer to compute correlations between all elements of an input sequence.

Self-attention uses a key-query-value construction which shares some similarities with Kalman filters, often used in meteorology for such tasks as data assimilation. The learnable parameters are weight matrices WlK,WlQ,WlV that encode the response to input tensors Xl at the lth layer such that K=WlKXl,Q=WlQXl, and V=WlVXl are the key, query, and value, respectively. Note that the term “self-attention” refers to the operation of each of these components on the same input latent state tensor Xl. The resulting scaled dot-product attention for layer l is
Attentionl(Ql,Kl,Vl)=softmax(QlKlTdk)Vl,
where dk is the dimensionality of keys and queries. In implementation, the weights are 1 × 1 convolution layers (i.e., dense layers) with a small number of filters, or heads, which produce multiple attention maps, a process known as multihead attention.
While in NLP models, the attention is computed along the dimension of the encoded input sequence, Finn (2021) proposed applying this methodology across the ensemble dimension for ensemble NWP forecasts. The strategy has plenty of appeal: by computing the similarity between ensemble members, one can dynamically integrate information from the complete ensemble to postprocess each individual member. Consider an input tensor of shape B × N × C × H × W, where the dimensions represent the batch size, number of ensemble members, number of feature channels (such as different atmospheric variables), and the height and width of the model grid, in latitudes and longitudes. This transformer is applied along the ensemble dimension of the forecasts, resulting in a reweighting tensor of shape B × K × N × N, where K is the number of attention heads. The weights are computed from the scalar dot product [Eq. (2)] over the channel, height, and width dimensions. This particular implementation makes the transformer scalable to different spatial input dimensions from global to regional models and is also agnostic to the number of ensemble members, meaning that the model can be trained on a limited number of ensemble members, but inference can be run on a larger ensemble. Taking inspiration from ensemble data assimilation, the attention layer splits the problem into a static and dynamic formulation, where the static part is a linear combination of the input data, resulting in the value matrix V that is used in the attention layer in Eq. (2). The observations are used as the query matrix Q and the key matrix K can be interpreted as the adjoint in data assimilation. The dynamic part adds that information to each member individually by the other members. A complete attention block uses residual connections that additively update the original member Zl in its latent space. This output is then projected with a final weight matrix Wo back to the output space; hence the attention block can be written as
Zl+1=σ[Zl+WoT(Zl)],
where Z is the data, T is the attention layer, Wo is the linear projection of the residual output from the attention space to the data domain, and l is the index of the attention block.

PoET is an adaptation of the ensemble transformer of Finn (2021). The original model was trained on a much smaller dataset, with single input fields and a very coarse resolution of 5.625° in latitude and longitude. Our higher-resolution dataset results in a substantial increase in memory cost due to the dot products across large dimensions C, H, and W. Therefore, to manage this we adapt the architecture and implement the transformer within a U-Net architecture, shown in Fig. 1. At each depth layer in the U-Net, following the embedding 2D convolution layers, we add a transformer block.1 Within the attention blocks, the convolution layers producing the key and query embeddings use a 3 × 3 convolution with stride of 3 that serves to further reduce the dimensionality of the dot-product calculation. Skip connections after the attention blocks at each level of the U-Net allow transformed hidden states to pass through directly to the decoder at multiple spatial resolutions. Altogether, the PoET implementation reduces the memory footprint of matrix multiplication operations within the transformer and enables the transformers to operate across different spatial scales. The layer normalization of the original ensemble transformer still operates at the full resolution of the grid, but unfortunately does not allow the model to be run at a different resolution than that of the training data. Experiments omitting the layer normalization or replacing it with another common technique, batch normalization, showed much worse performance. This observation cements the layer norm as an integral part of the transformer’s ability to correct forecast errors, likely because of its ability to capture local weather effects such as those of topography or land–sea differences.

Fig. 1.
Fig. 1.

Schematic of the transformer U-Net architecture of PoET. The inset in the lower right is adapted from Finn (2021), used with permission. In the PoET adaptation, the 1 × 1 convolution shown by the yellow arrow in the inset is replaced by a 3 × 3 convolution with a stride of 3.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

Finally, some more parameters of the PoET architecture are provided in appendix B for reference.

3. Experiments

a. PoET configuration

In our experiments, data for lead times every 6 h from 6 h up to a maximum of 96 h are used for training of the 2-m temperature model while, for precipitation, we start at 24 h to avoid the spinup. Because the lead time is not explicitly encoded in the model, it is possible to run inference for longer lead times.2

For the postprocessing of 2-m temperature forecasts, we include input features of 2-m temperature (T2M), temperature at 850 hPa (T850), geopotential height at 500 hPa (Z500), u and υ component of winds at 700 hPa (U700 and V700), and total cloud cover (TCC). Additionally, we prescribe orography, a land–sea mask, and the top-of-atmosphere incoming solar radiation (insolation) as additional predictors. Another model using a reduced feature set consisting of only T2M, T850, and Z500, plus the three prescribed variables performed only slightly worse than the one trained on the full dataset.

For the postprocessing of precipitation forecasts, the input predictors are changed to total precipitation, convective precipitation, convective available potential energy, total cloud cover, total column water, sea surface temperature, temperature at 850 hPa, winds at 700 hPa and geopotential at 500 hPa.

Apart from the selected predictors, the configuration of PoET is identical for the prediction of 2-m temperature and precipitation with two exceptions. First, the normalization of total and convective precipitation is done with a shifted logarithmic transformation.3 This transformation is applied to both the predictor and the predictand total precipitation. The second difference consists of using the kernel continuous ranked probability score (kCRPS) for precipitation instead of the Gaussian continuous ranked probability score (gCRPS) as a loss function, because the former makes no assumptions on the distribution of the ensemble (the definitions are available in appendix C). We tested using this formulation for 2-m temperature but observed little difference due to the Gaussian approximation being appropriate.

b. MBM configuration

The MBM parameters are estimated for each grid point and lead time separately. They also vary as a function of the time of the year in order to capture the seasonality of the forecast error. For this purpose, the training dataset differs for each forecast. We define a window centered around the forecast validity date and estimate the parameters using all training data within this time-of-the-year window. The suitable window size is different for the postprocessing of 2-m temperature and of precipitation. The window size is set to ±30 days for 2-m temperature and to ±60 days for precipitation for all lead times.

As for PoET, a shifted logarithmic transformation of the precipitation data is applied with MBM. Additionally, in inference mode, spurious precipitation is removed from MBM postprocessed precipitation fields and any correction leading to a change in precipitation value greater than 50 mm is rejected.

c. Verification process

We compare PoET, MBM, and raw forecasts in terms of their ability to predict 2-m temperature and precipitation up to 4 days in advance. Various aspects of the forecast performance are considered as described below. The results are presented in section 5, while the formal definitions of the verification metrics can be found in appendix C.

Bias and spread–skill relationships are used to assess the statistical consistency between forecast and verification. The bias is defined as the average difference between forecast and verification and a reliable forecast has a bias close to zero. The ensemble spread is defined as the standard deviation of the ensemble members with respect to the ensemble mean, while the ensemble mean error is defined as the root mean squared error of the ensemble mean. For a reliable ensemble forecast, the averaged ensemble spread should be close to the averaged ensemble mean error (Leutbecher and Palmer 2008; Fortin et al. 2014).

The CRPS is computed to assess the ensemble as a probabilistic forecast. Forecast performance in a multidimensional space is assessed using the energy score (ES), a generalization of the CRPS to the multivariate case (Gneiting and Raftery 2007). ES is applied over the time dimension, computed over two consecutive time steps for each pair of time steps separately. Additionally, for precipitation, probability forecast performance for predefined events is assessed with the Brier score (BS; Brier 1950). We consider two precipitation events: 6-hourly precipitation exceeding 1 and 10 mm.

The relative skill of a forecast with respect to a reference forecast is estimated with the help of skill scores. In the following, we compute the continuous ranked probability skill score (CRPSS), the energy skill score (ESS), and the Brier skill score (BSS) using the raw ensemble forecast as a reference. A comparison of PoET and MBM postprocessed forecasts is also performed using the latter as a reference.

Scores and reliability metrics are computed at each grid point for all validity times and aggregated in time and/or space for plotting purposes. When aggregating scores in space (over the globe), we apply a weighting proportional to the cosine of the gridpoint latitude.

4. Illustrative examples

a. PoET in action

Figure 2 shows the differences between each of the first three members of the raw ensemble and the corresponding PoET postprocessed forecasts at a lead time of 2 days. The ensemble mean change (the top-left panel) mostly shows a global heating, except in Asia which is consistent with the forecast bias discussed in the next section. The top row and leftmost column show the PoET forecast and raw forecast, respectively. The remaining entries show the difference between these respective ensemble members once the ensemble mean change is removed. Along the diagonal, we see the change induced by PoET on each member. The off-diagonal entries show the difference between differing raw and PoET ensemble members. The larger amplitude in these off-diagonal plots, compared to the diagonal, indicates consistency between the input and output ensemble members, that is, the ensemble has not been reordered or dramatically shifted by postprocessing. A comparison of PoET-corrected forecasts with ERA5 fields in two extreme cases is provided below.

Fig. 2.
Fig. 2.

Relative changes by PoET to a single date of the 2-m temperature forecast at day 2, valid on 6 Jan 2021. (left) The first three raw ensemble members. (top) The first three PoET-corrected ensemble members. Other entries show the differences between raw and PoET-corrected ensemble members when the ensemble mean difference has been removed. The ensemble mean difference is plotted in the top-left panel.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

b. A 2-m temperature case study

At the end of June 2021, a heatwave hit the northwestern United States and Canada leading to new temperature records and devastating wildfires. The top panels in Fig. 3 compare the 3-day averaged maximum (0000 UTC) temperature predictions of MBM and PoET with the corresponding ERA5 reanalysis field. In this example, we average 2-m temperature forecasts over lead times 24, 48, and 72 h. One randomly selected ensemble member is shown to illustrate postprocessing in action on a single forecast. The bottom panels in Fig. 3 show the difference between ERA5 and the raw forecast as well as the corrections applied to the forecast with postprocessing. We check whether postprocessing compensates for errors in the raw forecast, that is, if Figs. 3e and 3f match Fig. 3d. Overall, there is a good correspondence between the raw forecast error and the postprocessing corrections for both MBM and PoET. For example, we note the correction of the cold bias over the continent. However, there is some spottiness visible in the PoET correction (Fig. 3f). Moreover, in both Figs. 3e and 3f, as expected, fine details in the error pattern are not accurately captured, due to factors such as the limited predictability of this extreme event.

Fig. 3.
Fig. 3.

The 3-day averaged maximum temperature between 29 Jun and 1 Jul 2021: (a) ERA5, (b) MBM member 21, (c) PoET member 21, (d) difference between the raw forecast and ERA5 (xrawy), (e) MBM correction (xrawxMBM), and (f) PoET correction (xrawxPoET) made to the raw forecast. In the case of postprocessing methods leading to a perfect deterministic forecast, (e) and (f) would match (d).

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

c. A total precipitation case study

In March 2021, Australia was affected by extreme rainfall. Sustained heavy rain led to flooding in the eastern part of the country and large precipitation amounts were observed on the northern coast too. The top panels in Fig. 4 compare 3-day precipitation scenarios from MBM and PoET with the corresponding ERA5 precipitation field. The 3-day accumulated precipitation scenarios are derived from the postprocessed ensemble of 6 h accumulated precipitation forecasts that are consistent scenarios in space and time. Here again, we randomly select one member for illustration purposes. The bottom panels in Fig. 4 show the difference between ERA5 and the raw precipitation forecast along the corresponding postprocessing corrections. As in the 2-m temperature example, we check whether postprocessing compensates for raw forecast errors, that is, if Figs. 4e and 4f match Fig. 4d. The MBM correction only has some areas of consistency with the actual error while the PoET correction tends to partially compensate for the raw forecast error along the north and west coast, both over land and over the sea.

Fig. 4.
Fig. 4.

As in Fig. 3, but for 3-day accumulation precipitation between 18 and 21 Mar 2021.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

5. Verification results

a. 2-m temperature results

To compare and assess the performance of MBM and PoET, we apply the verification metrics defined in section 3c (i.e., the CRPSS, spread and error, bias, and ESS) to the postprocessed forecasts. The results are first aggregated over the globe as a function of the forecast lead time. In Fig. 5, we see that both methods considerably improve the raw ensemble skill with similar results in terms of CRPSS and ESS. PoET generates more skillful forecasts than MBM, but both methods are able to improve by ∼20% the raw forecast throughout the assessed lead times. Both methods also have a similar ability to reduce the bias.

Fig. 5.
Fig. 5.

(a) CRPSS (the higher, the better), (b) spread and skill, (c) bias (optimal value zero), and (d) ESS (the higher, the better) of 2-m temperature for three ensemble forecasts: raw ensemble, MBM, and PoET. In (a), we also show results for PoET using only T2m as a predictor.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

An additional experiment is run to disentangle the benefit of using the new ML-based method from the benefit of having more than one predictor as an input variable. Postprocessed forecasts are generated with PoET using 2-m temperature only as a predictor similarly as when running MBM and the results in terms of CRPSS are shown in Fig. 5a. The use of the PoET machinery for postprocessing seems to contribute to around two thirds of the improvement over the benchmark method while the remaining improvement can be attributed to extra information available in the additional predictors. This result appears consistent over lead times.

The MBM approach seems better at maintaining a spread–error parity, with PoET struggling at early lead times. Spread–skill diagrams, showing the error as a function of spread categories, reveal that aggregated scores must be interpreted carefully (see appendix D). Indeed, the uncertainty of PoET-corrected forecasts appears to reflect the potential forecast error more accurately than the MBM-corrected ones. Because compensating effects can be at play when averaging over all cases, we also compute bias and spread error ratio at each grid point separately before averaging absolute terms over the verification period (see appendix D). This approach reveals that PoET calibration underperformance compared with MBM is moderate and limited to the first lead times of the forecast. This result suggests a geographical disparity of the postprocessing impact that is now further explored with maps of scores.

Figure 6 shows maps of bias for the raw data and the PoET-corrected forecasts. We focus on lead time day 4 and the results are aggregated at each grid point over all verification days. We clearly see a general decrease in the bias with almost no remaining bias over the ocean. The remaining pockets of (generally positive) bias after postprocessing are mostly found over land where the amplitude of the raw forecast bias is larger. A change of sign in the bias is interpreted as an indication that the general circulation patterns over the training are not representative of the ones over the verification period. The broad structure of the bias is similar for MBM and for other lead times (not shown).

Fig. 6.
Fig. 6.

Bias of 2-m temperature forecasts at lead time day 4 for (a) the raw ensemble and (b) PoET. An optimal forecast has a bias close to 0.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

Figure 7 shows the gain in skill with PoET for the same lead time as in Fig. 6. CRPSS is computed using the raw ensemble as a baseline in Fig. 7a and MBM as a baseline in Fig. 7b. Figure 7a shows a widespread positive impact of PoET on the raw forecast skill with a larger gain over land where the raw forecast bias is generally more pronounced. A detrimental effect of postprocessing is observed in some regions (e.g., in South America, Africa, and Australia). These regions of negative skill score are also the ones where a bias is still present after postprocessing as shown in Fig. 6b.

Fig. 7.
Fig. 7.

CRPSS of 2-m temperature PoET forecast with respect to (a) the raw ensemble and (b) MBM. Positive values indicate a skill improvement with PoET. Note the difference in scale between the two plots.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

In Fig. 7b, there are very few areas where the CRPSS is less than zero, that is, areas where MBM forecasts have more skill than PoET. Improvements through PoET are fairly consistent across the globe, with no regions where there are larger gains due to the neural network approach. Indeed, there is a strong agreement between the locations where MBM and PoET add value to the raw ensemble (not shown). Given that MBM learns a climatological correction for each grid point this suggests PoET has mostly reproduced this climatological local correction.

b. Total precipitation results

Postprocessing of precipitation forecasts is a more challenging task because of the form of the underlying forecast probability distribution with a point mass at 0 (the no-precipitation probability) and a skewness capturing the more extreme events. Postprocessing with MBM and PoET is tested with small changes to the configuration used for the postprocessing of 2-m temperature forecast (see section 3), and a similar set of plots is examined to assess the corrected forecast performance.

Figure 8 shows verification metrics aggregated globally as a function of the lead time. In contrast to 2-m temperature results, the added benefit of either postprocessing approach is limited. With PoET, the skill improvement is approximately 2% in terms of CRPSS and ESS for the first several days of forecasting. With MBM, the skill improvement is ∼1% for most lead times. The gain in skill originates from improved performance in forecasting lower-intensity rather than higher-intensity events. Indeed, BSS computed for two precipitation exceeding thresholds, 1 and 10 mm, shows a larger skill score for the former (see Fig. D3 in appendix D).

Fig. 8.
Fig. 8.

As in Fig. 5, but for total precipitation.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

One explanation for the limited gain in skill with postprocessing is that the raw forecast is already well calibrated (see also Figs. D1c and D1d in appendix D). Also, PoET improves the averaged performance in Figs. 8a and 8d but seems to degrade both the bias and the spread–error ratio in Figs. 8b and 8c. This apparent paradox is explained by the large variations in forecast performance over the globe. A look at the mean absolute bias and mean absolute spread bias confirms that bias and the spread–skill relationship are overall significantly improved with PoET when assessed locally (see Fig. D2 in appendix D). Similarly, the erratic spread correction with MBM at a shorter lead time is not visible in Fig. D2b suggesting an averaging artifact. Fig. D2b also reveals that a point-by-point application of MBM does not seem appropriate to correct spread deficiencies at longer lead times.

Figure 9 provides another perspective on the bias by presenting maps of averaged values over all verification days. Here, the focus is on lead time day 4. The precipitation bias is reduced over land and the Maritime Continent. However, the bias in the tropical Pacific and Atlantic remains unchanged after postprocessing. We note that while positive biases are generally well corrected, negative biases are not.

Fig. 9.
Fig. 9.

As in Fig. 6, but for total precipitation. A bias close to 0 is optimal. Regions with annual precipitation lower than 0.1 mm are masked in gray.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

Finally, Figs. 10a and 10b show maps of CRPSS at day 4 for PoET using the raw forecast and MBM as a reference, respectively. PoET improves precipitation ensemble forecasts mainly over the tropics. Very localized degradation of the skill with respect to the raw forecast could be due to a too-short training sample. The benefit of using PoET rather than MBM appears predominantly in the tropics, but local positive skill scores are rather scattered. Alternate areas of positive and negative skill scores over the sea in the extratropics suggest that the two approaches are complementary.

Fig. 10.
Fig. 10.

As in Fig. 7, but for total precipitation. Positive values indicate a gain in skill with PoET. Regions with annual precipitation lower than 0.1 mm are masked in gray.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

c. ENS-10 comparison

During the preparation of this article, Ashkboos et al. (2022) produced a benchmark dataset for the postprocessing of ensemble weather forecasts (referred to as ENS-10 in the following). This framework is exploited to further test the capability of PoET and compare its performance with state-of-the-art ML postprocessing techniques.

ENS-10 dataset is similar to the one focused on here, originating from the same model reforecast framework, but with several differences. The reforecast dataset is constructed in 2018, the spatial resolution of the data is provided on a 0.5° grid, and the evaluation set comprises the last 2 years of the reforecast, meaning that the IFS configuration is identical between training and testing, also in terms of ensemble size. Note that the verification differs as a non-latitude-weighted CRPS is used as a performance metric.

To contribute to the benchmarking efforts, we train our model using the ENS-10 dataset and evaluate following the same methodology. We reduce the data volume by only training each model on a chosen subset of the total ENS-10 variables. For Z500, we utilize all ENS-10 variables on this pressure level and use an equivalent approach for T850. For 2-m temperature (T2m), our model predictors are 500 and 850 hPa alongside single-level variables 2-m temperature, skin temperature, sea surface temperature, mean sea level pressure, total cloud cover, and 10-m zonal and meridional wind. Also, the gridded data contain 720 points located at each pole that are unconstrained by the latitude-weighted training with PoET but contribute to the non-latitude-weighted evaluation. Therefore, we do not use PoET to correct these points but instead use the uncorrected raw forecast (the evaluation still includes these points to mirror ENS-10 evaluation).

In terms of model architecture, we make a small change to the PoET model to incorporate the higher spatial resolution. We increase the depth of the U-Net structure by 1, putting a transformer block at its fourth level.

Table 1 contrasts the PoET scores with the raw forecast and the best benchmarks from Ashkboos et al. (2022). For almost all configurations, the PoET approach leads to significant improvement over the ENS-10 benchmarks. In particular, for 2-m temperature, the CRPSS with PoET is considerably better than the results with the previous baselines. The improvement of ∼20%% is similar to the gain measured with our dataset. For Z500 with a 5-member ensemble, the model improves the raw output but fails to beat the LeNet approach (Ashkboos et al. 2022). We found no explanation for the limited success with this variable and ensemble member configuration. Similar results are obtained with the extreme event weighted continuous ranked probability score (EECRPS) introduced in Eq. (2) in Ashkboos et al. (2022).

Table 1.

Global non-latitude-weighted mean CRPS and EECRPS on the ENS-10 test set (2016/17) for baseline models with 5 (5-ENS) and 10 (10-ENS) ensemble members. The first 5 members are used for the 5-ENS evaluation. The best results for each combination of score/ensemble/parameter is in bold.

Table 1.

6. Conclusions

This work shows how to efficiently transform an ensemble forecast into a calibrated ensemble forecast, that is, a set of physically consistent scenarios with improved statistical properties. We compare two methods: one machine learning method based on self-attention transformers (PoET) and one statistical method used as a benchmark (MBM). For both methods, each member is calibrated separately but with the aim of optimizing the ensemble properties as a whole. As a result, the postprocessed ensemble has the spatial, temporal, and intervariable coherence necessary to enable any downstream application. Also, both tested methods can be trained on a smaller reforecast dataset (here with 11 members) to effectively calibrate a much larger operation ensemble (here with 51 members), preserving intermember calibration.

Ensemble postprocessing is successfully applied to global gridded forecasts of 2-m temperature and precipitation, using ERA5 reanalysis as the ground truth. Our results show that both MBM and PoET can significantly improve the skill of the operational IFS ensemble. This improvement is achieved through a better calibration of the ensemble, both in terms of bias and spread–skill relationship. We note that PoET is better at the headline scores (CRPS, ES) but with some areas where MBM can locally outperform PoET. This latter point suggests that a combination of the two approaches could lead to further improvement of the forecast skill. Also, our case-study examples illustrate the ability of postprocessing to improve existing ensemble members.

The postprocessing gain is smaller for precipitation than for 2-m temperature. Indeed, the skill improvement of precipitation forecasts is relatively small in this application. This result contrasts with results obtained with down-scaling approaches where accounting for representativeness uncertainty can have a major impact on scores (Ben Bouallègue et al. 2020). In further work, we will consider how PoET could be applied to ungridded observations, which would require architectural changes.

Direct applications of the methodologies developed here include postprocessing for forecast verification and intercomparison purposes. For example, bias correction can be applied to better understand changes in CRPS results with new IFS model versions (Leutbecher and Haiden 2021). Also, postprocessing would be a necessary step for a fair comparison of NWP forecasts with statistically optimized (data driven) ones in forecasting competition frameworks (see, e.g., Rasp et al. 2020). Finally, the proposed methods could be trivially adapted to a higher-resolution version of the truth, which could pave the way to ensemble postprocessing of global gridded data for operational forecasting.

1

More than one transformer block can be used, similar to Finn (2021), but in our experiments this did not result in performance increases.

2

While not shown here, PoET remains skillful for longer lead times.

3

The log(x + 1) with x the precipitation amount normalized into a dimensionless quantity (similar to Lopez 2011).

Acknowledgments.

The authors thank Tobias Finn for many interesting discussions and the original idea of using transformer techniques in the ensemble dimension. We also thank two anonymous reviewers for their valuable suggestions to improve the quality of this paper. Peter Dueben, Matthew Chantry, and Jesper Dramsch gratefully acknowledge funding from the MAELSTROM EuroHPC-JU project (JU) under Grant 955513. The JU receives support from the European Union’s Horizon research and innovation program and United Kingdom, Germany, Italy, Luxembourg, Switzerland, and Norway. Peter Dueben gratefully acknowledges funding from the ESiWACE project funded under Horizon 2020 Grant 823988. Mariana Clare gratefully acknowledges funding by the European Union under the Destination Earth initiative. Finally, all authors acknowledge the use of computing resources from both the European Weather Cloud and Microsoft Azure.

Data availability statement.

It is currently difficult for the authors to share the data as the hardware used is currently being replaced. The authors will, however, make the data available for download when the paper is eventually published. PoET source code is available at https://github.com/ecmwf-lab/poet, and the MBM parameter estimation relies on the Climdyn/pythie package (Demaeyer 2022).

APPENDIX A

MBM Parameters Optimization

The MBM parameters α and β are derived by resolving an ordinary least squares regression. They are computed using
α=yβx¯,
with y the verification and where ⟨⋅⟩ is an averaging operator applied over the training sample,
β=ρyx¯σyσx¯,
where ρyx¯ is the correlation between verification and ensemble mean, and σy and σx¯ are the variance of the verification and of the ensemble mean, respectively. Imposing the reliability constraints and using maximum likelihood estimators, we have
γ2=σy2σϵ2(1ρyx¯2),
with σϵ the ensemble spread (ensemble standard deviation).

APPENDIX B

Additional PoET Model Parameters

Table B1 shows some additional hyperparameters used for the PoET model. Minimal searching was performed to obtain these parameters as no significant improvements were observed.

Table B1.

Additional hyperparameters for PoET.

Table B1.

APPENDIX C

Scores Definition

a. CRPS

Following Gneiting and Raftery (2007), the kernel CRPS is defined as
kCRPS(x,y)=1Mi=1M|xiy|12M2i=1Mj=1M|xixj|,
where x is an ensemble of size M predictions, and y is the outcome.
Gneiting et al. (2005) suggested a closed-form expression for the CRPS of a Gaussian distribution with mean μ and variance σ2 defined as
gCRPS[N(μ,σ2),y]=σπ{πyμσerf(yμ2σ)+2exp[(yμ)22σ2]1},
with
erf=2π0xexp(x2).

b. ES

ES is a generalization of the CRPS to the multivariate case (Gneiting and Raftery 2007). It is defined as
ES(X,y)=1Mj=1MXjy12M2i=1Mj=1MXiXj
where X and y are the ensemble forecast and verification in the multivariate space, respectively.

c. Skill scores

Skill scores measure the relative performance of a given forecast f with respect to a reference forecast g. For a given score, its skill score SS is computed as
SS=1S¯fS¯g,
with S¯f and S¯g being the averaged score S for forecasts f and g, respectively.

APPENDIX D

Additional Plots

a. Reliability diagrams

For a reliable ensemble forecast, we expect consistency between ensemble spread and ensemble mean error. This can be checked with reliability plots as shown in Fig. D1. For a given forecast spread category (on the x axis), we check the consistency of the corresponding forecast error (RMSE on the y axis). Perfect reliability is indicated with a dashed diagonal line.

Fig. D1.
Fig. D1.

Reliability plots showing the spread–error relationship for 2-m temperature forecasts at lead time (a) day 2 and (b) day 4, and for precipitation forecasts at lead time (c) day 2 and (d) day 4. For each plot, a histogram shows the number of cases in each forecast uncertainty category.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

b. Mean absolute bias and mean absolute spread bias

A bias computed over a heterogeneous sample can be misleading because of the compensating effects at play. As a complementary verification metric to the averaged bias, we compute a mean absolute bias as follows:
|x¯ytime|space,
where ⟨⋅⟩time and ⟨⋅⟩space are the averaging operator over the verification period and verification domain, respectively. The bias is computed spatially (at each grid point) and then averaged over the verification period in mean absolute terms.
Similarly, a comparison of the ensemble spread with the ensemble mean error might not be meaningful if spread and error are averaged over a heterogeneous sample. Here, we suggest computing the ratio between spread and error at each grid point separately before centering around 0 and averaging over time in absolute terms. Formally, the mean absolute spread bias is computed as follows:
|M+1M1iM(xix¯)2time(yx¯)2time1|space.
where the factor (M+1)/(M1) accounts for the limited ensemble size.

Mean absolute bias and mean absolute spread bias for 2-m temperature and precipitation are shown in Figs D2 and D3, respectively.

Fig. D2.
Fig. D2.

(a) Mean absolute bias and (b) mean absolute spread bias for 2-m temperature forecasts. The closer to 0, the better.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

Fig. D3.
Fig. D3.

As Fig. D2, but for total precipitation.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

c. BSS

The BS is used to assess the performance of a binary probability forecast for a given event (Brier 1950). It is computed here to better understand the impact of postprocessing on the results in terms of CRPS. Indeed, the CRPS corresponds to the integral of the Brier score over all possible events. Fig. D4 shows the Brier skill score for two precipitation events defined as precipitation exceeding 1 and 10 mm in 6 h, respectively.

Fig. D4.
Fig. D4.

BSS (the larger, the better) for two threshold-exceeding events: (a) 1 and (b) 10 mm for three ensemble forecasts: the raw ensemble, the MBM, and the PoET-corrected forecasts.

Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1

REFERENCES

  • Ashkboos, S., L. Huang, N. Dryden, T. Ben-Nun, P. Dueben, L. Gianinazzi, L. Kummer, and T. Hoefler, 2022: ENS-10: A dataset for post-processing ensemble weather forecast. arXiv, 2206.14786v2, https://doi.org/10.48550/arXiv.2206.14786.

  • Baran, S., Á. Baran, F. Pappenberger, and Z. Ben Bouallègue, 2020: Statistical post-processing of heat index ensemble forecasts: Is there a royal road? Quart. J. Roy. Meteor. Soc., 146, 34163434, https://doi.org/10.1002/qj.3853.

    • Search Google Scholar
    • Export Citation
  • Ben Bouallègue, Z., 2017: Statistical postprocessing of ensemble global radiation forecasts with penalized quantile regression. Meteor. Z., 26, 253264, https://doi.org/10.1127/metz/2016/0748.

    • Search Google Scholar
    • Export Citation
  • Ben Bouallègue, Z., T. Heppelmann, S. E. Theis, and P. Pinson, 2016: Generation of scenarios from calibrated ensemble forecasts with a dual-ensemble copula-coupling approach. Mon. Wea. Rev., 144, 47374750, https://doi.org/10.1175/MWR-D-15-0403.1.

    • Search Google Scholar
    • Export Citation
  • Ben Bouallègue, Z., T. Haiden, N. J. Weber, T. M. Hamill, and D. S. Richardson, 2020: Accounting for representativeness in the verification of ensemble precipitation forecasts. Mon. Wea. Rev., 148, 20492062, https://doi.org/10.1175/MWR-D-19-0323.1.

    • Search Google Scholar
    • Export Citation
  • Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2022: Pangu-Weather: A 3D high-resolution model for fast and accurate global weather forecast. arXiv, 2211.02556v1, https://doi.org/10.48550/ARXIV.2211.02556.

  • Bremnes, J. B., 2020: Ensemble postprocessing using quantile function regression based on neural networks and Bernstein polynomials. Mon. Wea. Rev., 148, 403414, https://doi.org/10.1175/MWR-D-19-0227.1.

    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Demaeyer, J., 2022: Climdyn/pythie: Version 0.1.0 alpha release. Zenodo, https://doi.org/10.5281/zenodo.7233538.

  • Demaeyer, J., S. Vannitsem, and B. Van Schaeybroeck, 2021: Statistical post-processing of ensemble forecasts at the Belgian met service. ECMWF Newsletter, No. 166, ECMWF, Reading, United Kingdom, 21–25, https://www.ecmwf.int/en/newsletter/166/meteorology/statistical-post-processing-ensemble-forecasts-belgian-met-service.

  • Demaeyer, J., and Coauthors, 2023: The EUPPBench postprocessing benchmark dataset v1.0. Earth Syst. Sci. Data, 15, 26352653, https://doi.org/10.5194/essd-15-2635-2023.

    • Search Google Scholar
    • Export Citation
  • Dueben, P. D., M. G. Schultz, M. Chantry, D. J. Gagne II, D. M. Hall, and A. McGovern, 2022: Challenges and benchmark datasets for machine learning in the atmospheric sciences: Definition, status, and outlook. Artif. Intell. Earth Syst., 1, e210002, https://doi.org/10.1175/AIES-D-21-0002.1.

    • Search Google Scholar
    • Export Citation
  • Finn, T. S., 2021: Self-attentive ensemble transformer: Representing ensemble interactions in neural networks for Earth system models. arXiv, 2106.13924v2, https://doi.org/10.48550/ARXIV.2106.13924.

  • Fortin, V., M. Abaza, F. Anctil, and R. Turcotte, 2014: Why should ensemble spread match the RMSE of the ensemble mean? J. Hydrometeor., 15, 17081713, https://doi.org/10.1175/JHM-D-14-0008.1.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359378, https://doi.org/10.1198/016214506000001437.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 10981118, https://doi.org/10.1175/MWR2904.1.

    • Search Google Scholar
    • Export Citation
  • Grönquist, P., C. Yao, T. Ben-Nun, N. Dryden, P. Dueben, S. Li, and T. Hoefler, 2021: Deep learning for post-processing ensemble weather forecasts. Philos. Trans. Roy. Soc., A379, 20200092, https://doi.org/10.1098/rsta.2020.0092.

    • Search Google Scholar
    • Export Citation
  • Hagedorn, R., T. M. Hamill, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136, 26082619, https://doi.org/10.1175/2007MWR2410.1.

    • Search Google Scholar
    • Export Citation
  • Haiden, T., M. Janousek, F. Vitart, Z. Ben Bouallègue, L. Ferranti, and D. R. F. Prates, 2021: Evaluation of ECMWF forecasts, including 2021 upgrade. ECMWF Tech. Memo. 884, 56 pp., https://www.ecmwf.int/en/elibrary/81235-evaluation-ecmwf-forecasts-including-2021-upgrade.

  • Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 26202632, https://doi.org/10.1175/2007MWR2411.1.

    • Search Google Scholar
    • Export Citation
  • Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 19992049, https://doi.org/10.1002/qj.3803.

    • Search Google Scholar
    • Export Citation
  • Hochreiter, S., and J. Schmidhuber, 1997: Long short-term memory. Neural Comput., 9, 17351780, https://doi.org/10.1162/neco.1997.9.8.1735.

    • Search Google Scholar
    • Export Citation
  • Lam, R., and Coauthors, 2022: GraphCast: Learning skillful medium-range global weather forecasting. arXiv, 2212.12794v2, https://doi.org/10.48550/ARXIV.2212.12794.

  • Lavers, D. A., A. Simmons, F. Vamborg, and M. J. Rodwell, 2022: An evaluation of ERA5 precipitation for climate monitoring. Quart. J. Roy. Meteor. Soc., 148, 31523165, https://doi.org/10.1002/qj.4351.

    • Search Google Scholar
    • Export Citation
  • Leutbecher, M., 2019: Ensemble size: How suboptimal is less than infinity? Quart. J. Roy. Meteor. Soc., 145, 107128, https://doi.org/10.1002/qj.3387.

    • Search Google Scholar
    • Export Citation
  • Leutbecher, M., and T. N. Palmer, 2008: Ensemble forecasting. J. Comput. Phys., 227, 35153539, https://doi.org/10.1016/j.jcp.2007.02.014.

    • Search Google Scholar
    • Export Citation
  • Leutbecher, M., and T. Haiden, 2021: Understanding changes of the continuous ranked probability score using a homogeneous Gaussian approximation. Quart. J. Roy. Meteor. Soc., 147, 425442, https://doi.org/10.1002/qj.3926.

    • Search Google Scholar
    • Export Citation
  • Lewis, J. M., 2005: Roots of ensemble forecasting. Mon. Wea. Rev., 133, 18651885, https://doi.org/10.1175/MWR2949.1.

  • Lopez, P., 2011: Direct 4D-Var assimilation of NCEP stage IV radar and gauge precipitation data at ECMWF. Mon. Wea. Rev., 139, 20982116, https://doi.org/10.1175/2010MWR3565.1.

    • Search Google Scholar
    • Export Citation
  • Magnusson, L., C. Prudhomme, F. Di Giuseppe, C. Di Napoli, and F. Pappenberger, 2023: Operational multiscale predictions of hazardous events. Extreme Weather Forecasting, M. Astitha and E. Nikolopoulos, Eds., Elsevier, 87–129, https://doi.org/10.1016/B978-0-12-820124-4.00008-6.

  • Pathak, J., and Coauthors, 2022: FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/ARXIV.2202.11214.

  • Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133, 11551174, https://doi.org/10.1175/MWR2906.1.

    • Search Google Scholar
    • Export Citation
  • Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 38853900, https://doi.org/10.1175/MWR-D-18-0187.1.

    • Search Google Scholar
    • Export Citation
  • Rasp, S., P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, 2020: WeatherBench: A benchmark data set for data-driven weather forecasting. J. Adv. Model. Earth Syst., 12, e2020MS002203, https://doi.org/10.1029/2020MS002203.

    • Search Google Scholar
    • Export Citation
  • Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126, 649667, https://doi.org/10.1002/qj.49712656313.

    • Search Google Scholar
    • Export Citation
  • Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 23752393, https://doi.org/10.1175/MWR-D-15-0260.1.

    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681E699, https://doi.org/10.1175/BAMS-D-19-0308.1.

    • Search Google Scholar
    • Export Citation
  • Van Schaeybroeck, B., and S. Vannitsem, 2015: Ensemble post-processing using member-by-member approaches: Theoretical aspects. Quart. J. Roy. Meteor. Soc., 141, 807818, https://doi.org/10.1002/qj.2397.

    • Search Google Scholar
    • Export Citation
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, 2017: Attention is all you need. arXiv, 1706.03762v7, https://doi.org/10.48550/ARXIV.1706.03762.

  • Vitart, F., G. Balsamo, J.-R. Bidlot, S. Lang, and I. Tsonevsky, 2019: Use of ERA5 to initialize ensemble re-forecasts. ECMWF Tech. Memo. 841, 16 pp., https://www.ecmwf.int/sites/default/files/elibrary/2019/18872-use-era5-initialize-ensemble-re-forecasts.pdf.

Save
  • Ashkboos, S., L. Huang, N. Dryden, T. Ben-Nun, P. Dueben, L. Gianinazzi, L. Kummer, and T. Hoefler, 2022: ENS-10: A dataset for post-processing ensemble weather forecast. arXiv, 2206.14786v2, https://doi.org/10.48550/arXiv.2206.14786.

  • Baran, S., Á. Baran, F. Pappenberger, and Z. Ben Bouallègue, 2020: Statistical post-processing of heat index ensemble forecasts: Is there a royal road? Quart. J. Roy. Meteor. Soc., 146, 34163434, https://doi.org/10.1002/qj.3853.

    • Search Google Scholar
    • Export Citation
  • Ben Bouallègue, Z., 2017: Statistical postprocessing of ensemble global radiation forecasts with penalized quantile regression. Meteor. Z., 26, 253264, https://doi.org/10.1127/metz/2016/0748.

    • Search Google Scholar
    • Export Citation
  • Ben Bouallègue, Z., T. Heppelmann, S. E. Theis, and P. Pinson, 2016: Generation of scenarios from calibrated ensemble forecasts with a dual-ensemble copula-coupling approach. Mon. Wea. Rev., 144, 47374750, https://doi.org/10.1175/MWR-D-15-0403.1.

    • Search Google Scholar
    • Export Citation
  • Ben Bouallègue, Z., T. Haiden, N. J. Weber, T. M. Hamill, and D. S. Richardson, 2020: Accounting for representativeness in the verification of ensemble precipitation forecasts. Mon. Wea. Rev., 148, 20492062, https://doi.org/10.1175/MWR-D-19-0323.1.

    • Search Google Scholar
    • Export Citation
  • Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2022: Pangu-Weather: A 3D high-resolution model for fast and accurate global weather forecast. arXiv, 2211.02556v1, https://doi.org/10.48550/ARXIV.2211.02556.

  • Bremnes, J. B., 2020: Ensemble postprocessing using quantile function regression based on neural networks and Bernstein polynomials. Mon. Wea. Rev., 148, 403414, https://doi.org/10.1175/MWR-D-19-0227.1.

    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Demaeyer, J., 2022: Climdyn/pythie: Version 0.1.0 alpha release. Zenodo, https://doi.org/10.5281/zenodo.7233538.

  • Demaeyer, J., S. Vannitsem, and B. Van Schaeybroeck, 2021: Statistical post-processing of ensemble forecasts at the Belgian met service. ECMWF Newsletter, No. 166, ECMWF, Reading, United Kingdom, 21–25, https://www.ecmwf.int/en/newsletter/166/meteorology/statistical-post-processing-ensemble-forecasts-belgian-met-service.

  • Demaeyer, J., and Coauthors, 2023: The EUPPBench postprocessing benchmark dataset v1.0. Earth Syst. Sci. Data, 15, 26352653, https://doi.org/10.5194/essd-15-2635-2023.

    • Search Google Scholar
    • Export Citation
  • Dueben, P. D., M. G. Schultz, M. Chantry, D. J. Gagne II, D. M. Hall, and A. McGovern, 2022: Challenges and benchmark datasets for machine learning in the atmospheric sciences: Definition, status, and outlook. Artif. Intell. Earth Syst., 1, e210002, https://doi.org/10.1175/AIES-D-21-0002.1.

    • Search Google Scholar
    • Export Citation
  • Finn, T. S., 2021: Self-attentive ensemble transformer: Representing ensemble interactions in neural networks for Earth system models. arXiv, 2106.13924v2, https://doi.org/10.48550/ARXIV.2106.13924.

  • Fortin, V., M. Abaza, F. Anctil, and R. Turcotte, 2014: Why should ensemble spread match the RMSE of the ensemble mean? J. Hydrometeor., 15, 17081713, https://doi.org/10.1175/JHM-D-14-0008.1.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359378, https://doi.org/10.1198/016214506000001437.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 10981118, https://doi.org/10.1175/MWR2904.1.

    • Search Google Scholar
    • Export Citation
  • Grönquist, P., C. Yao, T. Ben-Nun, N. Dryden, P. Dueben, S. Li, and T. Hoefler, 2021: Deep learning for post-processing ensemble weather forecasts. Philos. Trans. Roy. Soc., A379, 20200092, https://doi.org/10.1098/rsta.2020.0092.

    • Search Google Scholar
    • Export Citation
  • Hagedorn, R., T. M. Hamill, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136, 26082619, https://doi.org/10.1175/2007MWR2410.1.

    • Search Google Scholar
    • Export Citation
  • Haiden, T., M. Janousek, F. Vitart, Z. Ben Bouallègue, L. Ferranti, and D. R. F. Prates, 2021: Evaluation of ECMWF forecasts, including 2021 upgrade. ECMWF Tech. Memo. 884, 56 pp., https://www.ecmwf.int/en/elibrary/81235-evaluation-ecmwf-forecasts-including-2021-upgrade.

  • Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 26202632, https://doi.org/10.1175/2007MWR2411.1.

    • Search Google Scholar
    • Export Citation
  • Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 19992049, https://doi.org/10.1002/qj.3803.

    • Search Google Scholar
    • Export Citation
  • Hochreiter, S., and J. Schmidhuber, 1997: Long short-term memory. Neural Comput., 9, 17351780, https://doi.org/10.1162/neco.1997.9.8.1735.

    • Search Google Scholar
    • Export Citation
  • Lam, R., and Coauthors, 2022: GraphCast: Learning skillful medium-range global weather forecasting. arXiv, 2212.12794v2, https://doi.org/10.48550/ARXIV.2212.12794.

  • Lavers, D. A., A. Simmons, F. Vamborg, and M. J. Rodwell, 2022: An evaluation of ERA5 precipitation for climate monitoring. Quart. J. Roy. Meteor. Soc., 148, 31523165, https://doi.org/10.1002/qj.4351.

    • Search Google Scholar
    • Export Citation
  • Leutbecher, M., 2019: Ensemble size: How suboptimal is less than infinity? Quart. J. Roy. Meteor. Soc., 145, 107128, https://doi.org/10.1002/qj.3387.

    • Search Google Scholar
    • Export Citation
  • Leutbecher, M., and T. N. Palmer, 2008: Ensemble forecasting. J. Comput. Phys., 227, 35153539, https://doi.org/10.1016/j.jcp.2007.02.014.

    • Search Google Scholar
    • Export Citation
  • Leutbecher, M., and T. Haiden, 2021: Understanding changes of the continuous ranked probability score using a homogeneous Gaussian approximation. Quart. J. Roy. Meteor. Soc., 147, 425442, https://doi.org/10.1002/qj.3926.

    • Search Google Scholar
    • Export Citation
  • Lewis, J. M., 2005: Roots of ensemble forecasting. Mon. Wea. Rev., 133, 18651885, https://doi.org/10.1175/MWR2949.1.

  • Lopez, P., 2011: Direct 4D-Var assimilation of NCEP stage IV radar and gauge precipitation data at ECMWF. Mon. Wea. Rev., 139, 20982116, https://doi.org/10.1175/2010MWR3565.1.

    • Search Google Scholar
    • Export Citation
  • Magnusson, L., C. Prudhomme, F. Di Giuseppe, C. Di Napoli, and F. Pappenberger, 2023: Operational multiscale predictions of hazardous events. Extreme Weather Forecasting, M. Astitha and E. Nikolopoulos, Eds., Elsevier, 87–129, https://doi.org/10.1016/B978-0-12-820124-4.00008-6.

  • Pathak, J., and Coauthors, 2022: FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/ARXIV.2202.11214.

  • Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133, 11551174, https://doi.org/10.1175/MWR2906.1.

    • Search Google Scholar
    • Export Citation
  • Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 38853900, https://doi.org/10.1175/MWR-D-18-0187.1.

    • Search Google Scholar
    • Export Citation
  • Rasp, S., P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, 2020: WeatherBench: A benchmark data set for data-driven weather forecasting. J. Adv. Model. Earth Syst., 12, e2020MS002203, https://doi.org/10.1029/2020MS002203.

    • Search Google Scholar
    • Export Citation
  • Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126, 649667, https://doi.org/10.1002/qj.49712656313.

    • Search Google Scholar
    • Export Citation
  • Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 23752393, https://doi.org/10.1175/MWR-D-15-0260.1.

    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681E699, https://doi.org/10.1175/BAMS-D-19-0308.1.

    • Search Google Scholar
    • Export Citation
  • Van Schaeybroeck, B., and S. Vannitsem, 2015: Ensemble post-processing using member-by-member approaches: Theoretical aspects. Quart. J. Roy. Meteor. Soc., 141, 807818, https://doi.org/10.1002/qj.2397.

    • Search Google Scholar
    • Export Citation
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, 2017: Attention is all you need. arXiv, 1706.03762v7, https://doi.org/10.48550/ARXIV.1706.03762.

  • Vitart, F., G. Balsamo, J.-R. Bidlot, S. Lang, and I. Tsonevsky, 2019: Use of ERA5 to initialize ensemble re-forecasts. ECMWF Tech. Memo. 841, 16 pp., https://www.ecmwf.int/sites/default/files/elibrary/2019/18872-use-era5-initialize-ensemble-re-forecasts.pdf.

  • Fig. 1.

    Schematic of the transformer U-Net architecture of PoET. The inset in the lower right is adapted from Finn (2021), used with permission. In the PoET adaptation, the 1 × 1 convolution shown by the yellow arrow in the inset is replaced by a 3 × 3 convolution with a stride of 3.

  • Fig. 2.

    Relative changes by PoET to a single date of the 2-m temperature forecast at day 2, valid on 6 Jan 2021. (left) The first three raw ensemble members. (top) The first three PoET-corrected ensemble members. Other entries show the differences between raw and PoET-corrected ensemble members when the ensemble mean difference has been removed. The ensemble mean difference is plotted in the top-left panel.

  • Fig. 3.

    The 3-day averaged maximum temperature between 29 Jun and 1 Jul 2021: (a) ERA5, (b) MBM member 21, (c) PoET member 21, (d) difference between the raw forecast and ERA5 (xrawy), (e) MBM correction (xrawxMBM), and (f) PoET correction (xrawxPoET) made to the raw forecast. In the case of postprocessing methods leading to a perfect deterministic forecast, (e) and (f) would match (d).

  • Fig. 4.

    As in Fig. 3, but for 3-day accumulation precipitation between 18 and 21 Mar 2021.

  • Fig. 5.

    (a) CRPSS (the higher, the better), (b) spread and skill, (c) bias (optimal value zero), and (d) ESS (the higher, the better) of 2-m temperature for three ensemble forecasts: raw ensemble, MBM, and PoET. In (a), we also show results for PoET using only T2m as a predictor.

  • Fig. 6.

    Bias of 2-m temperature forecasts at lead time day 4 for (a) the raw ensemble and (b) PoET. An optimal forecast has a bias close to 0.

  • Fig. 7.

    CRPSS of 2-m temperature PoET forecast with respect to (a) the raw ensemble and (b) MBM. Positive values indicate a skill improvement with PoET. Note the difference in scale between the two plots.

  • Fig. 8.

    As in Fig. 5, but for total precipitation.

  • Fig. 9.

    As in Fig. 6, but for total precipitation. A bias close to 0 is optimal. Regions with annual precipitation lower than 0.1 mm are masked in gray.

  • Fig. 10.

    As in Fig. 7, but for total precipitation. Positive values indicate a gain in skill with PoET. Regions with annual precipitation lower than 0.1 mm are masked in gray.

  • Fig. D1.

    Reliability plots showing the spread–error relationship for 2-m temperature forecasts at lead time (a) day 2 and (b) day 4, and for precipitation forecasts at lead time (c) day 2 and (d) day 4. For each plot, a histogram shows the number of cases in each forecast uncertainty category.

  • Fig. D2.

    (a) Mean absolute bias and (b) mean absolute spread bias for 2-m temperature forecasts. The closer to 0, the better.

  • Fig. D3.

    As Fig. D2, but for total precipitation.

  • Fig. D4.

    BSS (the larger, the better) for two threshold-exceeding events: (a) 1 and (b) 10 mm for three ensemble forecasts: the raw ensemble, the MBM, and the PoET-corrected forecasts.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 4210 2841 346
PDF Downloads 2614 1422 100