1. Introduction
The chaotic nature of the atmosphere makes forecasting the weather a challenging and scientifically exciting task. With large and high-quality publicly available datasets (Hersbach et al. 2020), weather forecasting is becoming a new playing field for deep learning practitioners (Pathak et al. 2022; Bi et al. 2022; Lam et al. 2022). More traditionally, at national meteorological centers, weather forecasts are generated by numerical weather prediction (NWP) models that resolve numerically physics-based equations. A Monte Carlo approach is followed to account for uncertainties: an ensemble of deterministic forecasts is run with variations in the initial conditions, the model parameterizations, and/or the numerical discretization. This ensemble approach, initially developed to explore the limits of deterministic forecasting, has now become the backbone of operational weather forecasting (Lewis 2005).
Practically, ensemble weather forecasts are a set of physically consistent weather scenarios that ideally capture the full range of possible outcomes given the information available at the start of the forecast (Leutbecher and Palmer 2008) and decision-making can be optimized using probabilistic forecasts derived from such an ensemble (Richardson 2000). One can assess not only the uncertainty of a weather variable at a given point in space and time, but also any joint probability distributions across the variables. This versatility is essential for ensemble prediction systems to support downstream applications with high societal relevance, such as flood forecasting or human heat stress forecasting (Magnusson et al. 2023).
However, as an output of a NWP model, an ensemble forecast is suboptimal in a statistical sense. On top of the limited ensemble size effect (Leutbecher 2019), systematic deficiencies like model biases (defined as the averaged differences between forecasts and observations) and over- or underdispersiveness (too much or too little ensemble spread as measured by the standard deviation among the ensemble members) are common features of any NWP ensemble forecasts (Haiden et al. 2021). Statistical postprocessing is proposed as a simple remedy where past data are exploited to learn forecast errors and correct the current forecast accordingly.
A variety of postprocessing approaches have been used over the years, from simple bias correction to machine learning based methods (Vannitsem et al. 2021). Classically, postprocessing of ensemble forecasts is achieved either by assuming the form of the predictive probability distribution and optimizing its parameters (Gneiting et al. 2005; Raftery et al. 2005) or by correcting a limited set of quantiles of the predictive distribution (Taillardat et al. 2016; Ben Bouallègue 2017). Recently, multiple different modern machine learning methods have been applied to ensemble postprocessing. For example, Rasp and Lerch (2018) trained a neural network to predict mean and standard deviation of a normal distribution for 2-m temperature forecasting at stations in Germany while Bremnes (2020) combined a neural network and Bernstein polynomials for generating quantile forecasts of wind speed at stations in Norway. In the case of downstream applications based on such a postprocessed forecast, an additional postprocessing step is required to “reconstruct” forecast dependencies between variables or in time and space (Ben Bouallègue et al. 2016; Baran et al. 2020). However, more recent developments in deep learning, particularly transformers, promise to resolve this issue by using mechanisms such as attention (Vaswani et al. 2017) to maintain intervariable and interspatial dependencies.
In this work, we target the direct generation of a calibrated ensemble for potential use in downstream applications. The focus is on 2-m temperature and precipitation, which are variables of interest for many stakeholders. We propose a new approach for ensemble postprocessing: postprocessing of ensembles with transformers (PoET). Our machine learning framework, PoET, combines the self-attention ensemble transformer used for postprocessing of individual ensemble members in Finn (2021) with the U-Net architecture used for bias correction in Grönquist et al. (2021), leveraging the advantages of both for the first time in a postprocessing application. We compare this approach with the statistical member-by-member (MBM) method proposed by Van Schaeybroeck and Vannitsem (2015) that is simpler than PoET but effective. In its simplest form, this method consists of a bias correction and a spread scaling with respect to the ensemble mean. MBM has been successfully tested on time series of 2-m temperature ensemble forecasts and is now run operationally at the Royal Meteorological Institute of Belgium (Demaeyer et al. 2021).
Machine learning approaches for ensemble postprocessing rely on the availability of suitable datasets (Dueben et al. 2022). Here, we use reforecasts and reanalysis for training. In this context, reforecasts and reanalysis are praised for their consistency because they are generated from a single NWP model for long periods of validity time. In particular, the benefit of reforecasts for postprocessing has been demonstrated in pioneering works on 2-m temperature and precipitation forecasts at station locations by Hagedorn et al. (2008) and Hamill et al. (2008), respectively. Reforecasts are also becoming the cornerstones of benchmark datasets for postprocessing of weather forecasts (Ashkboos et al. 2022; Demaeyer et al. 2023; Grönquist et al. 2021). In this work, we continue this trend focusing on ensemble postprocessing of global gridded forecasts.
The remainder of this paper is organized as follows: section 2 introduces the dataset and methods investigated in this study, section 3 provides details about the implementation of MBM and PoET for the postprocessing of 2-m temperature and precipitation ensemble forecasts as well as a description of the verification process, section 4 provides illustrative examples of postprocessing in action, section 5 presents and discusses verification results, and section 6 concludes this paper.
2. Data and methods
a. Data
At the European Centre for Medium-Range Weather Forecasts (ECMWF), the reforecast dataset consists of 11 ensemble members (10 perturbed + 1 control) generated twice a week over the past 20 years (Vitart et al. 2019). In our experiments, the dataset comes from the operational Integrated Forecasting System (IFS) reforecasts produced in 2020, that is with IFS cycles 46r1 and 47r1, with the switch in June 2020. Fields are on 1° horizontal grid resolution, and the focus is on lead times every 6 h up to 96 h. Reforecasts from 2000 to 2016 are used for training, while those in 2017 and 2018 are used for validation.
The postprocessing models are trained toward ERA5, the ECMWF reanalysis dataset (Hersbach et al. 2020). The target is the reanalysis of 2-m temperature while the short-range forecasts at T + 6 h (aligned with the forecast validity time) is used as a target for precipitation to account for the spinup after data assimilation [for a comprehensive assessment of ERA5 daily precipitation please refer to Lavers et al. (2022)].
For testing, we use the operational ensemble data from 2021, using two forecasts each week for 104 start dates in total, according to the ECMWF subseasonal-to-seasonal (S2S) model iterations. The operational ensemble has 51 members (50 perturbed members + 1 control member), but we apply postprocessing methods that are agnostic to ensemble size: they may be run in inference mode with a different ensemble size than used in training. The data from 2021 include model cycles Cy47r1, Cy47r2, and Cy47r3, switching in May and then October of 2021, respectively. Notably, the model upgrade in Cy47r2 included an increase to 137 vertical levels in the ensemble, an improvement that is not included in the training dataset. We are therefore directly testing our methodology for generalization across model cycles, an important property to reduce the maintenance required when operationalizing machine learning systems.
b. Statistical benchmark method for comparison
Neural networks are not the only methods that can be used to calibrate ensembles. There exist simpler statistical methods, which require less computational power and which are generally more “explainable.” In this work, we use the MBM approach detailed in Van Schaeybroeck and Vannitsem (2015) as a benchmark. MBM is a natural benchmark for PoET that can be seen as a sophisticated member-by-member method. In addition, a comparison with state-of-the-art machine learning (ML) postprocessing techniques is discussed in the framework of the 10-ensemble-member (ENS-10) benchmark dataset (Ashkboos et al. 2022) in section 5c.
In our application, the parameters optimization follows the climatological reliability (CR) + weak ensemble reliability (WER) (WER + CR) approach as defined in Van Schaeybroeck and Vannitsem (2015). WER + CR means that the estimated parameters are constrained to preserve two different reliability conditions, the WER and the CR conditions. For bias-free forecasts, CR is defined as the equality of forecast variability with observations variability, while WER is defined as the agreement between average ensemble variance and the mean squared forecast error. The analytical formulas used to compute the three MBM parameters are provided in appendix A. Note that other flavors of MBM exist (e.g., with score optimization), but they have been disregarded because of their prohibitive computational costs in our application. For example, the MBM approach based on the minimization (MIN) of the continuous ranked probability score (CRPS; the so-called CRPS MIN approach) is three orders of magnitude more computationally expensive. Furthermore, a test of the CRPS MIN approach on a subsample of the data shows no benefits in terms of scores compared with the method applied here.
c. The ensemble transformer
Transformers are a class of neural networks that were designed for large natural language processing (NLP) models (Vaswani et al. 2017). The main advantage of transformers is the capability to process arbitrary lengths of sequences, drawing context from every part of the sequence, without the expensive sequential computations and potential saturating gradient issues of recurrent methods such as long short-term memory (LSTM; Hochreiter and Schmidhuber 1997). Deep learning transformer architectures are often structured as encoder-decoder networks, where the encoder blocks use a self-attention layer to compute correlations between all elements of an input sequence.
PoET is an adaptation of the ensemble transformer of Finn (2021). The original model was trained on a much smaller dataset, with single input fields and a very coarse resolution of 5.625° in latitude and longitude. Our higher-resolution dataset results in a substantial increase in memory cost due to the dot products across large dimensions C, H, and W. Therefore, to manage this we adapt the architecture and implement the transformer within a U-Net architecture, shown in Fig. 1. At each depth layer in the U-Net, following the embedding 2D convolution layers, we add a transformer block.1 Within the attention blocks, the convolution layers producing the key and query embeddings use a 3 × 3 convolution with stride of 3 that serves to further reduce the dimensionality of the dot-product calculation. Skip connections after the attention blocks at each level of the U-Net allow transformed hidden states to pass through directly to the decoder at multiple spatial resolutions. Altogether, the PoET implementation reduces the memory footprint of matrix multiplication operations within the transformer and enables the transformers to operate across different spatial scales. The layer normalization of the original ensemble transformer still operates at the full resolution of the grid, but unfortunately does not allow the model to be run at a different resolution than that of the training data. Experiments omitting the layer normalization or replacing it with another common technique, batch normalization, showed much worse performance. This observation cements the layer norm as an integral part of the transformer’s ability to correct forecast errors, likely because of its ability to capture local weather effects such as those of topography or land–sea differences.
Schematic of the transformer U-Net architecture of PoET. The inset in the lower right is adapted from Finn (2021), used with permission. In the PoET adaptation, the 1 × 1 convolution shown by the yellow arrow in the inset is replaced by a 3 × 3 convolution with a stride of 3.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
Finally, some more parameters of the PoET architecture are provided in appendix B for reference.
3. Experiments
a. PoET configuration
In our experiments, data for lead times every 6 h from 6 h up to a maximum of 96 h are used for training of the 2-m temperature model while, for precipitation, we start at 24 h to avoid the spinup. Because the lead time is not explicitly encoded in the model, it is possible to run inference for longer lead times.2
For the postprocessing of 2-m temperature forecasts, we include input features of 2-m temperature (T2M), temperature at 850 hPa (T850), geopotential height at 500 hPa (Z500), u and υ component of winds at 700 hPa (U700 and V700), and total cloud cover (TCC). Additionally, we prescribe orography, a land–sea mask, and the top-of-atmosphere incoming solar radiation (insolation) as additional predictors. Another model using a reduced feature set consisting of only T2M, T850, and Z500, plus the three prescribed variables performed only slightly worse than the one trained on the full dataset.
For the postprocessing of precipitation forecasts, the input predictors are changed to total precipitation, convective precipitation, convective available potential energy, total cloud cover, total column water, sea surface temperature, temperature at 850 hPa, winds at 700 hPa and geopotential at 500 hPa.
Apart from the selected predictors, the configuration of PoET is identical for the prediction of 2-m temperature and precipitation with two exceptions. First, the normalization of total and convective precipitation is done with a shifted logarithmic transformation.3 This transformation is applied to both the predictor and the predictand total precipitation. The second difference consists of using the kernel continuous ranked probability score (kCRPS) for precipitation instead of the Gaussian continuous ranked probability score (gCRPS) as a loss function, because the former makes no assumptions on the distribution of the ensemble (the definitions are available in appendix C). We tested using this formulation for 2-m temperature but observed little difference due to the Gaussian approximation being appropriate.
b. MBM configuration
The MBM parameters are estimated for each grid point and lead time separately. They also vary as a function of the time of the year in order to capture the seasonality of the forecast error. For this purpose, the training dataset differs for each forecast. We define a window centered around the forecast validity date and estimate the parameters using all training data within this time-of-the-year window. The suitable window size is different for the postprocessing of 2-m temperature and of precipitation. The window size is set to ±30 days for 2-m temperature and to ±60 days for precipitation for all lead times.
As for PoET, a shifted logarithmic transformation of the precipitation data is applied with MBM. Additionally, in inference mode, spurious precipitation is removed from MBM postprocessed precipitation fields and any correction leading to a change in precipitation value greater than 50 mm is rejected.
c. Verification process
We compare PoET, MBM, and raw forecasts in terms of their ability to predict 2-m temperature and precipitation up to 4 days in advance. Various aspects of the forecast performance are considered as described below. The results are presented in section 5, while the formal definitions of the verification metrics can be found in appendix C.
Bias and spread–skill relationships are used to assess the statistical consistency between forecast and verification. The bias is defined as the average difference between forecast and verification and a reliable forecast has a bias close to zero. The ensemble spread is defined as the standard deviation of the ensemble members with respect to the ensemble mean, while the ensemble mean error is defined as the root mean squared error of the ensemble mean. For a reliable ensemble forecast, the averaged ensemble spread should be close to the averaged ensemble mean error (Leutbecher and Palmer 2008; Fortin et al. 2014).
The CRPS is computed to assess the ensemble as a probabilistic forecast. Forecast performance in a multidimensional space is assessed using the energy score (ES), a generalization of the CRPS to the multivariate case (Gneiting and Raftery 2007). ES is applied over the time dimension, computed over two consecutive time steps for each pair of time steps separately. Additionally, for precipitation, probability forecast performance for predefined events is assessed with the Brier score (BS; Brier 1950). We consider two precipitation events: 6-hourly precipitation exceeding 1 and 10 mm.
The relative skill of a forecast with respect to a reference forecast is estimated with the help of skill scores. In the following, we compute the continuous ranked probability skill score (CRPSS), the energy skill score (ESS), and the Brier skill score (BSS) using the raw ensemble forecast as a reference. A comparison of PoET and MBM postprocessed forecasts is also performed using the latter as a reference.
Scores and reliability metrics are computed at each grid point for all validity times and aggregated in time and/or space for plotting purposes. When aggregating scores in space (over the globe), we apply a weighting proportional to the cosine of the gridpoint latitude.
4. Illustrative examples
a. PoET in action
Figure 2 shows the differences between each of the first three members of the raw ensemble and the corresponding PoET postprocessed forecasts at a lead time of 2 days. The ensemble mean change (the top-left panel) mostly shows a global heating, except in Asia which is consistent with the forecast bias discussed in the next section. The top row and leftmost column show the PoET forecast and raw forecast, respectively. The remaining entries show the difference between these respective ensemble members once the ensemble mean change is removed. Along the diagonal, we see the change induced by PoET on each member. The off-diagonal entries show the difference between differing raw and PoET ensemble members. The larger amplitude in these off-diagonal plots, compared to the diagonal, indicates consistency between the input and output ensemble members, that is, the ensemble has not been reordered or dramatically shifted by postprocessing. A comparison of PoET-corrected forecasts with ERA5 fields in two extreme cases is provided below.
Relative changes by PoET to a single date of the 2-m temperature forecast at day 2, valid on 6 Jan 2021. (left) The first three raw ensemble members. (top) The first three PoET-corrected ensemble members. Other entries show the differences between raw and PoET-corrected ensemble members when the ensemble mean difference has been removed. The ensemble mean difference is plotted in the top-left panel.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
b. A 2-m temperature case study
At the end of June 2021, a heatwave hit the northwestern United States and Canada leading to new temperature records and devastating wildfires. The top panels in Fig. 3 compare the 3-day averaged maximum (0000 UTC) temperature predictions of MBM and PoET with the corresponding ERA5 reanalysis field. In this example, we average 2-m temperature forecasts over lead times 24, 48, and 72 h. One randomly selected ensemble member is shown to illustrate postprocessing in action on a single forecast. The bottom panels in Fig. 3 show the difference between ERA5 and the raw forecast as well as the corrections applied to the forecast with postprocessing. We check whether postprocessing compensates for errors in the raw forecast, that is, if Figs. 3e and 3f match Fig. 3d. Overall, there is a good correspondence between the raw forecast error and the postprocessing corrections for both MBM and PoET. For example, we note the correction of the cold bias over the continent. However, there is some spottiness visible in the PoET correction (Fig. 3f). Moreover, in both Figs. 3e and 3f, as expected, fine details in the error pattern are not accurately captured, due to factors such as the limited predictability of this extreme event.
The 3-day averaged maximum temperature between 29 Jun and 1 Jul 2021: (a) ERA5, (b) MBM member 21, (c) PoET member 21, (d) difference between the raw forecast and ERA5 (xraw − y), (e) MBM correction (xraw − xMBM), and (f) PoET correction (xraw − xPoET) made to the raw forecast. In the case of postprocessing methods leading to a perfect deterministic forecast, (e) and (f) would match (d).
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
c. A total precipitation case study
In March 2021, Australia was affected by extreme rainfall. Sustained heavy rain led to flooding in the eastern part of the country and large precipitation amounts were observed on the northern coast too. The top panels in Fig. 4 compare 3-day precipitation scenarios from MBM and PoET with the corresponding ERA5 precipitation field. The 3-day accumulated precipitation scenarios are derived from the postprocessed ensemble of 6 h accumulated precipitation forecasts that are consistent scenarios in space and time. Here again, we randomly select one member for illustration purposes. The bottom panels in Fig. 4 show the difference between ERA5 and the raw precipitation forecast along the corresponding postprocessing corrections. As in the 2-m temperature example, we check whether postprocessing compensates for raw forecast errors, that is, if Figs. 4e and 4f match Fig. 4d. The MBM correction only has some areas of consistency with the actual error while the PoET correction tends to partially compensate for the raw forecast error along the north and west coast, both over land and over the sea.
As in Fig. 3, but for 3-day accumulation precipitation between 18 and 21 Mar 2021.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
5. Verification results
a. 2-m temperature results
To compare and assess the performance of MBM and PoET, we apply the verification metrics defined in section 3c (i.e., the CRPSS, spread and error, bias, and ESS) to the postprocessed forecasts. The results are first aggregated over the globe as a function of the forecast lead time. In Fig. 5, we see that both methods considerably improve the raw ensemble skill with similar results in terms of CRPSS and ESS. PoET generates more skillful forecasts than MBM, but both methods are able to improve by ∼20% the raw forecast throughout the assessed lead times. Both methods also have a similar ability to reduce the bias.
(a) CRPSS (the higher, the better), (b) spread and skill, (c) bias (optimal value zero), and (d) ESS (the higher, the better) of 2-m temperature for three ensemble forecasts: raw ensemble, MBM, and PoET. In (a), we also show results for PoET using only T2m as a predictor.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
An additional experiment is run to disentangle the benefit of using the new ML-based method from the benefit of having more than one predictor as an input variable. Postprocessed forecasts are generated with PoET using 2-m temperature only as a predictor similarly as when running MBM and the results in terms of CRPSS are shown in Fig. 5a. The use of the PoET machinery for postprocessing seems to contribute to around two thirds of the improvement over the benchmark method while the remaining improvement can be attributed to extra information available in the additional predictors. This result appears consistent over lead times.
The MBM approach seems better at maintaining a spread–error parity, with PoET struggling at early lead times. Spread–skill diagrams, showing the error as a function of spread categories, reveal that aggregated scores must be interpreted carefully (see appendix D). Indeed, the uncertainty of PoET-corrected forecasts appears to reflect the potential forecast error more accurately than the MBM-corrected ones. Because compensating effects can be at play when averaging over all cases, we also compute bias and spread error ratio at each grid point separately before averaging absolute terms over the verification period (see appendix D). This approach reveals that PoET calibration underperformance compared with MBM is moderate and limited to the first lead times of the forecast. This result suggests a geographical disparity of the postprocessing impact that is now further explored with maps of scores.
Figure 6 shows maps of bias for the raw data and the PoET-corrected forecasts. We focus on lead time day 4 and the results are aggregated at each grid point over all verification days. We clearly see a general decrease in the bias with almost no remaining bias over the ocean. The remaining pockets of (generally positive) bias after postprocessing are mostly found over land where the amplitude of the raw forecast bias is larger. A change of sign in the bias is interpreted as an indication that the general circulation patterns over the training are not representative of the ones over the verification period. The broad structure of the bias is similar for MBM and for other lead times (not shown).
Bias of 2-m temperature forecasts at lead time day 4 for (a) the raw ensemble and (b) PoET. An optimal forecast has a bias close to 0.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
Figure 7 shows the gain in skill with PoET for the same lead time as in Fig. 6. CRPSS is computed using the raw ensemble as a baseline in Fig. 7a and MBM as a baseline in Fig. 7b. Figure 7a shows a widespread positive impact of PoET on the raw forecast skill with a larger gain over land where the raw forecast bias is generally more pronounced. A detrimental effect of postprocessing is observed in some regions (e.g., in South America, Africa, and Australia). These regions of negative skill score are also the ones where a bias is still present after postprocessing as shown in Fig. 6b.
CRPSS of 2-m temperature PoET forecast with respect to (a) the raw ensemble and (b) MBM. Positive values indicate a skill improvement with PoET. Note the difference in scale between the two plots.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
In Fig. 7b, there are very few areas where the CRPSS is less than zero, that is, areas where MBM forecasts have more skill than PoET. Improvements through PoET are fairly consistent across the globe, with no regions where there are larger gains due to the neural network approach. Indeed, there is a strong agreement between the locations where MBM and PoET add value to the raw ensemble (not shown). Given that MBM learns a climatological correction for each grid point this suggests PoET has mostly reproduced this climatological local correction.
b. Total precipitation results
Postprocessing of precipitation forecasts is a more challenging task because of the form of the underlying forecast probability distribution with a point mass at 0 (the no-precipitation probability) and a skewness capturing the more extreme events. Postprocessing with MBM and PoET is tested with small changes to the configuration used for the postprocessing of 2-m temperature forecast (see section 3), and a similar set of plots is examined to assess the corrected forecast performance.
Figure 8 shows verification metrics aggregated globally as a function of the lead time. In contrast to 2-m temperature results, the added benefit of either postprocessing approach is limited. With PoET, the skill improvement is approximately 2% in terms of CRPSS and ESS for the first several days of forecasting. With MBM, the skill improvement is ∼1% for most lead times. The gain in skill originates from improved performance in forecasting lower-intensity rather than higher-intensity events. Indeed, BSS computed for two precipitation exceeding thresholds, 1 and 10 mm, shows a larger skill score for the former (see Fig. D3 in appendix D).
As in Fig. 5, but for total precipitation.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
One explanation for the limited gain in skill with postprocessing is that the raw forecast is already well calibrated (see also Figs. D1c and D1d in appendix D). Also, PoET improves the averaged performance in Figs. 8a and 8d but seems to degrade both the bias and the spread–error ratio in Figs. 8b and 8c. This apparent paradox is explained by the large variations in forecast performance over the globe. A look at the mean absolute bias and mean absolute spread bias confirms that bias and the spread–skill relationship are overall significantly improved with PoET when assessed locally (see Fig. D2 in appendix D). Similarly, the erratic spread correction with MBM at a shorter lead time is not visible in Fig. D2b suggesting an averaging artifact. Fig. D2b also reveals that a point-by-point application of MBM does not seem appropriate to correct spread deficiencies at longer lead times.
Figure 9 provides another perspective on the bias by presenting maps of averaged values over all verification days. Here, the focus is on lead time day 4. The precipitation bias is reduced over land and the Maritime Continent. However, the bias in the tropical Pacific and Atlantic remains unchanged after postprocessing. We note that while positive biases are generally well corrected, negative biases are not.
As in Fig. 6, but for total precipitation. A bias close to 0 is optimal. Regions with annual precipitation lower than 0.1 mm are masked in gray.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
Finally, Figs. 10a and 10b show maps of CRPSS at day 4 for PoET using the raw forecast and MBM as a reference, respectively. PoET improves precipitation ensemble forecasts mainly over the tropics. Very localized degradation of the skill with respect to the raw forecast could be due to a too-short training sample. The benefit of using PoET rather than MBM appears predominantly in the tropics, but local positive skill scores are rather scattered. Alternate areas of positive and negative skill scores over the sea in the extratropics suggest that the two approaches are complementary.
As in Fig. 7, but for total precipitation. Positive values indicate a gain in skill with PoET. Regions with annual precipitation lower than 0.1 mm are masked in gray.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
c. ENS-10 comparison
During the preparation of this article, Ashkboos et al. (2022) produced a benchmark dataset for the postprocessing of ensemble weather forecasts (referred to as ENS-10 in the following). This framework is exploited to further test the capability of PoET and compare its performance with state-of-the-art ML postprocessing techniques.
ENS-10 dataset is similar to the one focused on here, originating from the same model reforecast framework, but with several differences. The reforecast dataset is constructed in 2018, the spatial resolution of the data is provided on a 0.5° grid, and the evaluation set comprises the last 2 years of the reforecast, meaning that the IFS configuration is identical between training and testing, also in terms of ensemble size. Note that the verification differs as a non-latitude-weighted CRPS is used as a performance metric.
To contribute to the benchmarking efforts, we train our model using the ENS-10 dataset and evaluate following the same methodology. We reduce the data volume by only training each model on a chosen subset of the total ENS-10 variables. For Z500, we utilize all ENS-10 variables on this pressure level and use an equivalent approach for T850. For 2-m temperature (T2m), our model predictors are 500 and 850 hPa alongside single-level variables 2-m temperature, skin temperature, sea surface temperature, mean sea level pressure, total cloud cover, and 10-m zonal and meridional wind. Also, the gridded data contain 720 points located at each pole that are unconstrained by the latitude-weighted training with PoET but contribute to the non-latitude-weighted evaluation. Therefore, we do not use PoET to correct these points but instead use the uncorrected raw forecast (the evaluation still includes these points to mirror ENS-10 evaluation).
In terms of model architecture, we make a small change to the PoET model to incorporate the higher spatial resolution. We increase the depth of the U-Net structure by 1, putting a transformer block at its fourth level.
Table 1 contrasts the PoET scores with the raw forecast and the best benchmarks from Ashkboos et al. (2022). For almost all configurations, the PoET approach leads to significant improvement over the ENS-10 benchmarks. In particular, for 2-m temperature, the CRPSS with PoET is considerably better than the results with the previous baselines. The improvement of ∼20%% is similar to the gain measured with our dataset. For Z500 with a 5-member ensemble, the model improves the raw output but fails to beat the LeNet approach (Ashkboos et al. 2022). We found no explanation for the limited success with this variable and ensemble member configuration. Similar results are obtained with the extreme event weighted continuous ranked probability score (EECRPS) introduced in Eq. (2) in Ashkboos et al. (2022).
Global non-latitude-weighted mean CRPS and EECRPS on the ENS-10 test set (2016/17) for baseline models with 5 (5-ENS) and 10 (10-ENS) ensemble members. The first 5 members are used for the 5-ENS evaluation. The best results for each combination of score/ensemble/parameter is in bold.
6. Conclusions
This work shows how to efficiently transform an ensemble forecast into a calibrated ensemble forecast, that is, a set of physically consistent scenarios with improved statistical properties. We compare two methods: one machine learning method based on self-attention transformers (PoET) and one statistical method used as a benchmark (MBM). For both methods, each member is calibrated separately but with the aim of optimizing the ensemble properties as a whole. As a result, the postprocessed ensemble has the spatial, temporal, and intervariable coherence necessary to enable any downstream application. Also, both tested methods can be trained on a smaller reforecast dataset (here with 11 members) to effectively calibrate a much larger operation ensemble (here with 51 members), preserving intermember calibration.
Ensemble postprocessing is successfully applied to global gridded forecasts of 2-m temperature and precipitation, using ERA5 reanalysis as the ground truth. Our results show that both MBM and PoET can significantly improve the skill of the operational IFS ensemble. This improvement is achieved through a better calibration of the ensemble, both in terms of bias and spread–skill relationship. We note that PoET is better at the headline scores (CRPS, ES) but with some areas where MBM can locally outperform PoET. This latter point suggests that a combination of the two approaches could lead to further improvement of the forecast skill. Also, our case-study examples illustrate the ability of postprocessing to improve existing ensemble members.
The postprocessing gain is smaller for precipitation than for 2-m temperature. Indeed, the skill improvement of precipitation forecasts is relatively small in this application. This result contrasts with results obtained with down-scaling approaches where accounting for representativeness uncertainty can have a major impact on scores (Ben Bouallègue et al. 2020). In further work, we will consider how PoET could be applied to ungridded observations, which would require architectural changes.
Direct applications of the methodologies developed here include postprocessing for forecast verification and intercomparison purposes. For example, bias correction can be applied to better understand changes in CRPS results with new IFS model versions (Leutbecher and Haiden 2021). Also, postprocessing would be a necessary step for a fair comparison of NWP forecasts with statistically optimized (data driven) ones in forecasting competition frameworks (see, e.g., Rasp et al. 2020). Finally, the proposed methods could be trivially adapted to a higher-resolution version of the truth, which could pave the way to ensemble postprocessing of global gridded data for operational forecasting.
More than one transformer block can be used, similar to Finn (2021), but in our experiments this did not result in performance increases.
While not shown here, PoET remains skillful for longer lead times.
The log(x + 1) with x the precipitation amount normalized into a dimensionless quantity (similar to Lopez 2011).
Acknowledgments.
The authors thank Tobias Finn for many interesting discussions and the original idea of using transformer techniques in the ensemble dimension. We also thank two anonymous reviewers for their valuable suggestions to improve the quality of this paper. Peter Dueben, Matthew Chantry, and Jesper Dramsch gratefully acknowledge funding from the MAELSTROM EuroHPC-JU project (JU) under Grant 955513. The JU receives support from the European Union’s Horizon research and innovation program and United Kingdom, Germany, Italy, Luxembourg, Switzerland, and Norway. Peter Dueben gratefully acknowledges funding from the ESiWACE project funded under Horizon 2020 Grant 823988. Mariana Clare gratefully acknowledges funding by the European Union under the Destination Earth initiative. Finally, all authors acknowledge the use of computing resources from both the European Weather Cloud and Microsoft Azure.
Data availability statement.
It is currently difficult for the authors to share the data as the hardware used is currently being replaced. The authors will, however, make the data available for download when the paper is eventually published. PoET source code is available at https://github.com/ecmwf-lab/poet, and the MBM parameter estimation relies on the Climdyn/pythie package (Demaeyer 2022).
APPENDIX A
MBM Parameters Optimization
APPENDIX B
Additional PoET Model Parameters
Table B1 shows some additional hyperparameters used for the PoET model. Minimal searching was performed to obtain these parameters as no significant improvements were observed.
Additional hyperparameters for PoET.
APPENDIX C
Scores Definition
a. CRPS
b. ES
c. Skill scores
APPENDIX D
Additional Plots
a. Reliability diagrams
For a reliable ensemble forecast, we expect consistency between ensemble spread and ensemble mean error. This can be checked with reliability plots as shown in Fig. D1. For a given forecast spread category (on the x axis), we check the consistency of the corresponding forecast error (RMSE on the y axis). Perfect reliability is indicated with a dashed diagonal line.
Reliability plots showing the spread–error relationship for 2-m temperature forecasts at lead time (a) day 2 and (b) day 4, and for precipitation forecasts at lead time (c) day 2 and (d) day 4. For each plot, a histogram shows the number of cases in each forecast uncertainty category.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
b. Mean absolute bias and mean absolute spread bias
Mean absolute bias and mean absolute spread bias for 2-m temperature and precipitation are shown in Figs D2 and D3, respectively.
(a) Mean absolute bias and (b) mean absolute spread bias for 2-m temperature forecasts. The closer to 0, the better.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
As Fig. D2, but for total precipitation.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
c. BSS
The BS is used to assess the performance of a binary probability forecast for a given event (Brier 1950). It is computed here to better understand the impact of postprocessing on the results in terms of CRPS. Indeed, the CRPS corresponds to the integral of the Brier score over all possible events. Fig. D4 shows the Brier skill score for two precipitation events defined as precipitation exceeding 1 and 10 mm in 6 h, respectively.
BSS (the larger, the better) for two threshold-exceeding events: (a) 1 and (b) 10 mm for three ensemble forecasts: the raw ensemble, the MBM, and the PoET-corrected forecasts.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0027.1
REFERENCES
Ashkboos, S., L. Huang, N. Dryden, T. Ben-Nun, P. Dueben, L. Gianinazzi, L. Kummer, and T. Hoefler, 2022: ENS-10: A dataset for post-processing ensemble weather forecast. arXiv, 2206.14786v2, https://doi.org/10.48550/arXiv.2206.14786.
Baran, S., Á. Baran, F. Pappenberger, and Z. Ben Bouallègue, 2020: Statistical post-processing of heat index ensemble forecasts: Is there a royal road? Quart. J. Roy. Meteor. Soc., 146, 3416–3434, https://doi.org/10.1002/qj.3853.
Ben Bouallègue, Z., 2017: Statistical postprocessing of ensemble global radiation forecasts with penalized quantile regression. Meteor. Z., 26, 253–264, https://doi.org/10.1127/metz/2016/0748.
Ben Bouallègue, Z., T. Heppelmann, S. E. Theis, and P. Pinson, 2016: Generation of scenarios from calibrated ensemble forecasts with a dual-ensemble copula-coupling approach. Mon. Wea. Rev., 144, 4737–4750, https://doi.org/10.1175/MWR-D-15-0403.1.
Ben Bouallègue, Z., T. Haiden, N. J. Weber, T. M. Hamill, and D. S. Richardson, 2020: Accounting for representativeness in the verification of ensemble precipitation forecasts. Mon. Wea. Rev., 148, 2049–2062, https://doi.org/10.1175/MWR-D-19-0323.1.
Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2022: Pangu-Weather: A 3D high-resolution model for fast and accurate global weather forecast. arXiv, 2211.02556v1, https://doi.org/10.48550/ARXIV.2211.02556.
Bremnes, J. B., 2020: Ensemble postprocessing using quantile function regression based on neural networks and Bernstein polynomials. Mon. Wea. Rev., 148, 403–414, https://doi.org/10.1175/MWR-D-19-0227.1.
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 1–3, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
Demaeyer, J., 2022: Climdyn/pythie: Version 0.1.0 alpha release. Zenodo, https://doi.org/10.5281/zenodo.7233538.
Demaeyer, J., S. Vannitsem, and B. Van Schaeybroeck, 2021: Statistical post-processing of ensemble forecasts at the Belgian met service. ECMWF Newsletter, No. 166, ECMWF, Reading, United Kingdom, 21–25, https://www.ecmwf.int/en/newsletter/166/meteorology/statistical-post-processing-ensemble-forecasts-belgian-met-service.
Demaeyer, J., and Coauthors, 2023: The EUPPBench postprocessing benchmark dataset v1.0. Earth Syst. Sci. Data, 15, 2635–2653, https://doi.org/10.5194/essd-15-2635-2023.
Dueben, P. D., M. G. Schultz, M. Chantry, D. J. Gagne II, D. M. Hall, and A. McGovern, 2022: Challenges and benchmark datasets for machine learning in the atmospheric sciences: Definition, status, and outlook. Artif. Intell. Earth Syst., 1, e210002, https://doi.org/10.1175/AIES-D-21-0002.1.
Finn, T. S., 2021: Self-attentive ensemble transformer: Representing ensemble interactions in neural networks for Earth system models. arXiv, 2106.13924v2, https://doi.org/10.48550/ARXIV.2106.13924.
Fortin, V., M. Abaza, F. Anctil, and R. Turcotte, 2014: Why should ensemble spread match the RMSE of the ensemble mean? J. Hydrometeor., 15, 1708–1713, https://doi.org/10.1175/JHM-D-14-0008.1.
Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359–378, https://doi.org/10.1198/016214506000001437.
Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 1098–1118, https://doi.org/10.1175/MWR2904.1.
Grönquist, P., C. Yao, T. Ben-Nun, N. Dryden, P. Dueben, S. Li, and T. Hoefler, 2021: Deep learning for post-processing ensemble weather forecasts. Philos. Trans. Roy. Soc., A379, 20200092, https://doi.org/10.1098/rsta.2020.0092.
Hagedorn, R., T. M. Hamill, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136, 2608–2619, https://doi.org/10.1175/2007MWR2410.1.
Haiden, T., M. Janousek, F. Vitart, Z. Ben Bouallègue, L. Ferranti, and D. R. F. Prates, 2021: Evaluation of ECMWF forecasts, including 2021 upgrade. ECMWF Tech. Memo. 884, 56 pp., https://www.ecmwf.int/en/elibrary/81235-evaluation-ecmwf-forecasts-including-2021-upgrade.
Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 2620–2632, https://doi.org/10.1175/2007MWR2411.1.
Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 1999–2049, https://doi.org/10.1002/qj.3803.
Hochreiter, S., and J. Schmidhuber, 1997: Long short-term memory. Neural Comput., 9, 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735.
Lam, R., and Coauthors, 2022: GraphCast: Learning skillful medium-range global weather forecasting. arXiv, 2212.12794v2, https://doi.org/10.48550/ARXIV.2212.12794.
Lavers, D. A., A. Simmons, F. Vamborg, and M. J. Rodwell, 2022: An evaluation of ERA5 precipitation for climate monitoring. Quart. J. Roy. Meteor. Soc., 148, 3152–3165, https://doi.org/10.1002/qj.4351.
Leutbecher, M., 2019: Ensemble size: How suboptimal is less than infinity? Quart. J. Roy. Meteor. Soc., 145, 107–128, https://doi.org/10.1002/qj.3387.
Leutbecher, M., and T. N. Palmer, 2008: Ensemble forecasting. J. Comput. Phys., 227, 3515–3539, https://doi.org/10.1016/j.jcp.2007.02.014.
Leutbecher, M., and T. Haiden, 2021: Understanding changes of the continuous ranked probability score using a homogeneous Gaussian approximation. Quart. J. Roy. Meteor. Soc., 147, 425–442, https://doi.org/10.1002/qj.3926.
Lewis, J. M., 2005: Roots of ensemble forecasting. Mon. Wea. Rev., 133, 1865–1885, https://doi.org/10.1175/MWR2949.1.
Lopez, P., 2011: Direct 4D-Var assimilation of NCEP stage IV radar and gauge precipitation data at ECMWF. Mon. Wea. Rev., 139, 2098–2116, https://doi.org/10.1175/2010MWR3565.1.
Magnusson, L., C. Prudhomme, F. Di Giuseppe, C. Di Napoli, and F. Pappenberger, 2023: Operational multiscale predictions of hazardous events. Extreme Weather Forecasting, M. Astitha and E. Nikolopoulos, Eds., Elsevier, 87–129, https://doi.org/10.1016/B978-0-12-820124-4.00008-6.
Pathak, J., and Coauthors, 2022: FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/ARXIV.2202.11214.
Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133, 1155–1174, https://doi.org/10.1175/MWR2906.1.
Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 3885–3900, https://doi.org/10.1175/MWR-D-18-0187.1.
Rasp, S., P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, 2020: WeatherBench: A benchmark data set for data-driven weather forecasting. J. Adv. Model. Earth Syst., 12, e2020MS002203, https://doi.org/10.1029/2020MS002203.
Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126, 649–667, https://doi.org/10.1002/qj.49712656313.
Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 2375–2393, https://doi.org/10.1175/MWR-D-15-0260.1.
Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681–E699, https://doi.org/10.1175/BAMS-D-19-0308.1.
Van Schaeybroeck, B., and S. Vannitsem, 2015: Ensemble post-processing using member-by-member approaches: Theoretical aspects. Quart. J. Roy. Meteor. Soc., 141, 807–818, https://doi.org/10.1002/qj.2397.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, 2017: Attention is all you need. arXiv, 1706.03762v7, https://doi.org/10.48550/ARXIV.1706.03762.
Vitart, F., G. Balsamo, J.-R. Bidlot, S. Lang, and I. Tsonevsky, 2019: Use of ERA5 to initialize ensemble re-forecasts. ECMWF Tech. Memo. 841, 16 pp., https://www.ecmwf.int/sites/default/files/elibrary/2019/18872-use-era5-initialize-ensemble-re-forecasts.pdf.