1. Introduction
Operational weather forecasting relies on numerical weather prediction (NWP) models. Since such models are subject to multiple sources of uncertainty, such as uncertainty in the initial conditions or model parameterizations, a quantification of the forecast uncertainty is indispensable. To achieve this, NWP models generate a set of deterministic forecasts, so-called ensemble forecasts, based on different initial conditions and variations of the underlying physical models. Since these forecasts are subject to systematic errors such as biases and dispersion errors, statistical postprocessing is used to enhance their reliability (see, e.g., Vannitsem et al. 2018). Recently, machine learning (ML) approaches for statistical postprocessing have shown superior performance over classical methods. For instance, Rasp and Lerch (2018) propose a distribution regression network (DRN) that predicts the parameters of a temperature forecast distribution from a suitable family of parametric distributions. In subsequent work, Schulz and Lerch (2022b) found that shallow multilayer perceptrons (MLPs) with forecast distributions of different flexibility achieve state-of-the-art results in postprocessing wind gust ensemble forecasts.
An ensemble forecast consists of multiple separate member forecasts, which are generated by repeatedly running NWP simulations with different model parameterizations and initial conditions. Typically, the configurations of different runs are sampled randomly from an underlying distribution of plausible simulation conditions, obtained, for example, from uncertainty-aware data assimilation. The member forecasts can then be seen as identically distributed and interchangeable random samples from a distribution of possible future weather states. In this setting, statistical postprocessing of ensemble forecasts can be phrased as a prediction task on unordered predictor vectors and requires solutions that are tailored to match the predictor format. Specifically, member interchangeability demands that the predictions of a well-designed postprocessing system should not be affected by permutations, that is, shuffling, of the ensemble members. Systems that satisfy this requirement are called permutation invariant. Established postprocessing methods rely on basic summary statistics of the raw ensemble forecast to inform the estimation of the postprocessed distribution and are thus permutation invariant by design. However, especially in large ensembles, the details of the distribution may carry valuable information for postprocessing, and a more elaborate treatment of the inner structure of the raw forecast ensembles may help to improve forecast accuracy for example in ambiguous forecast situations, where summary-based methods fail to evaluate the likelihood of different weather patterns accurately.
While studies have started to explore how specialized model architectures can help to improve postprocessing only recently (Bremnes 2020; Mlakar et al. 2023; Ben-Bouallegue et al. 2023), ML provides a variety of further approaches to enforcing permutation invariance in data-driven learning. Motivated by the success of permutation-invariant neural network (NN) architectures in representation learning, anomaly detection or set classification (e.g., Ravanbakhsh et al. 2016; Zaheer et al. 2017; Lee et al. 2019; Sannai et al. 2019; Zhang et al. 2019), where the models profit from the ability to extract concise feature representations from unordered data, permutation-invariant NNs appear as promising candidates for improving ensemble postprocessing.
Contribution
In this study, we investigate the capabilities of different permutation-invariant NN architectures for univariate postprocessing of station predictions. We evaluate the proposed models on two exemplary stationwise postprocessing tasks with different characteristics. The ensemble-based network models are compared to classical methods and basic NNs, which operate only on ensemble summary statistics but are trained under identical predictor conditions otherwise. We further assess how much of the predictive information is carried within the details of the ensemble distribution, and how much of the model skill arises from other factors. To shed light on the sources of model skill, we propose an ensemble-oriented feature importance analysis and study the effect of ensemble-internal degrees of freedom using conditional feature permutation.
2. Related work
a. Statistical postprocessing of ensemble forecasts
One of the most popular methods for statistical postprocessing of ensemble forecasts is ensemble model output statistics (EMOS; Gneiting et al. 2005), which performs a distributional regression based on a suitable family of parametric distributions and summary statistics of the ensemble. Due to its simplicity, EMOS has been applied to a wide range of weather variables including temperature (Gneiting et al. 2005), wind gusts (Pantillon et al. 2018), precipitation (Scheuerer 2014), and solar radiation (Schulz et al. 2021). Following the simple statistical EMOS approach, the success of ML methods (Taillardat et al. 2016; Messner et al. 2017), which are able to incorporate additional information and learn more complex patterns, have motivated the use of modern NN methods. First NN-based approaches are DRN (Rasp and Lerch 2018) as an extension of the EMOS framework, and the Bernstein quantile network (BQNs; Bremnes 2020) that provides a more flexible forecast distribution. In Schulz and Lerch (2022b), NN-based approaches were adapted toward the prediction of wind gusts and outperformed classical methods. Recently, research has shifted toward the use of more sophisticated network architectures. Examples include convolutional NNs that incorporate spatial NWP output fields (Scheuerer et al. 2020; Grönquist et al. 2021; Veldkamp et al. 2021; Horat and Lerch 2023), and generative models to produce multivariate forecast distributions (Dai and Hemri 2021; Chen et al. 2022).
Only recently, Mlakar et al. (2023) have proposed NN models that explicitly admit the use of ensemble-structured predictors by employing a dynamic attention mechanism. The resulting models perform best in the benchmark study of Demaeyer et al. (2023). Mlakar et al. (2023) address postprocessing with similar methods as this work, but do not focus explicitly on comparing different network design patterns for inference based on ensemble-valued predictors. In orthogonal work, Finn (2021) and Ben-Bouallegue et al. (2023) apply transformer-based NNs to ensemble postprocessing. In contrast to this study, both approaches focus on gridded predictor data, thus relying on different network architectures, and postprocess ensembles in a member-by-member fashion, whereas this work concentrates on distributional regression.
For a general review of statistical postprocessing of weather forecasts, we refer to Vannitsem et al. (2018), a review of recent developments and challenges can be found in Vannitsem et al. (2021) and Haupt et al. (2021).
b. Neural network architectures for regression on set-structured data
From an ML perspective, postprocessing of ensemble forecasts can be phrased as a regression task on set-structured predictors. Multiple studies have demonstrated that dedicated permutation-invariant NN architectures can help to improve prediction quality and generalization in diverse learning problems (e.g., Vinyals et al. 2015; Lyle et al. 2020), thus motivating the exploration of permutation-invariant NNs also for ensemble postprocessing. Early works on permutation-invariant layers for NNs (Ravanbakhsh et al. 2016) and pooling-based permutation-invariant NNs (Edwards and Storkey 2016) were followed by the more comprehensive framework DeepSets (Zaheer et al. 2017), which encompasses some of the most common design patterns for ML inference on set-structured predictors. Due to its generality, DeepSets is selected as one of the representative learning approaches in this study and is further discussed in section 4a. Soelch et al. (2019) highlight that architectural improvements, such as the use of more expressive pooling functions, may enhance model performance, which we consider in the design of the model architectures for postprocessing.
An alternative approach to permutation-invariant inference has been proposed by Lee et al. (2019), who use (multihead) attention functions (Vaswani et al. 2017) for permutation-invariant inference on set-valued data. Attention-based NNs, also known as transformers, have proven powerful in a variety of computer vision tasks (e.g., Khan et al. 2022) as well as postprocessing (Finn 2021; Ben-Bouallegue et al. 2023, see their section 2a) and are thus considered as a second paradigm for building permutation-invariant NNs.
c. Machine learning explainability and feature importance
ML explainability has attracted substantial interest throughout the last decade (for recent surveys, see, e.g., Guidotti et al. 2018; Linardatos et al. 2021; Burkart and Huber 2021) and is increasingly adopted in the Earth-system sciences (e.g., Reichstein et al. 2019; Höhlein et al. 2020; Labe and Barnes 2021; Farokhmanesh et al. 2023) to gain understanding on the reasoning mechanisms behind ML-based inference approaches. The most relevant approaches for this work are based on permutation feature importance (PFI; Breiman 2001), which aims to assess the (relative) importance of different predictors for inference. In PFI, relevance scores are assigned to the predictors based on the accuracy loss after permuting the predictor values within the test dataset and have been applied in the postprocessing before (e.g., Rasp and Lerch 2018; Schulz and Lerch 2022b) with a focus on scalar-valued predictors. In this work, we propose a conditional PFI measure for ensemble-valued predictors, which allows attributing importance values to different aspects of the ensemble-internal variability. Conditional perturbation measures have been considered in earlier works (e.g., Strobl et al. 2008; Molnar et al. 2024), where the conditioning is used to evaluate the importance of specific predictors in the context of the remaining predictors. By contrast, our approach addresses specifically the distribution characteristics of the ensemble-valued predictors encountered in postprocessing.
3. Benchmark methods and forecast distributions
a. Assessing predictive performance
In addition to the CRPS, we assess calibration based on the empirical coverage of prediction intervals (PIs) derived from the forecast distribution, and sharpness on the corresponding length. Under the assumption of calibration, the observed coverage of a PI should match the nominal level, and a forecast is sharper the smaller the length of the PI. In line with Schulz and Lerch (2022b), we choose the PI level based on the size of the underlying ensemble. For an ensemble of size
Further, we qualitatively assess calibration based on (unified) probability integral transform (PIT) histograms (Gneiting and Katzfuss 2014; Vogel et al. 2018). While a flat histogram indicates that the forecasts are calibrated, systematic deviations indicate miscalibration. For more details on the evaluation of probabilistic forecasts, we refer to Gneiting and Katzfuss (2014).
b. Distributional regression with parametric forecast distributions (EMOS, DRN)
In this study, we consider postprocessing of the ensemble forecast for a real-valued random variable Y as a distributional regression task on ensemble-structured predictors. We focus on the case of stationwise forecasts, which are given as prediction vectors
For EMOS, g is defined as a generalized affine-linear function of ensemble summary statistics, such as ensemble mean or standard deviation, and provides only limited flexibility for distribution estimation. DRN (Rasp and Lerch 2018; Schulz and Lerch 2022b), in contrast, admits the data-driven estimation of arbitrary link functions using NNs, thus increasing the learning ability. The forecast distribution as well as the underlying proper scoring rule used for optimization are two implementation choices.
c. Flexible distribution estimator (BQN)
Distributional regression methods based on a parametric forecast distribution are robust but lack flexibility as they are bound to the parametric distribution family of choice. Typical choices of forecast distributions include the normal (Gneiting et al. 2005; Rasp and Lerch 2018), logistic (Schulz and Lerch 2022b) or generalized extreme value distribution (Lerch and Thorarinsdottir 2013; Scheuerer 2014). They all lack the ability to express multimodalities that are required, for example, when different weather patterns occur. Hence, methods that do not rely on parametric assumptions have been proposed in the postprocessing literature. Examples are the direct adjustment of the ensemble members (van Schaeybroeck and Vannitsem 2015) or quantile regression forests (Taillardat et al. 2016). BQN (Bremnes 2020) models the forecast distribution as a quantile function, which is represented as a linear combination of Bernstein (basis-) polynomials of degree
d. Use of auxiliary predictors
In addition to the predictions of the target variable, most algorithms use auxiliary information to improve the prediction performance (see Table 1). We distinguish between ensemble-valued and scalar-valued predictors, where ensemble-valued predictors vary between different members and scalar-valued predictors do not. In the ensemble-valued case, we differentiate the primary prediction from auxiliary predictions of other meteorological variables. For either of these, postprocessing models can have access to the full set of ensemble values or only to summary statistics.
Predictor utilization by postprocessing methods. Methods used in this study are indicated by “ours.” Abbreviations: permutation-invariant (perm.-inv.); standard deviation (SD), station embedding (SE); station predictors and embedding (SP + SE).
Scalar-valued predictors refer to contextual information, such as station-specific coordinates and orography details (cf. Table 1, station predictors), as well as to temporal information, such as the day of the year. We consider only models that are trained on predictions for specific initialization and lead times, such that information about the diurnal cycle is not required. While most approaches include the scalar predictors explicitly as features in the regression process, EMOS takes advantage of categorical location and time information implicitly by fitting independent models for each station and month (Schulz and Lerch 2022b). BQN- and DRN-type models are trained separately for each lead time but employ a learned station embedding (Rasp and Lerch 2018; Schulz and Lerch 2022b) to share the same model between different station sites. Notably, the permutation-invariant models (cf. Table 1, permutation-invariant) considered in this study have access to the richest predictor pool. A complete list of model inputs on the parameter level can be found in Tables A1–A3 in appendix A.
4. Permutation-invariant neural network architectures
From the variety of permutation-invariant model architectures, we select two representative approaches, set pooling architectures and set transformers, which we adapt for distributional regression. Compared with the benchmark methods of section 3, the proposed networks replace the summary-based ensemble processing while the parameterization of the forecast distributions remains unchanged. A schematic comparison of both permutation-invariant architectures is shown in Fig. 1.
(a) Set pooling architecture, consisting of encoder and decoder MLPs, and (b) set transformer, featuring attention blocks and intermediate MLPs with residual connections. While the encoder–decoder architecture admits interactions between members only inside the pooling step, the set transformer admits information transfer between the members in each attention step.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0070.1
a. Set pooling architectures
Set pooling architectures (Zaheer et al. 2017), also known as DeepSets, achieve permutation invariance via extraction and permutation-invariant summarization of learned latent features. The features are obtained by applying an encoder MLP to all ensemble members separately, followed by a permutation-invariant pooling function and a decoder MLP, which outputs the distribution parameters θ. Due to the division into encoding, pooling, and decoding, we will thus use the names set pooling and encoder–decoder (ED) architecture synonymously.
In experiments, we considered different variants of ensemble summarization based on average and extremum pooling, as well as adaptive pooling functions based on an attention mechanism (Lee et al. 2019; Soelch et al. 2019), discussed below. Overall, we find that the pooling mechanism is of minor importance. Detailed comparisons are thus deferred to the supplemental materials and all subsequent experiments use attention-based pooling consistently.
b. Set transformer
Set transformers (Lee et al. 2019) are NNs, that model interactions between set members via self-attention. Attention is a form of nonlinear activation function, in which the relevance of the inputs is determined via a matching of input-specific key and query vectors. Multihead attention allows the model to attend to multiple key patterns in parallel (Vaswani et al. 2017). Lee et al. (2019) combine multihead attention with a memberwise NN to build a permutation-invariant set-attention block, from which a set transformer is constructed by stacking multiple instances. Set transformers apply straightforwardly to ensemble data and can exploit all aspects of the available ensemble dataset by allowing for information exchange between ensemble members early in the inference process. We construct a set transformer by using three set-attention blocks with 8 attention heads (Vaswani et al. 2017; Lee et al. 2019). Each block comprises a separate MLP with two hidden layers. Additionally, the first set-attention block is preceded by a linear layer to align the channel number of the ensemble input with the hidden dimension of the set-attention blocks. To construct vector-valued predictions from set-valued inputs, Lee et al. (2019) propose attention-based pooling, in which the output query vectors are implemented as learnable parameters. After pooling, the final prediction θ is obtained by applying another two-layer MLP.
5. Data
We evaluate the performance of the proposed models in two postprocessing tasks using the datasets described in Table 2. An overview of the predictor and target variables is provided in appendix A.
a. Wind gust prediction in Germany
In the first case study, we employ our methods for stationwise postprocessing of wind gust forecasts using a dataset that has previously been used in Pantillon et al. (2018) and Schulz and Lerch (2022b). The ensemble forecasts are based on the COSMO ensemble prediction system (COSMO-DE; Baldauf et al. 2011) and consist of 20 members forecasts, simulated with a horizontal resolution of 2.8 km. The forecasts are initialized at 0000 UTC, and we consider the lead times 6, 12, and 18 h. Other than wind gusts, the dataset comprises ensemble forecasts of several meteorological variables, such as temperature, pressure, precipitation, and radiation. An overview of all predictors is shown in Table A1 (appendix A). The predictions are verified against observations measured at 175 stations of the German Weather Service [Deutscher Wetterdienst (DWD)]. Forecasts for the individual weather stations are obtained from the closest grid point. The time period of the forecast and observation data starts on 9 December 2010 and ends on 31 December 2016. The models use the data from 2010 to 2015 for model estimation, using 2010–14 as training and 2015 as validation period. The forecasts are then verified in 2016. As in Schulz and Lerch (2022b), each lead time is processed separately.
As detailed in Schulz and Lerch (2022b), a minor caveat is caused by a nontrivial substructure of the forecast ensembles. The 20-member ensembles constitute a conglomerate of four subensembles, which are generated with slightly different model configurations. While this formally violates the assumption of statistical interchangeability of the members, the subensembles are sufficiently similar to justify the application of permutation-invariant models.
For the benchmark methods EMOS and DRN, we use the exact same forecasts as in Schulz and Lerch (2022b), both estimating the parameters of a truncated logistic distribution by minimizing the CRPS, see their section 3 for details. BQN is adapted as described in section 3 and Table 1.
b. Temperature forecasts from the EUPPBench dataset
In a second example, we postprocess ensemble forecasts of surface temperature using a subset of the EUPPBench postprocessing benchmark dataset (Demaeyer et al. 2023). EUPPBench provides paired forecast and observation data from two sets of samples. The first part consists of 20 years of reforecast data (1997–2016) from the Integrated Forecasting System (IFS) of the ECMWF with 11 ensemble members. Mimicking typical operational approaches, the reforecast dataset is used as training data, complemented by additional two years (2017 and 2018) of 51-member forecasts as test data. EUPPBench comprises sample data from multiple European countries—Austria, Belgium, France, Germany, and the Netherlands—which are publicly accessible via the CliMetLab API (ECMWF 2013). Additional data for Switzerland can be requested from the Swiss Weather Service but is not used in this study. EUPPBench constitutes a comprehensive dataset of samples over a long time period. In contrast to the wind gust forecasts, the EUPPBench ensemble members are exchangeable, so that permutation-invariant model architectures are optimally suited.
Deviating from the EUPPBench convention, models are tested on the 51-member forecasts, and the last 4 years of the reforecast dataset are considered as an independent test set of 11-member forecast samples. This allows us to assess the generalization capabilities of the ensemble-based postprocessing models on data equivalent to the training data, as well as on data with larger ensemble sizes. Furthermore, we use the full set of available surface- and pressure-level predictor variables, whereas the original EUPPBench task is restricted to using only surface temperature data. While this design choice hinders the direct comparison of the evaluation metrics in this paper with the original EUPPBench models, it enables a more comprehensive assessment of the relative benefits of using summary-based versus ensemble predictors. An overview of the predictors can be found in Table A2 (appendix A). From the pool of available forecast lead times, we select 24, 72, and 120 h for a closer analysis.
Unlike previous postprocessing applications for temperature (e.g., Gneiting et al. 2005; Rasp and Lerch 2018), we employ a zero-truncated logistic distribution as parametric forecast distribution for DRN, instead of a zero-truncated normal, as preliminary tests showed a slightly superior predictive performance of the logistic distribution pattern (see supplemental material for details). The zero-censoring arises from the use of the Kelvin scale for measuring temperatures and allows the use of the same model configuration for both temperature and wind gust predictions. In particular, the EMOS and DRN benchmark approaches are identical for both datasets.
6. Performance evaluation
For each of the postprocessing methods, we generated a pool of 20 networks in each forecast scenario. To ensure a fair comparison to the benchmark methods, we follow the approach from Schulz and Lerch (2022a,b), who build an ensemble of 10 networks and combine the forecasts via quantile aggregation. Hence, we draw 10 members from the pool and repeat this procedure 50 times to quantify the uncertainty of sampling from the general pool. For all model variants and resamples, we select those configurations as the final forecast that yield the lowest CRPS on the validation set. Details on hyperparameter settings are listed in appendix B and tuning procedures are discussed in the supplemental material. For both datasets, we compute the average CRPS, PI length, and PI coverage for the different forecast lead times based on the respective test datasets. The average is calculated over the resamples of the aggregated network ensembles. In what follows, we refer to pooling-based encoder–decoder (ED) models and set transformers (ST), and suffixes DRN and BQN indicate the parameterization of the forecast distribution. The model categories DRN and BQN without additional prefixes refer to the benchmark models based on summary statistics.
a. Wind gust forecasts
Table 3 shows the quantitative evaluation for lead times 6, 12, and 18 h. All permutation-invariant model architectures perform similarly to the DRN and BQN benchmarks and outperform both the EPS and conventional postprocessing via EMOS, thus achieving state-of-the-art performance for all lead times. Further, the PI lengths and coverages are similar to those of the benchmark methods with the same forecast distribution, indicating that the ensemble-based models achieve approximately the same level of sharpness as the benchmark networks while being well calibrated. Note that the underlying distribution type should be taken into account when comparing the sharpness of different postprocessing models based on the PI length, as the DRN and BQN forecast distributions exhibit different tail behavior, which affects the PI lengths for different nominal levels (see supplemental materials for details). A noticeable difference between the network classes is that the ED models result in sharper PIs than the ST models. This coincides with the empirical PI coverages of the methods in that wider PIs typically result in a higher coverage. Figure 2 shows the PIT histograms of the postprocessed forecasts. While differences are seen between DRN-type and BQN-type models, all DRN-type and all BQN-type models show very similar patterns. While all models are well calibrated, DRN-type models reveal limitations in the resolution of gusts in the lower segment of the distribution. BQN-type models all yield very uniform calibration histograms.
Mean CRPS (m s−1), PI length (m s−1), and PI coverage (%) of the postprocessing methods for the different lead times of the wind gust data (20-member ensemble, year 2016). Recall that the nominal level of the PIs is approximately 90.48%. The best-performing models (w.r.t. CRPS) are marked in bold.
PIT histograms of the postprocessing models for EUPPBench (left) 11-member reforecast and 51-member forecast ensembles and (right) 20-member wind gust forecasts.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0070.1
b. EUPPBench surface temperature reforecasts
As shown in Table 4, both ED and ST models show significant advantages compared to the EPS and EMOS in terms of CRPS and PI length for the EUPPBench dataset. Differences between the network variants arise mainly due to the use of different forecast distribution types. Note that the lead times of the wind gust dataset are in the short range with a maximum of 18 h, whereas the lead times considered in the EUPPBench dataset range from one to five days. Hence, the differences between the lead times in the effects of postprocessing are more pronounced. For example, for a lead time of 120 h, the improvement of the network-based postprocessing methods over the conventional EMOS approach is much smaller than for shorter lead times. In particular, ST models perform the best for lead time 24 h and all newly proposed models result in the smallest CRPS for lead time 120 h. In terms of the PI length and coverage, we find that the ED and ST models tend to generate slightly sharper predictions. A more detailed discussion of the differences in the PI lengths due to the choice of the underlying distribution is provided in the supplemental material. The PIT histograms in Fig. 2 show that the BQN models struggle to set accurate upper and lower bounds for the predicted distribution, whereas DRN distributions do not show such issues. Instead, they face the problem that the tail is too heavy. Overall, all postprocessing methods result in calibrated forecasts, while the DRN forecasts appear slightly better calibrated than the BQN forecasts, yielding PIT histograms with a wavelike structure.
Mean CRPS (K), PI length (K), and PI coverage (%) of the postprocessing methods for the different lead times for the EUPPBench reforecast data (11-member ensemble, years 2013–16). Recall that the nominal level of the PIs is approximately 83.33%. The best-performing models (w.r.t. CRPS) are marked in bold.
c. Generalization to 51-member forecast ensembles
As before, postprocessing outperforms the EPS forecasts and results in calibrated and accurate forecasts (cf. Table 5 and Fig. 2). Notably, all models have been trained purely on 11-member reforecasts and are not fine-tuned to the 51-member forecast ensembles. The CRPS scores are similar with almost identical values for all models, except EMOS, for all lead times. The ST models again perform the best for the shortest lead time. For the DRN forecasts, we find that the ensemble-based networks tend to reduce the PI length, as it is smaller for all cases except for lead time 120 h. The corresponding PI coverages are closely connected to the length of the PIs and indicate that the PIs are too large for most postprocessing models, as the observed coverages are above the nominal level.
Mean CRPS (K), PI length (K), and PI coverage (%) of the postprocessing methods for the different lead times for EUPPBench forecast data (51-member ensemble, years 2017–18). Recall that the nominal level of the PIs is approximately 96.15%. The best-performing models (w.r.t. CRPS) are marked in bold.
The calibration of the methods is not as good as in the other case studies, as indicated by the PIT histograms in Fig. 2, which may be a consequence of the large learning rate used in training the models (cf. supplemental materials). All BQN forecasts have problems in the tails, where the lower and upper bound are too extreme, such that insufficiently many observations fall into the outer bins. DRN yields similar results as for the reforecast data with too heavy-tailed forecast distribution, as indicated by the least frequent last bin. The differences between the methods themselves are again minor. Still, all postprocessing methods generate reasonably well-calibrated forecasts. Overall, the ensemble-based models result in state-of-the-art performance for generalization on 51-member forecasts or offer advantages over the summary-based benchmark methods.
7. Analysis of predictor importance
We analyze how the different model types distill relevant information out of the ensemble predictors. For this, we propose an ensemble-oriented PFI analysis to assess which distribution properties of the ensemble-valued predictors have the most effect on the final prediction. In its original form, PFI (e.g., Breiman 2001; Rasp and Lerch 2018; Schulz and Lerch 2022b) is used to assign relevance scores to scalar-valued predictors by randomly shuffling the values of a single predictor across the dataset. While the idea of shuffling predictor samples translates identically from scalar-valued to ensemble-valued predictors, ensemble predictors possess internal degrees of freedom (DOFs), such as ensemble mean and ensemble range, which may affect the prediction differently. In addition to ensemble-internal DOFs, the perturbed predictor ensemble is embedded in the context of the remaining multivariate ensemble predictors, such that covariances, copulas or the rank order of the ensemble members may carry information. To account for such effects, we introduce a conditional permutation strategy that singles out the effects of different ensemble properties.
a. Importance of the ensemble information
For ensemble-valued predictors, we consider two generalizations of this operator. We refer to these as the fully random permutation
b. Results
We compute PFI scores
(top) Permutation feature importance for summary-based networks and (bottom) permutation-invariant models for EUPPBench and wind gust postprocessing. Bar heights indicate the median of an ensemble of 20 separate models, the error bars depict the IQR. Predictors named “ens” in the top panels correspond to the primary predictors t2m and VMAX-10M, respectively. The suffix “sd” indicates the ensemble standard deviation of the predictor.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0070.1
The accuracy of the wind gust models is dominated by VMAX-10M and supplemented by additional predictors with lower importance. Temperature-like predictors obtain similar or higher scores than, for example, winds at the 850- and 950-hPa pressure levels. Note that for each lead time, the importance highlights different temperature predictors, which may be attributed to the diurnal cycle. Similar arguments can explain the increasing importance of short-wavelength radiation balance at the surface (ASOB-S) with increasing lead time. In a direct comparison of the model variants, we find that the differences between BQN-type and DRN-type models are very minor. However, ED-type models attribute higher importance to the most relevant predictors (VMAX-10M, T1000, T-2M), whereas ST-type models distribute the importance more evenly and use more diverse predictor information.
In the EUPPBench case, the models focus mainly on temperature-like predictors as well as surface radiation balances. Notably, for the summary-based models, mn2t6 and mx2t6 tend to be more important than the primary predictor t2m up to lead time 72 h. Since the diurnal cycle does not cause variations between the lead times here, differences in the predictor utilization must be due to the increasing uncertainty at longer lead times. The ensemble-based models rely relatively more strongly on the t2m predictor for the shorter lead time, whereas for longer lead times, the information utilization is more diverse. Qualitative differences between ED- and ST-type models are observed with respect to the humidity-related predictors tcw and tcwv. Only ST models recognize the value in these predictors, which may explain in parts the different generalization properties of ED and ST models on the EUPPBench reforecast and forecast datasets.
Figures 4 and 5 investigate the importance of ensemble-internal DOFs of selected ensemble predictors for the permutation-invariant model architectures. For both datasets, we choose a set of representative high-importance predictors and display the DOF importance for the ensemble-based models. Corresponding figures for the remaining predictors are shown in the supplemental materials. For all predictors and lead times, we compute importance ratios
Importance of ensemble-internal DOFs for wind gust postprocessing. Bar charts show importance ratios
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0070.1
As in Fig. 4, but for temperature postprocessing.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0070.1
For wind gust postprocessing (Fig. 4), the importance ratios suggest in many cases that most of the predictor information can be restored by conditioning the shuffling procedure on the ensemble mean. This is the case for T-1000, T-G, and FI850. The interaction plots suggest that the mean conditioning preserves information about extrema to a high degree, whereas ensemble range and higher-order statistic information are randomized. These findings are supported by observations in Schulz and Lerch (2022b), who note that omitting the standard deviation of the auxiliary ensemble predictors helps to improve the quality of the network predictions. Larger importance ratios of the scale-related and higher-order DOFs are observed, for example, for T-G at lead time 12 h and ASOB-S at lead time 6 h. However, these cases coincide with increased correlations between the respective perturbation patterns and the location-preserving perturbations, which show fractional skill ratios close to unity. This may be seen as an indication that the relevance of the remaining DOFs must in part be attributed to information overlap with the location-related DOFs. Note here that the strong information overlap between locationlike and scalelike metrics for ASOB-S predictors at 6-h lead time is again an artifact due to the diurnal cycle. At 6-h lead time, a substantial fraction of the ASOB-S predictor ensembles falls to zero mean and no variance due to the lack of solar irradiation, which impacts the correlation values as well as the effectiveness of perturbations. WIND850 is a corner case, in which the mean conditioning restores substantial amounts of the model skill but fails to restore the unperturbed performance completely. This indicates that, while the ensemble mean is an important predictor, the remaining DOFs deliver complementary information that modulates the interpretation of the mean value. VMAX-10M, being the primary predictor, constitutes the only example for which no single predictor is sufficient to restore the unperturbed model skill, thus indicating that both ED- and ST-type models learn to attend to the details of the ensemble distribution.
In surface temperature postprocessing, t2m is the primary predictor and displays similar characteristics as VMAX-10M in the wind-gust study. The mean-conditional shuffling of t2m tends to become more effective in restoring the model skill with increasing lead time. This may be due to the decreasing reliability of the EPS-based predictor ensembles with increasing lead time. Similar patterns are observed also in the remaining predictors. While the model skill cannot be restored with mean-only conditioning for 24-h lead time, the mean appears to become more informative for longer lead times. The radiation parameter ssrd6 sticks out visually with high correlations between location-related predictors, which occurs due to the same reasons as for the ASOB-S parameter discussed before.
8. Discussion and conclusions
We have introduced permutation-invariant NN architectures for postprocessing ensemble forecasts by selecting two exemplary model families and adapting them to postprocessing. In two case studies, using datasets for wind gusts and surface temperature postprocessing, we evaluated the model performance and compared the permutation-invariant models against benchmark models from prior work. Our results show that permutation-invariant postprocessing networks achieve state-of-the-art performance in both applications. All permutation-invariant architectures outperform both the raw ensemble forecast and conventional postprocessing via EMOS by a large margin, but systematic differences between the (more complex) permutation-invariant models and existing NN-based solutions are very minor and can mostly be attributed to differences in the distribution parameterization. Qualitatively similar results were observed for extreme events in both case studies but are not shown explicitly in the interest of brevity.
Based on a subsequent assessment of the permutation importance of ensemble-internal DOFs, we have seen that for many auxiliary ensemble predictors, preserving information about the ensemble mean is sufficient to maintain almost the complete information about the postprocessing target, while more detailed information is required about the primary predictors. These findings are consistent with prior work and are more comprehensive due to the larger variety of summary statistics considered in the analysis.
A striking advantage of the permutation-invariant models lies in the generality of the approach, that is, the models possess the flexibility of attending to the important features in the predictor ensembles and the capability of identifying those during training (as shown in our feature analysis). As the added flexibility comes with a surplus of computational complexity, the benefits of the respective methods should be weighed carefully. In operational settings, it may be reasonable to consider permutation-invariant models, as proposed here, as a tool for identifying relevant aspects of the input data. The gained knowledge can then be used for data reduction and to train reduced models with a more favorable accuracy–complexity trade-off.
Despite this, the apparent similarity between the performance of the ensemble-based and summary-based models remains baffling and requires further clarification. Supposing capable ensemble predictions, it seems reasonable, from a meteorological perspective, to expect that postprocessing models that operate on the entire ensemble can learn more complex patterns and relationships than models that operate on simple summary statistics. The lack of substantial improvements, as seen in this study, admits different explanations. One possibility would be that the available datasets are insufficient to establish statistically relevant connections between higher-order ensemble-internal patterns and the predicted variables. Problems could arise, for example, due to insufficient sample counts of the overall datasets or due to ensemble sizes being too low to provide reliable representations of the forecast distribution. Yet another reason could lie in the fact that the generation mechanisms underlying the NWP ensemble forecasts fail to achieve meaningful representations of such higher-order distribution information, which would raise follow-up questions regarding the design of future ensemble prediction systems. Given the impact and potential implications of the latter alternative, future work should examine the information content of raw ensemble predictions in more detail. The proposed permutation-invariant model architectures may help to achieve this, for example, by conducting postprocessing experiments with dynamical toy systems that are cheap to simulate and simple to understand, such that large datasets can be generated and evidence for both hypotheses can be distinguished.
Acknowledgments.
This research was funded by the subprojects B5 and C5 of the Transregional Collaborative Research Center SFB/TRR 165 “Waves to Weather” (www.wavestoweather.de) funded by the German Research Foundation (DFG). Sebastian Lerch gratefully acknowledges support by the Vector Stiftung through the Young Investigator Group “Artificial Intelligence for Probabilistic Weather Forecasting.” We thank two anonymous reviewers for their constructive comments.
Data availability statement.
The case study on surface temperature postprocessing is based on the EUPPBench dataset, which is publicly available. See Demaeyer et al. (2023) for details. The wind gust dataset is proprietary but can be obtained from the DWD for research purposes. Code with implementations of all methods is publicly available (Höhlein 2023).
APPENDIX A
Description of Predictors
The descriptions of the ensemble-valued predictor variables used in both case studies are shown in Tables A1 and A2 for wind gust and surface temperature postprocessing, respectively. The predictors listed in Table A3 are not ensemble-valued and are used equally in both case studies.
Description of meteorological parameters for wind-gust postprocessing (cf. Schulz and Lerch 2022b). Target variable: wind speed of gust (observations). Primary predictor: VMAX-10 m (ensemble forecast).
Description of meteorological parameters for surface temperature postprocessing (EUPPBench, cf. Demaeyer et al. 2023). Target variable: t2m (observations). Primary predictor: t2m (ensemble forecast).
APPENDIX B
Model Hyperparameters
Table B1 displays the hyperparameter settings for all model configurations used in the experiments. For details about the hyperparameter tuning process, we refer to the supplemental material.
Hyperparameters for model experiments.
REFERENCES
Baldauf, M., A. Seifert, J. Förstner, D. Majewski, M. Raschendorfer, and T. Reinhardt, 2011: Operational convective-scale numerical weather prediction with the COSMO model: Description and sensitivities. Mon. Wea. Rev., 139, 3887–3905, https://doi.org/10.1175/MWR-D-10-05013.1.
Ben-Bouallegue, Z., J. A. Weyn, M. C. A. Clare, J. Dramsch, P. Dueben, and M. Chantry, 2023: Improving medium-range ensemble weather forecasts with hierarchical ensemble transformers. arXiv, 2303.17195v3, https://doi.org/10.48550/arXiv.2303.17195.
Breiman, L., 2001: Random forests. Mach. Learn., 45, 5–32, https://doi.org/10.1023/A:1010933404324.
Bremnes, J. B., 2020: Ensemble postprocessing using quantile function regression based on neural networks and Bernstein polynomials. Mon. Wea. Rev., 148, 403–414, https://doi.org/10.1175/MWR-D-19-0227.1.
Burkart, N., and M. F. Huber, 2021: A survey on the explainability of supervised machine learning. J. Artif. Intell. Res., 70, 245–317, https://doi.org/10.1613/jair.1.12228.
Chen, J., T. Janke, F. Steinke, and S. Lerch, 2022: Generative machine learning methods for multivariate ensemble post-processing. arXiv, 2211.01345v1, https://doi.org/10.48550/arXiv.2211.01345.
Dai, Y., and S. Hemri, 2021: Spatially coherent postprocessing of cloud cover ensemble forecasts. Mon. Wea. Rev., 149, 3923–3937, https://doi.org/10.1175/MWR-D-21-0046.1.
Demaeyer, J., and Coauthors, 2023: The EUPPBench postprocessing benchmark dataset v1.0. Earth Syst. Sci. Data, 15, 2635–2653, https://doi.org/10.5194/essd-15-2635-2023.
ECMWF, 2013: CliMetLab. GitHub, https://github.com/ecmwf/climetlab.
Edwards, H., and A. Storkey, 2016: Towards a neural statistician. arXiv, 1606.02185v2, https://doi.org/10.48550/arXiv.1606.02185.
Farokhmanesh, F., K. Höhlein, and R. Westermann, 2023: Deep learning–based parameter transfer in meteorological data. Artif. Intell. Earth Syst., 2, e220024, https://doi.org/10.1175/AIES-D-22-0024.1.
Finn, T. S., 2021: Self-attentive ensemble transformer: Representing ensemble interactions in neural networks for earth system models. arXiv, 2106.13924v2, https://doi.org/10.48550/arXiv.2106.13924.
Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359–378, https://doi.org/10.1198/016214506000001437.
Gneiting, T., and R. Ranjan, 2011: Comparing density forecasts using threshold- and quantile-weighted scoring rules. J. Bus. Econ. Stat., 29, 411–422, https://doi.org/10.1198/jbes.2010.08110.
Gneiting, T., and M. Katzfuss, 2014: Probabilistic forecasting. Annu. Rev. Stat. Appl., 1, 125–151, https://doi.org/10.1146/annurev-statistics-062713-085831.
Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 1098–1118, https://doi.org/10.1175/MWR2904.1.
Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc., 69B, 243–268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.
Grönquist, P., C. Yao, T. Ben-Nun, N. Dryden, P. Dueben, S. Li, and T. Hoefler, 2021: Deep learning for post-processing ensemble weather forecasts. Philos. Trans. Roy. Soc., A379, 20200092, https://doi.org/10.1098/rsta.2020.0092.
Guidotti, R., A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi, 2018: A survey of methods for explaining black box models. ACM Comput. Surv., 51 (5), 1–42, https://doi.org/10.1145/3236009.
Haupt, S. E., W. Chapman, S. V. Adams, C. Kirkwood, J. S. Hosking, N. H. Robinson, S. Lerch, and A. C. Subramanian, 2021: Towards implementing artificial intelligence post-processing in weather and climate: Proposed actions from the Oxford 2019 workshop. Philos. Trans. Roy. Soc., A379, 20200091, https://doi.org/10.1098/rsta.2020.0091.
Höhlein, K., 2023: Code “postprocessing of ensemble weather forecasts using permutation-invariant neural networks.” Zenodo, https://doi.org/10.5281/zenodo.8329345.
Höhlein, K., M. Kern, T. Hewson, and R. Westermann, 2020: A comparative study of convolutional neural network models for wind field downscaling. Meteor. Appl., 27, e1961, https://doi.org/10.1002/met.1961.
Horat, N., and S. Lerch, 2023: Deep learning for post-processing global probabilistic forecasts on sub-seasonal time scales. arXiv, 2306.15956v1, https://doi.org/10.48550/arXiv.2306.15956.
Jordan, A., F. Krüger, and S. Lerch, 2019: Evaluating probabilistic forecasts with scoringRules. J. Stat. Software, 90 (12), 1–37, https://doi.org/10.18637/jss.v090.i12.
Khan, S., M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, 2022: Transformers in vision: A survey. ACM Comput. Surv., 54 (10s), 1–41, https://doi.org/10.1145/3505244.
Koenker, R., and G. Bassett Jr., 1978: Regression quantiles. Econometrica, 46, 33–50, https://doi.org/10.2307/1913643.
Labe, Z. M., and E. A. Barnes, 2021: Detecting climate signals using explainable AI with single-forcing large ensembles. J. Adv. Model. Earth Syst., 13, e2021MS002464, https://doi.org/10.1029/2021MS002464.
Lee, J., Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, 2019: Set transformer: A framework for attention-based permutation-invariant neural networks. Proc. 36th Int. Conf. on Machine Learning, Long Beach, CA, PMLR, 3744–3753.
Lerch, S., and T. L. Thorarinsdottir, 2013: Comparison of non-homogeneous regression models for probabilistic wind speed forecasting. Tellus, 65A, 21206, https://doi.org/10.3402/tellusa.v65i0.21206.
Linardatos, P., V. Papastefanopoulos, and S. Kotsiantis, 2021: Explainable AI: A review of machine learning interpretability methods. Entropy, 23, 18, https://doi.org/10.3390/e23010018.
Lyle, C., M. van der Wilk, M. Kwiatkowska, Y. Gal, and B. Bloem-Reddy, 2020: On the benefits of invariance in neural networks. arXiv, 2005.00178v1, https://doi.org/10.48550/arXiv.2005.00178.
Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Manage. Sci., 22, 1087–1096, https://doi.org/10.1287/mnsc.22.10.1087.
Messner, J. W., G. J. Mayr, and A. Zeileis, 2017: Nonhomogeneous boosting for predictor selection in ensemble postprocessing. Mon. Wea. Rev., 145, 137–147, https://doi.org/10.1175/MWR-D-16-0088.1.
Mlakar, P., J. Merše, and J. F. Pucer, 2023: Ensemble weather forecast post-processing with a flexible probabilistic neural network approach. arXiv, 2303.17610v3, https://doi.org/10.48550/arXiv.2303.17610.
Molnar, C., G. König, B. Bischl, and G. Casalicchio, 2024: Model-agnostic feature importance and effects with dependent features: A conditional subgroup approach. Data Min. Knowl. Discovery, https://doi.org/10.1007/s10618-022-00901-9, in press.
Pantillon, F., S. Lerch, P. Knippertz, and U. Corsmeier, 2018: Forecasting wind gusts in winter storms using a calibrated convection-permitting ensemble. Quart. J. Roy. Meteor. Soc., 144, 1864–1881, https://doi.org/10.1002/qj.3380.
Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 3885–3900, https://doi.org/10.1175/MWR-D-18-0187.1.
Ravanbakhsh, S., J. Schneider, and B. Poczos, 2016: Deep learning with sets and point clouds. arXiv, 1611.04500v3, https://doi.org/10.48550/arXiv.1611.04500.
Reichstein, M., G. Camps-Valls, B. Stevens, M. Jung, J. Denzler, N. Carvalhais, and Prabhat, 2019: Deep learning and process understanding for data-driven earth system science. Nature, 566, 195–204, https://doi.org/10.1038/s41586-019-0912-1.
Sannai, A., Y. Takai, and M. Cordonnier, 2019: Universal approximations of permutation invariant/equivariant functions by deep neural networks. arXiv, 1903.01939v3, https://doi.org/10.48550/arXiv.1903.01939.
Scheuerer, M., 2014: Probabilistic quantitative precipitation forecasting using ensemble model output statistics. Quart. J. Roy. Meteor. Soc., 140, 1086–1096, https://doi.org/10.1002/qj.2183.
Scheuerer, M., M. B. Switanek, R. P. Worsnop, and T. M. Hamill, 2020: Using artificial neural networks for generating probabilistic subseasonal precipitation forecasts over California. Mon. Wea. Rev., 148, 3489–3506, https://doi.org/10.1175/MWR-D-20-0096.1.
Schulz, B., and S. Lerch, 2022a: Aggregating distribution forecasts from deep ensembles. arXiv, 2204.02291v1, https://doi.org/10.48550/arXiv.2204.02291.
Schulz, B., and S. Lerch, 2022b: Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison. Mon. Wea. Rev., 150, 235–257, https://doi.org/10.1175/MWR-D-21-0150.1.
Schulz, B., M. E. Ayari, S. Lerch, and S. Baran, 2021: Post-processing numerical weather prediction ensembles for probabilistic solar irradiance forecasting. Sol. Energy, 220, 1016–1031, https://doi.org/10.1016/j.solener.2021.03.023.
Soelch, M., A. Akhundov, P. van der Smagt, and J. Bayer, 2019: On deep set learning and the choice of aggregations. Artificial Neural Networks and Machine Learning—ICANN 2019: Theoretical Neural Computation, I. Tetko et al., Eds., Lecture Notes in Computer Science, Vol. 11727, Springer, 444–457, https://doi.org/10.1007/978-3-030-30487-4_35.
Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis, 2008: Conditional variable importance for random forests. BMC Bioinf., 9, 307, https://doi.org/10.1186/1471-2105-9-307.
Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 2375–2393, https://doi.org/10.1175/MWR-D-15-0260.1.
Vannitsem, S., D. S. Wilks, and J. W. Messner, 2018: Statistical Postprocessing of Ensemble Forecasts. Elsevier, 347 pp., https://doi.org/10.1016/C2016-0-03244-8.
Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681–E699, https://doi.org/10.1175/BAMS-D-19-0308.1.
van Schaeybroeck, B., and S. Vannitsem, 2015: Ensemble post-processing using member-by-member approaches: Theoretical aspects. Quart. J. Roy. Meteor. Soc., 141, 807–818, https://doi.org/10.1002/qj.2397.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, 2017: Attention is all you need. Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, Association for Computing Machinery, 6000–6010, https://dl.acm.org/doi/10.5555/3295222.3295349.
Veldkamp, S., K. Whan, S. Dirksen, and M. Schmeits, 2021: Statistical postprocessing of wind speed forecasts using convolutional neural networks. Mon. Wea. Rev., 149, 1141–1152, https://doi.org/10.1175/MWR-D-20-0219.1.
Vinyals, O., S. Bengio, and M. Kudlur, 2015: Order matters: Sequence to sequence for sets. arXiv, 1511.06391v4, https://doi.org/10.48550/arXiv.1511.06391.
Vogel, P., P. Knippertz, A. H. Fink, A. Schlueter, and T. Gneiting, 2018: Skill of global raw and postprocessed ensemble predictions of rainfall over northern tropical Africa. Wea. Forecasting, 33, 369–388, https://doi.org/10.1175/WAF-D-17-0127.1.
Zaheer, M., S. Kottur, S. Ravanbakhsh, B. Póczos, R. Salakhutdinov, and A. J. Smola, 2017: Deep sets. Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, Association for Computing Machinery, 3394–3404, https://dl.acm.org/doi/10.5555/3294996.3295098.
Zhang, Y., J. Hare, and A. Prügel-Bennett, 2019: FSPool: Learning set representations with featurewise sort pooling. arXiv, 1906.02795v4, https://doi.org/10.48550/arXiv.1906.02795.