## 1. Introduction

Wind gusts are among the most significant natural hazards in central Europe. Accurate and reliable forecasts are therefore critically important to issue effective warnings and protect human life and property. However, wind gusts are a challenging meteorological target variable as they are driven by small-scale processes and local occurrence, so that their predictability is limited even for numerical weather prediction (NWP) models run at convection-permitting resolutions.

To quantify forecast uncertainty, most operationally used NWP models generate probabilistic predictions in the form of ensembles of deterministic forecasts that differ in initial conditions, boundary conditions, or model specifications. Despite substantial improvements over the past decades (Bauer et al. 2015), ensemble forecasts continue to exhibit systematic errors that require statistical postprocessing to achieve accurate and reliable probabilistic forecasts. Statistical postprocessing has therefore become an integral part of weather forecasting and standard practice in research and operations. We refer to Vannitsem et al. (2018) for a general introduction to statistical postprocessing and to Vannitsem et al. (2021) for an overview of recent developments.

The focus of our work is on statistical postprocessing of ensemble forecasts of wind gusts. Despite their importance for severe weather warnings, much recent work on ensemble postprocessing has instead focused on temperature and precipitation. Therefore, our overarching aim is to provide a comprehensive review and systematic comparison of statistical and machine learning methods for ensemble postprocessing specifically tailored to wind gusts.

Most commonly used postprocessing methods are distributional regression approaches where probabilistic forecasts are given by parametric probability distributions with parameters depending on summary statistics of the ensemble predictions of the target variable through suitable link functions. The two most prominent techniques are ensemble model output statistics (EMOS; Gneiting et al. 2005), where the forecast distribution is given by a single parametric distribution, and Bayesian model averaging (BMA; Raftery et al. 2005), where the probabilistic forecast takes the form of a weighted mixture distribution. Both methods tend to perform equally well, and the conceptually simpler EMOS approach has become widely used in practice (Vannitsem et al. 2021). In the past decade, a wide range of statistical postprocessing methods has been developed, including physically motivated techniques like member-by-member postprocessing (MBM; Van Schaeybroeck and Vannitsem 2015) or statistically principled approaches such as isotonic distributional regression (IDR; Henzi et al. 2021).

The use of modern machine learning methods for postprocessing has been the focus of much recent research interest (McGovern et al. 2017; Haupt et al. 2021; Vannitsem et al. 2021). By enabling the incorporation of additional predictor variables beyond ensemble forecasts of the target variable, these methods have the potential to overcome inherent limitations of traditional approaches. Examples include quantile regression forests (QRF; Taillardat et al. 2016) and the gradient-boosting extension of EMOS (EMOS-GB; Messner et al. 2017). Rasp and Lerch (2018) demonstrate the benefits of a neural network (NN)–based distributional regression approach, which we will refer to as distributional regression network (DRN). Their approach is an extension of the EMOS framework that enables nonlinear relationships between arbitrary predictor variables and forecast distribution parameters to be learned in a data-driven way. DRN has been extended toward more flexible, distribution-free approaches based on approximations of the quantile function (Bernstein quantile network, BQN; Bremnes 2020) or of quantile-based probabilities that are transformed to a full predictive distribution (Scheuerer et al. 2020).

Many of the postprocessing methods described above have been applied for wind speed prediction, but previous work on wind gusts is scarce. Our work is based on the study of Pantillon et al. (2018), one of the few exceptions, who use a simple EMOS model for postprocessing to investigate the predictability of wind gusts with a focus on European winter storms. They find that although postprocessing improves the overall predictive performance, it fails in cases that can be attributed to specific mesoscale structures and corresponding wind gust generation mechanisms. As a first step toward the development of more sophisticated methods, we adapt existing as well as novel techniques for statistical postprocessing of wind gusts and conduct a systematic comparison of their predictive performance. Our case study utilizes forecasts from the operational ensemble prediction system (EPS) of the Deutscher Wetterdienst (DWD; German weather service) run at convection-permitting resolution for the 2010–16 period and observations from 175 surface weather stations in Germany. In contrast to current research practice, in which postprocessing methods are often only compared with a small set of benchmark techniques, our work is the first to systematically compare a wide variety of postprocessing approaches to the best of our knowledge.

The remainder of the paper is structured as follows. Section 2 introduces the data and notation, and section 3 introduces the various statistical postprocessing methods, whose predictive performance is evaluated in section 4. A meteorological interpretation of what the models have learned is presented in section 5. Section 6 concludes with a discussion.

R code (R Core Team 2021) with implementations of all methods is available online (https://github.com/benediktschulz/paper_pp_wind_gusts).

## 2. Data and notation

### a. Forecast and observation data

Our study is based on the same dataset as Pantillon et al. (2018) and we refer to their section 2.1 for a detailed description. The forecasts are generated by the EPS of the Consortium for Small-Scale Modeling operational forecast for Germany (COSMO-DE; Baldauf et al. 2011). The 20-member ensemble is based on initial and boundary conditions from four different global models paired with five sets of physical perturbations and run at a horizontal grid spacing of 2.8 km (Pantillon et al. 2018). In the following, we will refer to the four groups corresponding to the global models, which consist of five members each, as subensembles. We consider forecasts that are initialized at 0000 UTC with a range from 0 to 21 h. The data ranges from 9 December 2010, when the EPS started in preoperational mode, to the end of 2016, that is, a period of around 6 years.

In addition to wind gusts, ensemble forecasts of several other meteorological variables generated by the COSMO-DE-EPS are available. Table 1 gives an overview of the 61 meteorological variables as well as additional temporal and spatial predictors derived from station information. The forecasts are evaluated at 175 SYNOP stations in Germany operated by the DWD, for which hourly observations are available. For the comparison with station data, forecasts from the nearest grid point are taken.

Overview of available predictors. For the meteorological variables, ensemble forecasts are available, with the term “500–1000 hPa” referring to the specific model levels at 500, 700, 850, 950, and 1000 hPa. Spatial predictors contain station-specific information.

### b. Training, testing, and validation

The postprocessing methods that will be presented in section 3 are trained on a set of past forecast–observation pairs to correct the systematic errors of the ensemble predictions. Even though many studies are based on rolling training windows consisting of the most recent days only, we will use a static training period. This is common practice in the operational use of postprocessing models (Hess 2020) and can be motivated by studies suggesting that using long archives of training data often lead to superior performance, irrespective of potential changes in the underlying NWP model or the meteorological conditions (Lang et al. 2020).

Therefore, we will use the period of 2010–15 as training set and 2016 as independent test set. The implementation of most of the methods requires the choice of a model architecture and the tuning of specific hyperparameters. To avoid overfitting in the model selection process, we further split the training set into the period of 2010–14, which is used for training, and use the year 2015 for validation purposes. After finalizing the choice of the most suitable model variant based on the validation period, the entire training period from 2010 to 2015 is used to fit that model for the final evaluation on the test set.

### c. Notation

Next, we will briefly introduce the notation used in the following. The weather variable of interest, that is, the speed of wind gusts, will be denoted by *y*, and ensemble forecasts of wind gusts are denoted by **x**. Note that *y* > 0, and that *m*-dimensional vector, where *m* is the ensemble size and *i*th ensemble member. The ensemble mean of **x** is denoted by *s*(**x**).

We will use the term predictor or feature interchangeably to denote a predictor variable that is used as an input to a postprocessing model. For most meteorological variables, we will typically use the ensemble mean as predictor. If a method uses several predictors, we will refer to the vector including all predictors by **X** ∈ * ^{p}*, where

*p*is the number of predictors and

*X*is the

_{i}*i*th predictor. For a set of past observations, ensemble forecasts of wind gusts, and predictors, we will denote the variables by

*y*,

_{j}**X**

_{⋅}

*, where*

_{j}*j*= 1, …,

*n*and

*n*is the training set size.

## 3. Postprocessing methods

This section introduces the postprocessing methods that are systematically compared for ensemble forecasts of wind gusts. The postprocessing methods will be presented in three groups, starting with established, comparatively simple techniques rooted in statistics. We will proceed to machine learning approaches ranging from random forest and gradient boosting techniques up to methods based on NNs. A full description of all implementation details is deferred to appendix B. Note that for each of the postprocessing methods, we fit a separate model for each lead time based on training data only consisting of cases corresponding to that lead time.

### a. Basic techniques

As a first group of methods, we will review basic statistical postprocessing techniques, where the term basic refers to the fact that these methods solely use the ensemble forecasts of wind gusts as predictors. They are straightforward to implement and serve as benchmark methods for the more advanced approaches.

#### 1) Ensemble model output statistics (EMOS)

**x**, the weather variable of interest

*Y*follows a parametric distribution

**) with**

*θ***∈ Θ, where Θ denotes the parameter space of**

*θ***is connected to the ensemble forecast via a link function**

*θ**g*such that

*μ*is the location and

*σ*> 0 is the scale parameter, the logistic distribution left-truncated in zero is given by the CDF

*μ*, which may still be negative.

*μ*and

*σ*are linked to the ensemble forecast

**x**via

*a*,

*c*,

*d*∈

*b*> 0 are estimated via optimum score estimation, that is, by minimizing a strictly proper scoring rule (Gneiting and Raftery 2007). For details on forecast evaluation, see appendix A. We here estimate the parameters by minimizing the continuous ranked probability score (CRPS), for which we observed similar results to maximum likelihood estimation (MLE). Analytical expressions of the CRPS and the corresponding gradient function of a truncated logistic distribution are available in the scoringRules package (Jordan et al. 2019). Note that we do not specifically account for the existence of subensembles in Eq. (3), since initial experiments suggested a degradation of predictive performance.

The EMOS coefficients *a*, *b*, *c*, and *d* are estimated locally; that is, we estimate a separate model for each station in order to account for station-specific error characteristics of the ensemble predictions. In addition, we employ a seasonal training scheme where a training set consists of all forecast cases of the previous, current and next month with respect to the date of interest. This results in 12 different training sets for each station, one for each month, that enable an adaption to seasonal changes. In accordance with the results in Lang et al. (2020), this seasonal approach outperforms both a rolling training window as well as training on the entire set.

#### 2) Member-by-member postprocessing (MBM)

*γ*scaling the distance to the ensemble mean. Our implementation of MBM postprocessing follows Van Schaeybroeck and Vannitsem (2015), who let the stretch coefficient

*γ*depend on the ensemble mean difference

*δ*via

*a*,

*b*,

*c*, and

*d*are estimated by minimizing the CRPS of the adjusted ensemble; the loss function thus corresponds to

*k*(

*i*) =⌈

*i*/5⌉ ∈ {1, … , 4} for

*i*= 1, … , 20, that identifies the subensemble to which the

*i*th member belongs, we can incorporate the submodel structure by modifying Eq. (4) to

*k*th subensemble in

**x**,

*δ*(

_{k}**x**) denotes the mean difference of the

*k*th subensemble in

**x**,

*k*= 1, … , 4, and

*a*,

*b*

_{1}, … ,

*b*

_{4},

*c*,

*d*

_{1}, … ,

*d*

_{4}∈

*c*for each submodel, which performs equally well but is more complex.

The training is performed analogously to EMOS, utilizing a local and seasonal training scheme. In particular, accounting for potential seasonal changes via seasonal training substantially improved performance relative to using the entire available training set. A main advantage of MBM in comparison with all other approaches is that the rank correlation structure of the ensemble forecasts is preserved by postprocessing, since each member is transformed individually by the same linear transformation. MBM thus results in forecasts that are physically consistent over time and space and also different weather variables, even if MBM is applied for each component separately (Van Schaeybroeck and Vannitsem 2015; Schefzik 2017; Wilks 2018).

#### 3) Isotonic distributional regression (IDR)

Henzi et al. (2021) propose IDR, a novel nonparametric regression technique that results in simple and flexible probabilistic forecasts because it depends on neither distributional assumptions nor prespecified transformations. Since it requires no parameter tuning and minimal implementation choices, it is an ideal generic benchmark in probabilistic forecasting tasks. IDR is built on the assumption of an isotonic relationship between the predictors and the target variable. In the univariate case with only one predictor, isotonicity is based on the linear ordering on the real line. When multiple predictors (such as multiple ensemble members) are given, the multivariate covariate space is equipped with a partial order. Under those order restrictions, a conditional distribution that is optimal with respect to a broad class of relevant loss functions including proper scoring rules is then estimated. Conceptually, IDR can be seen as a far-reaching generalization of widely used isotonic regression techniques that are based on the pool-adjacent-violators algorithm (de Leeuw et al. 2009). To the best of our knowledge, our work is the first application of IDR in a postprocessing context besides the case study on precipitation accumulation in Henzi et al. (2021) who find that IDR forecasts were competitive to EMOS and BMA.

The only implementation choice required for IDR is the selection of a partial order on the covariate space. Among the choices introduced in Henzi et al. (2021), the empirical stochastic order (SD) and the empirical increasing convex order are appropriate for the situation at hand when all ensemble members are used as predictors for the IDR model. We selected SD, which resulted in slightly better results on the validation data. We further considered an alternative model formulation where only the ensemble mean was used as predictor, which reduces to the special case of a less complex distributional (single) index model (Henzi et al. 2020), but did not improve predictive performance.

We implement IDR as a local model, treating each station separately since it is not obvious how to incorporate station-specific information into the model formulation. Given the limited amount of training data available, we further consider only the wind gust ensemble as predictor variable. Following suggestions of Henzi et al. (2021), we use subsample aggregation (subbagging) and apply IDR on 100 random subsamples half the size of the available training set. IDR is implemented using the isodistrreg package (Henzi et al. 2019).

### b. Incorporating additional information via machine learning methods

The second group of postprocessing methods consists of benchmark machine learning methods that are able to incorporate additional predictor variables besides ensemble forecasts of wind gusts. As noted in section 2, ensemble forecasts of 60 additional variables are available. Including the additional information into the basic approaches introduced in section 3a is in principle possible, but far from straightforward since one has to carefully select appropriate features and take measures to prevent overfitting. By contrast, the machine learning approaches presented in the latter offer a more feasible approach to include additional predictors in an automated, data-driven way.

#### 1) Gradient-boosting extension of EMOS (EMOS-GB)

**X**instead of

**x**in Eq. (1),

To ensure comparability with the basic EMOS approach, we employ a truncated logistic distribution for the probabilistic forecasts. The parameters are determined using MLE, which resulted in superior predictive performance in initial tests on the validation data relative to minimum CRPS estimation. We use the ensemble mean and standard deviation of all meteorological variables in Table 1 as inputs to the EMOS-GB model. Note that in contrast to the other advanced postprocessing methods introduced below, we here include the standard deviation of all variables as potential predictors since we found this to improve the predictive performance. Further, we include the cosine-transformed day of the year in order to adapt to seasonal changes, since seasonal training approaches as applied for EMOS and MBM led to numerically unstable estimation procedures and degraded forecast performance. Although spatial predictors can in principle be included in a similar fashion, we estimate EMOS-GB models locally since we found this approach to outperform a joint model for all stations by a large margin. Our implementation of EMOS-GB is based on the crch package (Messner et al. 2016).

#### 2) Quantile regression forest (QRF)

A nonparametric, data-driven technique that neither relies on distributional assumptions, link functions nor parameter estimation is QRF, which was first used in the context of postprocessing by Taillardat et al. (2016). Random forests are randomized ensembles of decision trees, which operate by splitting the predictor space to create an analog forecast (Breiman 1984). This is done iteratively by first finding an order criterion based on the predictor that explains the variability within the training set best, and then splitting the predictor space according to this criterion. This procedure is repeated on the resulting subsets until a stopping criterion is reached, thereby creating a partition of the predictor space. Following the decisions at the so-called nodes, one then obtains an analog forecast based on the training samples. Random forests create an ensemble of decision trees by considering only a randomly chosen subset of the training data at each tree and of the predictors at each node, aiming to reduce correlation between individual decision trees (Breiman 2001). QRF extends the random forest framework by performing a quantile regression that generates a probabilistic forecast (Meinshausen 2006). The QRF forecast thus approximates the forecast distribution by a set of quantile forecasts derived from the set of analog observations.

In contrast to EMOS-GB, only the ensemble mean values of the additional meteorological variables are integrated as predictor variables, since we found that including the standard deviations as well led to more overdispersed forecasts and degraded forecast performance. A potential reason is given by the random selection of predictors at each node, which limits the automated selection of relevant predictors in that a decision based on a subset of irrelevant predictors only may lead to overfitting (Hastie et al. 2009, section 15.3.4).

Although spatial predictors can be incorporated into a global, joint QRF model for all stations generating calibrated forecasts, we found that the extant practice of implementing local QRF models separately at each station (Taillardat et al. 2016; Rasp and Lerch 2018) results in superior predictive performance and avoids the increased computational demand both in terms of required calculations and memory of a global QRF variant (Taillardat and Mestre 2020). Our implementation is based on the ranger package (Wright and Ziegler 2017).

### c. Neural network–based methods

Over the past decade, NNs have become ubiquitous in data-driven scientific disciplines and have in recent years been increasingly used in the postprocessing literature [see, e.g., Vannitsem et al. (2021) for a recent review]. NNs are universal function approximators for which a variety of highly complex extensions has been proposed. However, NN models often require large datasets and computational efforts, and are sometimes perceived to lack interpretability.

In the following, we will first present a third group of postprocessing methods based on NNs. Following the introduction of a general framework of our network-based postprocessing methods, we will introduce three model variants and discuss how to combine an ensemble of networks. In the interest of brevity, we will assume a basic familiarity with NNs and the underlying terminology. We refer to McGovern et al. (2019) for an accessible introduction in a meteorological context and to Goodfellow et al. (2016) for a detailed review.

#### 1) A framework for neural network–based postprocessing

The use of NNs in a distributional regression-based postprocessing context was first proposed in Rasp and Lerch (2018). Our framework for NN-based postprocessing builds on their approach and subsequent developments in Bremnes (2020), Scheuerer et al. (2020), and Veldkamp et al. (2021), among others. In particular, we propose three model variants that utilize a common basic NN architecture but differ in terms of the form that the probabilistic forecasts take, which governs both the output of the NN as well as the loss function used for parameter estimation. A graphical illustration of this framework is presented in Fig. 1.

The rise of artificial intelligence and NNs is closely connected to the increase in data availability and computing power, as these methods unfold their strengths when modeling complex nonlinear relations trained on large datasets. A main challenge in the case of postprocessing is to find a way to optimally utilize the entirety of available input data while preserving the inherent spatial and temporal information. We focus on building one network jointly for all stations at a given lead time, which we will refer to as locally adaptive joint network. For this purpose, Rasp and Lerch (2018) propose a station embedding, where a station identifier is mapped to a vector of latent features, which are then used as auxiliary input variables of the NN. The estimation of the embedding mapping is integrated into the overall training procedure and aims to model local characteristics implicitly, contrary to Lerch and Baran (2017) and Hamill et al. (2008), who apply a preliminary procedure to pool stations and grid points, respectively, that exhibit similar characteristics.

Our basic NN architecture consists of two hidden layers and a customized output. The training procedure is based on the adaptive moment estimation (Adam) algorithm (Kingma and Ba 2014), and the weights of the network are estimated on the training period of 2010–14 by optimizing a suitable loss function tailored to the desired output. We apply an early stopping strategy that stops the training process when the validation loss remains constant for a given number of epochs to prevent the model from overfitting. To account for the inherent uncertainty in the training of NN models via stochastic gradient descent methods and to improve predictive performance, we again follow Rasp and Lerch (2018) and create an ensemble of 10 networks for each variant, which we aggregate into a final forecast. The combination of the corresponding predictive distributions is discussed in section 3c(5) following the description of the three model variants.

To determine hyperparameters of the NN models such as the learning rate, the embedding dimension or the number of nodes in a hidden layer, we perform a two-step, semiautomated hyperparameter tuning based on the validation set. In an automated procedure, we first find a small number of hyperparameter sets that perform best for an individual network, then we manually select that set that yields the best aggregated forecasts. Overall, the results are relatively robust to a wide range of tuning parameter choices, and we found that increasing the number of layers or the number of nodes in a layer did not improve predictive performance. Relative to the models used in Rasp and Lerch (2018), we increased the embedding dimension, and we used a softplus activation function. The exact configuration slightly varies across the three model variants introduced in the following. In addition to the station embedding, the spatial features in Table 1, and the temporal predictor, we found that including only the mean values of the meteorological predictors, but not the corresponding standard deviations, improved the predictive performance. These results are in line with those of QRF and those of Rasp and Lerch (2018), who find that the standard deviations are only of minor importance for explaining and improving the NNs predictions.

All NN models were implemented via the R interface to Keras (2.4.3; Allaire and Chollet 2020) built on TensorFlow (2.3.0; Allaire and Tang 2020).

#### 2) Distributional regression network (DRN)

Rasp and Lerch (2018) propose a postprocessing method based on NNs that extends the EMOS framework. A key component of the improvement is that instead of relying on prespecified link function such as Eqs. (3) or (7) to connect input predictors to distribution parameters, an NN is used to automatically learn flexible, nonlinear relations in a data-driven way. The output of the NN is thus given by the forecast distribution parameters, and the weights of the network are learned by optimizing the CRPS, which performed equally well to MLE.

We adapt DRN to wind gust forecasting by using a truncated logistic distribution. In contrast to the Gaussian predictive distribution used by Rasp and Lerch (2018), this leads to additional technical challenges due to the truncation, that is, the division by 1 − *F*(0; *μ*, *σ*), which induces numerical instabilities. To stabilize training, we enforce *μ* ≥ 0 by applying a softplus activation function in the output nodes, resulting in 1 − *F*(0; *μ*, *σ*) ≥ 0.5. Note that the mode of the truncated logistic distribution is given by max(*μ*, 0), and thus the reduction in flexibility can be seen as a restriction to positive modes. Given that negative location parameters tend to be only estimated for wind gusts of low intensity, which are in general of little interest, and that we noticed no effect on the predictive performance, the restrictions of the parameter space can be considered to be negligible.

#### 3) Bernstein quantile network (BQN)

*d*∈ Ν. The key implementation choice is the degree

*d*, with larger values leading to a more flexible forecast but also a larger estimation variance.

*d*, the BQN forecast is fully defined by the

*d*+ 1 basis coefficients, which are linked to the predictors via an NN. A critical requirement is that the coefficients are nondecreasing, because this implies that the quantile function is also nondecreasing.

^{1}Further, positive coefficients result in a positive support of the forecast distribution. In contrast to Bremnes (2020), we enforce these conditions by targeting the increments

*l*= 0, …,

*d*, based on which the coefficients can be derived via the recursive formula

Because of the lack of a readily available closed form expression of the CRPS or log score (LS) of BQN forecasts and following Bremnes (2020), the network parameters are estimated based on the quantile loss (QL), which is a consistent scoring function for the corresponding quantile and integration over the unit interval is proportional to the CRPS (Gneiting and Ranjan 2011). We average the QL over 99 equidistant quantiles corresponding to steps of 1%. For the degree of the Bernstein polynomials, Bremnes (2020) considers a degree of 8. We found that increasing the degree to 12 resulted in better calibrated forecasts and improved predictive performance on the validation data. Again following Bremnes (2020) and in contrast to our implementation of DRN and the third variant introduced below, we use all 20 ensemble member forecasts of wind gust sorted with respect to the predicted speed as input instead of the ensemble mean and standard deviation.

#### 4) Histogram estimation network (HEN)

The third network-based postprocessing method may be considered as a universally applicable approach to probabilistic forecasting and is based on the idea to transform the probabilistic forecasting problem into a classification task, one of the main applications of NNs. This is done by partitioning the observation range in distinct classes and assigning a probability to each of them. In mathematical terms, this is equivalent to assuming that the probabilistic forecast is given by a piecewise uniform distribution. The PDF of such a distribution is given by a piecewise constant function, which resembles a histogram, we thus refer to this approach as histogram estimation network (HEN). Variants of this approach have been used in a variety of disciplines and applications (e.g., Felder et al. 2018; Gasthaus et al. 2019; Li et al. 2021). For recent examples in the context of postprocessing, see Scheuerer et al. (2020) and Veldkamp et al. (2021).

*N*be the number of bins and

*b*

_{0}< ⋯ <

*b*be the edges of the bins

_{N}*I*= [

_{l}*b*

_{l}_{−1},

*b*) with probabilities

_{l}*p*,

_{l}*l*= 1, … ,

*N*, where it holds that

*y*falls within the bin range, that is,

*b*

_{0}≤

*y*<

*b*, and define the bin in which the observation falls via

_{N}*b*

_{0}, … ,

*b*. Note that the HEN forecast is defined by the binning scheme and the corresponding bin probabilities.

_{N}Given a fixed number of bins specified as a hyperparameter, the bin edges and corresponding bin probabilities need to be determined. There exist several options for the output of the NN architecture to achieve this. The most flexible approach, for example implemented in Gasthaus et al. (2019), would be to obtain both the bin edges and the probabilities as output of the NN. We here instead follow a more parsimonious alternative and fix the bin edges, so that only the bin probabilities are determined by the NN, which can be interpreted as a probabilistic classification task.^{2} Minimizing the LS in Eq. (10), then reduces to minimizing

For our application to wind gusts, we found that the binning scheme is an essential factor; for example, a fine equidistant binning of length 0.5 m s^{−1} leads to physically inconsistent forecasts. Based on initial experiments on the validation data, we devise a data-driven binning scheme starting from one bin per unique observed value and merging bins to end up with a total number of 20. Details on this procedure are provided in appendix B. We apply a softmax function to the NN output nodes to ensure that the obtained probabilities sum to 1. The network parameters are estimated using the categorical cross-entropy.

#### 5) Combining predictive distributions

NN-based forecasting models are often run several times from randomly initialized weights and batches to produce an ensemble of predictions in order to account for the randomness of the training process based on stochastic gradient descent methods. We follow this principle and produce an ensemble of 10 models for each of the three variants of NN-based postprocessing models introduced above, which leads to the question how the resulting ensembles of NN outputs should be aggregated into a single probabilistic forecast for every model. We here briefly discuss this aggregation step.

For DRN, the output of the ensemble of NN models is a collection of pairs of location and scale parameters of the forecast distributions. Following Rasp and Lerch (2018), we aggregate these forecasts by averaging the distribution parameters. To aggregate the ensemble of BQN predictions given by sets of coefficient values of the Bernstein polynomials, we follow Bremnes (2020) and average the individual coefficient values across the multiple runs. This is equivalent to an equally weighted averaging of quantile functions, also referred to as Vincentization (Genest 1992).

In the case of an ensemble of HEN models, the output is given by sets of bin probabilities. Instead of simply averaging the bin probabilities across the ensemble of model runs, which is known to yield underconfident forecasts that lack sharpness (Ranjan and Gneiting 2010; Gneiting and Ranjan 2013), we take a Vincentization approach. Since the quantile function is a piecewise linear function with edges depending on the bin probabilities, the predictions of the individual networks are subject to a different binning with respect to the quantile function. The average of those piecewise linear functions is again a piecewise linear function, where the edges are given by the union of all individual edges. This procedure leads to a smoothed final probabilistic forecast with a much finer binning than the individual model runs and eliminates the downside of fixed bin edges that may be too coarse and not capable of flexibly adapting to current weather conditions. An illustration of this effect is provided in Fig. 2.

(left) Predictive PDF of a single HEN forecast and (right) the corresponding aggregated forecast where the single forecast is combined with nine other HEN model runs via Vincentization.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

(left) Predictive PDF of a single HEN forecast and (right) the corresponding aggregated forecast where the single forecast is combined with nine other HEN model runs via Vincentization.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

(left) Predictive PDF of a single HEN forecast and (right) the corresponding aggregated forecast where the single forecast is combined with nine other HEN model runs via Vincentization.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

### d. Summary and comparison of postprocessing methods

We close this section with a short discussion and overview of the general characteristics of the various postprocessing methods introduced above. Table 2 provides an objective summary of the postprocessing methods, which we divided in three groups. In general, the order in which the groups of methods were introduced coincides with the model complexity, as exemplified by the number of parameters to be estimated for the individual models. The most complex methods, which are the NN variants, also generally require the largest training sets. For the less complex models such as EMOS and MBM, less training data are required, as the improved performance of local and seasonal training sets demonstrates. Within each group, one distributional regression method yields the forecast in form of a parametric distribution, here a truncated logistic distribution, which reduces forecast flexibility. The semi- and nonparametric alternatives are, however, more prone to overfitting. The model complexity is also directly related to the choice of predictors, which is displayed in Table 3, indicating that more complex models allow for incorporation of additional features. Note that appendix Tables B and B2 present a summary of the final hyperparameter configurations of all methods.

Overview of the main characteristics of the different postprocessing methods. The number of parameters refers to one trained model instance. The number of models refers to the number of trained model instances per lead time. In case of the NN-based methods, the 10 trained model instances are aggregated to a final forecast [see section 3c(5)].

Overview of the predictors used in the different postprocessing methods. The column “statistics” comprises the use of any summary statistic derived from the wind gust ensemble (mean, standard deviation, or mean difference).

## 4. Evaluation

In this section, we evaluate the predictive performance of the postprocessing methods based on the test period that consists of all data from 2016. Since we considered forecasts from one initialization time only, systematic changes over the lead time are closely related to the diurnal cycle.

The evaluation is guided by the principle that a probabilistic forecast should aim to maximize sharpness, subject to calibration (Gneiting et al. 2007). A probabilistic forecast is said to be *calibrated* if it is statistically consistent with the observations; *sharpness* refers to the concentration of the forecast distribution and is a property of the forecast alone. Calibration and sharpness can simultaneously be assessed quantitatively with *proper scoring rules*, where lower scores indicate superior predictive performance (Gneiting and Raftery 2007). For an introduction to the evaluation methods and the underlying theory, see appendix A.

### a. Predictive performance of the COSMO-DE-EPS and a climatological baseline

The predictive performance of the EPS coincides with findings of extant previous studies on statistical postprocessing of ensemble forecasts in that the ensemble predictions are biased and strongly underdispersed, that is, not calibrated due to a lack of spread; see Fig. 3 for the corresponding verification rank histograms.

Verification rank histograms of (left) 0- and (center) 1–21-h forecasts of the COSMO-DE-EPS and (right) uPIT histogram of the EPC forecasts over all lead times for all stations. Coverage refers to a prediction interval with a nominal coverage corresponding to a 20-member ensemble (∼90.48%).

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Verification rank histograms of (left) 0- and (center) 1–21-h forecasts of the COSMO-DE-EPS and (right) uPIT histogram of the EPC forecasts over all lead times for all stations. Coverage refers to a prediction interval with a nominal coverage corresponding to a 20-member ensemble (∼90.48%).

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Verification rank histograms of (left) 0- and (center) 1–21-h forecasts of the COSMO-DE-EPS and (right) uPIT histogram of the EPC forecasts over all lead times for all stations. Coverage refers to a prediction interval with a nominal coverage corresponding to a 20-member ensemble (∼90.48%).

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

We here highlight two peculiarities of the EPS. The first is the so-called spinup effect (see, e.g., Kleczek et al. 2014), which refers to the time the numerical model requires to adapt to the initial and boundary conditions and to produce structures consistent with the model physics. This effect can be seen not only in the verification rank histograms in Fig. 3, where we observe a clear lack of ensemble spread in the 0-h forecasts within each of the four subensembles and only a small spread between them, but also in the ensemble range and the bias of the ensemble median prediction displayed in Fig. 4, where a sudden jump at the 1-h forecasts occurs.

Boxplots of the stationwise (left) mean bias of the ensemble median and (right) mean ensemble range of the EPS as functions of the lead time. The black line indicates the average over all samples.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Boxplots of the stationwise (left) mean bias of the ensemble median and (right) mean ensemble range of the EPS as functions of the lead time. The black line indicates the average over all samples.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Boxplots of the stationwise (left) mean bias of the ensemble median and (right) mean ensemble range of the EPS as functions of the lead time. The black line indicates the average over all samples.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

The temporal development of the bias and ensemble range shown in Fig. 4 indicates another meteorological effect, the evening transition of the planetary boundary layer. When the sun sets, the surface and low-level air that have been heated over the course of the day cool down and thermally driven turbulence ceases. This sometimes quite abrupt transition to calmer, more stable conditions strongly affects the near-surface wind fields subject to the local conditions (see, e.g., Mahrt 2017). For lead times up to 18 h (corresponding to 1900 local time in winter and 2000 local time in summer), the ensemble range increases together with an improvement in calibration. However, at the transition in the evening, the overall bias increases and the calibration becomes worse for most stations indicating increasing systematic errors. This could be related, for example, to the misrepresentation of the inertia of large eddies in the model or errors in radiative transfer at low sun angles.

In addition to the raw ensemble predictions, we further consider a climatological reference forecast as a benchmark method. The extended probabilistic climatology (EPC; Vogel et al. 2018; Walz et al. 2021) is an ensemble based on past observations considering only forecasts at the same time of the year. We create a separate climatology for each station and hour of the day that consists of past observations from the previous, current, and following month around the date of interest. The observational database ranges back to 2001, thus EPC is built on a database of 15 years. Not surprisingly, EPC is well calibrated (see Fig. 3). However, it shows a minor positive bias, which is likely due to the generally lower level of wind gusts observed in 2016 relative to the years on which EPC is based. Additional illustrations are provided in the online supplemental material.

### b. Comparison of the postprocessing methods

Figure 5 shows probability integral transform (PIT) histograms for all postprocessing methods. All approaches substantially improve the calibration in comparison with the raw ensemble predictions and yield well calibrated forecasts, except for IDR, which results in underdispersed predictions. The PIT histograms of the parametric methods based on a truncated logistic distribution (EMOS, EMOS-GB, and DRN) all exhibit similar minor deviations from uniformity caused by a lower tail that is too heavy. The semi- and nonparametric methods MBM, QRF, BQN, and HEN are all slightly skewed to the left, in line with the histogram of EPC. Further, we observe a minor overdispersion for the QRF forecasts.

PIT histograms of all postprocessing methods, aggregated over all lead times and stations. Coverage refers to the empirical coverage of a prediction interval with a nominal coverage corresponding to a 20-member ensemble (∼90.48%).

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

PIT histograms of all postprocessing methods, aggregated over all lead times and stations. Coverage refers to the empirical coverage of a prediction interval with a nominal coverage corresponding to a 20-member ensemble (∼90.48%).

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

PIT histograms of all postprocessing methods, aggregated over all lead times and stations. Coverage refers to the empirical coverage of a prediction interval with a nominal coverage corresponding to a 20-member ensemble (∼90.48%).

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Table 4 summarizes the values of proper scoring rules and other evaluation metrics to compare the overall predictive performance of all methods. While the ensemble predictions outperform the climatological benchmark method, all postprocessing approaches lead to substantial improvements. Among the different postprocessing methods, the three groups of approaches introduced in section 3 show systematic differences in their overall performance. In terms of the CRPS, the basic methods already improve the ensemble by around 26%–29%. Incorporating additional predictors via the machine learning methods further increases the skill, where the NN-based approaches—in particular, DRN and BQN—perform best. The mean absolute error (MAE) and root-mean-square error (RMSE) lead to analogous rankings, and all methods clearly reduce the bias of the EPS. Among the well-calibrated postprocessing methods, the NN-based methods yield the sharpest forecast distributions, followed by QRF, EMOS-GB, and the basic methods. Thus, we conclude that the gain in predictive performance is mainly based on an increase in sharpness. Overall, BQN results not only in the best coverage, but also in the sharpest prediction intervals.

Evaluation metrics for EPC, COSMO-DE-EPS, and all postprocessing methods averaged over all lead times and stations. The PI length and coverage refer to a prediction interval with a nominal coverage corresponding to a 20-member ensemble (∼90.48%). The best methods are indicated in boldface type.

We further consider the total computation time required for training the postprocessing models. However, note that a direct comparison of computation times is difficult because of the differences in terms of software packages and parallelization capabilities. Not surprisingly, the simple EMOS method was the fastest with only 19 min. The network-based methods were not much slower than QRF and faster than EMOS-GB, which is based on almost twice as many predictors as the other advanced methods, which approximately doubled the computational costs. MBM here is an extreme outlier and requires a computation time of over 35 days in total, in particular due to our adaptations to the subensemble structure discussed in section 3a(2).

### c. Lead-time-specific results

To investigate the effects of the different lead times and hours of the day on the predictive performance, Fig. 6 shows various evaluation metrics as function of the lead time. While the CRPS values and the improvements over the raw ensemble predictions (Figs. 6a,b) show some variations over the lead times, the overall rankings among the different methods and groups of approaches are consistent. In particular, the rankings of the individual postprocessing models remain relatively stable over the day.

(a) Mean CRPS, (b) skill in terms of the CRPS (CRPSS) with respect to the raw ensemble predictions, (c) mean bias, and (d) mean prediction interval length of the postprocessing methods as functions of the lead time, averaged over all stations. (e) The Brier skill score with respect to the climatological EPC forecast as function of the threshold, averaged over all lead times and stations, where the dashed vertical lines indicate the quantiles of the observed wind gusts at levels given at the top axis.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

(a) Mean CRPS, (b) skill in terms of the CRPS (CRPSS) with respect to the raw ensemble predictions, (c) mean bias, and (d) mean prediction interval length of the postprocessing methods as functions of the lead time, averaged over all stations. (e) The Brier skill score with respect to the climatological EPC forecast as function of the threshold, averaged over all lead times and stations, where the dashed vertical lines indicate the quantiles of the observed wind gusts at levels given at the top axis.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

(a) Mean CRPS, (b) skill in terms of the CRPS (CRPSS) with respect to the raw ensemble predictions, (c) mean bias, and (d) mean prediction interval length of the postprocessing methods as functions of the lead time, averaged over all stations. (e) The Brier skill score with respect to the climatological EPC forecast as function of the threshold, averaged over all lead times and stations, where the dashed vertical lines indicate the quantiles of the observed wind gusts at levels given at the top axis.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

The spinup effect is clearly visible in that the mean bias drastically increases from the 0 h to the 1 h forecasts of the EPS and leads to a worse CRPS despite the increase of the ensemble range (see Fig. 4). The postprocessed forecasts, however, are able to successfully correct the biases induced in the spinup period while benefiting from the increase in ensemble range. Hence, the CRPS becomes smaller and the skill is the largest through all lead times. Although we improve the MBM forecast by incorporating the submodel structure, the adjusted ensemble forecasts are still subject to systematic deviations from calibration for lead times of 0 and 1 h. Additional illustrations are provided in the online supplemental material.

Following the first hours, a somewhat counterintuitive trend can be observed in that the predictive performance of the EPS improves up to a lead time of 10 h. This is in particular due to improvements in terms of the spread of the ensemble over time. By contrast, the predictive performance of the climatological baseline model is affected more by the diurnal cycle since observed wind gusts and their variability tend to be higher during daytime. The performance of the postprocessing models is neither affected by the increased spread of the EPS, nor by larger gust observations, and slightly decreases over time until the evening transition. This is in line with wider prediction intervals that represent increasing uncertainty for longer lead times, while the mean bias and coverage are mostly unaffected.

This general trend changes with the evening transition at a lead time of around 18 h. The CRPS of the climatological reference model decreases due to a better predictability of the wind gust forecasts. By contrast, the CRPS of the ensemble increases, again driven by an increase in bias and a decrease in spread that comes with a smaller coverage. The numerical model thus appears to not be fully capable of capturing the relevant physical effects and introduces systematic errors. The bias and coverage of the postprocessing methods do not change drastically, while the prediction intervals of the postprocessing methods become smaller, which is in line with the more stable conditions at nighttime. Therefore, the CRPS of the postprocessing methods becomes better again.

To assess the forecast performance for extreme wind gust events, Fig. 6e shows the mean Brier skill score with respect to the climatological EPC forecast as a function of the threshold value, averaged over all stations and lead times. For larger threshold values, the EPS rapidly loses skill and does not provide better predictions than the climatology for thresholds above 25 m s^{−1}. By contrast, all postprocessing methods retain positive skill across all considered threshold values. The predictive performance decreases for very high threshold values above 30 m s^{−1}, in particular, for EMOS-GB and QRF. Note that the EPS and all postprocessing methods besides the analog-based QRF and IDR have negative skill scores for very small thresholds, but this is unlikely to be of relevance for any practical application.

### d. Station-specific results and statistical significance

We further investigate the station-specific performance of the different postprocessing models and, in particular, investigate whether the locally adaptive networks that are estimated jointly for all stations also outperform the locally estimated methods at the individual stations. Figure 7 shows a map of all observation stations indicating the station-specific best model and demonstrates that at 162 of the 175 stations a network-based method performs best. While none of the basic methods provides the best forecasts at any station, QRF or EMOS-GB perform best at the remaining 13 stations. Most of these stations are located in mountainous terrain or coastal regions that are likely subject to specific systematic errors, which might favor a location-specific modeling approach.

Best method at each station in terms of the CRPS, averaged over all lead times. The point sizes indicate the level of statistical significance of the observed CRPS differences relative to the methods only from the other groups of methods for all lead times. Three different point sizes are possible, with the smallest size indicating statistically significant differences for at most 90% of the performed tests, the middle size is for up to 99%, and the largest is to 100%, meaning all differences are statistically significant.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Best method at each station in terms of the CRPS, averaged over all lead times. The point sizes indicate the level of statistical significance of the observed CRPS differences relative to the methods only from the other groups of methods for all lead times. Three different point sizes are possible, with the smallest size indicating statistically significant differences for at most 90% of the performed tests, the middle size is for up to 99%, and the largest is to 100%, meaning all differences are statistically significant.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Best method at each station in terms of the CRPS, averaged over all lead times. The point sizes indicate the level of statistical significance of the observed CRPS differences relative to the methods only from the other groups of methods for all lead times. Three different point sizes are possible, with the smallest size indicating statistically significant differences for at most 90% of the performed tests, the middle size is for up to 99%, and the largest is to 100%, meaning all differences are statistically significant.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Last, we evaluate the statistical significance of the differences in the predictive performance in terms of the CRPS between the competing postprocessing methods. To that end, we perform Diebold–Mariano (DM) tests of equal predictive performance (Diebold and Mariano 1995) for each combination of station and lead time and apply a Benjamini and Hochberg (1995) procedure to account for potential temporal and spatial dependencies of forecast errors in this multiple testing setting. Mathematical details are provided in appendix A.

We find that the observed score differences are statistically significant for a high ratio of stations and lead times, see Table 5. In particular, DRN and BQN significantly outperform the basic models at more than 97%, and even significantly outperform QRF and EMOS-GB at more than 80% of all combinations of stations and lead times. Among the locally estimated methods, QRF performs best but only provides significant improvements over the NN-based methods for around 5% of the cases.

Ratio of lead time–station combinations (%) where pairwise DM tests indicate statistically significant CRPS differences after applying a Benjamini–Hochberg procedure to account for multiple testing for a nominal level of *α* = 0.05 of the corresponding one-sided tests. The (*i*, *j*) entry in the *i*th row and *j*th column indicates the ratio of cases where the null hypothesis of equal predictive performance of the corresponding one-sided DM test is rejected in favor of the model in the *i*th row when compared with the model in the *j*th column. The remainder of the sum of (*i*, *j*) and (*j*, *i*) entry to 100% is the ratio of cases for which the score differences are not significant.

Final hyperparameter configurations of the methods presented in section 3. The NN-specific hyperparameters are displayed separately in Table B2.

Overview of the configuration of the individual networks in the NN-based methods.

To assess station-specific effects of the statistical significance of the score differences, the sizes of the points indicating the best models in Fig. 7 are scaled by the degree of statistical significance of the results when compared with all models from the two other groups of methods. For example, if DRN performs best at a station, the corresponding point size is determined by the proportion of rejections of the null hypothesis of equal predictive performance at that station when comparing DRN with EMOS, MBM, IDR, EMOS-GB, and QRF (but not the other NN-based models) for all lead times in a total of 5 × 22 DM tests. In general, if a locally estimated machine learning approach performed best at one station, the significance tends to be lower than when a network-based method performs best. The most significant differences between the groups of methods can be observed in central Germany, where most stations likely exhibit similar characteristics in contrast to coastal areas in northern Germany or mountainous regions in southern Germany.

## 5. Feature importance

The results presented in the previous section demonstrate that the use of additional features improves the predictive performance by a large margin. Here, we assess the effects of the different inputs on the model performance to gain insight into the importance of meteorological variables and better understand what the models have learned. Many techniques have been introduced in order to better interpret machine learning methods, in particular NNs (McGovern et al. 2019), and we will focus on distinct approaches tailored to the individual machine learning methods at hand and separately assess the feature importance for the individual methods.

### a. EMOS-GB and QRF

Since the second group of methods relies on locally estimated, separate models for each station, the importance of specific predictors will often vary across different locations and thus make an overall interpretation of the model predictions more involved.

In the case of EMOS-GB, we treat the location and scale parameters separately and consider a feature to be more important the larger the absolute value of the estimated coefficient is. In the interest of brevity, we here discuss some general properties only and refer to the online supplemental material for detailed results. Overall, the interpretation is challenging because of a large variation across stations, in particular during the spinup period. In general, the mean value of the wind gust predictions is selected as the most important predictor for the location parameter, followed by other wind-related predictors and the temporal information about the day of the year. For the scale parameter, the standard deviation of the ensemble predictions of wind gust is selected as the most important predictor, followed by the ensemble mean. Other meteorological predictors tend to only contribute relevant information for specific combinations of lead times and stations, parts of which might be physically inconsistent coefficient estimates due to random effects in the corresponding datasets. Selected examples are presented in the online supplemental material.

For QRF, we utilize an out-of-bag estimate of the feature importance based on the training set (Breiman 2001). The procedure is similar to what we apply for the NN-based models below but uses a different evaluation metric directly related to the algorithm for constructing individual decision trees, see Wright and Ziegler (2017) for details. Figure 8 shows the feature importance for some selected predictor variables as a function of the lead time; additional results are available in the online supplemental material. Interestingly, the 10 most important predictors (two of which are included in Fig. 8) are variables that directly relate to different characteristics of wind speed predictions from the ensemble. This can be explained by the specific structure of random forest models. Since these predictor variables are highly correlated, they are likely to serve as replacements if other ones are not available in the random selection of potential candidate variables for individual splitting decisions. The standard deviation of the wind gust ensemble is only of minor importance during the spinup period. Besides the wind-related predictors from the EPS, the day of the year, the net shortwave radiation flux prediction as well as the relative humidity prediction at 1000 hPa are selected as important predictors, particularly for longer lead times corresponding to later times of the day and potentially again indicating an effect of the evening transition. In particular, the shortwave radiation flux indicates the sensitivity of the wind around sunset to the maintenance of turbulence by surface heating, an effect not seen in the morning when the boundary layer grows more gradually.

Median of stationwise feature importance for selected predictors (see Table 1) of the QRF model as functions of the lead time. The error bars indicate a bootstrapped 95% confidence interval of the median. Note the different scales of the vertical axes.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Median of stationwise feature importance for selected predictors (see Table 1) of the QRF model as functions of the lead time. The error bars indicate a bootstrapped 95% confidence interval of the median. Note the different scales of the vertical axes.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Median of stationwise feature importance for selected predictors (see Table 1) of the QRF model as functions of the lead time. The error bars indicate a bootstrapped 95% confidence interval of the median. Note the different scales of the vertical axes.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

### b. Neural network–based methods

To investigate the feature importance for NNs, we follow Rasp and Lerch (2018) and use a permutation-based measure that is given by the decrease in terms of the CRPS in the test set when randomly permuting a single input feature, using the mean CRPS of the model based on unpermuted input features as reference. To eliminate the effect of the dependence of the forecast performance on the lead times, we calculate the relative permutation importance. Mathematical details are provided in appendix A.

Figure 9 shows the relative permutation importance for selected input features and the three NN-based postprocessing methods; additional results are provided in the online supplemental material. There are only minor variations across the three NN approaches, with the wind gust ensemble forecasts providing the most important source of information. To ensure comparability of the three model variants, we here jointly permute the corresponding features of the ensemble predictions of wind gust (mean and standard deviation for DRN and HEN, and the sorted ensemble forecast for BQN). Further results for BQN available in the online supplemental material indicate that among the ensemble members sorted with respect to the predicted speed, the minimum and maximum value are the most important member predictions, followed by the ones indicating transitions between the groups of subensembles. Again, we find that the standard deviation of the wind gust ensemble forecasts is of no importance for DRN and HEN (not shown).

Relative permutation importance of selected predictors (see Table 1) for the three NN-based models dependent on the lead time. Note the different scales of the vertical axes. The abbreviation VMAX_all refers to the multipass permutation of the features derived from the wind gust ensemble. Different symbols indicate the three model variants (open circles: DRN, open triangles: BQN, open squares: HEN).

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Relative permutation importance of selected predictors (see Table 1) for the three NN-based models dependent on the lead time. Note the different scales of the vertical axes. The abbreviation VMAX_all refers to the multipass permutation of the features derived from the wind gust ensemble. Different symbols indicate the three model variants (open circles: DRN, open triangles: BQN, open squares: HEN).

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Relative permutation importance of selected predictors (see Table 1) for the three NN-based models dependent on the lead time. Note the different scales of the vertical axes. The abbreviation VMAX_all refers to the multipass permutation of the features derived from the wind gust ensemble. Different symbols indicate the three model variants (open circles: DRN, open triangles: BQN, open squares: HEN).

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

In addition to the wind gust ensemble predictions, the spatial features form the second most important group of predictors. The station ID (via embedding), altitude and stationwise bias are the most relevant spatial features and have a diurnal trend that resembles the mean bias of the EPS forecasts, indicating that the spatial information becomes more relevant when the bias in the EPS is larger. Further, the day of the year and the net shortwave radiation flux at the surface provide relevant information that can be connected to the previously discussed evening transition as well as the diurnal cycle. Several temperature variables, in particular temperature at the ground level and lower levels of the atmosphere, constitute important predictors for different times of day; for example, the ground-level temperature is important for the first few lead times during early morning.

## 6. Discussion and conclusions

We have conducted a comprehensive and systematic review and comparison of statistical and machine learning methods for postprocessing ensemble forecasts of wind gusts. The postprocessing methods can be divided into three groups of approaches of increasing complexity ranging from basic methods using only ensemble forecasts of wind gusts as predictors to benchmark machine learning methods and NN-based approaches. While all yield calibrated forecasts and are able to correct the systematic errors of the raw ensemble predictions, incorporating information from additional meteorological predictor variables leads to significant improvements in forecast skill. In particular, postprocessing methods based on NNs jointly estimating a single, locally adaptive model at all stations provide the best forecasts and significantly outperform benchmark methods from machine learning. The flexibility of the NN-based methods is exemplified by the proposed modular framework that nests three variants with different probabilistic forecast types obtained as output. While DRN forecasts are given by a parametric distribution, BQN performs a quantile function regression based on Bernstein polynomials and HEN approximates the predictive density via histograms. The analysis of feature importance for the advanced methods in section 5 illustrates that the machine learning techniques—in particular, the NN approaches—learn physically consistent relations. Overall, our results underpin the conjecture of Rasp and Lerch (2018), who argue that NN-based methods will provide valuable tools for many areas of statistical postprocessing and forecasting.

That said, there does not exist a single best method for most practical applications as all approaches have advantages but also shortcomings. Based on our experiences for this particular study, Fig. 10 presents a subjective overview of key characteristics of the different methods, ranging from flexibility and forecast quality to complexity and interpretability. We suggest to exploit the full flexibility of NN-based approaches if various additional features and a large training set for model training and validation are available, as was the case here. If the dataset is of a smaller scale, for example, only given for a small set of stations, the results suggest that QRF and EMOS-GB may still able to extract valuable information from the additional predictors. However, if only ensemble predictions of the target variable or a small set of training samples are available, more simple and parsimonious methods will likely perform not substantially worse than the advanced machine learning techniques [see, e.g., the results in Baran and Baran (2021)].

Illustration of subjectively ranked key characteristics of the postprocessing methods presented in section 3 in the form of a radar chart. In each displayed dimension, entries closer to the center indicate lower degrees (e.g., of forecast quality). The color scheme distinguishes the three groups of methods, and the different line and point styles indicate different characteristics of the forecast distributions; e.g., solid lines indicate the use of a parametric forecast distribution. Flexibility here refers to the flexibility of the obtained forecast distribution, or the flexibility in terms of inputs that can be incorporated into the model. The component of model complexity is divided into the computational requirements in terms of data and computing resources, and the complexity of the model implementation is divided in terms of available software and the required choices about model architecture and tuning parameters.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Illustration of subjectively ranked key characteristics of the postprocessing methods presented in section 3 in the form of a radar chart. In each displayed dimension, entries closer to the center indicate lower degrees (e.g., of forecast quality). The color scheme distinguishes the three groups of methods, and the different line and point styles indicate different characteristics of the forecast distributions; e.g., solid lines indicate the use of a parametric forecast distribution. Flexibility here refers to the flexibility of the obtained forecast distribution, or the flexibility in terms of inputs that can be incorporated into the model. The component of model complexity is divided into the computational requirements in terms of data and computing resources, and the complexity of the model implementation is divided in terms of available software and the required choices about model architecture and tuning parameters.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

Illustration of subjectively ranked key characteristics of the postprocessing methods presented in section 3 in the form of a radar chart. In each displayed dimension, entries closer to the center indicate lower degrees (e.g., of forecast quality). The color scheme distinguishes the three groups of methods, and the different line and point styles indicate different characteristics of the forecast distributions; e.g., solid lines indicate the use of a parametric forecast distribution. Flexibility here refers to the flexibility of the obtained forecast distribution, or the flexibility in terms of inputs that can be incorporated into the model. The component of model complexity is divided into the computational requirements in terms of data and computing resources, and the complexity of the model implementation is divided in terms of available software and the required choices about model architecture and tuning parameters.

Citation: Monthly Weather Review 150, 1; 10.1175/MWR-D-21-0150.1

From an operational point of view, one major shortcoming of the postprocessing methods other than MBM is that they do not preserve spatial, temporal, or multivariable dependencies in the ensemble forecast. Results based on multivariate evaluation tools, which are provided in the online supplemental material, show that MBM outperforms EMOS and IDR. However, MBM is still inferior to the machine learning approaches that do not specifically account for multivariate dependencies, even when focusing on multivariate forecast evaluation. Several studies have investigated approaches that are able to reconstruct the correlation structure (e.g., Schefzik et al. 2013; Lerch et al. 2020) for univariate postprocessing methods. While these techniques require additional steps and therefore increase the complexity of the postprocessing framework, all methods considered here can form basic building blocks for such multivariate approaches. In addition, an interesting avenue for future work is to combine the MBM approach with NNs, which might allow to efficiently incorporate information from additional predictor variables while preserving the physical characteristics.

A related limitation of the postprocessing methods considered here is that they are not seamless in space and time as they rely on separate models for each lead time, and even each station in case of the basic approaches, as well as EMOS-GB and QRF. In practice, this may lead to physically inconsistent jumps in the forecast trajectories. To address this challenge, Keller et al. (2021) propose a global EMOS variant that is able to incorporate predictions from multiple NWP models in addition to spatial and temporal information. For the NN-based framework for postprocessing considered here, an alternative approach to obtain a joint model across all lead times would be to embed the temporal information in a similar manner as the spatial information.

A possible extension of the postprocessing methods presented in this work would be to apply them on the residuals of a linear model instead of the original target variable. This way, the postprocessing methods focus on learning error-dependent rather than scale-dependent relations, thus potentially compensating for the small amount of high wind speeds within the dataset. In particular, QRF would then able to extrapolate and a more evenly distributed binning scheme for HEN would be obtained.

The postprocessing methods based on NNs provide a starting point for flexible extensions in future research. In particular, the rapid developments in the machine learning literature offer unprecedented ways to incorporate new sources of information into postprocessing models, including spatial information via convolutional neural networks (Scheuerer et al. 2020; Veldkamp et al. 2021), or temporal information via recurrent neural networks (Gasthaus et al. 2019). A particular challenge for weather prediction is given by the need to better incorporate physical information and constraints into the forecasting models. Physical information about large-scale weather conditions, or weather regimes, forms a particularly relevant example in the context of postprocessing (Rodwell et al. 2018), with recent studies demonstrating benefits of regime-dependent approaches (Allen et al. 2020, 2021). For wind gusts in European winter storms, Pantillon et al. (2018) found that a simple EMOS approach may substantially deteriorate forecast performance of the raw ensemble predictions during specific meteorological conditions. An important step for future work will be to investigate whether similar effects occur for the more complex approaches here, and to devise dynamical feature-based postprocessing methods that are better able to incorporate relevant domain knowledge by tailoring the model structure and estimation process.

If the coefficients are strictly increasing, the same holds for the quantile function: Omitting the conditioning on **X**, the derivative of Eq. (8) is given by *τ* ∈ (0, 1) if the coefficients are strictly increasing, because the Bernstein polynomials are also positive on the open unit interval.

Note that alternatively, it is also possible to fix the bin probabilities and determine the bin edges by the NN. This would be equivalent to estimating the quantiles of the forecast distribution at the levels defined by the prespecified probabilities, i.e., a quantile regression.

In general, the permutation importance can be calculated for any postprocessing method and any scoring rule. Here, we focus on the NN-based approaches and the CRPS.

## Acknowledgments.

The research leading to these results has been done within the project C5 “Dynamical feature-based ensemble postprocessing of wind gusts within European winter storms” of the Transregional Collaborative Research Center SFB/TRR 165 “Waves to Weather” funded by the German Research Foundation (DFG). Sebastian Lerch gratefully acknowledges support by the Vector Stiftung through the Young Investigator Group “Artificial Intelligence for Probabilistic Weather Forecasting.” We thank Reinhold Hess and Sebastian Trepte for providing the forecast and observation data and Robert Redl for assistance in data handling. We further thank Lea Eisenstein, Tilmann Gneiting, Alexander Jordan, and Peter Knippertz for helpful comments and discussions, as well as John Bremnes for helpful comments and providing code for the implementation of BQN. We are grateful to Michael Scheuerer, Maxime Taillardat, and one anonymous reviewer, whose constructive comments helped to improve an earlier version of this paper.

## Data availability statement.

The dataset is proprietary but can be obtained from the DWD for research purposes. Code with implementations of all methods is available online (https://github.com/benediktschulz/paper_pp_wind_gusts).

## APPENDIX A

### Forecast Evaluation

We here provide a summary of the methods used for forecast evaluation. In the following, we will refer to a probabilistic forecast by *F*, the random variable of the observation by *Y*, and a realization of *Y* by *y*, that is, an observed wind gust.

#### a. Calibration and sharpness

Qualitatively, the calibration of a probabilistic forecast can be assessed via histograms of the joint distribution of forecast and observation based on the idea that the probability integral transform (PIT) *F*(*Y*) is standard uniformly distributed if the forecast is calibrated. When only a set of quantiles or the quantile function is given, we can calculate the unified PIT (uPIT), a generalized version of the PIT (Vogel et al. 2018). In case of ensemble forecasts, plots of the rank of the observations in the ordered ensemble provide a similar illustration. Systematic deviations from uniformity indicate specific deficiencies such as biases or dispersion errors; for example, in the case of a U-shaped histogram we refer to an *underdispersed* forecast and in the case of a hump-shaped histogram to an *overdispersed* forecast. Calibration can also be assessed quantitatively via the empirical coverage of a prediction interval (PI). Given a (1 − *α*)% PI, we can calculate the empirical coverage as the ratio of observations that fall within the corresponding PI. If the forecast is calibrated, (1 − *α*)% of the observations are expected to fall within the range of the PI. In case of an ensemble of size *m*, the ensemble range is considered an (*m* − 1)/(*m* + 1)% PI. Sharpness can be assessed based on the length of a PI. The shorter a PI is, the more concentrated the forecast distribution is and thus the sharper the forecast is.

#### b. Proper scoring rules

*S*is a function that assigns a penalty to a forecast–observation tuple (

*F*,

*y*) (Gneiting and Raftery 2007). It is said to be

*proper*relative to a class of probabilistic forecasts

*strictly proper*if equality holds if and only if

*F*=

*G*.

*F*with finite first moment. The CRPS is given in the same unit as the observation and generalizes to the absolute error in case of a deterministic forecast. The integral can be calculated analytically for a wide range of forecast distributions, e.g., for a truncated logistic distribution (see, e.g., Jordan et al. 2019). In the online supplemental material, we derive the CRPS of a piecewise uniform distribution based on Jordan (2016). Another popular strictly proper scoring rule is the log score (Good 1952) or ignorance score

*F*is a probabilistic forecast with density

*f*.

*τ*∈ (0, 1) to evaluate a quantile forecast

*τ*,

*F*),

*F*) to check the bias of the forecast distribution.

#### c. Optimum score estimation

*F*(

**X**;

**) be a parametric forecast distribution dependent on the predictor variables**

*θ***X**and parameter (vector)

**∈ Θ, where Θ is the parameter space. We can estimate the optimal parameter (vector)**

*θ***by minimizing the mean score of a strictly proper scoring rule**

*θ**S*; that is,

**x**

_{i},

*y*),

_{i}*i*= 1, …,

*n*denotes a training set of size

*n*. Note that minimizing the LS is equivalent to maximum likelihood estimation.

#### d. Statistical tests of equal predictive performance

*S*, forecasting methods are ranked by their average score,

*n*forecasts with corresponding observations.

*F*and

*G*, we can perform a statistical test of equal predictive performance via the Diebold–Mariano test (Diebold and Mariano 1995). If the forecast cases are independent, the corresponding test statistic is given by

*t*, where method

_{n}*F*is preferred if the test statistic is negative,

*G*if it is positive. In our case, we perform tests of statistical significance based on the CRPS for each station and lead time separately.

*α*= 0.05. Given the ordered

*p*values

*p*

_{(1)}, …,

*p*

_{(}

_{M}_{)}of

*M*hypothesis tests, a threshold

*p*value is determined via

*p** is then used to decide whether the null hypothesizes of the individual tests are rejected. For Table 5, we applied the Benjamini–Hochberg correction for each pair of methods separately considering tests for each combination of location and lead time. For Fig. 7, we applied the correction for each location separately considering the tests comparing the best method in terms of the CRPS with the methods from the other groups for all lead times.

#### e. Permutation importance

*permutation importance*(Rasp and Lerch 2018; McGovern et al. 2019), we use

*ξ*to denote the

*i*th predictor (

*i*= 1, … ,

*p*),

*F*(

**X**

*) (*

_{⋅j}*j*= 1, … ,

*n*) to denote the probabilistic forecast generated by an NN-based postprocessing method

^{A1}based on the

*j*th sample of a test set of size

*n*, and

*π*to denote a random permutation of the set {1, … ,

*n*}. The permutation importance of

*ξ*with respect to the test set

**X**and permutation

*π*is defined by

*ξ*with respect to

*π*, which is given by

*ξ*in the test set, generate the forecasts based on the permuted set, calculate the associated CRPS and calculate the difference to the CRPS of the original data. The larger the difference, the more detrimental is the effect of shuffling the feature to the forecast performance, and thus the more important it is.

*K*≤

*p*features Ξ corresponding to the indices

*I*= {

*i*

_{1}, … ,

*i*}, which we refer to as multipass permutation importance (McGovern et al. 2019). In this case, we do not permute only one feature according to

_{K}*π*but instead the entire set Ξ; that is,

*π*) is then calculated according to Eq. (A1). Last, we calculate a relative permutation importance:

## APPENDIX B

### Implementation Details

We here provide additional details on the implementation of all methods. The corresponding R code is publicly available (https://github.com/benediktschulz/paper_pp_wind_gusts).

If the evaluation or parts of it are based on a set of quantiles, we generate 125 equidistant quantiles for each test sample. This number is chosen such that the median as well as the quantiles at the levels of a prediction interval with a nominal coverage corresponding to a 20-member ensemble (∼90.48%) are included and such that the forecast distribution is given by a sufficiently large number of quantiles. The quantiles are then evaluated analogously to an ensemble forecast.

#### a. Basic methods

The implementation of EMOS and MBM is straightforward using built-in optimization routines in R in combination with functionalities for evaluation provided in the scoringRules package. We generally employed the gradient-based L-BFGS-B algorithm and in case of nonconverged or failed optimizations applied the Nelder–Mead algorithm in a second run. In case of EMOS, we parameterized the parameter *b* via *d*, both with a single *c* and *c*_{1}, … , *c*_{4} for the subensembles, but this resulted in a worse predictive performance still exhibiting systematic deviations in the histograms. The implementation of IDR relies on the isodistrreg package (Henzi et al. 2019), the associated hyperparameters are displayed in Table B1.

#### b. EMOS-GB and QRF

EMOS-GB and QRF are implemented via the crch (Messner et al. 2016) and ranger (Wright and Ziegler 2017) packages, respectively. The selected hyperparameter configurations are summarized in Table B1. The permutation importance of the QRF predictor variables on the training set is obtained from the ranger function.

#### c. NN-based methods (DRN, BQN, HEN)

Before fitting the NN models, each predictor variable was normalized by subtracting the mean value and dividing by the standard deviation based on the training set excluding the validation period. The NN models are built via the keras package (Allaire and Chollet 2020). For hyperparameter tuning, we used the tfruns package (Allaire 2018) to find the best hyperparameter candidates for a single run, for which we then compared an ensemble of 10 corresponding network models. For prespecified hyperparameters related to the basic model structures are provided in Table B1, the hyperparameters selected with this procedure are displayed in Table B2.

Except for HEN, where we employ the categorical cross-entropy, we manually implemented the loss functions for model training. For DRN, the CRPS of the truncated logistic distribution makes use of the tf-probability extension (0.11.1; Keydana 2020) that includes distribution functions of the logistic distribution. For BQN, the quantile loss evaluated at a given set of quantile levels based on a linear combination of Bernstein polynomials is implemented manually.

For fitting the HEN model, the observations have to be transformed to categorical variables representing the bins, which are generated iteratively based on the training set. We start with one bin for each observation (which only take a certain amount of values for reporting reasons) and merge the bin that contains the least amount of observations with the smaller one of the neighboring bins. We additionally put constrains on the bins. The first bin should have a length of at most 2 m s^{−1}, the last at most 7 m s^{−1}, and the others at most 5 m s^{−1}. In the aggregation procedure, the binning in terms of the probabilities is reduced to a minimal bin width of 0.01% for numerical reasons.

## REFERENCES

Allaire, J. J., 2018: tfruns: Training run tools for ‘TensorFlow’, version 1.4. R package, https://cran.r-project.org/package=tfruns.

Allaire, J. J., and F. Chollet, 2020: keras: R interface to ‘Keras’, version 2.3.0.0. R package, https://cran.r-project.org/package=keras.

Allaire, J. J., and Y. Tang, 2020: tensorflow: R interface to ‘TensorFlow’, version 2.2.0. R package, https://cran.r-project.org/package=tensorflow.

Allen, S., C. A. Ferro, and F. Kwasniok, 2020: Recalibrating wind-speed forecasts using regime-dependent ensemble model output statistics.

,*Quart. J. Roy. Meteor. Soc.***146**, 2576–2596, https://doi.org/10.1002/qj.3806.Allen, S., G. R. Evans, P. Buchanan, and F. Kwasniok, 2021: Incorporating the North Atlantic Oscillation into the post-processing of MOGREPS-G wind speed forecasts.

,*Quart. J. Roy. Meteor. Soc.***147**, 1403–1418, https://doi.org/10.1002/qj.3983.Baldauf, M., A. Seifert, J. Förstner, D. Majewski, M. Raschendorfer, and T. Reinhardt, 2011: Operational convective-scale numerical weather prediction with the COSMO model: Description and sensitivities.

,*Mon. Wea. Rev.***139**, 3887–3905, https://doi.org/10.1175/MWR-D-10-05013.1.Baran, S., and S. Lerch, 2015: Log-normal distribution based Ensemble Model Output Statistics models for probabilistic wind-speed forecasting.

,*Quart. J. Roy. Meteor. Soc.***141**, 2289–2299, https://doi.org/10.1002/qj.2521.Baran, S., and S. Lerch, 2016: Mixture EMOS model for calibrating ensemble forecasts of wind speed.

,*Environmetrics***27**, 116–130, https://doi.org/10.1002/env.2380.Baran, S., and S. Lerch, 2018: Combining predictive distributions for the statistical post-processing of ensemble forecasts.

,*Int. J. Forecasting***34**, 477–496, https://doi.org/10.1016/j.ijforecast.2018.01.005.Baran, S., and A. Baran, 2021: Calibration of wind speed ensemble forecasts for power generation. arXiv, 15 pp., https://arxiv.org/abs/2104.14910.

Baran, S., P. Szokol, and M. Szabó, 2021: Truncated generalized extreme value distribution-based ensemble model output statistics model for calibration of wind speed ensemble forecasts.

,*Environmetrics***32**, e2678, https://doi.org/10.1002/env.2678.Bauer, P., A. Thorpe, and G. Brunet, 2015: The quiet revolution of numerical weather prediction.

,*Nature***525**, 47–55, https://doi.org/10.1038/nature14956.Benjamini, Y., and Y. Hochberg, 1995: Controlling the false discovery rate: A practical and powerful approach to multiple testing.

,*J. Roy. Stat. Soc.***57B**, 289–300, https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.Breiman, L., 1984:

Wadsworth International Group, 368 pp.*Classification and Regression Trees.*Breiman, L., 2001: Random forests.

,*Mach. Learn.***45**, 5–32, https://doi.org/10.1023/A:1010933404324.Bremnes, J. B., 2020: Ensemble postprocessing using quantile function regression based on neural networks and Bernstein polynomials.

,*Mon. Wea. Rev.***148**, 403–414, https://doi.org/10.1175/MWR-D-19-0227.1.de Leeuw, J., K. Hornik, and P. Mair, 2009: Isotone optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and active set methods.

,*J. Stat. Software***32**, 1–24, https://doi.org/10.18637/jss.v032.i05.Diebold, F. X., and R. S. Mariano, 1995: Comparing predictive accuracy.

,*J. Bus. Econ. Stat.***13**, 253–263.Felder, M., F. Sehnke, K. Ohnmeiß, L. Schröder, C. Junk, and A. Kaifel, 2018: Probabilistic short term wind power forecasts using deep neural networks with discrete target classes.

,*Adv. Geosci.***45**, 13–17, https://doi.org/10.5194/adgeo-45-13-2018.Gasthaus, J., K. Benidis, Y. Wang, S. S. Rangapuram, D. Salinas, V. Flunkert, and T. Januschowski, 2019: Probabilistic forecasting with spline quantile function RNNs.

,*Proc. Mach. Learn. Res.***89**, 1901–1910, http://proceedings.mlr.press/v89/gasthaus19a.html.Genest, C., 1992: Vincentization revisited.

,*Ann. Stat.***20**, 1137–1142, https://doi.org/10.1214/aos/1176348676.Gneiting, T., 2011: Making and evaluating point forecasts.

,*J. Amer. Stat. Assoc.***106**, 746–762, https://doi.org/10.1198/jasa.2011.r10138.Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation.

,*J. Amer. Stat. Assoc.***102**, 359–378, https://doi.org/10.1198/016214506000001437.Gneiting, T., and R. Ranjan, 2011: Comparing density forecasts using threshold and quantile-weighted scoring rules.

,*J. Bus. Econ. Stat.***29**, 411–422, https://doi.org/10.1198/jbes.2010.08110.Gneiting, T., and R. Ranjan, 2013: Combining predictive distributions.

,*Electron. J. Stat.***7**, 1747–1782, https://doi.org/10.1214/13-EJS823.Gneiting, T., A. E. Raftery, A. H. Westveld, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation.

,*Mon. Wea. Rev.***133**, 1098–1118, https://doi.org/10.1175/MWR2904.1.Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness.

,*J. Roy. Stat. Soc.***69B**, 243–268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.Good, I., 1952: Rational decisions.

,*J. Roy. Stat. Soc.***14B**, 107–114, https://doi.org/10.1111/j.2517-6161.1952.tb00104.x.Goodfellow, I., Y. Bengio, and A. Courville, 2016:

MIT Press, 800 pp.*Deep Learning.*Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation.

,*Mon. Wea. Rev.***136**, 2620–2632, https://doi.org/10.1175/2007MWR2411.1.Hastie, T., R. Tibshirani, and J. Friedman, 2009:

, 745 pp., https://doi.org/10.1007/978-0-387-84858-7.*The Elements of Statistical Learning. Springer*Haupt, S. E., W. Chapman, S. V. Adams, C. Kirkwood, J. S. Hosking, N. H. Robinson, S. Lerch, and A. C. Subramanian, 2021: Towards implementing artificial intelligence post-processing in weather and climate: Proposed actions from the Oxford 2019 workshop.

,*Philos. Trans. Roy. Soc. London***A379**, 20200091, https://doi.org/10.1098/rsta.2020.0091.Henzi, A., J. Ziegel, and T. Gneiting, 2019: isodistrreg: Isotonic Distributional Regression (IDR). GitHub, https://github.com/AlexanderHenzi/isodistrreg.

Henzi, A., G.-R. Kleger, and J. F. Ziegel, 2020: Distributional (single) index models. arXiv, 38 pp., https://arxiv.org/abs/2006.09219.

Henzi, A., J. F. Ziegel, and T. Gneiting, 2021: Isotonic distributional regression.

,*J. Roy. Stat. Soc.***B83**, 963–993, https://doi.org/10.1111/rssb.12450.Hess, R., 2020: Statistical postprocessing of ensemble forecasts for severe weather at Deutscher Wetterdienst.

,*Nonlinear Processes Geophys.***27**, 473–487, https://doi.org/10.5194/npg-27-473-2020.Jordan, A., 2016: Facets of forecast evaluation. Ph.D. thesis, Karlsruher Institut für Technologie, 112 pp., https://doi.org/10.5445/IR/1000063629.

Jordan, A., F. Krüger, and S. Lerch, 2019: Evaluating probabilistic forecasts with scoringRules.

,*J. Stat. Software***90**, 1–37, https://doi.org/10.18637/jss.v090.i12.Keller, R., J. Rajczak, J. Bhend, C. Spirig, S. Hemri, M. A. Liniger, and H. Wernli, 2021: Seamless multi-model postprocessing for air temperature forecasts in complex topography.

,*Wea. Forecasting***36**, 1031–1042, https://doi.org/10.1175/WAF-D-20-0141.1.Keydana, S., 2020: tfprobability: Interface to ‘TensorFlow Probability’, version 0.11.1. R package, https://cran.r-project.org/package=tfprobability.

Kingma, D. P., and J. L. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 15 pp., https://arxiv.org/abs/1412.6980.

Kleczek, M. A., G.-J. Steeneveld, and A. A. M. Holtslag, 2014: Evaluation of the weather research and forecasting mesoscale model for GABLS3: Impact of boundary-layer schemes, boundary conditions and spin-up.

,*Bound.-Layer Meteor.***152**, 213–243, https://doi.org/10.1007/s10546-014-9925-3.Lang, M. N., S. Lerch, G. J. Mayr, T. Simon, R. Stauffer, and A. Zeileis, 2020: Remember the past: A comparison of time-adaptive training schemes for non-homogeneous regression.

,*Nonlinear Processes Geophys.***27**, 23–34, https://doi.org/10.5194/npg-27-23-2020.Lerch, S., and T. L. Thorarinsdottir, 2013: Comparison of non-homogeneous regression models for probabilistic wind speed forecasting.

,*Tellus***65A**, 21206, https://doi.org/10.3402/tellusa.v65i0.21206.Lerch, S., and S. Baran, 2017: Similarity-based semilocal estimation of post-processing models.

,*J. Roy. Stat. Soc.***66C**, 29–51, https://doi.org/10.1111/rssc.12153.Lerch, S., S. Baran, A. Möller, J. Groß, R. Schefzik, S. Hemri, and M. Graeter, 2020: Simulation-based comparison of multivariate ensemble post-processing methods.

,*Nonlinear Processes Geophys.***27**, 349–371, https://doi.org/10.5194/npg-27-349-2020.Li, R., B. J. Reich, and H. D. Bondell, 2021: Deep distribution regression.

,*Comput. Stat. Data Anal.***159**, 107203, https://doi.org/10.1016/j.csda.2021.107203.Mahrt, L., 2017: The near-surface evening transition.

,*Quart. J. Roy. Meteor. Soc.***143**, 2940–2948, https://doi.org/10.1002/qj.3153.Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions.

,*Manage. Sci.***22**, 1087–1096, https://doi.org/10.1287/mnsc.22.10.1087.McGovern, A., K. L. Elmore, D. J. Gagne, S. E. Haupt, C. D. Karstens, R. Lagerquist, T. Smith, and J. K. Williams, 2017: Using artificial intelligence to improve real-time decision-making for high-impact weather.

,*Bull. Amer. Meteor. Soc.***98**, 2073–2090, https://doi.org/10.1175/BAMS-D-16-0123.1.McGovern, A., R. Lagerquist, D. J. Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the black box more transparent: Understanding the physical implications of machine learning.

,*Bull. Amer. Meteor. Soc.***100**, 2175–2199, https://doi.org/10.1175/BAMS-D-18-0195.1.Meinshausen, N., 2006: Quantile regression forests.

,*J. Mach. Learn. Res.***7**, 983–999.Messner, J. W., G. J. Mayr, and A. Zeileis, 2016: Heteroscedastic censored and truncated regression with crch.

,*R J.***8**, 173–181, https://doi.org/10.32614/RJ-2016-012.Messner, J. W., G. J. Mayr, and A. Zeileis, 2017: Nonhomogeneous boosting for predictor selection in ensemble postprocessing.

,*Mon. Wea. Rev.***145**, 137–147, https://doi.org/10.1175/MWR-D-16-0088.1.Pantillon, F., S. Lerch, P. Knippertz, and U. Corsmeier, 2018: Forecasting wind gusts in winter storms using a calibrated convection-permitting ensemble.

,*Quart. J. Roy. Meteor. Soc.***144**, 1864–1881, https://doi.org/10.1002/qj.3380.Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133**, 1155–1174, https://doi.org/10.1175/MWR2906.1.Ranjan, R., and T. Gneiting, 2010: Combining probability forecasts.

,*J. Roy. Stat. Soc.***72B**, 71–91, https://doi.org/10.1111/j.1467-9868.2009.00726.x.Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts.

,*Mon. Wea. Rev.***146**, 3885–3900, https://doi.org/10.1175/MWR-D-18-0187.1.R Core Team, 2021: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, https://www.r-project.org/.

Rodwell, M. J., D. S. Richardson, D. B. Parsons, and H. Wernli, 2018: Flow-dependent reliability: A path to more skillful ensemble forecasts.

,*Bull. Amer. Meteor. Soc.***99**, 1015–1026, https://doi.org/10.1175/BAMS-D-17-0027.1.Schefzik, R., 2017: Ensemble calibration with preserved correlations: Unifying and comparing ensemble copula coupling and member-by-member postprocessing.

,*Quart. J. Roy. Meteor. Soc.***143**, 999–1008, https://doi.org/10.1002/qj.2984.Schefzik, R., T. L. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty quantification in complex simulation models using ensemble copula coupling.

,*Stat. Sci.***28**, 616–640, https://doi.org/10.1214/13-STS443.Scheuerer, M., 2014: Probabilistic quantitative precipitation forecasting using ensemble model output statistics.

,*Quart. J. Roy. Meteor. Soc.***140**, 1086–1096, https://doi.org/10.1002/qj.2183.Scheuerer, M., and D. Möller, 2015: Probabilistic wind speed forecasting on a grid based on ensemble model output statistics.

,*Ann. Appl. Stat.***9**, 1328–1349, https://doi.org/10.1214/15-AOAS843.Scheuerer, M., M. B. Switanek, R. P. Worsnop, and T. M. Hamill, 2020: Using artificial neural networks for generating probabilistic subseasonal precipitation forecasts over California.

,*Mon. Wea. Rev.***148**, 3489–3506, https://doi.org/10.1175/MWR-D-20-0096.1.Schulz, B., M. E. Ayari, S. Lerch, and S. Baran, 2021: Post-processing numerical weather prediction ensembles for probabilistic solar irradiance forecasting.

*Sol. Energy*,**220**, 1016–1031, https://doi.org/10.1016/j.solener.2021.03.023.Taillardat, M., and O. Mestre, 2020: From research to applications – Examples of operational ensemble post-processing in France using machine learning.

,*Nonlinear Processes Geophys.***27**, 329–347, https://doi.org/10.5194/npg-27-329-2020.Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics.

,*Mon. Wea. Rev.***144**, 2375–2393, https://doi.org/10.1175/MWR-D-15-0260.1.Thorarinsdottir, T. L., and T. Gneiting, 2010: Probabilistic forecasts of wind speed: Ensemble model output statistics by using heteroscedastic censored regression.

,*J. Roy. Stat. Soc.***173B**, 371–388, https://doi.org/10.1111/j.1467-985X.2009.00616.x.Vannitsem, S., D. S. Wilks, and J. Messner, 2018:

Elsevier, 364 pp.*Statistical Postprocessing of Ensemble Forecasts.*Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world.

,*Bull. Amer. Meteor. Soc.***102**, E681–E699, https://doi.org/10.1175/BAMS-D-19-0308.1.Van Schaeybroeck, B., and S. Vannitsem, 2015: Ensemble post-processing using member-by-member approaches: Theoretical aspects.

,*Quart. J. Roy. Meteor. Soc.***141**, 807–818, https://doi.org/10.1002/qj.2397.Veldkamp, S., K. Whan, S. Dirksen, and M. Schmeits, 2021: Statistical postprocessing of wind speed forecasts using convolutional neural networks.

,*Mon. Wea. Rev.***149**, 1141–1152, https://doi.org/10.1175/MWR-D-20-0219.1.Vogel, P., P. Knippertz, A. H. Fink, A. Schlueter, and T. Gneiting, 2018: Skill of global raw and postprocessed ensemble predictions of rainfall over northern tropical Africa.

,*Wea. Forecasting***33**, 369–388, https://doi.org/10.1175/WAF-D-17-0127.1.Walz, E.-M., M. Maranan, R. van der Linden, A. H. Fink, and P. Knippertz, 2021: An IMERG-based optimal extended probabilistic climatology (EPC) as a benchmark ensemble forecast for precipitation in the tropics and subtropics.

,*Wea. Forecasting***36**, 1561–1573, https://doi.org/10.1175/WAF-D-20-0233.1.Wang, J., and S. K. Ghosh, 2012: Shape restricted nonparametric regression with Bernstein polynomials.

,*Comput. Stat. Data Anal.***56**, 2729–2741, https://doi.org/10.1016/j.csda.2012.02.018.Wilks, D. S., 2016: “The stippling shows statistically significant grid points”: How research results are routinely overstated and overinterpreted, and what to do about it.

,*Bull. Amer. Meteor. Soc.***97**, 2263–2273, https://doi.org/10.1175/BAMS-D-15-00267.1.Wilks, D. S., 2018: Univariate ensemble postprocessing.

, S. Vannitsem, D. S. Wilks, and J. Messner, Eds., Elsevier, 49–89, https://doi.org/10.1016/B978-0-12-812372-0.00003-0.*Statistical Postprocessing of Ensemble Forecasts*Wright, M. N., and A. Ziegler, 2017: Ranger: A fast implementation of random forests for high dimensional data in C++ and R.

*J. Stat. Software*,**77**, 1–17, https://doi.org/10.18637/jss.v077.i01.