## Abstract

Ensemble forecasts are a valuable addition to deterministic wind forecasts since they allow the quantification of forecast uncertainties. To remove common deficiencies of ensemble forecasts such as biases and ensemble spread deficits, various postprocessing methods for the calibration of wind speed (univariate calibration) and wind vector (bivariate calibration) ensemble forecasts have been developed in recent years. The objective of this paper is to compare the performance of state-of-the-art calibration methods at distinct off- and onshore sites in central Europe. The aim is to identify calibration- and site-dependent improvements in forecast skill over uncalibrated 100-m ensemble forecasts from the ECMWF Ensemble Prediction System. The ensemble forecasts were evaluated at four onshore and three offshore measurement towers in central Europe at 100-m height for lead times up to 5 days. The results show that the recursive and adaptive wind vector calibration (AUV) outperforms calibration methods such as univariate ensemble model output statistics (EMOS), bivariate EMOS, variance deficit calibration, and ensemble copula coupling in terms of the root-mean-square error and continuous ranked probability score at almost all sites. It was found that exponential downweighting of past measurements in AUV contributes to higher forecast skill since similar downweighting approaches in the other calibration methods improved forecast skill. Proposing a bidimensional bias correction in bivariate EMOS similar to the approach taken in AUV yields bivariate EMOS skill at onshore sites that is similar to AUV skill. Deterministic and probabilistic improvements are usually much lower at offshore sites and increase with increasing complexity of the site characteristics since systematic forecast errors and ensemble underdispersion are larger at high-roughness sites.

## 1. Introduction

The economic operation of wind farms and the safe integration of wind power into the transmission grid requires high-quality wind and power forecasts, which are mostly based on forecasts from numerical weather prediction models. Nowadays, many weather services provide end users with both deterministic and ensemble forecasts. While deterministic forecasts are single integrations of numerical weather prediction models and provide point forecasts for each forecast horizon, ensemble forecasts consist of multiple model integrations.

Ensemble forecasts and, more generally, probabilistic forecasts allow the quantification of forecast uncertainties and the definition of prediction risk indices, which can be of great value for wind energy applications: Information of uncertainty can support the decision-making process of the technical wind farm management or can be the basis for defining optimal trading strategies in liberalized electricity markets to reduce penalties related to regulation costs (Pinson et al. 2007, among others).

Ensemble forecasts directly taken as output from numerical weather prediction models are often subject to forecast errors and ensemble spread deficiencies (Hamill and Colucci 1997; Gneiting et al. 2005; and others). However, uncertainty information is only valuable for decision making if the ensemble spread, which is represented by the standard deviation of the ensemble members, is realistic. Thus, various postprocessing approaches have been proposed to remove forecasts errors and spread deficiencies.

Univariate calibration techniques such as ensemble model output statistics (Thorarinsdottir and Gneiting 2010), Bayesian model averaging (Sloughter et al. 2010), or variance deficit calibration (Alessandrini et al. 2013) have been developed for the calibration of wind speed ensemble forecasts. The directional dependency of the wind farm efficiency also requires calibrated forecasts of the wind direction for the operation of wind farms. The correlation of the zonal and meridional wind components *u* and *υ* justifies the development of bivariate calibration approaches instead of focusing on wind speed and direction individually (Pinson 2012; Schuhen et al. 2012; Möller et al. 2013; Schefzik et al. 2013).

However, some of these calibration methods have never been (i) compared with other calibration methods, (ii) applied to measurements instead of model analyses, and (iii) evaluated at single off- and onshore sites to study site-dependent improvements relative to the uncalibrated wind ensemble forecasts. Furthermore, all aforementioned calibration methods were exclusively applied to ensemble forecasts of 10-m wind speed and direction, although ensemble forecasts close to hub height are important for wind energy applications.

Therefore, the objective of our paper is to provide a comparison of state-of-the-art calibration methods that are promising for wind energy applications and to apply these calibration methods to 100-m measurements at distinct off- and onshore sites. The measurement towers are located in regions of largely differing site characteristics to study not only calibration-dependent but also site-dependent improvements relative to the uncalibrated wind ensemble forecasts. We apply the calibration methods to 100-m wind ensemble forecasts of the Ensemble Prediction System of the European Centre for Medium-Range Weather Forecasts (ECMWF).

## 2. Data

### a. Tower measurements

To evaluate and compare the calibrated wind ensemble forecasts, wind measurements are selected from the onshore meteorological towers at Karlsruhe, Hamburg, Cabauw, and Falkenberg and the offshore research platforms Fino1, Fino2, and Fino3 (Fig. 1). The meteorological towers are located in regions with strongly differing site characteristics, which affect forecast errors (also see Drechsel et al. 2012) and, thus, the potential for improvements achieved by ensemble calibration as we show in sections 5 and 6.

The measurements from the different sites are available at a temporal resolution of 10 min. Quality control of the wind measurements is done following Jiménez et al. (2010). The quality control includes temporal consistency checks for abnormally low and high variations of wind speed and wind direction, plausibility checks for neighboring measurement levels if available, and exceedance checks for lower and upper thresholds.

The immediate surroundings of the 305-m tower near Hamburg are flat but inhomogeneous and characterized by a mix of rural development and—particularly to the west and north of the tower—urban and industrial areas, which lead to a high surface roughness (Brümmer et al. 2012). Available wind measurements are linearly interpolated between 50 and 110 to 100 m. The 200-m Karlsruhe tower, which records measurements at 100-m height, is located 10 km north of Karlsruhe in a forest in the Rhine Valley between the mountainous terrain of the Vosges Mountains and the Black Forest (Kalthoff and Vogel 1992). Thus, the Hamburg and Karlsruhe sites are both characterized by high surface roughness and later called *high-roughness sites*. The empirical wind direction distribution at the Karlsruhe tower has a distinct maximum for southwesterly wind directions (not shown), which is caused to a large extent by the southwest–northeast alignment of the valley and the surrounding mountains.

The measurement tower at Cabauw is surrounded by flat terrain with low vegetation and is located 50 km away from the North Sea (Baas et al. 2009). Measurements are linearly interpolated to 100 m from 80- and 140-m levels. The area around the 98-m Falkenberg tower is characterized by mixed farmland and forest vegetation (Beyrich and Adam 2007). Wind measurements are taken at 98 m.

At these onshore sites, the empirical wind speed distributions are unimodal, single-peaked, and sharp (Fig. 2). For the time period June 2010 to December 2012, the 100-m average wind speed is lowest at the Karlsruhe tower , followed by Falkenberg , Hamburg , and Cabauw .

The offshore research mast Fino1 is about 45 km north of the island of Borkum in the North Sea (Neumann et al. 2004), Fino2 is about 33 km north of the island of Rügen in the Baltic Sea, and Fino3 about 80 km west of the island of Sylt in the North Sea. At Fino1 (Fino2) wind speed is measured at 100 m (102 m) and wind direction at 90 m (91 m). At Fino3, wind speed and direction are measured at 100 m. Wind speed and direction measurements at Fino1 were provided with corrections based on comparisons with an additional anemometer and a lidar measurement campaign (Westerhellweg et al. 2012).

At Fino1 and Fino3, prevailing southwesterly to westerly winds are influenced by homogeneous offshore conditions with the result that these sites experience marine atmospheric boundary layer characteristics. In near-coastal areas, onshore atmospheric boundary layer characteristics can persist above internal boundary layers that have marine boundary layers characteristics for long distances, particularly for stable thermal stratification (Garratt and Ryan 1989, among others). Since Fino2 is less than 40 km away from the coast, Fino2 can experience a mixture of marine and onshore boundary layers for prevailing westerly winds. The empirical wind speed distributions at the offshore sites Fino1 , Fino2 , and Fino3 are considerably wider than the onshore ones but also single peaked (Fig. 2).

### b. Ensemble forecasts

Several statistical evaluations of the global ECMWF Ensemble Prediction System indicate that the uncalibrated ECMWF ensemble forecasts are superior to forecasts from other global ensemble prediction systems (Buizza et al. 2005; Hagedorn et al. 2012; among others). We use instantaneous ECMWF wind vector and speed ensemble forecasts every 3 h at 100-m height as input to this study. We consider five days as the maximum forecast horizon since forecast days 1 to 5 are most relevant for wind energy applications. On 26 January 2010, the horizontal resolution of the ECMWF Ensemble Prediction System has been changed to a spectral truncation at wavenumber 639 (Miller et al. 2010), which corresponds to 0.25° × 0.25°. The ensemble forecasts consist of 50 perturbed forecasts and one unperturbed control forecast that are generated twice daily, starting from 0000 and 1200 UTC (Leutbecher and Palmer 2008). Since January 2010, the ECMWF archives the 100-m ensemble wind forecasts in addition to the 10-m winds. For this reason, we use 100-m ECMWF wind ensemble forecasts in this study from January 2010 onward. A horizontal bilinear interpolation of the forecasts to the geographic coordinates at each site is applied to obtain local forecasts.

## 3. Verification methods

As stated in Murphy and Winkler (1987), forecast verification is the investigation of the properties of the joint distribution of forecasts and measurements. A detailed description of univariate verification methods for ensemble forecasts is found in Jolliffe and Stephenson (2003) and Wilks (2011). Gneiting et al. (2008) discuss ensemble forecast verification for vector-valued quantities (such as wind vectors) of discrete ensemble forecasts or density forecasts. Diagnostic tools to evaluate ensemble forecasts check the reliability (also called calibration) and sharpness of the forecasts. Reliability refers to the statistical consistency between the ensemble forecasts and the measurements and sharpness to the concentration of the predictive distribution. While reliability is a joint property of the ensemble forecasts and the measurements, sharpness is a property of the forecasts only (Gneiting et al. 2007). As stated in Wilks (2011), “sharp forecasts will be accurate only if they also exhibit good reliability, or calibration, and an important goal is to maximize sharpness without sacrificing calibration.”

To assess the reliability of probabilistic forecasts of wind speed, we use the Talagrand diagram or rank histogram (Hamill and Colucci 1997; Talagrand et al. 1997), which is the histogram of aggregated verification ranks over a number of individual forecast cases. The analogs of the Talagrand diagram for multivariate quantities are the multivariate rank histogram and the minimum spanning tree histogram (Gneiting et al. 2008). In this study, we use the multivariate rank histogram. It can be interpreted analogously to the Talagrand histogram: An ensemble forecast is calibrated if the rank histogram is uniform. While a U-shaped histogram indicates underdispersive ensemble forecasts, a bell-shaped histogram indicates overdispersive ensemble forecasts.

For the quantitative evaluation of univariate scalar quantities such as wind speed the continuous ranked probability score (CRPS) is a proper scoring rule for the evaluation of ensemble forecasts (Hersbach 2000; Gneiting and Raftery 2007; among others). The CRPS, which has the property to be sensitive to the entire permissible range of the parameter of interest, considers both reliability and sharpness and is defined as

where **1** is the indicator function, *X* and *X*′ are independent random variables with cumulative distribution function *F*, and *y* ∈ ℝ is the verifying value (Gneiting and Raftery 2007). For an ensemble forecast *F* = *F*_{ens} with ensemble members *x*_{1}, …, *x*_{M} ∈ ℝ, the continuous ranked probability score can be evaluated as

To evaluate both sharpness and calibration of wind vector forecasts, we employ the proper energy score (ES), which is the multivariate generalization of the continuous ranked probability score (Gneiting et al. 2008). We use the Euclidean version of the energy score for an ensemble forecast *F* = *F*_{ens} with ensemble members *x*_{1}, …, *x*_{M} ∈ ℝ^{2}, which is defined as

where ‖·‖ denotes the Euclidean norm in ℝ^{2}. For bivariate EMOS wind vector ensembles, we follow Schuhen et al. (2012) and replace the exact energy score (3) by the computationally efficient approximation of the energy score

where *K* denotes the sample size.

Over an evaluation set of *N* cases, the ES and CRPS values for each lead time are given by

where

denotes the Brier score for probability forecasts of the binary event at the threshold value *z* ∈ ℝ (Brier 1950).

To evaluate deterministic improvements of the ensemble mean , we use the root-mean-square error in its univariate form

where , and its bivariate form

where . The root-mean-square error is chosen over the mean average error since it penalizes large errors in the model forecast stronger. Following a study of Tambke et al. (2005), the RMSE can be decomposed into systematic forecast errors and phase errors:

The bias is part of the systematic errors and accounts for the difference between the mean values of the ensemble mean forecasts and measurements. The sdbias is the difference between the standard deviations of both. The calculation of disp involves the cross correlation between the forecast and measurements and describes forecast phase errors. The contribution of phase errors and biases to the RMSE of the ensemble mean is discussed in section 5b.

To test the statistical significance of improvements in the RMSE, CRPS, and Brier score (BS) achieved with the different calibration methods, we follow Efron (1979) and Bröcker and Smith (2007) and calculate confidence intervals with a resampling technique called the *bootstrap*. We repeat the bootstrap 1000 times and calculate 5%–95% confidence intervals from the resulting bootstrap samples. However, note that serial correlations of forecast errors can lead to an inflation of the confidence intervals of forecast verification statistics (Pinson et al. 2010; Wilks 2010). Wilks (2010) has investigated this effect for sampling distributions of the Brier score and showed that considering serial correlations can be important in certain forecast cases, but is far from trivial even in simple setups. In this study, we do not consider serial correlations of forecast errors since maximum serial correlations are below 0.25.

## 4. Calibration methods

### a. Generalities on wind ensemble calibration

Ensemble forecasts that are directly taken as output from numerical weather prediction models have deterministic biases and spread deficits. The spread deficit (also called underdispersion) is indicated by an ensemble standard deviation that is considerably lower than the root-mean-square error between the ensemble mean and measurements. To remove systematic forecast errors and ensemble spread deficits (i.e., to obtain reliable ensemble forecasts), various calibration methods for wind ensemble forecasts were developed in recent years.

Two state-of-the-art univariate calibration methods that yield full predictive distributions are ensemble Bayesian model averaging (BMA; Raftery et al. 2005; Sloughter et al. 2010) and ensemble model output statistics (EMOS; Gneiting et al. 2005; Thorarinsdottir and Gneiting 2010). Sloughter et al. (2010) propose an ensemble BMA implementation where the kernel of the mixture distribution is a gamma distribution. The EMOS predictive distribution of Thorarinsdottir and Gneiting (2010) is a single parametric distribution in which the kernel is truncated normal. We select EMOS over BMA since Thorarinsdottir and Gneiting (2010) showed that BMA yields calibrated forecasts of comparable predictive performance but at a higher computational cost. This is in agreement with results we obtained with the BMA calibration (not shown).

However, time trajectories of ensemble forecasts, also referred to as scenarios, are preferable to predictive densities. Ensemble forecasts with temporal dependence structures provide an extra level of information and are a prime input to a large class of stochastic optimization approaches where representation of time interdependencies are required (Pinson 2013). Variance deficit calibration (VDC) as described in Alessandrini et al. (2013) is a univariate calibration method for wind speed ensembles that yields ensemble trajectories. Thus, we select VDC as a second univariate calibration method.

Instead of focusing on wind speed and direction individually, the bivariate calibration of wind ensemble forecasts aims at jointly considering the zonal and meridional *u* and *υ* wind components. Particularly in regions with specific wind regimes, the wind components appear to be significantly correlated (Pinson 2012, among others), which increases the need for bivariate approaches to calibrate wind ensemble forecasts. Although wind speed ensemble forecasts can be sufficient for managing the storm control of wind turbines for wind speed forecasts near cutoff, the wind farm efficiency can be wind direction dependent. Thus, the technical management of wind farms that are located in specific wind regimes requires both ensemble wind speed and direction forecasts that were calibrated with bivariate approaches.

Several bivariate calibration methods were developed in recent years. Schefzik et al. (2013) propose a multistage procedure called ensemble copula coupling (ECC): In terms of wind ensemble calibration, each wind component is calibrated with univariate approaches, a discrete sample is drawn from each univariate predictive distribution, and the sampled values are rearranged in the rank order structure of the raw ensemble. Alternatively, Möller et al. (2013) propose a similar multistage procedure in which the calibrated univariate predictive distributions for *u* and *υ* are tied together with a Gaussian copula. The bivariate calibration by Pinson (2012) [recursive and adaptive wind vector calibration (AUV)] can be viewed as a variant of ECC and is a computationally efficient recursive and adaptive approach that yields ensemble wind vector trajectories. Schuhen et al. (2012) propose a bivariate extension to EMOS (UVEMOS) where the calibrated probabilistic forecasts take the form of bivariate normal distributions, while Sloughter et al. (2013) develop a bivariate extension to BMA that results in a mixture of bivariate, power-transformed normal densities. We select AUV and UVEMOS as well as ECC combined with EMOS as further calibration methods for this study.

### b. Univariate approaches

#### 1) Variance deficit calibration

VDC is a straightforward technique for the calibration of ensemble wind speed forecasts. The first step of the variance deficit calibration is a transformation of the ensemble forecasts and measurements with a logit link function to carry out the calibration in a simplified framework and to approach normal distributed data (Alessandrini et al. 2013). To reduce the ensemble mean bias, Alessandrini et al. (2013) propose the usage of a neural network. The neural network approach is replaced by a bias correction where the bias is defined as the mean difference between the logit-transformed ensemble mean and measurements and calculated for sliding training windows (see section 4d). The variance deficit coefficient is then defined as the ratio between the RMSE and the ensemble standard deviation (spread). The RMSE is calculated between the bias-corrected ensemble mean and the measurements. Both RMSE and spread are calculated over the training data given by the sliding training window. The variance deficit coefficient is then applied to the logit-transformed and bias-corrected wind speed ensemble to remove the spread deficit.

#### 2) Ensemble model output statistics

EMOS or nonhomogeneous Gaussian regression is a parametric regression model that was first introduced by Gneiting et al. (2005). The univariate EMOS method fits a normal distribution

around the ensemble members *x*_{1}, …, *x*_{M} ∈ ℝ. The quantity *σ*^{2} is the predictive variance, which is a linear function of the ensemble variance *S*^{2} with the coefficients *c* and *d*; *μ* is the bias-corrected predictive mean with regression coefficients *a* and *b*_{1}, …, *b*_{M}. Since the single members of the ECMWF Ensemble Prediction System with *M* = 51 ensemble members are exchangeable, each ensemble member is given the same weight *b*_{1} = *b*/51, …, *b*_{M} = *b*/51. The spread parameters *c* and *d* are constrained to be nonnegative. EMOS coefficients *a*, *b*, *c*, and *d* are fit with a training function based on the continuous ranked probability score. The minimization of the objective function

is done using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm as implemented in the R language (Gneiting et al. 2005).

To take the nonnegativity of wind speed into account, we follow Thorarinsdottir and Gneiting (2010) and use a truncated normal predictive distribution ^{0}(*μ*, *σ*^{2}) having a cutoff at zero. Remember that the variance deficit calibration accounted for the nonnegativity of wind speed by applying a logit transform to the ensemble members and measurements. EMOS then yields truncated normal probability density functions for wind speed.

### c. Bivariate approaches

#### 1) Ensemble copula coupling with EMOS (ECCEMOS)

The ensemble copula coupling (ECC) as described by Schefzik et al. (2013) is a multistage procedure that can be applied to calibrate wind vector ensemble forecasts. The procedure (i) applies univariate postprocessing such as EMOS or ensemble BMA to each wind component *u* and *υ* of the wind ensemble forecast, (ii) draws discrete samples of the same size as the raw ensemble from each univariate calibrated predictive distribution, and (iii) reorders the sampled values in the rank order structure of the raw ensemble (Schefzik et al. 2013).

In this study, we apply EMOS as proposed by Gneiting et al. (2005) to fit a normal distribution (*μ*, *σ*^{2}) around the *u* and *υ* ensemble forecasts individually. To obtain a discrete sample of size *M* = 51, we take equidistant quantiles from the cumulative distribution function. Schefzik et al. (2013) refer to this approach as ECC-Q, which outperforms other quantization approaches such as ECC-T. ECC-T fits a parametric, continuous cumulative distribution function to the raw ensemble distribution and then extracts the quantiles that correspond to the percentiles of the raw ensemble values (Schefzik et al. 2013).

#### 2) Recursive and adaptive wind vector calibration

AUV by Pinson (2012) can be viewed as a variant of ECC-T in which the wind components are recursively estimated via bias correction and dilation in a bivariate normal framework _{2}(** μ**, ) with the means

**= [**

*μ**μ*

_{u},

*μ*

_{υ}]

^{T}and the variance–covariance matrix . The bivariate normal density function is given by

where ** υ** = [

*u*,

*υ*]

^{T}is the wind vector,

*μ*

_{u}and

*μ*

_{υ}are the means along the

*u*and

*υ*components,

*σ*

_{u}and

*σ*

_{υ}are the respective standard deviations, and

*ρ*

_{uυ}is the correlations of the wind components.

The method is a recursive maximum likelihood approach in which the model parameters at a given time *t* and lead time *k* minimize the objective function

where *λ* ∈ (0, 1) is the forgetting factor, *n*_{λ} = (1 − *λ*)^{−1} is a normalization factor, and *f*(*υ*_{i}) is the bivariate normal density at the observed wind vector *υ*_{i} given the mean and variance model parameters ** θ** and

**. These parameters determine the bias-correction and dilation factors that lead to the corrected ensemble means**

*γ**μ*

_{u}and

*μ*

_{υ}and standard deviations

*σ*

_{u}and

*σ*

_{υ}. Note that the dilation model uses only a univariate scaling along

*u*and

*υ*and that the proposed bias correction model is a bidimensional bias correction in the (

*u*,

*υ*) plane. The latter is not the case for the bivariate EMOS technique proposed by Schuhen et al. (2012).

The adaptive and recursive estimation of the bias correction and dilation factors in a maximum likelihood framework (13) is “equivalent to minimizing the logarithmic scoring rule known as ignorance” for bivariate probabilistic forecasts (also see Roulston and Smith 2002), while EMOS minimizes an objective function based on the continuous ranked probability score. The resulting calibrated ensemble consists of bias-corrected and dilated temporal ensemble trajectories rather than predictive probability density functions.

The speed of adaptivity is controlled by the exponential forgetting factor *λ*. This is advantageous since the model smoothly adapts to changes in wind dynamics. The parameter update requires only the last set of ensemble forecasts and measurements. Thus, the method is computationally cheap. Using a forgetting factor is similar to downweighting past forecast–measurement pairs. Temporally distant forecast–measurement pairs receive a smaller weight than recent pairs. The sliding training window approach (e.g., Thorarinsdottir and Gneiting 2010; Sloughter et al. 2010) gives the same weights to all forecast–measurement pairs in the sample.

#### 3) Bivariate EMOS

By design, bivariate AUV generates physically consistent ensemble trajectories with the same number of ensembles and the same bivariate rank correlation coefficients as the raw ensemble (see discussion in Schuhen et al. 2012). In contrast, UVEMOS results in bivariate normal density forecasts _{2}(** μ**, ) with the bivariate normal probability density function given by (12). The implementation of UVEMOS closely follows the proposal of Schuhen et al. (2012): The correlation coefficient

*ρ*

_{uυ}is determined in an offline correlation model that is fitted for each location in the test period separately and modeled as a trigonometric function of the ensemble mean wind direction

*θ*:

The parameters *r*, *s*, and *ϕ* are estimated with weighted nonlinear least squares with the weights being proportional to the number of measurements in each wind direction sector. Second, the means *μ*_{u} and *μ*_{υ} are given by

where the parameters *a*_{u}, *a*_{υ}, *b*_{u}, and *b*_{υ} for the bias correction are estimated from the training data by standard linear least squares regression. The terms and denote the ensemble mean of the *u* and *υ* components. To estimate the corrected ensemble variances *σ*_{u} and *σ*_{υ}, we employ a maximum likelihood framework where the optimization is performed with the BFGS algorithm as proposed by Schuhen et al. (2012).

### d. Implementation specifics

#### 1) Sampling

The uncalibrated ECMWF ensemble forecasts consist of *M* = 51 ensemble members. By design, AUV, VDC, and ECCEMOS ensembles are discrete ensemble forecasts with the same number of ensemble members as the uncalibrated ensemble forecasts. EMOS yields truncated normal probability density functions for wind speed. To recover a synthetic ensemble of realizations out of the predictive probability density function, we follow Gneiting et al. (2005) and sample *M* = 51 equidistant quantiles *x*_{1}, …, *x*_{M} ∈ ℝ from the cumulative distribution function at levels *m*/(*M* + 1) for *m* = 1, …, *M*.

In case of UVEMOS, which yields bivariate normal probability density functions, we follow the proposal of Schuhen et al. (2012) and draw random samples from the bivariate distribution to yield discrete wind vector ensembles. For the rank histograms, we draw *M* = 51 random samples *x*_{1}, …, *x*_{M} ∈ ℝ^{2} from the distribution to facilitate the comparison with the other ensemble forecasts. For the calculation of the energy score, we draw a random sample *x*_{1}, …, *x*_{K} ∈ ℝ^{2} with *K* = 500, which requires us to use the computationally efficient approximation of the energy score [(4)]. A sample size of *K* = 500 is necessary to achieve a stable calculation of deterministic and probabilistic scores.

#### 2) Parameter estimation with sliding training windows

The parameters of EMOS, VDC, ECCEMOS, and UVEMOS are estimated using the sliding training window approach. Thus, the improvements of the deterministic and probabilistic scores at each site are dependent on the length of the sliding training windows, that is, the number of forecast–measurement pairs used in the calibration method. ECMWF forecasts initialized at 0000 and 1200 UTC, respectively, are calibrated using past forecasts initialized at the same time of the day. A larger training sample yields a better estimation of the parameters, while a shorter training sample adapts faster to changes in weather conditions.

Increasing the training period from 20 to 50 days substantially decreases the probabilistic day-ahead scores at all sites (Fig. 3), while a further increase to about 70 days yields slight additional improvements. Choosing the same number of training days at all sites is advantageous for simplicity reasons. The best compromise for the differing sites in terms of calibration improvements is found for 70 training days, which corresponds to 70 forecast–measurement pairs.

#### 3) Parameter estimation with exponential forgetting

The forgetting factor *λ* is the parameter in the recursive and adaptive wind vector calibration of Pinson (2012) that controls the exponential downweighting of past forecast–measurement pairs [(13)]. A small *λ* leads to a faster adaption to changes in weather conditions. The sensitivity of the improvements is evaluated in terms of the energy score for forgetting factors ranging between *λ* = 0.980 and *λ* = 0.999 (Fig. 4). Choosing an even smaller forgetting factor leads to unstable results at some sites associated with the inversion of the covariance matrix (see discussion in Pinson 2012). Pinson (2012) suggests carrying out a sufficiently large number of iterations before updating the covariance matrix for the first time to avoid numerical instabilities. For forgetting factors smaller than 0.980, however, the increasing number of sufficient iterations requires dismissing a large part of the data in the evaluation period, which is not acceptable in this study because of the limited length of the time series.

The calibrated ensemble forecasts at Cabauw, Fino1, Fino2, and Fino3 are almost insensitive to changes in the forgetting factor within the depicted range, while the forecasts at the onshore sites with higher surface roughness indicate a degrading calibration performance with an increasing forgetting factor (Fig. 4). The energy score at Karlsruhe is particularly sensitive to the forgetting factor. Since observed winds near Karlsruhe are strongly forced by the complex topography of the Rhine Valley, which is poorly resolved in the ECMWF Ensemble Prediction System, the forecast error is strongly wind direction dependent. A faster adaption of parameters to changes in wind patterns is advantageous particularly at this site. Therefore, we choose 0.980 as the optimal forgetting factor at all sites.

Since exponential downweighting of past forecast–measurement pairs appears advantageous because of a smooth adaptation of parameters to changes in wind patterns, we also implement versions of EMOS, ECCEMOS, and UVEMOS where past measurements receive less weight. The weights for the exponential downweighting are defined as *w*_{i} = *λ*^{i −1}, where *i* = 1, …, *N* denotes the training index with *N* = 150 days. With increasing training index, the weights exponentially decrease leading to a downweighting of past training data (Fig. 5). Considering more than 150 past training days for the exponential downweighting does not increase calibration performance.

The weighting in the EMOS and ECCEMOS methods is implemented by weighting the CRPS training function in the EMOS and ECCEMOS function with *w*_{i}. Weighted UVEMOS is implemented by weighting the linear least squares regression model for the mean parameter estimation and the maximum likelihood function for the variance parameter estimation with the weights *w*_{i}. These modified methods are denoted as EMOS-W, ECCEMOS-W, and UVEMOS-W. Similarly, a VDC-W method is implemented by a weighted bias correction and variance deficit estimation. A sensitivity study that is not shown here indicated that improvements of each method over its implementation without weighting is highest with an exponential forgetting factor of *λ* = 0.970 (Fig. 5).

#### 4) Independent adaptive wind vector calibration (AUV_IND)

As mentioned in section 4c, the adaptive and recursive wind vector calibration by Pinson (2012) consists of a bias-correction and dilation model where the bias-correction model is proposed as a bidimensional bias correction in the (*u*, *υ*) plane. This is different for UVEMOS and ECCEMOS where the *u* and *υ* bias correction is done along *u* and *υ* independently. To understand differences between the methods in more detail, we also implemented a one-dimensional bias-correction model in AUV (AUV_IND), where the bias-correction model structures are degraded to bias correct *u* and *υ* independently.

Note, that Pinson (2012) also analyzed the improvements of bivariate AUV over univariate AUV and found 1% improvements on average. However, the adaptive bivariate calibration of ECMWF ensemble forecasts was carried out against ECMWF analyses. Calibration against measurements might lead to different results as discussed in section 5.

#### 5) Bivariate EMOS with bidimensional bias correction (UVEMOS-W-EXT)

Furthermore, we propose a variant of the bivariate EMOS method of Schuhen et al. (2012) that extends the one-dimensional bias-correction model [(15)] to a bidimensional model:

The parameters and are estimated together with *a*_{u}, *b*_{u}, *a*_{υ}, and *b*_{υ} with a weighted linear least squares regression.

## 5. Results

In this section, we present the results of the comparison of the calibration methods at seven distinct off- and onshore tower sites to identify the most promising calibration methods at each site. We evaluate the calibrated and uncalibrated 100-m ECMWF wind ensemble forecasts for the test period from January 2011 to December 2012. Data from the training period were used to optimize the implementation specifics of each method, such as the number of training days or forgetting factor.

### a. Calibrated wind vector ensembles

Before we compare the calibrated ensembles in terms of wind speed, we evaluate the bivariate deterministic and probabilistic properties of AUV, UVEMOS, and ECCEMOS wind vector ensembles in terms of the bivariate RMSE (bRMSE), the energy score, and multivariate rank histograms.

#### 1) Deterministic verification

Improvements in the bRMSE achieved with the bivariate calibration methods describe the deterministic forecast skill of the ensemble mean of the wind vector ensemble. Forecast skill is the relative accuracy of the calibrated ensemble forecast with respect to the uncalibrated ensemble forecast. Day-ahead improvements (27–48 h) in the bRMSE are larger at high-roughness sites such as Karlsruhe and Hamburg since the ECMWF model does not adequately resolve small-scale changes of the surface roughness at these sites, which leads to larger biases (Table 1). At offshore sites, improvements are close to zero because of low forecast biases.

The deterministic skill of AUV ensembles is higher compared to ECCEMOS and UVEMOS ensembles at all sites. The largest gap between the methods occurs at the high-roughness sites Karlsruhe and Hamburg. The advantage of AUV ensembles over UVEMOS ensembles can be attributed to both the adaptive nature of the AUV method and particularly the bidimensional bias-correction model. This is demonstrated by an increase in bRMSE improvement of up to 1% when past forecast–measurement pairs are exponentially downweighted (ECCEMOS-W and UVEMOS-W) and by a decrease of AUV skill when the bias-correction model structures of AUV are degraded to a one-dimensional framework (AUV_IND) (Table 1). Thus, extending the bias correction in the bivariate EMOS method to a bidimensional framework (UVEMOS-W-EXT) leads to a large improvement in bRMSE over UVEMOS-W of 7.2% (Karlsruhe) and 2.5% (Hamburg).

#### 2) Probabilistic verification

The bivariate reliability of the calibrated ensemble forecasts is assessed with day-ahead bivariate rank histograms (Fig. 6). The nonuniformity of the bivariate rank histograms for day-ahead forecast horizons indicates that the uncalibrated wind speed ensemble at Fino1, Fino2, and Karlsruhe is underdispersive. AUV, UVEMOS-W-EXT, and ECCEMOS-W correct the underdispersion of the raw ensemble almost entirely, which is shown by uniform rank histograms.

The probabilistic forecast skill is presented as day-ahead improvements in the energy score (Table 1). Differences between the probabilistic skill of the calibrated ensembles are similar to the differences in the deterministic forecast skill. Thus, the adaptive and recursive calibration of wind vector ensembles by Pinson (2012) yields higher forecast skill compared to ECCEMOS and UVEMOS ensembles. Differences emerge clearer with increasing complexity of the site characteristics. Implementing both exponential downweighting of past training data and a bidimensional bias-correction model in UVEMOS leads to a comparable skill of UVEMOS-W-EXT and AUV although AUV skill is still highest at all sites.

### b. Calibrated wind speed ensembles

After the discussion of the bivariate properties of ECCEMOS, bivariate EMOS, and AUV ensembles, the forecast verification is extended to calibrated wind speed ensemble forecasts, which allows a direct comparison of ensembles postprocessed with univariate and bivariate calibration methods. AUV and ECCEMOS-W wind vector ensemble forecasts, which consist of 51 members, are used to calculate AUV and ECCEMOS-W wind speed ensemble forecasts. In case of UVEMOS-W and UVEMOS-W-EXT wind vector ensemble forecasts, we take the random wind vector samples and calculate wind speed samples: 51 random samples for the univariate rank histogram and 500 samples for calculations of the RMSE, CRPS, and BS.

In the following discussion we only consider the exponential downweighting implementations of ECCEMOS, UVEMOS, EMOS, and VDC since the downweighting of past forecasts/measurements leads to deterministic and probabilistic improvements of the calibrated wind speed ensembles of 0.5%–1.5% at almost all sites and for all forecast horizons (not shown).

#### 1) Deterministic verification

We evaluate the deterministic properties of the wind speed ensembles with the ensemble mean RMSE. The RMSE of the uncalibrated ECMWF ensemble is increasing with forecast horizon and is considerably larger at offshore sites compared to onshore sites (Fig. 7). This is due to higher wind speed at offshore sites (Fig. 2) leading to larger RMSE values. The offshore bias of the uncalibrated ensemble mean wind speed is mainly negative (underforecasting bias), while the onshore bias is positive (overforecasting bias) (the bias of representative sites is shown in Fig. 8).

At Fino1, Fino3, and Cabauw the bias is relatively low because of homogeneous offshore conditions at Fino1 and Fino3 and low-roughness conditions at Cabauw. These conditions are adequately resolved in the ECMWF numerical weather prediction model. As discussed in section 2a, the Baltic Sea site Fino2 can experience a mixture of marine and onshore atmospheric boundary layer conditions that might not be accurately represented in the ECMWF model. This might result in the larger underforecasting bias of the uncalibrated ensemble mean at Fino2 relative to Fino1 (Fig. 8). The high-roughness conditions around Karlsruhe and Hamburg induce large wind speed biases at these sites that are reduced by AUV and VDC-W and removed by EMOS-W and UVEMOS-W-EXT.

The main contribution to the RMSE comes from phase errors (Fig. 8). At offshore sites, the calibration methods are not able to reduce the nonlinear phase errors, while bivariate AUV, EMOS-W, ECCEMOS-W, UVEMOS-W, and UVEMOS-W-EXT considerably reduce the phase error at the Karlsruhe site. This phase error reduction is also observable at the Hamburg and Falkenberg sites. The correction of both bias and phase errors improves the RMSE of AUV ensembles by up to 17% (13%) at Hamburg (Karlsruhe) (Fig. 9).

At these sites and at offshore sites, deterministic AUV skill is slightly higher than improvements achieved with the other methods. The bidimensional bias correction in UVEMOS significantly improves the RMSE over the one-dimensional bias correction with the result that deterministic UVEMOS-W-EXT skill improves by up to 4%. At sites such as Fino1 and Cabauw, the UVEMOS-W-EXT and UVEMOS-W perform worse than other methods.

#### 2) Probabilistic verification

Ensemble calibration also improves the reliability of the wind speed ensembles by removing the ensemble underdispersion. The spread deficiencies are clearly expressed by the nonuniformity of the uncalibrated wind speed rank histograms at Fino1, Fino2, and Karlsruhe (Fig. 10). Asymmetries in the rank histograms at Fino2 and Karlsruhe indicate that forecast biases are present.

All calibration methods increase the spread of the ensemble forecasts leading to more reliable ensemble forecasts (Fig. 10). EMOS-W and VDC-W ensembles tend to be slightly underdispersive, while UVEMOS-W-EXT ensembles are slightly overdispersive as indicated by the bell-shaped rank histograms.

We evaluate quantitative improvements of the stochastic properties of the calibrated ensembles with the CRPS and the BS. The CRPS and BS of the uncalibrated ensemble forecasts are shown in Fig. 7. Improvements in the CRPS and BS are generally higher for intraday and day-ahead forecasts and then slowly decrease for larger forecast horizons (Figs. 11 and 12). Since spread deficits of the ECMWF wind ensemble forecasts are largest for shorter lead times, improvements are also larger for short lead times.

The general picture of the probabilistic differences between AUV, UVEMOS-W, VDC-W, EMOS-W, and ECCEMOS-W ensembles is similar to the discussion of the deterministic differences and is not repeated in detail. Intraday improvements in the CRPS reach up to 27%–28% (AUV and UVEMOS-W-EXT) and 23%–24% (EMOS-W and ECCEMOS-W) at the Hamburg site. Largest improvements achieved with AUV, EMOS-W, and UVEMOS-W-EXT are 21%–23% at the Karlsruhe site. Offshore improvements in the CRPS are smaller since systematic forecast errors and ensemble spread deficits are smaller compared to onshore sites.

In Fig. 12, we evaluate the improvements in the BS for wind speed thresholds of 10 m s^{−1}. The threshold of 10 m s^{−1} roughly corresponds to the 90th percentile of observed wind speeds at the onshore sites. For even larger thresholds, the sample size of wind ensemble forecasts becomes smaller and bootstrap confidence intervals are larger, which means that differences between the calibrated ensembles become less significant.

AUV considerably improves ensemble forecasts in high wind speed situations up to 40% (38%) at the Karlsruhe (Hamburg) site (Fig. 12). Although ECCEMOS-W, UVEMOS-W, and EMOS-W improvements reach up to 33%–39% at the high-roughness Karlsruhe site, AUV improvements in the BS are slightly higher. At offshore sites, improvements in the Brier score are not higher than 10% (Fino1 and Fino3) and 15% (Fino2) with no clear advantage of AUV.

## 6. Discussion and conclusions

The objective of this paper is to systematically compare state-of-the-art postprocessing methods for the calibration of 100-m wind ensemble forecasts and to identify site-dependent forecast improvements over the uncalibrated ensemble forecasts. For that, we calibrate and evaluate 100-m wind ensemble forecasts from the ECMWF Ensemble Prediction System with 100-m wind measurements at four onshore and three offshore meteorological towers in central Europe for forecast horizons of up to 5 days.

The calibration methods used for the comparison are ensemble model output statistics (EMOS; Thorarinsdottir and Gneiting 2010) and variance deficit calibration (VDC; Alessandrini et al. 2013) for wind speed ensembles and bivariate recursive and adaptive wind vector calibration (AUV; Pinson 2012), bivariate EMOS (UVEMOS; Schuhen et al. 2012), and bivariate ensemble copula coupling applied to wind vector forecasts (ECCEMOS; Schefzik et al. 2013). Univariate or bivariate Bayesian model averaging techniques are not considered because of their high computational costs and similar performance compared to EMOS approaches.

In terms of wind speed, bivariate AUV performs better than univariate EMOS and VDC and bivariate ECCEMOS and UVEMOS for deterministic (RMSE) and probabilistic (CRPS) scores at almost all sites. Comparing the bivariate properties of the wind vector ensembles, AUV ensembles have a higher forecast skill than UVEMOS and ECCEMOS ensembles at all sites.

The advantage of AUV ensembles can be attributed to the exponential downweighting of past measurements in AUV and to the specifics of the bidimensional bias-correction model employed in this method. Implementing exponential downweighting for EMOS, ECCEMOS, UVEMOS, and VDC increased the deterministic and probabilistic forecast skill by up to 1%. Thus, exponential downweighting has the slight advantage that parameters smoothly adapt to changes in weather conditions in comparison to the batch estimation in the sliding training window approach. Furthermore, proposing a bidimensional bias-correction model in the UVEMOS method (UVEMOS-W-EXT) instead of a one-dimensional bias correction in the (*u*, *υ*) plane leads to a considerable increase of UVEMOS forecast skill at onshore sites with the result that UVEMOS onshore skill is close to AUV skill. Thus, the bidimensional view in the bias-correction model of AUV and UVEMOS appears highly advantageous for improving forecast skill. However, one advantage of AUV over bivariate EMOS is the recursive approach, which requires only the last available set of forecast–measurement pairs to update parameters, which lowers computational costs considerably.

Highest improvements in deterministic and probabilistic forecast skill are found for sites such as Karlsruhe and Hamburg. The complex mountainous terrain and forest around the Karlsruhe tower and the urban and industrial areas around the Hamburg tower are not accurately resolved in the ECMWF model, which leads to larger forecast biases and spread deficits of the raw ensemble at these sites. For this reason, ensemble calibration leads to largest skill improvements at these sites.

Although the UVEMOS-W-EXT ensembles perform similar to AUV ensembles at high-roughness onshore sites, forecast performance at offshore sites can even degrade. This result might indicate that the correlation coefficients, which are estimated offline by fitting a trigonometric function to the conditional correlation, might show unreliable and physically unrealistic behavior at some sites (also see discussion in Schuhen et al. 2012).

The results from AUV and UVEMOS-W-EXT ensemble forecasts have shown that taking into account the correlations between the wind components can be highly important for improving deterministic and probabilistic forecast skill at certain sites. Since the wind farm efficiency is wind direction dependent and some wind farm operators already use ensemble wind forecasts to support the decision-making process of the technical wind farm management, the bivariate recursive and adaptive wind vector calibration is a promising and computationally cheap method for wind energy applications.

## Acknowledgments

The work presented in this study is funded by the national research project Baltic I (FKZ 0325215A, Federal Ministry for Environment, Nature Conservation and Nuclear Safety) and the Ministry for Education, Science and Culture of Lower Saxony. The authors thank the Karlsruhe Institute of Technology (KIT), the Royal Netherlands Meteorological Institute (KNMI), the German Weather Service (DWD), and the Meteorological Institute (MI) of the University of Hamburg for providing the wind measurements of the onshore meteorological measurement masts at Karlsruhe, Cabauw, Falkenberg, and Hamburg. The Project Management Jülich (PTJ) and the Federal Maritime and Hydrographic Agency (BSH) are acknowledged for providing measurements of the offshore research platforms Fino1, Fino2, and Fino3. Numerical weather prediction data are provided by ECMWF. The authors are grateful to Pierre Pinson, Luca delle Monache, Nina Schuhen, Jakob Messner, and Stefano Alessandrini for general discussions about ensemble calibration. Furthermore, we thank the reviewers and the editor for their valuable comments and suggestions.

## REFERENCES

*Boundary Layer Studies and Applications,*R. E. Munn, Ed., Springer, 17–40.

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science.*Wiley, 240 pp.

*ECMWF Newsletter,*No. 124, ECMWF, Reading, United Kingdom, 10–16.

*Proc. ECMWF Workshop on Predictability,*Reading, United Kingdom, ECMWF,