## Abstract

This work evaluates the use of a WRF ensemble for short-term, probabilistic, hub-height wind speed forecasts in complex terrain. Testing for probabilistic-forecast improvements is conducted by increasing the number of planetary boundary layer schemes used in the ensemble. Additionally, several prescribed uncertainty models used to derive forecast probabilities based on knowledge of the error within a past training period are evaluated. A Gaussian uncertainty model provided calibrated wind speed forecasts at all wind farms tested. Attempts to scale the Gaussian distribution based on the ensemble mean or variance values did not result in further improvement of the probabilistic forecast performance. When using the Gaussian uncertainty model, a small-sized six-member ensemble showed equal skill to that of the full 48-member ensemble. A new uncertainty model called the *pq* distribution that better fits the ensemble wind forecast error distribution is introduced. Results indicate that the gross attributes (central tendency, spread, and symmetry) of the prescribed uncertainty model are more important than its exact shape.

## 1. Introduction

The renewable energy sector has grown rapidly as global economies aim to derive a larger portion of their electric generation from renewable sources such as wind and solar. One of the primary challenges with using renewable energy sources is integrating these sources into existing electric grids. Power from these sources is generated only when the wind blows or the sun shines, and cannot be controlled to the same extent as traditional sources (e.g., coal, natural gas, nuclear, hydro).

Wind forecasts are used to ensure that the electric supply and demand always remain in sync through the scheduling of energy reserve resources and market trading (Monteiro et al. 2009). Because winds are highly variable and forecasts imperfect, planners must schedule *spinning* and *standby* energy reserves to add flexibility to the electric grid. These reserves are traditional energy sources (e.g., natural gas and coal) that are scheduled ahead of time to dispatch additional power to the grid with as little as 5 min of notice (Monteiro et al. 2009). However, reserve scheduling is complicated by the spinup time required before power can be dispatched to the grid from these sources. Energy reserve spinup times can range from several hours to over a day (Ahlstrom et al. 2013). Thus, accurate forecasts of wind energy generation are required at short-term horizons (0–48 h) to determine how many reserves are needed, to schedule these reserves ahead of time, and to adjust already committed reserves in the event of forecast changes (Monteiro et al. 2009).

For market trading, energy traders bid excess resources to the market, with some deadlines as little as 30 min prior to scheduled delivery (Monteiro et al. 2009; Draxl et al. 2014). Penalties can exist in the event of an imbalance between the bid power and that actually delivered (Pinson et al. 2006).

Planners and market traders use wind forecasts based on numerical weather prediction (NWP) models to gauge future wind energy production. Such NWP models are the largest source of uncertainty in wind energy generation estimates (Monteiro et al. 2009). Forecast uncertainty arises from incorrect initial condition (IC) sources, incomplete knowledge of atmospheric physics (e.g., turbulence), and improper assimilation of observational data (Buizza et al. 2005). Better estimation of forecast uncertainty, as well as higher forecast accuracy, lead to more efficient energy planning, optimized market trading opportunities, and significant monetary savings (Pinson et al. 2006; Marquis et al. 2011; Mahoney et al. 2012; Wilczak et al. 2015).

Probabilistic weather forecasts are the primary tool used to gain insight into forecast uncertainty. Earlier forecasts used for wind energy planning were deterministic, meaning they did not provide information on forecast uncertainty (Pinson et al. 2006). However, in the last 10 years, probabilistic forecasts have begun to be adopted throughout the industry as planners started to recognize the value of quantifying forecast uncertainty. Monteiro et al. (2009), Giebel et al. (2011), and Zhang et al. (2014) provide extensive reviews of the history of probabilistic forecasts for wind energy.

Zhang et al. (2014) detail the two standard approaches for creating probabilistic forecasts for wind energy: parametric and nonparametric methods. Parametric methods prescribe a probability distribution of a particular shape that is typically representative of the past forecast error distribution. Such distributions are described by location and scale parameters and have the benefit of being computationally cheap (Zhang et al. 2014). Early methods dressed distributions around single deterministic forecast models, with several distributions being tested. Lange (2005) found that the forecast error distribution calculated from the German Weather Service’s [Deutscher Wetterdienst (DWD)] numerical forecasts are well dressed by a Gaussian distribution. They provided the forecast error distribution results for 10-m wind speeds, not hub-height winds, and likely only at a location in flat terrain (the example location is confidential and a large quantity of the stations tested were in flat terrain). When converted to wind power, forecast error distributions have been fit by Gaussian (Lange 2005), beta (Bludszuweit et al. 2008), and generalized logit-normal distributions (Pinson 2012).

Nonparametric approaches do not make a shape assumption. One example is an ensemble of NWP models. Ensemble forecast systems provide a distribution of possible forecast outcomes (Fig. 1a) and can be produced by NWP using a variety of methods, including 1) using multiple IC sources, 2) stochastically perturbing a single IC, 3) perturbing observations used during data assimilation for observation error, 4) varying model physical parameterization configurations, 5) varying NWP dynamical cores, 6) varying model grid lengths, 7) using stochastic kinetic energy backscatter (SKEBS), or 8) stochastically perturbing parameterization tendencies (SPPTs) (Stensrud et al. 2000; Grimit and Mass 2002; Buizza et al. 2005; Eckel and Mass 2005; McCollor and Stull 2008b; Candille 2009; Berner et al. 2009; Palmer et al. 2009).

An ensemble forecast distribution can be converted into a probability distribution (Fig. 1b). This is commonly done by using the empirical distribution of the ensemble members (Anderson 1996). Bayesian model averaging has also been successfully applied to wind forecast systems (Hoeting et al. 1999; Raftery et al. 2005; Sloughter et al. 2010; Courtney et al. 2013). Kernel density estimation is another nonparametric method that links one or more explanatory variables to forecast probability density functions (Juban et al. 2007). While such methods have been shown to work well, the latter two require large datasets and can become computationally expensive (Juban et al. 2007; Zhang et al. 2014).

A hybrid approach is to parameterize the moments of a prescribed probability distribution (such as Gaussian) as a function of the statistical properties of an ensemble [e.g., the ensemble mean, ensemble variance, and ensemble forecast error distribution; Gneiting et al. (2005); Nipen and Stull (2011)]. This method requires running computationally expensive ensembles, but does not require an extensive training dataset. Gneiting et al. (2005) refer to this method as ensemble model output statistics (EMOS).

Probabilistic forecasts are most useful when the forecast probability of an event matches its observed frequency of occurrence. Distributions from raw ensembles are often underdispersive (Stensrud et al. 2000; Eckel and Mass 2005; McCollor and Stull 2008b) and must undergo bias correction and probabilistic calibration (Buizza et al. 2005; Nipen and Stull 2011). When the forecast probabilities match the relative frequency of occurrence, a probabilistic forecast is *reliable* or *calibrated*. Calibrated probabilistic forecasts can be trusted as accurate estimates of forecast uncertainty. However, as Nipen and Stull (2011) discuss, even forecasts based on climatology can be probabilistically calibrated, so assessing calibration is not enough. One method that can be used to differentiate probabilistically calibrated forecasts is through evaluating forecast sharpness (Gneiting et al. 2005; Pinson et al. 2006; Juban et al. 2007). Forecast sharpness is a measure of the width of probabilistic spread, with a deterministic forecast being perfectly sharp (Juban et al. 2007). Pinson et al. (2006) suggest forecast calibration is the primary concern of probabilistic forecasts for wind energy, while any improvements in sharpness represent added value.

The use of ensembles in probabilistic forecasts for hub-height wind speeds is growing, but studies evaluating probabilistic skill are still relatively limited (Junk et al. 2015). Deppe et al. (2013) performed an ensemble study over the flat terrain of Iowa using version 3.1.1 of the Weather Research and Forecasting Model (WRF; Skamarock et al. 2008), but did not assess probabilistic forecast calibration. Deppe et al. (2013) used an ensemble of different planetary boundary layer (PBL) schemes and IC sources. They found a multi-PBL ensemble had a more accurate ensemble-mean forecast when compared with an ensemble formed from perturbed initial conditions, although the latter had larger ensemble variance. Therefore, a multi-PBL ensemble might be a suitable choice when using the EMOS method of generating probabilistic forecasts.

Since none of the existing PBL schemes available in WRF were designed for complex terrain, and because many wind features in complex terrain are dependent on PBL evolution, we posit that an ensemble containing several PBL schemes may be beneficial. The main differences between PBL schemes are the treatment of vertical mixing and the statistical order of turbulence closure (Stull 1988; Stensrud 2007; Deppe et al. 2013). Differences in PBL schemes could be expected to be larger in complex terrain than was found by Deppe et al. (2013) over flat terrain.

For complex terrain, better methods for producing calibrated, sharp probabilistic hub-height wind speed forecasts are needed. This study evaluates the performance of multi-PBL, multi-IC, multi-grid-length WRF ensembles at four wind farms in the mountainous terrain of British Columbia using the empirical ensemble distribution and EMOS methods. The eight PBL schemes used here are the Yonsei University scheme (YSU; Hong et al. 2006; Hu et al. 2013), the Asymmetric Convective Model version 2 (ACM2; Pleim 2007), the Medium-Range Forecast model (MRF; Hong and Pan 1996), the Mellor–Yamada–Janjić scheme (MYJ; Janjić 1994), the Mellor–Yamada–Nakanishi–Niino level 2.5 scheme (MYNN; Nakanishi and Niino 2006), the Quasi-Normal Scale Elimination method (QNSE; Sukoriansky et al. 2005), the University of Washington scheme (UW; Bretherton and Park 2009), and Grenier–Bretherton–McCaa scheme (GBM; Grenier and Bretherton 2001).

Further, this study provides insight into probabilistic forecasting techniques for hub-height wind speed forecasts using the WRF Model. It also adds knowledge about the shape of the hub-height wind speed forecast error distribution across complex terrain. This work is a follow-up to the work of Siuta et al. (2017), who evaluated the deterministic forecast performance of this dataset and the deterministic forecast sensitivity to the choice of PBL scheme, grid length, and IC source. The rest of this paper is organized as follows. Section 2 describes the research methodology used. Section 3 provides a discussion of the results. Finally, section 4 discusses our general conclusions and suggestions for future work.

## 2. Methodology

A year-long (June 2014–May 2015) study using the WRF Model (version 3.5.1) was conducted to evaluate the effectiveness of a multi-PBL, multi-IC, multi-grid-length ensemble forecast across complex terrain. Each day, four, two-way nested meshes with 36-, 12-, and 4-km grid lengths (Fig. 2) produced 48 individual forecasts with hourly output covering a forecast horizon of 1 day at four wind farms in British Columbia. The 48 forecasts were created by running the domain setup in Fig. 2 with initial and boundary conditions from the U.S. National Centers for Environmental Prediction 0000 UTC 0.5° Global Forecast System (GFS) and 0000 UTC 32-km North American Mesoscale Model (NAM, grid 221), and with eight PBL schemes in the WRF-ARW dynamical core (Fig. 3). For the NAM, the 32-km grid is the only output available that extends far enough to the north for our WRF domains. Table 1 provides a summary of the WRF Model settings used.

Hub-height wind forecasts for each domain were created by using the nearest-neighbor grid cell in the horizontal and the closest model level to wind turbine hub height in the vertical. Vertical resolution in the boundary layer was increased with six levels in approximately the lowest 100 m. The authors recognize that horizontal and vertical interpolation methods may improve deterministic hub-height (roughly 50–150 m AGL) wind forecasts, but we do not apply them here to better isolate uncertainties in the boundary layer wind forecast due to PBL scheme choice.

Forecasts from each ensemble member, as well as hub-height wind speed observations, were used in the Component-Based Post-Processing System (COMPS; Nipen 2012) to postprocess and evaluate the forecast. Postprocessing is anything applied to the raw forecast for improvement, including bias removal, choice of uncertainty model, and any calibration done to adjust the uncertainty model. COMPS is a modular postprocessing and verification system, analogous to how WRF is modular for NWP. COMPS uses a series of namelists containing user-defined postprocessing options that are applied to the raw forecast input.

The four wind farms used in this study are located on mountain ridgetops, with elevations between 500 and 1000 m MSL. We use wind-farm-averaged nacelle (hub height) wind speed observations for postprocessing and model evaluation. These observations from the independent power producers are confidential and were provided to us by the local utility company, BC Hydro. BC Hydro, which purchases the wind power, performed quality control on the data. Wind-farm-averaged observations have been shown to better match power generation and output from NWP models compared to results of individual turbines or meteorological towers. Farm-averaged values remove subgrid-scale (intra–wind farm) variability not currently resolved by NWP models (Cutler et al. 2012). Wind farm names and locations remain anonymous in the figures shown here. However, statistics and aggregated results based on these observations are provided.

We evaluate several factors affecting the probabilistic hub-height wind speed forecast across complex terrain, including the effects of using multiple PBL schemes, applying bias correction, and the choice of the uncertainty model. These tests are summarized in Table 2 and are described in detail in the following subsections.

### a. Effects of the number of PBL schemes in the ensemble

This study is a follow-up to that by Siuta et al. (2017), who used the same dataset to examine the relationship between the deterministic hub-height wind speed forecast accuracy and the choice of PBL scheme, grid length, and IC source. The outcome of that study showed that the grid length and PBL scheme had the most influence on deterministic forecast accuracy, but the most influential factor varied by location, season, and time of day. Appendix A provides the mean absolute error (MAE) scores for the bias-corrected forecasts used in this prior study. Siuta et al. (2017) found the ACM2 PBL scheme to be the best performing of all the PBL schemes, and the 12-km option to be the best-performing grid, when averaged over all four wind farms over the entire year. However, the best PBL scheme (and grid length) differed for individual wind farms and seasons (see appendix A).

In this study we start with only the ACM2 PBL scheme (six total ensemble members), and then test for short-term probabilistic forecast improvements by adding PBL schemes one by one into the ensemble. We start with the ACM2 PBL scheme because of the prior results mentioned above.

### b. Bias-correction technique

We bias correct each individual ensemble member prior to forming the ensemble mean. Wind speeds can never be negative. To satisfy this condition, we use a degree-of-mass-balance (DMB) multiplicative bias-correction technique (Grubišić et al. 2005; McCollor and Stull 2008a; Bourdin et al. 2014) applied to each ensemble member. The current bias-correction factor is calculated from the ratio of past forecast and observation pairs, weighted by an *e*-folding time *τ*:

Here, is the bias-correction factor from the previous day weighted by . We used a *τ* of 30 days because it was found to be the optimum training period when averaged over all locations ( appendix A). The ratio of the mean of the previous day’s forecasts and observations is weighted by . The result is that the influence of the most recent forecast–observation pairs decreases with an *e*-folding time of *τ*. This recursive bias correction minimizes the need to store an ever-growing dataset for training.

The factor is calculated for each individual ensemble member and then applied to the raw ensemble member forecast to produce a bias-corrected ensemble member forecast :

### c. Choice of uncertainty model

The method used to convert the distribution of raw or bias-corrected ensemble members to a probability distribution is called the uncertainty model. The uncertainty model must not assign any probabilities to wind speeds below zero.

We evaluate five uncertainty models; four of which are shown in Table 2. The first method, referred to as Raw (Table 2), uses the empirical distribution of ensemble members to assign a probability. This is a basic method that could be used by those producing probabilistic forecasts over areas where observations do not exist.

The second uncertainty model, referred to as GNS (Table 2), assumes the form of a Gaussian distribution dressed about the ensemble mean. This Gaussian forecast probability distribution is described at any time *t* by

Here, is the bias-corrected ensemble mean calculated from (1) and (2), where each member is equally weighted. The variance is adaptively estimated from the square of the past forecast errors calculated from the ensemble mean, weighted by the same 30-day *e*-folding time used in (1).

The third uncertainty model, GSEV (Table 2), scales the variance of the Gaussian distribution based on the ensemble variance. For this method, variance is calculated using a linear regression of the form:

In (4), *x* is the raw or bias-corrected ensemble variance, and constants *m* and *b* are determined by regression against the square of the past errors (calculated from the bias-corrected ensemble mean). When used in this manner, the GSEV method could capitalize on theorized spread–skill relationships (Wilks 2011; Grimit and Mass 2002). Namely, when the ensemble variance is smaller (larger), the spread of the Gaussian distribution [(4)] could be smaller (larger), assuming a relationship exists. Generally, linear correlation is used to quantify spread–skill relationships, with values over 0.6 considered strong relationships (Grimit and Mass 2007).

The fourth uncertainty model, GSEM (Table 2), scales the Gaussian distribution by the ensemble mean. For this method, *x* in (4) represents the ensemble mean. This method could allow the uncertainty model to have larger (smaller) spread when the ensemble-mean-forecast wind speed is high (low). For the GSEM and GSEV uncertainty models, the regressed variables in (4) (e.g., the ensemble mean or the ensemble variance, and the past squared errors) are also adaptively updated with an *e*-folding time of 30 days.

Figure 4 illustrates how these three Gaussian-based methods could differ for an idealized three-member ensemble. We posit that the GSEV and GSEM methods could allow for sharper probabilistic forecasts than the GNS method. For each of these distributions, we also test for probabilistic forecast improvements resulting from adding more PBL schemes to the ensemble and through ensemble member bias correction. An overview of the tests performed is given in Table 2.

A fifth uncertainty model, based on a new distribution (which we call the *pq* distribution), is also tested. It allows for optimization of kurtosis to better match the distribution of past forecast errors (shown later). Probabilistic forecast results derived by using this model did not improve on those from the Gaussian-based models. The authors felt it important to show the results of this experiment, but details will be relegated to appendix B.

### d. Verification metrics

As our focus in this paper is on probabilistic forecasting, we concentrate our forecast evaluation on distributions-oriented verification metrics, which use the joint distribution of forecasts and observations (Wilks 2011).

These types of metrics provide advantages over deterministic measures when assessing probabilistic forecasts. Because they evaluate the full probabilistic forecast distribution, they provide insight into how well forecast uncertainty is represented by the forecast probability distribution. Accurate representation of forecast uncertainty allows for optimum decision-making and cost minimization (Pinson et al. 2006; Juban et al. 2007; McCollor and Stull 2008c).

As with deterministic verification, the choice of verification metrics should be decided based on the purpose of the forecast and the needs of the end user. Gneiting et al. (2005, 2007) and Juban et al. (2007) propose that ideal probabilistic forecasts achieve calibration while maximizing forecast sharpness (both are described in more detail later). For this we use the probability integral transform (PIT) histogram, reliability diagram, and continuously ranked probability score (CRPS). Skill scores are also used to quantify improvements in verification metrics of a test forecast configuration over that of a reference forecast (Wilks 2011).

Calibration is a measure of how well forecast probability represents the actual frequency of event occurrence (Wilks 2011; Nipen and Stull 2011; Gneiting et al. 2007, 2005). The PIT histogram provides a concise method of evaluating ensemble probabilistic calibration. PIT values are calculated by determining which forecast percentile matches the associated observation for individual forecast–observation pairs. A histogram is generated by counting the frequency of occurrence of observations in each forecast percentile bin. This histogram should be flat for a calibrated forecast, and the deviation from flatness can be used to gauge calibration, as was shown by Nipen and Stull (2011).

When the PIT histogram is not flat, it can indicate two things: 1) the forecast is under- or overdispersive (needs calibration) or 2) the forecast has bias. Both of these situations can be fixed by calibration methods that adjust the probability distribution, resulting in flatter PIT histograms.

Gneiting et al. (2007) highlight that while the use of PIT histograms is prudent to address forecast calibration, it is hardly the only important aspect of a probabilistic forecast. They mention that forecasts that maximize sharpness, while maintaining calibration, are best. In addition, forecast calibration may not hold true over a subset of events. For wind energy applications, sharper probabilistic forecasts (when also probabilistically calibrated) result in more efficient energy reserve resource planning and market trading opportunities because uncertainty (typical error from the ensemble mean) in the forecast is reduced (Juban et al. 2007). Because even forecasts based on climatology can be probabilistically calibrated, forecasts must be sharper than climatology to be considered skillful.

We use the observations from a previous 15-day window to define climatology at each hour of the day. Our choice of a 15-day window is twofold: 1) we have limited observations available with only a year-long dataset and 2) a shorter definition of climate (over that which averages all days of the year) could better represent seasonality. We do not use any observations in the future to define climate as this may cause the climate forecast skill to falsely appear to be better than it is.

Reliability diagrams allow us to assess forecast calibration for a subset of events given a threshold (Wilks 2011). For this study, we use wind speed thresholds of 5, 15, and 20 m s^{−1} to represent low winds, turbine-rated winds, and high winds, respectively. Turbine-rated speeds are those at which wind turbines start to generate maximum power output (i.e., increasing winds will no longer increase power output). This speed can differ between turbine models.

We insert a sharpness histogram within the reliability diagram. Sharpness is a function of only the forecast and is a measure of the width of a forecast probability distribution (Gneiting et al. 2007). Given a forecast threshold, it is a count of how many times this threshold value lands in each probability bin. Forecasts that most often place this threshold near either the 0th or 100th percentiles are said to be sharp, while forecasts that frequently place the forecast near the 50th percentile are not sharp. Narrower forecast probability distributions are sharper. While a sharper forecast is useful because it is more decisive, caution must be used to make sure that sharp forecasts are also probabilistically calibrated (i.e., a falsely confident forecast is not useful).

The CRPS is analogous to the MAE, but for probability. It is a measure of how well probability is assigned around each observed value (Hersbach 2000). A lower CRPS indicates a forecast that better assigns probability with respect to actual event occurrence (i.e., it assigns higher probability for the event). A perfect forecast will have a CRPS of 0. The CRPS reduces to the MAE for deterministic forecasts (the cumulative probability distribution takes the form of a step function), which makes it possible to compare ensemble and deterministic forecasts in a probabilistic sense. In addition, the CRPS provides a strictly proper scoring metric of assessing both forecast calibration and sharpness in a single score (Gneiting et al. 2005). Strictly proper scores promote honesty in the sense that forecasters cannot hedge their forecasts to achieve a better verification score (Wilks 2011). The CRPS skill score is calculated by comparing the CRPS score of any probabilistic forecast to that of a reference probabilistic forecast.

Finally, we will refer to root-mean-square error (RMSE) as a deterministic measure of forecast accuracy and also relate it to probabilistic forecast sharpness, since RMSE is directly related to the spread of the Gaussian uncertainty model.

## 3. Results and discussion

### a. Raw ensemble distribution as a probability distribution

Tests R1–R8 (Table 2) are designed to quantify improvements in probabilistic calibration due to adding more PBL schemes to a short-range ensemble forecast across complex terrain. Starting with the best PBL scheme (ACM2, test R1), additional PBL schemes were added to the ensemble in tests R2–R8. Each additional PBL scheme added six total members to the ensemble, comprising three grid sizes and two ICs. R1–R8 used the binned, raw ensemble distribution as the forecast probability distribution. Tests RB1–RB8 followed the same method, but used ensemble member bias correction.

Figure 5 shows the improvement in probabilistic calibration resulting from adding up to eight PBL schemes (i.e., improvement over R1). For the raw ensemble distribution, forming a multi-PBL, multi-IC, and multigrid ensemble leads to large improvements in probabilistic calibration over an ensemble consisting of just a single PBL scheme (six members). For example, using two PBL schemes (12 members, R2; Table 2) results in 9% calibration improvement over using one PBL scheme (R1; Table 2). A statistically significant 19% improvement in calibration (relative to R1; Table 2) results from using all eight PBL schemes (R8; Table 2). Throughout this manuscript we refer to statistical significance as exceeding the *p* = 0.05 significance level in a Student’s *t* test.

Once bias correction is applied, using all eight PBL schemes (RB8; Table 2) provides a 27% improvement in calibration (relative to RB1; Table 2). A 14% improvement is found when using only two PBL schemes (RB2; Table 2). Differences in forecast calibration between RB1 and RB8 are statistically significant. Additionally, improvements in calibration through the application of bias correction also pass statistical significance testing (e.g., comparing tests R1–R8 with RB1–RB8).

Even though R8 and RB8 were the best within their respective groups of experiments, their PIT histograms (Fig. 6) show that the binned, raw ensemble distribution is underdispersive, even with a 48-member ensemble. The probability forecast is too confident and events too often fall at the extremes of the distribution. To fix this, a better uncertainty model is needed, and we address this through our next set of tests.

### b. Prescribed probability distributions dressed on the ensemble mean

Tests G1–G8 (Table 2) use the GNS uncertainty model dressed about the raw ensemble mean with spread (variance) based on the past squared errors of the ensemble mean. Using this method with the best overall PBL scheme (G1), or all eight PBL schemes (G8), results in a biased forecast (Fig. 7). To remove the bias, we performed bias correction on each ensemble member during tests GB1–GB8. These also used the GNS uncertainty model, but centered the forecast probability distribution on the bias-corrected ensemble mean. The removal of the bias resulted in a nearly flat PIT histogram, indicating a calibrated forecast (GB1 and GB8 in Fig. 7). A separate calibration step is therefore not needed.

Next, we tested two additional Gaussian-based uncertainty models that allow the distribution to scale by either the ensemble variance or the ensemble mean. First, we scaled the distribution by the ensemble variance in tests GSV1–GSV8 (no ensemble member bias correction; Table 2) and GSVB1–GSVB8 (with ensemble member bias correction; Table 2). Tests GSV1–GSV8 and GSVB1–GSVB8 show nearly identical results to using a nonscaling Gaussian uncertainty model (GNS). PIT histograms for GSV1 and GSV8 show a biased probabilistic forecast (GSV1 and GSV8; Fig. 7) that results in a well-calibrated forecast once ensemble member bias correction is applied (GSVB1 and GSVB8; Fig. 7). Tests GSM1–GSM8 and GSMB1–GSMB8 yielded similar results (Fig. 7) for scaling the Gaussian distribution by the ensemble mean. When using any of the Gaussian-based uncertainty models, improvements in calibration through the use of the bias-corrected ensemble members passed significance testing.

Figure 8 shows the actual distributions of wind speed forecast errors with respect to the bias-corrected ensemble mean. Each histogram is for a different wind farm. By inspection, these distributions do not have a Gaussian shape. These distributions have more probability near the center (kurtosis = 3.5–6.6) than a Gaussian (kurtosis = 3), but they are nearly symmetric (skewness of less than 0.5, close to the Gaussian zero skewness). This engenders two questions: 1) what aspects of the Gaussian distribution allow it to work so well to produce calibrated probabilistic forecasts and 2) can we find a different theoretical distribution that is a better fit to the observed error distributions? We address question one next, and address question two in section 3f and in the appendix.

Perhaps all that is needed to produce calibrated forecasts is to have almost any distribution shape that has central tendency, spread about the mean, and is symmetric. The fact that the Gaussian distribution has tails that unphysically extend to ±∞ appears to be irrelevant to its ability to satisfactorily dress the ensemble mean. So the next question is how much spread in the distribution is needed?

First, we investigate whether larger forecast errors correspond to 1) greater spread among the raw ensemble members or 2) larger ensemble mean forecast winds. To do this, we calculated the coefficient of determination *r*^{2} for the linear relationship between the squared error of the bias-corrected ensemble mean, and either the bias-corrected ensemble variance or the bias-corrected ensemble mean. We found the relationship between the bias-corrected ensemble variance and the squared error had *r*^{2} values of 0.027 or less. This means that only 2.7% of the error variance can be explained by the linear relationship between the ensemble variance and the squared-error magnitude (e.g., the spread–skill relationship does not apply at these four locations in complex terrain).

The corresponding analysis comparing the bias-corrected ensemble mean with the squared error magnitude resulted in *r*^{2} values no larger than 0.053. The implication of this is that the Gaussian distributions used in GB1–GB8, GSVB1–GSVB8, and GSMB1–GSMB8 are similar, since the scaling relationships are weak. At some of the four locations, the differences between the GNS, GSEV, and GSEM distributions are statistically insignificant. Because of this, our next analyses will focus on the GNS distribution used in tests GB1–GB8.

An *r*^{2} value of 1 is not expected or feasible, as several studies have noted (Whitaker and Loughe 1998; Grimit and Mass 2007; Hopson 2014). Some of those studies, along with Wang and Bishop (2003), propose potential methods for extracting a stronger spread–skill relationship. Exploring those methods is beyond the scope of this study. Hopson (2014), however, concludes that in the case of a weak spread relationship, an invariant spread, derived from ensemble-mean error, may be used instead. We reach the same conclusion here.

Having addressed calibration, we now look at sharpness. For the GNS uncertainty model, spread is only a function of the past error of the bias-corrected ensemble mean [(3)]. The smaller the error (which can be measured by RMSE), the narrower the prescribed forecast probability distribution, and the sharper the forecast. We address this in the next section.

### c. Use of ensembles to improve forecast sharpness

The benefits of using ensembles in the short-term forecast horizons have been debated, especially for near-surface variables like hub-height winds (Mass et al. 2002; Eckel and Mass 2005; Warner 2011). We approach this question by looking at the effects of an ensemble on forecast sharpness. When using the GNS uncertainty model, forecast systems that have lower RMSEs must have smaller probabilistic spread (improved sharpness).

Figure 9 shows the average RMSE for each ensemble member, as well as the six-member bias-corrected ensemble from the best PBL scheme (GB1; Table 2) and the full 48-member bias-corrected ensemble (GB8; Table 2). While probabilistic calibration could be achieved by dressing a Gaussian distribution around any of these *individual* ensemble members, the sharpest forecasts are produced by dressing the distribution on the ensemble means from GB1 or GB8, which have the lowest RMSEs. While we show that the best PBL ensemble (GB1) and full PBL ensemble (GB8) have equivalent sharpness, it is not possible to know which PBL scheme is best a priori, unless it has been shown to consistently perform the best over an extended period of time. Nonetheless, these results indicate that larger ensembles do not necessarily perform better in short-term forecasts than smaller, selective ensembles. This is an important find for short-term wind planning, as it reduces computational cost while still providing sharp, probabilistically calibrated forecasts.

### d. Probabilistic forecasts for wind events

While the bias-corrected ensemble forecasts show statistical calibration for the data used during the training period, they are not necessarily calibrated over subsets of this training data. Forecasts of rare events, which in our case are high-wind events, can be uncalibrated. To assess calibration over different wind speed subsets, we provide the reliability diagrams using event thresholds of 5, 15, and 20 m s^{−1} for tests RB1, RB8, GB1, and GB8, as well as for a GNS distribution dressed on the single-best bias-corrected ensemble member (ACM2 12-km GNS; not in Table 2) (Fig. 10).

RB1 and RB8 are underdispersive for an event threshold of 5 m s^{−1}, while GB1, GB8, and the ACM2 12-km GNS are calibrated. However, for the 15 and 20 m s^{−1} thresholds, all forecasts exhibit poor calibration, typically overforecasting the probabilities. The horizontal dashed line in Fig. 10 represents the climatological frequency of events above the given threshold. Out of a total sample of 32 442 observations, 20 455 (63%) are above 5 m s^{−1}, 603 (1.8%) are above 15 m s^{−1}, and 35 (0.1%) are above 20 m s^{−1}. For the 15 m s^{−1} threshold, forecasts have near-zero skill (close to the red dashed line in Fig. 10). For the 20 m s^{−1} threshold, the probabilistic forecast has no skill. The lack of probabilistic forecast calibration for high-wind events is likely due to these events composing a small percentage of the training dataset, and forecast biases for these events may differ from those of the bulk of the training dataset. Separate calibration processes for different thresholds may alleviate this issue.

Insets within the panels in Fig. 10 show sharpness diagrams for the event thresholds. These diagrams indicate how often the event threshold falls in each forecast probability bin. Sharper forecasts will more frequently give event probabilities closer to 0% or 100%, while less sharp forecasts give event probabilities closer to 50%. For all thresholds, we see that the raw bias-corrected ensemble distributions (RB1 and RB8) produce the sharpest forecasts. However, sharpness without calibration is worthless. The high sharpness is a result of the underdispersive, overconfident nature of the raw ensemble distributions. Each of the GNS distributions (GB1, GB8, and ACM2 12-km GNS in Fig. 10) produces sharp forecasts for each of these event thresholds.

### e. Summary of skill versus computation cost

As a summary of the forecast improvements obtained by using probabilistic-based forecasts, we plot the CRPS skill score, which allows us to compare deterministic and ensemble forecasts in a probabilistic sense (Fig. 11). We use the worst-performing deterministic forecast (the bias-corrected 36-km QNSE forecast) as the reference forecast to gauge short-term probabilistic forecast improvements. Using the best deterministic forecast (12-km ACM2) provides an 11% improvement in the CRPS, averaged over all stations. A six-member ensemble forecast based on the best PBL scheme using the raw ensemble distribution as the uncertainty model (RB1; Table 2) improves the CRPS by an additional 17% (or 28% over the worst deterministic forecast). Using the raw distribution with the full 48-member ensemble (RB8; Table 2), provides an additional improvement in CRPS of 3%, or a 31% improvement over the worst deterministic forecast.

Figure 11 shows the large improvement that is gained by dressing the ensemble mean with an uncertainty model. The bias-corrected ACM2 12-km GNS provides a better probabilistic forecast than that of RB8 (on average), at a fraction of the computational cost. CRPS is improved over RB8 by an additional 5%. Test GB1, which uses all the ACM2-based forecasts, provides the most CRPS skill, with a 41% improvement over the worst deterministic forecast, and an additional 5% improvement over the best deterministic forecast dressed by the GNS distribution. No further improvement is gained with a larger ensemble for this short-term forecast (GB8; Fig. 11), while computational cost increases. All ensemble forecasts in Fig. 11 (e.g., tests RB1, RB8, ACM2 12-km GNS, GB1, and GB8) have probabilistic skill (increased sharpness) over a climatology forecast while the two deterministic forecasts do not (the ACM2 12-km deterministic and the 36-km QNSE deterministic reference forecasts; not explicitly shown).

### f. Shape of the prescribed distribution

In section 3b, we suggested that the exact shape of the prescribed distribution used to dress the ensemble mean is less important than the distribution’s gross attributes: central tendency, spread, and symmetry. As a preliminary test of this hypothesis, we devise a new distribution, called the *pq* distribution. This distribution can better fit the shape of the observed wind forecast error distribution (Fig. 12). The *pq* distribution is symmetric and bounded; see appendix B for details.

When Fig. 11 was recalculated using CRPS from the best-fitting *pq* distributions, the results (not shown) were nearly identical to those from the GNS distribution. The lack of improvement in CRPS through the use of a better-fitted distribution (*pq*) is preliminary evidence that the distribution’s gross attributes are more important, rather than the exact shape. Although the wind forecast errors were nearly symmetric for the four wind farms studied here, there might be other situations where the forecast error distributions may be asymmetric or even multimodal (i.e., not Gaussian; not *pq*), and other distributions would better describe the errors.

## 4. Conclusions and future work

We detailed methods employed to produce calibrated hub-height wind speed forecasts over complex terrain from a multi-PBL-scheme, multi-IC, multi-grid-length WRF ensemble. Tests evaluated the effects of having multiple PBL schemes in the ensemble, bias correction, and uncertainty model choice. Probabilistic forecast performance was evaluated based on improvements in the PIT histogram, reliability diagram, sharpness, CRPS, and RMSE.

For the binned raw ensemble distribution, increasing the number of PBL schemes available improved the forecast calibration. However, this distribution was underdispersive, and remained probabilistically uncalibrated even after removing the bias from each individual ensemble member.

To improve probabilistic calibration, we tested if a prescribed uncertainty model could be used to better represent forecast uncertainty. Three Gaussian-based uncertainty models and one *pq* model were evaluated. The first had no scaling; the distribution was based on the past error. The second and third Gaussian-based uncertainty models attempted to scale the Gaussian distribution based on a linear regression of either the ensemble variance or the ensemble mean against the square of the past errors from the ensemble mean. Using any Gaussian uncertainty model without individual member bias correction yielded biased PIT histograms. However, bias correcting each individual ensemble member resulted in probabilistically calibrated forecasts for all three Gaussian-based uncertainty models.

When using the prescribed uncertainty models, using additional PBL schemes did not result in improved probabilistic calibration. Therefore, large ensembles are not necessary to produce probabilistically calibrated forecasts. Instead, the addition of multiple PBL schemes improved sharpness. Although the RMSE (and thus sharpness) was similar when using the full 48-member ensemble and the reduced 6-member ensemble using only the ACM2 PBL scheme, the single best PBL scheme may not be known a priori. If the best PBL scheme is known, a selective ensemble can be used to save computation costs.

Additionally, the ensemble-mean RMSE, and thus sharpness, may be improved through the use of ICs produced by other agencies. Examples include the Canadian Global Deterministic Prediction System, the Fleet Numerical Meteorology and Oceanography Center Navy Global Environmental Model, the Met Office Unified Model, and the European Centre for Medium-Range Weather Forecasts Integrated Forecast System. The reason we suggest testing the use of these other sources is because the underlying data assimilation system used for both the GFS and the NAM is the same: the Gridpoint Statistical Interpolation (Shao et al. 2016). Differences between ICs may be larger when the sources come from distinctly different agencies (those using different data assimilation techniques).

Our error analysis shows that the Gaussian distribution describes the forecast error well, but that forecast error has little relationship to either the ensemble mean or the ensemble variance. We also found that the gross attributes (central tendency, spread, symmetry) of a prescribed distribution are more important than its exact shape.

## Acknowledgments

The authors acknowledge BC Hydro (Grant 00089063), Mitacs (Grant IT03826), and the Canadian Natural Science and Engineering Research Council (Grant RGPIN) for providing the funding to make this extensive project possible. We also thank GDF SUEZ Energy North America, Altagas Ltd., Alterra Power Corp., and Capital Power for allowing us access to their data. In addition, we thank Magdalena Rucker and Doug McCollor for their contributions to this effort. Finally, we would like to acknowledge the three anonymous reviewers for their input, which greatly improved this manuscript.

### APPENDIX A

#### MAE Scores

Tables A1 and A2 provide the deterministic MAE scores for the bias-corrected forecasts used to identify the best-performing PBL scheme in a separate study (Siuta et al. 2017). Figure A1 quantifies the annual accuracy skill gain or loss [MAE skill score (MAESS)] at each individual wind farm site by using the DMB bias-correction scheme and a training period of 30 days. Averaged over all locations, bias correction always resulted in accuracy improvements (Fig. A1, purple bars). For sites 1–3, improvements in accuracy over the raw forecasts ranged from 0%–12%. At site 4, MAE was improved by up to 56% (the 4-km grid using the MRF PBL scheme).

Figure A2 shows the variation in the annual MAESS at all four wind farm sites averaged over all 48 bias-corrected forecasts. While we do not show training periods less than 5 days in length, we tested a training period of only 1 day and found MAESS to be significantly worse (−40% or worse); thus, it is not included in Fig. A2. We find that the benefits of a longer training period level out between 20 and 30 days.

### APPENDIX B

#### The *pq* Frequency Distribution

We introduce the meteorological community to a new frequency distribution that we devised for section 3. The distribution has a simple unnormalized one-sided form:

where *p* and *q* are the shape parameters, and the distribution is defined between 0 and 1. It can be made into a normalized (unit area) symmetric probability density *f* with arbitrary scaling parameter *S* and center-location parameter :

where the area *A* under the unnormalized symmetric curve is

and where is the gamma function. The *pq* distribution is similar, but not identical, to the Kumaraswamy (1980) distribution and the beta distribution (Jambunathan 1954).

The *pq* distribution has a theoretical mean absolute deviation of

and variance of

The fourth statistical moment is

from which the kurtosis can be found as . Skewness is zero for this symmetric distribution. Sometimes a situation may exist where part of the distribution tail on one or both sides must be cut off. For this situation, you numerically integrate to get the area under the curve and to get the higher moments, using evenly spaced samples from (B2) instead of using the theoretical expressions in (B3)–(B6).

If you are given an observed distribution with scatter, then you can find the best-fit parameters (*p*, *q*, *S*, ) for the *pq* distribution as follows. If you know a priori (or want to fix) one or more of the parameters (typically *S* and/or ), then do so and find the remaining parameters by one of two methods. You can use the method of moments, where you first calculate the statistical moments of the observation data, and then use the the Newton–Raphson or other root-finding algorithm to find the *pq* distribution parameters that give the same statistical moments. You will need as many statistical moments as unknown parameters.

A second approach is brute force, which we used here to help us learn more about the sensitivity of the distribution to the choice of parameters. Namely, you loop through all reasonable values of *p* and *q*, and for each (*p*, *q*) combination find the mean absolute error between the *pq* distribution and the observed frequency distribution (after normalizing the observations to have unit area under its histogram). The best (*p*, *q*) combination is the one with minimum error. Figure B1 shows an example for one of the wind farm forecast error distributions.

While stepping through each (*p*, *q*) combination, we also produced a plot of the *pq* distribution MAD, variance and kurtosis (Fig. B2). You can use this as a poor man’s method of moments. Namely, if you know the variance and kurtosis (or the MAD and kurtosis) from your observation data, then you can use those values in Fig. B2 to find the approximate *p* and *q* values.

Although the *pq* distribution cannot fit skewed distributions, its versatility for symmetric distributions is remarkable. Figure B3 shows extremes that approximate a delta function, a linear ramp shape, and a near-uniform distribution. Figure B4 shows that *q* controls the slope of the tails, while *p* controls the shape of the center of the distribution. The bottom-right panel in Fig. B4 shows how the *pq* distribution can approximate Gaussian distributions, particularly for values of *p* ≈1.85.

## REFERENCES

*Proc. 2007 Power Tech Conf.*, Lausanne, Switzerland, IEEE, 683–688, doi:.

*Proc. 2006 Int. Conf. on Probabilistic Methods Applied to Power Systems*, Stockholm, Sweden, IEEE, doi:.

*Parameterization Schemes: Keys to Understanding Numerical Weather Prediction Models.*Cambridge University Press, 459 pp.

*An Introduction to Boundary Layer Meteorology.*Kluwer Academic, 666 pp.

*Numerical Weather and Climate Prediction.*Cambridge University Press, 526 pp.

*Statistical Methods in the Atmospheric Sciences.*3rd ed. Academic Press, 676 pp.

## Footnotes

© 2017 American Meteorological Society.

This article is licensed under a Creative Commons Attribution 4.0 license (http://creativecommons.org/licenses/by/4.0/).