## 1. Introduction

There are inherent limitations to forecasting a single realization of the future state of the atmosphere due to its chaotic nature. While numerical weather prediction (NWP) models have become more sophisticated in recent years and better represent and predict the atmospheric state, they are still limited because of imperfect model numerics, imperfect parameterizations of unresolved physical processes, and interpolations of input data that are sparsely located compared to current model grid resolutions. In recognition of these difficulties, contemporary NWP uses ensembles of simulations. To account for known sources of model and analysis errors, members in these ensembles often differ by imposed initial conditions (ICs), lateral and lower boundary conditions (LBCs), model physics parameterization schemes, and even the choice of NWP modeling system. While the relationship between ensemble spread and forecast error is not strictly linear, Grimit and Mass (2007) state that with larger ensemble spread, there is a greater probability of the forecast errors being larger, and vice versa.

There are both inherent and practical limitations to ensemble forecasting as well. The first limitation is that NWP ensembles are computationally expensive to run. When faced with limited computing resources, trade-offs must be made when configuring the ensemble. How fine can the horizontal and vertical resolution be? How many members can there be in the ensemble? How big can the forecast domain be, and should it be nested in a larger and coarser outer domain? How long can the forecast duration be? All of these considerations are important, but they all compete for the same limited resources.

A second limitation is that ensemble forecast systems tend to be underdispersive; in other words, the ensemble variance does not represent the full range of typical forecast errors and thus requires calibration (Raftery et al. 2005). If an ensemble is perfectly calibrated, then the ensemble variance and ensemble-mean error variance will have a 1:1 ratio (e.g., Grimit and Mass 2007; Kolczynski et al. 2009, 2011). In other words, if a meteorological ensemble is underdispersive, then the forecast uncertainty cannot be properly diagnosed from the ensemble spread. Kolczynski et al. (2011) show, using a stochastic ensemble, that even perfectly constructed ensembles with fewer than hundreds of members will be underdispersive. Unfortunately, given current computing resources at most institutions, including operational centers, it is not practical to run an NWP ensemble with hundreds of members; therefore, steps must be taken to calibrate ensembles before forecast uncertainty can be properly assessed. In addition to these practical difficulties, deficiencies in ensemble construction could necessitate calibration no matter how many members are used. Ensemble calibration is typically accomplished by postprocessing the ensemble, which can be viewed as a form of ensemble forecast “dressing,” in which the error distribution of the ensemble is modified (dressed) with statistical estimates of the true error distribution (Roulston and Smith 2003).

There are many applications for which NWP ensemble forecasting is useful. One of these applications is atmospheric transport and dispersion (AT&D) forecasting, as AT&D models such as the Second-Order Closure Integrated Puff (SCIPUFF) model (Sykes et al. 2004) are often driven by NWP ensemble model output and statistics, including wind variances, wind covariances, and a decorrelation length scale derived from the ensemble (e.g., Warner et al. 2002; Lee et al. 2009). Another application is wind power forecasting where uncertainty metrics are needed (Delle Monache et al. 2011; Liu et al. 2011). For instance, to obtain appropriate spread in concentration predictions from AT&D models there should be good spread in low-level wind direction and atmospheric boundary layer (ABL) depth, as these are two of the most important parameters affecting uncertainty in AT&D predictions (Lewellen and Sykes 1989). By “good spread” we mean an ensemble variance that provides a calibrated, reliable, and sharp forecast probability density function (PDF) (Eckel and Mass 2005). In this study we focus on obtaining good spread for ABL and low-level wind forecasts. Therefore, when evaluating the performance of ensemble configurations for low-level wind prediction, we verify against observations of 10-m winds and 2-m temperature. For other applications other parameters may be important.

It is not clear a priori how best to configure a useful ensemble for AT&D or wind energy applications, or if that particular configuration would be the most useful for other applications, such as quantitative precipitation forecasting. Therefore, some testing is necessary. There are a large number of possible choices of IC, LBC, and physics perturbations that can be included in an NWP ensemble for any given application. The primary aim of this study is to propose an objective methodology to “down-select,” or determine which subset of members should be included in an ensemble, assuming that it is impractical to include all the members, and then to determine if calibrating (dressing) the PDF with Bayesian model averaging provides an appropriately dispersed ensemble. Details about our ensemble and the verification data that we use are discussed in section 2.

Naturally, down-selection may alter the ensemble mean often used as the “best guess” forecast, as well as exacerbate the calibration problem. These issues result both because we will have fewer members and because, by preferentially choosing specific members, we further alter the original PDF of the ensemble. Therefore, for down-selection to be successful, we need to employ postprocessing methods that will help maintain an accurate best guess, and ideally calibrate the ensemble variance as well.

The first postprocessing method used is principal component analysis (PCA). In this study PCA is our primary method for defining candidate down-selected ensemble subsets. We discuss PCA further in section 3. The ensemble calibration method we use in this study is Bayesian model averaging (BMA). BMA is a statistical postprocessing method introduced to the atmospheric sciences by Raftery et al. (2003, 2005) and is used to correct for the underdispersive, and thus uncalibrated, nature of forecast ensembles. We discuss our BMA application in section 4 of this paper. Verification results demonstrating the performance of the down-selection and the calibration, as well as a comparison of the PCA down-selection to a random down-selection, are included in section 5. Finally, section 6 includes an overall summary of the study and lists avenues for future research.

## 2. Data

### a. Ensemble configuration

There are various approaches to ensemble initialization and configuration that are documented in the literature for limited-area ensembles. Each approach attempts to account for forecast uncertainty from various sources, including initial conditions, lateral boundary conditions, physical parameterizations, and model numerics. Even a brief survey of the literature reveals that there is currently no agreed-upon best strategy for configuring limited-area ensembles (e.g., Houtekamer et al. 1996; Stensrud et al. 2000; Warner et al. 2002; Eckel and Mass 2005; Fujita et al. 2007; Jones et al. 2007; Bowler et al. 2008; Clark et al. 2008; Du et al. 2009; Hacker et al. 2011).

While we recognize the likely importance of IC–LBC perturbations even in short-range mesoscale NWP, we choose to test our down-selection methodology on a physics ensemble. One reason is that if we were to down-select from a mixed IC–LBC-physics ensemble, then the results of that down-selection would be difficult to interpret physically. Our goal in outlining this down-selection method is to determine which physics members are most useful to include in a separate, longer-term ensemble that likely would include IC–LBC variability as well. Additionally, if we were to not vary the physics, and use only an IC–LBC random perturbation method that results in equally likely perturbations, then those members would be exchangeable and statistically indistinguishable (Fraley et al. 2010). Therefore, by removing the random signal of IC–LBC perturbations we expect to obtain clearer and more meaningful results.

Our 24-member physics ensemble is created with version 3.2 of the Advanced Research Weather Research and Forecasting Model (ARW-WRF; Skamarock et al. 2008). The microphysics and atmospheric radiation schemes (both longwave and shortwave) are the same for each ensemble member, but the land surface, surface layer, boundary layer, and cumulus scheme configuration varies for each member, as detailed in Table 1. There are 45 full vertical levels in each simulation, with the lowest full level at 24 m AGL, 9 full levels below 500 m AGL, 16 full levels below 1 km AGL, and 24 full levels below 2 km AGL. The model top is at 50 hPa. Such high vertical resolution in the lowest portions of the troposphere is chosen because this study focuses on processes occurring in the ABL. Two model domains are used. The coarse domain uses a horizontal grid spacing of 36 km and a time step of 180 s, while the nested domain uses 12-km grid spacing and a 60-s time step. The coarse domain encompasses the continental United States (CONUS), and the nested domain covers the Great Lakes, Ohio Valley, mid-Atlantic, and Northeast, as shown in Fig. 1. References and details for all the parameterization schemes are contained in Skamarock et al. (2008). It should also be noted that in this ensemble we used a slightly modified version of the Mellor–Yamada–Janjic (MYJ) ABL scheme. In this modified MYJ scheme the ABL depth diagnosis method is changed, and the background turbulent kinetic energy (TKE) is modified from 0.10 to 0.01 J kg^{−1}.

Physics parameterizations for the ARW-WRF ensemble members used in this study. WSM 5-class: WRF Single-Moment 5-Class; RRTMG: Rapid Radiative Transfer Model; MM5: fifth-generation Pennsylvania State University--NCAR Mesoscale Model; and ACM2: asymmetric convective model, version 2.

For each month of June–August 2009, six forecast periods are randomly chosen, with three being initialized at 0000 UTC and three being initialized at 1200 UTC, to avoid biasing our results with the diurnal cycle. In total there are 18 forecast times chosen for this summer evaluation period, with a forecast period of 48 h in all cases. No data assimilation is used during the model integration for this study because we desire to simulate a forecasting system, rather than a hindcasting system. The LBCs for all 24 members in this study come from the 0.5° × 0.5° resolution Global Forecast System (GFS) forecast cycles initialized at each of the randomly chosen forecast times.

For the ICs the 0-h GFS analysis is blended with standard World Meteorological Organization (WMO) observations to provide a more accurate initial state. This blending is accomplished with the Obsgrid objective analysis software that is part of the WRF modeling system and developed by the National Center for Atmospheric Research (NCAR) (NCAR 2011, ch. 7). The objective analysis technique we use in Obsgrid is the Cressman scheme, which assigns a distance-weighted radius of influence to each observation (Cressman 1959). This radius of influence is also flow-dependent (i.e., not a simple circular radius of influence). With each successive scan that modifies the first-guess field in Obsgrid with the Cressman scheme, the observations are given lower weights (NCAR 2011).

### b. Verification and quality control

To verify our WRF ensemble forecasts, we use standard WMO observations. These observations are quality-controlled against the WRF Preprocessing System (WPS)-interpolated GFS analysis fields, using the Obsgrid software described above. We implement some additional quality control checks, including rejecting any surface observations with a reported elevation higher than 600 hPa, and rejecting surface observations where the reported elevation differed from the model terrain by more than 200 m. We find that blending observations into the initial conditions with Obsgrid improved our verification scores.

Of the 18 forecasts created for the summer evaluation period, six are randomly chosen to be set aside for verification purposes. The remaining 12 forecasts are used as training data for both the down-selection process and for calibration purposes. Table 2 lists which forecast periods are used for the training period and which are used for the verification period. It would of course be preferable to train on more than 12 forecast periods, such as the 25 forecasts on which Raftery et al. (2005) train BMA for the ensemble they use in their study, but our main aim in this preliminary study is to propose and demonstrate a new ensemble down-selection methodology. Raftery et al. (2005) also note that the best length of a training period for BMA likely changes for different variables, regions, and ensemble configurations. In a future study we plan to explore the impact of longer training periods on the effectiveness of down-selection and calibration.

Randomly chosen forecast initialization times (yyyy–mm–dd_hh UTC) for the training set (italics) and verification set (bold).

Three different diagnosed quantities are used for verification: 10-m AGL zonal wind (*u*), 10-m AGL meridional (*υ*) wind, and 2-m AGL temperature (*T*).

*o*is the value of observation

*i*,

*f*is the forecast value at the time and location of observation

*i*, and

*N*is the total number of observations. The CRPS, which assesses both accuracy and sharpness, is defined as (Wilks 2006)where

*x*at the time and location of observation

*i*and all other variables are as before. The CDF of the observation,

^{−1}and a CRPS of 1.0–3.0 for surface wind speed (e.g., Gneiting et al. 2005, 2006; Kann et al. 2009; Fraley et al. 2010; Sloughter et al. 2010). Our results compare favorably to these previously reported values (section 5).

Rank histograms are another verification tool we use to assess the dispersion of the ensemble forecasts. For an ensemble containing *n*_{ens} members, verification rank histograms are created by binning each verifying observation within the *n*_{ens} + 1 member distribution. These histograms are frequently used to diagnose the bias and dispersion of ensembles, and ensembles that exhibit no biases and are neither underdispersive nor overdispersive have rank histograms that are approximately flat (Wilks 2006). We say “approximately flat” because recent research indicates that even a perfect ensemble with fewer than several hundred members will be slightly underdispersive at best (Kolczynski et al. 2011). We also use the continuous analog of the verification rank histogram, which is the probability integral transform (PIT) histogram. PIT histograms are defined by evenly spacing the bins throughout the forecast distribution (Raftery et al. 2005).

## 3. Ensemble member down-selection

### a. Principal component analysis

The goal of our ensemble down-selection technique is to remove redundant members and retain the members that contribute to the forecast accuracy and spread. Stated another way, we want to retain just enough ensemble members to span the uncertainty space. This goal arises both because computational resources are generally too limited to allow for very large ensembles, and because ensembles are most useful when each member samples a different portion of the actual PDF of the atmospheric state.

**x**that is comprised of

*K*variables

*x*. PCA accomplishes this reduction by defining a new set of

_{k}*N*variables, where

*N*<

*K*. If there are substantial correlations (i.e., redundant information) among the variables of the original dataset, then

*N*<<

*K*. These new variables are called principal components (PCs) and are a linear combination of the variables from the original dataset and account for the greatest possible amount of the variance from the original dataset. Each of the

*N*PCs is defined by the eigenvectors

**e**

_{n}of the covariance matrix of

**x**(Wilks 2006):

Thus, the principal components are the eigenvectors ordered so that the first PC (PC_{1}) accounts for the largest variance in the original dataset. The eigenvalue, *e*_{1}, represents that variance. Subsequent PCs (PC_{2}, PC_{3}, etc.) account for the largest remaining variance in the original dataset, subject to the condition that they are orthogonal to all previous PCs that have been defined.

Our motivation for using PCA is that the PCs are inherently constructed to contain the largest amount of variance in the first *N* PCs. Thus, by selecting those PCs that account for a given amount of the variance, we are maximizing the variance explained by our chosen number of degrees of freedom, as represented here by the ensemble members. This implies that we are spanning our solution space with the smallest possible number of vectors to explain a chosen amount of the variance.

We use the freely available software program RapidMiner to perform the PCA in this study (Mierswa et al. 2006; http://www.rapid-i.com). A process for performing PCA is included with the RapidMiner distribution, along with processes for many other statistical postprocessing and data mining techniques. This makes RapidMiner a reasonable choice of software for our purposes.

The original dataset **x** on which we perform PCA is the forecast errors (forecast minus observation) of the ensemble, with *x*_{1}, … , *x*_{24} representing the forecast errors for each of our 24 ensemble members at every observation location over the course of the training period. Each PC therefore reveals which ensemble members contribute most to the error variance in each of our dataset’s 24 dimensions. We perform univariate PCA at each forecast lead time separately to isolate which members contribute most to the variability for each forecast variable at each lead time. We leave the comparison of down-selection via univariate PCA versus multivariate PCA for future work.

We then truncate the set of PCs so that the first several PCs that represent 95% of the variability of the data are used to represent the dataset for each forecast variable and lead time (Jolliffe 2002). Figure 2 shows a plot of the cumulative error variance for our PCs and where we truncate to maintain 95% of the error variance, shown as a dashed line for 2-m temperature at a forecast lead time of 24 h. The dotted line in Fig. 2 indicates where we truncate to maintain 90% of the cumulative error variance, used as an alternate threshold.

For the three different weather variables we verify against—10-m zonal (*u*) wind, 10-m meridional (*υ*) wind, and 2-m temperature (*T*), each at four different forecast lead times, 12, 24, 36, and 48 h—PCA identifies between four and seven PCs that account for 95% of the variability during the training period. After determining the appropriate number of PCs for each variable at each lead time, the next step is to determine which ensemble member contributed most to each of those highest ranking PCs. Recall from the discussion above that each PC is a linear combination of all the variables of the dataset, and that in this case, the ensemble members are the dataset variables. Each member therefore has its own weight for each PC. The ensemble member with the highest weight for a given PC is termed the top contributor to that PC.

The number of times each ensemble member is the top contributor to each of the highest ranking PCs is tallied. Four different subsets of ensemble members are defined: subset A, containing every ensemble member that was the top contributor at least once to the PCs that as a group account for 95% of cumulative forecast variability; subset B, containing every ensemble member that was the top contributor at least once to the PCs that each account for at least 2.0% of forecast variability; subset C, containing every ensemble member that was the top contributor at least once to the PCs that as a group account for 90% of cumulative forecast variability; and subset D, containing only the ensemble members that were most frequently the top contributor to the PCs that account for 95% of cumulative forecast variability (subset D is therefore a subset of subset A). Details of which ensemble members are included in each candidate subset for the nested 12-km domain are listed in Table 3. The size and membership of each subset is unique to this particular domain configuration; the PCA method selects a different set and number of members for the 36-km domain, both over the full domain and the 12-km subdomain. Thus, ensemble performance depends upon both the specific region covered and the resolution of the simulations, indicating that the PCA method would need to be performed anew if we changed the setup of the ensemble or the forecast domain, or if other users choose to apply this down-selection method to other ensemble modeling systems.

Summary of which members were included in each candidate ensemble subset for the 12-km domain. See text for description of how each subset was defined.

There are a couple potential caveats with the PCA down-selection method. By performing the PCA in observation space there is no guarantee that the down-selected ensemble is the best choice over the entire domain. It should also be noted that it is possible that some user needs will not be satisfied by a smaller ensemble size if extreme scenarios are desired.

### b. Correlation analysis

When down-selecting to a subset of ensemble members, it is desirable to exclude members that provide redundant information. One method that can shed light on whether certain members are providing redundant information is correlation analysis. We choose to correlate the forecast errors for each ensemble member. This process additionally allows us to interpret which physics schemes provide the most variability in the ensemble.

We compute correlations of the errors between all ensemble members for each forecast parameter at each forecast lead time. Error correlations for 24-h forecasts of 2-m temperature and 10-m zonal wind are shown in Figs. 3 and 4 , respectively. The tables are symmetric, so half of each table is color-coded as a visual aid, with warmer colors highlighting stronger correlations and cooler colors highlighting weaker correlations between the forecast errors in the ensemble members. The results discussed below are for the 12-km domain, but results are similar for both the same area with 36-km resolution and for the full 36-km domain. This observation provides us with greater confidence when attributing physical explanations to our results.

Some interesting patterns appear in the error correlations. For 2-m temperature forecasts, error correlations tend to be grouped according to which land surface model was used for each member (Fig. 3). For all members, the highest correlations are with all other members that share the same land surface scheme (see Table 1 for the physics configuration used for each ensemble member). The next-highest correlations are between pairs of land surface models. Members that used the thermal diffusion land surface scheme and the Pleim–Xu land surface model had highly correlated errors. The same is true for members that used the Noah and Rapid Update Cycle (RUC) land surface models, although those correlations are slightly weaker. The weakest correlations are between members that used either the Noah or RUC land surface models and those that use either the thermal diffusion or Pleim–Xu land surface schemes. These results indicate that the choice of land surface scheme has a substantial impact on 2-m temperature forecasts. This observation is due to the fact that different land surface models represent surface energy and moisture fluxes differently. This impacts surface temperature because processes in the surface layer are dominated by interactions with the land surface (Wyngaard 2010). Another pattern worth noting from Fig. 3 is that correlations are also quite high between members that only vary the cumulus scheme. This is not an entirely surprising result, as the cumulus scheme only has an indirect effect on the model surface temperature by producing precipitation and downdrafts. Thus, to achieve greater variability in 2-m temperature forecasts, ensembles should contain diversity in land surface schemes.

For 10-m wind forecasts, error correlations tend to be grouped according to the choice of boundary layer scheme (Fig. 4). Correlations are generally strongest between members that share the same boundary layer scheme, and especially between members that share both the same boundary layer and cumulus schemes. This result indicates that the choice of ABL scheme, and the cumulus scheme to a lesser extent, has a substantial impact on 10-m wind forecasts. This is due to the fact that the ABL schemes differ in how they model the dynamics and structure of the ABL. The cumulus scheme appears to have a secondary effect on 10-m wind forecast error correlations because low-level model winds are generally only affected by the cumulus scheme in regions where the model produces precipitation. Thus, to achieve greater variability in 10-m wind forecasts, ensembles should contain diversity in boundary layer schemes and, somewhat less importantly, cumulus schemes.

It is not sufficient to vary only one type of physics parameterization for a forecast application that requires an approximation of the uncertainty in both the dynamic and thermodynamic structure of the ABL. Our error correlation analysis demonstrates that variability in certain physical parameterizations affects forecast variables to varying degrees. For instance, the land surface scheme has a greater impact on 2-m temperature forecasts, and the boundary layer and cumulus schemes have a greater impact on 10-m wind forecasts.

Error correlation matrices like Figs. 3 and 4 can be used to perform ensemble down-selection. In this study, however, we use the error correlation matrices merely as evidence that our down-selected ensemble should contain a variety of physics schemes in order to span the uncertainty space and to interpret the importance of some of these factors.

## 4. Ensemble calibration

Once we have down-selected to a smaller number of ensemble members, we must “dress” the ensemble to statistically approximate the PDF of the forecast distribution (Roulston and Smith 2003). Our dressing, or calibration, technique is Bayesian model averaging (Raftery et al. (2003). BMA improves ensemble forecasts by estimating the best weights and parameters for each ensemble member to make a smooth PDF. Weights and parameters are trained to best match the observations during some training period, and are then applied to future forecasts to create a full PDF of the ensemble forecast. We choose to use BMA as our calibration technique because it is becoming a widely used technique that performs well (e.g., Raftery et al. 2005; Bao et al. 2010; Fraley et al. 2010; Sloughter et al. 2010).

A first step in BMA determines the functional form of the posterior distributions of the ensemble. For temperature, we use a normal distribution, as in Raftery et al. (2005). For vector quantities selecting the distribution is more difficult because the vectors are described by two (or more) different scalar quantities that are related. Here we define the horizontal wind in terms of zonal (*u*) and meridional (*υ*) components using a normal distribution. We make the assumption that the observational errors of wind components are normally distributed because it is an adequate assumption for many variables (e.g., Seidman 1981; Houtekamer 1993; Dee and da Silva 1999). Other studies use an alternate approach of decomposing wind into speed and direction and apply BMA to speed (Sloughter et al. 2010) or direction (Bao et al. 2010) using other distributions. It is not yet clear if one method is superior to the other.

While earlier studies using BMA have focused on calibrating for specific forecast locations independently, we choose to calibrate on all locations simultaneously. This is important for our future applications, as we intend to apply BMA over an entire forecast region for insertion into an AT&D model or other grid applications, rather than forecasting at specific point locations. BMA has also been used to calibrate forecast fields across an entire model grid, rather than calibrating only at individual observation locations (Berrocal et al. 2007).

In our BMA package we develop and use, each of the three variables (2-m AGL *T*, 10-m AGL *u*, 10-m AGL *υ*) is calibrated independently. Each forecast lead time (12, 24, 36, and 48 h) is also calibrated separately. We apply a bias correction for each variable and each lead time during the training period by subtracting the average value of each ensemble member’s forecast errors (i.e., forecast − observation) from that member’s forecasts. For the 10-m wind components we did not apply any threshold on calm wind points. In this study we only perform univariate BMA. It is possible that this could result in calibrations of variables that destroy physical relationships between them. We expect, however, that by applying it over a region, we maintain greater spatial consistency. In future studies we plan to explore multivariate BMA.

BMA ensemble member weights are calculated for each forecast parameter and lead time during the training period on the 12-km domain. BMA weights cannot be directly interpreted as an indicator of model quality. The members with higher weights often do perform better over the training period, but some members may have lower weights if they are highly correlated with some of the other members but do not perform quite as well as those other members (Raftery et al. 2005).

The optimal BMA weights for 2-m temperature forecasts are displayed in a donut chart in Fig. 5. From that figure we see that the ensemble members that use the Noah land surface model generally have the highest weights early in the forecast period, followed by members that use the RUC land surface model. The members with the smallest BMA weights use the thermal diffusion and Pleim–Xu land surface schemes. As with the forecast error correlation results, this again indicates that the choice of land surface scheme has a large impact on 2-m temperature model predictions. Additionally, when combined with the results from Fig. 3 that members with Noah or RUC land surface models are more highly correlated with each other than with members that use other land surface models, these results imply that the Noah and RUC land surface models yield better 2-m temperature forecasts over the training period. It should also be noted that these patterns are most prevalent for 12- and 24-h forecasts, but the BMA weights tend to become more even at 48-h lead time. We suspect that this behavior arises from the increasing importance of LBC errors infiltrating the domain compared to uncertainty in land surface models and surface layer schemes at later lead times.

The optimal BMA weights for the 10-m zonal wind forecasts are displayed in Fig. 6. The ensemble members that use the ACM2 boundary layer scheme generally have higher weights than those with the MYJ boundary layer scheme. The members that use the Yonsei University (YSU) boundary layer scheme generally have the lowest weights. Again, as with the forecast error correlation results, this indicates that the choice of boundary layer scheme has a substantial impact on model predictions of 10-m winds.

## 5. Results

Rank histograms are used to evaluate the dispersion of the ensemble. Figures 7a and 8a show verification rank histograms for the 24-h forecasts of 2-m temperature and 10-m zonal wind for the equal-weighted ensemble, respectively, and the ensemble is clearly underdispersive. Figures 7b and 8b show PIT histograms for the 24-h forecasts of 2-m temperature and 10-m zonal wind for the BMA-weighted ensemble, respectively. By comparing Fig. 7b with Fig. 7a, and Fig. 8b with Fig. 8a, it is clear that the BMA-weighted ensemble is far less underdispersive. In other words, the BMA calibration is successful and improves the spread of the ensemble. It is possible that the rank histograms would have been more uniform if IC–LBC errors, potentially an important source of errors, had been accounted for in this ensemble.

To assess the value of the calibration provided by BMA, the equal-weighted and BMA-weighted ensembles on the 12-km domain are compared with the verification metrics RMSE and CRPS over the verification period.

When comparing different measurements or calculations of a particular value, the statistical significance of those differences should be evaluated whenever possible. When data populations are too small to calculate significance, bootstrap resampling of the data is a common approach (Efron 1979; Wilks 2006). Therefore, as a first attempt to ascertain whether our results are statistically significant, we compute RMSE and CRPS for each observation location (averaged over the duration of the verification period at each lead time for each variable) and then perform 1000 bootstrap samples with replacement on the verification metric populations to obtain means and standard deviations of those populations. For example, there are a total of 2148 valid observations of 2-m temperature at the 48-h lead time during the six-case verification period. An ensemble-averaged RMSE and CRPS are calculated for each of those observations. Bootstrapping allows us to randomly construct 1000 populations of 2148 entries each for both the RMSE and CRPS for 48-h forecasts of 2-m temperature. The population mean RMSE and CRPS are calculated for all 1000 bootstrapped populations of those statistics, and the means and standard deviations of those population means are then calculated and used to evaluate statistical significance. This process is followed for each variable and each lead time. The statistical test we use is the two-sample Student’s *t* test with unequal variances [Wilks 2006, Eq. (5.8)]. If a difference is shown to be significant under the assumption of data independence, then further testing is necessary if there are correlations in the data because the *p* values will be overstated. Most of our results are significant under the assumption of data independence.

Bootstrapping is only valid when the data being examined are independent, however. Meteorological data such as temperature and wind generally exhibit substantial spatial and/or temporal correlations and thus are not independent. One way to account for these correlations is block bootstrapping, where geographic blocks of data are bootstrapped instead of individual data points (Hamill 1999; Wilks 2006). When evaluating the correlation of precipitation forecast error over a continental-sized domain, Hamill (1999) found significant spatial correlations between the blocks even when the domain was divided into just four large blocks. Our domain area is even smaller than that of Hamill (1999), so we expect that block bootstrapping would not be valid for this study. Another approach is field significance testing, which calculates the statistical significance of the difference between any two potentially correlated fields (Livezey and Chen 1983; Elmore et al. 2006). Computation of field significance requires a moving blocks bootstrap of time series data at each observation location. Considering that in this study each time series would only have six data points (because there are only six forecast periods in our verification dataset), it is dubious to expect that we would obtain meaningful results from a moving blocks bootstrap. Therefore we choose not to perform field significance testing in this study. We therefore present a statistical significance analysis of our results under the assumption of data independence, with the important caveat that the *p* values that determine significance are likely to be overstated.

To evaluate the performance of the ensemble as a deterministic forecast, we use RMSE to compare a BMA-weighted ensemble mean to the equally weighted ensemble mean. For each forecast lead time and variable, the bootstrap-mean RMSE for the BMA-weighted ensemble is only slightly different from the bootstrap-mean RMSE for the equal-weighted ensemble (see Fig. 9). While these small differences are generally statistically significant at the 95% confidence level assuming data independence, they are not “practically significant,” meaning that the differences are so small that, practically speaking, they are unimportant.

We use CRPS to evaluate the overall probabilistic predictions of the ensemble, over the 12-km domain for each forecast lead time and variable. The bootstrap-mean CRPS values for the equal-weighted and BMA-weighted ensembles are plotted in Fig. 10 for the full ensemble and each of the four candidate subsets defined by PCA (see Table 3). For all variables, lead times, and ensemble sizes, the BMA-weighted ensembles have bootstrap-mean CRPS values that are approximately 10% lower (better) than for the corresponding equal-weighted ensembles. The differences in CRPS are potentially significant, as they are highly significant under the assumption of data independence.

The important question of how the PCA-defined subsets perform relative to the full ensemble and to each other is next addressed. To assess whether any of the ensembles perform conclusively better than any other ensemble, we require that the bootstrap-mean CRPS values for at least 75% of the lead time–variable combinations exhibit a statistically significant improvement (at the 95% confidence level and assuming data independence) in comparison to another ensemble. Using our definition for conclusive improvement and examining the BMA-weighted ensembles, neither subset A (20 members) nor subset B (14 members) is conclusively different from the full ensemble (24 members). Subset C (12 members) and subset D (5 members) are conclusively worse than the full ensemble, however. Additionally, subset B is conclusively better than subset C. Because of these results, we conclude that subset B is the most useful subset, since it is the smallest subset ensemble that is not conclusively worse than the full ensemble. It is also not surprising that the CRPS is significantly poorer for mini-ensembles, because there are only a few forecast-specific data contributing to the forecast ensemble PDF.

To assess how well the PCA down-selection method compares to random down-selection, we choose 100 random subsets of 14 members each, the same ensemble size as our subset B. We calculate the bootstrapped CRPS for each forecast lead time and parameter combination in the same way as above for each of the random subsets, for both the equal-weighted ensembles and BMA-weighted ensembles. CRPS values for PCA subset B and the random subsets are then compared for statistically significant differences, although we realize that the inherent assumption of the CRPS data independence is not likely met. Table 4 shows that for the equal-weighted ensemble forecasts, the CRPS for PCA subset B is statistically significantly better than the CRPS for the random subset ensembles more frequently than the CRPS for subset B is significantly worse than that for the random subsets. This result is true for every lead time–variable combination except 36-h *υ*-wind forecasts. Furthermore, the CRPS for subset B is significantly better than at least half of the random subsets for all lead times of *u*-wind and temperature forecasts, as well as 12-h *υ*-wind forecasts. These findings indicate that the PCA down-selection method has some advantage over randomly down-selecting ensemble members, particularly for forecasts of 10-m *u*-wind and 2-m temperature. Interestingly, the advantage that the PCA method has over the random method also appears to become somewhat smaller after calibrating the ensemble with BMA, as subset B does not outperform the random subsets as frequently (as evaluated by CRPS). While the PCA down-selection still adds value for *u*-wind and temperature forecasts after BMA calibration, the random subsets perform better than subset B more frequently for *υ*-wind forecasts at every lead time after BMA calibration. The difference in performance of subset B relative to random subsets between the equal-weighted and BMA-weighted ensembles illustrates once again how forecast performance is improved by dressing the ensemble PDFs with BMA.

For each lead time–variable combination, and for both the equal-weighted and BMA-weighted ensembles, the percentage of the 100 random subsets compared to which the CRPS for PCA subset B is significantly lower (better) or significantly higher (worse).

## 6. Discussion and conclusions

The main goal of this study is to propose an objective methodology to “down-select” or determine which subset of members should be included in an ensemble used for forecasting applications for which short-range low-level wind prediction is of primary importance. Then we use Bayesian model averaging to “dress” the PDF. The NWP dataset on which we demonstrate this proposed methodology is a 24-member ARW-WRF physics ensemble over 18 forecast periods during summer 2009. Our down-selection methodology centers on using principal component analysis to determine the ensemble members whose forecast errors contribute most to variability in the forecast. Using PCA, we define four candidate ensemble subsets of various sizes, ranging from 5 to 20 members, for the nested 12-km domain. Somewhat different candidate subsets are defined using the 36-km domain.

To improve forecast dispersion by statistically dressing the PDF, we then calibrate the full ensemble and each subset ensemble using Bayesian model averaging. The BMA-calibrated ensembles perform as well as or better than the equal-weighted, uncalibrated ensemble by several metrics that are computed over each forecast lead time (12, 24, 36, and 48 h) and forecast variable (2-m AGL temperature, 10-m AGL zonal wind component, and 10-m meridional wind component). First, rank histograms indicate that the BMA-weighted ensemble is substantially less underdispersive than the equal-weighted ensemble. Second, the bootstrap-mean RMSE for the BMA-calibrated ensemble shows no change or a slight improvement for the equal-weighted ensemble for most lead time–variable combinations. Third, the bootstrap-mean CRPS for the BMA-weighted ensemble show an improvement of about 10% compared to the equal-weighted ensemble. Therefore, we conclude that the BMA calibration improves the quality of the ensemble forecasts substantially.

Correlations of both forecast errors across the ensemble and BMA weights for all the ensemble members reveal that model predictions of 2-m temperature are greatly influenced by the choice of land surface model, and that model predictions of 10-m winds are greatly influenced by the choice of boundary layer scheme, and to a lesser extent by the choice of cumulus parameterization scheme. Therefore, we conclude that for any subset to have sufficient variability in the structure and dynamics of the ABL, it must include diversity in land surface and boundary layer schemes and should also have diversity in cumulus schemes.

Our final result addresses the relative performance of the four PCA-determined candidate ensemble subsets. Based on these results and the importance of including a diversity of land surface, boundary layer, and cumulus parameterization schemes, for this ensemble configuration we recommend subset B as an ensemble for applications concerned with low-level wind forecasting. Subset B contains a diversity of physics options and gives forecasts of similar quality as the full ensemble for just over half the computational cost (14 vs 24 members). Subset B is also the smallest candidate subset for which the CRPS values are not conclusively different from those of the full ensemble. Thus, as long as a sufficient number of members are included to capture the primary sources of variability, down-selecting the ensemble using PCA and then dressing that reduced ensemble with BMA appears to be a successful strategy. Additionally, from the CRPS metric, the PCA down-selection method adds value over a random down-selection, particularly for 10-m *u*-wind and 2-m temperature forecasts.

This study is preliminary and we must acknowledge several caveats. First, it is unknown how these results depend on seasonality since we only evaluated cases from a single season. The down-selection results are somewhat different for the different domains and resolutions; thus, the results may differ somewhat by season as well. Second, when creating this ensemble we assumed that diversity in microphysics or radiation schemes would not add substantial variability to the ensemble. This assumption should be tested. Therefore, we plan to explore the seasonal dependence of this down-selection method by creating a larger ARW-WRF ensemble that varies additional physics schemes over a four-season period. Third, while we calibrate each of the three forecast variables independently in this study, we recognize this may not be an ideal approach, particularly because the zonal and meridional wind fields are related. This is a logical initial approach that we hope to refine in future research. We also plan to explore other postprocessing techniques for down-selection, such as *k*-means clustering or self-organizing maps. We also plan to investigate additional calibration techniques. Furthermore, we plan to verify ensemble predictions against other observations in the ABL, rather than just surface temperature and wind. Finally, we will test these techniques on a larger sample of case days. We aim to obtain a large enough sample of case days to allow for use of field significance testing, so that we can make better assessments of statistical significance than we are able to do in this study.

This study describes a way to use statistical postprocessing to down-select the members of an ensemble that contribute the most variability, then to dress the PDF of the ensemble using BMA. The combination of postprocessing ensemble forecasts with PCA and BMA over a short evaluation period provides an objective method for determining which subset ensemble members should be included in a longer-term forecast ensemble. This is particularly relevant given the nearly ubiquitous constraint on computational resources that most companies, universities, and research centers face. By dressing the ensemble PDF with BMA, we improve the ensemble dispersion as demonstrated in the PIT histograms and improve the ensemble performance as demonstrated by the CRPS values. While in this study we focused on down-selecting an ensemble for low-level wind forecasting applications, the same principles outlined here could be applied to other forecasting applications by verifying against the relevant variables.

The authors thank David Stauffer, George Young, Aijun Deng, and Kerrie Schmehl of Penn State University, and Joel Peltier of Bechtel National, Inc., for helpful discussions and feedback during this study, Brian Reen of Penn State for making the modifications to the MYJ scheme, and Tressa Fowler and John Halley Gotway of NCAR for input on statistical significance issues. We also thank Chuck Ritter of the Penn State University Applied Research Lab for invaluable computational support, and Christian Pagé of MeteoCentre for providing meteorological observation data for verification. In addition, we thank two anonymous reviewers and the journal editor for their insightful comments that helped improve this manuscript during the peer review process. This study was partially sponsored by the Defense Threat Reduction Agency, contract DTRA01-03-D-0010, John Hannan, CIV, Contract Monitor. Authors Jared Lee and Tyler McCandless are also grateful for funding from the Penn State University Applied Research Lab Exploratory & Foundational Program to support their graduate studies during this study.

## REFERENCES

Bao, L., , T. Gneiting, , E. P. Grimit, , P. Guttorp, , and A. E. Raftery, 2010: Bias correction and Bayesian model averaging for ensemble forecasts of surface wind direction.

,*Mon. Wea. Rev.***138**, 1811–1821.Berrocal, V. J., , A. E. Raftery, , and T. Gneiting, 2007: Combining spatial statistical and ensemble information in probabilistic weather forecasts.

,*Mon. Wea. Rev.***135**, 1386–1402.Bowler, N. E., , A. Arribas, , K. R. Mylne, , K. B. Robertson, , and S. E. Beare, 2008: The MOGREPS short-range ensemble prediction system.

,*Quart. J. Roy. Meteor. Soc.***134**, 703–722.Clark, A. J., , W. A. Gallus Jr, ., and T.-C. Chen, 2008: Contributions of mixed physics versus perturbed initial/lateral boundary conditions to ensemble-based precipitation forecast skill.

,*Mon. Wea. Rev.***136**, 2140–2156.Cressman, G. P., 1959: An operational objective analysis system.

,*Mon. Wea. Rev.***87**, 367–374.Dee, D. P., , and A. M. da Silva, 1999: Maximum-likelihood estimation of forecast and observation error covariance parameters. Part I: Methodology.

,*Mon. Wea. Rev.***127**, 1822–1834.Delle Monache, L., , A. Fournier, , T. M. Hopson, , Y. Liu, , B. Mahoney, , G. Roux, , and T. Warner, 2011: Kalman filter, analog and wavelet postprocessing in the NCAR-Xcel operational wind-energy forecasting system.

*Extended Abstracts, Second Conf. on Weather, Climate, and the New Energy Economy,*Seattle, WA, Amer. Meteor. Soc., 4C-2. [Available online at http://ams.confex.com/ams/91Annual/webprogram/Paper186510.html.]Du, J., and Coauthors, 2009: Recent upgrade of NCEP short-range ensemble forecast (SREF) system. Preprints,

*19th Conf. on Numerical Weather Prediction*/*23rd Conf. on Weather Analysis and Forecasting,*Omaha, NE, Amer. Meteor. Soc., 4A.4. [Available online at http://ams.confex.com/ams/23WAF19NWP/techprogram/paper_153264.htm.]Eckel, F. A., , and C. F. Mass, 2005: Aspects of effective mesoscale, short-range forecasting.

,*Wea. Forecasting***20**, 328–350.Efron, B., 1979: Bootstrap methods: Another look at the jackknife.

,*Ann. Stat.***7**, 1–26.Elmore, K. L., , M. E. Baldwin, , and D. M. Schultz, 2006: Field significance revisited: Spatial bias errors in forecasts as applied to the Eta Model.

,*Mon. Wea. Rev.***134**, 519–531.Fraley, C., , A. E. Raftery, , and T. Gneiting, 2010: Calibrating multimodel forecast ensembles with exchangeable and missing members using Bayesian model averaging.

,*Mon. Wea. Rev.***138**, 190–202.Fujita, T., , D. J. Stensrud, , and D. C. Dowell, 2007: Surface data assimilation using an ensemble Kalman filter approach with initial condition and model physics uncertainties.

,*Mon. Wea. Rev.***135**, 1846–1868.Gneiting, T., , A. E. Raftery, , A. H. Westveld III, , and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation.

,*Mon. Wea. Rev.***133**, 1098–1118.Gneiting, T., , K. Larson, , K. Westrick, , M. C. Genton, , and E. Aldrich, 2006: Calibrated probabilistic forecasting at the Stateline Wind Energy Center: The regime-switching space-time method.

,*J. Amer. Stat. Assoc.***101**, 968–979.Grimit, E. P., , and C. F. Mass, 2007: Measuring the ensemble spread-error relationship with a probabilistic approach: Stochastic ensemble results.

,*Mon. Wea. Rev.***135**, 203–221.Hacker, J. P., and Coauthors, 2011: The U.S. Air Force Weather Agency’s mesoscale ensemble: Scientific description and performance results.

,*Tellus***63A**, 625–641.Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts.

,*Wea. Forecasting***14**, 155–167.Houtekamer, P. L., 1993: Global and local skill forecasts.

,*Mon. Wea. Rev.***121**, 1834–1846.Houtekamer, P. L., , L. Lefaivre, , J. Derome, , H. Ritchie, , and H. L. Mitchell, 1996: A system simulation approach to ensemble prediction.

,*Mon. Wea. Rev.***124**, 1225–1242.Jolliffe, I. T., 2002:

*Principal Component Analysis*. 2nd ed. Springer, 487 pp.Jones, M. S., , B. A. Colle, , and J. S. Tongue, 2007: Evaluation of a mesoscale short-range ensemble forecast system over the northeast United States.

,*Wea. Forecasting***22**, 36–55.Kann, A., , C. Wittmann, , Y. Wang, , and X. Ma, 2009: Calibrating 2-m temperature of limited-area ensemble forecasts using high-resolution analysis.

,*Mon. Wea. Rev.***137**, 3373–3387.Kolczynski, W. C., , D. R. Stauffer, , S. E. Haupt, , and A. Deng, 2009: Ensemble variance calibration for representing meteorological uncertainty for atmospheric transport and dispersion modeling.

,*J. Appl. Meteor. Climatol.***48**, 2001–2021.Kolczynski, W. C., , D. R. Stauffer, , S. E. Haupt, , N. S. Altman, , and A. Deng, 2011: Investigation of ensemble variance as a measure of true forecast variance.

,*Mon. Wea. Rev.***139**, 3954–3963.Lee, J. A., , L. J. Peltier, , S. E. Haupt, , J. C. Wyngaard, , D. R. Stauffer, , and A. Deng, 2009: Improving SCIPUFF dispersion forecasts with NWP ensembles.

,*J. Appl. Meteor. Climatol.***48**, 2305–2319.Lewellen, W. S., , and R. I. Sykes, 1989: Meteorological data needs for modeling air quality uncertainties.

,*J. Atmos. Oceanic Technol.***6**, 759–768.Liu, Y., and Coauthors, 2011: Wind energy forecasting with the NCAR RTFDDA and ensemble RTFDDA systems.

*Extended Abstracts, Second Conf. on Weather, Climate, and the New Energy Economy,*Seattle, WA, Amer. Meteor. Soc., WC-2. [Available online at http://ams.confex.com/ams/91Annual/webprogram/Paper186591.html.]Livezey, R. E., , and W. Y. Chen, 1983: Statistical field significance and its determination by Monte Carlo techniques.

,*Mon. Wea. Rev.***111**, 46–59.Mierswa, I., , M. Wurst, , R. Klinkenberg, , M. Scholz, , and T. Euler, 2006: YALE: Rapid prototyping for complex data mining tasks.

*Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006),*T. Eliassi-Rad et al., Eds., ACM Press, 935–940.NCAR, 2011: Weather Research and Forecasting ARW version 3 modeling system user’s guide. NCAR Mesoscale and Microscale Meteorology Division, 362 pp. [Available online at http://www.mmm.ucar.edu/wrf/users/docs/user_guide_V3/contents.html.]

Raftery, A. E., , F. Balabdaoui, , T. Gneiting, , and M. Polakowski, 2003: Using Bayesian model averaging to calibrate forecast ensembles. Tech. Rep. 440, Dept. of Statistics, University of Washington, 28 pp. [Available online at http://www.stat.washington.edu/research/reports/2003/tr440.pdf.]

Raftery, A. E., , F. Balabdaoui, , T. Gneiting, , and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133**, 1155–1174.Roulston, M. S., , and L. A. Smith, 2003: Combining dynamical and statistical ensembles.

,*Tellus***55A**, 16–30.Seidman, A. N., 1981: Averaging techniques in long-range weather forecasting.

,*Mon. Wea. Rev.***109**, 1367–1379.Skamarock, W. C., and Coauthors, 2008: A description of the Advanced Research WRF version 3. NCAR Tech. Note NCAR/TN-475+STR, 113 pp.

Sloughter, J. M., , T. Gneiting, , and A. E. Raftery, 2010: Probabilistic wind speed forecasting using ensembles and Bayesian model averaging.

,*J. Amer. Stat. Assoc.***105**, 25–35.Stensrud, D. J., , J.-W. Bao, , and T. T. Warner, 2000: Using initial condition and model physics perturbations in short-range ensemble simulations of mesoscale convective systems.

,*Mon. Wea. Rev.***128**, 2077–2107.Sykes, R. I., , S. F. Parker, , and D. S. Henn, 2004: SCIPUFF version 2.0, technical documentation. ARAP Tech. Rep. 727, Titan Corporation, Princeton, NJ, 284 pp.

Warner, T. T., , R.-S. Sheu, , J. F. Bowers, , R. I. Sykes, , G. C. Dodd, , and D. S. Henn, 2002: Ensemble simulations with coupled atmospheric dynamic and dispersion models: Illustrating uncertainties in dosage simulations.

,*J. Appl. Meteor.***41**, 488–504.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences*. 2nd ed. Academic Press, 626 pp.Witten, I. H., , and E. Frank, 2005:

*Data Mining: Practical Machine Learning Tools and Techniques*. 2nd ed. Morgan Kaufmann, 525 pp.Wyngaard, J. C., 2010:

*Turbulence in the Atmosphere*. Cambridge University Press, 393 pp.