## 1. Introduction

Ensemble calibration methods have been recognized for over a decade as a way to improve the skill of global and regional probabilistic forecasts. Such methods have been based on reliability diagrams (Zhu et al. 1996; Krzysztofowicz and Sigrest 1999; Toth et al. 2001; Atger 2003), rank histograms (Hamill and Colucci 1997, 1998; Eckel and Walters 1998), spread–skill relationships (Atger 1999), fitted distribution parameters (Wilks 2002), Bayesian model averaging (Kass and Raftery 1995; Raftery et al. 2005; Sloughter et al. 2007), Gaussian ensemble dressing (GED; Roulston and Smith 2003; Wang and Bishop 2005; Fortin et al. 2006), logistic regression (LR; Hamill et al. 2004, 2008), analog forecasts (Zhu and Toth 2000; Hamill and Whitaker 2006), artificial neural networks (Yuan et al. 2007, 2008), nonhomogeneous Gaussian regression (NGR; Gneiting et al. 2005; Hagedorn et al. 2008), and bias correction of individual members (Richardson 2001; Stensrud and Yussouf 2003, Eckel and Mass 2005). No single method is best for all applications (e.g., Wilks and Hamill 2007).

Calibration of convective-scale probabilistic forecasts is still relatively new. The convection-allowing ensembles produced at the National Oceanic and Atmospheric Administration (NOAA) Hazardous Weather Testbed (HWT) Spring Experiments (Xue et al. 2007, 2008, 2009, 2010; Kong et al. 2007, 2008, 2009, 2010) provided an opportunity to study ensemble calibration at a convection-allowing resolution. Schaffer et al. (2011, hereafter SCH11) examined variations on a two-parameter reliability-based method of ensemble precipitation forecast calibration using these data. The first goal of the present study is to compare the effectiveness of several calibration methods for improving the skill of convection-allowing probabilistic precipitation forecasts during the 2009 HWT Spring Experiment.

Grid point–based forecast and verification methods have limited applicability at convection-allowing resolution [e.g., see review by Gilleland et al. (2009)]. Neighborhood-based methods have therefore been proposed to derive probabilistic precipitation forecasts using multiple nearby grid points (e.g., Theis et al. 2005; Mittermaier 2007; Ebert 2009; Schwartz et al. 2010; Bouallegue et al. 2011). Object-based methods have also been proposed to evaluate features rather than grid point–based values (Ebert and McBride 2000; Done et al. 2004; Davis et al. 2006; Gallus 2010; Carley et al. 2011; Johnson et al. 2011a). Most of these studies have applied object-based methods to deterministic, rather than probabilistic, forecasts. A second goal of the present study is to propose a way of deriving object-based probabilistic forecasts, which, along with neighborhood-based forecasts, will be used to verify the convection-allowing ensemble from the 2009 HWT Spring Experiment.

Effective sampling of forecast errors, originating from model dynamics, physics, initial conditions (ICs) and lateral boundary conditions (LBCs), has been an active area of research for ensemble forecasting (Stensrud et al. 2000; Hou et al. 2001; Richardson 2001; Wandishin et al. 2001; Alhamed et al. 2002; Wang and Bishop 2003; Nutter et al. 2004; Wang et al. 2004; Yussouf et al. 2004; Eckel and Mass 2005; Gallus and Bresch 2006; Aligo et al. 2007; Johnson and Swinbank 2009; Palmer et al. 2009; Candille et al. 2010; Berner et al. 2011; Hacker et al. 2011). It has been shown in some early studies that properly designed multimodel ensembles can outperform single model ensembles at global scales and mesoscales. Such studies have only just begun for convective-scale ensembles. An object-based cluster analysis by Johnson et al. (2011b) found that precipitation forecasts from the convection-allowing ensemble produced by the Center for Analysis and Prediction of Storms (CAPS) for the 2009 HWT Spring Experiment systematically clustered based on model dynamic cores [e.g., the Advanced Research Weather Research and Forecasting Model (ARW-WRF) and WRF Nonhydrostatic Mesoscale Model (NMM)]. The third goal of the present study is to evaluate the impacts of model diversity on the skill of the neighborhood and object-based probabilistic precipitation forecasts and the dependence of such impacts on calibration for the convection-allowing ensemble.

In summary, this study has three main goals. First, probabilistic forecasts derived from both the neighborhood and newly proposed object-based methods for the convection-allowing ensemble from the 2009 HWT Spring Experiment are verified. Second, the effectiveness of different calibration methods is compared. Third, the probabilistic forecast skill of single- and multimodel subensembles are evaluated, before and after calibration, using both neighborhood and object-based methods. Section 2 describes the ensemble and methods of generating, verifying, and calibrating the probabilistic forecasts. The neighborhood and object-based results are presented in sections 3 and 4, respectively. Section 5 contains conclusions and a discussion.

## 2. Data and methods

### a. Ensemble and verification data

CAPS has generated experimental daily real-time, convection-allowing (4-km grid spacing), ensemble forecasts for the NOAA HWT Spring Experiments since 2007 (Xue et al. 2007, 2008, 2009, 2010; Kong et al. 2007, 2008, 2009, 2010). In this study, the 2009 ensemble forecasts are verified and calibrated. The 2009 ensemble contained 20 members: 10 with ARW-WRF (Skamarock et al. 2005), 8 with WRF-NMM (Janjic 2003), and 2 with the Advanced Regional Prediction System (ARPS; Xue et al. 2003). IC/LBC perturbations were obtained from National Centers for Environmental Prediction (NCEP) Short Range Ensemble Forecasts (SREF; Du et al. 2006) and physics schemes were varied as detailed in Table 1. All but three members used ARPS three-dimensional variational data assimilation (3DVAR) and cloud analysis package (Gao et al. 2004; Xue et al. 2003; Hu et al. 2006) to assimilate Weather Surveillance Radar-1988 Doppler (WSR-88D) radial velocity and reflectivity along with surface pressure, horizontal wind, potential temperature, and specific humidity from the Oklahoma Mesonet, surface aviation observation, and wind profiler networks. The integration domain and verification domain are shown in Fig. 1. Additional details of the ensemble data are described in Xue et al. (2009) and Kong et al. (2009). Quantitative precipitation estimates (QPEs) from the National Severe Storm Laboratory Q2 product (Zhang et al. 2005) are used as the verification data, referred to as observations. Twenty-six days of forecast and observation data between 30 April 2009 and 6 June 2009, initialized at 0000 UTC, are used.^{1}

Details of ensemble configuration with columns showing the members, initial conditions (ICs), lateral boundary conditions (LBCs), whether radar data is assimilated (R), and which microphysics scheme (MP) was used with each member. Microphysics schemes used are Thompson (Thom.; Thompson et al. 2008), Ferrier (Ferr.; Ferrier 1994),WRF single-moment 6-class (WSM6; Hong et al. 2004), or Lin (Lin et al. 1983); planetary boundary layer schemes (PBL) are Mellor–Yamada–Janjic (MYJ; Janjic 1994), Yonsei University (YSU; Noh et al. 2003), or turbulent kinetic energy (TKE)-based (Xue et al. 2000); shortwave radiation schemes (SW) are Goddard (Tao et al. 2003), Dudhia (1989), or Geophysical Fluid Dynamics Laboratory (GFDL; Lacis and Hansen 1974); and land surface models (LSM) are Rapid Update Cycle (RUC; Benjamin et al. 2004) or NOAH (NCEP–Oregon State University–Air Force–NWS Office of Hydrology; Ek et al. 2003). NAMa and NAMf are the direct NCEP–North American Mesoscale Model (NAM) analysis and forecast, respectively, while the control (CN) IC has additional radar and mesoscale observations assimilated into the NAMa. Perturbations added to CN members to generate the ensemble of ICs, and LBCs for the SSEF forecasts are from NCEP SREF (Du et al. 2006). SREF members are labeled according to model dynamics: nmm members use WRF-NMM, em members use ARW-WRF (i.e., Eulerian mass core), etaKF members use Eta Model with Kain–Fritsch cumulus parameterization, and etaBMJ use Eta Model with Betts–Miller–Janjic cumulus parameterization.* N1 refers to the first negative bred perturbation from each SREF model.

### b. Single-model and multimodel subensembles

To explore the impact of the model diversity on the probabilistic precipitation forecast skill, three 8-member subensembles denoted ARW, NMM, and MODEL are defined as follows. Subensemble NMM contains all eight NMM members, and subensemble ARW contains the eight ARW members with the same IC/LBC perturbations^{2} used in the NMM subensemble to emphasize model rather than IC/LBC differences (Table 1). A multimodel subensemble, called MODEL, is defined by randomly choosing four members from each of the ARW and NMM subgroups on each day while still using the same eight IC/LBC perturbations. Only two ARPS members are available, with the only difference between them being one with and one without radar data assimilation, so ARPS members are not included. All subensembles contain the same number of members (eight) to exclude the effect of ensemble size on forecast skill (e.g., Clark et al. 2011).

### c. Neighborhood-based probabilistic forecasts

Neighborhood ensemble probability (NEP; Theis et al. 2005; Schwartz et al. 2010) is used to forecast the probability that the accumulated precipitation at each grid point will exceed a threshold. NEP is the percentage of grid points from all ensemble member forecasts within a search radius that exceed the threshold. Thresholds of 2.54 mm (0.1 in.), 6.5 mm (0.256 in.), and 12.7 mm (0.5 in.) for both hourly and 6-hourly accumulations are chosen to include light, moderate, and heavy rainfall events.^{3} The accumulation periods are chosen to include both convective-scale and mesoscale time periods. A search radius of 48 km (12 grid points) is subjectively chosen to balance the competing effects of higher skill and loss of detail at increasing radii. The resulting radius of the neighborhoods is similar to that identified by Mittermaier and Roberts (2010), who used a more objective metric (see also Mittermaier et al. 2012). There is minimal sensitivity of relative skill between different subensembles over a range of search radii (not shown). An example NEP forecast for the probability of exceeding 6.5 mm h^{−1} accumulation is shown in Fig. 2d.

(a) An example of 1-h accumulated precipitation forecast from one ensemble member, (b) observed 1-h accumulated precipitation, (c) probability that each forecast object in (a) would occur, (d) NEP for 1-h accumulated precipitation to exceed 6.5 mm, (e) reference forecast corresponding to (c), and (f) reference forecast corresponding to (d). All forecasts are at the 24-h lead time and valid at 0000 UTC 14 May 2009.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

(a) An example of 1-h accumulated precipitation forecast from one ensemble member, (b) observed 1-h accumulated precipitation, (c) probability that each forecast object in (a) would occur, (d) NEP for 1-h accumulated precipitation to exceed 6.5 mm, (e) reference forecast corresponding to (c), and (f) reference forecast corresponding to (d). All forecasts are at the 24-h lead time and valid at 0000 UTC 14 May 2009.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

(a) An example of 1-h accumulated precipitation forecast from one ensemble member, (b) observed 1-h accumulated precipitation, (c) probability that each forecast object in (a) would occur, (d) NEP for 1-h accumulated precipitation to exceed 6.5 mm, (e) reference forecast corresponding to (c), and (f) reference forecast corresponding to (d). All forecasts are at the 24-h lead time and valid at 0000 UTC 14 May 2009.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

### d. Object-based probabilistic forecasts

A method to generate object-based probabilistic forecasts is proposed in this section. The “method for object-based diagnostic evaluation” (MODE; Davis et al. 2006) is used to define objects as contiguous areas exceeding an accumulation threshold after smoothing with a four–grid point averaging radius. Johnson et al. (2011a) found that a 6.5-mm threshold resulted in objects that were most similar to the authors’ subjective interpretations of convective storms during multiple independent events from the 2009 Spring Experiment. The same threshold is therefore used in this study for both accumulation periods. Prespecified attributes describing each object, such as area, aspect ratio, and centroid location, are then calculated. Objects from different fields that have sufficiently similar attribute values, as measured by a quantity called “total interest” [see Eq. (A1) in the appendix], are matched. Further details on the MODE algorithm are provided in Davis et al. (2006). The parameters for identifying matching objects in this study are provided in the appendix.

An object-based method is proposed to forecast the probability that an object of interest will occur, or in other words will be matched by an observed object. The forecast objects of interest are obtained from a single ensemble member rather than the observations, which are not yet known at the time of the forecast. The forecast and observed objects constitute a match if the total interest between the forecast and observed objects exceeds a matching threshold (defined in the appendix). The forecast probability is generated for each forecast object as the fraction of ensemble members forecasting the object of interest. For example, if Fig. 3a represents the forecast objects of interest and Figs. 3b–h are the other member forecasts, then the black object in Fig. 3a would have a 12.5% probability of occurring and the dark gray object in Fig. 3a would have an 87.5% probability of occurring. These probabilities correspond to the number of members with a matching (i.e., same color) object and illustrate how the method retains storm-scale details while using probability to quantify uncertainty. The member defining the forecast objects is randomly chosen because results were not improved by attempts to select a “best” or “centroid” member (not shown). An example object-based probabilistic forecast is shown in Fig. 2c, where the forecast objects are determined by the forecast from one ensemble member shown in Fig. 2a. In summary, forecasts are generated for the probability that the similarity, measured by the total interest, between a forecast object and an observed object will exceed a threshold. This is directly analogous to the neighborhood method of forecasting the probability of exceeding a precipitation accumulation for a given location.

A hypothetical ensemble of eight object-based forecasts, showing (a) the forecast objects of interest and (b)–(h) seven other members’ forecasts. Forecast objects in (b)–(h) are colored black if they match the black forecast object of interest, dark gray if they match the dark gray forecast object of interest, and light gray if they do not match any forecast objects of interest. Such an ensemble of forecasts would result in 12.5% (87.5%) uncalibrated forecast probability of the black (dark gray) forecast object occurring.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

A hypothetical ensemble of eight object-based forecasts, showing (a) the forecast objects of interest and (b)–(h) seven other members’ forecasts. Forecast objects in (b)–(h) are colored black if they match the black forecast object of interest, dark gray if they match the dark gray forecast object of interest, and light gray if they do not match any forecast objects of interest. Such an ensemble of forecasts would result in 12.5% (87.5%) uncalibrated forecast probability of the black (dark gray) forecast object occurring.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

A hypothetical ensemble of eight object-based forecasts, showing (a) the forecast objects of interest and (b)–(h) seven other members’ forecasts. Forecast objects in (b)–(h) are colored black if they match the black forecast object of interest, dark gray if they match the dark gray forecast object of interest, and light gray if they do not match any forecast objects of interest. Such an ensemble of forecasts would result in 12.5% (87.5%) uncalibrated forecast probability of the black (dark gray) forecast object occurring.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

### e. Verification of probabilistic forecasts

Forecasts are verified using the Brier skill score (BSS), which has been defined and discussed by Brier (1950), Murphy (1973), and Wilks (2006), among others. For the grid point–based forecasts, the event being forecast is the exceedance of an accumulation threshold (e.g., SCH11). For the object-based forecasts the event being forecast is the exceedance of a total interest [Eq. (A1)] threshold.

BSS is defined relative to a reference probabilistic forecast. The grid point–based reference forecast is the climatological frequency of the observed event at the same grid point at the same time of a day. The climatology is determined from the radar-derived QPE on 134 days from the 2007–10 HWT Spring Experiment periods. The reference forecast is specific to each grid point to avoid biased skill resulting from spatially inhomogeneous climatology (Hamill and Juras 2007). The object-based reference forecast is the percentage of days in the 134-day climatology that contain an object matching the forecast object for which the forecast probability is generated. Reference forecasts are obtained after excluding data from the day being forecast to avoid biasing the reference forecast with prior knowledge of the observation. Positive BSS indicates better forecasts than the reference forecast. Examples of neighborhood and object-based reference forecasts are shown in Figs. 2f and 2e, respectively.

Grid point–based forecasts are verified every hour for the first 6 h and every 3 h thereafter for 1-h accumulations and every 6 h for 6-h accumulations. Object-based forecasts are verified at 1-, 3-, 6-, 12-, 18-, 24-, and 30-h (6, 12, 18, 24, and 30 h) lead times for 1-h (6 h) accumulations. For the grid point–based forecasts, statistically significant differences in BSS at the 95% confidence level are determined by a one-tailed, paired Wilcoxon signed rank test (Wilks 2006) of the daily Brier scores (Hamill 1999).^{4} We follow Hamill (1999) to use each day as one sample for significance testing to avoid the challenges of extracting independent samples from a single day (Wang and Bishop 2005). For the object-based forecasts, statistical significance is assessed using permutation resampling (Hamill 1999) because some days have fewer objects being forecast than other days. To reduce the sensitivity of the object-based probabilistic verification to the choice of the member defining the forecast objects of interest, BSS is aggregated over 50 repetitions with the member randomly chosen for each forecast day and repetition.

### f. Calibration of probabilistic forecasts

Several calibration methods, described in the following subsections, are applied and compared for both the neighborhood and object-based probabilistic forecasts. To test the sensitivity of the calibrations to the length of the training period, both 10- and 25-day training periods are used. When 10 days of training are used the training days are taken from the 10 days immediately preceding the day of the forecast.^{5} For 25-day training all days except the day of the forecast are used following the standard cross-validation method (e.g., Wilks and Hamill 2007). The training data are taken from the entire forecast domain to provide a large number of samples. Unless otherwise stated, results are shown using the 10-day training period. Sensitivity of the results to the training period length is also discussed. Data from previous seasons are not used for training because of the unique ensemble configuration during the 2009 HWT Spring Experiment. Preliminary tests also did not show an improvement in skill by separating the domain into three regions based on geographical location or climatological observed frequency for the purpose of calibration with 25 days of training data (not shown).

#### 1) Reliability-based method

Reliability-based calibration is applied by placing the forecast probability into a discrete bin, determining the observed frequency for forecasts in that bin during a training period, and using that observed frequency as the calibrated forecast probability (Zhu et al. 1996). Traditional ensemble-based probabilistic forecasts take discrete values corresponding to the number of ensemble members so the bins are implicitly defined by the ensemble size (e.g., Zhu et al. 1996). This method of assigning bins is used for the object-based probabilistic forecasts.

The number and size of bins are not obvious for the nearly continuous NEP. Therefore, 500 bins with equal number of samples in each bin are used based on the following observations. First, the 20-member ensemble has substantial within-bin variability of observed frequency when NEP forecasts are grouped into 21 bins. Second, the number of training samples decreases rapidly with increasing forecast probability if the bins are equally spaced (Fig. 4). The BSS is not significantly changed by doubling or halving the number of bins, but it is higher with 500 than 21 bins (not shown).

Example of reliability diagram used for reliability-based calibration training for forecasts issued at 0000 UTC 13 May 2009 for the full 20-member ensemble at 12-h lead time for the 2.54 mm h^{−1} threshold. Diagonal dashed line indicates perfect reliability. Vertical dashed lines indicate 21 equally spaced bins of forecast probability. Inset histogram shows number of forecasts during training in bins with width of 0.01 forecast probability.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Example of reliability diagram used for reliability-based calibration training for forecasts issued at 0000 UTC 13 May 2009 for the full 20-member ensemble at 12-h lead time for the 2.54 mm h^{−1} threshold. Diagonal dashed line indicates perfect reliability. Vertical dashed lines indicate 21 equally spaced bins of forecast probability. Inset histogram shows number of forecasts during training in bins with width of 0.01 forecast probability.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Example of reliability diagram used for reliability-based calibration training for forecasts issued at 0000 UTC 13 May 2009 for the full 20-member ensemble at 12-h lead time for the 2.54 mm h^{−1} threshold. Diagonal dashed line indicates perfect reliability. Vertical dashed lines indicate 21 equally spaced bins of forecast probability. Inset histogram shows number of forecasts during training in bins with width of 0.01 forecast probability.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

#### 2) Two-parameter reliability method

A two-parameter reliability-based calibration of neighborhood-based probabilistic forecasts has been proposed by SCH11. This method is similar to the reliability-based method described above, with two key differences. First, the uncalibrated forecasts are divided into as many evenly spaced bins as the number of ensemble members, instead of 500 unevenly spaced bins. Second, each of the bins is further divided into seven smaller bins based on the average forecast accumulation at grid points within the same search radius as used to calculate the NEP. For example, the 30%–35% forecast probability bin is further divided into seven bins corresponding to average accumulation <.01 in., [.01 in, 0.05 in.), [.05 in., 0.10 in.), [.10 in., 0.25 in.), [.25 in., 0.50 in.), [.50 in., 0.1 in.), and ≥1 in., following SCH11. The number and size of bins are here chosen to follow SCH11. The second parameter, average forecast accumulation, is used because a threshold is expected to be more likely to be exceeded when the mean forecast accumulation is higher (SCH11).^{6}

An analogous two-parameter reliability-based method is used for the object-based probabilistic forecasts. Here, the forecast probability bins are further divided into three smaller bins based on the attribute of object area, instead of the average accumulation, because larger objects are expected to be more likely to be matched than smaller objects (Davis et al. 2006). Three bins are used instead of the seven used by SCH11 to ensure sufficient training samples in each bin. The bins contain objects smaller than 1600 km^{2}, between 1600 and 4000 km^{2}, and greater than 4000 km^{2} for an approximately even distribution of objects.

#### 3) Logistic regression

*P*is the forecast probability,

*x*are the predictors,

_{i}*N*is the number of predictors, and

*β*

_{i}are the fitted coefficients. (LR was implemented using software written by Alan Miller that is freely available online at http://jblevins.org/mirror/amiller/#logit.)

For neighborhood-based forecasts, the mean and standard deviation of both accumulated precipitation and neighborhood probability (NP; i.e., NEP for a single member; Schwartz et al. 2010) are considered as potential predictors. Hamill et al. (2008) found improved performance of LR for a global ensemble by raising precipitation to the ¼ power to reduce the skew of the precipitation distribution and by including standard deviation as a second predictor. The power transformation of both accumulation and NP, before taking the mean, results in similar or higher skill than the untransformed predictors for all thresholds in our dataset also (Fig. 5). The addition of standard deviation as a second predictor generally has little impact on skill at early lead times and a positive impact on skill at later lead times. The decrease in skill resulting from adding the second predictor at the 12-h lead time for the 12.7 mm h^{−1} accumulation is not statistically significant. Since the mean and standard deviation of NP^{1/4} are either the most skillful or one of the most skillful sets of predictors for most lead times and thresholds, they are used for the comparison to other calibration methods.

Brier skill score of LR forecasts for the (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1} thresholds using mean neighborhood probability (NP), mean NP to the 0.25 power (NP^¼), mean precipitation accumulation (accum.), mean accum. to the 0.25 power (accum.^¼), mean and standard deviation of NP to the 0.25 power (NP^¼ and stdev.), and mean and standard deviation of accum. to the 0.25 power (accum.^¼ and stdev.) as predictors.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of LR forecasts for the (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1} thresholds using mean neighborhood probability (NP), mean NP to the 0.25 power (NP^¼), mean precipitation accumulation (accum.), mean accum. to the 0.25 power (accum.^¼), mean and standard deviation of NP to the 0.25 power (NP^¼ and stdev.), and mean and standard deviation of accum. to the 0.25 power (accum.^¼ and stdev.) as predictors.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of LR forecasts for the (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1} thresholds using mean neighborhood probability (NP), mean NP to the 0.25 power (NP^¼), mean precipitation accumulation (accum.), mean accum. to the 0.25 power (accum.^¼), mean and standard deviation of NP to the 0.25 power (NP^¼ and stdev.), and mean and standard deviation of accum. to the 0.25 power (accum.^¼ and stdev.) as predictors.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

For the object-based probabilistic forecasts, the number of members with a matching object (i.e., the uncalibrated forecast probability) and the attributes of the forecast object were considered as potential predictors. The number of ensemble members with a matching object and the natural logarithm of object area result in the highest BSS (Fig. 6) and they are therefore used for comparison to other calibration methods.

Brier skill score of 1-h accumulation object-based forecasts calibrated with LR using the number of matching objects (num), the number of matching objects and the area of the deterministic forecast object (area), the number of matching objects and the natural logarithm of area (ln[area]), and the number of matching objects and the aspect ratio of the deterministic forecast object (aspect) as predictors for (a) 1-h and (b) 6-h accumulation periods.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of 1-h accumulation object-based forecasts calibrated with LR using the number of matching objects (num), the number of matching objects and the area of the deterministic forecast object (area), the number of matching objects and the natural logarithm of area (ln[area]), and the number of matching objects and the aspect ratio of the deterministic forecast object (aspect) as predictors for (a) 1-h and (b) 6-h accumulation periods.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of 1-h accumulation object-based forecasts calibrated with LR using the number of matching objects (num), the number of matching objects and the area of the deterministic forecast object (area), the number of matching objects and the natural logarithm of area (ln[area]), and the number of matching objects and the aspect ratio of the deterministic forecast object (aspect) as predictors for (a) 1-h and (b) 6-h accumulation periods.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

#### 4) Cumulative distribution function–based bias adjustment

Calibration is also accomplished by adjusting for the systematic forecast bias of each member using the cumulative distribution function (CDF) of forecasts and observations during a training period (e.g., Wood et al. 2002; Hamill and Whitaker 2006; Voisin et al. 2010). These studies obtained a calibrated deterministic forecast by replacing each forecast value with the observed value that had the same cumulative probability as the uncalibrated forecast value during a training period. For example, if a forecast of 10 mm h^{−1} was the 95th percentile of all forecasts during training whereas the 95th percentile of the observations was only 8 mm h^{−1}, then a forecast of 10 mm h^{−1} is replaced with a forecast of 8 mm h^{−1}.

For the object-based probabilistic forecasts, total interest [Eq. (A1)], instead of accumulated precipitation, is adjusted using this CDF method because the object-based forecasts are effectively predicting the total interest, rather than accumulated precipitation, to exceed a threshold. For the object-based forecasts the CDF of total interests between the forecast objects and observed objects is first calculated. The total interests between the forecast objects and objects of other ensemble members are then adjusted to have the same CDF.

## 3. Neighborhood-based results

Verification of uncalibrated, calibrated, and subensemble neighborhood-based probabilistic forecasts are described in sections 3a, 3b, and 3c, respectively. Table 2 summarizes all the comparisons discussed in sections 3 and 4.

Summary of various comparisons discussed in the results sections 3 and 4. Further details can be found in the section identified in the second column while the third through fourth columns respectively indicate whether neighborhood or object-based methods were applied, whether calibration was applied, and the length of the training period if calibration was applied.

### a. Verification

The skill of the uncalibrated ensemble is verified first by comparing the neighborhood ensemble probability to the traditional ensemble probability (TRAD) in Fig. 7. Unlike the NEP, TRAD is derived at each grid point from the percentage of ensemble members forecasting precipitation greater than the threshold only at that grid point. The calibrated results in Fig. 7 are discussed in section 3b.

Brier skill score of traditional ensemble probability (TRAD; solid), NEP without calibration (NEP; dashed dense), NEP calibrated using the reliability-based method (NEPrel; dotted dense), logistic regression (LR; dashed sparse), SCH11 (dotted sparse), and NEP from bias adjusted members (NEPba; dashed–dotted) for (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1}. Note that (d)–(f) have a different vertical scale than (a)–(c) to facilitate comparison among the lines. Further comparison is facilitated with statistical significance tests in Table 3.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of traditional ensemble probability (TRAD; solid), NEP without calibration (NEP; dashed dense), NEP calibrated using the reliability-based method (NEPrel; dotted dense), logistic regression (LR; dashed sparse), SCH11 (dotted sparse), and NEP from bias adjusted members (NEPba; dashed–dotted) for (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1}. Note that (d)–(f) have a different vertical scale than (a)–(c) to facilitate comparison among the lines. Further comparison is facilitated with statistical significance tests in Table 3.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of traditional ensemble probability (TRAD; solid), NEP without calibration (NEP; dashed dense), NEP calibrated using the reliability-based method (NEPrel; dotted dense), logistic regression (LR; dashed sparse), SCH11 (dotted sparse), and NEP from bias adjusted members (NEPba; dashed–dotted) for (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1}. Note that (d)–(f) have a different vertical scale than (a)–(c) to facilitate comparison among the lines. Further comparison is facilitated with statistical significance tests in Table 3.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Probabilistic forecasts derived from NEP are more skillful than those derived from TRAD (Fig. 7). TRAD has less skill than the reference forecast (i.e., negative BSS) at many lead times for the 1-h accumulated precipitation forecasts, especially at 2–4- and 18–21-h lead times (Figs. 7a–c). The 6-h accumulation forecasts are more skillful than the 1-h accumulation forecasts, having positive skill except for 6- and 24-h lead times at the 12.7 mm (6 h)^{−1} threshold (Figs. 7d–f). The NEP improves the probabilistic forecast skill compared to TRAD at all forecast lead times considered, although the NEP skill remains negative for the least skillful lead times and thresholds. When the NEP forecast skill is greatest (e.g., the 6-h lead time in Fig. 7d) the skill remains less than 0.3. In summary, the uncalibrated SSEFs provide, at best, less than 30% reduction in error compared to purely climatological probabilistic forecasts, and no improvement at worst. Note that the reference forecast is determined at each grid point separately (section 2e), which results in lower skill of SSEFs than if a domain average value is used as the reference forecasts (Hamill and Juras 2007). It was also found that the greater skill for NEP than TRAD was a result of an improved resolution component of the Brier score (not shown).

Uncalibrated forecasts show skill that generally follows a diurnal cycle (Fig. 7). There are skill maxima in the overnight and early morning hours (approximately 9–15- and 27–30-h lead times, valid at 0900–1500 and 0300–0600 UTC, respectively) and at the 1-h lead time for 1-h accumulation. There are skill minima during the afternoon for both accumulation periods (approximately 18–24-h lead times, valid 1800–0000 UTC), at the 6-h lead time for 6-h accumulation, and at the 2–4-h lead times for 1-h accumulation (Fig. 7). Both the afternoon skill minimum and the minimum at 2–4 h of forecast time correspond to the maxima in accumulated precipitation forecast bias of most members (not shown). The diurnal cycle is likely caused by both the time of day and the forecast initialization time (Yuan et al. 2007).

The uncalibrated probabilistic forecasts for 1-h accumulated precipitation have lower skill at the 2–4-h lead times than the next afternoon skill minimum, especially using TRAD (Fig. 7). The more pronounced earlier minimum contrasts with similar biases at both lead times (not shown) and deterministic forecast skill that is similar or greater at the earlier lead time (Kong et al. 2009, their Fig. 8). The 2–4-h skill minimum is also present when only the members with radar data assimilation are used (not shown). It is hypothesized that the lower probabilistic forecast skill at the 2–4-h lead times is caused by ensemble underdispersion. Underdispersion may result from assimilation of the same radar data into all members without convective-scale IC perturbations. This hypothesis is supported by rank histograms (Hamill 2001), which are more U-shaped, suggesting likely underdispersion, at the 2–4-h lead times than the 18–21-h lead times (not shown).

### b. Calibration

Calibration generally improves the probabilistic forecast skill (Fig. 7). After calibration, the probabilistic forecasts for all accumulation thresholds have positive skill. Most of the skill increase occurs during the uncalibrated skill minima. As such, the diurnal cycle of the forecast skill is less pronounced after calibration. During the uncalibrated skill minima, the skill of the probabilistic forecasts calibrated by various calibration methods is qualitatively similar in that differences in skill among calibration methods are smaller than differences in skill between calibrated and uncalibrated forecasts (Fig. 7). Interestingly, the majority of the improvement by LR, NEPrel, and SCH11 can be recovered by using the CDF-based bias adjustment (NEPba), which suggests that the skill improvement by various calibration methods primarily results from correcting the forecast bias. Additional increases in skill are not obtained by first applying bias adjustment and then applying the ensemble probability-based calibrations (not shown). The Brier score is not further decomposed when the skill of the calibrated and uncalibrated probabilistic forecasts are compared because the resolution and the reliability components are sensitive to how the forecast probabilities are binned.

There are some deviations from the general calibration results discussed above. Probabilistic forecasts calibrated with NEPrel and LR are less skillful than uncalibrated NEP forecasts at the 12-h lead time for the 12.7 mm h^{−1} threshold (Fig. 7c). However, this result is not statistically significant (Table 3). At the 1-h lead time there is a statistically significant loss of skill compared to NEP at the 12.7 mm h^{−1} threshold for NEPrel and at all thresholds for SCH11 and NEPba. There are several instances of significantly higher skill for LR than SCH11 and/or NEPrel for 1-h accumulations (Table 3). However, there is only one instance of higher skill for SCH11 than LR and no instances of higher skill for NEPrel than LR. For 6-h accumulations, there is significantly higher skill for NEPrel than LR at the 12-h lead time for all thresholds, higher skill for NEPrel than SCH11 at the 24-h lead time for all thresholds, and higher skill for LR than SCH11 at the 24-h lead time for 6.5 and 12.7 mm (6 h)^{−1} thresholds (Table 3). In summary, LR tends to be most skillful for 1-h accumulations and NEPrel tends to be most skillful for 6-h accumulations. One hypothesis for this difference is that LR, by using a fitted function, is less susceptible to sampling noise while NEPrel can fit the underlying signal better than a prespecified functional form. It is notable that NEPba tends to be significantly less skillful than LR and NEPrel for early (∼2–4-h) lead times. This may be due to LR and NEPrel calibrating the underdispersion, in addition to bias.

Forecast lead time at which the differences in skill in Fig. 7 are statistically significant. Bold entries indicate lead times when the skill of the forecast identified by the row is significantly higher than the skill of the forecast identified by the column. Italic entries indicate lead times when the skill of the forecast identified by the column is significantly higher than the skill of the forecast identified by the row. Empty cells have no statistical significance at any lead time.

For many thresholds and lead times NEPba, LR, and NEPrel do not show a significant increase of skill using the 25-day training period compared to using the 10-day period, indicating a general lack of sensitivity to a longer than 10-day training period (Fig. 8). SCH11 is the most sensitive calibration method to the length of the training period based on the magnitude and significance of the differences (Fig. 8). The larger sensitivity of SCH11 to the length of the training period is not surprising for two reasons. First, two parameters are used, unlike the one-parameter reliability-based method, NEPrel. Second, the discrete binning method of SCH11, like NEPrel, does not allow data from well-sampled regions of parameter space to affect the less well-sampled regions of parameter space, unlike LR, which uses a continuous equation [Eq. (1)]. Therefore the effective number of “parameters” in SCH11 is larger than LR because SCH11 fits the parameters separately for each bin.

Difference in Brier skill score of neighborhood-based probabilistic forecasts between 25 days and 10 days of training for ensemble calibration methods for (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1}. Markers indicate statistically significant difference between 10 and 25 days of training at the 95% confidence level.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Difference in Brier skill score of neighborhood-based probabilistic forecasts between 25 days and 10 days of training for ensemble calibration methods for (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1}. Markers indicate statistically significant difference between 10 and 25 days of training at the 95% confidence level.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Difference in Brier skill score of neighborhood-based probabilistic forecasts between 25 days and 10 days of training for ensemble calibration methods for (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1}. Markers indicate statistically significant difference between 10 and 25 days of training at the 95% confidence level.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

### c. Verification and calibration of subensembles

Verification and calibration of single and multimodel subensembles are of interest for the following reasons. First, there has been limited study on skill differences between single- and multimodel ensembles at the convection-allowing resolution. Second, the impact of calibration on such differences has thus yet to be determined. Third, precipitation forecasts used in this study have been shown to systematically group into clusters with common model dynamics (Johnson et al. 2011b). NEP and NEPrel are used to evaluate the direct impact of calibration although other calibration methods lead to similar conclusions (not shown).

Uncalibrated NEP forecasts from ARW and NMM, defined in section 2b, have significantly different skill, with ARW being the most skillful for most lead times and thresholds after the first two forecast hours (Fig. 9). NMM has significantly lower skill than MODEL after the first 2 h for all thresholds. ARW has significantly higher skill than MODEL for many lead times and thresholds, particularly during the first 18 h of forecast time. MODEL has no advantage compared to ARW except for the 6.5 mm h^{−1} threshold at the 30-h lead time where MODEL is significantly more skillful. There are many lead times when NMM has negative skill and ARW has positive skill (e.g., 6–12-h lead times for 6.5 and 12.7 mm h^{−1} thresholds). All subensembles also have skill that follows similar diurnal cycles as the full ensemble (Fig. 7). The fact that ARW is similarly or significantly more skillful than MODEL at most lead times and thresholds suggests that there is little advantage of the multimodel ensemble for these uncalibrated probabilistic forecasts. This is likely a result of the lower skill, but equal weight, of the NMM members.

Brier skill score of uncalibrated NEP from single model (ARW and NMM) and multimodel (MODEL) subensembles for thresholds of (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1}. Statistically significant difference from MODEL is indicated by a square or triangle for NMM or ARW, respectively, and significant difference between ARW and NMM is indicated by an asterisk along the horizontal axis. Note the different vertical axis scale for (d)–(f) than in (a)–(c).

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of uncalibrated NEP from single model (ARW and NMM) and multimodel (MODEL) subensembles for thresholds of (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1}. Statistically significant difference from MODEL is indicated by a square or triangle for NMM or ARW, respectively, and significant difference between ARW and NMM is indicated by an asterisk along the horizontal axis. Note the different vertical axis scale for (d)–(f) than in (a)–(c).

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of uncalibrated NEP from single model (ARW and NMM) and multimodel (MODEL) subensembles for thresholds of (a) 2.54 mm h^{−1}, (b) 6.5 mm h^{−1}, (c) 12.7 mm h^{−1}, (d) 2.54 mm (6 h)^{−1}, (e) 6.5 mm (6 h)^{−1}, and (f) 12.7 mm (6 h)^{−1}. Statistically significant difference from MODEL is indicated by a square or triangle for NMM or ARW, respectively, and significant difference between ARW and NMM is indicated by an asterisk along the horizontal axis. Note the different vertical axis scale for (d)–(f) than in (a)–(c).

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

The greater difference between the first and second skill minima for NMM than ARW (Fig. 9) is related to differences in forecast bias. Both subensembles have similarly underdispersive rank histograms during the earlier skill minimum (not shown). ARW and NMM members have similar bias during the afternoon skill minimum while NMM members have greater bias than ARW members during the early skill minimum (not shown).

The calibration decreases the differences in the probabilistic forecast skill among ARW, NMM, and MODEL (Fig. 10). In contrast to the uncalibrated probabilistic forecast skill, the calibrated probabilistic forecast skill of ARW is generally only significantly greater than the NMM skill for the smaller threshold over longer accumulation periods (Figs. 10d,e). After calibration MODEL does not show an advantage over ARW and NMM except at later lead times. Compared to the uncalibrated probabilistic forecasts at these later lead times (>24 h), MODEL is now more frequently more skillful than ARW and NMM, which could be a result of an ability of a multimodel ensemble to sample more possible realizations than a single model ensemble when the predictability is reduced at later lead times. For example, MODEL has greater skill than ARW and NMM at the 30-h lead time for 2.54, 6.5, and 12.7 mm (6 h)^{−1}, 27–30-h lead times for 2.54 mm h^{−1}, and 30-h lead time for 6.5 mm h^{−1}. The greater skill of MODEL after calibration also suggests that the multimodel design is more effective with models of similar skill.

As in Fig. 9, but for NEP forecasts calibrated using the reliability-based method.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

As in Fig. 9, but for NEP forecasts calibrated using the reliability-based method.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

As in Fig. 9, but for NEP forecasts calibrated using the reliability-based method.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

## 4. Object-based results

### a. Verification

The uncalibrated object-based probabilistic forecast skill is first evaluated. Here, the event being forecast is an observed object matching a forecast object. The uncalibrated object-based probabilistic forecasts are less skillful than the reference forecasts except for the 1-h accumulation at 1-h lead time and the 6-h accumulation at 18- and 24-h lead times (see Fig. 11). The uncalibrated object-based probabilistic forecast skill is negative at more lead times than the uncalibrated NEP and TRAD forecasts for the 6.5-mm thresholds corresponding to the 6.5-mm threshold used to define objects (Figs. 7b,e). Such results are consistent with the expectation that forecasting the probability of an object with specific attributes is more difficult than forecasting the probability of precipitation.

Brier skill score of object-based probabilistic forecasts without calibration (uncal.), calibrated with logistic regression (LR), the reliability-based method (rel.), the two-parameter reliability-based method of SCH11 (SCH11), and the CDF-based bias adjustment (b.a.), for (a) 1-h and (b) 6-h accumulation periods.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of object-based probabilistic forecasts without calibration (uncal.), calibrated with logistic regression (LR), the reliability-based method (rel.), the two-parameter reliability-based method of SCH11 (SCH11), and the CDF-based bias adjustment (b.a.), for (a) 1-h and (b) 6-h accumulation periods.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of object-based probabilistic forecasts without calibration (uncal.), calibrated with logistic regression (LR), the reliability-based method (rel.), the two-parameter reliability-based method of SCH11 (SCH11), and the CDF-based bias adjustment (b.a.), for (a) 1-h and (b) 6-h accumulation periods.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Like the neighborhood-based uncalibrated probabilistic forecasts (e.g., Fig. 7b), the uncalibrated 1-h accumulation object-based probabilistic forecasts have a diurnal cycle with skill minima at 3- and 18-h lead times and skill maxima at 1- and 12-h lead times (Fig. 11a). The more pronounced skill minimum at the 3-h lead time, compared to the 18-h lead time, is associated with greater bias in forecast total interest at the 3-h lead time. Unlike the 1-h accumulation object-based forecasts (Fig. 11a), the 6-h accumulation object-based forecast skill does not show an afternoon skill minimum (Fig. 11b). Further diagnostics (not shown) suggest that the afternoon skill minimum for 1-h accumulation forecasts could be a reflection of errors in timing at the beginning of the afternoon convection maximum. Such timing errors may not be reflected over a longer 6-h accumulation period. The afternoon skill minimum is still seen in the corresponding 6-h accumulation neighborhood-based results (Fig. 7e). Further diagnostics suggested that there is a precipitation forecast bias maximum during the convection maximum and the neighborhood-based metric is more strongly sensitive to such amplitude bias than the object-based metric.

### b. Calibration

All calibration methods improve on the skill of the uncalibrated object-based probabilistic forecasts (Fig. 11). Unlike the neighborhood-based forecasts, the skill of object-based probabilistic forecasts calibrated using the CDF-based bias adjustment (b.a.) is significantly less than the skill of object-based forecasts calibrated with LR, SCH11, and NEPrel at most lead times (Table 4). Only LR, SCH11, and NEPrel result in skillful object-based probabilistic forecasts compared to the reference forecasts at all lead times for both accumulation periods (Fig. 11). At most lead times LR is also significantly more skillful than SCH11 and NEPrel (Table 4). Like the neighborhood-based forecasts, the object-based calibration results in the greatest skill increases during the periods of uncalibrated skill minima. Like the neighborhood-based forecasts, the SCH11 calibration is the most sensitive calibration to the length of the training period with significantly higher skill for the longer training period at more lead times (Fig. 12). The LR and b.a. calibration methods are the least sensitive to the training period length (Fig. 12). After calibration, the object-based forecasts now have skill similar to that of the calibrated neighborhood-based forecasts (Figs. 7b,e).

Difference in Brier skill score of object-based probabilistic forecasts between 25 days and 10 days of training for ensemble calibration methods for (a) 1-h accumulations and (b) 6-h accumulations. Markers indicate statistically significant difference between 10 and 25 days of training at the 95% confidence level.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Difference in Brier skill score of object-based probabilistic forecasts between 25 days and 10 days of training for ensemble calibration methods for (a) 1-h accumulations and (b) 6-h accumulations. Markers indicate statistically significant difference between 10 and 25 days of training at the 95% confidence level.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Difference in Brier skill score of object-based probabilistic forecasts between 25 days and 10 days of training for ensemble calibration methods for (a) 1-h accumulations and (b) 6-h accumulations. Markers indicate statistically significant difference between 10 and 25 days of training at the 95% confidence level.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

### c. Verification and calibration of subensembles

As discussed in section 1, Johnson et al. (2011b) showed systematic clustering into groups with common model dynamics using an object-based cluster analysis. This result motivates an evaluation of the skill of the single-model subensembles, ARW and NMM, and the multimodel subensemble, MODEL, in the context of object-based probabilistic forecasting. Since LR is the most skillful calibration (Fig. 11), it is used for the comparison of calibrated subensembles.

The skill of uncalibrated object-based probabilistic forecasts from the eight-member subensembles is negative except for 1-h accumulation at the 1-h lead time (Figs. 13a,b). Like the neighborhood-based forecasts (Fig. 9), the uncalibrated object-based probabilistic forecasts from ARW are significantly more skillful than those from NMM (Figs. 13a,b). There is no advantage of MODEL over ARW at any lead time for the uncalibrated object-based forecasts (Figs. 13a,b). The decrease in object-based probabilistic forecast skill resulting from decreasing the ensemble size from 20 to 8 members is larger for NMM than for ARW and MODEL. The diurnal cycles of skill are similar for the uncalibrated subensembles (Figs. 13a,b) and the uncalibrated full ensemble (Fig. 11). Like the neighborhood-based probabilistic forecasts, the skill minimum at the 3–6-h lead times is more pronounced for NMM than for ARW and MODEL. ARW and NMM have larger differences in object-based than neighborhood-based forecast skill, which could be due to exclusion of correct-null forecasts by design in the object-based probabilistic forecast method.

Brier skill score of (a),(b) uncalibrated and (c),(d) calibrated object-based probabilistic forecasts from single model (ARW and NMM) and multimodel (MODEL) subensembles for (left) 1-h and (right) 6-h accumulations. Statistical significance is indicated as in Fig. 9.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of (a),(b) uncalibrated and (c),(d) calibrated object-based probabilistic forecasts from single model (ARW and NMM) and multimodel (MODEL) subensembles for (left) 1-h and (right) 6-h accumulations. Statistical significance is indicated as in Fig. 9.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Brier skill score of (a),(b) uncalibrated and (c),(d) calibrated object-based probabilistic forecasts from single model (ARW and NMM) and multimodel (MODEL) subensembles for (left) 1-h and (right) 6-h accumulations. Statistical significance is indicated as in Fig. 9.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

The calibrated object-based probabilistic forecasts from the eight-member subensembles have positive skill at all lead times (Figs. 13c,d). Like the neighborhood-based probabilistic forecasts, the calibration reduces the differences in skill among the subensembles (Figs. 13c,d). Unlike the neighborhood-based probabilistic forecasts, there is no advantage of MODEL over ARW, before or after calibration, at any lead time.

## 5. Conclusions and discussion

In this study, neighborhood and object-based probabilistic forecasts from a convection-allowing ensemble, produced by CAPS during the 2009 NOAA HWT Spring Experiment, are verified and calibrated. Logistic regression, one- and two-parameter reliability-based methods, and CDF-based bias adjustment are used for calibration. Newly proposed object-based probabilistic forecasts are derived from the percentage of ensemble members that match the forecast objects. Single- and multimodel subensembles are verified and calibrated to explore the effect of using multiple models (ARW-WRF and WRF-NMM).

Neighborhood-based probabilistic forecasts are considered first. The skill of the uncalibrated neighborhood-based probabilistic forecasts shows a diurnal cycle, with no skill compared to the reference forecast at the skill minima. Calibration primarily improves the neighborhood-based probabilistic forecast skill during the uncalibrated skill minima, resulting in skillful forecasts. During the uncalibrated skill minima the differences among calibration methods are smaller than the differences between calibrated and uncalibrated forecasts, suggesting little practical impact on skill of the choice of calibration method. Further significance tests show that logistic regression tends to be best for 1-h accumulations while one-parameter reliability-based calibration tends to be best for 6-h accumulations. The differences in skill between 10-day and 25-day training periods are generally small. The two-parameter reliability-based method of SCH11 is the most sensitive to training period length. An uncalibrated single-model ARW subensemble is significantly more skillful than an uncalibrated single-model NMM subensemble. An uncalibrated multimodel subensemble has skill between the single-model subensembles. Calibration reduces the differences in skill of the single-model subensembles. Except at lead times beyond 24 h, calibrated and uncalibrated multimodel subensembles are not more skillful than the better single-model subensemble. After calibration, there are more instances of the multimodel subensemble having greater skill than both single-model subensembles.

A method to generate object-based probabilistic forecasts is proposed. The uncalibrated object-based probabilistic forecasts have less skill than the uncalibrated neighborhood-based forecasts. Calibration increases the skill of the object-based forecasts at all lead times especially, but not only, during the uncalibrated skill minima. Skillful forecasts are obtained at all lead times after calibration with logistic regression and one and two-parameter reliability-based calibrations but not after CDF-based bias adjustment of total interest. Logistic regression (bias adjustment) is the most (least) skillful calibration for the object-based probabilistic forecasts. The effect of training period length on the object-based probabilistic forecasts is small especially for the one-parameter reliability diagram, LR, and CDF bias correction calibration methods. The uncalibrated object-based probabilistic forecasts from the single-model ARW subensemble are significantly more skillful than those from the single-model NMM subensemble. The uncalibrated multimodel subensemble has skill more similar to ARW than NMM. Object-based calibration reduces the differences in skill among the subensembles. Using a multimodel ensemble shows no advantage compared to the single-model subensembles both before and after calibration for the object-based probabilistic forecasts.

Both neighborhood and object-based forecasts have the most pronounced skill minima at 2–4-h lead times. Our diagnostics suggest that such a skill minimum could be related to the enhanced ensemble underdispersion. Another hypothesis could be that the suboptimal radar data assimilation requires the model to spin up during the first few hours. Future work to improve initial ensembles including using an advanced data assimilation method such as the ensemble based data assimilation (e.g., Wang et al. 2008a,b) to sample possible initial conditions at multiple scales is suggested.

Future study of methods to generate convective-scale probabilistic forecasts is still needed. Neighborhood-based forecasts for high precipitation thresholds over short accumulation periods may be better defined as the probability of occurrence within an area rather than at a specific grid point (e.g., SPC 2011). A limitation of object-based methods is the difficulty of quantifying null events (i.e., no object forecast or observed; e.g., Davis et al. 2009). An alternative method of generating the object-based probabilistic forecasts to include the null events was also considered in this study (not shown). For the alternative method a probability was generated for objects defined using all objects observed at a given time of day during the 2007–10 NOAA HWT Spring Experiments. This allows 0% probability forecasts and obviates the need to choose an ensemble member to define the forecast objects of interest. Skill was uniformly greater with the alternative method because it included null events that were skillfully predicted. The results for all the comparisons, however, remain the same as those presented.

This study is not a comprehensive comparison of all possible calibration methods. Future work on determining optimal calibration methods for convection-allowing ensembles is still needed. For example, Bayesian model averaging (e.g., Sloughter et al. 2007) accounts for differences in skill and nonindependence of members comprising an ensemble through unequal weighting of members. Preliminary experiments with unequal weighting did not show improvement for the application considered in this study. Future research should seek appropriate methods of generating calibrated forecast probabilities from unequally likely members for convective scale forecasts.

Our results show that the forecasts can be effectively improved after calibration with 10 days of training. Only marginal improvements are found when increasing the training period to 25 days. This is likely because of the large training sample sizes resulting from including data from all grid points in the forecast. However, the length of training period may be a more important consideration when the weather regime is rapidly changing, such as during transitions from one season to another. Future work is needed to further study how to collect training data in subregions based on geographical location or climatological observed frequency and how to conduct calibration for unusual and rare events.

## Acknowledgments

The authors are grateful to NSSL for the QPE verification data and NCAR for making MODE source code available. This research was supported by NSF Grant AGS-1046081, and University of Oklahoma faculty start-up award 122-792100. The CAPS real-time forecasts were produced at the Pittsburgh Supercomputing Center (PSC) and the National Institute of Computational Science (NICS) at the University of Tennessee, and were mainly supported by the NOAA CSTAR program (NA17RJ1227). Some computing was also performed at the OU Supercomputing Center for Education & Research (OSCER) at the University of Oklahoma (OU). Fanyou Kong, Ming Xue, Kevin Thomas, Yunheng Wang, Keith Brewster, and Jidong Gao of CAPS are thanked for the production of the ensemble forecasts. We also thank three anonymous reviewers for providing comments and suggestions that greatly improved the manuscript.

## APPENDIX

### MODE Parameters for Object Matching

In the context of the HWT Spring Experiment we focus on attributes relevant for the specific application of severe weather forecasting, such as shape, size, and area, which can indicate storm morphology or mode (Johnson et al. 2011a). For 1-h accumulation the same matching parameters are used as those described in Johnson et al. (2011a; see their appendix A). The attributes calculated for each object are the centroid location, area, aspect ratio, and orientation angle. For the 6-h accumulations, the authors’ subjective evaluations found a different set of attributes to give better matches. Different from the 1-h accumulations, the attributes for 6-h accumulation objects are area, aspect ratio, and boundary location. Boundary location, defined at the edges of an object, rather than centroid location, defined at the center of mass of an object, is used for the longer accumulation period because there are more large objects in addition to many small objects. Two large objects can have a large distance between centroids and still partially overlap while two small objects can subjectively appear far apart but have a smaller centroid distance.

The similarity of attributes from different objects is quantified with an interest value. The interest values for 1- and 6-h accumulations are shown in Figs. A1 and A2, respectively. As a given attribute gets less similar between two objects, the interest value gets smaller (Figs. A1 and A2). For the 6-h accumulation aspect ratio interest value, the value in Fig. A2b is further multiplied by the area ratio. Thus, a pair of objects only gets “credit” for having similar aspect ratio if the area is also similar. In Fig. A2c, an equivalent radius of each object is calculated as the radius of a circle with equal area as the object. The closest distance between the boundaries of two objects is then quantified in multiples of their average equivalent radius *R*_{eq}, which accounts for different subjective interpretations of a given displacement for large and small objects. For the focus on convective-scale storm modes, no attempt to match a group of small objects with a single large object is made.

Functions mapping paired attribute values to interest values [*f* in Eq. (A1)] for 1-h accumulation objects for (a) area ratio, (b) centroid distance, (c) aspect ratio difference, and (d) angle difference.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Functions mapping paired attribute values to interest values [*f* in Eq. (A1)] for 1-h accumulation objects for (a) area ratio, (b) centroid distance, (c) aspect ratio difference, and (d) angle difference.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Functions mapping paired attribute values to interest values [*f* in Eq. (A1)] for 1-h accumulation objects for (a) area ratio, (b) centroid distance, (c) aspect ratio difference, and (d) angle difference.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Functions mapping paired attribute values to interest values [*f* in Eq. (A1)] for 6-h accumulation objects for (a) area ratio, (b) aspect ratio difference, and (c) boundary distance.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Functions mapping paired attribute values to interest values [*f* in Eq. (A1)] for 6-h accumulation objects for (a) area ratio, (b) aspect ratio difference, and (c) boundary distance.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Functions mapping paired attribute values to interest values [*f* in Eq. (A1)] for 6-h accumulation objects for (a) area ratio, (b) aspect ratio difference, and (c) boundary distance.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

*I*is calculated as a weighted average of the interest values

*f*for each of the attributes (Davis et al. 2009):

*S*is the number of object attributes and

*c*and

_{s}*w*are the confidence and weight

_{s}^{A1}(defined below) assigned to the interest value of the

*s*th attribute. Objects are considered to match if the total interest [Eq. (A1)] exceeds a threshold. Thresholds of 0.6 and 0.66 for 1-h and 6-h accumulations, respectively, are selected based on a subjective evaluation of the quality of matches (e.g., Fig. A3). Note that all parameters chosen are based on a variety of forecasts rather than a single case. The same thresholds are used to match forecast objects with observed objects and match forecast objects among ensemble members.

Object-based 24-h forecasts valid at 0000 UTC 14 May 2009 from (a) 1-h accumulation forecast from ARW P4 member and (c) 6-h accumulation forecast from ARW P4 member. (b),(d) Corresponding [to (a) and (c), respectively] observed objects. Matched objects are shaded gray and unmatched objects are shaded black.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Object-based 24-h forecasts valid at 0000 UTC 14 May 2009 from (a) 1-h accumulation forecast from ARW P4 member and (c) 6-h accumulation forecast from ARW P4 member. (b),(d) Corresponding [to (a) and (c), respectively] observed objects. Matched objects are shaded gray and unmatched objects are shaded black.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Object-based 24-h forecasts valid at 0000 UTC 14 May 2009 from (a) 1-h accumulation forecast from ARW P4 member and (c) 6-h accumulation forecast from ARW P4 member. (b),(d) Corresponding [to (a) and (c), respectively] observed objects. Matched objects are shaded gray and unmatched objects are shaded black.

Citation: Monthly Weather Review 140, 9; 10.1175/MWR-D-11-00356.1

Each interest value is assigned a constant weight *w* and a variable confidence value *c* (Tables A1 and A2 ) in Eq. (A1), following Davis et al. (2006, 2009). For 1-h accumulations (Table A1), the weights are equally assigned as 2 each to size (area ratio), location (centroid distance), and shape. The weight for shape is further divided into aspect ratio and orientation angle. Confidence for shape attributes is proportional to centroid distance interest (CDI) and area ratio (AR) interest. In other words, there is low confidence that objects with very different location and/or size represent the same precipitation system so their shape is unimportant. Confidence for orientation angle is reduced for nearly circular objects. Confidence for centroid distance and area ratio is also reduced for objects that are different in size or far away, respectively. For 6-h accumulations (Table A2), the weights are equally assigned to location (boundary distance) and shape (divided into aspect ratio and area).

Attributes and parameter values used for fuzzy matching algorithm for 1-h accumulation objects (CD denotes centroid distance, CDI denotes centroid distance interest, AR denotes area ratio, and *T* denotes aspect ratio).

Attributes and parameter values used for fuzzy matching algorithm for 6-h accumulation objects.

## REFERENCES

Alhamed, A., S. Lakshmivarahan, and D. J. Stensrud, 2002: Cluster analysis of multimodel ensemble data from SAMEX.

,*Mon. Wea. Rev.***130**, 226–256.Aligo, E. A., W. A. Gallus, and M. Segal, 2007: Summer rainfall forecast spread in an ensemble initialized with different soil moisture analyses.

,*Wea. Forecasting***22**, 299–314.Atger, F., 1999: The skill of ensemble prediction systems.

,*Mon. Wea. Rev.***127**, 1941–1953.Atger, F., 2003: Spatial and interannual variability of the reliability of ensemble-based probabilistic forecasts: Consequences for calibration.

,*Mon. Wea. Rev.***131**, 1509–1523.Benjamin, S. G., G. A. Grell, J. M. Brown, T. G. Smirnova, and R. Bleck, 2004: Mesoscale weather prediction with the RUC hybrid isentropic-terrain-following coordinate model.

,*Mon. Wea. Rev.***132**, 473–494.Berner, J., S.-Y. Ha, J. P. Hacker, A. Fournier, and C. Snyder, 2011: Model uncertainty in a mesoscale ensemble prediction system: Stochastic versus multiphysics representations.

,*Mon. Wea. Rev.***139**, 1972–1995.Bouallegue, Z. B., S. Theis, and C. Gebhardt, 2011: From verification results to probabilistic products: Spatial techniques applied to ensemble forecasting. Abstracts of the Fifth International Verification Methods Workshop, 1–7 December 2011, Melbourne, Australia, CAWCR Tech. Rep. 046, 4–5. [Available online at http://www.cawcr.gov.au/publications/technicalreports/CTR_046.pdf.]

Brier, G. W., 1950: Verification of forecasts expressed in terms of probability.

,*Mon. Wea. Rev.***78**, 1–3.Candille, G., S. Beauregard, and N. Gagnon, 2010: Bias correction and multiensemble in the NAEFS context or wow to get a “free calibration” through a multiensemble approach.

,*Mon. Wea. Rev.***138**, 4268–4281.Carley, J. R., B. R. J. Schwedler, M. E. Baldwin, R. J. Trapp, J. Kwiatkowski, J. Logsdon, and S. J. Weiss, 2011: A proposed model-based methodology for feature-specific prediction for high-impact weather.

,*Wea. Forecasting***26**, 243–249.Clark, A. J., and Coauthors, 2011: Probabilistic precipitation forecast skill as a function of ensemble size and spatial scale in a convection-allowing ensemble.

,*Mon. Wea. Rev.***139**, 1410–1418.Davis, C., B. Brown, and R. Bullock, 2006: Object-based verification of precipitation forecasts. Part I: Methodology and application to mesoscale rain areas.

,*Mon. Wea. Rev.***134**, 1772–1784.Davis, C., B. Brown, R. Bullock, and J. Halley-Gotway, 2009: The method for object-based diagnostic evaluation (MODE) applied to numerical forecasts from the 2005 NSSL/SPC Spring Program.

,*Wea. Forecasting***24**, 1252–1267.Done, J., C. Davis, and M. Weisman, 2004: The next generation of NWP: Explicit forecasts of convection using the Weather Research and Forecast (WRF) model.

,*Atmos. Sci. Lett.***5**, 110–117.Du, J., J. McQueen, G. DiMego, Z. Toth, D. Jovic, B. Zhou, and H. Chuang, 2006: New dimension of NCEP Short-Range Ensemble Forecasting (SREF) system: Inclusion of WRF members.

*Proc. WMO Expert Team Meeting on Ensemble Prediction System,*Exeter, United Kingdom, WMO. [Available online at http://www.emc.ncep.noaa.gov/mmb/SREF/WMO06_full.pdf.]Dudhia, J., 1989: Numerical study of convection observed during the winter monsoon experiment using a mesoscale two-dimensional model.

,*J. Atmos. Sci.***46**, 3077–3107.Ebert, E., 2009: Neighborhood verification: A strategy for rewarding close forecasts.

,*Wea. Forecasting***24**, 1498–1510.Ebert, E., and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors.

,*J. Hydrol.***239**, 179–202.Eckel, F. A., and M. K. Walters, 1998: Calibrated probabilistic quantitative precipitation forecasts based on the MRF ensemble.

,*Wea. Forecasting***13**, 1132–1147.Eckel, F. A., and C. F. Mass, 2005: Aspects of effective mesoscale, short-range ensemble forecasting.

,*Wea. Forecasting***20**, 328–350.Ek, M. B., K. E. Mitchell, Y. Lin, P. Grunmann, E. Rogers, G. Gayno, and V. Koren, 2003: Implementation of upgraded Noah land-surface model advances in the National Centers for Environmental Prediction operational mesoscale Eta model.

,*J. Geophys. Res.***108**, 8851, doi:10.1029/2002JD003296.Ferrier, B. S., 1994: A double-moment multiple-phase four-class bulk ice scheme. Part I: Description.

,*J. Atmos. Sci.***51**, 249–280.Fortin, V., A.-C. Favre, and M. Saïd, 2006: Probabilistic forecasting from ensemble prediction systems: Improving upon the best-member method by using a different weight and dressing kernel for each member.

,*Quart. J. Roy. Meteor. Soc.***132**, 1349–1369.Gallus, W. A., Jr., 2010: Application of object-based verification techniques to ensemble precipitation forecasts.

,*Wea. Forecasting***25**, 144–158.Gallus, W. A., Jr., and J. F. Bresch, 2006: Comparison of impacts of WRF dynamic core, physics package, and initial conditions on warm season rainfall forecasts.

,*Mon. Wea. Rev.***134**, 2632–2641.Gao, J.-D., M. Xue, K. Brewster, and K. K. Droegemeier, 2004: A three-dimensional variational data analysis method with recursive filter for Doppler radars.

,*J. Atmos. Oceanic Technol.***21**, 457–469.Gilleland, E., D. Ahijevych, B. G. Brown, B. Casati, and E. E. Ebert, 2009: Intercomparison of spatial forecast verification methods.

,*Wea. Forecasting***24**, 1416–1430.Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation.

,*Mon. Wea. Rev.***133**, 1098–1118.Hacker, J. P., and Coauthors, 2011: The U.S. Air Force Weather Agency’s mesoscale ensemble: Scientific description and performance results.

,*Tellus***63A**, 1–17.Hagedorn, R., T. M. Hamill, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures.

,*Mon. Wea. Rev.***136**, 2608–2619.Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts.

,*Wea. Forecasting***14**, 155–167.Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts.

,*Mon. Wea. Rev.***129**, 550–560.Hamill, T. M., and S. J. Colucci, 1997: Verification of Eta–RSM short-range ensemble forecasts.

,*Mon. Wea. Rev.***125**, 1312–1327.Hamill, T. M., and S. J. Colucci, 1998: Evaluation of Eta–RSM ensemble probabilistic precipitation forecasts.

,*Mon. Wea. Rev.***126**, 711–724.Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application.

,*Mon. Wea. Rev.***134**, 3209–3229.Hamill, T. M., and J. Juras, 2007: Measuring forecast skill: Is it real skill or is it the varying climatology?

,*Quart. J. Roy. Meteor. Soc.***132**, 2905–2923.Hamill, T. M., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts.

,*Mon. Wea. Rev.***132**, 1434–1447.Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation.

,*Mon. Wea. Rev.***136**, 2620–2632.Hong, S.-Y., J. Dudhia, and S.-H. Chen, 2004: A revised approach to ice microphysical processes for the bulk parameterization of clouds and precipitation.

,*Mon. Wea. Rev.***132**, 103–120.Hou, D., E. Kalnay, and K. K. Droegemeier, 2001: Objective verification of the SAMEX ’98 ensemble forecasts.

,*Mon. Wea. Rev.***129**, 73–91.Hu, M., M. Xue, and K. Brewster, 2006: 3DVAR and cloud analysis with WSR-88D level-II data for the prediction of Fort Worth tornadic thunderstorms. Part I: Cloud analysis and its impact.

,*Mon. Wea. Rev.***134**, 675–698.Janjic, Z. I., 1994: The step-mountain eta coordinate model: Further developments of the convection, viscous sublayer, and turbulence closure schemes.

,*Mon. Wea. Rev.***122**, 927–945.Janjic, Z. I., 2003: A nonhydrostatic model based on a new approach.

,*Meteor. Atmos. Phys.***82**, 271–285.Johnson, A., X. Wang, F. Kong, and M. Xue, 2011a: Hierarchical cluster analysis of a convection-allowing ensemble during the Hazardous Weather Testbed 2009 Spring Experiment. Part I: Development of object-oriented cluster analysis method for precipitation fields.

,*Mon. Wea. Rev.***139**, 3673–3693.Johnson, A., X. Wang, M. Xue, and F. Kong, 2011b: Hierarchical cluster analysis of a convection-allowing ensemble during the Hazardous Weather Testbed 2009 Spring Experiment. Part II: Season-long ensemble clustering and implication for optimal ensemble design.

,*Mon. Wea. Rev.***139**, 3694–3710.Johnson, C., and R. Swinbank, 2009: Medium-range multi-model ensemble combination and calibration.

,*Quart. J. Roy. Meteor. Soc.***135**, 777–794.Kass, R. E., and A. E. Raftery, 1995: Bayes factors.

,*J. Amer. Stat. Soc.***90**, 773–795.Kong, F., and Coauthors, 2007: Preliminary analysis on the real-time storm-scale ensemble forecasts produced as a part of the NOAA Hazardous Weather Testbed 2007 Spring Experiment. Preprints,

*22nd Conf. on Weather Analysis and Forecasting/18th Conf. on Numerical Weather Prediction,*Park City, UT, Amer. Meteor. Soc., 3B.2. [Available online at https://ams.confex.com/ams/22WAF18NWP/techprogram/paper_124667.htm.]Kong, F., and Coauthors, 2008: Real-time storm-scale ensemble forecasting during the 2008 Spring Experiment. Preprints,

*24th Conf. on Severe Local Storms,*Savannah, GA, Amer. Meteor. Soc., 12.3. [Available online at https://ams.confex.com/ams/24SLS/techprogram/paper_141827.htm.]Kong, F., and Coauthors, 2009: A real-time storm-scale ensemble forecast system: 2009 Spring Experiment.

*Proc. 10th WRF Users’ Workshop,*Boulder, CO, NCAR, 3B.7.Kong, F., and Coauthors, 2010: Evaluation of CAPS multi-model storm-scale ensemble forecast for the NOAA HWT 2010 Spring Experiment. Preprints,

*25th Conf. on Severe Local Storms,*Denver, CO, Amer. Meteor. Soc., P4.18. [Available online at https://ams.confex.com/ams/25SLS/techprogram/paper_175822.htm.]Krzysztofowicz, R., and A. A. Sigrest, 1999: Calibration of probabilistic quantitative precipitation forecasts.

,*Wea. Forecasting***14**, 427–442.Lacis, A. A., and J. E. Hansen, 1974: A parameterization for the absorption of solar radiation in the earth’s atmosphere.

,*J. Atmos. Sci.***31**, 118–133.Lin, Y., R. D. Farley, and H. D. Orville, 1983: Bulk parameterization of the snow field in a cloud model.

,*J. Climate Appl. Meteor.***22**, 1065–1092.Mittermaier, M., 2007: Improving short-range high-resolution model precipitation forecast skill using time-lagged ensembles.

,*Quart. J. Roy. Meteor. Soc.***133**, 1487–1500.Mittermaier, M., and N. Roberts, 2010: Intercomparison of spatial forecast verification methods: Identifying skillful spatial scales using the fractions skill score.

,*Wea. Forecasting***25**, 343–354.Mittermaier, M., N. Roberts, and S. A. Thompson, 2012: A long-term assessment of precipitation forecast skill using the fractions skill score.

, doi:10.1002/met.296, in press.*Meteor. Appl.*Murphy, A. H., 1973: A new vector partition of the probability score.

,*J. Appl. Meteor.***12**, 595–600.Noh, Y., W. G. Cheon, S. Y. Hong, and S. Raasch, 2003: Improvement of the K-profile model for the planetary boundary layer based on large eddy simulation data.

,*Bound.-Layer Meteor.***107**, 421–427.Nutter, P., D. Stensrud, and M. Xue, 2004: Effects of coarsely resolved and temporally interpolated lateral boundary conditions on the dispersion of limited-area ensemble forecasts.

,*Mon. Wea. Rev.***132**, 2358–2377.Palmer, T. N., R. Buizza, F. Doblas-Reyes, T. Jung, M. Leutbecher, G. J. Shutts, M. Steinheimer, and A. Weisheimer, 2009: Stochastic parametrization and model uncertainty. ECMWF Tech. Memo. 598, 42 pp. [Available online at http://www.ecmwf.int/publications/library/ecpublications/_pdf/tm/501-600/tm598.pdf.]

Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133**, 1155–1174.Richardson, D. S., 2001: Ensembles using multiple models and analyses.

,*Quart. J. Roy. Meteor. Soc.***127**, 1847–1864.Roulston, M. S., and L. A. Smith, 2003: Combining dynamical and statistical ensembles.

,*Tellus***55A**, 16–30.Schaffer, C. J., W. A. Gallus, and M. Segal, 2011: Improving probabilistic ensemble forecasts of convection through the application of QPF–POP relationships.

,*Wea. Forecasting***26**, 319–336.Schwartz, C. S., and Coauthors, 2010: Toward improved convection-allowing ensembles: Model physics sensitivities and optimizing probabilistic guidance with small ensemble membership.

,*Wea. Forecasting***25**, 263–280.Skamarock, W. C., J. B. Klemp, J. Dudhia, D. O. Gill, D. M. Barker, W. Wang, and J. G. Powers, 2005: A description of the Advanced Research WRF version 2. NCAR Tech. Note NCAR/TN-468+STR, 88 pp. [Available from UCAR Communications, P.O. Box 3000, Boulder, CO 80307.]

Sloughter, J. M., A. E. Raftery, T. Gneiting, and C. Fraley, 2007: Probabilistic quantitative precipitation forecasting using Bayesian model averaging.

,*Mon. Wea. Rev.***135**, 3209–3220.SPC, cited 2011: Probabilistic outlook information. NWS Storm Prediction Center. [Available online at http://www.spc.noaa.gov/products/outlook/probinfo.html.]

Stensrud, D. J., and N. Yussouf, 2003: Short-range ensemble predictions of 2-m temperature and dewpoint temperature over New England.

,*Mon. Wea. Rev.***131**, 2510–2524.Stensrud, D. J., J. Bao, and T. T. Warner, 2000: Using initial conditions and model physics perturbations in short-range ensemble simulations of mesoscale convective systems.

,*Mon. Wea. Rev.***128**, 2077–2107.Tao, W.-K., and Coauthors, 2003: Microphysics, radiation, and surface processes in the Goddard Cumulus Ensemble (GCE) model.

,*Meteor. Atmos. Phys.***82**, 97–137.Theis, S. E., A. Hense, and U. Damrath, 2005: Probabilistic precipitation forecasts from a deterministic model: A pragmatic approach.

,*Meteor. Appl.***12**, 257–268.Thompson, G., P. R. Field, R. M. Rasmussen, and W. D. Hall, 2008: Explicit forecasts of winter precipitation using an improved bulk microphysics scheme. Part II: Implementation of a new snow parameterization.

,*Mon. Wea. Rev.***136**, 5095–5115.Toth, Z., Y. Zhu, and T. Marchok, 2001: The use of ensembles to identify forecasts with small and large uncertainty.

,*Wea. Forecasting***16**, 463–477.Voisin, N., J. C. Schaake, and D. P. Lettenmaier, 2010: Calibration and downscaling methods for quantitative ensemble precipitation forecasts.

,*Wea. Forecasting***25**, 1603–1627.Wandishin, M. S., S. L. Mullen, D. J. Stensrud, and H. E. Brooks, 2001: Evaluation of a short-range multimodel ensemble system.

,*Mon. Wea. Rev.***129**, 729–747.Wang, X., and C. H. Bishop, 2003: A comparison of breeding and ensemble transform Kalman filter ensemble forecast schemes.

,*J. Atmos. Sci.***60**, 1140–1158.Wang, X., and C. H. Bishop, 2005: Improvement of ensemble reliability with a new dressing kernel.

,*Quart. J. Roy. Meteor. Soc.***131**, 965–986.Wang, X., C. H. Bishop, and S. J. Julier, 2004: Which is better, an ensemble of positive–negative pairs or a centered spherical simplex ensemble?

,*Mon. Wea. Rev.***132**, 1590–1605.Wang, X., D. Barker, C. Snyder, and T. M. Hamill, 2008a: A hybrid ETKF-3DVAR data assimilation scheme for the WRF model. Part I: Observing system simulation experiment.

,*Mon. Wea. Rev.***136**, 5116–5131.Wang, X., D. Barker, C. Snyder, and T. M. Hamill, 2008b: A hybrid ETKF-3DVAR data assimilation scheme for the WRF model. Part II: Real observation experiments.

,*Mon. Wea. Rev.***136**, 5132–5147.Wilks, D. S., 2002: Smoothing forecast ensembles with fitted probability distributions.

,*Quart. J. Roy. Meteor. Soc.***128**, 2821–2836.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences: An Introduction.*2nd ed. Academic Press, 467 pp.Wilks, D. S., and T. M. Hamill, 2007: Comparison of ensemble-MOS methods using GFS reforecasts.

,*Mon. Wea. Rev.***135**, 2379–2390.