## 1. Introduction

The ensemble forecast, with its advantage in producing probabilistic forecasts, is widely used in current operational numerical weather prediction. However, most existing ensemble forecast systems suffer bias and underdispersion, which undermine their value in direct application. To fully exploit the potential of ensemble forecasting, many sophisticated statistical postprocessing methods have been designed for correcting biases and compensating for the underdispersion in raw ensemble forecasts, such as logistic regression (Hamill et al. 2004), ensemble dressing (Roulston and Smith 2003; Wang and Bishop 2005), nonhomogeneous Gaussian regression (Gneiting et al. 2005), and Bayesian model averaging (BMA; Raftery et al. 2005).

The BMA postprocessing model proposed by Raftery et al. (2005) produces a predictive probability density function (PDF) by weighted averaging among PDFs centered on the individual bias-corrected forecasts of ensemble members. It was first introduced for postprocessing parameters with a normal distribution, such as surface temperature and sea level pressure (Raftery et al. 2005; Wilson et al. 2007), and then later extended to variables with a nonnormal distribution. Sloughter et al. (2007) developed a BMA model for the probabilistic quantitative precipitation forecast (PQPF), in which the PDFs of individual ensemble members are represented as a mixture of a discrete probability at zero and a gamma form probability distribution from zero to infinity. In addition, BMA models for variables such as visibility and wind vector have also been developed and shown promising results (Chmielecki and Raftery 2011; Sloughter et al. 2013). Fraley et al. (2010) improved the robustness and computation efficiency of BMA by expanding its ability to accommodate exchangeable and missing ensemble members. It has been shown that training the BMA model on cases of similar conditions improves the performance of the BMA forecast. Kleiber et al. (2011) proposed a local BMA approach that involved the BMA model being trained only on cases with similar spatial features as the forecast case, which improved the performance of BMA substantially. A similar result was also found in Erickson et al. (2012), in which the BMA model trained on days with similar conditions showed better performance than when trained on consecutive recent days.

However, it remains controversial whether BMA stands among the most viable up-to-date postprocessing methods. Sloughter et al. (2007) demonstrated that BMA performed better than logistic regression in high-precipitation events when applied to the University of Washington mesoscale ensemble forecasts during 2003–04. Liu and Xie (2014) applied BMA to multimodel ensemble forecasts from the Observing System Research and Predictability Experiment, Interactive Grand Global Ensemble dataset in the Huaihe basin of China, and also found that BMA outperformed logistic regression. On the contrary, Wilks (2006) compared several statistical postprocessing methods, including logistic regression, ensemble dressing, nonhomogeneous Gaussian regression, and BMA in an idealized Lorenz ’96 setting, and concluded that BMA performed less well than the other three methods, and was excluded from the group of best-performing postprocessing methods. Schmeits and Kok (2010) applied BMA PQPF to a 20-yr European Centre for Medium-Range Weather Forecasts (ECMWF) ensemble reforecast dataset, and found that BMA became less skillful even than the raw ensemble forecasts from forecast day 3 onward.

One problem with the BMA model is that it is based on individually bias-corrected ensemble members. At a sufficiently long lead time, bias correction draws each ensemble member toward the climatology, and reduces the ensemble spread. As a result, BMA will place most probability around the climatology (Schmeits and Kok 2010). To deal with this problem, Schmeits and Kok (2010) suggested an additive bias correction (by setting the slope of the linear regression equation to 1), to maintain proper spread among ensemble members, by which the performance of BMA was improved to a comparative level with extended logistic regression (Wilks 2009). A similar problem was also found in BMA and model output statistics (MOSs) on individual ensemble members in Wilks (2006).

Any MOS inevitably converges to the climatology (Wilks 2011). However, the MOS of the ensemble mean forecast is usually slower to converge to the climatology than the MOSs of individual ensemble members, since the ensemble mean forecast extracts reliable information from all ensemble members, and usually possesses better skill than any individual member, which makes the ensemble mean forecast a more preferable predictor in postprocessing. All postprocessing methods that outperformed BMA in Wilks (2006) used the ensemble mean forecast as the predictor. In other words, it is important for postprocessing to treat the ensemble as a whole, or include as much information as possible on the whole ensemble. However, conventional BMA is unable to do this when estimating the PDF for each ensemble member individually.

Hamill and Colucci (1998) and Eckel and Walters (1998), in their rank histogram calibration studies, demonstrated that ensemble quantitative precipitation forecast (QPF) cases can be divided into different subsamples according to their ensemble spread, and these subsamples have different statistical properties from each other. Pooled together, these subsamples form a heterogeneous overall sample, which is difficult for a simple statistical model to interpret. The BMA model, such as the one in Sloughter et al. (2007), is based on a combination of several simple statistical models (such as logistic regression and linear regression), and is probably unable to interpret the heterogeneous sample either. The present reported study aimed to research the impact of the heterogeneous training sample on BMA performance, which, to the best of the authors’ knowledge, has not been previously addressed in the literature. To study the problem, a stratified sampling BMA approach was proposed; specifically, a BMA method based on training sample categorization, which shares similar features as the local BMA in Kleiber et al. (2011), but with a different categorization condition—the ensemble spread. It was expected that the stratified sampling based on ensemble spread would implicitly represent more information on the whole ensemble in the BMA process, which may be beneficial for the performance of calibration. Three PQPF postprocessing experiments were performed on an ensemble forecast dataset for the summer of 2010 in the northern China region: two BMA experiments with different sampling methods and one logistic regression experiment. In section 2, the statistical models used are introduced. Section 3 describes the dataset, experiment design, and verification method. Section 4 presents the results, and section 5 summarizes the study.

## 2. Statistical models

### a. BMA model

*y*is the cube root of precipitation quantity,

*f*is the forecast of a particular ensemble member,

*k*is the index of the ensemble member,

*K*is the total number of the ensemble members,

*y*given an ensemble of

*w*

_{k}is the posterior probability of the

*k*th ensemble member being the best among all members, and

*y*given that

*f*

_{k}is the best among the ensemble. Cube root transformation on precipitation data was used to achieve better performance in logistic regression [Eq. (3)] and better fitting of the gamma distribution to precipitation observation [Eq. (4)], as in Sloughter et al. (2007). The predictive PDF of

*y*is transformed to the PDF of the original precipitation amount (

*y*

^{3}) after BMA is done.

*I*[⋅] is a general indicator function that is equal to 1 if the condition in brackets holds, and 0 otherwise;

*f*

_{k}; and

*y*given that

*y*is positive for the

*k*th ensemble member.

*f*

_{k}equals 0, and 0 otherwise; and

*a*

_{0,k},

*a*

_{1,k}, and

*a*

_{2,k}are undetermined parameters, estimated member specifically by logistic regression with the cube root of the forecast as the predictor and a YES/NO precipitation analysis as the predictand.

*f*

_{k}byandThe

*b*

_{0,k}and

*b*

_{1,k}are estimated by linear regression, with the cube root of the forecast as the predictor and the cube root of the analysis as the predictand in all nonzero analysis cases.

*w*

_{k},

*c*

_{0}, and

*c*

_{1}are estimated by maximizing the log-likelihood function:where

*i*is the index of a particular case in the training sample, and

*N*is the total number of training cases. The log-likelihood function is maximized by the expectation-maximization (EM) algorithm, as in Sloughter et al. (2007). The

*c*

_{0}and

*c*

_{1}are estimated numerically in every iteration in the EM algorithm.

### b. Logistic regression model

*θ*

_{1},

*θ*

_{2}, and

*θ*

_{3}are undetermined parameters. The forecasted probability of precipitation exceeding the threshold

*y*

_{0}is then

## 3. Data, experiment design, and verification method

### a. Data

#### 1) Raw ensemble forecast dataset

The raw ensemble forecasts used in this study were produced from a short-range ensemble forecast experiment over northern China (Fig. 1) from 1 July to 18 August 2010, with a total 44 consecutive dates after 6 dates were omitted because of failure of the data archive (Zhu et al. 2013). A total of 12 rainfall events happened in the model domain during July and August 2010, mainly caused by high-level troughs, shear lines, cold-core lows, and the subtropical high (Wang 2010; Zhao 2010). Comparing to the climatology, the total precipitation amount of July and August 2010 had positive percentage anomalies (form 30% to 100%) in the south and east part of the model domain (Heilongjiang, Jilin, Liaoning, Shandong, and the southeast part of Hebei, China), and negative percentage anomalies (from 30% to 80%) in the north and west part of the domain (Inner Mongolia and the northwest part of Hebei).

The ensemble system was based on the Advanced Research version of the Weather Research and Forecasting (ARW-WRF) Model, version 3.2, and had 11 members. It was set on a 132 × 102 grid at 15-km grid spacing and 50 vertical levels, covering most of north and northeast China (Fig. 1). The control member and 10 perturbed members were driven by the National Centers for Environmental Prediction (NCEP) Global Forecast System and 10 randomly selected NCEP Global Ensemble Forecast System perturbed forecasts, respectively. An inflation factor of 2 was applied to perturbations in the initial conditions (ICs) and lateral boundary conditions (LBCs) to compensate for the underdispersion in ICs and LBCs (see Zhu et al. 2013). The ensemble members were set with different physics options, which were selected somewhat arbitrarily by taking into consideration the greater diversification and numerical stability among those available in ARW-WRF. The specific physics options of the ensemble members are described in Table 1. The ensemble forecasts were initiated at 0000 UTC on each day, and integrated for 48 h. The ensemble dataset possessed relatively proper dispersion in the meteorological fields, but wet bias in the precipitation forecasts. A detailed verification of the dataset can be found in Zhu et al. (2013). Since BMA becomes less skillful at longer lead times (Schmeits and Kok 2010), only the 24-h accumulated precipitation forecasts at the 48-h lead time were used in this study.

Physics options for the raw ensemble forecasts. Abbreviations are as follows: Betts–Miller–Janjic (BMJ), Grell–Devenyi ensemble scheme (GD), Kain–Fritsch (KF), land surface model (LSM), Monin–Obukhov (MO), Mellor–Yamada–Janjic (MYJ), planetary boundary layer (PBL), Rapid Radiative Transfer Model (RRTM), Rapid Update Cycle (RUC), WRF single-moment 3/5/6-class (WSM3/5/6), and Yonsei University (YSU). For details on the physical parameterization packages and references see Skamarock et al. (2008). The longwave radiation in all members was set to RRTM. The m00 is the control member.

#### 2) Verification dataset

The daily rainfall analysis data (0000–0000 UTC) issued by the China Meteorological Administration were used as verification data (Shen et al. 2010). The data were generated by optimal interpolation of the rain gauge data from 2400 stations based on climate background fields, covering the entire mainland of China at 0.25° × 0.25° resolution. The dataset had been verified during 1 June–27 July 2008. A total of 91% of the data had absolute errors less than 1.0 mm, and 98.94% of the data had absolute errors less than 3.0 mm (Shen et al. 2010). In this study, the analysis data were remapped to the 15-km model domain of the raw ensemble forecasts before verification. The model grid points beyond the Chinese border were assigned with missing values, and the number of grid points with valid values was 9254 for each verified day.

### b. Experiment design

Two BMA experiments with different sampling methods were performed in this study. The first experiment, as in Sloughter et al. (2007) and many other BMA studies, pooled all available cases at every grid point on all training days to form an overall training sample, and is referred to as conventional BMA (CBMA). The second experiment employed a stratified sampling, in which the overall training sample (the same as that used by CBMA) was first divided into subsamples according to the ensemble spread, and BMA parameters were then trained on each subsample separately. When forecasting, each forecasting case was categorized by its ensemble spread, and picked the parameters from the subsample to which it belonged. In other words, for each particular forecasting case, the parameters were trained only on those cases that had similar ensemble spread as the forecasting case. Stratified sampling was employed to avoid the possible impact of sample heterogeneity in CBMA. It was also expected to implicitly represent more information on the whole ensemble in the BMA model, which is likely beneficial for the performance of calibration. This second experiment is referred to as stratified sampling BMA (SSBMA). In this study, 10 subsample strata were set arbitrarily for the SSBMA (Table 2). In addition, a logistic regression (LR) experiment using the model in section 2b was performed on the same dataset. It used the same overall sampling as in CBMA.

Subsamples in SSBMA initiated on 18 Jul 2010. The sample size of the overall sample is 397 922.

A cross-validation strategy was applied in all three experiments to guarantee a sufficient training sample size, especially for extreme rainfall events. Specifically, for a particular forecasting day, cases at all grid points on the other 43 days were used as the overall training sample. In SSBMA, this overall training sample was further divided into subsamples before the BMA training took place. Though cross validation is not suitable for operational use, it is able to examine the impact of sample heterogeneity on BMA, and compare the performances of the three postprocessing methods.

### c. Verification method

*x*) is the forecasted CDF, OCDF(

*x*) is the observed CDF (changes from 0 to 1 at the observed value),

*n*is the index of verified cases, and

*N*is the total number of cases.

*O*is the observation of event occurrence (equal to 1 when the event occurs; otherwise, 0). For more details of the above verification scores, readers are referred to Wilks (2011).

## 4. Results

### a. Heavy rainfall case

#### 1) BMA training

The 48-h forecasts initiated on 18 July (valid from 0000 UTC 19 July to 0000 UTC 20 July) from the two BMA experiments were verified to reveal their differences in parameter estimation. This was a heavy rainfall event. At 0000 UTC 19 July 2010, the edge of subtropical high was located at the Huanghai Sea (the southeast part of the domain), and a low pressure center was over South Shanxi (left-bottom corner of the domain) at 500 hPa. The cold air from the west of the low pressure center interacted with the warm and wet air on the northwest edge of the subtropical high, caused massive heavy rainfall in Huabei region, with 24-h accumulated precipitation exceeding 100 mm. Blocked by the stable subtropical high, the low pressure system moved northeastward, and extended the rainfall area to the northeast.

The subsamples in SSBMA showed fundamental difference in their sample sizes, which dropped quickly while the ensemble spread increased (Table 2). The sample size of subsample 10 was less than 1% of subsample 1, and less than 0.5% of the overall sample, demonstrating that the high spread cases, which often correlated with heavy rainfall events, only accounted for a trivial fraction in the overall training sample.

Figure 2 shows the conditional mean of the cube root of the analysis given a particular cube root of the forecast calculated for member m01 from all positive analysis cases in the overall sample, and subsamples 2, 4, 6, 8, and 10. The forecasts were rounded to 0.25 intervals when calculating the conditional means, and only the results with sample size larger than 20 are shown. It can be seen that the distribution of conditional means was concave in the overall sample, indicating a poor linear relationship between the cube root of the forecast and that of the analysis. Applying linear regression [Eq. (5)] on the overall sample led to problematic results, shown later in Fig. 3b. The subsamples possessed better linear relationships between the cube roots of the forecast and analysis, with the distribution of conditional means being closer to a straight line.

Figure 3 illustrates the regression curves of Eqs. (3) and (5) fitted for member m01 in CBMA and SSBMA. The curves are shown between [0, 7] (mm)^{1/3}, a range that covers the daily amounts of most rainfall events in northern China. The regression curves of Eq. (3) are discrete at 0 (mm)^{1/3} (dots in Fig. 3a), since *a*_{2,m01} takes effect in Eq. (3) when the forecast is equal to 0 (mm)^{1/3}. In SSBMA, the subsamples possessed obviously distinct regression curves; and the curves rose and became flatter, while the ensemble spread of the subsamples increased. More specifically, for low spread subsamples, such as subsample 1, a relatively complete S-shaped logistic curve is shown on the range (0, 7] (mm)^{1/3}, and the probability of precipitation (PoP) transited from low to high, gradually. For high spread subsamples, such as subsample 9, the curve is relatively flat since only the upper part of the logistic curve is shown on the range (0, 7] (mm)^{1/3}, and the PoP remained close to 1 regardless of the forecasted precipitation amount. Under this condition, cases of similar forecasted precipitation amount would be estimated with different PoP if they had different ensemble spreads. CBMA did not have such resolution. The only regression curve of CBMA presented the overall property of the overall sample, which, in this case, was similar to the curve of subsample 3. As a result, all cases used the same *a*_{0,k}, *a*_{1,k}, and *a*_{2,k} in CBMA even if they had widely different statistical properties.

The regression lines of Eq. (5) fitted for member m01 (Fig. 3b) illustrate different relationships between the cube root of the forecast and that of the analysis among the subsamples in SSBMA. Both *b*_{0,m01} (intercept) and *b*_{1,m01} (slope) increased among subsamples with the increase of ensemble spread. Such differences in statistical properties among subsamples were also found in other members (not shown). The only regression line of CBMA was close to the regression lines of subsamples 4–8, representing the relationship in the overall sample. However, it was far away from the regression line of subsample 10, indicating that most information on the high spread cases was filtered out in CBMA since they were just a few outliers. Therefore, forecasts with high ensemble spread were improperly bias corrected to the “best” estimations, which were actually much lower than the true best estimations. Moreover, ensemble members showed different skill in high spread cases. More specifically, some members, such as m01 and m06, had larger slopes in subsample 10, indicating good correlation between the forecast and analysis, while some other members had rather flat regression lines that indicated poor forecast skill (not shown). Such a difference could not be recognized by CBMA, since ensemble members were similar in their performances after improper bias corrections.

The *c*_{0} and *c*_{1} estimated numerically in the EM algorithm were common for all ensemble members. The *c*_{1} was close to 0 in CBMA and all subsamples in SSBMA, showing that *c*_{0} of subsamples 1–10 were 0.27, 0.3, 0.34, 0.35, 0.4, 0.49, 0.65, 0.88, 0.98, and 0.78, respectively, showing large differences among subsamples. There was an increasing trend in *c*_{0} from subsample 1 to subsample 9. However, the *c*_{0} of subsample 10 was not in this increasing array. This is probably because, compared to subsamples 1–9, the *c*_{0} of subsample 10 was fitted for a different weighted combination of the ensemble members. As discussed in the next paragraph, SSBMA assigned the weights evenly to all ensemble members in subsamples 1–9, but most weights to two members (m01 and m06) in subsample 10. The CBMA estimated *c*_{0} at 0.36, similar to subsample 4, but relatively far away from subsamples with very low or high ensemble spread, implying that CBMA would overestimate (underestimate) the variance

In the EM algorithm, the ensemble members were assigned with similar weights (*w*_{k}) at about 1/11 in CBMA (Fig. 4), as well as in subsamples 1–9 in SSBMA (not shown). However, members were given fundamentally different weights in subsample 10 in SSBMA, in which m01 and m06 were given with high weights at about 0.3 and 0.4, respectively; m02 and m10 with weights around 0.1; and others with trivial weights close to 0. This demonstrates that ensemble members had similar skill when the ensemble spread was low, while m01 and m06 were superior in cases of higher ensemble spread. Since the better performances of m01 and m06 were relatively persistent on all verified days (not shown), and given the sensitivity of QPF to model physics options, it is probably the physics options’ configurations of m01 and m06 that gave them an advantage in high spread cases. It also demonstrates the merit of a mixed-physics setting in ensemble QPFs that some members help to improve the performance of the whole ensemble. With only one group of BMA weights, CBMA was incapable of presenting the difference among the subsamples. Since high spread cases were outnumbered by low and medium spread cases in the overall sample, the weights estimated by CBMA mainly represented the performances of the ensemble members in low and medium spread cases.

As shown above, subsamples based on ensemble spread possessed different statistical properties, and all BMA parameters except *c*_{1} demonstrated sensitivity to subsamples, indicating that subsamples should be processed separately by BMA. Pooling these subsamples together formed an overall heterogeneous sample, difficult for the BMA model to interpret, and produced ill-fitted parameters. The impact of such parameters on BMA forecasting is illustrated in the following part.

#### 2) BMA forecasting

To examine the difference of the CBMA and SSBMA experiments in forecasting, the PoP forecasts and BMA percentile forecasts for 24-h accumulated precipitation at the 48-h lead time were generated for the 18 July 2010 case over the entire model domain. The PoP forecasts (Fig. 5) illustrate that the two experiments had a similar spatial pattern, but SSBMA produced a sharper forecast with larger gradients and more extreme values (0 and 1). The BMA 50th and 90th percentile forecasts (Fig. 6) also displayed a similar spatial pattern, but different sharpness between CBMA and SSBMA. For the 50th percentile forecasts, also known as the BMA deterministic forecast, the SSBMA forecast had a larger rainfall area and higher rainfall amount than CBMA, with more intense local maxima. Compared to the analysis, the SSBMA forecast captured the rainfall area reasonably well, while the CBMA forecast underestimated the rainfall area. Both forecasts underestimated the precipitation magnitudes over the entire domain. The SSBMA forecast, however, was closer to the analysis. For the 90th percentile forecasts, also known as the BMA upper bound forecast, the SSBMA forecasted a larger precipitation amount but a smaller rainfall area compared to CBMA. CBMA forecasted the precipitation amount of local maxima at about 50–70 mm, much weaker than the 100–150 mm values in the analysis. The SSBMA forecasted a reasonable amount for the local maxima, with small displacement to the analysis.

To further illustrate the difference between CBMA and SSBMA, predictive PDFs were drawn from the two experiments at three model grid points (see Fig. 7). The PDFs at grid point 41.35°N, 117.94°E represented a low ensemble spread case from subsample 1, with an ensemble spread at 1.92 mm and analysis at 0 mm (Figs. 7a,b). CBMA forecasted nonprecipitation probability at 0.67, while SSBMA forecasted 0.81. This is consistent with Fig. 3a, which shows CBMA tended to forecast higher PoP than SSBMA when the ensemble spread was low. CBMA forecasted a wider distributed PDF than SSBMA, since the variance of PDF was overestimated in CBMA. Both experiments enclosed the analysis by their 90th percentiles, but the forecast of SSBMA was sharper.

The PDFs at grid point 46.78°N, 117.64°E were of a medium ensemble spread case from subsample 7, with the analysis at 28 mm and an ensemble spread at 19.38 mm (Figs. 7c,d). For this case, CBMA forecasted higher nonprecipitation probability (at about 52%) and narrower distributed conditional PDF given positive precipitation, with the 90th percentile at 19 mm, leaving the analysis outside. SSBMA forecasted lower nonprecipitation probability and wider distributed conditional PDF, enclosing the analysis by its 90th percentile (at 43 mm).

The third pair of PDFs was of a heavy rainfall case at grid point 41.51°N, 111.21°E, which had the analysis at 127 mm and an ensemble spread at 109 mm (Figs. 7e,f). For this case, both experiments forecasted similar nonprecipitation probability. However, CBMA produced a left-shifted and narrower distributed conditional PDF than SSBMA, indicating that the raw forecast was wrongly corrected, and the variance was underestimated in CBMA. As a result, the CBMA had the 90th percentile at 66 mm, far away from the analysis. The SSBMA, despite failing to enclose the analysis by its 90th percentile either, gave a much better forecast at 107.7 mm.

As shown above, parameters estimated from the overall sample fitted less well for cases that deviated far away from the main body of the overall sample. As a result, CBMA produced less sharp PDF, and severely underforecasted the local rainfall maxima in BMA percentile forecasts. SSBMA produced sharper forecasts, since it was able to estimate the parameters more accurately for each subsample by processing them separately. In other words, the SSBMA had higher resolution in sampling and parameter estimation.

### b. Verification over the season

CBMA and SSBMA were verified in terms of CRPS on all grid points over all 44 days available from the seasonal dataset, and the statistics of the scores were calculated by bootstrapping (Fig. 8a). SSBMA showed better scores than CBMA. The CRPS was also calculated for those cases with analysis exceeding 50 mm alone to examine the performance of the two experiments in heavy rainfall events. In this category, SSBMA still outperformed CBMA (Fig. 8b).

The bootstrap median and 90% confidence interval of the BSS of the probabilistic forecasts for 24-h accumulated precipitation exceeding the thresholds of 1, 10, 20, 30, 40, and 50 mm, were separately calculated for CBMA, SSBMA, and LR, using the raw ensemble forecasts as a reference (see Fig. 9). For BMA, the probabilistic forecast was generated by using the area under the total PDF to the right side of the threshold. For the raw ensemble, the probabilistic forecasts were generated by consensus voting among all ensemble members. Compared to the raw ensemble, CBMA had better skill at the 1-, 40-, and 50-mm thresholds, but comparative or worse skill at the 10-, 20-, and 30-mm thresholds. The LR and SSBMA performed better at all thresholds compared to RAW and CBMA. SSBMA outscored LR at thresholds below 40 mm, but had similar scores at 50 mm.

The reliability diagram was verified for the probabilistic forecasts of 24-h accumulated precipitation exceeding 1, 20, and 50 mm, from the raw ensemble forecasts, CBMA, SSBMA, and LR (see Fig. 10). The raw ensemble overestimated the probability at all three thresholds. CBMA, however, underestimated the probability at all thresholds, and the bias became larger as the threshold increased. For the 50-mm threshold, CBMA forecasted all probabilities between 0.05 and 0.1 for all verified cases. The LR showed overforecasting for the 1- and 20-mm thresholds, but calibrated relatively well at the 50-mm threshold. SSBMA performed the best among the four experiments for the 1- and 20-mm thresholds, and showed skill comparative to LR at 50 mm.

## 5. Summary

In this study, two BMA experiments were performed on 11-member, short-range ensemble QPFs in the northern China region from 1 July to 18 August 2010, to examine the impact of heterogeneous training samples on BMA performance. The first experiment used the overall training sample by pooling all available cases during the training period indiscriminately, as in most conventional BMA studies, while the other experiment used a new stratified sampling approach (i.e., SSBMA) that first divided training cases into subsamples by their ensemble spread, and then performed BMA on each subsample.

Subsamples of different ensemble spreads possess fundamental differences in sample size and statistical properties, and ought to be processed separately. Pooling all available cases together forms a heterogeneous overall sample, causing interpretation difficulty for the BMA model. The parameters estimated from the overall sample only present the overall property, which is different from individual subsamples, especially those that deviate far away from the main body of the overall sample.

BMA suffers the most from heterogeneity in the training sample in high ensemble spread cases (often correlated with heavy rainfall cases), since high spread subsamples tend to have very small sample sizes, and their properties are widely different from the main body of the overall sample. The CBMA approach filters out most information on the high spread cases, and produces ill-fitted parameters for them. As a result, CBMA severely underestimates the local maxima in BMA percentile forecasts and the probability in high threshold PQPF.

The stratified sampling approach helps BMA to estimate parameters in a more accurate manner for different subsamples, and implicitly introduces information on the whole ensemble into the estimation of PDFs of individual members. As a result, SSBMA produces sharper predictive PDFs, PoP forecasts, and BMA percentile forecasts than CBMA. It remedies the underestimation of local maxima in BMA percentile forecasts and the probability in high-threshold PQPF in CBMA reasonably well.

Several metrics, including the CRPS, BSS, and reliability diagram, were verified for CBMA, SSBMA, LR, and the raw ensemble forecasts. CBMA showed no obvious advantage toward the raw ensemble. SSBMA and LR outperformed the raw ensemble and CBMA in all verification metrics, while SSBMA had better skill at lower-threshold PQPFs than LR. It is probably because the LR used ensemble mean forecast as a predictor, and the ensemble mean forecast usually has large wet bias in low-threshold QPF, which increases the difficulty for LR to discriminate different probabilities in low-threshold PQPF. An imperfect regression model might be another reason for inferior performance of logistic regression in low-threshold PQPF in this study. However, soundly based conclusions can only be drawn from more detailed evaluation with a larger data sample on logistic regression in further researches.

To guarantee a sufficient training sample size, especially for the high ensemble spread subsamples, this study used the cross-validation approach, which does not suit the needs of operational postprocessing. The feasibility of SSBMA in an operational application based purely on historical data will be tested with a larger dataset in the near future. Moreover, questions such as whether sample heterogeneity is a common issue in short-range ensemble PQPF calibration, and whether stratified sampling BMA is feasible in other forecast variables and other ensemble systems, also deserve investigation in the future.

We thank three anonymous reviewers for their valuable comments, which helped to improve this manuscript. This research was jointly supported by a project of the National Natural Science Foundation of China (Grant 41305099) and the “Strategic Priority Research Program” of the Chinese Academy of Sciences (XDA05100300).

## REFERENCES

Chmielecki, R. M., , and A. E. Raftery, 2011: Probabilistic visibility forecasting using Bayesian model averaging.

,*Mon. Wea. Rev.***139**, 1626–1636, doi:10.1175/2010MWR3516.1.Eckel, F. A., , and M. K. Walters, 1998: Calibrated probabilistic quantitative precipitation forecasts based on the MRF ensemble.

,*Wea. Forecasting***13**, 1132–1147, doi:10.1175/1520-0434(1998)013<1132:CPQPFB>2.0.CO;2.Erickson, M. J., , B. A. Colle, , and J. J. Charney, 2012: Impact of Bias-correction type and conditional training on Bayesian model averaging over the northeast United States.

,*Wea. Forecasting***27**, 1449–1469, doi:10.1175/WAF-D-11-00149.1.Fraley, C., , A. E. Raftery, , and T. Gneiting, 2010: Calibrating multimodel forecast ensembles with exchangeable and missing members using Bayesian model averaging.

,*Mon. Wea. Rev.***138**, 190–202, doi:10.1175/2009MWR3046.1.Gneiting, T., , A. E. Raftery, , A. H. Westveld, , and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation.

,*Mon. Wea. Rev.***133**, 1098–1118, doi:10.1175/MWR2904.1.Hamill, T. M., , and S. J. Colucci, 1998: Evaluation of Eta–RSM ensemble probabilistic precipitation forecasts.

,*Mon. Wea. Rev.***126**, 711–724, doi:10.1175/1520-0493(1998)126<0711:EOEREP>2.0.CO;2.Hamill, T. M., , J. S. Whitaker, , and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts.

,*Mon. Wea. Rev.***132**, 1434–1447, doi:10.1175/1520-0493(2004)132<1434:ERIMFS>2.0.CO;2.Hamill, T. M., , R. Hagedorn, , and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation.

,*Mon. Wea. Rev.***136**, 2620–2632, doi:10.1175/2007MWR2411.1.Kleiber, W., , A. E. Raftery, , J. Baars, , T. Gneiting, , C. F. Mass, , and E. Grimit, 2011: Locally calibrated probabilistic temperature forecasting using geostatistical model averaging and local Bayesian model averaging.

,*Mon. Wea. Rev.***139**, 2630–2649, doi:10.1175/2010MWR3511.1.Liu, J., , and Z. Xie, 2014: BMA probabilistic quantitative precipitation forecasting over the Huaihe Basin Using TIGGE multimodel ensemble forecasts.

,*Mon. Wea. Rev.***142**, 1542–1555, doi:10.1175/MWR-D-13-00031.1.Raftery, A. E., , T. Gneiting, , F. Balabdaoui, , and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133**, 1155–1174, doi:10.1175/MWR2906.1.Roulston, M. S., , and L. A. Smith, 2003: Combining dynamical and statistical ensembles.

,*Tellus***55A**, 16–30, doi:10.1034/j.1600-0870.2003.201378.x.Schmeits, M. J., , and K. J. Kok, 2010: A comparison between raw ensemble output, (Modified) Bayesian model averaging, and extended logistic regression using ECMWF ensemble precipitation reforecasts.

,*Mon. Wea. Rev.***138**, 4199–4211, doi:10.1175/2010MWR3285.1.Shen, Y., , M. Feng, , H. Zhang, , and F. Gao, 2010: Interpolation methods of China daily precipitation data (in Chinese).

*J. Appl. Meteor. Sci.,***21,**279–286.Skamarock, W. C., and Coauthors, 2008: A description of the Advanced Research WRF version 3. NCAR Tech. Note NCAR/TN-475+STR, 113 pp. [Available online at http://www.mmm.ucar.edu/wrf/users/docs/arw_v3_bw.pdf.]

Sloughter, J. M., , A. E. Raftery, , T. Gneiting, , and C. Fraley, 2007: Probabilistic quantitative precipitation forecasting using Bayesian model averaging.

,*Mon. Wea. Rev.***135**, 3209–3220, doi:10.1175/MWR3441.1.Sloughter, J. M., , T. Gneiting, , and A. E. Raftery, 2013: Probabilistic wind vector forecasting using ensembles and Bayesian model averaging.

,*Mon. Wea. Rev.***141**, 2107–2119, doi:10.1175/MWR-D-12-00002.1.Wang, W., 2010: Analysis of the July 2010 atmospheric general circulation and weather (in Chinese).

,*Wea. Forecast Rev.***36**(10), 122–127.Wang, X., , and C. H. Bishop, 2005: Improvement of ensemble reliability with a new dressing kernel.

,*Quart. J. Roy. Meteor. Soc.***131**, 965–986, doi:10.1256/qj.04.120.Wilks, D. S., 2006: Comparison of ensemble-MOS methods in the Lorenz’96 setting.

,*Meteor. Appl.***13**, 243–256, doi:10.1017/S1350482706002192.Wilks, D. S., 2009: Extending logistic regression to provide full-probability-distribution MOS forecasts.

,*Meteor. Appl.***16**, 361–368, doi:10.1002/met.134.Wilks, D. S., 2011:

*Statistical Methods in the Atmospheric Sciences.*3rd ed. Academic Press, 704 pp.Wilson, L. J., , S. Beauregard, , A. E. Raftery, , and R. Verret, 2007: Calibrated surface temperature forecasts from the Canadian ensemble prediction system using Bayesian model averaging.

,*Mon. Wea. Rev.***135**, 1364–1385, doi:10.1175/MWR3347.1.Zhao, W., 2010: Analysis of the August 2010 atmospheric general circulation and weather (in Chinese).

,*Wea. Forecast Rev.***36**(11), 109–114.Zhu, J., , F. Kong, , and H. Lei, 2013: A regional ensemble forecast system for stratiform precipitation events in the Northern China Region. Part II: Seasonal evaluation for summer 2010.

,*Adv. Atmos. Sci.***30**, 15–28, doi:10.1007/s00376-012-1043-x.