## 1. Introduction and background

Of all the atmospheric quantities forecast by numerical weather prediction (NWP) models, rainfall is one of the most difficult fields to predict accurately. The errors in quantitative precipitation forecasts (QPFs) can arise as a result of errors both in the observations and in the forecast model itself. Detailed knowledge of the atmospheric moisture and vertical motion fields is critical for predicting the location and amount of rainfall, but these are difficult quantities to observe accurately. Models cannot adequately represent the cloud dynamics and microphysical processes involved in rainfall generation, and therefore must resort to parameterizations.

To compensate for shortcomings in observing systems and model physics there has been a trend in recent years toward ensemble forecasting, the realization of a number of model integrations using perturbed initial conditions. Ensemble prediction systems (EPSs) have been extensively tested and used in operations at the European Centre for Medium-Range Weather Forecasts (ECMWF) (Molteni et al. 1996) and the U.S. National Centers for Environmental Prediction (NCEP) (Tracton and Kalnay 1993; Stensrud et al. 1999). Using this strategy one can estimate the probability of various events and possibly also the uncertainty associated with a particular forecast. In recent years there has been great interest in using ensembles to generate probabilistic rain forecasts at both short and medium ranges (Du et al. 1997; Eckel and Walters 1998; Buizza et al. 1999a; Mullen and Buizza 2001). The ensemble average has repeatedly been shown to give a more accurate forecast than a single realization of the forecast model (Leith 1974; Zhang and Krishnamurti 1997; Du et al. 1997; Buizza and Palmer 1998).

Studies have shown that the greater the number of ensemble members the greater the skill of the final ensemble prediction (Buizza and Palmer 1998), particularly when generating probabilistic forecasts. As a result some operational centers run large numbers of model integrations to produce their ensemble forecasts (Buizza et al. 1999a; Kalnay et al. 1998). This is very computationally expensive and lower-resolution versions of the models are generally employed. The implicit assumption with most single-model EPSs is that errors result primarily from uncertainties in the initial conditions. A drawback with this approach is that any biases present in the model itself will also be present in the ensemble and may require calibration (Hamill and Colucci 1997). The recent introduction of “stochastic physics” attempts to account for uncertainties in the model subgrid-scale processes (Buizza et al. 1999b).

Another approach that has been taken to address these issues is to combine forecasts from more than one NWP model. Evans et al. (2000) combined the ensemble members from the ECMWF and U. K. Met Office (UKMO) global models to produce a sort of “superensemble,” with better performance on average than was obtained by either ensemble prediction system alone. Hamill and Colucci (1997, 1998) similarly combined ensembles from the NCEP Eta model and regional spectral model to generate improved short-range forecasts of probability of precipitation.

When a number of NWP forecasts are available, for example, via the Global Telecommunications System (GTS) it is possible to construct a poor man's ensemble. In this case the ensemble is composed of output from different models and/or initial times, rather than a single model with perturbed initial conditions. Unlike EPSs that use singular vectors or breeding modes to generate optimal perturbations to the initial conditions, the poor man's ensemble samples the uncertainty in the initial conditions via the different observational data, assimilation, and initialization methods used by operational centers. The poor man's ensemble also samples the uncertainty in model formulation via the variety of model physical parameterizations, numerics, and resolutions.

Forecasters at the bench routinely use multiple models for guidance, knowing that different models are likely to pick up different nuances in the predicted weather. The poor man's ensemble is simply a formalization of this process. This multimodel approach, in addition to having negligible cost, avoids the problem of systematic bias that occurs when a single model is used. Because each model is run at its own NWP center at full resolution it makes a potentially more skillful contribution to the poor man's ensemble than the lower-resolution model runs in single-model EPSs. However, the biases of the component models may be less well understood (quantitatively, at least) than a single well-studied model.

Can a collection of NWP forecasts with a variety of model physics, data sources, and forecast abilities produce better predictions than a large ensemble generated from a single well-formulated NWP model? Atger (1999) and Ziehmann (2000) both compared the performance of the ECMWF ensemble prediction system with a poor man's ensemble consisting of four operational forecasts from the ECMWF, UKMO, Deutscher Wetterdienst (DWD), and NCEP global models. In most respects the 4-member poor man's ensemble outperformed both the 51-member ECMWF EPS at forecasting 500-hPa height anomalies over Europe in summer and winter, and this superior performance extended for several days into the forecast period. Atger (1999) concluded that the factor responsible for the higher skill of the poor man's ensemble was the improved quality of the ensemble mean.

The *distribution* of forecasts in an ensemble provides a (limited) sample of possible weather scenarios. It contains more information than a single deterministic forecast and many researchers now believe that probabilistic forecasts are more useful than deterministic forecasts (Fritsch et al. 1998). Nevertheless, the deterministic forecast is the more traditional product and is still preferred by many forecasters and users who require a “best estimate” of the future weather. The *ensemble mean* represents a consensus deterministic forecast. It is best used in the short range (a few days) as it can give misleading results in the case of bifurcation of the ensemble forecasts.

Ensemble averaging of gridded NWP QPFs was done by Speer and Leslie (1997) and Du et al. (1997), who found that ensemble mean forecasts of precipitation were superior to individual model forecasts, both in terms of rainfall amount and distribution. Vislocky and Fritsch (1995) stressed the need to adopt a strategy of combining available forecast products rather than relying on a single most superior product. Recently Krishnamurti et al. (1999, 2000) proved the success of this approach for both weather and seasonal climate forecasts, using poor man's ensembles (called superensembles in their report) of three to eight models from operational and research centers. They used a multiple regression technique to determine optimal weights for combining the ensemble members, based on a training dataset. For an independent dataset the superensemble demonstrated a marked improvement over the individual models and the ensemble mean in root-mean-square errors of forecast monthly precipitation, daily and monthly 850-hPa meridional wind, and hurricane track and intensity.

McBride and Ebert (2000) recently examined the ability of several NWP models to forecast 24-h precipitation amounts over Australia. The differences in model formulation resulted in each model having its own relative strengths and weaknesses, with no model clearly superior to the others in all regions and seasons. It seemed likely, and initial tests confirmed, that the consensus (ensemble mean) precipitation forecast from these seven models would have lower errors than the individual models. It was less clear whether an ensemble of only seven models could provide useful probabilistic rain forecasts. The purpose of this paper is to examine how well a poor man's ensemble of daily QPFs from these seven operational models is able to predict the precipitation in Australia, both in a probabilistic and a deterministic sense.

The next section presents the forecast and verifying data. Section 3 describes the verification strategy that, in addition to the standard accuracy and skill measures, includes a method for evaluating the ability of QPFs to predict the location and magnitude of individual rain systems. Section 4 investigates the ability of the poor man's ensemble to produce useful forecasts of the probability of precipitation. The skill of the ensemble mean relative to that of the individual QPFs is described in section 5. Refinements and alternatives to the ensemble mean are explored in section 6, and section 7 shows an example of these ensemble forecasts for a heavy rain event over Australia. The dependence of the ensemble results on the number and spread of the members are investigated in sections 8 and 9. The paper finishes with a summary and conclusions.

## 2. Forecast and verification data

Starting in the mid-1990s the World Climate Research Programme (WCRP) Working Group on Numerical Experimentation (WGNE) began a program of verifying QPFs from operational NWP models over three countries for which high quality verification data were available, namely, the United States, Germany, and Australia (WCRP 1997). Participating operational centers agreed to make their NWP output available for this investigation. This study utilizes model QPFs over Australia collected as part of the WGNE study. The seven operational NWP models used here (Table 1) are the same as were verified by McBride and Ebert (2000) but the period of verification is extended to 28 months, from September 1997 through December 1999. The reader is referred to the official documentation of the relevant operational centers for descriptions of the models. In this study we use 24-h QPFs from the first two days of the model integrations initialized at 0000 UTC. About 7% of the QPFs were missing, usually due to archival failure. This study uses only days in which all seven models were present. All models are remapped onto a 1° latitude–longitude grid using bilinear interpolation or averaging as appropriate.

The verification data come from the Australian operational daily rainfall analysis (Weymouth et al. 1999). This analysis is based on rain gauge measurements from up to 2000 real-time synoptic and telegraphic stations and nearly 4000 cooperative network sites. The observations are made at 0900 local time (2200–0100 UTC depending on location and daylight savings), corresponding approximately to 0000 UTC. The error caused by the timing mismatch is usually negligible compared to QPF errors. A three-pass successive corrections scheme is used to map the observations to a 0.25° latitude–longitude grid over Australia. These analyses are averaged onto the 1° grid corresponding to that of the regridded QPFs, yielding 608 grid points in common each day.

The poor man's ensemble will have maximum value if the component QPFs are independent. Because all of the models show skill in predicting the observed rainfall pattern there will naturally be significant correlation between the various model QPFs, but so long as these are not close to unity we can expect the combined product to be superior to the individual QPFs, on average. Table 2 shows the daily averaged spatial correlation coefficients between model QPFs. These values range between about 0.38 and 0.76, with a mean daily spatial correlation coefficient of 0.59 for the 24-h forecasts and 0.54 for the 48-h forecasts.

## 3. Verification strategy

Verification of probabilistic forecasts often begins with a reliability diagram, which plots the observed frequency of an event occurring, stratified according to the predicted probability of that event. In our case the probability of rain at a given grid point is estimated as the fraction of ensemble members predicting rain, and the number of probability categories *P* is therefore equal to the number of ensemble members plus one [i.e., for seven QPFs *P* = (0/7, 1/7, … , 7/7)]. Perfect forecasts will produce a curve that lies along the 1:1 line in the reliability diagram; the proximity of the curve to the 1:1 line indicates the reliability of the probabilistic forecast. As pointed out by Stanski et al. (1989) the curve in the reliability diagram is analogous to the bias in continuous forecasts (ratio of forecast to observed event frequency). If the curve lies below the 1:1 line then the events are overforecast (predicted probabilities too high), while a curve lying above the 1:1 line indicates underforecasting of events.

*N*is the number of forecasts of binary event,

*p*

_{i}is the probability of occurrence for the

*i*th forecast, and

*o*

_{i}is the observed occurrence (1 if the event occurred, 0 if not). A perfect score for PS is 0. The Brier skill score (BSS) is the improvement in PS with respect to the same score for a reference forecast, PS

_{ref}:

Another frequently used approach for verifying probabilistic forecasts is the relative operating characteristic (ROC), which comes from signal detection theory and was first proposed for meteorological use by Mason (1982). It makes use of the familiar contingency table of predicted versus observed rain and no rain events. Recall that the members of the contingency table are hits, *h,* the frequency of correctly forecast rain occurrences; misses, *m,* the frequency of rain occurrences that were incorrectly forecast; false alarms, *f*, the frequency of rain forecasts where no rain occurred; and nonevents, *z,* the frequency of nonrain occurrences correctly forecast. As set out here, *h* + *m* + *f* + *z* = 1. The rain threshold is usually set at some very small value such as 0.2 mm day^{−1} for point observations or 1 mm day^{−1} for gridded data to eliminate dew and insignificant rain amounts. The threshold can be progressively increased to verify rain exceeding various amounts (Mesinger 1996; Buizza et al. 1999a).

The ROC is a curve of the hit rate, *h*/(*h* + *m*), versus the false alarm rate, *f*/(*f* + *z*), stratified by forecast probability. For a given forecast probability *P,* all forecasts with probability less than *P* are considered forecasts of nonevents, while all forecasts with probability greater than or equal to *P* are considered forecasts of events. The hit rate and false alarm rate are calculated from the contingency table for that value of *P.* This process is repeated for values of *P* between 0 and 1 to produce the ROC curve.

The area under the ROC curve indicates the skill of the probabilistic forecast. The ROC for a perfect set of forecasts will lie along the left axis (false alarm rate = 0), then along the upper horizontal axis (hit rate = 1), with an associated area of 1.0. Forecasts with no skill lie along the 1:1 line (false alarms and hits equally likely) and yield an ROC area of 0.5. According to Buizza et al. (1999a) a probabilistic forecast is considered “useful” if its ROC area is at least 0.7, while a value of 0.8 or more signifies a “good” prediction.

Objective verification of deterministic QPFs usually involves computing a number of continuous and categorical statistics. In the former category are the forecast and observed rain area, mean and maximum intensity, and rain volume, as well as the mean absolute error, root-mean-square (rms) error, and the correlation coefficient between predicted and observed (or analyzed) rainfall. Commonly used categorical statistics for deterministic forecasts include the bias score, probability of detection, false alarm ratio, and equitable threat score. These categorical statistics are computed from results pooled in space and time, which gives values more representative of the entire dataset than daily averaged values. The bias score, BIAS = (*h* + *f*)/(*h* + *m*), measures the ratio of forecast to observed rainfall frequency, indicating whether the QPF under- or overforecasts the occurrence of rain. The probability of detection, POD = *h*/(*h* + *m*), is the accuracy of the QPF in forecasting rain occurrence. The false alarm ratio, FAR = *f*/(*h* + *f*), measures the fraction of rain forecasts for which rain did not occur. The equitable threat score, ETS = (*zh* − *fm*)/[(*f* + *m*) + (*zh* − *fm*)], is an improved version of the well-known threat score TS = *h*/(*h* + *m* + *f*) that takes into account the number of hits that would occur due to random chance. A perfect forecast would yield categorical statistics of BIAS = 1, POD = 1, FAR = 0, and ETS = 1.

These verification statistics give a broad picture of the ability of the forecast to correctly predict the occurrence and amount of rainfall over the domain of interest. When the domain is large, as is the case for Australia, there are usually several different weather systems occurring simultaneously on any given day. Using domain-wide statistics alone it is often difficult to get a clear picture of where the forecast is going wrong. For example, imagine a scenario in which the model underforecasts the amount of rain falling in a weather system in one part of the domain, but overforecasts the rain falling in another system in a different part of the domain. Systematic errors of this nature would be difficult to detect because, although the rms errors and correlation coefficient would indicate that the model was in error, the mean rain area and volume might be approximately correct when averaged over the entire domain. The usual approach to addressing this problem is to divide the domain into smaller subdomains and verify the rain predicted within each subdomain (e.g., Junker et al. 1992; McBride and Ebert 2000).

Ebert and McBride (2000) recently proposed an alternative approach that focuses on weather systems, as opposed to separate regions. For each rain system they examined the union of the set of forecast and analysed (observed) rain fields and defined this as a “contiguous rain area” (CRA). They found a threshold isohyet of 5 mm day^{−1} to be the most useful value for evaluating model 24-h NWP QPFs; other thresholds may be better suited to different space scales and timescales. A CRA is shown schematically in Fig. 1. The displacement of the forecast from the observed rain is indicated by the arrow. This displacement is estimated using pattern matching, where the forecast is horizontally translated over the observations until the total squared error is minimized over the set of all points in the CRA before and after translation. A CRA may contain only observed or only forecast rain; in this case the displacement is meaningless. The displacement distance is a useful verification quantity since perhaps the most important aspect of a good rainfall forecast is a correct forecast of its location.

The determination of location error is similar in concept to the distortion representation of forecast errors proposed by Hoffman et al. (1995) and demonstrated by Du et al. (2000). In their approach, as in the CRA approach, the forecast error is a combination of errors in location, mean amplitude, and residual small-scale variation. If the displacement and amplitude errors are assumed to be constant over the domain, then one can estimate the location error using pattern matching and the amplitude error by computing the ratio of the forecast and observed fields. They found that similar results could be obtained by minimizing the rms difference or maximizing the correlation coefficient between the forecast and observations. The CRA method differs from that of Hoffman et al. (1995) primarily in the definition of the domain. The latter uses a rectangular domain, whereas Ebert and McBride (2000) confine their domain to only those grid points including the storm of interest in order to exclude all unrelated rainfall.

In order to obtain reliable estimates of pattern displacement in cases where the observed rain field is incomplete (e.g., when a rain system crosses the coast), Ebert and McBride (2000) found that it was necessary to have a sufficiently large area of observed rain (≥20 grid points) and the pattern correlation of the translated forecast with the observations must exceed 0 at the 95% significance level. Only CRAs satisfying these conditions have been included in the calculation of model displacement errors. Further, to eliminate insignificant storms CRAs were required to have a total rain volume of at least 0.25 km^{3} (∼10 mm rain over a 1.5° square area).

Ebert and McBride (2000) also applied the notion of hits, misses, and false alarms to the correct forecast of rain *events,* represented by CRAs in the daily rainfall verification. A contingency table for event verification is shown in Fig. 2. A hit constitutes a good rain forecast in which both the location and magnitude of the rain system are well predicted. The precise specifications of “well predicted” are arbitrary, and would depend on the space scales and timescales of the forecast. Ebert and McBride considered that a good daily rain forecast would have a CRA displacement of less than one effective radius of the observed system (but not more than 500 km). They considered a good forecast of rain magnitude to have a difference of no more than one rain category between the forecast and observed maximum rain rates, where the categories were given as 1–2, 2–5, 5–10, 10–25, 25–50, 50–100, 100–150, 150–200, and >200 mm day^{−1}. The *event hit rate* measures the frequency of correctly forecast events and is an intuitive measure of a model's ability to make useful rain forecasts.

## 4. Probabilistic rain forecasts

Recent studies on using ensembles to predict the probability of precipitation used fairly large ensembles (≥25 members) in order to try to represent the full range of likely scenarios and derive better estimates of rainfall probability (Du et al. 1997; Buizza et al. 1999a). In this section we examine whether our poor man's ensemble of seven members contains useful information for estimating the probability of rainfall.

The POP was estimated at each grid point as the proportion of model QPFs predicting rain at or above a given threshold. No attempt was made to give greater weight to models with greater skill or to calibrate the POP forecasts (e.g., Hamill and Colucci 1997). The rain threshold was varied from 1 to 50 mm day^{−1} to determine the ensemble's predictive ability for progressively higher rainfall.

Figure 3 shows a reliability diagram for the 24- and 48-h poor man's ensembles, for a rain threshold of 1 mm day^{−1}. Both ensembles had a tendency to overforecast rain frequency, with the ensemble of 24-h forecasts showing greater reliability. The ensemble of 48-h forecasts had a strong tendency to overpredict rain frequency at high probabilities but underpredict rain at low probabilities, indicating lower predictive skill as the curve tends toward a horizontal line.

The Brier skill score is plotted as a function of increasing rain threshold in Fig. 4. The BSS was computed using all forecasts pooled in space and time to ensure greater stability, particularly at high thresholds. In practice the daily averaged values of BSS are a few percent lower than the pooled values. The poor man's ensemble of 24-h forecasts has predictive skill for daily rainfall up to and exceeding 50 mm day^{−1}. The Brier skill score is greatest when the rain threshold is lowest (all rain included) and decreases as progressively higher rain rates are isolated. The ensemble of 48-h forecasts had much less predictive skill and was unable to beat climatology (BSS = 0).

It is interesting to compare the probabilistic skill of the poor man's ensemble with that of an ensemble prediction system. Following the example of Mullen and Buizza (2001), probabilistic forecasts from the 51-member ECMWF EPS initialized at 1200 UTC were verified for the summer and winter seasons over Australia. Table 3 shows the Brier skill scores for both systems. During the warm season the poor man's 24-h ensemble achieved BSS values of 0.22 for a 1 mm day^{−1} rain threshold and 0.10 for rain exceeding 10 mm day^{−1}. The ECMWF 36-h ensemble did not perform as well, giving BSS values of 0.12 and −0.09 for 1 and 10 mm day^{−1} thresholds, respectively. These values are lower than those reported by Mullen and Buizza (2001) for the U.S. warm season. However, by the second day of the forecast the skill of the poor man's ensemble was less than climatology, while the scores for the ECMWF EPS were similar to its day one values. Cool season Brier skill scores over Australia were higher for both ensembles. For 1 and 10 mm day^{−1} thresholds the BSSs were 0.50 and 0.44 for the 24-h poor man's ensemble and 0.36 and 0.27 for the 36-h ECMWF EPS. Although we cannot verify 24-h forecasts from the ECMWF EPS, our results, and also those of Mullen and Buizza (2001), suggest that they are usually slightly poorer than the 36- and 60-h forecasts due to insufficient spread of the ensemble. The poor man's ensemble of independent model QPFs appears to have greater probabilistic skill *for 1-day forecasts* than the larger ensemble of QPFs from the ECMWF EPS.

The mean daily ROC area is plotted in Fig. 5. To give an idea of the variability in skill among daily forecasts the vertical bars accompanying the 24-h ensemble values indicate ± one standard deviation about the mean (daily standard deviations were similar for the 48-h ensemble). Using a value of 0.8 as a standard for a good probabilistic forecast, Fig. 5 shows that on a daily basis the 24-h poor man's ensemble usually meets this standard for all but the heaviest rainfall occurrences. The 48-h ensemble has useful skill only for the lower rain rates.

## 5. Verification of individual and ensemble mean QPFs

Domain-wide and CRA rainfall verification of the model QPFs was first performed to establish some baseline statistics against which ensemble QPFs can later be compared. Figure 6 gives verification statistics over the Australian domain for seven model QPFs and the ensemble mean QPF, denoted AVG. The first three diagrams show predicted and observed rain area, rain volume, and mean rain intensity. The models show varying abilities to reproduce the mean values of the observations. However, most models overestimate rain area and volume and underestimate mean rain intensity. As expected, averaging of several rain forecasts has the effect of increasing the rain area and reducing the mean rain intensity. The rain volume predicted by the ensemble mean is equal to the average of all of the model values and is greater than the observed value.

For the first 24-h period the mean daily model rms error averaged 4.3 mm day^{−1}, or about two-thirds of the mean rain intensity. For the 24-h AVG forecast the mean rms error was 3.4 mm day^{−1}. Theoretically, the error variance of the ensemble mean of *n* independent, equally likely, forecasts should be 0.5(1 + *n*^{−1}) times the mean error variance of the forecasts (Leith 1974). Taking the square root, we can expect the rms error of the ensemble average to have a magnitude of about 75% of the mean model rms error when seven models are combined. The rms for the AVG forecast is in good agreement with the theoretically predicted value of 3.2 mm day^{−1}.

Figures 6e and 6f show improvements in the correlation coefficient and equitable threat score for the ensemble mean as compared to the individual models, for both the 24- and 48-h forecast periods. To assess the significance of the improvement in equitable threat score, Hamill (1999) recommended a resampling methodology that tests the hypothesis that the difference in ETS between competing forecasts is zero. Using this methodology the ETS for the ensemble mean was found to be significantly better than all model ETS values at the 0.01 significance level according to the Student's t-test.

CRA verification statistics are shown in Fig. 7. The number of CRAs verified for each model ranged from 242 to 298 as the success of the individual models in forecasting particular storms also varied. 323 CRAs were verified for the 24-h ensemble mean forecast and 291 for the 48-h ensemble mean. Because each model forecast a slightly different set of storms it was not possible to use Hamill's (1999) resampling methodology to measure the statistical significance of the CRA results. Instead, the uncertainties in the mean CRA verification statistics were estimated using a bootstrapping method whereby 20% of the daily results were withdrawn and the mean statistics recomputed, with the standard deviation of the mean verification statistics then computed after 30 independent withdrawals. The uncertainties estimated in this manner were less than 3% of the mean values.

The average displacement of the forecast rain patterns from the observed locations was roughly 225–300 km for the 24-h forecasts and 260–320 km for the 48-h forecasts. Only one of the models produced maximum rain rates that were within 20% of the observed mean value of 56 mm day^{−1}. These attempts to predict realistic maximum rain rates were costly in terms of rms errors since it is difficult to predict correctly where rain maxima will occur. It can be seen in Figs. 6c and 6d that the most conservative QPFs also had the lowest rms errors.

The CRA verification for the AVG QPF shows substantial improvements over the individual QPFs. Notably, the average displacement distance of the AVG QPF was 178 km, approximately 30% less than the mean displacement distance for individual QPFs. The AVG QPF quite clearly provides a much improved estimate of the rainfall location.

The model hit rates for rain events were 0.59–0.68 for the first 24-h period and somewhat smaller, 0.53–0.62, for the second day. The event hit rate for AVG was higher than for any of the individual model forecasts, with values of 0.75 and 0.62 for the 24- and 48-h forecasts, respectively. This improvement comes about almost entirely through a reduction of location errors. For the model 24-h forecasts an average of 26% of CRAs had appropriate rain intensity but fell into the category of “missed locations” (see Fig. 2). The frequency of this type of error was only 11% for the ensemble mean forecast. On the other hand, the frequency of significantly underestimated events averaged 8% for the models and 12% for 24-h AVG forecasts.

## 6. Alternatives to the ensemble mean

The previous section demonstrated that the simplest combination of model QPFs, the arithmetic mean, yields a forecast rain field that is more accurate than any of the individual estimates. This section explores several alternatives to the simple ensemble mean in an effort to further improve the deterministic forecast. This has not been investigated very much in the literature, for two reasons. First, the majority of ensemble studies have focused on the prediction of 500-hPa height fields, where the ensemble mean has a stronger resemblance to actual height fields than is the case for rainfall. Second, most studies of precipitation ensembles have concentrated on probabilistic forecasts (Eckel and Walters 1998; Buizza et al. 1999a; Mullen and Buizza 2001). Du et al. (1997) verified ensemble mean precipitation forecasts but did not attempt to correct them in any way. It would be particularly desirable to correct for the large bias in rain area and underestimation of maximum rainfall that result from the averaging process.

### a. Weighted averaging

It was seen in Figs. 6 and 7 that some models had more skill than others at predicting rainfall. It is likely that giving greater weight to the more skillful forecasts will result in a more accurate ensemble mean. This concept was demonstrated by Van den Dool and Rukhovets (1994), who showed that better extended-range forecasts of 500-hPa height fields could be made when multiple forecasts were optimally weighted. Krishnamurti et al. (1999, 2000) used the same approach for 850-hPa meridional wind forecasts, but their optimization was carried out at every grid point. In their multiple linear regression formulation the weights were derived so as to minimize the rms error of the ensemble mean, accounting for the error variances of the individual forecasts as well as their interforecast covariances. Following the example of Van den Dool and Rukhovets (1994), monthly weighting coefficients were derived for the model QPFs from the previous two months of verification results (to account for changing seasonal behavior). The results indeed gave greater weight to the more accurate models, but some models were assigned negative weight. This approach was rejected as it produced some areas of negative rainfall.

Another approach used frequently in the combination of gridded rainfall analyses is weighting according to the inverse of the expected error variance (Xie and Arkin 1996; Huffman et al. 1997). Using this strategy a weighted average QPF, denoted AVG_{w}, was produced by weighting each model QPF by the inverse of its mean domain-wide error variance over the prior 2-month period. The relative weights for the seven models ranged between 0.07 and 0.22.

Table 4 gives the verification results for this experiment as well as the other variations on the ensemble mean. It can be seen that AVG_{w} performed almost the same as the ensemble mean. The rain area and volume were slightly different but the remainder of the statistics were nearly identical. The reason for the failure of weighted averaging to produce a better combined QPF appears to lie with the difficulty of predicting the performance of the models relative to each other on a given day, based on prior performance. The 2-monthly mean model rms is not representative of the highly variable daily accuracy. Even weights based on the *prior day's* performance, an intuitive strategy practiced by forecasters, did not yield better results.

### b. Median forecasts

In producing a deterministic forecast of precipitation from an ensemble of QPFs there is an additional consideration that does not occur with other fields such as geopotential height or temperature, and that relates to the bimodal nature of the rainfall distribution. Observed daily rainfall is either zero (the usual case at most locations) or distributed lognormally when rain occurs; forecast rainfall shows a similar distribution. Therefore, the ensemble of precipitation forecasts at a grid point would not be expected to resemble a Gaussian distribution, particularly when some members predict rain and others predict none. In the case of a bimodal forecast rain distribution, one can argue that the ensemble mean is not representative of the ensemble, and that the median value may be more appropriate since it is sampled from the rain distribution.

The ensemble median precipitation, denoted MED, is determined independently for every grid point. The median should provide results that are qualitatively similar to the ensemble mean, but without the excess area of low rain rates that results from averaging a minority of positive rain rates with several “zero” forecasts. A disadvantage of the ensemble median forecast is that at grid points where a minority of models predict no rain, the resulting values will be selected from the low side of the (nonzero) rainfall distribution.

To address this issue one can use polling to determine forecast rain occurrence, where rain is forecast only where at least half of the ensemble members forecast rain (i.e., median rain rate of at least 1 mm day^{−1}). The rain rate is then estimated as the mean of those ensemble members that forecast rain. This “majority rules” (MAJ) forecast will have rain area identical to the median forecast, but its intensities will usually be greater since only nonzero rain rates are used to estimate the ensemble rain rate.

Verification of MED and MAJ showed little difference between the two except for rain intensity and volume. MED indeed had small bias in rain area, but its rain volume and intensity were both much lower than observed. The mean rain rates of MAJ were close to the observed values but the rain volume was too great. The correlation coefficients and ETSs were similar to those of the other ensemble products, while the CRA verification statistics resembled those of the simple ensemble mean. These results show that a median approach gives a good delineation of rain boundaries but does not correct the underestimated maximum rain intensities and therefore produces no improvement in the event hit rate.

### c. Bias reduction

It was shown that averaging several QPFs results in a rain field with the rain maxima located more accurately (on average) than in any individual QPF, but with markedly reduced maximum rain rates and unrealistically large rain extent. It would be desirable to transform the rain rates to produce a more realistic rain field with reduced rain area and intensified maximum rain rates. One way to do this is to eliminate the rain at grid points with the lowest rain accumulations, (corresponding to the periphery of the rain systems), then scale the remaining rain rates such that the adjusted rain system boundaries again have low rain rates, the highest rain rates are increased, and the total rain volume is conserved.

*A,*is assumed to be too great by a bias factor,

*b,*while the volume of rainfall in the combined forecast,

*V,*is assumed to be approximately correct. If

*f*(

*r*) represents the frequency distribution of forecast rainfall rates,

*r,*over the domain, then the area and volume of forecast rainfall can be written as

*r*

_{max}is the maximum rain rate in the ensemble mean forecast. The rain area is corrected for the expected bias by dividing by the factor

*b,*

*A*

_{t}is the bias-corrected rain area and

*r*

_{t}is a threshold rain rate that, when set as the lower boundary of the integral in (5), yields the corrected rain area. The excess rain area is eliminated by setting all rain rates less than

*r*

_{t}to zero.

*a*has a value greater than 1, preserves the relative shape of the rain rate distribution beyond

*r*=

*r*

_{t}. The rain volume can be rewritten as

*a*Conservation of rain volume implies that the new maximum rain rate,

*r*

^{′}

_{max},

*r*

_{max}.

This transformation depends on knowledge of the area bias, *b,* in the combined QPF. Figure 8 shows the monthly averaged bias scores for the 24-h ensemble mean. There are two regimes visible in the time series. In the cool season (June–September) the rain in Australia falls mainly in the southern part of the continent and is associated with large-scale synoptic systems. The models are reasonably skilled at predicting rain associated with these large systems, and the bias of the combined product is only about 1.1. During the warmer seasons (October–May) a greater proportion of rain falls in convective systems, with intense tropical rain systems characterizing the northern Australian summer. There is much greater model disagreement on the magnitude and location of warm season rain, with the result that the bias in the ensemble mean is about 1.6.

These bias factors were used in (5)–(7) to create a bias-corrected ensemble mean, AVG_{t}, where the subscript t indicates that the rain rates have been transformed. Verification of the AVG_{t} product (Table 4) shows that the bias reduction indeed corrected the overestimation of the rain area. The equitable threat score was higher for AVG_{t} than for AVG due to a reduction in false alarms. However, because AVG_{t} is constrained to conserve rain volume and the volume was overestimated by AVG, the resulting mean rain intensity was too great, also contributing to an increase in rms error. The mean CRA displacement distance was greater for AVG_{t} than for AVG, with the result that the event hit rate decreased to 0.71 for the 24-h forecast and 0.59 for the 48-h forecast.

### d. Probability matching

Probability matching is an approach that can be used to blend data types with different spatial and temporal properties, where usually one data type gives a better spatial representation while the other data type has greater accuracy. The method works by setting the probability distribution function (PDF) of the less accurate data equal to that of the more accurate data. Examples are the blending of radar and rain gauge observations (Rosenfeld et al. 1993), or geostationary and polar satellite rain estimates (Anagnostou et al. 1999). In our case, we hypothesize that the most likely spatial representation of the rain field is given by the ensemble mean, while the best frequency distribution of rain rates is given by the ensemble of model QPFs.

To obtain the probability matched forecast, PM, the model rain rate PDF is first calculated by pooling the forecast rain rates for all *n* models for the entire domain, ranking them in order of greatest to smallest, and keeping every *n*th value. The rain rates in the ensemble mean forecast are similarly ranked from greatest to smallest, with the location of each value stored along with its rank. The grid point with the highest rain rate in the ensemble mean rain field is reassigned to the highest value in the model rain rate distribution, and so on. The rain area is constrained not to exceed the mean rain area predicted by the models.

Table 4 shows that the reduction in the rain area led to improved estimates of rain volume as well as an improved equitable threat score when compared to AVG. However, the location errors were greater for PM than for AVG. Because PM samples from the entire distribution of rain rates (and none of the models systematically overestimated maximum rain rate), its maximum rain rates were closer to the observed values than any of the other ensemble products. As a result, the frequency of underestimated events was only 5% and the event hit rate was 0.81 for 24-h forecasts and 0.70 for 48-h forecasts. The PM had the highest event hit rates of all ensemble products and therefore is probably the most useful deterministic ensemble rainfall forecast for forecasters.

## 7. An example of an ensemble forecast

Figure 9 shows an example of 24-h QPFs for 21 April 1998 from the seven operational models, the ensemble mean (AVG), the weighted ensemble mean (AVG_{w}), the median forecast (MED), the bias-corrected ensemble mean (AVG_{t}), and the probability matched forecast (PM). POP estimates for rainfall exceeding 1 and 20 mm day^{−1} are shown in Fig. 10. The verifying rainfall analysis is shown in Fig. 11. On this day an extensive area of heavy rain extended from northwestern to southeastern Australia associated with a “northwest cloudband” (ascent of warm moist air from the northwest, often bringing precipitation to the interior of Australia). The maximum analyzed rainfall was 83 mm day^{−1}.

All of the models correctly predicted the extensive rain area, but there was large variation in its predicted position and intensity. For example, the ECMWF model forecast the heavy rain to the north, while the UKMO model predicted heavy rain to the southeast of where it actually occurred. The USAVN model overestimated the heavy rain area, while the GASP and JMA models underestimated it (see Table 1 for definitions). In this case, as with most rain systems, the poor man's ensemble combines the individual forecasts with their various errors into probabilistic and deterministic rain forecasts that give more useful information than any single QPF.

The various deterministic ensemble forecasts generally gave a better representation of the location and shape of the rain field than any of the models. The ensemble mean (AVG and AVG_{w}) and median (MED) forecasts slightly overestimated the rain area and did not capture the area of rain ≥50 mm day^{−1}. The bias-corrected mean (AVG_{t}) slightly underestimated the rain area for this case but succeeded in predicting some rain over 50 mm day^{−1} in the correct location. The rain rate transformation used to derive AVG_{t} eliminated the area of lighter rain over Tasmania and the south coast of the continent. The probability matched forecast (PM) gave the best prediction of the size, shape, and location of the heavy rain region, as well as the light rain in the southeast.

The map of probability of precipitation for rain exceeding 1 mm day^{−1}, POP_{1} (Fig. 10a), shows an extensive area where the probability exceeds 0.50. This is slightly broader than the area of rain ≥1 mm day^{−1} in the analysis (Fig. 11). The observed region of heavy rain is found inside the area of POP_{1} = 1.0. In Fig. 10b the contour of POP_{20} = 0.50 provides an excellent estimate of the area of rain exceeding 20 mm day^{−1} in the analysis, and most of the heaviest rain is found inside the POP_{20} = 0.75 contour.

## 8. Dependence on number of members

The results shown thus far have made use of all seven models to create the ensemble QPFs. The skill and accuracy of the ensemble forecast would be expected to increase asymptotically to some limiting value as the number of input QPFs is increased, assuming that the accuracy of the QPFs being added is similar to those already included. In this section we investigate how many members are needed in the ensemble forecast in order to achieve close to the maximum skill and accuracy.

Two experiments were conducted to address this question. In the first, the number of 24-h QPFs in the ensemble mean was increased from one to seven with the members selected randomly. Ten realizations were run each day to guarantee stable verification statistics.

In the second experiment the 24-h QPFs were systematically added to the ensemble according to their overall skill at forecasting the occurrence of rain, as measured by the equitable threat score (Fig. 6f). In this case the “ensemble” of one member was identical to the model with the highest overall ETS, the ensemble of two members comprised the two models with the highest overall ETS scores, and so on. (It was seen in section 6a that producing an optimal combination of QPFs is not a trivial task. An ensemble mean whose members are selected according to ETS values is itself likely to have high values of ETS, but this does not guarantee that other verification statistics will be optimal. Different results would occur if the selection of ensemble members were based on the smallest rms errors or the highest event hit rates, for example.)

Figure 12 shows the dependence of selected ensemble mean verification results on the number of members in the ensemble. With random selection there is a decrease in the rms error and a monotonic increase in the ETS with increasing number of members. These quantities appear to asymptotically approach limiting values and are already close to those values when seven QPFs are included in the ensemble. These results agree with the findings of Du et al. (1997), who showed that an ensemble mean of 8–10 QPFs (from a single model) yielded 90% of the improvement in rms error that would be obtained by a 25-member ensemble. The CRA mean displacement and event hit rate also improve rapidly with increasing numbers of ensemble members. The CRA results were best for a six-member ensemble, although the differences between the six- and seven-member ensemble results are within the uncertainty of the verification.

The dependence of POP forecasts on the number of members in the ensemble is shown in Fig. 13. The upper panel shows the Brier skill score as a function of rain rate threshold, for ensembles varying in size from one to seven members, selected randomly. The lower panel shows the same for members selected by overall skill. It is seen in Fig. 13a that one randomly selected QPF does not provide a useful POP forecast, but an ensemble as small as two members has skill for light rainfall. An ensemble size of five shows skill out to a rain threshold of 50 mm day^{−1}. The BSS curves become closer together for progressively larger ensembles. An exponential fit to the BSS values at the basic rain threshold of 1 mm day^{−1} suggests an asymptotic limit of about 0.40. Eighty percent of this potential skill is achieved for an ensemble of seven members.

When the membership in the ensemble is increased in a more intelligent manner, that is, by starting with the most skilled models, the verification statistics approach their limiting values more quickly. This means that a smaller number of QPFs are needed to approach maximum skill and accuracy, provided that those QPFs are chosen well. Figure 12 shows that the increase in accuracy with increasing number of members is not monotonic when models are added systematically, as the particular characteristics of each model influence the ensemble mean characteristics. The ETS is nearly constant for ensembles of between two and six members and peaks for five members. The CRA displacement distance was best when the number of skill-selected members was five, while the event hit rate was greatest when all seven QPFs were included. The Brier skill score of the two-member skill-selected ensemble (Fig. 13b) is roughly the same as that of the three-member randomly chosen ensemble (Fig. 13a), and 75% of the potential skill for POP forecasts can be achieved with four well-chosen ensemble members. The BSS peaks for six skill-selected members.

These results show that while the ensemble size is small, the addition of QPFs to the ensemble increases its skill and accuracy. However, a point can be reached where the addition of QPFs with progressively poorer skill may actually degrade the ensemble product. In other words, more is not necessarily better!

## 9. Relationship between spread and skill

The spread of the ensemble, or dispersion, describes the breadth of the range of forecasts made by the EPS. Intuitively, the larger the spread, the greater the uncertainty of the forecast [although the results of Whitaker and Loughe (1998) and others suggest that this may be true only for extreme events]. For a good ensemble prediction the “truth” lies somewhere within the cloud of forecasts given by the ensemble members. Much of the research on ensemble prediction systems has concentrated on determining the perturbations to the initial conditions that will result in a dispersion that is large enough to capture the true conditions but not so large as to be useless for forecasting (Toth and Kalnay 1993; Buizza and Palmer 1995). A primary reason for the large size of the ECMWF and NCEP EPSs is the desire to capture most or all of the important growing modes in the dynamical system in order to get the best possible estimate of the forecast probability distribution function.

Hamill and Colucci (1998) found that the spread of the ensemble of QPFs from the Eta–regional spectral model was not a good predictor of forecast uncertainty. In this section we investigate whether the spread of the multimodel ensemble is better able to predict the forecast uncertainty. If the spread is related to forecast uncertainty then when the spread is small (low uncertainty) the ensemble mean should have high skill, and when the spread is large (high uncertainty) then the ensemble mean should have lower average skill (although some ensemble mean forecasts may be close to the observations due purely to chance). The usual measure of the strength of this relationship is the spread–skill correlation.

Two different approaches are investigated for estimating the spread–skill correlation, one that uses forecasts at grid points and one that uses the positions of the rain systems. First, following Ziehmann (2000) the spread of the ensemble at a grid point is calculated as the mean squared difference between the model QPFs and their ensemble mean, while the skill is the squared difference between the ensemble mean and the verifying rainfall analysis. The spread–skill correlation was calculated from the spread and skill values pooled in space and time. For the 24-h ensemble the spread–skill correlation was 0.31, while for the 48-h ensemble the correlation was 0.22. These values are too low to be very useful for predicting forecast uncertainty at grid points using the ensemble spread.

The second technique investigates whether the spread in the positions of the forecast rain systems is related to the accuracy of the location of the rain system in the ensemble mean, and it parallels the work of Stensrud et al. (1999) for ensemble cyclone forecasts. For all CRAs in which there were at least two model QPFs predicting the same rain system (determined by matching the location and magnitude of the observed rain fields) the spread was calculated as the average distance between the locations of the forecast rain systems and their mean location, and the skill was the displacement error of the ensemble mean forecast from the observation. In this case the spread–skill correlation was 0.32 for the 24-h ensemble and 0.31 for the 48-h ensemble. Again, these correlations are too low to be practically useful, indicating that the spread in the location of rain system forecasts is not a good measure of the skill of the ensemble mean forecast.

Because rain events can be predicted or missed by the models, another factor to consider is the number of models predicting a particular event. Intuitively, the more models that predict a rain event, the greater the expected reliability of the ensemble forecast. Events captured by all of the models would tend to be strong, synoptically forced systems that are arguably easier to forecast. Using the CRA approach one can compute the frequency of the observed rain pattern being located within the cloud of model QPFs. This “cloud” can be represented by a polygon whose vertices are the displacements of the individual forecasts from the observed position. In our case the polygon would have between three and seven vertices, depending on how many models predicted the rain event. Another quantity of interest to forecasters is how frequently the actual maximum rain rate is within the range of the ensemble predictions.

Table 5 shows the frequency of the observed rain being located within the cloud of ensemble members and the observed maximum rain rate falling within the range of ensemble predictions, as a function of the number of members predicting the rain system. The frequency of the ensemble enclosing the true rain location depends strongly on the number of ensemble members predicting that event. When only three 24-h QPFs predicted a rain event its observed location was found within the ensemble cloud 19% of the time, increasing to 80% for events predicted by all seven models. The ensemble size had less impact on the frequency of the observed maximum rain rate being within the range of the ensemble predictions, increasing from about 40% when three models predicted the event to about 60% when all seven models predicted the event. This is partly because all of the models tended to underestimate the maximum rain rate (see Fig. 7b). Only a small fraction of the events predicted by three models had both the position and magnitude of the rain enveloped by the ensemble, while more than half of the events predicted by all seven models at 24 h were enveloped by the ensemble rain locations and magnitudes. Interestingly, the frequencies of the 48-h forecasts enclosing the rain location and magnitude were only slightly less than for the 24-h forecasts (note, however, that there were fewer systems captured by the ensemble).

It appears that the number of models predicting a rain system gives a good indication of the likelihood of the ensemble to envelop the location and magnitude of that event. Further, when all members of the poor man's ensemble predict a given rain system there is a very high likelihood that it will be located within the cloud of the ensemble members and a reasonable likelihood that the maximum rain rate will also be within the range of the ensemble predictions.

## 10. Summary and conclusions

This study has demonstrated that a poor man's ensemble of seven independent operational NWP QPFs is capable of providing useful information on the probability of precipitation, as well as the magnitude and location of rain events. Twenty-eight months of 24- and 48-h model QPFs for rainfall over Australia were combined to produce probabilistic and deterministic ensemble precipitation forecasts at 1° spatial resolution. These were verified against daily analyses of rain gauge observations.

The probabilistic forecasts were evaluated using the Brier skill score and the relative operating characteristic. A poor man's ensemble of seven 24-h QPFs had useful predictive skill for rainfall up to and exceeding 50 mm day^{−1}. POP forecasts from an ensemble of 48-h forecasts showed skill only for low rain rates and did not perform better than climatology. Although we do not have ensembles for forecast periods beyond 48 h, we suspect that the usefulness of POP forecasts from the poor man's ensemble may be limited to the short term, 1–2 days. This is not very surprising given the difficulty experienced by NWP models in making accurate rain forecasts.

For both forecast periods the ensemble mean (AVG) was more accurate and skillful than the individual models according to most measures. As expected, the averaging increased the spatial extent and reduced the rain intensity relative to the individual model forecasts, creating a smoother looking field. The mean rms error was 20% lower than the mean value for the model QPFs and was usually better than the best individual performer each day. The ETS values were also better for the ensemble mean because the improved rain detection outweighed the increase in the number of false alarms.

Several new measures were used to evaluate the ability of the models and the ensemble mean to predict individual *rain events.* In this study a rain event was defined as a contiguous rain area (CRA) in the forecast and/or observations enclosed by an isohyet of 5 mm day^{−1}. The error in the location of the forecast rain was determined by pattern matching, minimizing the total squared difference between the forecast and observed rain patterns. The mean location error for 24-h model forecasts was about 260 km, increasing to about 300 km for 48-h forecasts. The ensemble mean had 30% smaller location errors, 178 and 226 km for 24- and 48-h forecasts, respectively. The mean displacement of the AVG forecast was still substantial because the model QPFs were often displaced in similar directions. It is worth noting, however, that the ensemble mean of the 48-h QPFs had smaller location errors than all but one of the model QPFs at 24 h.

A forecast for a rain event was considered a “hit” if its displacement was small and its maximum rain rate was close to the observed value (see section 3 for details). The event hit rate for model 24-h QPFs ranged between 0.59 and 0.68, and increased to 0.75 for their ensemble mean. The improvement in event skill for the ensemble mean resulted entirely from an improved consensus location of the rain pattern. Twelve percent of ensemble mean forecasts had good rain location but strongly underestimated the maximum intensity.

Four alternatives to the ensemble mean were investigated in an effort to improve the deterministic forecast. The first used weighted averaging to give greater weight to the models with better performance, determined by their mean rms errors over the prior two-month period. Although weighted averaging has proved beneficial in other applications it failed to improve the mean QPF from the poor man's ensemble. The large variability in relative model accuracy and skill from day to day precluded an accurate determination of the weights, with the result that inappropriate weights were often applied. This failure of weighted averaging to improve the ensemble mean will apply to any ensemble in which the relative skill or accuracy of the members is not consistent or predictable.

The second strategy used the ensemble median instead of the mean to determine the rain area, then estimated the rain rate as either the median of all ensemble members (MED) or the “majority rules” mean of all nonzero ensemble members (MAJ). This strategy produced good forecasts of rain occurrence, but the median rain rates were systematically too low, resulting in poorer forecasts for rain systems. The majority rules approach used more appropriate rain rates, resulting in better rain intensity and volume.

A third strategy simultaneously reduced the bias in rain area and increased the magnitude of the maximum rain rate of the ensemble mean by transforming the rain rates. Using a cool season bias of 1.1 and a warm season bias of 1.6, a bias-corrected ensemble mean, AVG_{t}, was generated with rain area and intensity that were much closer to the observed values. Although the rms error for AVG_{t} increased relative to AVG, the equitable threat score improved as the rate of false alarms was reduced. Bias-correcting the mean field did not produce improvements in the CRA verification statistics because the mean displacement distance was larger for this forecast.

The fourth, and most successful, strategy used probability matching (PM) to reassign the rain rates in the ensemble mean using the rain rate distribution of the component QPFs. The PM forecast benefited from the improved rain location given by the ensemble mean, with more realistic rain rates taken directly from the model QPFs. In particular, the maximum rain rates were much higher than those of the other deterministic ensemble forecasts, and the bias in rain area was significantly reduced. The resulting event hit rate was 0.81 for 24-h forecasts and 0.70 for 48-h forecasts, making this the most useful of the deterministic products.

The dependence of the POP and ensemble mean results on the number of members included in the ensemble was investigated using the 24-h model QPFs. When ensemble members were selected randomly the Brier skill score and the domain verification statistics (rms error and ETS) improved monotonically with increasing ensemble size, with values close to their asymptotic limits for an ensemble size of seven. A similar result was also found by Buizza and Palmer (1998) for ensemble prediction of 500-hPa geopotential height anomalies. Interestingly, the CRA verification statistics peaked for an ensemble size of six. Although the differences in the displacement and event hit rate for six- and seven-member ensembles are not statistically significant, it is possible that further addition of ensemble members may blur, rather than clarify, the QPFs for rain events. Further testing with a larger ensemble is needed to confirm this.

When the ensemble members were selected in order of overall skill, with the model with the highest ETS being selected first, the verification statistics improved more rapidly with ensemble size and were not necessarily optimal for the seven-member ensemble. *This implies that an optimum selection of ensemble members is possible, beyond which the further addition of less skilled QPFs degrades the ensemble product.* The size and membership of the “optimum” ensemble depends on what verification statistic is used to measure success. The results for the deterministic forecasts (Fig. 12) would recommend an ensemble of five models, while the most useful probabilistic 24-h forecasts came from a six-member ensemble (Fig. 13b). We have not explored all possible combinations of ensemble members, and it is possible that other ensemble configurations could outperform the ones verified in this study. For example, Fig. 6f showed that some models had higher ETS scores at 48 h than other models had at 24 h. It is likely that a skill-selected ensemble selected from the larger set of 24- and 48-h QPFs would perform better than the best of the 24-h skill-selected ensembles.

The spread–skill correlations for the poor man's ensemble were computed both at grid points (the conventional method) and for the predicted location of rain systems. Both approaches yielded values of roughly 0.3 for 24- and 48-h ensemble forecasts, indicating that there is little ability to predict the uncertainty of the forecast from the size of the spread alone. However, the number of models capturing a rain event appears to be a good predictor of the ensemble's likelihood of enveloping the location and magnitude of the rain system. Rain events that were predicted by all seven models at 24 h had their locations enclosed by the ensemble cloud in 80% of the cases, and both their locations and magnitudes enclosed by the ensemble in 53% of the cases.

The poor man's ensemble was more successful than the much larger ECMWF ensemble prediction system in predicting the probability of rainfall in the first 24 h. This was due mainly to the larger spread of the poor man's ensemble. A more detailed comparison of these two approaches for predicting rainfall amount is the subject of a future study.

The advantages of the multimodel approach are the following.

The poor man's ensemble is essentially cost-free. It does not require intensive computing resources in order to perform multiple model integrations. Rather, it can be easily assembled from NWP forecasts that are readily available in house and over the GTS. As improvements are made to observing systems, assimilation methods, model physics, and computing abilities, the resulting improvements in individual model QPFs will have immediate beneficial impact on the poor man's ensemble predictions.

The poor man's ensemble samples uncertainties in both the initial conditions and model formulation through the variation of input data and analysis and forecast methodologies of its component members. It is therefore less prone to systematic biases and errors that lead to underdispersive behavior in single-model EPSs. This removes the need to derive calibrations for the POP forecasts to compensate for the biases (Eckel and Walters 1998). Model QPFs that contribute negatively to the overall skill of ensemble forecasts can easily be removed.

The skill of the poor man's ensemble in predicting precipitation distribution and amount appears to be limited to the first one or two days of the forecast. The larger ensemble prediction systems clearly provide better QPFs and POPs beyond the short term. The most skillful and accurate forecasts are likely to come from a multimodel “superensemble” that combines the best features of the large EPSs and the poor man's ensemble (e.g., Evans et al. 2000).

## Acknowledgments

The author wishes to thank François Lalaurette at ECMWF for suggesting the median precipitation as an alternative to the ensemble mean. Helpful comments and suggestions from the two anonymous reviewers and from Frank Woodcock, Bill Bourke, and Kamal Puri at the Bureau of Meteorology resulted in an improved manuscript. Roberto Buizza kindly provided the ECMWF EPS rainfall forecasts for Australia. Provision by ECMWF and DWD of their NWP model results via FTP is gratefully acknowledged.

## REFERENCES

Anagnostou, E. N., A. J. Negri, and R. F. Adler, 1999: A satellite infrared technique for diurnal rainfall variability studies.

,*J. Geophys. Res***104****,**31477–31488.Atger, F., 1999: The skill of ensemble prediction systems.

,*Mon. Wea. Rev***127****,**1941–1953.Buizza, R., and T. N. Palmer, 1995: The singular vector structure of the atmospheric general circulation.

,*J. Atmos. Sci***52****,**1434–1456.Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction.

,*Mon. Wea. Rev***126****,**2503–2518.Buizza, R., A. Hollingsworth, F. Lalaurette, and A. Ghelli, 1999a: Probabilistic predictions of precipitation using the ECMWF Ensemble Prediction System.

,*Wea. Forecasting***14****,**168–189.Buizza, R., M. Miller, and T. N. Palmer, 1999b: Stochastic representation of model uncertainties in the ECMWF Ensemble Prediction System.

,*Quart. J. Roy. Meteor. Soc***125****,**2887–2908.Du, J., S. L. Mullen, and F. Sanders, 1997: Short-range ensemble forecasting of quantitative precipitation.

,*Mon. Wea. Rev***125****,**2427–2459.Du, J., S. L. Mullen, and F. Sanders, 2000: Removal of distortion error from an ensemble forecast.

,*Mon. Wea. Rev***128****,**3347–3351.Ebert, E. E., and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors.

,*J. Hydrol***239****,**179–202.Eckel, F. A., and M. K. Walters, 1998: Calibrated probabilistic quantitative precipitation forecasts based on the MRF ensemble.

,*Wea. Forecasting***13****,**1132–1147.Evans, R. E., M. S. J. Harrison, R. J. Graham, and K. R. Mylne, 2000: Joint medium-range ensembles from the Met. Office and ECMWF systems.

,*Mon. Wea. Rev***128****,**3104–3127.Fritsch, J. M., and and Coauthors, 1998: Quantitative precipitation forecasting: Report of the Eighth Prospectus Development Team, U.S. Weather Research Program.

,*Bull. Amer. Meteor. Soc***79****,**285–299.Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts.

,*Wea. Forecasting***14****,**155–167.Hamill, T. M., and S. J. Colucci, 1997: Verification of eta-RSM short-range ensemble forecasts.

,*Mon. Wea. Rev***125****,**1312–1327.Hamill, T. M., and S. J. Colucci, 1998: Evaluation of eta-RSM ensemble probabilistic precipitation forecasts.

,*Mon. Wea. Rev***126****,**711–724.Hoffman, R. N., Z. Liu, J-F. Louis, and C. Grassotti, 1995: Distortion representation of forecast errors.

,*Mon. Wea. Rev***123****,**2758–2770.Huffman, G. J., and Coauthors,. . 1997: The Global Precipitation Climatology Project (GPCP) Combined Precipitation Data Set.

,*Bull. Amer. Meteor. Soc***78****,**5–20.Junker, N. W., J. E. Hoke, B. E. Sullivan, K. F. Brill, and F. J. Hughes, 1992: Seasonal and geographic variations in quantitative precipitation prediction by NMC's Nested-Grid Model and Medium-Range Forecast Model.

,*Wea. Forecasting***7****,**410–429.Kalnay, E., S. J. Lord, and R. D. McPherson, 1998: Maturity of operational numerical weather prediction: Medium range.

,*Bull. Amer. Meteor. Soc***79****,**2753–2769.Krishnamurti, T. N., C. M. Kishtawal, T. E. LaRow, D. R. Bachiochi, Z. Zhang, C. E. Williford, S. Gadgil, and S. Surendran, 1999: Improved weather and seasonal climate forecasts from multimodel superensemble.

,*Science***285****,**1548–1550.Krishnamurti, T. N., C. M. Kishtawal, Z. Zhang, T. LaRow, D. Bachiochi, E. Williford, S. Gadgil, and S. Surendran, 2000: Multimodel ensemble forecasts for weather and seasonal climate.

,*J. Climate***13****,**4196–4216.Leith, C. E., 1974: Theoretical skill of Monte Carlo forecasts.

,*Mon. Wea. Rev***102****,**409–418.Mason, I., 1982: A model for assessment of weather forecasts.

,*Austr. Meteor. Mag***30****,**291–303.McBride, J. L., and E. E. Ebert, 2000: Verification of quantitative precipitation forecasts from operational numerical weather prediction models over Australia.

,*Wea. Forecasting***15****,**103–121.Mesinger, F., 1996: Improvements in quantitative precipitation forecasts with the ETA regional model at the National Centers for Environmental Prediction: The 48-km upgrade.

,*Bull. Amer. Meteor. Soc***77****,**2637–2649.Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF Ensemble Prediction System: Methodology and validation.

,*Quart. J. Roy. Meteor. Soc***122****,**73–119.Mullen, S. L., and R. Buizza, 2001: Quantitative precipitation forecasts over the United States by the ECMWF Ensemble Prediction System.

,*Mon. Wea. Rev***129****,**638–663.Rosenfeld, D., D. B. Wolff, and D. Atlas, 1993: General probability-matched relations between radar reflectivity and rain rate.

,*J. Appl. Meteor***32****,**50–72.Speer, M. S., and L. M. Leslie, 1997: An example of the utility of ensemble rainfall forecasting.

,*Austr. Meteor. Mag***46****,**75–78.Stanski, H. R., L. J. Wilson, and W. R. Burrows, 1989: Survey of common verification methods in meteorology. World Weather Watch Tech. Report 8, WMO/TD No. 358, 114 pp.

Stensrud, D. J., H. E. Brooks, J. Du, M. S. Tracton, and E. Rogers, 1999: Using ensembles for short-range forecasting.

,*Mon. Wea. Rev***127****,**433–446.Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at NMC: The generation of perturbations.

,*Bull. Amer. Meteor. Soc***74****,**2317–2330.Tracton, S., and E. Kalnay, 1993: Ensemble forecasting at NMC: Operational implementation.

,*Wea. Forecasting***8****,**379–398.van den Dool, H. M., and L. Rukhovets, 1994: On the weights for an ensemble-averaged 6–10-day forecast.

,*Wea. Forecasting***9****,**457–465.Vislocky, R. L., and J. M. Fritsch, 1995: Improved model output statistics forecasts through model consensus.

,*Bull. Amer. Meteor. Soc***76****,**1157–1164.WCRP, 1997: Report of twelfth session of the CAS/JSC Working Group on Numerical Experimentation (Japan Meteorological Agency, Tokyo, Japan, 28–31 October 1996). WMO/TD No. 813, 31 pp.

Weymouth, G., G. A. Mills, D. Jones, E. E. Ebert, and M. J. Manton, 1999: A continental-scale daily rainfall analysis system.

,*Austr. Meteor. Mag***48****,**169–179.Whitaker, J. S., and A. F. Loughe, 1998: The relationship between ensemble spread and ensemble mean skill.

,*Mon. Wea. Rev***126****,**3292–3302.Xie, P., and P. A. Arkin, 1996: Analyses of global monthly precipitation using gauge observations, satellite estimates, and numerical model predictions.

,*J. Climate***9****,**840–858.Zhang, Z., and T. N. Krishnamurti, 1997: Ensemble forecasting of hurricane tracks.

,*Bull. Amer. Meteor. Soc***78****,**2785–2795.Ziehmann, C., 2000: Comparison of a single-model EPS with a multi-model ensemble consisting of a few operational models.

,*Tellus***52A****,**280–299.

NWP models for which 24- and 48-h quantitative precipitation forecasts were verified. The asterisk signifies that the grid resolution of the model output received at the Bureau of Meteorology via the GTS is coarser than the true resolution of the model

Mean daily spatial correlation coefficient between model rainfall forecasts. The upper part of the table gives the correlations between 24-h forecasts, while the lower part of the table corresponds to 48-h forecasts

Seasonal Brier skill scores for the poor man's ensemble and the ECMWF ensemble prediction system for (a) the warm season (Dec–Feb) and (b) the cool season (Jun–Aug)

Domain and CRA verification statistics for several ensemble QPF products for (a) 24-h forecasts and (b) 48-h forecasts. Refer to sections 5 and 6 for descriptions of the ensemble products

Frequency of observed rain being located within the cloud of ensemble members and the observed maximum rain rate falling within the range of ensemble predictions, as a function of the number of members predicting the rain system