## 1. Introduction

Seasonal climate forecasts are necessarily probabilistic, and forecast information is most completely characterized by a probability density function (pdf). Estimation of the forecast pdf is required to measure predictability and to issue accurate forecasts. For reliable forecasts, the difference between the climatological and forecast pdfs represents predictability, and several measures of this difference have been developed to quantify predictability (Kleeman 2002; DelSole 2004; Tippett et al. 2004; DelSole and Tippett 2007). Quantile probabilities are the probabilities assigned to quantile-delimited categories and provide a coarse-grained description of the forecast and climatological pdfs, which is appropriate for ensembles with relatively few members. The International Research Institute for Climate and Society (IRI) issues seasonal forecasts of precipitation and temperature in the form of tercile-based categorical probabilities (hereafter called tercile probabilities), that is, the probability of the below-normal, normal, and above-normal categories (Barnston et al. 2003). Forecasts that differ from equal-odds probabilities, to the extent that they are reliable, are indications of predictability in the climate system. Accurate estimation of quantile probabilities is important both for quantifying seasonal predictability and for making climate forecasts.

In single-tier seasonal climate forecasts, initial conditions of the ocean–land–atmosphere system are the source of predictability, and ensembles of coupled model forecasts provide samples of the model atmosphere–land–ocean system evolution consistent with the initial conditions, their uncertainty, and the internal variability of the coupled model. In two-tier seasonal forecasts, ensembles of atmospheric general circulation models (GCMs) provide samples of equally likely model atmospheric responses to a particular configuration of sea surface temperature (SST). Tercile probabilities must be estimated from finite ensembles in either system. A simple nonparametric estimate of the tercile probabilities is the fraction of ensemble members in each category. Alternatively, the entire forecast pdf including tercile probabilities can be estimated by modeling the ensemble as a sample from an analytical pdf with adjustable parameters for mean, spread, shape, etc. Here we use a Gaussian distribution described by its mean and variance. The counting method has the advantage of making no assumptions about the form of the forecast pdf. Both approaches are affected by sampling error due to finite ensemble size, though to different degrees. This paper is about the impact of sampling error on parametric and nonparametric estimates of simulated and forecast tercile probabilities for seasonal precipitation totals. We analyze precipitation because of its societal importance and because, even on seasonal time scales, its distribution is farther from being Gaussian, and hence more challenging to describe, than quantities like temperature and geopotential height, which have been previously examined.

In this paper we present analytical descriptions of the accuracy of the counting and Gaussian tercile probability estimators. These analytical results facilitate the comparison of the counting and Gaussian estimates and show how the accuracy of the estimators increases as ensemble size and predictability level increase. The analytical results support previous empirical results showing the advantage of the parametric estimators. Wilks (2002) found that modeling numerical weather prediction ensembles with Gaussian or Gaussian mixture distributions gave more accurate estimations of quantile values than counting, especially for quantiles near the extremes of the distribution. Kharin and Zwiers (2003) used Monte Carlo simulations to show that a Gaussian fit estimate was more accurate than counting for Gaussian distributed forecast variables.

We show how the accuracy of the tercile probability estimates affects the rank probability skill score (RPSS). The RPSS is a multicategory generalization of the two-category Brier skill score. Richardson (2001) found that finite ensemble size had an adverse effect on the Brier skill score with low-skill regions being more negatively affected by small ensemble size. Changes in ensemble size that cause only modest changes in Brier skill score can lead to large changes in economic value implied by a simple cost–loss decision model, particularly for extreme events (Richardson 2001).

Accurate estimation of tercile probabilities from GCM ensembles does not ensure a skillful simulation or forecast if there are systematic errors in the GCM pdf. Calibration of model probabilities is needed to account for model deficiencies and produce reliable climate forecasts (Robertson et al. 2004). We expect that forecast skill would be improved by reducing sampling error in the GCM probabilities that are inputs to both the calibration system and the procedure to estimate calibration parameters. We investigate the roles of sampling and model error using a 79-member ensemble of GCM simulations of seasonal precipitation made with observed SST; we examine the impact of reducing sampling error on the skill of the simulations with and without calibration. Additionally, we use the GCM data to assess the importance of some simplifying assumptions used in the calculation of the analytical results by comparing the analytical results with empirical ones obtained by subsampling from the large ensemble of GCM simulations.

An important predictability issue relevant to parametric estimation of tercile probabilities is the relative roles of the forecast mean and variance in determining predictability (Kleeman 2002). Since predictability is a measure of the difference between forecast and climatological distributions, identifying the parameters associated with predictability also identifies the parameters that are useful for estimating tercile probabilities. For instance, if the predictability of a system is due to only the changes in the forecast mean, then the forecast mean should also be useful for estimating tercile probabilities. One approach to this question is to identify the parameters that give the most skillful forecast probabilities (Buizza and Palmer 1998; Atger 1999). Kharin and Zwiers (2003) showed that the Brier skill score of hindcasts of 700-mb temperature and 500-mb height was improved when probabilities were estimated from a Gaussian distribution with constant variance as compared with counting; fitting a Gaussian distribution with time-varying variance gave inferior results. Hamill et al. (2004) used a generalized linear model (GLM; logistic regression) to estimate forecast tercile probabilities of 6–10 day and week-2 surface temperature and precipitation and found that the ensemble variance was not a useful predictor of tercile probabilities. In addition to looking at skill, we examine the relative importance of the forecast mean and variance for predictability in the perfect model setting by asking whether including ensemble variance in the Gaussian estimate and the GLM estimate reduces sampling error.

The paper is organized as follows. The GCM and observation data are described in section 2. In section 3, we derive some theoretical results about the relative size of the error of the counting and fitting estimates and about the effect of sampling error on the ranked probability skill score. The GLM is also introduced and related to Gaussian fitting. In section 4, we compare the analytical results with empirical GCM-based ones and include effects of model error. A summary and conclusions are given in section 5.

## 2. Data

*λ*. Positive skewness is the usual non-Gaussian aspect of precipitation and requires a choice of

*λ*< 1. The value of

*λ*is found by maximizing the log-likelihood function. Figure 1 shows the geographical distribution of the values of

*λ*, which is an indication of the deviation of the data from Gaussianity; we only allow a few values of

*λ*, namely, 0, 1/4, 1/3, 1/2, and 1. The log function and small values of the exponent tend to be selected in dry regions. This is consistent with Sardeshmukh et al. (2000) who found that monthly precipitation in reanalysis and in a GCM was significantly non-Gaussian mainly in regions of mean tropospheric descent.

The precipitation observations used to evaluate model skill and to calibrate model output come from the extended New et al. (2000) gridded dataset of monthly precipitation for the period 1950 to 1998, interpolated to the T42 model grid.

## 3. Theoretical considerations

### a. Variance of the counting estimate

*counting*estimate

*p*of a tercile probability is the fraction

_{N}*n*/

*N*, where

*N*is the ensemble size and

*n*is the number of ensemble members in the tercile category. The binomial distribution

*P*(

_{p}*n*|

*N*), where

*p*is the tercile probability, gives the probability of there being exactly

*n*members in the category. The expected number of members in the tercile category is

*p*is the probability

_{N}*p*, and the counting estimate is unbiased. However, having a limited ensemble size generally causes any single realization of

*p*to differ from

_{N}*p*. The variance of the counting estimate

*p*is

_{N}*N*(1 −

*p*)

*p*. The relation in (3) shows that the error of the counting estimate is inversely proportional to the ensemble size.

Since the counting estimate *p _{N}* is not normally distributed or even symmetric for

*p*≠ 0.5 (for instance, the distribution of sampling error necessarily has a positive skew when the true probability

*p*is close to zero), it is not immediately apparent whether its variance is a useful measure. However, the binomial distribution becomes approximately normal for large

*N*. Figure 2 shows that the standard deviation gives a good estimate of the 16th and 84th percentiles of

*p*for

_{N}*p*= 1/3 and modest values of

*N*. In this case, the counting estimate variance is (2/9)

*N*. The percentiles are obtained by inverting the cumulative distribution function of the sample error. Since the binomial cumulative distribution is discrete, we show the smallest value at which it exceeds 0.16 and 0.84. Figure 2 also shows that for modest-sized ensembles (

*N*> 20) the standard deviation is fairly insensitive to incremental changes in ensemble size; increasing the ensemble size by a factor of 4 is necessary to reduce the standard deviation by a factor of 2.

*p*. The extent to which the forecast probability differs from the climatological value of 1/3 is an indication of predictability, with larger deviations indicating more predictability. Intuitively, we expect regions and seasons with more predictability to suffer less from sampling error on average since enhanced predictability implies more reproducibility among ensemble members. In fact, when the forecast distribution is Gaussian with mean

*μ*and variance

_{f}*σ*, the variance of the counting estimate of the below-normal category probability is (see the appendix for details)

_{f}*x*is the left tercile boundary and erf denotes the error function. Since the absolute value of the error function approaches unity when the absolute value of its argument is large, the counting estimate variance is small when the ensemble mean is large or the ensemble variance is small. Assuming that the forecast variance

_{b}*σ*is constant and averaging (4) over forecasts gives that the average variance is approximately (see the appendix for details)

_{f}*S*

^{2}is the usual signal-to-noise ratio [see (A4); Kleeman and Moore (1999); Sardeshmukh et al. (2000)]. When there is no skill

*S*= 0,

*p*= 1/3, and the average variance is (2/9)

*N*. The signal-to-noise ratio is related to correlation skill with

*r*=

*S*/

*S*

^{2}

### b. Variance of the Gaussian fit estimate

*Gaussian fit*estimate

*g*of the tercile probabilities is found by fitting the

_{N}*N*-member ensemble with a Gaussian distribution and integrating the distribution between the climatological tercile boundaries (Kharin and Zwiers 2003). The Gaussian fit estimate has two sources of error: (i) the non-Gaussianity of the forecast distribution from which the ensemble is sampled and (ii) sampling error in the estimates of mean and variance due to limited ensemble size. The first source of error is problem dependent, and we will quantify its impact empirically for the case of GCM-simulated seasonal precipitation. The variance of the Gaussian fit estimate can be quantified analytically for Gaussian distributed variables. When the forecast distribution is Gaussian with mean

*μ*and known variance

_{f}*σ*, the variance of the Gaussian fit estimate of the below-normal category probability is approximately (see the appendix for details)

_{f}*x*is the left tercile boundary. The average (over forecasts) variance of the Gaussian fit tercile probability is approximately (see appendix for details)

_{b}*x*

_{0}= Φ

^{−1}(1/3) ≈ −0.4307 and Φ is normal cumulative distribution function. Comparing this value with the counting estimate variance in (5) shows that the Gaussian fit estimate has smaller variance for all values of

*S*

^{2}, with its advantage over the counting estimate increasing slightly as the signal-to-noise ratio increases to levels exceeding unity.

*S*= 0), the average variance of the Gaussian fit is

*S*= 0). The inverse dependence of the variances on ensemble size means that modest decreases in variance are equivalent to substantial increases in ensemble size. For instance, the variance of a Gaussian fit estimate with ensemble size 24, the simulation ensemble size used for IRI forecast calibration (Robertson et al. 2004), is equivalent to that of a counting estimate with ensemble size 40. The results in (3) and (8) also allow us to compare the variances of counting and Gaussian fit estimates of other quantile probabilities for the case

*S*= 0 by appropriately modifying the definition of the category boundary

*x*

_{0}. For instance, to estimate the median,

*x*

_{0}= 0, and the variance of the Gaussian estimate is about 36% smaller than that of the counting estimate; in the case of the 10th and 90th percentiles,

*x*

_{0}= Φ

^{−1}(1/10) ≈ −1.2816 and the variance of the Gaussian estimated probability is about 66% smaller than that of the counting estimate. The accuracy of the approximation in (8) for higher quantiles depends on the ensemble size being sufficiently large.

### c. Estimates from generalized linear models

*p*, here the tercile probability, and some set of explanatory variables

*y*, as for instance the GCM ensemble mean and variance (McCullagh and Nelder 1989). Suppose the probability

_{i}*p*depends on the response

*R*, which is the linear combination

*a*and a constant term

_{i}*b*. The response

*R*generally takes on all numerical values while the probability

*p*is bounded between zero and one. The GLM approach introduces a function

*g*(

*p*) that maps the unit interval on the entire real line and studies the model

*a*and the constant

_{i}*b*are found by maximum likelihood estimation. Here, the GLMs are developed with the ensemble mean (standardized) and ensemble standard deviation as explanatory variables and

*p*given by the counting estimate. This procedure is different from that used by Hamill et al. (2004) where the GLM was developed using observations. The procedure here has the potential to reduce sampling error, not systematic model error.

*g*(

*p*), including the logit function, which leads to logistic regression (McCullagh and Nelder 1989; Hamill et al. 2004). Here we use the probit function, which is the inverse of the normal cumulative distribution function Φ; that is, we define

*p*≤ 0.9 (McCullagh and Nelder 1989). The assumption of the GLM method is that

*g*(

*p*) is linearly related to the explanatory variables: here the ensemble mean and standard deviation. When the forecast distribution is Gaussian with constant variance,

*g*(

*p*) is indeed linearly related to the ensemble mean and this assumption is exactly satisfied. To see this, suppose that the forecast ensemble has mean

*μ*and variance

_{f}*σ*. Then the probability

_{f}*p*of the below-normal category is

*x*is the left tercile of the climatological distribution, and

_{b}We show an example with synthetic data to give some indication of the robustness of the GLM estimate when the population that the ensemble represents does not have a Gaussian distribution. We take the forecast pdf to be a gamma distribution with shape and scale parameters (2, 1). The pdf is asymmetric and has a positive skew (see Fig. 3a). Samples are taken from this distribution and the probability of the below-normal category is estimated by counting, Gaussian fit, and GLM; the Gaussian fit assumes constant known variance, and the GLM uses the ensemble mean as an explanatory variable. Interestingly the rms error of both the GLM and Gaussian fit estimates is smaller than that of counting for modest ensemble size (Fig. 3b). As the ensemble size increases further, counting becomes a better estimate than the Gaussian fit. For all ensemble sizes, the performance of the GLM estimate is better than the Gaussian fit.

Other experiments (not shown) compare the counting, Gaussian fit, and GLM estimates when the ensemble is Gaussian with nonconstant variance. The GLM estimate with ensemble mean and variance as explanatory variables and the two-parameter Gaussian fit have smaller error than counting and the one-parameter models (for large enough ensemble size) as expected.

### d. Ranked probability skill score

*M*is the number of forecasts,

*F*

_{i}_{,}

*(*

_{j}*O*

_{i}_{,}

*) is the cumulative distribution function of the*

_{j}*i*th forecast (observation) of the

*j*th category. The observation “distribution” is defined to be one for the observed category and zero otherwise. This definition means that

*F*

_{i}_{,1}=

*P*

_{i}_{,}

*,*

_{B}*F*

_{i}_{,2}=

*P*

_{i}_{,}

*+*

_{B}*P*

_{i}_{,}

*, where*

_{N}*P*

_{i}_{,}

*(*

_{B}*P*

_{i}_{,}

*) is the probability of the below normal (near normal) category for the*

_{N}*i*th forecast. The terms containing above-normal probabilities (

*j*= 3) vanish.

*i*subscript. Let O

*, O*

_{B}*, and O*

_{N}*be the probabilities that the verifying observation falls into the below-, near-, and above-normal categories, respectively. That is,*

_{A}*, O*

_{B}*, and O*

_{N}*collectively represent the uncertainty of the climate state, not due to instrument error but due to the limited predictability of the climate system. In the case of equal odds, O*

_{A}*= O*

_{B}*= O*

_{N}*= 1/3, there is no predictability, while a shift away from equal odds represents predictability. These probabilities are not directly measurable since only a single realization of nature is available. The expected (with respect to the observations) RPS of a particular forecast is the sum of the RPS for each possible category of observation multiplied by its likelihood:*

_{A}_{perfect}) is

_{perfect}is small for large probability shifts. The quantity RPS

_{perfect}is a perfect model measure of potential probabilistic skill analogous to the signal-to-noise ratio, which determines the correlation skill of a model to predict itself. The quantity RPS

_{perfect}has the same form as the

*uncertainty*term in the decomposition by Murphy (1973) of the Brier score. However, the uncertainty term in the decomposition by Murphy (1973) is the score of the climatological forecast averaged over forecasts, while RPS

_{perfect}is the expected score of a correct probability forecast averaged over realizations of the observations. Both quantities measure the variability of the observations with respect to their expected frequency. When the forecast distribution is Gaussian, RPS

_{perfect}is simply related to the forecast mean

*μ*and variance

_{f}*σ*

^{2}

_{f}by

_{perfect}= 0 in the limit of

*σ*= 0 (deterministic forecast) and elucidates the empirical relation between probability skill and mean forecast found by Kumar et al. (2001).

_{f}_{perfect}for the 79-member ECHAM4.5 GCM-simulated precipitation data. This is a perfect model measure of potential probabilistic skill with small values of RPS

_{perfect}showing that the GCM has skill in the sense of reproducibility with respect to itself. Skills are highest at low latitudes, consistent with our knowledge that tropical precipitation is most influenced by SST. Perfect model RPS values are close to the no-skill limit of 4/9 in much of the extratropics. The RPSS is defined using the RPS and a reference forecast defined to have zero skill, here climatology:

_{clim}is the RPS of the climatological forecast. The expected RPS of a climatological forecast is found by substituting

*P*=

_{B}*P*=

_{N}*P*= 1/3 into (16), which gives

_{A}_{perfect}≡ 1 − RPS

_{perfect}/RPS

_{clim}for the GCM-simulated precipitation data. Even under the perfect model assumption, the RPSS exceeds 0.1 in few regions.

*P*= O

_{B}*+*

_{B}*ϵ*and

_{B}*P*= O

_{A}*+*

_{A}*ϵ*where

_{A}*ϵ*and

_{B}*ϵ*represent error due to finite ensemble size. If each of the forecast probabilities are unbiased and 〈

_{A}*ϵ*〉 = 〈

_{B}*ϵ*〉 = 0, then substituting into (16) and averaging over realizations of the ensemble gives

_{A}*α*, as does, for example, the Gaussian fit estimate, then

*α*< 1.

## 4. Estimates of GCM-simulated seasonal precipitation tercile probability

### a. Variance of the counting estimate

*N*(without replacement) from the ensemble of GCM simulations and compute two counting estimate probabilities denoted

*p*and

_{N}*p*′

_{N}; the ensemble size of 79 and independence requirement limits the maximum value of

*N*to 39. The expected value of the square of the difference between the two counting estimates

*p*and

_{N}*p*′

_{N}is twice the variance of the counting estimate since

*p*−

_{N}*p*) and (

*p*−

*p*′

_{N}) are uncorrelated. The averages in (26) are with respect to time and realizations (1000) of the two independent samples.

We expect especially close agreement between the subsampling calculations and the analytical results of (5) in regions where there is little predictability and the signal-to-noise ratio *S*^{2} is small, since, for *S*^{2} = 0, the analytical result is exact. In regions where the signal-to-noise ratio is not zero, though generally fairly small, we expect that the average counting variance still decreases as 1/*N*. However, there is no guarantee that the Gaussian approximation will provide an adequate description of the actual behavior of the GCM data.

Figure 5 shows that in the land gridpoint average the variance of the counting estimate is very well described by the analytical result in (5), with the difference from the analytical result being on the order of a few percent for the below-normal category probability and less than one percent for the above-normal category probability. The accuracy difference between the below- and above-normal categories may be due to the below-normal category being more affected by non-Gaussian behavior. Figure 6a shows the spatial variation of the convergence factor −0.0421868 + 0.264409/*S*^{2}*N* is obtained by dividing by *N*

### b. Error of counting, Gaussian fit, and GLM estimators

*error*variance of the estimators must be computed. The error is not known because the true probability is not known exactly. Therefore each method is compared to a common baseline as follows. Each method is applied to an ensemble of size

*N*(

*N*= 5, 10, 20, 30, 39) to produce an estimate

*q*. This estimate is then compared to the counting estimate

_{N}*p*

_{40}computed from an independent set of 40 ensemble members. This counting estimate

*p*

_{40}serves as a common unbiased baseline. The variance of the difference of these two estimates has contributions from the

*N*-member estimate

*q*and the 40-member counting estimate. The variance of the difference can be decomposed into error variance contributions from

_{N}*q*and

_{N}*p*

_{40}:

*p*

_{40}is used. Therefore the error variance of the estimate

*q*is

_{N}*q*−

_{N}*p*)

^{2}〉 rather than 〈(

*q*−

_{N}*p*

_{40})

^{2}〉 so as to give a sense of the magnitude of the sampling error rather than the difference with the baseline estimate. Results are averaged over time and realizations (100) of the

*N*-member estimate and the 40-member counting estimate.

We begin by examining the land gridpoint average of the sampling error of the three methods. Figure 7a shows the gridpoint-averaged rms error of the tercile probability estimates as a function of ensemble size. The variance of the counting estimate is well described by theory (Fig. 7a) and is larger than that of the parametric estimates. The one-parameter GLM and constant variance Gaussian fit have similar rms error for larger ensemble sizes; the GLM estimate is slightly better for very small ensemble sizes. While the magnitude of the error reduction due to using the parametric estimates is modest, the savings in computational cost compared to the equivalent ensemble size is significant.

The single parameter estimates, that is, the constant variance Gaussian fit and the GLM based on the ensemble mean, have smaller rms error than the estimates based on ensemble mean and variance (Fig. 7b). The advantage of the single parameter estimates is greatest for smaller ensemble sizes. This result is important because it shows that attempting to account for changes in variance, even in the perfect model setting where ensemble size is the only source of error, does not improve estimates of the tercile probabilities for the range of ensemble sizes considered here (Kharin and Zwiers 2003). The sensitivity of the tercile probabilities to changes in variance is, of course, problem specific.

Figure 8 shows the spatial features of the rms error of the below-normal tercile probability estimates for ensemble size 20. Using a Gaussian with constant variance or a GLM based on the ensemble mean has error that is, on average, less than counting; the average performances of the Gaussian fit and the GLM are similar. In a few dry regions, especially in Africa, the error from the parametric estimates is larger. This problem with the parametric estimates in the dry regions is reduced when a Box–Cox transformation is applied to the data (not shown), and overall error levels are slightly reduced as well. The spatial features of rms error when the variance of the Gaussian is estimated and when the mean and standard deviation are used in the GLM are similar to those in Fig. 8, but the overall error levels are slightly higher.

### c. RPSS

In the previous section we evaluated the three probability estimation methods in the perfect model setting, applying the estimators to small ensembles and asking how well they reproduce the probabilities from the large ensemble. We now compare the three probability estimation methods in an imperfect model setting by computing their RPSS using observations. We expect the reduction in sampling error to result in improved RPSS, but we cannot know beforehand the extent to which model error confounds or offsets the reduction in sampling error. Figure 9 shows maps of RPSS for ensemble size 20 for the counting, Gaussian fit, and GLM estimates. The results are averaged over 100 random selections of the 20-member ensemble from the full 79-member ensemble. The overall skill of the Gaussian fit and GLM estimate is similar and both are generally larger than that of the counting estimate.

Figure 10 shows the fraction of points with positive RPSS as a function of ensemble size. Again results are averaged over 100 random draws of each ensemble size except for *N* = 79 when the entire ensemble is used. The parametrically estimated probabilities lead to more grid points with positive RPSS. The Gaussian fit and GLM have similar skill levels with the GLM estimate having larger RPSS for the smallest ensemble sizes and the Gaussian fit being slightly better for larger ensemble sizes. It is useful to interpret the increases in RPSS statistics in terms of effective ensemble size. For instance, applying the Gaussian fit estimator to a 24-member ensemble give RPSS statistics that are on average comparable to those of the counting estimator applied to an ensemble size of about 39. Although all methods show improvement as ensemble size increases, it is interesting to ask to what extent the improvement in RPSS due to increasing ensemble size predicted by (24) is impacted by the presence of model error. For a realistic approximation of the RPSS in the limit of infinite ensemble size, we compute the RPSS for *N* = 1 and solve (24) for RPSS_{perfect}; we expect that in this case sampling error dominates model error and the relation in (24) holds approximately. Then we use (24) to compute the gridpoint-averaged RPSS for other values of *N*; the theory curve in Fig. 10 shows these values. In the absence of model error, the count and theory curves of RPSS in Fig. 10 would be the same. However, we see that the effect of model error is such that the curves are close for *N* = 5 and *N* = 10 and diverge for larger ensemble sizes with the actual increase in RPSS being lower than that predicted by (24).

The presence of model error means that some calibration of the model output with observations is needed. The GCM ensemble tends to be overconfident and calibration tempers this. To see if reducing sampling error still has a noticeable impact after calibration, we use a simple version of Bayesian weighting (Rajagopalan et al. 2002; Robertson et al. 2004). In the method, the calibrated probability is a weighted average of the GCM probability and the climatology probability (1/3). The weights are chosen to maximize the likelihood of the observations. There is cross-validation in the sense that the weights are computed with a particular ensemble of size *N*, and the RPSS is computed by applying those weights to a different ensemble of the same size and then comparing the result with observations. The calibrated counting–estimated probabilities still have slightly negative RPSS in some areas (Fig. 11a), but the overall amount of positive RPSS is increased compared to the uncalibrated simulations (cf. with Fig. 9a); the ensemble size is 20 and results are averaged over 100 realizations. The calibrated Gaussian and GLM probabilities have modestly higher overall RPSS than the calibrated counting estimates with noticeable improvement in skillful areas like southern Africa (Figs. 11b,c). We note that a simpler calibration method based on a Gaussian fit with the variance determined by the correlation between ensemble mean and observations, as in Tippett et al. (2005), rather than ensemble spread, performs nearly as well as the Gaussian fit with Bayesian calibration.

It is interesting to look at examples of the probabilities given by the counting and Gaussian fit estimate to see how the spatial distributions of probabilities may differ in appearance. Figure 12 shows uncalibrated tercile probabilities from DJF 1996 (ENSO neutral) and 1998 (strong El Niño). Counting and Gaussian probabilities appear similar, with Gaussian probabilities appearing spatially smoother.

## 5. Summary and conclusions

Here we have explored how the accuracy of tercile category probability estimates are related to ensemble size and the chosen probability estimation technique. The counting estimate, which uses the fraction of ensemble members that fall in the tercile category, is attractive because it is simple and places no restrictions on the form of the ensemble distribution. The error variance of the counting estimate is a function of the ensemble size and tercile category probability. For Gaussian variables, the tercile category probability is a function of the ensemble mean and variance. Therefore, for Gaussian variables, the counting estimate variance for an individual forecast depends on ensemble size, mean, and variance; the average (over forecasts) counting estimate variance depends on ensemble size and the signal-to-noise ratio. An alternative to the counting estimate is the Gaussian fit estimate, which computes tercile probabilities from a Gaussian distribution with parameters estimated from the forecast ensemble. Like the counting estimate, the variance of the Gaussian fit tercile probabilities is also shown to be a function of the ensemble size and the ensemble mean and variance, and the average variance depends on ensemble size and the signal-to-noise ratio. When the variables are indeed Gaussian, the error variance of the Gaussian fit estimate is smaller than that of the counting estimate by approximately 40% in the limit of small signal. The advantage of the Gaussian fit over the counting estimate is equivalent to fairly substantial increases in ensemble size. However, this advantage depends on the forecast distribution being well described by a Gaussian distribution. Generalized linear models (GLMs) provide a parametric estimate of the tercile probabilities using a nonlinear regression with the ensemble mean and possibly the ensemble variance as predictors. The GLM estimator does not explicitly assume a distribution but, as implemented here, is equivalent to the Gaussian fit estimate in some circumstances.

The accuracy of the tercile probability estimates affects probability forecast skill measures such as the commonly used ranked probability skill score (RPSS). Reducing the variance of the tercile probability estimate is shown to increase the RPSS. We examined this connection in the perfect model setting used extensively in predictability studies in which the “observations” are assumed to be indistinguishable from an arbitrary ensemble member. We find the expected RPSS in terms of the above- and below-normal tercile probabilities and, for Gaussian variables, in terms of the ensemble mean and variance. Finite ensemble size degrades the expected RPSS, conceptually similar to the way that finite ensemble size reduces the expected correlation (Sardeshmukh et al. 2000; Richardson 2001).

Many of the analytical results are obtained assuming that the ensemble variables have a Gaussian distribution. We test the robustness of these findings using simulated seasonal precipitation from an ensemble of GCM integrations forced by observed SST, subsampling from the full ensemble to estimate sampling error. We find that the theoretical results give a good description of the average variance of the counting estimate, particularly in a spatially averaged sense. This means that the theoretical scalings can be used in practice to understand how sampling error depends on ensemble size and level of predictability. Although the GCM-simulated precipitation departs somewhat from being Gaussian, the Gaussian fit estimate had smaller error than the counting estimate. The behavior of the GLM estimate is similar to that of the Gaussian fit estimate. The parametric estimators based on ensemble mean had the best performance; adding ensemble variance as a parameter did not reduce error. This means that with the moderate ensemble sizes typically used, differences between the forecast tercile probabilities and the equal-odds probabilities are due essentially to shifts of the forecast mean away from its climatological value rather than to changes in variance. Since differences between the forecast tercile probabilities and the equal-odds probabilities are a measure of predictability, this result means that predictability in the GCM is due to changes in ensemble mean rather than changes in spread. This result is consistent with Tippett et al. (2004), who found that differences between forecast and climatological GCM seasonal precipitation distributions as measured by relative entropy were basically due to changes in the mean rather than changes in the variance.

The reduced sampling error of the Gaussian fit and GLM is shown to translate into better simulation skill when the tercile probabilities are compared to actual observations. Examining the dependence of the RPSS on ensemble size shows that, although RPSS increases with ensemble size, model error limits the rate of improvement compared to the ideal case. Calibration improves RPSS, regardless of the probability estimator used. However, estimators with larger sampling error retain their disadvantage in RPSS even after calibration. The application of the Gaussian fit estimator to specific years shows that the parametric fit achieves its advantages while also producing probabilities that are spatially smoother than those estimated by counting.

In summary, our main conclusion is that carefully applied parametric estimators provide noticeably more accurate tercile probabilities than do counting estimates. This conclusion is completely rigorous for variables with Gaussian statistics. We find that for variables that deviate modestly from Gaussianity, such as seasonal precipitation totals, the error of the Gaussian fit tercile probabilities is smaller than that of the counting estimates. More substantial deviation from Gaussianity may be treated by transforming the data or using the related GLM approach.

## Acknowledgments

We thank Lisa Goddard and Simon Mason for stimulating discussions and Benno Blumenthal for the IRI Data Library. GCM integrations were performed by David DeWitt, Shuhua Li, and Lisa Goddard with computer resources provided in part by the NCAR CSL. Comments from two anonymous reviewers greatly improved the clarity of this paper. IRI is supported by its sponsors and NOAA Office of Global Programs Grant NA07GP0213. The views expressed herein are those of the authors and do not necessarily reflect the views of NOAA or any of its subagencies.

## REFERENCES

Atger, F., 1999: The skill of ensemble prediction systems.

,*Mon. Wea. Rev.***127****,**1941–1953.Barnston, A. G., S. J. Mason, L. Goddard, D. G. Dewitt, and S. E. Zebiak, 2003: Multimodel ensembling in seasonal climate forecasting at IRI.

,*Bull. Amer. Meteor. Soc.***84****,**1783–1796.Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction.

,*Mon. Wea. Rev.***126****,**2503–2518.DelSole, T., 2004: Predictability and information theory. Part I: Measures of predictability.

,*J. Atmos. Sci.***61****,**2425–2440.DelSole, T., and M. K. Tippett, 2007: Predictability, information theory, and stochastic models.

, in press.*Rev. Geophys.*Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories.

,*J. Appl. Meteor.***8****,**985–987.Hamill, T. M., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts.

,*Mon. Wea. Rev.***132****,**1434–1447.Kharin, V. V., and F. W. Zwiers, 2003: Improved seasonal probability forecasts.

,*J. Climate***16****,**1684–1701.Kleeman, R., 2002: Measuring dynamical prediction utility using relative entropy.

,*J. Atmos. Sci.***59****,**2057–2072.Kleeman, R., and A. M. Moore, 1999: A new method for determining the reliability of dynamical ENSO predictions.

,*Mon. Wea. Rev.***127****,**694–705.Kumar, A., A. G. Barnston, and M. P. Hoerling, 2001: Seasonal predictions, probabilistic verifications, and ensemble size.

,*J. Climate***14****,**1671–1676.McCullagh, P., and J. A. Nelder, 1989:

*Generalized Linear Models*. Chapman and Hall, 387 pp.Murphy, A. H., 1973: A new vector partition of the probability score.

,*J. Appl. Meteor.***12****,**595–600.New, M., M. Hulme, and P. Jones, 2000: Representing twentieth-century space–time climate variability. Part II: Development of 1901–96 monthly grids of terrestrial surface climate.

,*J. Climate***13****,**2217–2238.Rajagopalan, B., U. Lall, and S. E. Zebiak, 2002: Categorical climate forecasts through regularization and optimal combination of multiple GCM ensembles.

,*Mon. Wea. Rev.***130****,**1792–1811.Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size.

,*Quart. J. Roy. Meteor. Soc.***127****,**2473–2489.Robertson, A. W., U. Lall, S. E. Zebiak, and L. Goddard, 2004: Improved combination of multiple atmospheric GCM ensembles for seasonal prediction.

,*Mon. Wea. Rev.***132****,**2732–2744.Roeckner, E., and Coauthors, 1996: The atmospheric general circulation model ECHAM-4: Model description and simulation of present-day climate. Max Planck Institute for Meteorology Tech. Rep. 218, 90 pp.

Sardeshmukh, P. D., G. P. Compo, and C. Penland, 2000: Changes of probability associated with El Niño.

,*J. Climate***13****,**4268–4286.Tippett, M. K., R. Kleeman, and Y. Tang, 2004: Measuring the potential utility of seasonal climate predictions.

,*Geophys. Res. Lett.***31****.**L22201, doi:10.1029/2004GL021575.Tippett, M. K., L. Goddard, and A. G. Barnston, 2005: Statistical–dynamical seasonal forecasts of central-southwest Asian winter precipitation.

,*J. Climate***18****,**1831–1843.Wilks, D. S., 2002: Smoothing forecast ensembles with fitted probability distributions.

,*Quart. J. Roy. Meteor. Soc.***128****,**2821–2836.

## APPENDIX

### Error in Estimating Tercile Probabilities

#### Variance of the counting estimate

*p*is (

_{N}*p*−

*p*

^{2})/

*N*. When the forecast precipitation anomaly

*f*has a Gaussian distribution with mean

*μ*and variance

_{f}*σ*

^{2}

_{f}, the probability

*p*of the below-normal category is

*x*is the left tercile boundary of the climatological distribution. In this case, the counting estimate variance depends on the forecast mean and variance through

_{b}*x*is joint normally distributed with mean zero and variance

*σ*

^{2}

_{x}. In this paper, the forecast

*f*is the precipitation anomaly

*x*conditioned on the SST. In this case, the left tercile boundary

*x*of the climatological pdf is

_{b}*σ*

_{x}x_{0}, where

*x*

_{0}= Φ

^{−1}(1/3) ≈ −0.4307 is the left tercile boundary of a mean zero normal distribution with unit variance. Averaging

*x*

^{2}over all forecasts gives

*σ*

^{2}

_{x}into signal and noise contributions. We denote the signal variance 〈

*μ*

^{2}

_{f}〉 by

*σ*

^{2}

_{s}and define the signal-to-noise ratio by

*σ*

^{2}

_{s}=

*σ*

^{2}

_{f}S^{2},

*S*

^{2}, we introduce the variable

*μ*=

*μ*/

_{f}*σ*and use the fact that

_{f}*x*/

_{b}*σ*=

_{f}*x*

_{0}

*S*

^{2}

*p*−

*p*

^{2}〉 as a function of the signal-to-noise ratio

*S*

^{2}. Numerical evaluation of the integral in (A6) suggests that we express this dependence using a new parameter

*g*≡ (1 +

*S*

^{2})

^{−1/2}:

*g*= 1 corresponding to the signal-to-noise ratio

*S*

^{2}being zero. The first term is found from

*p*−

*p*

^{2}〉 is

*S*

^{2}. A second-order [in powers of (

*g*− 1)] approximation is more accurate for larger values of

*S*and is given by

*S*

^{2}is fairly small for seasonal forecasts, we will use the approximation in (A11).

#### Error of the Gaussian fit estimate

*N*-member forecast ensemble with a Gaussian distribution, using its sample mean

*m*and sample variance

_{f}*s*

^{2}

_{f}defined by

*x*denotes the value of the

_{i}*i*th member of the ensemble. Based on this information and using (A1), the Gaussian fit estimate

*g*of the probability of the below-normal category is

_{N}*μ*is zero and the true tercile probability is 1/3 for all forecasts. Also, the forecast variance

_{f}*σ*

^{2}

_{f}is equal to the climatological variance

*σ*

^{2}

_{x}and does not have to be estimated from the ensemble. In this case, the squared error of the Gaussian fit estimate is

*m*and used the fact that

_{f}*x*=

_{b}*σ*

_{x}x_{0}. The term O(

*m*

^{3}

_{f}) is small and can be neglected for sufficiently large ensemble size

*N*; neglecting the higher-order terms leads to an underestimate in the final result of about 3.6% for

*N*= 10. Since 〈

*m*

^{2}

_{f}〉 =

*σ*

^{2}

_{x}/

*N*, the average (over forecasts) variance of the Gaussian fit tercile probability is

*σ*is constant and known. This means that there is predictability due to changes in forecast mean but not due to changes in forecast variance. The squared error of the Gaussian fit probability estimate is

_{f}*m*−

_{f}*μ*) about

_{f}*m*=

_{f}*μ*gives that the squared error is

_{f}*μ*−

_{f}*m*)

_{f}^{2}〉 =

*σ*

^{2}

_{f}/

*N*, the average (over realizations of the ensemble) of the squared error of the Gaussian fit is

*μ*with mean zero and variance

_{f}*σ*

^{2}

_{s}gives that the average variance of the Gaussian fit tercile probability is

*x*

^{2}

_{b}/

*σ*

^{2}

_{f}=

*x*

^{2}

_{0}(1 +

*S*

^{2}).