## 1. Introduction

Climate forecasts are associated with uncertainty because of the stochastic nature of the climate system. The level of uncertainty can be conveyed in a quantitative way by using probabilities (Murphy 1977; Leith 1973; Zwiers 1996; Kharin and Zwiers 2001). Owing to their ability to quantify the uncertainty, probabilistic forecasts are of potentially greater value to decision makers than deterministic forecasts (Thompson 1962; Murphy 1977; Krzysztofowicz 1983).

The level of uncertainty can be quantified in several ways. For example, it can be estimated subjectively by expert assessment. It can also be derived from the confidence intervals of statistical forecasts. For model forecasts, the level of uncertainty can be derived from an ensemble of model forecasts (Murphy and Winkler 1987; Tracton and Kalnay 1993; Vislocky and Fritsch 1995; Wilks 1995; Doblas-Reyes et al. 2000; Palmer et al. 2000, 2004).

A single-model ensemble approach is now widely used for climate forecasting. Averaging over a single-model ensemble of forecasts initiated from different initial conditions reduces the climate noise in a forecast of the mean climate state. A multimodel ensemble technique is a relatively recent contribution to climate forecasting. It can reduce the systematic and random error variances associated with the errors in individual model formulations (Doblas-Reyes et al. 2000; Fritsch et al. 2000; Stephenson and Doblas-Reyes 2000; Kharin and Zwiers 2002; Peng et al. 2002; Palmer et al. 2004).

Probabilistic interpretation of the forecasts from an ensemble prediction system is conducted by a range of operational centers, which produce seasonal forecasts. A probabilistic forecast is formulated in terms of the probabilities of a set of categories. Boundaries of the categories are defined based on the climatological probability distribution function (pdf) constructed using hindcast data. The forecast probability of each category is estimated as a portion of the cumulative probability of a forecast sample associated with this category (e.g., Kharin and Zwiers 2003).

The most commonly used method in operational practice employs a single-model ensemble (WMO 2007, 2008). Multimodel ensemble prediction systems are exploited by a few operational prediction centers, with different multimodel combination methods being applied. For instance, the International Research Institute for Climate and Society (IRI) multimodel ensemble prediction system consists of six models, within which individual forecasts are combined with appropriate weights according to past model forecast performance (Robertson et al. 2004). However, estimation of the robust weights requires sufficiently long hindcast series. Given the existing short series, the only practical way of employing multimodel prediction remains unweighted on the past skill combination of individual model forecasts (Kharin and Zwiers 2002; Peng et al. 2002; Hagedorn et al. 2005).

In both operational and research practice using a multimodel combination, the most commonly used method is pooling, which is unweighted according to the past individual model forecast skill. That is, all of the (bias-corrected) ensemble members from all the participating models are pooled into a single sample with equal weights (e.g., Hagedorn et al. 2005; Weigel et al. 2008). Such an approach is used operationally by the Meteorological Service of Canada (MSC). MSC uses four models, with each model producing 10 ensemble members in both forecast and (yearly) hindcast datasets (WMO 2007).

It should be noted that the differences in the individual model ensemble sizes do not restrict the use of pooling. In this case the model weights become proportional to their ensemble sizes (Robertson et al. 2004; Weigel et al. 2008). The main restriction on the use of pooling arises from the method of estimation of forecast probabilities, which implies that the climatological pdf is constructed on the basis of hindcast data and the forecast pdf is constructed on the basis of forecast data. This usage of both hindcast and forecast datasets requires consistency between the model weights in the hindcast and the forecast datasets. Otherwise, it appears that the climatological pdf is dominated by one subset of models, which have the larger ensembles in the hindcast, while the forecast pdf is dominated by other subset of models, which have the larger ensembles in the forecast.

In such a situation, it is reasonable to estimate a probabilistic forecast for each model separately and then to combine the forecasts obtained. For example, the European Multimodel Seasonal to Interannual Prediction (EUROSIP) system, consisting of three models, each having 41 ensemble members in the forecast and 11–15 ensemble members in (yearly) hindcast datasets, produces a multimodel forecast that is an average of the individual model probabilistic forecasts (information online at www.ecmwf.int/products/forecasts/seasonal/forecast/forecast_charts/eurosip_doc.htm). It should be noted that the forecast ensemble sizes of all three models are equal (41 members) and the simple average with equal weights is the most reasonable method of multimodel combination.

The Asia–Pacific Economic Cooperation (APEC) Climate Center (APCC) dataset has inconsistencies between the model weights in hindcast and forecast datasets, with the individual model ensembles essentially differing in size. EUROSIP suggests a solution for the special case when the model forecast ensembles are equal in size. However, to the best of our knowledge, the question of the probabilistic forecast based on the multimodel combination not calibrated on the past skill, which features both inconsistencies between the model weights in hindcast and forecast datasets and essential differences in individual model ensemble sizes, has not been discussed in the literature. This became the motivation for the study, which provided the basis for the multimodel ensemble global prediction method applied in APCC operational practice.

In this paper, we discuss the solutions for the above-mentioned problems we encountered while developing a probabilistic multimodel ensemble prediction system (PMME). In section 2, we present the datasets used in this study. Two main issues, multimodel combination and Gaussian approximation for precipitation, are discussed in sections 3 and 4. We introduce the operational PMME method in section 5, while an assessment of historical and real-time forecasts is presented in section 6. Section 7 gives a brief summary and discussion.

## 2. Data

The APCC concentrates international efforts in multimodel ensemble seasonal forecasting. Fifteen operational and research institutions from eight APEC member economies (Table 1) routinely provide the APCC with ensembles of dynamical seasonal predictions. These model outputs come in the form of ensembles of global forecast fields with their original model spatial resolutions. The time resolution is 1 month. The APCC is provided with data for only the forecast period that covers the second, third, and fourth months of the model integration. The lead time of the APCC seasonal forecasts is 1 month.

Individual model ensembles vary from 7 to 31 members in the forecast datasets and from 5 to 24 members in hindcast datasets. This variety, particularly the inconsistency between the individual model weights in the forecast and hindcast datasets and their associated restrictions on the multimodel combination methods, motivated the presented study. Each institution provides its specific set of forecast variables, with most of the forecasts including mean sea level pressure, precipitation, temperature at 850 hPa (hereafter, temperature), geopotential height at 500 hPa (Z500), zonal and meridional wind components at 850 and 200 hPa, etc.

The APCC issues operational probabilistic forecasts for precipitation, temperature, and Z500. During the preprocessing stage, the model outputs are interpolated into 2.5° latitude × 2.5° longitude grids and 3-month (seasonal) average fields are estimated. These ensembles of seasonally averaged fields are used in the APCC probabilistic multimodel ensemble prediction system as input data.

The individual model hindcasts cover different periods of between 12- and 31-yr lengths within the time interval from 1969 to 2005. For the development and verification of the method of multimodel combination, there is a need for sufficiently long periods of overlapping of the model hindcasts from sufficiently large numbers of models. Meanwhile, increasing the number of models leads to a reduction of the overlapping period and vice versa. As a compromise between the number of participating models and the overlapping period length, the study was performed on the basis of seven models that provide 21 yr (1983–2003) of overlapping of hindcasts necessary for verification.

In this paper, a developed at the APCC probabilistic forecast method is discussed in detail for predictions of summer (June–August) precipitation and wintertime (December–February) temperature. The summary hindcast verification scores and assessments of real-time predictions are shown for all three forecast variables. By way of verification, we make use of precipitation data from the Climate Prediction Center Merged Analysis of Precipitation (Xie and Arkin 1997) dataset, as well as temperature and Z500 data from the National Centers for Environmental Prediction–National Center for Atmospheric Research reanalysis (Kanamitsu et al. 2002).

## 3. Multimodel combination

Two approaches are possible for developing a probabilistic multimodel ensemble forecast on the basis of a set of model ensembles. The first approach is pooling (e.g., Barnston et al. 2003; Doblas-Reyes et al. 2005), and the second is to separately compute a probabilistic forecast for each individual model and then combine them.

*P*is a forecast probability,

*E*is a

_{j}*j*event [i.e., either above normal (AN), near normal (NN), or below normal (BN)], mdl

*is the*

_{i}*i*model, and

*M*is the number of models. In this equation,

*P*(

*E*|mdl

_{j}*) is a forecast probability of the event*

_{i}*E*conditioned on the

_{j}*i*model (i.e., the

*i*-model forecast of the

*j*event). Here,

*P*(mdl

*) is an unconditional probability of the model, which is a model weight in this context.*

_{i}The choice of the model weights depends upon the ratio between 1) the standard errors of the individual model ensemble means that represent the 68% confidence intervals of the sampling errors (Särndal et al. 1992) of individual models associated with model ensemble spread and 2) the difference between individual model forecasts caused by both the differences in model formulations and sampling errors. If the difference between the individual model forecasts is comparable to or less than the model standard errors, the optimal model weights are inversely proportional to the squared standard error of each individual model forecast (Taylor 1997). Alternatively, if the difference between the model forecasts is much larger than the model standard errors, one can neglect them and combine the model forecasts with equal weights.

*σ*is the standard deviation of the model spread and

*n*is the model ensemble size. The difference between the model ensemble means is represented by an unbiased intermodel standard deviation of the individual model ensemble means: where

*M*is the number of models,

*μ*is the

_{i}*i*-model ensemble mean, and

*μ*

_{mm}is the mean of the individual model ensemble means. For each model, we computed the yearly and 21-yr mean ratios (

*R*) between ɛ

*and*

_{μ}*σ*

_{mm}, with the mean ratio being estimated as where the brackets 〈···〉 denote the average over 21 yr. It should be noted that

*R*can exceed one because the sampling error in the nominator is represented by the standard error, that is, its 68% confidence interval, which depends only upon ensemble spread and size; meanwhile, in the denominator, the intermodel standard deviation is subject to that very sampling error.

Examples of the spatial distribution of the 21-yr mean *R* values for the 20- and 5-ensemble-member models participating in the multimodel ensemble of seven models with ensemble sizes varying from 5 to 20 are shown in Figs. 1a–d. An analysis of the maps of *R* shows that 1) for both temperature and precipitation the *R* values for the 5-ensemble-member model essentially exceed those for the 20-ensemble-member model; 2) the *R* values for precipitation are larger than for temperature; 3) for both precipitation and temperature for the tropical Pacific the standard errors can be neglected, that is, the model forecasts may be combined with equal weights; and 4) for the extratropics, particularly the Northern Hemisphere continents, standard errors become comparable with intermodel standard deviation or even exceed it (for the 5-ensemble-member model). This means that, for the extratropics, choosing the individual model weights for the combination of the models of different ensemble size, the differences between standard errors cannot be neglected.

Along with spatial variability, *R* features essential temporal variability (Figs. 1e and 1f). There is a large area, presumably in the extratropics, where the number of years with a yearly *R* > 1 is between 7 and 14 out of 21 yr. That is, the number of years when standard errors are larger than the intermodel standard deviation is comparable with the number of years when they are much less.

Thus, this study shows that for the globe as a whole neither the intermodel difference, nor the standard error, can be treated as prevailing; that is, neither of the above-suggested weights is appropriate for a method of global multimodel forecasting. Such uncertainty suggests that for the global forecast method it is reasonable to choose some compromise approach to the model weights. We have suggested the geometric mean of the alternative weight values; that is, we assign those model weights that are inversely proportional to the maximum error in forecast probability associated with the standard error.

*, is related to the standard error of the mean defined in Eq. (2) as where*

_{P}*f*(

*X*) is a Gaussian pdf: and |

*X*−

*μ*| < ɛ

*/2. For the standard errors associated with ensemble sizes varying from 5 to 31, the exponent term in Eq. (4) ranges within the interval from 0.98 to 0.9999 and can be treated as a constant. Equation (3) becomes Therefore, for each individual model forecast, we assign the weight proportional to the square root of the model ensemble size,*

_{μ}*n*. Taking into account that model weights must sum to one, the final forecast formula for each

_{i}*j*event is

## 4. Examination of Gaussian pdf for precipitation

An operational method should be technologically simple and not time consuming. In accordance with this requirement, PMME is based on a Gaussian approximation of both climatological and forecast pdfs. This is common for temperature and Z500, whose pdfs are close to Gaussian, whereas the precipitation pdf is close to gamma one (Wilks 1995). Approximation of the precipitation pdf with a Gaussian pdf may result in overestimation (underestimation) of the probability of the BN (AN) category due to differences in the shapes of the Gaussian and gamma pdfs.

We have performed the Anderson–Darling goodness-of-fit test (A–D test) for normality of precipitation data (Stephens 1974). The A–D test is more sensitive to deviations in the tails of the distribution than the Kolmogorov–Smirnov test (K–S test). In contrast to the K–S test, the A–D test allows for adjustment of the sample size and it can be used with small samples. However, its critical values are tabulated for only a few theoretical distributions; particularly, a table of critical values is available for Gaussian pdfs but is not for gamma pdfs. Results from the A–D test for observations at each grid point are shown in Fig. 2a. The grid points where the precipitation pdfs significantly differ from the Gaussian are randomly scattered over the globe, with regions of contiguous grid points of significant difference being confined to southern and northern Africa, southwestern Asia, Australia, eastern equatorial Pacific, and Antarctica. It is interesting to note that these regions tend to be confined to the areas of low precipitation. Meanwhile, for most of the globe, the empirical summer precipitation pdf does not differ significantly from the Gaussian, given a 21-yr series size and the 5% significance level.

We have also assessed the differences between the forecast probabilities predicted based on Gaussian and gamma pdfs. The maximum difference through 21 hindcast years between precipitation forecast probabilities for the AN and BN categories estimated by using Gaussian and gamma pdfs is shown in Figs. 2b and 2c. Throughout most of the globe, the maximum difference does not exceed 10%, while the areas where the maximum difference exceeds 10% are mainly confined to comparatively small areas within the regions of low precipitation (shading in Fig. 2 shows the regions where cumulative summer precipitation is less than 50 mm).

In summary, both the A–D test and an examination of the differences in forecast probabilities based on different pdfs show that the Gaussian approximation of the precipitation pdf is appropriate for most part of the globe. In addition, the possible difference in the forecast probability due to the use of a Gaussian pdf instead of a gamma one does not exceed 10% for nearly the whole globe, except for a few low-precipitation areas. Hence, we will use the Gaussian approximation of the precipitation pdf.

## 5. APCC probabilistic multimodel ensemble operational forecast

The APCC operational seasonal forecasts are issued in the form of tercile-based categorical probabilities (hereafter, tercile probabilities), that is, the probability of the BN, NN, and AN categories, with respect to climatology. In this study, similar to many other studies (e.g., Kharin and Zwiers 2003; Boer 2005), a Gaussian approximation was applied to estimate tercile probabilities.

The APCC forecast procedure consists of two stages. In the first stage the individual model probabilistic forecasts are estimated. The lower (*x _{b}*) and upper (

*x*) terciles are estimated as

_{a}*x*=

_{b}*μ*− 1.43

*σ*and

*x*=

_{a}*μ*+ 1.43

*σ*, respectively, with

*μ*and

*σ*being the mean and standard deviation of the hindcast sample. The forecast probability of each category is estimated as a portion of the cumulative probability of the forecast sample associated with this category. In the second stage, individual model probabilistic forecasts for each category are combined using Eq. (6).

An example of a probabilistic seasonal forecast issued by APCC for temperature is shown in Fig. 3. It shows the forecast probability of each category separately, as well as a combined one based on three category probabilities. The combined map (Fig. 3d) shows the regions where one of the categories dominates with corresponding color and the regions where the forecast pdf does not significantly differ from the climatological one are left uncolored.

*χ*

^{2}) test. We estimate the statistic as where

*n*is a sum of the ensemble sizes of individual models,

*P*(

*E*) is a forecast probability of the

_{j}*j*event, and 0.333 is the expected (climatological) probability of all three equiprobable categories.

Under the null hypothesis, which corresponds to no significant difference from the climatological probability distribution, this statistic has a *χ*^{2} probability distribution with two degrees of freedom. If the null hypothesis is rejected at the 5% significance level, the largest forecast categorical probability is marked in the combined map with its respective color. It is worth noting that the threshold probabilities associated with this test are very close to those estimated as the 95% confidence interval of climatological probability based on binomial probability distribution.

## 6. Prediction assessment

Following the recommendations of the World Meteorological Organization Standardized Verification System for Long-Range Forecasts (SVS-LRF; WMO 2002), we assess the skill of the forecasts by means of the reliability (attributes) diagram (Murphy 1973; Murphy and Winkler 1977; Atger 2003, 2004; Jolliffe and Stephenson 2003; Wilks 1995) and the relative operating characteristics (Swets 1973; Mason 1982; Mason and Graham 1999). To assess the skill of the forecasts with respect to the climatological forecast in each category, we use the Brier skill score (BSS). Detailed descriptions of these verification methods and explanations of an interpretation of the respective verification scores are given in Wilks (1995).

In this study, all the verification scores were computed on the 21-yr cross-validated hindcasts from 1983 to 2003 for the globe, the tropics (20°S–20°N), and the northern extratropics (20°–90°N). The statistical significance of the obtained verification scores has been assessed using the Monte Carlo approach. We have randomly scrambled the forecast fields in the time domain 500 times. For each of 500 Monte Carlo trails we estimated all the verification scores for the grid points and the aggregated scores for the regions. Then, we assessed the significance as a probability of the excess of randomly obtained verification scores over those obtained on the original succession of the forecasts.

### a. The skill of PMME

#### 1) Historical forecasts

##### (i) Reliability (attributes) diagram

*K*bins: where BS is the Brier score,

*f*is the forecast probability of an event,

*o*equals one if the event occurs and zero otherwise,

*o*

*N*is the total number of forecasts, and

*n*is the number of forecasts falling in the

_{k}*k*bin. The first term of the decomposed BS, reliability, characterizes the agreement between the forecast probability and mean observed frequency; while the second term, resolution, shows the ability of the forecast to resolve a set of sample events into subsets with characteristically different frequencies; and the third term represents uncertainty.

Regionally aggregated reliability diagrams for PMME with corresponding frequency histograms are shown in Figs. 4 and 5 as the curves with closed circles. These diagrams display the relative frequency of an observed event against the forecast probability of this event for the bins into which the forecasts are grouped. The diagonal connecting the points (0,0) and (1,1) represents the perfect resolution and reliability. The line with a relative frequency value of 0.333, the “no resolution” line, corresponds to the performance of random forecasts. The median of the angle between these two lines, the line where reliability is equal to resolution, represents the performance of the climatological forecast. The closer the curve to the diagonal, the higher is the skill of the forecast. Deviations from the diagonal give the conditional bias. If the curve is below (above) the diagonal, it indicates overforecasting (underforecasting). The slope of the curve shows the resolution: the flatter the curve, the lower the resolution of the forecast. If the curve is above (below) the median in the range of forecast probabilities above (below) the climatological probability (*P* = 0.333), forecasts are skillful as compared with climatology. If the curve appears between the median and the no-resolution line, forecasts have no skill in respect to climatology; however, they outperform random guessing.

The frequency histograms at the bottom of the plots show the relative frequencies of forecasts with respect to forecast probabilities and characterize the sharpness of the forecast system. The higher the sharpness, the larger the relative frequency of the high- and low-probability forecasts, those close to 0.0 and 1.0. The regionally aggregated BSS values of the PMME are shown in the top-left corner of the plots. Positive values of the BSS document that the PMME outperforms climatological forecasts.

For temperature (Fig. 4), the PMME forecast curve deviations from the diagonal are quite small and the slope is close to the diagonal one. For the AN and BN categories, the PMME forecasts outperform climatological forecasts for all three analyzed regions. The skill of the PMME forecasts for precipitation (Fig. 5) is less than that for temperature. It is skillful in comparison with random forecasts. However, there is a distinct deviation of the curve downward in the range of forecast probabilities 0.3–0.6, especially for the northern extratropics. It should also be noted that for both temperature and precipitation, forecasts of the category NN (not shown) are less skillful than the climatological forecasts, although they outperform both single-model and random forecasts.

The reliability diagrams and, particularly, the frequency histograms expose the main shortcoming of the PMME as compared with single models. It is the tendency to produce forecasts close to climatology, especially for precipitation; meanwhile, forecasts corresponding to high-probability bins are issued comparatively rarely. However, it is worth noting that low sharpness is an inherent shortcoming of most of the systems based on the averaging. On the other hand, such averaging provides an apparent advantage since it improves the reliability and resolution. Moreover, in the APCC method, it averaging improves the skill of the forecasts in the extreme and near-extreme bins because these forecasts are supported by most of the models participating in the PMME.

Another shortcoming of the PMME is a moderate, however systematic, bias toward overforecasting of the BN category for both temperature and precipitation. If it were only precipitation, it could be explained with the use of the Gaussian approximation of the precipitation PDF. However, such a supposition does not explain the bias in the temperature forecasts. So, the only explanation resides in the individual model forecasts combined in the ensemble. Combination of the individual model forecasts in PMME essentially improves both the reliability and resolution of the forecasts. However, even combined forecasts cannot be assessed as perfect.

According to a number of studies, the skill of multimodel ensemble forecasts is higher than that of single models (e.g., Robertson et al. 2004; Hagedorn et al. 2005). Our verification results, as it was expected, also support the superiority of the multimodel approach. The reliability diagrams (Figs. 4 and 5) clearly document that when the models are combined in the PMME, the performance of this ensemble is considerably higher than that of individual models operating separately, with performance improvement being essential in both the reliability and resolution of the forecasts.

##### (ii) Relative operating characteristic score

As a summary statistic representing the skill of the probabilistic forecast system, WMO recommends the relative operating characteristic score (WMO 2002). The relative operating characteristic score (ROCS) is equal to 1 for a perfect forecast and 0.5 for a no-skill forecast. The spatial distribution of the ROCS for temperature and precipitation is shown in Fig. 6. The wintertime temperature forecast is apparently skillful for most of the globe (Figs. 6a and 6c). For the AN and BN categories, the highly skillful forecasts with ROCS exceeding 0.7 extend well into the extratropics, and in the tropical regions ROCS mainly exceeds 0.8, with ROCS above 0.9 spanning the tropical eastern Pacific. For the NN category (not shown), the ROCS value is less than that for two other categories by about 0.1–0.2 and statistically significant ROCS values are confined to the tropical region.

Forecasts of precipitation are highly skillful only in the tropical regions, the South Pacific convergence zone, and the South Pacific. In the extratropical regions the distribution of the skillful forecast grid points is quite a mosaic (Figs. 6b and 6d). The higher skill of the prediction of temperature variability in comparison with precipitation has been reported upon in many studies (e.g., Mason et al. 1999; Peng et al. 2000; Derome et al. 2001). A multimodel ensemble improves the forecast skill as compared with a single-model approach for both temperature and precipitation; however, the precipitation forecast still remains modest.

In this paper, the detailed forecast performance assessment by means of reliability diagrams is shown for only wintertime temperatures and summer precipitation. A summary overview of prediction skill for temperature, precipitation, and Z500 in terms of ROCS is shown in Table 2. Table 2 documents that seasonal forecasts of Z500 clearly outperform those of both precipitation and temperature for all the seasons for the globe and the tropics. For the northern extratropics, the skill levels of Z500 and temperature forecasts are comparable. The poorest forecasts are for precipitation, particularly for summer precipitation in the northern extratropics. It is interesting to note that for this region precipitation forecasts for winter are notably more skillful than those for summer because wintertime precipitation is mainly caused by circulation anomalies for which prediction is comparatively skillful. It should also be noted that all the aggregated ROCSs are statistically significant at the 99% confidence level according to the Monte Carlo test with 500 random trails.

#### 2) Real-time operational forecasts

The PMME has been exploited at the APCC as an operational seasonal probabilistic prediction system since May 2006. The 2-yr period is too short for the collection of a sufficient number of real-time forecasts to obtain some quantitative estimates and make well-grounded conclusions. So, the results of the assessments of real-time forecasts shown below should be treated as preliminary.

The ROCS values for real-time and historical forecasts of precipitation, temperature, and Z500 for the globe for the summer and winter seasons are shown in Fig. 7. The ROCS values for the real-time forecasts are above 0.5, so the forecasts may be considered to be skillful. The skill of historical forecasts for all the variables features strong interannual variability. The ROCS values for real-time forecasts lie within the range of variability of the skill of historical forecasts.

The state of the climate system in a particular year considerably affects the predictability and consequently the skill. Particularly, the highest skill for all three variables is featured by the seasonal forecast for the winter of 1997/98, the winter of an extreme El Niño and the strongest boundary forcing. It is interesting to note that the skill levels of real-time forecasts for both winters (2006/07 and 2007/08) appear to be higher than for both summers (2006 and 2007). A reasonable explanation may reside in the state of the climate system during these two years. There was a weak El Niño (La Niña) during the 2006/07 (2007/08) winter. A weak La Niña was also observed during the 2005/06 winter. Therefore, the wintertime forcing was well defined and eminently predictable in late October–early November when model integrations were performed. Time periods of the summers 2006 and 2007 were characterized by atmospheric processes associated with transitions between the weak El Niño–Southern Oscillation phases with weak, poorly predictable forcing, which resulted in the lower skill of the forecasts.

However, this explanation is to only be treated as a hypothesis. For 2 yr the difference between the skill of the forecasts for summer and winter may appear just occasionally. Thus, the main conclusion from the assessment is that the real-time forecasts are skillful and their skill is within the range of the interannual variability of the skill of historical forecasts.

### b. The skill of the different methods of multimodel combination

Three possible methods of multimodel combination have been analyzed as contenders to become an APCC operational method. Along with PMME, we have tested two methods discussed in section 3, namely, the multimodel combination with equal weights (PMMEE) and that with weights inversely proportional to the squared error in forecast probability associated with sampling error (PMMEN). Description of the model weights is given in Table 3.

The reliability diagrams for PMME, PMMEE, and PMMEN are shown in Figs. 8 and 9. For both temperature and precipitation the curves are within the confidence intervals of each other, which implies that the skill levels of these three methods are quite similar. Indeed, the BSS values are quite comparable, with minor superiority by PMMEE for temperature and similarly minor superiority by PMMEN for precipitation. It should also be noted that their frequency histograms are also similar. Therefore, the shortcoming of low sharpness is typical for all three methods.

The ROCS values computed for PMMEE and PMMEN differ by not more than 0.01 from those for PMME shown in Table 2. Thus, a conclusion can be drawn that in general the skill levels of PMMEE, PMMEN, and PMME are comparable. Neither of these methods can be assessed as the best or the worst in terms of the skill scores aggregated over the large regions (several thousand grid points) and 21 yr. However, forecasts produced by these three methods do differ. The difference between forecast probabilities produced by PMMEE and PMMEN is up to 15%. By issuing real-time forecasts, it cannot be known beforehand which of these two methods is the best in a particular year for a particular grid point. The PMME forecast probabilities are in between the probabilities predicted by the alternative methods, and when compared to PMMEE and PMMEN, it was never the poorest. It is the most optimal choice for the operational method for global probabilistic forecast.

Two other possible methods of the multimodel combination have also been tested. These are characterized by having weights inversely proportional to the squared sampling error and the ensemble variance. However, these methods feature strong overforecasting. Their skill levels appear to be much poorer than those of PMME, PMMEE, and PMMEN for both temperature and precipitation, with the skill of the precipitation forecasts being poorer than the climatological one for all three regions. We briefly discuss the possible causes of their failure in section 7.

## 7. Summary and discussion

A probabilistic multimodel ensemble prediction system has been developed to provide seasonal forecasts at the APCC and its 21-yr cross-validated hindcasts (1983–2003) and real-time forecasts have been assessed in terms of skill scores recommended by WMO SVS-LRF. This system is based on a multimodel ensemble that is uncalibrated on the past skill and a parametric Gaussian fitting method for the estimation of tercile-based categorical probabilities.

In this study, we have discussed the issue of the probabilistic interpretation of forecasts from this multimodel ensemble, combining the large number of models whose ensembles essentially differ in size, and model weights in the forecast and hindcast datasets are inconsistent. This problem is new in seasonal probabilistic forecasting. So far, the solutions for only the special cases have been suggested. Particularly, when the model ensembles differ in size but the model weights in the forecast and hindcast datasets are consistent, the pooling may be used (Hagedorn et al. 2005; Weigel et al. 2008). On the other hand, if the model weights in the forecast and hindcast datasets are inconsistent but the model forecast ensemble sizes are equal, the averaging of the individual model forecasts may be used. However, a solution has not yet been suggested for the multimodel ensemble that features both the essential difference between the model ensemble sizes and the inconsistency between the model weights in the hindcast and forecast datasets. We solve the problem by estimating forecast probabilities separately for individual models and by combining the predicted forecast probabilities using the total probability formula. In this combination, we apply model weights that are proportional to the square root of the ensemble size.

Our initial consideration was confined to two alterative methods of multimodel combination, those discussed in the paper as PMMEE (equal weights) and PMMEN (weights inversely proportional to the squared error in forecast probability associated with standard error). The first method was considered to be the most logical choice given the absence of any robust information on the historical individual model skill. The basis of the second method was provided by previous studies (e.g., Robertson et al. 2004; Hagedorn et al. 2005) that documented that an increase in the ensemble size of a single model (and the associated reduction of the standard error) improved its performance.

However, the preparatory study has shown that neither of two alternative methods is preferable as a global method. Therefore, a geometrical mean between PMMEE and PMMEN has been suggested. This method, PMME (weights inversely proportional to the error in forecast probability associated with the standard error), has become the APCC’s global operational method. It is worth noting that the error in forecast probability associated with the standard error does not depend upon the ensemble variance; it is related only to ensemble size. Particularly for the PMME method, each model weight is proportional to the square root of the model ensemble size.

The performance of all three methods is comparable. There is a slight superiority by PMMEE for temperature and, similarly, a slight superiority by PMMEN for precipitation. This matches the results from the preparatory test, which yield that, in general, the ratio between the standard errors and the intermodel standard deviation is lower for temperature and larger for precipitation. Comparison of the PMMEE and PMMEN forecast assessments reveals that the slightly lower sharpness of PMMEN is associated with an improvement in the reliability of the forecasts of poorly predictable precipitation; on the other hand, for better predictable temperatures, PMMEN does not benefit from correctly predicted high probabilities of events because of the lower sharpness. In contrast to PMMEN, the slight superiority in sharpness of PMMEE is associated with a tendency toward overforecasting and the lower reliability of its precipitation forecasts. As compared with the two other methods, the PMME is optimal. It slightly outperforms PMMEE in precipitation forecasts and slightly outperforms PMMEN in temperature forecasts. Meanwhile, its forecasts probabilities are in between the probabilities predicted by the competitive methods. It never is the poorest of these three methods.

Two additional methods, those with the weights inversely proportional to the squared standard error and the ensemble variance, have also been tested. They show much poorer skill than the three competitive methods and are not discussed in this paper. However, it is worth noting that both of these methods suggest inverse ensemble variance in the model weights, and the explanation of their failure resides in the enlargement of the weights of the models with underdispersed (overconfident) ensembles. This results in overforecasting of the multimodel predictions, which is strongly penalized by the probabilistic skill scores (Weigel et al. 2008).

Another issue addressed in this paper is the use of a Gaussian approximation for precipitation pdfs. To examine the feasibility of Gaussian approximation, the Anderson–Darling test has been employed and an examination of the differences in the forecast probabilities based on different pdfs was carried out. The results indicate that, for precipitation, the application of the Gaussian pdf is appropriate for most of the globe. In addition, the maximum difference between the forecast probabilities based on the Gaussian and gamma pdfs during 21 hindcast years does not exceed 10% for the whole globe in general, with the exception of a few low-precipitation areas. That is, forecast probabilities based on Gaussian pdfs are comparable to those based on gamma pdfs, with the Gaussian approximation based method being more technologically simple and less time consuming than the other options (e.g., the gamma fitting method and the nonparametric approach), which is important in operational practice.

Verification of historical forecasts has been performed. For the AN and BN categories, the PMME temperature forecasts outperform the climatological forecasts for all three considered regions (i.e., globe, tropics, and northern extratropics). Precipitation forecasts are not as skillful as those for temperature. However, they are definitely more skillful than random guessing for the extratropics and climatological forecasts for the tropics. Nevertheless, for both temperature and precipitation, as was expected, the results from forecast assessments document the apparent superiority of the PMME forecasts over those by individual models, with marked improvements seen in both reliability and resolution.

It is worth noting that for both temperature and precipitation forecasts of the NN category are substantially less skillful than those of other categories. Similar results were reported by many studies and were explained by Van den Dool and Toth (1991) by the narrowness of the NN category in terms of a forecast variable.

The study performed here has exposed the main shortcoming of PMME and other tested methods—their low sharpness as compared with single models. It is an inherent feature of those systems based on the averaging employed. However, it is just averaging that implicitly provides mutual cancellation of the errors in the pdf parameters of the individual model forecasts and explicitly improves the performance of the system (Fritsch et al. 2000).

The PMME scheme described in this paper has been developed for APCC operations and implemented to provide seasonal tercile-based categorical forecasts of precipitation, temperature, and Z500 since May 2006. The assessment of the real-time forecasts reveals that they are skillful and the skill of real-time forecasts is within the range of the interannual variability of the skill of historical forecasts. Further development of the method implies its specification for particular variables and regions along with experiments on its calibration according to the skill of past forecasts on the basis of Bayes’s theorem.

We are grateful to the institutions participating in the APCC multimodel ensemble operational system for providing the hindcast experiment data. Our special thanks are addressed to three anonymous reviewers whose very useful comments and recommendations helped us to considerably improve the paper. We are very thankful to Dr. Peter Mayes of the New Jersey Department of Environmental Protection for his kind help in editing of the manuscript. This research has been supported by the Korea Meteorological Administration (KMA).

## REFERENCES

Atger, F., 2003: Spatial and interannual variability of the reliability of ensemble-based probabilistic forecasts: Consequences for calibration.

,*Mon. Wea. Rev.***131****,**1509–1523.Atger, F., 2004: Estimation of the reliability of ensemble based probabilistic forecasts.

,*Quart. J. Roy. Meteor. Soc.***130****,**627–646.Barnston, A. G., , Mason S. , , Goddard L. , , DeWitt D. G. , , and Zebiak S. E. , 2003: Increased automation and use of multimodel ensembling in seasonal climate forecasting at the IRI.

,*Bull. Amer. Meteor. Soc.***84****,**1783–1796.Boer, G. J., 2005: An evolving seasonal forecasting system using Bayes’ theorem.

,*Atmos.–Ocean***43****,**129–143.Brier, G. W., 1950: Verification of forecasts expressed in terms of probability.

,*Mon. Wea. Rev.***78****,**1–3.Derome, J., and Coauthors, 2001: Seasonal predictions based on two dynamic models.

,*Atmos.–Ocean***39****,**485–501.Doblas-Reyes, F. J., , Déqué M. , , and Piedeliérem J-P. , 2000: Multi-model spread and probabilistic seasonal forecasts in PROVOST.

,*Quart. J. Roy. Meteor. Soc.***126****,**2069–2088.Doblas-Reyes, F. J., , Hagedorn R. , , and Palmer T. N. , 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting —II. Calibration and combination.

,*Tellus***57A****,**234–252.Fritsch, J. M., , Hilliker J. , , Ross J. , , and Vislocky R. L. , 2000: Model consensus.

,*Wea. Forecasting***15****,**571–582.Hagedorn, R., , Doblas-Reyes F. J. , , and Palmer T. N. , 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting—I. Basic concept.

,*Tellus***57A****,**219–233.Jolliffe, I. T., , and Stephenson D. B. , 2003:

*Forecast Verification*. John Wiley and Sons, 240 pp.Kanamitsu, M., , Ebisuzaki W. , , Woollen J. , , Yang S. K. , , Hnilo J. J. , , Fkorino M. , , and Potter G. , 2002: NCEP–DOC AMIP-II reanalysis.

,*Bull. Amer. Meteor. Soc.***83****,**1631–1643.Kharin, V. V., , and Zwiers F. W. , 2001: Skill as function of time scale in ensemble of seasonal hindcast.

,*Climate Dyn.***17****,**127–141.Kharin, V. V., , and Zwiers F. W. , 2002: Climate predictions with multimodel ensembles.

,*J. Climate***15****,**793–799.Kharin, V. V., , and Zwiers F. W. , 2003: Improved seasonal probability forecast.

,*J. Climate***16****,**1684–1701.Krzysztofowicz, R., 1983: Why should a forecaster and a decision maker use Bayes theorem.

,*Water Resour. Res.***19****,**327–336.Leith, C. E., 1973: The standard error of time-average estimates of climatic means.

,*J. Appl. Meteor.***12****,**1066–1069.Mason, I., 1982: A model for assessment of weather forecasts.

,*Aust. Meteor. Mag.***30****,**291–303.Mason, S. J., , and Graham N. E. , 1999: Conditional probabilities, relative operating characteristics, and relative operating levels.

,*Wea. Forecasting***14****,**713–725.Mason, S. J., , Goddard L. , , Graham N. E. , , Yulaeva E. , , Sun L. , , and Arkin P. A. , 1999: The IRI Seasonal Climate Prediction System and the 1997/98 El Niño event.

,*Bull. Amer. Meteor. Soc.***80****,**1853–1873.Murphy, A. H., 1973: A new vector partition of the probability score.

,*J. Appl. Meteor.***12****,**595–600.Murphy, A. H., 1977: The value of climatological, categorical, and probabilistic forecasts in the cost–loss ratio situation.

,*Mon. Wea. Rev.***105****,**803–816.Murphy, A. H., , and Winkler R. L. , 1977: Reliability of subjective probability forecasts of precipitation and temperature.

,*Appl. Stat.***26****,**61–78.Murphy, A. H., , and Winkler R. L. , 1987: A general framework for forecast verification.

,*Mon. Wea. Rev.***115****,**1330–1338.Palmer, T. N., , Brankovic C. , , and Richardson D. S. , 2000: A probability and decision-model analysis of PROVOST seasonal multi-model ensemble integrations.

,*Quart. J. Roy. Meteor. Soc.***126****,**2013–2034.Palmer, T. N., and Coauthors, 2004: Development of a European Multimodel Ensemble System for Seasonal to Interannual prediction (DEMETER).

,*Bull. Amer. Meteor. Soc.***85****,**853–872.Peng, P., , Kumar A. , , Barnston A. G. , , and Goddard L. , 2000: Simulation skills of the SST-forced global climate variability of the NCEP–MRF9 and Scripps–MPI ECHAM3 models.

,*J. Climate***13****,**3657–3679.Peng, P., , Kumar A. , , Van den Dool A. H. , , and Barnston A. G. , 2002: An analysis of multimodel ensemble predictions for seasonal climate anomalies.

,*J. Geophys. Res.***107****,**4710. doi:10.1029/2002JD002712.Robertson, A. W., , Lall U. , , Zebiak S. E. , , and Goddard L. , 2004: Improved combination of multiple atmospheric GCM ensembles for seasonal prediction.

,*Mon. Wea. Rev.***132****,**2732–2744.Särndal, C-E., , Swensson B. , , and Wretman J. , 1992:

*Model Assisted Survey Sampling*. Springer-Verlag, 712 pp.Stephens, M. A., 1974: EDF statistics for goodness of fit and some comparisons.

,*J. Amer. Stat. Assoc.***69****,**730–737.Stephenson, D. B., , and Doblas-Reyes F. J. , 2000: Statistical methods for interpreting Monte Carlo ensemble forecasts.

,*Tellus***52A****,**300–322.Swets, J. A., 1973: The relative operating characteristic in psychology.

,*Science***182****,**990–1000.Taylor, J. R., 1997:

*An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements*. 2nd ed. University Science Books, 327 pp.Thompson, J. C., 1962: Economic gains from scientific advances and operational improvement sin meteorological prediction.

,*J. Appl. Meteor.***1****,**13–17.Tracton, M. S., , and Kalnay E. , 1993: Operational ensemble prediction at the National Meteorological Center: Practical aspects.

,*Wea. Forecasting***8****,**379–398.Van den Dool, H., , and Toth Z. , 1991: Why do forecasts for “near normal” often fail?

,*Wea. Forecasting***6****,**76–85.Vislocky, R. L., , and Fritsch J. M. , 1995: Improved model output statistics forecasts through model consensus.

,*Bull. Amer. Meteor. Soc.***76****,**1157–1164.Weigel, A. P., , Liniger M. A. , , and Appenzeller C. , 2008: Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts?

,*Quart. J. Roy. Meteor. Soc.***134****,**241–260.Wilks, D. S., 1995:

*Statistical Methods in the Atmospheric Sciences: An Introduction*. Academic Press, 467 pp.WMO, 2002: Standardised Verification System (SVS) for Long-Range Forecasts (LRF). New attachment II-9 to the manual on the GDPS. Vol. 1. WMO-No. 485, 24 pp.

WMO, cited. 2007: Report of WMO/KMA Workshop of Global Producing Centres on Lead Centre for Long-Range Forecast Multi-Model Ensemble Prediction. [Available online at http://www.wmo.int/pages/prog/www/DPFS/Reports/Wshop-LCLRFMME_Busan.doc].

WMO, cited. 2008: Meeting of the Expert Team on Extended and Long-Range Forecasting. [Available online at http://www.wmo.int/pages/prog/www/DPFS/Reports/ET-ELRF_Beijing2008.doc].

Xie, P., , and Arkin P. A. , 1997: Global precipitation: A 17-year monthly analysis based on gauge observation, satellite estimates, and numerical model outputs.

,*Bull. Amer. Meteor. Soc.***78****,**2539–2558.Zwiers, F. W., 1996: Interannual variability and predictability in an ensemble of AMIP climate simulations conducted with the CCC GCM2.

,*Climate Dyn.***12****,**825–848.

Participating models in the APCC seasonal forecasts.

Regionally aggregated ROCSs for the PMME forecasts of precipitation (PREC), temperature at 850 hPa (T850), and Z500 for winter and summer seasons. All aggregated ROCSs are statistically significant at the 99% confidence level.

Competitive methods of multimodel combination.

^{}

* Current affiliation: Hydrometeorological Research Center of Russia, Moscow, Russia.