1. Introduction
Ensemble prediction systems seek to represent the sensitivity of the inherently chaotic climate system to small changes in initial conditions by initializing the model with a range of perturbed initial states that span analysis uncertainty (Tracton and Kalnay 1993; Palmer et al. 2005). Model errors due to the parameterization of physical processes, the effect of unresolved scales, or imperfect boundary conditions result in forecasts that are overconfident (unreliable) as a result of the underestimation of the forecast uncertainties (Weigel et al. 2009). The probability distribution of an overconfident, unreliable ensemble system has too little spread (Weigel et al. 2009), so that high forecast probabilities are issued more frequently than the conditional fraction of times the event is correspondingly observed. A perfectly reliable prediction system, on the other hand, shows consistency between the forecast probabilities and corresponding mean observed frequency of occurrence (Wilks 1995). Reliable climate forecasts are especially important when they are used for subsequent application such as to drive hydrology or agricultural production models.
Seasonal rainfall predictions for Australia from the Predictive Ocean Atmosphere Model for Australia (POAMA), version P15b, coupled model forecast system, which was run operationally at the Australian Bureau of Meteorology from 2002 until 2011 (details of the POAMA forecast system are provided in section 2), are overconfident and only moderately reliable even when forecast accuracy is highest in the austral spring season (Lim et al. 2011). Accurate prediction of Australian seasonal rainfall, at least for lead times of one to two months, results from the capability of P15b to predict El Niño–Southern Oscillation (ENSO; Wang et al. 2008; Hendon et al. 2009) and the Indian Ocean dipole (IOD; Zhao and Hendon 2009) and the associated teleconnections to Australian climate, which are dominant influences on Australian climate (McBride and Nicholls 1983; Nicholls 1989; Drosdowsky and Chambers 2001; Lim et al. 2009). Despite the demonstrable skill of P15b seasonal rainfall forecasts, as measured for instance by correlation using the ensemble mean, the overconfidence and lack of reliability impedes the uptake of these forecasts for practical applications and for public issuance by the National Climate Centre at the Australian Bureau of Meteorology.
The lack of forecast spread (overconfidence) arising from model errors can be ameliorated by incorporation of stochastic parameterizations, use of different model versions (e.g., perturbed physics or use of different parameterizations), or using multimodel ensembles (e.g., Palmer et al. 2000; Doblas-Reyes et al. 2000; Palmer et al. 2004; Hagedorn et al. 2005; Weigel et al. 2009). A new version of the POAMA forecast system addresses the low reliability of forecasts from P15b by making use of three slightly different versions of the model, and this improvement in reliability, while acknowledging that the three model versions are more similar than dissimilar, is demonstrated in section 4.
We also compare forecast accuracy and reliability from the new POAMA system to the forecasts provided by the Ensemble-Based Predictions of Climate Changes and their Impacts (ENSEMBLES) project (Hewitt and Griggs 2004; Weisheimer et al. 2009; Alessandri et al. 2011b) in order to better assess progress of the POAMA system and to indicate the potential benefit for prediction of Australian climate by participating in an operational multimodel seasonal forecast system. The ENSEMBLES project provides seasonal hindcasts from a number of coupled model seasonal forecast systems that use a range of horizontal and vertical resolution (typically all better than POAMA). The archive of hindcasts from ENSEMBLES thus provides an opportunity to assess how common the deficiency of low reliability is for predicting regional rainfall across Australia. As part of this archive, the hindcasts from coupled models developed at the European Centre for Medium-Range Weather Forecasts (ECMWF), Météo-France (MF), and the Met Office (UKMO) are available. These three centers contribute to the European multimodel Seasonal to Interannual Prediction project (EUROSIP), which is an operational multimodel seasonal forecast system launched in late 2005. EUROSIP combines ensemble forecasts from the current versions of the coupled seasonal forecast models from ECMWF, UKMO, and MF. Although the model versions used for the ENSEMBLES projects are not necessarily the same as used in EUROSIP, evaluation of the ENSEMBLES hindcasts from these three models can give not only an indication of the relative skill of the POAMA system but also the potential benefit of including POAMA in a multimodel ensemble that is achievable in real time.
The paper is organized as follows. The POAMA and ENSEMBLES hindcast data are described in section 2. The methods for assessing forecast accuracy and reliability are outlined in section 3. In section 4 we compare accuracy and reliability of forecasts for Australian seasonal rainfall from of the new version of POAMA (P24) with the older version (P15b) in order to track progress of the POAMA system. We also compare to the hindcasts from ECMWF, MF, and UKMO that have been archived as a part of the ENSEMBLES project in order to assess common successes and problems. In section 5, the benefit for improved reliability of including POAMA version P24 into a multimodel ensemble with the ECMWF, UKMO, and MF models is assessed, indicating the potential of a real-time multimodel system that includes POAMA. Conclusions are provided in section 6.
2. Forecast systems, hindcasts, and verification data
POAMA is the Australian Bureau of Meteorology’s coupled model seasonal prediction system. The newest version, which recently became operational, is P24. This system builds on the earlier version P15b (e.g., Zhao and Hendon 2009; Hudson et al. 2011) that was run operationally at the Bureau of Meteorology from 2002 until early 2012. One major difference between P15b and P24 is an improved ocean data assimilation system. The new system, the POAMA Ensemble Ocean Data Assimilation System (PEODAS; see Yin et al. 2011 for a comprehensive description) is an approximate form of an ensemble Kalman filter system. PEODAS includes the assimilation of salinity as well as temperature data, and includes multivariate model updates and explicit estimates of state-dependent background error covariances. As a result, improvement in the ocean analysis used for initial conditions is seen over the POAMA version P15b ocean analysis system (Yin et al. 2011; Wang et al. 2011). This is expected to improve the predictability of ocean temperatures and therefore Australian rainfall (e.g., Alessandri et al. 2010, 2011a).
Version P24 also combines 10 ensemble members from each of three slightly different model versions into a 30-member ensemble, which is aimed at alleviating the unreliable forecasts from P15b. The three versions of P24 are referred to as P24a, P24b, and P24c. Version P24c has the same standard atmospheric physics options as P15b, but is initialized using the improved ocean analyses from the new PEODAS assimilation system. P24a uses a different parameterization of shallow convection compared to P24c that results in less mean state drift, while P24b uses an explicit ocean–atmosphere flux adjustment to control the mean state. More details are provided in Wang et al. (2011).
Nine-month hindcasts are initialized on the first of each month for both P15b and P24. For P24, perturbed ocean initial conditions are provided by PEODAS (Yin et al. 2011). A single atmospheric initial condition is used for all 10 members, provided by the Atmosphere and Land Initialisation scheme (ALI; Hudson et al. 2011). For P15b, a single ocean initial condition was provided by the old temperature-only ocean assimilation scheme, while perturbed atmospheric initial conditions were obtained from successively 6-hourly earlier analyses. The ensemble size for P15b is 10, while the total ensemble size for P24 is 30 (10 from each of the three versions).
This paper focuses on verification of predictions for the four main seasons in the period 1980–2005, at a lead time of one and four months. A lead time of one month corresponds to the verification season starting one month after the hindcast initialization date. Note that the three months averaged for a seasonal mean are all initialized on the same date.
The ENSEMBLES project is a collaboration of around 80 institutions, investigating the benefits of ensemble prediction for seasonal to decadal forecasts (Hewitt and Griggs 2004; Weisheimer et al. 2009; Alessandri et al. 2011b). Seasonal hindcasts were conducted with coupled forecast models from ECMWF, MF, UKMO, the Euro-Mediterranean Centre for Climate Change Istituto Nazionale di Geofisica e Vulcanologia (CMCC-INGV) and the Leibniz Institute of Marine Sciences at Kiel University (IFM-GEOMAR). Seven-month hindcasts are available for 1960–2005, for four start dates (first of the month for February, May, August, and November). The ECMWF’s Integrated Forecast System (IFS) is similar to the Seasonal Forecast System 3 that was used operationally at ECMWF March 2007 through 2011. Similarly, the UKMO’s Hadley Centre Global Environmental Model version 2 (HadGEM2) and the MF’s Action de Recherche Petite Echelle Grande Echelle/Océan Parallélisé (ARPEGE/OPA) systems are similar to their current operational seasonal forecast systems that also contribute to the EUROSIP project. These particular ENSEMBLES models are therefore indicative of the potential of what is available from EUROSIP in real time, although the number of ensemble members in real time is typically larger than what is available from the ENSEMBLES archive.
Similar to POAMA, each of these three systems is initialized from best estimates of the observed atmosphere and ocean states, with an ensemble of nine initial conditions based on three different ocean analyses (Weisheimer et al. 2009). The system details are summarized in Table 1. Also seen from Table 1 is that the ocean model resolution for the three ENSEMBLES models and the two POAMA models are similar, but that the atmospheric resolution varies substantially between the models. ECMWF is the highest with an effective horizontal resolution of about 100 km (T159) and 63 vertical levels and the POAMA models are the lowest with a resolution of about 250 km (T47) and only 17 vertical levels.
Details of models.


Seasonal rainfall forecasts across Australia are verified against the National Climate Centre’s (NCC) gridded monthly analysis (Jones and Weymouth 1997). These analyses are on a 0.25° × 0.25° longitude–latitude grid in the range 10°–44.5°S, 112°–156.25°E and are based on an interpolation of available rain gauge data across Australia. Seasonal mean rainfall forecasts and observations were all interpolated to the POAMA’s 2.5° grid over Australia before analysis using bilinear interpolation. The land and ocean mask files were also interpolated to the POAMA grid.
3. Methods
a. Accuracy score for hindcast assessment
To calculate a probabilistic forecast for a two-category forecast of above/below median rainfall, the model hindcasts are compared to their own cross-validated median value (i.e., the medians are derived in a leave-one-out fashion). The probabilistic forecast is determined from the fraction of ensemble members that indicate the occurrence of the event. The verifying observations are also compared to their own cross-validated median value. A hit is given when above-median rainfall is forecast with a probability greater than 50% and above-median rainfall was observed. A correct negative occurs when above-median rainfall is forecast at less than 50%, and the observed rainfall was below median. This accuracy score, also known as the proportion correct (Finley 1884), is equal to the total number of hits and correct negatives, divided by the total number of forecasts (Wilks 1995).
Because of the even number of ensemble members in the POAMA hindcasts, there is often the situation where half the ensemble members indicate a wetter-than-median forecast, and half the ensemble members indicate a dryer-than-median forecast. This corresponds to exactly a 50% probability of above-median rainfall, and a 50% probability of below-median rainfall. As the observed rainfall will only fall into a single category, the choice on how to award this forecast can affect the skill measure in comparison with other forecast systems. If a single hit is given for an equal probability forecast, no matter which category is observed, the skill is biased higher than if a hit is not given for this situation. As the ECMWF, UKMO, and MF models have nine ensemble members, an equal probability forecast never occurs. Therefore, to avoid exactly 50% forecasts in POAMA, a subset of nine ensemble members was used to create the forecasts for each version of the model; P15b, P24a, P24b, and P24c. As the original ensemble members are created from a set of perturbed initial conditions, probabilistic or deterministic forecasts resulting from a random choice of the nine members without replacement are averaged over 100 Monte Carlo runs, to ensure the prediction is not biased.
This dichotomous score, which only considers the most likely outcome, does not consider the spread of the ensemble members. Therefore a more complete verification of the probabilistic forecasts requires the assessment of additional forecast attributes. Care needs to be taken when comparing probabilistic forecasts from a range of systems because of biases associated with small ensemble sizes or the binning of forecasts, and there is little agreement in the literature on the best skill score (e.g., Müller et al. 2005). We therefore present a number of measures of desirable forecast attributes in order to compare seasonal forecast systems for Australian rainfall, and use the accuracy score described above as one component of this evaluation.
b. Reliability, resolution, and sharpness
In order for a dynamical forecast to be valuable and practically applicable, it must be reliable as well as accurate. A reliable forecast predicts an event with a probability that corresponds to the frequency with which the event is observed when considered over many forecasts. An accurate probabilistic forecast has resolution and sharpness, as well as reliability.
The reliability, resolution, and sharpness of forecasts can be represented on an attributes diagram. For reliability, the relative observed frequency of an event is plotted against the forecast probability, which is divided into a number of bins (e.g., Wilks 1995). This computation requires pooling forecasts across seasons, lead times, and/or locations in order to increase the sample size. A reliable forecast will lie along the diagonal 1:1 line, so that the event is observed to occur on a fraction of occasions equal to the probability with which it was forecast. If the data points lie along the horizontal at a relative observed fraction of the average of all observations, then climatology is observed, no matter what was forecast, and the system has no resolution. In attributes diagrams presented here, the size of the data point in each frequency bin is proportional to the number of forecasts that fall in that bin. Sharpness is the tendency of the forecast to predict extreme values away from climatology; therefore, large data points near 0% and 100% indicate sharp forecasts. If the data points are grouped around a forecast of 50%, the forecast cannot discriminate from a climatological forecast, although it may be perfectly reliable.


The first term on the right-hand side of Eq. (1) is the reliability error. A perfectly reliable system has a reliability error score of zero indicating a 1:1 correspondence between the forecast probability and the relative observed frequency. The second term on the right-hand side of Eq. (1) is the resolution score and determines the ability of the forecast to differentiate from a climatological observation. A forecast has good resolution if the outcome changes with a differing forecast probability. Although the forecast probability is not directly input into the resolution score, it is inherent in the relative observed frequency. The third term is the uncertainty, which is the probability score of the sample climatology forecast. It is independent of the forecast system. For the above/below-median forecasts considered here the event occurs 50% of the time; therefore, the uncertainty term is the maximum possible value.


The differing number of ensemble members likewise results in inaccuracies in a skill score that compares the BS to a reference forecast, such as climatology. A debiased version of the Brier skill score artificially introduces a sampling error to the reference forecast to account for the intrinsic unreliability of a finite number of ensemble members (Müller et al. 2005; Weigel et al. 2007). However, this correction is not independent of ensemble size when the forecasts are overconfident (Tippett 2008), as in the cases presented here. Therefore we choose not to use the Brier skill score to compare the skill of the independent center forecasts with the multimodel ensemble, but instead focus on the decomposed terms to assess the reliability and resolution attributes.
c. Multimodel ensemble combination
As in the submitted manuscript, multimodel ensemble combination has been assessed and compared previously (Doblas-Reyes et al. 2000; Palmer et al. 2000, 2004; Hagedorn et al. 2005; Weigel et al. 2009; Lim et al. 2011) and has been shown to improve forecast reliability even when the contributing models are overconfident (Weigel et al. 2009). To reap benefit from a multimodel ensemble approach, the independent forecast models must have some level of comparable skill (Weigel et al. 2010) but have differing strengths and weaknesses (i.e., independent errors), so that the model which is most skillful varies across season and lead time. A multimodel ensemble is a more consistently skillful system than the contributing individual models because of the cancellation of uncorrelated forecast error (Bohn et al. 2010; Hagedorn et al. 2005). The more independent and overconfident the single models are, the more likely they are to increase the skill of the multimodel system (Alessandri et al. 2011b). It is not possible for the multimodel ensemble to perform worse than all of the individual models, as the additional information over the worst model will only improve the prediction (Hagedorn et al. 2005). Deciding that a model is consistently less skillful than the other models could be a basis for excluding it from a multimodel ensemble, but this is a subjective assessment. It is also possible to weight the models by their respective skill level, but the limited number of hindcasts currently available for seasonal forecasting causes difficulty in assessing the relative merit (Weigel et al. 2010). Multimodel ensembles have been shown to benefit from the additional information from contributing individual models beyond the expected benefit of an increase in ensemble size, which tends to saturate at 30 members (Hagedorn et al. 2005; Palmer et al. 2004). A multimodel ensemble benefits from the strengths of a range of independent models; however, issues common to the models will not be resolved without further development of the parameterization of the physical processes.
A multimodel ensemble combines the ensemble members from a number of independent models into a single prediction. For the grand ensemble mean across all models, anomalies are first calculated relative to the model climatology of each individual model (which are function of start month and lead time), and then the anomalies are averaged to determine the multimodel ensemble mean.


4. Comparative skill
a. Accuracy score
We begin by examining forecast accuracy of the two POAMA models and also compare to the three ENSEMBLES models from ECMWF, UKMO, and MF. Figure 1 shows the accuracy scores for prediction of above/below-median rainfall across Australia from the individual models at a lead time of one month. The accuracy score at each grid point is calculated as in section 3a from the 26 yr of cross-validated forecasts (1980–2005). An accuracy score greater than 50% is considered a skillful forecast, and is represented by the green and blue shades. The Australian continent-averaged accuracy score is shown in Figs. 2a and 3a, for a lead time of one month and four months, respectively.

Accuracy score for above/below-median seasonal rainfall for P15b, P24, ECMWF, UKMO, and MF models, and the multimodel ensemble using P24, ECMWF, UKMO, and MF models (MME; 54 members total). Lead time is one month. An accuracy score greater than 50%, as indicated by green and blue shades, is considered skillful.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1

Accuracy score for above/below-median seasonal rainfall for P15b, P24, ECMWF, UKMO, and MF models, and the multimodel ensemble using P24, ECMWF, UKMO, and MF models (MME; 54 members total). Lead time is one month. An accuracy score greater than 50%, as indicated by green and blue shades, is considered skillful.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1
Accuracy score for above/below-median seasonal rainfall for P15b, P24, ECMWF, UKMO, and MF models, and the multimodel ensemble using P24, ECMWF, UKMO, and MF models (MME; 54 members total). Lead time is one month. An accuracy score greater than 50%, as indicated by green and blue shades, is considered skillful.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1

(a) Australian continent averaged accuracy score for above/below-median seasonal rainfall from P15b, P24, ECMWF, UKMO, and MF models, and the multimodel ensemble using P24, ECMWF, UKMO, and MF models (MME; 54 members total). (b) Australian continent averaged mean reliability error. (c) Australian continent averaged generalized mean resolution. Lead time is one month. Note that for accuracy and resolution a high score is desirable, but for reliability error, a low score is desirable.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1

(a) Australian continent averaged accuracy score for above/below-median seasonal rainfall from P15b, P24, ECMWF, UKMO, and MF models, and the multimodel ensemble using P24, ECMWF, UKMO, and MF models (MME; 54 members total). (b) Australian continent averaged mean reliability error. (c) Australian continent averaged generalized mean resolution. Lead time is one month. Note that for accuracy and resolution a high score is desirable, but for reliability error, a low score is desirable.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1
(a) Australian continent averaged accuracy score for above/below-median seasonal rainfall from P15b, P24, ECMWF, UKMO, and MF models, and the multimodel ensemble using P24, ECMWF, UKMO, and MF models (MME; 54 members total). (b) Australian continent averaged mean reliability error. (c) Australian continent averaged generalized mean resolution. Lead time is one month. Note that for accuracy and resolution a high score is desirable, but for reliability error, a low score is desirable.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1

As in Fig. 2, but lead time is four months.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1

As in Fig. 2, but lead time is four months.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1
As in Fig. 2, but lead time is four months.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1
For all seasons, forecasts from the newest version of POAMA (P24) are more accurate than from the older version P15b, which is a good outcome from the development of the POAMA system. The reasons for this improvement may stem from improved ocean initial conditions that lead to better predictions of tropical SSTs (e.g., Wang et al. 2011), the use of three versions of the model, and the use of a larger ensemble size. Improved prediction of tropical SST anomalies enables more accurate prediction of the teleconnections that drive precipitation variability (e.g., Lim et al. 2011). Improved model physics as provided in P24a and reduced systematic error as provided in P24b will also enable more skillful predictions of Australian rainfall. See Wang et al. (2011) for an overview of the improved skill of the new operational system.
At the longer lead time of four months, P24 is seen to have a lower average accuracy score than P15b in winter and spring, despite performing better in summer and autumn (Fig. 3a). While the effect of the improvement due to the initial conditions provided by PEODAS is expected to decrease with lead time (Wang et al. 2011; see also Alessandri et al. 2010), the newer version of POAMA has a more damped ENSO than the previous version (Wang et al. 2011). Hence, the strength of the teleconnections of El Niño to Australian rainfall will be weakened at longer lead times with P24 compared to P15b, thereby resulting in a faster reduction of skill in winter and spring when ENSO impacts on rainfall are the greatest.
Comparison of POAMA to the ENSEMBLES models in Fig. 1 reveals that P24 is competitive with the ENSEMBLES systems. Interestingly, no one model stands out for all seasons. In general, the seasons where the individual models show the most accuracy for predicting rainfall are austral autumn [March–May (MAM)] and spring [September–November (SON)]. In MAM, the models are typically more accurate in the north west of the continent, except for P24, which is accurate in the south east. The ECMWF model has the highest accuracy in winter [June–August (JJA)], although P24 is also reasonably skillful in this season. In SON, all models are accurate in the center and east of the continent, where ENSO/IOD impacts are strong (Nicholls 1989; Risbey et al. 2009). SON is also the only season in which forecast accuracy remains high at longer lead times (Fig. 3a), again likely due to the strong impact of ENSO on Australian climate in this season and the good predictability of ENSO in boreal spring. Austral summer [December–February (DJF)] is typically the least skillful season for all models except the UKMO model, which has high accuracy in Western Australia. Interestingly, the models that show high accuracy at the short lead time are not necessarily the most accurate models at the longer lead times. A possible explanation is that higher skill at a short lead time may reflect improved initial conditions (Alves et al. 2004; Alessandri et al. 2010, 2011a), but higher skill at a longer lead time reflects reduced systematic model error, for instance, associated with the teleconnection of El Niño to Australian rainfall (e.g., Lim et al. 2011).
b. Reliability and resolution
Figure 4 shows the attributes diagrams for the individual models for forecasts pooled over all four seasons at a lead time of one month. Here we assess reliability for the probability of above-median rainfall using all land points over Australia. Combining all the seasons together allows the consistency of the models to be more readily assessed, and gives a better indication of the value of the predictions, as operationally, it is desirable to have reliable forecasts at all times and locations. The size of the data points correspond to the fraction of forecasts in that probability interval, and the 10 forecast probability bins are equally spaced between 0% and 100%. The attributes diagrams show data points corresponding to the average of the forecasts in each bin rather than the central value, so that a reliable forecast will accurately lie on the solid 1:1 line (e.g., Brocker and Smith 2007).

Attributes diagrams for predictions of above-median rainfall for the individual models and the multimodel ensemble using P24, ECMWF, UKMO, and MF models (MME; 54 members total), for all Australian continental grid points and with the four seasons combined. Lead time is one month. The y axis is the relative observed frequency and the x axis is the forecast probability (the data points correspond to the average of forecasts within each probability bin). The solid line shows perfect reliability. The dashed line is the no-skill line, which borders the shaded area indicating skillful forecasts. The size of the data point is proportional to the fraction of forecasts in that probability bin.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1

Attributes diagrams for predictions of above-median rainfall for the individual models and the multimodel ensemble using P24, ECMWF, UKMO, and MF models (MME; 54 members total), for all Australian continental grid points and with the four seasons combined. Lead time is one month. The y axis is the relative observed frequency and the x axis is the forecast probability (the data points correspond to the average of forecasts within each probability bin). The solid line shows perfect reliability. The dashed line is the no-skill line, which borders the shaded area indicating skillful forecasts. The size of the data point is proportional to the fraction of forecasts in that probability bin.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1
Attributes diagrams for predictions of above-median rainfall for the individual models and the multimodel ensemble using P24, ECMWF, UKMO, and MF models (MME; 54 members total), for all Australian continental grid points and with the four seasons combined. Lead time is one month. The y axis is the relative observed frequency and the x axis is the forecast probability (the data points correspond to the average of forecasts within each probability bin). The solid line shows perfect reliability. The dashed line is the no-skill line, which borders the shaded area indicating skillful forecasts. The size of the data point is proportional to the fraction of forecasts in that probability bin.
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1
The unreliability of the P15b forecasts, as discussed in section 1, is readily apparent. The forecasts are overconfident and have poor to moderate reliability, as they have a shallower gradient than the perfectly reliable (diagonal solid) line. It is pleasing to see that the newer version P24 has much improved reliability over P15b. The ensemble of forecasts from P24 is, in fact, the most reliable of any of the ENSEMBLES models considered here. Forecasts from the ECMWF model, which is the highest resolution model considered here, show moderate reliability and are the best of the ENSEMBLES models. The improved reliability of P24 over P15b may stem from improved forecast accuracy due to improved predictions of tropical SSTs (e.g., Wang et al. 2011) as a result of the improved initial conditions provided by PEODAS, but the improved ensemble generation strategy (more members using three versions of the model) likely played a primary role.
To further diagnose the reliability, Figs. 2b and 3b show the seasonality of the mean reliability error [Eq. (1)] of Australian rainfall forecasts at a lead time one and four months, respectively. A smaller mean reliability error indicates a more reliable model, as it is a measure of how far from the diagonal the data points are in an attributes diagram. An accurate model therefore tends to be more reliable. P24 shows improved reliability in all seasons and leads compared to P15 (Figs. 2b and 3b). Examining all of the models, the annual variation of the mean reliability error is seen to track the inverse of the annual variation in the continent averaged accuracy score. The seasons where the models are most reliable and accurate are MAM and SON for both lead times. In winter (JJA) at 1-month lead, two models have relatively low reliability error but three models have relatively high reliability error; the difference is reduced in SON. The ECMWF model is the most reliable model at a 1-month lead time in JJA, but is relatively unreliable at a 4-month lead time, while the UKMO model is more reliable. This was also noted in the accuracy score in section 4a, where the models that showed the highest skill at the short lead time were often not the most accurate models at the longer lead time. There is a lack of reliability in all models in summer (DJF) for both lead times.
The seasonality of the generalized mean resolution of the models [Eq. (2)] is shown in Figs. 2c and 3c for a lead time of one and four months, respectively. P24 provides increased mean resolution over P15, although this benefit reduces at longer lead time when forecast accuracy is lower. Mean resolution from P24 is seen to be comparable to the individual ENSEMBLES model. In general, the annual variation of mean resolution is similar to that of the mean reliability error and accuracy score. P24 does not have the highest resolution of the models, despite its higher accuracy and reliability. This results from the distribution of forecasts issued over all seasons whereby P24 has a larger number of near-climatological forecasts than emphatic forecasts, which probably reflects the use of three versions of the POAMA model in the P24 ensemble.
In summary, overconfidence and a lack of reliability for regional rainfall forecasts is a common problem to the selection of dynamical, coupled models considered here, although a substantial improvement in reliability was achieved in going from P15b to P24. Interestingly, despite having the lowest atmospheric model resolution but having the largest ensemble size, the forecasts from the P24 system are the most reliable. In comparison to the three model systems from ENSEMBLES, each system shows strengths for particular seasons or lead times. This variable skill and reliability of the individual models for a range of regions, seasons, and lead times indicates the potential for improvement with further model development. Furthermore, this variation in forecast accuracy across the models indicates that a multimodel ensemble combining the universally overconfident ECMWF, UKMO, MF, and P24 forecasts will reduce the uncorrelated error, resulting in a more consistently accurate and reliable prediction system than the individual models.
5. Multimodel ensembles
a. Accuracy score
To assess the benefit of combining the strengths and weaknesses of the individual models discussed in section 4 and to demonstrate what might be feasible by contributing P24 to a multimodel ensemble, the ENSEMBLES hindcasts from ECMWF, UKMO, and MF were combined with the ensemble of forecasts from P24 into a 54 member multimodel ensemble (MME), according to the method described in section 3c. The spatial accuracy scores for the multimodel ensemble are shown in the bottom row of Fig. 1 for a lead time of one month. The continental average of the accuracy score for the MME is shown in Figs. 2a and 3a for lead times of one and four months, respectively. Regions of high accuracy for the MME, such as in the southeast of the continent in MAM, the center and east in SON, and the west in DJF, correspond to areas where the individual contributing models have high skill. Importantly, however, the MME is more consistently accurate across all seasons and areas than each of the individual models.
Nonetheless, the MME is not always the most accurate (e.g., P24 and ECMWF outperform the MME for winter rainfall; see Fig. 2a), but the MME is more accurate than these models in the other three seasons. This illustrates the benefit of multimodel ensemble, the combination of the strengths of independent models, and the cancellation of uncorrelated error. At the longer lead time, the contributing models have similar skill, so there is less of an increase in the accuracy score value and coverage achieved by combining them (Fig. 3a). This also suggests that the error of the different models at the longer lead time is more highly correlated, as they suffer from similar inabilities to predict rainfall a season ahead.
b. Correlation of the ensemble mean
The correlation of the multimodel ensemble mean rainfall anomaly with observations is not improved in the multimodel ensemble compared to the individual models (not shown). Although correlation using the ensemble-mean prediction is a deterministic assessment of forecast skill, it shows similar patterns to the accuracy scores determined from probabilistic forecasts based on the individual members. This is consistent with findings by Alessandri et al. (2011b) for surface air temperatures in the tropics. For instance, correlation of the ensemble mean from P24 with observations shows regions of strong correlation (greater than 0.6) in MAM and SON in the same areas where hit rate accuracy is highest as shown in Fig. 1. And, strong negative correlation of the ensemble mean from P24 (less than −0.7) is seen in the south of the continent in JJA, corresponding to regions of low accuracy (less than 25% correct) in Fig. 1. This region of negative correlation using the ensemble mean stems from systematic model error in simulating the key teleconnection of El Niño to Australian rainfall (Lim et al. 2009). Some of this bias is due to climate drift, which motivated the inclusion of the mean-state bias-corrected version (P24b), although correcting the mean state did not completely eliminate the biased teleconnection (Lim et al. 2009).
In regions with high accuracy scores for the multimodel ensemble, the magnitude of the correlation using the multimodel ensemble mean is not increased by the combination of the individual models. A multimodel ensemble does not necessarily improve the deterministic aspect of the forecast. Combining individual models that show forecast biases into a multimodel ensemble results in a greater reduction of forecast error, as measured by the root-mean-square error of the ensemble mean, than for models that do not exhibit a forecasting bias (Weigel et al. 2008; Bohn et al. 2010). As we account for bias between the models and forecasts by comparing them to their respective climatologies or median values, the main benefit in combining predictions here comes from the cancellation of uncorrelated error and the increased spread of probabilistic forecasts (Bohn et al. 2010; Hagedorn et al. 2005).
c. Reliability and resolution
At a lead time of one month, the MME has lower mean reliability error than the contributing models (Fig. 4). This improvement is consistent across all seasons (Fig. 2b), and for the longer lead time (Fig. 3b), even when some of the individual models show poor reliability. Improved reliability is the greatest improvement of the multimodel ensemble system over the contributing models.
The seasonality of the general mean resolution over Australia of the MME at a lead time one and four months is shown in Figs. 2c and 3c, respectively. The MME has the highest resolution except in summer (DJF) and at the longer lead time. However, these are times when the contributing independent models also have low resolution; therefore, it is not unexpected that the multimodel ensemble will show similarly low resolution.
d. Benefit of combining P24 with European models
In the previous section, the full MME of the ECMWF, UKMO, and MF models with P24 was assessed for increases in skill relative to the individual contributing models. Here we investigate the benefit that comes particularly from the inclusion of P24, which is an independent model with a different initialization scheme. A reduced-model multimodel ensemble of the ECMWF, UKMO, and MF models is used as a representation of the current EUROSIP collaboration. The inclusion of P24 with these models in the full MME is expected to increase the skill due to the additional information and cancellation of uncorrelated errors due to the independent system. In general forecast skill is also increased for a larger number of ensemble members because of decreases in uncertainty from the broader range of possible outcomes sampled. The skill typically increases with increasing ensemble size up to about 30 members (Palmer et al. 2004).
To investigate the major source of benefit in contributing P24 to MME over just the reduced-model multimodel, a reduced-member multimodel was also formed using a smaller subset of ensemble members for all models. For each of the six contributing models (P24a, P24b, P24c, ECMWF, UKMO, and MF), 5 ensemble members are randomly selected without replacement for a total of 30 members. This was repeated for 100 Monte Carlo runs and the resulting probabilistic forecasts averaged. The reduced-member multimodel has a total of 30 ensemble members, and the reduced-model multimodel has a total of 27 ensemble members, comparable in number to P24.
The attributes diagrams for the reduced-model and reduced-member multimodels are shown in Fig. 5, along with the result from P24 repeated from Fig. 4. The reduced-model multimodel shows similar resolution and sharpness to the P24 model. It has similar overall reliability to P24, but is more reliable at forecast probabilities less than 10% and greater than 80% (i.e., the data points for the highest and lowest probabilities are closer to the solid diagonal line). P24 is more reliable in the central probability range.

Attributes diagrams for all four seasons combined, at a lead time of one month for P24 (repeated from Fig. 4, 27 members total), the reduced-model multimodel ensemble using just ECMWF, UKMO, and MF models only (27 members total), and the reduced-member multimodel ensemble of P24, ECMWF, UKMO, and MF models (30 members total).
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1

Attributes diagrams for all four seasons combined, at a lead time of one month for P24 (repeated from Fig. 4, 27 members total), the reduced-model multimodel ensemble using just ECMWF, UKMO, and MF models only (27 members total), and the reduced-member multimodel ensemble of P24, ECMWF, UKMO, and MF models (30 members total).
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1
Attributes diagrams for all four seasons combined, at a lead time of one month for P24 (repeated from Fig. 4, 27 members total), the reduced-model multimodel ensemble using just ECMWF, UKMO, and MF models only (27 members total), and the reduced-member multimodel ensemble of P24, ECMWF, UKMO, and MF models (30 members total).
Citation: Monthly Weather Review 141, 2; 10.1175/MWR-D-11-00333.1
The reduced-member multimodel attributes diagram shows higher reliability than P24 or the reduced-model multimodel, and is consistent with the MME attributes diagram shown in Fig. 4. This suggests that it is the addition of the independent information of P24 and the cancellation of uncorrelated error that increases reliability in the multimodel, and there is minimal additional benefit due to the increased number of ensemble members beyond 30 members.
A significant increase in accuracy score cannot be measured for the full MME or reduced-member multimodel over the reduced-model multimodel, despite some indication of higher skill in JJA and SON when P24 has higher accuracy. The benefit of including the additional independent information of P24 in the multimodel ensemble therefore is seen primarily in the increase in reliability, beyond the increase in number of ensemble members. This suggests that combining different versions of a model (or including stochastic parameterizations) at a single operational center (as done with P24) would benefit regional rainfall predictions more than a single version with an increased number of ensemble members. Furthermore, these results indicate that there is benefit in adding P24 to the models from ECMWF, MF, and UKMO, to increase the reliability and consistency of accurate regional rainfall forecasts.
6. Conclusions
Hindcasts of seasonal mean rainfall for Australia were assessed from two versions of the POAMA seasonal forecast system: P15b and the newer P24. Substantial progress has been achieved with P24 for reducing the reliability errors that were prominent in P15b. P24 also shows higher accuracy for all seasons at a short lead time. This improvement is attributed to the improved ensemble generation strategy (more members and use of three model versions) and improved ocean initial conditions that result in improved forecasts of tropical SSTs. Nonetheless, forecasts from P24 still suffer from a lack of reliability.
Comparison with forecasts from three ENSEMBLES models indicate that low reliability for regional rainfall forecasts is a common problem among contemporary coupled, dynamical forecast models even though the forecasts have demonstrable skill as measured by forecast accuracy. The individual models assessed in this study all showed similar accuracy, but POAMA version P24 demonstrated higher reliability. Interestingly, P24 achieves higher reliability despite having the lowest atmospheric resolution of the models considered here. However, P24 uses a larger ensemble size than is available from the ENSEMBLES archive and it makes use of three slightly different versions of the model. These results suggest that a single operational center may deliver more reliable and skillful systems by perturbing their models as well as increasing the number of ensemble members.
P24 also has the potential to improve the reliability of a multimodel ensemble based on the EUROSIP system. The combination of individual, independent models into a multimodel ensemble, in order to increase the reliability and accuracy of climate predictions, benefits from the cancellation of uncorrelated model error and increased spread. The multimodel ensemble of P24, ECMWF, UKMO, and MF models showed consistently higher reliability than the individual models. The multimodel ensemble that includes P24 also showed higher reliability compared to a reduced-model multimodel based on just the three ENSEMBLES models. The increase in reliability of the multimodel ensemble including P24 is due in part to reduced model error rather than just an increase in ensemble size. These results indicate that there is benefit in adding POAMA version P24 to a real-time multimodel ensemble such as is available from the EUROSIP project in order to increase the reliability and consistency of accurate regional rainfall forecasts in real-time systems.
Acknowledgments
Support was provided from the Water Information Research and Development Alliance (WIRADA) Project 4.2 “Improved climate predictions at hydrologically-relevant time and space scales from the POAMA seasonal climate forecasts” and from the South Eastern Australia Climate Initiative (http://www.seaci.org). ENSEMBLES was funded by the EU FP6 Integrated Project ENSEMBLES (505539), whose support is gratefully acknowledged. We also thank the reviewers for their constructive comments on an earlier version of the manuscript.
REFERENCES
Alessandri, A., A. Borrelli, S. Masina, A. Cherchi, S. Gualdi, A. Navarra, P. Di Pietro, and A. F. Carril, 2010: The INGV–CMCC seasonal prediction system: Improved ocean initial conditions. Mon. Wea. Rev., 138, 2930–2952.
Alessandri, A., A. Borrelli, S. Gualdi, E. Scoccimarro, and S. Masina, 2011a: Tropical cyclone count forecasting using a dynamical seasonal prediction system: Sensitivity to improved ocean initialization. J. Climate, 24, 2963–2982.
Alessandri, A., A. Borrelli, A. Navarra, A. Arribas, M. Déqué, P. Rogel, and A. Weisheimer, 2011b: Evaluation of probabilistic quality and value of ENSEMBLES multimodel seasonal forecasts: Comparison with DEMETER. Mon. Wea. Rev., 139, 581–607.
Alves, O., M. A. Balmaseda, D. Anderson, and T. Stockdale, 2004: Sensitivity of dynamical seasonal forecasts to ocean initial conditions. Quart. J. Roy. Meteor. Soc., 130, 647–667.
Balmaseda, M. A., A. Vidard, and D. L. T. Anderson, 2008: The ECMWF ocean analysis system: ORA-S3. Mon. Wea. Rev., 136, 3018–3034.
Bohn, T. J., M. Y. Sonessa, and D. P. Lettenmaier, 2010: Seasonal hydrologic forecasting: Do multimodel ensemble averages always yield improvements in forecast skill? J. Hydrometeor., 11, 1358–1372.
Brocker, J., and L. A. Smith, 2007: Increasing the reliability of reliability diagrams. Wea. Forecasting, 22, 651–661.
Collins, W. J., and Coauthors, 2008: Evaluation of the HadGEM2 model. Hadley Centre Tech. Note 74, Met Office Hadley Centre, Exeter, United Kingdom, 47 pp.
Colman, R., and Coauthors, 2005: BMRC atmospheric model (BAM) version 3.0: Comparison with mean climatology. BMRC Research Rep. 108, 32 pp.
Daget, N., A. T. Weaver, and M. A. Balmaseda, 2009: Ensemble estimation of background-error variances in a three-dimensional variational data assimilation system for the global ocean. Quart. J. Roy. Meteor. Soc., 135, 1071–1094.
Déqué, M., C. Dreveton, A. Braun, and D. Cariolle, 1994: The ARPEGE/IFS atmosphere model: A contribution to the French community climate modelling. Climate Dyn., 10, 249–266.
Doblas-Reyes, F. J., M. Déqué, and J.-P. Piedelievre, 2000: Multimodel spread and probabilistic seasonal forecasts in PROVOST. Quart. J. Roy. Meteor. Soc., 126, 2069–2087.
Drosdowsky, W., and L. E. Chambers, 2001: Near-global sea surface temperature anomalies as predictors of Australian seasonal rainfall. J. Climate, 14, 1677–1687.
Finley, J. P., 1884: Tornado prediction. Amer. Meteor. J., 1, 85–88.
Hagedorn, R., F. J. Doblas-Reyes, and T. N. Palmer, 2005: The rationale behind the success of multimodel ensembles in seasonal forecasting—I. Basic concept. Tellus, 57A, 219–233.
Hendon, H. H., E.-P. Lim, G. Wang, O. Alves, and D. Hudson, 2009: Prospects for predicting two flavors of El Niño. Geophys. Res. Lett., 36, L19713, doi:10.1029/2009GL040100.
Hewitt, C. D., and D. J. Griggs, 2004: Ensembles-based predictions of climate changes and their impacts. ENSEMBLES Tech. Rep. 1, 5 pp. [Available online at http://ensembles-eu.metoffice.com/tech_reports.html.]
Hudson, D., O. Alves, H. H. Hendon, and G. Wang, 2011: The impact of atmospheric initialisation on seasonal prediction of tropical Pacific SST. Climate Dyn., 36, 1155–1171.
Johnson, C., and R. Swinbank, 2009: Medium-range multimodel ensemble combination and calibration. Quart. J. Roy. Meteor. Soc., 135, 777–794.
Jones, D. A., and G. Weymouth, 1997: An Australian monthly rainfall dataset. Australian Bureau of Meteorology Tech. Rep. 70, 19 pp.
Lim, E.-P., H. H. Hendon, D. Hudson, G. Wang, and O. Alves, 2009: Dynamical forecast of inter-El Niño variations of tropical SST and Australian spring rainfall. Mon. Wea. Rev., 137, 3796–3810.
Lim, E.-P., H. H. Hendon, D. L. T. Anderson, A. Charles, and O. Alves, 2011: Dynamical, statistical-dynamical and multimodel ensemble forecasts of Australian spring season rainfall. Mon. Wea. Rev., 139, 958–975.
Madec, G., P. Delecluse, M. Imbard, and C. Levy, 1998: OPA8.1 Ocean General Circulation Model reference manual. IPSL Tech. Note 11, Institut Pierre Simon Laplace, Paris, France, 97 pp.
McBride, J. L., and N. Nicholls, 1983: Seasonal relationships between Australian rainfall and the southern oscillation. Mon. Wea. Rev., 111, 1998–2004.
Müller, W. A., C. Appenzeller, F. J. Doblas-Reyes, and M. A. Liniger, 2005: A debiased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes. J. Climate, 18, 1513–1523.
Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595–600.
Nicholls, N., 1989: Sea surface temperatures and Australian winter rainfall. J. Climate, 2, 965–973.
Palmer, T. N., Č. Branković, and D. S. Richardson, 2000: A probability and decision-model analysis of PROVOST seasonal multi-model ensemble integrations. Quart. J. Roy. Meteor. Soc., 126, 2013–2033.
Palmer, T. N., and Coauthors, 2004: Development of a European multimodel ensemble for seasonal-to-interannual prediction (DEMETER). Bull. Amer. Meteor. Soc., 85, 853–872.
Palmer, T. N., G. J. Shutts, R. Hagedorn, F. J. Doblas-Reyes, T. Jung, and M. Leutbecher, 2005: Representing model uncertainty in weather and climate prediction. Annu. Rev. Earth Planet. Sci., 33, 163–193.
Risbey, J. S., M. J. Pook, P. C. McIntosh, W. C. Wheeler, and H. H. Hendon, 2009: On the remote drivers of rainfall variability in Australia. Mon. Wea. Rev., 137, 3233–3253.
Salas Mélia, D., 2002: A global coupled sea ice-ocean model. Ocean Modell., 4, 137–172.
Schiller, A., J. S. Godfrey, P. C. McIntosh, G. Meyers, N. R. Smith, O. Alves, G. Wang, and R. Fiedler, 2002: A new version of the Australian community ocean model for seasonal climate prediction. CSIRO Marine Research Rep. 240, 82 pp.
Smith, N. R., J. E. Blomley, and G. Meyers, 1991: A univariate statistical interpolation scheme for subsurface thermal analyses in the tropical oceans. Prog. Oceanogr., 28, 219–256.
Stephenson, D. B., C. A. S. Coelho, and I. T. Jolliffe, 2008: Two extra components in the Brier score decomposition. Wea. Forecasting, 23, 752–757.
Stockdale, T. N., and Coauthors, 2011: ECMWF seasonal forecast system 3 and its prediction of sea surface temperature. Climate Dyn., 37, 455–471.
Tippett, M. K., 2008: Comments on “The discrete Brier and ranked probability skill scores.” Mon. Wea. Rev., 136, 3629–3633.
Tracton, M. S., and E. Kalnay, 1993: Operational ensemble prediction at the National Meteorological Center: Practical aspects. Wea. Forecasting, 8, 379–398.
Valcke, S., L. Terray, and A. Piacentini, 2000: OASIS 2.4, Ocean atmosphere sea ice soil: User’s guide. Tech. Rep. TR/CMGC/00/10, CERFACS, Toulouse, France, 85 pp.
Wang, G., O. Alves, D. Hudson, H. H. Hendon, G. Liu, and F. Tseitkin, 2008: SST skill assessment from the new POAMA-1.5 system. BMRC Research Letter 8, 2–6.
Wang, G., D. Hudson, Y. Yin, O. Alves, H. H. Hendon, S. Langford, G. Liu, and F. Tseitkin, 2011: POAMA-2 SST skill assessment and beyond. CAWCR Research Letter 6, 40–46.
Weigel, A. P., M. A. Liniger, and C. Appenzeller, 2007: The discrete Brier and ranked probability skill scores. Mon. Wea. Rev., 135, 118–124.
Weigel, A. P., M. A. Liniger, and C. Appenzeller, 2008: Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts? Quart. J. Roy. Meteor. Soc., 134, 241–260.
Weigel, A. P., M. A. Liniger, and C. Appenzeller, 2009: Seasonal ensemble forecasts: Are recalibrated single models better than multimodels? Mon. Wea. Rev., 137, 1460–1479.
Weigel, A. P., R. Knutti, M. A. Liniger, and C. Appenzeller, 2010: Risks of model weighting in multimodel climate projections. J. Climate, 23, 4175–4191.
Weisheimer, A., and Coauthors, 2009: ENSEMBLES: A new multimodel ensemble for seasonal-to-annual predictions—Skill and progress beyond DEMETER in forecasting tropical Pacific SSTs. Geophys. Res. Lett., 36, L21711, doi:10.1029/2009GL040896.
Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.
Wolff, J. E., E. Maier-Reimer, and S. Legutke, 1997: The Hamburg ocean primitive equation model. Tech. Rep. 13, Deutsches Klimarechenzentrum, Hamburg, Germany, 58 pp.
Yin, Y., O. Alves, and P. R. Oke, 2011: An ensemble ocean data assimilation system for seasonal prediction. Mon. Wea. Rev., 139, 786–808.
Zhao, M., and H. H. Hendon, 2009: Representation and prediction for the Indian Ocean dipole in the POAMA seasonal forecast model. Quart. J. Roy. Meteor. Soc., 135, 337–352.