This paper examines the quality of seasonal probabilistic forecasts of near-global temperature and precipitation issued by the International Research Institute for Climate and Society (IRI) from late 1997 through 2008, using mainly a two-tiered multimodel dynamical prediction system. Skill levels, while modest when globally averaged, depend markedly on season and location and average higher in the tropics than extratropics. To first order, seasons and regions of useful skill correspond to known direct effects as well as remote teleconnections from anomalies of tropical sea surface temperature in the Pacific Ocean (e.g., ENSO related) and in other tropical basins. This result is consistent with previous skill assessments by IRI and others and suggests skill levels beneficial to informed clients making climate risk management decisions for specific applications. Skill levels for temperature are generally higher, and less seasonally and regionally dependent, than those for precipitation, partly because of correct forecasts of enhanced probabilities for above-normal temperatures associated with warming trends. However, underforecasting of above-normal temperatures suggests that the dynamical forecast system could be improved through inclusion of time-varying greenhouse gas concentrations. Skills of the objective multimodel probability forecasts, used as the primary basis for the final forecaster-modified issued forecasts, are comparable to those of the final forecasts, but their probabilistic reliability is somewhat weaker. Automated recalibration of the multimodel output should permit improvements to their reliability, allowing them to be issued as is. IRI is currently developing single-tier prediction components.
The International Research Institute for Climate and Society (IRI) began issuing seasonal forecasts of near-global climate in October 1997, using a two-tiered dynamically based multimodel prediction system (Mason et al. 1999). The forecasts are probabilistic with respect to the occurrence of three climatologically equiprobable categories of seasonal total precipitation and mean temperature—below, near, and above normal as defined by the 30-yr base period in use at the time. The forecasts were issued quarterly for the two upcoming consecutive 3-month periods from October 1997 until June 2001, after which time they were issued monthly for the same two lead times, but additionally for the two intermediate overlapping 3-month periods. For most of the history of the IRI’s forecasts, they have been issued approximately one-half month prior to the beginning of the first 3-month forecast period.1 We define the lead time as the time between issuance and the start of the targeted period; since June of 2001, forecasts were issued at 0.5-, 1.5-, 2.5-, and 3.5-month lead times.
An evaluation of the performance of IRI’s seasonal forecasts from 1997 through 2001 was presented in Goddard et al. (2003). Forecast skills were found to be positive for the seasons and regions known to have intrinsic predictability, aided in particular by the strong ENSO events from mid-1997 through mid-2000. In this paper forecast skills during the longer period of late 1997 through 2008 are described.
In section 2 the methodology used to produce IRI’s precipitation and temperature forecasts, and the forecast format, are outlined. In section 3 the verification data and procedures are defined, and in section 4 skills of the IRI’s forecasts are examined by region and season and in the context of the ENSO state and a strong multidecadal warming trend. Skills of the issued forecasts are compared with those of the objective guidance of the numerical prediction tools. Section 5 provides a summary and suggests possible improvements for IRI’s climate forecasts.
2. Climate prediction methodology
The IRI’s prediction methodology has been primarily dynamical, using a two-tiered system (Bengtsson et al. 1993) in which a set of SST predictions is first established, and then a set of atmospheric general circulation models (AGCMs), each consisting of multiple ensemble runs, is forced by the set of predicted SSTs (Mason et al. 1999). The use of multiple SST scenarios accounts for uncertainty in the SST predictions, yielding more realistic levels of uncertainty in the temperature and precipitation forecasts than would be produced from a single (but imperfect) SST scenario.
a. SST prediction
The precise details of the method of deriving the SST predictions have evolved during the 11 years of IRI’s forecasts, but use of both persisted SST anomalies and one or more scenarios of evolving SST predictions based on a combination of dynamical and statistical models have been consistent features. In all scenarios, the SST forecasts in the extratropics (outside of 30°N–25°S) are damped persistence of the mean anomalies observed the previous month (added to the forecast season’s climatology), with an e-folding time of 3 months (Mason et al. 1999). In the tropics, multimodel, mainly dynamical SST forecasts are used for the Pacific—the basin having the best known physics and model forecast consistency—while statistical and dynamical forecasts are combined for the Indian and Atlantic Oceans. In the non-Pacific tropical basins, during seasons having little apparent SST predictive skill, damped persistence of the SST anomalies observed in the most recent month are used, but with a lower damping rate than applied in the extratropics. For seasons having greater apparent skill, canonical correlation analysis (CCA; Glahn 1968; Barnett and Preisendorfer 1987) models are used in the Indian Ocean (Mason et al. 1999) and tropical Atlantic Ocean (Repelli and Nobre 2004).
A separate scenario of globally persisted SST anomaly, consisting of undamped anomalous SST observed the previous month added to the climatology of the months being forecast, is used out to 4 months for the IRI’s shortest lead time forecasts. For the nonpersisted, evolving SST anomaly predictions, the AGCMs are run out to 7 months.
The methods used to develop the SST forecasts are detailed in Table 1. For the evolving SST forecasts, three versions of the forecast SST anomalies have been used. In the first version, used through May 2004, a single deemed best estimated forecast SST scenario was used for tropical Pacific SST, which was that of the National Centers for Environmental Prediction (NCEP) coupled model (Ji et al. 1998). Beginning June 2004, three separate tropical Pacific scenarios were used: NCEP’s more recently developed global coupled model [Climate Forecast System (CFS); Saha et al. 2006], the Lamont-Doherty Earth Observatory (LDEO) intermediate coupled model version 5 (Chen et al. 2004), and the constructed analog (CA) statistical model (Van den Dool 1994, 2007; Van den Dool et al. 2003). One-third of the ensemble members of each of the AGCMS was forced by each SST scenario. This multiscenario design (Li et al. 2008) was believed to better represent the uncertainty expressed by the spread of the ensemble mean SST forecasts among the three models, whose forecast ENSO states often differed considerably. In the tropical Atlantic and Indian Oceans, a single scenario was used, consisting of the average of the CFS and CA ensemble mean forecasts.
Use of multiple SST scenarios was refined further in a third version starting in May 2007, noting that sometimes the ensemble mean tropical Pacific forecasts of the three models agreed closely, while at other times they differed greatly. The degree of disagreement is not believed to be significantly related to actual forecast uncertainty (e.g., Kharin and Zwiers 2002; Tippett et al. 2007). To ensure more approximately comparable scenario differences from year to year for the same forecast start month and lead time, the three scenarios were derived based on the historical error of the 3-way superensemble mean of the models, for hindcasts using observed SSTs over the global tropics. The preferred structures of the error field were found using principal components analysis (PCA) on the multimodel mean SST hindcast error. The three scenarios then used are 1) the 3-way multimodel ensemble mean SST forecast itself (with mean biases removed, and that mean 2) plus and 3) minus the first PC of the historical error. The PC accounts for roughly 40% of the model error variance, and its spatial pattern for most start and lead times is related largely, but not exclusively, to ENSO.
b. AGCMs for climate prediction
In the second tier of IRI’s prediction system, several AGCMs are forced by the set of predicted SSTs. The initial states of the AGCMs are not based on observed atmospheric or land surface conditions but are taken from ongoing updates to long AGCM simulations forced by observed SSTs. Because the earliest predicted period begins 3–4 weeks after the time of the forecast integrations, use of observed atmospheric initial conditions is not considered critical. However, the lack of observed land surface initial conditions (soil moisture, snow cover) may slightly degrade the forecasts because their effects can continue for longer than one month. The initial conditions used, differing among ensemble members, are characteristic of the respective model, region, and time of year, and the probability distribution of possible atmospheric states is spanned across members, constrained to be consistent only with the prescribed SST boundary conditions.
The number and specific set of AGCMs, and their forcing by the SST predictions, have evolved over the 11 yr of forecasting (Table 2). Three AGCMs with T42 spectral horizontal resolution (∼2.8° latitude–longitude) were used from late 1997 to early 2001, after which additional or replacement AGCMs were used (Barnston et al. 2003). Seven AGCMs have been used from late 2004 through 2008, providing a total of 144 (68) ensemble members forced by evolving (persisted) SST. The National Aeronautics and Space Administration (NASA), Center for Ocean–Land–Atmosphere Studies (COLA), Geophysical Fluid Dynamics Laboratory (GFDL), and Scripps models have highest horizontal resolution [T62 spectral (∼2.0°) or 2.5° × 2.0° gridded]. The European Center for Medium-Range Weather Forecasts–Deutsches Klimarechenzentrum: Hamburg Model (ECHAM) and National Center for Atmospheric Research (NCAR) Community Climate Model (CCM) have been run at IRI, while the other models have been run at their home institutions using IRI’s SST boundary conditions and graciously sent monthly to IRI to contribute to the forecasts. All model outputs are expressed with respect to their own climatologies (e.g., mean and terciles) based on multidecadal simulations using observed SST.
The climatological base period used as the reference frame for forecasts and observations was 1961–90 from 1997 until June 2001, 1969–982 from July 2001 through 2002, and 1971–2000 from January 2003 to present.
Forecasts issued through early 2001 were developed largely from the ECHAM3.6, CCM3.2, and NCEP–Medium-Range Forecast (MRF9) AGCMs, whose forecasts were combined subjectively by the forecasters using various model validation statistics (Mason et al. 1999; Goddard et al. 2003). Forecast formation was further guided by empirical probabilistic composites based on relative frequencies of occurrence of tercile-based categories keyed to past ENSO episodes (Mason and Goddard 2001). Beginning in mid-2001 the process of merging the AGCM predictions into a final forecast was automated (Barnston et al. 2003). Two multimodel ensembling methods were used: a Bayesian method (Rajagopalan et al. 2002; Robertson et al. 2004) and a canonical variate method (Mason and Mimmack 2002), and the two forecast results were averaged. In both methods, individual model weighting varies by grid point and forecast target season, governed by the models’ historical skills over an approximately 50-yr period when forced by observed SST fields. Use of this model weighting formulation, to be discussed in the context of the skill results in section 4f, is not ideal because observed SST is not available for the target periods in the real-time forecast setting.
c. Final forecast
Even with the more automated system implemented in 2001, final minor subjective modification of the objective forecasts by the forecasters has continued. This modification has consisted largely of overall damping of probabilities toward climatology—more at high than at low latitudes, and in particular for inordinately strong regional probability shifts. Light spatial smoothing has also been done to reduce noise. Other modifications include selected spatial model output statistics (MOS) corrections of systematic errors of the individual AGCMs for precipitation for specific regions using CCA (Ndiaye et al. 2009; Tippett et al. 2003; Landman and Goddard 2002); a nudging toward reduction (enhancement) of probabilities for below-normal (above normal) temperature, partly in response to a diagnostic verification of IRI’s forecasts during 1997–2000 (Wilks and Godfrey 2002); and making the forecasts more consistent with those of other meteorological centers or regional climate outlook forums.
3. Data and methods
Consistent datasets of observed global temperature and precipitation are required to calibrate the model forecasts and to verify the forecasts. For temperature, the 2° gridded global Climate Anomaly Monitoring System (CAMS) dataset from National Oceanic and Atmospheric Administration (NOAA) (Ropelewski et al. 1985) is used. For precipitation, the Climate Prediction Center (CPC) Merged Analysis of Precipitation (CMAP; Xie and Arkin 1997) for data from 1979 onward and the data from the Climate Research Unit (CRU) of the University of East Anglia for 1961–78 (New et al. 2000; Mitchell and Jones 2005) are used. Tests for relative biases during the overlap period indicate minor biases in mean and biases in variance (CRU data having lower variance). The latter biases slightly affect the terciles when the 1961–90 climatology period was used but have little effect for the two later base periods.
Here we use the ranked probability skill score (RPSS; Epstein 1969), a likelihood score (Aldrich 1997), and a generalization of the relative operating characteristics (ROC) curve (Mason 1982) to the three forecast categories collectively (Mason and Weigel 2009). For additional diagnostic understanding, we apply reliability analysis (Murphy 1973).
The RPSS (Epstein 1969; Wilks 2006), an extension of the Brier skill score (Brier 1950) to more than two categories, begins with computation of the ranked probability score (RPS). RPS is the squared probability error, cumulative across the forecast categories in ascending rank order, between the categorical forecast probability and the corresponding observed “probability” (100% probability assigned to the observed category, 0% otherwise). Higher RPS indicates larger forecast error. RPSS is positive when the RPS of the forecasts is less than that of a chosen reference forecast (here, the climatology forecast of one-third probability for each category).
The likelihood score (Aldrich 1997) is based on the product, over all forecasts in a set of n forecasts, of the probabilities forecast for the actually observed category. The nth root of this product is taken, yielding an intuitively meaningful geometric mean probability assigned to the correct category. The likelihood score is closely related to the ignorance score (Roulston and Smith 2002), linked to information theory, and derived scores such as return ratio (Hagedorn and Smith 2008; Tippett and Barnston 2008). A likelihood skill score (LSS) compares the likelihood score for the forecast with that of a reference forecast (here, climatology), assigning zero skill if it equals the reference. A differing feature between LSS and RPSS is that LSS is based on the probability assigned only to the category later observed (the locality property; Brocker and Smith 2007), ignoring probabilities for the other categories; RPSS uses the probabilities forecast for all three categories and gives greater credit when high probabilities are assigned to a category adjacent to that observed versus a more distant category. Both RPSS and LSS are used to verify the IRI forecasts in part to assess the extent to which the simpler LSS provides information about forecast quality similar to (or as fully as) the more comprehensive and widely used RPSS.
Another probabilistic verification measure is an extension of the ROC area to include all forecast categories collectively (Mason and Weigel 2009). This generalized ROC score (GROC) is the proportion of all available pairs of observations of differing categories whose probability forecasts are discriminated in the correct direction. With a possible range of 0%–100%, a 50% rate of correct discrimination is expected by chance. The calculations of RPSS, LSS, and GROC are summarized in Table 3. RPSS and LSS can be calculated for individual forecasts, permitting a time series of forecast skills. By contrast, GROC is calculated only for a set of forecasts, of which at least two must have differing observational outcomes.
All three of RPSS, LSS, and GROC are proper (Winkler and Murphy 1968; Brocker and Smith 2007)—that is, they cannot be enhanced by making the forecast probabilities different from those believed true by the forecasters (“hedging”; Murphy and Epstein 1967). A difference between GROC on the one hand, and RPSS and LSS on the other, is that the goodness of the calibration of the probabilities matters to the latter two scores, while it is essentially irrelevant to GROC. GROC evaluates purely discrimination ability within the forecast sample at hand, without penalty for overall or conditional biases in the probability values. The other side of the same coin is that GROC does not reward correct forecasts of overall shifts of climate in the forecast sample relative to a longer period of reference such as a warmed climate relative to a past 30-yr period. The above characteristics are similar to those of the temporal correlation coefficient for verification of deterministic forecasts and in contrast to the mean squared error or Heidke skill scores (Barnston 1992)—the latter two, like RPSS and LSS, being calibration sensitive.
Limited ensemble sizes restrict forecast probabilities to finite numbers of possible values, creating small offsets from asymptotic probabilities coming from theoretically infinite ensemble sizes. These offsets slightly decrease RPSS (Weigel et al. 2007a,b), whose climatology reference RPS remains “perfect.” Impacts on LSS and GROC are smaller. Adjustments for these biases are not conducted here, given IRI’s fairly large ensemble size.
In addition to the above verification measures, the probabilistic reliability of the forecasts is diagnosed using attributes diagrams (Murphy 1973; Wilks 2006). These show the correspondence of the full range of issued forecast probabilities and their associated relative frequency of observed occurrence, revealing forecast characteristics such as probabilistic bias, forecast over- (under-) confidence, and forecast sharpness.
The temporal variability of the performance of IRI’s forecasts is shown by time series of a verification measure (RPSS or LSS) for a given lead time averaged over the globe, the tropics, or specific regions. These scores are averaged over the scores for each grid square, area weighted by the cosine of their latitude. Additionally, the geographical distribution of the forecast quality is shown by computing the measures for each grid square over all forecasts or a subset of forecasts (e.g., for a given season and/or lead time). We show the performance of the issued forecasts as well as the objective multi-AGCM output used as the primary guiding tool. First, however, we consider the quality of the SST forecasts.
a. SST forecast skill
Favorable performance of the climate forecasts in a two-tiered design depends on performance in critical aspects of the SST forecasts—particularly the ENSO state and the SST anomaly patterns in the tropical Atlantic and Indian Oceans. Figure 1 shows the spatial distribution of temporal correlation between tropical SST forecasts and observations at 0.5- and 3.5-month lead times, and Fig. 2 shows time series of the forecasts and observations averaged over several key rectangular areas. Figure 2 (top) shows the performance of the SST forecasts in capturing the ENSO state as represented by the Niño-3.4 SST index (Barnston et al. 1997) at 0.5- and 3.5-month lead times. Forecasts and observations correlate 0.88 and 0.75 at 0.5- and 3.5-month lead times, respectively, indicating useful skill in anticipating the ENSO-related SST. Omitting the strong El Niño through the first half of 1998, these correlations drop to 0.86 and 0.73, suggesting that the skill did not depend heavily on this one episode.
SST forecasts in the Indian and tropical Atlantic Oceans (Figs. 1 and 2) were comparatively less skillful, and skills differ little from persistence-based forecasts (Table 4). These lower skills are consistent with the weaker inherent predictability of SST in the non-Pacific tropical ocean basins (Goddard et al. 2001; Stockdale et al. 2006). Although interannual variability of tropical SSTs outside of the central and eastern Pacific is small (Table 4), anomaly patterns in these oceanic regions are believed key to enhanced likelihoods for specific climate anomalies (e.g., Chang et al. 2006). For example, in parts of tropical and subtropical Africa, Asia, and South America, climate anomalies are related to a zonal dipole in the Indian Ocean (Saji et al. 1999; Goddard and Graham 1999), an El Niño–like structure in the equatorial Atlantic (Zebiak 1993), and meridional gradients in the tropical Atlantic (Ward and Folland 1991; Enfield et al. 1999; Servain et al. 1999). Both tropical Indian and Atlantic Ocean SSTs appear sensitive to exogenous, and sometimes extratropical, phenomena that may have little inherent predictability (Kushnir et al. 2006). While this may be true for the tropical Pacific as well, the Pacific has better defined, slower, and stronger internal dynamics that frequently outweighs exogenous influences.
b. Temporal variability of climate forecast skill
Figure 3 shows time series of RPSS averaged over the near-global and tropical (25°N–25°S) land areas for forecasts for each of the four lead times (0.5, 1.5, 2.5, and 3.5 months) from the period October–December (OND) 1997 to December–February (DJF) 2008/09 for precipitation and temperature. Forecasts of climatological probabilities are included. The proportions of land area coverage by nonclimatology forecasts for the globe, tropics, and extratropics (Table 5) indicate highest proportions of nonclimatology forecasts issued for the tropics, for temperature, and for shorter lead times. Nonclimatology forecasts are somewhat more prevalent in forecasts for which ENSO extremes were expected than otherwise and for boreal autumn and winter than other seasons because of greater confidence in the forecast ENSO state for those seasons.
Forecast skill over the 11-yr period has been strongly related to ENSO variability (Fig. 3). Correlations between the absolute value of the Niño-3.4 SST anomaly and tropical RPSS for precipitation are 0.54, 0.44, 0.40, and 0.43 for 0.5-, 1.5-, 2.5-, and 3.5-month lead precipitation forecasts, respectively. (Corresponding Spearman rank correlations are 0.44, 0.44, 0.41, and 0.38.) Figure 4 (left) shows the effect of the ENSO state on RPSS for 0.5-month lead tropical precipitation forecasts as a function of lag time between the season of the ENSO state and that of the climate forecast target. Despite modest average skill levels, a simultaneous positive relationship with both phases of ENSO is noted (consistent with results in Goddard and Dilley 2005), El Niño being associated with greater skill than La Niña. Figures 3 and 4 show near-zero precipitation skills during ENSO-neutral periods in both extratropics and tropics, which is comparable to Livezey and Timofeyeva (2008), who identified ENSO variability as virtually the sole source of seasonal precipitation forecast skill for the United States.
The time series of RPSS for IRI’s temperature forecasts (Fig. 3) show higher average levels than those of precipitation forecasts. Temperature skill is related to ENSO state but differently than precipitation: skill is highest near the end of, and shortly following, El Niño events, and lowest with the same timing for La Niña events. Figure 4 (right) shows the effect of the ENSO state on RPSS for 0.5-month lead tropical temperature forecasts as a function of lag time between the SST and the climate forecast target. The greatest impact of ENSO on RPSS occurs 4 months following the ENSO peak for both ENSO phases. This influence on forecast skill is attributable to a delayed temperature response in both tropics and extratropics (Kumar and Hoerling 2003), which was earlier documented in the context of the atmospheric bridge (Lau and Nath 1996; Alexander et al. 2002) and strongly exemplified in the response to the 1997/98 El Niño (Kumar et al. 2001).
One reason for the comparatively higher overall temperature forecast skill is that the skill receives a substantial contribution from correctly forecasting increased probabilities of above-normal temperature related to global warming. This warming is partially reproduced by the AGCMs, forced by SSTs that reflect part of the global warming signal, although the climate change signal is largely lost in the SST forecasts, and consequently in the AGCM responses, after the first few months in models using fixed (and now outdated) greenhouse gas settings (Doblas-Reyes et al. 2006; Liniger et al. 2007). Therefore, the warming signal is further captured by the forecasters who make additional subjective probabilistic adjustments toward warmth. The climate change component of skill is much weaker for precipitation, whose trends are generally smaller and may be of either sign, depending on location and season. Seasonal temperature is subject to well established probabilistic shifts related to ENSO (e.g., Halpert and Ropelewski 1992), providing a source of interannual predictability largely independent of the warming trend. Although the geographical distribution of ENSO’s effects on temperature differs from that associated with global warming, there are similarities between effects of El Niño and global warming, particularly in the tropics. Consequently, in a tropical average sense, El Niño tends to amplify the effects of global warming, yielding increased confidence in forecasts of above-normal temperature, while La Niña tends to weaken or cancel global warming effects, resulting in a smaller net signal, greater forecast uncertainty, and lower skill.
These ideas appear substantiated by Figs. 3 and 4, showing highest (lowest) temperature skills during and after El Niño (La Niña) events, particularly in the tropics. Correlations between the Niño-3.4 SST anomaly and the tropical RPSS 4 months later are high: 0.80, 0.77, 0.76, and 0.72 for 0.5-, 1.5-, 2.5-, and 3.5-month lead time forecasts, respectively. (Corresponding Spearman rank correlations are 0.79, 0.78, 0.78, and 0.74.) In their evaluation of IRI’s forecasts during 1997–2001, Goddard et al. (2003) concluded that empirical ENSO probabilistic composites were not helpful for IRI’s seasonal temperature forecasts because the La Niña conditions during the majority of the period led to increased forecast probabilities for below-normal temperature, while above-normal temperatures continued to predominate in the observations.
c. Seasonality and geographical distribution of climate forecast skill
Figure 5 shows the geographical distribution of RPSS over the globe for all seasons for precipitation and temperature at 0.5-month and 3.5-month lead times. Relatively high temperature skill is noted in much of the tropics and in some extratropical regions. Temperature skill decreases, but does not disappear, with lead time. Skill for precipitation is lower than that for temperature but is also generally highest in the tropics. While precipitation skill averaged over all seasons does not disappear at 3.5-month lead, it decays more quickly with lead time than temperature, proportionally with respect to its initial level, in regions having highest season-specific skills (not shown). Skill is generally greater for temperature than for precipitation partly because of the more pervasive and unidirectional manifestation of climate change (and generally correct forecasts for such) in temperature than precipitation. To first order, the global warming component of temperature forecast skill pervades all seasons, all lead times, and most regions and is proportionately most prominent in the tropics where interannual and internal variability are generally weakest. (Warming is greater in the extratropics in degrees Celsius but is outweighed by still greater amounts of interannual variability.)
Because RPSS for precipitation is below −0.01 over nearly as much area of the globe as it is above 0.01 at 0.5 and 3.5-month leads (Fig. 5), one reasonably might question the field significance of the skill result (Livezey and Chen 1983). Monte Carlo tests were conducted in which the years were shuffled 5000 times, while the ordering of the months within a given year remained intact to represent the effective sampling time for an ENSO cycle. The global mean RPSS for the shuffled data never attained the level of the actual verification (at 0.006) at 0.5-month lead and exceeded it (at 0.003) in 1 out of 5000 trials at 3.5-month lead. Because field significance is strong, one can trust not only the existence of real global mean skill but also the general features of the skill’s geographical distribution.
The geographical distribution of mean RPSS for precipitation forecasts at 0.5-month lead is shown in Fig. 6 for January–March (JFM), April–June (AMJ), July–September (JAS), and OND. Skills are often related to the seasonal cycle of rainfall itself, which in the tropics–subtropics maximizes with the local summer monsoon season or with the twice-yearly passage of the ITCZ near the equator. Skill is highest in Indonesia, eastern equatorial Africa, and southeastern South America during the last few months of the calendar year; in portions of southern Africa from November to March; and in India and the Sahel from June through September. Skill is concentrated in the seasons and regions having known responses to ENSO (Ropelewski and Halpert 1987, 1989, 1996; Mason and Goddard 2001) as well as some additional areas in response to SST anomalies outside the tropical Pacific (e.g., the African Sahel and Guinea coast, and Northeast Brazil in response to the tropical Atlantic).
The seasonal cycles of precipitation forecast skill for several regions having well-defined monsoon seasons and/or ENSO-related responses are shown for all 12 running 3-month seasons in Fig. 7, using RPSS and GROC. Skill in the Philippines is highest in late boreal winter following the usual peaking of ENSO episodes and minimal in boreal summer during the southwest Asian monsoon. Indonesia and the western tropical Pacific islands show maximum skill in late boreal autumn, when ENSO episodes often mature and the ITCZ migrates through from north to south. Skill in the African Sahel peaks during the late boreal summer rainy season, and analogous behavior holds for austral summer in southern Africa. In eastern equatorial Africa, skill peaks in late boreal autumn (short rainy season) but is low during the boreal spring (long rainy season), as established through ENSO responses empirically (Ropelewski and Halpert 1987; Mason and Goddard 2001) and physically in the context of the intermediary role of the Indian Ocean SST in the short rainy season (Goddard and Graham 1999). In the southern United States, Mexico, and the Caribbean, skills peak during boreal winter, during a dry season, following known teleconnections to ENSO (Ropelewski and Halpert 1987, 1989; Mason and Goddard 2001).
While there are no large differences between the skill pictures painted by RPSS and GROC beyond their differing scaling, a tendency for fewer cases of negative skill is noted in GROC (<0.5) than in RPSS (e.g., during boreal summer in western tropical Pacific islands, RPSS < 0 while GROC > 0.5). This is likely due to the presence of discrimination in the forecasts, such as cases of above-normal rainfall being forecast with higher probabilities of above normal than cases lacking above-normal rainfall, even if the probability values have systematic biases. Unconditional or conditional (rainfall dependent) forecast biases, such as most probabilities for above normal being too low, would be penalized in RPSS, counteracting credit given for the probability for above normal being relatively higher for cases of observed above-normal rainfall than for other cases; GROC would reflect such discriminative ability, unhidden by biases. Hence, Fig. 7 tells us that the precipitation forecast probabilities may not have been optimally calibrated; this will be examined below in a reliability analysis.
The geographical distribution of mean RPSS for temperature forecasts at 0.5-month lead time are shown in Fig. 8 for JFM, AMJ, JAS, and OND. The skill patterns of the four seasons do not differ greatly. The JFM season features the greatest spatial extremes of skill, with highest tropical skill and most widespread negative extratropical skill. The seasonal cycle of temperature forecast skill for selected large regions (Fig. 9) shows smaller, subtler seasonal dependence than that of precipitation skill. This difference likely exists because much of the temperature skill is related to correctly forecast probability shifts toward above normal because of global warming, which is largely independent of season. Temperature also lacks the tropical seasonal migratory cycles found in precipitation (e.g., ITCZ, monsoons). Skill in many regions (e.g., northern South America, Indonesia) is slightly higher in boreal winter than summer, which is likely related to seasonality of the ENSO cycle: El Niño (La Niña) tends to warm (cool) the tropical atmosphere most predictably and strongly near and following its mature phase late in the calendar year.
The differences in skill shown by RPSS and GROC (Fig. 9) indicate fewer cases of negative RPSS than GROC (<0.5), for example, late boreal summer skills for Africa. This difference is due to the credit given by RPSS for correctly elevated probabilities for above-normal temperature caused by global warming, even in the absence of correct year-to-year discrimination among probability values within the 11-yr period. Because of the reference forecast used in RPSS (a climatology based on a completed 30-yr period), forecasts uniformly tending toward above-normal temperature earn credit in RPSS but not in GROC unless, additionally, the forecast warm tendency differs interannually in phase with the observations. Figure 7 (precipitation) and Fig. 9 (temperature) provide opposing examples of how attributes other than discrimination hurt or help RPSS, respectively, without affecting GROC. A summary comparison of skill results using the RPSS, LSS, and GROC measures is provided in Table 6 for each of the four forecast lead times for the globe, tropics, and extratropics.
d. Probabilistic reliability
Reliability plots for precipitation and temperature forecasts over the globe and in the tropics over the 11-yr forecast period, aggregated for all area-weighted grid points and seasons, are shown in Fig. 10 by tercile category. Precipitation reliability appears favorable, with very slight overconfidence for above- and below-normal precipitation. There is slight underforecasting of below-normal rainfall: the 11-yr verification period was slightly drier than the 30-yr climatological base period [e.g., in tropics, below- (above-) normal precipitation occurred in 36% (31%) of cases], while the mean forecast probabilities remained close to 33%. The frequency of issuance of given forecast probabilities (lower subpanels of Fig. 10) shows a majority of climatological probabilities forecasts both globally and for the tropics and nonclimatology forecast probabilities deviating mainly within 10% of climatology. The near-unity slope of the reliability curve indicates that this lack of forecast sharpness is necessary, given the considerable forecast uncertainty consistent with the known generally limited skill levels. A summary of some diagnostic attributes of the reliability analysis for the tropical precipitation forecasts is shown in Table 7a. Noted are reasonable slopes for the above- and below-normal categories and low sharpness (SDf column).
For temperature (Fig. 10, bottom), the confidence component of the reliability is favorable, with reliability curve slopes near unity for above-normal temperature and slightly less for below normal for global and tropical domains. However, despite mean forecast probability shifts toward above-normal temperatures [mean probabilities issued for above (below) normal in tropics of 43% (23%)], above-normal temperatures were markedly underforecast [above (below) normal observed in 68% (10%) of cases]. This degree of imbalance in the observations with respect to the climatology reflects the large magnitude of low-frequency variability, including specifically a global warming signal.
The magnitude by which temperatures during the 11-yr study period averaged higher than those during the warmest of the 30-yr climatological base period used for IRI’s forecasts (1971–2000) is illustrated in Fig. 11a, which shows the spatial distribution of the percentile of the 1998–2008 median temperature within the 1971–2000 climatologies, seasonally aggregated. Positive shifts are large: the 11-yr medians of 15% of the land grid points attain ≥95 percentile rank within the 1971–2000 observations, and the medians of 1.5% of the grid points are higher than all 30 years in the 1971–2000 period.3 No grid points attain ≤5 percentile rank with respect to either 30-yr period, and most of the few grid points ranking below the median for 1971–2000 are near coastlines, restrained by the more slowly changing ocean temperatures. Throughout this period, during which the SST forecast models and AGCMs still used fixed, outdated greenhouse gas concentration settings, the IRI forecasts may have kept better pace with the warming trend if they had allowed an empirical tool known as optimal climate normals (OCN; Huang et al. 1996) to influence the forecasts.4 Figure 11b shows the spatial distribution of the percentile of 1997–2008 median precipitation within the 1971–2000 climatology. The direction of shift from the climatology is geographically dependent for precipitation, with roughly equal areas trending drier as trending wetter.
Reliability plots for global temperature and precipitation (Fig. 10, left) show similar slopes, and slightly milder unconditional biases, compared with those for tropical forecasts. The global plots show lower forecast sharpness, consistent with known lower average signal-to-noise ratios (and expected predictive skill levels) in the extratropics (Shukla and Kinter 2006; Kumar et al. 2007; Peng et al. 2000).
A diagnostic evaluation of IRI’s seasonal climate forecasts for the 1997–2000 period (Wilks and Godfrey 2002) found IRI’s 0.5-month lead temperature forecasts somewhat overconfident, precipitation forecasts with appropriate confidence in the tropics but overconfidence in the extratropics, and substantial overforecasting of below-normal temperatures with a gross preponderance of above-normal observed temperature but only a slight mean tilt toward above-normal forecast temperature.5 The negative RPSS values seen over a large region in extratropical latitudes (Figs. 5 and 6) suggest that even the weak shifts in precipitation probabilities have continued to exceed those warranted in the extratropics, and that nonclimatology forecasts should be uncommon in the extratropics. Mean tropical forecast probabilities for above-normal temperature were somewhat higher during the 11-yr forecast period than during 1998–2000 [partly in response to Wilks and Godfrey (2002) and because of forecasters’ increasing confidence in forecasting a continuation of the global warming signal], while the relative frequency of observations in the above-normal category continued at the same high level (roughly two-thirds of cases) over the 11-yr period as during 1998–2000. The result was a slightly less severe, but still very substantial, cool bias.
e. Separation of interannual and low-frequency skill
The performance of probabilistic forecasts is more fully described by verification measures aimed at different attributes than by a single measure (Wilks 2006). For example, RPSS and LSS can be negative because of imperfect calibration even when the forecasts have potential information value (Hsu and Murphy 1986; Mason 2004), while GROC, being virtually insensitive to calibration problems (e.g., mean or conditional forecast biases), may show positive results for the same set of forecasts. Among the RPSS, LSS, GROC scores, and reliability diagrams, multidimensional diagnostics are formed for the forecasts.
A comparison among the geographical patterns of RPSS, LSS, and GROC is shown for IRI’s forecasts of precipitation and temperature for the JFM season in Figs. 12 and 13, respectively. Comparisons for other seasons (not shown) are similar. RPSS and LSS have very similar patterns, including locations having zero skill, LSS averaging one-half to one-third of RPSS in magnitude. That RPSS is affected by probabilities assigned to categories that do not verify, while LSS is not, is probably not a significant factor in the score differences for IRI’s forecasts, which never have grossly non-Gaussian (e.g., bimodal) probabilities that would enable probabilities given to nonverifying categories to be important. The global spatial correlation between RPSS and LSS is 0.9 or greater for all seasons, while that between either of them and GROC is approximately 0.6. Thus, at least for IRI forecast skill, RPSS and LSS appear largely redundant, and either of them could be used alone without material loss of information. For precipitation (Fig. 12), GROC skill shows skill patterns roughly similar to those of RPSS and LSS, but with somewhat less area of negative (<0.5) skill.6 This suggests that the proportion of correct discrimination among the varying probabilities forecast for the tercile categories is favorable in JFM in the relatively high skill regions (e.g., Philippines—east Australia, Pacific islands). Because forecast uncertainty is considerable (the most likely category often having only 0.40–0.50 probability, as warranted for good reliability), RPSS and LSS have weak magnitudes in these skillful regions. Additionally, the mild bias of over- (under-) forecasting above (below) normal further decreases RPSS and LSS but not GROC (Fig. 10 and Table 7a); these probabilistic features may cause RPSS and LSS to be negative.
A different picture is presented for temperature (Fig. 13). The patterns themselves, while roughly similar, differ in that there is a greater area of negative skill for GROC than for RPSS and LSS. This is caused by the recognition, albeit too weak, of the dominance of the above-normal category in the forecasts that is rewarded in RPSS and LSS but not in GROC, where only correct discrimination among the sampled cases is credited. Figure 13 shows that, while discrimination among the temperature forecasts within the 11-yr forecast sample was better for temperature than for precipitation in the tropics, it was generally low for both temperature and precipitation outside of the tropics. The level of discrimination for the outer categories for tropical temperature is indicated by ROC scores in the low to middle 0.60s (Table 7), compared with the upper 0.50s for tropical precipitation.
To summarize, the GROC helps to distinguish forecast skill related solely to discrimination of mainly interannual variability within the 11-yr forecast period, as opposed to such discrimination combined with skill in correctly predicting overall 11-yr mean probability shift with respect to the 30-yr climatological base period(s), as for example, that associated with climate change. GROC shows minor differences from RPSS and LSS (allowing for scaling differences) for precipitation (Fig. 12), being more favorable because it is not penalized for the small wet bias. For temperature (Fig. 13), GROC shows smaller areas of positive skill than RPSS and LSS because the climate shift in temperature was large enough that even partial recognition of it in IRI’s probability forecasts (Fig. 10 and Table 7) was credited in RPSS and LSS but not GROC. However, even for discrimination alone, performance is seen to be stronger for temperature than for precipitation in the tropics.
f. Skill of objective multimodel predictions; comparison with issued forecasts
A comparison of the skill of IRI’s issued forecasts with that of the objective multimodel ensemble forecasts indicates the value of the human modification to the raw model output. The objective output comes from several AGCMs, each roughly calibrated to its mean and terciles, forced by multiple SST scenarios, and weighted using two multimodel ensemble algorithms. Ideally, the objective probabilistic model output should be capable of being the final forecast product, but expert judgment has further influenced the issued forecasts. As discussed earlier, subjective modifications include a general weakening of probability anomalies, more specific weakening of excessively sharp forecasts, spatial smoothing, spatial MOS corrections for selected regions/seasons for precipitation, shifting of temperature probabilities toward “above normal,” and adjustments toward forecasts issued by other producing centers. Weakening of forecast probability anomalies is done because the model weighting scheme in the multimodel combination is based on historical skills using observed rather than predicted SST.7 Probabilistic shifts toward above normal for temperature are done because the models do not fully capture global warming, even with warmer SSTs forcing them, because of constant and outdated prescribed model greenhouse gas concentrations.
Figure 14 shows the geographical distribution of RPSS for the multimodel precipitation and temperature predictions at 0.5-month lead for JFM and JAS. While a comparison of the precipitation skills to those of the actually issued forecasts (Figs. 6a,c and 3 and Tables 7a,b) indicates similar skills for both variables, differences are discernible. For both seasons, spatially noisier skill patterns are seen for the multimodel than for the issued precipitation forecasts. Global and tropical mean skills for JFM and JAS precipitation are very slightly higher for the issued than for the multimodel forecasts, and inspection of the two RPSS fields suggests that this may be largely due to the smoothing and weakening of predicted deviations from climatological probabilities. The level of spatial noise in the multimodel forecasts is greater for precipitation than for temperature (Gong et al. 2003), requiring more forecaster smoothing to optimize skill. The same comparison using the GROC score (not shown) leads to a qualitatively similar conclusion, except with even smaller skill differences between the two forecast sets, likely because of approximately equal levels of basic discrimination in both forecast versions, but better calibration in issued than multimodel forecasts.
With relatively small trend components in precipitation, interpretation of results is related mainly to interannual variability. The reliability plot for all-season tropical precipitation multimodel ensemble predictions (Fig. 15, comparable to Fig. 10 for issued forecasts) shows somewhat shallower slopes, indicative of greater overconfidence in the multimodel ensemble predictions. Tables 7a,b provide attributes of reliability and skill for the 0.5-month lead tropical precipitation multimodel forecasts and issued all-season forecasts. The issued probabilities are more conservative and have very slightly higher skills by most of the verification measures. The mean squared departures of the reliability curve from the ideal 45° line (“reliability” column in Table 7) are greater, and exceed those of the climatology forecast reference by greater amounts, for multimodel ensemble forecasts than for issued forecasts for all three categories. A somewhat higher resolution is also seen in the issued than the multimodel forecasts and is closely related to small increases in discrimination (indicated by greater GROC and ROC by individual category), which is believed to be due to the forecasters’ additional spatial smoothing and modifications resulting from the selected regional spatial MOS corrections.
The right side of Fig. 14 shows the geographical distribution of RPSS for multimodel temperature predictions at 0.5-month lead for JFM and JAS. Comparison with skill for the corresponding issued forecasts (Figs. 8a,c and 3) shows for both seasons skill patterns of roughly similar mean level, but spatially noisier, in the multimodel forecasts. However, global and tropical mean RPSS are just slightly higher in the multimodel than the issued forecasts (Figs. 3 and 14 and Tables 7c,d), mainly because of higher scores in those tropical regions where skill is highest in both forecast versions. In regions of low extratropical skill, issued forecasts have milder negative RPSS than multimodel forecasts. A comparison between objective and finally issued forecasts using GROC (not shown) indicates no edge in performance of the multimodel ensemble forecasts over the issued forecasts. A likely explanation is that the forecaster modifications change the calibration of the forecasts (Fig. 15 versus Fig. 10; Tables 7c,d) but have little effect on the basic discrimination present in the multimodel ensemble forecasts. However, the combination of decreasing the confidence and weakly adjusting for the underforecasting of above-normal temperature, while leaving GROC unchanged as expected, slightly reduced RPSS. Although the forecasters’ adjustments for the cool bias increased RPSS, their reduction of perceived overconfidence had a larger negative impact on RPSS by weakening the highest probabilities for above normal. While RPSS and LSS are slightly lower for the issued than multimodel ensemble temperature forecasts, the GROC, the ROC for individual categories, and the resolution components (Tables 7c,d) suggest slight improvements in discrimination in the issued forecasts and improved slopes in the reliability curves.
In summary, overconfidence in the multimodel forecasts was corrected in the issued forecasts for both precipitation and temperature, but the objective forecasts better captured the strong warming trend than the issued forecasts. The issued forecasts featured equal to slightly improved resolution/discrimination compared with the model output, but the damping of above-normal probabilities to correct for overconfidence contributed to its underforecasting.
The IRI has issued seasonal probabilistic forecasts of near-global temperature and precipitation for 11 years since late 1997, using mainly a two-tiered, dynamically based prediction system where a set of SST prediction scenarios is made, which then serve as prescribed lower boundary conditions for integrations of ensembles from a set of AGCMs. Forecasts have been issued monthly, for four upcoming running 3-month periods, for most of the IRI’s history. Seven AGCMs have been used since 2004, whereby forecast ensembles numbering well over 100 members are postprocessed and merged into final probability forecasts.
The skill of the forecasts ranges from near zero to moderate, depending on season and location. Skills for temperature average higher, are less seasonally and regionally dependent, and decay more slowly with lead time than skills for precipitation. Temperature skills benefit from correct forecasts of continuation of a strong tendency for above-average temperatures (relative to a completed 30-yr base period) associated with global warming. Although ENSO remains a source of temperature forecast skill, warming trends have rivaled ENSO effects as a skill source during the forecast period. Skills for precipitation, by contrast, do not benefit appreciably from a trend component because precipitation trends are weaker and vary in direction depending on season and location. Hence, precipitation skill is based mainly on correctly discriminated effects of interannual fluctuations involving ENSO and SST anomalies outside the tropical Pacific.
Forecast skills are higher in the tropics than extratropics for both temperature and precipitation. This is consistent with the higher signal-to-noise ratios at low latitudes documented for troposphere geopotential height (e.g., Shukla and Kinter 2006; Kumar et al. 2007) and in associated surface climate (e.g., Rowell 1998; Peng et al. 2000). While the spatial pattern of temperature forecast skill shows a weak annual cycle, that of precipitation is more strongly seasonally dependent, roughly following both the annual cycle of low-latitude monsoon rainfall and teleconnections to large-scale tropical SST anomalies—particularly ENSO. At midlatitudes, positive precipitation skill, while not prevalent, is found in regions and seasons having successfully modeled ENSO and non-ENSO tropical SST teleconnections. The skill results found here are consistent with skill evaluations by other forecast-producing centers and with theoretical predictability studies. Skill levels in specific seasons and locations could benefit users who understand the probabilistic aspects of seasonal climate forecasts sufficiently for prudent decision making for their application.
Over a period as brief as 11 yr, the variability in the amplitude of ENSO extremes is likely to govern forecast skill more strongly than incremental improvements in models or forecast methodology. Hence, Fig. 3 indicates highest forecast skills during the 1997/98 El Niño at the beginning of the period, despite the fact that the simplest SST prediction scheme and the fewest AGCMs were used at that time.
Skills of the objective multimodel probability forecasts, used as the primary basis for the final issued forecasts, are comparable to those of the final forecasts, but they are somewhat overconfident. This is believed to be due in part to the development of the multimodel superensembling process using individual AGCM skills when the AGCMs are forced by observed rather than predicted SST. Thus, while the relative weighting among the models may be well estimated, their collective weighting, and resultant departures from climatological forecast probabilities, are overestimated.
The verification diagnostics challenge the suitability of using completed 30-yr periods to define the current temperature climatology from which to form anomalies or quantile-based category boundaries, given the strong nonstationarity. Reasons to consider alternative climatological reference frames include the severely shifted categorical frequencies of current observations, forecasts reflecting a mixture of time scales that may be confusing to stakeholders, and the greater challenge in conducting meaningful verification. Observational alternatives to estimation of the current year’s temperature climate might include use of an annually updated OCN-based climatology (Huang et al. 1996) or, at higher risk, a linear trend fit of the observations in recent decades (e.g., Livezey et al. 2007); a dynamical approach might consist of a deemed optimal superensemble of regionally specific Intergovernmental Panel on Climate Change (IPCC) model forecasts (e.g., Tebaldi et al. 2005; Greene et al. 2006; Furrer et al. 2007; Christensen et al. 2007; Gleckler et al. 2008) averaged over a period centered on the current year. The above options all carry uncertainties beyond those of a stationary climate, as they contain predictive components.
The aspects of IRI’s prediction system in greatest need of improvement or further development are 1) postprocessing: use of systematic spatial MOS corrections for individual AGCMs, specific to the season and lead time, before superensembling; 2) incorporation of time-varying greenhouse gas settings in SST forecast models and in AGCMs; and 3) movement toward a partially or totally single-tiered prediction system. Implementation of 1) occurred in late 2009, and progress on 3) is under way.
It is difficult to compare the operational predictive skill of IRI’s forecast system with that of other systems such as a single-tiered dynamical system [e.g., Palmer et al. 2004 (and references therein); Graham et al. 2005; Saha et al. 2006; Kug et al. 2008; Wang et al. 2008] or a purely empirical system (Van den Dool 2007). Improvement in ENSO prediction has obvious value toward improvement of climate prediction, and the potential predictability of ENSO is an open question but believed not fully realized (Chen and Cane 2008). Expansion of available data from the Argo (Schmid et al. 2007), Prediction and Research Moored Array in the Tropical Atlantic (PIRATA) (Bourlès et al. 2008), and Research Moored Array for African–Asian–Australian Monsoon Analysis and Prediction (RAMA) (McPhaden et al. 2009) systems is expected to result in more fully realizable predictive skill for SST in tropical oceans outside of the Pacific. Improved modeling of the ocean–atmosphere system, through better representation of physical processes, should increase skill toward the theoretical limit and reduce the need for postprocessing and forecaster intervention.
This work was funded by a grant/cooperative agreement from the National Oceanic and Atmospheric Administration (NA07GP0123 and NA050AR4311004). The views expressed are those of the authors and do not necessarily reflect the views of NOAA or its subagencies. The authors appreciate the thoughtful and constructive comments of three anonymous reviewers. The monthly forecast AGCM integrations done by partner institutions (NASA GSFC, COLA, Queensland Climate Change Centre of Excellence, GFDL, Scripps/ECPC) have been invaluable contributions to IRI’s forecasts. Acknowledged are scientific contributions by Nicholas Graham for the original system design, Stephen Zebiak, Balaji Rajagopalan, Upmanu Lall, and Andrew Robertson for multimodel superensembling, and Michael Tippett for targeted spatial AGCM MOS corrections. Competent production support was provided by Mary Tyree, Martin Olivera, John del Corral, Jack Ritchie, Sara Barone, and Bin Li.
* Current affiliation: Swiss Re Financial Services Corporation, New York, New York
Corresponding author address: Anthony G. Barnston, International Research Institute for Climate and Society, 61 Route 9W, P.O. Box 1000, Columbia University, Palisades, NY 10964-8000. Email: firstname.lastname@example.org
Prior to 2000, forecasts were issued during the first week of the first predicted period.
This nonstandard base period was used because of delays in updating of the global climate observational data used.
For the 1961–90 climatology used during the first few years of IRI forecasts, these figures increase to 30% and 8.6%, respectively.
OCN forecasts the mean temperature anomaly observed over the most recent 10 years for the season and location in question and would forecast a pattern qualitatively similar to that of Fig. 10a, but season specific.
A similarly strong cool bias was also noted in the climate forecasts of NOAA Climate Prediction Center during 1995–98 (Wilks 2000).
The larger spatial variation seen in GROC within its range of 0 to 1 than those seen in RPSS and LSS within their ranges exists, in part, because GROC measures mainly just one attribute of performance (discrimination), while RPSS and LSS measure performance in discrimination and in other attributes. Excellent (or very poor) performance is less likely to occur in net over several attributes than in just one.