Analyses of the relative prediction skills of NOAA’s Climate Forecast System versions 1 and 2 (CFSv1 and CFSv2, respectively), and the NOAA/Climate Prediction Center’s (CPC) operational seasonal outlook, are conducted over the 15-yr common period of 1995–2009. The analyses are applied to predictions of seasonal mean surface temperature and total precipitation over the conterminous United States for the shortest and most commonly used lead time of 0.5 months. The assessments include both categorical and probabilistic verification diagnostics—their seasonalities, spatial distributions, and probabilistic reliability. Attribution of skill to specific physical sources is attempted when possible. Motivations for the analyses are to document improvements in skill between two generations of NOAA’s dynamical seasonal prediction system and to inform the forecast producers, but more importantly the user community, of the skill of the CFS model now in use (CFSv2) to help guide the users’ decision-making processes. The CFSv2 model is found to deliver generally higher mean predictive skill than CFSv1. This result is strongest for surface temperature predictions, and may be related to the use of time-evolving CO2 concentration in CFSv2, in contrast to a fixed (and now outdated) concentration used in CFSv1. CFSv2, and especially CFSv1, exhibit more forecast “overconfidence” than the official seasonal outlooks, despite that the CFSv2 hindcasts have outperformed the outlooks more than half of the time. Results justify the greater weight given to CFSv2 in developing the final outlooks than given to previous dynamical input tools (e.g., CFSv1) and indicate that CFSv2 should be of greater interest to users.
Operational probabilistic seasonal outlooks for surface mean temperature and total precipitation over the United States have been routinely issued by the Climate Prediction Center (CPC) since December 1994 and now span more than a 15-yr period. The seasonal outlook information is disseminated to the user community in several formats—for example, as maps of the probability of the most likely tercile-based category (O’Lenic et al. 2008) or as probabilities of exceedance (Barnston et al. 2000). The outlooks are released in the middle of the calendar month, the first target season being the upcoming 3-month period (e.g., for the outlook released in mid-April, it is for the average climate for May–July). Beginning with the first target season, outlooks are issued for 13 running seasons, extending to a year in advance. The CPC’s forecast system is described in O’Lenic et al. (2008), and its forecast skill is discussed in Livezey and Timofeyeva (2008) and Peng et al. (2012).
Here, we focus on the outlooks for the shortest (0.5 month) lead time, as these are most frequently used, and their skill and other attributes are compared with those of the two most recent versions of the dynamical model used as an input tool for their formulation. The skills of some nondynamical input tools for CPC’s outlooks are highlighted in Peng et al. (2012).
Between 2004 and 2011 the operational Climate Forecast System (CFS) coupled model, hereafter called CFSv1, for version 1 (Saha et al. 2006), served as the dynamical model input, while in April 2011 the second version, CFSv2 (Saha et al. 2012, manuscript submitted to J. Climate, hereafter Sv2), was implemented. In this study we compare the skills of the hindcasts of the CFSv1 and CFSv2 models with those of CPC’s real-time, operational climate outlooks of seasonal surface temperature and precipitation over the United States. We assume that the dynamical model hindcasts approximate the forecasts that would have been made and used by the forecasters in real time had they been available. The skill assessments are intended to be useful for the producers of the seasonal outlooks, confirming the characteristics of the dynamical forecasts seen in the postprocessing and weighting stage of developing the probabilistic seasonal outlook. The assessments are also expected to be of interest to the public—particularly users of dynamical seasonal forecasts for decision-making processes. For such users, these skill results place the CFSv2 within the context of the improved quality of the predictions relative to the CFSv1. In this paper we attempt, where possible, to provide some understanding of the physical basis for the skill improvements.
2. Data and methods
Hindcast data for the CFSv1 model are available during 1982–2009, the CFSv2 model during 1982–2011, and the real-time CPC seasonal outlooks during 1995–2011. The common period of 1995–2009 is used for this study. The CFSv1 and CFSv2 hindcasts span months 1–9, but here we consider just the first 3 months averaged together to form the first season. The spatial resolution of the model predictions is T62 (~2°) for CFSv1 and T126 (~1°) for CFSv2, while the resolution for the CPC forecasts analyzed is a 2° × 2° grid. For our analyses, the CFS hindcasts are interpolated onto the 2° × 2° grid of the CPC forecasts.
The CPC’s seasonal outlooks are made using a combination of empirical and dynamical prediction tools (O’Lenic et al. 2008). The empirical methods, based purely on historical data, include canonical correlation analysis (Barnston 1994), optimum climate normals (OCN1; Huang et al. 1996), a regression tool (Unger 1995), and an ensemble canonical correlation analysis (Mo 2003). Dynamical prediction methods employ comprehensive general circulation models initialized from the observed states of the ocean, land surface, and atmosphere. The dynamical input tool used for the CPC’s forecasts has been the National Centers for Environmental Prediction’s (NCEP) dynamical model at the given time, beginning with the Medium-Range Forecast model (MRF9) partly coupled model from 1995 to 2000 (Ji et al. 1994), an improved partly coupled system from 2000 to 2004 (Kanamitsu et al. 2002a), then the fully coupled CFSv1 from 2004 to 2011 (Saha et al. 2006), and CFSv2 from 2011 onward (Sv2).
The CPC seasonal outlooks for U.S. surface temperature and precipitation are probabilistic, describing shifts in the probabilities away from their climatological values (of one-third) for three equiprobable tercile-based categories (below, near, and above normal). The category boundaries are defined using observed data over the most recently completed 30-yr base period covering three regular decades (e.g., 1981–2010 at the time of this writing). When the forecast tools indicate little or no shift in probability away from ⅓ for the three categories, or present conflicting information, the forecasters issue a forecast of “equal chances” (exactly ⅓ probability) for all categories, implying that the forecast is no better than the climatological information. By contrast, the individual tool forecasts often indicate at least some slight deviations from the climatology even when little or no demonstrable predictive skill exists historically—this being particularly true for the model-based forecasts, where due to the finite ensemble size and associated sampling errors, the random variability (noise) rarely cancels itself out to exactly zero.
The verification data, interpolated onto a 2 × 2 latitude–longitude grid, come from a CPC analysis (Ropelewski et al. 1985). The tercile boundaries for seasonal mean temperature and total precipitation for forecasts issued prior to May 2001 are based on the observed data for the 1961–90 period, while forecasts issued from May 2001 and later are based on the 1971–2000 period. These base periods are applied also to the categorical assignments of the corresponding observations.
The dynamical model predictions consist of a moderately large ensemble of forecasts, each starting from a different initial analysis, and each resulting in a different realization of the predicted seasonal mean, together defining a predicted probability distribution. Some basic characteristics of the CFSv1 and CFSv2 model versions are shown in Table 1. Apart from changes in the models and forecast resolutions, notable differences in CFSv1 and CFSv2 are that the CO2 concentration in CFSv2 evolves with time with the initial CO2 concentration specified as the observed CO2 value at the beginning of the forecast and is then held fixed during the subsequent 9-month forecasts. For the CFSv1, on the other hand, the CO2 value is held fixed at the observed 1988 concentration, and therefore, CO2 concentrations before (after) 1988 are over- (under-) estimated. The importance of correct specification of CO2 concentration for seasonal forecasts has been documented in earlier studies (Doblas-Reyes et al. 2006; Cai et al. 2009); another notable difference between CFSv1 and CFSv2 is the initial conditions. For CFSv2, initial conditions are from the Climate Forecast System Reanalysis (CFSR; Saha et al. 2010), while for CFSv1, they are from NCEP–Department of Energy (DOE) Reanalysis-2 (R-2; Kanamitsu et al. 2002b). It has been documented by Saha et al. (2010) that the atmospheric analysis (and, therefore, the initial conditions) based on the CFSR is much better than for the R-2.
Verification of the seasonal outlooks is done using both categorical and probabilistic measures. For categorical verification, the Heidke skill score (HSS; O’Lenic et al. 2008) is used. The HSS tallies the number of “hits” (cases in which the category having the highest forecast probability matches the later observed category), and compares this number to that expected by chance alone. The HSS is a scaled measure of the percentage improvement in skill relative to a set of random forecasts, or climatology (equal chances) forecasts, and is defined as
where c is the number of cases (here, grid points) with hits, t is the total number of grid points in the outlook, and e is number of grid points expected to be correct by chance (and equals t/3 for the tercile-based categorization). In the case of more than one category sharing the highest forecast probability, a hit is divided if one of them is later observed. Hence, in the case of the “equal chances” forecast (⅓ probability for each category), there is a three-way tie and a ⅓ hit is tallied, equaling the hit rate expected by chance and contributing to a HSS of 0. For a two-way tie,2 a ½ hit is tallied, contributing to positive skill more weakly than a nontied hit.
The HSS can be computed in two ways: 1) for all points and 2) for only points where the outlook differs from the “equal chances” forecast of ⅓ for each category. Although CPC’s operational outlooks are often verified in the second way, here we rely on the HSS applied to all points for direct comparability to the CFS forecasts (for which forecasts at all grid points are available and equal chances is seldom a forecast outcome). The HSS ranges from −50 (for no hits) to 100 (all hits).
As a measure of categorical skill, the HSS disregards the magnitude of the outlook probabilities that indicate the forecast confidence level. To verify the probabilistic information in the seasonal outlooks, the ranked probability skill score (RPSS) is used (Epstein 1969). The RPSS is a measure of the squared distance between the forecast and the observed cumulative probabilities; a more detailed discussion is found in Kumar et al. (2001a), Goddard et al. (2003), and Wilks (2006). As an additional probabilistic skill assessment tool—one that does not penalize for systematic errors—we also use relative operating characteristics (ROC; Mason 1982). Finally, we use reliability diagrams (Murphy 1973; Wilks 2006) to compare the forecast probabilities against their corresponding frequencies of observed occurrence, and to examine the frequency distribution of the issued probabilities (sharpness) and other characteristics of the forecasts.
Skill measures for CPC’s forecasts and the CFS models are computed either as time series (where the score is averaged over the spatial domain3 for each season in time), or as spatial maps (where the score is averaged over the entire period of record for each grid point, and the geographical distribution of skill is displayed). The time series of skill is expected to show variations of skill in response to known predictors, such as the ENSO phenomenon; the spatial map describes the spatial distribution of skill, and may show known ENSO teleconnection patterns during relevant seasons. Both dimensions of skill are expected to be seasonally dependent.
We describe several aspects of the relative performances of the predictions of the two CFS model versions and the CPC’s official forecasts of temperature and precipitation.
a. Temporal variation and seasonal cycle of spatially averaged HSS
A comparison among the time series, smoothed using a 13-season running average, of the HSS for the 0.5-month lead time for the CFSv1 and CFSv2 model predictions, and the official CPC outlooks for seasonal mean temperature is shown in Fig. 1 (top left). The HSS for CFSv2 is higher than that for CFSv1 throughout most of the 15-yr period. During 1995–2000 the two CFS versions appear well correlated, but this correlation decreases thereafter. The average HSS of CPC’s outlooks over all of the continental United States slightly exceeds that of CFSv1, but is exceeded to a greater extent by that of CFSv2. Prior to 2007, CFSv2 is rarely outperformed by the CPC’s outlooks.
The same comparison applied to total precipitation is shown in Fig. 1 (top right). Overall, precipitation skill is considerably lower than temperature skill. As noted for temperature, HSS is higher for CFSv2 than CFSv1, and the skills of the two CFS models are strongly correlated until 2000. However, in contrast to the temperature result, the precipitation forecasts from CPC underperform those from both CFS versions, although those from CFSv1 are only slightly more skillful.
The seasonal cycle of the skills of the temperature forecasts from the three sources (bottom-left panel in Fig. 1) features a bimodal distribution for CFSv2 and the CPC’s forecasts with peaks in northern winter and summer and minima in late spring and late fall. An exception to this pattern is seen for CFSv2 during December–February (DJF), where its lowest skill for the year is noted. The skill pattern of CFSv1 stands out as being different from CFSv2 and CPC in its lack of a summer skill peak, with highest skill during winter where it outperforms CFSv2 for DJF but falls far short of CFSv2 during the other portions of the cold season. The lack of a summer skill peak in CFSv1 may be related to a problem in initializing soil moisture in the time of year when it matters most (Wang et al. 2010).
The seasonal cycle of the skill of precipitation forecasts (Fig. 1, bottom right), aside from the generally lower level than that of temperature, differs also in the lack of a summer peak. Except for some minor skill fluctuations over short portions of the seasonal cycle, the patterns are essentially unimodal, peaking during winter. This is likely due to robust ENSO teleconnections in the United States being mainly limited to winter (Livezey and Timofeyeva 2008). While winter is the season of maximum HSS for both temperature and precipitation, the winter HSS is approximately 0.20 for temperature, but only 0.10–0.15 for precipitation.
Toward an explanation for the high coherency of temperature prediction skill between CFSv1 and CFSv2 from 1995 to 2000, decreasing thereafter, Fig. 2 (top) shows the average seasonal U.S. temperature anomaly (relative to the 1971–2000 year climatology) of the two model versions and the observations over the study period, and shows that during 1995–2000 the mean temperature predictions of the two model versions are highly similar. After 2000, the average predicted temperature of CFSv2 begins exceeding that of CFSv1, and by a generally increasing margin with time over the remainder of the study period. This trend in mean predicted temperature appears also in the percentage of forecasts having the highest probability for the above normal category (Fig. 2, bottom). In both plots, the CFSv1 has lower values than the observations much of the time, while CFSv2 maintains values averaging closer to those observed. However, we note that a direct comparison of the percentage of grid points and the mean temperature itself is complicated by the way the tercile membership is defined in the models versus the observations, and by the confidence level of the models, to be discussed below.
The increasingly warmer predictions of CFSv2 than CFSv1 may be due in part to the specification of evolving CO2 in CFSv2 and this warmth difference may be a reason for the decreasing similarity between the temperature skill levels of the two models after 2000. While the skill levels of the two model versions are strongly correlated up to year 2000, that of CFSv2 is always higher than that of CFSv1; likely also due to model and initial condition improvements other than the CO2 specification.
The possibility that differing forecasts of mean U.S. temperature between the two model versions lead to differing skill levels between them is supported by a particularly large difference in mean temperature from 2005 to mid-2007 (Fig. 2), and a correspondingly large skill difference during the same period (Fig. 1, top left). The observed mean temperature during this period was better approximated by CFSv2 than CFSv1 (Fig. 2, top). A fixed and outdated (from late 1980s) CO2 specification has been documented to lead to a lack of maintenance of trend in model forecasts of surface temperature over ocean and land (Doblas-Reyes et al. 2006; Cai et al. 2009). The inverse relationship between the absolute error in mean temperature and the skill suggests that the generally increasing cold bias in CFSv1, possibly related to its fixed CO2 concentration, may partly account for its lower skill than CFSv2 not only during the 2005–08 period but more generally during the latter half of the study period. The brief period of 2005–08, however, is insufficient for a robust conclusion, and this conclusion is also tempered by the analysis of Wang et al. (2010), who showed that over the same period due to trends in the specification of initial moisture, CFSv1 consistently predicted a cooler summer.
Another feature of interest in the temperature predictions is the relatively lower HSS obtained by the official CPC forecasts than by either CFS model version during the protracted La Niña of 1998–2000 (Fig. 1, top left), with a focal point of skill difference in northern winter–spring 1998/99. One of CPC’s empirical forecast tools, used only during ENSO episodes, is ENSO composites, representing historical teleconnections over the United States for El Niño or La Niña. While this tool has continued to be helpful for seasonal precipitation forecasts, it has failed in at least two major instances for temperature forecasts. As shown in Goddard et al. (2003), the IRI’s temperature forecasts for the globe were degraded by the use of global La Niña teleconnections during the 1998–2000 La Niña episode. A possible explanation for this failure of temperature teleconnections is that the global warming signal and the global El Niño teleconnection pattern, while differing in many respects, have very roughly similar spatial patterns, particularly at low latitudes (Barnston et al. 2010; see their Fig. 4), such that the effects of La Niña and global warming somewhat oppose one another. In midlatitudes, the La Niña temperature teleconnection for below normal temperature in the northwestern and north-central United States, for example, is compromised by the global warming signal.
Another reason for the ineffectiveness of ENSO temperature teleconnections during the 1–2 yr following significant El Niño events is a marked delay in the midlatitude atmospheric response to the El Niño (Kumar and Hoerling 2003) that is not incorporated in simultaneous ENSO composites, but can be simulated by models, as seen in the case of the 1998–2000 La Niña that immediately followed very strong 1997–98 El Niño (Kumar et al. 2001b; Hoerling et al. 2001). Another, more general, shortcoming of El Niño or La Niña teleconnections is that they may not be effective for noncanonical ENSO events. For example, Goddard et al. (2006) showed that CPC’s temperature forecasts in the eastern United States during the 2002–03 El Niño, based substantially on El Niño teleconnections, failed because the event was of the type limited largely to the central Pacific—a case in which the best dynamical model predictions indicated a cold U.S. east coast and fared better than the teleconnection guidance. In both examples above, dynamical model guidance, particularly from a high-performing model such as CFSv2, delivered better predictions (Fig. 1, top left) than forecasts guided heavily by generic teleconnection guidance.
b. Geographical distribution of temporally averaged HSS
The geographical distribution of time-average skill for temperature for each of the three forecast products for all seasons combined is shown in Fig. 3 (top row). The distribution is somewhat uniform across the United States, with only a few exceptions—for example, a skill maximum in the southwestern region, more so for CFSv2 than CFSv1, and to the greatest extent for CPC’s forecasts. This region has been found to have a strong warming trend over recent decades (see right column of Fig. 2 in Livezey and Timofeyeva 2008), tending to be in the above normal category during many of the years for most of the seasonal cycle. This southwestern warming trend has allowed the OCN statistical tool to have high level of skill in that region (see middle row of Fig. 16 in Peng et al. 2012), and the better performance of CFSv2 than CFSv1 in this region suggests that CFSv2 reproduces temperature trend features better than CFSv1. The improved reproduction of the spatial pattern of the observed upward temperature trend in the southwest United States in CFSv2 may be a result of CFSv2’s time-evolving CO2 concentration. This possibility is supported by the progressively increasing exceedance of the predicted mean U.S. temperature of CFSv2 over that of CFSv1 after year 2000, while being approximately equal before that year (Fig. 2); however, contributions from other factors, for example, issues with specification of soil moisture in the CFSv1 (Wang et al. 2010), cannot be discounted. While skill differences appear most clearly in the southwest, CFSv2 tended to outperform CFSv1 over much of the United States. It is interesting that both CFS versions have had a weak relative skill maximum in the far northeastern United States, but CPC’s forecasts lack this feature, suggesting that it was missing in some of the nondynamical tools.
Seasonally stratified temperature skill distributions are shown for northern summer and winter in the middle and bottom rows of Figs. 3, respectively. These maps generally recapitulate the findings for the year as a whole, except that the skill levels of the three forecast products are most similar in winter (CFSv2 being only slightly better than CFSv1), and most different in summer with CFSv2 strongly outperforming CFSv1. CPC’s summer forecasts average roughly the same as those of CFSv2, but with skill in the southwest strongly dominating at the expense of skill in the Midwest and the northern Great Plains.
The spatial distribution of time-averaged precipitation skill for all seasons (Fig. 4, top row) is also somewhat uniform across the United States, but suggests a relative minimum in parts of the Rockies and the central or northern Midwest. The skill of CFSv1 is the most spatially uniform among the three forecast sources. The distribution of skill in CPC’s forecasts features a “skill hole” covering parts of the Rockies, eastern Great Plains, and the Midwest. Thus, its overall skill level is the lowest of the three, with relatively strong skill limited to the southern tier of states. Seasonally stratified precipitation skill distributions for northern summer and winter (middle and bottom rows of Fig. 4, respectively) show higher overall skill in winter than summer for all three forecast sources. During winter the southern tier of states has relatively the highest precipitation skill, undoubtedly due to the winter ENSO precipitation teleconnection pattern (e.g., see Fig. 17, bottom row, in Peng et al. 2012). In summer, when precipitation skill is lower due to less well defined, smaller-scale synoptic and mesoscale variability, geographical preferences are not great for the two CFS forecasts; the CPC forecast, however, has noticeably lower skill than either CFS version in much of the interior of the country, with skill areas limited to the Northwest, the northern Great Plains, New England, and the Southeast.
c. Probabilistic verification
Skill comparisons among the three forecast products for seasonal temperature and precipitation are repeated using the probabilistically sensitive RPSS verification measure. Here, in contrast to the HSS, the probability values matter, as there are strong rewards (penalties) for high probabilities given if the category in question is (or is not) observed. Additionally, the forecast probabilities for the categories not observed are also taken into account: for example, the score is less low when a category adjacent to the observed category has a high forecast probability than when a nonadjacent category has a high probability.
Time series of RPSS indicate relatively poor probabilistic temperature verification for CFSv1 compared with CFSv2 (Fig. 5, top), and CPC forecast performance just slightly lower than that of CFSv2. However, while CFSv2 and CPC have a similar average RPSS (0.043 and 0.030, respectively), CFSv2 has much greater temporal variations of skill than CPC. This outcome may be related to the fact that the CPC forecasts, derived from multiple input tools, are generally more probabilistically conservative than the forecasts of any individual tool such as the CFS prediction, which itself may be “overconfident” (to be discussed below).
Time series of RPSS for precipitation forecasts (Fig. 5, bottom) are noticeably lower than those for temperature. Again, CFSv1 has the lowest average performance with RPSS of −0.047. CFSv2 performs slightly better, with −0.013, while CPC’s forecasts have still just slightly higher RPSSs, averaging 0.003. This ordering of performance among the three forecast products differs from that found for HSS noted above, for a reason to be discussed below. Similar to the temperature forecasts, the two CFS forecasts have greater variations of skill over time than CPC, perhaps due to the greater variability of a single tool than a product combining multiple tools. Another possible source of the large temporal variability of CFS forecast skill, to be illustrated below, is the overconfidence of its probability forecasts. Such overconfidence reduces RPSS, as severely incorrect forecasts result in more strongly negative RPSSs than excellent forecasts result in positive RPSSs. (Recall that RPSS is based on squared probability errors.) This difference in the characteristics of the HSS and RPSS is likely the main reason for the CPC forecasts to outperform CFSv2 forecasts in RPSS, but not HSS.
Figure 6 (top row) shows the geographical distribution of RPSS for temperature for all times of the year for the three forecast systems, as shown in Fig. 3 (top row) for HSS. The spatial patterns of skill are roughly similar to those for HSS, as for example the skill maximum in the Southwest that is noticeably stronger in CFSv2 than CFSv1, underscoring the better reproduction of warming trends in CFSv2. However, regions of relatively low positive HSS tend to have nonpositive RPSS. This disappearance of skill is seen most noticeably in CFSv1. A reason for positive HSS but nonpositive RPSS is that the forecast probabilities deviate farther from climatology than warranted by the expected observed frequency of occurrence for the category in question. Such probabilistic overconfidence results in RPSS penalties that can outweigh credit for true skill residing in the forecasts, while HSS is insensitive to an incorrect level of probabilistic confidence. The spatial distribution of RPSS for precipitation forecasts (Fig. 6, bottom row) shows the same feature relative to HSS as is seen in temperature. Because skill is lower for precipitation than temperature, the maps of RPSS are scant in positive skill, particularly for CFSv1. CFSv2 and CPC forecasts fare better, with close to half of the area of the United States retaining positive RPSS. While CFSv2 shows the highest local RPSS maximum of the three forecast sources—located in the southeastern United States—CPC’s forecasts have the greatest proportion of area with positive skill.
The statistical significance of the RPSS of CFSv1, CFSv2, and the difference in RPSS between CFSv1 and CFSv2 is evaluated using a Monte Carlo approach. For the individual model skills, the dates of the forecasts are randomly permuted 10 000 times to develop a null distribution of skill against which the rank of the actual skill is determined. In shuffling the dates, not only do the gridpoint data for the individual forecast and observed maps for a given season remain together, but blocks of five consecutive running 3-month periods also remain intact. The natural spatial correlation and the temporal autocorrelation of the forecast and observed anomalies are thus preserved, ensuring a realistic approximation to the spatial and temporal degrees of freedom in the U.S. climate. If the actual skill is at or above the 95th percentile of the null distribution, one-sided significance at the 5% level is achieved. The first two columns in Fig. 7 show the spatial distribution of locally significant skill for CFSv1 and CFSv2 for forecasts of temperature (top row) and precipitation (bottom row). For temperature, CFSv1 is locally significant at 24% of the grid points, while CFSv2 is significant at 66%, as considerable additional area over much of the United States becomes significant. For precipitation the coverages are 22% and 36%, respectively, and there is a noticeable increase in coverage only in parts of the southern and eastern United States with CFSv2.
To compute the field significance (Livezey and Chen 1983) of the RPSS for each model version, the average of the RPSS over the United States is used as the test statistic. In this case the average RPSS is significant at <0.0001 (i.e., none of the 10 000 randomized trials produces an average RPSS as high as the actual one) for precipitation for either model. For temperature, field significance is <0.0001 for CFSv2 and 0.0002 for CFSv1. These very strong field significances imply that some trust may be placed in the patterns of local significance shown in Fig. 7, and that even with the improved CFSv2 model there remain challenges in predicting temperature in some portions of the United States (e.g., parts of the southwest and southeast), and precipitation over most of the area outside of the southern tier and Eastern Seaboard—at least when averaging over all seasons.
For significance tests for the difference between the RPSS of CFSv1 and CFSv2, rather than randomizing the dates of the forecasts in the Monte Carlo tests, the model identity (CFSv1 versus CFSv2) is randomized over time as described in Hamill (1999). Again, 10 000 trials are performed, and model identity is set to be the same in blocks of five consecutive running seasons to preserve the natural temporal autocorrelation. Maps of the spatial distribution of the resulting local significance of the skill difference are shown in the last column of Fig. 7. For temperature (top row), significant differences in RPSS are achieved at 56% of the locations, failing in the midsection and some corners of the country. For precipitation (bottom row in Fig. 7), significant differences appear in only ¼ of the locations. The field significance of the spatially averaged RPSS differences are <0.0001 for both temperature and precipitation, lending strong support for the reality of the spatial patterns of local significance. In particular, the result for temperature is suggestive of a meaningful and spatially extensive improvement in RPSS of CFSv2 over CFSv1.
d. ROC analysis
ROC curves and ROC area have been used as a verification tool for probability forecasts for several decades (Mason 1982). The ROC area is the area under the curve of hit rate versus false-alarm rate on a ROC plot. Such a plot shows the cumulative hit rate against cumulative false-alarm rate for progressively decreasing forecast probabilities for an event to occur (here, an event is the occurrence of one of the tercile-based categories). A favorable ROC plot would show a higher hit rate than false-alarm rate for cases having the highest probabilities for the event, but a decreasing hit rate and increasing false alarm rate as forecasts with lower probabilities are added into the cumulative tally. [When probability is low, such as 0.10, a false alarm (meaning that the event does not occur) is more desirable than a hit.] With a possible range of 0%–100%, a 50% rate of correct discrimination (appearing as a ROC area of 0.5) is expected by chance and reflects 0 forecast skill (Mason and Graham 2002). ROC measures discrimination alone, without penalty for poor probability calibration; hence, even if probability values are all too high or too low, or vary too much or too little, ROC areas show skill (i.e., are >0.5) if higher probabilities are followed by higher rates of occurrence than lower probabilities.
ROC curves and areas for temperature forecasts are shown in Fig. 8 for CFSv1, CFSv2, and the CPC forecasts for the above normal category (left column) and below normal category (right column). They reveal higher ordinal probabilistic discrimination on the part of CFSv2 than CFSv1 for both categories, and lowest discrimination ability for the CPC forecasts. The more favorable result for CFSv2 than CFSv1 is not surprising in light of its higher scores on most of the other metrics examined, which also measure discrimination in combination with other attributes. The lower result for the forecasts of CPC than those of CFSv1 does not parallel results for other metrics, and is partly explainable on the basis of the way in which probabilities are assigned in CPC’s forecasts. A fairly large proportion of CPC forecasts are for equal chances for any of the three categories (the climatology forecast). These forecasts, which participate in the ROC computation, contribute to zero skill. The probabilities predicted by the two CFS model versions are not plagued with this “special” and frequently occurring forecast type, and their probability bins progress across their range with a more naturally shaped distribution, even if that range is wider than it should be, due to overconfidence. (Again, note that inappropriate confidence—or any systematic probabilistic bias—does not degrade ROC.)
A consequence of the sizeable proportion of CPC’s forecasts sharing the same probability (⅓), and thus not discriminated from one another, is a reduction in the ROC area. An implication is that even the small probability preferences given by the CFS model versions when predictability is low tend to be correct at a rate slightly higher than that expected by chance—an advantage denied the CPC climatology forecasts. Although CPC’s relatively poor ROC areas can be viewed as an artifact of its forecast format with the frequently issued climatology forecast, this outcome may also suggest that CPC could issue more informative forecasts if it attempted to discriminate within cases having little to no forecast signal. For example, CPC could indicate areas deviating from climatology more weakly than currently required.4
ROC curves and areas for precipitation forecasts are shown in Fig. 9 for the above normal category (left column) and below normal category (right column). Their areas are smaller than their counterparts for temperature, in keeping with the above-noted relatively lower skill for precipitation. Nonetheless, consistent with findings for temperature, the ROC areas show higher probabilistic discrimination for CFSv2 than CFSv1 for both categories, and the lowest discrimination ability for the CPC forecasts. More probabilistic skill in CFSv2 than CFSv1 is again consistent with its higher scores on the other metrics of precipitation skill. The lower result for the CFS forecasts, as discussed above for the temperature results, is likely related to CPC’s forecast format in which a large percentage of forecasts are climatology forecasts, creating a lack of probabilistic discrimination for a proportion of the forecasts that is considerably greater for precipitation than for temperature (Peng et al. 2012).
e. Reliability analysis
We next assess the reliability, overall biases, and sharpness of the probabilistic seasonal temperature and precipitation forecasts of the CPC and the two CFS forecasts. Reliability is a measure of the correspondence between the forecast probabilities and their subsequent observed relative frequencies, spanning the full range of issued forecast probabilities. Perfect reliability would be achieved, for example, if the above normal category were assigned a probability of 40% in 200 instances (considering all grid points for all forecasts over the 15 yr), and the later observed seasonal mean anomalies were in the above normal category in exactly 80 (i.e., 40%) of those instances.
Reliability diagrams for temperature forecasts for the above and below normal categories are shown for the two CFS model versions and the CPC’s forecast in Fig. 10 (left column) as the red and blue curves, respectively. For each category, forecasts are binned for various probability intervals spanning from the lowest to the highest forecast probability (x axis), and are compared to their corresponding observed relative frequencies of occurrence (y axis). The plots in the insets in Fig. 10 show the percentages of cases in which forecasts having probabilities in each of the probability bins were issued. The diagonal line (y = x) represents perfect reliability. The binning of the probabilities for the CFS model versions is straightforward, while for CPC’s forecasts it is less obvious because only the probability of the dominant category is shown in CPC’s forecast maps. However, the probabilities of the two categories not shown are defined according to rules provided by CPC. These rules, provided in Peng et al. (2012), stipulate how the probability of the nondominant categories change (e.g., decrease) as the probability of the dominant category increases by given amounts. The reliability analysis includes forecasts for both positive and negative probability anomalies for the above and below normal categories, as well as the cases of zero probability anomaly (the climatology forecasts). Because enhanced probabilities for the near normal category are rarely issued, and because the expected skill for this category is known to be low (Van den Dool and Toth 1991), the reliability analysis does not include this category.
Figure 10 reveals several noteworthy forecast characteristics for temperature. First, all three forecast products show a general bias of underpredicting above normal temperature and overpredicting below normal. This bias occurs because there has been a pervasive warm tendency over the study period, as the average observed relative frequency of warm (cold) conditions is 0.47 (0.23), while the three sets of forecasts have considerably smaller preferences for warm over cold conditions. The failure to reproduce the extent of the warmer observed recent climate is indicated by the general offset of the entire reliability curves from the ideal 45° line such that the curve for the upper tercile category is positioned too high in the plot, while the curve for the lower category is positioned too low. These general discrepancies are summarized by the mean forecast probabilities and observed frequencies (see text within Fig. 10). The enhanced observed warmth, associated with the positive temperature trend related to global warming trends, is reproduced best by CFSv2, to a nearly equal extent by CPC’s forecasts, and the least by CFSv1. These forecast versus observation characteristics are also captured by contingency tables showing the percentages of observed occurrence of each of the three categories, given each of the forecast categories having the highest probability (Table 2). In these tables, a comparison between the marginal percentages (shown to the right of, and below, the main 3 × 3 matrix) reveals imbalances among frequencies of observed categories, and the extent to which they are reproduced in the forecasts.
Another forecast attribute noted in the reliability diagram is the confidence levels of the forecasts. The CPC’s forecasts have the most appropriate level of probabilistic confidence of the three forecast sources, as the slopes of its reliability curves are closer to unity than those of the predictions of the two model versions, whose curves have shallower slopes. Slopes of <1 indicate forecast overconfidence, in which forecast deviations from climatology are larger than the corresponding deviations of the observed relative frequencies. For example, CFSv1 has made forecasts having probabilities of 0.85 (the center of the 0.8–0.9 probability bin) for both above normal and below normal temperature, while the corresponding observed relative frequencies have been lower.
A third forecast characteristic is forecast sharpness, revealed in the forecast frequency inset plots. Sharpness represents the extent and frequency with which the forecasts deviate from the climatological probability. Figure 10 (left) shows that CPC’s temperature forecasts are much less sharp than the CFS model versions, whose forecasts span a considerably larger range of probabilities. However, the greater sharpness of the CFS model forecasts appears unjustified, as the reliability diagram indicates forecast overconfidence—particularly for CFSv1.
When predictive skill exists, favorable diagnostics on all three of the above-mentioned attributes (lack of overall bias, lack of over- or underconfidence, and the closely related optimum degree of sharpness) are revealed in the contingency table (Table 2), which would show largest percentage frequencies along the diagonal of the 3 × 3 matrix, and lower frequencies off the diagonal, especially in the lower-left and upper-right cells that represent two-category errors. When fundamental predictive skill is low, the contingency table is not expected to have markedly enhanced frequencies along the diagonal. In that case, the reliability curve inset histogram should indicate low sharpness, and the reliability curves appropriate forecast confidence (a slope near unity) along a limited range of issued probabilities. In the case of the seasonal temperature predictions, some skill clearly exists and the sum of the percentages along the diagonal exceeds 33.3% for all three forecast systems (39%, 44%, and 41% for CFSv1, CFSv2, and CPC, respectively).5 The greater diagonal sum for CFSv2 than CFSv1 is largely a result of the better reproduction of observations in the above normal category, with a 21% (30%) categorical match in CFSv1 (CFSv2). This finding is undoubtedly due to a better reflection of the warming trend in CFSv2 than CFSv1, possibly related to the assignment of a time-evolving carbon dioxide concentration in CFSv2 but not in CFSv1.
Reliability curves for precipitation forecasts are shown for the two CFS model versions and the CPC’s forecast in the right column of Fig. 10. In contrast to temperature, a trend related to climate change has been small, with only a slight shift toward above normal precipitation during the study period—a shift that was not reproduced in the predictions of the two CFS versions or in the CPC forecasts. Similar to findings for temperature, the reliability curves indicate overconfidence in the predictions of both CFS model versions (particularly CFSv1), and a more appropriate level of confidence in CPC’s forecasts. Also consistent with the temperature results, the CFS model versions exhibit greater forecast sharpness than CPC. Because forecast skill is generally lower for precipitation than for temperature (reflected in the HSS, RPSS, and ROC skill results), sharpness would be expected to be lower for precipitation than for temperature; this is indeed shown to be the case for all three forecast sources. Consistent with the temperature results, the relatively high sharpness on the parts of the model version predictions is unjustified by the resulting probabilistic overconfidence. Although CPC’s forecasts may seem to deviate from climatology too infrequently and by too little to be useful for applications, the more appropriate confidence level indicated by the slope of its reliability curves reveals that such weak forecast probabilities are appropriate, as the forecast probabilities tend to match the subsequent relative frequency of occurrence.
Contingency tables for the precipitation forecasts (Table 3) reflect the lower skill levels than are found for temperature, with the diagonal entries summing to 37%, 38%, and 35% for CFSv1, CFSv2, and CFS, respectively. These sums are only slightly greater than 33%.
The performances of the CFSv1 and CFSv2 dynamical models, used as input to NOAA/CPC’s seasonal climate outlooks, are compared with those of CPC’s outlooks themselves over the 1995–2009 period. Within the 15-yr study period, the CPC forecasters had the benefit of forecasts from CFSv1 since 2004, and did not yet have CFSv2, which became operational in April 2011. Nonetheless, hindcasts are used for evaluation of both CFS model versions, while real-time forecasts are used for the CPC forecasts.
The predictive skill of CFSv2 clearly improved over that of CFSv1 by most verification metrics, over most U.S. locations and over most seasons. This is undoubtedly attributable to its improved physics and the better data assimilation–analysis methods available by the time of its implementation in 2011. Another improvement was the inclusion of time-evolving greenhouse gas concentrations in CFSv2—a feature lacking in CFSv1, which used a fixed and outdated (from late 1980s) concentration that has been found to maintain a lack of observed trend in forecasts of surface temperature over ocean and land. This last improvement may be the cause of CFSv2’s notably lower discrepancy than that of CFSv1 between the mean forecast probability and the corresponding observed relative frequency of occurrence of the above normal temperature category (Figs. 2 and 10, Table 2). CFSv2 is also seen to exhibit a lower degree of probabilistic overconfidence than CFSv1.
The average performance of the CPC’s seasonal outlooks has been somewhat better than that of CFSv1 alone, but in many cases it is slightly lower than that of CFSv2. The fact that CFSv2 has higher average skill than CPC’s outlooks for each of the HSS, RPSS, and ROC metrics justifies the higher weight now given to CFSv2 than had been given to CFSv1 and its predecessor dynamical model versions, and to the empirical tools that have been essentially constant throughout the 15-yr study period. A specific example in which CFSv2 delivered better U.S. temperature forecasts than CFSv1, and both model versions better than the CPC’s official forecast, was during the 1998–2000 La Niña when the empirical teleconnections failed due to the delayed midlatitude warmth following the very strong 1997–98 El Niño—a warming captured effectively by CFSv2.
Regarding the level of probabilistic confidence, while CPC’s outlooks have been more appropriate (i.e., not sharper than warranted) than either CFS model version, CFSv2 is less overconfident than CFSv1 and, therefore, requires less postprocessing to become probabilistically reliable while retaining the good level of discrimination as measured here using ROC.
We thank Michael Halpert, David Unger, and three anonymous reviewers for their constructive comments and suggestions. Dr. Wanqiu Wang provided the data for model hindcasts.
OCN forecasts the mean anomaly over the last 10 (15) yr for temperature (precipitation), specific to the location and season being forecast, and represents a low-frequency persistence, or trend, in the recent approximate decade of observed data.
Due to the fairly limited latitude range of most of the mainland United States, latitude-dependent weighting for the 232 grid squares is not employed.
Currently, one category usually must attain at least 40% probability for an area of nonclimatology to be shown, although locations surrounding the probability maximum may gradually decline toward 33.3%.
The HSS can be computed from the sum of the diagonal percentages, as these represent the percentage of categorical hits.