• Alessandri, A., and A. Navarra, 2008: On the coupling between vegetation and rainfall inter-annual anomalies: Possible contributions to seasonal rainfall predictability over land areas. Geophys. Res. Lett., 35 , L02718. doi:10.1029/2007GL032415.

    • Search Google Scholar
    • Export Citation
  • Alessandri, A., A. Borrelli, S. Masina, P. D. Pietro, A. Carril, A. Cherchi, S. Gualdi, and A. Navarra, 2010: The INGV-CMCC seasonal prediction system: Improved ocean initial conditions. Mon. Wea. Rev., 138 , 29302952.

    • Search Google Scholar
    • Export Citation
  • Alves, O., M. Balmaseda, D. Anderson, and T. Stockdale, 2004: Sensitivity of dynamical seasonal forecasts to ocean initial conditions. Quart. J. Roy. Meteor. Soc., 130 , 647667.

    • Search Google Scholar
    • Export Citation
  • Balmaseda, M., D. Anderson, and A. Vidard, 2007: Impact of ARGO on analyses of the global ocean. Geophys. Res. Lett., 34 , L16605. doi:10.1029/2007GL0304452.

    • Search Google Scholar
    • Export Citation
  • Doblas-Reyes, F., R. Hagedorn, and T. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting. II. Calibration and combination. Tellus, 57A , 234252.

    • Search Google Scholar
    • Export Citation
  • Hagedorn, R., F. J. Doblas-Reyes, and T. N. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting. I. Basic concept. Tellus, 57A , 219233.

    • Search Google Scholar
    • Export Citation
  • Ingleby, B., and M. Huddleston, 2007: Quality control of ocean temperature and salinity profiles: Historical and real-time data. J. Mar. Syst., 65 , 158175.

    • Search Google Scholar
    • Export Citation
  • Kang, I.-S., and J. H. Yoo, 2006: Examination of multi-model ensemble seasonal prediction methods using a simple climate system. Climate Dyn., 26 , 285294. doi:10.1007/s00382-005-0074-8.

    • Search Google Scholar
    • Export Citation
  • Koster, R. D., and Coauthors, 2010: Contribution of land surface initialization to subseasonal forecast skill: First results from a multi-model experiment. Geophys. Res. Lett., 37 , L02402. doi:10.1029/2009GL04167.

    • Search Google Scholar
    • Export Citation
  • Krishnamurti, T., C. Kishtawal, T. LaRow, D. Bachiochi, Z. Zhang, C. E. Williford, S. Gadgil, and S. Surendran, 1999: Improved weather and seasonal climate forecasts from multimodel superensemble. Science, 285 , 15481550.

    • Search Google Scholar
    • Export Citation
  • Lee, J.-Y., and Coauthors, 2010: How are seasonal prediction skills related to models’ performance on mean state and annual cycle? Climate Dyn., 35 , 267283. doi:10.1007/s00382-010-0857-4.

    • Search Google Scholar
    • Export Citation
  • Murphy, A., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281293.

  • Palmer, T., and Coauthors, 2004: Development of a European Multimodel Ensemble System for Seasonal to Interannual Prediction (DEMETER). Bull. Amer. Meteor. Soc., 85 , 853872.

    • Search Google Scholar
    • Export Citation
  • Park, Y., R. Buizza, and M. Leutbeche, 2008: TIGGE: Preliminary results on comparing and combining ensembles. Quart. J. Roy. Meteor. Soc., 134 , 20292050.

    • Search Google Scholar
    • Export Citation
  • Rajagopalan, B., U. Lall, and S. Zebiak, 2002: Categorical climate forecasts through regularization and optimal combination of multiple GCM ensembles. Mon. Wea. Rev., 130 , 17921811.

    • Search Google Scholar
    • Export Citation
  • Richardson, D., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126 , 649668.

    • Search Google Scholar
    • Export Citation
  • Richardson, D., 2003: Economic value and skill. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. Stephenson, Eds., Wiley, 165–187.

    • Search Google Scholar
    • Export Citation
  • Richardson, D., 2006: Predictability and economic value. Predictability of Weather and Climate, T. Palmer and R. Hagedorn, Eds., Cambridge University Press, 628–644.

    • Search Google Scholar
    • Export Citation
  • Rosati, A., K. Miyakoda, and R. Gudgel, 1997: The impact of ocean initial conditions on ENSO forecasting with a coupled model. Mon. Wea. Rev., 125 , 754772.

    • Search Google Scholar
    • Export Citation
  • Shukla, J., and J. C. Kinter, 2006: Predictability of seasonal climate variations: A pedagogical review. Predictability of Weather and Climate, T. Palmer and R. Hagedorn, Eds., Cambridge University Press, 306–341.

    • Search Google Scholar
    • Export Citation
  • Shukla, J., R. Hagedorn, B. Hoskins, J. Kinter, J. Marotzke, M. Miller, T. N. Palmer, and J. Slingo, 2009: Revolution in climate prediction is both necessary and possible: A declaration at the world modeling summit for climate prediction. Bull. Amer. Meteor. Soc., 90 , 1619.

    • Search Google Scholar
    • Export Citation
  • Timmermann, R., H. Goosse, G. Madec, T. Fichefet, C. Ethe, and V. Duliere, 2005: On the representation of high latitude processes in the ORCA-LIM global coupled sea ice-ocean model. Ocean Modell., 8 , 175201.

    • Search Google Scholar
    • Export Citation
  • Toth, Z., O. Talagrand, and Y. Zhu, 2006: The attributes of forecast systems: A general framework for the evaluation and calibration of weather forecasts. Predictability of Weather and Climate, T. Palmer and R. Hagedorn, Eds., Cambridge University Press, 584–595.

    • Search Google Scholar
    • Export Citation
  • Uppala, S., and Coauthors, 2005: The ERA-40 re-analysis. Quart. J. Roy. Meteor. Soc., 131 , 29613012.

  • Wang, B., and Coauthors, 2009: Advance and prospectus of seasonal prediction: Assessment of the APCC/CliPAS 14-model ensemble retrospective seasonal prediction (1980–2004). Climate Dyn., 33 , 93117.

    • Search Google Scholar
    • Export Citation
  • Weigel, A., M. Liniger, and C. Appenzeller, 2008: Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts? Quart. J. Roy. Meteor. Soc., 134 , 241260.

    • Search Google Scholar
    • Export Citation
  • Weisheimer, A., and Coauthors, 2009: ENSEMBLES: A new multi-model ensemble for seasonal-to-annual predictions—Skill and progress beyond DEMETER in forecasting tropical Pacific SSTs. Geophys. Res. Lett., 36 , L21711. doi:10.1029/2009GL040896.

    • Search Google Scholar
    • Export Citation
  • Wilks, D., 2006: Statistical Methods in the Atmospheric Sciences. 2nd ed. Academic Press, 630 pp.

  • View in gallery

    Sharpness, discrimination distance, RelSS, and ResSS over the (a) tropics (25°S–25°N) and (b) northern midlatitudes (30°–75°N) in the ENSEMBLES (dark) and the DEMETER (light) multimodels. The forecast events are surface air temperature falling below the lower (left bars) and above the upper (right bars) climatology terciles. Land and ocean grid points of forecast months 2–4 for all 4 start dates of the period 1980–2001 are considered. Asterisks indicate if the value is significantly enhanced compared to the other experiment (5% level; Monte Carlo method).

  • View in gallery

    (a),(b) Discrimination diagrams and (c),(d) reliability diagrams for the forecasts started in May. Each grid point of the forecast 2–4 seasonal mean surface air temperature anomalies over the tropics (25°S–25°N, 0°–360°) are used and ERA-40 data are taken as reference. (a),(c) The dichotomous event of temperature being below the lower tercile of the sample climatological distribution and (b),(d) the case of the temperature exceeding the upper tercile. The histograms in the reliability diagrams indicate the refinement distribution. In red is DEMETER and in blue ENSEMBLES. Discrimination distance, RelSS, and ResSS values are reported and an asterisk is placed to indicate if the value is significantly higher compared to the other experiment at the 5% level (Monte Carlo method).

  • View in gallery

    As in Fig. 2, but for August starting dates.

  • View in gallery

    Spatial distribution of the sharpness attribute of the probabilistic forecasts for above-normal surface air temperature conditions. (a),(b) ENSEMBLES. (c),(d) DEMETER. (e),(f) ENSEMBLES minus DEMETER difference in sharpness; shaded are the areas of increase (red) and decrease (blue) in ENSEMBLES that passed a significance test at the 10% level. (a),(c),(e) 1 May and (b),(d),(f) 1 Aug start date.

  • View in gallery

    As in Fig. 4, but for discrimination.

  • View in gallery

    As in Fig. 4, but for resolution.

  • View in gallery

    As in Fig. 4, but for reliability.

  • View in gallery

    Ensemble mean forecasts vs ERA-40 surface air temperature anomalies: point-by-point correlations of months 2–4 of the predictions. ENSEMBLES forecasts with starting dates (a) 1 May, (b) 1 Aug. (c),(d) As in (a),(b), but for DEMETER forecasts. (e),(f) ENSEMBLES minus DEMETER difference in correlations, respectively for May and August (contours interval of 0.2). Shaded are the areas of increase (red) and decrease (blue) in ENSEMBLES that passed a significance test at the 10% level.

  • View in gallery

    As in Fig. 2, but for the northern midlatitudes (30°–75°N, 0°–360°).

  • View in gallery

    As in Fig. 2, but for the northern midlatitudes (30°–75°N, 0°–360°) and August starting dates.

  • View in gallery

    As in Fig. 1, but for the average of the single-models performance.

  • View in gallery

    Multimodel minus single-models averaged performances: sharpness, discrimination distance, RelSS, and ResSS over the (a) tropics (25°S–25°N) and (b) northern midlatitudes (30°–75°N) in ENSEMBLES (dark) and DEMETER (light). The forecast events are surface air temperature falling below the lower (left bars) and above the upper (right bars) climatology terciles. Land and ocean grid points of forecast months 2–4 for all four start dates of the period 1980–2001 are considered.

  • View in gallery

    Potential economic value of the ENSEMBLES and DEMETER forecasts over the tropics (25°S–25°N) as a function of the C/L ratio for the prediction of surface temperature being (left) below the lower tercile and (right) above the upper tercile of the sample climatology. Calibrated forecasts using the discrimination information for ENSEMBLES (solid) and DEMETER (dashed) and using the reliability information for ENSEMBLES (dash–dots) and DEMETER (dots). (a),(b) 1 Feb; (c),(d) 1 May; (e),(f) 1 Aug; (g),(h) 1 Nov starting dates.

  • View in gallery

    As in Fig. 13, but for the northern midlatitudes (30°–75°N, 0°–360°).

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 247 193 12
PDF Downloads 115 88 4

Evaluation of Probabilistic Quality and Value of the ENSEMBLES Multimodel Seasonal Forecasts: Comparison with DEMETER

View More View Less
  • 1 Centro Euro-Mediterraneo per i Cambiamenti Climatici, Bologna, Italy
  • | 2 Met Office, Exeter, United Kingdom
  • | 3 Météo France, CNRS/GAME, Toulouse, France
  • | 4 CERFACS/URA1875, Toulouse, France
  • | 5 European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom
© Get Permissions
Full access

Abstract

The performance of the new multimodel seasonal prediction system developed in the framework of the European Commission FP7 project called ENSEMBLE-based predictions of climate changes and their impacts (ENSEMBLES) is compared with the results from the previous project [i.e., Development of a European Multimodel Ensemble System for Seasonal-to-Interannual Prediction (DEMETER)]. The comparison is carried out over the five seasonal prediction systems (SPSs) that participated in both projects. Since DEMETER, the contributing SPSs have improved in all aspects with the main advancements including the increase in resolution, the better representation of subgrid physical processes, land, sea ice, and greenhouse gas boundary forcing, and the more widespread use of assimilation for ocean initialization.

The ENSEMBLES results show an overall enhancement for the prediction of anomalous surface temperature conditions. However, the improvement is quite small and with considerable space–time variations. In the tropics, ENSEMBLES systematically improves the sharpness and the discrimination attributes of the forecasts. Enhancements of the ENSEMBLES resolution attribute are also reported in the tropics for the forecasts started 1 February, 1 May, and 1 November. Our results indicate that, in ENSEMBLES, an increased portion of prediction signal from the single-models effectively contributes to amplify the multimodel forecasts skill. On the other hand, a worsening is shown for the multimodel calibration over the tropics compared to DEMETER.

Significant changes are also shown in northern midlatitudes, where the ENSEMBLES multimodel discrimination, resolution, and reliability improve for February, May, and November starting dates. However, the ENSEMBLES multimodel decreases the capability to amplify the performance with respect to the contributing single models for the forecasts started in February, May, and August. This is at least partly due to the reduced overconfidence of the ENSEMBLES single models with respect to the DEMETER counterparts.

Provided that they are suitably calibrated beforehand, it is shown that the ENSEMBLES multimodel forecasts represent a step forward for the potential economical value they can supply. A warning for all potential users concerns the need for calibration due to the degraded tropical reliability compared to DEMETER. In addition, the superiority of recalibrating the ENSEMBLES predictions through the discrimination information is shown.

Concerning the forecasts started in August, ENSEMBLES exhibits mixed results over both tropics and northern midlatitudes. In this case, the increased potential predictability compared to DEMETER appears to be balanced by the reduction in the independence of the SPSs contributing to ENSEMBLES. Consequently, for the August start dates no clear advantage of using one multimodel system instead of the other can be evidenced.

Corresponding author address: Andrea Alessandri, Centro Euro-Mediterraneo per i Cambiamenti Climatici, Via Aldo Moro 44, 40127 Bologna, Italy. Email: andrea.alessandri@enea.it

Abstract

The performance of the new multimodel seasonal prediction system developed in the framework of the European Commission FP7 project called ENSEMBLE-based predictions of climate changes and their impacts (ENSEMBLES) is compared with the results from the previous project [i.e., Development of a European Multimodel Ensemble System for Seasonal-to-Interannual Prediction (DEMETER)]. The comparison is carried out over the five seasonal prediction systems (SPSs) that participated in both projects. Since DEMETER, the contributing SPSs have improved in all aspects with the main advancements including the increase in resolution, the better representation of subgrid physical processes, land, sea ice, and greenhouse gas boundary forcing, and the more widespread use of assimilation for ocean initialization.

The ENSEMBLES results show an overall enhancement for the prediction of anomalous surface temperature conditions. However, the improvement is quite small and with considerable space–time variations. In the tropics, ENSEMBLES systematically improves the sharpness and the discrimination attributes of the forecasts. Enhancements of the ENSEMBLES resolution attribute are also reported in the tropics for the forecasts started 1 February, 1 May, and 1 November. Our results indicate that, in ENSEMBLES, an increased portion of prediction signal from the single-models effectively contributes to amplify the multimodel forecasts skill. On the other hand, a worsening is shown for the multimodel calibration over the tropics compared to DEMETER.

Significant changes are also shown in northern midlatitudes, where the ENSEMBLES multimodel discrimination, resolution, and reliability improve for February, May, and November starting dates. However, the ENSEMBLES multimodel decreases the capability to amplify the performance with respect to the contributing single models for the forecasts started in February, May, and August. This is at least partly due to the reduced overconfidence of the ENSEMBLES single models with respect to the DEMETER counterparts.

Provided that they are suitably calibrated beforehand, it is shown that the ENSEMBLES multimodel forecasts represent a step forward for the potential economical value they can supply. A warning for all potential users concerns the need for calibration due to the degraded tropical reliability compared to DEMETER. In addition, the superiority of recalibrating the ENSEMBLES predictions through the discrimination information is shown.

Concerning the forecasts started in August, ENSEMBLES exhibits mixed results over both tropics and northern midlatitudes. In this case, the increased potential predictability compared to DEMETER appears to be balanced by the reduction in the independence of the SPSs contributing to ENSEMBLES. Consequently, for the August start dates no clear advantage of using one multimodel system instead of the other can be evidenced.

Corresponding author address: Andrea Alessandri, Centro Euro-Mediterraneo per i Cambiamenti Climatici, Via Aldo Moro 44, 40127 Bologna, Italy. Email: andrea.alessandri@enea.it

1. Introduction

Multimodel ensembles (MMEs) are powerful tools in dynamical climate prediction as they account for uncertainties related to single-model errors (Park et al. 2008; Palmer et al. 2004; Wang et al. 2009). On seasonal time scales, it has been found that errors in MMEs are smaller than the average of the single-model errors (Palmer et al. 2004; Hagedorn et al. 2005), while the overconfidence of single-model ensembles is reduced by the better dispersed MMEs, which broaden the sampling of the future climate trajectories (Weigel et al. 2008). The multimodels get their performance from the skill of the contributing models, so that MME skill is generally proportional to the mean skill of the individual models (Kang and Yoo 2006). However, the relation between single-model averages and MME skill is not linear and the multimodel performance is superior to the average of the single-model ensembles (SMEs). As explained in Hagedorn et al. (2005), this is mainly attributable to error cancellations and to the nonlinearity of the skill metrics applied. The independence of the contributing models between each other is a prerequisite to obtain error cancellations (Hagedorn et al. 2005) and for skill amplification to occur (Kang and Yoo 2006). Weigel et al. (2008) showed that multimodels act by reducing the overconfidence of the single models (i.e., by gradually widening the ensemble spread and moving the ensemble mean toward truth without reducing the potential predictability). It follows that the ability of MMEs to gain more skill compared to the SMEs average strongly depends on the degree of overconfidence of the contributing SMEs. The higher the independence and the overconfidence of the contributing SMEs, the larger the potential benefit of using a MME.

The progresses in modeling ask for the careful verification of the corresponding forecast enhancements as well as for the evaluation of the calibration requirements of the predictions (Toth et al. 2006). On a fundamental level, all the forecast verification procedures involve the investigation of the properties of the joint distribution of forecasts and observations (Wilks 2006). However, there can be many aspects of the model performance and differing views of what constitutes a good forecast (e.g: Murphy 1993; Wilks 2006). As a consequence, a broad range of verification metrics are usually needed to analyze and compare forecast quality.

To be useful for decision making, seasonal climate predictions need to be probabilistic and the capability of probability forecasts to provide valuable information to end users needs to be assessed (e.g., Richardson 2006). At the decision-making level, probability forecasts are regarded by virtue of their potential economic value. This notion of value is conceptually different from the notion of skill in the meteorological sense. In fact, the potential economic value cannot be assessed by analyzing meteorological variables alone, whereas it also depends on the user’s economic parameters.

The completeness of the forecasts verification procedures is of particular relevance for the subsequent calibration practices that are aimed at providing corrected data for the end users (Toth et al. 2006). Traditionally, calibration can be pursued by statistically correcting the reliability attribute (Toth et al. 2006). However, other prediction attributes can be used to recalibrate the forecasts. For instance, the capability of the retrospective forecasts to discriminate past observed dichotomous events can be used effectively in order to calculate posterior probabilities, given the forecast results, for the future events (Wilks 2006; Alessandri et al. 2010).

In the paper by Weisheimer et al. (2009), the new global MME reforecast dataset created within the European Commission (EC) FP7 project called ENSEMBLE-based predictions of climate changes and their impacts (ENSEMBLES) for seasonal and annual time scales was presented. Weisheimer et al. (2009) suggested for improvements of the ENSEMBLES MME in the tropical Pacific systematic error and accuracy of the seasonal predictions with respect to the previous Development of a European Multimodel Ensemble System for Seasonal-to-Interannual Prediction (DEMETER) MME (developed during the EC FP5 project DEMETER; Palmer et al. 2004). In the present study we report a first evaluation of the ENSEMBLES system for the verification attributes that characterize probabilistic prediction quality and potential economic value of the forecasts (Wilks 2006; Richardson 2006). Taking as the reference the DEMETER MME, the assessment has been carried out at the global scale and with particular focus on macroareas such as the tropical band (25°S–25°N) as well as the extratropics (30°–75°N/S). In section 2, the basic features of the MME forecasts being compared are summarized and the applied verification measures of prediction quality are described. Forecast quality comparison in the tropics is presented in section 3 while section 4 reports the results over the extratropics. The topic addressed in section 5 is the comparison of ENSEMBLES and DEMETER in terms of the MME performance gain compared to the respective single-models averaged skill. In section 6 we evaluate the forecasts potential economic value and we compare the reliability and the discrimination approaches to the calibration in terms of the benefits to end users. Finally, in section 7 a discussion and a summary of conclusions will close this study.

2. Method

The quality measures described in section 2a are used in order to compare the new ENSEMBLES MME (Weisheimer et al. 2009) with the previous version developed during the European Union (EU) project DEMETER (Palmer et al. 2004). We compare ENSEMBLES and DEMETER considering the respective versions of the five prediction systems that participated in both projects (see section 2b) in the overlapping retrospective forecast period (1980–2001). For each year, we use 1-month lead-time seasonal mean forecasts (target months 2–4 averages) starting on 1 February, 1 May, 1 August, and 1 November.

In this study we will concentrate on forecasting below-normal (i.e., below lower tercile of the sample distribution; E) and above-normal (i.e., above upper tercile of the sample distribution; E+) surface air temperature. The 40-yr European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis (ERA-40; Uppala et al. 2005) is used as reference for the verification of the forecasts. The sample terciles are estimated, for each grid point and for each starting date, separately in ENSEMBLES and in DEMETER. All the calculations are performed using the forecast anomalies, computed for each contributing model by removing the corresponding climatology from the original ensemble forecasts. A similar process is applied to the verification ERA-40 data.

a. Verification procedure

Following Wilks (2006), we consider a range of different attributes in order to characterize the forecast quality. We compare the sharpness, the discrimination, the reliability and the resolution attributes of the forecasts (Wilks 2006). The potential economic value of the calibrated forecasts is further assessed through the simple cost-loss decision model (CLM; Richardson 2003). In the following the quality measures are briefly defined. The details on the forecast quality skill scores and of the CLM used in this work are then provided in sections 2a(1) and 2a(2), respectively.

  • Reliability (or calibration) pertains to the conditional distribution of observations given the forecasts. It characterizes the correspondence of the forecasts to the average observation for specific values of the predictions. A positively oriented reliability skill score (RelSS) can be defined displaying 1.0 as the maximum value [see section 2a(1) for details].

  • Sharpness characterizes the unconditional distribution of the predictions (often reported as forecasts refinement distribution) and is an attribute of forecasts alone (i.e., with no regard to corresponding observations). A measure of sharpness is the standard deviation of the refinement distribution [Sh; see section 2a(1) for details] and has to do with the signal of the predictions. Large spreads imply that different forecasts are issued relatively frequently and so have the potential to discern a broad range of conditions (sharp forecasts).

  • Resolution measures the degree to which the forecasts sort the corresponding conditional observations into groups that are different from each other. Different from reliability (pertaining to differences between conditional observations with the forecasts themselves) resolution refers to differences between averaged conditional observations given the different values of the forecasts. A resolution skill score (ResSS) with zero as the minimum value can be defined [see section 2a(1) for details].

  • Discrimination pertains to the conditional distribution of forecasts given the observations. It characterizes the ability of a forecasting system to produce different forecasts for those occasions having different realized outcomes of the predictand. A scalar attribute in the form of discrimination distance d can be defined [see section 2a(1) for details]. Here d = 1 indicates perfect forecasts while d = 0 means no discrimination at all.

  • Potential economic value measures the economical saving the user can make by using the forecasts, compared with having only climatological informations. Here we use the relative value of the forecasts [hereinafter Value; see section 2a(1) for details] by comparing the saving with the maximum attainable which would result from perfect deterministic forecasts (Value = 1).

1) Skill measures

After discretizing the probability forecasts to a finite set of values (yi; which can take any of the I values y1, y2, … yI; i = 1, … , I), the joint distribution of the forecasts and dichotomous observations (oj; occurrence j = 1; no occurrence j = 0) can be denoted by
i1520-0493-139-2-581-e1
From the definition of conditional probability, one possible way to factorize the joint distribution is the calibration-refinement factorization:
i1520-0493-139-2-581-e2
One part of this factorization consists of a set of the I conditional distributions p(oj|yi), which specifies how often each possible observed event occurred when the single forecast yi was issued, or how well each forecast yi is reliable (or calibrated). A scalar measure of reliability can be defined as follows (Wilks 2006):
i1520-0493-139-2-581-e3
where Ni is the number of times each forecast yi is used in the collection of forecasts being verified and n is the total number of forecast–event pairs. Forecast calibration can be simply pursued by statistically correcting the reliability attribute. That is, it is performed by replacing the issued forecast probabilities with the conditional frequency distribution of the observations that followed the retrospective forecasts (Toth et al. 2006).
The other part of this factorization [Eq. (2)] is the marginal (or refinement) distribution [p(yi)], which characterizes the averaged signal of the predictions and defines the sharpness attribute of the forecasts. A measure of sharpness can be defined as the standard deviation of the refinement distribution:
i1520-0493-139-2-581-e4
Another attribute of the forecasts that is related to the calibration-refinement factorization is the resolution. Resolution refers to the differences between the conditional averages of the observations for different values of the forecasts:
i1520-0493-139-2-581-e5
As described in Wilks (2006), the reliability, the resolution, plus an “uncertainty” term can be used to define the Brier score (BS), a scalar measure of the overall accuracy:
i1520-0493-139-2-581-e6
By using the climatological forecasts as the reference, a Brier skill score (BSS) is often computed as follows (Wilks 2006):
i1520-0493-139-2-581-e7
For commodity and by analogy with the BSS, positively oriented RelSS and ResSS can be defined as
i1520-0493-139-2-581-e8
i1520-0493-139-2-581-e9
The other factorization of p(yi, oj) is the likelihood-base rate factorization:
i1520-0493-139-2-581-e10
where the conditional distributions p(yi|oj) express the likelihoods that each of the allowable forecast value yi would have been issued in advance of each of the observed dichotomous event oj (occurrence j = 1; no occurrence j = 0). The conditional likelihood distributions, p(yi|oj), are directly indicative of how well a set of forecasts are able to discriminate among the events oj. The larger the separation of the two likelihood distributions, the higher the discrimination performance of the predictions, which can be expressed as a scalar attribute in the form of discrimination distance d. It is defined as the difference between the means of the two likelihood distributions (μy|oj) following Wilks (2006):
i1520-0493-139-2-581-e11
As pointed out by Wilks (2006), the above-mentioned characteristics of the two likelihood distributions could be used effectively to recalibrate the probability forecasts (Wilks 2006; Alessandri et al. 2010).

2) Potential economic value

The potential economic value of the forecasts is assessed through the simple CLM. As described in Richardson (2003, 2006), the CLM simply considers a decision maker that is sensitive to a specific adverse event such as E and E+ and that aims to minimize the overall expense. The CLM works such that if this event occurs and the user has not taken any preventive action then he suffers a financial loss L. Alternatively, the user could take action at a cost C that would protect against this potential loss. Assuming the user knows o but has no additional forecast information, then the optimal strategy is either always or never protect, depending on which gives the lower overall expense. This gives a baseline against which improvements from using forecast information can be evaluated:
i1520-0493-139-2-581-e12
Another reference point is provided by the expense associated with perfect deterministic forecast information; the user would only protect if the event is going to occur and the average expense is
i1520-0493-139-2-581-e13
The deterministic forecasts give simple yes/no predictions for the events to occur and the performance of such a forecasting system can be summarized in a contingency table (Table 1). The average expense of using deterministic forecasts is obtained by multiplying the columns of the table as follows:
i1520-0493-139-2-581-e14
A relative value of the forecasts can be defined by comparing the saving the user can make with respect to EC with the maximum possible saving that could be made from perfect deterministic forecasts:
i1520-0493-139-2-581-e15
This can be easily extended to the case of the probability forecasts that are used in this study. In this case, the user is faced with the decision of whether the probability of adverse conditions is high enough for protective action to be taken. The user needs to set a probability threshold pt and take action if the forecast probability exceeds that threshold. It can be shown that, for reliable probability forecasts the optimal choice is pt = C/L (Richardson 2000). It follows that each user should take action when the probability of the event exceeds their own cost-loss ratio.

b. The multimodels

The new ENSEMBLES MME was developed in the framework of the European Commission FP7 project ENSEMBLES (Weisheimer et al. 2009). The ENSEMBLES and DEMETER MMEs, compared in this study, include the respective versions of the five prediction systems that participated in both projects (see Table 2 for the list of the contributing institutions). As reported in Weisheimer et al. (2009), the ENSEMBLES MME has improved in all aspects compared with DEMETER. The main modifications include increased horizontal and vertical resolution, a better representation of subgrid physical processes, land and sea ice, the inclusion of interannual variability in the greenhouse gas forcing, and a more widespread use of assimilation for constructing ocean initial conditions based on the EN3 quality-controlled in situ temperature and salinity profiles (Ingleby and Huddleston 2007). The same greenhouse gases boundary forcing was used by all the ENSEMBLES systems (details on the data used as boundary forcing can be found online at http://www.ecmwf.int/research/EU_projects/ENSEMBLES/exp_setup/index.html). Table 3 reports a summary of the main information on the five Seasonal Prediction Systems (SPSs) for both ENSEMBLES and DEMETER.

Each of the individual model ensembles participating either to ENSEMBLES or to DEMETER consists of 9 ensemble members, thus giving rise to the two 45-member MMEs that are compared in this study. For both ENSEMBLES and DEMETER MMEs construction, we use the simplest approach of applying equal weights to all contributing models and ensemble members (Krishnamurti et al. 1999; Hagedorn et al. 2005). However, more complex methods of optimally combining the single-model output have been described (Krishnamurti et al. 1999; Rajagopalan et al. 2002; Doblas-Reyes et al. 2005). The application of such more complex methods is beyond the scope of the present MMEs comparison.

3. Tropics

The comparison between ENSEMBLES and DEMETER of the forecast attributes over tropics (all land and ocean grid points considered) displays a small but systematic increase of the sharpness and of the discrimination in ENSEMBLES (Fig. 1a). For both E and E+, we found that the enhanced sharpness and discrimination, considering the whole tropics, is significant at the 5% level (Monte Carlo method, 10 000 repetitions) for all starting dates. The increased sharpness indicates the tendency of the predictions to fall preferentially outside normal conditions, thus displaying an increased forecast signal (Wilks 2006). This can be clearly appreciated for the 1 May and 1 August start date cases by looking at the refinement distributions in the reliability diagrams of Figs. 2c,d and 3c,d. The above result is consistent with Weisheimer et al. (2009), where by looking at the spread/RMSE ratio it showed a marked tendency in DEMETER to low forecast signal-to-noise ratios over the tropical Pacific. This tendency was not present in ENSEMBLES. The spatial distribution of the signal improvement with respect to DEMETER is shown by comparing the global maps of the sharpness evaluated at each grid point. Figure 4 reports the results of the probabilistic forecasts for E+ started in May (left panels) and August (right panels). For all start dates (only May E+ and the August E+ are reported for brevity in Fig. 4), ENSEMBLES (Figs. 4a,b) increases the sharpness compared to DEMETER (Figs. 4c,d) over wide areas of the tropical oceans. Note that the results for below-normal conditions (E) are similar to those for above-normal cases (not shown). As shown in Figs. 4e,f the significant (10% level, bootstrap method) differences between ENSEMBLES and DEMETER are mostly concentrated over the central and eastern tropical Pacific. The improvement for the ocean is shown to affect also the land by increasing the signal over the surrounding continental areas (see Fig. 4). The coincidence of the increased sharpness with enhanced discrimination in ENSEMBLES (Fig. 1a) over the tropics, indicates that the higher signal allows for an improved capability compared with DEMETER to discriminate among the considered dichotomous events. This result can be further appreciated by looking at the discrimination diagrams reported in Figs. 2a,b and 3a,b (only the 1 May and the 1 August cases are reported for brevity). ENSEMBLES shows increased separation between the conditional probability distributions of the forecasts given the occurrence [p(yi|o1)] and the absence [p(yi|o0)], respectively, of the event, leading to improved ability of the forecasts to discriminate both E+ and E compared to DEMETER. The global maps of the discrimination for E+ forecasts started on 1 May and 1 August are shown in Fig. 5. From the comparison between Figs. 4 and 5 it is shown that the regions with increased signal also tend to improve discrimination. However, this tendency is less pronounced for the August start date (Figs. 4b,d,f and 5b,d,f), which also displays regions with reduced discrimination in correspondence of the increased signal. In particular, for August E the difference between ENSEMBLES and DEMETER discrimination shows large negative areas over central and western Pacific, where on the other hand increased sharpness is found (not shown).

Resolution is also improved in ENSEMBLES showing a significant increase for February (E+), May (both E+ and E), and November (both E and E+; Fig. 1a). As reported in Fig. 6 (left panels) for the May start date (with similar results for February and November, which are not reported for brevity), ENSEMBLES shows increased resolution over large areas in the Pacific and Indian Oceans. Noteworthy, differently from the other start dates, the resolution for August do not improve in ENSEMBLES displaying results very similar to DEMETER, while considering the whole tropical belt. However, the spatial distribution of the differences in resolution between ENSEMBLES and DEMETER for August (Fig. 6f) displays quite a patchy pattern, with positive differences in some regions, which tend to be compensated by negative values in other areas. For instance, DEMETER displays increased resolution over central and southern tropical Pacific. On the other hand ENSEMBLES tends to be better over the western Pacific–Maritime Continent and Indian Ocean.

The reliability attribute, considering the tropical belt as a whole, appears to be almost systematically worsened compared to DEMETER, with the exception of the slight but significant improvement in ENSEMBLES for the May E+ predictions (Fig. 1a; see also Figs. 4a,c,e for the global maps). Conversely, DEMETER shows a significantly better RelSS for the forecasts started in February (both E and E+), August (both E and E+), and November (E). The reliability diagrams in Figs. 2c,d and 3c,d (only the 1 May and 1 August cases are reported for brevity) display, for both DEMETER and ENSEMBLES, a tendency to overconfident forecasts. However, in much cases DEMETER approaches more closely the perfect reliability 1:1 line in the diagrams. As a consequence, this corresponds to less overconfident forecasts compared to ENSEMBLES. This is particularly true for the forecasts started on 1 August, which show a worsening in ENSEMBLES in terms of reliability (Figs. 3c,d). In fact, the comparison of the global maps of the reliability for the August start date (Fig. 7f) shows that DEMETER improves reliability in many areas over the tropical Pacific, Indian, and Atlantic Oceans as well as over the surrounding continents. It is noted that the August start date exhibits very mixed results when considering the different attributes of the forecasts. The reader is referred to the analysis in section 5 to get some insight into this. Here it is only observed that the forecasts started on 1 August show the largest sea surface temperature (SST) biases over tropical Pacific compared with the other start dates and this may in part compromise the potential for the possible improvement of this start date. In fact, as shown by several works (e.g: Lee et al. 2010), the ability of dynamical models in simulating SST mean state closely impacts the prediction skill for the interannual SST anomalies. Considering SST over the Niño-3.4 region (5°S–5°N, 170°–120°W), ENSEMBLES decreases the RMSE for the 1-month lead seasonal mean (months 2–4) forecast climatology by 20% (0.36 in ENSEMBLES vs 0.45 in DEMETER) for February, by 6% (0.30 vs 0.32) for May, by 7% (0.75 vs 0.81) for August, and by 4% (0.42 vs 0. 44) for November. Even if ENSEMBLES reduces the bias compared to DEMETER as reported above, for the 1 August start date the errors are still more than 2 times higher than 1 February and 1 May and exceed the 1 November forecasts by more than two-thirds.

In both ENSEMBLES and DEMETER, the skill in the forecast quality measures tends to be concentrated over tropical Pacific and from there it tends to irradiate toward the whole tropical belt and extratropics (see Figs. 4 –7 for the May E+ and the August E+ cases). The largest performance is however confined over the ocean, while land areas show a lower performance in terms of sharpness, discrimination, and resolution. Nevertheless, as summarized in Table 4, the comparison of ENSEMBLES and DEMETER over tropical land shows improvements similar to the ocean, indicating the link between continents and the sea. In particular, sharpness, discrimination and resolution are significantly enhanced over land in ENSEMBLES following the improvements over the ocean. However, a few exceptions to the above behavior are noted. For instance, the ENSEMBLES results for the 1 August forecasts improve significantly over land while over the ocean a significant worsening is evidenced (Table 4). Figure 8 shows the spatial distribution of the temporal correlations (TCs) between MME mean forecasts and observations (only 1 May and 1 August start dates are reported). Interestingly, we can see quite a spatial similarity between TC maps and some of the probabilistic forecast quality attributes. In particular, alike discrimination, resolution, and sharpness, the TCs show the better skill over tropical Pacific and decreasing performance in the rest of tropical belt and toward extratropics. Discrimination shows a particularly strong relationship with TC. The spatial correlation between discrimination and TC averaged over E and E+ and over all four starting dates is 0.44. In contrast, for resolution, sharpness and reliability the spatial correlation with TC decreases to 0.36, 0.35, and 0.20, respectively. The values above are computed for ENSEMBLES, but the results for DEMETER do not differ significantly. The quite high spatial correlation between discrimination and TC indicates that the ability of the probabilistic forecasts to discriminate above-normal and below-normal conditions is connected to the skill of the MME mean “deterministic” prediction of interannual anomalies. In fact, we see that the tropical regions with significantly increased TC in ENSEMBLES (see Figs. 8e,f for the May and August start date cases) also tend to correspond to increased discrimination (Figs. 5e,f). The connection is particularly strong over the tropical Pacific for February, May, and November forecasts (only the 1 May start date is reported for brevity in the figures). A similar behavior can be also seen for the resolution (see Figs. 6e,f) even if the connection appears to be less clear.

Table 5 reports the comparison of the verification attributes for the three tropical ocean basins. The results show that the tropical Atlantic behaves differently from the other tropical oceans. In this case, the increased ENSEMBLES sharpness is not followed by any significant enhancement of discrimination in the forecasts started in August and November. For these starting dates, the increased signal of the predictions is only accompanied by significantly lower RelSS (see also Fig. 7, right panels, showing the August global maps), thus evidencing a worsened accordance to observations. This result resembles what found in Alessandri et al. (2010), where it was shown that the use of subsurface data assimilation for ocean initialization in the Centro Euro-Mediterraneo per i Cambiamenti Climatici–Istituto Nazionale di Geofisica e Vulcanologia (CMCC-INGV) SPS tends to degrade the prediction skill over tropical Atlantic. For the CMCC-INGV SPS, subsurface assimilation induces very drastic corrections in the tropical Atlantic. These corrections drive the coupled model too far from the state it would have there, thus leading to a negative impact on the forecasts (Alessandri et al. 2010). Since ENSEMBLES consistently makes a more widespread use of in situ temperature and salinity profiles assimilation (see section 2b), the above result suggests that the skill worsening over the tropical Atlantic compared to DEMETER may apply to ENSEMBLES, similarly to what was found for the CMCC-INGV SPS.

4. Extratropics

The forecast quality outside the tropics (Fig. 1b reports the results for the northern midlatitudes, 30°–75°N) tends to decrease in comparison with the intertropical regions (Fig. 1a; see also Figs. 4 –7 for the 1 May and the 1 August start dates global maps). In particular, for both ENSEMBLES and DEMETER, a considerable reduction in sharpness, discrimination, ResSS, and RelSS is shown. Despite the generally reduced performance with respect to the tropical regions, as well ENSEMBLES appears to improve in the northern midlatitudes compared to DEMETER. Considering the northern midlatitude band as a whole, a significant (5% level) enhancement of discrimination, reliability, and resolution is observed in ENSEMBLES for February, May, and November starting dates (Fig. 1b; see also the reliability and discrimination diagrams for May and August reported in Figs. 9 and 10, respectively). In contrast, the southern midlatitudes (30°–75°S) are little impacted, with some improvements for the May starting dates in ENSEMBLES but a worsening for August compared to DEMETER (not shown).

Focusing on the northern midlatitudes, the forecasts started in February are widely affected, with improvements for both E and E+ (Fig. 1b). The May start dates significantly improve the predictions mostly for E (Figs. 1b and 9) and, conversely, in November the forecasts improve only for above-normal conditions (E+; Fig. 1b). Remarkably, the August starting dates show only the significant difference for E+ reliability, where DEMETER appears to outperform ENSEMBLES (Figs. 1b and 10). Table 6 summarizes the comparison between ENSEMBLES and DEMETER of the performances over the Euro-Atlantic (35°–65°N, 80°W–40°E) and the Pacific–North American (PNA; 40°–65°N, 150°E–60°W) regions. ENSEMBLES significantly improves the reliability, the resolution, and the discrimination for February and May (E) starting dates in both Euro-Atlantic and PNA domains. The largest enhancements are found over the Euro-Atlantic region, where also the E+ November forecasts and the reliability of the August predictions (both E and E+) are significantly improved compared to DEMETER. Note, however, that reliability of the August forecasts improve over Euro-Atlantic in ENSEMBLES mostly to the north of 50°N over the Atlantic Ocean and to the east of 20°E over continents. South of 50°N over the ocean and over western Europe, we report the reduction in reliability compared with DEMETER (see Figs. 7e,f). In contrast, the PNA winter prediction shows better skill in DEMETER in terms of sharpness and discrimination (5% significance verified for both E and E+). The PNA reliability and resolution for also appear to be significantly worsened in ENSEMBLES.

As reported in Table 4, land areas over the northern midlatitudes tend to improve in terms of sharpness, resolution, and discrimination by amounts very similar to the oceans for February, May, and November start dates. However, similarly to what was found for the tropics the skill over ocean is better than over land. The only exception is reliability, which for land is higher than for the ocean in the 1 February, 1 May, and 1 November forecasts and for the same start dates, ENSEMBLES shows a significant overall enhancement compared to DEMETER (Fig. 1b). The fact that ENSEMBLES improves northern latitude calibration and performance for boreal winter, spring, and early summer only (Fig. 1b) suggests that the enhancements might be in part related to the better consideration of the sea ice dynamics and of the coupled atmosphere–ocean–land–ice variability compared to DEMETER (see section 2b; Weisheimer et al. 2009; Wang et al. 2009). In fact, given the relatively reduced extent of the northern polar ice cap during late summer and early autumn (e.g., Timmermann et al. 2005) the forecasts started in August are supposed to be less influenced by sea ice.

The relationship over northern midlatitudes between the spatial distribution of probabilistic forecast quality (Figs. 4 –7) and the spatial distribution of TC (Fig. 8) behaves differently compared to the tropics (see section 3). In fact, the spatial correlations of TC with discrimination reduces to 0.17 (vs 0.44 over the tropics; reported correlations are for the average over E and E+ and over the 4 starting dates), with resolution to 0.15 (vs 0.36 over the tropics), and with sharpness it is 0.20 (vs 0.35 over the tropics). Furthermore, we report a slightly negative spatial correlation between TC and reliability (−0.09, while over the tropics it was 0.20). In fact, comparing Figs. 7a,c (Figs. 7b,d) with Figs. 8a,c (Figs. 8b,d) it is shown that relatively low (high) TC tend to correspond to relatively high (low) reliability. Similarly TC increases and decreases (Figs. 8e,f) in ENSEMBLES are in some cases associated with reliability reductions and increases, respectively (Figs. 7e,f). The above result suggests that reliability increases (reductions) compared to DEMETER may be, in a few areas, more related to decreased (increased) signal-to-noise ratio (note that perfect reliability can be attained by always issuing forecasts corresponding to the climatological distribution) than to really higher (lower) skill.

5. Multimodel versus averaged single-models skill

It is not possible to isolate specific reasons for the skill differences between ENSEMBLES and DEMETER that have been illustrated in the previous sections because the coupled model systems have undergone a number of changes to their complexity, physics, resolution, and initialization (section 2b; see also Weisheimer et al. 2009). No experiments are available to separately test all these changes. However, as previously discussed, the performance of a MME comes from the skill of the contributing models and MMEs tend to amplify the performance nonlinearly by comparison to the single-models average. Specifically, MMEs gain more skill compared to the average of the contributing SMEs with increasing independence of the SMEs between each other and with the increase of the overconfidence of the SMEs (Hagedorn et al. 2005; Kang and Yoo 2006; Weigel et al. 2008). In this section, we evaluate the MME performance amplification with respect to the average of the SMEs (hereinafter aveSME) and try to get some insight on the degree of independence and overconfidence of the SMEs composing the ENSEMBLES MME. To this aim, we compare ENSEMBLES and DEMETER in terms of both the aveSME skill and the difference between the MMEs and their respective aveSME performances.

As reported in Fig. 11a, the aveSME performance over the tropics is enhanced in ENSEMBLES mostly in terms of discrimination. In fact, with the exception of the forecasts started on 1 February for E+, the discrimination distance increases significantly in all cases (Fig. 11a). Furthermore, improvements in the resolution are also observed in the 1 May forecasts for both E+ and E and in the 1 August and 1 November forecasts for E. Over the northern midlatitudes (Fig. 11b) ENSEMBLES improves significantly the aveSME discrimination and resolution of the 1 February and 1 May forecasts. Resolution is also improved to some extent for the forecasts started in August and November (significance verified only for E+). Interestingly, the forecasts started in February and May also evidence significant improvements of the aveSME reliability in ENSEMBLES.

The differences between MME and respective aveSME performances are reported in Fig. 12. As already discussed, multimodels act by reducing the overconfidence of the single models (i.e., by widening the ensemble spread while possibly moving the ensemble mean toward truth). The increase of the reliability of the MMEs compared to the aveSME values in Fig. 12 documents the effectiveness of the MMEs in reducing the SMEs over confidence. ENSEMBLES shows a tendency to increase less the reliability, thus suggesting in some cases a reduced independence of the contributing SMEs by comparison to DEMETER. In particular, in both the tropics (Fig. 12a) and northern midlatitudes (Fig. 12b) the 1 August starting dates display little reliability difference in ENSEMBLES, indicating a small degree of independence of the SMEs for this start date. From Fig. 11 it is shown that the ENSEMBLES aveSME have better reliability and resolution for August, while in the MMEs the same quality attributes tend to be better in DEMETER (Fig. 1). This clearly indicates that the single-model forecasts initiated from 1 August are more dependent to each other in ENSEMBLES compared to DEMETER.

Considering the tropics, the ENSEMBLES MME displays little reduction of the sharpness by comparison with DEMETER (Fig. 12a). This indicates that an increased portion of the ENSEMBLES SMEs signal is retained and eventually nonlinearly amplified in the MME. The enhanced signal in ENSEMBLES increases the potential predictability and may eventually contribute to the effective skill of the ENSEMBLES MME, depending on whether or not the signal increase coincides with better correspondence to observations. Of course, this has to do with the degree of independence of the SMEs, which allows for error reductions and the consequent skill amplification (Hagedorn et al. 2005). The comparison between ENSEMBLES and DEMETER in Fig. 12a shows that the association of the relative differences in sharpness and in discrimination/resolution indeed confirms that the signal increase in the ENSEMBLES MME effectively enhances the forecasts skill for the 1 February, 1 May, and 1 November start dates. On the contrary, for August the relatively small reduction in sharpness of ENSEMBLES is not accompanied by a corresponding increase in discrimination and/or resolution with respect to DEMETER. In this case, as previously argued by looking at the reliability and resolution differences, some reduction in the independence of the SMEs contributing to ENSEMBLES is apparent.

The large signal reduction in the DEMETER MME compared to its aveSME appears to be the main cause for the difference in sharpness over the tropics between ENSEMBLES and DEMETER MMEs, which was documented in section 3 (see Fig. 1). In fact, the aveSME sharpness displays only very small differences between DEMETER and ENSEMBLES (Fig. 11a). This suggests that in ENSEMBLES a new source of signal may be acting to consistently reduce the errors and increase the skill contribution of each SME. A possible candidate for this source of signal is the wider use, in ENSEMBLES, of subsurface assimilation for ocean initialization (see section 2b). In fact, several studies have shown that ocean assimilation can contribute in obtaining skillful forecasts (e.g., Rosati et al. 1997; Alves et al. 2004; Balmaseda et al. 2007; Alessandri et al. 2010). In particular, Alessandri et al. (2010) studied the effect of subsurface assimilation on the CMCC-INGV SPS and evidenced impacts quite similar to what we are observing here. They reported increased signal-to-noise ratio and enhanced discrimination of the surface temperature forecasts over the tropics as a consequence of using subsurface assimilation in the ocean for initialization.

Over the northern midlatitudes the aveSME reliability in ENSEMBLES is relatively high compared to DEMETER and the increase is significant for the February and May (E+) starting dates (Fig. 11). This corresponds to a reduced overconfidence of the SMEs contributing to ENSEMBLES by comparison to those participating in DEMETER. Consistent to what was found in Weigel et al. (2008), this explains the reduced capability for the ENSEMBLES MME to enhance the 1 February and 1 May forecasts with respect to the aveSME performance.

6. Forecasts potential economic value

Figure 13 shows the Value over the tropics of the forecasts as a function of the cost-loss ratio (C/L) for both E (left panels) and E+ (right panels). The results for DEMETER and ENSEMBLES forecasts calibrated through the reliability as well as the discrimination attribute are reported (hereinafter ENSEMBLESDISC,REL and DEMETERDISC,REL). Here, the calibration coefficients are computed in a “one-year out cross-validation mode” (Wilks 2006). This means that for each year to be verified, the target year is excluded from the computations.

Overall, ENSEMBLES displays higher Values compared to DEMETER for February (Figs. 13a,b). Even larger increases are found for the May (Figs. 13c,d) and the November start dates (Figs. 13g,h). In contrast, consistent with the analysis in sections 3 and 5, the Values for August are very little affected and both ENSEMBLESDISC and ENSEMBLESREL display Values very close to the DEMETER counterparts (Figs. 13e,f). The above results are summarized in Table 7 for the Values averaged over the whole C/L range and also reporting the results for the uncalibrated MMEs. Table 7 shows that the averaged Values of the forecasts are always positive over tropics. Furthermore, from Table 7 it is illustrated that the ENSEMBLES forecasts Value over the tropics consistently outperforms DEMETER only after that the calibration procedures are applied. In fact, the Value of the uncalibrated ENSEMBLES predictions is lower than the uncalibrated DEMETER for the August (E and E+), February (E+), and November (E) starting dates. In correspondence of the same starting dates, in section 3, we reported a worsened reliability for ENSEMBLES, thus by itself indicating an increased need of calibration. Interestingly, from Table 7 (see also Fig. 13) it is evident that, compared with the uncalibrated predictions, the ENSEMBLES forecasts calibrated through the discrimination add systematically more Value than the forecasts calibrated through the reliability attribute.

The Values of the calibrated forecasts over the northern midlatitudes (Fig. 14) are considerably smaller than over the tropics (Fig. 13). The maximum Value (obtained for users with C/L = o) ranges between 0.1 and 0.2 of what they would obtain from perfect forecasts. This corresponds to a considerable reduction with respect to the tropics, where the maximum Values are between 0.3 and 0.4. However, the Values averaged over the whole C/L range, reported in Table 8, are still always positive over the northern midlatitudes after a calibration is applied. The comparison with DEMETER shows noticeable enhancements of ENSEMBLES in the northern midlatitudes for February (Figs. 14a,b), May (Figs. 14c,d), and November (Figs. 14g,h) starting dates. Again, for August the two MMEs show very similar results. Considering the forecasts started in February, May, and November, ENSEMBLESDISC outcompetes all the others. With respect to DEMETERDISC, it increases the averaged Value by at least 15% and by a maximum of 200% for (Table 8). By comparison, the reliability-calibrated forecasts improve less in ENSEMBLES. In two cases ENSEMBLESREL even displays an averaged lower Value with respect to DEMETERREL (Table 8).

7. Conclusions

The new ENSEMBLES multimodel ensemble (MME) seasonal prediction system enhances systematically the sharpness and the discrimination over the tropical band for the prediction of anomalous seasonal temperature events (5% significance level verified). Improvements in the resolution attribute are also reported in the tropics, whereas ENSEMBLES shows a significant worsening of the tropical reliability.

ENSEMBLES increases the prediction signal over tropics compared to DEMETER, so that the forecasts tend to fall preferentially outside normal conditions and enhance the predictions’ confidence. This result confirms the reduction in ENSEMBLES of the tendency to a too small signal-to-noise ratio, which characterizes DEMETER (Weisheimer et al. 2009). The concurrent increased discrimination distance, illustrated in this work, indicates that the enhanced ENSEMBLES signal indeed contributes to an improved capability to discriminate among the considered dichotomous events leading to better correspondence to observations. We showed that the enhanced discrimination for above- and below-normal conditions by the probabilistic predictions is spatially connected to the improvement of the MME mean “deterministic” forecasts for the interannual anomalies and in particular over tropical Pacific.

It is impossible to isolate specific reasons for the improvements we reported in ENSEMBLES because the coupled model systems have undergone a number of changes to their complexity, physics, resolution, and initialization. Experiments to test all these changes separately are not available. However, we tried to gain some insight into the basic differences by the comparison between ENSEMBLES and DEMETER in terms of both the averaged skill of the contributing single-model ensembles (SMEs) and the difference between MMEs and their SMEs averaged performances. By comparison to DEMETER, an increased portion of the SMEs signal actually contributes to the ENSEMBLES potential predictability over tropics for all the considered start dates. It is shown that the enhanced signal effectively acts to considerably amplify the discrimination as well as the resolution of the ENSEMBLES forecasts started 1 February, 1 May, and 1 November. Noteworthy, the ENSEMBLES MME forecasts started in August, even if potentially less influenced by the “spring predictability barrier,” appear to reduce the capability to amplify the skill coming from the SMEs compared to the other starting dates. It is suggested that the larger bias over tropical Pacific for the forecasts started 1 August may in part contribute in limiting the performance improvements for this start date. In fact, even if ENSEMBLES reduces the bias for the 1 August start date compared to DEMETER, the RMSE over the Niño-3 region is still more than 2 times higher than 1 February and 1 May and exceeding the 1 November forecasts by about two-thirds. As shown by several works (e.g., Lee et al. 2010), the ability of dynamical models in simulating SST mean state closely impacts the prediction skill for the interannual SST anomalies. Furthermore, for August, we showed that the single-model forecasts are more dependent on each other in ENSEMBLES compared to DEMETER. In fact, for this start date, the ENSEMBLES SME average has better reliability and resolution, while in the MMEs the same quality attributes tend to be better in DEMETER. The increased signal of the ENSEMBLES MME compared to DEMETER is suggested to follow, at least in part, from the wider use of ocean subsurface assimilation. By improving the initialization of the ocean subsurface, assimilation is supposed to provide to the SMEs contributing to ENSEMBLES a consistent source of signal that, once combined in the MME, possibly acts to increase confidence, reduce errors, and increase the skill.

The tropical Atlantic behaves differently from the other tropical ocean sectors. Over this ocean basin, ENSEMBLES shows a worsening of the forecasts started in August and November. For these starting dates, the increased signal of the predictions leads, on average, to a worsened match to the corresponding observations. Again, this is suggested to be related to the increased use of subsurface assimilation in ENSEMBLES. In fact, several previous studies have evidenced a tendency to the degradation of the prediction skill over Atlantic when assimilation is used (e.g., Alessandri et al. 2010).

The ENSEMBLES predictions over the extratropics are also noticeably improved compared to DEMETER. In particular, ENSEMBLES significantly (5% level) improves discrimination, resolution, and reliability in the northern midlatitudes for February, May, and November (above-normal conditions only) starting dates. The larger improvements are found in the Euro-Atlantic sector. In contrast, the Pacific–North American domain show mixed results, especially for the prediction of the boreal winters. Despite the above-mentioned improvements, the ENSEMBLES MME reduces the capability to amplify the performance with respect to the contributing SMEs for the forecasts over the northern midlatitudes started in February, May, and August. For February and May, this follows at least in part from the reduced overconfidence of the contributing SMEs by comparison with the DEMETER counterparts. Considering the August starting dates, similarly to what is found over the tropics, a marked reduction of the independence of the SMEs participating in the ENSEMBLES MME compared to those contributing to DEMETER is shown. On the other hand, the fact that ENSEMBLES improves the northern midlatitudes calibration and performance for boreal winter, spring, and early summer only, suggests that the enhancement might be in part related to the better consideration of sea ice dynamics compared to DEMETER. In fact, given the relatively reduced extent of the northern polar ice cap during late summer and early autumn (e.g., Timmermann et al. 2005), the forecasts started in August are supposed to be less influenced.

Our results show that the skill in the probabilistic forecast quality attributes tends to be concentrated over tropical Pacific, confirming that much of the skill of present dynamical seasonal climate forecasts comes from their ability in predicting ENSO (e.g., Weisheimer et al. 2009). From the tropical Pacific the skill tends to irradiate toward the whole tropical belt and extratropics with the largest performance, which tends, however, to be confined over the ocean. Nevertheless, our analysis shows that ENSEMBLES enhances the skill over land areas compared to DEMETER by amounts that are comparable to the ocean improvements. In particular, sharpness, discrimination, and resolution are significantly enhanced over land following the improvements over ocean.

Dynamical weather and climate prediction is challenging: progress did not occur in the last 30 years because of drastic breakthroughs but because of slow incremental progresses and through a great deal of hard work (Shukla and Kinter 2006). This study has evidenced significant progress in ENSEMBLES over both the tropics and northern midlatitudes. Furthermore, the fact that much of the improvement in ENSEMBLES is found for the forecasts started in winter and spring indicates that the new MME partially corrects the limiting effect to predictability related to the “spring predictability barrier.” However, the improvements are small. The MME performance appears primarily limited by the modest skill exhibited by the contributing single models. Increasing efforts need to be placed in order to improve models as well as coupled model initialization. In particular, there are expectations for improvements over land (Alessandri and Navarra 2008; Wang et al. 2009; Koster et al. 2010), where performance tends to be lower compared to ocean areas. Land initialization applied to seasonal predictions appears to be at a relatively early stage of development (Wang et al. 2009). It follows that enhanced initialization strategies and the proper representation of uncertainty in the initial conditions might considerably improve probabilistic forecasts over land during the next years. However, as suggested by Shukla et al. (2009), it is possible that fundamental enhancements can only be achieved with substantially higher-resolution models than are currently available.

We believe this paper provides important information to all users and potential users of seasonal predictions. First, we demonstrate that they can get positive potential economic value (i.e., Value) from the newly developed ENSEMBLES dataset. After calibration, either through discrimination or reliability, the averaged Value of the forecasts is always positive over both the tropics and northern midlatitudes. Furthermore, we show that the Value of the ENSEMBLES forecasts started on 1 February, 1 May, and 1 November increases compared to DEMETER over both the tropics and northern extratropics, provided they are suitably calibrated. In this regard, we warn that over the tropics ENSEMBLES consistently outperforms DEMETER only after calibration is applied. The comparison of the calibration strategies applied to both DEMETER and ENSEMBLES MMEs has shown that the ENSEMBLES forecasts calibrated through the discrimination outperform all other choices. Considering the starting dates of February, May, and November, ENSEMBLES calibrated through discrimination displays the largest Values over both the tropics and northern midlatitudes. Concerning the forecasts started on 1 August, ENSEMBLES and DEMETER show very similar values and we were not able to evidence any clear advantage of using one MME system instead of the other.

Acknowledgments

Thanks to the CMCC staff and special thanks to P. Di Pietro, S. Masina, B. Fogli, A. Cherchi, F. J. Doblas-Reyes, and N. Keenlyside for help and inspiring discussions. We are grateful to the anonymous reviewers whose comments greatly improved the quality of the manuscript. The data was funded by the EU FP6 Integrated Project ENSEMBLES (Contract 505539) and made available through the ECMWF central archive and ENSEMBLES data public dissemination. This work was partially supported by the project “Climate Change Assessment in Small Pacific Islands States” funded by the Italian Ministry for Environment, Land, and Sea.

REFERENCES

  • Alessandri, A., and A. Navarra, 2008: On the coupling between vegetation and rainfall inter-annual anomalies: Possible contributions to seasonal rainfall predictability over land areas. Geophys. Res. Lett., 35 , L02718. doi:10.1029/2007GL032415.

    • Search Google Scholar
    • Export Citation
  • Alessandri, A., A. Borrelli, S. Masina, P. D. Pietro, A. Carril, A. Cherchi, S. Gualdi, and A. Navarra, 2010: The INGV-CMCC seasonal prediction system: Improved ocean initial conditions. Mon. Wea. Rev., 138 , 29302952.

    • Search Google Scholar
    • Export Citation
  • Alves, O., M. Balmaseda, D. Anderson, and T. Stockdale, 2004: Sensitivity of dynamical seasonal forecasts to ocean initial conditions. Quart. J. Roy. Meteor. Soc., 130 , 647667.

    • Search Google Scholar
    • Export Citation
  • Balmaseda, M., D. Anderson, and A. Vidard, 2007: Impact of ARGO on analyses of the global ocean. Geophys. Res. Lett., 34 , L16605. doi:10.1029/2007GL0304452.

    • Search Google Scholar
    • Export Citation
  • Doblas-Reyes, F., R. Hagedorn, and T. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting. II. Calibration and combination. Tellus, 57A , 234252.

    • Search Google Scholar
    • Export Citation
  • Hagedorn, R., F. J. Doblas-Reyes, and T. N. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting. I. Basic concept. Tellus, 57A , 219233.

    • Search Google Scholar
    • Export Citation
  • Ingleby, B., and M. Huddleston, 2007: Quality control of ocean temperature and salinity profiles: Historical and real-time data. J. Mar. Syst., 65 , 158175.

    • Search Google Scholar
    • Export Citation
  • Kang, I.-S., and J. H. Yoo, 2006: Examination of multi-model ensemble seasonal prediction methods using a simple climate system. Climate Dyn., 26 , 285294. doi:10.1007/s00382-005-0074-8.

    • Search Google Scholar
    • Export Citation
  • Koster, R. D., and Coauthors, 2010: Contribution of land surface initialization to subseasonal forecast skill: First results from a multi-model experiment. Geophys. Res. Lett., 37 , L02402. doi:10.1029/2009GL04167.

    • Search Google Scholar
    • Export Citation
  • Krishnamurti, T., C. Kishtawal, T. LaRow, D. Bachiochi, Z. Zhang, C. E. Williford, S. Gadgil, and S. Surendran, 1999: Improved weather and seasonal climate forecasts from multimodel superensemble. Science, 285 , 15481550.

    • Search Google Scholar
    • Export Citation
  • Lee, J.-Y., and Coauthors, 2010: How are seasonal prediction skills related to models’ performance on mean state and annual cycle? Climate Dyn., 35 , 267283. doi:10.1007/s00382-010-0857-4.

    • Search Google Scholar
    • Export Citation
  • Murphy, A., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281293.

  • Palmer, T., and Coauthors, 2004: Development of a European Multimodel Ensemble System for Seasonal to Interannual Prediction (DEMETER). Bull. Amer. Meteor. Soc., 85 , 853872.

    • Search Google Scholar
    • Export Citation
  • Park, Y., R. Buizza, and M. Leutbeche, 2008: TIGGE: Preliminary results on comparing and combining ensembles. Quart. J. Roy. Meteor. Soc., 134 , 20292050.

    • Search Google Scholar
    • Export Citation
  • Rajagopalan, B., U. Lall, and S. Zebiak, 2002: Categorical climate forecasts through regularization and optimal combination of multiple GCM ensembles. Mon. Wea. Rev., 130 , 17921811.

    • Search Google Scholar
    • Export Citation
  • Richardson, D., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126 , 649668.

    • Search Google Scholar
    • Export Citation
  • Richardson, D., 2003: Economic value and skill. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. Stephenson, Eds., Wiley, 165–187.

    • Search Google Scholar
    • Export Citation
  • Richardson, D., 2006: Predictability and economic value. Predictability of Weather and Climate, T. Palmer and R. Hagedorn, Eds., Cambridge University Press, 628–644.

    • Search Google Scholar
    • Export Citation
  • Rosati, A., K. Miyakoda, and R. Gudgel, 1997: The impact of ocean initial conditions on ENSO forecasting with a coupled model. Mon. Wea. Rev., 125 , 754772.

    • Search Google Scholar
    • Export Citation
  • Shukla, J., and J. C. Kinter, 2006: Predictability of seasonal climate variations: A pedagogical review. Predictability of Weather and Climate, T. Palmer and R. Hagedorn, Eds., Cambridge University Press, 306–341.

    • Search Google Scholar
    • Export Citation
  • Shukla, J., R. Hagedorn, B. Hoskins, J. Kinter, J. Marotzke, M. Miller, T. N. Palmer, and J. Slingo, 2009: Revolution in climate prediction is both necessary and possible: A declaration at the world modeling summit for climate prediction. Bull. Amer. Meteor. Soc., 90 , 1619.

    • Search Google Scholar
    • Export Citation
  • Timmermann, R., H. Goosse, G. Madec, T. Fichefet, C. Ethe, and V. Duliere, 2005: On the representation of high latitude processes in the ORCA-LIM global coupled sea ice-ocean model. Ocean Modell., 8 , 175201.

    • Search Google Scholar
    • Export Citation
  • Toth, Z., O. Talagrand, and Y. Zhu, 2006: The attributes of forecast systems: A general framework for the evaluation and calibration of weather forecasts. Predictability of Weather and Climate, T. Palmer and R. Hagedorn, Eds., Cambridge University Press, 584–595.

    • Search Google Scholar
    • Export Citation
  • Uppala, S., and Coauthors, 2005: The ERA-40 re-analysis. Quart. J. Roy. Meteor. Soc., 131 , 29613012.

  • Wang, B., and Coauthors, 2009: Advance and prospectus of seasonal prediction: Assessment of the APCC/CliPAS 14-model ensemble retrospective seasonal prediction (1980–2004). Climate Dyn., 33 , 93117.

    • Search Google Scholar
    • Export Citation
  • Weigel, A., M. Liniger, and C. Appenzeller, 2008: Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts? Quart. J. Roy. Meteor. Soc., 134 , 241260.

    • Search Google Scholar
    • Export Citation
  • Weisheimer, A., and Coauthors, 2009: ENSEMBLES: A new multi-model ensemble for seasonal-to-annual predictions—Skill and progress beyond DEMETER in forecasting tropical Pacific SSTs. Geophys. Res. Lett., 36 , L21711. doi:10.1029/2009GL040896.

    • Search Google Scholar
    • Export Citation
  • Wilks, D., 2006: Statistical Methods in the Atmospheric Sciences. 2nd ed. Academic Press, 630 pp.

Fig. 1.
Fig. 1.

Sharpness, discrimination distance, RelSS, and ResSS over the (a) tropics (25°S–25°N) and (b) northern midlatitudes (30°–75°N) in the ENSEMBLES (dark) and the DEMETER (light) multimodels. The forecast events are surface air temperature falling below the lower (left bars) and above the upper (right bars) climatology terciles. Land and ocean grid points of forecast months 2–4 for all 4 start dates of the period 1980–2001 are considered. Asterisks indicate if the value is significantly enhanced compared to the other experiment (5% level; Monte Carlo method).

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 2.
Fig. 2.

(a),(b) Discrimination diagrams and (c),(d) reliability diagrams for the forecasts started in May. Each grid point of the forecast 2–4 seasonal mean surface air temperature anomalies over the tropics (25°S–25°N, 0°–360°) are used and ERA-40 data are taken as reference. (a),(c) The dichotomous event of temperature being below the lower tercile of the sample climatological distribution and (b),(d) the case of the temperature exceeding the upper tercile. The histograms in the reliability diagrams indicate the refinement distribution. In red is DEMETER and in blue ENSEMBLES. Discrimination distance, RelSS, and ResSS values are reported and an asterisk is placed to indicate if the value is significantly higher compared to the other experiment at the 5% level (Monte Carlo method).

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 3.
Fig. 3.

As in Fig. 2, but for August starting dates.

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 4.
Fig. 4.

Spatial distribution of the sharpness attribute of the probabilistic forecasts for above-normal surface air temperature conditions. (a),(b) ENSEMBLES. (c),(d) DEMETER. (e),(f) ENSEMBLES minus DEMETER difference in sharpness; shaded are the areas of increase (red) and decrease (blue) in ENSEMBLES that passed a significance test at the 10% level. (a),(c),(e) 1 May and (b),(d),(f) 1 Aug start date.

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 5.
Fig. 5.

As in Fig. 4, but for discrimination.

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 6.
Fig. 6.

As in Fig. 4, but for resolution.

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 7.
Fig. 7.

As in Fig. 4, but for reliability.

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 8.
Fig. 8.

Ensemble mean forecasts vs ERA-40 surface air temperature anomalies: point-by-point correlations of months 2–4 of the predictions. ENSEMBLES forecasts with starting dates (a) 1 May, (b) 1 Aug. (c),(d) As in (a),(b), but for DEMETER forecasts. (e),(f) ENSEMBLES minus DEMETER difference in correlations, respectively for May and August (contours interval of 0.2). Shaded are the areas of increase (red) and decrease (blue) in ENSEMBLES that passed a significance test at the 10% level.

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 9.
Fig. 9.

As in Fig. 2, but for the northern midlatitudes (30°–75°N, 0°–360°).

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 10.
Fig. 10.

As in Fig. 2, but for the northern midlatitudes (30°–75°N, 0°–360°) and August starting dates.

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 11.
Fig. 11.

As in Fig. 1, but for the average of the single-models performance.

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 12.
Fig. 12.

Multimodel minus single-models averaged performances: sharpness, discrimination distance, RelSS, and ResSS over the (a) tropics (25°S–25°N) and (b) northern midlatitudes (30°–75°N) in ENSEMBLES (dark) and DEMETER (light). The forecast events are surface air temperature falling below the lower (left bars) and above the upper (right bars) climatology terciles. Land and ocean grid points of forecast months 2–4 for all four start dates of the period 1980–2001 are considered.

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 13.
Fig. 13.

Potential economic value of the ENSEMBLES and DEMETER forecasts over the tropics (25°S–25°N) as a function of the C/L ratio for the prediction of surface temperature being (left) below the lower tercile and (right) above the upper tercile of the sample climatology. Calibrated forecasts using the discrimination information for ENSEMBLES (solid) and DEMETER (dashed) and using the reliability information for ENSEMBLES (dash–dots) and DEMETER (dots). (a),(b) 1 Feb; (c),(d) 1 May; (e),(f) 1 Aug; (g),(h) 1 Nov starting dates.

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Fig. 14.
Fig. 14.

As in Fig. 13, but for the northern midlatitudes (30°–75°N, 0°–360°).

Citation: Monthly Weather Review 139, 2; 10.1175/2010MWR3417.1

Table 1.

Contingency table for deterministic forecast of specified event over a set of cases showing a fraction of occasions for each combination of forecast and outcome. The corresponding costs C and losses L associated with different actions and outcomes in the cost-loss model are also reported. Note that a + c = o and that b + d = 1 − o.

Table 1.
Table 2.

Institutions contributing to the new ENSEMBLES multimodel. Please note that CMCC-INGV and IFM-GEOMAR are evolutions of the respective systems developed during DEMETER at Istituto Nazionale di Geofisica e Vulcanologia (INGV) and at Max-Planck Institute für Meteorologie (MPI).

Table 2.
Table 3.

Overview of the seasonal prediction systems contributing to the ENSEMBLES and DEMETER multimodel ensembles. See also Weisheimer et al. (2009) and Palmer et al. (2004) for further details.

Table 3.
Table 4.

DEMETER (DEM) vs ENSEMBLES (ENS) discrimination distance (d), RelSS, ResSS, and sharpness (Sh) for land areas only and ocean areas only. The probability forecasts for above upper tercile (E+) and below lower tercile (E) of the sample climatological distributions are reported for surface air temperature. Bold values are the ones that are significantly increased compared to the other experiment at the 5% significance level.

Table 4.
Table 5.

As in Table 4, but for the tropical Pacific, tropical Indian, and tropical Atlantic Oceans only.

Table 5.
Table 6.

As in Table 4, but the Pacific–North American (PNA) and the Euro–Atlantic regions.

Table 6.
Table 7.

Potential economic value of ENSEMBLES and DEMETER forecasts over the tropics (25°S–25°N) averaged over the whole C/L range. Uncalibrated forecasts as well as forecasts calibrated through reliability and discrimination are reported. The forecasts for above upper (E+) and below lower (E) climatology terciles are reported for surface air temperature.

Table 7.
Table 8.

As in Table 7, but for the northern midlatitudes (30°–75°N, 0°–360°).

Table 8.
Save