This study describes an attempt to overcome the underdispersive nature of single-model ensembles (SMEs). As an Indo–U.S. collaboration designed to improve the prediction capabilities of models over the Indian monsoon region, the Climate Forecast System (CFS) model framework, developed at the National Centers for Environmental Prediction (NCEP-CFSv2), is selected. This article describes a multimodel ensemble prediction system, using a suite of different variants of the CFSv2 model to increase the spread without relying on very different codes or potentially inferior models. The SMEs are generated not only by perturbing the initial condition, but also by using different resolutions, parameters, and coupling configurations of the same model (CFS and its atmosphere component, the Global Forecast System). Each of these configurations was created to address the role of different physical mechanisms known to influence error growth on the 10–20-day time scale. Last, the multimodel consensus forecast is developed, which includes ensemble-based uncertainty estimates. Statistical skill of this CFS-based Grand Ensemble Prediction System (CGEPS) is better than the best participating SME configuration, because increased ensemble spread reduces overconfidence errors.
Accurate predictions of weather phenomena by numerical models are limited by the uncertainties in the initial atmospheric, oceanic (sea surface temperature and sea ice), land surface, and soil moisture conditions along with inaccurate representations of true physical processes. Uncertainties in the initial conditions (ICs) are bracketed by creating IC ensembles, using various methods as surveyed in Buizza and Palmer (1995), for example. One simple method is to slightly perturb the original atmospheric or oceanic states, as described in Abhilash et al. (2013, 2014a). Imperfect model uncertainties are more challenging to represent. In a posterior approach, ensemble predictions from different models are pooled to produce a final forecast probability distribution. In any multimodel approach, the independent skills of the participating models are combined in a judicious manner to reinforce the total skill of the multimodel ensemble (MME) mean (or other statistics). One example of such an MME combination approach using historically trained regression (i.e., linear combinations of the inputs) is documented in Krishnamurti et al. (1999) and Weigel et al. (2008). Such a multimodel superensemble forecast provides a better deterministic forecast than any individual model or any of its single-model ensemble (SME) subsets (Weigel et al. 2008). However, such a “best estimate” use of superensembles does not address statistical skill issues like the spread–error relationship (SER).
Forecasts are more useful if uncertainty is quantified. In the extended range (ER) especially, beyond the weather scale (2–3 weeks), a single deterministic rainfall forecast is not sufficient: the user community also should be given probabilistic forecasts that quantify the uncertainty. One chronic problem in numerical ensembles is underdispersion, leading to overconfidence errors in the resulting statistical forecast distributions. Single-model ensemble prediction systems (EPSs) are especially prone to underdispersion, since they do not represent all sources of forecast error (Buizza et al. 2008). Some single-model approaches for producing more-diverse inputs to the ensemble product generator include stochastic postprocessing [e.g., “Bayesian model averaging” method of Raftery et al. (2005)] or stochastic parameterization (Buizza et al. 1999; Palmer 2001). But true MME approaches may better represent physical uncertainties and increase the ensemble spread accordingly. Seeking a robust operational approach for ER monsoon forecasts, this study considers a partial MME approach.
Employing a broad-based MME in real-time forecasts would require reliable, coordinated multi-institutional effort (Sahai et al. 2013; Borah et al. 2015; Prasad et al. 2014). While frameworks like the National Monsoon Mission (http://www.tropmet.res.in/monsoon/) might foster such collaborations in the long run, the current study reports on an MME derived from multiversion, multiresolution, and two-tier extended-range forecasts computed at one center. Results are based on the National Centers for Environmental Prediction’s (NCEP) Climate Forecast System (CFS), version 2. CFSv2 is a coupled model whose atmosphere component, the Global Forecast System (GFS), has been subsequently improved in ways not yet propagated to the coupled system. The Indian adaptation of the NCEP CFS and its operational reliability for extended-range forecasting have already been reported upon (Sahai et al. 2013; Borah et al. 2015). The goal here is to demonstrate and quantify improvements in the MME forecasts as compared with the CFS-based SME approach (Sahai et al. 2013) in the extended range.
For this study we run three different versions of CFS/GFS to generate three sets of SMEs and then pool them to form the MME, an approach to ensemble generation described by Houtekamer et al. (1996). The versions of CFS/GFS are diverse in resolution as well as in their model physics. Since extended-range skill stems from both slowly varying lower boundary effects and from the model’s memory of the initial atmospheric state, our three component SMEs were devised with both predictability sources in mind. Specifically, the MME includes both ocean-initialized coupled CFS integrations with slightly different atmospheric initial conditions, and “two tier” integrations with stand-alone atmospheric GFS forced by bias-corrected CFS-forecast sea surface temperature (SST), a setup hereinafter called GFSbc. The combined forecast system addresses the prime goal of the National Monsoon Mission, as initiated by the government of India (http://www.tropmet.res.in/National%20Monsoon%20Mission-121-Page), which is to improve the monsoon prediction capabilities of the CFS system.
Although these three SMEs show similar prediction skill and their errors saturate at about the same lead time of around 25 days, there are many instances where the three models disagree in predicting particular events, such as the amplitude and phase of monsoon intraseasonal oscillation (MISO) propagation. Even though the SMEs show similar skill on average, there are years when any one of the configurations is the most or least skillful (Abhilash et al. 2013), motivating this so-called CFS-based Grand Ensemble Prediction System (CGEPS), for ER prediction (ERP) of active/break spells of Indian summer monsoon rainfall. Another important operational reason to combine SMEs into an MME (rather than disseminating each variant separately) is to clarify communication to the users, with a single consensus forecast and associated uncertainties and reliability estimates.
This paper’s main objective is to demonstrate the operational efficacy of CGEPS in the extended-range rainfall forecast during the summer monsoon season. Section 2 describes the model and methodology; section 3 presents the results. Section 3a presents the skill during the hindcast period, section 3b presents the MISO prediction skill and spread–error relationship, and section 3c presents the probabilistic prediction skill. Section 4 gives our conclusions.
2. Model and methodology
We have used the latest version1 of NCEP’s CFSv2 (Saha et al. 2014). The atmospheric component (GFS) is coupled to an ocean model, a sea ice model, and a land surface model. For its ocean component, the CFSv2 uses the GFDL Modular Ocean Model, version 4p0d (MOM4; Griffies et al. 2004). The initial conditions have been prepared from a coupled data assimilation system (CDAS) with T574L64 resolution atmospheric assimilation and MOM4-based oceanic assimilation, a real-time extension of the CFSR (Saha et al. 2010).
The stand-alone GFS with slightly different physics options is forced with daily bias-corrected forecast SSTs from CFSv2. Bias correction involves subtracting a climatology of bias as a function of calendar day and lead time, with Optimum Interpolation Sea Surface Temperature observations as the reference. We denote this two-tier forecast as GFSbc, with “bc” indicating bias-corrected boundary conditions. For more model and experimental details and skills of GFSbc, CFST126, and CFST382, see Abhilash et al. (2013, 2014 a,b) and Sahai et al. (2015), respectively.
Based on performance experience, and aiming to maximize the operational skill for our available computer resources, we have chosen to pool these variants: 11 members of CFST126 (~100 km), 11 members of CFST382 (~38 km), and 21 members of GFSbc. A schematic diagram of the CGEPS generation is shown in Fig. 1. The model integrations run to generate the ensemble members are initiated with updated initial conditions every 5 days from 16 May through 28 September during the summer monsoon season. The forecast consensus is given by averaging the 43 ensemble members. The skill (defined as the correlation between the observed and the forecast rainfall anomalies from a 10-yr hindcast climatology) is evaluated during the hindcast period of 2001–12 summer monsoon seasons. Skill is computed by averaging the forecast on a pentad (5-day mean) scale. For example, if the forecast starts from 16 May IC, the pentad 1 (P1) lead-time forecast corresponds to the forecast averaged over the period 17–21 May, the pentad 2 (P2) lead-time forecast corresponds to the forecast averaged over the period 22–26 May, and so on. The pentad averaging is done since the daily prediction of summer monsoon rainfall by dynamical models is not skillful beyond 10 days.
To assess the performance of the method, various skill estimates are calculated for several boxed regions over India. These skill metrics and scores include temporal correlation and root-mean-square error (RMSE) evaluated at the pentad scale, MISO-index prediction skill and its spread–error relationship, the relative operating characteristic (ROC) curve, the area under the ROC curve (AUC), Brier skill score (BSS), and resolution and reliability (detailed in section 3 below). Rainfall data for all scoring are taken from the Tropical Rainfall Measuring Mission (TRMM) gauge merged rainfall dataset (Mitra et al. 2009) obtained from the India Meteorological Department.
Probabilistic forecasts from ensemble members at different lead times are validated with respect to predefined categories (quantiles) based on observations. The categories are defined by classifying the observed rainfall into above normal (AN), below normal (BN), and near normal (NN) categories using the tercile classification method. Terciles are the three intervals, that is, the lower, middle, and upper thirds of the climatologically distributed values of a variable like rainfall where each category has an equal climatological probability of 33.33%. To determine the tercile ranges, the observed rainfall data were ranked in descending order. The three categories are defined as AN, NN, or BN and are separated by the values ⅓ and ⅔ of the way down the ranked list. In a similar manner the three categories are defined for the three SMEs based on model hindcast climatological data (2001–12). By counting the proportion of the model’s ensemble members falling into the respective terciles, corresponding probabilities for each forecast category can be generated. These categorical forecasts are very important for the probabilistic prediction of ISM. This tercile classification approach is used over some approximately homogenous regions, selected based on the climatology and variances of the rainfall over the Indian landmass. For the monsoon zone of India (MZI) as discussed in Borah et al. (2013), central India (CEI), northeast India (NEI), northwest India (NWI), and south peninsular India (SPI), the percentage departure of rainfall from the long-term pentad climatology is evaluated in the next section.
The climatological seasonal [June–September (JJAS)] mean bias of rainfall forecast over the Indian region at P1, P2, P3, and P4 lead times from the three variants of CFS (i.e., CFST126, GFSbc, and CFST382) along with the MME are shown in Fig. 2 from left to right, respectively. The top panel in Fig. 2 shows the observed seasonal (JJAS) mean climatological rainfall. In general, the dry bias over the Indian region increases with lead. The serious overestimation (wet bias) in rainfall over the Arabian Sea and western equatorial Indian Ocean and the underestimation (dry bias) over the Indian land region are common in most of the climate models (e.g., Sperber et al. 2013; Kim et al. 2012; Bush et al. 2014). The overestimation in rainfall over the Arabian Sea is reduced in GFSbc when forced with bias-corrected forecast SST. However, the dry bias over the Indian landmass remains an unresolved problem in all models. Near-zero values off the west coast mark the boundary between the wet and dry biases in model rainfall, which are large along the west coast of India. Figure 2 also shows that, among all variants of CFS, the CFST382 has the smallest climatological June–September precipitation biases in the P1–P4 lead forecast over the Indian land region, followed by the MME average. The wet bias over the adjoining oceanic regions is considerably reduced in MME for all pentad leads. Since MME is a simple average of all variants, and GFSbc has the maximum number of ensemble members, this climatological bias in MME over the oceanic and central Indian regions can be attributed largely to GFSbc. Sahai et al. (2015) show that the drastic reduction of climatological biases in the P1–P4 lead forecast in CFST382 over the Indian land region is not necessarily reflected in the ERP skill of CFST382. In contrast, in the MME, reduction in the climatological biases over the Indian land region (though lesser than that of CFST382) has been translated to useful ERP skill, as evident from Table 1, which shows the spatial correlations for forecast rainfall at different pentad lead times (P1–P4). The MME is seen to be retaining a large spatial correlation (above 0.85) of up to four-pentad lead time. Further improvements are discussed in the following sections.
a. Hindcast skill
Hindcast skill of the MME from CGEPS is analyzed for pentad-averaged rainfall over five homogeneous regions of India (Fig. 3a). Skill in terms of the correlation coefficient (CC) between the predicted and observed pentad mean rainfall series is computed for MME and for all individual models during the hindcast period of 2001–12 for P1–P4 lead time. For most of the cases, CC from MME shows the highest skill relative to that from all of the individual models (Figs. 3b–f) for all lead pentads and for all homogeneous regions except for CEI and the core monsoon zone of India (MZI; Rajeevan et al. 2010). For those regions, GFSbc skill is almost the same or marginally better than MME for longer leads (i.e., P3 and P4). In summary, the deterministic forecast skill from MME is reasonably improved relative to all individual models and for all pentad leads.
b. MISO prediction skill and spread–error relationship
We compare the prediction skill of large-scale MISO by computing the MISO indices following Suhas et al. (2012) and Borah et al. (2013), which define the eight-phase northward-propagating MISO rainfall anomalies from the Indian Ocean region to the foothills of the Himalayas. This is an empirical orthogonal function (EOF) analysis similar to that of Wheeler and Hendon (2004) except that the EOF analysis is performed on an extended data matrix. Like those authors’ verification measures proposed to compute the Madden–Julian oscillation (MJO) prediction skill, we have also calculated the bivariate CC and RMSE (Lin et al. 2008; Gottschalck et al. 2010; Rashid et al. 2011) between the predicted and observed MISO indices. The MISO indices (MISO1 and MISO2) are the principal components computed by projecting the MME forecast and observation onto the extended EOFs.
Figure 4a shows that the limit of useful prediction (as measured by a threshold CC of 0.5) for MME is up to 20 days and is comparable to the best participating model (GFSbc). The forecast error saturates at longer leads near 25 days for both MME and GFSbc (Fig. 4b). Up to 15 days, both MME and GFSbc show similar forecast skill; then, the forecast error grows slightly faster for GFSbc relative to the MME. In summary, the CGEPS MME shows better MISO prediction skill than does the best SME.
Reliability and utility of the EPS are generally assessed by plotting the ensemble spread versus RMSE. Spread is evaluated by plotting the standard deviation of all individual models with respect to their ensemble mean. Both the spread and RMSE of MME are computed by considering all of the 43 ensemble members. Figure 4b shows the RMSE (solid line) and the spread (dotted line) for each SME and the MME as a function of lead time. Ideally for a better EPS, the spread should be close to the RMSE, which should be minimized at any lead time. It is evident from Fig. 4b that the individual SME has lower spread (overconfident) and higher RMSE at long lead times, indicating a poorer SER. However, the MME shows considerable improvement in SER, which is mostly contributed by the increased spread (i.e., reduction in overconfidence penalty) at longer lead time, without compromising the skill.
To better illustrate the impact of this increased spread on the prediction of MISO, an example is given in Figs. 4c–f. The panels show a scatterplot of MISO1 and MISO2 indices, to evaluate the ability to forecast the MISO propagation in a [MISO1, MISO2] phase space from a typical break state starting from 24 August 2012. While the spread of MISO indices (green shaded) does not encompass the verification curve (blue) for the SMEs (Figs. 4c–e), MME spread does include it (phases 1–3 in Fig. 4f). MME also improves SER in a statistical sense, like the MISO amplitude of this particular event.
c. Skill of probabilistic forecast
Since deterministic predictions of daily weather features by dynamical models are not accurate beyond 10 days, probabilistic forecasts from ensembles add extra value in ERP. Probabilistic categorical prediction skill from the EPS is evaluated by computing the ROC score, measured as AUC (Sahai et al. 2008), and BSS (Wilks 2005). The ROC diagram represents the ability of a set of probabilistic forecasts to discriminate a dichotomous event, relating the hit rate and the false alarm rate (Kharin and Zwiers 2003). In a ROC curve, the closer clustering of probability values indicates less spread among the ensemble members. The ROC curves for the below normal category at P3 lead is shown in Fig. 5a for different probability thresholds. The ROC curve for MME (black curve) shows a better probability distribution than do the individual participating SMEs. Figure 5b shows the AUC for all the three categories and for all lead pentads. The AUC for MME is always better than or similar to the best participating SME, for all lead pentads and for all three categories. However, AUC is highest for the BN and AN categories. Thus, with the present MME system, the BN and AN categories are well predicted and clearly discriminated.
Figure 6a shows the BSS of categorical probabilistic forecasts. BSS measures the relative skill of forecasts compared to a standard or climatological or persistence forecast (Wilks 2005). Again the MME is seen to be skillful (BSS is positive) for below and above normal categories. However, for the other SMEs, the BSS is negative for all lead pentads except for the first lead pentad. The BSS for the near-normal category is very low or even negative for all lead pentads and for all SMEs as well as MME, consistent with the ROC analysis discussed in the previous paragraph. The skill of categorical prediction for the near-normal category is lower because the width of the categories for predictands with bell-shaped distributions is narrow near the center: it is easier for the verifying observation to escape from the closed middle category than the open-ended outer categories (Van den Dool and Toth 1991).
Is running a cheaper variant (e.g., GFSbc) with a larger number of ensemble members necessarily a good strategy in general? The answer is negative. The BSS plots show why the best-performing SME (e.g., GFSbc), which is also cheaper to run, cannot compete with the MME. The yellow bar plot (MME21) in Fig. 6a is obtained if we generate a (3 × 7) 21-member MME by randomly selecting seven ensemble members from each of the three variants of the participating models in CGEPS and compute the BSS. It can be seen that it is still better than the 21-member GFSbc (green bar) for all three categories. Thus, the improvement in the probability prediction skill is not coming solely from running more instances of GFSbc (i.e., the model with better skill). Furthermore, it is not attributable to the fact that a larger number of ensemble members are run for GFSbc. The observed improvement is the result of combining the three variants of the model. Thus, the MME approach is a better choice for MISO forecasting than running the best SME with a larger number of ensembles, as long as the SME is underdispersive.
The Brier score can be decomposed into its reliability, resolution, and uncertainty. The performance of the EPS is further evaluated by comparing its two main attributes: reliability and resolution. The statistical reliability measures the degree to which the sample of forecasts is statistically indistinguishable from the corresponding observations. On the other hand, statistical resolution reflects the system’s ability to distinguish between different forecast events. For a better EPS, the reliability should be closer to zero and the resolution should be higher at all lead times. From Fig. 6b, it is seen that at shorter lead times the resolution is the main contributor to positive BSSs. For all SMEs reliability is higher than resolution after P1 lead, resulting in a negative BSS for all SMEs. However, in the case of MME, the reliability value is always closer to zero and lower than the resolution, for the below normal and above normal categories, resulting in positive BSS. Thus, the MME performs better than any of its three participating SMEs. However, for the near-normal category, the reliability from MME remains more or less constant and slightly higher than the resolution for all longer leads beyond the first pentad. It is evident from the above analyses that the three-category probabilistic forecasts from MME outperform all SMEs and this better skill flows from greater spread and thus a better SER.
The prediction of monsoon weather and the MISO in the extended range is a challenging problem for the operational monsoon meteorologist. It is found that the CGEPS MME provides multiple benefits: by encompassing the errors in both the ICs and forecast model physics, it provides better probability forecasts from the users’ perspective (measured by increased reliability). We conclude that deterministic prediction of the MISO using this CGEPS MME is operationally acceptable, although the evaluation of deterministic skill has not yet been thoroughly compared with the forecasts of other operational prediction systems, such as the ECMWF Variable Ensemble Prediction System (VarEPS) system, which is reported to outperform CFSv2 for MJO forecasts (Kim et al. 2014).
Part of the overconfidence penalty involved in SMEs is overcome in the MME, which improves the spread–error relationship, so that the MME approach adds value to both the deterministic and probabilistic forecasts. The results encourage us to pursue a comprehensive adoption of this broadened NCEP-CFSv2 CGEPS MME framework for the operational extended-range prediction of the Indian summer monsoon. Based on the results above, further diversification efforts such as a perturbed physics-parameter approach would probably further enhance the ensemble spread and hence may improve the CGEPS’s skill in these metrics, so long as the diversity does not involve bringing in models of such poor quality that their performance degrades the system. In this setting, physical experimentation can easily be justified within an operationally oriented computing workflow, an auspicious circumstance for sustained efforts by which the National Monsoon Mission’s goals can drive improvements in tropical physics for these much-used global models.
IITM is fully supported by the Ministry of Earth Sciences, government of India, New Delhi. SS thanks the Council for Scientific and Industrial Research (CSIR), New Delhi for a research fellowship. BEM gratefully acknowledges National Monsoon Mission financial support given by the Earth System Science Organization, Ministry of Earth Sciences, government of India (Grant/Project MM/SERP/Univ_Miami_USA/2013/INT-1/002).
As of 2014. Version numbering will soon be instituted in the CFS.