An assessment is made of the modes of interannual variability in the seasonal mean summer and winter Southern Hemisphere (SH) 500-hPa geopotential height in the twentieth century in models from the Coupled Model Intercomparison Project (CMIP) phase 5 (CMIP5) dataset. Modes of variability of both the slow (signal) and intraseasonal (noise) components in the CMIP5 models are evaluated against those estimated from reanalysis data. There is general improvement in the leading modes of the slow (signal) component in CMIP5 models compared with the CMIP phase 3 (CMIP3) dataset. The largest improvement is in the spatial structures of the modes related to El Niño–Southern Oscillation variability in SH summer. An overall score metric is significantly higher for CMIP5 over CMIP3 in both seasons. The leading modes in the intraseasonal noise component are generally well reproduced in CMIP5 models, and there are few differences from CMIP3. A new total overall score metric is used to rank the CMIP5 models over both seasons. Weighting the seasons by the relative spread of overall scores is shown to be suitable for generating multimodel ensembles for further analysis of interannual variability. In multimodel ensembles, it is found that an ensemble of size 5 or 6 is sufficient in SH summer to reproduce well the dominant modes. In contrast, about 13 models are typically are required in SH winter. It is shown that it is necessary that the selected models individually reproduce well the leading modes of the slow component.
A key requirement for understanding the observed and projected changes in the global climate system due to changing greenhouse gas concentrations is the need to separate the “signal” (i.e., the response to radiative forcing) from the “noise,” in this case internal variability. Often such a signal is estimated using multimodel ensembles (MMEs) formed by combining models from the Coupled Model Intercomparison Project phase 3 or phase 5 (CMIP3 or CMIP5) datasets (e.g., Meehl et al. 2007b; Collins et al. 2013, and references therein). Alternatively, large ensembles from single models have been generated by perturbing either the initial conditions (e.g., Selten et al. 2004; Kirtman et al. 2011; Deser et al. 2012) or the model parameterizations, often known as perturbed physics ensembles (e.g., McSweeney et al. 2012). Different radiative forcings have been applied to atmospheric general circulation models (AGCMs) (e.g., Folland et al. 2002; Deser and Phillips 2009) or coupled atmosphere–ocean general circulation models (CGCMs); for example, the tier 1 and tier 2 CMIP5 experiments described by Taylor et al. (2012).
Of particular interest are changes in regional surface climate, such as rainfall or surface air temperature. The large-scale atmospheric circulation acts as a pathway from global-scale radiative forcing to the surface regional-scale climate change response. As such, the ability of CGCMs to represent coherent patterns, or modes, of interannual variability in the large-scale atmospheric circulation is an important consideration in the use of such models to understand how these modes might change in the future. Of particular importance is the extent to which these modes respond to, or reflect the trend in, observed and projected changes in radiative forcing due to changing greenhouse gas concentrations.
It is possible to separate the signal and noise in the interannual variability of the seasonal mean of a climate variable. Such methods are based on the premise that the seasonal mean can be considered as a random variable, with a seasonal “population” mean and departures from that mean (Zheng and Frederiksen 2004; Frederiksen and Zheng 2007b). Zheng and Frederiksen (2004) referred to these two components as the slow and intraseasonal components of the seasonal mean. By also considering a seasonal mean as consisting of externally and internally forced components (e.g., Rowell et al. 1995), the slow component can be further defined as consisting of slow-external and slow-internal components (Zheng and Frederiksen 1999; Zheng et al. 2004, 2009). For a CGCM ensemble, the interannual variability of the slow-external component is related to changes in radiative forcing (Grainger et al. 2013).
Using monthly mean anomalies, Zheng and Frederiksen (2004) estimated the covariance matrices for the slow and intraseasonal components. This allows empirical orthogonal functions (EOFs) to be calculated, which represent the modes of interannual variability. Frederiksen and Zheng (2007a) applied this method to estimate the dominant modes of interannual variability in Southern Hemisphere (SH) 500-hPa geopotential height for December–February (DJF; summer) and June–August (JJA; winter) using the National Centers for Environmental Prediction (NCEP) reanalysis dataset (Kalnay et al. 1996). While it is well known that statistical modes of variability are not the same as dynamical or physical modes (e.g., Monahan et al. 2009, and references therein), Frederiksen and Zheng (2007a) found that their modes of variability had spatial structures similar to many dynamical modes.
The leading mode of the slow component, in this context the signal, in both seasons reflects high-latitude variability. The next two modes are related to El Niño–Southern Oscillation (ENSO) variability, while the fourth mode is similar to South Pacific wave modes. The intraseasonal noise component was also analyzed, as it explains a significant fraction of the variability of the extratropical atmospheric circulation (Zheng et al. 2000, 2004). The leading mode of the intraseasonal component has a zonally symmetric annular structure resembling the southern annular mode (SAM). The secondary modes have midlatitude wave-4 (DJF) or wave-3 (JJA) patterns. These typically reflect the impact on the seasonal mean of persistent anticyclones, or are related to internal instability of the atmospheric flow.
The modes of variability of the slow and intraseasonal components were subsequently found to be qualitatively well reproduced in AGCMs (Zheng et al. 2009; Grainger et al. 2011b). Subsequently, Grainger et al. (2013) applied the method of Zheng and Frederiksen (2004) to CGCMs from the CMIP3 dataset (Meehl et al. 2007a) and attempted to quantify how well they reproduced the modes of variability of the SH atmospheric circulation. The leading modes of the intraseasonal component were found to be generally well reproduced. However, the leading modes of the slow component were less well reproduced, which they associated with known deficiencies in the mean state or variability of CMIP3 models. Using a four-model MME, they analyzed the projected changes in interannual variability and found that these were dominated by the response to projected changes in the radiative forcing in the CMIP3 climate change scenarios.
There has been much recent discussion surrounding model selection and the use of MMEs in projection studies (e.g., Knutti et al. 2010, 2013; Weigel et al. 2010; McSweeney et al. 2012; Yokohata et al. 2013). Grainger et al. (2013) selected their CMIP3 models based on two requirements. The first was that the models reproduced reasonably well in both seasons the leading modes of variability of the slow component. This was taken to mean those models whose slow component modes did not reflect known dynamical deficiencies. Their second requirement was that the CMIP3 models were not flux-adjusted and had time-varying stratospheric ozone. This was to ensure that their MME isolated the climate change signal due to changes in projected anthropogenic greenhouse gases.
Grainger et al. (2013) were unable to determine whether their four-model MME was of an appropriate size required to best represent the SH atmospheric circulation modes of interannual variability. However, they did show that it reproduced the slow component modes better than any individual CMIP3 model. In CMIP3 models, the effect of ensemble size on variability has been examined in passing (e.g., Pierce et al. 2009; Knutti et al. 2010). Pennell and Reichler (2011) systematically estimated the number of “effective” CMIP3 models, which they defined as the amount of statistically independent information in the dataset, for 35 climate variables. Yokohata et al. (2013) calculated the effective degrees of freedom for nine climate variables in several MMEs including CMIP5.
The primary aim of this paper is to assess how well the CMIP5 models reproduce the leading modes of variability of the slow component of the SH atmospheric circulation in the twentieth century. A comparison is made with the modes of interannual variability found in our earlier study of CMIP3 models. CMIP5 models will then be ranked on the basis of how well they perform in SH summer and winter. This will be used to give an estimate of the optimum size of CMIP5 MMEs for SH atmospheric circulation interannual variability.
The outline of this paper is as follows. The data and methods used are described in sections 2 and 3 respectively. The modes of interannual variability in reanalysis data are described in section 4, and CMIP5 models are assessed in section 5. The selection of CMIP5 models for MMEs is examined in section 6. Conclusions and discussion are given in section 7.
a. Reanalysis data
Monthly mean 500-hPa geopotential height is used to represent the SH atmospheric circulation in summer (DJF) and winter (JJA). Data have been obtained from the Twentieth Century Reanalysis (20CR) project (Compo et al. 2011) for the period 1951–2000 (i.e., 50 seasons). For consistency with the earlier studies, the original 2° × 2° data are mapped onto a 2.5° × 2.5° grid. This is further subsampled to 5° × 5° and thinned toward the South Pole, as described in Frederiksen and Zheng (2007a), so that the data are approximately weighted by area. For the modes of variability of the slow component we are also interested in their relationship with global SST. SST data are obtained from the Hadley Centre Sea Ice and Sea Surface Temperature dataset version 1 (HadISST1; Rayner et al. 2003) and subsampled onto a 2° × 2° grid.
b. CMIP5 data
Monthly mean SH 500-hPa geopotential height data for DJF and JJA have been obtained for 50 seasons from the CMIP5 historical experiment (Taylor et al. 2012) from 45 models, summarized in Table 1 (including expansions of model names). In 41 models, the period 1951–2000 is used. The respective periods (see Table 1) for CESM1-WACCM, EC-EARTH, and HadGEM2-CC allow an ensemble estimate to be made. CanCM4 historical data are only available from January 1961 to December 2005, so the period 1962–2005 is used. Grainger et al. (2013) found that the ensemble estimate performed better than individual realizations in all CMIP3 model ensembles in DJF, and in the majority in JJA.
We have chosen to exclude MIROC-ESM-CHEM, as this model is essentially the MIROC-ESM model with coupled atmospheric chemistry (e.g., Eyring et al. 2013). Analogously, we only consider the GISS-E2-H and GISS-E2-R physics version 1 (p1) ensembles and not their physics versions 2 and 3 (p2 and p3) ensembles with added atmospheric chemistry [also described in Eyring et al. (2013)]. A preliminary investigation of these three models (not shown) found that there are few differences in the modes of variability between their historical prescribed ozone and coupled atmospheric chemistry experiments.
For most CMIP5 models, SST is obtained from the ocean surface temperature variable (tos). Where this is unavailable, the atmospheric surface skin temperature (ts) is used instead. While the two variables will be different, primarily under sea ice, there is minimal impact when used to examine the relationship between SST and the modes of variability of the slow component.
To compare the CMIP5 model modes of variability with the 20CR, all data must be mapped onto the same grid (e.g., Grainger et al. 2008). CMIP5 500-hPa geopotential height data are mapped onto the 20CR 2.5° × 2.5° grid, while SST data are mapped onto the HadISST 2° × 2° grid. For all data, anomalies are calculated by subtracting the climatological monthly mean from the time series at each grid point.
a. Modes of variability
The principles and assumptions used by the methodology have been detailed in previous papers (e.g., Frederiksen and Zheng 2007b, and references therein). Here, a brief summary is given. Consider a climate variable (e.g., geopotential height at a grid point location) from which the annual cycle has been removed. A time series of monthly mean anomalies x is then conceptualized as
where y = (1, …, Y) is the year index in a sample of Y years, m = (1, 2, 3) is the month index within a season, and s = (1, …, S) is the realization index in an ensemble of size S. Also, βy is the slow-external component of the seasonal mean, related in CGCMs to changes in radiative forcing; δsy is the slow-internal component, related to slowly varying internal dynamics (Zheng et al. 2009; Grainger et al. 2013); and εsym is the residual monthly departure of xsym from the slow-external and slow-internal components. The slow component of the seasonal mean is defined as the sum of the slow-external and slow-internal components, that is,
The seasonal mean can then be written as
where the subscript o denotes the average over an index of s, y, or m. Note that μsy is associated with variability on slow-varying time scales (i.e., the signal), while εsyo is associated with variability within the season (i.e., the intraseasonal noise component) (Zheng and Frederiksen 2004).
Given this conceptual model, covariance matrices for the components of the seasonal mean climate field can be estimated using time series of monthly mean anomalies (Zheng and Frederiksen 2004; Zheng et al. 2009). Since the components have different degrees of freedom (Rowell et al. 1995), the seasonal mean covariance should be estimated as the sum of the external and internal covariability, that is,
where i1 and i2 are any pair from the set of i = (1, …, I) geographical locations, and denotes an estimated covariance. The total internal covariance is estimated by
and the covariance of the slow-external component by
is the ensemble mean seasonal mean covariance. It is important to note that for single realizations, or for reanalysis data, S ≡ 1 and the external and internal covariance cannot be estimated separately. In this case, the seasonal mean covariance is estimated from the sample covariance (e.g., Zheng and Frederiksen 2004), that is,
Zheng and Frederiksen (2004) showed that the covariance of the intraseasonal component can be estimated by
are the monthly moments between x at i1 and i2. Given Eq. (2), the covariance of the slow component can be estimated as a residual, that is,
As written, Eq. (9) assumes that μsy and εsyo are statistically independent. However, even if this is not the case, the residual estimate may still be better related to the covariance of the slow component than the total covariance (Zheng and Frederiksen 2004). The spatial truncation method of Zheng and Frederiksen (2004) is applied to the covariance matrices of the total internal and intraseasonal components. Covariance matrices are adjusted so that they are positive semidefinite using the method of Grainger et al. (2008).
The modes of interannual variability for each component are estimated using empirical orthogonal function (EOF) analysis. In this paper, we shall refer to the patterns obtained from EOF analysis of the I × I covariance matrix of the intraseasonal noise component as intraseasonal modes (I modes). Similarly, the patterns obtained from EOF analysis of the slow component (the signal) covariance matrix will be referred to as slow modes (S modes).
We also define an associated time series with each mode as the projection of the monthly anomalies xsym at i = 1, …, I onto that mode. The cross-covariance between the slow components of an S mode associated time series and another climate field (e.g., SST) can be estimated through further application of Eqs. (3)–(9) [see Grainger et al. (2011a) for the derivation]. Here, we shall refer to the estimated cross-covariances between the S mode associated time series and SST as the slow SST–height covariances.
b. Model assessment
Grainger et al. (2013) detailed a method for selecting the model S modes that best match the reanalysis S modes. In brief, a candidate permutation of model S modes is objectively selected that maximizes the pattern correlations of the S modes and their slow SST–height covariances. Where ambiguous modes are found, these are subjectively inspected to determine the final 1 to 1 best match selection. For each reanalysis S mode, how well this is reproduced in a model is scored by
where R is the pattern correlation, over a specified region, between the model and reanalysis S modes (the absolute value is used since the sign of an EOF is arbitrary); RSST is the pattern correlation, over a specified region, between the model and reanalysis S-mode SST–height covariances; and are the estimated variances (i.e., the eigenvalues) of the model and reanalysis S modes respectively; and Mμ is the score, with a value between 0 and 1. Having selected the best matches, an overall score for each season is defined as
where sss denotes a season (i.e., DJF or JJA) and is the score [Eq. (10)] of seasonal S mode n.
4. Reanalysis modes of variability
The dominant S modes and I modes of SH 500-hPa geopotential height from reanalysis data for the second half of the twentieth century were first analyzed by Frederiksen and Zheng (2007a), with several follow-up studies (e.g., Zheng et al. 2009; Grainger et al. 2011b, 2013). To provide context to the CMIP5 results, we summarize here the main findings of these earlier studies.
The leading three S modes of 20CR SH 500-hPa geopotential height in DJF and JJA for the period 1951–2000 are shown in Fig. 1. The leading S mode in both seasons represents high-latitude variability with a SAM-like structure. There is a protrusion into the South Pacific, particularly in JJA (Fig. 1c), that has been seen in other reanalysis studies (e.g., Fogt et al. 2011). S modes 2 and 3 in both seasons represent ENSO variability, evident in the slow SST–height covariances (Figs. 1b,d). The S modes have spatial structures (Figs. 1a,c) similar to the Pacific–South American modes found in other reanalysis studies (e.g., Mo and Higgins 1998; Mo 2000) and closely resemble the 20CR SH 500-hPa geopotential height anomaly composites (not shown) over the ENSO warm- and cold-event years designated by Kiladis and Mo (1998).
The leading four I modes for 20CR SH 500-hPa geopotential height for DJF and JJA for the period 1951–2000 are shown in Fig. 2. I mode 1 in both seasons has an annular structure resembling SAM, while the secondary I modes have midlatitude wave-4 (DJF) or wave-3 (JJA) patterns, also seen in other reanalysis studies (e.g., Kidson 1999, and references therein). At the phase shown, I mode 2 in both seasons has largest positive loadings in the South Pacific Ocean and their spatial structures resemble the persistent anticyclone cluster patterns of Renwick (2005). Although I modes 3 and 4 have similar eigenvalues, and are degenerate using the criteria of North et al. (1982), separate dynamical modes are identifiable in the study of Frederiksen and Frederiksen (1993). The I mode 3 in both seasons has spatial structures that reflect blocking patterns, while I mode 4 has a spatial structure resembling that of internally generated wave disturbances.
It is our view that the modes shown in Figs. 1 and 2 are representative of the modes of interannual variability of Southern Hemisphere atmospheric circulation in the second half of the twentieth century. We base this on the following findings:
The modes are similar across reanalysis datasets. We also analyzed the NCEP reanalysis dataset (Kalnay et al. 1996) for the period 1951–2000 and 40-yr European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis (ERA-40; Uppala et al. 2005) for the period 1959–2001. Except for DJF S mode 3, pattern correlations range from 0.95 to 0.70 with respect to the 20CR S modes (Fig. 1) and from 0.98 to 0.59 with respect to the 20CR I modes (Fig. 2). The associated variances and percentage explained for the NCEP and ERA-40 datasets (not shown) are generally consistent with those for 20CR (see Figs. 1 and 2).
The modes are similar across subperiods. Although reanalysis datasets should be used with caution at high southern latitudes prior to the assimilation of satellite data (Bromwich and Fogt 2004), previous studies (Kidson 1999; Frederiksen and Zheng 2007a) have shown consistency of EOFs of NCEP SH atmospheric circulation interannual variability when pre- and post-1979 periods are considered with respect to the full analysis period. When the modes (not shown) for the 20CR 1951–75 and 1976–2000 periods are compared with respect to the 1951–2000 period, with one exception pattern correlations are in the range 0.97–0.57.
The modes are similar throughout the troposphere. Inspection of the geopotential height field at other levels (not shown) found that all are essentially barotropic between 850 and 300 hPa, with pattern correlations ranging from 0.99 to 0.69. At 200 hPa, pattern correlations are at least 0.64 except for JJA S modes 2 and 3.
Grainger et al. (2013) found consistency of behavior between CMIP3 reproduction of the 20CR S modes and other studies examining SAM and ENSO variability. A brief examination (described later in section 5a) indicates similar behavior in CMIP5 models. Therefore we would expect that a poor pattern correlation of the 500-hPa geopotential height S mode and/or their slow SST–height covariance would translate to poor pattern correlations in other, relevant, climate fields.
Model scores [Eq. (10)] will obviously vary with respect to choice of reanalysis dataset, analysis period, or climate field. However, our main interest lies in whether models reproduce the circulation modes of interannual variability “reasonably well”—that is, whether or not there are any clear deficiencies in the dominant modes. We do not expect that the relative ranking of CMIP5 models will be materially affected by our chosen reference field.
5. CMIP5 model assessment
The leading S modes and I modes of SH 500-hPa geopotential height in CMIP5 models are estimated for the periods given in Table 1 using ensembles of all realizations. Modes are evaluated against the 20CR S modes and I modes (section 4) and results compared with the CMIP3 ensemble estimates of Grainger et al. (2013).
a. Slow component modes
Figure 3 shows, for the 20CR S modes in Fig. 1, aggregates over CMIP3 and CMIP5 models of the individual diagnostics in Eq. (10). Pattern correlations for the S-modes |R| and slow SST–height covariances RSST are calculated over the same regions as for Grainger et al. (2013). The value of |R| is calculated over the entire analysis domain for all S modes and seasons; RSST is calculated over 60°–30°S and 60°S–20°N for all longitudes and over 30°S–20°N, 90°E–70°W for S modes 1–3 respectively in both DJF and JJA. For convenience, the estimated variance of the model S mode is shown in Fig. 3 as the standard deviation relative to 20CR, that is,
where the values of 20CR are shown in Fig. 1.
In DJF, there are improvements in the CMIP5 models compared to CMIP3 in the spatial structure of both the S modes (|R|; Fig. 3a) and the slow SST–height covariances (RSST; Fig. 3b), as indicated by increased median pattern correlations and often a reduction in the interquartile range (IQR). To check if differences in the medians are statistically significant, whether or not or their confidence intervals overlap is considered (Tukey 1977). Confidence intervals (not shown) for the CMIP5 and CMIP3 dataset medians, defined as , where N is the sample size, do not overlap for the DJF S-mode-2 pattern correlation and DJF S-mode-3 slow SST–height covariance pattern correlation, indicating that the difference in medians for those diagnostics exceeds the threshold for statistical significance. There are smaller differences in DJF relative standard deviation (σ*; Fig. 3c). There are clear improvements in CMIP5 in the score (Mμ; Fig. 3d) for all three DJF S modes, although only for S mode 3 does the difference in the medians exceed the threshold for statistical significance. However, for S mode 2, the leading mode related to ENSO variability, almost 75% of the CMIP5 models have scores that exceed the CMIP3 upper quartile.
In JJA, the distribution of results, as indicated by the whiskers and the IQR, is generally similar for CMIP3 and CMIP5 models across all diagnostics and S modes (Figs. 3e–h). While the CMIP5 median is generally higher than the CMIP3 median, the difference only exceeds the threshold for statistical significance for the S-mode-1 slow SST–height covariance pattern correlations (Fig. 3f). The most notable improvement in JJA score (Fig. 3h) is for S mode 2, where a small number of CMIP5 models exceed the entire CMIP3 dataset.
Grainger et al. (2013) found that on an individual basis, how well or poorly a CMIP3 model reproduced the 20CR S modes was consistent with other studies of SAM and ENSO variability. Two examples using CMIP5 are given here. First, in reanalysis datasets trends in SAM indices may be associated with large estimated variances for S mode 1 (Grainger et al. 2013). For DJF S mode 1 (Fig. 3c), the two CMIP5 outliers are IPSL-CM5A-LR and FGOALS-s2. Purich et al. (2013) found that the SAM trends in these two models for SH autumn (March–May) exceed those of reanalysis data by a factor of 2. Second, the western Pacific warm pool affects ENSO dynamics (e.g., Picaut et al. 1996) and was examined in 19 CMIP5 models by Brown et al. (2014). There is a good association among models found by Brown et al. (2014) to have the most skillful oceanic simulation (most notably CCSM4 and NorESM1-M) or poorer ocean dynamics (e.g., CSIRO Mk3.6.0, GISS-E2-R, and INM-CM4.0) and their individual diagnostics here (not shown) for S modes 2 and 3.
b. Intraseasonal component modes
The 20CR I modes are also assessed in the CMIP3 and CMIP5 models, since the interannual variability associated with the intraseasonal noise component comprises about 50% of the total seasonal mean variance in the SH extratropics (Zheng et al. 2000, 2004). The method for selecting the model best match to the 20CR I modes is analogous to that for the S modes (section 3b), although only the I-mode pattern correlations |R| and estimated variances are considered [full details are in Grainger et al. (2013)].
Figure 4 shows the I-mode pattern correlation |R| and relative standard deviation σ* aggregated for CMIP3 and CMIP5 models for the leading four 20CR I modes shown in Fig. 2. There is little difference in the distribution of the diagnostics in the CMIP3 and CMIP5 models for any of the I modes. The difference in the medians only exceeds the threshold for statistical significance for σ* of DJF I mode 4 (Fig. 4b). Pattern correlations are slightly higher in CMIP5 than in CMIP3 models for DJF I modes (Fig. 4a), but slightly lower for JJA I modes (Fig. 4c). There is a general increase in relative standard deviations from CMIP3 to CMIP5 in both seasons (Figs. 4b,d). The 20CR I modes are generally well reproduced in the CMIP5 models in both seasons. Since this is a 1 to 1 best match, this means that both of 20CR I modes 3 and 4 are reproduced, albeit with a preference for 20CR I mode 4 over I mode 3, as indicated by the higher relative standard deviations. Inspection of individual model diagnostics (not shown) finds no clear systematic differences in the CMIP5 dataset.
c. Model overall score
The overall scores and [Eq. (11)] aggregated for CMIP3 and CMIP5 models are shown in Fig. 5. There is clear improvement in CMIP5 models for both seasons, with about 75% exceeding the CMIP3 upper quartile in DJF, and 50% in JJA. Differences in the medians exceed the threshold for statistical significance in both seasons.
As suggested by Fig. 5 almost all individual CMIP3 and CMIP5 models have much higher overall scores in DJF than in JJA. Inspection of Fig. 3 indicates that this is mostly due to higher pattern correlations for S modes 1 and 2 in DJF than in JJA. These modes are more zonally asymmetric in the 20CR in JJA than in DJF (cf. Figs. 1c and 1a). While models may reproduce the modes reasonably well in JJA, any errors in location will be magnified when the pattern correlation is calculated. Another reason for the poorer model performance in JJA may be that the SH atmospheric circulation is more weakly constrained in winter than in summer by radiative forcing (through stratospheric ozone; e.g., Gillett and Thompson 2003) or SST (i.e., ENSO; e.g., L’Heureux and Thompson 2006).
Table 2 gives the overall score [Eq. (11)] for DJF and JJA for each CMIP5 model. Grainger et al. (2013) argued that 5 of 23 CMIP3 models (22%) had S modes unlikely to represent known deficiencies in the model dynamics, which can be taken to mean that they reproduce the 20CR S modes reasonably well. Therefore an independent threshold for CMIP5 models is whether their overall score exceeds the CMIP3 dataset upper quartile for a given season. A majority of the available CMIP5 models (27 of 39) exceed this threshold in DJF (Table 2), but just under half (18 of 39) do so in JJA. There is little consistent behavior across both seasons, with only a handful models being particularly good (e.g., MPI-ESM-LR and NorESM1-M) or poor (e.g., INM-CM4.0).
6. CMIP5 model selection for multimodel ensembles
One use for model metrics is to provide a relative rank of their performance. Since the atmospheric circulation acts as a pathway for regional impacts of global-scale changes in radiative forcing, it may be desirable to select models that best reproduce atmospheric circulation variability. Since we are mainly interested in the implications for projection studies, we will not rank six models for which no representative concentration pathway (RCP) 8.5 experiment data are available. They are CanCM4, CESM1-FASTCHEM, FGOALS-s2, HadCM3, MIROC4h, and MPI-ESM-P. Although FGOALS-s2 was used in section 5, historical and RCP experiment data were withdrawn from the CMIP5 archives in early 2013.
The 39 available models will be used here to construct MMEs in order to determine the optimum size and model ranking for analysis of SH atmospheric circulation interannual variability. Only one realization per model will be used, and is given equal weight within any ensemble. The realization used and selection criterion is indicated in Table 2.
a. Optimum multimodel ensemble size
To estimate the optimum MME size, we construct MMEs of increasing size by adding realizations in order determined by the CMIP5 model rank. The question is how to define a rank based on modes in both seasons. Only the S modes need to be considered, as CMIP5 models reproduce the 20CR I modes generally well, and with no systematic differences. Therefore we define a weighted total overall score as
where γsss is the weight for season sss (e.g., DJF). Ideally, the total overall score should rank the models such that a suitably sized MME obtained from the highest ranked models should itself have the highest overall score in each season. Additionally, MMEs of the same size using lower ranked models should have decreasing overall scores. We will initially use the weights γDJF = 1 and γJJA = 2.2 to estimate the optimum MME size. They correspond to the relative spread of overall scores in each season, as indicated by the ratio of the standard deviations normalized with respect to the seasonal mean CMIP5 overall score (see Table 2). The model rank based on these weights is given in Table 2.
Figure 6 shows the overall score for DJF and JJA in MMEs of increasing size, using the model rank. A threshold of whether it is worthwhile to use a MME is that the overall score exceeds that of the best individual model in each season. This is shown in Fig. 6 as a solid line, with the values given in Table 2. In DJF (Fig. 6a), this threshold is exceeded at S = 3. The MME overall score reaches 0.65 at S = 5, with only small changes with increasing size. There is a more complex behavior in JJA (Fig. 6b). The overall score threshold is not exceeded until S = 11, although the MMEs of size S = 2 and S = 8 are close. It then peaks at S = 13 before generally declining as more models are added.
The individual diagnostics [Eq. (10)] for each S mode are shown in Fig. 7 for MMEs of increasing size. For DJF (Figs. 7a–c), all diagnostics except the S-mode-1 relative standard deviation (Fig. 7c) show relatively little change after about S = 6. As previously noted, IPSL-CM5A-LR is an outlier among the CMIP5 models and the inclusion at S = 11 results in the sharp jump seen in Fig. 7c. In contrast, for JJA (Figs. 7d–f) only S-mode-2 and S-mode-3 slow SST–height covariance pattern correlations (Fig. 7e) show consistent behavior across the MMEs. Other diagnostic values generally do not stabilize until between about S = 9 and S = 12. Inspection of the MME S modes (not shown) confirms this, with large variations in the structure and nature up until about S = 9, and only incremental changes in structure thereafter. The decrease in MME JJA overall score after S = 13 is largely attributable to a steady decline in S-mode pattern correlations (Fig. 7d), although there are larger variations for S mode 3 between S = 13 and S = 23. Smaller decreases occur for large MMEs because of decreasing estimated variances (Fig. 7f).
Figure 6 suggests that the optimum CMIP5 MME size is seasonally dependent. For DJF, as few as 5 or 6 models may be sufficient. In contrast, in JJA around 11–13 models are more likely to be required. Larger MMEs may have poorer performance resulting from the addition of models that fail to reach the “reproduce reasonably well” threshold.
As mentioned above, the choice of weights should result in MMEs of the same size to have decreasing overall scores as lower ranked models are used. To check this, the 39 available models were formed into MMEs of successively lower model ranks with sizes of S = 6, 8, and 13. Figure 8 shows the MME overall scores for DJF and JJA, grouped by ensemble size. MMEs in each case are denoted by letter, with ensemble A containing the top ranked models. In DJF (Fig. 8a), ensemble A has the highest overall score in all three cases. Overall scores, except for S = 8 ensemble D, are monotonically decreasing. In JJA (Fig. 8b), the difference between MME overall scores is clear cut when the optimum size (S = 13) is used. However, differences become smaller at S = 8 and ensemble A drops below the threshold for usefulness. At S = 6, it is now ensemble B with the highest overall score. The result seems to depend on the inclusion of CESM1-BGC, which is ranked 7th. However, the key point is that for JJA, results for MMEs smaller than the optimum size are highly sensitive to sampling error, in this case the models used.
One possible way to overcome sampling error in JJA is to form MMEs using only one model from each center (Table 1). This is plausible, since model similarities may lead to oversampling, or loss of statistical independence (Knutti et al. 2013). Results for S = 8, with this added restriction, are shown at the right of Fig. 8. There are only small changes in DJF (Fig. 8a) compared with the original S = 8 (middle group). However, in JJA (Fig. 8b), there are clear improvements in overall scores for ensembles A–D, with ensemble A now exceeding the threshold for usefulness.
Figure 8 also highlights the sensitivity of MME performance to the condition that the individuals reproduce reasonably well the 20CR S modes. In DJF, only MMEs of the lowest ranked models and small size (e.g., S = 6, 8) have overall scores less than the best single model. Presumably this is because of the generally good performance of CMIP5 models in DJF, where only a few models have overall scores much less than the reproduce-reasonably-well threshold (Table 2). However, the requirement is much stronger in JJA. This is most clearly seen in the S = 13 case. Only ensemble A exceeds the threshold for usefulness, although the ensemble B overall score does exceed all individual models except CCSM4. Ensemble C does not reach the “reproduce reasonably well” threshold.
Weighting overall scores by normalized standard deviation appears to be most likely to produce MMEs with the desired behavior. Large deviations (e.g., γJJA ≤ 1, zero seasonal weights) typically result in rankings for which MME overall scores (not shown) do not decrease monotonically in both seasons. Smaller, plausible, changes in γJJA may affect the detail of the results shown in Figs. 6–8, but not the key conclusions.
Finally, the use of periods other than 1951–2000 in four models (Table 1) also does not materially affect our results. This is consistent with the different periods in the reanalysis datasets (section 4). For example, the period 1962–2005 was also analyzed in the top 10 ranked models (Table 2), with near-zero difference in the mean total overall score. Although we consider ensemble estimates to best reflect the model performance, there is no difference between the CESM1-WACCM and EC-EARTH ensembles (section 5) and their respective single realizations (not shown) for the 1951–2000 period. A somewhat larger change occurs in HadGEM2-CC, where the single realization for the period 1951–2000 performs much worse than the ensemble estimate, particularly in DJF.
7. Conclusions and discussion
In this paper, the interannual variability of the seasonal mean SH 500-hPa geopotential height has been estimated for separate slowly varying (signal) and intraseasonal (noise) components. Coherent patterns, or modes, of interannual variability for these components for summer (DJF) and winter (JJA) were estimated for the CMIP5 dataset historical experiment. These were compared against those estimated from 20CR data for the period 1951–2000 using the assessment method developed by Grainger et al. (2013). Our key findings are as follows:
There are clear improvements in the CMIP5 dataset over CMIP3 in the reproduction of the leading three modes of variability in the slow component (the signal) of the 20CR data for both DJF and JJA. The largest improvement is in the spatial structures of the modes related to ENSO variability in DJF and their slow SST–height covariances.
These improvements translate to improvements in the overall score in the CMIP5 dataset over CMIP3 in both seasons. Differences in the medians in both seasons exceed the threshold for statistical significance.
There is little difference between the CMIP3 and CMIP5 datasets in their reproduction of the leading four modes of variability in the intraseasonal noise component of the 20CR data. The intraseasonal modes are generally well reproduced across the CMIP5 models in both seasons, with no systematic differences between models.
For obtaining multimodel ensembles for further analysis of SH atmospheric circulation interannual variability, a ranking based on the weighted mean of overall scores can be used. Weights using the relative spread of overall scores in each season are the most suitable for this purpose.
An estimate of the optimum MME size is obtained by generating MMEs of increasing size based on their rank. In DJF, as few as 5 or 6 models are required. In contrast, for JJA best performance is found for a MME of 13 models. However, by considering a greater spread of models (e.g., from different modeling centers), adequately performing MMEs from 8 models are possible.
It is necessary that an optimally sized MME consists of models that individually reproduce “reasonably well” the modes of variability. That is, their slow modes of variability generally do not reflect known model deficiencies. In DJF, the majority of models have overall scores exceeding an independent threshold. Consequently, unless a few poorly performing models are combined, almost all suitably sized MMEs will do better than the best individual model. In contrast, less than half of the available CMIP5 models exceed the independent threshold in JJA, and there is a strong dependence that an optimally sized MME consists of the highest ranked models.
The focus of this paper has been on how well the CMIP5 models reproduce the modes of variability of the slow component (i.e., the signal) of seasonal mean SH 500-hPa geopotential height in the twentieth century. We believe that this is a necessary first step in identifying those models that are best suited for further studies of interannual variability, particularly those involving climate projections. We have also provided a qualitative estimate of the optimum size of MMEs for best analyzing interannual variability of SH atmospheric circulation. By using a diverse range of the best models, this may strike a balance between reliably reproducing the current climate, projecting future changes and effective degrees of freedom between MME members (e.g., Knutti et al. 2010; Pennell and Reichler 2011; Yokohata et al. 2013). However, there is limited evidence that selecting only the so-called best models results in different climate projections (e.g., Pitman and Perkins 2008; Pierce et al. 2009) and it does not guarantee that the MME will behave the same as individual models (e.g., Knutti et al. 2010).
It is important to note that model performance of the atmospheric circulation interannual variability should not be the only basis for model selection for projection and/or teleconnection studies. An understanding of model climatological mean state biases may also be required. For example, CMIP5 models generally have the well-known tropical Pacific cold tongue bias, and they show a range of behaviors in west Pacific warm pool SST (Brown et al. 2014). Whether this is consistent with model atmospheric feedback [e.g., as analyzed by Bellenger et al. (2014)] may determine the reliability of projected changes in the coupled covariability of Australian rainfall and SH atmospheric circulation. Similarly, there is no clear relationship between key diagnostics of radiative forcing (e.g., equilibrium climate sensitivity) and modeled changes in historical surface air temperature (e.g., Forster et al. 2013). Again, this may impact the climate change signal (i.e., the slow-external component of interannual variability) found in the CMIP5 climate change experiments.
In ongoing work, we will analyze projected changes in the modes of interannual variability and the pathway from global-scale radiative forcing to surface regional-scale climate change. MMEs of the best CMIP5 models identified here will be used, taking into consideration the points discussed above. The increased number of CMIP5 models, and their general improvement over CMIP3, should provide a strong basis for reliably estimating the projected changes in interannual variability.
We acknowledge the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP, and the climate modeling groups (listed in Table 1) for producing and making available their model output. For CMIP, the U.S. Department of Energy’s Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. We acknowledge the resources and support of the National Computational Infrastructure at the Australian National University, and the efforts of data downloaders in maintaining the CMIP5 data at the Australian Earth Systems Grid node. J. Sisson provided invaluable assistance in preprocessing the CMIP5 data. S. Grainger is supported by the Australian Government Department of the Environment through the Australian Climate Change Science Program. X. Zheng is supported by the 2010CB951604 National Program on Key Basic Research Project of China, and the 2013BAC05B04 Key Technologies Research and Development Program of China. Comments from A. Moise, S. Osbrough, and three anonymous reviewers helped to improve this paper.