We compare the performance of several modes of variability across six U.S. climate modeling groups, with a focus on identifying robust improvements in recent models [including those participating in phase 6 of the Coupled Model Intercomparison Project (CMIP)] compared to previous versions. In particular, we examine the representation of the Madden–Julian oscillation (MJO), El Niño–Southern Oscillation (ENSO), the Pacific decadal oscillation (PDO), the quasi-biennial oscillation (QBO) in the tropical stratosphere, and the dominant modes of extratropical variability, including the southern annular mode (SAM), the northern annular mode (NAM) [and the closely related North Atlantic Oscillation (NAO)], and the Pacific–North American pattern (PNA). Where feasible, we explore the processes driving these improvements through the use of “intermediary” experiments that utilize model versions between CMIP3/5 and CMIP6 as well as targeted sensitivity experiments in which individual modeling parameters are altered. We find clear and systematic improvements in the MJO and QBO and in the teleconnection patterns associated with the PDO and ENSO. Some gains arise from better process representation, while others (e.g., the QBO) from higher resolution that allows for a greater range of interactions. Our results demonstrate that the incremental development processes in multiple climate model groups lead to more realistic simulations over time.
Around the world, and certainly in the United States, climate and weather model groups have been upgrading their codes for operational purposes and/or for contributions to new international projects [such as phase 6 of the Coupled Model Intercomparison Project (CMIP6); Eyring et al. 2016)]. Preliminary analysis of these new model versions—both published (Del Genio et al. 2015; Rind et al. 2014; Golaz et al. 2019; Danabasoglu et al. 2020) and unpublished—has shown some remarkable increases in the fidelity of representation of important modes of variability. Most notably, representations of the Madden–Julian oscillation (MJO), the quasi-biennial oscillation (QBO), and patterns associated with El Niño–Southern Oscillation (ENSO) have greatly improved relative to model versions of only a few years ago.
This raises important scientific questions: what were the processes involved in this increase in skill? What is the balance between increases in vertical or horizontal resolution versus new physics or better tuning? Can we better predict the impact of climate change on these modes or interactions between them? The salience of these questions is increased by the upcoming IPCC Sixth Assessment Report, which will report on model evaluations and projections in 2021.
This paper reports on an in-depth comparison across all six U.S. climate modeling centers (Table 1). Compared to broader comparisons across the CMIP archive this study has an advantage in that we are able to dig deeper into intermediate versions of models that have not been included in CMIP6 and encompasses two groups that do not contribute to CMIP since they are focused on shorter prediction windows (weather to subseasonal variability). Intermediate model versions are analyzed for select modes for which the simulation duration is of sufficient length to characterize the mode in question. Within these analyses, we focus on 1) using consistent and robust diagnostics across all modes and models (atmospheric and coupled), 2) attempting to track down the reasons for model skill improvements, and 3) identifying continuing and persistent systematic biases.
The inclusion of dynamic variability in the climate system has been the goal of general circulation modeling since the beginning (e.g., Phillips 1956; Manabe and Bryan 1969; Hansen et al. 1983). Some modes, which are dependent on the largest-scale features of the synoptic circulation, such as the North Atlantic Oscillation (NAO) [or the closely related northern annular mode (NAM)] and the southern annular mode (SAM), have been represented in all models. For other modes of variability, however, it has long been recognized that they rely on wave motions or specific climate features that were not resolvable using the horizontal and vertical discretizations and/or configurations achievable in early generations of models.
Developments since then, and particularly within the CMIP process, have clearly identified necessary (though not sufficient) requirements for models to realistically exhibit specific modes of variability. An obvious example is the simulation of ENSO that, at minimum, requires sufficient resolution to resolve the equatorial Kelvin waveguide in the Pacific Ocean (Kang and An 1998). Models with ocean components without sufficient resolution will exhibit tropical variability, but the magnitude and transient structure of that variability will not be realistic (Russell et al. 1995; Clement et al. 2011). Similarly, the capacity to produce a QBO relies on sufficient vertical resolution in the lower stratosphere (~500 m) (Geller et al. 2016).
For some modes though, resolution plays little role. For example, the inability of models to produce an MJO had long been a puzzle until the early successes of Inness et al. (2003), among others. For this mode, the key issues revolved around simulating the processes of convection and the tropical boundary layer sufficiently well to prevent excessive mixing (e.g., Kim et al. 2012). This is similar to representations of the Pacific decadal oscillation (PDO), whose status remains ambiguous—is it the decadal expression of ENSO, or something driven independently? Or is it mainly a statistical description (Newman et al. 2016)? There are no obvious resolution-based reasons to expect (or not) the presence of a realistic PDO, and yet model representations have historically been very diverse.
The lack of any obvious barriers to simulations of these last two modes has led some to speculate that new physics or radical new approaches might be needed to improve their representation in models. Meanwhile, the “normal” development of general circulation models (GCMs) (in the Kuhnian sense) has proceeded apace. The extent to which significant improvements have been made will be a testament (or not) to our increasing understanding of the climate system.
There have been far too many modes of variability identified in the literature for our analysis to be comprehensive, so our focus will be on the principal, well-recognized modes that have been robustly identified in the modern climate record. In the tropics, this includes coupled modes like ENSO, the PDO, and the MJO as well as the primarily atmospheric QBO in the tropical stratosphere. In the extratropics, this includes the NAO/NAM, SAM, and Pacific–North American (PNA) patterns. While we consider all seasons, we focus primarily on those during which these modes are dominant (i.e., December–February (DJF) for the NAO/NAM, PNA, and ENSO; and June–August (JJA) and DJF for the SAM].
Note that, while here we distinguish between the NAM and the NAO, there have been different perspectives on their relationship [see Thompson et al. (2003) and references therein]. Various studies suggest they are indistinguishable (Wallace 2000; Feldstein and Franzke 2006; Dai and Tan 2017), connected (e.g., Thompson and Wallace 1998; Gong et al. 2002; Cohen and Barlow 2005; Cohen et al. 2005; Stephenson et al. 2006; Rivière and Drouard 2015; Song 2019), independent (e.g., Ambaum et al. 2001; Deser 2000), or mixed by season (Rogers and McHugh 2002). In this paper, however, we diagnose the NAM and NAO separately in order to provide both regional and hemispheric perspectives on northern extratropical variability. For ENSO and the PDO, in addition to the spatial patterns and teleconnections we also consider spectral behavior.
2. Models and analysis
We primarily use coupled atmosphere–ocean simulations together with a few historical atmosphere-only (AMIP) (Gates et al. 1999) simulations. For those models submitting to CMIP6 [e.g., Community Earth System Model (CESM), ModelE, Energy Exascale Earth System Model (E3SM)] these experiments correspond to the “Historical” experiment (Eyring et al. 2016). Other simulations analyzed were obtained directly from modeling groups [e.g., the Goddard Earth Observing System (GEOS) and Global Ensemble Forecast System (GEFS) subseasonal forecasts]. The type and number of ensemble members per model submission considered here vary as described in more detail below (and in Table 3). We examine both current versions of the model as well as prior versions and selected development versions when available and relevant.
a. Model descriptions
The salient details for the models used in this study (e.g., model components, resolution, parameterizations) are summarized in Table 2. Here we briefly describe the models utilized in this study, directing readers who seek full details to the references described herein.
Four versions of the CESM were used in this analysis: CESM1 [using the Community Atmosphere Model (CAM), version 5], CESM1 [using the Whole Atmosphere Community Climate Model (WACCM), version 5], CESM2 (using CAM6), and CESM2 (using WACCM6), which are documented in Hurrell et al. (2013), Mills et al. (2017), Danabasoglu et al. (2019), and Gettelman et al. (2019), respectively. The U.S. Department of Energy (DOE) E3SM, version 1 (v1) (Golaz et al. 2019), branched from the CESM1 model, but has evolved significantly (Rasch et al. 2019; Xie et al. 2018). Five versions of the NASA Goddard Institute for Space Studies (GISS) ModelE were included as well, two from CMIP5 (E2-R and E2-H; Schmidt et al. 2014) and three CMIP6 versions [E2.1-G and E2.1-H (Kelley et al. 2019) and E2.2-G (Rind et al. 2020)]. The -G (-R in CMIP5) and -H indicate two different ocean models. Finally, three Geophysical Fluid Dynamics Laboratory (GFDL) models were used: CM3 (Griffies et al. 2011; Donner et al. 2011), CM4 (Held et al. 2019; Zhao et al. 2018a,b), and ESM4 (Dunne et al. 2019, manuscript submitted to J. Adv. Model. Earth Syst.). We also reference simulations from the suite of U.S. models that were used in CMIP3 (Meehl et al. 2007) as a baseline for the changes seen in later CMIP iterations.
In addition to the CMIP models, we also consider an ensemble of 10 free-running integrations produced by the NASA Global Modeling and Assimilation Office (GMAO) using GEOS, version 5 (GEOS-5; Molod et al. 2015), and an ensemble of forecasts from two operational subseasonal forecasting modeling groups: the GMAO subseasonal 45-day long forecasts (Molod et al. 2020) and the GEFS Subseasonal Experiment (SubX) forecasts from the National Centers for Environmental Prediction (NCEP) (Zhu et al. 2018). The GMAO forecasts are fully coupled, whereas the GEFS are uncoupled.
While each modeling center has different development targets, we note a few relevant developments common to models considered in this study. First, most models, particularly those participating in CMIP6, have increased the height of the model top, as well as the vertical resolution. This appears to play a critical role in the fidelity of the QBO (Geller et al. 2016; see section 3d). Second, there have been improvements to the models’ representation of gravity wave drag, which in turn have also improved simulation of the QBO. This has come by way of improved parameterizations [e.g., CESM2(WACCM2), GISS E2.2, CM4] and improved tuning of current parameterizations (e.g., E3SMv1-MODGW; i.e., ESMv1 with modified gravity waves). Third, the treatment of shallow convection has improved in E3SM, CESM2, GISS E2.1, and CM4 including new parameterizations and tunings. These improvements positively impact the simulated MJO (section 3a).
The specific experiments submitted by each modeling center are summarized in Table 3. For most of the CMIP models, some number of ensemble members of the “Historical” experiment are analyzed. For the GMAO ensemble (hereafter M2AMIP) we have considered a 10-member ensemble of AMIP simulations initialized from meteorological fields from different dates in November 1979 using identical sea surface temperatures and sea ice concentrations. The 45-day NASA-GMAO subseasonal forecasts were initialized from the MERRA-2 and GMAO S2S-2.0 ocean analysis, respectively. An ensemble of four forecasts was initialized at 5-day intervals for years 1999–2016. For the GEFS forecasts, an 11-member ensemble of 35-day-long forecasts was used for all years spanning 1999–2016, each of which consists of 1 control and 10 perturbed members.
b. Intermediary model version simulations
We also incorporate analysis of simulations using “intermediary” model versions that were developed between CMIP3/5 and CMIP6, as well as after an initial CMIP6 submission. While they were not originally designed as individual sensitivity experiments, we have found that these simulations contribute toward our understanding of the physical processes responsible for improved representations of different modes of variability across models.
For our analysis of the MJO, we incorporate historical coupled simulations produced using a version of GISS ModelE that represents an intermediary model tag between the CMIP5 Model E2 (Schmidt et al. 2014) and the CMIP6 Model E2.1 (Kelley et al. 2019). In this model version (hereafter GISS ModelE2MJO) the main differences relative to E2 include 1) an increase in one of the parameterization’s plume entrainment rates and 2) an option to allow for reevaporation above the cloud base. The impact of these changes on AMIP simulations of the MJO was documented in Kim et al. (2013).
To identify mechanisms for improved simulations of the QBO we include historical simulations produced using a version of E3SMv1 (hereafter E3SMv1-MODGWD) in which the parameterized convectively generated gravity waves were altered as described in Richter et al. (2019, hereafter R19). In particular, two changes were made: 1) the convective fraction relating the tropospheric heating rate within convective cells to the GCM gridbox-averaged heating rate was increased from 5% to 8% and 2) the efficiency with which convection generates gravity waves was decreased from a default value of 0.4 in E3SMv1 to 0.35 in E3SMv1-MODGWD. R19 showed that these two changes have significant impacts on the QBO in that model.
To further understand the influence of model tuning of clouds in simulating both tropical and extratropical coupled modes of variability, a sensitivity experiment conducted using CESM2 is considered, referred to here as CESM2-gamma. CESM2 utilizes the Cloud Layers Unified by Binormals (CLUBB) shallow turbulence scheme. In CESM2-gamma, only the gamma coefficient, which has been identified as a critical parameter for low cloud feedback responses to climate change (Gettelman et al. 2019), is modified from the official CESM2 version. Specifically, gamma controls the width of the vertical velocity probability distribution function and exercises a strong influence over low-cloud cover.
c. Analysis tools and observational products
The observational products and analysis measures we use are summarized in Table 4. In particular, the tropical and extratropical modes of variability examined in this study are evaluated using both the Climate Variability Diagnostics Package (CVDP; Phillips et al. 2014) and the PCMDI Metrics Package (PMP; Gleckler et al. 2016). The extratropical modes are evaluated using both conventional empirical orthogonal function (EOF) analysis in which, for example, EOF1 in the observations is compared to EOF1 from each of the models (Stoner et al. 2009; Phillips et al. 2014). We also utilize the common basis function (CBF) approach, in which model anomalies are projected onto the observed EOF to obtain the CBF principal component (CBF PC; Lee et al. 2019). Using the CBF PC, the model mode spatial pattern is obtained by regressing the CBF PC back onto the model anomalies. We have chosen to utilize both methods given that in the conventional approach mode swapping may preclude the relevant model mode from being compared to the observations. We find, however, that the relative performance of the models is typically consistent across the different methodologies, though as reported in Lee et al. (2019), the CBF method shows the models tend to appear more skillful, compared to the standard EOF approach.
The period of analysis for the extratropical modes is 1900–2005 for models and observations, for which we use both ERA twentieth-century reanalysis (ERA20C; Poli et al. 2016) and the NOAA twentieth-century reanalysis (20CR; Compo et al. 2008) for years 1900–78 and ERA-Interim (Dee et al. 2011) from 1979 to the present. The one exception is the SAM, which we evaluate over the period 1956–2005 since there are substantial differences during the earlier part of the twentieth century among various observed and reanalyzed datasets (Lee et al. 2019). Furthermore, model skill for the extratropical modes is illustrated using Taylor diagrams (Taylor 2001), in which the radial distance from the origin is the spatial standard deviation normalized by the observed standard deviation. The difference between the observed reference and the model statistic is the centered root-mean-square error (RMSE), and the azimuthal angle is the pattern correlation between the model and the reference observations. The full suite of Taylor diagrams across all modes and seasons are too numerous to present in the main text but can be found online (see data availability statement).
The MJO analysis is predominantly based on diagnostics performed on daily data of precipitation and 850-hPa winds over the period 1999 (for which we begin to have credible gridded precipitation observations) up to the present. The procedure follows the metrics described by Jiang et al. (2015), which are summarized here for completeness. Lag regressions of precipitation and zonal winds are obtained by projecting the daily fields to a time varying index of 20–100-day filtered precipitation along 85°–95°E and 5°N/S. Time–longitude diagrams of the regression fields are obtained by averaging each lag day over latitudes spanning 15°N/S. Pattern correlations with the observations are obtained by correlating the time–longitude diagrams of models analyzed herein with precipitation from the 3b42 daily product from Tropical Rainfall Measuring Mission (TRMM) (Iguchi et al. 2000) and the 850-hPa zonal winds from ERA5 (Hersbach and Dee 2016). A similar pattern correlation analysis is performed for the wavenumber-frequency representation of the fields, in which the signal strength S is defined as the ratio of the difference between the power spectrum P and red spectrum R to the power spectrum itself [S = (P − R)/P, where R is the red noise spectrum] (Clark et al. 2020). The calculation of the power spectrum follows that of Wheeler and Kiladis (1999), and the red noise spectrum follows the procedure of Masunaga (2007), following the guidelines outlined by Waliser et al. (2009). The east–west power ratio is calculated following the procedure outlined by Sperber and Kim (2012) as the ratio between the power spectrum of eastward- and westward-propagating zonal wavenumbers 1–5 and time scales between 20 and 100 days.
The MJO forecast skill among the two subseasonal forecast ensembles is estimated by comparing real-time multivariate MJO (RMM) indices derived from each forecast model with an RMM index obtained from ERA5 data. The RMM index for ERA5 is obtained following Wheeler and Hendon (2004) as the combined EOF of OLR, u200 and u850. For each field, the mean and seasonal cycle is removed, and the fields are averaged over the 15°N/S latitude belt and normalized by their zonally averaged variance before they are combined into a single vector. EOF analysis is then performed on this vector of combined fields. The EOFs from ERA5 are projected onto the GEFS and GEOS-S2S-2 OLR, u200, and u850 anomalies to obtain their respective RMM time series. This method follows those outlined by Gottschalck et al. (2010) and Vitart (2017). Bivariate correlations are calculated for the two datasets following the method described by Gottschalck et al. (2010) but extended to include correlations corresponding to each calendar month as in Molod et al. (2020). We create subsets of the RMM indices for each calendar month, and bivariate correlations are made based on each forecast day and calendar month.
Finally, evaluations of the QBO across the models are based primarily on diagnostics derived from zonal and monthly averaged zonal wind output, available for some models only at 10, 30, 50, 70, and 100 hPa, using the metrics outlined in Schenzinger et al. (2017). (For MERRA-2, M2AMIP, and the GISS ModelE simulations, the native vertical resolution output was used.) Comparisons of models over the period 1980–2016 are made against MERRA-2, which exhibits a realistic QBO compared to observations, both in terms of its zonal winds, mean meridional circulation, and associated ozone changes (Coy et al. 2016). The lack of stratospheric data available at higher temporal resolution in the models prevents us from doing as rigorous an evaluation as has been done in the recent Stratosphere-Troposphere Processes and Their Role in Climate (SPARC) Quasi-Biennial Oscillation initiative (QBOi) (Butchart et al. 2018), for which the six-hourly output required to both calculate the transformed Eulerian mean circulation and compare equatorial wave spectra, was available. Nonetheless, the data analyzed here do provide some insight into the state of the QBO and its representation across the models considered in this study.
In this section we describe the fidelity of each climate mode separately and the improvement (or lack thereof) from CMIP3/5 to CMIP6. When improvements are found we also use intermediary model version experiments in order to understand the drivers of changes in model performance.
a. Madden–Julian oscillation
The results of our MJO analysis for the U.S. climate models are shown in Figs. 1 and 2. As a guideline, in Fig. 2 the closer the individual points in the scatter are to the gray star, the closer the simulation is to the observations (here TRMM for precipitation and ERA5 for the zonal winds). Overall, the five models from CMIP6 considered here exhibit enhanced eastward propagation, compared to the CMIP5 models. The amplitudes of the MJO-related winds and precipitation are also improved in CMIP6 models relative to observations. When assessing individual models, the improvements are still clearer. For example, CESM2 exhibits stronger wind anomalies compared to CESM1 as well as more coherent eastward propagation (Figs. 1c,d). Similar results are seen for the other models both in terms of precipitation and zonal wind (see data availability statement for details).
The east–west (EW) power ratio is shown in Figs. 2c and 2d. When compared to the CMIP5 models, the CMIP6 models exhibit an increased EW ratio that compares more closely with observations, manifest as a rightward shift in Fig. 2 in both precipitation and wind. To showcase this change in signal, Figs. 1a and 1b compare the signal strength of precipitation between GFDL’s CMIP5 CM3 model (Fig. 1a) and CMIP6 CM4 model (Fig. 1b). The darker shading for eastward-propagating zonal wavenumbers 1–5 and time scales ranging from 20 to 100 days is clearly evident in CM4.
The models considered do not only exhibit a closer agreement with the TRMM measurements and ERA5 data, but also show an improved space–time spectrum of all waves. This can be seen by considering the y axis in Figs. 2a and 2b, which shows the pattern correlation of the signal strength of the individual models with respect to the observations. For the spectrum in precipitation, it is clear that all CMIP6 models exhibit an increased correlation relative to their corresponding CMIP5 versions. A less distinct, but nonetheless positive, improvement is also observed for the spectrum of zonal winds.
For the two subseasonal forecast ensembles analyzed in this study we examine their forecast skills by calculating their monthly bivariate correlation coefficients with respect to RMM indices derived for ERA5 (Fig. 3). Overall, the GEOS-S2S-2 and GEFS ensembles exhibit qualitatively similar correlations. In particular, the correlation in both models decays to 0.5 near forecast day 20, consistent with the results presented in Pegion et al. (2019) and Kim et al. (2019). When analyzing the individual months some differences are observed, however. GEOS-S2S-2 exhibits a slower decorrelation time during late boreal summer (JAS), with correlations near 0.4 up to 40 days during August, consistent with the findings in Molod et al. (2020). On the other hand, the decorrelation time is faster during November and February, when correlations are below 0.3 at forecast day 25. In contrast to the GEOS-S2S-2 forecasts, the GEFS ensemble exhibits similar decorrelation times for nearly every month, with the notable exception of late summer (JAS), when it decorrelates faster.
Intermediary model version experiments
To understand improvements in the representation of the MJO, we include results from an intermediary version of GISS ModelE (denoted GISS ModelE2MJO) that represents a development version between the CMIP5 E2 submission and the CMIP6 E2.1 submission. This version (yellow squares in Fig. 2) shows significant improvements over the original GISS E2 model as a result of several modifications in the convective parameterization (see section 2b for details) that resulted in a convection scheme that is more sensitive to environmental relative humidity and a more humid mean state, both of which have been shown to be consistent with improved MJO simulation (Kim et al. 2012; Del Genio et al. 2012). This result is also consistent with Zhao et al. (2018a), who show that transient variability in the tropics increases when the rate of cumulus lateral mixing and convective rain reevaporation are increased in an entirely different model [GFDL’s atmospheric general circulation model, version 4.0 (AM4)], which suggests that the mechanism for MJO improvement demonstrated here is not specific to ModelE.
b. Extratropical modes
Collective skill assessments of numerous extratropical modes of variability have been discussed widely in the literature (i.e., Stoner et al. 2009; Phillips et al. 2014; Lee et al. 2019). Here we analyze the SAM (Figs. 4a,b), the NAM (Fig. 4c), the PNA (Fig. 4d), the NAO (Fig. 4e), and the PDO (Fig. 4f). Our analysis of the SAM, NAM, NAO, and PNA are based on seasonally averaged sea level pressure anomalies, with a focus on the dominant (winter) season, with the exception of the SAM, for which we consider the (DJF) summer season as well since its interannual variability is nearly identical to that occurring during JJA. For the PDO we use monthly anomalies of sea surface temperature.
For the case of the SAM during JJA (Fig. 5a), in all U.S. models the SAM appears to have been better represented in CMIP3, compared to CMIP5 and CMIP6. An evaluation of the SAM during DJF (Fig. 5b), however, shows the opposite, with most CMIP6 models outperforming earlier Model Intercomparison Project (MIP) versions, with the exception of E3SMv1. Thus, while the SAM exhibits some of the most pronounced skill improvement, compared to the other extratropical modes, this improvement is only realized during DJF and does not apply more generally to other seasons.
Consideration of the U.S. modeling groups one at a time affords a somewhat clearer indication that skill has improved since CMIP3. [Note that, despite its incorporation of major changes in physics (Golaz et al. 2019), the E3SMv1 model is included in the discussion among the NCAR models.] In particular, for the NAM during DJF, the GFDL models show improved skill in CMIP6 compared to previous MIPs (Fig. 5c). For the NCAR and DOE models (not shown), the Taylor diagrams indicate that E3SMv1 and CESM1 perform best, with the remaining CMIP6 models tending to be better than the other CMIP5 and CMIP3 model versions. For the GISS models (Fig. 5d), the CMIP5 version performed best.
Of the extratropical modes analyzed, the PNA exhibits the largest diversity in skill across the entire ensemble of MIPs and models (not shown). CMIP3 was especially problematic, with three of the models having higher order EOFs with markedly better skill than their corresponding EOF1. This indicates that these models were not properly simulating the hierarchy of observed EOFs due to mode swapping. Among the GFDL models (Fig. 5e), the CMIP6 simulations lie closest to the 1.0 reference line, indicating that their interannual variability (and thus pattern amplitude) is consistent with observations, whereas GFDL-ESM4 has a smaller pattern correlation and larger RMSE than GFDL CM4. For GISS (Fig. 5f), the CMIP5 models performed best, with the CMIP3 (CMIP6) models underestimating (overestimating) the interannual variability and pattern amplitude. For the NCAR and DOE models (not shown), E3SMv1 and CESM1(CAM5) model are most skillful, with the other CMIP6 models being more skillful than the other CMIP5 or CMIP3 models.
Finally, of the modes analyzed, the NAO is best simulated overall with pattern correlations of ~0.95 and RMSE values less than 0.5 hPa (Fig. 5g). Collectively, CMIP6 has smaller skill dispersion than CMIP5 or CMIP3, with models tending to be located closer to the 1.00 reference line. For GFDL, CMIP6 is marginally more skillful than the other MIPs, while CMIP6 GISS E2.1-G overestimates the pattern amplitude compared to CMIP5 and CMIP3 versions of the model. For the NCAR family of models (Fig. 5h), most of the CMIP5 models [especially CESM1(CAM5) and E3SMv1], marginally outperform their CMIP6 and CMIP3 counterparts.
To summarize: only for the SAM during DJF we do see collective improvement from CMIP3 to CMIP6 in the representation of extratropical coupled modes among all models. Otherwise, the intermodel scatter is large, spanning the limited and varied number of ensemble members across the MIP generations. Our conclusion is that for the extratropical modes, the skill improvement from CMIP3 to CMIP6 is highly mode and season dependent. Taylor diagrams for additional modes and seasons are available online (see data availability statement).
The relevant sensitivity experiment for the extratropical atmospheric modes is the CESM2-gamma historical simulation described in section 2b. Results for this simulation are shown in the panels of Fig. 5 that include the family of NCAR models (Figs. 5a,b,g,h). For the most part, the differences between the CESM and CEMS2-gamma comparisons with reference data are nominal and could possibly be owing to the limited sample (only one realization of the CESM2-gamma is included). One exception is the SAM during JJA, for which this sensitivity simulation appears to be an outlier in both pattern and amplitude. This aside, however, the performance of the extratropical atmospheric modes does not appear to be clearly sensitive to the gamma coefficient in the CLUBB shallow turbulence scheme as applied in CESM2.
c. Tropical coupled modes
1) El Niño–Southern Oscillation
Composites of ENSO events are derived from both ERA20C and the CMIP models (Fig. 6) using normalized detrended December Niño-3.4 time series that are smoothed with a binomial filter and selected for all years when absolute anomalies exceed one standard deviation (El Niño) and all years less than −1 standard deviation (La Niña). Mean model biases (Fig. 6b) indicate a systematic weakness in the ENSO pattern in models as they are negatively correlated with the observed pattern of ENSO (Fig. 6a), which is characterized by negative pressure anomalies in the eastern tropical Pacific Ocean that extend to higher latitudes over the northeastern Pacific, northern Atlantic, and Southern Ocean (Sarachik and Cane 2010) and by positive pressure anomalies over the western tropical and subtropical Pacific Ocean. There are also biases arising from a westward displacement of simulated anomalies (discussed further below), as evident in the negative biases in the western Pacific Ocean and a dipole of biases in the northern Pacific Ocean (Fig. 6b).
Decomposing the models’ bias patterns using EOFs reveals that the leading EOF, which explains a substantial amount of intermodel variance (28%), correlates strongly with the composite pattern (Fig. 6c), particularly near the Aleutian low (Butler et al. 2014). The second EOF, which also explains a significant amount of variance (17%), is characterized by negative values in the western Pacific Ocean and positive values in the Arctic (Fig. 6d). Negative loadings of both EOFs are found in the U.S. CMIP6 models that most closely agree with the observations. In particular, the best U.S. model is characterized by a pattern that strongly resembles the observations in both hemispheres and all ocean basins, although its spatial variance is somewhat stronger (Fig. 6f). In contrast, the model that least agrees with the observations is characterized by weaker than observed teleconnections, both within and outside of the tropics, and exhibits large-scale teleconnections that are opposite in sign to those observed in some regions (i.e., the North Atlantic Ocean) (Fig. 6e).
The observed transient evolution of El Niño (Fig. 7), shows a gradual warming of the tropical Pacific Ocean from the date line to the western coast of the Americas in year 0, reaching a peak in December near 2 K and transitioning on average to cooler than normal conditions of about −0.1 K one year later. The mean model bias (Fig. 7b) is characterized by warm anomalies that extend too far westward (identified previously in Fig. 6b) and occur too late in the seasonal cycle, as evidenced by a broad band of positive SST biases in late spring of year 1. The leading EOF of model bias for the Niño Hovmöller plots (Fig. 7c), which explains a significant fraction of intermodel variance (45%), strongly correlates with the observed composite, indicating a systematic overestimation by many models of the time–longitude structure.
The second bias EOF, which explains 18% of the variance, is characterized by strong warm anomalies in year 1, suggesting a failure to adequately transition to La Niña conditions over time in some models (Fig. 7d). In the best model (Fig. 7e) and consistent with observations (Fig. 7a), anomalies intensify during year 0, are located primarily east of the date line with cool anomalies to their west, and peak in December, after which they transition rapidly to cold anomalies in spring of year 1. In contrast, in the poorer performing model (Fig. 7f), positive anomalies also grow gradually through year 0 but extend well into year 1. A large-scale transition to cool Pacific SSTs in year 2 is simulated in the poorer scoring model, but to an extent that is weaker than observed. That said, even the poorer scoring CMIP6 U.S. model is considerably more skillful than many other models included in the CMIP archives (not shown).
The Taylor diagram of the PDO, derived from the leading EOF of North Pacific (20°–70°N) SST anomalies, suggests that the RMSE has been reduced and pattern correlation increased from CMIP3 to CMIP5 to CMIP6 (Fig. 8). Among CMIP5 and CMIP6 versions of both the GFDL and GISS models, the representation of the PDO has improved and among the NCAR models, CMIP6 CESM2 has the largest pattern correlation and smallest RMSE values compared to either CMIP5 or CMIP3 versions. The overall tendency, therefore, is for CMIP6 models to have larger pattern correlations and smaller RMSE than their corresponding CMIP3 and CMIP5 versions, a result that also holds using the common basis function approach (not shown).
Regressions of the principal component time series of the PDO against SSTs can be used to quantify the global teleconnection patterns associated with the PDO (Phillips et al. 2014), which consists of a zonal dipole of anomalies in the North Pacific centered near 40°N that resembles the structure of El Niño (Fig. 9a). As for El Niño, the mean model bias (not shown) negatively correlates with the observed pattern, which indicates an overall weakness of the PDO in the models. In addition, the leading EOF pattern of the intermodel PDO bias (Fig. 9c), which is associated with connections between the northern and tropical Pacific Ocean, also positively correlates with the mean bias across models suggesting that, while on average the simulated PDO connections with the tropics are systematically underestimated, there is also considerable variation across models.
This last point is especially evident when comparing the PC weightings corresponding to the leading EOFs for the CMIP5 and CMIP6 versions of U.S. models (Fig. 9b), where the origin may be interpreted as the CMIP mean model bias, and observations are shown in red. In particular, significant improvement from CMIP5 to CMIP6 model versions is reflected by the relative proximity of PC weightings for CMIP6 (closed circles) versus CMIP5 model versions (open circles) to the observed range. Most improved is the GISS model (green), which had among the most positive leading EOF1 PCs in CMIP5, yet in CMIP6 is among the closest to the observed range. Note that the reduction of bias in EOF 1 in the U.S. climate models and inconsistent changes in EOF 2 mirror the more general changes from CMIP5 to CMIP6 seen in other climate models (Fasullo et al. 2020).
To illustrate this improvement directly, the simulated PDO regression patterns are shown for the CMIP5 (Fig. 9e) and CMIP6 (Fig. 9f) versions of the GISS model. In CMIP5, significant teleconnections were largely confined to the North Pacific Ocean, with minimal structure in the other ocean basins and in stark contrast to the observed pattern. In CMIP6, teleconnections to remote regions have expanded considerably, particularly in the tropics, with a strong spatial correlation with those observed, though biases in the tropical meridional structure remain. Correctly resolving these teleconnections has broad relevance to associated attribution and prediction applications and represents a major step forward, despite some persistence of biases. While the processes that contribute to the improvement remain to be understood, it is noteworthy that patterns in the North Pacific have not improved dramatically across model versions (Fig. 8), suggesting other origins for the teleconnection improvement.
3) ENSO and PDO spectra
Another key property of both ENSO and the PDO is the spectra of their indices, which are associated with considerable socioeconomic implications related to the frequency, intensity, and persistence of drought, floods, and other impacts (Dilley and Heyman 1995). In general, the power of simulated ENSO variability is too strong in models except at high frequencies (<2.5 years) (Fig. 10a), where many models underestimate the severity of El Niño–La Niña transitions (also shown in Fig. 7b). By comparison, between 2.5 and 6 years, all U.S. climate models produce on average more power than is observed. In the 6–10-yr band, the average observed power is reduced from the 2.5–6-yr band but remains large, with the E3SM range on par with the observed estimates. For periods greater than 10 years, the observed power is, again, less than that for 2.5–6-yr periods and the agreement between the models and observations is closer, with the exception of CESM2(CAM6) and CESM2(WACCM6). A systematic increase in power from CMIP5 (coincident black lines) to CMIP6 model versions is apparent in all CMIP6 U.S. models except CESM, where the version 1 and version 2 ranges overlap. The general increase in ENSO power across U.S. CMIP6 model versions for periods greater than 2.5 years relative to their CMIP5 counterparts is consistent with the broader increase in power from CMIP5 to CMIP6 models overall (Fasullo et al. 2020).
Model–observation agreement is generally closer for the PDO spectra (Fig. 10b), compared to that for ENSO. Models systematically underestimate the observed estimates in the less than 2.5-yr band, where there is little power; by comparison, in the 2.5–6-yr band, the models exhibit generally good agreement with the observations, albeit with large internal variability. In the 6–10-yr band, observational estimates again increase and fall generally within the model ranges, with WACCM6 perhaps being biased high, although more ensemble members are needed to more accurately represent the ensemble spread in that model. At low frequencies (>10 years), power in the PDO is still larger and good general agreement exists between the observed and simulated estimates.
As seen in Fig. 11, the gamma parameter in the CESM2 sensitivity experiments exerts an important influence on ENSO teleconnections. The spatial character of ENSO–SLP teleconnections (similar to Fig. 6, although here estimated using Niño-3.4 regression) is shown for ERA-20C (Fig. 11a) and for the CESM2-gamma sensitivity experiments (Fig. 11b), along with the raw regressions for each model. In CESM2-gamma, many of the canonical biases are shown to worsen relative to CESM2 and include a weakening of the overall pattern in most locations, manifest as negative anomalies in the northeast Pacific Ocean and North Atlantic Ocean. They also include a westward shift of ENSO variance, as evidenced by negative differences in the central Pacific Ocean (Fig. 11b). The pattern correlations versus observations also decrease for CESM2-gamma (0.88) from those for CESM2 (0.93). Together the biases are analogous to PC1 and PC2 in our multimodel analysis (Figs. 6c,d) and demonstrate a basic sensitivity of ENSO teleconnections to unobserved cloud parameters that are typically tuned.
We evaluate the QBO only among the current (CMIP6) generation of models (Table 1). This is because, unlike for the other modes, the QBO was not consistently represented in the majority of models participating in previous CMIP intercomparisons (i.e., only 5 of 47 CMIP5 models had anything resembling a QBO; Butchart et al. 2018). We also include in our analysis not only historical coupled runs but historical AMIP runs that were not included in the previous discussions that, by comparison, focused on modes requiring atmosphere–ocean coupling. Specifically, these include the results from the GEOS M2AMIP ensemble as well as the GISS E2.2-G AMIP historical runs (Table 3).
1) QBO period
The QBO is first depicted in terms of the evolution of the equatorial winds, averaged over 5°S–5°N, over the course of the observational period 1980–2015 (or up to years for which model output was available, depending on the simulation) (Fig. 12). MERRA-2 exhibits the characteristic oscillating propagation downward of zonal wind anomalies, also featured clearly in all of the other models, with the exception of CESM2(CAM6) and GISS E2.1. This is not surprising given that the latter models have relatively low vertical resolutions (Table 2). Hereafter, therefore, our focus will be on further quantifying various aspects of the QBO in all models exempting CESM2(CAM6) and GISS E2.1. We also exclude from our analysis the results from CESM1(WACCM5) since the QBO was imposed in that model.
As in Schenzinger et al. (2017) (hereafter SC17) we begin by calculating the Fourier transform of the equatorial zonal mean zonal wind; hmax is then defined as the height at which the sum of the squares of the Fourier amplitudes between 26 and 30 months maximizes. For MERRA-2 this occurs at 20 hPa, which is consistent with the values of hmax quoted in Coy et al. (2016), as well as SC17, albeit for the case of MERRA in the latter. Comparison of hmax among the simulations shows good consistency with MERRA-2 to within ±15 hPa, with hmax occurring above 20 hPa for some models (i.e., M2AMIP, some members of GISS Model E2.2) and below 20 hPa for others [i.e., E3SMv1 and E3SMv1-MOGWD, CESM1(WACCM)] (Fig. 13a). The one exception is E3SMv1, for which hmax spans 30–50 hPa. While this is somewhat at odds with Bushell et al. (2020), who showed that the QBO in the QBOi models is generally shifted upward compared to reanalyses, it important to note that we are considering a much smaller ensemble of models compared to the 13 models considered in that study. Furthermore, offline comparisons of hmax in MERRA-2 calculated used the native vertical grid of MERRA-2 (20 hPa) versus the coarser grid at which the modes of variability (MoV) model output was available (30 hPa) demonstrates that hmax does exhibit some sensitivity to vertical resolution. While this implies that some caution needs to be taken when interpreting the sense of the MoV models’ bias in hmax, as we discuss below, other measures of the QBO (e.g., amplitude, period) are less sensitive.
Time series of the equatorial zonal winds at hmax are used to identify the time between every other phase peak. For MERRA-2 the average period over all cycles is 28.2 months, with a minimum (maximum) period of 22 (36) months (Fig. 13b). This is in excellent agreement with equatorial radiosonde-based estimates, which are also slightly above 28 months (Baldwin et al. 2001). The mean periods in the examined models generally all agree very well with MERRA-2, particularly for the M2AMIP ensemble in which the QBO period values for all 10 members range between 27 and 29.1 months. By comparison, the QBO period is not as well captured in E3SMv1, which features a period that is almost twice as fast as in MERRA-2 for some members.
We explore this last point further by contrasting the results from E3SMv1 with those from E3SMv1-MODGWD. In response to two separate changes to the convective gravity wave drag (GWD)—both with respect to the amplitude of the momentum flux phase speed spectra (which in that model is proportional to a tunable convective fraction per GCM grid cell) and the efficiency with which convection generates gravity waves—there is a significant improvement in the QBO period exhibited in E3SMv1-MODGWD (Fig. 13b). Given that R19 tuned that simulation to obtain a credible QBO period this result is not surprising. Nonetheless, it is consistent in spirit with the development decisions that were made in tuning similar aspects of the convective component of the nonorographic gravity wave drag scheme in GISS E2.2 (Rind et al. 2020) and in previous versions of that model (Rind et al. 2014) (hereafter RJ14).
2) QBO amplitude
Compared to the QBO period in MERRA-2 the models considered here exhibit more disagreement in terms of the amplitude of the QBO (Fig. 13c), which likely reflects the priorities governing how GWD schemes are tuned in models (i.e., to first produce a credible period and, thereafter, other aspects of the QBO). The best agreement with MERRA-2, in which the amplitude is ~45 m s−1, is exhibited by GISS E2.2 and M2AMIP; by comparison, in nearly all the other models the amplitude of the QBO is underestimated, consistent with the QBOi models (Bushell et al. 2020). Further decomposition of the QBO amplitude into easterly and westerly components (Figs. 13d–g) shows that this low amplitude bias among the models is most often associated with an underestimate of the easterly component of the QBO (Figs. 13d,e). By comparison, the amplitude of the westerly phase of the QBO is better represented in the models (Figs. 13f,g).
It is interesting to ask if the differences in QBO amplitude exhibited among the models in the historical runs compare in magnitude to the differences in forecast skill among the two subseasonal forecast ensembles considered in this study. Figure 14 compares RMSE values in the equatorial zonal winds between the GEOS-S2S-2 and NOAA GEFS ensemble forecasts, relative to MERRA-2. With the exception of 100 hPa, where GEFS performs slightly better, the S2S forecast errors are smaller throughout the stratosphere, especially above 30 hPa. Comparisons of the vertical resolution and model top of GEFS with the underlying GEOS GCM used to produce the S2S forecasts indicates that the GEFS top is lower (0.2 vs 0.01 hPa) and has fewer vertical levels, with less distribution in the lower stratosphere/upper troposphere (Table 2). At the same time, other factors may also contribute to the differences, including the fact that the stochastic perturbation that is applied to each GEFS forecast member varies with height. In particular, in the stratosphere the weight of the stochastic perturbation decreases from 1 at 100 hPa to 0 at pressures at and above 25 hPa, which could contribute to the variable performance of the GEFS ensemble mean forecast at different stratospheric levels. This, in addition to the use of MERRA-2 as our reference against which all RMSE values have been calculated, may provide additional reasons for the relatively weaker QBO skill in the GEFS forecasts. Therefore, while the vertical resolution and model top differences between the models are consistent with the sources driving differences in model performance among the historical runs, more investigation is needed to understand how (if) skill on the climatic time scales relevant to CMIP translates to subseasonal time scales in any meaningful way. A more rigorous and in-depth presentation of the state of the QBO in the GEOS-S2S-2 forecasts is currently in preparation and soon to be submitted for publication (L. Coy, NASA, Global Modeling Assimilation Office, 2019, personal communication).
Intermediary model version experiments
Comparisons between pairs of models show that the QBO period depends sensitively on which aspects of the nonorographic gravity wave drag are altered. In particular, comparisons of GISS E2.1 versus E2.2 and E3SMv1 versus E3SMv1-MODGWD, in which changes in the efficiency with which convection generates gravity waves were made in both cases, resulted in significant improvements in the QBO period, relative to MERRA-2.
By comparison, our analysis does not reveal any systematic changes that clearly improve the representation of the amplitude of the QBO. In particular, the QBO amplitude in E3SMv1-MODGWD is approximately the same as in E3SMv1 (Fig. 13c), albeit with some differences, depending on QBO phase (Fig. 13d). This can also be seen in the perhaps surprising result that, despite their good representation of the overall mean QBO amplitude, all M2AMIP members consistently underestimate (overestimate) the easterly (westerly) QBO amplitude (Figs. 13d–g). This is interesting because the M2AMIP ensemble was generated using the exact same version of GEOS that was used to produce MERRA-2. This demonstrates that, while changes in the nonorographic GWD parameterizations may suffice in terms of improving aspects like QBO period, they may not be sufficient for constraining other aspects like QBO amplitude. For that, assimilation of observed fields (as in MERRA-2) can counteract underlying free-running biases in the models (Geller et al. 2016).
We have presented a comprehensive assessment of the performance of U.S. climate models with respect to multiple modes of variability. Overall, we show that for many modes (though not all), improvements in model skill over time are impressive and a testament to the improvements in the representation of key processes. In addition to improved representations of the MJO and QBO [which have been reported in previous studies (Kim et al. 2013; RJ14; Danabasoglu et al. 2020)], the overall improvement in ENSO and the PDO in recent CMIP6 models is remarkable (Figs. 15a,b). At the same time, however, there is no clear improvement in the representation of the NAM and possible degradation of skill in the SAM (Figs. 15c,d), although the correlations were very high already.
We can distinguish between two kinds of improvement exhibited among most of the modes considered in this study: those that rely on a threshold of model representation that is crossed at a distinct moment in model development, and improvements that rely on more gradual, collective improvements in processes. As an example of the first, the ability of GISS E2.2, CESM2(WACCM6), and E3SMv1-MODGWD to produce a realistic QBO signal is predominantly a function of increased vertical resolution in the lower stratosphere and sufficiently complex spectra of parameterized gravity waves that are tied to the underlying physics in the models (e.g., convection, shear). Models without either do not have a QBO worth discussing, while those with have at least the possibility of being able to tune for a realistic amplitude and period. In this latter group, the QBO period is much easier to tune than the amplitude (Geller et al. 2016), which is consistently underestimated. While data assimilation can remedy this bias (evident in comparing M2AMIP with MERRA-2), it remains a challenge for future development.
Improvement in the simulation of coupled and extratropical modes falls into the second category of model improvement, likely being attributed to gradual improvement of the base climate and a range of relevant processes. Evidence of improved fidelity across generations is apparent in some cases (e.g., the amplitude of the SAM in DJF), but less clear in others (NAM, NAO, PNA, the SAM during JJA). While the limited and varied number of samples hamper definitive statements that can be generalized across models, our analysis nevertheless suggests that progress has been made in some areas, most notably for ENSO and the PDO. Improvements seen in the MJO are also an example of this latter approach, although the improvements are much clearer, compared to the extratropical modes. The drivers of these improvements also appear to be much better understood and related to consistent approaches to treating rain reevaporation within convective parameterizations.
Finally, while the results from our analysis suggests a clear progression in model fidelity in a climate context, it is not clear how (if) this improved performance translates to skill in subseasonal forecasting. Our limited analysis comparing two subseasonal forecast groups suggests that the factors contributing to improved QBO performance in the climate context may also improve skill on subseasonal time scales. However, owing to the limited number of subseasonal models considered in this study, our analysis is not conclusive. As more forecast systems become available in parallel with new CMIP6 models, however, it will become easier to address this question.
This work was funded from NASA MAP, DOE, and NOAA, and arises from the 2019 U.S. Climate Modeling Summit in Washington, D.C., cochaired by Steven Pawson and Gavin Schmidt. Climate modeling at GISS and the GMAO is supported by the NASA Modeling, Analysis and Prediction program, and resources supporting this work were provided by the NASA High-End Computing (HEC) Program through the NASA Center for Climate Simulation (NCCS) at Goddard Space Flight Center. This research was also supported as part of the Energy Exascale Earth System Model (E3SM) project and Regional and Global Climate Modeling Program, both funded by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research. Work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. All presented material relevant to the CESM project, which is supported primarily by the National Science Foundation (NSF), is based upon work supported by the National Center for Atmospheric Research, which is a major facility sponsored by the NSF under Cooperative Agreement 1852977. Portions of this study were supported by the Regional and Global Model Analysis (RGMA) component of the Earth and Environmental System Modeling Program of the U.S. Department of Energy’s Office of Biological and Environmental Research (BER) Cooperative Agreement DE-FC02-97ER62402. Computing and data storage resources, including the Cheyenne supercomputer (doi:10.5065/D6RX99HX), were provided by the Computational and Information Systems Laboratory (CISL) at NCAR. The GEFS SubX forecasts were partially supported through NWS OSTI and NOAA’s Climate Program Office (CPO) Modeling, Analysis, Predictions, and Projections (MAPP) program. We also thank the GFDL model development team, the leadership of NOAA/GFDL for their efforts and support in developing CM4/ESM4 as well as the many GFDL scientists and engineers who conducted the CM4/ESM4 runs and made the data available at ESGF. We acknowledge the modeling groups, PCMDI and the WCRP’s Working Group on Coupled Modelling (WGCM) for their roles in making available the CMIP multimodel datasets. Support of this dataset is provided by the Office of Science, U.S. Department of Energy. We thank the three reviewers whose comments helped greatly increase the clarity of the manuscript.