1. Introduction
The development of increasingly complex Earth system models (ESMs) has provided new opportunities for forecasting beyond the weather regime, including intraseasonal, seasonal, and interannual time scales, as well as research associated with projections of climate change (Meehl et al. 2005; Taylor et al. 2012; Eyring et al. 2016). Consequently, the diagnosis and evaluation of climate models has become increasingly complex (Randall et al. 2007; Flato et al. 2013). This has been facilitated through the development of coordinated model intercomparisons in the early 1990s (Gates 1992) and the public availability of standardized model output in conjunction with the advent of reanalysis, which provides a long record of validation data extending beyond the satellite era. As for weather forecast models, which typically undergo substantial upgrades in physics and dynamics on a 1–2-yr time scale, it is important to have standard performance tests for climate models to assess if statistically significant improvement has been achieved across generations of models. This effort has facilitated an exponential growth in the community of scientists involved with model development, diagnosis, and evaluation (e.g., Gleckler et al. 2008, 2016; Eyring et al. 2020; Fasullo 2020; Bock et al. 2020).
Toward this goal, in this study we evaluate the fidelity of climatological modes of variability, which are often defined by leading patterns of empirical orthogonal function (EOF) analysis, representing substantial portions of climatic interannual variance in different geographical domains. In particular, the variability modes over extratropical regions, such as the northern annular mode (NAM), the North Atlantic Oscillation (NAO), the Pacific–North America pattern (PNA), the southern annular mode (SAM), the Pacific decadal oscillation (PDO), the North Pacific Oscillation (NPO), and the North Pacific Gyre Oscillation (NPGO), have been known to play important roles in both natural ecosystems and human societies. For example, those variability modes have been known to influence cyclone or storm tracks (e.g., Hanna et al. 2008; Reboita et al. 2009; Sung et al. 2014; Feser et al. 2015), frequency and intensity of rainfall (e.g., Archambault et al. 2008; Durkee et al. 2008; Zhang et al. 2021), extreme weather (e.g., Thompson and Wallace 2001; Hurrell and Deser 2010), air quality over certain regions (e.g., Jerez et al. 2013), robustness of ozone holes (e.g., Perlwitz et al. 2008; Son et al. 2008; Fogt et al. 2009), marine ecosystems (e.g., Joh and Di Lorenzo 2017), fishery catch amount (e.g., Mantua et al. 1997; Szuwalski et al. 2020), and others. [For additional background information, please see chapter 5 of the Fourth National Climate Assessment (Perlwitz et al. 2017).]
The extent to which general circulation models (GCMs) realistically capture observed variability is therefore important for each of the aforementioned applications. While the majority of previous literature has focused on evaluating one or two variability modes in one or two generations of groups of coupled GCMs (CGCMs; in which atmospheric and ocean models are coupled), several studies have advanced approaches to examine extratropical modes collectively. Stoner et al. (2009) diagnosed reproducibility of the leading EOFs in the earlier generation of CGCMs in phase 3 of the Coupled Model Intercomparison Project (CMIP3; Meehl et al. 2005; Phillips et al. 2014) describes a software package (the Climate Variability Diagnostics Package or CVDP), which diagnoses the leading EOFs for variability modes simulated from CGCMs and was initially tested in evaluating CMIP5 (Taylor et al. 2012) models. Data from the more recent CMIP6 (Eyring et al. 2016) simulations, now published by the Earth System Grid Federation (ESGF; Williams et al. 2016), have enabled performance comparison across multiple generations of CGCMs. Fasullo et al. (2020) examined tropical and extratropical variability modes simulated by CGCMs from three generations of climate models (CMIP3, CMIP5, and CMIP6) using the CVDP. They found clear improvement in the simulated spatial pattern of the PDO by CMIP6 models, but no substantial improvement in the extratropical atmospheric modes. Orbe et al. (2020) analyzed CGCMs from the six U.S. climate modeling groups that have been participating in multiple generations of CMIP, and found overall improvement in the PDO with the recent CMIP6-participating U.S. models. However, skill improvement from CMIP3 to CMIP6 for the atmospheric modes is less clear and highly mode and season dependent, which is consistent with findings from Fasullo et al. (2020).
One challenge in evaluating modes of variability using the traditional EOF approach is often referred to as a need for EOF “swapping,” which occurs when a higher-order EOF of a model more closely corresponds to a lower-order observationally based EOF, or vice versa. Lee et al. (2019a) pointed out that EOF swapping (hereafter referred to as the EOF-swapping method) can significantly affect conclusions drawn about model performance when the higher- and lower-order EOFs are not well separated, as determined by applying the test of North et al. (1982). Lee et al. (2019a) found this to be particularly egregious for the PNA. To circumvent this issue, they adopted a common basis function (CBF) approach, which projects the leading observed EOF onto model anomalies. A CBF-type approach has been used for various applications: not only for the extratropical modes of variability (Lee et al. 2019a,b; Jun et al. 2020; Sung et al. 2021), but previously also for the Madden–Julian oscillation (MJO; Sperber 2004; Sperber et al. 2005; Sperber and Annamalai 2008; Gottschalck et al. 2010; Sperber et al. 2013), PDO (Bonfils and Santer 2011), ENSO (Bonfils et al. 2015), and formal detection and attribution (e.g., Santer et al. 2007). Using CBFs reduces reliance on the statistical construct of EOFs in addition to circumventing the swapping issue.
Each of these three methods—the CBF, traditional EOF, and EOF with swapping—addresses different questions. The CBF method targets the question “How well are the phenomena represented by the leading observed EOF captured by a model?”, without regard to any other variability in the model or in observations. The traditional approach considers “Does the model correctly identify in its leading EOF the phenomena represented by the leading EOF identified in observations?”. The EOF swapping method considers the possibility that there is not a one-to-one correspondence in the ordering of leading EOFs so as to address the question “How well does one of the model’s EOFs compare with the observed EOF?”. For example, in Lee et al. (2019a) and other works, it has been demonstrated that in some situations when EOFs were not well separated, the second simulated EOF may more closely correspond to the observed leading EOF.
Our objective is to gauge the evolution of performance across multiple generations of CMIP, focusing on extratropical modes of variability. We highlight performance changes in both patterns and amplitude, which complement a recent study mainly focusing on pattern comparisons (Fasullo et al. 2020). We do so by extending the analysis of Lee et al. (2019a) across multiple generations of CMIP. We emphasize results from the CBF method, but we also compare with results from the traditional and swapped EOF approaches. For the extratropical atmospheric modes, we apply our methodology both to CGCMs and to uncoupled GCMs (i.e., AMIP simulations of CMIP), to explore whether it might be possible to test and improve extratropical model performance during development of the atmospheric model, prior to coupling. We also briefly explore possible reasons for the overall performance changes that we have identified.
The rest of the paper is organized as follows. In section 2, we describe the observational data and model simulations used in this study, and provide a brief description of the methodology we use to compare successive generations of models. In section 3, we compare simulated modes of variability from CMIP3, CMIP5, and CMIP6 GCMs, focusing on spatial patterns (section 3a) and mode amplitudes (section 3b). We also compare results from the coupled GCMs with the uncoupled GCMs (section 3c), and from the different methodological approaches: CBF versus the traditional EOFs (section 3d). In addition we examine the potential impact of changes in model configuration, specifically resolution, and linkages to the simulated mean state (section 3e). We conclude in section 4 with a summary of our results and a discussion of our primary findings.
2. Data and methods
a. Model simulations
In this study we include all available coupled and uncoupled (i.e., Historical and AMIP) simulations from CMIP3, CMIP5, and CMIP6 (Meehl et al. 2005; Taylor et al. 2012; Eyring et al. 2016) by the time of the analysis performed (January 2021). From the coupled experiments we use model data for most of the twentieth century and the early twenty-first century. In the coupled simulations in CMIP5 and CMIP6 this period was referred to as the “Historical” period while in CMIP3 this was identified as the “Climate of the 20th Century” (20c3m) experiment. We evaluate the simulation period of 1900–2005 in CMIP5 and CMIP6, and 1900–99 in the shorter CMIP3 runs. For simplicity, we refer to the coupled experiments for all CMIP generations as Historical experiments. We analyze 128 models across CMIPs using 851 Historical simulations (i.e., 72, 216, and 563 simulations from 22 CMIP3, 50 CMIP5, and 56 CMIP6 models, respectively) and 90 AMIP models across CMIPs using 298 simulations (i.e., 25, 95, and 178 AMIP simulations from 12 CMIP3, 31 CMIP5, and 47 CMIP6 models, respectively). For each model, we first compute the performance metrics for each ensemble member and then average the results across all available ensemble members to reduce the sampling uncertainty. Lists of the models and the number of available ensemble members of individual models are shown in Table 1. Information for the CMIP models can be found at the overview of Model Intercomparison Projects on the PCMDI website (https://pcmdi.llnl.gov/mips/).
Summary of CMIP3, CMIP5, and CMIP6 models and their ensemble member simulations used in this study. The analysis is performed on all available simulations by the time of the analysis performed (January 2021). For each model we first compute the performance metrics for each ensemble member and then average the results across all ensemble members from each experiment (either Historical or AMIP) to reduce the sampling uncertainty. An asterisk (*) at the end of the CMIP6 model name depicts high-top models (model top layer at above 1 hPa). Further information for models can be found in the overview of Model Intercomparison Projects at the PCMDI website (https://pcmdi.llnl.gov/mips/).
b. Reference data
As a default reference dataset for evaluating models, we employed monthly averaged sea level pressure from the National Oceanic and Atmospheric Administration (NOAA) Cooperative Institute for Research in Environmental Sciences (CIRES) Twentieth Century Reanalysis (20CR) version 2 (Compo et al. 2006, 2011) and monthly averaged sea surface temperature (SST) from the Met Office Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) version 1.1 (Rayner et al. 2003). The 20CR is constrained by observations of surface pressure and was run at a horizontal resolution of T62 and with 28 vertical levels. The monthly mean sea level pressure data are available on a global 2° × 2° latitude–longitude grid. The HadISSTv1.1 was used as the boundary forcing for the 20CR and provides monthly means on a 1° × 1° latitude/longitude grid. The period of 1900–2005 is used to isolate each of the aforementioned variability modes from the reference datasets, except for SAM. In our previous study, we found that in the first half of the twentieth century there is a noticeable discrepancy in SAM between different reanalysis datasets (Lee et al. 2019a), which is consistent with findings of Pohl and Fauchereau (2012) and Gerber and Martineau (2018). With that considered, we used a more recent, shorter period (1955–2005) where the twentieth-century reanalyses are much more consistent in the Southern Hemisphere for defining SAM from the reference dataset. Throughout the paper we routinely refer to observations but strictly speaking we are often referring to our use of the above observationally constrained reanalysis as reference datasets.
While we use the 20CR and the HadISSTv1.1 as our default reference dataset throughout the analysis in the current study, it is important to identify how the choice of reference data might impact our findings. We employ an alternative reference dataset as a combination of the Twentieth Century Reanalysis from the European Centre for Medium-Range Weather Forecasts (ERA-20C; Poli et al. 2016) and HadISSTv2.1 (note that the HadISSTv2.1 was prescribed for generating the ERA-20C). In our previous study we showed that the 20CR and ERA-20C yield sea level pressure variability that is largely consistent with that from more comprehensive shorter-record-length reanalysis emphasizing the satellite era (see section 2 of Lee et al. 2019a). In this paper, we revisit such comparison between the two twentieth-century reanalyses (i.e., 20CR and ERA-20C) and the ERA-5 reanalysis (Hersbach et al. 2020) to determine what impact the selection of reference data may have on our conclusions.
c. Methodology
We employ a suite of well-established summary statistics that objectively gauge the consistency between models and observations to evaluate how well climate models simulate five observed atmospheric modes (NAM, NAO, PNA, SAM, and NPO) and two sea surface temperature (SST)-based modes (PDO and NPGO). Using the PCMDI Metrics Package (PMP; Gleckler et al. 2008, 2016), we evaluate these modes by employing the CBF method (Lee et al. 2019a) and compare these results to the traditional EOF analysis. We use seasonal anomalies of monthly sea level pressure for the atmospheric modes and monthly SST anomalies for the SST based modes. We subsequently subtracted the area-weighted mean over the EOF domain at each time point, following the approach of Bonfils and Santer (2011) and others. This method reliably removes any long-term human-induced trend from the data. Details of the methodologies and definitions of each mode and associated domain area follow those of Lee et al. (2019a). In this study we refer to specific physical phenomena as “modes” whereas EOFs are linear statistical constructs that help us simplify and isolate key characteristics of particular modes. We note there are limitations to the physical interpretation of EOFs (e.g., North 1984; Hannachi et al. 2007; Monahan et al. 2009).
In the CBF method (Lee et al. 2019a), an EOF pattern from the observation-based reference dataset is projected onto the model’s anomaly space, which results in a CBF principal component (PC) time series. Then a linear regression is computed at each model grid point between the CBF PC time series and temporal anomalies of the model field, to construct a CBF pattern of the model. This method identifies variability in the model that is similar to the observed mode, and then determines the extent to which the model’s full pattern of variability associated with the mode matches the observed pattern. Both the CBF pattern and the conventional EOF pattern analyses stem from linear mathematical frameworks that yield consistent results when a given model performs well. For example, if the model perfectly simulates observed EOFs, the CBF pattern is going to be identical to the model’s EOF pattern. In a reasonably realistic simulation, the model’s CBF pattern should be similar to the observed EOF pattern, but not necessarily identical. Differences between them reflect both differences in the amplitude of variability associated with the mode and also structural differences in the simulated spatial pattern. The CBF approach has advantages by removing the arbitrary sign, EOF swapping, and/or observed EOF being split across the model’s multiple EOFs, compared to the traditional EOF approach. We note that, if the observed EOFs are distorted and thus uncertain due to sampling, the CBF technique could penalize models unreasonably for not reproducing this distortion. To circumvent this situation as much as possible, we limit our selection of variability modes in which their EOFs were well separated in the observed field, by the test of North et al. (1982), to ensure the projected pattern is reliable. As discussed in section 2b, we also selected the period of 1955–2005 for defining SAM from the reference dataset due to the noticeable discrepancy in SAM between different reanalysis datasets in the first half of the twentieth century, whereas the period of 1900–2005 was used for defining other modes. We conducted a sensitivity study of the results to the choice of observational datasets and verified consistency in observed EOFs for the selected modes from the observational datasets. The interested readers are directed to our previous study, Lee et al. (2019a), for further details in the previous section for methodology.
We define two metrics in this study: one for the spatial pattern and another for the mode amplitude. First, for the spatial pattern, we standardize the CBF and the observed EOF patterns to have unit variance by normalizing them by their own standard deviation of spatial distribution. Then we calculate the centered root-mean-square error (RMSE) of the normalized CBF pattern against the normalized observed EOF pattern as a metric that assesses the skill of the models in representing the spatial pattern of each mode, which is discussed in section 3a. Second, for the mode amplitude, for each model and mode the standard deviation of the simulated CBF principal component time series (PC) (i.e., obtained by projecting the model anomalies onto the observed EOF) is divided by the standard deviation of the observed PC (i.e., the PC time series associated with the observed EOF), which is discussed in section 3b. Both the spatial pattern and mode amplitude metrics are averaged across all available ensemble members for each model in our analysis.
For statistical testing, we performed two-sided t tests at the 5% significance level when assessing differences in skill metrics across MIP generations and for linear regression fits in assessing if there is a relationship in skill between Historical and AMIP simulations. Even so, in many cases the reported significance level (p value) is typically more stringent (e.g., P ≪ 0.05). We note this since we assume that each model is independent when calculating the degrees of freedom, while for a given CMIP generation multiple model versions from a given center may not be independent and commonality may also extend across models from different centers and across CMIP generations.
3. Results
In evaluating modes of variability across the CMIP3, CMIP5, and CMIP6 generations we decompose skill into spatial and amplitude components. Using the CBF approach, the spatial skill is discussed in section 3a and the amplitude skill in section 3b. In section 3c, we compare results from coupled (i.e., Historical) and uncoupled (AMIP) simulations. In section 3d, we discuss the sensitivity of our results to the choice of evaluation method, including the traditional EOF, the EOF-swap, and the CBF approaches. We then explore potential drivers of the performance change in section 3e.
a. Spatial skill
We first examined the patterns of variability for each mode. For each model and season, the mode patterns were generated using the CBF method (Lee et al. 2019a). The spatial patterns have been standardized to unit variance in advance to calculate the centered root-mean-square error (RMSE). Thus, this metric assesses the skill of the models in representing the spatial pattern of each mode (see section 2c). To summarize improvement in model simulation of modes of variability, we calculate pattern metrics (i.e., centered RMSE in this case) for all available CMIP3, CMIP5, and CMIP6 models and display them in a portrait plot (Gleckler et al. 2008). The portrait plot has been used to compare the relative performance of Earth system models (ESMs) (e.g., as in Gleckler et al. 2008; Sillmann et al. 2013; Bellenger et al. 2014; Flato et al. 2013; Lee et al. 2019a; Cannon 2020; Kim et al. 2020; Planton et al. 2020). Figures 1a and 1b show portrait plots of the centered RMSE averaged over all available ensemble members for each Historical and AMIP model, respectively. In the portrait plots, for each column the centered RMSE is normalized by the median centered RMSE, per dataset. Thus, a value of −0.5 indicates that the error is 50% smaller than the median or typical model error, and a value of 0.5 indicates that the error is 50% larger than the median error. From top to bottom are the skill scores for CMIP3, CMIP5, and CMIP6 with the gray horizontal bar separating each CMIP generation. For both the Historical and AMIP simulations there is a clear visual indication of larger error (more reddish shading) in CMIP3 and with CMIP6 tending to exhibit smaller errors (more bluish shading).
In Fig. 1 we can also readily assess how our choice of reference data impacts our overall results via the two triangles that combined make each square. For sea level pressure–based modes (SAM, NAM, NAO, NPO, and PNA) in the upper-left hand triangle the model results are shown relative to 20CR whereas in the lower-right triangle the model results are shown relative to the ERA-20C. For SST-based modes (NPGO and PDO), results are shown relative to HadISSTv1.1 (upper-left triangle) and HadISSTv2.1 (lower-right triangle). Illustration of the results in this way demonstrates that our overall conclusions are not very sensitive to our selection of reference data for both Historical and AMIP simulations. The reference datasets used in Fig. 1 are twentieth-century reanalyses with a record length that matches that of the model simulations. As in Lee et al. (2019a) we have also examined the impact of using a more advanced reanalysis system (i.e., ERA5) and found a close correspondence to 20CR (figure not shown). Here we prefer to emphasize 20CR and ERA-20C as our reference datasets because of their longer record length, which corresponds to the CMIP Historical simulations.
Results of statistical tests to evaluate the significance of apparent spatial skill improvement, based on the centered/normalized RMSE calculated against the default reference dataset, 20CR/HadISSTv1.1, are given in Fig. 2. The two-sided t test is applied at the 5% significance level. The bar-and-whisker plots show the mean (symbol) and median (horizontal line) with the box showing the 25th and 75th percentiles and the whiskers showing the 5th and 95th percentiles. A thicker box outline indicates there is a statistically significant improvement for the mean compared to the immediately preceding CMIP generation. A filled symbol for the mean in CMIP6 box indicates that CMIP6 is improved relative to CMIP3. Figures 2a and 2b show the RMSE averaged over all seasons except for the Historical PDO and NPGO, which are based on monthly SST anomalies. In Fig. 2a for the Historical simulations, CMIP6 shows statistically significant improvement over CMIP5 and/or CMIP3 for all spatial patterns associated with EOF1 and EOF2 of the observation. The CMIP6 collection of models outperforms the CMIP3 ensemble for all seven modes, and it outperforms CMIP5 for five of seven modes. This consistent gain in pattern performance across different modes of variability is noteworthy. For the NPO and NPGO, based on EOF2 of observations, CMIP6 and CMIP5 skills are indistinguishable (Fig. 2a), although CMIP6 outperforms CMIP3. Comparing atmospheric modes in the all-season Historical results (Fig. 2a) with that of the AMIP results (Fig. 2b) shows that the relative performance of the atmospheric modes is consistent between these two experiments [e.g., NAO (NAM) has the smallest (largest) centered RMSE, etc.]. For AMIP, in comparing CMIP generations, CMIP6 outperforms CMIP3 for all atmospheric modes, but its skill is indistinguishable from CMIP5 for four of five modes. The exception in AMIP is the PNA for which CMIP6 is also superior to CMIP5. The AMIP of CMIP5 outperforms that of CMIP3 for all modes except for the PNA.
We next discuss the seasonality of pattern skill. For the winter season when the modes are strongest, Figs. 2c and 2d indicate that the relative skill of the modes is consistent between the Historical and AMIP simulations [e.g., NAO (SAM) has the smallest (largest) centered RMSE, etc.]. For Historical simulations, the CMIP6 models significantly outperform CMIP3 or CMIP5 models for NAM, NAO, and PNA (Fig. 2c). The significant improvement in simulation of NAM and NAO from CMIP3 to CMIP5 and CMIP6 is also evident from the AMIP simulations (Fig. 2d). The relative correspondence between Historical and AMIP simulations also occurs in the spring season (Figs. 2e,f), as well as the summer and fall (not shown). In terms of significant improvement in skill across CMIP generations for a given mode, compared to the all-season result (Figs. 2a,b), the seasonal evaluation (Figs. 2c–f, and results for summer and fall) shows that significant improvement varies by mode and season. Repeating the analysis in Fig. 2 but using the alternative reference dataset, ERA-20C/HadISSTv2.1 (figures not shown), leads to similar conclusions—there is an overall improvement in the newer CMIP models, indicating that the evolution of performance across CMIPs in the spatial pattern perspective is not very sensitive to the choice of reference dataset.
As shown in Fig. 3, we summarized the boxplots of Fig. 2 by noting where we found significant improvement from CMIP5 to CMIP6 (orange circles), from CMIP3 to CMIP6 (blue squares), and from CMIP3 to CMIP5 (open blue triangles). For example, the first three rows of each table in Figs. 3a and 3b (i.e., All season, winter, and spring) summarize Figs. 2a, 2c, and 2e for Historical, and Figs. 2b, 2d, and 2f for AMIP, respectively. We found that there are 20 of 27 cases (15 of 22 cases when excluding “All season” for the atmospheric modes) from Historical (Fig. 3a) and 16 of 25 cases (11 of 20 cases when excluding “All season” for the atmospheric modes) from AMIP (Fig. 3b) for which CMIP6 has shown statistically significant improvement from either CMIP5 or CMIP3 (as highlighted in yellow in the tables). It is also interesting to see that for Historical there are a larger number of cases showing improvement from CMIP5 to CMIP6 (orange circles) in Fig. 3a than in Fig. 3b. On the other hand, there are more cases showing an improvement from CMIP3 to CMIP5 (blue open triangles) in Fig. 3b than in Fig. 3a. This indicates that in many cases of Historical runs, improvement has been robust between CMIP5 and CMIP6 but tends to be indistinguishable between CMIP3 and CMIP5. In many cases of AMIP runs, however, improvement has been robust between CMIP3 to CMIP5 but indistinguishable between CMIP5 and CMIP6 in general.
b. Amplitude skill
The amplitudes of observed and simulated mode variability in CMIP3, CMIP5, and CMIP6 Historical and AMIP simulations are shown in Figs. 4a and 4b, respectively. For AMIP, only atmospheric modes are shown because SSTs are specified boundary conditions. For each model and mode, the amplitude skill metric is defined as the standard deviation of the simulated CBF principal component time series (PC) (i.e., obtained by projecting the model anomalies onto the observed EOF) divided by the standard deviation of the observed PC (i.e., the PC time series associated with the observed EOF), as described in section 2c. Here, the standard deviation of the nonscaled PC represents the square root of the space–time variance of the mode. These ratios are averaged across all available ensemble members for each model. Both Historical and AMIP results suggest a mix of results. While some models, modes, and seasons are in agreement with the observed variability (ratio close to 1.0, shown as greenish colors), some systematic errors in amplitude exist and persist across all CMIP generations for some modes and seasons (e.g., columns where reddish colors are dominant). The systematic amplitude overestimates tend to occur during the postdominant season (e.g., SAM SON, NAM MAM, PNA MAM), as seen by the vertical reddish banding, which is consistent with the CMIP5 Historical result of Lee et al. (2019a). Additionally, the NPO, characterized by EOF2 of the observations, is especially problematic throughout the nondominant seasons.
In Fig. 4 (as in Fig. 1), the sensitivity of the skill metric to the use of different reference datasets is seen by comparing the color between the triangles in each box. In the upper-left hand triangle the model results are shown relative to the default reference dataset, 20CR/HadISSTv1.1, whereas in the lower-right triangle the model results are shown relative to the alternative reference dataset, ERA-20C/HadISSTv2.1. In visual inspection, while we see consistencies in many boxes from Historical and AMIP simulation cases, the consistency in counterpart triangles is not as robust as that of the spatial pattern in Fig. 1. For example, there are columns of systematic discrepancies with tendency of more underestimation diagnosed with the alternative reference (i.e., more blueish color in lower triangles) such as NAM JJA and PNA JJA, and tendency of more overestimation diagnosed with the default reference such as NPO MAM (i.e., more reddish color in upper triangles), which indicate that there are considerable observational uncertainties with those modes in their off season when the signal is weaker than the dominant season (i.e., winter).
To summarize the information in the amplitude portrait plot of Fig. 4, we provide a coarser-grained measure of skill in Fig. 5 using our default reference dataset, where we show the proportion of simulations that fall into three different amplitude categories: Too Weak (<0.8), About Right (0.8–1.2), and Too Strong (>1.2). The results in Figs. 5a–e are for winter for the Historical simulations, the season when the atmospheric modes are dominant, with Fig. 5f being the histogram proportions for the PDO, which is based on monthly anomalies. Qualitatively, for all modes, including the NPGO (not shown), the largest proportion of models fall into the About Right category, with smaller proportions in the two extreme categories. In AMIP (not shown), the atmospheric modes show similar behavior except for CMIP5 NAM, which has a greater proportion of models in the Too Strong category compared to the About Right category.
To evaluate if there has been a significant change in skill across MIP generations, we test whether the proportions within each category are significantly different using the Z-score test for proportions where the null hypothesis is the difference between two proportions = 0 [see Glen (2020) for the details]. Within each category we evaluate the differences between CMIP6 versus CMIP5, CMIP6 versus CMIP3, and CMIP5 versus CMIP3 using a two-sided test. The resulting p values from each of the tests are given in the table inset in the subpanels, with differences taken to be significant for P < 0.05.
During winter for SAM, NAM, NAO, and NPO (Figs. 5a,b,c,e) the majority of models are classified in the About Right category. For these modes the proportion of models within each category are indistinguishable indicating that there has been no significant improvement in amplitude skill across the CMIP generations. However, for the PNA DJF (Fig. 5d), PDO (Fig. 5f), and NPGO (not shown) cases, there has been a statistically significant reduction in the proportion of models that are in the Too Weak category between the successive CMIP generations, and an increase in the proportion of models in the About Right category. With a small number of models in the too strong category, these results show an improvement in the amplitude of these three modes. For the NPGO (not shown), CMIP6 and CMIP5 outperform CMIP3 in the Too Weak and About Right categories.
Figure 6 confirms the postdominant season amplitude overestimate for SAM (Fig. 6a), CMIP6 NAM (Fig. 6b), CMIP6 PNA (Fig. 6d), and NPO (Fig. 6e). Compared to winter (Fig. 5), during spring there is a greater proportion of models in the Too Strong category and a lower proportion of models in the About Right category. Degradation in CMIP6 compared to CMIP5 and CMIP3 occurs for NAM and PNA (Figs. 6b,d) where there is a significant decrease in the proportion of models in the About Right category and a significant increase in the proportion of models in the Too Strong category. Also, there are few instances of the models being Too Weak during spring. Similar results are found during summer and fall for the NPO for all CMIPs (not shown).
Repeating the analysis in Figs. 5 and 6 but using the alternative reference dataset, ERA-20C/HadISSTv2.1 (figures not shown), leads to a similar conclusion, confirming that overall improvement in the amplitude in the newer CMIP models is not clearly evident.
c. Skill comparison between Historical and AMIP
From the analysis of sections 3a and 3b, we have found that there is a strong tendency for AMIP skill to be commensurate with Historical skill based on consideration of the modes for which CMIP6 has better skill than either CMIP5 and/or CMIP3 (yellow shading in Fig. 3). Regarding the spatial pattern, for example, in winter where CMIP6 improvement is significant (Figs. 2c,d and 3), the Historical and AMIP simulations agree in 2 of 3 cases (NAM and NAO). Similarly, in spring, a significant agreement occurs in 2 of 4 modes (Figs. 2e,f and 3). The agreement also occurs for 2 of 3 modes in summer, and 3 of 4 modes in fall (Fig. 3). Overall, for a given atmospheric mode and considering each season, in most cases (10 of 14) there is a commensurate significant skill improvement in CMIP6 compared to either CMIP5 or CMIP3 between the Historical and AMIP simulations, suggesting that improving skill in AMIP will in most cases translate to improved skill in the Historical simulations. Considering the mode amplitude, the relative partitioning and MIP-by-MIP performance within each amplitude category (i.e., Too Weak, About Right, and Too Strong) found for the Historical simulations in spring (Fig. 6) and for other seasons (not shown) are qualitatively consistent when the AMIP simulations are evaluated (not shown). This suggests there may be a relationship between the fidelity of the Historical and AMIP simulations.
To confirm this, we examine the relationship in relative performance across atmospheric modes between the Historical and AMIP simulations (Fig. 7). For each mode and season, linear regression indicates a statistically significant relationship in the multimodel ensemble between the Historical and AMIP simulations with P < 0.01. The closest correspondence between Historical and AMIP simulations are cases for which the regression slope is close to 1 and the correlation is large. If one’s goal is to monitor or improve the simulation of extratropical atmospheric modes of variability during model development, our linear regression results suggest that in many cases it is possible to work in AMIP mode with a high degree of confidence that the skill will translate to the coupled model. This may be particularly useful when trying to improve a model with excessive amplitude (e.g., ratios greater than 1.5 in Fig. 4). The results may also suggest that the uncertainties in the atmospheric modes mainly arise from atmospheric interactions and mechanisms, while uncertainties in SST variability and ocean processes are less influential.
d. Comparison of evaluation methods
In sections 3a–c we have used the CBF approach (Lee et al. 2019a) for the skill assessment across CMIP generations. These modes of variability have also been evaluated by applying traditional EOF analysis directly to each of the models (e.g., Stoner et al. 2009; Phillips et al. 2014; Orbe et al. 2020; Fasullo et al. 2020). Lee et al. (2019a), among others, have pointed out that in some cases, model EOF1 does not correspond to observed EOF1, but rather one of the secondary observed EOFs. In such cases EOF swapping is needed for a meaningful comparison. However, the strategy for the mode swapping is also problematic since different methodologies for selecting the relevant model EOF’s mode can yield different results (Lee et al. 2019a). As the traditional EOF and the CBF approaches are essentially asking different scientific questions, as discussed in section 1, we believe it is instructive to routinely apply both. In this section we examine to what extent our conclusions may be sensitive to the approach used.
In Fig. 8 we illustrate that irrespective of the EOF method used for validation we come to similar broad conclusions regarding skill across MIP generations. The figure shows Taylor diagrams (Taylor 2001) for Historical PNA DJF using three different approaches: the EOF analysis, the EOF analysis with swapping, and the CBF analysis. The Taylor diagrams graphically incorporate mode errors associated with pattern and amplitude via the pattern correlation, centered RMSE, and standard deviation. Because this includes the overall variance, here we used the nonscaled spatial pattern of EOF or CBF for the Taylor diagrams, as in Fig. 9 from Lee et al. (2019a). Additionally, the standard deviation of each model is normalized by that of the reference dataset to be consistent with the amplitude metric. The PNA has been chosen to show because it is a mode where the need for swapping frequently arises. In this case, CMIP6 is most skillful in terms of spatial pattern (Fig. 2c) and mode amplitude (Fig. 5d). Comparing Figs. 8a and 8b indicates that for all generations of CMIP much of the discrepancy in skill in comparing observed EOF1 and model EOF1 is due to the need for EOF’s mode swapping. Figures 8a and 8b also show that there is less swapping required for the CMIP6 models than for the CMIP3 and CMIP5 models. As seen by comparing the three panels of Fig. 8, the dispersion in pattern skill within each generation of CMIP is dramatically reduced when the CBF approach is used. This result is expected because the CBF approach, by construction, is meant to determine how well the models can capture the observed leading EOFs pattern (Lee et al. 2019a), whereas the EOF and EOF swapped methods capture the models’ own modes of variability (see section 1). Although there is a noticeable difference in the distribution of metrics on the Taylor diagrams between the EOF swapped and CBF approaches (cf. Figs. 8b,c), each method suggests an improvement (in relation to the reference dataset) for CMIP6, compared to CMIP3 and/or CMIP5 (also shown in Fig. 2c and Figs. S2c,d in the online supplemental material). Overall, the results for all methodologies are consistent in that CMIP5 and/or CMIP3 have larger dispersion in skill compared to CMIP6. While in Fig. 8 the individual models are not identified (for conciseness), we do provide this information online, along with a full suite of Taylor diagrams for all modes and seasons (see data availability statement).
We further examine the sensitivity of the results to the selected analysis method by focusing on models developed at three modeling centers. In Fig. 9, as an example, we quantify the PNA DJF or NAO DJF performance of each of these models when using the traditional EOF (circle), the EOF-swap (square), and the CBF (star) approaches. The CMIP3 and CMIP5 versions of GISS-E-H have the EOF swapping issue as illustrated by the nonoverlapping circles and squares in Fig. 9a. Interestingly, the CMIP6 version of GISS-E-H models (GISS-E2-1-H) does not have an EOF swapping issue (the green circle and square overlap). This means that the dominant EOF in the CMIP6 version of the model now better corresponds to the dominant observed mode than its earlier versions, which is an improvement for this model. Comparing the same shaped markers gives an indication of relative performance across CMIPs. As expected from Fig. 8, results differ (location on the diagram) between the EOF and CBF methods; however, the bottom-line conclusions—improvement or not across generations—are not very sensitive to the method selection. Conclusions become more consistent between different methods when there are fewer EOF swapping cases, such as GFDL-CM models for NAO DJF (Fig. 9b), CESM-WACCM models for PNA DJF (Fig. 9c) and many others (again, the full suite of Taylor diagrams across all modes and seasons is accessible on the PCMDI website; see the data availability statement).
We find many consistencies in the conclusions of the CBF and EOF regarding relative model performance or changes of performance across CMIP generations although skill score values themselves are different between each method. From the analysis in the online supplemental material, we find improved pattern-based performance in CMIP6 (compared to CMIP3 or CMIP5) is also evident in many cases with the EOF and EOF-swap approaches. Regarding the amplitude of variability, the EOF and EOF-swap approaches put more models into the Too Strong category than the CBF. The prevalence of amplitude overestimation remains, in particular for the postdominant seasons and NPO regardless of methodology selection. There are however notable exceptions, and we find it is instructive to evaluate models by collectively examining results from the three methods. We have used the CBF method as a default because of the previously identified advantages (Lee et al. 2019a). In this study we have identified an important additional advantage—application of the CBF method more readily yields statistically significant performance changes than the traditional EOF approach, in large part owing to a reduced spread of CBF-based results associated with intrinsic variability (i.e., across multiple realizations of the same model).
e. Potential drivers of performance change
In sections 3a–3d we have quantified performance changes in the simulated extratropical modes of variability across recent CMIP generations. While this in conjunction with the comparison of results from coupled (Historical) and uncoupled simulations (AMIP) is the primary objective of our study, here we consider possible reasons for the improvement in the spatial pattern of the modes. We examine the potential impact of changes in model configuration and also consider possible linkages to the simulated mean state.
We first examine the impact of changes in horizontal resolution. We focus on 16 models from the CMIP5 and CMIP6 Historical experiment, which represent eight pairs of models that are confirmed to be the same model differing primarily in horizontal resolution. In some cases, minor adjustments may have been made to gravity wave drag or other parameters when increasing resolution, but otherwise the models in each pair are fundamentally the same. In Fig. 10, for each pair the performance change between the lower-resolution (circles) and higher-resolution models (closed squares) is highlighted for the dominant season of the atmospheric modes. The error measure used here is the centered RMSE of the CBF spatial patterns compared with the 20CR. Arrows pointing to the left therefore indicate that higher-resolution models outperform lower-resolution models, and right-headed arrows suggest a degradation in performance. These results for the atmospheric modes are inconsistent, with no indication of systematic performance changes associated with increased resolution. A similar conclusion can be drawn with SST-based modes (i.e., PDO and NPGO; figures not shown), and the amplitude of all modes (figures not shown). We thus find no clear evidence that horizontal resolution explains the improvements we have identified in the pattern of the simulated extratropical modes in CMIP6.
In the vertical, higher resolution in some cases is not limited to increased representation of the stratosphere, but can also include a higher upper bound (e.g., in so-called “high-top” models). We partitioned CMIP6 Historical models into 30 high-top models that have their top layers above the 1-hPa level (Table 1), with the remaining 26 models identified as low-top models. In Fig. 11 we find there are some cases where the high-top models significantly outperform the low-top models, which include NAM in MAM, NAM in JJA, NAO in MAM, and SAM in all seasons but SON. This result may suggest that the height of the model top likely influences the tropospheric extratropical variability. However, it is important to note that high and low top models may differ in more than just their vertical domain.
Similar to the analysis in Fig. 11, we also conducted a comparison between two model groups partitioned from CMIP5 to models with and without indirect aerosol effect, following the grouping strategy of Chylek et al. (2016). The CMIP5 models were selected for the grouping because of even populations in the two groups, while most CMIP6 models consider the indirect aerosol effect and most CMIP3 models do not. We do not find robust evidence that including the indirect aerosol effect contributes to the improvement of the extratropical modes of variability, except for a few cases: the MAM and SON seasons for SAM (Fig. S6).
To consider possible linkages to the mean state fidelity, we selected 13 models used for the Historical experiment in which the CMIP5 and CMIP6 versions differ only by development changes between these two phases of CMIP. These CMIP5 and CMIP6 pairs include ACCESS1.0 and ACCESS-CM2, CESM1-CAM5 and CESM2, CESM1-WACCM and CESM2-WACCM, CanESM2 and CanESM5, FGOALS-g2 and FGOALS-g3, FIO-ESM and FIO-ESM-2-0, GISS-E2-R and GISS-E2-1-G, GISS-E2-H and GISS-E2-1-H, GFDL CM3 and GFDL-CM4, IPSL-CM5A-LR and IPSL-CM6A-LR, MIROC5 and MIROC6, MPI-ESM-LR and MPI-ESM1-2-LR, and MRI-ESM1 and MRI-ESM2-0. Other models are either new entries into CMIP or models that are generally regarded as distinctly different from others. A closer examination of model dependence is beyond the scope of this study, but its importance is noted in the discussion session.
The position and shape of extratropical modes are modulated to some degree by the mean state pattern of sea level pressure and the jet (Barnes and Polvani 2013). To examine this possibility, we compare the NAM DJF spatial pattern error (i.e., centered RMSE from Fig. 1a) to errors in the seasonal mean climatology state and the stationary wave pattern. We focus on the NAM because this is where we have identified the more robust performance improvements from CMIP5 to CMIP6, as shown in Fig. 3. The DJF seasonal mean climatology state was taken from one realization per model (it differs little from one realization to another) and its RMSE was calculated over the same domain as the NAM against the 1980 to 2005 ERA-5 reanalysis sea level pressure field. The RMSE for the stationary wave pattern was calculated similarly but after the zonal mean is removed from the mean state. Figures 12a and 12b show no statistically significant evidence that the mean state pattern explains the improvement of NAM pattern in DJF and other seasons (figures not shown), suggesting that the performance changes in the spatial characteristics of NAM to first order cannot be explained by changes in the mean state. In the Southern Hemisphere, improvement in the mean state circumpolar extratropical atmospheric circulation is one obvious possibility for influencing the spatial pattern of SAM, and Bracegirdle et al. (2020) have identified such improvement in the mean state of the CMIP6 models. To examine the possible impact of the mean state circulation on SAM, in Fig. 12c the magnitude of jet latitude index (JLI) bias derived from the seasonal mean state is compared to the centered RMSE metric of the SAM (from Fig. 1a) in the SON season. The JLI is derived from the maximum of the seasonal and zonal mean wind at 850 hPa. The latitude of maximum wind (i.e., jet) is defined as the jet latitude index, following the calculation of Bracegirdle et al. (2020). The magnitude of the JLI bias was derived as absolute difference in JLI from model and ERA-5. Figure 12c shows a statistically significant relationship between the jet bias and the SAM pattern, indicating that improvement in the jet position in the newer CMIP6 versions may have contributed to the improvement in the SAM pattern. To consider this further, the latitude of the jet maximum tends to correlate with the latitude of maximum in the SAM pattern, as shown in Fig. 12d. The correlation between the maximum jet and SAM pattern latitudes does vary seasonally, as it is strongest in DJF (r = 0.95), weakened in MAM (r = 0.75), and lowest in JJA (r = 0.54; figure not shown), and recovers to r = 0.73 in SON (Fig. 12d). This suggests an influence of other seasonal factors that may contribute to the fidelity of the SAM pattern. Repeating the above analysis of the NAM and SAM using all models in the CMIP archive (i.e., not just focusing on the limited set where there is a documented connection between the model versions used in CMIP5 and CMIP6), we reach similar conclusions to our results in Fig. 12 (not shown).
4. Summary and discussion
In this study, for the CMIP3, CMIP5, and CMIP6 generations of models we evaluate the fidelity of the spatiotemporal variability of modes of atmospheric and ocean–atmosphere extratropical variability (NAM, NAO, PNA, SAM, PDO, NPO, and the NPGO) defined by leading EOF patterns identified in observations. We have applied multiple analysis methods, using the CBF approach as a baseline and comparing our conclusions to those derived from more traditional EOF-based approaches. We have applied our benchmarking of atmospheric modes to both coupled (Historical) and uncoupled (AMIP) simulations. We have further examined several possible explanations for our findings.
Our results indicate that there is significant improvement in the simulated spatial patterns of some individual modes in the newer generation(s) of models. From the analysis of spatial pattern error, using the CBF approach we find improvement in CMIP6 compared to CMIP5 or CMIP3 in most modes and seasons with a few exceptions. We also find that the improvement in the spatial pattern is not sensitive to the selection of a reference dataset, which is supported by the findings of Gerber and Martineau (2018) that the annular modes are consistent across different reanalyses except for the SAM in the presatellite era. The evidence for the improvement is however somewhat sensitive to the analysis methodology; improvements in CMIP6 are more robust with the CBF approach than the EOF-based approaches that have wider dispersion in skill. This, in addition to a different selection of models examined, particularly in CMIP6, may explain the discrepancies between our results to a similar study conducted by Fasullo et al. (2020), whose examination includes PDO and DJF of SAM, NAM, and NAO using the EOF approach.
In contrast to the spatial pattern, however, demonstrable improvement is not evident in the mode amplitude. The skill improvement in the representation of mode variability (i.e., amplitude) across CMIP generations is not very robust and it is highly dependent on the mode and season. Statistically meaningful improvement of amplitude skill in CMIP6 compared to CMIP5 and/or CMIP3 is revealed only in a few cases (e.g., PNA DJF, PDO, and NPGO) from the analysis of the categorical assessment in terms of the proportion of models that have Too Weak, About Right, and Too Strong variability. For the dominant season (i.e., winter), most CMIP5 and CMIP6 models are in the About Right category. However, we note that there is no evidence of improvement on the systematic overestimation in the postdominant season (i.e., spring) across the CMIP generations, showing a reduced proportion of newer models in the About Right category and an excessive proportion in the Too Strong category, regardless of which reference dataset is used. One possibility is this overestimation is related to the systematic bias in the CMIP3 models annular mode decorrelation time scale with the seasonal cycle delayed (Gerber et al. 2008). Simpson et al. (2013b) identified that the simulated SAM in the austral summer season is commonly too persistent in CMIP5 because of deficient planetary wave feedback, which may also be related to our findings for SAM. Further research is needed to understand the underlying cause of the systematic overestimation of the amplitude in the postdominant season.
To reduce sampling uncertainty and gauge the range of structural errors in current models, we have included in our analysis all historical simulations currently available in the CMIP archive. We have not, however, considered structural dependencies (common building blocks such as atmospheric or ocean model components, or physical parameterizations) that may exist between some models which would reduce the degrees of freedom and thus impact the reported significance level of the statistical tests. Quantifying the impact of cross-model dependency (e.g., Jun et al. 2008; Masson and Knutti 2011; Pennell and Reichler 2011; Knutti et al. 2013) can be challenging, and it may be especially difficult in the assessment of modes of variability.
Another potentially interesting application of our metrics could involve quantification of changes or persistence in extratropical variability modes late in the twenty-first century. For example, following Haszpra et al. (2020) for the NAM, it would be useful to determine if there is a significant change in intermodel spread across modes. Barnes and Polvani (2013) demonstrated a robust response of the eddy-driven jets to climate change in the CMIP5 multimodel ensemble, which consequently influence the extratropical variability modes. These and other topics related to future changes in extratropical variability should build on the foundation of present-day comparisons with observations.
The relative performance of models across CMIPs is typically consistent, as assessed using the different methodological approaches (the traditional EOF, EOF with swapping, and the CBF), but there are notable exceptions such as NPO. The pattern-based outcomes (i.e., improvement or not) from the EOF approach for DJF of NAM, SAM, and NAO, and for PDO, are largely consistent with the findings of Fasullo et al. (2020), but not identical. Fasullo et al. (2020) noted a substantial improvement in the simulated PDO but only a slight and not definitive improvement in DJF of NAM, NAO, and SAM, from the CMIP6 models compared to GCMs in the earlier CMIPs. We find that CMIP6 models outperform CMIP3 models for SAM DJF (i.e., austral summer) and for NAO DJF (i.e., boreal winter), and there is no significant improvement in NAM DJF. Direct comparison between our results to that of Fasullo et al. (2020) is however limited because of notable differences in the pool of models used in each study [e.g., a total of 55 models from CMIP3, CMIP5, and CMIP6 in Fasullo et al. (2020), whereas herein we examine 123 different models in addition to all of their currently available ensemble members].
We have applied the CBF method (Lee et al. 2019a) as our default performance test to quantify how well observed extratropical modes are represented by models. We used this method as our default because it rectifies several deficiencies in the traditional comparison of EOFs, which we also apply in the present study as a complementary analysis. Through the course of our analysis, however, we identified an additional benefit of the CBF method—the CBF method more readily yields statistically significant performance changes than the traditional EOF approach, in large part owing to a reduced spread of CBF-based results associated with intrinsic variability (i.e., across multiple realizations of the same model). This is particularly useful when the goal is to distinguish between the strengths and weaknesses of different models or to quantify performance changes as we have done here. Nonetheless, we do advocate applying both the CBF and traditional EOF methods routinely as they target complementary questions as described in section 1.
While our emphasis is on performance change across groups of models from different generations, knowing how individual models are performing is also informative. For readers interested in the representation of spatial patterns and amplitude time series of individual modes and seasons obtained from each model, our project website provides interactive portrait plots that allow navigation from our metrics to the underlying maps and time series from which they were derived (see the data availability statement section).
For the atmospheric modes, we find there is a significant amplitude skill relationship between the Historical and AMIP simulations. In each case and for each season there is a statistically significant relationship between the fidelity of coupled/ESM Historical simulations and AMIP simulations. This suggests that successfully rectifying the overestimate of amplitude in the postdominant season in an atmospheric-only (AMIP) configuration may also lead to improvement in the more computationally and labor-intensive coupled ocean–atmosphere simulations.
We have identified some clues on how the pattern improvement of certain modes can be interpreted; however, it is difficult to identify direct drivers of cross-generational performance improvement within the variety of the model configurations in the different CMIP generations.
We found no clear evidence that horizontal resolution or the incorporation of indirect aerosol effects explains the improvements we have identified in the pattern of the simulated extratropical modes in CMIP6.
We showed there are some modes and seasons where high-top models (top layer at above 1 hPa) outperform the rest of models in CMIP6, in particular for most seasons of SAM. This result suggests that the height of the model top likely influences the tropospheric extratropical variability. Inclusion of the full stratosphere may have contributed to the fidelity of SAM throughout interactions between stratospheric ozone and Southern Hemisphere tropospheric circulation (e.g., Son et al. 2010; Ivanciu et al. 2021), but further analysis is needed to demonstrate the connection. We note however that high- and low-top models may differ in more than just their vertical domain, which precludes the elimination of the possibility that other physical aspects in the model may also have contributed to the improvement.
We also have shown that better representation of the Southern Hemisphere (SH) jet may have contributed to improvement in the SAM’s spatial pattern, which is consistent with findings of earlier studies showing a relationship between biases in the jet’s mean latitude and the SAM time scale (e.g., Barnes and Hartmann 2010). In contrast, no clear relationship of NAM with mean state or stationary waves has been identified, suggesting that the improvement in NAM may result from a combination of multiple processes rather than being primarily influenced by the mean field over the Northern Hemisphere (NH). Our result is consistent with the findings of Son et al. (2021) that the influence of the synoptic environment to the annular modes is more straightforward in the SH than in the NH.
Indeed, our findings support the conclusions of a recent study (Orbe et al. 2020) that the performance changes in the extratropical modes of variability are likely attributable to gradual improvement of the base climate and a range of relevant processes. Our examination of potential drivers for the improvement in the CMIP6 has emphasized spatial patterns because that is where we have identified significant improvement. More work is needed to better understand the persistent systematic errors in the amplitude of extratropical modes.
Other recent studies have compared CMIP6 with earlier model generations, as examples follow, and some of the characteristics analyzed in these studies may impact the fidelity of extratropical modes of variability. These and other studies examine targeted aspects of the circulation as simulated across generations of CMIP. Further research is needed to investigate their possible connections to performances on the extratropical variability modes.
In the Northern Hemisphere, Fu et al. (2020) showed an improvement in the East Asian westerly jet, and Oudar et al. (2020) have found biases in the wintertime midlatitude atmospheric circulation have been reduced. Schiemann et al. (2020) reported that frequency and persistence of atmospheric blocking is better simulated in CMIP6 than in CMIP5. Priestley et al. (2020) noted that the bias in storm tracks in CMIP5 is still persistent in CMIP6 with little improvement. Rao et al. (2020) found that most of CMIP5 and CMIP6 models struggle to correctly capture the interaction between quasi-biennial oscillation (QBO) and tropospheric NAM. Luo et al. (2021) suggested many CMIP5 and CMIP6 models have a limitation in reproducing the full strength of the observed atmosphere–sea ice connection during Arctic summer.
In the Southern Hemisphere, Bracegirdle et al. (2020) identified a reduction in the equatorward bias of the annual mean westerly jet in CMIP6, but no clear improvement in the representation of the Amundsen Sea low that is known to represent a large fraction of variability over the Southern Hemisphere (Hosking et al. 2013, 2016; Turner et al. 2012; Raphael et al. 2016). Beadling et al. (2020) showed improvement in the surface wind stress but the upper Southern Ocean remains biased warm and fresh, and Antarctic sea ice extent remains poorly represented. Roach et al. (2020) reported reduced intermodel spread on the representation of sea ice area and improved spatial distribution of sea ice in CMIP6 compared to CMIP5.
Additional insight into the simulation of extratropical variability and how it may be improved is perhaps best explored within a single model framework where experimentation is feasible. For example, Kawatani et al. (2019) showed that advanced representations of the stratosphere (i.e., increased vertical resolution and raised model top height) significantly affect tropospheric circulations in their MIROC model simulations. Simpson et al. (2013a,b) found that a deficiency in the SH planetary wave feedback resulted in SAM being too persistent in the Canadian Middle Atmosphere Model. Simpson et al. (2020) compared the CESM2 model family of CMIP6 to its CMIP5 version, the CESM1 model, and found improvements in various aspects including storm tracks, stationary waves, NAM, and NAO, which may have arisen from the improved physical parameterizations and resulting mean climate. In any event, as discussed above, our results suggest that prescribed SST experiments are suitable for examining many of the baseline characteristics of simulated atmospheric extratropical modes, making them more tractable to evaluate and potentially improve during the model development process.
Acknowledgments
We sincerely thank the two anonymous reviewers, Dr. Ron L. Millier of NASA GISS, and the editor Dr. Isla R. Simpson for their constructive comments and suggestions on an earlier draft of the manuscript. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. The efforts of the authors were supported by the Regional and Global Model Analysis (RGMA) program of the United States Department of Energy’s Office of Science. We acknowledge the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the output and providing access, and the multiple funding agencies who support CMIP and ESGF. The U.S. Department of Energy’s Program for Climate Model Diagnosis and Intercomparison (PCMDI) provides coordinating support and led development of software infrastructure for CMIP. We thank Stephen Po-Chedley, Paul Durack, Sasha Ames, Jeff Painter, Chris Mauzey, and Cameron Harr at LLNL for maintaining the CMIP database in PCMDI. The Twentieth Century Reanalysis (20CR) Project dataset is supported by the U.S. Department of Energy, Office of Science Innovative and Novel Computational Impact on Theory and Experiment (DOE INCITE) program, and Office of Biological and Environmental Research (BER), and by the National Oceanic and Atmospheric Administration (NOAA) Climate Program Office. The HadISST data are © British Crown Copyright, Met Office, 2020, provided under a Non-Commercial Government Licence (http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/).
The authors declare no conflict of interests relevant to this study. This document was prepared as an account of work sponsored by an agency of the U.S. government. Neither the U.S. government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the U.S. government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the U.S. government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.
Data availability statement
All of the data used in this study are publicly available. The CMIP data are available on the ESGF at https://esgf-node.llnl.gov. The Twentieth Century Reanalysis (20CR) data are provided by the NOAA/Earth System Research Laboratory (ESRL)/Physical Sciences Division (PSD) at http://www.esrl.noaa.gov/psd/. The ERA-20C and ERA-5 datasets are available through ECMWF’s website at https://www.ecmwf.int/en/forecasts/datasets/browse-reanalysis-datasets. The HadISST data are provided by the U.K. Met Office at their website at https://www.metoffice.gov.uk/hadobs/hadisst/. The analysis script used in this study is implemented in the PCMDI Metrics Package (PMP; https://github.com/PCMDI/pcmdi_metrics). Interactive Portrait Plots with Dive Down information (i.e., spatial map and amplitude time series of different extratropical variability modes obtained from individual models for each mode and season) are available through the U.S. Department of Energy’s Coordinated Model Evaluation Capabilities (CMEC) project website at https://cmec.llnl.gov, in the variability modes section at https://cmec.llnl.gov/results/variability_modes/. We are working toward releasing the statistics generated from this study through the website as well; in the meantime, results are also available upon request to the authors. We also provide additional Taylor diagrams with individual models identified, on the PCMDI website at https://pcmdi.llnl.gov/research/metrics/variability_modes/taylordiagram.
REFERENCES
Archambault, H. M., L. F. Bosart, D. Keyser, and A. R. Aiyyer, 2008: Influence of large-scale flow regimes on cool-season precipitation in the northeastern United States. Mon. Wea. Rev., 136, 2945–2963, https://doi.org/10.1175/2007MWR2308.1.
Barnes, E. A., and D. L. Hartmann, 2010: Testing a theory for the effect of latitude on the persistence of eddy-driven jets using CMIP3 simulations. Geophys. Res. Lett., 37, L15801, https://doi.org/10.1029/2010GL044144.
Barnes, E. A., and L. Polvani, 2013: Response of the midlatitude jets, and of their variability, to increased greenhouse gases in the CMIP5 models. J. Climate, 26, 7117–7135, https://doi.org/10.1175/JCLI-D-12-00536.1.
Beadling, R. L., and Coauthors, 2020: Representation of Southern Ocean properties across Coupled Model Intercomparison Project generations: CMIP3 to CMIP6. J. Climate, 33, 6555–6581, https://doi.org/10.1175/JCLI-D-19-0970.1.
Bellenger, H., É. Guilyardi, J. Leloup, M. Lengaigne, and J. Vialard, 2014: ENSO representation in climate models: From CMIP3 to CMIP5. Climate Dyn., 42, 1999–2018, https://doi.org/10.1007/s00382-013-1783-z.
Bock, L., and Coauthors, 2020: Quantifying progress across different CMIP phases with the ESMValTool. J. Geophys. Res., 125, e2019JD032321, https://doi.org/10.1029/2019JD032321.
Bonfils, C., and B. D. Santer, 2011: Investigating the possibility of a human component in various Pacific decadal oscillation indices. Climate Dyn., 37, 1457–1468, https://doi.org/10.1007/s00382-010-0920-1.
Bonfils, C., B. D. Santer, T. J. Phillips, K. Marvel, L. R. Leung, C. Doutriaux, and A. Capotondi, 2015: Relative contributions of mean-state shifts and ENSO-driven variability to precipitation changes in a warming climate. J. Climate, 28, 9997–10 013, https://doi.org/10.1175/JCLI-D-15-0341.1.
Bracegirdle, T. J., C. R. Holmes, J. S. Hosking, G. J. Marshall, M. Osman, M. Patterson, and T. Rackow, 2020: Improvements in circumpolar Southern Hemisphere extratropical atmospheric circulation in CMIP6 compared to CMIP5. Earth Space Sci., 7, e2019EA001065, https://doi.org/10.1029/2019EA001065.
Cannon, A. J., 2020: Reductions in daily continental-scale atmospheric circulation biases between generations of global climate models: CMIP5 to CMIP6. Environ. Res. Lett., 15, 064006, https://doi.org/10.1088/1748-9326/ab7e4f.
Chylek, P., T. J. Vogelsang, J. D. Klett, N. Hengartner, D. Higdon, G. Lesins, and M. K. Dubey, 2016: Indirect aerosol effect increases CMIP5 models’ projected Arctic warming. J. Climate, 29, 1417–1428, https://doi.org/10.1175/JCLI-D-15-0362.1.
Compo, G. P., J. S. Whitaker, and P. D. Sardeshmukh, 2006: Feasibility of a 100-year reanalysis using only surface pressure data. Bull. Amer. Meteor. Soc., 87, 175–190, https://doi.org/10.1175/BAMS-87-2-175.
Compo, G. P., and Coauthors, 2011: The Twentieth Century Reanalysis Project. Quart. J. Roy. Meteor. Soc., 137 (654), 1–28, https://doi.org/10.1002/qj.776.
Durkee, J. D., J. D. Frye, C. M. Fuhrmann, M. C. Lacke, H. G. Jeong, and T. L. Mote, 2008: Effects of the North Atlantic Oscillation on precipitation-type frequency and distribution in the eastern United States. Theor. Appl. Climatol., 94, 51–65, https://doi.org/10.1007/s00704-007-0345-x.
Eyring, V., S. Bony, G. A. Meehl, C. A. Senior, B. Stevens, R. J. Stouffer, and K. E. Taylor, 2016: Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev., 9, 1937–1958, https://doi.org/10.5194/gmd-9-1937-2016.
Eyring, V., and Coauthors, 2020: ESMValTool v2. 0—An extended set of large-scale diagnostics for quasi-operational and comprehensive evaluation of Earth system models in CMIP. Geosci. Model Dev., 13, 3383–3438, https://doi.org/10.5194/gmd-13-3383-2020.
Fasullo, J. T., 2020: Evaluating simulated climate patterns from the CMIP archives using satellite and reanalysis datasets using the Climate Model Assessment Tool (CMATv1). Geosci. Model Dev., 13, 3627–3642, https://doi.org/10.5194/gmd-13-3627-2020.
Fasullo, J. T., A. S. Phillips, and C. Deser, 2020: Evaluation of leading modes of climate variability in the CMIP archives. J. Climate, 33, 5527–5545, https://doi.org/10.1175/JCLI-D-19-1024.1.
Feser, F., M. Barcikowska, O. Krueger, F. Schenk, R. Weisse, and L. Xia, 2015: Storminess over the North Atlantic and northwestern Europe—A review. Quart. J. Roy. Meteor. Soc., 141, 350–382, https://doi.org/10.1002/qj.2364.
Flato, G., and Coauthors, 2013: Evaluation of climate models. Climate Change 2013: The Physical Science Basis. T. F. Stocker et al., Eds, Cambridge University Press, 741–866.
Fogt, R. L., J. Perlwitz, S. Pawson, and M. A. Olsen, 2009: Intra-annual relationships between polar ozone and the SAM. Geophys. Res. Lett., 36, L04707, https://doi.org/10.1029/2008GL036627.
Fu, Y., Z. Lin, and T. Wang, 2020: Simulated relationship between wintertime ENSO and East Asian summer rainfall: From CMIP3 to CMIP6. Adv. Atmos. Sci., 38, 221–236, https://doi.org/10.1007/s00376-020-0147-y.
Gates, W. L., 1992: AMIP: The Atmospheric Model Intercomparison Project. Bull. Amer. Meteor. Soc., 73, 1962–1970, https://doi.org/10.1175/1520-0477(1992)073<1962:ATAMIP>2.0.CO;2.
Gerber, E. P., and P. Martineau, 2018: Quantifying the variability of the annular modes: reanalysis uncertainty vs. sampling uncertainty. Atmos. Chem. Phys., 18, 17 099–17 117, https://doi.org/10.5194/acp-18-17099-2018.
Gerber, E. P., L. M. Polvani, and D. Ancukiewicz, 2008: Annular mode time scales in the Intergovernmental Panel on Climate Change Fourth Assessment Report models. Geophys. Res. Lett., 35, L22707, https://doi.org/10.1029/2008GL035712.
Gleckler, P. J., K. E. Taylor, and C. Doutriaux, 2008: Performance metrics for climate models. J. Geophys. Res., 113, D06104, https://doi.org/10.1029/2007JD008972.
Gleckler, P. J., C. Doutriaux, P. J. Durack, K. E. Taylor, Y. Zhang, D. N. Williams, E. Mason, and J. Servonnat, 2016: A more powerful reality test for climate models. Eos, Trans. Amer. Geophys. Union, 97, https://eos.org/science-updates/a-more-powerful-reality-test-for-climate-models.
Glen, S., 2020: Z test: Definition & two proportion Z-test. From StatisticsHowTo.com: Elementary Statistics for the rest of us! Accessed 28 September 2020, https://www.statisticshowto.com/z-test/.
Gottschalck, J., and Coauthors, 2010: A framework for assessing operational Madden–Julian oscillation forecasts: A CLIVAR MJO working group project. Bull. Amer. Meteor. Soc., 91, 1247–1258, https://doi.org/10.1175/2010BAMS2816.1.
Hanna, E., J. Cappelen, R. Allan, T. Jónsson, F. Le Blancq, T. Lillington, and K. Hickey, 2008: New insights into North European and North Atlantic surface pressure variability, storminess, and related climatic change since 1830. J. Climate, 21, 6739–6766, https://doi.org/10.1175/2008JCLI2296.1.
Hannachi, A., I. T. Jolliffe, and D. B. Stephenson, 2007: Empirical orthogonal functions and related techniques in atmospheric science: A review. Int. J. Climatol., 27, 1119–1152, https://doi.org/10.1002/joc.1499.
Haszpra, T., D. Topál, and M. Herein, 2020: On the time evolution of the Arctic Oscillation and related wintertime phenomena under different forcing scenarios in an ensemble approach. J. Climate, 33, 3107–3124, https://doi.org/10.1175/JCLI-D-19-0004.1.
Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 1999–2049, https://doi.org/10.1002/qj.3803.
Hosking, J. S., A. Orr, G. J. Marshall, J. Turner, and T. Phillips, 2013: The influence of the Amundsen–Bellingshausen Seas low on the climate of West Antarctica and its representation in coupled climate model simulations. J. Climate, 26, 6633–6648, https://doi.org/10.1175/JCLI-D-12-00813.1.
Hosking, J. S., A. Orr, T. J. Bracegirdle, and J. Turner, 2016: Future circulation changes off West Antarctica: Sensitivity of the Amundsen Sea low to projected anthropogenic forcing. Geophys. Res. Lett., 43, 367–376, https://doi.org/10.1002/2015GL067143.
Hurrell, J. W., and C. Deser, 2010: North Atlantic climate variability: The role of the North Atlantic Oscillation. J. Mar. Syst., 79, 231–244, https://doi.org/10.1016/j.jmarsys.2009.11.002.
Ivanciu, I., K. Matthes, S. Wahl, J. Harlaß, and A. Biastoch, 2021: Effects of prescribed CMIP6 ozone on simulating the Southern Hemisphere atmospheric circulation response to ozone depletion. Atmos. Chem. Phys., 21, 5777–5806, https://doi.org/10.5194/acp-21-5777-2021.
Jerez, S., P. Jimenez-Guerrero, J. P. Montávez, and R. M. Trigo, 2013: Impact of the North Atlantic Oscillation on European aerosol ground levels through local processes: A seasonal model-based assessment using fixed anthropogenic emissions. Atmos. Chem. Phys., 13, 11 195–11 207, https://doi.org/10.5194/acp-13-11195-2013.
Joh, Y., and E. Di Lorenzo, 2017: Increasing coupling between NPGO and PDO leads to prolonged marine heatwaves in the northeast Pacific. Geophys. Res. Lett., 44, 11 663–11 671, https://doi.org/10.1002/2017GL075930.
Jun, M., R. Knutti, and D. W. Nychka, 2008: Spatial analysis to quantify numerical model bias and dependence: How many climate models are there? J. Amer. Stat. Assoc., 103, 934–947, https://doi.org/10.1198/016214507000001265.
Jun, S.-Y., J.-H. Kim, J. Choi, S.-J. Kim, B.-M. Kim, and S.-I. An, 2020: The internal origin of the west–east asymmetry of Antarctic climate change. Sci. Adv., 6, eaaz1490, https://doi.org/10.1126/sciadv.aaz1490.
Kawatani, Y., K. Hamilton, L. J. Gray, S. M. Osprey, S. Watanabe, and Y. Yamashita, 2019: The effects of a well-resolved stratosphere on the simulated boreal winter circulation in a climate model. J. Atmos. Sci., 76, 1203–1226, https://doi.org/10.1175/JAS-D-18-0206.1.
Kim, Y. H., S. K. Min, X. Zhang, J. Sillmann, and M. Sandstad, 2020: Evaluation of the CMIP6 multi-model ensemble for climate extreme indices. Wea. Climate Extremes, 29, 100269, https://doi.org/10.1016/j.wace.2020.100269.
Knutti, R., D. Masson, and A. Gettelman, 2013: Climate model genealogy: Generation CMIP5 and how we got there. Geophys. Res. Lett., 40, 1194–1199, https://doi.org/10.1002/grl.50256.
Lee, J., K. R. Sperber, P. J. Gleckler, C. J. Bonfils, and K. E. Taylor, 2019a: Quantifying the agreement between observed and simulated extratropical modes of interannual variability. Climate Dyn., 52, 4057–4089, https://doi.org/10.1007/s00382-018-4355-4.
Lee, J., Y. Xue, F. De Sales, I. Diallo, L. Marx, M. Ek, K. R. Sperber, and P. J. Gleckler, 2019b: Evaluation of multi-decadal UCLA-CFSv2 simulation and impact of interactive atmospheric–ocean feedback on global and regional variability. Climate Dyn., 52, 3683–3707, https://doi.org/10.1007/s00382-018-4351-8.
Luo, R., Q. Ding, Z. Wu, I. Baxter, M. Bushuk, Y. Huang, and X. Dong, 2021: Summertime atmosphere–sea ice coupling in the Arctic simulated by CMIP5/6 models: Importance of large-scale circulation. Climate Dyn., 56, 1467–1485, https://doi.org/10.1007/s00382-020-05543-5.
Mantua, N. J., S. R. Hare, Y. Zhang, J. M. Wallace, and R. C. Francis, 1997: A Pacific interdecadal climate oscillation with impacts on salmon production. Bull. Amer. Meteor. Soc., 78, 1069–1080, https://doi.org/10.1175/1520-0477(1997)078<1069:APICOW>2.0.CO;2.
Masson, D., and R. Knutti, 2011: Climate model genealogy. Geophys. Res. Lett., 38, L08703, https://doi.org/10.1029/2011GL046864.
Meehl, G. A., C. Covey, B. McAvaney, M. Latif, and R. J. Stouffer, 2005: Overview of the Coupled Model Intercomparison Project. Bull. Amer. Meteor. Soc., 86, 89–93, https://doi.org/10.1175/BAMS-86-1-89.
Monahan, A. H., J. C. Fyfe, M. H. P. Ambaum, D. B. Stephenson, and G. R. North, 2009: Empirical orthogonal functions: The medium is the message. J. Climate, 22, 6501–6514, https://doi.org/10.1175/2009JCLI3062.1.
North, G. R., 1984: Empirical orthogonal functions and normal modes. J. Atmos. Sci., 41, 879–887, https://doi.org/10.1175/1520-0469(1984)041<0879:EOFANM>2.0.CO;2.
North, G. R., T. L. Bell, R. F. Cahalan, and F. J. Moeng, 1982: Sampling errors in the estimation of empirical orthogonal functions. Mon. Wea. Rev., 110, 699–706, https://doi.org/10.1175/1520-0493(1982)110<0699:SEITEO>2.0.CO;2.
Orbe, C., and Coauthors, 2020: Representation of modes of variability in six U.S. climate models. J. Climate, 33, 7591–7617, https://doi.org/10.1175/JCLI-D-19-0956.1.
Oudar, T., J. Cattiaux, and H. Douville, 2020: Drivers of the northern extratropical eddy-driven jet change in CMIP5 and CMIP6 models. Geophys. Res. Lett., 47, e2019GL086695, https://doi.org/10.1029/2019GL086695.
Pennell, C., and T. Reichler, 2011: On the effective number of climate models. J. Climate, 24, 2358–2367, https://doi.org/10.1175/2010JCLI3814.1.
Perlwitz, J., S. Pawson, R. L. Fogt, J. E. Nielsen, and W. D. Neff, 2008: Impact of stratospheric ozone hole recovery on Antarctic climate. Geophys. Res. Lett., 35, L08714, https://doi.org/10.1029/2008GL033317.
Perlwitz, J., T. Knutson, and J. Kossin, 2017: Large-scale circulation and climate variability. Climate Science Special Report: Fourth National Climate Assessment, Vol. I, D. J. Wuebbles et al., Eds, U.S. Global Change Research Program, 161–184.
Phillips, A. S., C. Deser, and J. Fasullo, 2014: Evaluating modes of variability in climate models. Eos, Trans. Amer. Geophys. Union, 95, 453–455, https://doi.org/10.1002/2014EO490002.
Planton, Y. Y., and Coauthors, 2020: Evaluating climate models with the CLIVAR 2020 ENSO metrics package. Bull. Amer. Meteor. Soc., 102, E193–E217, https://doi.org/10.1175/BAMS-D-19-0337.1.
Pohl, B., and N. Fauchereau, 2012: The southern annular mode seen through weather regimes. J. Climate, 25, 3336–3354, https://doi.org/10.1175/JCLI-D-11-00160.1.
Poli, P., and Coauthors, 2016: ERA-20C: An atmospheric reanalysis of the twentieth century. J. Climate, 29, 4083–4097, https://doi.org/10.1175/JCLI-D-15-0556.1.
Priestley, M. D. K., D. Ackerley, J. L. Catto