This study proposes an integrated diagnostic framework based on atmospheric circulation regime spatial patterns and frequencies of occurrence to facilitate the identification of model systematic errors across multiple time scales. To illustrate the approach, three sets of 32-yr-long simulations are analyzed for northeastern North America and for the March–May season using the Geophysical Fluid Dynamics Laboratory’s Low Ocean–Atmosphere Resolution (LOAR) and Forecast-Oriented Low Ocean Resolution (FLOR) coupled models; the main difference between these two models is the horizontal resolution of the atmospheric model used. Regime-dependent biases are explored in the light of different atmospheric horizontal resolutions and under different nudging approaches. It is found that both models exhibit a fair representation of the observed circulation regime spatial patterns and frequencies of occurrence, although some biases are present independently of the horizontal resolution or the nudging approach and are associated with a misrepresentation of troughs centered north of the Great Lakes and deep coastal troughs. Moreover, the intraseasonal occurrence of certain model regimes is delayed with respect to observations. On the other hand, interexperiment differences in the mean frequencies of occurrence of the simulated weather types, and their variability across multiple time scales, tend to be negligible. This result suggests that low-resolution models could be of potential use to diagnose and predict physical variables via their simulated weather type characteristics.
Global circulation models (GCMs) are essential tools for both scientific research (e.g., Delworth et al. 2012; Taylor et al. 2012) and the development of climate services (Vaughan and Dessai 2014) at different time scales. Since, by definition,1 models are not a perfect representation of reality, different types of errors must be accounted for before GCMs can be used as a fair depiction of the real world.
Common approaches to diagnose systematic errors involve the computation of metrics aimed at providing an overall summary of the performance of the model in reproducing the particular variables of interest in the study, normally tied to specific spatial and temporal scales. For example, Gleckler et al. (2008) used different metrics to assess the climatological behavior of GCMs, Covey et al. (2016) focused on model evaluation of the rainfall diurnal cycle, and several studies have analyzed the present-time rainfall characteristics in state-of-the-art climate change simulations (see, e.g., Reichler and Kim 2008; Delworth et al. 2012; Ryu and Hayhoe 2014; van der Wiel et al. 2016). Moreover, the number of open source packages to evaluate the fidelity of GCMs compared to observations (e.g., Phillips et al. 2014; Mason and Tippet 2016; Gleckler et al. 2016) is rapidly increasing. Nonetheless, the evaluation of the goodness of a model is not always tied to the understanding of the physical processes that are correctly represented, distorted, or even absent in the model world. As the physical mechanisms are more often than not related to interactions taking place at multiple time and spatial scales (e.g., Nakamura et al. 2013; Robertson et al. 2015; Muñoz et al. 2015, 2016), cross-scale model diagnostic tools are not only desirable but required.
A complementary alternative to the common diagnostic approach mentioned above can be achieved via a nonlinear dynamical system perspective (Lorenz 1963; Palmer 1999). Thus, typical questions like “How well does this model represent the observed seasonal rainfall patterns?” are framed as part of the more general question “Can the model adequately represent the available states of the system?” (in a suitably coarse-grained phase space; see Ghil and Robertson 2002). Since only certain physical states can be accessed by the real-world system under study, as in statistical or quantum mechanics, the rationale is that models that cannot faithfully reproduce those states should not be expected to provide, for example, the right rainfall or temperature patterns for the correct reasons.
How to identify the available states? Theoretically, they are related to the concept of multiple flow equilibria (Lorenz 1969; Charney and DeVore 1979; Reinhold and Pierrehumbert 1982) and quasi-stationary regimes or metastable fixed points capable of attracting the chaotic trajectory of the system; they can be recognized in a state (or phase) space as regions that are visited more frequently—or, equivalently, regions surrounding a local density maximum. The identification of these clusters in the state space poses a nontrivial statistical problem (Stephenson et al. 2004; Christensen et al. 2015), and in general they correspond to proxies of the available states of the system, and not the states themselves. From a practical perspective, in the atmosphere those clusters are typically associated with recurrent daily circulation types or “weather types” (WTs), which have been widely studied in the literature (Lorenz 1969; Charney and DeVore 1979; Reinhold and Pierrehumbert 1982; Lorenz 2006; Vautard 1990; Michelangeli et al. 1995; Robertson and Ghil 1999; Moron et al. 2008a,b; Johnson and Feldstein 2010; Riddle et al. 2013), as they can lead to both important positive and negative socioeconomic impacts.
Formally speaking, weather types are statistical constructs associated with recurrent circulation configurations that can be used to study weather regimes, a more physical concept that often requires additional conditions [e.g., the average time spent on each regime must be long compared to other oscillations in the system, as in Lorenz (2006)]. Having stated that formal difference, for the purposes of this work both concepts are considered largely interchangeable. If correctly defined, these weather types can be understood as “building blocks” or some sort of “alphabet” that can be used to describe all the physically acceptable events of the system (Muñoz et al. 2015, 2016). This is shown schematically in Fig. 1; the events (e.g., rainy days) can be explained by the occurrence of individual weather types (letters) or particular WT sequences/transitions (words).
The underlying hypothesis used here is that climate variability across time scales can be described in terms of the frequency of occurrence of weather types at the different scales, with external (or internal) forcings taking the form of shifts in the residence time of the system in the different basins of attractions of the state space. This idea is related to the fluctuation–dissipation theorem in climate (Leith 1975), which relates the mean response to small perturbations of a nonlinear dynamical system to fluctuations in the unforced system. As indicated by Leith (1975), the nonlinear transfer of energy and enstrophy in the climate system implies sources of these variables in some scales of motion and dissipation in others; hence, considering the present cross-time scale diagnostics approach can help to understand the problem better.
Beyond the theoretical interest of associating weather types with physically acceptable states of the system, the present approach is useful in those cases in which it is possible to identify circulation-dependent systematic errors in climate models; a weather-type rectification (correction) has the potential to improve the related physical fields and events in the simulations.
A regime approach has been explored to study general biases and uncertainty in climate models at specific time scales (e.g., Palmer and Weisheimer 2011; Perez et al. 2014; Christensen et al. 2015). Since it is possible to analyze the behavior of the circulation regimes over a wide range of different time scales—for example, by aggregating their frequency of occurrence at subseasonal, seasonal, interannual and longer time scales—and since they can be associated with physical processes occurring across time scales (Moron et al. 2015; Muñoz et al. 2015, 2016), it is proposed here that the weather-typing approach provides a natural and unified framework for cross-time-scale diagnostics of GCMs. Although the scheme is deemed especially useful when considering seamless prediction systems (Hoskins 2013), it can be also employed to analyze causes of bias at any particular time scale. Furthermore, the same set of weather types can be used to diagnose a wide range of variables in a physically consistent way, thus shedding light on the causes of biases rather than focusing on the bias itself.
The goal of the present study is to illustrate the proposed approach using simulations produced by two coupled GCMs developed by the Geophysical Fluid Dynamics Laboratory (GFDL), with physical configurations designed to reproduce key aspects of the observed climate variability. For the sake of the illustration, the study is only focused on rainfall during the March–May (MAM) season for northeastern North America (NENA). The MAM atmospheric circulation regimes in this region are relatively well understood, especially in relation to flood events in the Ohio and Mississippi river basins (Nakamura et al. 2013; Robertson et al. 2015).
The rest of the paper is organized as follows. After introducing the datasets and summarizing the methods in the next section, observed and modeled rainfall climatologies are analyzed in section 4. The weather-type diagnostics are presented in section 5 for daily, subseasonal, interannual, and “decadal” time scales, while section 6 deals with possible sources of bias and an overall discussion of the results. Finally, the concluding remarks are presented in section 7.
Both observations and model outputs were used in the present study. These are described in the following subsections. The time period considered in all datasets is MAM 1981–2012.
Observed rainfall fields were obtained from the gauge-based NOAA–NCEP–CPC Unified Precipitation gridded dataset (Chen et al. 2008). This product has daily temporal resolution and a spatial resolution of 1° × 1°.
The analysis of the observed daily circulation over NENA was derived from the NCEP–NCAR Reanalysis Project (version 2, or NNRPv2) 500-hPa geopotential data on a 2.5° × 2.5° grid (Kalnay et al. 1996; Kistler et al. 2001), and also from the Modern-Era Retrospective Analysis for Research and Applications (MERRA) for the same variable on a 0.5° × 0.5° grid (Rienecker et al. 2011). These two datasets were used to investigate the impact of horizontal resolution and reanalysis methods in the identification of the weather types.
b. Model data and experimental design
This study explores the impact of (a) horizontal resolution and (b) Newtonian relaxation toward observed fields on model biases associated with the variability of circulation regimes at multiple time scales.
The three sets of climate simulations used are 32-yr long, with 5 members each, produced by two kindred coupled GCMs developed by the Geophysical Fluid Dynamics Laboratory. They share the same ocean, land, and ice model components inherited from the GFDL coupled models version 2.1, CM2.1 (Delworth et al. 2006), and version 2.5, CM2.5 (Delworth et al. 2012).
The models used in this study are the Low Ocean–Atmosphere Resolution (LOAR; van der Wiel et al. 2016), and the Forecast-Oriented Low Ocean Resolution (FLOR; Vecchi et al. 2014). Unlike CM2.1 and CM2.5, LOAR and FLOR use essentially the same atmospheric model component, based on a finite-volume dynamical core on a cubed sphere (Putman and Lin 2007), integrating along 32 vertical levels and with horizontal resolutions of 2° × 2° (C48) for LOAR and 0.5° × 0.5° (C180) for FLOR. The dynamical time step is modified to match the individual model’s atmospheric resolution, and the atmospheric physics for both models is similar to that in CM2.5 (Delworth et al. 2012; Vecchi et al. 2014). The ocean model component is the Modular Ocean Model (MOM) version 5 (Griffies 2012), at 1° × 1° and configured as reported in Vecchi et al. (2014). The land surface model is the Land Model version 3 (LM3; Milly et al. 2014), with the same horizontal resolution as the atmospheric model component. The sea ice model component is the Sea Ice Simulator (SIS), having three vertical layers, one snow and two ice, and five thickness categories, as reported in Delworth et al. (2006) and references therein. The time resolution for all model output is daily.
A typical way to explore how well uncoupled GCMs reproduce the observed natural variability is to force them with observed SSTs [e.g., à la the Atmospheric Model Intercomparison Project (AMIP) (Gates et al. 1999)]. Although this approach is, by definition, not possible for coupled GCMs, a common alternative is to use a Newtonian nudging to relax certain model fields, such as SSTs, toward observations (e.g., Rosati et al. 1997). This method has been shown useful to reproduce key aspects of the natural variability of the climate system and to decrease model biases [e.g., Luo et al. (2008), and references therein]. Thus, all three experiments used in the study were nudged to observed fields as follows.
In two sets of numerical experiments, LOARsst and FLORsst, the models’ SSTs were relaxed to the daily-interpolated observed HadISST product (Rayner et al. 2003), SSTO, using a restoring time scale days, such that
where represents the partial time derivative of the field, and the coupled model tendency of the field. These two experiments were used to analyze the impact of horizontal resolution on model biases.
Since nudging only SSTs is often inadequate (Fujii et al. 2009), as the variability in the climate system is not only controlled by sea surface temperatures, for the third set of numerical experiments, called FLORsst+strat, MERRA’s 6-hourly stratospheric temperatures and winds above 100 hPa were nudged in addition to the SST nudging described above. The same ansatz presented in Eq. (1) was used but with a relaxation time hours, and a tapering factor γ such that it is one for model layers above 50 hPa (i.e., 1, 4, 8, 14, 21, 30, and 41 hPa), and it is zero for layers below 100 hPa; from 50 to 100 hPa, γ varies linearly from one to zero, thus transitioning in a smooth way from the stratosphere to the troposphere. This same stratospheric nudging approach has been used recently by Jia et al. (2017) to show the role of the extratropical stratosphere as an important source of predictability at interannual scale. Both FLORsst and FLORsst+strat were used in this study to explore the role of nudging in model biases.
Here the general methodology is described. The diagnostic approach per se is described in section 5.
a. Cluster analysis
To identify the proxies of the available states of the system, daily circulation regimes were determined performing a k-means analysis (e.g., Robertson and Ghil 1999) on observed and modeled 500-hPa geopotential height anomalies, each at its own spatial resolution to avoid the possibility of spurious results induced by the spatial interpolation process.
The k-means technique is a partitioning method that classifies all days into a predefined number of clusters, minimizing the intracluster sum of variance, , for a partition (the Voronoi diagram generated by the classification process):
where , k is the total number of clusters in the solution, denotes the different clusters in the partition, and represents the Euclidean line element (distance) between the cluster’s maps X and centroid . In this work, no areal weighting was necessary, as the selected domain is relatively small and it does not extend into high latitudes (30°–50°N; more details below).
This procedure assigns only one cluster to each day on record. Daily geopotential data were first projected onto its six leading empirical orthogonal functions (same number for all datasets), accounting for at least 95% of the total observed variance. No additional time filtering was applied to the data before clustering, thus retaining the annual seasonal cycle and interannual, subseasonal, and synoptic weather time scales for diagnosis. No projection between the modeled and observed leading empirical orthogonal functions was performed (i.e., the modeled WTs were directly obtained using exactly the same procedure as for the observed ones). As with the avoidance of any kind of spatial interpolation before the k-means analysis, this was done in order to evaluate each model’s weather types in an unbiased way.
Two approaches were tested to define the 500-hPa geopotential anomalies (departures from the long-term mean) that characterize each weather type. In one, the anomalies of all members were concatenated along the time coordinate before the k-means process was performed, as in Muñoz et al. (2016). In the second approach, the classification of the 500-hPa geopotential anomalies was performed for each member independently, then computing the ensemble mean of the same weather type across members. The latter method was used in the present study as it respects the chronology and transition probabilities of the weather types in the presence of nonstationary data.
The classifiability index (CI; Michelangeli et al. 1995) measures the similarity of partitions of the data, and permits the identification of the minimum number of clusters required to achieve a good classification. Classification performance is measured via the pattern (or anomaly) correlation coefficient between pairs of cluster centroids and in partitions and , respectively, which allows the matrix to be computed, with elements given by
where the sums consider all the M grid points in the domain, and the grid point deviations are computed as usual:
where . The maximal value of the ith row of the matrix ρ in Eq. (3) represents the cluster in that best fits the ith cluster in . If the two partitions are identical, of course, .
In this study, a total of partitions are used, obtained from 100 different initial random seeds of the algorithm; the single partition that matches most closely the remaining ones is then selected. This similarity is computed by averaging the values of , and, following Michelangeli et al. (1995),
Clustering solutions within the range of k = 2–10 were explored for the NENA geographical domain, bounded by 30°–50°N and 105°–69°W. A k-means seven-cluster solution was found to yield a statistically significant value of the classifiability index , compared to a red-noise background modeled as a first-order Markov process having the same covariance at lag 0 and 1 days as the atmospheric data (Michelangeli et al. 1995); therefore the solution, which is the same one reported by Robertson et al. (2015), was selected for further analysis.
This set of clusters, or weather types, can be interpreted as a set of geopotential regimes that typify the daily variability, and are considered to be the so-called building blocks or letters mentioned in the introduction. Other cluster solutions were explored, verifying that the general features of the circulation regimes presented below are robust and sufficient for the purposes of this study.
b. Similarity metrics
There are multiple metrics to evaluate how well a model reproduces observations (e.g., Mason and Stephenson 2008; Jolliffe and Stephenson 2012); selection among them depends on the attribute or type of question to be addressed. For the sake of simplicity, this work focuses on the similarity of spatial patterns and temporal behavior between the simulated and observed weather types. Two metrics are chosen for that purpose: pattern correlation and the scatter index. As indicated before, the k-means analysis is performed on the native grid of the models and the observations; however, when comparing results, all calculations were always carried out in a common low resolution grid of 2.5° × 2.5° (the one of the NNRPv2 dataset).
The pattern correlation ρ is computed following Eq. (3), but considering the fact that the partitions now correspond to those modeled and observed.
The scatter index ξ is defined here as the root-mean-square error operator of the simulated and observed frequencies of occurrence of the weather types, normalized by the weather type mean frequency (e.g., Perez et al. 2014). The frequency of occurrence is defined as the proportion of time that each weather type occurs at each time scale window of interest, averaged across ensemble members for the case of model simulations (see an example below). Two versions of the scatter index are used in this work, one to evaluate the mean frequency state, , and another for the temporal variability at different time scale windows , :
where and denote the simulated and observed frequencies of the weather types, and σ denotes the corresponding sample standard deviation. The different time scales (or time scale windows) are presented in section 5. A model with a perfect representation of the observed frequency of occurrence satisfies , and in consequence its representation of the variability in the frequency of occurrence is also perfect . In general that is not the case, and each index provides information on these two statistics.
To evaluate the scatter index at a particular time scale implies performing the corresponding calculations on the time window defined for that purpose. To illustrate this idea, consider as an example the overall evaluation of a model representation of the observed weather type frequencies during the window defined as 16–31 May, for all seasons in the period 1981–2012. The scatter indices are then computed considering the observed and simulated frequencies of occurrence of weather types for those thirty-one 16-day-long windows, using Eqs. (7) and (8), where the vertical bars indicate that the indices are to be evaluated for the particular time scale (window) .
Once the scatter indices have been computed for the different time scales of interest, they can be presented using boxplots to also convey information about their uncertainties in the models; for example, if all members in a model perfectly agree on the value of a scatter index for a particular time scale, the corresponding box in the boxplot will collapse to an horizontal line. More details about these boxplots are discussed in section 5.
To analyze persistence of the weather types, the empirical probability density functions were computed for each dataset and regime. To compare between the different numerical experiments and the reanalysis, the Euclidean distance between each pair of the corresponding Weibull distribution parameters, α (scale) and β (shape), was used to assess similarity. The Weibull distribution
is commonly used in survival or duration analysis, and provided better fit than other less flexible distributions, like the exponential one (which corresponds to the special case of ).
4. Northeastern North America’s rainfall climatology
It is a common practice to evaluate the fidelity of a model compared to observations through the analysis of the climatological behavior of the variable of interest. This section uses the MAM rainfall climatology of the region under study (30°–50°N and 105°–69°W; see Fig. 2) to illustrate some key ideas of the present diagnostic approach.
The observed rainfall climatology (Fig. 2a) exhibits a clear precipitation gradient with higher amounts ( mm day−1) in a wide region covering approximately 30°–40°N and 97°–84°W (most parts of Alabama, Mississippi, Louisiana, Arkansas, Tennessee, southern Missouri, and southwestern Kentucky), with basically no precipitation along the western border of NENA (Texas, Oklahoma, Kansas, Nebraska, the Dakotas, and southern Ontario), 0.5–1 mm day−1 to the north (southern Ontario and Quebec) and 2–3.5 mm day−1 along the eastern coast (from northern Florida to Maine).
A visual inspection of the models’ climatology (Figs. 2b–d) clearly shows some important biases in both magnitudes and spatial distribution of rainfall, with simulations at higher resolution (Figs. 2c,d) exhibiting some improvement with respect to the low-resolution one (Fig. 2b).
What are the causes of such biases? Typically, issues involving model resolution or physical parameterizations (specially for rainfall) are frequently identified as the sources of bias—and indeed higher horizontal and vertical resolution, as well as better physics, tends to improve the representation of the observed fields. Nonetheless, it is also possible that the resolution and parameterizations are fit for the purpose, and something else is failing. Whatever the causes, a complementary approach consisting of the diagnosis of the physical mechanisms conducive to precipitation—rather than how well the precipitation itself is represented in the model—is also useful and most often required.
Rainfall variability in NENA is dominated by synoptic-scale circulation patterns (Archambault et al. 2008; Nakamura et al. 2013; Robertson et al. 2015; Roller et al. 2016) that are, in turn, modulated by well-known climate modes at multiple time scales like El Niño–Southern Oscillation (ENSO), the North Atlantic Oscillation (NAO), the Pacific–North American pattern (PNA), or the Madden–Julian oscillation (MJO; e.g., Ropelewski and Halpert 1986, 1987; Barnston and Livezey 1987; Wheeler and Hendon 2004).
Although some studies have found an inconclusive link between ENSO and NENA’s climate (e.g., Ropelewski and Halpert 1986), others indicate that this mode can modulate the frequency of low pressure systems and storms in the region (Trenberth and Caron 2000; Frankoski and DeGaetano 2011). Similarly, pressure and geopotential height anomalies associated with the NAO modify the frequency of occurrence of nor’easters, as well as storm track locations over the North Atlantic basin via changes to the position and orientation of the North Atlantic jet stream (Jones and Davis 1995; Archambault et al. 2008). There is also evidence of links between PNA and NENA’s rainfall anomalies (Leathers et al. 1991; Notaro et al. 2006), and between MJO’s phases 5–7 and precipitation rate in the region (Becker et al. 2011; Zhou et al. 2012).
The effect of these and other climate modes conducive to rainfall in NENA can be studied through their influence in the occurrence, persistence, and evolution of daily weather types. For example, Roller et al. (2016) analyzed NENA’s wintertime mean and extreme precipitation, storm tracks, and teleconnections using a set of five WTs defined in terms of 850-hPa winds; similar studies have been conducted in other parts of the world for different seasons (Moron et al. 2008a,b, 2012, 2015; Muñoz et al. 2015, 2016).
The rainfall climatology (Fig. 2) discussed in this section can thus be understood in terms of the (non)linear contribution of the different mechanisms represented by persistence and transitions of the region’s daily circulation regimes (WTs, or atmospheric states). A misrepresentation of the physical interactions involved, even with perfect rainfall parameterizations, can generate bias not only in the climatological precipitation field but also in higher-order statistics of the rainfall variability at different time scales. The same can be said about other variables physically associated with the behavior of those WTs.
Hence, the next section focuses on diagnosing how well the models represent the observed behavior of the WTs for different time scales.
5. Cross-time-scale diagnostics
This section first discusses the circulation regimes obtained for the observations and experiments, before turning to different tools to diagnose the weather types evolution at daily to decadal scales. Although the method is general and can be used for longer time scales, this study is constrained by the availability of longer observed rainfall records.
The computed set of seven NENA WTs found to best typify the daily circulation regimes for MAM is presented in Fig. 3 plotted over the entire hemisphere to better identify wavelike patterns, and in Fig. 4 for a more regional analysis of the weather types. The regimes with the highest observed frequency of occurrence are located to the left and the less frequently occurring to the right (see top row in those figures; the model regimes were ordered to follow each observed pattern). Equation (3) was used to identify the “model-equivalent WT” and physical interpretation of the hemispheric patterns (Fig. 3) was used in those cases in which the pattern correlation coefficients provided too similar values for two given regimes.
Two different reanalysis products were used to analyze the robustness of the method identifying the observed WTs, and the dependence on horizontal resolution (NNRPv2 at ~2.5° vs MERRA at ~0.5°). Tables 1 and 2 show a comparison of the anomaly correlation coefficient between WTs using NNRPv2 and MERRA as a reference, respectively. The results are indeed consistent across reanalyses and resolutions.
The observed weather types (top two rows of Figs. 3 and 4) exhibit synoptic-scale meridionally elongated dipolar wave patterns (WT1, WT3, WT5, WT6), and monopole/dipole patterns that are zonally elongated (WT2, WT4, WT7). Weather regimes that are predominantly associated with positive geopotential anomalies (WT1 and WT2) tend to be more frequent—and more persistent, as will be shown later—than the more transient regimes (WT4–WT7), which show spatial structures with dominant negative geopotential anomalies (see Figs. 3 and 4). Although this general separation between more persistent and more transient circulation regimes is present in the simulations, the ordering in those cases—which reflects their mean frequencies—is not exactly the same as in the observations.
As it will be shown in section 5b, there are preferred sequences of states whose pattern evolution and typical time scales suggest eastward propagation of baroclinic waves (e.g., state transitions ; top rows in Fig. 3). Generally speaking, in NENA the MAM daily circulation regimes tend to be associated with continental ridges west of the Great Lakes (WT1; top rows in Fig. 4), continental ridges over the Great Lakes (WT2), northeastern seaboard ridges (WT3), troughs centered north of the Great Lakes (WT4), deep coastal land troughs (WT5), southeastern seaboard ridges (WT6), and deep northeastern troughs (WT7). The seaboard ridges (WT3 and WT6) have been shown to be associated with extreme rainfall events in the Ohio River basin; these two circulation types tend to occur more often during La Niña years, and phase 5 of the MJO also modulates their seasonal frequency, with lower occurrence for WT3 and higher presence for WT6 [for additional details see Robertson et al. (2015)].
There is no simple answer to the question of what set of experiments best represents the observed weather types, as the answer depends on the basis for the comparison. The shapes, magnitudes, locations, tilts, and frequencies of the weather types are characteristics to consider. On average, both LOAR and FLOR models do a good job reproducing the observed circulation regimes, although some important biases are present (Figs. 3 and 4; Tables 1 and 2). For example, the spatial features of WT1, the most frequently observed, are well represented in all experiments, but the ensemble-mean frequency of occurrence is about 30% too low in FLORsst. The observed dipolar configuration in WT3 is basically absent in the FLORsst+strat ensemble mean for that regime, and for WT4 all experiments exhibit southwardly shifted and widely enhanced negative geopotential anomalies with respect to the observations. WT5 appears severely tilted (about ) in the simulations, arguably due to a misrepresentation and location of the weak positive geopotential anomaly observed north of 50°N near and over the Great Lakes (Fig. 4). FLOR tends to show positive geopotential anomalies along the eastern coast of the United States in WT6, especially in the FLORsst+strat experiment, which also exhibits important frequency biases for that regime. Finally, although the simulated WT7 spatial patterns are very well represented (see high correlations in Tables 1 and 2), all experiments overestimate its frequency of occurrence.
Since there are no striking differences in the weather type representations in terms of the resolution of the reanalyses, the NNRPv2 product is selected as the reference in all following discussions about circulation regimes characteristics, unless otherwise indicated.
The persistence of these circulation patterns and their transitions control, both in the real world and in the simulations, aspects like the moisture that is advected into NENA, and thus this weather-within-climate approach is normally used to better understand the physical mechanisms behind the occurrence or not of (extremely) rainy days in a region (Robertson and Ghil 1999; Moron et al. 2008a, 2013, 2015; Muñoz et al. 2015, 2016). As discussed in section 4, the systematic errors in the simulated weather regimes can help to explain the biases in other variables, as for example in the rainfall field and its climatological behavior. To illustrate this idea, Fig. 5 shows the average daily rainfall regimes (RR), defined by compositing rainfall values associated with each weather type.
As expected, FLOR tends to simulate better the rainfall field than LOAR at a local level; however, at a regional scale, there are not statistically significant differences (; two-sample Student’s t test) in the spatial correlation coefficients between the three experiments. Table 3 shows the anomaly correlation coefficients between the rainfall regimes in the models and observations. Overall, the most significant biases appear in the daily rainfall regimes associated with the weather types showing greater bias over NENA (Fig. 5), that is, WT4 and WT5. Errors in magnitude and spatial distribution of the rainfall anomalies are related to misrepresented atmospheric circulations and moisture fluxes in WT4 in all experiments (not shown), exhibiting positive rainfall anomalies along the coast as a result of the deformed and displaced trough discussed above (Fig. 4). In spite of the tilt that WT5 exhibits in the simulations, the associated daily rainfall patterns for NENA are not too biased, except along the northeastern coast where the anomalies should have opposite sign; this is attributed in part to the fact that the circulation and moisture fluxes over NENA are similar to the observed one, even when that is not the case at a larger scale. Likewise, although WT3 is the circulation regime with the worst large-scale anomaly correlation coefficients (e.g., Table 1), its representation of the associated physical mechanisms controlling rainfall over NENA provides a very good depiction of the observed rainfall regime RR3. Having this regime well represented in the simulations is important to adequately represent extreme rainfall events over the Ohio River basin (Robertson et al. 2015).
Biases both in the spatial patterns (involving shape, magnitudes, tilt, location) and the frequency of occurrence of the simulated weather types contribute to biases in rainfall and other variables at different time scales. In studies involving a large number of years (e.g., 100 years), two different k-means solutions should be computed to analyze, for example, whether the climate change signal is modifying the weather types’ spatial patterns or the dimensionality of the phase space. For studies like the present one, however, which involves just three decades, the spatial patterns are normally assumed to be constant. This allows for the analysis of the weather types’ temporal variability at different time scales in terms of changes in their frequency of occurrence at those time scales.
A possible way to summarize how well the simulations reproduce both the mean frequency of occurrence and its variability across multiple time scales is through the use of the corresponding scatter indices [see Eqs. (7) and (8)], presented in Fig. 6 for two particular subseasonal windows (20–30 March and 4–14 May; further discussed in section 5c), for the interannual variability considering all MAM seasons in the 1981–2012 period (section 5d), and for the first and last decades in the same period of years (section 5e). As with the case of the spatial patterns (e.g., Table 1), the analysis of the frequencies of occurrence indicates that no particular model or experiment can be overall considered the best one across all the time scales in study. Certainly, there are differences in terms of the median and dispersion values at particular scales, with higher errors and uncertainties in the subseasonal case; nonetheless, within-scale differences in the medians are normally negligible. Furthermore, compared to LOARsst, FLORsst tends to have the same or a lower dispersion for both scatter indices (Fig. 6).
a. Klee diagrams
Klee diagrams, shown in Fig. 7, are a way to represent the temporal evolution of the available states of the system (Muñoz et al. 2015). These diagrams consist of a simple matrix plot sketching the daily evolution of weather types for the entire period under study. They are equivalent to the representation of the Viterbi state sequences, except for the fact that no Viterbi algorithm or hidden Markov model (e.g., Robertson et al. 2006) is involved in the process. A Klee diagram is the basis for analyzing daily transitions and temporal variability at subseasonal, seasonal, decadal, and longer time scales. Moreover, it has been used to define subseasonal-to-seasonal states for forecast purposes, using hybrid dynamical–statistical models (Muñoz et al. 2016).
If a simulation has exactly the same Klee diagram as the one computed from the observations, then it has a perfect representation of the observed weather types’ evolution across time scales, independently of how well the model represents their spatial patterns; of course, even if the models were perfect, the Klee diagrams would not be exactly the same due to atmospheric chaos. Nonetheless, the idea is conceptually useful, and helps to define several ways to diagnose similarity in the temporal evolution and associated statistics. Moreover, having simulated Klee diagrams that exhibit similar characteristics to observations is also of practical use because it is then possible to use a combination of modeled frequencies of occurrence and observed spatial patterns to represent the evolution of the circulation regimes at a given time scale. In such cases, this approach has important implications for prediction, an idea that is briefly discussed in section 6, and that will be explored in more detail elsewhere.
Overall, the observed and simulated Klee diagrams for NENA show high similarity (Fig. 7). Since it is not possible to have an ensemble mean of Klee diagrams because they represent categories and not real numbers, the analysis must be performed on a member-per-member basis. Visual inspection suggests that all experiments exhibit a dominance of WT1 and WT2 toward the end of the season, and of WT6 and WT7 during the first half, all consistent with the observations and consistent with the predominance of negative height anomalies in March and positive ones in May within the MAM season (Fig. 4). However, some biases are apparent, like the presence in LOARsst and FLORsst+strat of too many days with WT3 at the end of the season, which, for example, could provide false alarms in those cases in which WT3 is related to floods in the Ohio River basin.
Different statistics can easily be computed from the Klee diagrams to help diagnose and compare dynamical models. Some examples—like daily transition probabilities, mean durations (persistence) for each weather type, and other summaries to analyze the evolution of the circulation regimes at different time scales—are discussed in the following subsections.
b. Daily transitions and typical durations
Daily transition matrices are commonly used to characterize persistence and preferred state transitions. Their diagonal sketches the persistence probabilities for each weather type, and the off-diagonal elements represent the conditional probabilities of transition to a particular regime (along the horizontal axes of the matrix; “posterior WT”) given that a different weather type (along the vertical axes; “prior WT”) occurred on the previous day.
Daily transitions in NENA, presented in Fig. 8, are dominated by persistence, with continental ridges over the Great Lakes (WT2) being the most probable regime to persist, especially during May (Fig. 7). The most frequently observed statistically significant transitions (; see Vautard et al. 1990) suggest circuits like or , that can be associated with the propagation of baroclinic waves over NENA, but also transitions leading to highly persistent states, like , associated with anomalously high moisture fluxes from the Gulf of Mexico and the Caribbean conducive to rainy days along the eastern coast, followed by days with a high pressure system stationed in the northern part of the region, typically conducive to low-level divergence and no rain (see Figs. 4, 5, and 7).
All simulations for NENA adequately represent the fact that the persistence probabilities are significantly higher than nonself transitions. As expected, some biases exist; for example, the persistence of WT4 in the LOARsst and FLORsst experiments (see Fig. 8) is overestimated, and there are consistent biases in all experiments for transitions like , , , , and . But in general, all models tend to capture most of the observed transition probabilities, with differences always less than ±15%; moreover, no statistically significant differences (; two-sample Student’s t test) exist between the average transition probabilities of the experiments.
It is also important to evaluate if the simulations fairly represent the typical persistence of the weather types. Figure 9 shows relative frequency histograms and their corresponding Weibull distribution fit for the most common durations; for comparison, Table 4 presents the values of the Weibull distribution parameters α and β, which measure the scale (or characteristic life) and the shape (or slope) of the distribution, respectively.
This analysis indicates that WT1–WT3 (predominant in May) tend to persist more than the other weather types (as expected from the visual inspection of the observations’ Klee diagram; Fig. 7), WT4 tends to transition faster than the other regimes, and WT5–WT7 exhibit similar duration probability density functions (PDFs), all these facts in agreement with the observations.
As with the spatial patterns of the weather types between models and observations, there is not a unique set of experiments that is consistently better than the others (i.e., for which all the regimes adequately represent the observed persistence distributions); nonetheless, FLORsst tends to have better self-transition statistics for WT2, and WT4–WT6 (Table 4; bold numbers indicate experiments with the best representation of self-transitions, or persistence, for each WT). Overall, although the main characteristics of the duration PDFs are captured well by the experiments (e.g., the fact that can be well modeled by a Weibull distribution, higher persistence in WT1–WT3), certain features are not. It is possible to identify WT5 and WT6 as the regimes with the worst representation in terms of persistence in all simulations (Table 4); for example, WT5 overestimates the number of “early” transitions by at least a factor of 1.5. As complementary information, Table 5 summarizes the average persistence of each WT in the NNRPv2 and the numerical experiments; results are consistent with the discussion presented above and Table 4.
c. Subseasonal evolution
The typical subseasonal evolution of the weather types can be analyzed using their “climatological” frequency of occurrence as computed by the regime’s appearance for each calendar day, after an 11-day moving average is applied to the frequency of occurrence time series in order to filter out the shortest time scales. This metric, hereafter referred to as “subseasonality” to avoid the cacophonic phrase “subseasonal seasonality,” is shown in Fig. 10. Clearly, MAM is a transition season between boreal winter and summer that could be characterized by low occurrence of ridge configurations (WT1–WT3) during March, and their dominance during May.
Generally speaking, the sketched subseasonal evolution of the weather types in the numerical experiments is similar to the observations, although there is an overall delayed subseasonality in several weather types (cf. the slopes of the curves in Figs. 10b–d with the ones in Fig. 10a). Biases are present both in the subseasonal frequency of occurrence of some weather types and in their actual evolution. For example, the frequency of WT4 is underestimated in all experiments, especially after calendar day 50 (Fig. 10); the model-equivalent WT5–WT7 tend to have more or less the same relative frequency of occurrence in April than in March, which is inconsistent with observations; WT1 has its peak of occurrence typically between calendar days 65 and 75 (4–14 May), but this maximum appears about 10 days earlier in FLORsst+strat.
Because of the clear differences between the beginning and the end of the season, two periods were considered for further analysis: 20–30 March and 4–14 May (see black boxes in Fig. 10).
Errors in the median values of the simulated frequencies of occurrence of weather types, as measured by the scatter index (Fig. 6a), tend to be similar between the two periods under consideration, although with higher dispersion during the second half of the season, which is mostly due to the misrepresentation of the subseasonality of WT4–WT6. On the other hand, errors in the standard deviations of the occurrences are more clearly discriminated (Fig. 6b), being similar in all three experiments but with higher values for the 4–14 May period. For the end-of-March section under study, the median values for the standard deviations of the simulated frequencies are similar to the other time scales analyzed in this work (interannual and decadal), although LOARsst exhibits higher and similar dispersions at subseasonal scale than at the other scales.
d. Interannual variability
Analysis of the ensemble-mean interannual evolution of the frequency of weather types, shown in Fig. 11, indicates that the highest root-mean-square errors, presented in Table 6, occur for patterns associated with troughs north of the Great Lakes (WT4), especially for LOARsst (~11.7 days per season, compared to ~10.0 and ~9.8 days for FLORsst and FLORsst+strat, respectively). On the other hand, WT3 and WT7—northeastern seaboard ridges and deep troughs—have the best average representation of the interannual variability in all experiments, with slightly lower errors for LOARsst and FLORsst+strat (~5.1 days for both WT3 and WT7 in both experiments, compared to ~6 days for the same weather types in FLORsst). Nonetheless, on average, there are no statistically significant differences (, using a two-sample Student’s t test) between the experiments, with an overall median error of 7.3 days, which is lower than 10% of the total days in the season. This result is consistent with the scatter index values for the mean frequency of occurrence at interannual scales (Fig. 6a).
The variability in the frequencies—measured by the associated standard deviation—is, again, very similar between the experiments. Moreover, the fact that there is relatively low dispersion in the variability scatter index for almost all time scales (Fig. 6b) implies agreement between the different members in each numerical experiment, suggesting low model uncertainty in the reported values of this parameter. Nonetheless, overall Fig. 11 suggests that the experiments do not seem to capture well the observed interannual variability.
e. Decadal differences
There are not enough years to formally study interdecadal variability, and thus only the differences in the frequency of occurrence of weather types were analyzed for the first and last decades, after smoothing the frequency of occurrence time series using an 11-yr moving average, to pick up the long-time-scale signals of interest in the present analysis.
Differences in medians and dispersions of the mean frequency of occurrence are negligible for the two decades (Fig. 6a), although dispersion in the simulated mean frequencies of occurrence are definitively higher in the FLOR experiments than in LOARsst for the 1981–90 decade. FLORsst and FLORsst+strat show slightly lower medians for the scatter index of the standard deviations for the 2003–12 decade, although with higher dispersions than LOARsst. Overall, both decades show the same ranges and values for the statistics considered.
The approach presented here offers tailored diagnostics to understand possible sources of model biases at multiple time scales. The basic idea is that the inherent biases at the synoptic or low-frequency variability scale in models could be rectified at larger scales, according to how climate drivers excite the observed weather types differentially on longer subseasonal-to-seasonal and seasonal-to-decadal scales. A variety of diagnostic metrics were explored to identify model errors at several time scales. Nonetheless, putting together the big picture provided by the different metrics is in general not a trivial task.
The analysis performed indicates that, at large scale, the simulations have trouble representing the observed spatial pattern associated with WT3 (Table 2), although at regional scale—over NENA—the spatial configuration of the northeastern seaboard ridges is actually good enough to provide a fair representation of the observed rainfall regimes (Table 3), suggesting that the observed physical mechanisms that control rainfall in that case (e.g., moisture and heat transport from the Gulf of Mexico; Fig. 4) are present in the simulations. This has important implications for flood prediction in the Ohio River basin (Nakamura et al. 2013; Robertson et al. 2015).
In contrast, the lowest rainfall pattern correlations are obtained for the regime associated with troughs north of the Great Lakes (RR4, WT4; see Table 3), attributed earlier to the southward displacement of the geopotential anomaly with respect to the observed pattern, and its enhancement over most of North America. This is truly the worst circulation regime simulated in the experiments. Not only is the spatial pattern poorly simulated, but also the depiction of the observed temporal variability across all time scales is the worst of all WTs. These issues are hypothesized here to be part of the same pathology, and they could be related to problems in the models’ rendition of tropical–extratropical interactions, as the seasonal frequency of occurrence of WT4 is significantly correlated with the Niño-3.4 SST index (not shown), a link that will be treated—along with other teleconnection indices and circulation regimes—in a future paper, following the methodology discussed in Muñoz et al. (2015).
The other weather regimes have a fair representation of the geopotential anomalies and rainfall patterns over NENA, although this is not necessarily true at the continental or hemispheric scale.
Altogether, there are not significant differences in the average performance of LOAR and FLOR in terms of the reproduction of the spatial patterns and temporal variability of the observed weather types, although a certain experiment can be better than the others when considering particular characteristics (e.g., mean frequency at interannual scale, or the persistence of WT4; see section 5). In this work, the question of which model is better is really a question of what nudging approach performs better and what is the impact of horizontal resolution.
Nudging both SST and stratospheric fields did not consistently improve the representation of the weather types in the model, and indeed in some cases provided worse results than the SST-only nudging experiment (FLORsst). It is possible that some stratosphere–troposphere and ocean–atmosphere interactions are not being well simulated, and that some improvement can be achieved if the vertical resolution in the model is increased to adequately account for the stratospheric processes. This is a matter of future research.
As indicated earlier, no significant improvement was found when increasing horizontal resolution, either in the reanalysis products or in the model experiments (LOARsst and FLORsst). This is attributed to the fact that synoptic-scale 500-hPa geopotential height anomalies do not really require high resolution in order to reproduce the key physical mechanisms associated with, for example, propagation of Rossby waves that perturbs circulation patterns, or the moisture advection conducive to rainfall; in addition, the topography in NENA is not tall or complex enough as to negatively impact the low-resolution model. Yet, high horizontal resolution could be important in other regions of the world.
Although high spatial resolution does not seem to be necessary to satisfactorily reproduce observed weather types, high-resolution models like FLOR have the advantage of providing physically related variables like rainfall or surface temperature at a resolution preferred by decision-makers and for the provision of climate services (Vaughan and Dessai 2014). Nonetheless, it is possible to exploit the fair representation of weather type characteristics by faster coupled models like LOAR to “reconstruct” fields of interest (e.g., the rainfall climatology in NENA).
Since the rainfall climatology can be computed in terms of a linear combination of rainfall regimes and the total frequency of occurrence of those regimes ,
for regimes, if the model has an adequate representation of only one of these fields, then observations could be used to compute the other factor in the linear combination. For example, the rainfall regimes of the experiments analyzed in this work (ModRR) could be used in conjunction with the observed frequencies of occurrences (ObsF) to reconstruct the observed rainfall climatology, as shown in Figs. 12a and 12b. For other approaches that can be used for prediction purposes, see Moron et al. (2008b).
As expected, the experiments with FLOR provide better spatial rainfall patterns than LOARsst because of their higher resolution. Although there are biases, the use of the observed frequency of occurrences has improved the original simulated rainfall climatologies (Figs. 2b–d). Further improvement is obtained if the observed rainfall regimes (ObsRR) are used in conjunction with the modeled frequencies (ModF; Fig. 12c). With a satisfactory bias-correction method, this approach has the potential to provide relatively economic—at least from a computational point of view—diagnostic products and forecasts.
7. Concluding remarks
This work discussed a new diagnostic framework to evaluate the performance of models across multiple time scales, based on their representation of the observed spatial and temporal variability of weather types. Under this nonlinear system dynamics perspective, “good” models are those that correctly reproduce the observed characteristics of the weather types at multiple time scales.
The framework takes advantage of the weather-typing decomposition, in terms of the spatial patterns of the circulation regimes and their temporal evolution, to analyze model performance at multiple time scales focusing on the evaluation of tailored statistics like daily transition probabilities, weather type mean durations, and subseasonal, interannual, and longer-term frequencies of occurrence. Furthermore, since the circulation regimes are normally linked to concrete climate modes, they can also be used to diagnose model biases from a physical perspective, like deformations or displacements of particular geopotential height configurations that control the occurrence of rainfall in a region of the world.
To illustrate how the diagnostic approach works, it was applied to three different sets of numerical experiments using Geophysical Fluid Dynamics Laboratory coupled circulation models. The simulations tend to represent well the location, shape, and magnitude of daily circulation regimes and associated rainfall patterns, although some important biases were reported and discussed. Further research is being conducted to perform an in-depth analysis of possible tropical–extratropical interactions that might not be well represented by the models.
Finally, the present framework can also be used for model intercomparison, and can be applied to uncoupled and regional models.
The authors are grateful to Tony Barnston and Simon Mason (IRI) and Nathaniel Johnson and Ángel Adames (GFDL) for discussions about different aspects of the paper. ÁGM was supported by National Oceanic and Atmospheric Administration’s Oceanic and Atmospheric Research, under the auspices of the National Earth System Prediction Capability. AWR was supported by NOAA Next Generation Global Prediction System (NGGPS) project Grant NA16NWS4680014. This paper is dedicated to Eneko Muñoz and Cathy Vaughan (ÁGM’s nena).
The word model is derived from Latin modulus, a “measure” or “standard” of something, which evolved to today’s idea of “likeness made to scale,” for example in clay or wax, or via mathematical or numerical expressions.