## Abstract

A method to incorporate synoptic eddies into the diagnosis of circulation regimes using cluster analysis is illustrated using boreal winter reanalyses of the National Centers of Environmental Prediction (hereafter observations) over the Pacific–North American region. The motivation is to include the configuration of the high-frequency (periods less than 10 days) transients as well as the low-frequency (periods greater than 10 days) flow explicitly into the definition of the regimes.

Principle component analysis is applied to the low-frequency 200-hPa height field, and also to the low-frequency “envelope” modulations of the rms of high-frequency meridional velocity at 200 hPa. A maximum covariance analysis of the height and envelope fields, carried out using the appropriate principal components, defines three modes as explaining most of the covariance. This defines the minimum dimensionality of the space in which to apply *k*-means cluster analysis to the covariance coefficients. Clusters found using this method agree with results of the previous work.

Significance is assessed by comparing cluster analyses with results from synthetic datasets that have the same spectral amplitudes (but random phases) of seasonal means and, separately, intraseasonal fluctuations as do the original observed time series. This procedure ensures that the synthetic series have similar autocovariance structures to the observations. Building on earlier work, the clusters obtained are newly tested to be highly significant without the need for quasi-stationary prefiltering.

## 1. Introduction

The analysis of low-frequency atmospheric variability within the framework of circulation regimes has proved quite useful in terms of relating synoptic behavior and extreme surface anomalies to the large-scale flow (Ayrault et al. 1995; Arnott et al. 2004; Cassou et al. 2005; Stan and Straus 2007). Methods that have been used to identify such regimes include cluster analysis (Michelangeli et al. 1995), modeling the probability distribution function in terms of additive Gaussian structures (Smyth et al. 1999), and statistical models based on neural networks (Monahan et al. 2001), although this last method has been questioned (Christiansen 2005).

However, the vast majority of observational studies identifying regimes have one feature in common: the analysis is carried out on time series of a single variable (often geopotential height or streamfunction), often filtered to remove synoptic time-scale variability, and expressed in a reduced dimensional representation [usually principal components (PCs)].

Theoretical studies, on the other hand, have shown the importance of baroclinic-scale transient feedback on the large scale as a mechanism to help support circulation regimes (Reinhold and Pierrehumbert 1982; Vautard and Legras 1988). Observational studies have confirmed the feedback between baroclinic waves and low-frequency structures (Nakamura and Wallace 1990; Nakamura et al. 1997).

The purpose of this article is to introduce a method to identify circulation regimes that includes baroclinic-scale properties. This is accomplished in the context of the work of Straus et al. (2007, hereafter SCM), in which circulation regimes were associated with significant clustering of low-frequency filtered height maps in a state space defined by PCs.

This paper uses the same reanalysis dataset as in SCM, but incorporates the baroclinic eddy properties to help define a minimum state space dimension. In addition we use an improved method of assessing significance, involving a generalization of the “random phase” approximation of Christiansen (2007).

## 2. Data and analysis

Time series of once-daily geopotential height and meridional wind at 200 hPa were obtained from the reanalyses of the National Centers for Environmental Prediction (NCEP) for the 152-day period starting 21 November, for each of the 54 winters 1948/49 through 2001/02. An estimated climatological annual cycle was subtracted (as in SCM), so that the resulting series retain both interannual and intraseasonal variability. A digital filter was used to divide the time series of each variable into two components: a “low frequency” component containing all periods longer than about 10 days, and a “high frequency” component (retaining periods shorter than 10 days), representative of baroclinic eddies. The digital filter is similar to that used by, for example, Blackmon (1976), who use a filter that retains periods of 2–6 days. Here the high-frequency category extends up to periods of 10 days, since this filter is meant to capture not only the growth and propagation of baroclinic waves, but their life cycle as well (Simmons and Hoskins 1978). We denote the high-frequency components by primes.

Daily time series of the rms of high-frequency meridional wind were computed as

The modulation of *υ*_{rms} series on time scales longer than 10 days was obtained by retaining only the low-frequency component. The use of such an “envelope” functions is given (e.g., in Nakamura and Wallace 1990). Since the meridional wind is dominated by the longitudinal derivative of the streamfunction, it emphasizes the higher zonal wavenumbers.

Note that we need to filter twice: once to obtain the high-frequency fields appearing on the right-hand side of Eq. (1) and a second time to low-pass filter the resulting rms velocity. Since each application of the filter results is a loss of 15 days of the record at the beginning and at the end, the high-frequency fields are available for 120 days starting 7 December, but the envelope fields are available only for the 92 days starting 21 December for each winter.

A principal component (or empirical orthogonal function) analysis was carried out on the 92-day time series of low-frequency 200-hPa geopotential height, over the domain 20°–90°N, 150°E–30°W, denoted by the vector **Z**, and, separately on the low-frequency envelope of *υ*_{rms}, denoted by the vector **E**. The spectrum of explained variance for **Z** is fairly red, with the leading three modes explaining 22%, 16%, and 12% of the total space–time variance, respectively, while the leading 10 modes together explain 84%. In contrast, the spectrum of the **E** modes is somewhat less red, with the leading three modes explaining about 15%, 10%, and 9% of the total variance, and the leading 10 modes explaining 59%.

The covariance of the envelope field with the low-frequency flow is assessed with the use of maximum covariance analysis (MCA),^{1} applied to the PCs of **Z** and **E**. This analysis identifies linear combinations of PCs that maximize the squared covariance between the two fields. For more details on this and related multivariate analysis techniques, see Bretherton et al. (1992) and Wallace et al. (1992).

The results of the MCA calculations depend on the truncation (number of PCs) of the input fields. Table 1 gives a number of statistics as a function of PC truncation: the total squared covariance explained, the squared covariance explained by each of the first three modes, the percent of total squared covariance explained, and the percentage of the truncated variance of **Z** explained by each mode. In addition, the correlations between the leading PCs of **Z** and **E** with the corresponding leading MCA time series are shown. The correlations for the first mode are highly significant, indicating that the MCA analysis is not overly dominated by the **Z** spectrum, which is the redder of the two.^{2}

One robust result from Table 1 is that the first 3 MCA modes dominate at all truncations, explaining at least 75% of the total squared covariance. At least five PCs are required, however, to represent these three modes in a consistent fashion.

Figures 1 –3 compare the MCA patterns of **Z** and **E** for truncations of 5 and 10 PCs for the leading three modes. All the patterns shown are heterogeneous correlation functions (Bretherton et al. 1992), so that the maps of **E** give the covariances of this envelope field with the (standardized) **Z** time series associated with the given mode, while the height patterns give the covariance of **Z** with the (standardized) **E** time series associated with the given mode.

The leading MCA pattern shows a strong high in the North Pacific and a trough over northwestern North America (Figs. 1a,c). The associated northward shift in the storm track (seen in the envelope of *υ*_{rms} in Figs. 1b,d) is consistent with the shift in the jet implied by Figs. 1a,c.

There is agreement in the overall pattern between the two representations based on a different truncation. This is also true for the second mode (Fig. 2), which shows a North Atlantic mode combined with a low over western North America. Here we see that the weakened westerlies (anomalous easterlies) in the North Atlantic are consistent with the reduced storm track (envelope) activity, while farther south the situation is reversed. In the Pacific, the low over western North America (high in the North Pacific) is also associated with consistent storm track changes.

The third MCA pattern shows a sort of asymmetric annular mode, with enhanced height gradients in the Pacific (and a low over Alaska), consistent with enhanced storm track activity in this region. This configuration holds also in the North Atlantic, although the height gradient and envelope function anomalies are fairly weak in the five PC truncation.

## 3. Cluster analysis and significance

### a. Truncation and algorithm

We apply the cluster analysis directly to the set of **Z** time series associated with the first *M* MCA modes, where the dimension *M* is to be chosen. (Each atmospheric state is represented by a point in a *M*-dimensional space.) The cluster analysis consists of a partitioning algorithm that, given a prespecified number *k* of clusters, iteratively assigns all points to one of *k* groups (clusters) so that the ratio of the within-cluster variance to the total variance is minimized (Michelangeli et al. 1995). No quasi-stationary filtering was applied (see SCM). Once each state is identified as belonging to a particular cluster, corresponding composite maps of **Z** and **E** are obtained using the full original gridpoint representation of the filtered datasets. The composites are presented as anomalies from the full climatology for all fields.

To apply the cluster analysis we must first decide the dimension *M* and the number *N* of PCs (i.e., the truncation) to be used to define the MCA modes. The results of the previous section suggest that only three distinct MCA modes explain the covariance of the storm track envelope field with the low-frequency height field. A minimum truncation of 5 PCs for each field is necessary to capture these MCA modes, suggesting that the truncation *N* be at least 5. We find consistent cluster analysis results for a range of truncations *N* ≥ 5 and dimension *M* ≥ 3.

Figures 4 –7 compare the composite fields of **Z** and **E** for 4 clusters (*k* = 4) for the choices of (*N* = 5 and *M* = 3), (*N* = 5 and *M* = 5), (*N* = 10 and *M* = 5), and (*N* = 10 and *M* = 10). The height composites are indicated by the contours, the envelope *υ*_{rms} by the shading. The Pacific trough, Alaska ridge, and Arctic high patterns of SCM show reasonably good agreement across the various choices of truncation and dimension, with the envelope field consistent with the height gradient anomalies.

### b. Significance

Cluster analysis seeks to identify regions in the reduced space whose probability density is higher than would be expected from a suitable background (a multinormal distribution). Although PCs are linearly uncorrelated (orthogonal in time), a “significant” cluster implies that they are not statistically independent. Significance testing is carried out by generating a large number of independent synthetic time series for each PC. Using the approach we describe below, we are assured that each synthetic time series approximately captures the temporal correlation structure of the corresponding real PC, yet they are statistically independent.

This approach, designed for clusters computed from PCs, would not be appropriate if applied directly to an MCA-based cluster calculation, since the MCA time series for different modes are *not* linearly uncorrelated. However, if synthetic PCs are generated as described below, and the MCA and cluster analysis repeated using these PCs, the results are appropriate for assessing significance.

Christiansen (2007) pointed out that the temporal correlation structure of any time series can be captured in a synthetic time series by computing the Fourier harmonic coefficients for all frequencies, retaining the original amplitudes, but randomizing the phases for each frequency. This random-phase approach preserves the spectrum. Since (for an infinite time series) the spectrum is the Fourier transform of the temporal autocovariance function, the synthetic time series should retain the original temporal correlation information.

This approach needs modification when applied to winter daily data, consisting of discontinuous segments. Write the principal components for **Z**, *P*_{y,d} (where the index *y* gives the year and the index *d* gives the day), as

where 1 ≤ *y* ≤ 54, 1 ≤ *d* ≤ 92, the seasonal mean *S* depends only on year *y*, and the series *F* has zero mean over all *d* for any *y*. The random-phase approach is implemented on the series *S* and *F* separately. Here *F*_{y,d} represents a sample (of size 54) of daily time series of zero mean; the spectrum is computed for each sample and then averaged over all available samples. A random-phase surrogate *F̂ _{d}* is computed by summing the harmonic coefficients computed using the specified amplitudes with random phases. The series

*S*, however, consists of only one sample of 54 seasonal means. To avoid sampling errors, its spectrum is approximated by a white-noise spectrum, whose constant value is just the average over all frequencies. A surrogate

*Ŝ*is constructed again using random phases.

_{y}A synthetic time series corresponding to *P*_{y,d} is formed using a randomly chosen surrogate *Ŝ _{y}* to which we add 54 randomly chosen surrogates

*F̂*. This is repeated for each of the PCs to form a synthetic dataset of the same size as the original observed dataset. The cluster analysis is repeated for each of 100 such datasets; the number of datasets for which the variance ratio is less than the observed (synthetic data more clustered than observed data), once divided by 100, gives the significance level.

_{d}Significance results are shown in Table 2 for a range of *k* and a range of *M*, the number of MCA modes used in the clustering. In each case the truncation *N* = *M*. For 3 or 4 clusters, the corresponding p value is 0.03 or below^{3} for all *M*. Note that this significance test does not measure the degree to which the underlying distribution departs from a multinormal one (Christiansen 2007).

## 4. Conclusions and discussion

The main conclusions of this article are the following:

For wintertime in the Pacific–North America region, three MCA modes capture over 75% of the squared covariance between the low-frequency 200-hPa height field and the envelope fields describing the low-frequency variation of the rms of high-frequency meridional wind. This result is robust to changes in the PC truncation of the fields in the analysis as long as at least 5 PCs are retained.

Cluster analysis carried out for a range of PC truncations and number of MCA modes identifies essentially the same preferred patterns identified in SCM (for the same observational dataset) on the basis of the low-frequency flow. These states are thus characterized by a strong eddy–mean flow relationship.

An analysis of significance improved over that of SCM shows that 3 or 4 clusters are significant with a

*p*value of 0.03 or less, independent of the number of modes kept in the cluster analysis. This is based on the full input dataset. The degree to which the underlying distribution departs from a Gaussian in not measured in this analysis.

For the region and season presented here, inclusion of the transient eddy configuration did not substantially modify the preferred patterns suggested by the original cluster analysis using principal components. However in general this will not be the case.

## Acknowledgments

The author would like to acknowledge extensive discussions with Dr. Tim DelSole. This work was supported by the National Science Foundation under Grant ATM-03-32910, the National Aeronautics and Space Administration under Grant NNG-04-GG46G, and the National Oceanic and Atmospheric Administration under Grant NA04-OAR-4310034.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

## Footnotes

*Corresponding author address:* David M. Straus, Center for Ocean–Land–Atmosphere Studies, 4041 Powder Mill Rd., Suite 302, Calverton, MD 20705. Email: straus@cola.iges.org

^{1}

MCA is often called singular value decomposition, which describes the mathematical technique used to maximize covariance between two sets of fields.

^{2}

If the **Z** field were to dominate the MCA, which may occur if the envelope field were to have a completely white spectrum, the matrix that describes the rotation of the **Z** PCs to give the new MCA coordinates would look very much like the identity matrix, at least for the lowest modes (upper-left-hand corner). We have confirmed that this is not the case here, so that the rotation matrix looks very distinct from the identity matrix.