## Abstract

This paper demonstrates that an operational forecast model can skillfully predict week-3–4 averages of temperature and precipitation over the contiguous United States. This skill is demonstrated at the gridpoint level (about 1° × 1°) by decomposing temperature and precipitation anomalies in terms of an orthogonal set of patterns that can be ordered by a measure of length scale and then showing that many of the resulting components are predictable and can be predicted in observations with statistically significant skill. The statistical significance of predictability and skill are assessed using a permutation test that accounts for serial correlation. Skill is detected based on correlation measures but not based on mean square error measures, indicating that an amplitude correction is necessary for skill. The statistical characteristics of predictability are further clarified by finding linear combinations of components that maximize predictability. The forecast model analyzed here is version 2 of the Climate Forecast System (CFSv2), and the variables considered are temperature and precipitation over the contiguous United States during January and July. A 4-day lagged ensemble, comprising 16 ensemble members, is used. The most predictable components of winter temperature and precipitation are related to ENSO, and other predictable components of winter precipitation are shown to be related to the Madden–Julian oscillation. These results establish a scientific basis for making week-3–4 weather and climate predictions.

## 1. Introduction

Operational weather forecasts are skillful out to 7–10 days (Simmons and Hollingsworth 2002), and operational seasonal forecasts are skillful out to 3–8 months (depending on season and model; Barnston et al. 2012), but there is relatively limited evidence that forecasts are skillful in the intermediate 3–4-week range (Newman et al. 2003; Pegion and Sardeshmukh 2011; Wang et al. 2014). If skillful forecasts in the 3–4-week range existed, they would have significant social and economic value because many management decisions in agriculture, food security, water resources, and disaster risk are made on this time scale. However, most studies that claim predictability in the 3–4-week range identify this skill in the tropics (Li and Robertson 2015), in upper-level quantities like geopotential height fields (Pegion and Sardeshmukh 2011), or in certain global climate indices (Wang et al. 2014), whereas the skill of midlatitude land surface quantities like 2-m temperature or precipitation tend to be negligible (Li and Robertson 2015). Johnson et al. (2013) develop an empirical model for predicting North American 2-m temperature out to 4 weeks based on a linear trend and statistical relations with the Madden–Julian oscillation (MJO) and El Niño–Southern Oscillation (ENSO) and find that this empirical model has skill in certain regions and phases of the MJO. This paper will show that an operational forecast model makes skillful predictions of week-3–4 average temperature and precipitation over the contiguous United States (CONUS).

Predictability of temperature and precipitation depends very much on the spatial and temporal scale under consideration. Beyond weather time scales (e.g., 7–10 days), it is widely accepted that only large-scale spatial structures are predictable. Accordingly, we propose a novel approach to investigating subseasonal predictability using a set of spatial patterns that can be ordered by length scale. We will show that week-3–4 averages of time series corresponding to many of these spatial patterns can be skillfully predicted by a state-of-the-art prediction model. In addition, we find linear combinations of these time series that maximize predictability and show that many of these predictable components can be predicted with skill.

## 2. Data

The computations performed in this study are strongly constrained by the availability of forecasts; hence, it is helpful to discuss data issues first. We analyze retrospective forecasts, called “hindcasts,” from version 2 of the Climate Forecast System (CFSv2; Saha et al. 2014). The CFSv2 is a coupled atmosphere–ocean–land–ice model and is initialized based on analysis products for the atmosphere, ocean, land, and sea ice. The hindcasts under investigation were initialized at 0000, 0600, 1200, and 1800 UTC of each day over the 12-yr period from January 1999 to December 2010. Although these hindcasts were integrated out to 45 days, only the 2-week mean of weeks 3–4 were considered. Only one hindcast per initialization time is available, so a lagged-ensemble approach is employed whereby an average of forecasts initialized at different times but verifying at the same time were used. In general, skill increases with the size of the lagged ensemble until it saturates around 4 days (as shown in section 4). Accordingly, we consider hindcasts based on a 4-day lagged ensemble, which contains 16 members, derived from four hindcasts per day. To be clear, the 4-day lagged ensemble is computed from hindcasts that are initialized at or before time *t* and that verify from times through days (inclusive). We consider hindcasts of temperature and precipitation over CONUS initialized only in January and July (i.e., boreal winter and summer).

For verification, the 2-week mean temperature is compared to estimates from the NCEP–NCAR reanalysis (Kistler et al. 2001). Similarly, hindcasts of daily precipitation were verified relative to the Climate Prediction Center (CPC) unified gauge-based analysis (Chen et al. 2008).

Climatologies of daily temperature and precipitation are quite noisy and require significant smoothing. No significant dependence of hindcast climatology on lead time was detected, so the model climatology for each calendar day was estimated by averaging all hindcasts verifying on the same day and over all lead times. In addition, the daily climatology was fit to a second-order polynomial over the 76-day period starting from the first of each month. Various checks and visual comparisons were made to ensure that the estimated climatologies were reasonable.

MJO indices are computed from CFSv2 hindcasts in the manner of Trenary et al. (2017). Specifically, the familiar real-time multivariate MJO indices (RMM1 and RMM2) of Wheeler and Hendon (2004) were derived from an EOF analysis of observations, and then the resulting EOF patterns were projected on model variables. In contrast to the standard approach, a 120-day running mean was not subtracted from the indices; hence, our MJO indices include interannual variability.

## 3. Methods

This section describes our methods for 1) defining an orthogonal set of large-scale patterns, 2) quantifying predictability and skill, and 3) finding patterns that maximize predictability and skill.

### a. Eigenvectors of the Laplacian operator

We project temperature and precipitation fields onto the eigenvectors of the Laplacian operator over CONUS. Laplacian eigenvectors provide a convenient orthogonal basis set that can be ordered by a measure of length scale. Special cases of Laplacian eigenvectors include Fourier series and spherical harmonics, which are used routinely to decompose time series by time scale and spatial structures by length scale, respectively. Eigenvectors of the Laplacian operator over CONUS were obtained using a Green’s function method described in DelSole and Tippett (2015), which should be consulted for details (codes are available upon request). The resulting spatial patterns are orthogonal with respect to an area-weighted inner product and ordered such that the first corresponds to a spatially uniform pattern over the domain (i.e., the largest spatial scale that fits in the domain), and subsequent patterns correspond to dipoles, tripoles, quadrupoles, and so forth of decreasing length scale. These vectors depend only on the geometry of the domain and therefore are data independent, in contrast to empirical orthogonal functions (EOFs). Thus, a single set of spatial patterns are used to analyze different variables and seasons.

Laplacian eigenvectors 2–10 over CONUS are shown in Fig. 1. The first eigenvector is not shown because it equals a constant over the whole domain. The second and third eigenvectors measure the east–west and north–south gradients, respectively. The next two eigenvectors correspond to a tripole and quadrupole, and so on. The percent variance of observed 2-week means explained by the first 20 Laplacian eigenvectors is shown in Fig. 2; similar percentages are found in the model (not shown). As expected, the explained variance tends to decrease with decreasing spatial scale.

### b. Measure of predictability

Predictability refers to the degree to which a variable *in a model* is predictable by that model. As such, predictability is an inherent property of a model that can be measured independently of observations. The standard approach to measuring predictability is to consider an ensemble of predictions initialized at equally likely states of the system. Although the CFSv2 reforecast dataset does not have multiple ensemble members for the same initial condition day (i.e., a “burst” ensemble), an ensemble can be approximated by grouping hindcasts initialized 6 h apart and that verify on the same day. The resulting ensemble often is called a lagged ensemble (Hoffman and Kalnay 1983). Let denote the forecast anomaly initialized at time and verifying at time *t*, where *τ* is the lead time; time is measured in units of days. If *E* is the ensemble size, then the mean of the lagged ensemble is defined as follows:

where arises because hindcasts were initialized 6 h apart (i.e., 1/4 of a day apart).

If a variable is not predictable, then the ensemble members would be independent and the expected variance of the ensemble mean would be times the expected variance of the climatological distribution . The standard test for this hypothesis is analysis of variance (ANOVA; Rowell 1998). To test the null hypothesis of no predictability, ANOVA uses the statistic

where is an estimate of the variance of ensemble means (i.e., “signal”), given by

and is an estimate of the variance about the ensemble means (i.e., “noise”), given by

If the noise perturbations are independent and identically distributed Gaussian random variables, then *F* follows an F distribution with and degrees of freedom, which can be used to test significance. Unfortunately, the independence assumption is unrealistic for forecasts initialized a few days apart because large-scale fields tend to be serially correlated on daily time scales. Therefore, the standard hypothesis test is not appropriate for subseasonal forecasts. We propose a block permutation test for deciding predictability. Specifically, under the null hypothesis of no predictability, the forecasts would be exchangeable in the sense that each value of *F* obtained from a permutation of (independent) samples is equally likely. Accordingly, we construct a permuted ensemble by drawing forecasts from random years. Importantly, the entire sequence of forecasts within a year are drawn, ensuring that the serial correlation across consecutive days is preserved. This sampling is tantamount to randomly permuting (or “shuffling”) the years assigned to the forecasts. The statistic *F* is computed for the permuted ensemble, and this procedure is repeated many times (i.e., 10 000 times). The rank of the *F* obtained from the unpermuted ensemble is evaluated relative to the values of *F* for the permuted ensembles. Under the hypothesis of exchangeability, the rank is uniformly distributed. The actual lagged ensemble is said to be predictable if the observed value of *F* exceeds the 95% percentile of the *F* values obtained from permuted samples.

### c. Measure of skill

Skill refers to the degree to which a forecast predicts the observed variable. Two standard measures of skill are mean square error and correlation *ρ*. Significance tests for skill based on mean square error have been discussed by DelSole and Tippett (2014), while those based on correlation are standard. Unfortunately, these tests are not appropriate for forecasts initialized at daily intervals because of the serial correlation mentioned above. We again apply a permutation method in which the year labels for the observations are randomly permuted. By selecting the entire sequence of observations within a year, the serial correlation between observations on daily time scales is preserved. After shuffling the year labels for the observations, the correlation coefficient between forecasts and shuffled observations can be computed. This procedure is repeated many times (i.e., 10 000 times) to build up an empirical distribution for the correlation under the null hypothesis of independence. The 95th percentile of the resulting samples then defines the 5% significance threshold value for the correlation coefficient.

### d. Predictable component analysis

In some cases, none of the time series for the Laplacian eigenvectors can be predicted with skill. However, this result does not prove that there is no skill, because it is possible that some linear combination of eigenvectors can be predicted with skill. To test this possibility, we find the linear combination of eigenvectors that maximize the predictability measure *F* in (2). This procedure is formally equivalent to predictable component analysis (see DelSole and Tippett 2007 for a review). We briefly review this procedure to clarify its application in our particular situation. Let the weights of the linear combination be such that the quantity being forecast is the following:

where is the forecast anomaly for the *m*th Laplacian eigenvector. If the weights are collected into the vector , then the predictability error measure (2) can be written equivalently as follows:

where and are covariance matrices for the ensemble mean and residuals about the ensemble mean, respectively, defined as

and

To find an extremum, we differentiate (6) with respect to :

where *λ* is the value of *F* for the linear combination defined by the weights . If is positive definite, which is typically true when the number of Laplacian eigenvectors is much smaller than the sample size, then the derivative vanishes when satisfies the generalized eigenvalue problem:

It can be proven that if the eigenvalues (and corresponding eigenvectors) are ordered in descending order, then the first eigenvector maximizes *F*, the second maximizes *F* subject to being uncorrelated with the first eigenvector (in a sense defined shortly), and so on. Moreover, the eigenvalues give the corresponding maximized *F* values. These solutions define the predictable components, the first of which will be called the “most predictable component.” Each eigenvector can be substituted in (5) to define the time series associated with that component. Because covariance matrices are symmetric, the resulting time series for different components are uncorrelated. The spatial structure of the predictable component is obtained from regression. The regression coefficient between the predictable component time series in (5) and the *m*th Laplacian eigenvector is

The Laplacian eigenvectors are then summed using weights specified in the vector . Note that a regression coefficient can be computed for the *m*th Laplacian eigenvector even if that vector was not included in the optimization procedure discussed above (e.g., when ). We use Laplacian eigenvectors to construct the spatial pattern. This choice effectively imposes a prescribed level of spatial smoothing for the regression pattern.

Note that the above procedure yields a complete set of predictable components for each lead time *τ*. This lead time dependence is sensible because predictability is characterized by different patterns at different time scales. An alternative approach is to characterize predictability over all time scales, which can be done by maximizing a measure of predictability integrated over all lead times. This approach is called average predictability time (APT; DelSole and Tippett 2009) analysis. APT analysis is not used here because we want to demonstrate the existence of predictability specifically for the week-3–4 forecasts. Although APT analysis can find predictable components on subseasonal time scales, testing the hypothesis of predictability on subseasonal time scales is not straightforward because the integral includes the short weather lead times that are predictable. By applying predictable component analysis for only one lead time, subseasonal predictability can be tested in isolation from predictability on other time scales.

The sampling distribution of the maximized *F* values (i.e., the eigenvalues) under the null hypothesis of no predictability can be estimated using a permutation technique similar to that described above, in which the label for years assigned to forecasts are randomly permuted. The only extra step is that instead of drawing a single variable, an entire *M*-dimensional vector is drawn, corresponding to the amplitudes of the *M* Laplacian eigenvectors for the relevant forecast. Again, an essential element of the technique is to draw the entire sequence of forecasts within a year for the *M* eigenvectors, which preserves the serial correlation on daily time scales. After generating a mock ensemble forecast dataset comprising *T* time steps and *E* ensemble members, the covariance matrices are computed and the generalized eigenvalue problem in (11) is solved. This process is repeated many times (i.e., 10 000 times) to build up an empirical distribution for the eigenvalues.

## 4. Results

The correlation skill of 4-day lagged ensembles of temperature and precipitation of week-3–4 hindcasts over CONUS during January and July is shown in Fig. 3. Statistically insignificant values at the 5% level (according to the permutation test) are masked out. The figure shows that winter temperature and precipitation and summer temperature are skillfully predicted by the CFSv2 over a third to a half of the area of CONUS. Summer precipitation shows effectively no skill (e.g., the number of positive and negative correlations are approximately equal). Although some negative correlations are statistically significant in a local sense, we do not believe them to be field significant.

Our goal is to diagnose the predictability and skill shown in Fig. 3 in terms of large-scale spatial structures. The predictability and skill of individual Laplacian eigenvectors of January temperature as a function of ensemble size is shown in Fig. 4. Qualitatively similar results are obtained for other variables and time periods. Not surprisingly, predictability decreases with ensemble size because each additional member is initialized farther from the target and therefore contains more noise. The signal-to-noise ratio (SNR) decreases by a factor of 2–3 from a 12-h to a 4-day lagged ensemble. In contrast, the skill tends to increase with ensemble size, provided the skill is sufficiently large.

The predictability of week-3–4 temperature and precipitation CFSv2 hindcasts projected onto individual Laplacian eigenvectors is shown in Fig. 5. Predictability is quantified by the SNR , where *F* is defined in (2). We use , which is equivalent to analyzing differences between hindcasts initialized 6 h apart, as done in weather prediction studies (Simmons and Hollingsworth 2002). The figure reveals that several spatial structures of winter temperature and precipitation and summer temperature are predictable. Summer precipitation also is predictable but for fewer spatial structures. In general, temperature is more predictable than precipitation, and winter is more predictable than summer. Note, however, that precipitation is more predictable than temperature for certain components during winter (e.g., eighth and ninth components).

Although the above results demonstrate week-3–4 predictability, this result does not necessarily imply that the associated hindcasts are skillful (i.e., that the hindcasts can predict observed anomalies with skill). In most cases, mean square error shows no significant skill. Accordingly, we consider skill based on correlation, which is invariant to linear transformations of the forecast and thus does not penalize biases or errors in forecast amplitude. The skills of the hindcasts based on a 4-day lagged ensemble are shown in Fig. 6. The figure shows that many spatial structures of winter temperature and precipitation and summer temperature can be predicted with skill by -week hindcasts. The fact that skill exists for correlation but not for mean square error suggests that an amplitude correction is necessary for skill. Only one spatial structure (i.e., the 19th) of summer precipitation has skill exceeding the relevant significance level, but it is unlikely that it would be significant after the multiple comparisons required to identify it are taken into account. Thus, we conclude that large-scale week- winter temperature and precipitation and summer temperature can be predicted with skill but find little evidence that large-scale, summer precipitation can be predicted with skill at weeks .

Although no individual Laplacian eigenvector has significant skill for summer precipitation, this result does not necessarily imply that summer precipitation cannot be predicted with skill. In particular, it is possible that some linear combination of eigenvectors can be predicted with skill. To test this possibility, we apply predictable component analysis to find linear combinations of Laplacian eigenvectors that maximize predictability. A critical step in this procedure is selecting the number of eigenvectors. This step is tantamount to a model selection problem and is one of the most challenging problems in statistics (Fukunaga 1990; Hastie et al. 2003; Taylor and Tibshirani 2015). Fortunately, we have found that our results are not sensitive to the precise number of eigenvectors in the range of eigenvectors (we did not look beyond 20). To further validate our results, we have partitioned the years into two parts, a training sample in which the most predictable components are identified and a verification sample onto which the predictable components are projected and used as an independent test of predictability. We find that the predictability and skill in the verification sample tends to saturate after about 9 eigenvectors and remains nearly the same (or even grows) by 20 eigenvectors (not shown). The time series and associated regression pattern corresponding to individual predictable components are virtually independent of the number of eigenvectors greater than five or so. For subsequent calculations, we use the same number of eigenvectors (viz., nine) for all months and variables.

The maximized signal-to-noise ratios for CFSv2 week-3–4 hindcasts are shown in Fig. 7. As above, we use ensemble size corresponding to differences between hindcasts initialized 6 h apart. The shaded area shows the 95% confidence intervals for no predictability. The results suggest that all components are predictable (because all results lie outside the shaded confidence region). However, precipitation components near the trailing end tend to be only marginally significant. There is a “kink” in the signal-to-noise spectra at one or two components, indicating predictability significantly greater than the background significance threshold.

The regression map between the most predictable component time series and relevant field is shown in Fig. 8. The winter temperature and precipitation patterns are similar to the observed ENSO teleconnection patterns derived from monthly means (Yang and DelSole 2012), suggesting that CFSv2 week-3–4 predictability arises from El Niño/La Niña events. The summer temperature pattern also bears some resemblance to model–ENSO teleconnection patterns (e.g., compare to Fig. 7 of Wang et al. 2012), but the correspondence to the summer precipitation pattern is weak.

The skills of the predictable components are shown in Fig. 9. The figure shows that the most predictable components have skill at weeks 3–4 for winter temperature and precipitation and summer temperature. In contrast, the most predictable component of summer precipitation has no significant skill (it is too small to appear in the figure). About two to three predictable components of winter temperature and precipitation and summer temperature have skill. Confidence intervals for the correlation skills overlap (not shown), indicating that the correlations cannot be distinguished. It follows that the ranking according to skill cannot be determined based on the available data. Thus, the fact that the most predictable component is not the most skillful is not necessarily meaningful.

To gain insight into the nature of the predictability and skill, we show in Fig. 10 time series of the most predictable components. These time series confirm that secular trends are small. In addition, for the components with the most skill, the time series exhibit relatively large jumps between years but relatively small fluctuations within a year. This feature suggests that the predictability comes from predicting the overall mean during the month rather than predicting variations within the month. To test this possibility, the forecasts within a month were decomposed into the sum of two terms, a monthly mean plus an anomaly relative to the monthly mean, and then the correlation skill of these two components were computed separately. The result, shown in Fig. 11, shows that skill associated with the monthly mean often dominates. Moreover, skill in predicting the anomalies rarely exceeds 0.35, whereas skill in predicting monthly means frequently exceeds 0.35.

Given that predictability appears to be dominated by the monthly mean component, it is reasonable to explore relations with other variables by computing correlations between monthly mean quantities. The simultaneous squared correlation between each predictable component in CFSv2 and the Niño-3.4 index is shown in Fig. 12a. We call this measure *R*^{2} because it corresponds to the coefficient of determination of a regression model for predicting the component based on Niño-3.4. Because the Niño-3.4 index is persistent on weekly time scales, its value in the model is very close to its initial value, which in turn is close to the observed value. Thus, these correlations measure the ENSO teleconnections in the model. We see that the most predictable components of winter temperature and precipitation in CFSv2 are highly correlated with ENSO.

In addition to ENSO, the MJO often is cited as a phenomenon that may give rise to subseasonal predictability (Vitart 2014). To explore this, we compute the coefficient of determination between the predictable component and the RMM1 and RMM2 indices defined in Wheeler and Hendon (2004). These indices were computed from daily CFSv2 fields, then averaged over week-3–4 hindcasts, and then averaged over the month so that a correlation could be computed using only monthly values. The coefficient of determination is the correlation between the predictable component and the best linear combination of RMM1 and RMM2 and measures the fraction of variance of the predictable component that can be predicted from the MJO indices. The values computed from monthly mean quantities are shown in Fig. 12b and reveal that the most predictable component of winter temperature and precipitation are significantly correlated with MJO activity.

It is well known that ENSO and MJO activity tend to be correlated. This correlation confounds the interpretation of pairwise correlations. To clarify the relations further, we quantify the degree of relation after one of the indices has been regressed out. A convenient measure of the degree of relation between *X* and *Y*, after *Z* has been removed, is

where is the sum square error of a regression prediction of *Y* based on *X*, and is the sum square error of a regression prediction of *Y* based on *X* and *Z*. The constant term is understood to be included in all regression models. The quantity lies between 0 and 1 and can be interpreted as the fraction of variance of *Y* explained by *Z* after the linear relation with *X* has been removed from all variables. In the case of ENSO after MJO has been removed (Fig. 12c), only the leading predictable component of winter precipitation shows a significant relation with ENSO. In contrast, the leading component of winter temperature has a significant correlation with ENSO (see Fig. 12a), but not after the MJO has been removed (which has an *R*^{2} of about 0.4, just below the significance threshold; see Fig. 12c). This result does not necessarily mean the leading component of winter temperature is unrelated to ENSO, but rather that a correlation could exist but the sample size (i.e., 12 yr) may be too small to detect it. In the case of MJO after ENSO has been removed, shown in Fig. 12d, the third and fourth predictable components of winter precipitation show a significant relation with MJO.

For completeness, we note a similar analysis was performed using the North Atlantic Oscillation (NAO) index and MJO indices. We find that the correlations with the predictable components are marginally significant but that these correlations become insignificant when MJO has been regressed out (not shown).

## 5. Conclusions

This paper shows that an operational forecast model skillfully predicts week-3–4 temperature and precipitation over the contiguous United States. This skill can be identified at the gridpoint level (about 1° × 1°) and by projecting data onto an orthogonal set of large-scale CONUS patterns (as derived from the eigenvectors of the Laplacian operator). An important aspect of this identification is a permutation significance test that accounts for serial correlation on daily time scales. Skill is detected based on correlation measures but not based on mean square error measures, indicating that an amplitude correction is necessary for skill. Our results differ from those of Li and Robertson (2015) perhaps because we analyzed weeks together rather than separately and only one month at a time was analyzed.

Winter temperature and precipitation tend to have more predictability than their summer counterparts, with summer precipitation having the weakest predictability of all quantities considered in this paper. In addition, the most predictable components were identified by finding linear combinations of Laplacian eigenvectors that maximize signal-to-noise ratio. The results of this maximization procedure clarify the spatial structure of the predictable variability. The most predictable component during winter effectively represents the model’s ENSO teleconnection pattern. Some predictable components of winter precipitation are associated with MJO activity. The skill of the predictable components is dominated by the skill in predicting the mean value during a month rather than from predicting anomalies relative to the monthly mean. By explicitly identifying patterns in an operational forecast model that are predictable on subseasonal time scales and demonstrating that these patterns can be predicted with skill in observations, the above results provide a scientific basis for week-3–4 predictions.

## Acknowledgments

We thank two reviewers and the editor Joseph Barsugli for helpful comments that led to improved clarity in the final paper. This research was supported primarily by the National Oceanic and Atmospheric Administration, under the Climate Test Bed program (NA10OAR4310264) and the MAPP program (NA14OAR4310184). Additional support was provided by the National Science Foundation (AGS-1338427), National Aeronautics and Space Administration (NNX14AM19G), and the National Oceanic and Atmospheric Administration (NA14OAR4310160). The views expressed herein are those of the authors and do not necessarily reflect the views of these agencies.

## REFERENCES

*An Introduction to Statistical Pattern Recognition*. 2nd ed. Academic Press, 591 pp.

*Elements of Statistical Learning*. Corrected ed. Springer, 552 pp.

## Footnotes

© 2017 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).