1. Introduction
A general prediction problem is to find the best estimate of a quantity y given a related quantity x. We refer to vectors y and x as the predictand and predictor, respectively. Examples of typical earth science prediction problems are as follows: x is the current sea surface temperature and y is its future state (Penland and Magorian 1993); x is a prescribed CO2 concentration and y is global surface temperature (Krueger and Von Storch 2011); x is a large-scale climate feature and y is an associated small-scale climate feature (Robertson et al. 2012). In principle, the probability distribution of y for a particular value of predictor x = x0 (the conditional distribution) can be computed from physical laws or estimated from data. In either case, the mean of that distribution (the conditional mean) is the best forecast in the sense of minimizing the expected squared error. When x and y have a joint Gaussian distribution, the best forecast, as well as its uncertainty, is given by linear regression (LR).
The idea of conditional averaging is also found in the constructed analog (CA) method (Van den Dool 1994, 2006), a statistical forecast method that has been applied in a variety of geophysical problems (e.g., Van den Dool et al. 2003; Maurer and Hidalgo 2008; Hawkins et al. 2011). A prediction yCA is made for a particular value of the predictor x = x0 by searching through historical data for values of y corresponding to values of x that are close to x0, so-called analogs. The CA method expresses the current predictor state x0 as a weighted linear combination of past states and makes a prediction by applying those same weights to the corresponding values of y, an averaging procedure reminiscent of the conditional mean. The CA has previously been described as differing from LR in two fundamental ways. First, it has been claimed that by making no assumption of a linear relation between predictor and predictand, CA captures nonlinearity. Second, it has been claimed that since CA is not based on minimizing the mean squared error of the predictions, there is no danger of overfitting. Here we show that typical implementations of CA do not have these properties, and, in fact, CA forecasts are identical to LR forecasts.
The paper is organized as follows. In section 2 we review the least squares problems that arise in the formulations of LR and CA, and use the matrix pseudoinverse to show that simple (without predictor truncation or regularization) implementations of the two methods give identical forecasts. In section 3, we identify situations where the simple implementation overfits the data and show that a recommended CA implementation is the same as principal component regression. In section 4, we show that another common CA implementation corresponds to ridge regression. In section 5, we show that LR and CA predictions of the Niño-3.4 index are identical and may have large variance even at long leads. In section 6, we present and illustrate some nonlinear regression methods that follow naturally from modifications to CA. A summary and discussion are given in section 7.
2. Linear regression, constructed analogs, and pseudoinverses
We use the following matrix notation for the training data. Let
















3. Connection to principal component regression































The choice of the number of PCs to use in the calculation of the CA weights has exactly the same effect on the forecast as the choice of the number of PCs to use in PCR. In both cases, using too many PCs leads to overfitting.
4. Connection to ridge regression









5. Example: Niño-3.4 prediction
A typical application of CA and LR is the prediction of the Niño-3.4 index (Van den Dool 2006). We consider forecasts made in the beginning of July and take as predictors the gridded April–June sea surface temperature (SST) anomaly in the region from 40°S to 40°N from the extended reconstructed SST (ERSST) dataset, version 3b (Smith and Reynolds 2004). The historical data used to form
Figure 1 shows that CA and PCR forecasts based on the same number of EOFs are identical. On the other hand, forecasts based on different numbers of EOFs can vary greatly. Forecasts using 10 EOFs show little variability, while those with 25 or more show considerable variability. This particular set of forecasts verifies well against observations out to a lead of nearly two years. The skill of forecasts made in July for the following March–May (lead 8) was computed for period 1955–2003 using the entire dataset and using leave-one-out cross validation (CV) applied to the LR coefficients and CA weights; the PCs were computed using the full dataset. The CV skill of the 10 EOF forecasts is the highest, and as the number of EOFs increases, the resulting forecasts have lower CV skill and greater variance (Table 1). On the other hand, the in-sample correlation increases as the number of EOFs increases, and the in-sample ratio of forecast to climatological variance is equal to the in-sample correlation. The variance of the cross-validated forecasts is greater than the climatological variance when 25 or more EOFs are used. The reason for this behavior is that the in-sample explained variance and the variance of the regression coefficient estimates, both of which are increasing functions of the number of predictors, contribute to the variance of the cross-validated forecasts. The behavior of the CV forecasts, especially those with more than 10 EOFs, is consistent with that of overfitting with in-sample skill being substantially greater than the CV skill, and the CV skill being inconsistent with the variance.
Constructed analog (CA) and principal component regression (PCR) forecasts along with observations (obs) of the three-month-average Niño-3.4 index. Forecasts are made at the beginning of July and extend through April–June of 2007. The numbers in the legend indicate the number of EOFs retained.
Citation: Monthly Weather Review 141, 7; 10.1175/MWR-D-12-00223.1
Skill and ratio of forecast to climatological variance of in-sample and leave-one-out cross-validated (CV) forecasts made in the beginning of July for the following March–May average (lead 8) of the Niño-3.4 index during the period 1955–2003.
6. Nonlinear CA














(a) Data (plus signs) generated by (28) fit by linear regression (LR)–constructed analog (CA), k-nearest neighbors (KNN), Gaussian kernel smoother (GKS), and local linear regression (LLR). The “truth” curve is the expected value of y given x. (b) The CA, KNN, GKS, and LLR weights for x0 = −0.5. The LLR weights are divided by 4 for display purposes.
Citation: Monthly Weather Review 141, 7; 10.1175/MWR-D-12-00223.1
The CA, KNN, GKS, and LLR weights are quite different for x0 = −0.5 as shown in Fig. 2b. The sum of the weights is one for all methods due to the intercept term. A clear feature of the CA weights is that they are a linear function of the data values and display no maximum near x0. This behavior is general as discussed earlier. The KNN weights are zero except for the five data points nearest to x0 where they are ⅕. The GKS weights have largest values near x0 and decrease to zero as the distance to x0 increases. The LLR weights are locally linear near x0 with values that go to zero far from x0.
7. Summary and discussion
While the constructed analog (CA) statistical forecast method has previously been described as having properties that are distinct from those of linear regression (LR; Van den Dool 2006), we have shown here that, with comparable treatment of the data, CA and LR produce identical forecasts, and therefore the properties of CA are the same as those of LR. In particular, CA forecasts are linear functions of the predictors and subject to overfitting. When EOF truncation is used in the CA calculation, the resulting forecast is the same as that given by principal component regression (PCR) based on the same EOFs. Likewise, using ridging in the calculation of CA weights results in the same forecast as does ridge regression.
These results were illustrated in an example where sea surface temperature was used to predict the Niño-3.4 index. The CA and PCR forecasts based on the same number of PCs are identical. When many PCs were used, the forecasts show high variance, even at long leads, but low cross-validated skill, a symptom of overfitting. The equivalence between LR and CA depends on the precise definition of the weights. Allowing the weights to depend nonlinearly on the data leads naturally to generalizations of CA such as kernel smoothers and local linear regression, which we have illustrated with an example.
In practice, LR forecasts are observed to differ from CA forecasts. Moreover, forecasts from different implementations of LR also differ. For instance, LR-based statistical forecasts of ENSO including CA have quite different properties (Barnston et al. 2012). Use of distinct datasets may explain some of these differences. However, it must be recognized that many linear regression forecasts, with significant variations in skill, can be constructed from a given dataset of predictors and predictands. There are two primary sources of variety. First, the predictors or predictands can be truncated, and the regression developed on the truncated data. Principal component analysis and canonical correlation analysis are commonly used methods for truncating the data that enter a LR. The resulting forecasts depend on the truncation choices as illustrated here in the Niño-3.4 example where the forecasts depend strongly on the number of principal components retained as predictors. Linear inverse models and autoregressive methods usually project both the predictors and predictands onto EOFs (DelSole and Chang 2003); CA generally only projects the predictors, thus leading to different forecasts. Second, there are a variety of methods for estimating the LR coefficients. In addition to the classic least squares method, there are shrinkage methods like ridge and lasso (Hastie et al. 2009). The CA often uses ridge; PCR does not, again leading to different forecasts. Appropriate choices of data truncation and coefficient estimation method are key to developing a skillful LR forecast.
Acknowledgments
The authors thank Huug van den Dool for his generous and helpful comments, and two anonymous reviewers for their useful suggestions. MKT is supported by grants from the National Oceanic and Atmospheric Administration (Grants NA05OAR4311004 and NA08OAR4320912) and the Office of Naval Research (Grant N00014-12-1-0911). TD gratefully acknowledges support from grants from the NSF (Grant 0830068), the National Oceanic and Atmospheric Administration (Grant NA09OAR4310058), and the National Aeronautics and Space Administration (Grant NNX09AN50G). The views expressed herein are those of the authors and do not necessarily reflect the views of NOAA or any of its subagencies.
REFERENCES
Barnston, A. G., M. K. Tippett, M. L. L'Heureux, S. Li, and D. G. DeWitt, 2012: Skill of real-time seasonal ENSO model predictions during 2002–2011. Is our capability increasing? Bull. Amer. Meteor. Soc., 93, 631–651.
DelSole, T., 2007: A Bayesian framework for multimodel regression. J. Climate, 20, 2810–2826.
DelSole, T., and P. Chang, 2003: Predictable component analysis, canonical correlation analysis, and autoregressive models. J. Atmos. Sci., 60, 409–416.
Golub, G. H., and C. F. Van Loan, 1996: Matrix Computations. 3rd ed. The Johns Hopkins University Press, 694 pp.
Hansen, P., 1998: Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion. Society for Industrial and Applied Mathematics, 247 pp.
Hastie, T., R. Tibshirani, and J. Friedman, 2009: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 768 pp.
Hawkins, E., J. Robson, R. Sutton, D. Smith, and N. Keenlyside, 2011: Evaluating the potential for statistical decadal predictions of sea surface temperatures with a perfect model approach. Climate Dyn., 37, 2495–2509.
Kaplan, A., M. A. Cane, Y. Kushnir, A. C. Clement, M. B. Blumenthal, and B. Rajagopalan, 1998: Analyses of global sea surface temperature 1856-1991. J. Geophys. Res., 103 (C9), 18 567–18 589.
Krueger, O., and J.-S. Von Storch, 2011: A simple empirical model for decadal climate prediction. J. Climate, 24, 1276–1283.
Maurer, E. P., and H. G. Hidalgo, 2008: Utility of daily vs. monthly large-scale climate data: An intercomparison of two statistical downscaling methods. Hydrol. Earth Syst. Sci., 12, 551–563.
Penland, C., and T. Magorian, 1993: Prediction of Niño-3 sea surface temperatures using linear inverse modeling. J. Climate, 6, 1067–1076.
Robertson, A. W., J.-H. Qian, M. K. Tippett, V. Moron, and A. Lucero, 2012: Downscaling of seasonal rainfall over the Philippines: Dynamical versus statistical approaches. Mon. Wea. Rev., 140, 1204–1218.
Smith, T. M., and R. W. Reynolds, 2004: Improved extended reconstruction of SST (1854–1997). J. Climate, 17, 2466–2477.
Van den Dool, H., 1994: Searching for analogues, how long must we wait? Tellus, 46A, 314–324.
Van den Dool, H., 2006: Empirical Methods in Short-Term Climate Prediction. Oxford University Press, 240 pp.
Van den Dool, H., J. Huang, and Y. Fan, 2003: Performance and analysis of the constructed analogue method applied to U.S. soil moisture over 1981–2001. J. Geophys. Res., 108, 8617, doi:10.1029/2002JD003114.