## Abstract

There are a variety of multivariate statistical methods for analyzing the relations between two datasets. Two commonly used methods are canonical correlation analysis (CCA) and maximum covariance analysis (MCA), which find the projections of the data onto coupled patterns with maximum correlation and covariance, respectively. These projections are often used in linear prediction models. Redundancy analysis and principal predictor analysis construct projections that maximize the explained variance and the sum of squared correlations of regression models. This paper shows that the above pattern methods are equivalent to different diagonalizations of the regression between the two datasets. The different diagonalizations are computed using the singular value decomposition of the regression matrix developed using data that are suitably transformed for each method. This common framework for the pattern methods permits easy comparison of their properties. Principal component regression is shown to be a special case of CCA-based regression. A commonly used linear prediction model constructed from MCA patterns does not give a least squares estimate since correlations among MCA predictors are neglected. A variation, denoted least squares estimate (LSE)-MCA, is suggested that uses the same patterns but minimizes squared error. Since the different pattern methods correspond to diagonalizations of the same regression matrix, they all produce the same regression model when a complete set of patterns is used. Different prediction models are obtained when an incomplete set of patterns is used, with each method optimizing different properties of the regression. Some key points are illustrated in two idealized examples, and the methods are applied to statistical downscaling of rainfall over the northeast of Brazil.

## 1. Introduction

Multivariate statistical methods are used to analyze observational and model data, to make statistical forecasts, and to calibrate or correct dynamical forecasts. Some of the most commonly used methods include principal component analysis (PCA), maximum covariance analysis (MCA), and canonical correlation analysis (CCA; e.g., Bretherton et al. 1992). PCA is usually applied to a single dataset, finding the projections [empirical orthogonal functions (EOFs)] or components that explain the most variance. Methods such as CCA and MCA work with two datasets, finding projections that optimize some measure of linear association between the two datasets: CCA selects components of each dataset so as to maximize their correlation; MCA does likewise, except maximizing covariance. A common application of these methods is the construction of linear prediction models based on the identified, and often physically meaningful, coupled patterns.

Redundancy analysis (RDA) and principal predictor analysis (PPA) are pattern methods specifically tailored for use in linear regression models and, unlike CCA and MCA, are asymmetric in their treatment of the two datasets, identifying one dataset as the predictor and the other as the predictand. RDA selects predictor components that maximize explained variance (von Storch and Zwiers 1999; Wang and Zwiers 2001). PPA selects predictor components that maximize the sum of squared correlations (Thacker 1999). Another commonly used pattern regression method is principal component regression (PCR; e.g., Yu et al. 1997) in which PCA is applied to the predictor field and then a multiple linear regression is developed between the EOF coefficients or principal components (PCs) and each predictand element individually.

The purpose of this paper is to elucidate the connection between methods for finding coupled patterns and multivariate regression. A key element is the use of the singular value decomposition (SVD) to analyze the matrix of regression coefficients. The SVD reveals the structure of the regression by finding orthogonal transformations that diagonalize the regression. The singular values are the regression coefficients of the diagonalized regression. The regression is invariant with respect to linear transformations of the data (as long as the predictor transformation is invertible) in the sense that the regression matrix is transformed in the same way as the data. However, the SVD of the regression matrix is not invariant since, after a linear transformation of the data, the transformations that diagonalize the regression are generally no longer orthogonal. Therefore applying the SVD to regression matrices developed with different transformations of the data yields distinct diagonalizations of the regression. Furthermore, these distinct diagonalizations diagnose different properties of the regression as measured by the singular values. For instance, previous work has shown that when the data are expressed in the basis of its principal components, the regression matrix reduces to the cross-covariance matrix and its SVD corresponds to CCA with the singular values being the canonical correlations (Bretherton et al. 1992; DelSole and Chang 2003). Here we extend this idea and show that MCA, RDA, and PPA are equivalent to SVDs of the regression developed using data that are transformed in a distinct manner for each method. The connection between the pattern methods and multivariate regression provides a common framework that is useful for understanding and comparing the pattern methods, as well as for computation.

The paper is organized as follows: in section 2, we examine in a univariate regression how, with appropriate linear transformations of the data, the regression coefficient measures correlation, explained variance, or covariance. In section 3, we examine the behavior of multivariate regression when linear transformations are applied to the data. In section 4, we analyze the multivariate regression and obtain the coupled pattern methods as singular vectors of a transformed regression. We discuss reduced-rank regression in section 5. Some of the key issues are illustrated with idealized examples in section 6. The methods are compared in a statistical downscaling example in section 7. Section 8 gives a summary and conclusions.

## 2. Univariate linear regression

In the case of a single predictand and a single predictor, an estimate *ŷ* of the predictand *y* based on the predictor *x* is given by the linear regression

where the regression coefficient *a* is

and 〈 〉 denotes expectation; we take *x* and *y* to be anomaly values, that is, deviations from their respective means, and thus the regression equation contains no constant term. The regression coefficient can be manipulated to obtain quantities such as correlation, explained variance, and covariance, which measure aspects of the linear relation between predictor and predictand. Specifically,

Here “explained variance” means the variance of *y* explained by the regression, *not* the fraction of variance, which is the square of the correlation. The difference of the variance of *y* and the explained variance is the error variance of the regression. Since the linear regression minimizes squared error, it maximizes explained variance.

The regression coefficient *a* changes in a simple way when a linear scaling is applied to the variables. Suppose that new variables are defined by *x*′ = *lx* and *y*′ = *my*, where *l* and *m* are scalars and *l* ≠ 0. The regression equation for the new variables is

where the new regression coefficient *a*′ is related to the original regression coefficient by

Combining Eqs. (3) and (5) shows that particular choices of *l* and *m* lead to the transformed regression coefficient *a*′ having the following interpretations:

when both variables are normalized to have unit variance,

*x*′ =*x*/〈*x*^{2}〉,*y*′ =*y*/〈*y*^{2}〉, and the regression coefficient*a*′ is the correlation between*x*and*y*;when

*x*alone is normalized to have unit variance,*x*′ =*x*/〈*x*^{2}〉, and the magnitude of the regression coefficient |*a*′| is the square root of the variance explained by*x*; andwhen

*x*is normalized by its variance,*x*′ =*x*/〈*x*^{2}〉, and the regression coefficient*a*′ is the covariance between*x*and*y.*

The connection between transformations of the data and the interpretation of the regression coefficient is simple but not particularly useful in the scalar case. The univariate regression does, however, indicate that rescaling of the data, while changing the value and interpretation of the regression coefficient, does not fundamentally change the regression; the rescalings of the data are simply applied to the least squares estimate (LSE) and the regression coefficient. This concept is generalized to the multivariate case in section 3, and in section 4 we present the appropriate multivariate generalizations of these data transformations that lead to regression coefficients that measure correlation, explained variance, or covariance—the same quantities that arise in methods for finding coupled patterns.

## 3. Multivariate linear regression

Suppose that the multivariate predictand *y* is linearly related to the multivariate predictor **x**, where **x** and **y** are anomaly fields; we use the convention that **x** and **y** are column vectors. The least squares estimate **ŷ** of the predictand is given by linear regression as

where T and −1 denote transpose and matrix inverse, respectively. Typically the expectations are computed from data using sample averages. The predictor data matrix 𝗫 is the matrix whose *i*th column is the *i*th sample of the predictor **x**; the number of rows of 𝗫 is equal to the dimension of **x**, and the number of columns of 𝗫 is equal to the number of samples. Likewise the predictand data matrix 𝗬 is the matrix whose *i*th column is the *i*th sample of the predictand **y**. Then

where the least squares regression coefficient matrix is defined as 𝗔 ≡ (𝗬𝗫^{T})(𝗫𝗫^{T})^{−1}.^{1}

As in the univariate case of Eq. (5), linear transformations of the data lead to transformation of the regression matrix. Suppose we introduce new variables **y**′ = 𝗠**y** and **x**′ = 𝗟**x**, where 𝗟 and 𝗠 are matrices. The regression matrix 𝗔′ relating the transformed variables is

If, additionally, 𝗟 is invertible then the transformed regression matrix has the simple form

analogous to the univariate case in Eq. (5). This relation provides several pieces of useful information. First, when the transformation 𝗟 of the predictor is invertible, the least squares estimate **ŷ**′ of **y**′ is

which means that the least squares estimate using the transformed data is just the transformation of the original least squares estimate. Rescaling the data or expressing it in another basis does not fundamentally change the regression so long as the transformation 𝗟 of predictor data is invertible.

The transformation 𝗟 of the predictor data is not invertible when 𝗟**x** = 0 for some **x** ≠ 0, which means that the transform 𝗟 has the effect of reducing the number of predictors. Reducing the number of predictors is often desirable when the dimension of the predictor is large compared to the number of available samples. When the number of predictors is large compared to the number of samples, the sample covariance matrix 𝗫𝗫^{T} is ill conditioned or even singular, and reducing the number of predictors regularizes the regression problem by making it have a unique solution that is not overly sensitive to the data. The number of predictors is often reduced using PCA, which finds the components of the data that explain the most variance, although other projections may be used as well (DelSole and Shukla 2006). Reducing the set of predictors to some smaller number of PCs is called *prefiltering* in the context of CCA (Bretherton et al. 1992). In contrast to the general case of singular transformations of the predictor data, when 𝗟 is the prefiltering transformation that maps the data onto a subset of its PCs, the regression developed with the prefiltered data is the same as the original regression applied to the prefiltered data (see the appendix).

The goal when selecting the number of predictors is a skillful model. However, the data used to estimate the regression coefficients are not directly useful for determining the skill of the regression.^{2} For instance, if the dimension of the predictor **x** exceeds the number of samples and prefiltering is done using the maximum number of PCs, the in-sample error is 0 since 𝗬 = 𝗔𝗫. However, since such a regression completely fits the data, including its random components, we expect it to suffer from *overfitting* and have poor skill on independent data. Regression models with fewer predictors are more likely to represent the actual relationships, avoid overfitting, and better predict out-of-sample data. To choose the number of predictors that optimizes the out-of-sample skill of the regression, the data can be split into two segments with the regression coefficients estimated using one segment and the number of predictors chosen to optimize the skill in the independent segment. This procedure does, however, give an overly optimistic estimate of skill due to selection bias (Zucchini 2000), and the skill of the selected model should ideally be estimated on a third independent set of data. In what follows, we assume that the number of predictors has been reduced so that the number of predictors is less than the number of samples, and the predictor covariance matrix is invertible.

Another important consequence of the relation in Eq. (8) follows from noting that the error variance ‖**y**′ − **ŷ**‖^{2} = (**y**′ − **ŷ**′)^{T}(**y** − **ŷ**′) of the transformed variable is minimized, and that (**y**′ − **ŷ**′)^{T}(**y**′ − **ŷ**′) = (**y** − **ŷ**)^{T}(𝗠^{T}𝗠) (**y** − **ŷ**). Therefore, not only is the sum of squared error (**y** − **ŷ**)^{T}(**y** − **ŷ**) minimized, but so is the positive semidefinite quadratic function of the error (**y** − **ŷ**)^{T}(𝗠^{T}𝗠)(**y** − **ŷ**). Changing the weighting of the error elements does not result in an estimate that is different from the least squares estimate, and in fact, some of the weights can be set to 0 since 𝗠 is not required to be invertible. For instance, choosing 𝗠 = **e**^{T}_{i}, where **e*** _{i}* is the

*i*th column of the identity matrix means that the least squares estimate minimizes ‖

**e**

^{T}

_{i}(

**y**−

**ŷ**)‖

^{2}= (

*y*

_{i}−

*ŷ*

_{i})

^{2}, which is the error of the

*i*th element of the predictand. Therefore regression minimizes not only the total error variance but the error variance of each element separately. Consequently, the regression estimate developed using all the elements of the predictand simultaneously is the same as the one developed with individual elements of the predictand separately. However, for questions of inference, such as testing hypotheses about the regression coefficients, the multivariate character of the problem cannot be neglected, and correlations between parameters must be considered.

This last property of regression aids the interpretation of PCR. In PCR, regressions are developed between predictor PCs and each of the predictands individually. The above conclusion means that PCR is the same as developing the regression between all of the predictands and the PCs simultaneously. This shows a connection with CCA since a CCA-based regression model with EOF prefiltering of the predictor (and no other truncation) is the same as multiple linear regression between the predictor PCs and the predictand (Glahn 1968; DelSole and Chang 2003). Therefore PCR is the same as a CCA-based regression model with EOF prefiltering of the predictor and no other truncation such as prefiltering of the predictand.

## 4. Analysis of the regression matrix

We now show that transforming the multivariate data in ways suggested by the univariate case allows us to interpret the regression coefficients as correlation, variance explained, standardized explained variance, or covariance of the original data. The SVD of the transformed regression matrix diagonalizes the regression and identifies projections of the data that maximize these measures. These projections are the same that are used in methods for finding coupled patterns.

### a. Correlation

In the univariate case, normalizing the predictor and predictand by their standard deviation makes the regression coefficient equal to the correlation between predictor and predictand. The appropriate multivariate generalization is to multiply the variables by the inverse of the matrix square root^{3} of their covariance:

The appearance of the inverse of the predictand covariance indicates that it may be necessary to prefilter the predictand as well as the predictor. The matrix square root is not uniquely defined; postmultiplication of a matrix square root by an orthogonal matrix gives another matrix square root. A convenient choice for the matrix square root of the inverse covariance is the transformation that replaces the data with its PC time series (normalized to have unit variance); that is, **x**′ and **y**′ are the normalized principal component scores. Such a transformation is sometimes called a *whitening* transformation (DelSole and Tippett 2007) since the transformed data are uncorrelated and have unit variance

where 𝗜 is the identity matrix. The regression matrix for predicting **y**′ from **x**′ is

since 𝗫′𝗫′^{T} = 𝗜. The (*i*, *j*)th element of 𝗔′ is the correlation between the *i*th element of **y**′ and the *j*th element of **x**′, denoted *y*′_{i} and *x*′_{j}, respectively, since

and the elements of **x**′ and **y**′ have unit variance.

Instead of looking at the correlations between individual elements of **x**′ and **y**′, we can examine the correlation of one-dimensional projections of the data. Projecting the transformed predictand and predictor data onto the *weight vectors ***u** and **v**, respectively, gives the time series

which from Eq. (12) have unit variance. The correlation between the time series of the projections is

where we use the definition of 𝗔′ from Eq. (13). This ratio is maximized when **u** and **v** are, respectively, the left and right leading singular vectors of 𝗔′ (Golub and Van Loan 1996). The SVD of 𝗔′ is defined to be

where 𝗨 and 𝗩 are square orthogonal matrices and 𝗦 is a matrix with nonnegative diagonal entries *s _{i}* ordered from largest to smallest; the columns of 𝗨 and 𝗩 form complete, orthogonal bases for the predictand and predictor, respectively. The singular vectors

**u**

*and*

_{i}**v**

*are the*

_{i}*i*th columns of 𝗨 and 𝗩 and satisfy

where *k* is the smaller of row and column dimensions of 𝗔′. Therefore *s*_{1} = **u**^{T}_{1}𝗔′**v**_{1} is the largest possible correlation between projections of the data. The next largest singular value *s*_{2} = **u**^{T}_{2}𝗔′**v**_{2} is the largest possible correlation between projections of the data subject to the constraint that the projections be orthogonal to the first ones, that is, the constraint that **u**^{T}_{2}**u**_{1} = **v**^{T}_{2}**v**_{1} = 0. This orthogonality constraint has the consequence that the associated time series are uncorrelated because

Likewise, subsequent singular values are the maximum correlation subject to the constraint that the projections are orthogonal (time series are uncorrelated) to previous ones.

The weight vectors for the untransformed variables are the columns of the matrices 𝗤* _{x}* and 𝗤

*defined so that the projection of the untransformed variables is equal to the projection of the transformed variables*

_{y}Using Eq. (11) gives 𝗤* _{y}* = (𝗬𝗬

^{T})

^{−1/2}𝗨 and 𝗤

*= (𝗫𝗫*

_{x}^{T})

^{−1/2}𝗩. Although the weight vectors for the transformed variables are orthogonal, the weight vectors for the untransformed variables are not, or more precisely, they are orthogonal with respect to a different norm since 𝗤

^{T}

_{y}(𝗬𝗬

^{T})

^{−1}𝗤

_{y}= 𝗜 and 𝗤

^{T}

_{x}(𝗫𝗫

^{T})

^{−1}𝗤

_{x}= 𝗜. The data can be expressed as

*patterns*that multiply the time series. The pattern vectors differ from the weight vectors since the weight vectors are not orthogonal. The matrices 𝗣

*and 𝗣*

_{x}*of pattern vectors are found by solving*

_{y}which gives 𝗣* _{x}* = (𝗫𝗫

^{T})

^{1/2}𝗩 and 𝗣

*= (𝗬𝗬*

_{y}^{T})

^{1/2}𝗨; these patterns solve Eq. (21) in a least squares sense when an incomplete set of projections is used. The pattern and weight vectors are orthogonal to each other since 𝗣

^{T}

_{x}𝗤

_{x}= 𝗜 and 𝗣

^{T}

_{y}𝗤

_{y}= 𝗜.

The above analysis defines the decomposition of the data into patterns whose times series have the maximum correlation subject to the constraint that subsequent predictor and predictand time series be uncorrelated. This decomposition is CCA with the columns of 𝗤* _{x}*(𝗤

*) being the predictor (predictand) weight vectors, the columns of 𝗣*

_{y}*(𝗣*

_{x}*) the predictor (predictand) patterns, and the diagonal elements of 𝗦 the canonical correlations [DelSole and Chang (2003); see the appendix of this paper for a derivation of the usual CCA equations]. Using the relation between 𝗔 and 𝗔′ in Eq. (9), the regression matrix can be simply written using the weight vectors and patterns as*

_{y}The above relation shows that CCA diagonalizes the regression. Since 𝗤^{T}_{x}𝗣_{x} = 𝗜, 𝗔𝗣_{x} = 𝗣_{y}𝗦, and predictor patterns are mapped to predictand patterns scaled by their correlation. The decomposition of 𝗔 in Eq. (22) is not the usual SVD of 𝗔 since 𝗣* _{y}* and 𝗤

*are not orthogonal matrices, but can be interpreted as a SVD of 𝗔 with the usual vector norms replaced by the norms implied by the whitening transformations.*

_{x}^{4}

### b. Variance explained

In the univariate case, normalizing the predictor by its standard deviation and leaving the predictand unchanged makes the regression coefficient equal the square root of the explained variance. The appropriate generalization to the multivariate problem is to apply the whitening transformation to the predictor as in Eq. (11),

and to leave the predictand unchanged. The regression matrix relating **x**′ and **y** is

since 𝗫′𝗫′^{T} = 𝗜. Proceeding as in (14) of the previous section shows that the absolute value of the (*i*, *j*) entry of the transformed regression matrix 𝗔′ is the square root of variance explained by the regression between *y _{i}* and

*x*′

*. The square root of the variance explained by a regression between projections using the weight vectors*

_{j}**u**and

**v**of the predictand and predictor data is

This ratio is maximized when **u** and **v** are, respectively, the leading left and right singular vectors of 𝗔′. Therefore *s*^{2}_{1} = (**u**^{T}_{1}𝗔′**v**_{1})^{2} is the maximum variance explained by a single predictor. Conversely, 〈**y**^{T}**y**〉 − *s*^{2}_{1} is the minimum error variance of a regression that uses a single predictor. The variance explained using the first two pairs of singular vectors is *s*^{2}_{1} + *s*^{2}_{2}, and the minimum error variance when two predictors are used is 〈**y**^{T}**y**〉 − *s*^{2}_{1} − *s*^{2}_{2}. The variances add since the predictor projections are uncorrelated, a consequence of the fact that **v**^{T}_{1}𝗫′𝗫′^{T}**v**_{2} = **v**^{T}_{1}**v**_{2} = 0. The predictand projection time series are correlated but the predictand weight (and pattern) vectors are orthogonal since **u**^{T}_{1}**u**_{2} = 0. This decomposition of the data is called RDA (von Storch and Zwiers 1999; Wang and Zwiers 2001). Additional details of the weight and pattern vectors are given in Table 1. The RDA patterns diagonalize the regression with the diagonal elements measuring the square root of the variance explained by each predictor pattern. A related method is Empirical Orthogonal Teleconnection 2 (EOT2), which finds the predictor element, rather than the linear combination of predictor elements, that explains the maximum predictand variance (Van den Dool 2006). Subsequent uncorrelated EOT2 components are computed iteratively by finding the predictor element at each step that explains the most residual variance.

### c. Explained standardized variance

If the variances of the predictands are highly disparate, *standardization,* that is, normalizing each predictand by its standard deviation, may be appropriate. Applying RDA to standardized predictands finds the projections that maximize the explained standardized variance. Explicitly, we use the transformations

where the notation Diag 𝗬𝗬^{T} means the diagonal matrix whose diagonal elements are the same as those of 𝗬𝗬^{T}; the elements of the diagonal matrix Diag 𝗬𝗬^{T} are the predictor variances and **y**′ is **y** with each element divided by its standard deviation. This transformation of the predictand normalizes each predictand to have unit variance as in CCA, but unlike CCA, the transformed predictands remain correlated. The transformed regression matrix is

The (*i*, *j*)th element of 𝗔′ is the correlation between *y _{i}* and

*x*′

*since*

_{j}The absolute value of this quantity is also the square root of the fraction of the variance of *y _{i}* explained by

*x*′

*, that is, the square root of the explained standardized variance. Paralleling the interpretation of CCA and RDA, we project the transformed data onto the vectors*

_{j}**u**and

**v**. The square root of the standardized explained variance of the regression between the projections is

The **u** and **v** that maximize this ratio are the leading singular vectors of 𝗔′.

The explained standardized variance is the sum of the explained fraction of variance for each predictand. On the other hand, the explained fraction of variance for each predictand is the square of the correlation between the prediction and the predictand. Therefore, maximizing the explained standardized variance is the same as maximizing the sum of squared correlations between predictand and prediction.

We call this decomposition of data PPA after Thacker (1999) who focused on the predictor patterns, which he called principal predictors and characterized as maximizing the sum of squared correlation between the predictor patterns and the predictand data. Like CCA and RDA, the PPA predictor projections are uncorrelated because of the use of the whitening transformation. However, the predictand projections are neither uncorrelated nor orthogonal. Additional details of the weight and pattern vectors are give in Table 1. PPA provides a diagonalization of the regression with the diagonal elements measuring the square root of the explained standardized variance for each pattern pair.

### d. Covariance

In the univariate problem, normalizing the predictor by its variance makes the regression coefficient equal to covariance. To generalize to the multivariate problem we multiply the predictor by the inverse of its covariance,

and do not transform **y**. The regression matrix for predicting **y** from **x**′ is

The (*i*, *j*)th element of 𝗔′ is the covariance between *y _{i}* and

*x*The covariance between projections of the predictand and predictor data in the directions

_{j}.**u**and

**v**, respectively, is

This ratio is maximized when **u** and **v** are the left and right leading singular vectors of 𝗔′.^{5} This decomposition of the data is maximum covariance analysis (MCA), sometimes referred to as SVD; we use the name MCA to distinguish between the coupled pattern method based on the SVD of the cross covariance and the general SVD matrix procedure (von Storch and Zwiers 1999).

Writing the regression matrix 𝗔 using the MCA projections gives

where 𝗨𝗦𝗩^{T} is the SVD of 𝗬𝗫^{T}.^{6} The *i*th MCA predictand projection is uncorrelated with the *j*th MCA predictor projection for *i* ≠ *j* since the matrix 𝗨^{T}𝗬𝗫^{T}𝗩 only has nonzero elements on its diagonal. However, this does not mean that a regression for the *i*th MCA predictand projection should only include the *i*th MCA predictor projection. To show this, we express the predictor data as 𝗫 = 𝗩𝗕, where the rows of 𝗕 contain the time series of the projection of the predictors onto 𝗩, and 𝗕 is given by 𝗕 = 𝗩^{T}𝗫. Substituting this representation of the predictor data into Eq. (33) gives

This form is similar to usual MCA-based linear models. However, usually 𝗕𝗕^{T} is replaced by the diagonal matrix whose first *n _{y}* diagonal entries are the same as those of 𝗕𝗕

^{T}, the variance of the MCA time series and whose remaining diagonal entries are zero (e.g., Widmann et al. 2003);

*n*is the dimension of the predictand. This approximation means that correlations between MCA modes are neglected, and the resulting estimate is not generally an LSE. Therefore, we call the method using the regression matrix in Eq. (34) LSE-MCA since it uses the projections that maximize covariance like MCA but

_{y}*is*a least squares estimate. Like MCA, LSE-MCA requires no EOF prefiltering. Widmann (2005) also noted that the usual MCA-based linear models do not agree with CCA-based regression and multiple linear regression, even when the predictand is a scalar. When the predictand is a scalar, the usual MCA-based linear model uses a single SVD mode as predictor, truncating the predictor data. LSE-MCA, on the other hand, does not truncate the data and reproduces the least squares estimate. The usual MCA-based linear model is the same as LSE-MCA when the predictor and predictand dimensions are the same and 𝗕𝗕

^{T}is indeed diagonal. The matrix 𝗕𝗕

^{T}is diagonal when the MCA modes 𝗩 also happen to be EOFs of the predictor or when the predictors are uncorrelated with equal variance and the covariance matrix 𝗫𝗫

^{T}is proportional to the identity matrix; the latter condition is true when, for instance, the predictors are whitened variables. Feddersen et al. (1999) used MCA projections in a least squares estimate but with an implementation that additionally required the solution be found by numerical optimization. MCA is similar to

*partial least squares*(PLS) regression (Wold et al. 1984; Boulesteix and Strimmer 2007) in that the components maximize covariance, and the first PLS component is the same as the first MCA component. However, subsequent components differ because PLS components are uncorrelated.

## 5. Reduced-rank regressions

We have shown that the regression matrix can be decomposed into patterns that optimize selected quantities including correlation, explained variance, explained standardized variance, and covariance. These decompositions help diagnose properties of the regression by expressing the data in bases so that the regression matrix is diagonal. As shown in section 3, the use of different bases does not fundamentally change the regression as long as the bases are complete and there is no truncation of the data. Therefore, all the methods give the same prediction model as multiple regression when a complete set of patterns are used. However, the regression is changed when a partial set of patterns is used, effectively truncating the data used to develop the regression. Such a simplification of the regression may be desirable since it reduces the number of predictors, and hence the number of parameters that must be estimated from the data. We expect that regressions that use too many predictors will have poor skill on independent data due to overfitting and sampling error.

Reducing the number of patterns used in the regression is somewhat different from prefiltering, which reduces the number of predictors or predictands without necessarily considering joint relations between predictor and predictand. Decomposition of the regression into pairs of patterns produces measures of the strength of the relation between the patterns; for instance, CCA gives the correlation between the time series of the patterns. Therefore, it is reasonable to retain those pairs of patterns that represent the strongest relations and discard the rest. Since overfitting may exaggerate the in-sample relationship, validation of the relation on independent data is useful for deciding which pairs of patterns to retain. Often cross-validated skill is the basis for selecting the patterns to keep in the regression. However, as mentioned earlier in the context of EOF prefiltering, the cross-validated skill used to select the model will give an overly optimistic estimate of performance on independent data due to selection bias.

Since the pattern pairs are found by computing the SVD of the transformed regression matrix 𝗔′, restricting the patterns used in the regression is the same as replacing 𝗔′ by a truncated SVD; that is, the regression matrix 𝗔′ = 𝗨𝗦𝗩^{T} is replaced with Ȧ′ = 𝗨Ṡ𝗩^{T}, where the first *r* diagonal elements of Ṡ are the same as those of 𝗦 and the rest are 0; the patterns and weights do not change. The resulting truncated regression matrix Ȧ = 𝗣_{y}ṠQ^{T}_{x} retains *r* pairs of patterns and has the property that it is the rank-*r* regression, which optimizes the condition that the SVD measures. In particular, depending on method, the rank-*r* regression may optimize mutual information (CCA),^{7} explained variance (RDA), the sum of squared correlations (PPA), or the sum of covariance (LSE-MCA).

The patterns obtained in each method are generally different, and for a given value of *r*, the rank-*r* regression will be different for each method. Therefore, the different pattern methods produce different regressions when the regression is truncated. The motives of the user or the nature of the problem may indicate that one pattern method is preferable over another. For instance, CCA can select patterns with large correlation but small explained variance. In this case, RDA might be preferable as it maximizes explained variance. Similarly, LSE-MCA, by maximizing covariance, may select patterns with large variance but not necessarily high correlation. In this case, CCA might be preferable. The optimization of mutual information makes CCA attractive from the viewpoint of predictability since mutual information is a predictability measure with many attractive properties (DelSole and Tippett 2007).

A final point regarding these truncated regressions is that the truncated regression is indeed the same as the regression developed using the data projected onto the retained patterns since essentially a diagonal regression matrix is being truncated.

## 6. Two idealized examples

### a. MCA and LSE-MCA

We now consider a simple example that illustrates the difference between the commonly used MCA linear model and LSE-MCA. We take **x** and **y** each to have two elements. Suppose that 𝗬𝗫^{T} = 𝗜 and the MCA modes are the columns of the identity matrix. Then from Eq. (33) the regression matrix is simply

The commonly used approximation of the regression matrix is the diagonal matrix

We take the predictor covariance to have the form

where *θ* is the angle between the MCA modes and predictor EOFs, and the predictor EOFs have variance of 1 and *σ*^{2}. The angle *θ* is important because MCA and LSE-MCA are the same when the MCA modes are predictor EOFs, that is, when *θ* = 0. Additionally, suppose the predictand covariance has the similar structure

These choices for the covariances imply that the CCA weight vectors are the same as the EOFs and that the canonical correlations are 0.8 and 0.5 [see Eq. (A5) of the appendix].

The error variance of the regression is

where we use the facts that 𝗬𝗫^{T} = 𝗜 and 𝗔 = (𝗫𝗫^{T})^{−1}. The error variance of the MCA model is

Therefore the error variance of the MCA linear model relative to that of the LSE-MCA regression is

This error variance is governed by *θ* and *σ. *Figure 1 shows the error of the MCA linear model relative to that of the LSE-MCA regression as a function of *θ* and *σ.* When *θ* = 0, the MCA modes are also EOFs of the predictors, and there is no difference between the methods. Increasing *θ* increases the error of the MCA linear model. When *σ* = 1, the methods are the same since 𝗫𝗫^{T} = 𝗜 and again the MCA modes are the same as the predictor EOFs. As *σ* decreases, the relative error of the MCA linear model increases.

### b. CCA, LSE-MCA, and RDA

We now present a simple example to illustrate some issues regarding the truncation of the regression as discussed in section 5. We construct a two-dimensional, diagonal example where the correlations of the two elements are specified and examine the error of rank-1 regressions as the variance of one of the elements is varied. In particular, suppose that

The first and second elements are uncorrelated and have variance 1 and *σ*^{2}, respectively, for both **x** and **y**. Note that *σ*^{2} may or may not exceed 1. Suppose that 𝗬𝗫^{T} is diagonal and given by

so that *c*_{1} and *c*_{2} are the canonical correlations; *c*_{1} ≥ *c*_{2}. The regression matrix is

and the regression error variance is

The rank-1 CCA regression selects the part of the system with highest correlation, which is the first element, regardless of *σ.* We now show that when the first element has little variance, the regression based on the leading CCA pattern does not minimize squared error. The rank-1 CCA regression matrix is

The error variance of the rank-1 CCA regression relative to the full regression is

On the other hand, LSE-MCA selects the part of the system with highest covariance. Examination of Eq. (43) shows that for *σ* < *c*_{1}/*c*_{2}, the first element has the highest covariance and the rank-1 LSE-MCA regression matrix is the same as the rank-1 CCA regression matrix. For *σ* > *c*_{1}/*c*_{2}, the second element has highest covariance and the rank-1 LSE-MCA regression matrix is

and the error variance relative to the full regression is

Comparing Eq. (47) with Eq. (49) shows that the error variance of the rank-1 LSE-MCA regression is larger than that of the rank-1 CCA regression when *c*_{1}/*c*_{2} ≤ *σ* ≤ *c*_{1}/*c*_{2} and smaller when *σ* ≥ *c*_{1}/*c*_{2}. This result agrees with the intuition that if *σ* is small, we expect the squared error to be minimized by the rank-1 regression matrix accounting for the first element, which has highest correlation. On the other hand, if *σ* is sufficiently large, then the rank-1 regression should be based on the second element.

Figure 2 shows the squared error of the rank-1 regressions as a function of *σ* for *c*_{1} = 0.8 and *c*_{2} = 0.5. There are three regimes. For *σ* ≤ *c*_{1}/*c*_{2} ≈ 1.26, the LSE-MCA and CCA rank-1 regressions are the same. For *c*_{1}/*c*_{2} ≤ *σ* ≤ *c*_{1}/*c*_{2}, the error of the LSE-MCA rank-1 regression is greater than that of the rank-1 CCA regression because the LSE-MCA is selecting the second element since it explains more covariance. However, the second element has lower correlation, and the resulting regression has higher rms error. For *σ* ≥ *c*_{1}/c_{2} = 1.6, the error of the rank-1 CCA regression is larger than that of the rank-1 LSE-MCA regression because the large value of *σ* dominates the rms error. In this simple two-dimensional example, the RDA rank-1 regression coincides with either the CCA or LSE-MCA rank-1 regressions, depending on which one has smaller rms error. In general, RDA-based regression is distinct from and has smaller rms error than either CCA or MCA-based regressions of the same rank.

## 7. Example: Statistical downscaling

General circulation models (GCMs) often have relatively coarse horizontal spatial resolution. Information about smaller scales can sometimes be extracted from the coarse-scale GCM output by forming a regression between GCM output and observations (Widmann et al. 2003). Such a regression can also be used to remove systematic model errors (Feddersen et al. 1999). We apply this procedure to the ensemble mean of a 24-member set of ECHAM4.5 (Roeckner et al. 1996) T42 ensemble simulations of March–May (1950–2000) precipitation over the northeast of Brazil (13°S–1°N, 55°–35°W). With T42 model resolution corresponding to a 2.8° × 2.8° spatial grid, the model domain contains 63 grid points. During this time of the year, precipitation over the northeast of Brazil is closely related to sea surface temperature (SST), and the GCM forced with observed SST skillfully reproduces some aspects of seasonal precipitation interannual variability. Observational data are taken from a gridded (0.5° × 0.5°) rainfall observation dataset (New et al. 2000). Leave-one-out cross validation is used to select the level of EOF prefiltering^{8} as well as the number of patterns retained in the regression; the truncations for each method are chosen to maximize the sum over grid points of those cross-validated correlations greater than 0.3. Results using rms error as a truncation metric are similar, though rms error tends to select lower-dimensional models, as has been noted generally (Browne 2000).

The correlation map for a cross-validated univariate per gridpoint regression between the gridded observations and the GCM output interpolated to the observation grid is shown in Fig. 3a. Although there is a large region with correlations greater than 0.5, the gridpoint regression is limited by not using spatial correlation information. Figures 3b–g show the correlation maps of regressions based on PCR, CCA, RDA, MCA, LSE-MCA, and PPA patterns, respectively. All of the spatial pattern regression methods show overall improvement compared to the gridpoint regression and are fairly similar to each other. Their similarity may reflect that there seems to be only one or two meaningful modes in the regression that are captured by all the methods. The CCA regression uses five predictor EOFs and two predictand EOFs to form a rank-2 regression; RDA and PPA use rank-1 regressions based on five and three predictor EOFs, respectively.

The best overall results for correlation skill are obtained with CCA. Although CCA is expected to perform better than PCR since PCR is the special case of CCA with an untruncated predictand, there is no particular reason to expect CCA to outperform LSE-MCA or RDA, in general. The differences in skill are mostly insignificant, in a statistical sense. Both MCA and LSE-MCA use the same four modes that maximize covariance. Although we expect LSE-MCA to perform better than MCA since MCA neglects correlations between predictors, the impact of sampling error on the performance of the methods is unknown. One could imagine poor estimation of the correlations among the predictors outweighing neglecting interpredictor correlations. In any case, in this example, LSE-MCA does outperform MCA, which has the worst performance of the pattern regression methods. The regression with the smallest cross-validated rms error is RDA. The rank-1 CCA regression (not shown) has slightly lower overall correlation than the rank-2 CCA regression, but has lower cross-validated rms error, in fact, lower than that of the RDA regression.

## 8. Summary and conclusions

Two commonly used linear methods for finding coupled patterns in two datasets are canonical correlation analysis (CCA) and maximum covariance analysis (MCA), which find projections of the data having maximum correlation and covariance, respectively. Such methods are useful for diagnosing relations between variables and constructing linear prediction models. Pattern methods like redundancy analysis (RDA) and principal predictor analysis (PPA) were developed specifically for use in linear prediction models and maximize explained variance and the sum of squared correlations, respectively. In this paper we show that these methods diagonalize the regression and are singular value decompositions (SVDs) of the matrix of regression coefficients for data transformed suitably for each respective method.

The essential character of the regression does not change when linear transformations are applied to data, as long as the transformation of the predictors is invertible. One consequence of the invariance of the regression is that regression-based prediction minimizes not only the sum of squared errors but any positive semidefinite quadratic function of the error. This fact implies that the regressions developed with each predictand individually will give the same predictions as the regression developed with all the predictands simultaneously. Consequently, principal component regression (PCR) in which regressions are developed between predictor PCs and individual predictands gives the same prediction model as does the regression developed between the set of predictands and the predictor PCs simultaneously, which in turn is the same as CCA with EOF prefiltering of the predictor and no other truncations.

Although the regression is invariant under linear transformations of the data, the meaning of the regression coefficients changes depending on the transformation of the data. This connection between the interpretation of the regression coefficients and transformation of the data is readily apparent in the univariate case where differing normalizations of the data determine whether the regression coefficient measures correlation, explained variance, or covariance. Analogous transformations in the multivariate case lead to the regression matrix having coefficients that measure the same quantities. The whitening transformation in which the data are replaced by its normalized PCs is the multivariate generalization of normalizing a variable by its standard deviation.

The structure of the regression matrix is revealed by the SVD, which finds orthogonal bases so that the regression matrix is diagonal. Depending on the transformation applied to the data, the singular values measure correlation, explained variance, explained standardized variance, or covariance. The singular vectors identify the projections of the data that optimize these quantities and correspond to the methods CCA, RDA, PPA, and MCA, respectively. The SVD of a transformed regression can also be interpreted as the SVD of the untransformed regression with particular choices of norm for the predictor and predictand (Ehrendorfer and Tribbia 1997).

A common method for constructing a linear prediction model from MCA patterns does not produce a least squares estimate since correlations between MCA predictors are neglected. A variation, LSE-MCA, uses the same MCA patterns which maximize covariance but minimizes squared error. There are some special cases when MCA and LSE-MCA are the same, such as when the predictor and predictand dimensions are the same and MCA patterns are also EOFs of the predictor. In general, as illustrated in a two-dimensional example, the MCA linear model will have larger rms error than LSE-MCA. In practice, where sampling error plays a role, the MCA linear model may potentially gain some benefit by neglecting poorly estimated correlations among the predictors. However, in statistical downscaling GCM simulated rainfall over the northeast of Brazil, the MCA model had slightly worse performance compared to the other pattern methods.

Since the different coupled pattern methods correspond to decompositions of the same regression matrix, they all produce the same prediction model when a complete set of patterns is used. The choice of pattern method is important to the regression model when the SVD is truncated, that is, when an incomplete set of patterns is used. The regression model obtained by retaining only the first *r* pairs of patterns is the rank-*r* regression that maximizes mutual information, explained variance, explained standardized variance, and covariance for CCA, RDA, PPA, and LSE-MCA, respectively. We illustrate in a two-dimensional example that the RDA rank-1 regression is the rank-1 regression that minimizes rms error while the rank-1 regressions based on CCA or MCA patterns generally do not.

The difference between reduced-rank regressions based on the different methods depends on the difference between the subspaces spanned by the retained patterns of each method, not differences between individual patterns. For instance, although the first *r* RDA patterns (assuming *r* > 1) may be different from the first *r* CCA patterns, if they collectively span the same subspace, regressions based on them will be identical. This fact may help in understanding why all the methods produce linear models with comparable skill in the statistical downscaling example.

The derivation of the pattern methods in the regression framework makes it easy to compare the methods and is useful for computation. A practical benefit of this approach is that an algorithm or computational method developed for one method is easily adapted for the other methods by transforming the data. For instance, Table 2 shows that all the methods can be expressed as MCA applied to transformed data.

An important issue that has not been examined closely here is the role of sampling error. The finite number of samples causes sampling error to affect all the methods, such that the underlying covariances are imperfectly known. EOF prefiltering is only one method for limiting the covariances to information that can be robustly estimated. Ridge methods are another approach to treat this problem (Vinod 1976; Hastie et al. 1995).

## Acknowledgments

We thank two anonymous reviewers for their useful comments and suggestions. We thank Benno Blumenthal for the IRI Data Library. This paper was funded in part by a grant/cooperative agreement from the National Oceanic and Atmospheric Administration (NOAA), contract NA07GP0213 with the Trustees of Columbia University. The views expressed herein are those of the authors and do not necessarily reflect the views of NOAA or any of its subagencies. The second author was supported by the National Science Foundation (ATM0332910), National Aeronautics and Space Administration (NNG04GG46G), and the National Oceanographic and Atmospheric Administration (NA04OAR4310034).

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**.**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

### APPENDIX

#### EOF Prefiltering and CCA Equations

##### EOF prefiltering

Let 𝗟 = 𝗭^{T}, where 𝗭 is a matrix whose columns contain some but not all of the orthogonal eigenvectors of the predictor covariance matrix 𝗫𝗫^{T}. Then 𝗫𝗫^{T}𝗭 = 𝗭**Λ**, where **Λ** is a diagonal matrix containing the corresponding eigenvalues. The regression matrix relating **y** and **x**′ = 𝗟**x** is

The projection 𝗣 that projects the predictor data onto the space spanned by the columns of 𝗭 is 𝗣 = 𝗭𝗭^{T}. Applying the original regression to the projected data is the same as the regression with the transformed data because

##### Alternative form for CCA

The usual CCA equations for the predictand weights are obtained as follows. First, from Eq. (17), 𝗔′𝗔′^{T} = 𝗨𝗦𝗦^{T}𝗨^{T}, which means that 𝗨 is the matrix of eigenvectors of 𝗔′𝗔′^{T}. The eigenvalues and eigenvectors of 𝗔′𝗔′^{T} are found by solving the eigenvalue problem 𝗔′𝗔′^{T }**u** = *s*^{2}**u**, or in terms of the weight **q*** _{y}* = (𝗬𝗬

^{T})

^{−1/2 }

**u**,

Then using the definition of 𝗔′ in Eq. (13),

The eigenvalue problem in Eq. (A3) is

which is Eq. (14.11) of von Storch and Zwiers (1999). The usual CCA equations for the predictor weights follow similarly from 𝗔′^{T}𝗔′ = 𝗩𝗦^{T}𝗦𝗩^{T}.

## Footnotes

*Corresponding author address:* M. K. Tippett, International Research Institute for Climate and Society, The Earth Institute of Columbia University, Lamont Campus/61 Route 9W, Palisades, NY 10964. Email: tippett@iri.columbia.edu

^{1}

We use the convention that the data in 𝗫 and 𝗬 are normalized by *n* − 1, where *n* is the number of samples. This convention simplifies the notation by making 〈**xx**^{T}〉 = 𝗫𝗫^{T} and 〈**yx**^{T}〉 = 𝗬𝗫^{T}. The matrix of regression coefficients is independent of *n.*

^{2}

There are in-sample estimates of the out-of-sample error such as Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC) that take into account the number of predictors (Akaike 1973; Schwarz 1978).

^{3}

A matrix square root of the positive definite matrix 𝗣 is 𝗭 if 𝗭𝗭^{T} = 𝗣.

^{4}

The dependence of the SVD on choice of norm is well known in ensemble forecasting where the SVD is sometimes used to generate initial perturbations (Ehrendorfer and Tribbia 1997).

^{5}

The SVD of the cross-covariance matrix also arises in the solution of the orthogonal Procrustes problem (Gower and Dijksterhuis 2004).

^{6}

Forming 𝗬𝗫^{T} is impractical and unnecessary when the predictor and predictand dimensions are large compared to the number of samples. Instead, MCA can be applied to the covariance of the unnormalized predictor and predictand PCs since the SVD is invariant under orthogonal transformations. The dimensions in the SVD calculation are thus determined by the number of PCs rather than the predictor and predictand dimensions.

^{7}

This is a consequence of the facts that (i) mutual information of normally distributed variables is an increasing function of correlation alone and (ii) the mutual information of a sum of independent variables is the sum of their mutual information.

^{8}

Predictor and predictand are prefiltered in CCA. Only the predictor is prefiltered in RDA and PPA. No prefiltering is used with MCA or LSE-MCA.