## Abstract

The Earth’s ice–ocean–atmosphere system exhibits nonlinear responses, such as the difference in the magnitude of the atmospheric response to positive or negative ocean or sea ice anomalies. Two classes of methods that have previously been used to investigate the nonlinear dependence between climate fields are kernel methods and neural network methods. In this paper, a third methodology is introduced: gradient-based kernel dimension reduction. Gradient-based kernel methods are an extension of conventional kernel methods, but gradient-based methods focus on the directional derivatives of the regression surface between two fields. Specifically, a new gradient-based method is developed here: gradient kernel canonical correlation analysis (gKCCA). In gKCCA, the canonical directions maximize the directional derivatives between the predictor field and the response field, while the canonical components of the response field maximize the correlation with a nonlinear augmentation of the predictor canonical components. Gradient-based kernel methods have several advantages: their components can be directly related to the original fields (unlike in conventional kernel methods), and the projection vectors are determined by analytical solution (unlike in neural networks). Here gKCCA is applied to the question of nonlinear coupling between high-frequency (2–3 years) and low-frequency (4–6 years) modes in the Pacific Ocean. The leading gKCCA subspace shows a significant nonlinear coupling between the low-pass and high-pass fields. The paper also shows that the results of gKCCA are robust to different levels of noise and different kernel functions.

## 1. Introduction

The identification of coupled patterns is important to many areas of climate research, including teleconnection and attribution studies (Smoliak et al. 2015), seasonal climate forecasting (Wilks 2008), climate field reconstruction using proxy records (Smerdon et al. 2011), and linear inverse modeling (DelSole and Tippett 2008). While early research focused on the linear dependence between two climate fields (Bretherton et al. 1992; Tippett et al. 2008), other studies have also recognized the importance of nonlinear dependence (Ramírez et al. 2006; Cannon and Hsieh 2008; Ortiz Beviá et al. 2010; Evans et al. 2014).

Two main methods have previously been used to investigate the nonlinear coupling between two fields: neural net canonical correlation analysis (NNCCA) (Hsieh 2001; Wu et al. 2005; Cannon and Hsieh 2008) and kernel canonical correlation analysis (KCCA) (Arenas-García et al. 2013). In NNCCA, parametric mapping functions are used to link one-dimensional projections of the climate fields, while in KCCA, the climate fields are abstractly mapped into a higher-dimensional space (or feature space) and linear canonical correlation analysis (CCA) is performed in the feature space. KCCA has not been widely used in the climate sciences (perhaps for the reason below), but Lima et al. (2009) used the related method of kernel principal component analysis (KPCA). Both NNCCA and KCCA have issues. For KCCA (and KPCA), the time components are not linear combinations of the predictor field or the response field, which can make the KCCA (KPCA) components difficult to interpret. For NNCCA, the projection weights are found by optimization methods, and NNCCA typically relies on iterative decomposition (i.e., the components are extracted one at a time), so NNCCA is computationally expensive. Also, NNCCA is not appropriate for high-dimensional data and hence is typically applied to a low-dimensional principal component subspace of the original data fields. In both methods, the kernel (or mapping) functions are generally arbitrarily chosen.

The main aims of this study are to develop a new method for investigating nonlinear dependence between two climate fields and to illustrate the method by examples. The new method, gradient-based kernel canonical correlation analysis (gKCCA), builds on the ideas of Fukumizu and Leng (2014), who introduced gradient-based kernel regression with a univariate response variable. The following paper describes a gradient-based kernel method for coupled fields (section 2) and also its connections with CCA (sections 2a, 2b, and 2d) and the linear model (section 2c). Cross validation for nonlinear models is discussed in section 3. Studies of the tropical Pacific Ocean have suggested that the system oscillates at preferred frequencies, such as annual, quasi-biennial (QB), and quasi-quadrennial (QQ) frequencies (Bejarano and Jin 2008), but the nature, and the spatial structure, of the coupling between these modes remains unclear. In section 4, the gKCCA method is used to investigate nonlinear coupling between modes with different frequencies in the Pacific Ocean. Finally, the sensitivity of the gKCCA method with respect to different levels of noise and different kernel functions is investigated (section 5).

## 2. Methods

### a. Overview

Given two different space–time anomaly fields and , it is often useful to ask whether there are a small number of coupled patterns that link the two fields. Canonical correlation analysis is a popular method for investigating linear dependence between two fields. CCA is based on the following model:

where and are a pair of canonical spatial patterns, and are a pair of canonical time series, the single subscript *j* in , for example, denotes the *j*th column vector of matrix , except in the case of , which is an identity matrix of rank *d*. The prime symbol denotes matrix transposition, and and are the predicted (reconstructed) fields. The term is a anomaly field, is a matrix of projection vectors, is an anomaly field, is a matrix of projection vectors, and is a matrix, where (see Table 1 for notation hereinafter). The canonical correlations *ρ* are the diagonal elements of the matrix .

CCA is a useful method for two reasons. The first is that the canonical time components are marginal subspaces, or one-dimensional projections of the fields. The advantage of this is that CCA models a multidimensional regression surface by the simple addition of some marginal subspaces; for example,

where the joint subscript , for example, denotes the element at row *k* and column *d* in matrix . Each component forms one edge of a hyperplane, and the hyperplane is modeled as the sum of linear functions that make up its edges (hence the term marginal functions or marginal subspaces). Modeling a high-dimensional function is difficult because of the “curse of dimensionality”; that is, points are always sparse in high dimensions (e.g., Scott 2015), but it is less difficult to model a high-dimensional function by using marginal functions because marginal functions are low-dimensional projections of the high-dimensional function. The second reason that CCA is useful is that CCA seeks component pairs that have a high correlation; that is, CCA seeks to find projection matrices and that maximize a particular criterion:

where is the criterion to be maximized. This focus on correlation means that CCA can find coupled patterns that do not explain a large proportion of the variability in the predictor or response fields, and hence CCA can find signals in noisy fields.

The main disadvantage with CCA is its focus on linear correlation, which means it may miss structures that have a nonlinear dependence. Thus it would be nice to have a method in which the advantages of CCA were adopted to the nonlinear setting. Such a method might aim to model a nonlinear regression surface as follows:

subject to and . It also might aim to maximize the following criterion of interest:

Equation (4) is a nonlinear extension of Eq. (1a): the functions form the edges of a manifold (a curved sheet). The main advantage of this particular nonlinear extension, as with linear CCA, is to model the sheet as the sum of functions that describe the edges of the sheet:

This model covers a subset of all possible nonlinear functions, but further extensions are possible, for example, by including interaction terms such as (which will be reported elsewhere). Hereinafter, it is assumed that is a locally differentiable function (i.e., gradient vectors can be estimated at points) and that one point in -space maps onto one point in -space.

The gKCCA method has two steps:

Find , the subspace of that is linearly or nonlinearly related to .

Model the functions and find , by performing CCA on and a nonlinear augmentation of .

Step 1 and step 2 are explained in sections 2b and 2d, respectively.

### b. The gradient field and its eigenvectors (or estimating )

This section explains how to do step 1 and illustrates it using a simulated example. Some extra details of the method are found in appendix A and supplementary sections 1–3, and cross-references are given as necessary.

In the example here, a nonlinear sheet is simulated with a flat edge and a curvy edge , and an error is added to the sheet:

where is a vector, , where is a multivariate uniform random distribution spanning the given range, and is a row vector of , . So in this example, is a matrix, is a matrix, and is an vector. The terms and were also centered and scaled to unit variance. Thus and form a curved sheet in a three-dimensional space (Fig. 1a). Of course, it is possible to use linear CCA to try to model this surface as a linear sheet (a plane). But if the fitted CCA model was used to predict , then it would be apparent that the linear sheet model fails to properly predict .

So, given only and , the aim here is to find the directional vector or vectors that capture the nonlinearity in the curved sheet, as a set of one-dimensional projections. In the example in Eq. (7), all the nonlinearity can be captured in one direction . The first question to be answered is how to find those directions.

One solution is to find the directions that maximize the gradient of given . Supplementary section 1 contains further information about gradients and directional derivatives, but a less technical narrative follows. Assuming that gradients can be calculated at various points of a function, there should be directions that maximize the mean (absolute or squared) gradient of the projected points along that direction. For any linear or nonlinear regression surface, the directions in which has the greatest influence on must be associated with the steepest absolute gradients, while the directions in which has the least influence on must be associated with the least absolute gradients. The advantage of working with gradients is that the solution can be cast as an eigenvalue problem with an analytical solution (supplementary section 1). No neural networks are necessary, and the analytical solution is computationally efficient.

To find the directions in which the average absolute gradient is the steepest, an estimate of the gradient field is required. The gradient field is made up of the gradient vectors at each point, where the gradient vector is and . Estimating seems to require an estimate of , which is not known a priori. However there is an elegant way to estimate , without explicitly calculating , by making use of kernel feature spaces.

A feature space is a higher-dimensional space into which points are mapped, with mapping functions . Supplementary section 2 provides an example of a feature space. In the feature space, the number of points *n* is the same as in the input space, but the number of variables *p* can be much greater than in the input space. For some feature spaces, the inner product of the vector of mapping functions is a kernel function; that is, (see appendix A). If this is the case, then instead of writing , we can write , where , and and are matrices of linear coefficients (see supplementary section 3). That is, by using a kernel function, a nonlinear regression model can be calculated without explicitly calculating . Similarly, the gradient vectors of the regression surface can be estimated by using the gradient vectors of the kernel function :

where is a gradient matrix, is a matrix, and is a matrix of linear coefficients (as in supplementary section 3).

An example of a kernel function that will be used in this paper is the well-known Gaussian kernel:

Its gradient vector is

When the Gaussian kernel is used in Eq. (8), the gradient field at a particular point is modeled as a weighted sum of multivariate Gaussian kernels centered on the other points. The smoothness of the gradient field thus depends on the parameter *σ* [Eqs. (9) and (10)]. It is convenient to reexpress this parameter in terms of the median Euclidean distance between all the points in :

where .

The gradient field for the example [Eq. (7)], calculated using Eq. (8) for two different values of *σ*, is shown in Fig. 1b. For , in the regions of steepest gradient, the gradient vectors are generally aligned along the same axial direction. For , the gradient vectors are more scattered, because the Gaussian kernel is tighter and thus is more susceptible to random fluctuations in the field, caused by both and the random spacing of points. In the other extreme (i.e., for large values of ), the modeled sheet becomes a hyperplane; that is, the modeled surface is the same as the linear CCA model.

Having estimated the gradient field [Eq. (8)], the estimation of is straightforward. The directions in which the gradient field is steepest are simply the leading eigenvectors of the average of the outer product of the gradient vectors at each point:

Then . Note that if was whitened (i.e., ) prior to analysis, then . The diagonal elements of are equal to the mean-squared directional derivatives in the direction of each unit vector in .

Both linear CCA and gKCCA [Eqs. (8)–(12)] were used to estimate for the example [Eq. (7)], and the leading direction for each method is shown in Fig. 1c. Using , gKCCA accurately estimates the direction (Fig. 1b, left plot) and the nonlinear function (Fig. 1c, left) that forms an edge of the curved sheet (Fig. 1a). Linear CCA, though, finds the direction for which and have the strongest linear correlation (Fig. 1c, right), but in this case the linear correlation is simply a correlation between two groups of points (low and high points). Further, the gKCCA model is more capable than the CCA model of predicting from , because for CCA the predicted values will lie on a linear sheet, not a curved sheet.

### c. The linear model in kernel form

This subsection shows how the linear model and its gradient vector can be expressed in kernel form, in order to provide an analytical example of the methods above. The well-known linear function can be expressed as follows:

Its gradient vector (or gradient matrix, in the case that ) at all points can be expressed as follows:

where *λ* is a ridge parameter as in ridge regression (see, e.g., Cannon 2009). The effect of the ridge parameter is that higher values of the parameter will give more weight to the principal components of the predictor matrix that have large eigenvalues, leading to less overfitting of the response.

Now the linear model and its gradient vector will be reexpressed in kernel form, using the linear kernel. The linear kernel is

and its gradient vector is

where .

The term is a Gram matrix for where the elements of are in the linear case , and *ε* is a ridge parameter (which has the same effect as *λ* above). The gradient vector of the linear model [Eq. (14)] can now be reexpressed as follows:

where .

The advantage of Eqs. (17) and (18) is that these equations can be generalized to the nonlinear case (supplementary section 3), and the gradient vectors of many nonlinear can be calculated as a linear combination of the kernel gradient vectors:

Different kernel functions describe different nonlinear feature spaces (see examples in appendix A). For the Gaussian kernel function, the columns of [i.e., ] each contain a Gaussian function, centered on and spread over all the other points. Equation (19a) states the function is a linear combination of those Gaussians. Equation (19b) states that is a linear combination of the gradients of those Gaussians.

### d. Modeling and estimating

Section 2b showed how to estimate the matrix , which contains the unit directions that maximize given that . The directions were obtained by modeling using a kernel function [Eqs. (8) and (12)], but this only provides the directions and not the functions that need to be modeled separately. In addition, in the example in section 2b, was a vector, and so it was not necessary to find a corresponding subspace . This subsection shows how to both model and estimate in a single step, by combining CCA with basis functions.

For linear CCA, it is well known that the projection matrices and can be found via two eigenvalue problems:

where and are the covariance matrices of fields and estimated using a specific model. For linear CCA, for example is estimated using the linear model [Eq. (13)], so [if in Eq. (13)].

For the model in Eq. (4), exactly the same formulation as in Eq. (20b) can be used to estimate , by using a nonlinear expansion of to calculate . To do this, it is assumed that can be approximated using a set of basis functions :

where is the number of basis functions.

There are several possible basis functions that would be suitable here including piecewise linear ( appendix B), piecewise polynomial, or orthogonal polynomial basis functions. For any particular application, the parameter should be chosen according to the degree of nonlinearity that the researcher is interested in. There is an example below, which shows that the basis functions need only to broadly approximate in order to successfully estimate the directions in .

For gKCCA, the estimation of consists of expanding each by the basis functions to produce a basis matrix :

The term is then estimated as

and substituted into Eq. (20b). Then , where is the leading eigenvector of , for each basis matrix . This ensures that the columns of are paired with the columns of .

Here the example in section 2b is extended as follows:

where and are both matrices, denotes numbers sampled from a normal random distribution, and is a orthonormal matrix. The matrices and are the same as in the example in section 2b. Given only and the new , was estimated using the method in section 2b. To estimate , was calculated using piecewise linear basis functions ( appendix B). Piecewise linear basis functions have at least two advantages: they are computationally simple, and the , the degrees of freedom of or the number of segments in, the piecewise linear basis matrix can be used to control the linearity (or nonlinearity) of the solution.

Figure 2 shows plotted against for six different values of . Also plotted against is the piecewise linear model (PWLM), which “guides” the estimation of , for each . This piecewise linear model is in fact from the application of CCA to and . Figure 2 shows that if the PWLM is even a rough approximation of (e.g., for ), then the corresponding subspace in that is correlated with can be correctly estimated. For all , the PWLM also finds the expected . The difference between linear CCA and gKCCA is that linear CCA assumes that is linearly correlated with , but gKCCA allows to be correlated with . Supplementary section 4 discusses symmetry in CCA and gKCCA.

### e. Explained variance and prediction

Once and have been estimated, the proportion of the total variance of each field explained by each component can be calculated as follows (as in linear CCA):

Also, once and have been estimated it is easy to apply any nonlinear smoothing regression model to the marginal functions to build a predictive model for gKCCA.

## 3. Cross validation

### Cross-validation metrics for linear and nonlinear models

The gKCCA method has two main parameters that need to be estimated: σ [Eq. (10) in Eq. (19b)] and *ε* [Eq. (19b)], for a given .

Various implementations of linear CCA contain parameters that need to be estimated; for example, regularized CCA adds ridge parameters to the covariance matrices of the predictor and response fields (Lim et al. 2012). Such parameters are typically estimated using cross validation. The main step of cross validation is, for a particular CCA model with a given set of parameters, to construct the matrices and using a number of training and test sets:

where refer to the test sets and refer to the training sets, refers to a matrix of column means, and refers to the projection matrix calculated for the respective training set. The skill of a particular CCA model can then be assessed using a cross-validation metric, such as a measure of the linear association between the subspaces and (e.g., Wilk’s lambda, Hotelling–Lawley trace, or the Pillai–Bartlett trace). These metrics can be calculated on a selected number of dimensions of and , and this number of dimensions is referred to as . Note that is required only for the calculation of the cross-validation metric and is separate from *d* [e.g., Eq. (4)], which is the number of model dimensions. That is, is a metric parameter and *d* is a model parameter. It is possible that or that . For example, we could cross-validate a CCA model using only the first dimension and (i.e., ) but could later find that there is a meaningful association between and , after calculating the full CCA model using the cross-validated parameters. Since is a metric parameter (and not a model parameter), the choice of should be made in the same way as the choice of metric (e.g., Wilk’s lambda, Hotelling–Lawley trace, and Pillai–Bartlett trace are metrics of linear association): the choice should be based on the user’s intuition about what they are attempting to solve.

For gKCCA, measures of linear association between and are not sufficient because and may be nonlinearly dependent. Various measures of nonlinear association between subspaces have been suggested including entropy-based and distance-correlation-based statistics (e.g., Martínez-Gómez et al. 2014). One difficulty with these statistics is that they measure the nonlinear association between the total subspaces of and and thus are independent of the rotation of any particular subspace (e.g., ). The aim here is to use a metric that captures the nonlinear association between and , for each . This can be achieved by expanding each using a set of orthogonal basis functions, as in section 2d. The parameter is again the number of basis functions.

The actual nonlinear cross-validation metric is the adjusted multiple coefficient of determination (also known as the squared multiple correlation coefficient) from an augmented linear model (ALM):

where is a -length vector of regression coefficients, and is the sum of the squares of the subscripted vector (or matrix). For the ALM, the response variable [left-hand side of Eq. (27a)] is formed by vectorizing the columns of , while the matrices are basis function matrices created for the column of [Eq. (22)]. Since the response variable of this ALM is a vector, the adjusted multiple coefficient of determination is a scalar value [Eq. (27b)]. Note the adjusted multiple correlation coefficient is different from the simple multiple correlation coefficient: the adjusted coefficient takes into account the degrees of freedom of the ALM.

## 4. Nonlinearly coupled oscillations in the Pacific Ocean

### a. Introduction

In this section, gKCCA is used to investigate nonlinear coupling in the Pacific Ocean.

It is likely that ENSO has multiple modes of variability, each with their own characteristic time scale, that can coexist (Barnett 1991). This aspect of ENSO has been investigated by several researchers using the Zebiak–Cane (ZC) model (Yang et al. 1997; Roulston and Neelin 2003; Bejarano and Jin 2008). For example, Bejarano and Jin (2008) ran the full ZC model over a wide range of values of a two-dimensional parameter space, the parameters being the upper-ocean layer thickness and the percentage wind stress. The ZC model exhibited three main regimes:

A quasi-quadriennial regime, which dominated for deep upper-layer thickness (150 m) and high-scaled wind stress

A quasi-biennial QB regime, which dominated for shallow upper-layer thickness (130 m) and high-scaled wind stress

A mixed regime, where the QQ and QB modes coexisted, for average upper-layer thickness (140 m) and average scaled wind stress

Thus Bejarano and Jin (2008) demonstrated that ENSO may consist of two oscillatory modes, which can coexist, but their study did not examine the interaction between the modes.

Earlier work on the interaction between the QQ and QB modes was conducted by Barnett (1991), Yang et al. (1997), and Roulston and Neelin (2003). Using bicoherence to analyze global SST data (1950–87), Barnett (1991) showed that the QQ and QB modes were quadratically coupled, while the QB mode and the annual mode were not quadratically coupled. Yang et al. (1997) and Roulston and Neelin (2003) performed experiments with a linearized version and a lite nonlinear version of the ZC model, respectively. In the full ZC model, SST is a nonlinear function of thermocline depth, and atmospheric heating is a nonlinear function of surface-wind convergence, but in the Roulston and Neelin model only the former function is nonlinear (hence lite nonlinear version). Both Yang et al. (1997) and Roulston and Neelin (2003) performed their model experiments without a seasonal cycle. Yang et al. (1997) found that the most unstable mode was a quasi-biennial oscillation, and the linearized model produced no quasi-quadrennial oscillation, suggesting that the QQ mode might arise from a nonlinear interaction with the QB mode. In contrast, Roulston and Neelin (2003) found that the QB mode was stable and only became oscillatory as a result of nonlinear coupling to an unstable QQ mode. Also, the nonlinearity in their model did not produce a quasi-biennial spectral peak without the presence of a stable separate QB mode. That is, the QB spectral peak was not simply a subharmonic of the QQ mode.

In another study, Monahan and Dai (2004) considered the following statistical model, which is rewritten here in terms of the notation used in this manuscript:

where here is the tropical Pacific SST anomaly field, and is a matrix of spatial patterns. This model basically says that there are two coupled modes in the tropical Pacific, with spatial patterns and and time components and , which Monahan and Dai (2004) refer to as the antisymmetric component and the symmetric component, respectively. Monahan and Dai (2004) fitted this model by assuming that was the first EOF of , and then was determined using standard linear techniques. In this model, since the time component of the second mode peaks twice as fast as the first mode, in unison with the extreme positive and negative anomalies of the first mode, the two modes represented by spatial patterns and are nonlinearly coupled. Monahan and Dai (2004) found that this model explained some variability in the observed SST field, but the same coupling was generally absent in modeled SST fields from global climate models, and hence the mechanisms of mode coupling were not investigated.

### b. Data and analysis

The main data used in this study are from the Extended Reconstructed Sea Surface Temperature (ERSST) dataset (Huang et al. 2016; NOAA 2017), which spans the years 1854 to present and has a spatial resolution of . Monthly data were extracted for the region 125° to 295°E and −20°to 20°N , for the years 1945–2009 . To minimize edge effects on the bandpass filtering (see below), the raw time series (1854–2016) were first filtered, and then the years 1945–2009 were obtained from the filtered fields. The aim here is to investigate the nature of the coupling between QB and QQ components of the observational SST field, so the SST field was separated into a predictor field and a response field by using two Butterworth bandpass filters of order 3. The predictor field was produced by using a 4–6-yr bandpass filter. The response field was produced by using a 2–3-yr bandpass filter. Henceforth, the predictor field and response field are also referred to as the low-pass field and the high-pass field, respectively. This is a good example for illustrating the gKCCA method because we would not expect any linear correlation between two fields containing orthogonal frequencies. Three preprocessing steps were applied prior to analysis: the and fields were centered and scaled to unit variance, then weighted by the square root of cos(latitude), and whitened using singular value decomposition (Chung and Nigam 1999). In the latter step, because *p* and *q* were , only the components that collectively explained up to 85% of the total field variance were kept. The gKCCA method was also applied to predictor and response fields, which retained 80% and 90% of the total field variance, but in both cases the cross-validation metric was less than when 85% of the total field variance was retained (see section 4c).

The coupling between the and fields was investigated using both regularized linear CCA (regCCA) and gKCCA. For regCCA, the covariance matrix of each field was regularized, and the regCCA model was estimated using singular value decomposition:

where and are the projection matrices for the whitened and fields. Thus the regCCA model has two parameters to be estimated: and . Similarly, the gKCCA model has two parameters to be estimated: *ε* and σ.

For both regCCA and gKCCA the model parameters were estimated using gridded parameter values and leave-half-out cross-validation, with the cross-validation metric . For regCCA, a grid of six values for both and were used: . These values were also used for *ε* in gKCCA. For σ in gKCCA, 10 different values were used: , where is the median of pairwise Euclidean distances of the data [Eq. (11)]. Also for gKCCA, piecewise linear basis functions were used in Eq. (23) and Eq. (27).

In addition, the two other parameters and were set based on the problem at hand here. For regCCA and gKCCA, the parameter [in Eq. (27)] was set at and respectively, which means that the cross-validation metric provides information about linear and piecewise linear association. For gKCCA, the rationale for setting was that it is potentially quadratic (or V shaped) interactions that are of interest here. For both regCCA and gKCCA, the metric parameter was set at so that is a measure of association between and . The rationale for setting was that a preliminary analysis suggested that there were no statistically significant subspaces beyond the first gKCCA subspace. Considering and together, for regCCA, is a measure of linear correlation between and (because and ). For gKCCA, is a measure of nonlinear (or piecewise linear) correlation between and (because and ).

Further for gKCCA, after cross-validating the parameters *σ* and *ε*, a multivariate phase randomization test (Schreiber and Schmitz 2000) was performed to test if the coupling between the components and was statistically significant. In the phase randomization test, null hypothesis surrogates were generated by randomizing the Fourier phases of the predictor and response fields. This means that the time series in the randomized fields have the same Fourier spectra as the time series in the predictor and response fields, but with different phases. A thousand surrogate pairs of fields were generated, whitened, and gKCCA was applied using the cross-validated estimates of the parameters σ and *ε*. The nonlinear coupling between fields was tested using a time-domain version of the bicoherence statistic *b*:

where denotes the Hadamard product. Here the bicoherence is a linear correlation between and , or a measure of the quadratic phase coupling between the dominant frequencies in and . The multivariate phase randomization test is essentially asking how likely it is that the nonlinear coupling between the observed fields could be the result of nonlinear alignment between random fields that oscillate at the same frequencies as the observed fields. This test is more general than the test performed by Barnett (1991), which modeled the leading field components as a second-order autoregressive process. The multivariate phase randomization test contains no assumption about the order of the autoregressive process.

### c. Results

For regCCA, the cross-validated parameter estimates were and . For gKCCA, the cross-validated parameter estimates were and . The term was higher for gKCCA (0.23) than for regCCA (~0.00), suggesting that there is no linear coupling between and . Note that for gKCCA, was 0.21, 0.23, and 0.14 when retaining 80%, 85%, and 90% of the total field variance. The respective dimensionality of was 3, 3, and 4 components (for 80%, 85%, and 90% of the field variance), while the dimensionality of was 4, 6, and 8 components. Thus in this example, the main decrease in is associated with the increasing dimensionality of and consequent overfitting of the model in the leave-half-out cross-validation. The generally low values suggest that the nonlinear coupling between the high-pass and low-pass fields is not particularly strong.

Figure 3 shows the correlation between the paired components and , and the associated spatial patterns, for the leading two regCCA subspaces. As expected, the linear correlations here are not significant (Fig. 3, left column), so there is no interaction between the spatial patterns.

Table 2 contains the results of the multivariate phase randomization test applied to the leading three gKCCA subspaces. The bicoherence of the first component pair is significantly different from the null hypothesis bicoherence distribution, while the bicoherences of the second and third subspace are not significantly different. This demonstrates that the quadratic phase coupling between the and fields identified in the leading gKCCA subspace is not simply the result of random alignment between two fields that oscillate at the same frequency as and .

Figure 4 shows the correlation between the paired components and and the spatial patterns for the leading two gKCCA subspaces. For , the correlation between and shows a general quadratic relationship (Fig. 4, top left). Components and explain 47% and 30% of the variance in the original low-pass and high-pass SST field, respectively (Fig. 4, top row). The positive phase of shows negative anomalies in the equatorial east and central Pacific, and horseshoe-shaped positive anomalies in the west Pacific. The positive phase of shows positive anomalies in the equatorial central Pacific. Figure 5 (top row) shows that and oscillate with periods of 5 and ~2.5 years respectively. The quadratic relationship between and means that positive phases of are associated with both positive and negative phases of , while negative phases of are associated with neutral phases of (Fig. 5, top row). Here the high-frequency mode has a stronger pattern in the central Pacific than in the east Pacific (Fig. 4, top right), which contrasts with the high-frequency (symmetric) mode of Monahan and Dai (2004). In Barnett (1991), the high-frequency (QB) mode exhibited a strong pattern in the both the central and east Pacific. In these other studies, the low-frequency and high-frequency patterns (which are nonlinearly related) were estimated in very different ways. Here these patterns have been extracted by maximizing a general measure of association (the squared directional derivative) between two fields. The mechanics of this nonlinear coupling will be investigated in a future study using global climate models.

The second pair of gKCCA components and explain 34% and 7% of the variance in the original low-pass and high-pass SST field, respectively (Fig. 4, bottom row). The gKCCA method has attempted to find a subspace in which the components and are nonlinearly related, but this subspace is not statistically significant (Table 2).

## 5. Discussion

In this section, the gKCCA method is further investigated with respect to the effect of noise on the estimation of the parameters σ and *ε* and the application of other kernels. These aspects are explored using a third simulated example.

In this example, a quadratically coupled pair of fields were simulated as follows:

where (chosen randomly for and ), *t* is the time in decimal years (with monthly resolution), and is the *z*-score operator. Both the and fields were simulated with dimensions (i.e., ). In the two experiments that follow, the gKCCA method was used to estimate the leading subspace, for three different values of , which corresponds to a signal-to-noise ratio for simulated . The gKCCA parameters were estimated using leave-half-out cross validation applied to gridded parameter sets (see below for details).

In the first experiment, the gKCCA method was applied using a Gaussian kernel. The parameter , which is calculated for the predictor field , was the same for all values of SNR. The parameters and *ε* were estimated using gridded parameter values (the same as in section 4b). The leading subspace, for the three different values of SNR, are shown in Fig. 6 (top row): the quadratic relationship between and is visible in all simulations. The parameter *ε* was the same for all three values of SNR. The parameter did not vary systematically for the different values of SNR, suggesting the different values of *σ* were simply due to the randomness in the simulations. This example suggests that for the Gaussian kernel the parameters *σ* and *ε* are insensitive to the level of noise in .

In the second experiment, the gKCCA method was applied using a polynomial kernel. Unlike the Gaussian kernel, the feature space of the polynomial kernel is finite dimensional ( appendix A). The polynomial kernel and its derivative are

where *k* is the power of the polynomial. Thus for this experiment there are also two parameters to be estimated: *ε* [Eq. (19b)] and *k*. The parameter *ε* was estimated using the same gridded parameter set as above. The parameter *k* was estimated using the set . Given that the above simulation contains quadratic coupling, a value of is expected. The leading subspace, for the three values of SNR, are shown in Fig. 6 (bottom row). Again, the quadratic relationship between and is visible in all simulations. The cross-validated parameter *k* was the same for all three values of SNR. The parameter *ε* showed values of and for values of and , respectively. Each eigenfunction of the polynomial kernel matrix will contain all the monomials of a particular polynomial (this stems from the polynomial feature space; appendix A). This means that quadratic features will be distributed through all the eigenfunctions, and hence heavy regularization (a high value of *ε*) is not expected when using the polynomial kernel. In this example, using or with the polynomial kernel produces little to no difference in the estimated leading subspace.

The investigation of nonlinearities in climate science has been hindered by the lack of methods which are relatively easy to use and computationally efficient. There are many areas in climate science where gKCCA could be applied, and two further examples are given here. The first is in the area of climate reconstructions using climate proxies, such as tree rings. Tree rings carry a climate signal that is both nonlinear and multivariate (Evans et al. 2014). It would be interesting to test how well gKCCA reconstructs temperature and rainfall variation in the context of pseudoproxy experiments with nonlinear proxy models, such as VS-Lite (Vaganov-Shashkin-Lite), a tree-ring model (Tolwinski-Ward et al. 2011). VS-Lite is an example of a proxy model that defines a nonlinear function between climate variables (temperature, precipitation) and the climate proxy (tree rings). The gKCCA method should be able to find such a nonlinear function. The second example is in the area of predictability. For example, in the Lorenz model, the state variables at one point in time are nonlinearly dependent on the state at past times. This nonlinear predictability has been investigated using a neural network (Cannon 2006), and it would be interesting to see how gKCCA performs. The gKCCA method can also be developed further. For example, it is possible that there could be nonlinear coupling between fields with propagating signals. Complex CCA (i.e., CCA with complex vectors) can be used to investigate linear coupling between such fields (Schreier 2008), so complex gKCCA needs to be developed in order to investigate nonlinear coupling between fields with propagating signals.

## 6. Conclusions

Gradient-based kernel dimension reduction relies on two powerful mathematical ideas: (i) the gradient vector in an input space is a linear combination of the gradient vectors in a feature space, and (ii) a kernel function can be used to implicitly describe a feature space, without explicitly computing the mapping functions . Two methods that have been previously suggested for investigating nonlinear dependence between two fields are KCCA and NNCCA. KCCA uses the second idea above but only recovers components of the feature space. In contrast, gKCCA uses the feature space to recover components of the input space.

This paper has introduced the gKCCA method and investigated several aspects of the method, including nonlinear cross validation, sensitivity to noise, and the application of different kernel functions. The gKCCA method has at least two model parameters: the Gram matrix regularization parameter *ε* and other parameters belonging to the kernel function, such as Gaussian *σ* or polynomial power *k*. To estimate these parameters, nonlinear cross validation was performed using the adjusted multiple coefficient of determination from an augmented linear model as the cross-validation metric. Experiments with different levels of noise, and different kernels, show that gKCCA is able to recover the underlying nonlinearity between two fields. The application of gKCCA to observed SST data (ERSST) shows a significant quadratic coupling between the low-pass (4–6 years) field and the high-pass (2–3 years) field for the leading gKCCA subspace. The high-frequency pattern has stronger anomalies in the central Pacific than in the east Pacific, compared to other studies investigating nonlinear interactions, but the methods used in these other studies were different. Future studies of these components using climate models are needed.

Further investigations of the application of gKCCA to nonlinear questions in climate science are needed. Further methodological experiments, including the application of a wider range of kernel functions (e.g., Laplacian, periodic, and Matern kernels), are also required. Finally, methodological developments are possible, such as the development of gKCCA for fields with nonlinear coupling between propagating components.

## Acknowledgments

My Julia package CoupledFields contains source code for gKCCA (https://github.com/Mattriks/CoupledFields). Julia is a state-of-the-art programming language. We thank several anonymous reviewers for their useful suggestions that helped improve the manuscript.

### APPENDIX A

#### Factorizing the Polynomial and Gaussian Kernel

First it is necessary to factorize the polynomial kernel , where here and are row vectors of :

where partition is a set of integers such that and , and is the feature space vector of the polynomial kernel. The length of is .

The Gaussian kernel is a function of two linear kernels and the sum of all polynomial kernels:

because . Note that (without the subscript *k*) is the feature space vector of the Gaussian kernel. The sum of polynomial kernels in Eq. (A2c) means that the Gaussian kernel feature space vector in Eq. (A2e) contains the monomials of all polynomial kernel feature space vectors. That is, the Gaussian kernel feature space vector is of infinite length.

All the above proves (for the polynomial and Gaussian kernels) that a Kernel matrix of a field , is the Gram matrix of a feature space . That is, .

### APPENDIX B

#### Piecewise Linear Basis Matrix

Piecewise linear basis matrix can be expressed as follows:

where are the elements of (expressed here as if the elements were sorted into ascending order, for convenience—note for practical calculation sorting of rows is not required), is a -length vector of breakpoints, and is the degrees of freedom. The rows of change with each breakpoint (i.e., , , and so forth). The degrees of freedom of the piecewise linear basis is controlled by the number of breakpoints; for example, for the breakpoints are , where refers to the *k*% quantile of . Note that if , then and . Each column of represents a segment of a piecewise linear model.

## REFERENCES

*Multivariate Density Estimation: Theory, Practice, and Visualization*, D. W. Scott, Ed., John Wiley and Sons, 217–240, doi:.

## Footnotes

Supplemental information related to this paper is available at the Journals Online website: http://dx.doi.org/10.1175/JCLI-D-16-0563.s1.

© 2017 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).