## Abstract

The present paper introduces and illustrates methodological developments intended for so-called optimal fingerprinting methods, which are of frequent use in detection and attribution studies. These methods used to involve three independent steps: preliminary reduction of the dimension of the data, estimation of the covariance associated to internal climate variability, and, finally, linear regression inference with associated uncertainty assessment. It is argued that such a compartmentalized treatment presents several issues; an integrated method is thus introduced to address them. The suggested approach is based on a single-piece statistical model that represents both linear regression and control runs. The unknown covariance is treated as a nuisance parameter that is eliminated by integration. This allows for the introduction of regularization assumptions. Point estimates and confidence intervals follow from the integrated likelihood. Further, it is shown that preliminary dimension reduction is not required for implementability and that computational issues associated to using the raw, high-dimensional, spatiotemporal data can be resolved quite easily. Results on simulated data show improved performance compared to existing methods w.r.t. both estimation error and accuracy of confidence intervals and also highlight the need for further improvements regarding the latter. The method is illustrated on twentieth-century precipitation and surface temperature, suggesting a potentially high informational benefit of using the raw, nondimension-reduced data in detection and attribution (D&A), provided model error is appropriately built into the inference.

## 1. Context and motivations

An important goal of climate research is to determine the causes of past global warming in general and the responsibility of human activity in particular; this question has thus emerged as a research topic known as detection and attribution (D&A). Over the past two decades, the results produced in this area have been instrumental in delivering far-reaching statements and in raising awareness of anthropogenic climate change [IPCC’s Fourth Assessment Report (Hegerl et al. 2007a)]. From a methodological standpoint, D&A studies involve a range of methods (e.g., correlation comparison, regression-based decomposition, and Bayesian inference) but are frequently based on linear regression methods referred to as optimal fingerprinting, whereby an observed climate change is regarded as a linear combination of *p* externally forced signals added to an internal climate variability term (Hegerl and Zwiers 2011). The regressors —or fingerprints—consist here of spatial, temporal, or space–time patterns of response to external forcings as anticipated by one or several climate models, whereas internal climate variability *ε* is treated as a multivariate Gaussian noise with mean zero and covariance . Practically, the results of the inference and, in particular, the magnitude of the uncertainty range on the regression coefficients ** β**, determine whether these fingerprints are present in the observations and whether or not the observed change is attributable to a given cause.

Over the years, this approach has been formulated by using a suite of multivariate linear regression models that were progressively adapted. The standard general linear model and its usual inference treatment [i.e., ordinary least squares (OLS)] was first introduced explicitly by Hegerl et al. (1996) and was then refined by Allen and Tett (1999):

The expression of the maximum likelihood estimator (MLE) of ** β** in this model is given by

where the so-called *precision matrix*, defined as the inverse of the covariance , is here assumed to be known. In Allen and Stott (2003), a further advance was introduced by recognizing that the true forcing responses are actually not known to certainty. Instead, we have access to , which is equal to plus a remnant noise originating from the internal variability, which is present in climate model simulations even after averaging:

The column vectors of are independent and identically distributed (iid) according to a centered multivariate Gaussian with covariance , where *m* is the size of the ensemble. The so-called total least squares (TLS) inference scheme used in this case (Van Huffel and Vandewalle 1994) also assumes the precision matrix to be known and yields a closed-form expression for ** β** as a function , , and , which is a generalization of Eq. (2) and is detailed further in section 2.

In any of the above two linear regression inferences, the knowledge of the covariance matrix associated to internal climate variability and of the precision matrix is critical for inference on ** β**. In optimal fingerprinting, is evaluated prior to—and independently from—the regression procedure, based on a sample of

*r*simulated, unforced variations , are referred to as “control runs.” A trivial option to estimate out of would be to use the so-called

*empirical covariance*:

but this estimate is known to perform poorly here, because control runs are costly, and thus the size of the sample *r* is often small as compared to the dimension of the data *n*. Further, has rank *r* and is thus noninvertible when , impeding a straightforward inversion of into an estimator of the precision matrix. The classic and perhaps most common solution to this problem in D&A to date was proposed by Hegerl et al. (1996) and consists in using a truncated pseudoinverse of as an estimator of the precision matrix, defined by

where is the diagonalized factorization of with its matrix of eigenvectors; where with is its matrix of eigenvalues; and where only the *k* leading eigenvectors of the empirical covariance are retained (). Note that in Eq. (5), as in the remainder of this article, denotes an estimator of the precision matrix , as opposed to the inverse of an estimator of the covariance matrix . This approach is generally referred to as *covariance truncation*. It is straightforward to show that, from the perspective of linear regression, covariance truncation is equivalent to projecting the data onto the *k* leading eigenvectors of and then to scaling each obtained new coordinate based on the inverse square root of the corresponding eigenvalue. The choice of an optimal truncation order *k* in this approach is recognized to be difficult and remains partly arbitrary thus far but has been shown to significantly influence inference results. This motivated Ribes et al. (2009, 2012) and Ribes and Terray (2012) to estimate and based on a more recent estimator, the so-called Ledoit and Wolf (LW) estimator (Ledoit and Wolf 2004). The latter consists of a linear shrinkage toward a target matrix proportional to the identity with an optimal shrinkage intensity defined in order to minimize the quadratic error of estimation of . It was initially introduced in the context of an application in finance (Ledoit and Wolf 2003) and has been successfully used for numerous and diverse applications since then. It is defined by

where and . The estimator of the precision matrix can this time be obtained by straightforward inversion of , which is now invertible by construction. It is straightforward to show that this approach is equivalent to regularizing the eigenvalues of by scaling them toward their mean *λ* [i.e., ] while keeping the eigenvectors unchanged.

For uncertainty analysis, the confidence intervals around that are readily available in the above regression models under a known, fixed covariance [e.g., in the OLS model] would be an option. However, it is usually recognized that since is not actually a known constant but an estimate, applying such formulas straightforwardly may yield erroneous uncertainty assessment. To circumvent this issue, in any two above approaches, control runs are split beforehand into two samples of size and , respectively, where and are arbitrarily chosen under the constraint . Equations (5) or (6) are then applied on both samples to derive two independent estimates of and . The first estimate is used for inference of the value of ** β**, whereas the second one is used for uncertainty analysis on

**. For illustration, in an OLS context, this approach yields**

*β*On the other hand, the dimension *n* of raw simulations and observations is usually considered to be too high for a straightforward application of the above procedures to the raw data. This reserve is justified in particular by the fact that covariance and precision matrix estimates are anticipated to perform poorly when *n* gets large while *r* remains fixed (i.e., *r* is smaller and smaller relative to *n*). This situation leads most authors to opt for an initial dimension reduction step, which is preliminary to the above-described treatments. The reduction of dimension from a few millions to a few hundreds typically is performed based on a variety of methods. Projecting the raw data onto an arbitrarily chosen number of leading spherical harmonics can be viewed as the most commonplace method thus far for global studies. The implications of the choices related to the reduction method on the reliability of inferences are not well understood in general, although Ribes et al. (2012) have begun to study some aspects of the quality of inferences.

To summarize, usual optimal fingerprinting procedures can be described generically by the following four steps:

Reduction of the dimension of the raw data , , and

.*ε*Split of

into two subsamples; derivation of two independent estimates of the covariance and precision matrices.*ε*Inference of the value of

based on the first estimate.*β*Inference of the confidence intervals around based on the second estimate.

While the approaches mentioned above have successfully produced numerous D&A results thus far [e.g., Stott et al. 2004; Zhang et al. 2007; Hegerl et al. 2007; Min et al. 2008, 2011], it is argued that they share an important shortcoming at a structural level, thus creating an opportunity for potential improvement. This shortcoming lies in our view in the sequential and disconnected nature of the treatment: the four above-mentioned steps certainly have their logic with respect to the end purpose of performing inference on ** β**, but they are disconnected from one another in a rather troublesome way from a theoretical prospective (Fig. 1a). Indeed, the theorems usually invoked in statistics to guarantee the theoretical soundness of an inferential treatment—whether w.r.t. the performance of the resulting estimators or on the analysis of their uncertainty—are, in general, established and hold only for single statistical models. They do not hold a priori for sequential application of separate models and estimators, unless validity could be established explicitly. The reason for this is quite intuitive: by imposing independent, nonintegrated steps in which a given step’s outcome is the input for the next step, it makes it quite difficult to keep track of the flow of uncertainty as well as potential biases between quantities of interest within a rigorous mathematical framework. There are thus few guarantees that the above estimators are optimal, nor that the confidence intervals are actually correct—the latter aspect being of particular concern in a D&A context, because the level of certainty accompanying most statements delivered by D&A studies is, in itself, an important component thereof.

The objective of this paper is to propose an approach to the D&A problem that avoids a sequential analysis approach and thereby attempts to improve the quantification of uncertainty and the reliability of inference. The general features of our proposal are presented in section 2a; section 2b formalizes the approach in detail for the most simple situation of interest, that of the linear regression model of Eq. (1); section 2c explains how and why the approach can be extended almost identically to the more complex errors-in-variables model of Eq. (3); section 2d briefly treats the case wherein, for each individual forcing, the full ensemble of available runs, instead of the single ensemble-average run, is used for inference. Next, section 3 shows that, by virtue of a classic linear algebra simplification, dealing with high-dimensional variables in the context of this model does not imply greater computational complexity than the dimension reduction initial step in the existing sequential approach. Section 4 shows results of the procedure on simulated data and compares its performance to existing ones. Section 5 applies the procedure on twentieth-century data. Section 6 discusses results and concludes.

## 2. Integrated optimal fingerprinting models and inference

### a. General approach

Our proposal consists in restoring unity in the structural formulation underpinning optimal fingerprinting methods—hence the proposed name of “integrated optimal fingerprinting.” It is argued that optimal fingerprinting can indeed be formulated and treated quite easily within a single statistical model (Fig. 1b). In the latter model, all data to be used for inference are dealt with simultaneously: simulated response runs , control runs ** ε**, and climate observations —all of them in their raw, high-dimensional form—share the same status as observations from a statistical modeling prospective. On the other hand, the model deals similarly with the unknown quantities (i.e., the coefficients

**, the covariance , and the smooth responses ), which are all treated as parameters. Nevertheless, we introduce a key distinction between those parameters by recognizing that in the present linear regression context, the end purpose is to infer the coefficients**

*β***, not the covariance , nor the deterministic responses to forcings . The latter two parameters are thus treated here as nuisance parameters (i.e., parameters that are not of immediate interest but must still be accounted for in inference).**

*β*The standard way to deal with nuisance parameters in statistics is first to eliminate them from the complete likelihood of the model before proceeding as usual with likelihood maximization and uncertainty analysis on the remaining parameters of interest. Eliminating nuisance parameters has long been a central problem in statistical inference and has been extensively studied in many different contexts (Basu 1977). There exist several approaches for elimination. A first strategy consists in deriving the so-called *profile* likelihood by maximizing the complete likelihood function over the entire domain of values of the nuisance parameter for every value of the parameter of interest. However, this straightforward elimination procedure has its limitations: the complete likelihood may, for instance, not be bounded, in which case a profile likelihood cannot be obtained. A second strategy, which may, in particular, be suitable when such an issue arises, consists of integrating out the nuisance parameters in the complete likelihood instead of maximizing them out, in order to obtain the so-called *integrated* likelihood. This approach is described extensively, for instance, in Berger et al. (1999). As discussed therein, a uniform integration of the complete likelihood over the entire domain is most straightforward, but an interesting alternative option consists in introducing a prior distribution on the nuisance parameters in order to constrain their values beforehand, which allows for the introduction of some regularization. If any, the parameters of the prior distribution may themselves be subsequently evaluated by maximizing the integrated likelihood. In this case, eliminating the nuisance parameters comes at the cost of introducing new parameters associated to the prior. Yet this trade-off may be beneficial and may reduce the number of degrees of freedom in case the dimension of the former exceeds that of the latter, as will be shown below.

Let us consider for illustration the most frequently used linear regression model with iid residuals, which corresponds to the above setting of Eq. (1) under the particular case , where is unknown. This model is classically solved by treating the unknown variance of errors as a nuisance parameter and by eliminating it from the likelihood function following the first strategy above: the complete likelihood is maximized in for any ** β**, thus obtaining a point estimate and the profile likelihood . The latter profile likelihood is subsequently maximized in

**to obtain a point estimate, and profile likelihood ratios are derived for uncertainty analysis on**

*β***. But in the more general case of interest here, where is entirely unknown, the matrix is not of full rank and is, of course, not an acceptable covariance estimate. As long as , this problem remains even when the**

*β**r*control runs are built into the inference. It is thus often not possible to obtain a profile likelihood in this case. Instead, our proposal consists in eliminating the nuisance parameter by integration of the complete likelihood and to introduce a parameterized prior on in doing so. The latter corresponds to a general strategy referred to as shrinkage estimation (e.g., Chen 1979; Ledoit and Wolf 2003; Hannart and Naveau 2014), which is used in D&A by the regularized optimal fingerprinting (ROF) method of Ribes et al. (2009). The prior aims at weakly constraining the structure of based on a general preset target . As discussed in Hannart and Naveau (2014) and in the above references on shrinkage estimation, the latter constraint is justified by, and aims to compensate for, the huge informational imbalance that exists between an unknown matrix to be evaluated out of an matrix of known data, but seeks to do so without restrictively imposing a predefined structure on . Instead, an unspecified degree of resemblance between and the predefined structure is merely assumed by using an appropriate prior on . The level of resemblance is parameterized with , and its value is inferred from observations. Inference can then be handled by standard maximization of the integrated likelihood, which here yields closed formula for estimators and confidence intervals that are easy and fast to implement.

### b. Ordinary least squares model

As sketched above, the proposed model treats similarly and as observations; ** β** and

*α*as unknown parameters, the values of which need to be estimated; as a nuisance parameter that influences all observations but the estimation of which is not of interest per se; and and as known, fixed constants. This integrated formulation is represented in the structural chart of Fig. 1b, in contrast to the usual sequential formulation shown in Fig. 1a. As a starting point, the joint distribution of observations is derived. The former can be factorized into

Next, both individual terms in the right-hand side of Eq. (8) are derived. First, the structural assumption inherent to the linear regression model of Eq. (1) implies

On the other hand, is assumed to be an iid sample of vectors distributed according to a multivariate Gaussian PDF with mean zero and covariance matrix . The probability density function (PDF) of is thus

where is the empirical covariance matrix. Combining Eqs. (9) and (10), the complete likelihood of the model is

The conjugate distribution on , the so-called inverse Wishart distribution, is chosen for convenience as the prior distribution on :

where and are two hyperparameters, and is the multivariate Gamma function. For further convenience, a different parameterization of the above inverse Wishart distribution is introduced by using instead of , where metaparameter is more directly interpretable as the mean of the distribution, whereas, on the other hand, metaparameter drives the variance of the elements of with variance zero for and infinite variance for ( appendix A).

Next, the nuisance parameter is integrated out in Eq. (8) to obtain the integrated likelihood , which no longer depends on but only on the parameter of interest ** β** and on the regularization parameter

*α*:

This integration can be performed exactly because of conjugacy ( appendix B). The following expression is obtained after some algebra:

up to a constant that does not depend on * β,* which will be clarified below, and where the covariance matrix , defined by

has been introduced. Even if estimating is not of primary interest here, the matrix is an intermediate quantity interpretable as a linear shrinkage estimator of the covariance, where *α* is the shrinkage intensity and is the shrinkage target. By maximizing the likelihood of Eq. (14) in ** β** and

*α*, the desired MLE of

**is obtained:**

*β*where the expression of the integrated likelihood is given by

For clarity, the function and the intermediate quantities and are introduced in Eq. (17). The maximization of for based on the closed-form expression in Eq. (17) may be obtained by application of any maximization procedure [for instance, the Newton–Raphson algorithm (Press et al. 2007)].

It should be noted that the expression of in Eq. (16) is identical to that of the standard OLS estimator under a model in which the estimated covariance is treated as if it was known and constant (i.e., sequential approach). It is also similar to the expression of the estimator of Ribes et al. (2012) under the special case and . Nevertheless, the similarity with Ribes et al. (2012) is limited to the aspect of the expression in Eq. (16). Indeed, in practice, as will be shown in section 4, the obtained values of and differ, yielding different estimates of ** β** even in the special case . Further, Eq. (16) is applicable with any target (i.e., not restrictively to a target proportional to the identity). Finally, the confidence intervals obtained in our approach differ substantially, in both theory and practice, from those obtained under the approach of Ribes et al. (2012), as is now detailed.

In the present integrated model, the usual profile likelihood-based method of Venzon and Moolgavkar (1988) can be applied straightforwardly to derive confidence intervals around . For a given confidence level *η* (0.95, say) the former approach consists in solving in ** β**. After some algebra, it can be shown to be equivalent to solving in

**, , where the transformed confidence threshold level is defined by**

*β*where the approximation holds for *n* large. Under a model in which the estimated covariance is treated as if it were known and constant [i.e., sequential approach, as in Ribes et al. (2012)], one obtains confidence intervals by solving . Confidence intervals obtained under this assumption would therefore differ from those obtained under our approach, since in general. Further, the relative position of and is driven (approximatively for large *n*) by the position of w.r.t. 1. Therefore, the confidence intervals in our approach will be larger than those obtained by assuming that the estimated covariance is known when the model fit is bad and, conversely, smaller when the model fit is good. In both cases, the gap in confidence intervals between both approaches will be emphasized when and/or are small. Figures 2a and 2b show confidence regions obtained under both situations for .

### c. Total least squares model

This approach is now extended to the errors in variables introduced by Allen and Stott (2003) and used in various studies (Lott et al. 2013), which is often called the total least squares model. The following PDF prevails:

where is now treated as an observed version of the actual design matrix that contains noise. In this case, is not known; thus, it is treated as a nuisance parameter. This means that

and the PDF of each column vector of is assumed to be distributed according to a Gaussian PDF with mean and variance , where denotes the size of the ensemble that was averaged to obtain each :

The same sequence as in model (8) is then followed in order to obtain an integrated likelihood in ** β**. This means that is first integrated out in Eq. (20) in order to obtain a marginalized likelihood. Next, it is concentrated in

**and**

*β**α*by maximizing in , as would be done in the standard model with known covariance. In contrast with , the nuisance parameter is thus not integrated out: it is instead eliminated by mean of maximization to obtain the so-called profile likelihood (Basu 1977). In applying this sequence, the integration in can be performed exactly, as well as the maximization in after some algebra ( appendix C). These two steps yield an expression for the likelihood that is quite similar to Eq. (14):

up to a constant that depends only on *α*, and where the scaling factor has been introduced for convenience. The MLEs of *α* and ** β** are this time

Note that , defined to be the value maximizing , differs from Eq. (16) because of the presence of the factor , which depends on ** β** and synthesizes the influence of the errors-in-variables assumption on the model likelihood. Yet it also has a convenient closed-form expression, , obtained as the whitened TLS estimator under covariance . It is briefly recalled for completeness that, for any , is a

*p*-dimensional vector, and is a scalar such that the -dimensional vector is the unitary eigenvector associated to the smallest eigenvalue of the positive definite matrix . Therefore, as in section 2a, the expression for

**matches exactly that obtained under a TLS model in which the estimated covariance would be treated as if it were known and constant (i.e., sequential approach).**

*β*Finally, the expression for the concentrated likelihood is given by

where and . The concentrated likelihood is thus slightly modified as compared to Eq. (17).

### d. All runs total least squares model

In practice, matrix in the above model was obtained by averaging an ensemble of *m* matrices of simulated responses (i.e., ). Such averaging reduces variance by a factor of *m*, but some worthwhile information regarding is lost in the process. To make the most of this information and integrate it into the inference, the model is reformulated by accounting explicitly for instead of its average :

where are now treated as *m* observed, independently noise-contaminated versions of the unknown design matrix , and are thus assumed to be iid Gaussian with mean and covariance . This means that variations among ensemble members are assumed to be due to internal variability only and that models are assumed to correctly simulate the latter, which are two strong assumptions. After some calculations, can be simplified into

where is the sample covariance matrix of the ensemble of simulated response runs, and denotes the *i*th column vector of matrix .

An interesting consequence of the factorization in Eq. (26) is that our modified model is actually formally identical to the model of section 2b if substituting and *r* by and *r* + *pm*, respectively. Practically, this means that, after subtraction of the ensemble mean, each obtained response run must be treated identically to an internal variability control run. This simple calculation shows that the new model is equivalent to, and basically boils down to, a potentially significant increase in the size of the control ensemble from *r* to *r* + *pm*, which will improve the performance of statistical inference.

## 3. Implementation in large dimension

For small values of *n*—say in the order of a few hundred—computational aspects are not an issue: basic routines of a standard software package run on a desktop computer can cope with a straightforward implementation of Eqs. (14)–(23). But the straightforward approach is no longer suitable for large values of *n*, because it would imply inverting and/or diagonalizing matrices, thus requiring large computational resources. The purpose of this section is to show that, under a rather general assumption, by merely reformulating our equations through some algebraic manipulation, it is possible to drastically reduce their computational complexity.

The main assumption here is that the inverse square root of the target matrix , can be computed at a very low computational cost no matter what *n* is. This assumption is not very restrictive because, by definition, a target matrix is a general structure that is likely to have a sufficiently simple mathematical expression (e.g., a diagonalized expression, or see, for instance, section 5) to allow for a simple derivation of its inverse square root. Note that it clearly holds for the default target matrix proportional to the identity matrix. Under this assumption, the following transformed variables are derived as a starting point:

Next, the following six dimension-reduced statistics are derived:

After implementation of Eqs. (27) and (28), the dimension of the obtained statistics is merely driven by *r* and *p* but no longer by *n*. If our end result could be obtained based on these elementary statistics only, then the large dimensionality issue would be resolved.

At this stage, it is important to note that all the calculations required for the computation of and can be expressed based on four basic quantities: , , , and that all depend on *α*. Our goal thus boils down to deriving closed expressions for these four quantities as functions of *α* and the six elementary statistics of Eq. (28) only. For this purpose, let us consider the matrix and its diagonal factorization , where is orthonormal and is diagonal. Because has rank *r*, has *r* positive eigenvalues that are grouped under the vector **w**, with corresponding eigenvectors grouped under the matrix . Similarly, denotes an matrix of unitary eigenvectors corresponding to the eigenvalue zero: that is, . A classical simplification for obtaining **w** and without diagonalizing the matrix is to diagonalize instead the matrix . Indeed, it can immediately be shown that both have the same *r* positive eigenvalues and that, if denotes its orthonormal matrix of eigenvectors, then , with and a diagonal matrix used for normalization. On the other hand, the computation of would be a daunting task for large *n*, but fortunately—as will be clarified in Eq. (29)— is of interest, rather than . The former quantity can be obtained easily by as a consequence of the fact that . With the above notation and properties

where and are two diagonal matrices that are fast to compute for any α. Finally, some algebra yields

which is the desired reduced-form expression. The other quantities of interest are obtained similarly:

Therefore, under a rather general setting and after some initial calculations, the complexity of the procedure is reduced from *O*(*n*) down to *O*(*r*), thus unlocking the possibility of treating very large datasets on a desktop computer and without any need for a preliminary, independent dimension reduction.

## 4. Numerical simulations

This section illustrates and evaluates the performance of our integrated procedure by applying it to simulated values of , , and . The initial use of simulated rather than real data aims at verifying that our inference procedure performs correctly, a goal that requires the actual values of ** β** to be known. It also allows performance comparison with existing procedures.

### a. Description of the setup

Our idealized data simulations are based to a large extent on the numerical experiments setup of Ribes et al. (2012) and aim at realistically replicating a global dataset of twentieth-century temperature. Four pools of experiments were generated by combining the two models introduced in section 2 (i.e., OLS and TLS) together with two distinct sets of assumptions w.r.t. the size and structure of the covariance matrix . Regarding the latter, a first set of assumptions stick exactly to those proposed by Ribes et al. (2012) by using *n* = 275 and a covariance matrix estimated beforehand by these authors out of various CMIP5 control runs after a preliminary dimension reduction of the data based on spherical harmonics. In contrast to this classical dimension-reduced situation (DR), our second set of assumptions attempts to represent the non-classical situation described in section 3, where a raw, nondimension-reduced dataset (NDR) is treated directly. For this purpose, it is assumed that all variables are obtained on a regular rectangular 100 × 100 spatial grid (i.e., *n* = 10 000 grid points with spatial coordinates ). Note that, differently from the above DR setup, which is spatiotemporal, the temporal dimension of fingerprints is thus not considered in the NDR setup for simplicity; the variables of interest may correspond, say, to temperature change over the twentieth century at every location . The covariance matrix is given a spatial structure that is defined in the following way. Consider the isotropic exponential covariance matrix defined by , with parameter such that , as well as its diagonalized form. To obtain a spatial structure that is simultaneously less smooth and more general than , its eigenvalues are modified by simply multiplying them using *n* i.i.d. random factors taken from the uniform distribution on [0.5, 1.5], while maintaining its eigenvectors unchanged. All covariance structures obtained using this simple procedure share with the feature of concentrating a large fraction of the correlation in the vicinity of any given grid point while, at the same time, allowing for a wide variety of shapes that strongly differ from . In particular, in strong contrast with , the covariance matrices derived in this way are nonisotropic (Fig. 3c). They also appear to be able to represent such a complex structure as, for instance, a remote teleconnection dipole (Fig. 3d).

The matrix of unobserved regressors under the TLS model was obtained by simulating each of its *n* rows independently from a *p*-variate Gaussian distribution with fixed correlation and fixed variance , where is chosen to obtain a realistic signal-to-noise ratio. The matrix of noise-contaminated, observed regressors was then obtained from after simulating the *p* columns of independently from an *n*-variate Gaussian distribution with covariance , where the ensemble size *m* is assumed to be equal to five. The vector of observations **y** was obtained from after simulating from an variate Gaussian distribution with covariance and . Simulations of and under the OLS model were obtained as a particular case of the TLS simulations setup (i.e., using and ). Finally, under both OLS and TLS models, the matrix of control runs was obtained by simulating the *r* runs independently from an variate Gaussian distribution with covariance . The critical influence of the number of available control runs *r* on inference performance was assessed by repeating the entire process for 10 values of *r* ranging from to .

Implementing our integrated optimal fingerprinting (IOF) procedure requires the use of a target matrix , which describes the information available beforehand on the general structure of . Four different targets , , , and were tested for each simulation, yielding four different estimates in each case. The identity matrix scaled to [i.e., ] was our default target. It allows for a direct comparison to Ribes et al. (2012), which is also based on this generic target. Yet the performance improvement achieved under a more specifically relevant target is also of interest. In the first, DR case with , is a random perturbation of obtained using the inverse Wishart distribution. In the second, NDR case with *n* = 10 000, is the isotropic exponential covariance matrix . In both cases, two intermediate targets, for , with and , were introduced.

For each of the 40 experiments described above (=2 models × 2 setups × 10 values of *r*), simulations were generated. Inference was performed on each simulation using our IOF procedure successively with the four targets defined above. For the sake of comparison, inference was also performed based on the shrinkage procedure of Ribes et al. (2012) (referred to as ROF in Figs. 4–8) as well as the so-called *oracle* procedure (e.g., Ledoit and Wolf 2004). The oracle procedure consists in setting the value of *α* to , where is obtained by minimizing the estimation error on ** β** viewed as a function of

*α*. The oracle estimator is thus, of course, not implementable in a real application where the true value of the parameter is not known. Its interest is rather that it defines an upper bound on the performance of any shrinkage estimator (i.e., ROF or IOF) and is thereby a useful benchmark.

The performance of each estimator of ** β** was evaluated based on average mean squared error (MSE) , where denotes the simulation number, normalized to the mean squared error of the estimator obtained under known covariance. The accuracy of confidence intervals was evaluated by comparing the empirical frequency of the confidence regions, including the actual

**, to its theoretical value of 90%.**

*β*### b. Results

As a first general remark, it appears that, in all cases and rather expectedly so, the MSE of the estimator of ** β** is a monotonically decreasing function of

*r*that gets steeper as

*r*gets smaller and finally diverges to infinity as . Concerning the OLS model (Figs. 4, 5), the MSE of our procedure under the target (IOF0) is indistinguishable from that of the ROF procedure, both being also indistinguishable from the MSE of the oracle estimator (Figs. 4a, 5a). This means that both ROF and IOF0 are able to come up with the best possible value of

*α*. The accuracy of confidence intervals is excellent in all three cases in the DR case, with the empirical frequency matching the theoretical value 90% (Fig. 4b). However, a pronounced discrepancy prevails w.r.t. the accuracy of confidence intervals in the NDR case: while the accuracy of IOF0 remains excellent, that of ROF degrades to about a 10% negative mismatch for small values of

*r*(i.e., overconfidence) worsening to as high as a 40% mismatch for (Fig. 4b). Although it was not computed, further worsening appears to be likely to occur for larger values of

*r*given the consistently decreasing shape of the empirical frequency curve for ROF.

On the other hand, introducing more relevant targets shows a clear benefit in an IOF context. The MSE indeed sharply decreases when moving from IOF0 to IOF1, IOF2, and IOF3 (Figs. 4c, 5c). The performance gain appears to be higher when *r* is small; indeed, the better information on included by the improved target naturally has a stronger impact in this situation, since less information is then required from control runs. When an improved target is used and *r* is small, the inference procedure is thus able to capture it by yielding a higher weight *α* given to the target (not shown). The MSE decrease is particularly sharp in the NDR case: performance nearing perfection (i.e., nearing performance under known ) is obtained even for small values of *r* and even though the target differs substantially from in terms of Frobenius distance (Fig. 3b). The accuracy of confidence intervals remains excellent in IOF1, IOF2, and IOF3 (Figs. 4d, 5d), which means that the decrease in MSE is accurately captured by a decrease in the amplitude of the confidence region, on average.

The above findings obtained under the OLS model are maintained under the TLS model (Figs. 6, 7), except for a few changes that are highlighted. First, note that TLS results in lower estimation performance level as compared to OLS for all experiments and for all procedures. This holds here, not because TLS as a method performs worse in general, but because the design matrix is simulated with no noise under the OLS setting, whereas it is simulated with noise in the TLS setting. The gap in performance between OLS and TLS thus corresponds to the effect on estimation performance of noise in the design matrix when the correct method is used. Second, the accuracy of confidence intervals is strongly degraded for both ROF and IOF under the DR case (Figs. 6b,d). Under the NDR case, the degradation also occurs for ROF but not for IOF which maintains perfect accuracy (Figs. 7b,d). Third, the MSE benefit of using more relevant targets appears to be more pronounced under the TLS model than under the OLS model. It also appears to have a benefit in terms of confidence intervals in the DR case, as their accuracy is partially restored by using more relevant targets (Fig. 6d).

## 5. Illustration

### a. Data

The purpose of this section is to illustrate the method by applying it in the context of a D&A study based on actual observations and GCM model simulations. The study focuses on two variables: surface temperature and precipitation, considered over the 1905–2005 period and over the entire surface of Earth, when observations are available. While numerous previous D&A studies are already available based on these variables (e.g., Hegerl et al. 2007b; Zhang et al. 2007; Noake et al. 2012; Polson et al. 2013; Lott et al. 2013; Marvel and Bonfils 2013), a novelty of this illustration is to highlight the effect on D&A results of using straightforwardly the variables of interest in their raw temporal and spatial form, and thus in potentially high dimension, without any prior processing for dimension reduction. Second, the benefit of more relevant and more constraining targets for the covariance matrix of internal variability than the identity matrix is enhanced.

Observations were obtained, on the one hand, from the HADCRUT4 monthly temperature dataset (Morice et al. 2012) over the period 1901–2000 and, on the other hand, from a combination of the GPCP v2.2 monthly precipitation dataset (Adler et al. 2003) over the period 1979–2000 with the GPCC v6 dataset (Schneider et al. 2013) over the period 1901–78. GCM model simulations were obtained from the IPSL CM5A-LR (Dufresne et al. 2012), the runs of which were downloaded from the CMIP5 database. An ensemble of runs consisting of two sets of forcings (i.e., ) was used: the natural set of forcings (NAT) and the anthropogenic set of forcings (ANT), for which three runs are available in each case over the period of interest (i.e., ) and from which an ensemble average was derived. On the other hand, a single preindustrial control run of 1000 yr is available and was thus split into 10 individual control runs of 100 yr. Next, following section 2d, the six residual runs consisting of the difference between individual response runs and the corresponding ensemble average were used as if they were additional control runs (i.e., ).

Precipitation units in simulations were converted from kilograms per meter squared per second to millimeters per month to match the units of the observational dataset (scaling factor of 2.63 × 10^{6}). Both precipitation and temperature in both observations and simulations were converted to anomalies by subtracting the time average over the reference period 1961–90.

The data were averaged temporally and spatially using time–space windows of different resolutions. The commonplace temporal resolution of 10 yr was used for all windows, and five different spatial resolutions on a latitude per longitude grid: 1 × 1, 2 × 3, 5 × 10, 10 × 18, and 21 × 36 corresponding to grid cells of size 180° × 360°, 90° × 120°, 36° × 36°, 18° × 20°, and 8.5° × 10° and to theoretical values of *n* of 10, 60, 500, 1800, and 7560. Averaging over windows was performed for both observations and simulations by using only values for which observations were nonmissing for a like-to-like comparison between observations and simulations. Windows with more than 80% of missing observations were suppressed, yielding final values of *n* of 10, 60, 470, 1530, and 5700 for temperature and 10, 30, 270, 790, and 2960 for precipitation.

### b. Target matrix

The main hypothesis required for implementation of the described method consists in the structure of the target . The following assumption is used:

where is a correlation matrix of size reflecting the temporal dependence and is assumed to be exponential with parameter ; is a correlation matrix of size reflecting the spatial dependence and is assumed to be exponential with parameter ; is the diagonal matrix of variances at each location *s*; and denotes the Kronecker product. The latter Kronecker product simply reflects the fact that it is assumed that the dependence in time, described by , is independent of the location *s* and, reciprocally, that the dependence in space, described by , is independent of time *t*. Note that the Kronecker product does not commute: Eq. (32) implies that the data is ordered first by date and second by location (i.e., the space–time data vectors of size consist of spatial vectors of size , which are concatenated by date). The assumption of time–space independence has an important practical implication for implementation, as can be obtained from

which requires us to diagonalize two matrices of size and instead of one matrix of size . The parametric form assumed for is required to estimate beforehand the local variances and the exponential parameters driving the range of the space–time dependence for temperature and precipitation. Local variances were obtained directly from the empirical covariance by for . With the local variances known, the exponential parameters were obtained by minimizing the mean squared error viewed as a function of (Fig. 8). Finally, the method was also implemented on two particular cases of Eq. (32) for the sake of comparison: (i) the diagonal target obtained by setting correlation matrices and to identity matrices with the same local variances; and (ii) the default target with . By order of increasing complexity, the three targets defined above are denoted herein as , , and .

### c. Model error

A well-known limitation of model Eq. (19) consists in assuming that the error terms and in the model Eq. (3) only originate from internal variability. This assumption means that the climate model used to simulate the responses to individual forcings is assumed to be perfect, that the model correctly simulates internal variability, and that observations are free of measurement error. When using real observations and simulations, these assumptions can be challenged. These missing error terms are likely to affect the performance of the estimation of ** β** and accuracy of confidence intervals, because parameter

**may then adjust to compensate for the missing error instead of fitting patterns to observations**

*β**y*, hence yielding values that are physically unrealistic.

While this problem is in our view an important matter in D&A, addressing it completely would imply substantially modifying our model assumptions and, in particular, introducing covariance matrices describing model and observational error. Such modifications would, in turn, raise inference difficulties that are beyond the scope of this article. The latter difficulties are nevertheless discussed further in section 6, but for the sake of this illustration, we merely attempted to address this limitation in a simplified manner by introducing a slight modification to our model that does not affect its implementability. For this purpose, the covariance matrix associated with is simply inflated by a factor in order to reflect additional error terms. This inherently assumes that those extra error terms share the same covariance structure up to a scaling factor, which is not a particularly realistic assumption but is preferable in our view to letting ** β** adjust for missing errors instead, as mentioned above. The value of

*τ*is chosen so that the inflated covariance provides as good a fit as possible to the vector of residuals with . This approach yields

where accordingly maximizes the integrated likelihood of Eq. (24) under . Then expressions of the likelihood, of the MLEs and , and of their corresponding confidence intervals in model Eq. (19) must hence be modified to include this new parameter. After some calculations, this boils down to changing into in Eqs. (22) and (23); into in Eq. (24); and *m* to in the calculation of and . Note that, as expected, all estimators are thus left unchanged when using . On the other hand, it is interesting to notice that, from the point of view of , introducing such an error inflation factor is equivalent to artificially inflating the size of the ensemble *m* by the same factor . This is consistent with the results of Ribes and Terray (2012), who found that, in the context of a D&A study based on temperature data, artificially increasing the ensemble size in the TLS model led to improved estimates of ** β**. The scaled error TLS model introduced above sheds light on this finding, as it shows that artificially increasing the ensemble size inherently corresponds to accounting for some missing error.

### d. Results

The proposed method yields estimates of that remain between 0 and 1 for all five resolutions, all three targets, and for both variables (Fig. 9. Estimated values of are positive except occasionally at high resolution for both variables; they depart from one in a more pronounced way in this case. Confidence intervals are much larger for precipitation than for temperature. Confidence intervals are much larger for than for . Confidence intervals exclude the value 0 for , allowing detection to be performed, in nearly all situations. This holds as well for for target or globally averaged variables but otherwise not.

For both forcings and both variables, the amplitude of confidence intervals is found to decrease as resolution increases (Fig. 10a).

The choice of the target does not appear to have a strong impact on the amplitude of confidence intervals on ** β** but it does appear to influence significantly the value of its point estimate. In any case, the goodness of fit of the model—measured herein by —is directly related to the complexity of the target: it reaches its maximum for target and its minimum for target (Fig. 10c).

Estimates of the covariance inflation parameter *τ* range between 1.6 and 2.1 for temperature and between 1.9 and 6.5 for precipitation (Fig. 10b). In both cases, the minimum value is reached at low resolution 2 × 3. It then increases as resolution gets higher, which means that missing errors (i.e., model and observational error) are more important at higher resolutions. Introducing an extra parameter necessarily improves goodness of fit by construction. Yet, in order to evaluate whether or not the gain in goodness of fit justifies the introduction of the extra parameter under scrutiny, the classic Bayes information criterion (BIC; Schwarz 1978) can be used. In this present case, this implies that the parameter *τ* is worth including whenever the gain in goodness of fit is greater than , or . The latter condition is met for all targets, variables, and resolutions. Further, Fig. 10d shows that steeply increases with resolution. This finding is not surprising, since missing errors also do, thus amplifying the benefit of including an error inflation parameter.

## 6. Conclusions

An integrated optimal fingerprinting method has been introduced. In contrast with existing approaches, the method treats within one statistical model all the available data (i.e., climate observations, model response simulations, and control simulations) in a spatiotemporal, high-dimensional format. This means that no reduction of the dimension of the data is required initially. The critical covariance estimation step is built into the integrated inference procedure, thereby offering a more coherent treatment of all uncertainty sources. Inference of the regression coefficients vector ** β** is obtained under closed-form expressions, which are implementable at a low computational cost even in high dimension.

A key idea underpinning the method consists in regularizing the covariance by mean of a hierarchical formulation. Closed-form expressions can then be obtained because of a convenient choice of conjugate distributions (i.e., Gaussian and inverse Wishart distributions), yielding a so-called linear shrinkage estimator toward the predefined target structure . One strength of the method as compared to that of Ribes et al. (2012) is that the target can be of any structure, thus opening access to a wide range of structures relevant for regularization, especially in a spatiotemporal context, where numerous covariance models are available. A second distinctive advantage is that the uncertainty associated with the latter linear shrinkage estimation of the covariance is consistently built into the inference of the parameter of interest ** β** and of its confidence intervals, which indeed was the initial purpose of using an integrated model.

The hierarchical formulation of our model is actually equivalent in a Bayesian context to introducing an informative, conjugate prior distribution for and a noninformative, improper, uniform prior distribution for ** β** and . Rephrased in Bayesian terms, the proposed maximum likelihood estimator of

**corresponds to the mode of its marginal posterior distribution; the confidence region corresponds to the quantiles of the same distribution, while estimation of**

*β**α*is based on a so-called

*empirical Bayesian*approach, meaning that a hyperparameter is chosen so as to maximize the marginal probability of the data. One interesting aspect of this approach is that it may be extended quite naturally by using informative priors on

**and as well as on , thereby introducing further regularization into D&A studies. One motivation to introduce such further regularization on**

*β***and is the disturbing fact that, in their current form and inclusive of the present method, TLS inferences often yield physically unrealistic values of both the coefficients**

*β***, which are sometimes inferred at values well above one or below zero, and of the patterns , which are often inferred to be less smooth than . It can be hoped that such a comprehensive regularization approach based on introducing informative priors, may yield more physically meaningful inference results. In any case, our proposal represents an effort to move toward that direction.**

*β*One limitation of our method is that it is relies on the assumption that covariance matrices are proportional. To our knowledge, this limitation is actually shared by all optimal fingerprinting methods and studies to date, except those of Huntingford et al. (2006), who relaxed this assumption for the first time in the context of D&A, and Hannart et al. (2014), who proposed an inference procedure to deal with this situation. When this assumption is relaxed (i.e., in the more general errors-in-variables regression model, where the covariance associated with the column vectors of is not proportional to ) closed-form expressions are no longer valid, and likelihood maximization becomes mathematically intractable. Yet the proposed hierarchical model formulation and the regularization it allows remain, in our view, relevant in this case as well. Solving for the same model extended to this more general situation would thus be useful in our view and would imply designing an ad hoc numerical inference procedure. The latter might be inspired by the scheme of Hannart et al. (2014). The assumption that the noise is Gaussian is also a limitation of the method when applied at high resolution, because we can no longer rely on spatial averaging for the Gaussian assumption to hold.

The proposed method is found to outperform alternative inference methods of Allen and Tett (1999), Allen and Stott (2003), and Ribes et al. (2012) based on a simulation study w.r.t. both the error of estimation of ** β** and the accuracy of the confidence intervals. In particular, the gap in performance with Ribes et al. (2012) is found to be significant even when the same target is used (i.e., ) and becomes wider when a more specific target is used relevantly or when the number of control runs

*r*is small. However, despite these gains in performance, simulation results show that the method still yields imperfect confidence intervals in the TLS case and needs to be further refined to solve this issue.

Finally, the method was illustrated by applying it to global temperature and precipitation observational datasets at several increasing spatial resolutions, yielding estimates of ** β** of increasing precision (uncertainty ~). Thus, a novelty of this study is to highlight the fact that dimension reduction may result in a loss of worthwhile information from a D&A standpoint. From this perspective, it appears to be essential to build climate model error into the inference, as the former tends to increase at higher resolution. It also appears to be relevant to focus further on the design of D&A tools, which, like the present method, are able to handle high-dimensional data, in an attempt to make the most of available observations to detect and attribute climate change.

## Acknowledgments

I gratefully acknowledge Aurélien Ribes and Philippe Naveau for stimulating discussions while working on this study. I would also like to thank Francis Zwiers and an anonymous reviewer for many suggestions that resulted in a substantially improved manuscript. This work has been supported by the LEFE-INSU Grant MESDA and the ANR (Agence Nationale de la Recherche) Grant DADA.

### APPENDIX A

#### Inverse Wishart Parameterization Choice

The inverse Wishart distribution is usually parameterized with and :

The following bijective parameter change is used:

with and . With this new choice of metaparameters, the expression of the inverse Wishart PDF becomes

where the normalizing constant is omitted for simplicity. Metaparameter can now be interpreted straightforwardly as the distribution mean, and metaparameter *α* drives the magnitude of the variance of the elements of :

with zero variance for and infinite variance for .

### APPENDIX B

#### Integrated Likelihood in the OLS Model

As a straightforward consequence of Eq. (A1),

for any . Going further, the so-called normalizing constant is denoted On the other hand, combining Eqs. (9), (10), and (12) yields

with . The above expression in can be integrated by applying Eq. (B1) to obtain the following:

where the constant has been dropped. Next, some rearrangements of the determinant of are performed based on the parameter change of Eq. (A2):

Sylvester’s determinant theorem (i.e., for any two rectangular matrices and of same dimension) yields

### APPENDIX C

#### Integrated Likelihood in the TLS Model

Let us denote the negative log likelihood in of the standard TLS model with known covariance , defined by

An explicit maximization in of the right-hand term of Eq. (C1) can be performed (detailed calculations are omitted). This yields the following expression of the integrated likelihood in ** β**:

with .

Next, the same steps that are applied in appendix B to the likelihood of the OLS model are applied to the complete likelihood of the TLS model given in Eq. (19). After integration in and some simplifications, the following reduced expression of the marginal likelihood in is obtained:

where is an matrix. Maximizing in is equivalent to maximizing in under the constraint , with , where denotes the first column vector of , and denotes an matrix consisting of column vectors of .

Introducing and the vectorial Lagrange multiplier , the corresponding Lagrangian function can be written as

The first-order condition implies

Introducing Eq. (C5) into the constraint yields

which is the expression given in the main text.

## REFERENCES

*Climate Change 2007: The Physical Science Basis*, S. Solomon et al., Eds., Cambridge University Press, 663–746.

*Numerical Recipes: The Art of Scientific Computing*. 3rd ed. Cambridge University Press, 1256 pp.

*The Total Least Squares Problem: Computational Aspects and Analysis*. SIAM, 288 pp., doi:.