## Abstract

Statistical principles underlying “fingerprint” methods for detecting a climate change signal above natural climate variations and attributing the potential signal to specific anthropogenic forcings are discussed. The climate change problem is introduced through an exposition of statistical issues in modeling the climate signal and natural climate variability. The fingerprint approach is shown to be analogous to optimal hypothesis testing procedures from the classical statistics literature. The statistical formulation of the fingerprint scheme suggests new insights into the implementation of the techniques for climate change studies. In particular, the statistical testing ideas are exploited to introduce alternative procedures within the fingerprint model for attribution of climate change and to shed light on practical issues in applying the fingerprint detection strategies.

## 1. Introduction

Predictions of climate change due to human-induced increases in greenhouse gas and aerosol concentrations have been an ongoing arena for debate and discussion. A major difficulty in early detection of changes resulting from anthropogenic forcing of the climate system is that the natural climate variability overwhelms the climate change signal in observed data. A number of schemes based on fingerprint methodologies have been developed to overcome this inherent problem [Bell (1982), (1986); Hasselmann (1979), (1993), (1997); Madden and Ramanathan (1980); and North et al. (1995); see Hegerl and North (1997) for a review and comparison of these approaches and Solow (1991) for discussion of some related statistical issues].

The basic idea behind fingerprint procedures is to express the climate data in terms of low-dimensional signal patterns. The intent is that these patterns describe and summarize defining characteristics of the signal. The fingerprints are constructed to maximize the signal-to-noise ratio in the observed climate data. Fingerprints thus provide an optimal dimension reduction of the full climate system enhancing climate change detection strategies (Hasselmann 1993).

The fingerprint approach, as introduced by Hasselmann (1979), entails a mathematical procedure for optimally detecting a climate change signal above the background natural climate variability noise. We will describe how such schemes are representable as optimal hypothesis testing procedures from the classical statistics literature. The formal statistical hypothesis testing framework provides insights into the implementations of the fingerprint detection strategies. We also discuss the statistical development of procedures used in attribution analyses. In particular, we can draw on the statistical literature to suggest alternative attribution methodologies.

Section 2 motivates the discussion of the statistical principles for climate change studies through a development of statistical models for climate change analysis. An understanding of the statistical issues in modeling the climate system is necessary for the statistical exposition of fingerprint models and methods. Sections 3a and 3b review statistical hypothesis testing and comparative experiments for detecting a climate change signal and assessing casual relationships (i.e., attribution). Section 4a reviews the fingerprint method as presented by Hasselmann (1993) and formulates the problem as a traditional most powerful statistical hypothesis test. Sections 4b and 4c compare and contrast current attribution methods with statistical approaches and offer some new suggestions. Section 5 considers practical issues of detecting climate change signals via fingerprints and discusses ideas for incorporating alternative statistical procedures in the detection schemes.

We recognize that the literature and current consensus regarding climate change are based on extensive and sophisticated scientific reasoning, of which only a portion is attributable to the results of statistical testing (IPCC 1996). However, since statistical testing does play a crucial role in these studies, delineation of the choice of appropriate methods and the interpretation of results is warranted.

## 2. Statistical formulations

The detection of a climate change signal and attribution of the potential signal to specific forcings requires understanding of the variation in the observed data, namely the climate change signal and natural climate variability. In this section, we discuss issues in modeling these two components of the climate change problem.

Let **Ψ** = {Ψ_{a}} denote an observed climate state where the index *a* = (**v**, **x**, *t*) represents the dependence of each component on climate variables **v** (e.g., pressure, humidity, etc.), spatial coordinates **x**, and time *t.* Dependence on location, time, and selected variables is suppressed. Define *f*_{A} to be a quantitative measure of anthropogenic forcings; define *f*_{E} to be other *external* forcings, such as solar input, volcanoes, etc.

### a. Models

The discussion of modeling relevant to the climate change issue is motivated by two concerns. 1) Statistical tools are best derived and applied depending on the stochastic models believed to underlie the problem; 2) sources of our uncertainty, both regarding “nature” and our analyses must be identified to assess results and suggest actions. In particular, we seek formal answers to the question “What is ‘climate variability’?” In the following sections we describe the modeling of the climate state **Ψ** from several viewpoints of the physical structure of the climate system and the associated uncertainties. We make no claim that these models are novel; for example, see Hasselmann (1976).

#### 1) Deterministic view

We may consider the climate state

where *i* represents a quantification of Earth’s *ensemble* membership (“initial condition”), and *μ* is some deterministic function, namely, the trajectory of an integration of a physical model.

#### 2) Ensemble-stochastic view

Even if one accepts the determinism in Eq. (1), uncertainty arises due to lack of complete knowledge of *i.* Hence, we consider an *ensemble-stochastic* view in which *i* is viewed as a random quantity, say with density function *Q.* Specifically, consider a “signal model”

where an ensemble average *μ*(·, *f*_{A}, *f*_{E}) is defined by

Note that this argument corresponds to the usual ensemble averaging technique common in fluid dynamics. Also, in general, ensemble averages depend on *Q.*

#### 3) Operational-stochastic view

Aspects in Eq. (1), beyond *i,* are unknown. In general, *μ, f*_{A}, and *f*_{E} are all unknown. The most expedient approach is to simply replace *μ*(·, *f*_{A}, *f*_{E}) by *μ̂*(·, *f̂*_{A}, *f̂*_{E}), where “hats” denote estimates. However, this approach does not readily enable us to account for our uncertainties in the estimated quantities. To do so, we may also employ probabilistic representations of these uncertainties. An *operational-stochastic* approach endows all unknown quantities with probability distributions. (This view coincides with the Bayesian approach to statistics.)

Another source of error in this context accrues from the need to approximate *μ* via numerical computation. Even with no other uncertainties, nonlinearities in the physical models suggest numerical problems. These uncertainties can also be managed via probabilistic modeling.

To summarize these uncertainties, we consider a “signal model”

where **O**(*i, f*_{A}, *f*_{E}) represents “error.” The specification of a probability distribution for **O** is based on the above sources of uncertainty.

Depending on the goal of an analysis, we might need to combine the operational- and ensemble-stochastic views. Consider the model

where **Ψ**_{e} is intended to account for ensemble variation and **O** is intended to account for sources of operational uncertainties. To handle this model we must not only produce distributions for the two error vectors, but also their joint distribution (i.e., the dependence structure between errors).

The issue of the meaning of probability warrants a comment here. We are liberal in our use of probability in this discussion. In particular, probability statements need not only hinge on the ensemble or frequency interpretation. Rather, we are prepared to model uncertain quantities as being amenable to probability. This view is in concert with the Bayesian approach to statistics (Berger 1985; Bernardo and Smith 1994).

#### 4) Intrinsic-stochastic view

By an *intrinsically stochastic* view we mean that **Ψ** is thought of as random with some probability distribution, say *P.* The key is that we would make this assumption even conditional on *i, f*_{A}, and *f*_{E}. A standard notation is to write

The notion is that no implementable form of the laws of physics permit us to envision a usable function *μ* that could completely determine **Ψ**. Nevertheless, we can entertain a “signal plus noise” model of the form

where *μ* is based on an appropriate physical model and **N** represents unexplained noise. Note that **N** is not interpreted in the same fashion as the **Ψ**_{e}(*i, f*_{A}, *f*_{E}) indicated in Eq. (2). The question of whether or not the **N** and the **O** actually represent different quantities may be mainly philosophical to some. In a sense we agree, though we also believe there is value to raising the two views.

For the model to be well-defined, distributional assumptions for the *N*-process, conditional on *i, f*_{A}, and *f*_{E} are needed. Typically, it is assumed that the expected value of *N* is zero. Note that this is an *assumption* in general, rather than a consequence of definition, as for anomalies from an ensemble average such as the **Ψ**_{e}’s defined in Eq. (2).

As before, we might consider combining these various interpretations as

### b. Statements of the climate change problem

Perhaps the most immediately relevant version of the climate change problem is the question, “Is **Ψ** significantly different than it would have been had *f*_{A} been zero?” Two clarifications merit emphasis. First, the meaning of “significantly different” is intended to involve the practical, physical importance of variations in **Ψ**, as opposed to the notion of “statistical significance.” Second, in this query, we are concerned with *this* ensemble member, as opposed to the anticipated ensemble average impacts. In the notation of the previous section, we would study the quantity

if available, though practically, we would hope to study

These statements of the issue are problematic for most classical statistical analyses as we have but one realization under this Earth’s initial conditions. Thus, we cannot produce the distributions for the errors **O** and **N**. Assuming adoption of the model (2), more accessible statements of the climate change problem involve statements regarding ensemble means. “Is *μ*(·, *f*_{A}, *f*_{E}) significantly different from *μ*(·, 0, *f*_{E})?” or is

significantly different from zero? Though more tractable than testing **Ψ** [or equivalently **Ψ**_{e}(*i, f*_{A}, *f*_{E})] directly, the issues of (1) practical versus statistical significance and (2) uncertainties in *μ, f*_{A}, and *f*_{E} arise.

In an intrinsically stochastic formulation, a natural statement of the problem is to ask whether or not *P*(·|*i, f*_{A}, *f*_{E}) and *P*(·|*i,* 0, *f*_{E}) [see (5)] tend to produce significantly different climates. Oddly, in this statement the actual data is not relevant, if both candidate probability distributions are known. That is, we are not necessarily interested in classifying **Ψ** as coming from one of these two distributions, unless we first can argue that the distributions really are practically different. Of course, in our setting, neither distribution is known.

Most climate change studies focus on changing signals, though climatologists also recognize that changes in *variability* are also possible and important. Though amenable to statistical analysis, this issue will not be treated here.

### c. Other points

In viewing **Ψ** as the “true” values of climate variables, we should recognize that our observational data is not **Ψ** itself, but rather some version with *measurement error.* This additional source of error is often ignored in climate change studies or implicitly absorbed into error terms such as **Ψ**_{e}, **O**, and **N** defined earlier. However, measurement errors can have quite different structures than these other errors. As one example, we might expect nonstationarity in measurement errors over time, as technologies changed.

The usual formulations of the climate change problem found in the literature falls into the ensemble-stochastic view of section 2a(2). Specifically, the climate change problem is viewed from a signal-plus-noise standpoint with signal *μ*(·, *f*_{A}, *f*_{E}) and noise **Ψ**_{e}(*i, f*_{A}, *f*_{E}). The problem is then divided into parts. 1) *Detection* of signal and 2) subsequent *attribution* of a detected signal to anthropogenic sources. In the remainder of this paper we focus on the statistical formulation of the usual techniques for detecting and attributing climate change. Thus, though we will briefly mention alternative statistical strategies that exploit the other models presented in section 2a, we leave discussion and development of the operationally and intrinsically stochastic views to future work.

## 3. Statistical analyses

### a. Detection: Hypothesis testing

The detection of a climate change signal may be viewed as the statistical testing of a hypothesis. To explore this issue, we review the formal hypothesis testing framework as viewed by statisticians. Both the conceptual interpretation of statistical tests and some of the formal theory leading to criteria for selecting test procedures are discussed. These ideas and formalities have their origins in the work of R. A. Fisher, J. Neyman, E. S. Pearson, etc. See Fisher (1990) and Lehmann (1993) for in-depth discussions and additional references, and Lehmann (1986) for an extended presentation of the formal theory. Casella and Berger (1990) provides an introductory, textbook presentation of statistical testing.

Fisher’s main view begins with the translation of a scientific theory into a statement concerning the probability distribution of some observable data. Let *X* represent some random variable or observable. Suppose that the probability distribution of *X,* say *f,* is known up to some parameter, *θ.* Common notation is that *X* ∼ *f*(*x* | *θ*), where *θ* is unknown, but lies in some set Θ (*θ* ∈ Θ). Fisher assumed that the scientific hypothesis to be tested corresponds to a specific value for the unknown parameter, say *θ*_{0}. This value generates the *null hypothesis,* typically denoted by *H*_{0}: *θ* = *θ*_{0}. To test for“significance” in the data, Fisher devised the p-value. The p-value is the probability, assuming the null hypothesis is true, of observing data at least as unfavorable to *H*_{0} as the actual data observed. The logic here is indirect; the suggestion is that small p-values are evidence against the null. It is important to note that the p-value is *not* the probability of the null, given the observed data. That quantity plays no role in classical testing.

Fisher’s notion of testing is that it can be used to provide quantified evidence against hypotheses. The intention is that if the p-value is very tiny (0.05 has become a common cutoff, though some suggest 0.01 as more appropriate), the scientific hypothesis corresponding to *H*_{0} is untenable and warrants replacement. On the other hand, one cannot produce quantifiable evidence in favor of the null. However, Fisher noted that if one repeatedly tests a particular null and never achieves strong evidence against it, then one would continue to entertain that null hypothesis as a plausible theory.

A concept of data being “unfavorable to *H*_{0}” is required in the definition of the p-value. That is, we must have some notion of an alternative hypothesis in mind to implement the procedure. Alternatively, formal statistical tests do not indicate evidence about the null in a vacuum; rather, the quantifications involve comparisons against alternatives. To clarify, suppose that our statistical model is that *X* ∼ *N*(*θ,* 1), which is shorthand for “*X* has a normal (Gaussian) distribution with mean *θ* and variance 1.” Some theory yields *H*_{0}: *θ* = 0. We then observe *X* = *x*_{d}. (Assume that *x*_{d} > 0.) The p-value calculation says we should find the probability that a *N*(0, 1) random variable is as unfavorable to the null as “*x*_{d}.” If alternatives of interest are simply that *θ* ≠ 0, the usual p-value is the probability that |*X*| ≥ *x*_{d}, where *X* ∼ *N*(0, 1). However, if the plausible alternatives are that *θ* > 0, the p-value is the probability that *X* ≥ *x*_{d}, where *X* ∼ *N*(0, 1). In general this is no small matter; in this case the two p-values differ by a factor of 2.

J. Neyman and E. S. Pearson sought a more formal approach to testing. The notion is that a scientific theory partitions the set Θ into two sets, say Θ_{0} and Θ_{a}. We seek a test of the competing null and alternative hypotheses: *H*_{0}: *θ* ∈ Θ_{0} versus *H*_{a}: *θ* ∈ Θ_{a}. We can either 1) reject *H*_{0}, in favor of *H*_{a}, or 2) fail to reject *H*_{0}. (Failing to reject *H*_{0} does not necessarily mean we accept *H*_{0}. Statisticians and textbooks vary on this point.) Since there are two possible actions, there are two kinds of errors. 1) Type I error: rejection of *H*_{0} when it is in fact true, and 2) Type II error: failing to reject (or accepting) *H*_{0} when it is false.

A *test* is a rule for picking an action, based on the data. We are to choose a subset known as the *rejection region,* say *R,* of the set of all possible values of the data *X.* Should the observed data *x* ∈ *R,* we reject *H*_{0};else, we fail to reject *H*_{0}. The criteria for choosing *R* are based on probabilities of error. First, define the *power function* of the test *R* to be the probability, Pr_{θ}(*X* ∈ *R*), of rejection for a fixed value of *θ.* Intuitively we wish to make the power small for *θ* ∈ Θ_{0} and large for *θ* ∈ Θ_{a}. However, there is clearly no unequivocal best test. (We could simply always reject, ensuring no error if *θ* ∈ Θ_{a}, or never reject, ensuring no error whenever *θ* ∈ Θ_{0}.) To make progress, the suggestion is to essentially fix one of the error probabilities. Namely, consider all tests that have the property that the probability of Type I error is less than some preassigned, small number *α.* [That is, Pr_{θ}(*X* ∈ *R*) ⩽ *α* for all *θ* ∈ Θ_{0}.] Among all such tests, we then try to find *R* to make the power large for those *θ* ∈ Θ_{a}. Should there exist an *R* that makes the power as large as possible for every *θ* ∈ Θ_{a}, we say it is a uniformly most powerful, level *α* test (UMP).

In many cases, UMP tests fail to exist. In such cases it is common to further restrict the class of tests. One such restriction is to consider those level *α* tests for which the power outside the null hypothesis is greater than *α*; that is, Pr_{θ}(*X* ∈ *R*) ≥ *α* for all *θ* ∈ Θ_{a}. Such tests are said to be unbiased. Within this class one then searches for tests that maximize the power for all *θ* ∈ Θ_{a}. Such tests are uniformly most powerful unbiased, level *α* tests.

A second restriction often suggested is to consider those *α* level tests that display certain invariance properties. That is, tests that behave symmetrically under certain classes of transformations of the statistical formulation. For example, important classes of transformations involve “change of units” for physical measurements. Once an invariance theory is formulated, one can then search for tests maximizing power within the class of invariant *α* level tests. Further detail is beyond the scope of this paper.

Note that in both the Fisherian and Neyman–Pearson formulations, the two hypotheses are treated asymmetrically. The null hypothesis plays a special role in each case. (This issue is particularly important in developing techniques for attribution versus detection.) A fully decision theoretic approach, formulated early in the work of A. Wald, generally permits asymmetric treatment of actions. The key is that probabilities of errors are not necessarily the appropriate quantities for control and optimization. Rather, the (expected) losses associated with particular actions are minimized. Such procedures are particularly important in the study of remediation.

Finally, we mention the Bayesian approach. In this approach to statistics, all unknown quantities, including parameters, are viewed as random variables. In particular, the unknown, now “random,” *θ* is endowed with a prior probability distribution. We may compute the prior probability of *H*_{0} based on this distribution. The theory of probability provides a mechanism (Bayes’ theorem) for updating the distribution of *θ* conditional on the observed data, yielding the posterior distribution for *θ.* The Bayesian testing solution is then based on the posterior probabilities of the null and alternative hypotheses. We will pursue this avenue for climate change detection elsewhere; also, see Leroy (1998). For general discussion of both decision theory and Bayesian analyses, see Berger (1985).

### b. Attribution: Statistical issues

Intuitively, the keys to attribution of a phenomenon is the explanation of the observed results in terms of *causes* coupled with the elimination of other potential causes. For the classical statistician, attribution is argued, though never unequivocally proven, from data obtained in *randomized* and *controlled* experiments.“Control” refers to the fact that experimental units are treated fairly under competing causes and differences are compared. “Randomization” is used to determine which units receive which treatments to (hopefully) guard against unforeseen potential causes and biases (Fisher 1990). Such approaches are not practical for climate change. First, the earth cannot serve as its own control. Historical records are subject to errors. Also, we believe a variety of climate changes, not anthropogenically induced, have occurred. Second, we cannot run small-scale, surrogate Earths under various conditions in a laboratory; the earth system is believed to be far too complex for relevant features to be captured by toys (Trenberth 1997). Hence, climatologists rely on large numerical models to provide controls, at the best, or at least some useful information about anticipated behavior of the twentieth century climate without anthropogenic forcing.

From a foundational point of view, genuine attribution of climate change is impossible. Hence, statistical arguments are needed to provide scientific explanation of our understanding of climate change issues. The role of statistics in attempting to establish causation is a rich and fascinating issue. For brevity we only offer the following references: Holland (1986) and Good (1983).

## 4. Statistics and fingerprinting

### a. Detection as a hypothesis test

In order to formulate the fingerprint detection scheme as a statistical hypothesis test, we need to review and describe the model from which fingerprints are derived. We will utilize the notation and ideas of Hasselmann (1993), denoted *H* from here onward. The climate vector **Ψ** is assumed to be a linear combination of the signal **Ψ**^{S} and internal climate noise **Ψ̃** so that

The natural variation **Ψ̃** is assumed to be a random *n*-vector with mean zero and dispersion (covariance matrix) **C**(**x**, *t*). The space–time lagged covariances modeled through the matrix **C** = **C**(**x**, *t*) are presumed known for the moment. It may seem that we have changed notation from that in section 2. However, this notation emphasizes the fact that one can choose which of *μ* or *μ̂* to assign as **Ψ**^{S} as well as which combination of **Ψ**_{e}, **O**, and **N** to assign to **Ψ̃**. Of course, how one prescribes **C** depends critically on these choices.

In practice, observations and signals are typically defined as anomalies from some baseline state, **Ψ**^{S}_{0}. Here, **Ψ**^{S}_{0} may represent the climatology of the system, the climate state at a specific time, or a time series of the climatic variables under given “control” conditions. For example, in studying global trends in temperature over time, Santer et al. (1996) defined the baseline state as the average temperatures observed in the period 1880–1920. Alternatively, the vector **Ψ**^{S}_{0} may be defined as the expected temperature trend in a system where greenhouse gas and aerosol concentrations are not increasing. Of course, this control state can only be estimated, typically using climate model output. In each of these approaches to formulating a baseline, a level of uncertainty [corresponding to the use of estimated (hatted) quantities in section 2] is introduced though seldom quantified.

Interest lies in studying the observed climate state and expected signal as they differ from the baseline. To this end, redefine the “data” **Ψ** and signal **Ψ**^{S} to be an anomaly from **Ψ**^{S}_{0}. If a climate signal above that present in the natural climate variability does not exist, we would expect the state of the system **Ψ**, on average, to be no different from the baseline state **Ψ**^{S}_{0}; that is **Ψ**^{S} will be zero. Hence, **Ψ**^{S} is the climate signal above and beyond the natural variability **Ψ̃**.

Assume we have a set of *p* expected signal patterns represented by the *n*-vectors **g**_{1}, . . . , **g**_{p}. These patterns may define climate change induced by increases in greenhouse gas concentrations, aerosol concentrations, regional climate changes, or other human activities expected to impact the climate system. These vectors are assumed known as generated by climate models or expert opinion. Assume that the signal vector **Ψ**^{S} lies in a space spanned by the vectors **g**_{1}, . . . , **g**_{p}; that is,

for some unknown set of coefficients *a*_{1}, . . . , *a*_{p}. In matrix notation we write

where **G** is an *n* × *p* matrix with columns corresponding to the vectors **g**_{1}, . . . , **g**_{p} and **a** is a *p*-vector containing the coefficients of the linear combination in Eq. (12).

The fingerprint methodology of *H* reduces **Ψ** to a sequence of *p* low-dimensional detectors denoted *d*_{i}, through an application of linear filters or fingerprints, denoted **f**_{i}; that is *d*_{i} = **f**^{T}_{i}**Ψ**, *i* = 1, . . . , *p.* Let **d**^{S} = (*d*^{S}_{1}, . . . , *d*^{S}_{p}) = (**f**^{T}_{1}**Ψ**^{S}, . . . , **f**^{T}_{p}**Ψ**^{S}) where superscripts T represent vector transpose. The goal, then, is to determine the statistical significance of the optimally detected signal. This optimally detected signal is constructed by maximizing the distance between the signal and natural climate variability or signal-to-noise ratio over all possible fingerprints. More specifically, the fingerprints **f**_{i} are chosen to maximize the quantity

where the (*i, j*)th element of **D** is

Here, *H* shows that if the forced signal **Ψ**^{S} lies in a space spanned by known model prediction pattern vectors **g**_{1}, . . . , **g**_{p}, then the solution to this maximization problem is **f**^{ *}_{ i} = **C**^{−1}**g**_{i}. The statistical significance of the optimally detected signal is then determined through the statistic *ρ*^{2}(**d***) where the vector **d*** is given by

There is a direct relationship between the optimal fingerprint method of *H* and most powerful tests of coefficients in a corresponding regression problem. To provide a perspective to this view, first note that the cornerstone of the approach is to make inferences about the coefficients *a*_{1}, . . . , *a*_{p}. In settings in which the patterns are estimated from model runs, it would seem that these *a*_{i}’s are also determined. However, we believe the view is that the models are capable of capturing the structure and patterns of climate change, reflected by the **g**_{i}, but not the precise magnitudes as would be implied by both the *a*_{i}’s and the **g**_{i}’s taken as a group. Rather, the key is to treat the patterns as essentially correct, and then decide whether or not the patterns appear to be present in observational data by regressing the data on the patterns, that is, estimating the *a*_{i}’s from the data.

To formally relate the signal detection problem as a statistical multiple regression problem, combining Eqs. (11) and (13) implies that,

In the statistical regression literature, **G** is termed the design matrix and **a** regression coefficients. Next, optimal estimates of the signal **Ψ**^{S} are obtained from the generalized least squares estimates (Weisberg 1985, section 4a) of the regression coefficients, denoted **â**. Namely,

and

The estimate **â** is rightly viewed as the Gauss–Markov optimal procedure; note too that it is simply a rotation of the projected fingerprints **d*** given in Eq. (14).

Statistical inference can be based on the fact that, under Gaussian assumptions, **â** has a multivariate normal distribution:

For example, consider testing the hypothesis of no signal in *H* that may be equivalently written as

If the dispersion matrix **C** is known, the uniformly most powerful invariant test of Eq. (17) rejects *H*_{0} if

where *χ*^{2}_{p}(1 − *α*) is the critical value at level *α* based on the chi-square distribution with *p* degrees of freedom (see Lehmann 1986, section 8.7 or Seber 1984). The restriction to invariant tests arises here because no UMP test exists.

Notice the test statistic *T* in Eq. (18) is equivalent to the statistic utilized to assess the significance of the optimally detected signal in *H.* If **f*** = **C**^{−1}**G** denotes the optimal fingerprint,

In the single pattern problem (*p* = 1),

where *a* is an unknown constant. It can be shown that the test in Eq. (18) is also a UMP-unbiased test (Lehmann 1986, sections 5.8 and 7.7 or Seber 1984).

#### 1) Unknown **C**

If the dispersion **C** is unknown, *H* suggests estimating the correlation structure from historical data independent of the observations **Ψ**. Let **Y**_{1}, . . . , **Y**_{m} denote the historical data and **Y** denote the mean of the historical sample. If each **Y**_{i} is identically distributed *N*(** τ**,

**C**) with unknown mean

**, then the maximum likelihood estimator**

*τ*may be utilized as an estimate for the dispersion **C**. Substituting this estimate into the test statistic in Eq. (18) suggests that a test of Eq. (17) would reject the null hypothesis when

is large. Unfortunately, the determination of a “large” *T*_{Y} is not as easily accomplished as in the known dispersion case. However, if *m* is large, use of the *χ*^{2} critical value as in Eq. (18) may be reasonable.

#### 2) Fingerprinting as regression

The regression formulation as described above allows for an implementation of a sequential fingerprint detection scheme when the signal patterns **g**_{i}, *i* = 1, . . . , *p* (which may represent changes in climate due to carbon dioxide, aerosols, regional variables, etc.), form a hierarchical set. Namely, when the climatologist wishes to assess the significance of patterns in a prescribed order of importance, for example, carbon dioxide first, followed by aerosols, etc. More specifically, multipattern signal detection is typically performed as follows. Initially test for the appearance of the pattern **g**_{1} in the data. If this pattern is detected, add the second pattern **g**_{2}, and so on until a pattern is no longer detected in the observations. The multipattern detection is thus reduced to a sequence of tests on univariate detectors.

This procedure is analogous to a forward stepwise routine for building a regression model (Weisberg 1985, chapter 8). Let *a*_{i} denote the regression coefficient corresponding to pattern **g**_{i}. We sequentially test if each *a*_{i} is zero given all other signal patterns **g**_{1}, . . . , **g**_{i−1} are in the regression model. Continue adding patterns to the regression equation until the first acceptance of a null hypothesis. Hence, rather than testing the hypothesis in Eq. (17) that all regression coefficients *a*_{1}, . . . , *a*_{p} are zero, test each regression coefficient separately and sequentially.

Providing these analogies between fingerprinting and statistical regression analysis is not merely an academic exercise. Rather it enables transfer of technology as well as cautions. First, the interpretation of parameters in multiple regression can be problematic. While the notation is casual, the meaning of *a*_{1} in the univariate model *a*_{1}**g**_{1} is not the same as that in the model *a*_{1}**g**_{1} + *a*_{2}**g**_{2}. In the second case, *a*_{1} is roughly the rate of change in the signal in varying the first pattern while holding the second pattern constant. In some not entirely pathological examples, a significant, say positive *a*_{1} based on the first model may become insignificant or even change sign when **g**_{2} is added to the model. More generally, relationships among the patterns can create difficulties in interpretation and analysis. The issue is known as multicollinearity in statistics (Weisberg 1985, chapter 8). Some concerns can be mitigated by regressing on appropriately defined empirical orthogonal functions of the patterns. (This is known as principal components regression in statistics.) However, this creates difficulties in interpretation of the derived “patterns.” Related problems arise in defining a “carbon dioxide plus aerosols” pattern. Interactions among these forcings generally suggest that this pattern is not simply obtained from the pattern for carbon dioxide and the pattern for aerosols.

A second issue involves the above sequential, forward stepwise analysis. In general, the results strongly depend upon the order in which patterns are added. Further, the advertised *α* levels for the sequence of tests are typically not valid. Indeed, these procedures are notorious for overstating the overall statistical significance of results.

### b. Attribution

Suppose the detection test of the previous section does indeed reject the null hypothesis of no climate change signal above baseline. We cannot immediately attribute the perceived signal to any of the patterns **g**_{1}, . . . , **g**_{p}. Though we cannot obtain pure causal relationships in this context, one can test if the forcing patterns **G** are consistent with the observations.

Hasselmann (1997) pursues this line of inquiry, roughly as follows. Climate model data is used to construct an estimate of the expected signal amplitude **â**_{M} under a selected forcing specification; the subscript *M* indicates a model-based derivation. An estimate **â**_{obs} can be constructed from the climate observations. A comparison of **â**_{M} and **â**_{obs} will determine if the observed climate signal is consistent with the expected, model-based signal.

The uncertainty associated with estimating the climate change signal via **â**_{M} and **â**_{obs} is derived from the model-based and natural climate variability, respectively. To this end, Hasselmann (1997) assumes

and

where **a**_{M} and **a** are the true coefficients and **C**_{M} and **C**_{obs} denote the variability in the estimates. Note that

where **C** is the dispersion matrix of the natural climate variability as discussed earlier.

A test of consistency between the model-predicted and observed climate signals analyzes the hypothesis

The normality assumptions and an independence supposition yield the distribution

under the null hypothesis. The null hypothesis is rejected if

where *χ*^{2}_{1−α} is the critical value at level *α* based on the chi-square distribution with *p* degrees of freedom. If this test fails to reject, evidence for consistency is claimed.

If **C** and **C**_{M} are unknown, one can construct estimates **Ĉ** and **Ĉ**_{M} via climate model control runs. In particular, **C** can be estimated from climate model output in which anthropogenic influences on the climate system are negligible. (One could also estimate **C** based on data as in the earlier detection analysis.) Here, **C**_{M} can be estimated from repeated runs of the climate model utilized to construct **â**_{M}. The consistency test is then based on Hotelling’s *T*-distribution or a variant thereof (analogous to the detection test derived in section 4a). See Hegerl et al. (1997) for an application. Note that direct substitution of estimated covariances into tests derived under known covariance assumptions are best viewed as approximations. See section 5 for more discussion.

These consistency tests have several statistical deficiencies. First, as mentioned in section 3, some statisticians do not accept the notion of demonstrating evidence for a null hypothesis. This is, the consistency hypothesis, **a** = **a**_{M}, cannot be “accepted” in a Fisherian convention. Even if one relaxes this convention, it is difficult to ascribe an error rate or measure of confidence if one accepts the null. While we know what it means to reject a null hypothesis at some *α* level, this does not provide a valuable error rate if we accept. The error of interest (Type II, as described in section 3) is that of incorrectly accepting. The probability of this error (one minus the “power”) depends on the true value of **a**, which is unknown.

Second, the consistency test is performed upon rejecting the null hypothesis of the detection test in Eq. (17). This two-step procedure complicates a correct calculation of error probabilities in the consistency test. Specifically, such calculation must account for the probability of incorrectly detecting a climate change when one does not exist and the probability of correctly detecting a climate change when one does exist as well as the probability of incorrectly rejecting the null hypothesis in Eq. (20).

### c. Alternative procedures for attribution

The search for consistency described in section 4b is analogous to demonstrations of *equivalence* in the statistics literature. For example, biostatisticians are often confronted with the problem of determining if generic drugs are equivalent to a brand name in that they become available at the drug action site at approximately the same rate and concentration. Government regulations require the proof of bioequivalence before a generic drug may be marketed. (See Berger and Hsu 1996.)

The test for consistency is essentially one of geoequivalence: testing if the observed and model-predicted climate change signals are equivalent. We can thus utilize the established bioequivalence testing procedures for a test of geoequivalence. These procedures are derived from a test of the hypothesis

Notice the null hypothesis from the consistency test (20) appears in the alternative here.

An easily applicable test of (21) was developed by Brown et al. (1995). Let ** θ** =

**a**−

**a**

_{M},

**=**

*θ̂***â**

_{obs}−

**â**

_{M}, and

**Σ**=

**C**

_{obs}+

**C**

_{M}. Under the assumption that

we have evidence against the null hypothesis and in favor of geoequivalence at level *α* if the (confidence) set

does not contain zero. Here, *z*_{α} is the *α* critical level from a standard normal distribution. These confidence sets are optimal in the sense that they minimize the expected volume of the set among all confidence sets with coverage probability 1 − *α* (Brown, et al. 1995).

If **Σ** is unknown, approximations can again be based on independent estimates of the required dispersion matrices.

Alternatively, the so-called simple versus simple hypothesis testing formulation poses the detection–attribution problem as a single hypothesis test. Recall we have estimates of both **a** and **a**_{M} as derived from observed and climate model data respectively. The test of the hypothesis

arguably considers detection and consistency in one analysis. If the null hypothesis is rejected, one might claim evidence that a signal exists and is consistent with the climate model predicted signal, analogous to the conclusions sought by Hasselmann (1997).

For **C** assumed known, the most powerful *α* level test is known to be “reject *H*_{0}” if

where *z*_{a} is the appropriate critical value based on a standard normal distribution (e.g., *z*_{.05} = 1.645).

This simple versus simple view admits an enhancement. The test in Eq. (24) is actually UMP for testing the hypotheses

where *d* is unspecified. (Note that any specific value assigned to *d* > 0 would cancel in the calculation of *T*_{M}.) That is, the magnitude of **a**_{M} is not important; rather, the test looks in the direction indicated by **a**_{M}. See Hegerl et al. (1997) for discussion and implementations.

## 5. Practical considerations and discussion

### a. Covariance estimation

It is well-recognized in the climate change literature that the estimation of dispersion matrices (e.g., **C**) is a difficulty regardless of one’s view of the detection-attribution problems. Misrepresentation of the correlation structure may lead to incorrect conclusions (Hegerl and North 1997). Hasselmann (1997) suggests that dispersion matrices “can be estimated from observations or model simulations with sufficient accuracy for a meaningful signal-to-noise analysis.” However, we believe that translation of “sufficient accuracy” may be difficult.

Often, a long control run from a climate model with no anthropogenic forcing is utilized to estimate **C**. Under assumptions of separability between the temporal and spatial correlations, the estimation procedure is tractable (see, for example, Hasselmann 1993; Hegerl et al. 1997; and Santer et al. 1995a for applications). However, improved climate model representations of the natural climate variability as well as reasonable statistical techniques for estimating **C** are necessary to realize the full potential of the fingerprint detection scheme (Hegerl et al. 1997).

Use of observations, as in section 4a(1) rather than model output may well appear to be more readily defensible than relying on models only. However, observations potentially contain variations and spatial dependences not entirely attributable to true natural climate variability. First, they contain measurement errors and biases. Further, data such as **Ψ** and the **Y**_{i}’s in section 4a(1) are typically produced using objective analysis type procedures. Such procedures can induce spatial structure due to human averaging that need not reflect nature. A second potential issue involves the choice of baselines. Recall that **Ψ** is actually an anomaly from a particular baseline (denoted by **Ψ**^{S}_{0} in section 4). Clearly, if we use the procedure in section 4a(1) it is important that the resulting estimate of **C** is not biased by differences between **Ψ**^{S}_{0} and **Y**.

Even if all of the potential bias issues raised above are judged to have negligible effects, results based on the use of estimates as if they are “true” parameter values must be viewed as approximations. The problem of course is gauging the quality of approximation. The problem is not easy. For example, recall the key development in section 4 involved the estimate **â** = (**G**^{T}**C**^{−1}**G**)^{−1}**G**^{T}**C**^{−1}**Ψ** and its distribution given in Eq. (16). If we simply replace C by an unbiased estimate [as in section 4a(1)], the resulting estimate of a is no longer unbiased nor normally distributed. While this is not severe as *m* → ∞, gauging the impact for a particular *m* is not trivial. As noted in *H,* simulation can be a valuable tool in this process. Of course, simulation is merely an alternative approximation, useful primarily when the cost in performing massive simulation experiments is small.

### b. Correlation based detection

A common implementation of the fingerprint notion is to detect a climate change signal by studying some measure of dependence, for example, correlation between the expected, climate model constructed anthropogenic signal and climate observations. [Barnett (1986) provides the first application of this technique. For further applications, see Karoly et al. (1994), Santer et al. (1996), Santer et al. (1995b), Tett et al. (1996), and Wigley and Santer (1990).] The fingerprint in these applications is a time series of the correlation-like statistics measuring the temporal change in similarity between the expected (under some specification of forcing) climate signal and the climate observations.

One motivation for such studies is a relationship between these techniques and the methodology presented in section 4a. More generally, regression and correlation analyses are related. Suppose that for each fixed time epoch, we let an *s*-vector **g** be a model-derived estimate of the climate signal change under some forcing specification. (*s* denotes the number of spatial grid points considered.) That is, **g** represents spatial departures from a mean model behavior. Assume that for each time period *t,* the *s*-vector of climate observations **Ψ**_{t} (actually, the departures from the spatial mean of the observations, which are in turn anomalies from some baseline) follows the model

Paralleling section 4a, assume that for each *t, ***Ψ̃**_{t} ∼ *N*(**0**, **C**_{s}), where **C**_{s} is an *s* × *s* covariance matrix, and *a* is an unknown regression coefficient. Hence, **Ψ**_{t} is a snapshot of a particular climate variable anomaly in space at one point in time.

Consider (dependence on time is suppressed) the generalized least squares estimate of *a* at time *t,*

Under the above assumptions, we have that each *Z*_{t} = (**g**^{T}**C**^{−1}_{s}**g**)^{1/2}*â*_{t} is normally distributed with mean *a* and variance one, and can be used to test the hypothesis

at that time.

We remark that in some cases authors have not adjusted for **C**_{s}, and consider the correlation statistic

If **C**_{s} is an identity matrix it can be shown that testing the null *H*_{0}: *a* = 0 is equivalent to testing the null that the “true” correlation between the observations and the model predictions is zero, based on *r*_{t}. However, two critical points arise. First, the suggestion that **C**_{s} is the identity seems untenable, leading us to recommend testing based on statistic *Z*_{t} rather than *r*_{t}.

Second, a more foundational concern arises. Formally, genuine correlation is measured between two random quantities by definition. That is, a true correlation being tested actually means that one is assuming that **g** is a realization of some stochastic process. But here’s the rub: if we endow the model with a stochastic background, then the spatial covariance matrix of **g** should also enter into the analysis. (This object also arose in the geoequivalence test.) See Smith (1997) for discussion and suggestions for handling both covariance structures.

Current use of correlation-like schemes measure similarity between climate observations and expected climate signal by studying the trend of *r*_{t} over time. The idea is analogous to allowing the regression coefficient *a* in Eq. (26) to vary in time, denoted say *a*(*t*). Again, such trend analysis should take account of the spatial correlations (both of the data and model results) as well as potential temporal covariance structures.

### c. Interpretation of testing results

Given the scientific and societal importance of climate change studies, it is worthwhile to place the value of classical statistical hypothesis testing results in perspective. Hegerl and North (1997), Hegerl et al. (1997), and Hegerl et al. (1996) suggest that the hypothesis test in Eq. (17) and its cousins are conservative in the sense that a bad “guess” for **G** “will not wrongly detect change where change does not exist” (Hegerl and North 1997). A statistical defense for this claim requires that an error rate be appended. That is, if the null hypothesis is true, incorrectly rejecting it and claiming evidence for change is an event of probability *α.* This claim is indeed correct no matter how well or poorly the alternatives have been constructed.

Of course, this claim hinges on the assumptions under which the probability calculation is done (e.g., normality, etc.). However, more subtle concerns may arise. The claim associates no climate change with the null hypothesis holding exactly. It need not carry over if one thinks of the test as an approximation to a test of a fuzzy null hypothesis that **a** ≈ **0**. This is particularly problematic in that we may have little confidence in the meaning of precise, nonzero values of the *a*_{i}’s. Hence, quantifying the level of approximation desired in the fuzzy null is difficult.

To clarify potential sources of concern, we consider a very special example: a test of the hypotheses given in Eq. (23) in the case of a single pattern (*p* = 1) where *a*_{M} > 0 and assuming that **C** = **I**. It can be shown that the rejection rule reduces to

The sign of *a*_{M} determines the test, but the value of *a*_{M} is irrelevant. The condition *T*_{M} = *z*_{α} generates a hyperplane in the data space perpendicular to the vector **g**.

Consider a two-dimensional data example. Suppose **g** = (1 3)^{T} and *α* = .05 (*z*_{.05} = 1.645). The rejection hyperplane is the line

This line intersects the line generated by **g** = (1 3)^{T} at the point (0.52, 1.56). Hence, for observed data (0.525, 1.575), one would claim detection at an *α* = 0.05 error rate. However, suppose our computer model had suggested the value of *a*_{M} = 1.05. We may not be comfortable with a claim of detection since the actual data is neither compatible with the null hypothesis nor with the suggested alternative. Intuitively, this data leaves us“50–50” on which hypothesis we believe, or makes us question the validity of either hypothesis or our other assumptions. The situation is particularly disturbing if *a*_{M} is much larger than 1.05, since the data is more compatible with the null (that is being rejected), than with the entire model output (that is being accepted). This concern is amplified in cases for which the data fall in the rejection, yet are far away from the anticipated alternative **g**. Such concerns may well be a motivation for the consistency test of *H* discussed early.

Parallel comments apply to the detection test in Eq. (17). The rejection region defined in Eq. (18) is the complement of an ellipsoid centered at **0**. The orientation of this ellipsoid is determined by the **g**_{i}. For highly eccentric ellipsoids, there may be substantial probability of rejection at a values that might well be viewed as part of a fuzzy null.

In part the general concern we are raising is one of practical versus statistical significance. The classical tests described in this article are not designed to be useful in assessing practical significance of results. That is, a “detected” signal need not be an important signal. Similarly, meaningful changes may not be readily detectable (i.e., the climate signals exist in regions where our tests have low power). It bears repeating that the significance discovered in testing refers to data, not to the quantities tested. For example, rejection of the null *H*_{0}: **a** = **0** in Eq. (17) does not mean that we have evidence that **a** is significantly different from zero in terms of climatic behavior.

Another viewpoint is that the sort of tests reviewed here tend to overstate the evidence against the null hypothesis. For further discussion, see Berger and Delampady (1987) and Berger and Sellke (1987).

## Acknowledgments

This work was supported by NCAR’s Geophysical Statistics Project, funded by NSF Grant #DMS-9312686. We are grateful to J. Berger, A. Dempster, G. Hegerl, R. Madden, and T. Wigley for valuable discussions regarding climate change studies. Two referees provided extremely valuable comments, criticisms, and suggestions leading to substantial improvements in this presentation.

## REFERENCES

## Footnotes

*Corresponding author address:* Dr. Richard A. Levine, Division of Statistics, Statistical Laboratory, University of California, Davis, Davis, CA 95616.

Email: levine@wald.ucdavis.edu