## 1. Introduction

The statistical analysis of climate data typically involves the estimation of quantities such as the mean value or the regression coefficient from a trend analysis. Such estimates are viewed as incomplete without the inclusion of an uncertainty estimate that can then be used to test a hypothesis. For example, in model intercomparison projects, an estimate bracketed by error bars may be presented for each model. In climate change studies, a trend estimate may be presented for competing datasets, or for models and observations. When the error bars for the different estimates do not overlap, it is presumed that the quantities differ in a statistically significant way. Unfortunately, as demonstrated by Schenker and Gentleman (2001, hereafter SG), this is in general an erroneous presumption.

A natural question to ask at this point is “How serious is the problem of misapplication of error bars?” To address this question, the author searched through a recent year’s worth of issues from the *Journal of Climate*. A similar but less exhaustive search was made of the Third Assessment Report (TAR) of the Intergovernmental Panel on Climate Change (IPCC; Houghton et al. 2001). Instances of inappropriate use of error bars were found in both the *Journal of Climate* and the TAR.^{1} Given that these are two of the most respected publications in the field of climate, some clarification on the use of error bars seems warranted.

Before beginning discussion of the problem, it is worth reviewing some of the types of error bars that are commonly presented, along with related terminology. Error bars may represent the standard deviation, the standard error, the confidence interval, or some other measure of error or uncertainty. While the standard deviation is a measure of dispersion of *individual observations* about their mean, the standard error is the standard deviation of a derived *statistic*, such as the mean, regression coefficient, correlation coefficient, etc. A confidence interval can be constructed about a sample statistic such that it contains the true population statistic with a specified probability^{2} and thus can be used in hypothesis testing. This note is relevant to error bars that represent confidence intervals for hypothesis testing.

## 2. The nature of the problem

The misperception regarding the use of error bars may arise because of a fundamental difference between one-sample and two- (or multi) sample testing. For a Gaussian-distributed variate, when only one quantity is estimated, a one-sample test (such as a Student’s *t* test) may be performed. The null hypothesis would be that the estimated quantity is equal to some constant (e.g., that an anomaly is zero). In the one-sample case, application of a *t* test is equivalent to placing error bars about the quantity to see if it overlaps with the hypothesized value. However, when the interest is in comparing estimated values from two different samples, use of error bars about each estimate, looking for overlap, is not equivalent to application of a two-sample *t* test.

Suppose, for example, based on a single sample of data generated by a climate model, a 6-K rise in temperature is found to occur over some interval, and suppose that the standard error of this estimate is 2 K. Assuming that the temperatures are drawn from a Gaussian-distributed population, the hypotheses that the true change is zero can be assessed via a one-sample Student’s *t* test. Employing the standard formula yields a *t* value of 3 that, except for a very small sample size, results in rejection of the null hypothesis of no change in temperature with a high level of confidence. Alternately, if one had preselected the same confidence level by placing error bars at 3 times the standard error about the estimated mean, it would have been found that the interval just intersects zero.

In contrast, the two-sample case is fundamentally different in that, in general, looking for overlap from two sets of error bars is not equivalent to the appropriate *t* test. Examples based on the data in Table 1 are used to illustrate the nature of the problem. Suppose we have finite samples of values of some quantity from both observed data and from a general circulation model (GCM). Estimates of the mean (*X**s*) of the sample values can be made along with the uncertainty [standard error (SE)] of the estimated means. As is common practice, a confidence interval about the estimated means can be constructed by taking ± twice the standard error. The intervals given in Table 1 are displayed graphically in Fig. 1 in the form of error bars.

For example 1, three sets of error bars are shown on the left side of Fig. 1 for the observations (O), GCM (G), and their difference (D). In this example, the observations and GCM have equal standard deviations. It can be seen that there is considerable overlap of the error bars from the observations and GCM. In such a case, a researcher would typically conclude erroneously that there is no statistically significant difference between their respective means. An alternate approach to the same problem is to apply a two-sample *t* test. Such a test has been applied and the corresponding error bars about the difference of the means (D) do not include zero. Based on this test, the same researcher would conclude that there is a statistically significant difference between the means.

The reason for this apparent paradox can be understood by considering the relationship between the SE of the mean of the *individual* samples (observations and GCM) to that of the SE of the *difference* of their means. The crucial factor is that in the case of the two-sample *t* test, the SE of the difference is estimated by “pooling” the variances from the two different samples. It should be noted that while the two-sample *t* test is well founded in statistical theory, the use of overlapping error bars in the two-sample case is not.

*X*

_{1}

_{1}(

*X*

_{2}) is the mean of the observations (GCM), SE

_{1}(SE

_{2}) is the standard error of the observations (GCM), and c is a constant that determines the level of confidence. The two terms on the rhs of (1) represent the distances from their respective means to the end of the whisker (i.e., half the length of the confidence intervals). In example 1,

*c*= 2 since the confidence intervals represent ± two standard errors. Note that throughout this paper no distinction is made between population and sample parameters; it should be understood that estimates of various population parameters from an available finite sample are being used.

*t*test to example 1 results in a different requirement for declaration of significance

_{3}is the standard error of the differences. The distinction between the two approaches lies in the relationship between the individual standard errors (SE

_{1}and SE

_{2}) and the standard error based on pooling the individual variances (SE

_{3}). Under some simplifying assumptions, this relationship, presented by SG along with some theoretical underpinnings and derived algebraically in the appendix, is

To gain insight, it is instructive to use the geometric analogy given by SG involving a right triangle. The lengths of the sides are given by SE_{1} and SE_{2}, while the length of the hypotenuse is given by SE_{3}. Thus, (3) is simply an expression of the Pythagorean relationship. It can be reasoned that for the same difference in means [the left-hand sides of (1) and (2)] it will be more difficult to declare significance using (1) than (2). This is true because the rhs of (1) represents the sums of the lengths of the sides of a right triangle whereas the rhs of (2) represents the length of the hypotenuse; the latter will always be less than or equal to the former.

The amount of disparity between the two approaches depends on the ratio of the lengths of the sides of the triangle, which is equivalent to the ratio of the standard errors of the two samples. Example 1 was concocted to have equality in this regard and illustrates the case of maximum disparity. It is easy to demonstrate (SG) that when SE_{1} = SE_{2}, the ratio of the rhs of (1) to the rhs of (2) takes on its maximum value of

At the other extreme, when the SE from one sample is much larger than the other, (2) and (3) approach equivalency. Geometrically this occurs as one side of the right triangle approaches zero length, in which case the remaining side is also the hypotenuse. Example 2 illustrates this situation in that the variance from the observations is much larger than that from the GCM. As seen on the right side of Fig. 1, the error bars from observations and GCM just overlap and the error bars about the difference almost touch the zero line. Unfortunately, instances of large disparity in variances are typically not so common.

## 3. Conclusions

In summary, this note has addressed the common practice of placing error bars about the means from two distinct samples. While this practice lends itself to an appealing graphical presentation, it can often lead to an erroneous conclusion as to whether the means differ in a statistically significant manner. In particular, this approach will lead to a conservative bias in that sometimes no difference is found when it should. However, this bias is not constant, varying depending on the relative magnitudes of the sampling errors in the two samples. The maximum bias is found when the sampling variability of the two samples is comparable.

While this note has dealt with testing the difference between means, the same cautions apply to other statistical quantities. Practitioners should opt for appropriate two-sample tests suited for the parameter of choice, or for a multiple comparisons test when more than two samples are involved. While the error bar method is tempting, it is not grounded by statistical theory when more than one sample is involved.

## Acknowledgments

Gabriel Lau and Keith Dixon kindly provided comments on an earlier draft of this manuscript. Neville Nicholls, editor David Stephenson, and an anonymous reviewer provided useful suggestions during the journal review.

## REFERENCES

Houghton, J. T., Y. Ding, D. J. Griggs, M. Noguer, P. J. van der Linden, X. Dai, K. Maskell, and C. A. Johnson, 2001:

*Climate Change 2001: The Scientific Basis*. Cambridge University Press, 881 pp.Schenker, N., and J. Gentleman, 2001: On judging the significance of differences by examining the overlap between confidence intervals.

,*Amer. Stat.***55****,**182–186.Zar, J., 1996:

*Biostatistical Analysi*. s. 3d ed. Prentice Hall, 662 pp.

## APPENDIX

### Derivation of Relationship between Standard Errors

The relationship expressed by (3), which is associated with a two-sample Student’s *t* test, is derived here by invoking some simplifying assumptions as per SG. This relationship relates the individual standard errors of the means from two samples (SE_{1} and SE_{2}) with the standard error of the difference of their means (SE_{3}). The derivation utilizes a number of standard statistical equations [(A1)–(A6) and (A10)–(A11)] associated with the two-sample *t* test. These are available from any of a number of introductory statistics texts, for example, Zar (1996).

*S*

_{p}^{2}) from the two samples

_{1}(SS

_{2}) is the corrected sum of squares from the first (second) sample and

*υ*

_{1}(

*υ*

_{2}) is the degrees of freedom for the first (second) sample such that

_{1}(n

_{2}) is the sample size of the first (second) sample and where the variances of the two samples are defined by

_{1}and n

_{2}→ ∞. In this limit

*υ*

_{1}→

*n*

_{1}and

*υ*

_{2}→

*n*

_{2}so that after algebraic cancellation (A7) becomes

_{1}= n

_{2}) so that (A8) becomes

The relationship expressed by (A12) is useful for gaining insight as to the difference between the use of individual error bars (based on SE_{1} and SE_{2}) from the two samples and application of the two-sample *t* test (based on SE_{3}). Although it is based on some simplifying assumptions, yielding a less complex relationship for illustrative purposes, the principle that it expresses, namely the distinction between the two approaches, is more generally applicable.

Some statistical quantities from two examples involving hypothetical samples of data from observations and a GCM. Given are the sample size (n), the mean of the sample (*X**s*), the standard error or std dev of the mean (SE), and the confidence interval. On the third line (difference) for each example are given the difference of the means (observations minus GCM), the standard error of the difference of the means, and a confidence interval based on a two-sample Student’s *t* test. The intervals have been constructed using ± twice the sampling error about the mean; this is close to the value of 1.96, which in the limiting case of an infinite sample size corresponds to a 95% confidence interval. Note that while these examples have been constructed to produce “round numbers,” the concepts that they illustrate are not dependent on either the particular values or the sample sizes.

^{1}

In the *Journal of Climate* there were 17 articles in which error bars or related measures were presented in figures or tables and used in a two-sample setting. In only two of these articles did the authors correctly apply a two-sample test. In the majority of the other 15 articles, inferences were drawn inappropriately from the error bars; in a few cases the usage was ambiguous, perhaps leading the reader to an inappropriate inference. Although it seemed that conclusions would not change in about a third of the offending articles, for the remainder, conclusions would change in some instances; the extent of change is difficult to determine given the large number of individual cases involved. In the TAR, although proper usage was noted in quite a few cases, four cases of inappropriate use were found.

^{2}

A confidence interval is often a scaled version of the standard error. For example, assuming a Gaussian distribution, a 95% confidence interval for the population mean is constructed by extension of ± 1.96 standard errors about the sample mean.