## 1. Introduction

In meteorology and oceanography, and other fields, it is often necessary to check whether two quantities *x* and *y* are linearly related. Usually this involves the calculation of a correlation coefficient and a regression coefficient. Both coefficients are useful; the sample correlation coefficient

In practice, “noise” reduces the correlation coefficient and affects the accuracy of the regression coefficient. Noise can be due to measurement error or to different physical processes in *x* and *y* that affect the linearizing process common to both. While measurement error can sometimes be estimated for each variable, error due to possible physical influences in the real data is often not known. So there is a large class of regression problems in meteorology, oceanography, and other fields in which the signal-to-noise ratio in both variables is unknown.

*M*points, the true linear regression coefficient between two random variables

*X*and

*Y*that are both subject to random noise. Mathematically, we want to find the regression coefficient

*α*in the linear relationshipgiven the observationsIn (2) and (3), the error or noise terms

*ɛ*and

*δ*are taken to be normally distributed zero-mean random variables that are independent of each other and

*X*.

*x*and

*y,*is the sample standard deviation of

*x*, andare the sample means of

*x*and

*y*, respectively. Since

*X*,

*ɛ*, and

*δ*are all independent of each other, as

*σ*and

_{X}*σ*are the (true) standard deviations of

_{ɛ}*X*and

*ɛ*. Thus it follows from (4) that as

*n*is defined asTherefore, whenever

_{x}*x*has nonzero noise and hence

*n*> 0, (9) implies the well-known result that

_{x}*α*even when the number of observations

*x*has zero noise, then

*n*= 0 and an unbiased estimate

_{x}*α*for the regression of

*y*is available; conversely, if

*y*has zero noise (

*δ*= 0), then a regression of

*x*on

*y*yields an unbiased estimate for

*α*

^{−1}. Confidence intervals in either case can then be determined in the finite case using a Student’s distribution with

*M*− 2 degrees of freedom. However, when both

*x*and

*y*have nonzero noise, an unbiased estimate of

*α*or

*α*

^{−1}is not available and confidence intervals for the true regression coefficient are unknown.

Many methods (Ricker 1973; Jolicoeur 1975; Riggs et al. 1978; McArdle 1988, 2003; Frost and Thompson 2000) have been devised to obtain a best estimate for

In this paper, we overcome this difficulty by noting that when the relative size of the noise for each variable is unknown, we have no basis for choosing between the variables. Consequently, we must assume that the noise in each variable is equally likely subject to the constraint that the sample correlation coefficient for the given set of data is known. This equal likelihood noise assumption, subject to the known sample correlation coefficient, enables us to determine explicitly the confidence intervals in the *M* case numerically.

The rest of the paper is organized as follows: in the next section we discuss an unbiased estimate for the true regression coefficient when nothing is known about the noise in each variable. Then, in section 3, we examine the *M* case are found numerically in section 4, and an example of their use is then given in section 5. A final section 6 contains some concluding remarks.

## 2. An unbiased estimate for the true regression coefficient

Consider the problem of estimating the true regression coefficient between *x* and *y* when nothing is known or assumed about the noise in *x* and *y* except for the constraint provided by the sample correlation coefficient. One estimate for the true regression coefficient *α* of *y* on *x* is *f*_{1} is used to correct the bias in *x* on *y* to obtain *s _{y}* is the sample standard deviation of

*y*defined analogously to

*s*in (6). The bias in

_{x}*f*

_{2}. But note that there is no basis for distinguishing

*f*

_{1}and

*f*

_{2}. In both regression cases there are the same number of points

*M*and the same sample correlation coefficient. Also in both cases we are carrying out the same procedure, namely, using a factor to scale the ordinary least squares regression coefficient so that it gives an unbiased estimate of the true regression coefficient. Therefore, the factors

*f*

_{1}and

*f*

_{2}should be the same and we write

*f*

_{1}=

*f*

_{2}=

*f*.

*α*and

*α*

^{−1}, and we have no way of distinguishing these estimates, we must have

*y*and

*x.*In the oceanography literature,

Note that *x* is normalized by *s _{x}*,

*y*is normalized by

*s*, and the

_{y}*perpendicular*distance from the regression line is minimized rather than the vertical distance as in an ordinary least squares fit. A principal component analysis of the normalized variables also has its principal axis as the

## 3. The probability density function for *α/α*_{GMR} for the limiting *M* case

Given *M* data pairs, we can calculate both a correlation coefficient *r* and

*δ*. By dividing both numerator and denominator by

*Y*in

*y*[see (1) and (3)], and so, analogous to the noise-to-signal ratio

*n*, the parameter

_{x}*n*in (16) is the noise-to-signal ratio for

_{y}*y*. Squaring

*r*gives, from (16), the constrainton

*n*and

_{x}*n*since

_{y}*r*can be determined from the data.

*n*≥ 0 and

_{x}*n*≥ 0, it follows from (18) that

_{y}*n*and

_{x}*n*both vary between 0 and

_{y}*r*

^{−2}− 1; geometrically, the points (

*n*,

_{x}*n*) lie along a hyperbolic curve (see Fig. 1). Because we know nothing about the relative sizes of

_{y}*n*and

_{x}*n*, we assume that the points (

_{y}*n*) are uniformly distributed along the curve between the limiting values at

_{x}, n_{y}*A*(

*n*= 0,

_{x}*n*=

_{y}*r*

^{−2}− 1) and

*B*(

*n*=

_{x}*r*

^{−2}− 1,

*n*= 0). If

_{y}*l*denotes arc length and

*l*is the length of the curve

_{AB}*AB*, then the required uniformly distributed probability density function (pdf) issince this is the same value for any point on the curve

*AB*and satisfies

Note that other assumptions about the distribution of *n _{x}* and

*n*along the curve (18) are not justifiable. For example, if we assume

_{y}*n*is uniformly distributed along the curve, then because of the hyperbolic form of (18),

_{x}*n*is not uniformly distributed along the curve. Different distributions would be inappropriate, since we have no way of distinguishing the noise in

_{y}*x*and

*y*; all we have is the knowledge that the noise pair (

*n*,

_{x}*n*) lies somewhere along the curve defined by (18) and illustrated in Fig. 1.

_{y}*n*,From (18) we can find

_{x}*n*in terms of

_{y}*n*and

_{x}*r*and hence calculate

*dn*/

_{y}*dn*to obtain, from (21),From (9),and from (13),

_{x}*ζ*in (24) for notational convenience. Changing variables in (22) to

*ζ*, we haveThus,Since

*g*in (26) is the pdf for

*α*given a knowledge of the parameters

*r*and

*α*

_{GMR}, which we can calculate from the data.

*g*for various values of

*r*. It follows from (24) and

*g*as a function of

*g*has a long tail for small |

*r*|; that is, the confidence intervals are wide, because for small |

*r*| not much of the variance is explained by the regression fit. From (20), the median of

*g*occurs when

*AB*(see Fig. 1). By symmetry this corresponds to

*n*

_{x}*= n*, or by (18), when (1 +

_{y}*n*)|

_{x}*r*| = 1. From (24) this is equivalent to

*g*is a maximum. Since

*g*is a monotonically decreasing function of

*AB*in Fig. 1 of length

*l*depends on

_{AB}*r*

^{2}. As

*A*(0,

*r*

^{−2}− 1) and

*B*(

*r*

^{−2}− 1, 0) approach 0 and so the length of the curve

*l*as a function of

_{AB}*r*

^{2}. Note that if we approximate the length of

*l*using a straight line between

_{AB}*A*and

*B*, then

*l*is the length of the hypotenuse of a right-angled isosceles triangle with equal sides

_{AB}*r*

^{−2}− 1, and soFigures 1 and 3 show that this approximation becomes increasingly accurate as

*r*

^{2}≤ 1. Thus (26) can also be written approximately, with increasing accuracy as

While it is useful to have obtained results for the limiting large *M* case, in practice we are usually faced with the problem of determining *α* given a finite number *M* of hard-won data points and finite *M* estimates *M* case we know only *r* and *α*_{GMR}. In the next section, we show how to calculate the relevant finite *M* pdf and corresponding confidence intervals numerically.

## 4. Confidence intervals for the true regression coefficient for finite *M*

*M*and a given

*x*

_{*}and

*y*

_{*}is the same as that for

*x*and

*y*while the sample regression coefficient

*y*

_{*}on

*x*

_{*}is

*n*,

_{x}*n*, and the zero-mean, unit variance variablesandwhere

_{y}*X*. To see that (35) and (36) can indeed be written in this way, first note thatAlso, from (10), (17), (37), and (38) we haveandThe substitution of (40), (41), and (42) into (35) and (36) yieldsand

We found the required confidence intervals numerically from (43) and (44), sampling the zero-mean, unit-variance variables *X _{*}*,

*ɛ*, and

_{*}*δ*

_{*}from independent zero-mean, unit-variance normal distributions. The numerical calculations proceed by first obtaining a random

*r*from a uniform distribution over the interval [−1, 1]. For that

*r*we then randomly sample a point (

*n*) from a uniform distribution along the curve

_{x}, n_{y}*AB*corresponding to

*r*

^{2}. Now that we have

*n*and

_{x}*n*we can use the independent zero-mean, unit-variance normal distributions of

_{y}*X*,

_{*}*ɛ*, and

_{*}*δ*to obtain

_{*}*M*random samples of

*x*and

_{*}*y*from (43) and (44) and hence obtain the correlation coefficient

_{*}*x*and

*y*. We can also calculate the regression coefficient

*r*for this sample, we also know sgn(

*α*) and hence

*L*and

*U*for confidence intervals for the true regression coefficient

*α*. For example, for a 95% confidence interval we can find

*L*and

*U*such thatOnce

*L*and

*U*are known, then for a given sample (

*x, y*) of

*M*points we can calculate

Since we have an analytical solution for *M*, it is of interest to see how large *M* has to be for the *L* and *U* as a function of *M*. For small *M* it is only when *M* case resembles the *U* increases more rapidly than *L* decreases. In the finite case, plots end when *L* = 0, because for smaller *α* is not known with 95% confidence.

## 5. An example

*L*and

*U*corresponding to the 80%, 90%, 95%, and 99% confidence intervals for various

*M*and

*p*(

*o, t*) and eastern equatorial Pacific

*p*(

*L, t*) should be related to the zonal integral of the eastward equatorial wind stress anomaly

*τ*byIn (46),

*t*is the time in months,

*ξ*= 0 refers to the western equatorial Pacific boundary,

*ξ*=

*L*to the eastern equatorial Pacific boundary, and

*h*(

*L*) is a measure of the height of the turbulent atmospheric boundary layer in the eastern equatorial Pacific. Note that both monthly time series on the left- and right-hand sides of (46) contain noise. This noise is due to the measurement error in the thousands of ships of opportunity wind and pressure observations, the measurement error in the station pressure measurements at Darwin, Australia, and their adjustment to the equator, and the noise due to theoretical approximations in the derivation of (46). The noise-to-signal ratios

*n*and

_{x}*n*are thus nonzero and unknown and this must be taken into account when estimating

_{y}*h*(

*L*) by regression.

Bunge and Clarke (2009) tested the validity of (46) using monthly pressure and wind stress data from 1978 to 2003 inclusive. Since the data are autocorrelated, the number of degrees of freedom in the data is not the number of months of data, but rather the number of years of data because El Niño time series can be thought of as independent 12-month segments [see, e.g., Fig. 2.14 of Clarke (2008)]. As there are 26 yr of data, *M* = 26. Also, Bunge and Clarke found from the correlation of the time series in (46) and their standard deviations that

*L*and

*U*for

*α*can be obtained for

*M*= 26 by linearly interpolating the values for

*M*= 25 and

*M*= 35 in the supplement. For example, from Table S17, the 95% entry for

*L*for

*M*= 25 while the corresponding value from Table S18 for the

*M*= 35 case is 0.8149. Linear interpolation gives

*L*= 0.7875 for

*M*= 26 and a similar analysis gives

*U*= 1.2706. This implies, from the discussion associated with (45), that the 95% confidence interval for the true value of

*α*=

*h*(

*L*) is

Note that the above confidence interval takes into account both the finite number of points *M and* our uncertainty about the noise. By comparison, the standard ordinary least squares regression of the left-hand side of (46) and on the right-hand side gives a 95% confidence interval (138 m, 204 m). This interval is smaller than that in (47), but it is a confidence interval for the *M* = *α*_{OLS} rather than the true regression coefficient *α*. The confidence interval for the true regression coefficient is larger because, in addition to the finite *M* limitation, the confidence interval for the required true regression coefficient takes into account the unknown noise.

## 6. Concluding remarks

Two reviewers’ comments made us think that we should point out here the difference between linear prediction and our goal of estimating the true regression coefficient *y* from *x* linearly. In that case, the ordinary least squares regression coefficient *y* as best it can (in the least squares sense) given past data. When the noise is large, *r*^{2} is small [see (18)], and such a linear predictor will only explain a small percentage of the variance. This results in all predictions being near the mean. Because of this, in some prediction circles the regression fit is “inflated” so that the regression estimates have the same variance as *y* (i.e., effectively *y*.

The preceding is related to, but separate from, our goal to find the true regression coefficient *α* between the variables *Y* and *X* given noisy realizations *y* and *x* as stated in (2) and (3). In our case, if it is known that the “noise”-to-signal ratio in at least one of the variables *x* and *y* (say *x*) is small, then the ordinary least squares regression coefficient is nearly unbiased and can be used [see (9) with small *n _{x}*]. But when the noise-to-signal ratio is not known in both variables, the ordinary least squares regression coefficient is biased. In that case the confidence intervals that are calculated for

*α*, the coefficient we really want. When the noise-to-signal ratio is not known in both variables,

*M*of independent samples, and, from the data, the standard deviations of each variable and the correlation coefficient between them.

We thank F. Huffer for helpful comments on an early version of our paper and the National Science Foundation for financial support (Grants OCE-0850749 and OCE-1155257).

## REFERENCES

Barker, F., , Soh Y. C. , , and Evans R. J. , 1988: Properties of the geometric mean functional relationship.

,*Biometrics***44**, 279–281.Bunge, L., , and Clarke A. J. , 2009: A verified estimation of the El Niño index Niño-3.4 since 1877.

,*J. Climate***22**, 3979–3992.Clarke, A. J., 2008:

*An Introduction to the Dynamics of El Niño & the Southern Oscillation.*Academic Press, 324 pp.Emery, W. J., , and Thomson R. E. , 2001:

*Data Analysis Methods in Physical Oceanography.*2nd rev. ed. Elsevier, 638 pp.Frost, C., , and Thompson S. G. , 2000: Correcting for regression dilution bias: Comparison of methods for a single predictor variable.

,*J. Roy. Stat. Soc.***A163**, 173–189.Garrett, C., , and Petrie B. , 1981: Dynamical aspects of the flow through the Strait of Belle Isle.

,*J. Phys. Oceanogr.***11**, 376–393.Jolicoeur, P., 1975: Linear regressions in fisheries research: Some comments.

,*J. Fish. Res. Board Canada***32**, 1491–1494.Kendall, M. G., , and Stuart A. , 1973:

*The Advanced Theory of Statistics.*3rd ed. Vol. 2, Griffin, 723 pp.McArdle, B. H., 1988: The structural relationship: Regression in biology.

,*Can. J. Zool.***66**, 2329–2339.McArdle, B. H., 2003: Lines, models, and errors: Regression in the field.

,*Limnol. Oceanogr.***48**, 1363–1366.Ricker, W. E., 1973: Linear regressions in fishery research.

,*J. Fish. Res. Board Canada***30**, 409–434.Ricker, W. E., 1975: A note concerning Professor Jolicoeur’s comments.

,*J. Fish. Res. Board Canada***32**, 1494–1498.Riggs, D. S., , Guarnieri J. A. , , and Addelman S. , 1978: Fitting straight lines when both variables are subject to error.

,*Life Sci.***22**, 1305–1360.Sokal, R. R., , and Rohlf F. J. , 1995:

*Biometry: The Principles and Practice of Statistics in Biological Research.*3rd ed. W. H. Freeman and Co., 887 pp.Sprent, P., , and Dolby G. R. , 1980: The geometric mean functional relationship.

,*Biometrics***36**, 547–550.