## 1. Introduction

A homogeneous climatic time series is one in which variations through time are caused solely by variations in weather and climate (Conrad and Pollak 1950). Homogeneous time series are required to make meaningful assessments of climate change. Consequently, much effort has been expended in the detection of and adjustment for data inhomogeneities in temperature records by researchers concerned with analyzing climatic trends and variability (e.g., Vincent 1998; Alexandersson and Moberg 1997; Easterling et al. 1996; Easterling and Peterson 1995; Karl and Williams 1987; WMO 1966). Inhomogeneities in a climatic data time series can be caused by changes in instrumentation, observing practices, the method used to calculate mean daily temperature, station location, and environmental conditions surrounding the observation station (Peterson et al. 1998; Karl et al. 1988; Baker 1975). Such changes can result in sharp discontinuities in a time series when, for example, different instrumentation is installed or a station is moved. Other changes can lead to a trend as in the case of urbanization or instrument drift.

Techniques for detecting errors in daily data that do not deal comprehensively with inhomogeneities are often used operationally and are characterized by a narrow time window (e.g., one month for daily temperature and precipitation observations). Procedures such as those described by Reek et al. (1992) check for internal consistency errors arising from, for example, transposition of numbers in manual recording of observations (e.g., 35 is recorded as 53) or the transposition of maximum and minimum temperatures. Internal consistency errors, which we will call class A errors, are typically detected early in a data quality assurance process. A second class (class B) of data error can result from electronic noise in measurements, defective communications, sensor degradation prior to replacement, lightning strikes, the systematic attribution of observations to the wrong calendar day (a process often called “shifting”) and a host of other often unexplainable causes. A scheme called Areal Edit and its graphical user interface, Geographical Edit and Analysis, were developed at the National Climatic Data Center (NCDC) and are used operationally to detect class B errors in observations from the National Weather Service (NWS) U.S. Cooperative Observer (COOP) Network (Guttman and Quayle 1990; Reek and Owen 1990). Detection of class B errors “system” or “random” is based on neighbor comparisons since erroneous observations may pass internal consistency checks but deviate in unrealistic ways from neighboring stations.

Detection of inhomogeneities, the third class (class C) of error, usually requires a much longer data record. Testing is often conducted through an analysis of annual or seasonal means, generally requiring many years of data, including a modest length of record (e.g., five years) following the initiation of an inhomogeneity (Easterling and Peterson 1995; Karl and Williams 1987). The motivation for our work, therefore, was the need to develop a methodology that could detect potential inhomogeneities in, comparatively speaking, “near–real time,” that is, through an analysis of daily observations with the same operational time perspective as the quality control procedures cited above (i.e., a daily time series 1 or 2 months in length). The new methodology is the subject of this paper. We use the qualifier “potential” because it may be difficult to definitively state that an inhomogeneity has occurred with a short record length. The goal, however, is to provide early warning of a *possible* inhomogeneity to station field managers so that corrective measures, if needed, can be taken in a timely manner. With early detection, continued corruption of the climate record can be prevented. Whereas inhomogeneity testing of historical time series of temperature typically includes appropriately adjusting the record, the methodology described in this paper involves only the detection of inhomogeneities. We will show that the methodology for detection of class C errors also finds temperature errors that belong to class B, defined above.

Organization of the paper is as follows. In section 2 we describe the statistical model used for identifying an unreasonable deviation of daily maximum and daily minimum temperatures at a candidate station from neighboring stations. The model involves evaluation of an entire month of daily data as opposed to the specific examination of individual days. In section 3 we review the criteria used to select neighbors for comparison with each candidate station. An evaluation of the model assumption is presented in section 4. In section 5, the design of a statistical test based on cross correlation is described and in section 6, a second statistical test, based on the presence of autocorrelation in the data model, is discussed. In section 7, application of the tests at the NCDC is discussed, and the summary and conclusion are given in section 8.

## 2. Random model for temperature difference

Our approach was to create a monthly time series such that under normal conditions—that is, when there is no data problem—the series is white noise, and under abnormal conditions—when there is a potential data problem—it is not white noise. An appropriate white noise test then can be applied to the data.

*c*and a neighbor

*n.*Mathematically, the model is

*D*

_{ci,ni}

*c*

_{i}

*n*

_{i}

*T*

_{ci}

*T*

_{ni}

*n*on the

*i*th day of the month;

*T*

_{c}and

*T*

_{n}are the monthly mean maximum or minimum temperature for the current data month at candidate and neighbor

*n*; and

*S*

_{c}and

*S*

_{n}are the standard deviation of maximum or minimum temperatures for the current data month at candidate and neighbor.

Multiple difference series, which we call D-series, are created, each comprising a time series of daily differences between the same candidate and one of the several neighbors (from 3 to 6). Since the method is a neighbor-check scheme, the principal assumption is that the candidate station and its neighbors are exposed to the same underlying mesoscale weather regimes during the month. If that is the case, then, in the absence of a data problem, the D-series represents random differences between standardized daily departures from the measurement sites. Systematic differences between stations are removed by standardization at each station. The monthly mean is subtracted to remove, for example, systematic temperature differences due to elevation differences. The daily departures from the current monthly mean are scaled by dividing by the standard deviation, which can be important, for example, when comparing stations with somewhat different daily variability. Ideally, what remains after subtraction of two standardized daily temperature time series are random microscale differences given by the D-series in Eq. (1).

## 3. Neighbor selection

A candidate's neighbors must share a similar time of observation to conform to the assumption that the two stations are tracking one another due to common mesoscale influences. If observation times differ, systematic temperature differences can creep into the D-series as a maximum or minimum temperature occurring at nearly the same time at two nearby stations may be associated with different calendar days at a morning observer versus an afternoon observer. Our requirement that candidate and neighbor station groups share a common observation time (e.g., morning, afternoon, or midnight) can increase the average distance between the NWS COOP stations that are compared. The consequence is that large station distances can redden the D-series as the mesoscale signal at a candidate and its neighbors becomes out of phase at large between-station distances. Fortunately, this was not found to be the case in practice for most of the United States since NWS COOP station density is high enough for a sufficiently large pool of neighbors to be found usually within 100 km.

While standardization of the temperature series should remove systematic effects of elevation differences on temperature, existing neighbor selection rules at the NCDC, in addition to stratifying station comparison groups into morning, afternoon, and midnight readers, also pair stations that have an elevation difference no greater than an empirically defined threshold. This maximum elevation difference is scaled according to the elevation of the candidate station. In this way, a somewhat greater candidate–neighbor elevation difference is permitted at high elevation stations than at low elevation stations. Further, selection of neighbors across topographical barriers is not permitted and attempts are made to pair coastal candidate stations with coastal neighbors. In addition to these empirically based selection criteria, state climatologists from western states have provided NCDC with lists of suitable neighbors for candidate stations in their states, based on their experience. This list of neighbors as well as those neighbors meeting all selection criteria may be superceded by an NCDC “override” list that contains seemingly appropriate neighbors, but known through experience to be poor choices for a given candidate station.

## 4. Verification of model assumption

Two statistical assessments were conducted to evaluate the assumption behind the random model for the D-series when applied to observed daily time series: a test for the number of runs of like signs in the D-series, and an evaluation of the distribution of the lag-1 autocorrelation coefficient for a large sample of D-series. The nonparametric runs test (Ostle 1963, p. 470) was selected since no particular underlying probability distribution (e.g., Gaussian) is required. For a random time series with zero autocorrelation the population distribution of runs of like sign has been derived; if there are too few or too many runs, the null hypothesis of random differences is rejected. Too few runs might indicate the presence of a trend or a sudden change in the level of temperature difference at some point during the month being examined. Too many runs might indicate a temperature difference too high one day, too low the second day, too high the third day, etc. Situations in which there would be too many runs were evaluated to be unlikely, so a one-sided test was conducted for too few runs only. For 30- or 31-day months, the number of runs at the 5% level of significance is 11. Thus, if there are fewer than 11 runs, a flag is raised.

The runs test was applied to published NWS COOP maximum and minimum temperature data for stations located in 10 Midwestern states for observations recorded in January and July 1997. The test was also applied to a sample of stations in the more heterogeneous terrain of Colorado. Each of the approximately 1000 Midwestern candidate stations and 150 Colorado stations was paired with up to five neighboring stations to form D-series. In both station samples, the results of the runs tests indicated that the occurrence of series with a number of sign changes less than or equal to 11 was close to 7%, slightly higher than the expected 5%. There are three primary reasons that account for this slightly higher than expected percentage. First, the assessment of the random model was applied to observed data that include all undetected data errors. The presence of data errors increases the likelihood of fewer than expected sign changes in the D-series. Second, simulations show that using values recorded to the nearest whole degree inflates the number of runs test flags by as much as 1% or 2%, in spite of using standardized departures. Third, there are meteorological situations in which the assumptions behind the random model are not met for a given candidate and neighbor pool (see section 5). Each reason leads to a slight reddening of the D-series.

An alternative to the runs test that also demonstrates a reddening of the random data model in practice is to compare the distribution of lag 1 (1-day) autocorrelation coefficients for observed D-series to the same distribution for simulated white noise. For a population of white noise D-series, the distribution should be centered on zero and have a standard deviation *σ* equal to 1/(*n*)^{1/2}; with 30 or 31 daily observations for a month, *σ* is approximately 0.18. Figure 1 shows a comparison of two observed lag-1 autocorrelation coefficient distributions with the same distribution for 10 000 simulated white noise D-series. The observed distributions include calculations using both maximum and minimum temperatures and are shifted somewhat to the right with respect to the random distribution. The distribution compiled from exclusively Midwestern stations has a mean of 0.05 and a standard deviation about 0.19. For a sample including all NWS COOP stations, the mean lag-1 autocorrelation coefficient of the D-series is about 0.07 with a standard deviation of 0.20. From this comparison it is clear that, in practice, the D-series only approximate white noise. Accordingly, any thresholds for the magnitude of correlation coefficients used in error detection described later in this paper were based on the *observed* distributions and not on white noise. On the basis of comparison of a number of observed lag-1 autocorrelation coefficient distributions for different areas and time periods, it was concluded that there is little variation in the distribution by season or region, implying that the same threshold could be used everywhere at all times.

## 5. Cross-correlation test

The methodology we have developed to detect inhomogeneities (class C errors) comprises two correlation tests: a cross-correlation test discussed in this section, and a lag-1 (1-day) autocorrelation test described in section 6. In fact, experience shows that the cross-correlation test more commonly detects the wide variety of system (class B) errors than inhomogeneities, but examples of detecting both classes of error will be shown. The lag-1 autocorrelation test is designed primarily to detect inhomogeneities.

The basis for the cross-correlation test is that a high value of the correlation coefficient between two D-series (candidate and one neighbor, and candidate and another neighbor) indicates that the candidate is significantly unlike its neighbors. When a sufficient number of correlation coefficients (a candidate station with 5 neighbors yields 10 coefficients) exceeds a critical value based on the frequency distribution of coefficients, a flag is raised for the candidate station for the given month. Numerous data problems have been uncovered by this test, examples of which will be shown.

Figures 2 and 3 illustrate, through a contrived example, why two D-series are highly correlated when there is a data problem with the candidate station. Figure 2a is a plot of the daily maximum temperatures for July 1997 at Mount Carroll, Illinois, and three of its available neighboring stations. There was no evidence of any data problem during this month. Figure 2b depicts the time series of standardized daily departure differences (D-series) between the candidate station (*c*), Mount Carroll, and the three surrounding neighbors (*n*1, *n*2, *n*3) calculated using Eq. (1). The random nature of the time series is evident and can be seen also in the scatterplots between the combinations of D-series that are shown in Fig. 2c. The same temperature D-series are shown in Fig. 3 with one change: the daily temperature series at the candidate station has been shifted for two consecutive days (as depicted by the thick dashed line). In other words, the maximum temperature recorded on the morning of 21 July is now attributed to 20 July; likewise, the observation for 22 July is attributed to 21 July (the time series from 1 to 19 July and from 22 to 31 July are unaltered). The alteration of the temperature series mimics the practice of shifting in manual observations whereby an observer, a “shifter,” attributes an observation to its perceived day of occurrence rather than to the day that it was recorded (see Reek et al. 1992). Note that the days with the shift have large differences of departure (Fig. 3b) and that these differences are all of the same sign. The implication of the time series containing these departures is that any two D-series will be highly correlated. This is apparent from the scatterplots shown in Fig. 3c. In this example, all three combinations of D-series are highly correlated (i.e., Dcn1 vs Dcn2, Dcn1 vs Dcn3, Dcn2 vs Dcn3), a consequence of the heavy influence of the shifted days on the D-series values.

An actual example of a shifter (class B error) is shown in Fig. 4a for Dwight, Illinois, in which observations beginning on 18 July were shifted for the remainder of the month. The D-series are shown in Fig. 4b, where it can be seen that the series track each other closely beginning on the day of the shift. The associated scatterplots of all pairs of D-series, four of which are shown in Fig. 4c, illustrate the correlation between D-series with many correlation coefficients in excess of 0.9. Although this illustration is for a 13-day period, it is more common to observe an entire month of observations systematically shifted.

To gain insight into the distribution of cross-correlation coefficients for random data, we simulated monthly time series of standardized white noise D-series, calculated the correlation coefficients, and plotted their frequency distribution. As shown in Fig. 5, the distribution for white noise is slightly asymmetric about 0.5, the mean correlation coefficient. The mean value of 0.5 arises when the correlation between any two stations (*c,* *n*1, *n*2, etc.) forming the D-series is approximately equal (see the appendix). Next we calculated the frequency distribution for observed D-series, also shown in Fig. 5. While its standard deviation is larger than for white noise, the mean is also close to 0.5.

Given the range of these distributions, it is possible to draw a high cross-correlation coefficient by chance, both in the simulated and observed D-series. However, even though each series contains a common term, the cross correlations themselves were found to be essentially independent of one another. That is, if the correlation between *D*_{cn1} and *D*_{cn2} is high, there is little preference for the correlation between *D*_{cn2} and *D*_{cn3} to be high or low. Because of this, it becomes increasingly unlikely to draw multiple occurrences of high correlation coefficients between error free, random D-series even given the initial draw of a high correlation coefficient. The independence of the cross correlations was determined using simulated D-series and by examining the marginal distribution composed of the remaining companion coefficients when the magnitude of a coefficient exceeded the value corresponding to a significance level of 5% (one tail).

*α*of drawing at random

*k*or more companion correlation coefficients greater than or equal to a certain magnitude can be calculated using the binomial distribution according to

*M*is the number of companion correlation pairs,

*p*is the probability that a randomly selected correlation coefficient will be greater than a chosen magnitude, and

*r*is the dummy variable of summation. The probability of drawing one

*or more*correlation coefficients with

*p*= 0.05 and

*M*= 10 (corresponding to 5 neighbors) is about 40% but drops off quickly for multiple high correlation coefficients. Using the same parameters as an example, the probability

*α*is only about 1% for drawing three or more coefficients with values equivalent to the 5% probability level.

The cross-correlation test, therefore, is based on the probability of obtaining multiple occurrences of high correlation coefficients among pairs of random, companion D-series. Since the probability of multiple occurrences is low in error-free data, but high when the candidate station series contains errors when multiple occurrences are present, they are likely due to a data problem. Although the range in the distribution of observed correlation coefficients shown in Fig. 5 is substantially different than that for white noise generated D-series, probabilities of multiple occurrences still can be derived using the binomial distribution by basing the threshold magnitude of the correlation coefficient for a given significance level on the *observed* distribution of correlation coefficients. With *p* = 0.05, the threshold value of the correlation coefficients is about 0.82. A comparison also was made of the frequency distribution of correlation coefficients between D-series from Midwestern (MW) stations and a distribution from stations in Colorado (CO) and adjacent states. The purpose was to determine whether the distribution for mountainous regions was similar to that in more homogeneous terrain. In fact, the distributions are nearly indistinguishable, with means of 0.43 and 0.44 for MW and CO, respectively, and a 5% significance level of about 0.825 for both samples. The similar significance level suggests that the same threshold can be used with reasonable confidence across the country.

In the application of the cross correlation test, a minimum of three and as many as six neighbors are selected for each candidate station. With three neighbors, the number of permutations of D-series pairs *M,* is 3, with four neighbors 6, five neighbors 10, and six neighbors 15. The probability *α* of obtaining 2 (3 neighbors), 3 (4 neighbors), or 4 or more (5 or 6 neighbors) correlation coefficients with *p* = 0.05 is used as a threshold for error evaluation. When the threshold is met or exceeded, a flag is raised. The value of *α* varies somewhat depending on the number of neighbors, but in any case is less than 0.01. In the shifter example shown in Fig. 4, all 10 possible D-series correlation pairs exceeded the critical value (*p* = 0.05) of 0.825.

Another example of a cross-correlation test flag is shown in Fig. 6. The July 1997 maximum temperature time series at Kaskaskia, Illinois, along with five of its neighboring stations, are shown in Fig. 6a. The five D-series are shown in Fig. 6b. In this case, 3 out of 10 D-series correlation pairs were significant at the 5% level threshold of 0.825. Four of the 10 D-series combinations are shown as scatterplots in Fig. 6c. If the data were error-free, the probability of finding three or more correlation coefficients significant at the 5% level is about 1%.

The flagging of a number of temperature time series for 1997 at Kaskaskia, Illinois, and because the mean maximum temperature at Kaskaskia was 4.6°C higher than the average of its neighbors in July of that year, prompted a historical look into the temperatures at this station. Given that there is no adequate physiographic reason to explain a difference of this magnitude, the excessively high temperatures at the candidate station were suspect. Figure 7 is a plot of the monthly mean July maximum temperatures at Kaskaskia and its three nearest neighboring stations. Beginning in the early 1980s, July maximum temperatures at this station warmed relative to its neighbors. Although not in the station history file, we have been informed by G. W. Goodge (1999, personal communication) that the surface in the vicinity of the instruments was paved. The Kaskaskia temperature time series is an example of a class C error (inhomogeneity).

A third example of a correlation test flag is shown in Fig. 8. Figure 8a is the time series for the Cooperative Observer station at Toledo Blade, Ohio; the D-series are shown in Fig. 8b and four selected scatterplots between D-series are shown in Fig. 8c. In this example, 15 of 15 possible cross correlation coefficients exceeded the threshold value of 0.825. It is clear from this example and the examples shown in Figs. 4 and 6 that when a candidate's time series is unlike its neighbors, points in a scatterplot between any two D-series will align themselves along the diagonal running from the lower left to the upper right. If, on the other hand, the time series of a neighbor is unlike that of the candidate and other neighbors, values in the scatterplot between D-series, one of which contains the offending neighbor, will lie along one of the two axes rather than along the diagonal and a low correlation coefficient will be obtained. Consequently, when a critical number of high correlation coefficients occur, the problem clearly lies with the candidate and not with neighbors. Since all neighbors also become candidates, if there a problem in a neighbor's series, it will be discovered when it becomes a candidate. The Toledo Blade temperature series is an example of a class B or system error.

## 6. Lag-1 autocorrelation test

The principal function of the lag-1 autocorrelation test is to detect inhomogeneities (class C errors). The basis of this test follows from the observation that when a step change in temperature appears sometime during the month in the time series of the candidate station, the magnitude of the lag-1 (1-day) autocorrelation coefficient of each D-series increases; that is, the time series is reddened. This occurs because, in the process of standardization, the magnitude of the step change is systematically redistributed throughout the daily time series. Thus, if the step change is positive, each D-series is decreased for the period prior to the introduction of the step change, and increased thereafter. The reverse is true for a negative step change. The larger the step change and the more closely stations track one another day-to-day, the more visible will be the impact of a step change. The presence of a linear trend at the candidate station has a similar effect.

The reddening of a D-series in the presence of a step change or trend can be illustrated through the use of a fabricated example. Figure 9a shows the daily maximum temperature series during January 1997 for the NWS COOP station at Medford, Wisconsin, and its neighboring stations. There were no apparent data errors present in the time series at this station during that month. Figure 9b shows the five D-series and a solid line that connects the average of each day. The lag-1 autocorrelation coefficient for the average D-series is 0.19. A step change of 2.2°C (4°F) was added to the maximum temperature time series on the 16th of the month and is shown in Fig. 10a. Compared to the original D-series shown in Fig. 9b, the impact of the step change on the D-series can be readily seen in Fig. 10b. The autocorrelation in each D-series has increased substantially. The lag-1 (1-day) autocorrelation coefficient of the time series formed from the average of all D-series increased from 0.19 to 0.75, the latter clearly significant at the 1% level for the observed data distribution shown in Fig. 1. Although each D-series is reddened with the introduction of a step change, we have found the lag-1 autocorrelation coefficient determined using the daily *average* of each D-series (shown in Fig. 10b) to provide a more sensitive test statistic than checking for high lag-1 coefficient values among the individual coefficients.

While it may seem that a 2.2°C step change is quite large and therefore easily detectible, the magnitude of such a step change relative to the daily temperature variance for a month can be quite small. In fact, visual inspection of a temperature plot such as shown in Fig. 10a would not lead to any suspicion that a step change has occurred. Therefore, lag-1 autocorrelation test flags can occur for step changes that cannot be visibly identified. In addition, since we are viewing only one month of data, we are unable to reliably estimate the magnitude of any potential step change. Consequently, some kind of verification using historical data is in order. It also should be pointed out that because the temperature time series are standardized, if a step change occurred prior to the month being tested and a constant bias is present throughout the entire current month, neither the autocorrelation test nor the cross-correlation test will detect the error. Both tests are useful for detecting errors in the month that they occur or in cases where a bias, if present throughout the month, is not constant across all days as in the example shown in Fig. 6.

The extent to which an average D-series is reddened is shown in Fig. 11. The thin solid line is the distribution of lag-1 autocorrelation coefficients of the average D-series for maximum and minimum temperatures at about 1000 candidate stations in the Midwest and is the same as the corresponding distribution in Fig. 1. Distributions of the lag-1 autocorrelation coefficients for the same candidate stations with a step changes of 1.1°C (2°F) and 2.2°C (4°F) added individually to each candidate station at the middle of the data month are shown by the broken and heavy, solid lines, respectively. The distribution of lag-1 autocorrelation coefficients is increasingly shifted to the right as the magnitude of the step change increases.

The lag-1 autocorrelation test is based on the probability level associated with the magnitude of the lag-1 (1-day) autocorrelation; the higher the coefficient, the greater the likelihood that there is a problem at the candidate station. We set the critical value for the magnitude of the lag-1 autocorrelation coefficient for testing to be equivalent to the 1% level or about 0.56. In addition, since the impact of a step change on the lag-1 autocorrelation coefficient is greatest when the step change occurs in the middle of the time series, in practice, we use a 30-day running window across two months of daily data and calculate the coefficient for each of the windows. If the coefficient exceeds the critical value in any one of the time windows, the station is flagged. An example of a station flagged by the autocorrelation test is shown in Fig. 12. The lag-1 autocorrelation coefficient for the average D-series formed by the candidate (Monahans, TX) and its neighbors was 0.82, which represents a significance level of about 0.03%.

## 7. Application at the National Climatic Data Center

A quality assurance system has been in place for a number of years at the NCDC to detect internal consistency (class A) and system (class B) errors. However, the present scheme (see section 1) for detecting system errors is based on exceedance of an empirically defined difference threshold between daily temperature departures at a candidate station and those at its neighbors. Because the threshold is fixed, some system errors pass undetected. This can occur especially when the variance in daily temperatures is relatively small throughout a data month and also is generally true when there is a data inhomogeneity. Although the magnitude of the difference threshold can be reduced to improve error detection, doing so can increase the number of Type I errors (false positives) to a level that rapidly becomes unworkable. The correlation tests, in addition to detecting potential inhomogeneities, improve the detection of system errors without a significant increase in Type I errors. Currently, the two correlation tests are run at the end of a data month after the data have been digitized from the observer forms and undergone review by existing internal consistency and system error checks.

Table 1 shows monthly station error totals from the approximately 5700 temperature observing stations during the first 8 months of operation. The number of cross correlation test flags varied from 88 to 140 stations in a given month while the number of lag-1 autocorrelation test flags ranged from 49 to 174. In rare occasions a station's time series may raise both a cross correlation and a lag-1 autocorrelation test flag. The combined number of flags represent roughly between 1% and 3% of the total number of reporting stations. When a station flag is raised, the data for that month are manually reviewed by validators who assess the validity of the error. Based on these assessments, about 80% to more than 90% of the cross correlation flags in a given month have been evaluated as true data errors. Many of these flags reflect “shifters” that have not been detected by the Areal Edit component of the system. In contrast, certain meteorological conditions can result in lag-1 autocorrelation flags and consequently the success ratio for this test is lower; that is, the number of false positives is relatively higher. Nevertheless, the lag-1 autocorrelation test has proven capable of identifying steplike changes and other problems such as indicated in Fig. 12 and is valuable as an early warning of potential inhomogeneities. When an error is detected, an existing data value substitution scheme provides a replacement value, if possible, but the original observation is also retained.

## 8. Summary and conclusions

We have described the development of two statistical error detection tests useful in the quality assurance of daily temperature observations and have shown examples of their application. The first test is called the cross-correlation test. While it was designed for detecting inhomogeneities, in practice, it more commonly finds a plethora of other data errors. The second test, called the lag-1 (1-day) autocorrelation test, can be used to discover primarily data inhomogeneities. Both tests involve an evaluation of standardized difference series (D-series) comprising observations from a candidate station and each of three or more neighboring stations. Under the assumption that the candidate and neighbors are exposed to the same mesoscale weather conditions, when the data are error free, the D-series approximate white noise, whereas when errors are present at the candidate station, the D-series do not approximate white noise. The cross-correlation and lag-1 autocorrelation tests are used to test for white noise and identify potential problems in the candidate station's time series.

The two correlation tests are now used operationally at the NCDC and have been added to existing quality assurance procedures (e.g., Reek et al. 1992), although they could stand alone as a general error detection method. In addition, even though the examples shown were from the NWS Cooperative Observer Network, we conclude that the two tests are appropriate for the quality assurance of temperature observations from any observation network where station density is sufficient for an adequate neighbor selection pool to be found. In fact, the tests are now also being used to detect possible errors in the National Weather Service's Automated Surface Observing System (ASOS). The tests were developed specifically to detect potential inhomogeneities in an operational time perspective so that corrective measures can be taken soon after a problem is discovered. If a candidate station is flagged with a potential inhomogeneity or other error, an effort to discover its cause is undertaken. Establishing that there is, in fact, an inhomogeneity and assessing its magnitude requires analyzing time series of monthly data much longer than one or two months. Determination of the magnitude of an inhomogeneity using time series of monthly means as opposed to monthly time series of daily data constitutes the next step in operational quality assurance and will be described in a forthcoming paper.

## Acknowledgments

The authors wish to thank Dr. Nathaniel Guttman for his thoughtful discussion regarding the distribution of cross-correlation coefficients and for his general review of this paper. We would also like to thank Mr. Thomas Reek and Mr. William Angel for help in incorporating the tests into the operational quality assurance procedures at the National Climatic Data Center.

## REFERENCES

Alexandersson, H., and Moberg A. , 1997: Homogenization of Swedish temperature data. Part I: Homogeneity test for linear trends.

,*Int. J. Climatol***17****,**25–34.Baker, D. G., 1975: Effect of observation time on mean temperature estimation.

,*J. Appl. Meteor***14****,**471–476.Conrad, V., and Pollak C. , 1950:

*Methods in Climatology*. Harvard University Press, 459 pp.Easterling, D. R., and Peterson T. C. , 1995: A new method for detecting and adjusting for undocumented discontinuities in climatological time series.

,*Int. J. Climatol***15****,**369–377.——, ——, and Karl, T. R., 1996: On the development and use of homogenized climate data sets.

,*J. Climate***9****,**1429–1434.Guttman, N. B., and Quayle R. G. , 1990: A review of cooperative temperature data validation.

,*J. Atmos. Oceanic Technol***7****,**334–339.Karl, T. R., and Williams Jr. C. N. , 1987: An approach to adjusting climatological time series for discontinuous inhomogeneities.

,*J. Climate Appl. Meteor***26****,**1744–1763.——, Diaz, H. F., and Kukla G. , 1988: Urbanization: Its detection and effect in the United States climate record.

,*J. Climate***1****,**1099–1123.Ostle, B., 1963:

*Statistics in Research*. 2d. ed. Iowa State University Press, 585 pp.Peterson, T. C., and. Coauthors. 1998: Homogeneity adjustments of

*in situ*atmospheric climate data: A review.,*Int. J. Climatol***18****,**1493–1517.Reek, T., and Owen T. , 1991: GEA documentation and user's guide. Internal NCDC Doc., 30 pp.

——, Doty, S. R., and Owen T. W. , 1992: A deterministic approach to the validation of historical daily temperature and precipitation data from the Cooperative Network.

,*Bull. Amer. Meteor. Soc***73****,**753–762.Vincent, L. A., 1998: A technique for the indentification of inhomogeneities in Canadian temperature series.

,*J. Climate***11****,**1094–1104.WMO, 1966: Climatic change. WMO Tech. Note 79, 83 pp.

## APPENDIX

### Correlation between Two Random Time Series That Have a Common Term

The random time series *c,* *n*1, and *n*2 are standardized departures as defined by Eq. (2). At the outset we can expect that *D*_{c,n1} will be positively correlated with *D*_{c,n2} because of the common term *c.* In the derivation that follows, Cov, *E,* Var, and Corr are the respective mathematical covariance, expectation, variance, and correlation operators traditionally employed in statistical theory.

*D*

_{c,n1}and

*D*

_{c,n2}in which

*c,*

*n*1, and

*n*2 are treated as stationary random variables. Thus the day-of-month subscripts shown in Eqs. (1) and (2) are not needed:

*ρ*is the correlation between the random variables shown in each pair of subscripts.

*ρ*

_{c,n1}

*ρ*

_{c,n2}

*ρ*

_{n1,n2}

*ρ,*

*D*

_{c,n1}

*D*

_{c,n2}

If the approximations in Eq. (A8) are equalities, then Eq. (A9) is also an equality.

Total number of stations flagged by the correlation tests by month (out of approximately 5700 reporting)