## 1. Introduction

The quality assurance (QA) procedures discussed herein were developed and applied to the data systems of the National Oceanic and Atmospheric Administration's Regional Climate Centers. The National Climatic Data Center (NCDC) began semiautomated review of the data validation for the cooperative climatological stations in 1982 (Guttman and Quayle 1990). Although NCDC's validation process became somewhat automated, many data continue to be inspected manually (Guttman et al. 1988).

Generally, there are two categories of tests: those that use data from a single site (Meek and Hatfield 1994) and those that use data from multiple sites. The second type compares a station's data against neighboring stations' (Hubbard 2001; Reek et al. 1992). Statistical decisions play a large role in quality control efforts, but increasingly there are rules introduced that depend upon the physical system involved. Examples of these are the testing of hourly solar radiation against the clear sky envelope (Allen 1996; Geiger et al. 2002) and the use of soil heat diffusion theory to determine soil temperature validity (Hu et al. 2002). It is now realized that QA is best suited when made a seamless process between staff operating the quality control software at a centralized location where data is ingested and technicians in the field (Hubbard 2001; Shafer et al. 2000).

Quality assurance software consists of procedures or rules against which data are tested. Each procedure will either accept the datum as being true or reject the datum and label it as an outlier. This hypothesis (*H*_{o}) testing of each datum and the statistical decision to accept the datum or to note it as an outlier can have the outcomes shown in Table 1. If the datum is valid and is accepted as such or the datum is invalid and rejected, the QA procedure is working appropriately. When the datum is valid and is rejected by QA, a Type I error is committed. If the datum is not valid but is accepted by QA, a Type II error is committed.

Take the simple case of testing a variable against limits. Suppose that the hypothesis is that a datum for a measured variable is valid only if it lies within ±3 standard deviations (*σ*) of the mean (*μ*), then, assuming a normal distribution, the expectation is that *H*_{o} will be accepted 99.73% of the time with no error. The values that lie beyond *μ* ± 3*σ* will be rejected with a resulting Type I error if valid values are encountered beyond these limits. In these cases (*H*_{o} is rejected when the value is actually valid) the expectation is that a Type I error will be made 0.27% of the time, assuming for this discussion that the data have no errant values. If a “true” value is replaced with an “errant” value, then the hypothesis will properly be rejected, only if the “errant” value falls outside the range *μ* ± 3*σ.* It would otherwise be accepted, when it actually is false (the value is not valid), and this would lead to a Type II error. In this simple example, reducing the limits against which the data values are tested will produce more Type I errors and fewer Type II errors, while increasing the limits leads to fewer Type I errors and more Type II errors. For QA software, study is necessary to achieve a balance wherein one reduces the Type II errors (mark more “errant” data as having failed the test) while not increasing Type I errors to the point where valid extremes are brought into question. Because Type I errors cannot be avoided, it is prudent for data managers to always keep the original measured values regardless of the quality testing results.

In this manuscript we point to three major contributions. The first is the explicit treatment of Type I and Type II errors in the evaluation of the performance of quality control procedures to provide a basis for intercomparison of procedures. The second is to illustrate how the selection of parameters in the quality control process can be tailored to individual needs in regions or subregions of a widespread network. Finally, we introduce a new spatial regression test that uses a subset of the neighboring stations that provide the “best fit” to the target station. The spatial regression weighted estimate has characteristics that make it possible to build statistical confidence intervals for testing data at the target station.

## 2. Data and methodology

The tests performed in this study were conducted for six stations. These six stations are part of the cooperative weather observer network (TD3200 dataset at NCDC). Table 2 shows the location, the average annual maximum and minimum temperatures, the annual total precipitation, and the elevation of each station. The stations were chosen to represent different climate regimes. Crete, Nebraska, and Dickinson, North Dakota, are two sites in the High Plains where the latter is cooler and drier and at a higher elevation. Fort Myers and Key West, Florida, are both warm sites located in the vicinity of the Gulf of Mexico, although the latter is completely surrounded by water and the former is on the west side of the Florida peninsula. Tucson, Arizona, has a warm and dry climate, while Yellowstone Lake, Wyoming, has a cooler climate. Both Tucson and Yellowstone Lake are located in more complex terrain (deeper ridges and valleys) than the other sites (flatter terrain). The elevation range is from near zero at Key West to nearly 2400 m at Yellowstone Lake.

This study uses four procedures. Three tests are tuned to the prevailing climate: seasonal thresholds, seasonal rate of change, and seasonal persistence. The thresholds and limits for these tests are related to station climatology at the monthly level (period 1971–2000) as compared to previous efforts, which mainly used one set of limits for a variable, regardless of time of year (Shafer et al. 2000; Hubbard 2001). The fourth test is a spatial comparison, using linear regression to estimate confidence intervals for the station in question. Only valid (nonmissing) data are exposed to the tests described below.

*“upper and lower” threshold*test checks whether a given variable (e.g., daily maximum temperature) falls in a specific range for the month in question. This test has been in use for some time. Where relatively new stations are involved the threshold test is often employed by considering the climate extremes for the area (Shafer et al. 2000). When the limits are determined based on the statistics of the distribution it has been called the sigma test (Guttman et al. 1988). The threshold test for variable

*x*is

*x*

*fσ*

_{x}

*x*

*x*

*fσ*

_{x}

*x*

*σ*

_{x}is the standard deviation of the daily values (e.g., daily maximum values) for the month in question. The variable

*x*may represent maximum temperature, minimum temperature, or rainfall. An analysis was performed on the data (1971–2000) to determine the relationship between the “percent of data passing” the test and various values of

*f.*This procedure allows an informed choice regarding how many data points will be flagged in the natural datastream. If the datastream contained no errors, the values not passing would be Type I errors. In operational use, the data so flagged as potential Type I errors will be considered suspect and subjected to further manual checking, so a realistic determination of

*f*is critical to project staff requirements. Graphs were developed to display the potential Type I errors versus

*f*for the threshold test.

The *step change (SC)* test checks to see whether or not the change in consecutive values of the variable fall within the climatologically expected lower and upper limits on daily rate of change for the month in question. In this case the step is defined as the difference between values on day *i* and *i* − 1, for example, *x*_{i} = *d*_{i} − *d*_{i−1}. Utilizing this definition of *x* and calculating the associated mean and the variance allows Eq. (1) to again be used, and an analysis of the data (1971–2000) determines the relationship between *f* and the potential Type I errors for the SC test.

*persistence*test checks the variability of the measurements. When a sensor fails it will often report a constant value; thus the standard deviation (

*σ*) will become smaller, and if the sensor is out for an entire reporting period,

*σ*will be zero. In other cases the instrument may work intermittently and produce reasonable values interspersed with zero values, thereby greatly increasing the variability for the period. Thus, when the variability is too high or too low the data should be flagged for further checking. The first step is to calculate the standard deviation from daily values for each month (

*j*) and year (

*k*) of the 30-yr record,

*σ*

_{jk}. Then the mean standard deviation is calculated for each month

*σ*

_{j}

*σ*

_{jk}over the years. Likewise, the standard deviation of these monthly values (

*σ*

_{jk}) is calculated over all years

*σ*

_{jσ}is calculated. The persistence test compares the standard deviation for the time period being tested to the limits expected as follows: The period under consideration passes the persistence test if the above relation holds for the specified value of

*f.*

The *spatial weighted regression* test checks that the variable falls inside the confidence interval formed from estimates based on *N* “best fit” neighboring stations during a time period of length *n,* which is taken as 24 for this study, and *N* was set to 5. The surrounding stations were selected by specifying a radius around the station and finding those stations with the closest statistical agreement to the target station. This was taken as 50 km for all stations with the exception of Key West (150 km) and Yellowstone Lake (100 km). These latter stations were in lower-station-density areas, thus prompting larger radii. Additional requirements for station selection were that the variable to be tested is one of the variables measured at the candidate site and the data for that variable span the data period to be tested. A station that otherwise qualifies could also be eliminated from consideration if more than half of the data is missing for the time span (more than 12 missing days).

*x*

_{lt}are derived by use of the coefficients derived from linear regression, so for any time

*t*and for each surrounding station (

*y*

_{lt}) an estimate is formed:

*x*

_{l}

*a*

_{l}

*b*

_{l}

*y*

_{l}

*x*′) by utilizing the standard error of estimate (

*s*) for each of the linear regressions (also known as root-mean-square error) in the weighting process. The surrounding stations are ranked according to the magnitude of the standard error of estimate, and the

*N*stations with the lowest

*s*values are used in the weighting process; in this case

*N*is taken as 5: This new approach provides more weight to the stations that have the best fit with the target station. Because the stations used in (3) are a subset of the neighboring stations we maintain that the estimate is not an areal average but a spatial regression weighted estimate. Care must be taken to preserve the correct sign on

*x*′. The weighted standard error of estimate (

*s*′) is calculated from This approach differs from inverse distance weighting in that the standard error of estimate has a statistical distribution; therefore, confidence intervals can be calculated on the basis of

*s*′ and the station value (

*x*) can be tested to determine whether or not it falls within the confidence intervals:

*x*

*fs*

*x*

*x*

*fs*

*f,*the number of potential Type I errors decreases. Unlike distance weighting techniques, this approach does not assume that the best station to compare against is the closest station but instead looks to the relationships between the actual station data to settle which stations should be used to make the estimates and what weighting these stations should receive.

Using the above methodology, the rate of error detection can be preselected. The reader should note that the results are presented in terms of the fraction of data flagged against the range of *f* values (defined above) rather than selecting one *f* value on an arbitrary basis. This type of analysis makes it possible to select the specific *f* values for stations in differing climate regimes that would keep the Type I error rate uniform across the country. For example, for sake of illustration, suppose the goal is to select *f* values that keep the potential Type I errors to about 2%. A representative set of stations and years can be preanalyzed prior to QC to determine the f values appropriate to achieve this goal.

To document the performance of the various procedures in a controlled situation, a set of “seeded” errors were introduced to the datasets and the performance of the various procedures in regard to catching these errors were recorded. By a random process, 2% of the dates were selected to receive a “seeded” error.

*r*, was selected using a random number generator operating on a uniform distribution with a mean of zero and a range of ±3.5. This number was then multiplied by the standard deviation of the variable in question to obtain the error magnitude:

*E*

_{ix}

*σ*

_{x}

*r*

_{i}

*σ*

_{x}). The selection of 3.5 ensures that the tests include cases that are close to the extremes of the period 1971–2000. The results of running the procedures on the “modified” dataset were cataloged for those days on which errors were introduced. The fraction of errors caught by each procedure was compared across the range of error magnitudes introduced.

## 3. Results

### a. Type I errors

It is important to examine the number of potential Type I errors that would occur when using the specified procedures with various *f* factors. The general shape of the relationship between *f* and the fraction of data flagged is shown in Figs. 1a–d. Although we show the fraction on a log scale, the results obtained here have a resemblance to the results of Eischeid et al. (1995), although their work was with respect to monthly data and the interquartile range. The result for the threshold analysis at Tucson indicates that approximately 2% of the data would be flagged for maximum and minimum temperature if *f* values of 2.4 and 2.3 are used, respectively. For precipitation, 2% of the data were flagged in the threshold test for an *f* value of 1.13. These results are shown in Fig. 1a. Similar figures for Tucson are shown for the step change test (Fig. 1b), the persistence test (Fig. 1c), and the spatial test (Fig. 1d). Other stations show similar relationships between “fraction of data flagged” and *f.* Zero is not shown on the vertical axes of Figs. 1a–d, but where the curves have an endpoint inside the box, there were no values flagged by the test beyond that point (e.g., for the persistence test there were no minimum temperature values flagged beyond *f* of about 3.5).

An across-site comparison is shown for an *f* value of 3 in all the procedures in Tables 3–5. For maximum temperature (Table 3), *f* = 3 would flag less than 2% of the data, with the exception of the spatial regression tests at Dickinson and Yellowstone Lake. For minimum temperature (Table 4), *f* = 3 would flag less than 2% of the data, except at Yellowstone. For precipitation (Table 5), the step change test is not implemented because of the discontinuous nature of precipitation. For precipitation the value of *f* = 3 resulted in less than 2% of the data being flagged for the threshold and persistence tests. In the case of the spatial test anywhere from 5% to 7% of the data were still flagged at *f* = 6.

These results show that it will be possible to select dynamic *f* values for each station and season that will result in a specific but quasi-fixed rate of Type I error generation (say 2% or 0.5%) across the nation.

On first glance these error detection rates may not look stellar, but it should be recognized (see below) that the worst errors are being caught, and it is only those errors that are down in the range comparable to sensor accuracy that are slipping through undetected.

### b. Type II errors

The results of the seeding analysis are presented in terms of the percentage of errors that were correctly identified as a function of the size of the error. The percent of errors not identified (Type II) is actually 100 minus the percent of correctly identified errors. An example for maximum temperature at Crete is given in Table 6. This result is typical of the other sites. None of the tests were able to identify the smallest errors −0.5 < *r* < 0.5; however, each of the tests became more successful in identifying errors as the magnitude of the error increased. This is not a realistic assessment of the persistence test because seeding of one errant value during a period of 30 days is not likely to move the variability outside the acceptable limits. The spatial regression test identified the most errors (75%), followed by the step change (SC) at 35%, the threshold at 19%, and persistence with less than 2%. All tests combined together identified 75% of the errors introduced. In the case of the other sites, as in this case, most of the errors identified by the other three tests were a subset of those identified by the spatial test.

The analysis shown in Table 6 was repeated for maximum and minimum temperatures and precipitation and for other locations with similar results. The combined performances of the four tests are indicated in Figs. 2a–c. For maximum temperature, the best performance (70%–80%) was noted at the plains sites (Crete and Dickinson) and the desert site (Tucson). The performance at the island (Key West) and shore (Fort Meyers) sites as well as the site with low station density (Yellowstone Lake) was somewhat less (50%–60%).

For minimum temperature (Fig. 2b) the combined performance was about 60% except for the island site. For precipitation (Fig. 2c) the site with complex terrain and the island location gave poorer performance (30%– 40%), while the other sites all came in from 40%–50%.

In each case, the spatial regression technique was responsible for identifying the majority of the errors found in the combined analysis.

## 4. Discussion and conclusions

It is essential to test the performance and capability of quality control procedures. In this study, the relative performance of quality control tests varied modestly with climate type and significantly with the variable tested. Seeded errors closer to zero were not detected by quality control tests; but, as the magnitude of error increased, so did the effectiveness of the quality control procedures. For large errors (comparable to *f* values >2.0), spatial regression was able to flag 100% of “seeded” errors. Continuous variables, such as temperature, produce fewer Type I errors under the test procedures used. The trade-off between Type I and Type II errors is very evident with the precipitation variable. Noncontinuous variables, such as precipitation, will produce more Type I errors, especially in the spatial test. The spatial test does, however, offer a means of reducing the Type II errors. Although daily precipitation is known to follow a gamma distribution, it was included in these tests to give a reference point. The authors intend to focus on alternative QA procedures in follow-up studies, especially additional tests that recognize the non-normal distribution of precipitation. Future work should also include comparison of the techniques set forth here to nonparametric techniques suggested by Lanzante (1996). Pattern recognition seems to have a role to play as well, in that the Type I errors often appear in geographical groupings according to the location and passage of synoptic systems. An effective implementation of pattern recognition has potential to greatly reduce the number of Type I errors made in quality control and assurance.

## REFERENCES

Allen, R. G., 1996: Assessing integrity of weather data for reference evapotranspiration estimation.

,*J. Irrig. Drainage Eng***122****,**97–106.Eischeid, J. K., , Baker C. B. , , Karl T. , , and Diaz H. F. , 1995: The quality control of long-term climatological data using objective data analysis.

,*J. Appl. Meteor***34****,**2787–2795.Geiger, M., , Diabate L. , , Menard L. , , and Wald L. , 2002: A web service for controlling the quality of measurements of global solar irradiation.

,*Solar Energy***73****,**475–480.Guttman, N. B., , and Quayle R. G. , 1990: A review of cooperative temperature data validation.

,*J. Atmos. Oceanic Technol***7****,**334–339.Guttman, N. B., , Karl C. , , Reek T. , , and Shuler V. , 1988: Measuring the performance of data validators.

,*Bull. Amer. Meteor. Soc***69****,**1448–1452.Hu, Q., , Feng S. , , and Schaefer G. , 2002: Quality control for USDA NRCS SM-ST network soil temperatures: A method and dataset.

,*J. Appl. Meteor***41****,**607–619.Hubbard, K. G., 2001: Multiple station quality control procedures. Automated weather stations for aplications in agriculture and water resources management. World Meteorological Organization Tech. Doc. AGM-3 WMO/TD No. 1074, 133–136.

Lazante, J. R., 1996: Lag relationships involving tropical sea surface temperatures.

,*J. Climate***9****,**2568–2578.Meek, D. W., , and Hatfield J. L. , 1994: Data quality checking for single station meteorological databases.

,*Agric. For. Meteor***69****,**85–109.Reek, T., , Doty S. R. , , and Owen T. W. , 1992: A deterministic approach to the validation of historical daily temperature and precipitation data from the Cooperative Network.

,*Bull. Amer. Meteor. Soc***73****,**753–765.Shafer, M. A., , Fiebrich C. A. , , Arndt D. S. , , Fredrickson S. E. , , and Hughes T. W. , 2000: Quality assurance procedures in the Oklahoma Mesonet.

,*J. Atmos. Oceanic Technol***17****,**474–494.

The error classification in testing of a quality assurance hypothesis

The location and climate of weather stations included in this study

The fraction (Type I) of maximum temperature data flagged (%) at *f* = 3 for each test and each site

The fraction (Type I) of minimum temperature data flagged (%) at *f* = 3 for each test and each site

The fraction (Type I) of precipitation data flagged (%) at *f* = 3 for each test and each site

The percentage of seeded errors flagged by each test as a function of error magnitude (*rσ*) for maximum temperature at Crete, NE