• Andersson, E., , and H. Järvinen, 1999: Variational quality control. Quart. J. Roy. Meteor. Soc., 125, 697722.

  • Baker, N. L., 1992: Quality control for the navy operational atmospheric database. Wea. Forecasting, 7, 250261.

  • Barber, C. B., , D. P. Dobkin, , and H. Huhdanpaa, 1996: The Quickhull algorithm for convex hulls. ACM Trans. Math. Software, 22, 469483.

    • Search Google Scholar
    • Export Citation
  • Barnes, S. L., 1964: A technique for maximizing details in numerical weather map analysis. J. Appl. Meteor., 3, 396409.

  • Feng, S., , Q. Hu, , and W. Qian, 2004: Quality control of daily meteorological data in China, 1951–2000: A new dataset. Int. J. Climatol., 24, 853870.

    • Search Google Scholar
    • Export Citation
  • Fiebrich, C. A., , and K. C. Crawford, 2001: The impact of unique meteorological phenomena detected by the Oklahoma mesonet and ARS micronet on automated quality control. Bull. Amer. Meteor. Soc., 82, 21732187.

    • Search Google Scholar
    • Export Citation
  • Gandin, L. S., 1988: Complex quality control of meteorological observations. Mon. Wea. Rev., 116, 11371156.

  • Hubbard, K. G., , and J. You, 2005: Sensitivity analysis of quality assurance using the spatial regression approach—A case study of the maximum/minimum air temperature. J. Atmos. Oceanic Technol., 22, 15201530.

    • Search Google Scholar
    • Export Citation
  • Ingleby, N. B., , and A. C. Lorenc, 1993: Bayesian quality control using multivariate normal distributions. Quart. J. Roy. Meteor. Soc., 119, 11951225.

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I., , and D. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp.

    • Search Google Scholar
    • Export Citation
  • Lorenc, A. C., 1981: A global three-dimensional multivariate statistical interpolation scheme. Mon. Wea. Rev., 109, 701721.

  • Lorenc, A. C., , and O. Hammon, 1988: Objective quality control of observations using Bayesian methods: Theory, and a practical implementation. Quart. J. Roy. Meteor. Soc., 114, 515543.

    • Search Google Scholar
    • Export Citation
  • Pöttschacher, W., , R. Steinacker, , and M. Dorninger, 1996: VERA - a high resolution analysis scheme for the atmosphere over complex terrain. MAP Newsletter, Vol. 5, Mesoscale Alpine Programme Office, Zurich, Switzerland, 64–65.

    • Search Google Scholar
    • Export Citation
  • Reek, T., , S. R. Doty, , and T. W. Owen, 1992: A deterministic approach to the validation of historical daily temperature and precipitation data from the cooperative network. Bull. Amer. Meteor. Soc., 73, 753762.

    • Search Google Scholar
    • Export Citation
  • Shafer, M. A., , C. A. Fiebrich, , D. S. Arndt, , S. E. Fredrickson, , and T. W. Hughes, 2000: Quality assurance procedures in the Oklahoma mesonetwork. J. Atmos. Oceanic Technol., 17, 474494.

    • Search Google Scholar
    • Export Citation
  • Steinacker, R., , C. Häberli, , and W. Pöttschacher, 2000: A transparent method for the analysis and quality evaluation of irregularly distributed and noisy observational data. Mon. Wea. Rev., 128, 23032316.

    • Search Google Scholar
    • Export Citation
  • Steinacker, R., and Coauthors, 2006: A mesoscale data analysis and downscaling method over complex terrain. Mon. Wea. Rev., 134, 27582771.

    • Search Google Scholar
    • Export Citation
  • Wade, C. G., 1987: A quality control program for surface mesometeorological data. J. Atmos. Oceanic Technol., 4, 435453.

  • Wilks, D., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.

  • WMO, 2008: Guide to Meteorological Instruments and Methods of Observation. 7th ed. WMO-8, World Meteorological Organization, Geneva, Switzerland, 681 pp.

    • Search Google Scholar
    • Export Citation
  • View in gallery

    Simplified example of finding natural neighbors for a subset of nine stations. Station n = 3 is considered to be the main station (pentagram); stations 2, 3, 4, 7, and 9 are primary natural neighbors (squares); and all stations excluding station 8 are secondary stations. The darker gray area corresponds to the first-level neighborhood and the whole gray area to the second-order neighborhood.

  • View in gallery

    Delaunay triangulation (lines) applied to the station distribution (black dots) from 1200 UTC 29 Aug 2009. Attention should be paid to the inhomogeneous density of the observational network.

  • View in gallery

    Alignment of the local subsets of grid points around the station positions of the two-dimensional example shown in Fig. 1. Whereas the latter are symbolized by gray circles, grid points that constitute the second partial derivations with respect to single and multiple variables are marked by black and light gray squares, respectively.

  • View in gallery

    An idealized two-dimensional example for a regular station distribution with uniform values except for an artificial spike added to the centered station. (a) The uncorrected observation field with its central peak. (b) The analyzed field, corrected by the unweighted deviations ΔΨ. Note that not only is the centered station corrected but its neighbors are also moved into the opposite direction. (c) The analyzed field, corrected by the weighted deviations ΔΨw, is presented. One can see the impacts of the cost function reduction by reducing the effects of counter-swinging, which affects the error-free neighbors.

  • View in gallery

    Example of an idealized station distribution with 20 stations where the two closely connected stations (1a and 1b; gray diamonds) are combined with a fictive cluster (station 1; black square). The gray lines are the triangulation lines of the original station distribution and the black lines connect natural neighbors after the so-called clustering. The connections between stations that are not part of the cluster (uninvolved stations; black dots) remain the same.

  • View in gallery

    Visualization of the idealized example explained in the text and listed in Table 1. The (a) observation and (b) analysis fields after application of the QC procedure without cluster treatment. (c) The modified observation field where cluster members are replaced by the fictive cluster station. (d) The final result of the QC procedure with clustering. Comparing (b) and (d), the advantages of cluster treatment become obvious.

  • View in gallery

    One-dimensional example of VERA-QC applied to 15 stations. The value of station 8 is an outlier. Observations are marked by black dots, results after applying unweighted deviations by light gray diamonds, and those computed with weighted deviations by dark gray squares.

  • View in gallery

    As in Fig. 7, but where the values of stations 7–9 exceed the others.

  • View in gallery

    (left) Station distributions and (right) observations as well as corrected observations for (a) one-, (b) two-, and (c) three-dimensional examples. Black dots mark the centered stations, dark gray dots the stations of the inner layers, and light gray dots those of the outer layers. Error-affected stations are symbolized by dark pentagrams. Station numbers in the one- and two-dimensional station distributions correspond to the station numbers along the x axes in the bar plots. For three dimensions, only the bars for the primary neighbors of the erroneous stations are displayed. Observations, results corrected with unweighted deviations, and results corrected with weighted deviations are visualized by white bars with black edges, light gray bars, and dark gray bars, respectively.

  • View in gallery

    (a) Real station distribution in the surroundings of Vienna where black dots symbolize the unclustered stations and the pentagram shows a station with a gross error located in the city of Vienna. The white squares mark cluster members that are combined to fictive cluster stations (gray circles). The original stations are connected by black triangulation lines and those after the cluster treatment by gray lines. (b) The reference field, which is composed of a field with the constant value 1013 hPa and a southwest-to-northeast gradient. Note that this function was evaluated at the station positions and, as in (c)–(f), interpolated to a regular 4-km grid with the help of VERA. Whereas in (c) only random errors with a variance of 1 hPa were added to the observations, (d) shows the simulated observation field with random and additional gross errors. (e) The corrected observation field and (f) the weighted deviations.

  • View in gallery

    Example of a simulated pressure field (sum of three wave patterns and thermally induced pressure perturbations) and the positions of coastal (squares), lowland (circles), and alpine (diamonds) stations centered over northern Italy, including parts of the Mediterranean Sea and most of the Alps.

  • View in gallery

    Cumulative frequency distribution as a function of the absolute difference between artificial errors and deviations for the VERA-QC process (solid line), the QC process using SR (dashed line), and the QC process using ID (dash–dot line). The higher the line, the more artificial errors are recognized correctly. For this test series, 100 simulated gross error-free observation fields were used.

  • View in gallery

    (a) Layout and contingency tables for (b) VERA-QC and (c) the QC using SR processes; the latter two are based on the decisions for 332 stations and 100 time steps. In the top-left corners of (b) and (c), the computed values of ETS and HSS are listed. Both QC methods detect gross errors (GEs) very well but the VERA-QC approach is found to deliver a slightly better level of performance.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 45 45 10
PDF Downloads 41 41 10

Data Quality Control Based on Self-Consistency

View More View Less
  • 1 Department of Meteorology and Geophysics, University of Vienna, Vienna, Austria
© Get Permissions
Full access

Abstract

Conducting meteorological measurements, one is always confronted with a wide variety of different types of errors and with the decision of how to correct data for further use, if necessary. The selection of an adequate quality control (QC) procedure out of a wide range of methodologies depends on the properties of the observed parameter such as spatial or temporal consistency. But the intended data application (e.g., model-independent data analysis) or the availability of prior knowledge also has to be taken into account. The herein-presented self-consistent and model-independent QC process makes use of the spatial and temporal consistency of meteorological parameters. It is applicable to measurements featuring a high degree of autocorrelation with regard to the resolution of the observational network in space and time. The presented QC procedure can mathematically be expressed as an optimization problem minimizing the curvature of the analyzed field. This results in a matrix equation that can be solved without needing to converge iterations. Based on the resulting deviations and, if applied, on their impacts on the cost function, station values are accepted, corrected, or identified as outliers and hence dismissed. Furthermore, it is pointed out that this method is able to handle complicated station distributions, such as clustered stations or inhomogeneous station densities. This QC method is not only an appropriate tool for case studies but also for model validation and has been proving itself as a preprocessing tool for operational meso- and micrometeorological analyses.

Corresponding author address: Reinhold Steinacker, Dept. of Meteorology and Geophysics, Althanstrasse 14, 1090 Vienna, Austria. E-mail: reinhold.steinacker@univie.ac.at

Abstract

Conducting meteorological measurements, one is always confronted with a wide variety of different types of errors and with the decision of how to correct data for further use, if necessary. The selection of an adequate quality control (QC) procedure out of a wide range of methodologies depends on the properties of the observed parameter such as spatial or temporal consistency. But the intended data application (e.g., model-independent data analysis) or the availability of prior knowledge also has to be taken into account. The herein-presented self-consistent and model-independent QC process makes use of the spatial and temporal consistency of meteorological parameters. It is applicable to measurements featuring a high degree of autocorrelation with regard to the resolution of the observational network in space and time. The presented QC procedure can mathematically be expressed as an optimization problem minimizing the curvature of the analyzed field. This results in a matrix equation that can be solved without needing to converge iterations. Based on the resulting deviations and, if applied, on their impacts on the cost function, station values are accepted, corrected, or identified as outliers and hence dismissed. Furthermore, it is pointed out that this method is able to handle complicated station distributions, such as clustered stations or inhomogeneous station densities. This QC method is not only an appropriate tool for case studies but also for model validation and has been proving itself as a preprocessing tool for operational meso- and micrometeorological analyses.

Corresponding author address: Reinhold Steinacker, Dept. of Meteorology and Geophysics, Althanstrasse 14, 1090 Vienna, Austria. E-mail: reinhold.steinacker@univie.ac.at

1. Introduction

A continually increasing number of meteorological observation sites is producing larger and larger amounts of data. Meteorologists can only benefit from this extensive quantity of measurements if the data quality meets the requirements implied by the intended applications. On the one hand, high quality long-term observational data are essential for identifying climate changes or for validating climate model simulations (Feng et al. 2004). On the other hand, quality controlled real-time data are fundamental for nowcasting and model validation and furthermore are used to provide proper initial conditions for numerical weather prediction (Ingleby and Lorenc 1993).

Quality control (QC) of meteorological data is a quite young discipline. Until the early stages of the numerical weather prediction (NWP) movement, only slight attention had been paid to data quality and the QC process had been considered to be an unglamorous task (Gandin 1988). During the second half of the last century, the progress of NWP models brought about the recognition of the importance of QC. The manual inspection of observations was followed by simple QC algorithms that used empirically tested adjustments (Lorenc and Hammon 1988). Increasing computer power enabled the development of more complex QC methods, which will be summarized in order to classify the QC method presented in this paper. Nowadays, QC is not only an essential part of the acquisition, transmission, and processing of observational data, but it is also strictly recommended by different guides from the World Meteorological Organization in order to achieve a certain standard regarding the international exchange of data (WMO 2008).

This article presents a new QC method developed at the Department of Meteorology and Geophysics at the University of Vienna, the Vienna Enhanced Resolution Analysis Quality Control (VERA-QC). In comparison to other approaches it does neither require any previous knowledge nor prognostic model information. Thus, VERA-QC is especially suited for model validation. The moniker VERA-QC is derived from its application as a preprocessing tool of the department’s operational analysis of basic meteorological parameters (VERA) carried out on an hourly basis (Pöttschacher et al. 1996). In section 2 an overview of the types of errors and current common QC methods is given to provide a basis for the classification of the QC procedure presented in this paper. Section 3 describes the VERA-QC method in detail, points out common problems and difficulties associated with this special QC approach, and offers solutions, accompanied by idealized two-dimensional examples. Section 4 offers one-, two-, and three-dimensional examples with artificial observations for simple and more complex station distributions to present the properties characterizing the VERA-QC method. A comparison of the performances of VERA-QC and two other QC methods is carried out in section 5. The article is completed by section 6, which presents our conclusions and offers an outlook into further planned developments and applications.

2. Error types and QC methods

a. Error types

Before summarizing the most common QC procedures, one should be aware of their purpose, which is to recognize errors and to decide how to cope with them. Measurements are naturally affected by different kinds of errors. Although there is no standardized classification scheme, the so-called observational errors are usually divided into the following four main types:

  • Random errors—These errors are caused by the fact that an instrument can only give an approximation of nature. Furthermore, variations in other parameters can influence the measurements. As the great number of independent factors is governed by the law of large numbers, these errors can be treated like Gaussian-distributed random numbers around zero (Gandin 1988).
  • Systematic errors—In addition to the white noise, systematic errors occur mainly due to calibration errors or long-term drifts of sensors. Because of their usual persistence in time and their asymmetric distribution around zero, they add a bias to the measured parameter.
  • Micrometeorological errors—Spatial and temporal dimensions of meteorological phenomena cover a wide range of scales. Phenomena belonging to smaller scales than resolvable by the observational network result in micrometeorological errors. Although a measured value can be considered to be correct, such small-scale perturbations result in a misleading analysis because of the nonrepresentativeness of the single observation. Depending on the cause, such as subscale meteorological effects (e.g., urban heat islands) or meteorological noise (e.g., random subscale effects caused by turbulence), micrometeorological errors can be of systematic, as well as random, nature (Steinacker et al. 2000).
  • Gross errors—The most attention is paid to the so-called large or gross errors. This type of error is characterized by its rare occurrence and its large magnitude, and therefore it does not follow the Gaussian distribution law. Gross errors have strong effects on analyses and forecasts and are caused by the malfunctioning of measurement devices and by mistakes happening during data transmission, reception, and processing. A detailed overview of these errors is given by Gandin (1988).

b. Common QC methods

Since the beginning of weather analysis and prediction, meteorologists have been analyzing weather charts and simultaneously checking the quality of the synoptic observations. Based on this visual inspection, an observation is retained or rejected. Because of the increasing amount of data and the need for real-time initial fields for NWP, the time-consuming human component of the QC procedure has become an intractable task. Nevertheless, some data centers still consider human inspection to be an important part of the QC procedure (Shafer et al. 2000).

The highest significance can be assigned to the automatic QC. Depending on the existence of additional information, the spatiotemporal distribution of the observation data, and the intended application, an appropriate QC method can be chosen out of a wide variety of different QC procedures. As further information, we consider climate data, background fields based on earlier forecasts, and a priori knowledge, such as error statistics. An adequate QC method also depends on the dimension of the available data involving different levels of redundancy. It is self-evident that observations embedded in a dense spatiotemporal monitoring network allow more complex QC techniques than do single-station time series or isolated atmospheric soundings. According to the intended application, such as the creation of climate databases, model-independent analyses, or the calculation of first-guess fields, some QC methods are inapplicable. For example, an analysis for model validation should not be based on data controlled by a QC algorithm that makes use of a background field depending on the same model (Steinacker et al. 2000).

In the following we try to classify the members of the versatile family of QC methods. First of all, one can distinguish between three different types of outputs. The simpler QC algorithms can only accept or reject an observation whereas more sophisticated ones are able to suggest corrections or give the probability of gross errors. Second, a distinguishing feature can be found in the relationship between the QC algorithm and the analysis or forecast. The great majority of QC techniques are stand-alone applications, while some others are partly or completely incorporated into the analysis or forecast algorithms. A third criterion is the use of previous knowledge, such as statistical limits and background fields based on short-term forecasts. Additionally, some QC procedures consider each single observation separately, whereas enhanced ones take into account data from neighboring stations. This allows us to take advantage of the continuity, a criterion satisfied by most meteorological parameters. In practice, more than only one parameter may be measured at a station site. Therefore, it is obvious to check for internal consistency and to ensure that the data fulfill physical constraints (e.g., the hydrostatic relation), in which a further category of QC methods can be seen.

Below we give an overview of the most established QC methods.

  • Limit checks—Regardless of the actual weather situation, there are physical limits for each parameter that can never be exceeded (e.g., negative precipitation values do not exist). WMO (2008) contains detailed lists of physical limits for several parameters. A more enhanced check examines the local daily and annual weather conditions and compares an observation against climatological limits.
    • Almost every QC application begins with these plausible value checks, which make it possible to identify gross errors and to discard the observation immediately. Especially when establishing a climatological dataset, this method makes a significant contribution to assure its quality (Feng et al. 2004; Baker 1992).
  • Temporal consistency checks—The availability of time-resolved observations allows us to check if the instrument is stuck at a particular reading (Shafer et al. 2000) or if the tendency represents values that are implausible compared to climatological time series. These so-called persistence checks and the step change test are, similar to the limit check, one of the fundamental components of QC processes. Fiebrich and Crawford (2001) describe these tests in detail and give a list of thresholds for maximal allowed steps and minimal required standard deviations for various parameters.
  • Internal consistency checks—Normally, more than one meteorological parameter is measured at an observing station at the same time. Some of these parameters are physically related and the internal consistency check tests if values of related parameters are free of contradictions. An example would be checking the dewpoint temperature Td and air temperature T for the relation TdT. If time series of one parameter are available, further tests such as TminTmax are possible (Reek et al. 1992). A more complex internal consistency check makes use of physical constraints, such as the hydrostatic or the geostrophic relationship, by computing both sides of the relevant equation independently and comparing these results (Baker 1992).
  • Spatial consistency checks—Considering meteorological phenomena of scales exceeding the one resolved by the observational net, one can expect the related parameters to be distributed smoothly and therefore feature a high degree of autocorrelation. The redundancy of parameters, such as mean sea level pressure or potential temperature, allows us to compare the usually similar values with each other, which helps to detect outliers. The so-called consistency or buddy-check approach calculates the difference (residual) between a measured value and the expected value at the position of the station in question. This expected value is determined by an analysis that takes into account a certain number of adjacent stations, excluding the considered observation. The possible functional relationships between the influencing values and the value to be interpolated results in algorithms of different complexity. The following methods are widely spread:
    • – Inverse distance interpolation (ID)—The interpolated value is determined by the sum of the weighted surrounding station values, which are located within a certain radius from the target station. The weighting function is derived from the inverse of the distance between the target and the surrounding station. This quite facile approach is described, for example, by Wade (1987). A further advanced possibility to weight the surrounding stations according to their distances is introduced by Barnes (1964).
    • – Polynomial interpolation—Another approach for computing the interpolated value is to find a polynomial of order n that fits the measured values in the surroundings of the observation in question. A polynomial of order zero represents the simplest possible case, which means comparing the value of the target station to those of the surroundings. Nowadays, higher-order polynomial functions or splines (piecewise composed polynomials) are used. To obtain a smoothed field, it is possible to formulate a (cubic) spline interpolation as an optimization problem regarding a minimal roughness or curvature.
    • – Spatial regression (SR)—Instead of using only a weighting function depending on the distance, a more sophisticated approach is to assign the weights according to the correlation between the station of interest and the neighboring stations. These weights are based on the root-mean-square-error (RMSE) between the previous observations at the target and at the surrounding stations. Linear regression between the target station and each of the surrounding stations is performed, and results in an estimated value for the station of interest. Together with the correlation, a confidence interval for the observed value can be defined. If the observed value lies outside the boundaries of this interval, it is suspected to be erroneous. In a case study concerning Tmin and Tmax, carried out by Hubbard and You (2005), SR proved to be superior to ID. The SR has the advantage of not automatically assigning the highest weight to the nearest station, for example, when considering a coastal station that is more comparable to another coastal station farther away than to a mountain station in close proximity.
      • All these methods have in common that their results allow us to compute residuals that measure the quality of the tested station value. Depending on the residuals, this value can be accepted, corrected, or dismissed.
  • Optimum interpolation (OI)—As in the previously mentioned spatial consistency checks, the result of OI consists of an estimated value, to which the observation is compared. Additionally, SR and OI have the use of statistical information in common. The OI requests the computation of a background field and two error covariance matrices based on observational and background data. A significant difference between OI and other spatial consistency checks is the possibility of analyzing isolated stations. The influence of the surrounding stations is limited to, but not required for, the computation of the background field. A modified version of OI was used by Lorenc (1981) to check data quality.
  • Bayesian quality control (BQC)—In contrast to all previously described methods, BQC gives the probability of an observation to be a gross error. This method is based on Bayes’s theorem, a mathematical formalism that allows the computation of the gross error probability. Therefore, observation and background values and their error distributions as well as the a priori estimate of the gross error probability, are required. This formalism is implemented in two ways; either the posterior probability for gross errors is calculated for each observation or it is computed simultaneously for a combination of stations. A detailed description of different approaches of BQC can be found in Lorenc and Hammon (1988) or in Ingleby and Lorenc (1993). The obvious advantage of this method is in the possibility to compute the probability that an observation is affected by a gross error, which represents a natural criterion to reject or accept an observed value.
  • Variational quality control (VarQC)—Leading operational numerical weather prediction centers are using a variational approach [e.g., four-dimensional variational data assimilation (4D-Var)] for the analysis and forecast of atmospheric parameters. The variational approach is based on the minimization of a cost function that is basically composed of the observation and background fields and their error covariances. This variational approach provides the possibility of incorporating the quality control procedure within the analysis itself by weighting the observation term in the cost function according to the gross error probability. It is evident that one should apply the Bayesian theorem for determining this probability as it is implemented at the European Centre for Medium-Range Weather Forecasts (ECMWF) (Andersson and Järvinen 1999).
  • Complex quality control (CQC)—Most of the described QC mechanisms do not exclude each other, which offers us the possibility to carry out some of these processes consecutively. As a consequence, there are several residuals proposing to correct, reject, or retain a flagged observation. These flags and residuals have to be combined with a unique proposal, which is carried out by a so-called decision making algorithm (DMA). The successive application of the QC components and of the DMA constitutes the CQC (Gandin 1988).

c. Classification of VERA-QC compared to the previously presented methods

The quality control procedure presented in this paper combines elements of some of the previously mentioned methods; moreover, it adds completely new components. It is needless to point out that simple controls, such as limit checks, climatological checks, and single-station internal and temporal consistency checks, are applied. Observations passing these tests are checked for their spatial and, if required, also for their spatiotemporal consistency, which is considered to be the main focus of the presented QC procedure. The spatial consistency is checked by a variational approach that minimizes the curvature of the analyzed field. The VERA-QC method can be considered partially to be a complex QC method by (i) recognizing gross errors in a first iteration, (ii) flagging them, (iii) excluding stations with these errors, and (iv) repeating the procedure.

3. Methodology

This section describes the spatial consistency check that is carried out after data pass the simple QC checks mentioned above. Assuming an observation network with a sufficiently high density, meteorological parameters can be considered to be smoothly distributed. The precipitation field caused by a rain shower, for example, becomes coherent and smooth if the spacing of an observational network is much below the extent of the rain shower and the temporal resolution is much higher than its duration. Hence, data of conventional synoptic networks do not allow a QC check concerning convective precipitation; however, they do allow the QC verification of, for example, extratropical pressure systems. Naturally, measurements are erroneous and lead in general to an observation field that is rougher than the idealized error-free field. The roughness or smoothness of these fields can mathematically be expressed in terms of a cost function that consists of the integral of the squared curvature over the controlled domain. Minimizing this cost function and considering certain constraints lead to an optimization problem that is solved by a variational approach. As a result, we obtain deviations that are proposals to correct the measurements. By applying these deviations to the observed values, a field meeting the requirement of a minimal curvature is received. The following presents all the steps, such as the definition of the cost function and the discretization of the curvature and its derivations, as well as the solution of the resulting matrix equation.

a. Cost function of the variational approach

Considering a meteorological parameter Ψ, we define the error-free analysis field Ψa and its curvature as well as the field of observation Ψo. Furthermore, we define the cost function J as the sum of the squared curvature at all N(G) grid points of the discretized analysis field as follows:
e1
Note that Eq. (1) presents the discretized form of the thin plate spline approach. This cost function J takes a minimal value if the curvature of the analysis field is minimal as well. As explained in more detail below, the analysis field is approximated in terms of the known observation field. The squared curvature in Eq. (1) can be expressed as the sum of all possible combinations of squared second-order spatial or temporal derivations. Usually, the curvature would be computed on a regular grid but, as a consequence of limited computational power, we calculate them only at the station positions. For a two-dimensional (x, y) example, this can be written as
e2
A compact and more general version of Eq. (2) for D dimensions is represented by
e3
where d1 and d2 stand for the spatial coordinates or the time (e.g., in four dimensions: d1 = x, y, z, t; d2 = x, y, z, t). Because the analysis field and its curvature at any point n are unknown, they can be approximated by a first-order Taylor series around the curvature of the observed field Ψo:
e4
The subscript E denotes not only the station n but also neighboring stations that are allowed to be erroneous. It should be pointed out that this is a special feature in contrast to the above-described consistency checks where it is common that only the station in question is considered to be erroneous. In Eq. (4), the only unknown variables are Ψa and the so-called deviations (ΨaΨo) = ΔΨ. To compute these deviations, one has to combine Eqs. (1) and (4), to differentiate the cost function J with respect to all deviations ΔΨn, and to solve the resulting equation system for these deviations.

b. Finding natural neighbors

To declare neighboring stations that are allowed to have potential errors [subscript E in Eq. (4)], we have to define three terms:

  • Main station —The cost function in Eq. (1) consists of as many terms as stations exist in the considered domain. In this domain one station after another is always regarded as the center of the local neighborhood and is called the main station. As an example, station in Fig. 1, marked by the pentagram was selected to be the local main station.
    Fig. 1.
    Fig. 1.

    Simplified example of finding natural neighbors for a subset of nine stations. Station n = 3 is considered to be the main station (pentagram); stations 2, 3, 4, 7, and 9 are primary natural neighbors (squares); and all stations excluding station 8 are secondary stations. The darker gray area corresponds to the first-level neighborhood and the whole gray area to the second-order neighborhood.

    Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

  • Primary stations —All next natural neighbors of an actual main station and the main station itself are denoted as primary stations. Stations 2, 3, 4, 7, and 9 in Fig. 1 correspond to .
  • Secondary stations —All next natural neighbors of the pth primary stations and the primary station itself are named secondary stations. Particularly, denotes the whole subset consisting of a main station, all its next nearest stations, and furthermore their adjacent stations. In Fig. 1, all stations of the domain excluding station 8 are secondary neighbors.

The distinction between primary and secondary neighborhoods is necessary, because their stations are influencing the cost function J in two different ways. Stations in the primary neighborhood are located next to the main station and their values are allowed to vary. Mathematically, this is expressed in the Taylor series expansion of Eq. (4) where the primary stations are denoted by the subscript E. Stations in the wider secondary neighborhood are only used to compute the n = 1, 2, … , N curvature terms in the same equation.

An appropriate method of finding natural neighbors is the so-called Delaunay triangulation (Barber et al. 1996). The principle of this method in two dimensions is to connect three points at a time in such a way that no other point can be found in the circumcircle of the so-composed triangle. This concept can be expanded easily to three or more dimensions by replacing triangles with tetrahedrons and circumcircles with circumspheres and higher-dimensional analogs.

In comparison to Fig. 1, which shows the simple case of a homogeneous station distribution, Fig. 2 illustrates the stations of the European surface synoptic observation (SYNOP) network that reported on 1200 UTC 29 August 2009. One can observe national distinctions concerning the density of available measurements and that sometimes very distant or very close stations are connected, which requires a special treatment as described in section 3g. At this point an advantage of the presented method becomes obvious: whereas some spatial consistency checks such as the ID demand the definition of a radius of influence around the actual main station, VERA-QC offers a natural way of selecting influencing neighbors according to the local station density. This is done with the help of the Delaunay triangulation. Nevertheless it is reasonable to define an upper limit for the allowed distance between neighboring stations. To avoid fixed thresholds and to take into account the local station density, the upper limit is defined in terms of a multiple of the mean distances in the considered subdomain.

Fig. 2.
Fig. 2.

Delaunay triangulation (lines) applied to the station distribution (black dots) from 1200 UTC 29 Aug 2009. Attention should be paid to the inhomogeneous density of the observational network.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

It should be mentioned that before the triangulation procedure is carried out, stations with an obviously low correlation to the surroundings (e.g., a mountain station surrounded by valley stations) are excluded from the QC procedure and any further analysis. This approach is comparable to the conventional procedure of excluding mountain stations when analyzing sea level pressure fields.

c. Specification of the cost function

Knowing the meaning of the two declared neighborhoods and and with the help of Eq. (4), Eq. (1) can be specified as
e5
where the subscript m counts from 1 to and both d1 and d2 count from 1 to D; furthermore, and . The variables in demand are the deviations ΔΨn that are minimizing the cost function J, which requires differentiating J with respect to all deviations. This leads to a system of equations:
e6

d. Solution of the optimization problem

These in Eq. (6) can be grouped into a matrix equation of the form
e7
with i and j as row and column indices, each ranging from 1 to . The matrix Ai,j and the right-hand side bi of the matrix Eq. (7) can be expressed as
e8
e9
where Fi,j is a flag matrix with Fi,j = 1 if stations i and j are natural primary neighbors and Fi,j = 0 otherwise.

Considering a real observation network such as the one shown in Fig. 2, the number of equations in the linear but coupled system of Eq. (6) primarily exceeds a limit of 1000. The numerical solution of such a system of equations demands high computational power. By using the concept of sparse matrices, whose elements are predominantly zeros, the solution of the large system of Eq. (6) with and is not a problem.

e. Discretization of the curvature and its derivations

Mathematically, the curvatures and their derivations are defined at all points of the D-dimensional domain. In the course of the discretization one has to select a finite number of homogeneously distributed points at which these variables are evaluated. Usually station positions are not distributed regularly and a homogeneous grid has to be defined. To reduce the number of grid points (otherwise computationally expensive) and to adjust the gridpoint density to the inhomogeneous station distribution, subsets of required grid points are placed around the individual stations as shown in Fig. 3.

Fig. 3.
Fig. 3.

Alignment of the local subsets of grid points around the station positions of the two-dimensional example shown in Fig. 1. Whereas the latter are symbolized by gray circles, grid points that constitute the second partial derivations with respect to single and multiple variables are marked by black and light gray squares, respectively.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

To compute the curvature of the observational field, the measured values have to be interpolated to the local subsets of grid points. Because of its advantage in being computationally inexpensive and straightforward, the following inverse distance algorithm is applied as an interpolation method:
e10
In Eq. (10), Ψn denotes the unknown field values at all n grid points in the secondary neighborhood around the actual main station, and Ψs are the observed values in the same subdomain. The distances between the positions of points n and s are abbreviated by dn,s; α and β are parameters that control the impacts of more distant observations and the degree of smoothing, respectively. This interpolation is carried out for all terms in Eq. (5).
With the observations interpolated to the grid points, they can be used in the discretized versions of the second-order derivations. For example, the curvature with respect to the y direction evaluated for an arbitrary station s could be expressed as
e11
where (xs, ys) denotes the coordinates of the station s and Δy is the distance between two adjacent local grid points in the y direction.
The last discretization concerns the derivation of the curvatures with respect to the station values, Ψp:
e12
Both terms on the right-hand side of Eq. (12) differ by the increment ΔΨp that is added to the station value Ψp in the argument list of the first term. For practical execution, this increment is added to the station value during the interpolation [Eq. (10)] to the grid points that are needed to compute the curvatures.

By applying these concepts to the observations Ψo and inserting these discretized derivations into Eqs. (8) and (9), it is possible to solve Eq. (7) by matrix inversion.

f. Distinguishing different deviations

As a result of the solution of Eq. (7), one obtains the deviations ΔΨ = ΨaΨo. At this point the decision has to be made if an observation is accepted, corrected by the deviation, or dismissed. This decision is made for one station after another and depends not only on the value of the deviation itself, but also on its impact in correcting the observation. This can be expressed as the degree of the reduction of the cost function if the deviation would have been applied to the observed value. The reduction of the cost function with regard to the actual station of interest is obtained by computing the curvature after applying the deviation to the observation and comparing it to the curvature of the uncorrected observation field:
e13
As indicated in the previous equation, the curvatures needed to compute the cost function are evaluated for all stations that are secondary neighbors of the actual main station . Without knowledge of the impact that a deviation has on the cost function, it is not possible to decide if the considered measurement is erroneous or one of its neighbors.

Considering the idealized example of a domain with only one erroneous station (see Fig. 4a), significant deviations are calculated not only for the erroneous station, but also for its neighbors (Fig. 4b). By computing the cost function reductions for all these stations, it is possible to identify the station affected by errors. This station would cause , whereas the cost function reductions for all other stations would be significantly lower. That is the reason why the cost function reduction serves as a weighting factor for the original deviations ΔΨ. As shown in Fig. 4c, this modification of the deviations avoids an error propagation to stations that are considered to be error free.

Fig. 4.
Fig. 4.

An idealized two-dimensional example for a regular station distribution with uniform values except for an artificial spike added to the centered station. (a) The uncorrected observation field with its central peak. (b) The analyzed field, corrected by the unweighted deviations ΔΨ. Note that not only is the centered station corrected but its neighbors are also moved into the opposite direction. (c) The analyzed field, corrected by the weighted deviations ΔΨw, is presented. One can see the impacts of the cost function reduction by reducing the effects of counter-swinging, which affects the error-free neighbors.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

The weighted deviations, , constitute the basis for further decisions and lead to the following decision tree:

  • Gross error—The observation is assumed to have a gross error if the cost function reduction exceeds a user-defined threshold and if the weighted deviation exceeds, at the same time, a user-defined multiple of the median of all the weighted deviations. The latter condition avoids the case where very small deviations, although reducing the local curvature significantly, are identified as outliers. Stations with gross errors are excluded from further considerations by this decision-making algorithm.
  • No gross error—If only one of the above-mentioned two criteria is not fulfilled, the observation is retained and the following two cases for handling the deviations are possible.
  1. The weighted deviation is applied—If the weighted deviation exceeds a user-defined absolute threshold, it is applied.
  2. The weighted deviation is not applied—In the opposite case when the threshold is not reached, the error is regarded as being randomly distributed and the observation is accepted without corrections.

The mentioned thresholds are set by experience and may be changed depending on user-defined requirements similar to the choice of a numerical filter technique.

As soon as a gross error is detected, the whole QC procedure is repeated after discarding the gross-error-affected observations. Otherwise, the former neighbors of outliers would maintain their misleadingly large deviations. The artificially induced error in the example shown in Fig. 4 would be identified as a gross error and has no further influence on the computation of the analyzed field.

It should be mentioned that the station density and the station distribution, as well as the chosen interpolation method, have an effect on the interpolated absolute value at the grid points around each station (cf. Fig. 3) and, as a consequence, also on the curvature. Nevertheless, the VERA-QC concept depends only on the relative change in curvature reduction, which is not very sensitive to the chosen interpolation method.

g. Consideration of clustered stations

In general, the weighting with the cost function reduction offers a good method for identifying erroneous stations. Still, there are some special constellations of station alignments where an error from one station is propagated to another close-by station. This problem occurs if the distance between two or more stations is much smaller than the average distance between all stations in the considered subdomain. In this case, the curvature is not only minimized by correcting the erroneous station partially, but also by adding deviations of opposite sign and comparable magnitude to the neighboring station(s). Although mathematically comprehensible, this procedure does not lead to the desired result.

With the help of an idealized example, supported by Figs. 5 and 6, as well as Table 1, the problem of so-called clustered stations and its solution are described.

Fig. 5.
Fig. 5.

Example of an idealized station distribution with 20 stations where the two closely connected stations (1a and 1b; gray diamonds) are combined with a fictive cluster (station 1; black square). The gray lines are the triangulation lines of the original station distribution and the black lines connect natural neighbors after the so-called clustering. The connections between stations that are not part of the cluster (uninvolved stations; black dots) remain the same.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

Fig. 6.
Fig. 6.

Visualization of the idealized example explained in the text and listed in Table 1. The (a) observation and (b) analysis fields after application of the QC procedure without cluster treatment. (c) The modified observation field where cluster members are replaced by the fictive cluster station. (d) The final result of the QC procedure with clustering. Comparing (b) and (d), the advantages of cluster treatment become obvious.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

Table 1.

Observation values (obs), deviations, and final results of the QC procedure with (row 8) and without (row 2) cluster treatment for an idealized example with 20 regularly distributed stations, as shown in Fig. 5. See text for further explanation.

Table 1.

If the distance between neighboring stations falls below a certain percentage of the median of all station distances in the considered subdomain, these affected stations are combined to one fictive cluster station. In Fig. 5, stations 1a and 1b are recognized to be clustered stations and, as a consequence, are combined to the fictive cluster station 1. Note that for better visibility the displayed distance between these two stations has been increased.

Suppose a flat observation field where all observations exhibit the constant value zero except for station 1b, which is affected by an artificial error of the magnitude 1 (see Table 1, row 1). This observation field is presented in Fig. 6a. The application of the QC procedure without the special cluster treatment would lead to opposite deviations for stations 1a and 1b that are comparable in their absolute values (Table 1, row 2). As one can see in Fig. 6b, this special constellation would result in a reduction of the error of only approximately 50% and in adding a significant virtual error to the neighboring station 1a, which was assumed to be error free.

To handle this problem the following steps are carried out:

  • After identifying the cluster members, these stations are replaced by a virtual station, whose coordinates are computed by taking the mean of the original station’s coordinates. The station value of the fictive cluster station is derived by taking a weighted mean of the single cluster member values. Thereby, the inverse of the number of primary natural neighbors serves as weighting factor. The modified observation field is shown in Fig. 6c and Table 1 (row 3).
  • The VERA-QC procedure is applied to the new station distribution and as a result the weighted deviations are computed (Table 1, row 4).
  • These weighted deviations for the cluster stations are transferred to the member stations (Table 1, row 5) and are applied to their observation values.
  • In the next step, the QC procedure is repeated for all stations with the former cluster members, modified as described in the previous point. The resulting deviations are listed in Table 1 (row 6).
  • Finally, the weighted deviations of both QC iterations are accumulated. All uninvolved stations receive the weighted deviations from the second iteration (Table 1, row 7). Note that a weighted deviation is only applied if the considered station value is not detected as a gross error and the threshold described in section 3f is exceeded. Comparing the results of the QC process with and without clustering (Figs. 6b and 6d, respectively; Table 1, rows 2 and 8), one can see the positive effects of cluster treatment.

Metaphorically speaking, the weighting method described in the first point has the consequence that errors are propagated to cluster members that are better embedded in the station network. The higher number of primary neighbors enables an error to be detected and reduced more easily.

4. Examples

On the basis of some selected analytical examples, the properties and specific features of the described VERA-QC are presented in this section. Starting with the simplest case of a one-dimensional station distribution, we consider at first observations with one central outlier, followed by three central stations with equal values differing from the rest, and finally an example with two separated outliers is given. The last example mentioned is compared to an analog with a two and another with a three-dimensional station distribution, which are also affected by two errors in order to show the positive effects of the increased number of neighbors in a higher-dimensional observation network. This section is concluded with a quite complex example featuring gross and random errors as well as clustered stations. The application of this QC procedure to realistic observation fields is beyond the scope of this paper and we plan to present an outline in a following paper with a focus on operational applications and case studies.

a. One dimension with one central outlier

Referring to the two-dimensional example visualized in Fig. 4 with one central outlier, Fig. 7 presents the one-dimensional equivalent. All station values of the 15 stations (black dots) are equal except for the centered station value, which has been given an artificial error of magnitude one. Apart from the original station values, also the two results that consist of the observations corrected by the unweighted (light gray diamonds) and weighted (dark gray squares) deviations are shown in Fig. 7. In comparison to the two-dimensional example shown in Fig. 4, there are some similarities but also a significant difference. The application of the unweighted deviations leads, in both cases, to the effect of counter-swinging in the surroundings of the erroneous station, which can be avoided by weighting the deviations with the reduction of the cost function. Whereas in one dimension a station can have at most two next nearest neighbors, in higher dimensions this number is generally increased. As a consequence, an error can be detected and corrected more easily, and the influence on the surrounding stations is reduced considerably. Analyzing the difference between these two case studies, one can see that in the two-dimensional example that the error is reduced to approximately a third of the error in the one-dimensional case.

Fig. 7.
Fig. 7.

One-dimensional example of VERA-QC applied to 15 stations. The value of station 8 is an outlier. Observations are marked by black dots, results after applying unweighted deviations by light gray diamonds, and those computed with weighted deviations by dark gray squares.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

b. One dimension with centered signal

The evidence that the VERA-QC method is not just a smoothing algorithm can be understood considering the following example presented in Fig. 8. Contrary to the previous example, the values of the three centered stations are beyond the range of the others. Those values should be interpreted as a signal, rather than a group of outliers. On the one hand, it is unlikely that these three neighbors are affected by gross errors and on the other hand, this QC method is based on the assumption that gross errors are rare, as pointed out in section 2a. Applying the unweighted deviations would lead to a more or less smoothed analysis field where the observation values are reduced by approximately 50% and all other station values are affected as well. This undesired effect of averaging the signal can be avoided by applying the weighted deviations. As a result, the magnitude of the signal is maintained, and only the sharp contrast between the three central stations and the surroundings is softened slightly. Moreover, one can see that the mean value is not preserved, which presents a further special property of VERA-QC.

Fig. 8.
Fig. 8.

As in Fig. 7, but where the values of stations 7–9 exceed the others.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

c. One, two, and three dimensions, each with two errors

The effect that in higher dimensions errors are detected more easily and therefore corrected to a higher degree is presented by analyzing three comparable station distributions in one, two, and three dimensions. Moreover, it can be observed that the troublesome effect of counter-swinging in the surroundings of outliers is reduced by increasing the number of spatial dimensions and/or including the temporal dimension. With the help of three symmetrical station distributions, differing in the number of dimensions (arranged in Figs. 9a–c from top to bottom on the left-hand side), these two effects are visualized by comparing the corrected observations on the right-hand side of the same figure. These three examples have in common that in each case a centered station (black dot) surrounded by two layers of stations (inner layer, dark gray dots; outer layer, light gray dots) exist. Two artificial errors (black pentagrams) of magnitude 1 are impressed onto a station of the first and onto one of the second layer. The corresponding bar plots on the right-hand side present the observations (white bars with black edges), as well as the corrected observations, based on the unweighted and weighted deviations (light and dark gray bars). Apart from the two above-mentioned effects (note that both errors and counter-swinging are reduced to approximately by half by adding a further dimension), the influence of the erroneous station’s position with respect to the boundaries of the domain is illustrated. This can be seen by comparing the bars corresponding to erroneous stations (observation value 1). The left examples in each panel present the stations located in the inner layer, displaying a higher detection and correction rate for the errors, and the right bars show those of the outer layer with accordingly lower corrections. The reason for the unequal efficiency can be found in the different numbers of natural neighbors which, as a matter of course, are lower at the boundary of a domain.

Fig. 9.
Fig. 9.

(left) Station distributions and (right) observations as well as corrected observations for (a) one-, (b) two-, and (c) three-dimensional examples. Black dots mark the centered stations, dark gray dots the stations of the inner layers, and light gray dots those of the outer layers. Error-affected stations are symbolized by dark pentagrams. Station numbers in the one- and two-dimensional station distributions correspond to the station numbers along the x axes in the bar plots. For three dimensions, only the bars for the primary neighbors of the erroneous stations are displayed. Observations, results corrected with unweighted deviations, and results corrected with weighted deviations are visualized by white bars with black edges, light gray bars, and dark gray bars, respectively.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

d. Real two-dimensional station distribution with artificial observations

A real mesoscale station distribution (black dots) in the area surrounding Vienna, Austria (white lines represent borders), with some clustered stations (white squares) and one gross error (pentagram), is shown in Fig. 10a. The observations represent an artificial mean sea level pressure field that is composed by a southwest-to-northeast gradient. It could also feature a potential or equivalent potential temperature field. Figure 10b shows the interpolation of these station values onto a regular grid. To simulate realistic observations, Gaussian random errors with a variance of 1 hPa and one gross error were added to the reference field. Figure 10c illustrates the pressure field with random errors and Fig. 10d the simulated observation field (with additional gross error) on which the VERA-QC was applied. The observations have been interpolated onto a regular grid with the help of VERA, a high-resolution analysis scheme based on the thin plate spline method. The given analytic pressure field is not only disturbed by random errors and by a negative gross error in Vienna as mentioned above, but also by another outlier in the southwest of the domain that is part of a cluster. One may ask why the gross error in Vienna is not part of a cluster even though the distances between the adjacent stations are quite short. As mentioned in section 3g, cluster recognition depends on the station distribution in the local subdomain. To correct the observations, the following steps in the VERA-QC process, described in detail in section 3, are carried out. After the first iteration with cluster treatment and gross error recognition, the gross error stations are rejected and in the second iteration the weighted deviations are computed and applied to the observed values. The resulting field can be seen in Fig. 10e, where in the greater area of Vienna the defined analytic gradient in the pressure field is restored and is only influenced by the random fluctuations. Moreover, the outlier of the mentioned cluster station is corrected to a great part without neglecting the affected station. This result can also be taken from Fig. 10f, where the field of the weighted deviations (as the difference between the fields shown in Figs. 10d and 10e) is presented. Except for the two error-affected stations, all other stations require only minor corrections. Their application depends on the user-defined thresholds as described in section 3f. This complex example demonstrates the advantage of VERA-QC of not smoothing outliers and their surroundings but rather maintaining the resolvable patterns of the observation field.

Fig. 10.
Fig. 10.

(a) Real station distribution in the surroundings of Vienna where black dots symbolize the unclustered stations and the pentagram shows a station with a gross error located in the city of Vienna. The white squares mark cluster members that are combined to fictive cluster stations (gray circles). The original stations are connected by black triangulation lines and those after the cluster treatment by gray lines. (b) The reference field, which is composed of a field with the constant value 1013 hPa and a southwest-to-northeast gradient. Note that this function was evaluated at the station positions and, as in (c)–(f), interpolated to a regular 4-km grid with the help of VERA. Whereas in (c) only random errors with a variance of 1 hPa were added to the observations, (d) shows the simulated observation field with random and additional gross errors. (e) The corrected observation field and (f) the weighted deviations.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

It should be mentioned that VERA-QC is written in Matlab and is able to run on a Linux server as well as on a Windows PC. The QC procedure (including data reading and writing) of one parameter for a domain containing approximately 1000 stations takes about 10 s. Therefore, it is an appropriate preprocessing tool for every kind of further analysis.

5. VERA-QC in comparison with two common QC methods

In section 2, an attempt was made to find a classification for the VERA-QC procedure based on different and fundamental approaches of QC methods widely used. In the following, the performance of VERA-QC is compared to the performances of two commonly used spatial consistency checks, using inverse distance (ID) and spatial regression (SR) interpolation. This is done by applying the QC processes to a series of artificial observation fields with seeded errors that were designed to be as realistic as possible. In contrast to real fields, artificial observations have the advantage of knowing the “truth,” as well as the random and gross errors.

a. Observation field

To simulate an observation field presenting all possible realistic difficulties, such as different station densities, alpine and coastal influences, and the lack of smoothness, a domain including parts of the Mediterranean Sea and most of the Alps was selected. The positions of 332 stations were taken from the World Meteorological Organization (WMO) Global Telecommunications System (GTS) station list and divided into three types based on their location, namely whether they were coastal, lowland, or alpine stations.

As a meteorological parameter, the mean sea level pressure was chosen. It is simulated by a composition of three two-dimensional wave patterns [wave lengths λ = (400, 600, 800) km, amplitudes varying randomly between hPa and completely random phases] and thermally induced pressure perturbations depending on differential sensible heat fluxes and the effects of reduced air volumes above valleys. The latter is derived from the so-called idealized thermal pressure fingerprint, which is based on the two above-mentioned terrain-influenced effects and is described in Steinacker et al. (2006). The weights of the fingerprints are computed to simulate typical daily variations, allowing effects like heat lows and cold highs with maximal amplitudes of hPa. One realization of the simulated pressure signal is illustrated in Fig. 11.

Fig. 11.
Fig. 11.

Example of a simulated pressure field (sum of three wave patterns and thermally induced pressure perturbations) and the positions of coastal (squares), lowland (circles), and alpine (diamonds) stations centered over northern Italy, including parts of the Mediterranean Sea and most of the Alps.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

The values of the observation field are interpolated to the station positions and Gaussian-distributed random errors (mean μ = 0 hPa; standard deviation σ = ⅓ hPa) are added. Optionally, gross errors with a mean value of μ = 15 hPa, a standard deviation of σ = 2 hPa, and a random sign can be included.

b. Methods

Since the spatial consistency checks using ID and SR are straightforward, these QC methods were chosen for the comparison. Their formulation is well known and can be found, for example, in Hubbard and You (2005), who provide a detailed formalism that was followed by implementing the two methods. The required settings were optimized for the simulated observation fields and for the given station distribution:

  • For both ID and SR the influence radius r was set to r = 100 km in order to enable the allocation of neighbors in less dense regions in the south of the domain.
  • The minimally required coefficient of determination R2 to select influencing stations for the SR was optimized to handle gross errors and was assigned a value of R2 = 0.5.
  • Concerning SR, the recognition of gross errors also requires the definition of a confidence interval. The parameter f as a multiple of the weighted standard error of the estimate controls the width of this interval and regulates the strictness regarding gross errors. An ideal value was found to be f = 8.4.

In VERA-QC the minimal cost function reduction regarding the gross error recognition was set to . The user-defined multiple of the median of all weighted deviations that needs to be exceeded for identifying a gross error was attributed a value of 500 (for the definition of these conditions, see section 3f).

c. Results

The performance of the three different QC methods is evaluated regarding two criteria, namely the recognition of random errors and the detection of gross errors.

1) Recognition of random errors

As a measure for comparing the performance levels of the three QC methods, the difference between the added artificial random errors and the deviations suggested by the QC procedures for a simulated observation field free of gross errors is used. First, 100 time steps were simulated in order to compute the required correlations between the station values for the QC option using SR interpolation. After that, another 100 simulations were carried out to collect the differences between the proposed deviations and the added random errors. The sorted and cumulated differences for the three methods are illustrated in Fig. 12. One can see that VERA-QC, in comparison to the two other methods, generally produces smaller differences between the (known) artificial random errors and the deviations (computed by the QCs). Regarding this criterion, VERA-QC is preferable. The more enhanced QC choice using SR interpolation delivers, as expected, somewhat better results than that using ID interpolation.

Fig. 12.
Fig. 12.

Cumulative frequency distribution as a function of the absolute difference between artificial errors and deviations for the VERA-QC process (solid line), the QC process using SR (dashed line), and the QC process using ID (dash–dot line). The higher the line, the more artificial errors are recognized correctly. For this test series, 100 simulated gross error-free observation fields were used.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

Additionally, the statistical measures root-mean-square error (RMSE) and mean absolute error (MAE) are computed for the above-mentioned differences. In Table 2, their values are summarized for all stations together and also separately for the three station types (coastal, lowlands, and alpine). The RMSE and MAE values confirm that VERA-QC generates the smallest errors and a QC process using SR is superior to one using ID. The artificial observation field around alpine stations features the highest variability. Thus, one might expect that a QC procedure is less efficient in recognizing errors in this mountainous region. Nevertheless, the high station density in the alpine area compensates for the difficulty caused by the local high variability. Naturally, the performance of a QC method using ID or SR degrades in coastal areas. This is due to the extremely inhomogeneous station distribution that is characteristic of these regions and to a constant radius of influence when choosing neighboring stations. In contrast to QCs using ID or SR, the VERA-QC can adapt automatically to varying station densities and can treat coastal stations as well as others embedded in a more homogeneous station distribution. This aspect is reflected in the hardly varying values of the statistical measures for coastal, lowland, and alpine stations.

Table 2.

Combined presentation of the RMSE and MAE for the three different QC methods. The given numerical values refer to all stations (all), as well as to station subsets (coastal, lowland, and alpine). Both statistical measures are based on the differences between added artificial random errors and suggested deviations. See the text for further explanations.

Table 2.

2) Detection of gross errors

The performance of a QC procedure can also be evaluated considering its ability to recognize gross errors. This ability is quantified with the help of contingency tables and skill scores. These evaluations are carried out for the VERA-QC and the QC methods based on SR because both offer an advanced criterion for the recognition of gross errors.

Figure 13a shows the alignment of the selected contingency table, which is taken from Wilks (1995) and Jolliffe and Stephenson (2003). The definitions of the skill scores used, the equitable threat score (ETS) and the Heidke skill score (HSS), are also described in these books.

Fig. 13.
Fig. 13.

(a) Layout and contingency tables for (b) VERA-QC and (c) the QC using SR processes; the latter two are based on the decisions for 332 stations and 100 time steps. In the top-left corners of (b) and (c), the computed values of ETS and HSS are listed. Both QC methods detect gross errors (GEs) very well but the VERA-QC approach is found to deliver a slightly better level of performance.

Citation: Monthly Weather Review 139, 12; 10.1175/MWR-D-10-05024.1

The appearance of a gross error is, compared to the high quantity of observations, considered to be a rare event. Due to this fact, the ETS that is especially designed for the verification of such rare events has been chosen as evaluation parameter in the contingency table. Additionally, the values of the equally adequate but probably better known HSS have been computed.

As before, a time series of 100 simulated observation fields was generated but with the difference that 2% of all stations were affected by gross errors. The results of these simulations are summarized in Figs. 13b and 13c. As the achieved values of the statistical measures ETS and HSS are close to one, the performance levels of both QC methods are found to be very convincing. Note that the number of observations identified as false alarms by VERA-QC is slightly smaller. This implies that fewer measurements are rejected by mistake and the information offered by these stations stays available, which is especially important in regions with a less dense station distribution.

In contrast to a QC process based on SR, VERA-QC does not require any a priori knowledge, such as correlations between the observations. From our point of view this presents a considerable advantage.

6. Conclusions and outlook

In this article a new QC method based on self-consistency, called VERA-QC, has been outlined and compared to common QC procedures. As demonstrated, VERA-QC combines advantages that can be found in its model independency and in its ability to check the spatial and temporal consistency at the same time. Moreover, it is applicable to large domains; first, by using the computationally inexpensive concept of sparse matrices and, second, by the fact that a large and unknown number of iterations, required by some other QC methods, is replaced by at most one additional repetition of the QC algorithm. VERA-QC also automatically adapts to different densities of observation networks by using the concept of natural neighborhoods and by considering physically implied covariances. Therefore, it is appropriate to control the data acquired by field studies covering microscale phenomena, as well as to control GTS data from a lower resolved observation network of a whole continent. Compared to two other QC schemes, VERA-QC has shown a higher degree of efficiency in detecting erroneous values.

Although the presented VERA-QC method has been delivering an optimal level of performance as a preprocessing tool of the hourly VERA analyses, there are still some planned improvements. In the real-time application of VERA-QC, the unweighted and weighted deviations are stored and that offers the possibility of statistically evaluating them. VERA-QC is intended to check for the representativeness of stations and to detect biased stations. As a logical step, a bias correction of real data can be introduced. Another possible way to improve the VERA-QC procedure is to extend the method toward a multivariate approach where wind and pressure gradients could be treated simultaneously. This would concern the mathematical core of the VERA-QC procedure and therefore the cost function would have to be formulated in a different way including physical constraints.

Complementing the herein-presented basic principles of the VERA-QC approach, a further publication is in preparation. The results of the operational implementation and the outcomes of its applications in field studies will be discussed.

Acknowledgments

Thanks are due to the Austrian Science Fund (Fonds zur Förderung der wissenschaftlichen Forschung, FWF; P19658) and to the Austrian Research Funding Association (Die Österreichische Forschungsförderungsgesellschaft, FFG; project 818110) for partial financial support of this work.

REFERENCES

  • Andersson, E., , and H. Järvinen, 1999: Variational quality control. Quart. J. Roy. Meteor. Soc., 125, 697722.

  • Baker, N. L., 1992: Quality control for the navy operational atmospheric database. Wea. Forecasting, 7, 250261.

  • Barber, C. B., , D. P. Dobkin, , and H. Huhdanpaa, 1996: The Quickhull algorithm for convex hulls. ACM Trans. Math. Software, 22, 469483.

    • Search Google Scholar
    • Export Citation
  • Barnes, S. L., 1964: A technique for maximizing details in numerical weather map analysis. J. Appl. Meteor., 3, 396409.

  • Feng, S., , Q. Hu, , and W. Qian, 2004: Quality control of daily meteorological data in China, 1951–2000: A new dataset. Int. J. Climatol., 24, 853870.

    • Search Google Scholar
    • Export Citation
  • Fiebrich, C. A., , and K. C. Crawford, 2001: The impact of unique meteorological phenomena detected by the Oklahoma mesonet and ARS micronet on automated quality control. Bull. Amer. Meteor. Soc., 82, 21732187.

    • Search Google Scholar
    • Export Citation
  • Gandin, L. S., 1988: Complex quality control of meteorological observations. Mon. Wea. Rev., 116, 11371156.

  • Hubbard, K. G., , and J. You, 2005: Sensitivity analysis of quality assurance using the spatial regression approach—A case study of the maximum/minimum air temperature. J. Atmos. Oceanic Technol., 22, 15201530.

    • Search Google Scholar
    • Export Citation
  • Ingleby, N. B., , and A. C. Lorenc, 1993: Bayesian quality control using multivariate normal distributions. Quart. J. Roy. Meteor. Soc., 119, 11951225.

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I., , and D. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp.

    • Search Google Scholar
    • Export Citation
  • Lorenc, A. C., 1981: A global three-dimensional multivariate statistical interpolation scheme. Mon. Wea. Rev., 109, 701721.

  • Lorenc, A. C., , and O. Hammon, 1988: Objective quality control of observations using Bayesian methods: Theory, and a practical implementation. Quart. J. Roy. Meteor. Soc., 114, 515543.

    • Search Google Scholar
    • Export Citation
  • Pöttschacher, W., , R. Steinacker, , and M. Dorninger, 1996: VERA - a high resolution analysis scheme for the atmosphere over complex terrain. MAP Newsletter, Vol. 5, Mesoscale Alpine Programme Office, Zurich, Switzerland, 64–65.

    • Search Google Scholar
    • Export Citation
  • Reek, T., , S. R. Doty, , and T. W. Owen, 1992: A deterministic approach to the validation of historical daily temperature and precipitation data from the cooperative network. Bull. Amer. Meteor. Soc., 73, 753762.

    • Search Google Scholar
    • Export Citation
  • Shafer, M. A., , C. A. Fiebrich, , D. S. Arndt, , S. E. Fredrickson, , and T. W. Hughes, 2000: Quality assurance procedures in the Oklahoma mesonetwork. J. Atmos. Oceanic Technol., 17, 474494.

    • Search Google Scholar
    • Export Citation
  • Steinacker, R., , C. Häberli, , and W. Pöttschacher, 2000: A transparent method for the analysis and quality evaluation of irregularly distributed and noisy observational data. Mon. Wea. Rev., 128, 23032316.

    • Search Google Scholar
    • Export Citation
  • Steinacker, R., and Coauthors, 2006: A mesoscale data analysis and downscaling method over complex terrain. Mon. Wea. Rev., 134, 27582771.

    • Search Google Scholar
    • Export Citation
  • Wade, C. G., 1987: A quality control program for surface mesometeorological data. J. Atmos. Oceanic Technol., 4, 435453.

  • Wilks, D., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.

  • WMO, 2008: Guide to Meteorological Instruments and Methods of Observation. 7th ed. WMO-8, World Meteorological Organization, Geneva, Switzerland, 681 pp.

    • Search Google Scholar
    • Export Citation
Save