1. Introduction
Meteorological observations are important for a wide variety of applications, such as in numerical weather prediction (NWP) for data assimilation (Daley 1993; Kalnay 2003; Bormann et al. 2019) and verification (Jolliffe and Stephenson 2012; Wilks 2011), derivation of reanalyses (Dee et al. 2011; Kobayashi et al. 2015; Hersbach et al. 2020), agricultural meteorology (Stigter et al. 2000; WMO 2010), and research in atmospheric and climate sciences.
Traditional meteorological observational networks operated by national meteorological services are usually designed for the observation of synoptic atmospheric conditions. Because of this, they are not able to fully capture smaller-scale atmospheric phenomena, such as, e.g., convective rainfall (Kidd et al. 2017) and hailstorms (Clark et al. 2018). Furthermore, the resolution of NWP models is steadily increasing, with the use of subkilometer models being applied (Hagelin et al. 2014; Vionnet et al. 2015; Valkonen et al. 2020), demanding denser observational networks for both initialization and verification. Traditional observational networks, however, are not reflecting this increasing need for higher-resolution observations. Meteorological observations from unconventional sources, on the other hand, are becoming more prevalent. These observations have the potential to further improve numerical weather forecasts and bring added value to the atmospheric sciences (Muller et al. 2015; Zheng et al. 2018).
Traditional meteorological observations include both remotely sensed and in situ observations. Unconventional observations refer to data obtained from other sources than the traditional ones. Crowdsourced observations are a type of unconventional observations, which refer to data obtained from devices owned by citizens (Howe 2006; Muller et al. 2015). This category of unconventional observations include a number of sources, such as smartphones and private weather stations.
Crowdsourced observations have been used for a wide variety of applications, such as assimilation of pressure from smartphones (McNicholas and Mass 2018; Hintz et al. 2019), quantification of the urban heat island (Overeem et al. 2013; Meier et al. 2017; Napoly et al. 2018), monitoring of urban rainfall (Overeem et al. 2011; de Vos et al. 2017) and postprocessing of NWP model output (Nipen et al. 2020). The main strength of crowdsourced observations are their high spatial and temporal resolution, whereas the main disadvantage is that the uncertainty of individual observations are often higher compared to observations from traditional sources (Bell et al. 2015). Furthermore, weather stations from crowdsourcing networks do not have to comply with the regulations set by the World Meteorological Organization (WMO; WMO 2018a). This can lead to poor station sitings, such as placement next to buildings or other obstacles. Quality control (QC) is therefore an essential step for the use of crowdsourced meteorological observations (Muller et al. 2015).
There exist a wide range of QC methods. The operational QC methods used by the Nordic countries are discussed by Vejen et al. (2002) and Shafer et al. (2000) present the QC methods used for the Oklahoma Mesonet. Standard methods focus on single-station tests for finding erroneous observations, such as plausibility checks, rate of change checks, and persistence checks (Zahumenskỳ 2004; Fiebrich et al. 2010). However, these types of tests are not able to capture errors related to, e.g., poor station sitings or consistent instrument biases. To detect observations affected by these types of errors, different QC methods are needed. Manual QC checks, i.e., QC performed by humans, are not an option for dense networks due to the vast quantities of data available, and because the observations are required to be available in near–real time. Fully automatic QC methods are therefore needed. The WMO guidelines on surface station data quality control and quality assurance for climate applications (WMO No. 1269; WMO 2021) recommend the implementation of several QC tests for weather data. Among those tests is a category of statistical tests based on spatial QC methods that are classified as either recommended or optional. Spatial QC methods rely on observations from nearby stations and can therefore be used to find erroneous observations related to, e.g., poor siting choices. In the guide, there is a paragraph devoted to dealing with site differences where the importance of using a dense observational network is stressed, such that the comparison is made between observations that are observing the same atmospheric phenomenon.
To evaluate the impact of different thresholds for spatial QC parameters an analysis needs to be performed where the parameter is varied in a consistent manner and the effect in terms of error detection is analyzed. A framework for optimizing and tuning spatial QC methods is therefore needed. There is, however, no specific procedure for the optimization of spatial QC parameters in the WMO guidelines and there is very little information in the literature on how to choose the thresholds for spatial QC parameters. Hubbard et al. (2005) analyzed the tuning of four different methods for QC of temperature and precipitation for six stations, looking at the fraction of flagged observations for representing the potential type I error (i.e., valid observations flagged as erroneous). They used a method of so-called error seeding, which is the introduction of artificial errors by perturbation of the observations, where the fraction of artificial errors detected gave an idea of the type II error (i.e., erroneous observations not detected by the QC method).
The objective of this paper is twofold. The first is the presentation of a framework for tuning a spatial QC test for a dense network of temperature observations. Here we adopt a method of introducing known artificial errors in order to evaluate the performance of the spatial QC method for different thresholds. The method is evaluated using crowdsourced observations from a dense network of weather stations. The second objective is the investigation of the benefits of including observations from a network of crowdsourced observations in the QC of another network. The main driver for this is the fact that most meteorological observational networks are usually quite sparse, which makes spatial QC harder. But if we could include the vast number of crowdsourced observations available, the spatial QC methods could be used to their full potential. Here we combine two observational networks and evaluate the performance of the spatial QC method through the tuning framework presented in the first part. The main obstacles and how to handle them are investigated. The focus is here on temperature observations, but similar analyses could be made for other meteorological observations, such as, e.g., relative humidity and precipitation.
2. Observational data
Meteorological observations used in this study were obtained from two networks of crowdsourced observations. In the first part of the study, observations from FieldSense, which is a manufacturer of private weather stations, are used. The FieldSense stations are mainly developed for agricultural purposes and the network is therefore densest in regions with high agricultural activity. Since the FieldSense network is mostly rural and focus is on agricultural applications, most stations are placed on, or in the vicinity, of fields. Placement of stations next to buildings is therefore less frequent compared to an urban network. Furthermore, radiative errors due to solar heating are further reduced since the temperature sensor has a naturally ventilated radiation shield. However, due to poor ventilation, radiative errors might still occur for naturally ventilated shields, especially in conditions of weak wind and during daytime when shortwave radiation is at a maximum (Nakamura and Mahrt 2005).
In the second part of the study, observations from Netatmo, which is another manufacturer of private weather stations, is used in combination with the FieldSense observations. The coverage of the Netatmo network is densest in urban and more well-populated areas, such as cities. Due to the urban nature of the Netatmo network, stations are usually placed next to buildings, which affects the representativity of these observations. Furthermore, the temperature sensor does not have a radiation shield. Studies have found significantly higher temperatures during daytime for stations that lack a proper radiation shield and that were placed in direct sunlight (Bell et al. 2015; Meier et al. 2017). The Netatmo installation guidelines recommend placing the sensor out of direct sunlight in order to minimize the solar heating effect but in urban areas it is difficult to find optimal locations.
The Netatmo network is global whereas the FieldSense network is mostly concentrated in Denmark, where a majority of stations are located. The study area of interest is therefore Denmark, where both networks have good coverage. Observational data for FieldSense and Netatmo stations in Denmark were extracted for the year 2020. The geographical distribution of FieldSense stations used in this study is shown in Fig. 1. Currently, there are more than 1000 FieldSense stations in Denmark. The network therefore outnumbers the official network of WMO-compliant stations operated by the Danish Meteorological Institute (DMI) by a factor of more than 10. The Netatmo network, on the other hand, is even denser than the FieldSense network and vastly outnumbers both the FieldSense and DMI networks.
As is common for stations in meteorological networks of crowdsourced observations, the FieldSense stations lack metadata about their elevation. This information was obtained using the Danish elevation model (Danish Agency for Data Supply and Efficiency 2021). For Netatmo data the accompanying elevation metadata were used, if they did not deviate significantly from the Global Multiresolution Terrain Elevation Data 2010 (GMTED2010; Danielson and Gesch 2010), in which case the elevation from the digital model was used.
To present as clean datasets as possible for the evaluation of the tuning of the spatial QC method, temporal single-station checks are used to exclude obviously erroneous observations. A plausible range check, a rate of change check and a persistence check are used for both the FieldSense and Netatmo observations. Furthermore, a check to identify and exclude stations placed indoors is performed for both datasets. For the FieldSense stations, which measure additional meteorological parameters such as wind speed, relative humidity, and illuminance, an indoors test was constructed based on these additional parameters. For the Netatmo stations, a buddy check was applied, in which the temperature of the current station is compared with that of neighboring stations. In addition, only stations with observational data for at least 1 month were used.
3. A framework for tuning a spatial QC method
The comparison between nearby observations is at the core of any spatial QC scheme. Therefore, when implementing spatial QC methods, the representativeness error of an observation (Ingleby and Lorenc 1993; Lussana et al. 2010) must be taken into account as an additional source of uncertainty, which adds to the instrumental uncertainty. For our purposes, it is worthwhile mentioning that typical instrumental uncertainties for thermometers are on the order of ±0.1°C (WMO 2018a), while the uncertainty associated with representativeness errors is one order of magnitude larger ±1°C (e.g., Fig. 8 of Lussana et al. 2019) and occasionally larger in very data-sparse regions. The most important and critical decision that a spatial QC method must take is in the distinction between gross measurement errors (Gandin 1988) and large representativeness errors. The first type of error makes the observations uncorrelated from the actual atmospheric state, such that any QC procedure should flag them, while large representativeness errors can be allowed to pass the QC when small-scale processes are important for the application under consideration.
To select the optimal parameter settings for a QC method its performance needs to be assessed for a range of thresholds. Here we present a framework for tuning a spatial QC method for a dense network of meteorological observations. The method used is the introduction of known artificial errors by perturbation of the observations, which has previously been used in the assessment of QC methods for climate applications (Hubbard et al. 2005). This framework can be used for tuning different QC methods and here we show results for a selected spatial QC method referred to as a spatial consistency test (SCT).
The SCT (Lussana et al. 2010) evaluates a given observation against an expected value calculated based on neighboring stations within a predefined radius. First, a vertical temperature profile is fitted using the observations and their elevation, not including the observation in question. Optimal interpolation (OI) is then used to adjust the temperature locally and obtain an expected value for the given observation, as well as the error variance of both the expected and observed value. This adjustment is based on the station density, which results in a stricter test for data-dense regions compared to data-sparse regions. The SCT compares the squared deviation between the given observation and the expected value and if this ratio is greater than a predefined threshold, the observation is flagged. For a more detailed description of the SCT the reader is referred to Lussana et al. (2010). The SCT used here is available from the open-source software package TITAN, used for automatic quality control of meteorological observations (www.github.com/metno/titanlib; Båserud et al. 2020), developed at the Norwegian Meteorological Institute.
The SCT has several parameters that need to be defined, such as the radius within which to find neighboring observations, the minimum horizontal and vertical decorrelation length scales and the upper and lower thresholds used for flagging observations. Here, we have chosen to simultaneously tune the upper and lower thresholds (σ+ and σ−), which represents the number of standard deviations above and below the expected value an observation is allowed before being flagged. The other parameters of the SCT remain fixed throughout the tuning analysis; however, these can be tuned in the same manner as the threshold values. Table 1 lists the values of the fixed parameters used in the SCT. It should be noted that the vertical decorrelation length scale and the minimum elevation difference were chosen based on the fact that Denmark is very flat and we therefore do not expect large vertical differences. The SCT is, however, also suited for applications in mountainous regions, where the variation of temperature with altitude is significant. In fact, it was developed in the Alps, as detailed in Lussana et al. (2010). To obtain an optimal performance for mountainous regions, some of the SCT parameters, such as the vertical decorrelation length scale, need to be more carefully chosen and the method proposed here can be used to tune them.
Fixed parameters of the SCT.
To tune the performance of the SCT for different σ+ and σ− values, the SCT is run and evaluated for thresholds in the range 0.5–10. The two thresholds are tuned simultaneously by fixing their relative values such that they are always equal. This means that positive and negative deviations are penalized equally. The tuning parameter of the SCT can therefore be denoted σ, where σ = σ+ = σ−.
The framework for tuning the parameters of a spatial QC method for a dense network of meteorological observations presented here is based on perturbation of the observations. More specifically, the observation undergoing the QC is perturbed by a known artificial error, in order to simulate the occurrence of gross errors. Consider Fig. 2, which shows the temperature time series for a given station and its closest neighbors for 25 July 2020. The temperature observation at 0600 UTC has been perturbed by 3°C and can be seen to lie outside the temperature range of the neighboring stations. When applying the SCT to the 0600 UTC observation an expected value for the observation will be calculated using its neighbors. This expected value is then compared to the value of the perturbed observation. The observation will be flagged if the perturbed value deviates from the expected value by more than the specified threshold, which for the SCT is given by the σ parameter.
To obtain robust statistics, each FieldSense observation is perturbed by a fixed error and is then quality controlled using the SCT. This procedure is performed individually for each observation, one at a time. Each observation is perturbed by artificial errors of order ±1°, 2°, 3°, and 5°C, which represent typical values on the border between representativeness errors and gross errors for real-time applications dealing with the automatic production of weather forecasts (Nipen et al. 2020). This way the performance of the SCT can be evaluated separately for small and large errors.
4. The ROC curve as an evaluation measure
The SCT is performed for each artificial error (the “perturbed” scenarios) as well as an “unperturbed” scenario, where the SCT is run on the original data. The test results are evaluated by comparing the outcomes of the SCT for the “unperturbed” and “perturbed” scenarios to the corresponding truths. In the “unperturbed” scenario all observations are considered good, which gives two possible outcomes: a correct rejection (also referred to as a true negative), which occurs when an observation correctly passes the SCT, or a false alarm (also referred to as a false positive or type I error), which occurs when the SCT incorrectly flags an observation as erroneous. In the “perturbed” scenarios, on the other hand, all observations have been perturbed and are therefore labeled as erroneous (i.e., affected by gross error). In this case, two possible outcomes also exist: a hit (also referred to as a true positive), which occurs when the SCT correctly flags the observation as erroneous, or a miss (also referred to as a false negative or a type II error), which occurs when an observation incorrectly passes the SCT. The results from the “unperturbed” and the “perturbed” scenarios can be combined into a contingency table, which summarizes the outcomes of the SCT. A wealth of performance measures have been derived from the contingency table (Wilks 2011). Here we use the receiver operating characteristic (ROC) curve (Hanley and McNeil 1982), which is a graph relating the two metrics called false alarm rate and hit rate for different σ thresholds (see Fig. 3 for an example). The false alarm rate (also called false positive rate) relates the false alarms to the total number of good observations and the hit rate (also called true positive rate) relates the hits to the total number of erroneous observations. A perfect score is obtained for a hit rate of 1 and a false alarm rate of 0, meaning that all erroneous observations have been detected without introducing any false alarms. This corresponds to the top left corner of the ROC curve. The diagonal line in Fig. 3 represents the “random guess” or “perfect chance” scenario, which is the performance one expects when flipping a coin to decide if an observation should pass or fail the SCT. This corresponds to both a hit rate and false alarm rate of 0.5.
As expressed explicitly by Eq. (3), there is a direct link between the chosen value of m and our prior knowledge on (i) the observational network P(G), and (ii) the desired characteristics for our application, as specified through the ratio of the two cost functions c(FA)/c(M).
There are no universally recognized values for P(G), nonetheless the scientific literature gives some indications. The WMO (2018b) guide provides guidance for the WMO Integrated Global Observing System (WIGOS) and in Annex 1 the performance targets for surface synoptic land stations are defined. For temperature, the minimum requirement to ensure that data of a single station is useful is that the number of gross errors during 1 month should not exceed 15% of all observations. Analogously, in the report by Prates et al. (2019) on the WIGOS data quality monitoring at ECMWF, a station measuring 2-m temperature is flagged as problematic when the percentage of gross errors over 1 month is larger than 15%. The producer of the HadCRUT5 dataset of monthly averaged near-surface temperatures (Morice et al. 2021) found that the percentage of gross errors in observational global temperature datasets usually is much smaller than 15% and on the order of 1% or less (Simmons et al. 2004; Brohan et al. 2006). Van den Besselaar et al. (2012) found that their QC procedures applied to daily time series of minimum and maximum temperatures over an observational pan-European dataset flagged around 0.1% of observations for minimum temperature and 1% for maximum temperature. In the case of observations from citizen stations used in the production chain of automatic weather forecasts, Nipen et al. (2020) found that flagging around 20% of the observations every hour was necessary to have a significant improvement in the temperature forecast for the next 6 h. In conclusion, for an observational network of traditional stations a reasonable range for P(G) is 0.1%–15%, while for a station network of crowdsourced observation P(G) can be increased up to 20%.
The ratio c(FA)/c(M) for a specific application can be set by considering the expected frequencies of false alarms and misses. For instance, suppose that an observational network provides 1000 observations per hour and P(G) is set to 10%, which implies that
5. Optimizing σ for the FieldSense network
The SCT is run for all FieldSense observations from 0000 and 1200 UTC for the year 2020, one at a time, for each artificial error, ±1°, 2°, 3°, and 5°C, as well as an “unperturbed” scenario, where the SCT is run on the original data. The resulting ROC curves for each error magnitude are shown in Fig. 4. For the following analysis we will assume a cost ratio of 1:6 for c(FA)/c(M), as in the example in section 4.
The performance of the SCT, as well as the optimal σ setting, which is based on the expected cost, varies with the magnitude of the error. The worst performance is seen for an error of magnitude 1°C. This is expected since the SCT will have a harder time distinguishing a gross error from a good observation, especially in regions where the local temperature variability is high. The lowest cost for an error of 1°C, which using Eq. (3) is 1.13, is obtained by using the lowest σ threshold, 0.5. This means that an observation is only allowed to deviate by 0.5 standard deviations from the expected value before it is flagged. Such a strict test is needed since the simulated error is so small that we would risk not flagging many erroneous observations if a larger threshold was used.
Both the performance of the SCT and the value of the optimal σ threshold increase, whereas the cost decreases, with increasing error magnitude. For errors of magnitude 2°, 3°, and 5°C the cost decreases by 28% (0.81), 56% (0.50), and 80% (0.23), respectively. The respective cost optimum is associated with a σ threshold of 2, 3.5, and 5, respectively. For an error of 5°C the SCT detects most errors without introducing many false alarms, even for a lenient test (i.e., the very high σ thresholds). Since the optimal σ threshold depends on the magnitude of the errors, the user therefore needs to have a good understanding of the distribution of expected errors in order to select the optimal σ threshold.
We saw that for the network used here, different σ thresholds are optimal depending on the magnitude of the error. What else affects the choice of the optimal threshold? Does it depend on the time of day and the density of the network? To investigate this, the performance of the SCT is evaluated separately for daytime (1200 UTC) and nighttime (0000 UTC) data for an artificial error of ±3°C (Fig. 5a). The performance of the SCT is better for daytime data, where a higher hit rate is obtained for the same false alarm rate compared to nighttime data. Furthermore, the lowest cost is 34% lower for daytime data (0.39) compared to the nighttime data (0.59). The cost optimum for nighttime data is found for a σ value of 3, whereas 4 gives the best results for daytime data. The difference in performance for daytime and nighttime data is thought to be related to the larger spread of temperatures seen for nighttime data. A possible explanation for this is the existence of sharp surface-based temperature inversions. A larger spread would make it more difficult for the SCT to catch gross errors, which might explain the higher cost and need for a lower σ threshold. These results are likely specific for the network used here and might differ depending on the characteristic of the network used. For an urban network, for example, daytime temperatures are expected to have a larger spread due to siting issues and radiative errors, which is expected to result in a better performance of the SCT for nighttime data. The optimal settings for the σ threshold is also expected to be reversed, with a lower threshold for daytime data and a higher threshold for nighttime data.
To evaluate the results for different network densities, each FieldSense station was classified as located in either a data-dense or a data-sparse region. For this, the integral data influence (IDI; Uboldi et al. 2008; Horel and Dong 2010) was used. IDI is a measure of the observation density, where the IDI at a given location is based on the influence of neighboring stations through the correlation functions used in the OI algorithm. The results are shown in Fig. 5b. The performance of the SCT is, as expected, better for stations located in a data-dense region compared to stations located in a data-sparse region. The higher the station density the more likely that a comparison is made between observations that are observing the same atmospheric phenomenon. This makes it easier for spatial QC methods such as the SCT to detect errors. The expected cost, given Eq. (3) and assuming that a miss is 6 times as expensive as a false alarm, is 0.73 for stations in data-sparse regions and 0.47 for stations in data-dense regions. The associated σ thresholds are 2 and 3.5, respectively. This means that if the density of a network is increased, the optimal σ threshold is expected to increase, whereas the cost is expected to decrease. The results presented here are for an error of magnitude 3°C; however, similar results were obtained for the other error magnitudes.
6. Added value of using the Netatmo network
As we saw, the optimal σ threshold and the expected cost depend on the density of the network. Sparser networks tend to require lower thresholds in order to catch more erroneous observations. Furthermore, the performance of the SCT decreases (the cost increases) with increasing sparsity of the network. The FieldSense network is a very dense network for which spatial QC of observations is well suited. But what about networks that are not as dense? It is not often that one has this dense a network, so how would we go about to increase the performance of the spatial QC for a sparser network? To obtain a better performance for the spatial QC method, would there be any benefits to using a dense network of crowdsourced observations, such as the Netatmo network, in the QC of the sparser network?
This brings us to the second objective of this paper, which is the investigation of the benefits of adding a dense network of crowdsourced observations to that of the QC of another network. Here we investigate the benefits of adding Netatmo observations, which come from a mostly urban but denser network, in the QC of the more rural, but less dense, FieldSense network. The problem with adding a network in the QC of another network is that spatial QC methods such as the SCT will not work properly if there exists a net temperature difference between the two station networks. Because the FieldSense network is of a more rural character whereas the Netatmo network can be classified as an urban network, a net temperature difference is expected. Furthermore, the Netatmo stations are unshielded, which means that the measurements will be affected by solar heating when they are placed in direct sunlight. The FieldSense stations, on the other hand, have a naturally ventilated radiation shield and are therefore not as prone to have radiative errors. These two differences, rural versus urban and shielded versus unshielded, are expected to contribute to a net temperature difference between the two station networks.
Figure 6 shows the average FieldSense and Netatmo temperatures as a function of time of day and month. Overall, the average Netatmo temperature is higher compared to the average FieldSense temperature, showing that there is a net positive relative Netatmo temperature difference. The temperature as a function of time of day shows the largest differences between the two networks during afternoon hours (1200–1600 UTC) and nighttime hours (2000–0400 UTC). Smallest differences are seen during morning hours (0500–1000 UTC), which is the time of day when the temperature increases most rapidly. The FieldSense network has a faster thermal response time compared to the Netatmo network, for which the temperature increases slower and therefore also peaks later in the day. The monthly average temperatures show an overall colder FieldSense temperature, where the difference to the Netatmo temperature is largest during summer months and smallest during winter months. Several studies have found that insufficient shielding of the temperature sensor results in positive biases of up to several degrees, especially in conditions of calm winds and large shortwave radiation (Hubbard et al. 2004; Böhm et al. 2010; Nakamura and Mahrt 2005; Bell et al. 2015). The observed daytime difference between Netatmo and FieldSense stations, as well as the larger difference during summer months, is therefore most likely linked to the lack of a radiation shield for the Netatmo stations, which results in higher temperatures due to solar heating effects. Land-cover differences are most likely the reason for the nighttime difference, as well as the overall monthly difference. Urban areas are associated with overall higher temperatures, an effect known as the urban heat island (UHI) effect, which is an effect that has been extensively documented (Steeneveld et al. 2011; Fenner et al. 2014; Chapman et al. 2017).
It is therefore necessary to bias correct the Netatmo observations in order to make use of them in the QC of the FieldSense observations. The objective of this study is not to propose a new method for bias correcting observations from crowdsourced networks and we therefore make use of a very simple method. Here, the Netatmo temperatures are bias corrected using the monthly average Netatmo minus FieldSense temperature difference. This is not a perfect bias correction method and better and more complex methods can be developed. As the FieldSense network is of a more rural character, whereas the Netatmo stations are mostly located in urban areas, the UHI most likely contributes to the observed net temperature difference between the station networks. To deal more effectively with the expected temperature difference due to the UHI, stations can be classified according to land use (e.g., rural, urban) and the bias correction method can make use of this classification. However, for our purposes the simple method applied gives satisfactory results, as will be shown.
Figure 5b showed that the performance of the SCT decreases with station density. We therefore expect the Netatmo observations to bring an added value in regions where the FieldSense network is relatively sparse. Figure 7 shows two examples: one for a FieldSense station located in a data-dense region, where the closest FieldSense and Netatmo stations are 2.2 and 1 km away, respectively, and one for a FieldSense station where the closest neighboring FieldSense and Netatmo stations are 16.4 and 1.6 km away, respectively.
In the case where the FieldSense station is located in a data-dense region the neighboring FieldSense stations are seen to have similar temperatures to that of the station in question. Adding the Netatmo stations in this case is not expected to have a positive impact on the outcome of the SCT. It might even lead to a worse outcome due to the higher variability seen among Netatmo stations for daytime data. In the second case, where the FieldSense station in question is located in a data-sparse region, the neighboring FieldSense stations all exhibit colder temperatures. The neighboring FieldSense stations are located farther from the coast and therefore do not have the same maritime influence as the FieldSense station in question. The closer Netatmo stations, on the other hand, have (bias corrected) temperatures that more accurately represent the region around the FieldSense station under consideration. In this example, adding the Netatmo observations in the QC of the FieldSense station is expected to have a positive impact on the outcome of the SCT. The impact of adding the Netatmo network is therefore investigated for FieldSense stations located in data-dense versus data-sparse regions. The results for an artificial error or ±3°C are shown in Fig. 8.
Using the uncorrected Netatmo observations results in a significantly worse performance of the SCT, both for the data-dense and data-sparse scenario. Without bias correction, the cost optimum is found for a σ threshold of 0.5 and 3.0 for stations in data-sparse and data-dense regions, respectively. The associated cost is 0.92 and 0.83, respectively. After bias correction, on the other hand, the expected cost is less for both network densities: 0.54 and 0.61 for stations located in data-dense and data-sparse regions, respectively. It is therefore clear that a bias correction is needed in order to remove the net temperature difference between the Netatmo and FieldSense networks. Even a simple bias correction method, as the one applied in this study, is enough to give satisfactory results.
Focus will now be on the difference between the FieldSense only dataset and the dataset which includes the bias corrected Netatmo observations as well. In the data-dense scenario, a worse performance is seen for higher σ thresholds and lower false alarm rates when adding the Netatmo observations. The variability of the Netatmo temperatures is expected to be higher compared to the FieldSense temperatures, especially for daytime data where solar heating effects are expected to contribute to a larger spread. The higher σ thresholds are therefore expected to let more erroneous observations pass the SCT. This would result in an increase in the number of misses, and therefore an increase in the false alarm rate and conversely a decrease in the hit rate, which is exactly what can be seen in Fig. 8. For lower thresholds, on the other hand, the addition of the Netatmo observations neither increases nor decreases the performance of the SCT. The cost optimum for the FieldSense only dataset in the data-dense scenario is found for a σ threshold of 3.5, with an associated cost of 0.47. Including the bias corrected Netatmo observations results in a higher optimal σ threshold; 4. Furthermore, the cost increases by 14% (from 0.47 to 0.54). Hence, the inclusion of Netatmo observations does not benefit the QC for the data-dense scenario, as was expected.
For the data-sparse scenario, on the other hand, the QC can be seen to benefit from the use of Netatmo observations. For the FieldSense only dataset, the lowest cost is associated with a σ threshold of 2 (as was noted previously as well), whereas the optimal threshold when including the Netatmo observations is 3.5. Furthermore, the cost decreases by 16% (from 0.73 to 0.61) when the Netatmo observations are added. Hence, by including Netatmo observations in the QC of stations located in data-sparse regions, the hit rate is increased and the false alarm rate is decreased, resulting in a clear improvement of the SCT.
7. The effect of having undetected errors
The framework presented here, where artificial errors are introduced in order to tune the settings of a spatial QC method, is built upon the assumption that the dataset used only contains good observations and that all gross errors have been excluded. This is usually a good approximation, since gross errors are much rarer than valid observations in meteorological networks, as discussed in section 4. The datasets used in this study have been subjected to an initial QC; however, it is possible that some erroneous observations have not been detected. As a consequence, the “unperturbed” scenarios used in sections 4 and 5 likely include a small fraction of gross errors, which we erroneously consider as good observations in our experiments.
If a gross error passed the initial QC it would be included in the “unperturbed” scenario as a good observation. If this observation passes the SCT it would be labeled as a correct rejection, based on the assumption that it is a valid observation. However, in reality it should have been considered a miss since the observation is in fact erroneous. In the other case, where the SCT flags the observation, it would be labeled as a false alarm, when in reality it should be considered a hit. Having gross errors in the “unperturbed” scenario will therefore affect the analysis and it will not give an accurate estimate of the SCT’s false alarm and hit rate.
Having errors in the “unperturbed” scenario has an impact on the uncertainty of our results. To investigate the sensitivity of the framework with respect to undetected gross errors we simulate a dataset that contains 10% undetected errors. This error sensitivity experiment is performed using the FieldSense only dataset. The simulated dataset is created by adding an artificial error, here chosen as ±3°C, to 10% of the FieldSense stations from the original dataset, randomly selected. These observations, however, are still labeled as good since we want to simulate the effect of having undetected gross errors in our dataset. In reality, the new simulated dataset likely contains more than 10% undetected gross errors, since these artificially added errors are on top of any erroneous observations that might have gone undetected in the initial QC for the original dataset.
Using this new dataset, that now contains 10% undetected gross errors, we apply the framework outlined in section 3, where we perturb each observation in the simulated dataset by an artificial error of ±3°C, as done in section 5. The SCT is run for each observation, one at a time, for this new “perturbed” scenario. In addition, the SCT is run for an “unperturbed” scenario, in which the SCT is run on the simulated dataset (which contains 10% undetected gross errors) without perturbing any of the observations. The result from this new experiment, i.e., the outcome of the SCT for both the “perturbed” and the “unperturbed” scenario for the simulated dataset, is evaluated using the hit rate and false alarm rate in order to construct a ROC curve. Figure 9 shows the results from applying the tuning algorithm of section 3 for both the original dataset and this new simulated dataset that contains 10% undetected errors.
For the simulated dataset, which contains 10% undetected errors of magnitude 3°C, the ROC curve is displaced toward the lower-right corner compared to the ROC curve of the original dataset. This means that for the same σ threshold the SCT results in a higher false alarm rate and a lower hit rate for the simulated dataset compared to the original dataset. Furthermore, the optimal settings differ between the two datasets. For this application we again assume that a miss is 6 times as expensive as a false alarm. The cost optimum and the associated σ threshold can be found by minimizing the cost defined by Eq. (3). For the original dataset, the minimum cost is 0.50 and is obtained for a σ of 3.5. For the simulated dataset containing 10% undetected errors, on the other hand, the best σ threshold is 2.5, with an associated cost of 0.70. Hence, the effect of having 10% undetected gross errors in the dataset is a shift of the optimal σ from 3.5 to 2.5, which results in an increase of the cost by 40%. For the original dataset, approximately the same cost as the minimum cost for the simulated dataset (0.70) is obtained when using a σ threshold of 5, as can be seen from the isoperformance lines in Fig. 9. Furthermore, the maximum cost, obtained using a σ threshold of 10, is 2.4, and 4.4 for the original and simulated datasets, respectively. This shows that having undetected gross errors in the dataset increases the overall cost, both the minimum and the maximum.
Both the increase in cost and the shift toward a lower σ threshold when the dataset contains undetected gross errors is something the users should be aware of when using this method to tune the settings of a spatial QC method. The shift and increase, however, depends on both the number of undetected errors and the magnitude of these errors. The more undetected errors the larger the shift and cost increase, and the larger the undetected errors the more likely they are to be flagged as false alarms in the “unperturbed” scenario (when they should be labeled as hits). Hence, the effect of having larger undetected gross errors is to increase the false alarm rate and therefore decrease the hit rate.
If the network to be used for the tuning is deemed to be too unreliable, i.e., the optimal σ threshold is unrealistically small such that a large number of observations is flagged, an alternative would be to use a trusted high-quality reference network, such as the official networks of WMO-compliant stations operated by national meteorological services. The trusted reference network can then be perturbed and quality controlled using the other network and an optimal σ can be obtained.
8. Conclusions
There is no specific procedure from the WMO on how to optimize parameters of a spatial QC method and there is very little information in the literature on how to choose the thresholds for spatial QC parameters. In this study we present a framework for tuning a spatial QC method for a dense network of meteorological observations. The framework is based on the perturbation of observations through the addition of artificial errors. Using a “perturbed” and an “unperturbed” scenario, the hit and false alarm rate can be calculated and the results can be visualized in a ROC diagram. A cost function, based on the relative cost of having a miss versus a false alarm, for optimizing the spatial QC method is proposed. This has the merit of making the user’s prior knowledge on the observational network and the characteristics of its application explicit. The parameter is then tuned such that the cost function is optimized. In the case of an unreliable network, a trusted high-quality reference network can be used in combination with the other network in order to minimize the effect of having undetected gross errors.
Some of the results obtained here are likely specific for the network used in this study, such as a better performance of the SCT and a higher optimal σ threshold for daytime data. Other results are more generic and are applicable to other dense networks. One such result is a better performance (lower cost) and a higher optimal σ for larger gross errors. To choose a σ threshold the user therefore needs to have a good understanding of the expected error distribution since the optimal threshold depends on the magnitude of the errors.
Furthermore, it was seen that the SCT works better in data-dense regions and also that sparse networks require lower σ thresholds and therefore a stricter test. To improve upon the spatial QC for a sparse network, crowdsourced observational data from another network were included. Here we showed that it is crucial to bias correct the additional observations in order to remove any possible net temperature differences. Especially when the networks are characterized by different land-cover classes and differences in sensitivity to radiative errors exist. After the net temperature difference had been removed, it was shown that including the crowdsourced observations in the QC of a sparse network led to a better performance of the SCT and that a higher σ threshold could be used.
The framework is based on the assumption that the dataset used to tune the spatial QC method does not contain any erroneous observations. The effect of having undetected gross errors in the dataset is to increase the cost and shift the optimal σ toward lower thresholds. This is something the user should be aware of when using the proposed framework for tuning the parameters of a spatial QC method.
Future work includes investigating differences in correlations between observations from different land-cover classes (e.g., urban, rural), as it was shown that there is a net temperature difference between the rural and urban networks used in this study. This information can then be used to enhance the spatial QC method and possibly also contribute to the optimization of the QC method for different land-cover classes. Another research direction could be the investigation of weather dependent QC parameters. As the weather influences the correlations between observations, the optimal parameter settings would depend on the weather situation. The effect of using variable QC parameters could be investigated by classifying the weather into different classes. By performing the procedure proposed here weather dependent QC parameters could be obtained, which could possibly enhance the performance of the QC method.
Acknowledgments.
This study was partially funded by Innovation Fund Denmark, through Grant 8053-00242B.
Data availability statement.
Observational data were provided by FieldSense and Netatmo from their respective networks of private weather stations. Due to their proprietary nature, these data cannot be made openly available.
REFERENCES
Båserud, L., C. Lussana, T. N. Nipen, I. A. Seierstad, L. Oram, and T. Aspelien, 2020: TITAN automatic spatial quality control of meteorological in-situ observations. Adv. Sci. Res., 17, 153–163, https://doi.org/10.5194/asr-17-153-2020.
Bell, S., D. Cornford, and L. Bastin, 2015: How good are citizen weather stations? Addressing a biased opinion. Weather, 70, 75–84, https://doi.org/10.1002/wea.2316.
Böhm, R., P. D. Jones, J. Hiebl, D. Frank, M. Brunetti, and M. Maugeri, 2010: The early instrumental warm-bias: A solution for long central European temperature series 1760–2007. Climatic Change, 101, 41–67, https://doi.org/10.1007/s10584-009-9649-4.
Bormann, N., H. Lawrence, and J. Faranan, 2019: Global observing system experiments in the ECMWF assimilation system. ECMWF Tech Memo. 839, 26 pp., https://doi.org/10.21957/sr184iyz.
Brohan, P., J. J. Kennedy, I. Harris, S. F. B. Tett, and P. D. Jones, 2006: Uncertainty estimates in regional and global observed temperature changes: A new data set from 1850. J. Geophys. Res., 111, D12106, https://doi.org/10.1029/2005JD006548.
Chapman, L., C. Bell, and S. Bell, 2017: Can the crowdsourcing data paradigm take atmospheric science to a new level? A case study of the urban heat island of London quantified using Netatmo weather stations. Int. J. Climatol., 37, 3597–3605, https://doi.org/10.1002/joc.4940.
Clark, M. R., J. D. Webb, and P. J. Kirk, 2018: Fine-scale analysis of a severe hailstorm using crowd-sourced and conventional observations. Meteor. Appl., 25, 472–492, https://doi.org/10.1002/met.1715.
Daley, R., 1993: Atmospheric Data Analysis. Cambridge University Press, 457 pp.
Danielson, J., and D. Gesch, 2010: Global Multi-Resolution Terrain Elevation Data 2010 (GMTED2010). USGS Open-File Rep. 2011-1073, 26 pp.
Danish Agency for Data Supply and Efficiency, 2021: The Danish elevation model. Accessed 14 December 2021, https://eng.sdfe.dk/products-and-services/the-danish-elevation-model-dk-dem.
Dee, D. P., and Coauthors, 2011: The ERA-Interim reanalysis: Configuration and performance of the data assimilation system. Quart. J. Roy. Meteor. Soc., 137, 553–597, https://doi.org/10.1002/qj.828.
de Vos, L., H. Leijnse, A. Overeem, and R. Uijlenhoet, 2017: The potential of urban rainfall monitoring with crowdsourced automatic weather stations in Amsterdam. Hydrol. Earth Syst. Sci., 21, 765–777, https://doi.org/10.5194/hess-21-765-2017.
Fenner, D., F. Meier, D. Scherer, and A. Polze, 2014: Spatial and temporal air temperature variability in Berlin, Germany, during the years 2001–2010. Urban Climate, 10, 308–331, https://doi.org/10.1016/j.uclim.2014.02.004.
Fiebrich, C. A., C. R. Morgan, A. G. McCombs, P. K. Hall Jr., and R. A. McPherson, 2010: Quality assurance procedures for mesoscale meteorological data. J. Atmos. Oceanic Technol., 27, 1565–1582, https://doi.org/10.1175/2010JTECHA1433.1.
Gandin, L. S., 1988: Complex quality control of meteorological observations. Mon. Wea. Rev., 116, 1137–1156, https://doi.org/10.1175/1520-0493(1988)116<1137:CQCOMO>2.0.CO;2.
Hagelin, S., L. Auger, P. Brovelli, and O. Dupont, 2014: Nowcasting with the AROME model: First results from the high-resolution AROME airport. Wea. Forecasting, 29, 773–787, https://doi.org/10.1175/WAF-D-13-00083.1.
Hanley, J. A., and B. J. McNeil, 1982: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36, https://doi.org/10.1148/radiology.143.1.7063747.
Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 1999–2049, https://doi.org/10.1002/qj.3803
Hintz, K. S., H. Vedel, and E. Kaas, 2019: Collecting and processing of barometric data from smartphones for potential use in numerical weather prediction data assimilation. Meteor. Appl., 26, 733–746, https://doi.org/10.1002/met.1805.
Horel, J. D., and X. Dong, 2010: An evaluation of the distribution of Remote Automated Weather Stations (RAWS). J. Appl. Meteor. Climatol., 49, 1563–1578, https://doi.org/10.1175/2010JAMC2397.1.
Howe, J., 2006: Crowdsourcing: A definition. Accessed 14 December 2021, https://crowdsourcing.typepad.com/cs/2006/06/crowdsourcing_a.html.
Hubbard, K. G., X. Lin, C. B. Baker, and B. Sun, 2004: Air temperature comparison between the MMTS and the USCRN temperature systems. J. Atmos. Oceanic Technol., 21, 1590–1597, https://doi.org/10.1175/1520-0426(2004)021<1590:ATCBTM>2.0.CO;2.
Hubbard, K. G., S. Goddard, W. D. Sorensen, N. Wells, and T. T. Osugi, 2005: Performance of quality assurance procedures for an applied climate information system. J. Atmos. Oceanic Technol., 22, 105–112, https://doi.org/10.1175/JTECH-1657.1.
Ingleby, N. B., and A. C. Lorenc, 1993: Bayesian quality control using multivariate normal distributions. Quart. J. Roy. Meteor. Soc., 119, 1195–1225, https://doi.org/10.1002/qj.49711951316.
Jolliffe, I. T., and D. B. Stephenson, 2012: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 254 pp.
Kalnay, E., 2003: Atmospheric Modeling, Data Assimilation and Predictability. Cambridge University Press, 341 pp.
Kidd, C., A. Becker, G. J. Huffman, C. L. Muller, P. Joe, G. Skofronick-Jackson, and D. B. Kirschbaum, 2017: So, how much of the Earth’s surface is covered by rain gauges? Bull. Amer. Meteor. Soc., 98, 69–78, https://doi.org/10.1175/BAMS-D-14-00283.1.
Kobayashi, S., and Coauthors, 2015: The JRA-55 reanalysis: General specifications and basic characteristics. J. Meteor. Soc. Japan, 93, 5–48, https://doi.org/10.2151/jmsj.2015-001.
Lussana, C., F. Uboldi, and M. R. Salvati, 2010: A spatial consistency test for surface observations from mesoscale meteorological networks. Quart. J. Roy. Meteor. Soc., 136, 1075–1088, https://doi.org/10.1002/qj.622.
Lussana, C., O. E. Tveito, A. Dobler, and K. Tunheim, 2019: seNorge_2018, daily precipitation, and temperature datasets over Norway. Earth Syst. Sci. Data, 11, 1531–1551, https://doi.org/10.5194/essd-11-1531-2019.
McNicholas, C., and C. F. Mass, 2018: Impacts of assimilating smartphone pressure observations on forecast skill during two case studies in the Pacific Northwest. Wea. Forecasting, 33, 1375–1396, https://doi.org/10.1175/WAF-D-18-0085.1.
Meier, F., D. Fenner, T. Grassmann, M. Otto, and D. Scherer, 2017: Crowdsourcing air temperature from citizen weather stations for urban climate research. Urban Climate, 19, 170–191, https://doi.org/10.1016/j.uclim.2017.01.006.
Morice, C. P. , and Coauthors, 2021: An updated assessment of near-surface temperature change from 1850: The HadCRUT5 data set. J. Geophys. Res. Atmos., 126, e2019JD032361, https://doi.org/10.1029/2019JD032361.
Muller, C., L. Chapman, S. Johnston, C. Kidd, S. Illingworth, G. Foody, A. Overeem, and R. Leigh, 2015: Crowdsourcing for climate and atmospheric sciences: Current status and future potential. Int. J. Climatol., 35, 3185–3203, https://doi.org/10.1002/joc.4210.
Nakamura, R., and L. Mahrt, 2005: Air temperature measurement errors in naturally ventilated radiation shields. J. Atmos. Oceanic Technol., 22, 1046–1058, https://doi.org/10.1175/JTECH1762.1.
Napoly, A., T. Grassmann, F. Meier, and D. Fenner, 2018: Development and application of a statistically-based quality control for crowdsourced air temperature data. Front. Earth Sci., 6, 118, https://doi.org/10.3389/feart.2018.00118.
Nipen, T. N., I. A. Seierstad, C. Lussana, J. Kristiansen, and Ø. Hov, 2020: Adopting citizen observations in operational weather prediction. Bull. Amer. Meteor. Soc., 101, E43–E57, https://doi.org/10.1175/BAMS-D-18-0237.1.
Overeem, A., H. Leijnse, and R. Uijlenhoet, 2011: Measuring urban rainfall using microwave links from commercial cellular communication networks. Water Resour. Res., 47, W12505, https://doi.org/10.1029/2010WR010350.
Overeem, A., J. R. Robinson, H. Leijnse, G.-J. Steeneveld, B. P. Horn, and R. Uijlenhoet, 2013: Crowdsourcing urban air temperatures from smartphone battery temperatures. Geophys. Res. Lett., 40, 4081–4085, https://doi.org/10.1002/grl.50786.
Prates, C., E. Andersson, and T. Haiden, 2019: WIGOS data quality monitoring system at ECMWF. ECMWF Tech. Memo. 850, 27 pp., https://doi.org/10.21957/3kdegl7kz.
Provost, F., and T. Fawcett, 2001: Robust classification for imprecise environments. Mach. Learn., 42, 203–231, https://doi.org/10.1023/A:1007601015854.
Shafer, M. A., C. A. Fiebrich, D. S. Arndt, S. E. Fredrickson, and T. W. Hughes, 2000: Quality assurance procedures in the Oklahoma Mesonetwork. J. Atmos. Oceanic Technol., 17, 474–494, https://doi.org/10.1175/1520-0426(2000)017<0474:QAPITO>2.0.CO;2.
Simmons, A. J., and Coauthors, 2004: Comparison of trends and low-frequency variability in CRU, ERA-40, and NCEP/NCAR analyses of surface air temperature. J. Geophys. Res., 109, D24115, https://doi.org/10.1029/2004JD005306.
Steeneveld, G.-J., S. Koopmans, B. Heusinkveld, L. Van Hove, and A. Holtslag, 2011: Quantifying urban heat island effects and human comfort for cities of variable size and urban morphology in the Netherlands. J. Geophys. Res., 116, D20129, https://doi.org/10.1029/2011JD015988.
Stigter, C., M. Sivakumar, and D. Rijks, 2000: Agrometeorology in the 21st century: Workshop summary and recommendations on needs and perspectives. Agric. For. Meteor., 103, 209–227, https://doi.org/10.1016/S0168-1923(00)00113-1.
Uboldi, F., C. Lussana, and M. Salvati, 2008: Three-dimensional spatial interpolation of surface meteorological observations from high-resolution local networks. Meteor. Appl., 15, 331–345, https://doi.org/10.1002/met.76.
Valkonen, T., and Coauthors, 2020: Evaluation of a sub-kilometre NWP system in an Arctic fjord-valley system in winter. Tellus, 72A, 1–21, https://doi.org/10.1080/16000870.2020.1838181.
van den Besselaar, E. J. M., A. M. G. Klein Tank, G. van der Schrier, and P. D. Jones, 2012: Synoptic messages to extend climate data records. J. Geophys. Res., 117, D07101, https://doi.org/10.1029/2011JD016687.
Vejen, F. , and Coauthors, 2002: Quality control of meteorological observations: Automatic methods used in the Nordic countries. KLIMA Tech. Rep. 8/2020, 111 pp.
Vionnet, V., S. Bélair, C. Girard, and A. Plante, 2015: Wintertime subkilometer numerical forecasts of near-surface variables in the Canadian Rocky Mountains. Mon. Wea. Rev., 143, 666–686, https://doi.org/10.1175/MWR-D-14-00128.1.
Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier, 676 pp.
WMO, 2010: Guide to Agricultural Meteorological Practices (GAMP). WMO Doc. 134, 799 pp.
WMO, 2018a: Guide to meteorological instruments and methods of observation. WMO Doc. 8, 573 pp.
WMO, 2018b: Technical guidelines for regional WIGOS centres on the WIGOS Data Quality Monitoring System. WMO Doc. 1224, 40 pp.
WMO, 2021: Guidelines on surface station data quality control and quality assurance for climate applications. WMO Doc. 1269, 54 pp.
Zahumenskỳ, I., 2004: Guidelines on quality control procedures for data from automatic weather stations. WMO Doc., 10 pp.
Zheng, F., and Coauthors, 2018: Crowdsourcing methods for data collection in geophysics: State of the art, issues, and future directions. Rev. Geophys., 56, 698–740, https://doi.org/10.1029/2018RG000616.