A quality control (QC) process has been developed and implemented on an observational database of surface wind speed and direction in northeastern North America. The database combines data from 526 land stations and buoys spread across eastern Canada and five adjacent northeastern U.S. states. It combines the observations of three different institutions spanning from 1953 to 2010. The quality of these initial data varies among source institutions. The current QC process is divided into two parts. Part I, described herein, is focused on issues related to data management: issues stemming from data transcription and collection; differences in measurement units and recording times; detection of sequences of duplicated data; unification of calm and true north criteria for wind direction; and detection of physically unrealistic data measurements. As a result, around ~0.1% of wind speed and wind direction records have been identified as erroneous and deleted. The most widespread error type is related to duplications within the same station, but the error type that entails more erroneous data belongs to duplications among different sites. Additionally, the process of data compilation and standardization has had an impact on more than 90% of the records. A companion paper (Part II) deals with a group of errors that are conceptually different, and is focused on detecting measurement errors that relate to temporal consistency and biases in wind speed and direction.
Any analysis that makes use of meteorological or climatological observations relies on the quality of the data. Errors in measurements occur at any stage during the process of data manipulation, starting with the recording, storage, and distribution of data, until they reach the end users. Therefore, it is important to establish procedures to ensure the quality of the observations. Errors can be classified into three different types (Gandin 1988): random, systematic, and rough errors. Random errors are unavoidably inherent to all data, independent of the measured value and follow a zero-centered normal distribution. Systematic errors are distributed asymmetrically, usually persist in time and have multiple origins (e.g., instrument bias, calibration drifts, exposure problems; Wade 1987). These errors can be easily mistaken for random errors unless there is a priori information about them. Last, the malfunctioning of measuring devices and mistakes during data processing, transmission, and reception (Gandin 1988) can lead to the third type of error, the so-called rough (or large) error. The majority of the rough errors are caused by the malfunctioning of measuring devices or are communication related—introduced when the data are recorded, pass through, or emerge from communication channels. Although usually only a very small part of all the data are affected, the distortion caused by rough errors can be large enough to greatly impact subsequent analyses.
The procedures and protocols targeting the flagging and elimination, or eventual correction, of those systematic and rough errors are traditionally known as quality control (QC; e.g., DeGaetano 1997) or quality assurance (QA; e.g., Shafer et al. 2000) procedures. The use of both terms is frequent in the literature, often with the same meaning (e.g., Meek and Hatfield 1994; Eischeid et al. 1995; Graybeal et al. 2004; Wan et al. 2007; Lawrimore et al. 2011), which can lead to an ambiguous situation. Resorting to the Guide to Meteorological Instruments and Methods of Observation (WMO 2008), the difference, however, is clear: QA is the framework designed to prevent errors along the various meteorological activities, and QC is focused on the procedures to detect them. Therefore, we will refer to the analysis presented herein as QC.
QC processes can be designed to fulfill different goals. Some evaluate operational (real time) data by focusing on single stations (Meek and Hatfield 1994) or involving station networks (Wade 1987; Gandin 1988; DeGaetano 1997; Shafer et al. 2000; Fiebrich et al. 2010). Others focus their attention on assuring the quality of previously compiled historical databases (Graybeal et al. 2004; Jiménez et al. 2010b) that, in turn, may have been previously subjected to a QC process. Many quality protocols address several meteorological variables at the same time and are able to exploit cross information from each parameter (Gandin 1988; Meek and Hatfield 1994; Shafer et al. 2000; Fiebrich et al. 2010; Dunn et al. 2016), while in other cases they specifically focus on one parameter, which is often temperature or precipitation (Gandin 1988; Eischeid et al. 2000; González-Rouco et al. 2001; Lanzante et al. 2003; Lawrimore et al. 2011). Comparatively few studies address the detection of erroneous data and their correction or suppression in wind variables (DeGaetano 1997; Graybeal 2006; Jiménez et al. 2010b). Detection or correction protocols usually involve a battery of tests or checks, each of them focused on a specific potential problem. In some of these tests, a comparison with a neighbor or reference station is essential. In other cases the tests are carried out individually for each station.
The limit or plausibility checks search for individual measurements outside of a certain physical or statistical admissible range of values (e.g., Meek and Hatfield 1994; Graybeal et al. 2004; Lawrimore et al. 2011; Woodruff et al. 2011; Dunn et al. 2016). Temporal consistency checks account for excessive variability or unrealistic steady behaviors (e.g., DeGaetano 1997; Jiménez et al. 2010b). Internal consistency checks cross compare multiple variable types or a variable type from redundant sensors (e.g., Shafer et al. 2000; Graybeal et al. 2004). Spatial checks evaluate the records of a site in relation to those obtained at some neighbor location (e.g., Barnes 1964; Gandin 1988; DeGaetano 1997; Hubbard et al. 2005; Durre et al. 2010; Steinacker et al. 2011). Duplication error checks identify segments that could be artificially duplicated within a station’s lifetime or between different sites (e.g., Kunkel et al. 1998; Guttman 2002; Durre et al. 2010; Jiménez et al. 2010b; Lawrimore et al. 2011; Dunn et al. 2016). Finally, typographical error checks look for errors related to human mistakes made when the observations were recorded on paper and later transcribed during digitization efforts (e.g., DeGaetano 1997; Kunkel et al. 1998; Guttman 2002; Graybeal et al. 2004; Dunn et al. 2016).
In addition to these tests, which mainly deal with rough errors, there are also different procedures focused on the detection of systematic errors or biases (e.g., Klink 1999; Begert et al. 2003; Thomas et al. 2005; Jiménez et al. 2010b; Wan et al. 2010). These problems tend to affect longer time intervals than those discussed above. When changes are documented, corrections can be straightforward as in the standardization on known measurement height changes (Klink 1999; Thomas et al. 2005). These methods straddle the often fuzzy border between QC and data homogenization procedures, which are focused in the detection (and eventual correction) of artificial breaks in long-term means, standard deviations, or trends (e.g., Alexandersson 1986; González-Rouco et al. 2001; Begert et al. 2003; Wan et al. 2010).
At the stage of implementing corrections, some studies treat the aforementioned checks independently and thus decide about the quality of the data at the end of each test that is applied sequentially (DeGaetano 1997; Jiménez et al. 2010b; Lawrimore et al. 2011). Other studies use the so-called complex procedures: flagging the data after each test and making the final decision based on the results of all tests (Gandin 1988; Meek and Hatfield 1994; Eischeid et al. 1995; Shafer et al. 2000; Graybeal et al. 2004; Wan et al. 2007). Once the data are flagged, they can be eliminated or corrected through near- or fully automatic processes (Gandin 1988; Dunn et al. 2016) or with the help of human intervention (Wan et al. 2007; Jiménez et al. 2010b; Lawrimore et al. 2011). The data can also be merely flagged (Meek and Hatfield 1994; Eischeid et al. 1995; DeGaetano 1997; Shafer et al. 2000; Graybeal et al. 2004; Durre et al. 2010; Dunn et al. 2016), leaving the ultimate decision regarding corrections/removal to the end user.
Many of the tests cited above seek to detect either rough or systematic errors that can be produced at different moments between generating and archiving meteorological information. These types of erroneous records are in general of a local nature and are in principle not related to the institutional data sources. We will refer to these types of errors broadly as measurement errors.
On the other hand, during the operation and management of meteorological networks, the institutions in charge often adopt a set of criteria that assure the internal coherence of their data regarding the way the variables are measured and postprocessed (e.g., WMO 2008; MSC 2013). The criteria can, however, differ from one institution to another and pose challenges when unifying data from different sources into a common database. Additionally, there are errors, generally related to data manipulation, that can systematically affect multiple series that originate from a common source (e.g., duplication errors). These cases will be regarded as issues related to data storage and management.
The present work summarizes a QC process applied to a historical data compilation of onshore and offshore surface wind observations across the east coast of Canada and the northeastern United States. The observations have been collected from three different sources. The sources were selected on the basis of their availability for this study. The combined time span of the records covers almost 60 years with varying time resolutions; uneven measurement units; and changing measuring procedures, instrumentation, and heights. The level of QC procedures applied to the series prior to our compilation can be very different (Thomas and Swail 2011; MSC 2013). Therefore, the potential number of existing errors could be high and may have a nonnegligible impact on any future analysis.
The large-scale dynamics favor the transit of cyclones from tropical origin over the region of interest during the summer season (Landsea 2007) and even more intense extratropical cyclones during winter (Hart and Evans 2001; Plante et al. 2015). Such extratropical cyclones are frequency responsible for extreme weather events (Richards and Abuamer 2007; Cheng 2014). A large coastal perimeter and complex orography pose challenges for downscaling strategies oriented to the understanding of wind variability at a range of time scales, from intra- and interannual to long-term trends. So far, this area has received relatively low attention (e.g., Cheng et al. 2008, 2012; Martinez et al. 2013) and future analyses of the database provided here may focus on regional wind variability and trends that have been performed in other regions (e.g., Najac et al. 2009; Jiménez et al. 2010a; García-Bustamante et al. 2012; Pryor and Barthelmie 2014). This may be of scientific and societal relevance, as the government of Canada has shown a growing interest in building wind farms on the peninsula of Nova Scotia and in its annexed provinces (e.g., Hughes et al. 2006; Hughes 2007; Hughes and Chaudhry 2011). For the veracity of these analyses, however, it is paramount to handle observational databases in which the quality of different sources is brought to a common ground so that the data can later be used with confidence regardless of their provenance.
The objective of this work is to analyze and improve the quality of a set of wind surface data across northeastern North America obtained from a variety of sources and ultimately to develop a database useful for the analysis of surface wind variability. This study is divided into two parts. The goal of this first part is to analyze the occurrence of the various issues related to data management errors and their impact. Some of the issues treated herein have been discussed in previous works, addressing, for instance, eventual site relocations (e.g., Vautard et al. 2010), duplication errors (e.g., Dunn et al. 2016), or checks related to physical limits (e.g., Durre et al. 2010), among others. Alternatively, some of the tests used in this work are new, like those targeting site relocations or duplication errors, and can be useful workarounds for situations were metadata are not available (e.g., duplication errors). For each test, a description of the type of the problem is provided, together with a report on the statistics of occurrence in space and time, and other details, such as the data source. This helps to illustrate the different factors that can contribute to the apparition and occurrence of management errors. Although the specifics of some of the developed tests (especially during the compilation) have been tailored to the used data sources, the issues presented herein are nonetheless common to many different kinds of datasets, and most of the described procedures can be applied broadly.
The second part of this study (Lucio-Eceiza et al. 2017, hereafter Part II) is focused on measurement errors; the procedures presented therein are of universal applicability, as these errors are independent of the dataset. As with Part I, attention will be paid to illustrating the dependencies on the occurrence of errors. In both parts, an evaluation of the impact of errors on the statistics of the data is provided.
The remainder of the present paper is structured as follows. Section 2 describes the observational database. Section 3 describes the methodologies of the QC process for issues related to data management. Section 4 provides an account of the results obtained during each step of the QC applied herein. The impact of the suppressed data is discussed in section 5. The conclusions and some discussion are provided in section 6.
2. Observational wind data
The QC described herein focuses on a surface wind database that integrates 526 stations distributed across northeastern North America (WNENA). WNENA is the result of an aggregation of three different datasets (Fig. 1a), each one provided by a different institution: Environment Canada [EC; now known as Environment and Climate Change Canada (ECCC)], the Department of Fisheries and Oceans Canada Integrated Science Data Management division (DFO), and the operational global surface observations (NCEP ADP OGSO 1980, 2004) archived at the National Center for Atmospheric Research (NCAR). WNENA has an uneven distribution of stations, with higher spatial density across the southern area and along the coast, and with lower density northward and inland. The database spans over 60 years of hourly, 3- and 6-hourly recorded measurements using a variety of time references (Fig. 1b). Only simultaneous valid data pairs of both wind direction and speed are kept. Additionally, only sites with valuable data from the climatological perspective were selected, keeping those that had a good representation of at least one annual cycle or partial information over more than one season. For land stations, only those that had at least one year with >90% of nonmissing records or 3 years with >50% of nonmissing records were selected. For moored buoys the conditions were less stringent, as these are more prone to having data gaps (Thomas and Swail 2011), specifically those from the Great Lakes, which are only seasonally operated during ice-free months (B. Bradshaw 2009, personal communication). Only buoys that had at least one year with >85% of nonmissing records or 2 years with four operating months were kept. These conditions reduced the initial size of the database from ~700 sites to the actual number of 526 and accounted for an approximate loss of 2 400 000 pairs of records. As these numbers are the result of an initial decision to remove potentially problematic sites or sites of lower value previous to the compilation (section 3a), these numbers have not been included in the statistics and results described herein.
EC is the primary source for data with an original size of more than 400 sites; after the minimum length constraint, the number of sites was reduced to 343 land stations distributed across the east coast of Canada (see Table 1) and encompassing the provinces of New Brunswick (40 sites), Newfoundland (48) and Labrador (16), Nova Scotia (66), Prince Edward Island (19), and Quebec (154). The data have been gathered from HLY01 (hourly weather) and HLY15 (wind) ASCII individual files. These sites have been, to various degrees, previously quality controlled in both real-time and delayed mode by Environment Canada (MSC 2013). The files were acquired in subsequent batches in May 2008, February 2009, and March 2009. The series span from 1 January 1953, with 44 sites available, to 4 March 2009, with 193 sites available (see Fig. 1c).
A database of this spatiotemporal extension makes use of a great variety of anemometer types and averaging methods through time (Table 2), both automatic and manually operated, with this being a source of potential data issues. The most used anemometer types are the U2A (HLY01 and HLY15) and 45B (HLY15) for manned stations, and in recent times 78D digital automatic systems that incorporate U2A equipment. The original measurements for U2A and 45B are 2-min averages ending at the time of recording and are reported to the nearest nautical mile (1.852 km) per hour since 1996. Prior to that date, 1-min averages to the nearest land mile (1.609 km) per hour were used. The 78D system, in turn, provides averages ranging between 2 and 10 min (Richards and Abuamer 2007; Wan et al. 2010). All the data have been provided in kilometers per hour. The direction has been recorded at 8 (HLY15), 16 (HLY01), or 36 (HLY01, HLY15) points of the compass, with the transition from 16 to 36 points taking place at the end of 1970 (Environment and Climate Change Canada 2017). The records with 36 points of the compass are provided to the closest decagrade (0–36), while those given in 8 (16) points store their measurements in alternate intervals of four or five (two or three) decagrades (MSC 2013). The standard measuring height should follow, in theory, the international convention of 10 m (WMO 1950, 1969, 1983, 2008; MSC 2013). However, in practice many sites have experienced changes through time, particularly in the 1950s and 1960s, when it was not rare to install the instrumentation on rooftops to attain better exposure (Klink 1999; Wan et al. 2010). Only after the 1970s can the heights be considered with certain confidence at the standard 10-m height [Wan et al. 2010; see Table 3; for more information see Part II, section 4b(1)].
The records are given in local standard time (LST), which usually matches with province boundaries (Fig. 1b): eastern time zone (ETZ) at coordinated universal time − 5 h (UTC-5 or eastern standard time, red), Atlantic time zone (ATZ) at UTC-4 [Atlantic standard time (AST), orange], and Newfoundland time zone (NTZ, purple) at UTC-3.5, although the observations were made at 30 min past the hour, thus at AST. The data have been archived at hourly time resolution for most of the cases, although a few sites have been reporting the data at 3-hourly resolution [1960s–1980s; e.g., sites 704470 (Manicouagan) and 8200640 (Canso)] or synoptic intervals [until 1960s; e.g., 7043000 (Harrington Harbour) and 8401000 (Cape Race)] and some even only during daylight hours [e.g., 7052605 (Gaspe) and 705C2G9 (Îles de la Madeleine)].
The DFO dataset, archived by the Fisheries and Oceans Canada Integrated Science Data Management (ISDM) division (http://www.meds-sdmm.dfo-mpo.gc.ca/isdm-gdsi/waves-vagues/index-eng.htm), consisted originally of 22 moored weather buoys from Environment Canada covering the east coast of Canada (10) and the Canadian Great Lakes (12) that after the first step of the compilation (section 3a) resulted in 40 fixed positions (Table 1). The meteorological raw data were gathered from individual CSV files and had not received any quality control (Thomas and Swail 2011). The files had a single flag that applied to wave data and sometimes indicated whether the buoys were at the right position. The data corresponding to buoys flagged as being off position, adrift, or in dock under reparation, or at the end of each measuring season have not been considered. The data were accessed during June 2008 and consequently the time span ranges from 2 December 1988 to 25 June 2008 (see Fig. 1c). The data are provided at different heights depending on the hull type of the buoy (Table 3). Some time series have been recorded using different hulls and thus at different heights. In buoys with two anemometers, the highest one is usually considered as the primary source of information—or primary channel—while the second one is used as a backup when the first one is faulty. The historical MSC buoy status reports (available on the aforementioned ISDM website) give information on the channel used for data transmission, which corresponds to the highest anemometer by default. The time series of each buoy has been constructed by combining the information of both channels, either by choosing the transmitted channel or by visually rejecting the channel with erroneous data when the metadata were not available (e.g., before 22 June 1997). The periods when the sensors were unserviceable are also indicated in the metadata and were removed from our series. The records are 10-min-average samples ending at the time of their recording. Most of the measurements have been performed with R. M. Young anemometers, although since 2007 Vaisala WS425 ultrasonic anemometers have also been installed in secondary positions (Table 2; Thomas et al. 2005; Thomas and Swail 2011) with wind speed and direction being recorded meters per second and degrees, respectively. The data were provided in UTC, mostly in hourly resolution but also in 3-hourly resolution [e.g., 44139 (Banquereau Bank)] and then rounded to the closest hour at the collection. The reported time in the CSV files corresponds to the end of the wave sample, which for east coast buoys (including the Gulf of St. Lawrence) occurs 45 min before the end of the meteorological sample. The reporting times were delayed accordingly to match the meteorological measurements. The reported times for the Great Lakes are given at the end of the meteorological measurements and were left as they were provided (AXYS Environmental Consulting Ltd. 1996; M. Ouellet 2015, personal communication).
NCAR provided 143 additional series. From an original set of ~700 NCAR sites located in the region, only those longer than a year and located farther than 0.05° of any nearby EC station were chosen, both to improve the density of sites across eastern Canada and to introduce some information across the southern part of our area of interest. Ninety-one new sites across eastern Canada, involving stations in Nova Scotia (2), Nunavut (2), Ontario (78), and Quebec (9), and 52 sites across adjacent lands in the United States, including the states of Maine (14), Massachusetts (8), New Hampshire (9), New York (18) and Vermont (3), were added (Fig. 1a; Table 1). The dataset combines data from synoptic observations (SYNOP), aviation routine weather reports (METARs), Automated Weather Observing Systems (AWOS), and Automated Surface Observing Systems (ASOS), transmitted by the Global Telecommunication System (GTS) and stored in ds464.0 [office note (ON) 124 format] and ds461.0 (WMO BUFR format) databases. The data were downloaded on 1 January 2010. The series span from 1 January 1978 to 31 December 2009 (see Fig. 1c). Following the recommendations from NCAR, only ds461.0 was used since April 2000. There is no evidence of any QC process applied by NCEP to either ds461.0 or ds464.0 land surface wind data. The sampling resolution varies from 1–2 to 10 min before the hour. The data were recorded in knots for wind speed and degrees with a resolution of 36 compass points for wind direction, and were provided at UTC mainly in hourly, 3-hourly, and synoptic resolution, rounded to the closest hour during collection.
Being a compilation of both Canadian and U.S. sites, the measurements have been carried out with a great variety of anemometers and sampling techniques. For example, for the sites located in Canada the anemometers are likely to be of the 45B/U2A type, while the ASOS sites in the United States are equipped with the Belfort F420 series (Table 2; Nadolski 1998), although they have been transitioning to the Vaisala NWS 425 ice-free wind sensor (IFWS) ultrasonic anemometers since late 2005 (NOAA 2003; Schmitt IV 2009). The anemometer heights at these sites, although theoretically at 10 m (WMO 1969, 1983, 2008), in reality may have varied considerably (Wieringa 1980; Klink 1999; Pryor et al. 2009), and only for ASOS data from the mid-1990s onward can a 10-m height be assumed with certain reliability (Table 3; Nadolski 1998).
3. Quality control methodology
The QC that has been applied is structured into six phases that deal with the detection of various issues in data quality (numbered in Fig. 2): 1) compilation; 2) duplication errors; 3) physical consistency in the ranges of recorded values; 4) temporal consistency, regarding abnormally high/low variability in the time series; 5) detection of long-term biases; and 6) removal of isolated records. The first three phases deal with issues often related to data recording and management. The issues discussed in the compilation phase are divided into two steps. The first one is related to the way the information is stored in the different datasets. The second step is related to issues that arise at the moment of the compilation of data from different sources and involves the unification of criteria due to different institutional practices. The latter step is also the case with the consistency in values phase regarding redefinitions like those of true north and calms. The duplication errors and consistency in values phases are mostly related to data management issues, although instrumental faults can also influence the consistency in values phase. The last three phases (phases 4–6) deal with measurement errors related to instrumental problems, like untrustworthy performance, calibration, siting, changes in exposure of the surrounding environment, or others. This manuscript describes the issues related to data management (phases 1–3 in Part I, Fig. 2), while measurement errors will be addressed in Part II (phases 4–6 in Fig. 2).
The QC process follows a sequential structure designed to minimize potential overlapping between the various different phases. Most of the checks are common for both wind speed and direction, although some of them specifically address only one of the variables. The steps outlined in this manuscript (Part I) are designed to remove all the data regarded as erroneous after each phase, where the elimination of a speed or direction record implies the loss of the pair of both variables. However, in Part II the erroneous records will only be left flagged without further removal. Some of the procedures, as discussed at the introduction of each test within these papers, are to some extent based upon those developed in Jiménez et al. (2010b). However, improvements to them and many new steps have also been introduced herein. This section describes the first three phases, while the presentation of results and the illustration of specific cases will be addressed in the next section. Likewise, section 3 in Part II will deal with the last three phases in Fig. 2.
a. Phase 1: Compilation
The compilation phase (phase 1 in Fig. 2) is divided in two steps. In the first step a series of procedures was independently applied to each separate data source in order to detect errors during the data transcription or collection process (see typographical error checks, section 1). Tests were run to detect and correct measurements out of chronological order (Guttman 2002; Graybeal et al. 2004) and dates that have been entered/stored more than once (Guttman 2002). From February 2001 to July 2002, the ds461.0 dataset stored wind speed data erroneously. There is abundant information in its documentation page reporting this issue. Wind speed records from raw SYNOP reports, received via the GTS, were incorrectly converted when the ADP BUFR records and files were created. Numerous raw SYNOP wind speed reports were assumed to be in units of meters per second when they were actually in knots. This error spread over various sites, and the data were patched with a corrected batch.
Additionally, displacements of stations have also been taken into account in all datasets. The DFO buoys suffer changes in their moored position from time to time. Each buoy time series has been split into several parts, each one corresponding to the periods of stable positions after a displacement took place. Some NCAR sites also show displacements (Vautard et al. 2010), albeit for different reasons. This happens because the code identifiers of the stations are reused each time a station ceases to exist or is moved (e.g., to a different site within the same airport) and leads to cases where a code may combine data from different locations through time. To identify when a relocation could appreciably affect the wind behavior, the percentage of change of mean wind speeds was calculated for subsequent periods before and after each location changepoint reported in the data files. The changes were compared to changes experienced in randomly selected dates. This allowed for identifying the range of change that can be ascribed to natural variability and for identifying the shifts that produced changes that are comparably too large.
The second step of the compilation phase deals with the standardization of the diversity of measurement units, formats, and dates described in section 2 to a common frame (e.g., Haylock et al. 2008; Durre et al. 2010). Wind speed has been set to meters per second for all datasets and wind direction to degrees. The recording time of all the sites has been set to UTC with the help of a metadata file provided by EC and that contained the LST of all the EC stations. The information contained in the metadata has been independently validated through a pairwise comparison between each EC (target) site and its neighbors via mean square error (MSE). The comparison was carried out by shifting the target site 5 h forward and backward with respect to its pair, and looking at the time lag with minimum MSE among them, a procedure similar to that followed by Haylock et al. (2008). Section 4a describes the data that have been modified at this stage.
b. Phase 2: Duplication errors
The tests performed during this second phase (Fig. 2) identify periods of data that might have been accidentally duplicated during data retrieval, transmission, and archival (Kunkel et al. 1998; Durre et al. 2010; Jiménez et al. 2010b; Lawrimore et al. 2011; Dunn et al. 2016). These errors can take place within the same series (intrasite duplications) or from the accidental transfer of data from one series to another (intersite duplications). The checks have been applied first to a single time series and then to target intersite duplications. Both cases are handled in a similar manner.
The initial phase of the test localizes any data chain of any length that has been repeated in every other period within the same time series. For intersite duplications, the detection is done for chains that are repeated between site pairs and at any time. The intersite process is conducted systematically by comparing each site with every other site in the database; that is, the process is repeated times, where n is the number of sites. Chains of constant measurements have not been considered here (see temporal consistency in Part II). The vast majority of the detected repetitions are presumably attributable to natural reasons, since factors like persistence and low precision in meteorological records potentially enhance the probability of occurrence of random repetitions. Therefore, it is not straightforward to discriminate when a repeated chain has been artificially misplaced and should be eliminated. To handle this issue, we analyze the distribution of repeated chains in WNENA, which allows us to estimate the frequency of occurrence of the repetitions depending on their length. The probability (frequency of occurrence) of a repeated chain tends to diminish with the length of the repetition (this will be illustrated in section 4b). On the other hand, the intrasite erroneous duplications tend to share common date features, as they are primarily caused by the resubmission of data from a prior reporting period under a time stamp of the current reporting period (e.g., Lawrimore et al. 2011). Similarly, the intersite erroneous duplications are often caused by one misfiled report under two different stations (e.g., Dunn et al. 2016) and are expected to occur simultaneously in time. For this reason, we flag only duplications between similar dates and only for chains with lengths that show a high percentage of duplications with common dates. The flagged subset of repetitions is located at the tail of the distribution and corresponds to the longest cases. There may be erroneous cases among the shorter chains, but their impact over the data quality is arguably smaller than that of the large ones. None of the other chains will be inspected unless some additional information is provided.
All the flagged repeated chains are subjected to a final inspection before any corrective decision is taken. The duplicated data chains from each of the two different time intervals at a given site (intrasite case) or from each of the two sites involved (intersite case) are compared with data from neighboring stations via Pearson correlation coefficient whenever this is possible. For wind direction sequences, directional statistics are applied to the correlation (Mardia and Jupp 2009). If this comparison provides enough evidence, the correct data interval will be identified and preserved, and the erroneous one erased. Otherwise, both data intervals will be removed. For intersite duplications the comparison has been extended to other time intervals and neighbor sites to identify whether the flagged repetitions can be attributed to a meteorological/natural origin.
c. Phase 3: Consistency in values
The purpose of phase 3 (Fig. 2) is twofold: 1) to unify the criteria to consistently define calm and true north values in the database, and 2) to identify unrealistic observations within each time series.
The original data sources did not use a common criterion for wind direction in calm (wind speed = 0) situations and also in true north conditions, when wind speed is different from zero. Therefore, wind direction has been herein set to match the criteria established in DeGaetano (1997): 0° for calm cases and 360° for true north cases.
Unrealistic measurements are those that fall outside of some defined recording range. The range can be derived from statistics calculated at different time scales, from extreme events based on historical records (e.g., Graybeal et al. 2004; Dunn et al. 2016) or from the limits given by the specifications of the sensor (e.g., Meek and Hatfield 1994). In our case the limits are intended to be consistent with the limited metadata information of the instruments used in the observational networks (Table 2). Wind direction records that fall beyond are unrealistic and removed (DeGaetano 1997). In a similar way, negative wind speed values are discarded. Unfortunately, the limited availability of metadata hinders the establishment of an upper wind speed limit for all the sites and throughout their whole lifetimes. For that reason we establish the threshold at 100 , which corresponds to the highest speed limit from our documented anemometers. This limit is well above the recorded windiest event in our area, registered at the Mount Washington observatory on 12 April 1934: a 5-min wind speed of 84 with wind speed gusts peaking at 103.3 (Krause and Flood 1997). It also allows for recording of the transit of tropical/extratropical cyclonic events (e.g., Hart and Evans 2001) and the high wind tornadic events that cross our region of interest (e.g., Etkin et al. 2001). See section 4c for more details.
This section reports on the results of the first three phases of the QC process by showing the spatial and temporal distributions of each error type and illustrating each error type with some specific examples. A schematic description of each test is listed in Table 4. The number of affected records in each phase is presented in Table 5. The numbers in columns 2 and 3 correspond strictly to the data affected by either wind speed or wind direction, and the percentages are given with respect to the initial amount of data (53 956 328 records). The totals in column 4 refer to the affected wind speed and wind direction pairs (107 912 656 records), since the elimination of a speed or direction record implies herein the loss of the pair of both variables.
a. Phase 1: Compilation
The checks applied are aimed at detecting changes in the time sequence of data and repetitions of record entries for a given time step (section 3a). The compiled series did not show any cases of measurements out of chronological order. However, many repeated record entries for the given dates were detected, and they affected exclusively the NCAR dataset. According to their documentation page, these duplications may happen, for instance, when station METARs fall on the same time as SYNOP reports and are archived twice. Figure 3a shows the spatial distribution of the affected stations, 116 out of 143—some of them with up to near 8000 repeated entries totaling 258 321 records. However, only 2261 of these cases, belonging to 43 sites, involved various entries with different wind speed or direction values (Fig. 3a, color bar). In the cases with various entries containing the same observations, only one entry was kept. For the entries that presented different observations, the date was set as missing.
Additionally, a unit conversion issue related to decoding/encoding the ds461.0 wind speed data was also amended. The data were erroneously stored as meters per second instead of knots. This error affected 41 of our sites, with a total of 1 003 991 patched records, shown in Fig. 3b with regular triangles. Figure 3c shows an example for a site located in Trenton (Ontario).
Regarding displacements, the DFO moored buoys have been split constituting 40 independent buoy series of stable positions (section 3a): 23 series for the east coast of Canada (from the initial 10) and 17 for the Canadian Great Lakes (from the initial 12). In the case of NCAR stations, 30 out of the 143 sites showed relocations. Most of the relocations were by less than 3 km (Fig. 3d) and entailed wind speed mean shifts below 10% (gray line). This range is comparable to shifts calculated from randomly selected periods with no reported relocations (blue dots) that can be attributed to natural variability. The displacements showing larger ratios or those that took place over distances above 3 km were more thoroughly analyzed. For the four cases showing this condition, the last period after the change was removed. In total 84 062 records were erased. The affected sites are shown in Fig. 3b with inverted triangles. Figure 3e shows an example of a displacement of 1.38 km of station 71432 located in Port Weller (Ontario).
Regarding the standardization step, Fig. 1a shows the spatial distribution of the measurement units for wind speed. All records from EC ( and decagrades) and NCAR (knots) were standardized to meters per second and degrees (92.27% of the database; Table 5). The recording times of WNENA were uniformly set to UTC and thus this change affected all EC stations (86.4% of the data). This step was performed using the metadata file provided by EC, and the results were independently validated through a pairwise comparison with neighbor sites via MSE. An example of the comparison method is shown in Fig. 1d between a site located in Île Charron (Quebec) recording at LST and three neighbor stations that had been previously set to UTC with the same method. The lowest MSE values correspond to UTC-5 (ETZ), as expected. In most cases the stations of EC (Fig. 1b; see legend for symbols) followed their geographical time zone (Fig. 1b, shaded areas). Nevertheless, some notable exceptions were detected. For example, four stations reported at a different time reference than their expected time zone: one of them operated in EST while officially belonging to Labrador (ATZ); three other sites used central European standard time (CEST, at UTC + 1) while belonging to Quebec (ETZ). Additionally, three stations, all corresponding to Parc National du Fjord-du-Saguenay (Quebec) and located along the Saguenay River, were found changing their time zone throughout their period of operation (stars in Fig. 1b), from AST to EST, as in the example for a site located in Grand lac des Îles (Quebec) shown in Fig. 1e.
b. Phase 2: Duplication errors
The search of inter- and intrasite duplicated chains has been undertaken with periods of 12 h and longer to allow for a minimum of 6-hourly data chains of at least three values. Around intrasite and almost intersite duplicated chains of various lengths have been handled. Figures 4a,b illustrate as an example the absolute frequency distribution of intrasite repetitions for wind direction (green) and of intersite repetitions for wind speed (red) with their length. Similar results are obtained for the corresponding wind speed and wind direction variables for intra- and intersite situations (not shown). The distribution of the percentages of repeated chains with date commonalities with respect to the total number of chains is presented in blue. Here, the dates are regarded as similar if there is some coincidence in the sequence of hour, day, or month. These percentages remain low for short chains but grow steeply for longer chains, eventually reaching 100%. While a low percentage of repetitions with similar dates is inevitable due to pure chance, an increasing percentage is indicative of the occurrence of suspicious duplications. For operational purposes, especially regarding intersite duplications, the threshold is set at 50% (shaded area). Nevertheless, and as it will be shown below, shorter chains are also evaluated when enough evidence is provided.
A total of 16 (17) sites were affected by erroneous intrasite wind direction (wind speed) duplications, marked as regular (inverted) triangles in Fig. 4c, affecting nine buoys and eight EC land sites and involving a total of 5640 records (0.01% of the database). The buoys were affected by a systematic simultaneous intrasite-related failure (not shown) that duplicated approximately one day for both variables (15–16 November 2004; Fig. 4c, triangles with magenta edges). It was a general failure that also affected their backup channels (see section 2). This is the only event that led us to search for duplications shorter than the flagged minimum length, where three of the nine buoys were identified. This systematic failure accounts for half of the total number of stations affected by this type of issue. On the other hand, the longest erroneous detected intrasite duplication, belonging to a site in Fredericton (New Brunswick; Fig. 4e), expanded over a whole month simultaneously copying speed (Fig. 4e) and direction (not shown) data. Short duplicated chains were interspersed with data intervals with an apparently realistic behavior. The comparisons with neighboring stations allowed for identifying the correct data sequence in two of the erroneous intrasite cases.
A total of 1689 candidate intersite chains (976 wind direction, 713 wind speed) were flagged for later evaluation. These duplications correspond to only 9 (19) site pairs that involve 15 (28) sites for wind direction (speed) across the approximate 64 000 (38 000) site pairs that share any number of repeated chains. A comparison with neighbor stations and at different periods allowed us to identify the duplications caused by similar meteorological behaviors, which were spared. This is the case, among others, of eight sites located on Prince Edward Island (PEI) with a few sporadic duplications lasting around a day that occurred either simultaneously or with a difference of 1–2 h (not shown). PEI is a territory with a gently rolling landscape in which the highest point of land is located at only 152 m above sea level, which favors similar undisrupted wind flows all over the area. After the analysis, only duplications corresponding to four different sites were considered erroneous (Fig. 4d), one from NCAR and three from EC, totaling 138 864 (0.13%) records. The comparison with neighbors allowed us to identify the site that inherited the duplicated data in each case. The longest duplicated period corresponds to a site in Villeroy (Quebec, NCAR; Fig. 4f) that duplicates wind direction data of two other nearby sites (EC) for almost seven consecutive years. The differing institutional calm definition (see section 3c) resulted in the detection of fragmented chains instead of a continuous long chain. Duplications in speed were not detected, probably due to successive unit conversions of wind speed before our compilation process, presumably at the retrieval by NCAR for the ds464.0/ds461.0 set. These two sites, both in Lemieux (Quebec) and separated 500 m from each other, are located 17 km apart from Villeroy. This was the only cross-source duplication we detected, but it is nevertheless a reminder of the care that is needed when merging information from different sources (Dunn et al. 2016).
c. Phase 3: Consistency in values
The new direction criteria for calms consist of assigning 0° for the wind direction when the wind speed is 0 and the wind direction for the so-called true north is 360° for noncalm wind speeds. This decision affected most of the stations belonging to the datasets of DFO and NCAR. As we can see in Fig. 5a, the buoys did not show a settled convention for calms until the late 1990s (blue dots), when the direction was assigned to 0° (red dots). Most of the NCAR sites follow the 360° convention for direction (orange dots), except for a group of 13 U.S. sites with International Air Transport Association (IATA) identifiers, which ended before 1994. On the other hand, all NCAR stations followed the same criteria as EC regarding true north (not shown) and did not require modifications. Similar to the calms, none of the 40 DFO sites showed a consistent criterion through their whole lifetime for true north and had to be modified as well. In total, 539 510 wind direction records (1.02%) were changed (Table 5).
The removal of unrealistic wind speed records is hampered, as noted in section 3c, by the lack of extensive metadata on the use of different anemometer types and their variety of operational ranges (Table 2). This makes the establishment of a confident upper limit for wind speed elusive. As can be seen in Fig. 5b, some of these operational instrumental limits (dark blue vertical bars) fall within the tails of the wind speed distribution (blue bars). However, the wind speed records within this range bear physical realism. During the summer season, the extratropical cyclones of tropical origin (Landsea 2007) induce very high winds over the region. A comparison between the approximate wind speeds during the cyclonic events for 1954–2010 derived from the Canadian Hurricane Centre (CHC, gray bars) and those recorded by WNENA (red) is also shown in Fig. 5b. The information about the cyclonic events and their approximate wind speeds have been constructed from the storm-track images and the complementary information provided by the tropical cyclone season summaries (http://www.ec.gc.ca/ouragans-hurricanes/default.asp?lang=en&n=23B1454D-1). The midlatitude storms are even a larger contributor to extreme winds, much larger than hurricanes and with wind speeds that match or exceed hurricane intensity (Richards and Abuamer 2007). These storms usually occur during winter and are responsible for the majority of the extreme winds occurring in our area of interest, as shown in Fig. 5c. Data from the sites located in Mount Washington (New Hampshire) with a mean of ~15 have been excluded for the realization of the plot due to their naturally occurring very high wind records at this elevation (1910 m). Some local phenomena may also cause winds of this strength, such as the closely related suetes (on the north side of Cape Breton Island) and the wreckhouse winds (on the southern side of Newfoundland), and also the westerlies along the Labrador coast. Finally, another contributor for high wind speeds are the tornadoes that cross one of the two “tornado alleys” of Canada, extending east of the Great Lakes from southern Ontario through southwestern Quebec and western New Brunswick. Most of the tornadoes have maximum wind speeds under 50 , and they occur between May and September, with a peak activity in June (Newark 1981). A limit of 100 (gray shading in Fig. 5b) is adopted herein to single out unrealistic events. This is conservative enough to allow for winds corresponding to real extreme phenomena. A total of 288 wind speed records (<0.01%; Table 5) belonging to nine stations exceed this threshold. All the removed data belong to the NCAR dataset (Fig. 6a). From these data, 96% corresponds to a miscoded missing value (99 999 ) that appears in six of the stations. An example is shown for a site located in Mount Washington (Fig. 6b).
Regarding wind direction, 181 unrealistic records (<0.01%, Table 5) corresponding to 14 stations were detected, all belonging to the NCAR dataset (Fig. 6a). An example is shown in Fig. 6c, corresponding to a site located at Toronto Pearson International Airport (Quebec; NCAR), with a record of 990° that probably corresponds to a miscoded missing value.
This section summarizes the extent of the issues related to data management. The impact of the modifications on the statistics of the sites (mean wind speed and direction, standard deviation, kurtosis, and skewness) will be presented in Part II for the whole quality control process. Many sites have suffered profound modifications during the compilation processes described herein. For example, 41 NCAR sites (out of 143) were affected by duplicated entries that were erased from the sites; 43 NCAR sites presented unit conversion issues that had to be corrected; and 7 sites showed relocation with significant changes in the behavior of the time series, which implied the removal of the shorter location in each instance. Figure 7a shows the most relevant issue at each of the affected 70 NCAR sites in terms of the largest amount of modified data. The whole DFO dataset suffered periodically from buoy relocations that involved the refurbishing of the original dataset into a more manageable one composed by static locations. Regarding EC, all the sites had to be transformed from their LST dates to UTC. Finally, the datasets as a whole had to be standardized in their measurement units and in their true north and calm criteria.
The duplication errors and consistency in values phases had, with a few exceptions, a lesser impact on the sites than the compilation, but as they face QC issues that are commonly treated in other works, they are presented separately in Fig. 7. Figures 7b,c show the type of errors with the largest implication in terms of deleted data at each site both for wind speed (Fig. 7b) and direction (Fig. 7c). Only 43 (12%) sites have been affected in one or more of the analyzed three error typologies, 29 in the case of wind speed, and 34 for wind direction. The most widespread error for wind speed (direction) records is related to intrasite data sequence duplications (in purple) and affected 17 (16) sites, half of them caused during a simultaneous failure that affected nine buoys (stars) over the course of a day. The unrealistic measurements (yellow), which affected exclusively the NCAR dataset (triangles) with nine sites regarding wind speed (14 for direction), involved few records, in many cases as a result of a miscoding of missing values. The intersite duplications (pink) affected only three sites regarding wind speed (four for wind direction), but they involved the total suppression of one NCAR site (site CWVY, Fig. 4f).
Figure 7d shows the total accumulated percentage of deleted data. From the 43 sites (out of 526), 24 were barely affected with less than 0.01% of erroneous records and 17 with percentages ranging from 0.1% to 1%. One NCAR site presented errors in more than 1% of its data, all of them related to unrealistic speeds; another one, the aforementioned site CWVY, had all its records removed. The impact of the tests on the wind speed distribution can be seen in Fig. 7e: the distribution before phases 2 and 3 is presented in red and after it in blue. As a result of the application of these first three phases, the maximum wind speed values have been restricted to 100 . Additionally, the wind speed distribution has been affected with very minor changes, especially for speeds below 5 and above 40 , although due to the logarithmic scale only the effects in the later ones can be appreciated.
This text describes the first part of a QC procedure designed to identify and correct erroneous records of surface wind speed and direction observations. In this work we describe the first phases of the compilation of a database and the subsequent QC tests focused on data management issues. This database, with records ranging from hourly to synoptic time resolution and a maximum temporal span of 60 years for its longest sites, consists of 486 land sites and 40 buoys located over the area of northeastern North America (WNENA). It has been constructed from three datasets chosen for their availability and convenience. Therefore, the initial data may have been previously exposed to different levels of QC testing. The largest subset consists of 343 sites that stem from EC and that have been previously subjected to real-time and delayed mode QC procedures by the institution. A subset of 40 buoys has been obtained from DFO in raw format. Finally, 143 raw sites have been extracted from the ds461.0 and ds464.0 NCAR datasets. The NCAR and DFO datasets have implied a much bigger initial processing effort than EC as can be seen in section 2, and the tests related to the compilation can be seen in section 3a. The NCAR dataset presented duplicated entries with different data values that were suppressed (Table 5), records with erroneous unit conversion due to faulty data decoding that were corrected, and relocations with a noticeable effect on the wind speed behavior that resulted in the suppression of records. Regarding DFO buoys, the time series of the sites were constructed by piecing together data from the two measuring channels at the hull. The selected data at each moment belonged to the primary channel according to the metadata files. Off-position records were also suppressed. The measuring times of the east coast buoys, corresponding to wave samplings, were modified to match the meteorological recording times. Finally, the periodical relocations of the hulls implied the split of the initial 22 buoys into 40 stable artificial sites. Regarding the standardization process, the records from the EC dataset, in LST, had to be translated to UTC via neighbor comparison and with the assistance of metadata files. Additionally, the three datasets were set to common wind speed and direction units, and the definitions of calm and true north were also unified.
It is worth noting that some potentially useful data sources were not included in the compilation phase as result of a lack of knowledge at the time. For instance, regarding the United States, additional data can be acquired via NOAA’s National Centers for Environmental Information (NCEI; https://www.ncei.noaa.gov). Data of moored buoys, on the other hand, can be additionally retrieved from EC’s ship-format reports, archived by the International Comprehensive Ocean–Atmosphere Data set (ICOADS; available online at http://icoads.noaa.gov). The data obtained from national climate archive organizations offer the advantage of having been subjected to some level of quality control in delayed mode and are more likely to be accompanied with metadata information. Regarding the Canadian stations, although the data are commonly shared in LST format, there is also the possibility of acquiring them nowadays in UTC format by request. These datasets should save some of the painstaking steps taken during the compilation, but they might pose new unknown challenges. Future developments of the WNENA will hopefully integrate these additional sources of information.
Phases 2 and 3, which are more general in nature than phase 1, had a lesser impact on the database as only (0.13%) of the records were removed. The approach we developed to detect artificial intra- and intersite duplications allowed us to identify sequences that ranged from years up to single days, much shorter than the intervals aimed at in other similar works. This phase entailed the majority of errors and the suppression of a complete site. A possible drawback of the method is that depending on the region or size of the database, limited sensor precision and orographic similarities such as those we encountered in PEI could boost the number of naturally occurring short intersite duplications (~1-day length) to a very large number. For extremely large databases, a convenient additional strategy would be to focus the efforts directly on very long chains (1 week, 1 month…) and leave the shorter ones flagged as suspicious. Finally, the NCAR dataset showed some minor issues with the measurements out of physical range likely as a result of the management of missing values (e.g., 99 999 for wind speed and 990° for wind direction).
The procedures described herein are focused on the establishment of a manageable, internally consistent and spatially well-characterized database composed of climatologically relevant sites. For instance, the tests devoted to the chronological sorting and the detection of duplicated dates ensure the temporal coherence that is indispensable in all the subsequent tests applied both in Part I and Part II and any data analysis in general. The data completeness criteria discriminate sites with climatological value. The procedures that identify internal site displacements ensure that the stored time series do not merge spurious information belonging to different locations. These procedures are subsequently supplemented with the tests devoted to the detection of erroneously duplicated data. Finally, the detection of unrealistic data removes clearly impossible records, in contrast to the flagging process carried out during the detection of improbable measurements in Part II. In general the issues dealt with during Part I have a comparatively lower impact on the number of affected data and wind statistics than those demonstrated in Part II. However, they are crucial for the phases dealing with measurement errors described in Part II.
EELE was supported by the Agreement of Cooperation 4164281 between the UCM and St. Francis Xavier University, and projects CGL2014-59644-R and PCIN-2014-017-C07-06 of the MINECO (Spain). Funding for 4164281 was provided by the Natural Sciences and Engineering Research Council of Canada (NSERC DG 140576948), the Canada Research Chairs Program (CRC 230687), and the Atlantic Innovation Fund (AIF-ACOA). HB holds a Canada Research Chair in Climate Dynamics. JN and JFGR were supported by projects PCIN-2014-017-C07-03, PCIN-2014-017-C07-06, CGL2011-29677-C02-01 and CGL2011-29677-C02-02 of the MINECO (Spain). This research has been conducted under the Joint Research Unit between UCM and CIEMAT, by the Colaboration Agreement 7158/2016. We wish to thank the people of Environment and Climate Change Canada, Department of Fisheries and Oceans Canada, and National Center for Atmospheric Research for providing us with the original data used in this study and for their kindness in responding to all the questions that arose during the development of this work and the review process. Special thanks to Gérard Morin and Hui Wan for the metadata from the EC sites; Bruce Bradshaw, Mathieu Ouellet and Bridget Thomas for information regarding moored buoys; and Douglas Schuster for information regarding the ds461.0 and ds464.0 datasets.We thank J. Álvarez-Solas, A. Hidalgo, and P. A. Jiménez for the helpful discussions. Finally, we would also like to thank the reviewers for the many suggestions and useful information they offered us.
Note: A first version of this database will be made available to the public. The QC procedures in this manuscript have been developed using Linux shell scripting and Fortran programming. Potential users interested in having the code are invited to contact the corresponding author.
This article has a companion article which can be found at http://journals.ametsoc.org/doi/abs/10.1175/JTECH-D-16-0205.1