A wind profiler network with a total of 65 profiling radar systems was operated by the China Meteorological Observation Center (MOC) of the China Meteorological Administration (CMA) until July 2015. In this study, a quality control procedure is constructed to incorporate the profiler data from the wind-profiling network into the local data assimilation and forecasting systems. The procedure applies a blacklisting check that removes stations with gross errors and an outlier check that rejects data with large deviations from the background. As opposed to the biweight method, which has been commonly implemented in outlier elimination for univariate observations, the outlier elimination method is developed based on the iterated reweighted minimum covariance determinant (IRMCD) for multivariate observations, such as wind profiler data. A quality control experiment is performed separately for subsets containing profiler data tagged with/without rain flags in parallel every 0000 and 1200 UTC from 20 June to 30 September 2015. The results show that with quality control, the frequency distributions of the differences between the observations and the model background meet the requirements of a Gaussian distribution for data assimilation. A further intensive assessment of each quality control step reveals that the stations rejected by the blacklisting contained poor data quality and that the IRMCD rejects outliers in a robust and physically reasonable manner. Detailed comparisons between the IRMCD and the biweight method are performed, and the IRMCD is demonstrated to be more efficient and more comprehensive regarding the dataset used in this study.
Wind profiler radar detects the scattering of electromagnetic waves caused by atmospheric turbulence and is capable of measuring horizontal wind vectors with a high temporal resolution of approximately 5 min and a vertical resolution ranging from tens to hundreds of meters. The automated, continuous, and real-time vertical profiles might partially fill the gaps in the current upper-air observation system, specifically the inadequate spatial and temporal distributions of conventional observations. The use of wind profiler radar observation data (profiler data) in daily operational weather analysis has been demonstrated, particularly in the detection and analysis of weather structures, by issuing convective outlooks and watches (St-James and Laroche 2005; Benjamin et al. 2004). Furthermore, the added value of the profiler data has been demonstrated by a series of assimilation applications in many countries. The profiler data can have a positive impact on very short-range (3–6 h) forecasts over a central U.S. domain that includes most of the profiler sites and the region immediately downwind of the profiler observations (Benjamin et al. 2004). Severe-weather cases and a winter test period showed that the NOAA wind profiler network in the central United States can improve short-range (3–12 h) forecasts (Benjamin et al. 2004). Ishihara et al. (2006) found that the presence of a dense profiler network positively contributes to weather prediction in Japan. The use of profiler data at ECMWF improves wind forecasts (Bouttier 2001; Andersson and Garcia-Mendez 2002; Lopez and Bauer 2007).
In China, the China Meteorological Observation Center (MOC) of the China Meteorological Administration (CMA) operated 65 wind-profiling radar systems until July 2015. Of the 65 radar units, 50 were PBL profilers and 15 were troposphere profilers. Because most of the profiling radar systems were located in the Beijing–Tianjin–Hebei region, the Yangtze River delta, and the Pearl River delta, three small-area preoperational networks were constructed in these regions (Fig. 1). The composition of the profiler network was complex because the radar systems were of different types and made by various manufacturers, and the final profiler products (such as the hourly average profiler data) were distributed to end users after multilevel checks at the instrumental and signal-processing stages, including station-level control (filter smoothing, noise-level estimation, and multipeak processing) and provincial and national control (weather-type identification, uniform averaging, rationality check, and multitime smoothing processing) by the MOC/CMA.
Based on comparisons with radiosonde and NCEP Final Analyses (FNL), the hourly average horizontal profiler winds were generally of good quality and reliable (Dong and Wu 2014; Dong et al. 2014). The profiler data have also been applied in mesoscale weather analysis and severe-weather monitoring by the local weather service. However, in China, the use of these profiler observations in local NWP studies is still restricted. One remarkable reason for this restriction was that the error information of the profiler data in terms of assimilation is still unclear, and there is no quality control procedure tailored to data assimilation yet.
In variational data assimilation, quality control is one of the most important components for removing bad observations and ensuring validation of data assimilation. There are several sources of error in meteorological observations, including instrumental errors, representative errors, and gross errors caused by instrumental and telecommunication failures (Xu et al. 2013). Data assimilation systems, such as three-/four-dimensional variational systems, are highly sensitive to these erroneous observations and are inclined to converge toward outliers (Lorenc 1986). Observations that either are unreliable or contain observation errors that cannot meet the requirements of data assimilation must be identified and eliminated prior to assimilation (St-James and Laroche 2005; Lorenc 1986). Therefore, another purpose of quality control in assimilation is to ensure that the statistical errors of the measurements are consistent with the typical error features assumed in the assimilation approach.
In a small number of NWP studies, a biweight quality control procedure has been implemented to remove outliers that deviate greatly from the background fields, especially for univariate observations such as radiance, the entire column-integrated GPS precipitable water, and surface temperature (Li and Zou 2013; Xu et al. 2013). However, in applications to multivariate observations with multiple layers, such as profiler data, the biweight process consumes large amounts of time and resources because the procedure is performed repeatedly, layer by layer, for multiple variables (wind speed u and wind direction υ).
The minimum covariance determinant (MCD) estimator developed in 1984 (Rousseeuw 1984; Rousseeuw 1985) was one of the first affine-equivariant and highly robust multivariate outlier detection rules in applied robust statistics. Since the introduction of the computationally efficient fast MCD algorithm of Rousseeuw and Van Driessen (1999), MCD has been applied in numerous fields, such as medicine, finance, image analysis, and chemistry.
This paper describes a quality control procedure that can be implemented prior to the assimilation of the profiler data into NWP models and that includes blacklisting and background quality control, which are regularly used in other quality control research (St-James and Laroche 2005; Lee and Kawai 2011; Ishihara et al. 2006; Lambert et al. 2003). Specifically, an outlier detection rule based on iterated reweighted minimum covariance determinant (IRMCD; Cerioli 2010), an extension of MCD, is applied to identify the outliers of the observation-minus-background (OMB) values for the u and υ components simultaneously, and the method is demonstrated to be effective and efficient.
The remainder of this paper is organized into six sections. A description of the IRMCD method is presented first in section 2, and the variational and forecasting system and profiler observation datasets are briefly introduced in section 3. In section 4, the two-step quality control schemes are presented, including the establishment of the station blacklist and the elimination of outliers. Statistical characteristics and further inspection of the rejected data are discussed in section 5, together with the results of the quality control experiments. In section 6, a detailed comparison between IRMCD and the biweight check is presented, followed by the conclusions and a discussion.
A sample dataset containing n vector observations with p dimensions can be represented as
with for the ith observation.
Clearly, the mean vector and the covariance matrix can be estimated, but they can be contaminated if outliers exist in . In applied robust statistics, given robust estimators of and , the outliers in can be detected if large distances from the robust fit are found by comparing the squared robust distance of each observation with , that is, the quantile of the distribution, where is commonly accepted (Rousseeuw and Van Driessen 1999; Hubert and Debruyne 2010; Cerioli 2010). The IRMCD is a highly robust estimator method developed based on the reweighted version of the MCD estimator of Hawkins and Olive (1999) and Rousseeuw and Van Driessen (1999). As described by Cerioli (2010), the IRMCD rule for outlier detection for finite samples is executed as follows.
- Step 1. For dataset , the MCD subset is determined by the subset with h (where ) observations whose covariance matrix has the minimal determinant. The value of h is that of yielding the maximum possible breakdown point, which is , where denotes the integer part (Cerioli 2010). Therefore, the average of the MCD subset is
where is the proportionality constant obtained by Cerioli (2010).
- Step 4. In Cerioli (2010), instead of , is applied using the following distribution (referred to as the reference distribution for convenience):
then the sample dataset does not have outliers.
Step 5. Otherwise, an “iterated step” is applied to detect the outliers by testing each observation using as a cutoff with a nominal size of . If , then the observation vector is finally declared as an outlier.
Following the steps above, with the preset nominal size , the outliers of the multivariable dataset can be detected. In our study, each can be treated as an observation data sample at a specific profiler station. Additional implementation details are presented in section 4b.
3. Data and model system
a. Variational assimilation and forecasting system
An updated version of the Beijing Rapid Updated Cycling (BJRUC) system (Chen et al. 2009) was developed in 2011. The forecasting and data assimilation components of the preoperational BJRUC system are based on WRF (Skamarock et al. 2008), and WRF-VAR data assimilation and forecasting run independently in the two forecasted domains with resolutions of 9 and 3 km, respectively. Each day, Domain1, with a 9-km resolution (Fig. 1), performs two forecasts of 72 h at 0000 and 1200 UTC, starting from the GFS analysis for the initial conditions. The forecast results are supplied to users for short-range forecast references. The model physics configuration is as follows: WSM6 microphysics parameterization (Hong and Lim 2006), Kain–Fritsch cumulus parameterization scheme (Kain 2004) for 9-km domains and no cumulus for the 3-km domain, Yonsei University (YSU) PBL parameterization (Hong et al. 2006), RRTM longwave radiation scheme (Mlawer et al. 1997), and Dudhia shortwave radiation scheme (Dudhia 1989). The data sources used by the WRF Model data assimilation system (WRFDA) for Domain1 include conventional and intensive surface and upper-air radiosonde reports, aviation routine weather reports, and ship and buoy reports obtained via the Global Telecommunication System (GTS). The background error statistics (BES) files for each nested domain were derived independently based on the National Meteorological Center (NMC) method.
In this study, the daily archived analysis of Domain1 is used as the model background, against which quality control experiments are performed on the profiler data.
The quality control experiment is performed every 0000 and 1200 UTC—that is, the operational initial time of BJRUC Domain1—during the period from 20 June to 30 September 2015. The total sample size is 178 290 and contains all profiler records observed by the 65 wind profiler radar systems during this period. For the sake of simplicity, we ignored diurnal variation, and the results from both 0000 and 1200 UTC are merged together for further statistical analysis and comparison.
With precipitation, the wind-profiling radar systems might track the falling hydrometeors rather than the air, which could lead to large errors in the derived wind vectors. For this reason, the profiler data should be treated more carefully if “contaminated” by precipitation. Therefore, each hourly average profiler observation record is tagged “with rain” or “without rain” after multilevel signal-processing quality control by MOC/CMA. Therefore, under most circumstances in this study, the profiler dataset is divided into two individual subsets—with rain (R)/without rain (NR)—containing 77 959 and 100 331 samples, respectively, and subjected to the same comparison and statistical processes.
4. Quality control routine
Because of our “zero tolerance” attitude toward erroneous data in the assimilation application, a two-step quality control procedure (QC1 and QC2) composed of blacklisting and outlier elimination is designed to discard as much of the suspect profiler data as possible.
a. QC1: Blacklisting
Considering the complex composition of the wind profiler network, the first step in quality control is removal of suspect profiling stations whose data quality has been demonstrated to be generally unreliable. A profiling station is removed using a simple technique in which the correlation coefficient (cc) between the observations and background is compared with a predefined cc threshold, that is, when the cc of a station is smaller than a critical value, the station is considered problematic and is therefore moved to the blacklist.
For the th (, where is the number of the stations) station in both subsets R and NR, we find the correlation coefficient between the observation of the u/υ component and its counterpart derived from the background could be calculated as follows:
where indicates every available profiler record, is the observation operator corresponding to the th observation location, is the u/υ wind component field derived from the background, and and are the averages of the observations and corresponding background-derived values, respectively. The correlation coefficients of the u/υ component for every profiling station are computed between all profiler observations and the archived analysis of BJRUC Domain1 available for the period from 20 June to 30 September 2015.
Figure 2 shows the vertical profiles of the root-mean-square error (RMSE) of the u/υ component in response to different critical thresholds of the correlation coefficients from all stations for both subsets. For subset NR, when the correlation coefficient threshold increases to 0.6, the RMSE of the u/υ winds of the corresponding samples below a height of 6 km is approximately 2.2 m s−1, which is close to a reasonable observation error for wind profilers (St-James and Laroche 2005). Accordingly, 0.6 is a correlation coefficient threshold beyond which the profiler records with a rain flag at the station that is considered generally reliable. Similarly, the criterion of the correlation coefficient threshold for stations tagged with a rain flag is set to 0.4.
In performing QC1, for each station tagged with “with rain” or “without rain,” if the correlation coefficient between the wind profile observation and the background is below its corresponding predefined threshold (0.6 for without rain and 0.4 for with rain), the station is added to the blacklist. All observation records from this blacklisted station are excluded from the subsequent quality control and assimilation procedures.
b. QC2: Outlier rejection method based on IRMCD
Based on IRMCD, QC2 is implemented to remove the profiler data points that had been detected as outliers for every profiling station that has passed QC1. The procedure executes the steps described in section 2. For dataset with n observations at the th station, each observation vector in Eq. (1) can be indicated by
where and are the th differences of the u and υ deviations, respectively, of the observation relative to the model background (OMB). The IRMCD procedure is computed by the function cerioli2010.irmcd.test of the R package CerioliOutlierDetection (Green 2016). A fixed nominal size , a value used in many studies (Hubert et al. 2008; Willems et al. 2009), indicates that we expect 2.5% of the samples in the dataset will be identified as outliers. Accordingly, is calculated in step 4, and, for example, with indicates that the probability of falsely detecting outliers from a clean dataset containing 1000 samples is 0.0025%. Let be the quantile of the reference distribution in the function cerioli2010.irmcd.test to test condition (10). If , then the dataset is deemed uncontaminated by outliers. Otherwise, let act as the significance level in the function cerioli2010.irmcd.test to detect the outliers in each observation in .
c. Summary of quality control implementation steps
In a real-time or operational environment, the profiler stations are divided into R and NR datasets, depending on the rain flags at the observation time. The QC steps are performed sequentially station by station for each group as follows.
QC1: If the correlation coefficient between the profiler observations and the background exceeds the threshold, then the station passes QC1; if the threshold is not exceeded, then all observations from this station are rejected and the station appears on the blacklist.
QC2: For a station that has passed QC1, IRMCD is used to reject outliers.
In general, we expect that blacklisting reflects the updated status of every profiling station. Moreover, to ensure the validation and statistical significance of the IRMCD, the sample size should not be too small. Therefore, the abovementioned quality control process applied to the profiler data is a dynamic, time-sliding procedure for both blacklisting and outlier elimination. For every profiler station, the sample dataset always contained the latest one-month profiler observations by adding the newly arrived data and removing the oldest obsolete data. QC1 rejects the stations that deviate significantly from the background, and QC2 rejects the outliers if the station has passed QC1. After both procedures have been applied, the profiler data are assumed to be of good quality for further data assimilation.
5. Quality control results
In this section, the results of the two-step quality control procedure for the dataset described in section 2 are evaluated and compared in additional detail. The difference between the profiler observation and model analysis background can be calculated by
where , n is the total number of profiler observations, is the horizontal u/υ wind component (which has been converted from the ith profiler’s observed wind direction and speed), is the observation operator corresponding to the ith observation location, and comprises the u/υ wind components from the BJRUC Domain1 (9 km) analysis at the corresponding time. For convenience, the difference between the profiler observation and model background is indicated by the OMB difference, and the statistical features for the u/υ components are examined to evaluate the performance of the quality control scheme.
a. Statistical features
For subset NR, histograms of the OMB differences before the QC (RAW), after QC1, and after QC2 are illustrated in Fig. 3.
For the u/υ components, the distributions of the frequency density of RAW are similar to but not strictly Gaussian. The large central density peak and the additional data located on the left tail indicate the existence of outliers. More accurately, large discrepancies exist at both ends of the quantile–quantile (Q–Q) scatterplots. After the two QC steps, the standard deviation gradually decreases and—particularly after QC2— the Q–Q scatterplot nearly converges with a straight line, indicating that most of the outliers have been eliminated and that the probabilistic distribution of the frequency of OMB differences is much closer to the standard normal distribution.
To quantitatively describe the features of the statistical distribution of OMB, the skewness and excess kurtosis coefficients are calculated as
where n is the sample size, is the th OMB difference value, and is the sample mean. We note that the skewness and kurtosis of every Gaussian distribution are equal to 0.0. As shown in Table 1, we find that the kurtosis of RAW for the u/υ winds for the NR subset are 10.16/6.79, and the final values are 0.20/0.22, indicating that the significant outlier problem has been sufficiently resolved by quality control. The skewness of RAW also indicates that the distributions are skewed (−1.13/−0.61). After the QC, the values are reduced to −0.17/−0.002, and the distributions are more symmetric.
Subset R has a smaller sample size and a larger OMB standard deviation than NR, indicating that its data quality is less reliable than that of NR. However, the quality control procedure works well and yields results similar to those of NR, as shown in Fig. 4. It should also be noted that for both subsets, a final bias correction is still necessary due to the nonzero skewness, especially for the u component.
b. Rejected data
After the QC1 step, 11 and 9 profiling stations are rejected for the NR and R subsets, respectively, and those with the worst cc values can be easily identified from the box-and-whisker plots (Fig. 5). Further inspection of the observations reveals that the rejected profiling stations do have poor data quality, which could be ascribed to various factors. For example, a large number of 0 m s−1 observations are present throughout the troposphere (Fig. 6a), which could be due to instrumental or communication failures. In addition, for unknown reasons, at selected profiling stations it is difficult to establish a reasonable correlation with the model background (Figs. 6b–d). As stated above, the blacklist is always dynamically updated, and the information about the rejected stations is fed back to MOC/CMA for further confirmation.
The QC2 stage identifies approximately 5.57% and 8.52% of the total number of profiler data from the NR and R subsets, respectively, that passed QC1 as outliers. This rejection rate is quite reasonable. Taking advantage of IRMCD as a multivariate outlier detection rule, the outliers are detected in the multidimension space simultaneously. In this study, the multidimension space is the two-dimensional plane consisting of and , the u/υ OMB components. To obtain a general estimate of the reliability of the IRMCD outlier detection, scatterplots of with respect to the observation height for selected profiling stations are displayed in Fig. 7. The outliers are detected as points far from the cores of the circle/ellipse-shaped “clouds” of nonoutliers. As an example of a typical tropospheric profiling radar, station 54406 is able to detect horizontal winds up to a height of 13 km. As depicted in Figs. 7a,c, the outliers at this station mostly show large discrepancies from the majority. Additionally, a large number of the outliers (73% of the outliers in sample NR) are located in the upper air above a height of 8 km, which reflects the fact that the upper-air profiler observations that usually display large RMSEs for heights greater than 8 km (shown in Fig. 2) are also prone to rejection as outliers. For the enhanced planetary boundary layer profiling radar—for example, station 58737—most outliers (85% of the outliers in sample NR) are located at heights lower than 4 km. Moreover, certain observations with a small or are also declared as outliers. Because IRMCD is a method for handling multivariable problems, the observation is declared as an outlier for the u/υ winds even if it was an outlier for only one dimension of the vectors.
As an example, the 10-day time series of the observed and background wind profiler data and their corresponding scatterplots for the OMB u/υ components are shown in Fig. 8. The profiler data from station 54406 are generally reliable because most of the observations are comparable to the model background (Figs. 8a,b,d,e), but the identified outliers shown in red had large wind speeds or differences in direction from their counterparts in the model background. According to the OMB u/υ scatterplots (Figs. 8c,f), most of the outliers are located above 8000 m, with large OMB u/υ values that exceed 10 m s−1. These points are isolated from the majority of the nonoutlier samples. These results indicate that the IRMCD can reject outliers in a robust and physically reasonable manner.
6. Comparison between IRMCD and biweight method
a. Biweight method
For a univariate dataset containing n observations , the biweight check method can be used to remove the data that deviate from the biweight mean by more than several times the biweight standard deviation (Lanzante 1996). The procedure can be described as follows:
- Step 1. For every , its weight coefficient via the median M and median absolute deviation (MAD) is calculated as
Step 3. For every , its Z score can be calculated and with a predefined critical value [which is set to 4 in Zou and Zeng (2006)]; if , then the observation is treated as an outlier.
Following the steps above, for wind profiler observations, the biweight check is applied to the u/υ components from each specific profiler station, and an observation is discarded if its u or υ component is declared to be an outlier.
b. Comparison of outlier identification methods and results
In general, IRMCD and the biweight check perform essentially equivalent tasks: given a robust mean and standard deviation, the outliers in vector dataset Y are identified by their large distances from the robust fit. However, the two methods differ in three aspects. 1) in the biweight check, Y is required to be a univariate dataset. When applied to multivariate observations such as wind data, the outlier checking needs to be performed for the u/υ components separately. On the other hand, as a multivariate outlier detection method, IRMCD can be applied directly to a multivariate dataset Y—that is, the outliers of the u/υ components can be detected simultaneously for wind profiler observations—indicating that the IRMCD is more efficient. 2) Their robust means and standard deviations are calculated in different ways as are their identification rules. In IRMCD, are obtained by comparing the squared robust distance to a reference distribution with shape parameters, which varies subject to the different datasets to which IRMCD is being applied. In the biweight check, the predetermined times of the biweight standard deviation of the median are set as the cutoff for identification.
To demonstrate the impacts of these differences on the efficiency and performance in outlier detection, QC2 stages based on IRMCD and the biweight check are performed for every profiling station that has passed QC1. For a quantitative comparison of the two methods, a series of outlier detection experiments are performed with all cutoff parameters within certain intervals (e.g., from 0.0001 to 0.1 by 0.0001 for IRMCD and from 4 to 1.5 by −0.1 for biweight) to obtain the best outlier detection results in terms of the combination of the best skewness and excess kurtosis.
From the side-by-side comparison of the box plots shown in Fig. 9, it is clear that the two methods have reasonable and comparable results in outlier detection, but the full range and the range occupied by the box parts of the best skewness and excess kurtosis for IRMCD converge more closely toward 0. With respect to its corresponding larger rejection rates of 4.11% for the NR subset and 6.74% for the R subset, it can be found that IRMCD may lead to a better shape of the Gaussian distribution with a few more outliers detected.
As an example, the colored clusters shown in Fig. 10 correspond to whether the points are outliers in terms of the two-dimensional OMB of the u/υ components for all the stations. For subset NR, most of the outlying points, which clearly deviate from the core observations, are identified as outliers by both methods, corresponding to their common rejection rate of 3.29%. However, apart from those overlapping outliers, each of the methods can detect extra outliers that cannot be identified by the other method. For example, the outliers detected by IRMCD are mainly spread across the four quadrants where both u/υ components have large OMB values, while those detected only by the biweight method were concentrated close to the axes, where the u/υ OMB components are equal to 0. One way to explain this behavioral difference is that in IRMCD, the squared robust distances are measured and compared with the reference distribution in a multidimensional space, so the method is prone to identify those points with larger deviations from the majority of nonoutliers as outliers in all dimensions simultaneously rather than in the projection of only one dimension, as the biweight method does. From this point of view, IRMCD is superior to the biweight method.
The third difference is that IRMCD has a mechanism to prevent false positives. In IRMCD, the fourth step in the test [Eq. (10)] is specifically designed to prevent false positives in any good dataset (Cerioli 2010) because false positives are a clear shortcoming of the conventional MCD rules. Without step 4, IRMCD is equivalent to the normal finite sample reweighted MCD, and the direct execution of the fifth step leads to false discoveries of outliers in a perfect dataset. Nevertheless, both MCD and the biweight check suffer from the same deficiency. Even for a perfect dataset, outliers are more or less falsely detected, as the BSD can always be calculated and compared.
To demonstrate the ability of “antifalse outlier detection” of IRMCD, ideal outlier detection experiments are performed again for a “perfect” dataset comprising the intersecting parts of the nonoutliers from station 54406 illustrated in Fig. 10a. The dataset can be regarded as “uncontaminated,” as all possible outliers have already been eliminated by the two methods, and theoretically no more outliers should be detected by any method. Some extra outliers can be falsely detected by the biweight method, as shown in Fig. 11a. Additionally, without the fourth step of IRMCD, several points are identified as potential outliers (Fig. 11b). With the antifalse outlier detection step (step 4), testing the of those potential outliers with a cutoff , where , IRMCD is intended to generate a dataset without outliers (Fig. 11c). Nevertheless, the real observation data are far from perfect. Under most circumstances, we do not need to consider the possibility of the null hypothesis of no outliers suggested by IRMCD, but more strictly this point is also an essential difference between IRMCD and the other outlier detection methods.
7. Summary and discussion
In China, the MOC/CMA operated a wind profiler network with a total of 65 profiling radar systems until July 2015. Although its hourly average horizontal profiler wind products are generally of good quality and reliable, their use in meteorological NWP studies in China is still highly restricted because the error characteristics of the profiler data in terms of assimilation are still unclear, and no quality control procedure for wind profiler data tailored to meet the requirements of future assimilation is available.
To incorporate the profiler data from the operational profiling network into the local data assimilation and forecasting system (BJRUC), a two-step quality control procedure is proposed in this study. Specifically, instead of the biweight method (which has been widely implemented in outlier elimination for univariate observations such as radiance, GPS integrated precipitable water vapor (IPW), and conventional surface data), we developed a new method based on the IRMCD for outlier elimination in multivariate observations of profiler data. With the IRMCD as the outlier elimination scheme, a quality control procedure is constructed that consists of a blacklisting check that removes the stations with gross errors and an outlier check that rejects data with large deviations from the background.
A quality control experiment is performed on the profiler data compared with the analysis of BJRUC Domain1 at every 0000 and 1200 UTC during the period from 20 June to 30 September 2015. From the results, we observe that with the application of this quality control, the frequency distributions of the differences between the observations and the model background become more Gaussian like and meet the requirements for a Gaussian distribution for the data assimilation. A further intensive assessment of each quality control step reveals that the stations rejected by blacklisting have poor data quality and that IRMCD can reject outliers in a robust and physically reasonable manner.
Additional emphasis is placed on how to handle the profiler data, which are assumed to be affected by precipitation, in the quality control procedure. Parallel experimental results show that although the subset with rain has a larger RMSE and standard deviation relative to the model background, its quality is still sufficiently reliable for further use in assimilation. However, it is necessary to apply quality control and to perform future assimilation using a separate correlation coefficient threshold and observation error for the subset without rain because of a difference in data quality relative to the model background.
Three main aspects emerge as differences when the IRMCD is compared to the biweight method: 1) IRMCD is a multidimensional outlier detection rule, while the biweight method can be applied only to a univariate dataset. 2) In IRMCD, the outlier identification is obtained by comparing the squared robust distance to a reference distribution with shape parameters, which varies subject to the different datasets in which IRMCD is applied. In the biweight method, the predetermined times of the biweight standard deviation of the median are set as the cutoff for identification. 3) Considering the null hypothesis of no outliers, IRMCD has a mechanism to prevent false positives in any clean dataset. Parallel outlier identification experiments reveal that in terms of the combination of the best skewness and excess kurtosis that can be achieved by the two methods, IRMCD may yield a better Gaussian distribution and is prone to identifying points with larger deviations from the majority of nonoutliers as outliers in all dimensions simultaneously. From these points of view, IRMCD can be considered superior to the biweight method.
This work is a preliminary implementation of IRMCD in outlier detection in wind profiler observations. An impact experiment using wind profiler data in local NWP models in China is being carried out, and the results will be presented. Additionally, extended work on IRMCD is under consideration, including tests of more distributions as references and the use of IRMCD for cross validation for multisource observations.
This research was sponsored by the 973 Program (2013CB430102).