Standard meteorological model performance evaluation (sMPE) can be insufficient in determining “fitness” for air quality modeling. An sMPE compares predictions of meteorological variables with community-based thresholds. Conceptually, these thresholds measure the model’s capability to represent mesoscale features that cause variability in air pollution. A method that instead examines features could provide a better estimate of fitness. This work compares measures of fitness from sMPE analysis with a feature-based MPE (fMPE). Meteorological simulations for Bogotá, Colombia, using the Weather Research and Forecasting (WRF) Model provide an ideal case study that highlights the importance of fMPE. Bogotá is particularly interesting because the complex topography presents challenges for WRF in sMPE. A cluster analysis identified four dominant meteorological features associated with air quality driven by wind patterns. The model predictions are able to pass several sMPE thresholds but show poor performance for wind direction. The base simulation can be improved with alternative surface characterization datasets for terrain, soil classification, and land use. Despite doubling the number of days with acceptable specific humidity, overall acceptability was never more than 10%. By comparison, an fMPE showed that predictions were able to reproduce the air-quality-relevant features on 38.4% of the days. The fMPE is based on features derived from an observational cluster analysis that have clear relationships with air quality, which suggests that reproducing those features will indicate better air quality model performance. An fMPE may be particularly useful for high-resolution modeling (1 km or less) when finescale variability can cause poor sMPE performance even when the general pattern that drives air pollution is well reproduced.
Meteorological model performance is critical for successful air quality modeling and necessary for regulatory purposes. The standard model performance evaluation (sMPE) techniques for judging model fidelity are based on statistical thresholds developed by the community from various regional studies such as Emery et al. (2001). However, these types of evaluations are likely to show failure for high-resolution modeling in regions of complex topography. Despite sMPE failure, a meteorological model may be able to successfully pass a feature-based model performance evaluation (fMPE) by reproducing features that are most important for air quality. A region such as Bogotá, Colombia, which has emerging air quality issues, is understudied with low quality surface characterization datasets, and complex topography provides an interesting case study for comparing an sMPE of a meteorological model with an fMPE.
As a result of rapid economic and population growth, air pollution in Bogotá is far above local and international standards. Particulate matter on the order of ~10 μm or less (PM10) frequently exceeds Colombia’s daily (100 μg m−3) and annual standards (50 μg m−3), which are less restrictive than the World Health Organization (WHO) recommended daily (50 μg m−3) and annual standards (20 μg m−3) (SDA 2013b; WHO 2006). Ozone (O3) also consistently exceeds the national extremely strict local standard of 80 μg m−3 (approximately 41 ppb), which will likely be revised soon to be higher. Epidemiological studies have shown significant relationships between both of these pollutants and respiratory/cardiopulmonary diseases and this link has long been known (Dockery et al. 1993). Additionally, studies specific to Bogotá highlight the negative health impacts of air pollution (Arciniegas et al. 2006; Rodríguez et al. 2006; Bell et al. 2011; Pachón et al. 2013). With the aim of ameliorating the situation, Bogotá’s environmental agency, the Secretaría Distrital de Ambiente (SDA), has formulated a 10-yr air pollution abatement plan requiring an air quality modeling system to test the efficacy of regulatory strategies (SDA 2010). The first requirement of the air quality modeling system is accurate meteorological predictions.
Bogotá has a unique combination of geography and seasonal conditions. Bogotá is on the western slope of the Andes Mountains at 4.5°N and 74°W at an approximate elevation of 2500 m above mean sea level (MSL), making it the highest metropolitan area with more than 10 million people. Bogotá’s seasonal meteorology consists of four alternating periods of rainy and dry weather. The dry seasons are January–March and July–September, and the rainy seasons are April–June and October–December. Winds tend to be strongest in Bogotá in the morning between 0600 and 1100 local time (LT) and in the afternoon between noon and 1800 LT (SDA 2013b).
The complex topography of Bogotá is likely to lead to poor performance in wind speed and direction prediction, which are important components in air quality. In Bogotá, SDA (2013b) showed that diurnal cycles of wind speed and direction are associated with high concentrations of PM10. In regions of complex topography like Bogotá, studies consistently find poor wind direction skill for a standard MPE (Gebhart et al. 2014; Reboredo et al. 2015). Gebhart et al. (2014) used WRF in the Rocky Mountains and concluded that carefully considered physics options did not substantially improve model performance. Reboredo et al. (2015) tested physics configurations with WRF in Bogotá. They modeled eight single days and selected an optimal configuration with marginal differences or improvements from the other parameterizations. All physics parameterizations tested by Reboredo et al. (2015) performed poorly for wind direction. These studies demonstrate the need to focus on other aspects of modeling outside of the physics configurations, such as surface characterization inputs to improve model performance and demonstrate the inability of meteorological models to pass an sMPE in regions of complex topography.
Previous meteorological modeling studies in other understudied regions like Bogotá suggest sMPE enhancements are possible through improving surface characterization. Understudied regions are generally outside of the United States and Europe, where WRF is most commonly applied, and the default datasets such as the U.S. Geological Survey (USGS) Global 30 arc-s elevation (GTOPO30) topography, land-use, and soil classification data are based on sparse or old observations. However, meteorological models can be improved using alternate inputs for topography (Teixeira et al. 2014), soil categorization (Chang et al. 2014; Cheng and Steenburgh 2005), and land-use classification (Yucel 2006).
There is a clear need for meteorological modeling in Bogotá that addresses the challenges of surface characterization and evaluation within the context of complex topography. This study has three main steps: 1) application of a cluster analysis to characterize the meteorological features in Bogotá that results from the complex topography and their importance for pollutant transport, 2) development of a meteorological simulation using the best available physics options and surface characterization datasets, and 3) comparison of the simulations’ ability to pass an sMPE with its ability to reproduce the meteorological features found in the first step using an fMPE.
The cluster analysis reveals that important features for air quality in Bogotá are diurnal local flow patterns. For each cluster, we characterize case studies to illustrate the role of mountains in producing synoptic–thermal interactions that drive local patterns. Initial efforts to simulate local patterns showed poor sMPE, and our results show that improving surface characterization datasets has limited capability for improvement. We propose and implement an alternative fMPE and our results show that simulations are substantially more likely to reproduce critical features than to pass sMPE thresholds. The meteorological features correlate with air pollution regimes, thus suggesting that an sMPE may not correlate with air quality skill.
This manuscript is organized into a methods section (section 2), three results sections (sections 3–5), and our conclusions (section 6). Section 2 describes methods for identifying important meteorological features in the observations with a cluster analysis, WRF surface characterization inputs and physics configurations, the sMPE, and the fMPE. Section 3 details the meteorological features from the observational cluster analysis and their relation to air quality conditions. Section 4 presents model evaluation results from the sMPE and section 5 discusses the ability of the model to reproduce the important meteorological features with an fMPE. The conclusions are presented in section 6. To our knowledge, there have not been any studies that use a cluster analysis to identify observed meteorological features and test the sensitivity of meteorological models to surface characterization, and, then, compare an sMPE with an fMPE for a region similar to Bogotá.
Meteorological features that are influential for air quality in Bogotá are identified using a k-means cluster analysis. We then simulate the meteorological conditions in Bogotá using WRF (version 3.5.1; Skamarock et al. 2008) and evaluate the results with an sMPE. Alternative surface characterization input datasets are implemented to assess their impacts on model performance with the goal of improving sMPE. Finally, we use a feature-based fMPE to evaluate WRF’s skill at reproducing the features and their frequencies.
a. Identifying observed meteorological features
A cluster analysis can be used to identify features, such as local flow patterns, that are important for air quality (Davis et al. 1998; Huang and Smith 1999; Thompson et al. 2001; Austin et al. 2015). In this study, a k-means cluster analysis similar to that of Davis et al. (1998) is performed. The k-means method separates data into a number of clusters k on the basis of similarity in selected meteorological parameters.
The k-means clustering algorithm uses as input the number of clusters to generate k and a set of observations to cluster. The algorithm returns a set of k centroids, where all points in one of the k clusters are closest to one of the centroids. Being closest implies that the absolute error of the points in the cluster is minimized with the centroid it is assigned relative to any other centroid. To start, two samples of k centroids are selected from the dataset at random. The first set was seeded for consistency. Then, if they are unequal (i.e., have not converged) one of the samples of k centroids is used to make k clusters. Clusters are made such that each value in the dataset is assigned to a centroid that it is closest to. Then, the centroids are adjusted to be the mean of each respective cluster. The new adjusted centroids are compared with the previous set of centroids for the next iteration, if adjusted centroids have changed, then iterations continue until assignments of clusters and centroids no longer change.
The number of clusters k is chosen based on the “gap statistic” developed by Tibshirani et al. (2001). The statistic compares the “compactness” of clusters with a random distribution of data within the dataset’s bounds. The appropriate number of clusters is chosen when the measure of compactness is largest relative to the random distribution’s compactness.
Several meteorological variables available from the Red de Monitoreo de Calidad del Aire de Bogotá (RMCAB) observational dataset (SDA 2013a) were selected for use in the cluster analysis. Among the variables examined were temperature, pressure, mixing height, specific humidity, wind speed, and wind direction at each monitor. Pressure was averaged over all monitors to represent larger-scale processes. The RMCAB dataset has 11 stations that provide hourly data for the variables of interest (Fig. 1).
By far the most important variables for clustering were morning and afternoon wind speed and direction (SDA 2013b). All variables were tested but only wind speed and direction were impactful for clustering. Other variables such as temperature, mixing height, solar radiation, specific humidity, and domain-averaged pressure were not impactful for clustering because they do not exhibit nearly as much variability as wind speed and direction in Bogotá. SDA (2013b) reported that winds in Bogotá are only considered to have significant impacts on variability between the hours of 0600–1100 and 1200–1600 LT. Winds at other hours of the day are calm and not considered significant. Therefore, for the cluster analysis, winds were taken as averages of the two intervals of 0600–1100 and 1200–1600 LT.
The k clusters are developed from a year (2012) of observations from RMCAB. The k-means technique requires valid data for all “rows,” where a row contains morning and afternoon wind speed and wind direction for all monitors. Monitors with data unavailable for more than 10% of days in 2012 were excluded. The threshold of 10% was chosen to optimize both the number of stations available for clustering, and an acceptable percent of times that would be included in the cluster analysis. A 10% threshold yields 6 of the 11 stations that meet the threshold (Fig. 1) on 80% of all possible observation days in 2012 for clustering.
Cluster analysis creates k clusters that have distinct meteorological features, which would ideally correspond to air pollution regimes. Each day and its concentrations of PM10 and O3 are assigned to the clusters to determine which are associated with poor air quality. The air quality conditions associated with each cluster are identified as distinctly different using a Mann–Whitney rank sum test (Mann and Whitney 1947).
b. Model domain configuration and simulation periods
WRF was run for Bogotá using nested domain simulations with boundary and initial conditions from the National Centers for Environmental Prediction Final Operational Global Analysis (FNL; NCEP 2000). The domains (Fig. 1) are centered over Bogotá and use a 3-to-1 parent-to-child nesting ratio. The outermost domain (d01) has a 27-km resolution and encompasses all of Colombia, much of the surrounding region, and the Caribbean Sea while using a spectral nudging approach. The first nested domain (d02), with 9-km resolution, represents much of Colombia while minimizing boundary placement on complex terrain. The second nested domain (d03), with a 3-km resolution, encompasses the Colombian department (similar to a U.S state) of Cundinamarca. The innermost domain (d04) has a 1-km resolution, is 64 km × 64 km and is centered over Bogotá. Vertically, the simulations have 30 layers for all domains with the height of the first layer being 55 m. The vertical structure is also included in Fig. 1. Data were output from WRF hourly for all domains.
We simulate two 25-day periods in 2012 when high-pollution episodes occurred for a dry season and a rainy season (50 total simulation days). The periods spanned 7 February–3 March and 30 September–24 October and are referred to hereafter as the February and October periods. Each 25-day period was broken up into five overlapping 5.5-day segments. The first 12 h of each segment (prior to the start date at local time) were discarded as spinup.
c. Physics configurations
Multiple physics configurations were evaluated for their influences on model performance and two were selected for evaluation with improved surface inputs. A physics configuration is a combination of WRF’s physics schemes for radiation, cumulus convection, microphysics, the planetary boundary layer (PBL), and land surface model (LSM) (full details can be seen in appendix A). Two of the configurations were chosen for the sMPE and evaluation of surface characterization inputs based on the PBL schemes. PBL schemes are significant because of their influence on humidity. The first configuration chosen is based on the default WRF configuration (DF), and the second configuration is based on Nielsen-Gammon (2001), which was used in the Houston–Galveston, Texas, region (NG). The DF configuration uses the Yonsei University (YSU) PBL scheme, which has a known improvement from the MRF PBL scheme used by the NG configuration. The YSU PBL scheme represents the PBL height better than does the MRF PBL scheme used by the NG configuration because of the explicit treatment of entrainment (Hong et al. 2006; Cohen et al. 2015).
d. Surface characterization inputs for WRF
The default surface characterization datasets are highly accurate for use in the United States but may return subpar results outside of the country. The WRF Preprocessing System (WPS) allows the user to select surface characterization datasets (e.g., topography, land use, and soil classification) to be used by WRF. WPS then prepares data for each domain. We tested three surface characterization datasets that are alternatives to the standards for influence on model skill: topography, soil categorization, and land use. These datasets are tested with the two physics configurations: DF and NG. We also test the propagation of soil temperature and moisture from each 5-day segment to segment. For each surface characteristic, we demonstrate the difference between the standard WRF dataset (at the highest available resolution) and an alternative data source described further below.
Bogotá has complex topographic features, which need to be sufficiently resolved in order to yield optimal meteorology. The standard USGS GTOPO30 topography interpolated by WPS to the selected domain has deficiencies for Bogotá. Figure 2a shows that the GTOPO30 topography has a local depression in the northwest portion of the d04 domain. Local researchers (Rincon and Rojas 2014, 2015) note that the actual terrain is much flatter in this region, and the depression is likely the result of interpolation from sparse data. Alternatively, we use the Advance Spaceborne Thermal Emission and Reflection Radiometer (ASTER) topographic dataset (Abrams 2000). ASTER has a 30-m resolution and has been more accurate than the GTOPO30 in select regions outside the United States (Nikolakopoulos et al. 2004). To implement ASTER for WRF, we converted original GeoTIFF-formatted ASTER files to the necessary Band Interleaved by Line (.bil) format for compatibility with WPS. Figure 2b shows that topography derived from the ASTER dataset does not have the artificial depression in the northwest and has a higher mean height and stronger spatial gradients. For ASTER simulations, ASTER replaces GTOPO30 for all domains except d01, where ASTER’s resolution does not lead to significant changes.
2) Land use
Land use has been impactful for atmospheric state variables including wind, temperature, and humidity (Yucel 2006). An image of Bogotá shown in Fig. 3a, generated using Google Earth, displays the urban area of the city with a generally gray color, with surrounding vegetation. The default USGS land-use dataset in Bogotá (Fig. 3b) does not cover enough of the urban area with the urban category and poorly represents the city’s surrounding vegetation. We tested simulations using WRF’s alternative available land-use dataset, the Moderate Resolution Imaging Spectroradiometer (MODIS) (Boston University 2012), as seen in Fig. 3c, which has a higher resolution and can be more accurate outside of the United States (Cheng and Steenburgh 2005). MODIS provides a better spatial representation of the urban area of Bogotá and the forested regions surrounding the city. These two datasets are juxtaposed in Figs. 3b and 3c with a joint color scale representative of both by combining similar categories.
3) Soil classification
The standard USGS soil classification dataset for WRF is poorly resolved in Bogotá. Figure 4a shows that the USGS has a distinctly blocky pattern and only two categories. As an alternative, we used the Harmonized World Soil Database (HWSD) with 1-km resolution (Batjes 2009). HWSD has soil data in percentages of soil type. Since WRF requires soil classification data in USGS classifications, we convert the HWSD percentages of clay, silt, and sand to the proper classifications in USGS format. Figure 4b shows that HWSD has a moderately better representation of d04 with four classifications and a geographic pattern that more closely resembles the mountainous east, central plateau, and agricultural west.
4) Soil state propagation
Soil classification and land use indirectly affect meteorological modeling skill through soil moisture and temperature, but the cumulative effect that takes time to develop requires special treatment. The cumulative effects on soil moisture and temperature are typically lost during each 5-day segment reinitialization. This may be important for HWSD simulations, which have a different soil class dataset than NCEP FNL. As an alternative to reinitialization and similar to a previous study over the United States (Gilliam and Pleim 2010), we try propagating soil moisture and temperature data from the end of a 5-day segment to the start of the following segment’s spinup period to allow for cumulative deviation from the NCEP (2000) initial conditions. The initial conditions for WRF are shown in Fig. 5a for the third segment in February following the spinup, and by contrast the propagated soil moisture for the same time is shown in Fig. 5b using the HWSD soil classification. Using propagation shows a clear difference in moisture particularly over the mountainous east of d04, likely accounting for more runoff over time.
5) Surface characterization summary
With all combinations of terrain, soil classification, soil moisture propagation, and land use, there are a total of 16 characterizations, as seen in Table 1. There are eight simulations for each standard surface characterization and eight simulations for each of the alternative datasets.
e. Standard model performance evaluation
The sMPE can be used to determine if the model is suitable for regulatory air quality modeling and to understand the importance of applying improved surface characterization datasets. However, an sMPE will not be able to test if WRF can recreate the features important for air quality as determined by the cluster analysis. Each model simulation is assessed using benchmarks for “acceptability” in air quality applications. “Acceptability” is characterized based on a synthesis of the broader community’s mesoscale sMPE. Tesche et al. (2001) compiled MPEs from mesoscale meteorological simulations used for regulatory applications. Emery et al. (2001) added recommendations about benchmarks for statistics applied to individual days. Because of Bogotá’s complex terrain, we combine the recommended evaluation thresholds with those from McNally (2009). We use the proposed daily benchmarks, shown in Table 2, for the simulations.
Model performance statistics are compared with benchmarks for four important variables: wind speed at 10 m above ground (WS10), wind direction at 10 m (WD10), 2-m temperature (T2), and 2-m specific humidity (Q2). Model variables are output in WRF as specified in Skamarock et al. (2008). For each variable, a combination of root-mean-square error (RMSE), mean bias (MB), mean error (ME), and index of agreement (IOA) are used. Wind direction is only evaluated when observational wind speeds are greater than 0.5 m s−1. We examine the number of “acceptable” days, where a day is considered acceptable if all metrics for a variable meet the benchmark on that day. Furthermore, we evaluate joint acceptability, which occurs when all variables’ metrics meet their respective benchmarks on a given day. Since WD10 error is a highly difficult metric, we also consider “joint acceptability” for all variables without it. Not including WD10 error places strong emphasis on WD10 bias, which is the more important wind direction metric for air quality because it indicates whether pollutants will arrive at a monitor as opposed to when they will arrive. For regulatory standards, air quality variables are typically averaged over several hours and/or daily maxima are used, placing further emphasis on the importance of a low bias for WD10.
The simulations were evaluated against local observational data from RMCAB for the two innermost domains and against Meteorological Assimilation Data Ingest System (MADIS; NCEP 2014) stations for the outer two domains. Three stations had specific variables excluded from the MPE based on analysis from the RMCAB quality assurance team. The San Cristobal (SC) station’s wind data were excluded for the February period and the Ministerio de Ambiente, Vivien da y Desarrollo Territorial (MAVDT) station’s wind direction was excluded for October because of unreliable data. Also, we excluded Guaymaral’s specific humidity as a variable because, although this site is located in a rural region of Bogotá, it is within 30 m of a large paved area, which, according to Bailey (2000), is inappropriate for specific humidity. Vertical performance is also used to test WRF’s ability to accurately resolve mesoscale vertical structure. Comparisons were made between WRF and radiosonde (“sonde”) data collected at the El Dorado International Airport (SKBO) daily at 1200 UTC.
f. Feature-based model performance evaluation
Cluster analyses can be used to help identify important features in modeling studies, and evaluations based on these features can show similar levels of performance to threshold-based analyses (Dormann et al. 2013). After using a cluster analysis to identify the important meteorological features from the observational dataset (see section 2a) that are predictors of air quality, we determine if the model is able to reproduce those features. This feature-based analysis, fMPE, is potentially more viable than an sMPE, particularly for new modeling locations. At such locations, surface datasets may not be very accurate, and if there is no immediate opportunity for improvement, it is likely that a simulation will fail an sMPE. Instead, an fMPE is more likely to demonstrate that a model is applicable for air quality modeling. The model’s ability to reproduce the features is evaluated based on the accuracy of simulated day assignments of the clusters.
Prediction skill is nominally tested by rejecting accuracy as random chance. The null hypothesis is that every day, each cluster has an equal chance of occurrence, making the total number of occurrences of each of the k clusters in a period equally likely. The assumption is that model skill is not driving cluster assignment. The null hypothesis is evaluated using a Bernoulli test (binomial test) of overall assignment accuracy (rejected if p < 0.05). If the null hypothesis is rejected, the model nominally reproduces the accuracy of the clusters and, therefore, the air quality tendency.
3. Observational cluster analysis
First, the cluster analysis is used to reveal the meteorological features, such as diurnal wind patterns, that lead to certain air quality conditions. The “gap statistic” (described in section 2a) helped determine that the optimal number of clusters to be considered, k, should be 4. While the clustering method was not based on air quality, the flow patterns associated with clusters did show a relation to air quality conditions, particularly PM10. The clusters’ probability density functions (PDFs) showing the distributions of PM10 concentrations can be seen in Fig. 6. The clusters will hereafter be referred to as the low cluster (L), the midlow cluster (mL), the midhigh cluster (mH), and the high cluster (H), each with an average daily PM10 concentration from observations of 37, 43, 50, and 61 μg m−3, respectively. The O3 concentrations are similar across clusters and low because of strong NOx titration, which leads to consistently low O3 in the region. A Mann–Whitney rank sum test was used to determine whether the air quality conditions associated with each cluster were distinct. The test reports a p value for each cluster’s relation to all other clusters that needs to be less than 0.05. The maximum p value was 0.02, demonstrating that unique air quality conditions are associated with each cluster.
Figure 7 shows the combination of morning and afternoon wind patterns for each cluster along with associated PM10 and O3 concentrations. Arrows outlined in black were included in the cluster analysis and arrows not outlined were not included in the clustering because of intermittent data availability such that the minimum data availability requirement (see section 2a) was not met. The arrows from intermittently available stations are shown even though they were not used in the cluster analysis. The low PM10 cluster L shows strong morning and afternoon winds flowing generally toward the northwest and away from the mountains, likely driven by synoptic winds. The mL cluster has weaker winds that flow either west or north, both directions being away from the mountains and likely not impacted by synoptic-scale flow. The mH cluster shows several stations with winds flowing upslope toward the mountains in the east during the afternoon and downslope and away from the mountains during the morning, indicating a thermally driven pattern. The H cluster’s winds are generally strong, indicating synoptic flow, and would transport pollutants directly toward the mountains in the east.
The concentrations associated with the wind patterns suggest that the mountains in the region play a significant role in pollutant entrapment similar to other studies in regions of complex topography with mountains on one side of the region (Lu and Turco 1995; Yao et al. 2014). The L cluster has strong winds flowing downslope; therefore, pollutants are likely flushed out of the region, leading to lower concentrations. When winds are flowing toward the mountains, as in the H cluster, there is likely to be transport blocking along with air parcels vertically oscillating on the upstream side of the mountain, which would prevent long-range transport and likely lead to the higher concentrations (Yao et al. 2014). This vertical oscillation can occur because the mountain range is sufficiently long and perpendicular to the flow.
The L and H wind patterns occur when synoptic wind patterns are dominant, while the mH and mL clusters tend to occur when winds are either thermally driven or weak and synoptic flow is not a major factor. When there is strong synoptic flow, it is typical that pollutant concentrations are flushed out and tend to be lower. This is true for the L cluster when winds are easterly, but when the winds are westerly, for the H cluster, the highest concentrations tend to occur. This is indicative of the pollutant trapping. When synoptic flow is not a factor, winds are thermally driven and there tends to be morning winds downslope from the mountains and afternoon winds upslope (SDA 2013b), as seen for the mH cluster.
To demonstrate these effects on pollutant transport, we present a case study of an H day (17 February 2012), an mH day (11 October 2012), and an L day (2 March 2012). Figures 8a–c shows the 500-hPa geopotential height (GHT) from the NCEP FNL dataset gridded onto d02 and averaged between 0700 and 1300 LT, which leads to the mesoscale winds shown in Figs. 8d–f, and the diurnal winds in Figs. 8g–i. The H day has a high just south of Bogotá along with synoptic winds flowing from the west. The diurnal pattern shows weak morning winds, with strong afternoon winds flowing toward the mountains. The PM10 concentrations on this day remain high at an average of 86.4 μg m−3 indicating pollutants did not exit the region and were trapped. On the mH day, there is not a significant system over Bogotá, and surrounding winds are weak and scattered, leading to thermally driven local winds. The mH day winds are downslope in the morning and upslope in the afternoon. The average PM10 concentration for the mH day was 66.9 μg m−3, which is still high because of the afternoon westerly winds but not as high as the H day. For the L day, synoptic winds are flowing from the east and lead to diurnal local winds flowing generally easterly. The average concentration of PM10 is 33.4 μg m−3.
Based on the case study and conditions from the cluster analysis, it can be concluded that in a region such as Bogotá, with mountains that are tallest on one side, wind speed and wind direction are dominant factors determining air quality conditions.
4. Standard model performance evaluation results
The model was run for 35 total simulations (five physics configurations; 16 inputs for two physics configurations). The full evaluation for all configurations is presented in appendix B for the February and October periods with the two configurations most influenced by surface characterization changes analyzed here. Outer domain analysis (for d01, d02, and d03) is also shown in appendix C and was similar for all simulations; therefore, this study focuses on performance in d04.
a. Standard model performance evaluation of physics options
The DF and NG physics configurations had common levels of performances for most sMPE criteria, with Q2 as a notable exception. Both configurations performed very well for T2, moderately for WS10 and Q2, and very poorly for WD10. Vertically, model performance is weakest and most variable near the surface, and stronger and more consistent aloft. A full evaluation (see appendix B), including vertical analysis (see appendix D), is presented later.
For WS10, WRF performs fairly well for the DF and NG configurations, with 29 and 28 acceptable performance days out of 50. The model performed poorly for WD10, particularly ME. The DF and NG configurations only have between 4 and 3 days accepted out of 50, respectively, and overall WD10 ME’s of 66.6° and 66.3° However, the overall MB is acceptable. Similar to d04, WD10’s error performance is also poor for coarser-resolution domains d03 and d02, since both domains also feature of complex topography. WD10’s performance for d01 is stronger since it is less impacted by topography.
The variable Q2 is the one that is most susceptible to surface characterization changes because of the PBL schemes used. For Q2, NG yields an acceptable overall MB; however, the number of acceptable days was only 17 out of 50. DF’s overall MB for Q2 does not meet the benchmark and had even fewer individual acceptable days, with 13 out of 50. Figure 9 shows the Q2 and PBL diurnal profiles before and after surface characterization data improvements. In Fig. 9a, the dashed lines demonstrate that prior to improving the surface characterization datasets by using ASTER and HWSD, the DF configuration and to a lesser extent the NG configuration overpredict Q2 in the morning, including the crucial hours of 0600–1000 LT. During these hours the PBL is low, but emissions are high, leading to high pollutant concentrations. In the afternoon the DF configuration nearly captures an afternoon hump feature and then overpredicts later in the day. The NG simulation underpredicts in the afternoon. Since these simulations use the USGS GTOPO30 terrain, which has the artificial bowl, it is expected that there would be too much moisture at the monitor locations as a result of water tending to accumulate on the western side of the city inaccurately. Observational data have shown that moisture actually accumulates more on the eastern side of the city, closer to the mountains (SDA 2013b).
When there is excess moisture, energy is lost to latent heat and therefore less energy is available for vertical mixing, leading to lower PBL heights (Pleim and Xiu 1995; Sogachev et al. 2002; Findell and Eltahir 2003). However, if the PBL scheme tends to produce an overly deep PBL, the moisture content can be artificially lowered, which happens for NG. The fact that NG underpredicts Q2 speaks to this weakness in the PBL scheme. Analyses of PBL schemes have shown that, relative to other PBL schemes, MRF tends to overdeepen the PBL (Cohen et al. 2015; Hong et al. 2006; Mass et al. 2002). NG’s PBL daily profile is deeper than DF’s, as shown by the dashed lines in Fig. 9b. The combination of model terrain and PBL scheme leads to a lower bias for specific humidity because of the terrain but for an undesired reason. Improved terrain should improve the performance for DF, but not for NG.
b. Surface characterization impacts on sMPE
Using the NG and DF configurations, we ran all combinations of surface characterizations and saw distinct sensitivity for Q2 performance and little sensitivity for NG relative to those run with DF. Tables in appendix E summarize the effects of incrementally changing the surface characterizations on sMPE.
With more accurate terrain, DF’s Q2 sMPE performance improves. This is because DF uses the YSU PBL scheme, which has been shown to better represent the PBL height for reasons explained in section 2c. With the improvement in terrain from the ASTER data, surface moisture does not tend to accumulate and lead to excess latent heat fluxes, leaving more energy available for vertical mixing and a deeper PBL. As seen in Fig. 9b, both the DF and NG PBL heights increase with applying ASTER and HWSD. However, the NG PBL height is likely too high, while the DF PBL height is more accurate and Q2 performance is improved.
Figures 10a and 10b summarize the improvements, without WD10 error, in number of acceptable days from each surface characterization. Each box represents eight simulations with one of the surface characterizations changes. The NG configuration did not significantly change number of acceptable days out of 50 with any inputs (8–11 days without WD10). Alternatively, the addition of ASTER for the DF configuration shows strong improvement in acceptability without WD10 error as seen in Fig. 10a, caused by the significant improvement in Q2 acceptability. Following suit, the addition of HWSD leads to some more improvement. However, adding MODIS land use and soil propagation does not significantly change performance for DF or NG.
Despite the improvements gained for specific humidity performance, wind direction performance remains unacceptable for the majority of days with only 4 out of 50 days having acceptable wind direction performance for the sMPE.
5. Feature-based model performance evaluation (fMPE)
Although the model does not meet sMPE thresholds for wind direction on a majority of simulated days, the model is more likely to reproduce important meteorological features for air quality on any given day and therefore more frequently pass an fMPE. Table 3 shows the model’s ability to reproduce the frequency and accuracy of the clusters. Based on data availability requirements for clustering the observations (see methods section 2a), the February period has 15 days and October has 24 days for a total of 39 days available for analysis. The number of occurrences of each type of cluster is shown in the table for both February and October for the DF configuration with improved surface characterization data. The H cluster occurs more frequently in the model (26 days) than the observations (16 days). In addition, the number of occurrences of the mH cluster is underpredicted by the model with 4 days, as compared with the 13 days observed. This happens because mH days tend to be classified as H days and indicates that the model is predicting more occurrences of high pollution meteorology days characterized by strong synoptically driven westerly winds, and underpredicting weaker thermally driven westerly winds. However, the combined number of days between the mH and H clusters is very similar between the observations and the simulation implying that the air quality model will have a similar number of above average pollution days as the observations. Also, L days are mainly misrepresented as mL days, and the observations and model show 9 and 10 days, respectively, of combined mL and L days. The fact that the model can produce a similar distribution of the clusters is a positive trait for air quality modeling purposes and may be more important than having low bias or error.
The accuracy of cluster assignments on a given day is found to be statistically significant (p < 0.05) using a binomial test for the combined periods. The p value for the simulation is 0.00135, and rejects the null hypothesis that every day, each cluster has an equal chance of occurrence. Furthermore, the model is able to accurately predict the cluster assignment for 15 out of the 39 days available for analysis, about 38.4% of the days. This is a much higher success rate than found with the sMPE that showed the model could only pass an sMPE 4 days out 50 (4%). The fMPE therefore demonstrates that the model can reproduce important features for air quality much more frequently than it can pass an sMPE, yielding significantly more days that can be used for air quality modeling.
In this study, we used a cluster analysis to identify meteorological features that were related to air quality in Bogotá. The driving variables for air quality were wind speed and direction and the cluster analysis showed that there were four clusters that have distinct air quality features associated with them. On low pollution days, winds were generally strongly flowing from the mountainous east leading to flushing of pollutants. Medium levels of pollution tend to occur on days when winds are thermally driven and are weaker. Pollution is higher on days when thermal winds drive transport downslope from the east in the morning and upslope from the west in the afternoon. On the highest pollution days, winds tend to be synoptically driven and flowing from the west toward the mountains. This transports pollutants to the mountains and leads to trapping as demonstrated by a case study.
Two methods were then used to evaluate WRF for air quality modeling in Bogotá, Colombia with differing results. An sMPE using community based benchmarks for T2, WS, WD, and Q2 performance showed that WRF could accurately simulate T2, but performance was mediocre for WS and Q2 and poor for WD for two types of physics configurations, DF and NG. Improvements were then made to surface characterization datasets for terrain, soil classification, and land use. With the YSU PBL scheme used by DF, Q2 performance was able to dramatically improve. The MRF PBL scheme used by the NG configuration led to an overly deep PBL relative to the DF configuration. At first a deeper PBL was favorable for Q2 performance because the terrain and soil classification datasets led to inaccurate soil moisture accumulation. The overaccumulation of soil moisture leads to lower PBLs and overpredictions of Q2. With the more accurate ASTER terrain dataset, and HWSD soil classification, excess moisture accumulation is alleviated and both configurations PBLs were raised. This led to a better representation of Q2 by the DF configuration, but not the NG configuration. With the improved surface representation the DF configuration achieved much higher performance on the sMPE for Q2. However, WD error performance was not improved with surface characterization data. In fact on the majority of days, WRF fails the sMPE for wind direction, suggesting that the sMPE does not demonstrate if WRF could reproduce the features important for air quality found by the cluster analysis.
We then ran the cluster analysis on the simulation using the DF configuration with updated surface characterization and compared the predicted clusters with the observed clusters for an fMPE. The fMPE showed that WRF could reproduce features associated with air quality at a similar frequency and distribution to the observations. WRF demonstrated a similar number of days for the high and low clusters associated with air quality. A binomial test revealed that the accuracy was statistically significant for the simulation with a p value lower than 0.05. The fMPE showed that there were significantly more days (38.4%) when WRF could accurately simulate the features important for air quality than could pass an sMPE (<10%).
We therefore demonstrated that two evaluation techniques lead to competing conclusions on WRF’s applicability to regulatory air quality modeling. WRF was able to pass an sMPE for most variables but frequently failed for specific humidity and systematically failed for wind direction. An fMPE, however, revealed that WRF could reproduce important features of wind patterns that are associated with air quality conditions. Since sMPEs often fail in regions of complex topography, the fMPE can be another way to demonstrate a model’s suitability for air quality modeling. Future work will be needed to evaluate the predictive capacity of sMPE and fMPE for air quality performance. If fMPE is a better predictor of air quality modeling performance, it may be used as an alternative to sMPE. This will be most useful in settings where traditional sMPE benchmarks are not available or produce misleading results.
This project was funded by the Secretaría de Ambiente de Bogotá under Contract 1467 de 2013 with Universidad de La Salle. The viewpoint of the authors does not necessarily reflect the official thoughts of the SDA. We thank all of the people and organizations that provided information or advice for this project.
All Physics Configurations Used
Optimal physics configurations vary by region, period, and resolution and it is common practice to test multiple configurations (Borge et al. 2008; Ruiz et al. 2010; Carvalho et al. 2012; Evans et al. 2012). The configurations are referred to by the following acronyms: DF, AC, NG, GP, and BL. The DF configuration employs the default WRF options. Each of the other configurations was chosen based on previous studies. The AC configuration was chosen based on performance around Bogotá (Kumar and Rojas 2015). The NG configuration was chosen based on performance in a coastal environment, the Houston–Galveston region (Nielsen-Gammon 2001). The GP configuration was selected based on model skill over a range of continental conditions, including complex topography (Gilliam and Pleim 2010). The BL configuration was selected based on skill in Bogotá (Reboredo et al. 2015) and uses the Bougeault–Lacarrère (BL) PBL scheme. The full details of each physics option and configuration are summarized in Table A1 and Table A2. These five configurations were evaluated with default WRF inputs for the d04 domain.
Full Model Performance Evaluation of Physics Options
The five physics configurations show common performance for most evaluation criteria (e.g., RMSE, MB, ME for each meteorological variable). Table B1 shows that all configurations performed well for T2, moderately for WS10 and Q2, and very poorly for WD10. Figure B1 shows an average daily temperature profile for the configurations and RMCAB observations. All configurations can capture Bogotá’s diurnal temperature pattern. However, GP overpredicts T2 in the afternoon and underpredicts it in the morning, which is likely caused by the propagation of soil moisture used by GP, which can lead to the misallocation of soil moisture because of poorly represented terrain. Table B1 shows that there is little variation in T2 performance between configurations. The BL and AC configurations meet all of the benchmarks on 48 out of 50 days. DF meets all benchmarks on 49 out of 50 days. NG and GP meet the benchmark on 45 and 43 days, respectively, out of 50. Also, all configurations yield acceptable IOA and ME for Q2. Only NG and GP have unacceptable MB results for Q2. Wind speed performance is decent, with all configurations meeting the IOA and MB benchmarks. Last, all configurations have five or fewer acceptable days for WD. Considering all variables, the configurations only have between zero and two acceptable days.
Outer Domain Analysis
Tables C1–C3 show the outer domain model performance evaluation for d01, d02, and d03, respectively, for each physics configuration for 15 representative days in February chosen based on air quality. There is a noticeable degradation in WD10 performance from d01 to d02, and then consistently poor performance for d02, d03, and d04. The reason for this is the sudden presence of complex topography. As shown in Fig. 1, much of d02 is mountainous, and d03 is completely in the mountains, while complex topography does not cover the majority d01.
Figure D1 shows vertical performance for WRF physics configurations for temperature MAE and MB, specific humidity MAE and MB, wind direction MAE and MB, and wind speed RMSE and MAE. Figures D1a,b show that the configurations have similar temperature error and bias profiles, and all meet the criteria throughout. For specific humidity, the NG configuration shows a lower bias than do the other configurations and a worse error. Wind direction MB is generally within the criteria throughout the vertical, with worse performance near the surface as seen using the ground monitors. Wind speed bias and RMSE tend to be worse at higher altitudes, as the absolute wind speeds increase substantially. The similarities in vertical performance suggest that surface-layer performance is most important for model differences and air quality.
Full Analysis of Surface Characterization Inputs
We show the complete analysis of the surface characterization simulations. Tables E1 and E2 show the 50-day statistics for the combined periods for all surface characterizations for the DF and NG configurations, respectively.