1. Introduction
Despite often being underestimated, fog exerts significant impacts on both built environments and natural ecosystems. The reduced horizontal visibility resulting from fog formation has led to numerous traffic accidents and disruptions in aviation systems, resulting in casualties and substantial economic losses (Forthun et al. 2006; Bari and Bergot 2018; Gultepe et al. 2019). The socioeconomic toll of fog risks can be comparable to that led by tropical cyclones (Gultepe et al. 2007). Moreover, the formation of fog influences air quality, the land surface energy budget, power production, agricultural outputs, and even human health (Köhler et al. 2017; Decesari et al. 2017). In water-scarce regions, fog can become a potential opportunity to alleviate water stress, as it provides a cloudlike source of water in contact with land surfaces (Qiao et al. 2020; Runyan et al. 2019).
Predicting fog events, however, remains an immense challenge. Parameterizing the microphysical processes introduces deep uncertainties, due to many associated factors leading to high natural variability (Melhauser and Zhang 2012; Steeneveld and de Bode 2018). In hyperresolution weather modeling, a slight change in soil thermal conductivity could introduce significant biases to fog onset time (Smith et al. 2021). The inherent complexity and chaotic nature of atmospheric processes often result in poor performance in predicting fog events compared to precipitation forecasting skills (Zhou et al. 2012). Even weather models with promising fog forecasting performance could exhibit a false alarm ratio of approximately 30% (Fernández-González et al. 2019). Still, improper parameterization and unsuitable modeling resolutions can lead to deep uncertainty in liquid water content of the atmospheric boundary layer (Gultepe et al. 2017).
To overcome the challenges in numerical weather simulations, data-driven machine learning algorithms have shown promising performance in recent years. While empirical methods initially faced challenges in their infancy (Koziara et al. 1983), advancements in algorithms and increased computing power have empowered them to effectively categorize or regress visibility data with atmospheric feature variables (Bartoková et al. 2015; Durán-Rosal et al. 2018; Miao et al. 2020). Cornejo-Bueno et al. (2017), for instance, successfully forecasted low visibility events using support vector regression, neural networks, and Gaussian process algorithms, achieving over 98% accuracy in fog classification.
More recently, Castillo-Botón et al. (2022) have comprehensively showed the ability of ensemble models based on decision trees, neural networks, support vector machines, and other techniques in capturing local atmospheric patterns associated with fog formation. In a case study by Miao et al. (2020), a long short-term memory network was also found to be effective in predicting fog with a classification approach. Bari and Ouagabi (2020) and Pathak et al. (2018) have demonstrated the effectiveness of combining outputs from a dynamical weather model with machine learning techniques. Postprocessing of weather model outputs with a deep learning algorithm outperformed high-resolution ensemble forecast system in capturing sea fog episodes (Kamangir et al. 2021).
Nonetheless, despite the promising performance, machine learning algorithms have often been trained at point locations for relatively short periods of time (e.g., Castillo-Botón et al. 2022; Negishi and Kusaka 2022). This limitation prompts significant questions regarding the generalizability of these models outside their training environments. While numerical models can provide spatial information on foggy conditions, albeit with some uncertainty (e.g., Smith et al. 2021), the effectiveness of applying machine learning models to untrained locations remains an area of uncertainty. Creating a comprehensive map of fog-related risks within a region is crucial for proactive risk management. However, machine learning models have often been evaluated primarily in high-data availability point locations such as airports (e.g., Castillo-Botón et al. 2022; Guijo-Rubio et al. 2018).
In practical applications, important hydrological variables, such as river flows and latent heat flux, have seen successful upscaling to ungauged areas through analogous machine learning algorithms (e.g., Zeng et al. 2020; Xu et al. 2018). In contrast to traditional weather networks, which exhibit uneven spatial distribution, advanced reanalysis systems (e.g., Muñoz-Sabater et al. 2021) consistently provide reliable atmospheric data both in space and time. Data-driven models trained on reanalysis datasets have demonstrated promising performance in predicting hydrological processes at a global scale (e.g., Jung et al. 2019; Ghiggi et al. 2019). Notably, this common approach has not yet been employed to identify foggy weather conditions, even within a regional context.
In this study, we addressed these questions with a twofold approach. First, we evaluated five decision-tree-based ensemble models in predicting fog occurrences within a South Korean region, specifically focusing on untrained locations and time periods, while considering the diversity of fog generation mechanisms. Second, we applied a common workflow that extends point-scale observations to ungauged areas using high-resolution reanalysis data. The potential of reanalysis data was assessed to establish a fog frequency map within a specified area using the point-scale training of foggy weather conditions.
2. Materials and methods
a. Study area, point observations, and preliminary analysis
The study area encompasses the southwestern region of the Korean Peninsula, specifically located within 126.5°–128°E, 35.1°–36.2°N. This area is under a semihumid temperate climate characterized by distinct rainy summer seasons. During the summer, southerly winds carry moisture from the North Pacific Ocean, replenishing soil moisture and increasing surface evaporation. Conversely, the Siberian high brings dry and cold winter seasons, which limits the moisture supply to land surfaces. In spring, low humidity and reduced latent heat flux can lead to significant diurnal fluctuations in air temperature (Suh et al. 2009). Over the past decades, the mean air temperature in the study area has ranged within 10°–13°C, while the mean annual precipitation has been around 1300 mm a−1, with approximately 60% of the total rainfall occurring during the summer season.
To establish relationships between fog occurrences and atmospheric conditions, we collected data at 16 stations from the automated synoptic observations system (ASOS) operated by the Korea Meteorological Administration (KMA). The ASOS stations were equipped with visibility meters capable of measuring the intensity of light scattered by haze or fog, enabling the objective detection of fog occurrences during both daytime and nighttime hours (Oh and Suh 2020). The gauging stations are in various elevations, ranging from 12.2 to 406.9 m above mean sea level (MSL). The collected data include hourly visibility records from 2017 to 2022, as well as concurrent measurements of air temperature, dewpoint temperature, skin temperature, precipitation, wind speed, wind direction, solar radiation, and air pressure.
Fog events, as per the definition of the World Meteorological Organization (https://cloudatlas.wmo.int/en/fog-compared-with-mist.html), are characterized by horizontal visibility falling below 1 km. We classified a day as foggy if there is at least one instance of such low visibility. Instances with a precipitation rate exceeding 14 mm h−1 were disregarded since heavy rainfall can significantly reduce visibility (Belorid et al. 2015). In line with Lee and Suh (2019), we also excluded the low-visibility instances with relative humidity below 88%, as they could be influenced by other factors such as severe dust.
Figure 1 illustrates the average number of foggy days in each of the four seasons during 2017–22. The mean foggy days at 16 gauging locations varied widely, ranging from 9.3 to 70.8 days yr−1. In high elevations, fog frequency tends to increase during the fall season, which can be attributed to the moistening of land surfaces by monsoonal rainfalls and the increasing nocturnal temperature drops. The enhanced moisture supply and cooler nighttime temperatures create favorable conditions for condensation. In contrast, fog formation is noticeably constrained in urban areas (stations 146 and 156). The heat island effect, combined with extensive paved land covers, leads to a large difference between air and dewpoint temperatures, thereby limiting air condensation (Gautam and Singh 2018). In coastal regions with lower elevations, fog frequency does not exhibit a strong seasonal pattern, suggesting that fog events are generated not only by radiative cooling, but also by frontal rainfalls and moist air advected from sea surfaces (Belorid et al. 2015).
We categorized the 16 stations based on their elevation and proximity to the coast. The stations at elevations higher than 150 m MSL were classified as mountainous sites (IDs: 238, 244, 248, 254, 264, 284, and 289), while those located within 10 km to the coast (IDs: 140, 243, and 252) were labeled as coastal sites. The remaining 6 stations (IDs: 146, 156, 172, 245, 247, and 251) were classified as inland sites. Subsequently, we analyzed fog events from 2017 to 2022 at each station using the decision-tree-based classification framework by Tardif and Rasmussen (2007) (Fig. 2). Radiative cooling was the predominant fog generation mechanism at mountainous stations, whereas coastal and inland stations had more diverse mechanisms with larger contributions of precipitation and advection fog. The coastal and inland sites also tended to have somewhat higher proportions of unidentified fog generation mechanisms.
Figure 3 compares the annual, monthly, and diurnal variations in the frequency of fog events between the ASOS stations. Except for urban sites specifically distinguished in Fig. 3, there were declining trends in the annual foggy days. This can be attributed to multiple factors, including rapid urbanization (Belorid et al. 2015), rising air temperature and declining relative humidity (Shin et al. 2021), and improved air quality (Bhardwaj et al. 2019).
The seasonal patterns depicted in Fig. 3b indicate that fog occurrences in mountainous areas increase during the fall season, likely due to radiative cooling over moist surfaces. On the other hand, in coastal sites, fog formation is frequent during early spring and summer, being influenced by synoptic weather systems and monsoon effects (Lee et al. 2010). The rise in fog cases in March is associated with the high pressure on southwestern sea surfaces, which leads to warm moist westerly winds (Lee and Suh 2019). In urbanized sites, fog formation appears to be seasonally delayed, occurring more frequently in winter rather than in fall.
The diurnal variations (Fig. 3c) reveal that fog typically forms after sunset and dissipates as the surface is warmed by solar radiation. This suggests that radiative cooling is a dominant fog formation mechanism in the study area as shown in Fig. 2. However, frontal and advection fog could occasionally reduce visibility during daytime hours at coastal stations (Belorid et al. 2015; Lee et al. 2010).
b. Reanalysis data and spatial downscaling
To capture the spatial variations of surface atmospheric conditions, which are not fully represented by the point-scale ASOS data, we collected state-of-the-art reanalysis data from the ERA5-Land (Muñoz-Sabater et al. 2021) and ERA5 archives (Hersbach et al. 2020) provided by the European Centre for Medium-Range Weather Forecasts (https://cds.climate.copernicus.eu; last access: 25 August 2023). The advanced reanalysis data offer a comprehensive and consistent description of atmospheric changes in space and time. The collected variables from the ERA5-Land archive include hourly 2-m air temperature, 2-m dewpoint temperature, skin temperature, total precipitation, surface sensible heat flux, 10-m U and V wind speed components, and surface pressure at a resolution of 0.1° × 0.1°. While soil moisture contributes significantly to fog formation (Adhikari and Wang 2020; Li et al. 2021), we assumed that aggregating precipitation data could effectively represent surface moisture availability. We also incorporated low-level cloud cover and cloud base height datasets in the ERA5 archive. They serve as indicators of low visibility, as highlighted by Taszarek et al. (2020).
Since the ERA5-Land data at 0.1° × 0.1° may be coarse to represent local-scale nonlinear fog formation processes (Tapiador et al. 2019), we employed the cokriging method to further refine them. Cokriging is a geostatistical interpolation method that preserves the spatial autocorrelation of the raw data while incorporating their covariance to external variables. Following the workflow developed by Kusch and Davy (2022), we disaggregated the reanalysis data to a higher resolution of 0.025° × 0.025°.
The collected reanalysis data were downscaled using the related covariates at each time step. It is known that the temperatures are strongly correlated with elevation (Clements et al. 2003; Willmott and Matsuura 1995), and precipitation also tends to increase with elevation, although with uncertainty (Daly et al. 2008). Solar radiation and air pressure also have physical relationships with latitude and elevation (Allen et al. 1998). Wind speed and soil moisture, which have weaker correlations with geographical covariates, were interpolated using simple kriging to preserve autocorrelations only. The maximum number of neighboring values for the spatial downscaling was determined to be 16 through interactive testing to minimize errors. The refined data by cokriging appeared to better capture temperature variations along the elevation profile (Fig. S1 in the online supplemental material).
c. Machine learning models and feature variables
To classify binary fog occurrences at 16 gauging locations with the ASOS and the downscaled reanalysis data, we used five tree-based ensemble methods: random forest (RF; Breiman 2001), adaptive boosting (ADB; Freund and Schapire 1997), extreme gradient boosting (XGB; Chen and Guestrin 2016), light gradient boosting (LGB; Ke et al. 2017), and categorical boosting (CTB; Prokhorenkova et al. 2018). These models use decision trees (Breiman 2017) as base learners, which divide the feature space into distinct classes at each node to maximize subset purity. By combining predictions from multiple trees through a plurality voting mechanism, the ensemble models achieve higher accuracy and robustness compared to individual classification trees susceptible to training data variations.
Among the ensemble models, RF is the simplest technique, generating multiple decision trees trained on random subsets of the dataset. It introduces diversity by randomly selecting different features, depths, and tree structures, enhancing predictability through increased variation. ADB, XGB, LGB, and CTB are boosting techniques that aim to minimize misclassifications through sequential training. ADB starts with a basic classifier using a single feature split and iteratively adds weak classifiers, assigning higher weights to previously misclassified instances to improve overall accuracy. XGB constructs a series of decision trees in a cascade, reducing prediction errors by minimizing negative gradients of the loss function, thereby adjusting subsequent trees to rectify misclassifications. To avoid overfitting, XGB incorporates a regularization term that penalizes complex tree models.
LGB improves upon XGB by employing a leafwise structure for the regularization term, enhancing training efficiency on heavy datasets. By building trees in a leafwise manner instead of levelwise, LGB reduces the number of required splits and accelerates the training process while maintaining comparable performance. CTB is specifically designed to handle categorical features within gradient boosting. It excels at efficiently utilizing categorical variables by employing the ordering principle, seamlessly incorporating them into the boosting framework and resulting in improved performance and accuracy.
When training the five ensemble models, we slightly manipulated the ASOS and reanalysis datasets to consider temporal changes and accumulation of incremental data. For instance, while a rainfall generation process may limit fog formation, an increased moisture supply can lead to a frontal fog event. In such cases, cumulative rainfall depths may become a better predictor of fog formation compared to the incremental rainfall depth. Table 1 summarizes the feature variables used in the training process. The air temperature and dewpoint depression measure the overall energy status and the moisture content of the air. The surface-air temperature difference and the surface sensible heat flux can indicate the atmospheric stability. The 1-h temperature change is the rate of radiative cooling, and the 24-h total precipitation can be associated with frontal fog events. The wind speed can influence the dispersion and mixing of moisture, and the wind direction may reflect effects of nearby moisture sources. The month of year and the hour of day can be significant features, because each month has different mean values. Since the low-level cloud cover and cloud base height datasets from the ERA5 archive were produced at a coarse resolution of 0.25° × 0.25°, we applied bilinear interpolation to align with the resolution of the downscaled ERA5-Land data.
Summary of feature variables used in training and evaluation.
d. Training, prediction, and evaluation
We divided the 6-yr dataset into two distinct segments: a training set spanning the years from 2017 to 2021 and a separate evaluation set for the year 2022. Instead of the common random splitting, we opted to preserve the temporal autocorrelation of atmospheric conditions. The tree-based ensemble classifiers were trained using the dataset from the training period for a chosen type of stations among mountainous, inland, and coastal sites. The trained models were then tested by predicting fog cases using feature data for the year 2022 at the same type of stations. To assess the models’ performance at the other types of stations not included in their training, all the feature data from 2017 to 2022 were used to predict fog cases.
To address the significant data imbalance issue where only a small portion (∼5% in fall) of the visibility data were labeled as fog events, we employed the synthetic minority oversampling technique (SMOTE; Chawla et al. 2002). As in Castillo-Botón et al. (2022), we generated additional synthetic fog labels by oversampling the existing fog instances within the training dataset. Subsequently, the XGB, CTB, and RF models were trained using a configuration consisting of 200 trees, a maximum depth of 10, and a learning rate of 0.2. The number of leaves in each tree was set to 64 for the LGB models, and the simple two-leaf tree was used as the base learner of the ADB model. Through trial and error, we found that increasing those hyperparameter values did not improve the models’ performance.
Despite the advancements in the ERA5 and ERA5-Land reanalysis systems, it is important to acknowledge the potential presence of unidentified error sources, which could affect predictive performance of the machine learning algorithms. We observed instances where the reanalysis relative humidity data dropped below 88%, while the ASOS data indicated near saturation (Fig. S2). These cases accounted for approximately 18% of fog occurrences at the ASOS stations. To mitigate potential issues led by deficiencies in the reanalysis systems, we opted to exclude such cases in training and evaluation of the ensemble classifiers with the downscaled reanalysis data.
3. Results and discussion
a. Analysis of major feature variables
Figure 4 compares the relationships between observed fog probability and four major feature variables across the three station types in fall (September–November), when radiation fog dominates in mountainous sites. To quantify the fog probabilities, we counted fog cases within uniform bins for each feature variable. As expected, the dewpoint depression is a crucial factor for fog formation. As it approached to zero (i.e., relative humidity reaches 100%), fog probability increases exponentially in the mountainous and coastal sites. However, in the inland sites, the same dewpoint depression resulted in less fog occurrences. In fog cases, the 99th percentile of dewpoint depression was 1.4°C in mountainous sites, 1.4°C in inland locations, and 1.8°C in coastal areas. The higher dewpoint depression in coastal sites can be attributed partly to the gradual formation and dissipation of advection fogs during daytime hours when air temperatures are elevated.
In mountainous areas, during fog events, wind speeds were predominantly below 1 m s−1, underscoring the importance of low turbulence for formation of radiation fog (Fig. 4b). While advection fog rarely formed in coastal sites with relatively high wind speed over 3 m s−1, fog probability tended to decrease with increasing wind speed at the gauging stations. Figure 4c also suggests that the atmospheric inversion may not always be a prerequisite for fog formation, especially in mountainous sites, where most of fog events have occurred with positive differences between the surface and air temperatures. While rainfall events appeared to more frequently lead to reduced visibility compared to nonprecipitation conditions, there was no clear correlation between 24-h precipitation depth and fog probability.
Aligning the downscaled reanalysis data with the fog occurrences at ASOS stations, we could observe that fog probability decreases with dewpoint depression of the reanalysis data (Fig. 5a). However, they were generally lower, and their decreasing rate were gradual compared to the ASOS observations. A similar gradual decrease in fog probability was also found with the reanalysis wind speed data (Fig. 5b), possibly posing challenges for the machine learning models in accurately distinguishing fog from nonfog cases. Contrary to the ASOS observations, fog cases were often observed when the reanalysis sensible heat flux data were negative (indicating atmospheric inversion), especially at inland sites (Fig. 4c). As shown by the ASOS feature data, fog probabilities were higher when the ERA5-Land P24 was positive, although there was no distinct correlation between the two.
b. Training and evaluation using the ASOS data
To begin, we trained the five tree-based classifiers using ASOS data from mountainous sites during the fall seasons (September–October) of 2017–21. These machine learning algorithms almost perfectly categorized fog and nonfog cases during the training period. We then applied the trained models to predict fog occurrences at the same mountainous sites for the fall season of 2022. Fog occurrences in inland and coastal sites were also predicted for the fall seasons of 2017–22, which were independent of the training process. Figure 6 illustrates that the machine learning algorithms could perform better in capturing fog occurrences at mountainous sites than in inland and coastal areas. The F1 scores of the fiver classifiers were often higher than 0.5 at the mountainous sites, whereas they were notably lower at inland and coastal locations, mostly falling below 0.3.
The machine learning models’ poor performance at inland and coastal locations was not solely due to their training on mountainous-site data. We exclusively trained the models on inland and coastal sites for spring seasons (March–May), a period when fog events were more frequent in coastal areas than in mountainous sites. The F1 scores at inland and coastal sites often remained lower than those at mountainous sites, where the models were not trained (Fig. 7). This suggests that the ability of the tree-based classifiers to distinguish between binary fog cases might be influenced more by the presence of a predominant fog generation mechanism rather than the frequency of fog events.
Furthermore, when we trained the five models using the spring-season data from mountainous sites, their performance significantly improved at these trained locations, even though fog events were much rarer than in fall seasons. The models trained at mountainous sites still perform poorly in inland and coastal locations (Fig. S3). This finding also supports that the models’ predictability might be influenced by the presence of a dominant fog generation mechanism. Such dominance may lead to clearer and more distinct decision boundaries for the models’ decision trees. In contrast, while spring-season fog frequency in coastal and inland locations was comparable to that in mountainous sites, the machine learning models did not show robust performance even at the trained locations. The diverse fog generation mechanisms may introduce a higher degree of uncertainty into their decision boundaries.
The evaluation metrics in Figs. 6 and 7 fell short compared to previous machine learning classifications, such as the comprehensive evaluation by Castillo-Botón et al. (2022). In their research, fog episodes were successfully predicted at a single location in Spain with a half hour lead, and the consequent F1 scores were higher than 0.7 with boosting algorithms and RF. The lower performance in this study may be in part attributed to the categorical purity and severe data imbalance of the trained datasets. Castillo-Botón et al. (2022) conducted their point-scale evaluations only with a 23-month dataset, while directly using visibility data as a feature variable to predict fog occurrences. Notably, around 30% of their visibility observations was below 1 km, indicating a substantial presence of fog events. On the other hand, fog cases in the training dataset of our study were less than 5%. This large data imbalance presents a challenge for the five models to effectively learn the minority class. In addition, our study involved multiple sites that added complexity in the models’ classification.
As expected, the five models primarily categorized atmospheric conditions based on the degree of atmospheric saturation, timing of radiative cooling, and the presence of turbulence. Table 2 summarizes the importance of feature variables assessed through information gain. When predicting fall-season fog events, dewpoint depression was of the highest importance for RF, followed by the hour of day and wind speed. In contrast, the ADB model considered the rate of air temperature change as the primary feature, followed by dewpoint depression and the hour of day. XGB placed greater emphasis on dewpoint depression compared to RF, while LGB and CTB appeared to assign more evenly distributed importance to the feature variables compared to the other models.
Relative importance of the feature variables of the five ensemble classifiers trained by the ASOS data at mountainous sites for fall seasons. The importance was evaluated by the information gain.
c. Spatial upscaling of fog frequency using reanalysis data
When the five models were trained with the downscaled reanalysis dataset at mountainous stations, their predictive performance was generally lower than when they were trained by ASOS data (Fig. 8). The models’ FAR values increased and the F1 scores decreased at individual stations. Even after the screening of fog cases with low relative humidity, the observed fog cases could be classified with positive errors in the reanalysis dewpoint depression. This discrepancy in labeling may obscure the models’ decision boundaries, likely contributing to the reduced fog detectability and increased false positives.
The RF model, trained with the reanalysis data, exhibited relatively high true skill statistic (TSS) values in several inland and coastal sites. However, this came at the cost of a significant increase in FAR, leading to a considerable decrease in their F1 scores. In contrast, the three gradient boosting algorithms experienced declines in TSS, but with less FAR compared to RF at mountainous sites. The differences in FAR between RF and the boosting algorithms at mountainous sites imply that type-I errors were not reduced solely through bagging or random selection of feature variables, suggesting the need for a technique to effectively minimize misclassifications.
With the downscaled reanalysis data, we observed shifts in the importance of feature variables among the five models compared to the ASOS data (Table S1). The timing information gained increased importance in ADB, LGB, and CTB, while RF and XGB continued to prioritize dewpoint depression as the most important feature for categorizing fog cases. Despite the availability of direct low visibility indicators in the reanalysis dataset (e.g., low cloud cover and cloud base height), they did play a more significant role than air humidity, timing, and surface turbulence information in the modeling process. Similar to numerical weather models’ deficiency discussed by Gultepe et al. (2017), the advanced ERA5-Land reanalysis system may also be susceptible to uncertainty sources related with the liquid water content of the atmospheric boundary layer.
Some of the TSS values displayed in Fig. 8 were comparable to those from kindred machine learning studies (e.g., Bari and Ouagabi 2020; Bartoková et al. 2015). However, the high FAR values across the locations suggest that the five models might tend to overestimate fog frequency when applied in different locations. Even when the five models were retrained with the dataset from every site for 2017–21, the high FAR values were insignificantly decreased (Fig. S4). Even though the availability of trained models and high-resolution data offer the potential to extend fog predictions to areas without visibility meters, a notable number of type-I errors were difficult to avoid, necessitating a bias correction process.
Figure 9a confirms the overestimated fog frequency for the fall season of 2022. The fog frequencies were estimated by the models trained by all the dataset for 2017–21 at every monitoring site. Although the Pearson correlation coefficients (r) between observed and predicted foggy days were significant (0.39–0.81), the machine learning classifiers frequently generated unrealistically high numbers of foggy days, surpassing a third of the entire fall season. To make the trained models practically useful, the issue of overestimation driven by high FAR can be addressed by adjusting their voting thresholds. By default, the ensemble classifiers classify a fog instance when 50% of their decision trees indicate a fog case. However, raising this voting threshold for each model can help reproduce the mean foggy days of the observed locations.
For instance, we found that the RF model could match the observed mean foggy days at 16 sites (11.9 days) by considering a fog instance when 176 trees (instead of 100) out of 200 indicate fog. Similarly, the ADB, XGB, LGB, and CTB models reproduced the observed mean foggy days when 102, 122, 148, and 152 trees indicated fog, respectively. A higher voting threshold was required to address a higher FAR. Adjusting the voting thresholds led the five classifiers to improved predictions at the gauged locations, although some errors persisted (Fig. 9b). The postprocessing steps increased the Pearson r between observed and predicted foggy to a range of 0.72–0.81.
Figure 10 illustrates the spatial distributions of mean foggy days across the study area for the fall season of 2022 predicted using the adjusted classifiers with ERA5-Land data. Without the post processing, fog frequencies were substantially overestimated, particularly when employing the RF model, which exhibited the highest FAR (Fig. 11). As expected, the trained classifiers produced consistently high fog frequencies in mountainous areas, where radiative cooling is a common occurrence, and surface water availability tends to be higher during the fall season. The RF model emphasized particularly high fog probability at mountain summits, while the ADB indicated relatively low frequency at the same locations. The XGB, LGB, and CTB models produced similar patterns that appeared to align with the observed fog frequency (Fig. 1c). All the five models predicted low fog frequency around the urban stations, likely reflecting the influence of heat island effects.
Although the estimated foggy days seem to be within a reasonable range of 16–30 days at high elevations, it is important to acknowledge that the estimated fog frequencies at lower elevations could be subject to large uncertainties, given the low model performance at inland and coastal sites. Notably, the RF model had difficulty in capturing the higher fog frequency in western coastal sites compared to inland areas.
d. Discussion
Predicting fog episodes has long been a challenge using traditional weather models, primarily due to several factors. These models have struggled to resolve microphysical processes and parameterize them effectively (Gultepe et al. 2017; Koračin et al. 2014). Their coarse spatial resolution hinders their ability to accurately represent the turbulent atmospheric boundary layer (Gultepe et al. 2017). The inherent chaotic nature of atmospheric processes adds further complexity to fog prediction (Lorenz 1969). In response, modelers and practitioners have been exploring new methods to address these challenges and improve fog prediction.
Despite the previous success of tree-based classifiers in point-scale applications (e.g., Castillo-Botón et al. 2022), this study highlights that the complexity in fog generation mechanisms can pose challenges and limit the applicability of such classifiers. Our performance evaluation revealed that the five machine learning models performed better performance in mountainous sites, where radiation fog is the predominant fog generation mechanism. Surprisingly, even originally trained on data from coastal and island sites for spring seasons, during which fog was less frequent in mountainous sites, the classifiers performed better in the mountainous area. This suggests that the quality and distinctiveness of the training data can outweigh the sheer quantity of data when it comes to classifying foggy conditions. While some cases in this study showed comparable performance to prior modeling studies, they were unlikely to achieve good performance without high purity of original training datasets.
The availability of high-resolution atmospheric data and point-scale visibility observations offer potential for extending the trained relationship between fog occurrences and weather conditions to ungauged locations. However, potential error sources associated with the reanalysis system could significantly reduce the predictability of machine learning models, leading to biased fog frequency estimates and necessitating postprocessing steps. The findings from this study using ERA5-Land data suggest that rigorous training at specific point locations might not guarantee acceptable performance in fog prediction, contrasting to previous successes in hydrological upscaling of energy and water fluxes (e.g., Jung et al. 2019; Ghiggi et al. 2021). When training the classifiers with reanalysis datasets at locations with diverse fog generation mechanisms, they should be applied with high cautions.
In addition, a substantial challenge arose from the severe data imbalance between fog and nonfog cases in the training dataset. Given that fog instances accounted for less than 5% of the data, the tree-based classifiers would encounter difficulties in effectively learning the patterns associated with infrequent fog occurrences. Although oversampling techniques were applied to address this data imbalance, achieving acceptable predictive skill became increasingly challenging as fog occurrences diminished over time. To address this issue, one can draw inspiration from the success of three-dimensional deep learning approaches applied to sea surface temperature for marine fog prediction (Kamangir et al. 2021). Similar applications can be explored and tested for land surfaces, moving beyond simple point-scale learning of foggy weather conditions.
4. Conclusions
The adverse effects of fog episodes on natural and built environments emphasize proactive risk management strategies, particularly in areas where visibility information is not readily accessible. While employing a machine learning model trained on observed data from a specific location can be a viable solution, the challenge lies in ensuring the temporal and spatial transferability of such an empirical model. In this study, conducted within a South Korean domain and using a downscaled reanalysis dataset, we aimed to address this question.
Here, we showed that the tree-based ensemble classifiers trained with synoptic-scale observations exhibited varying performance levels. They performed better in locations where a dominant fog generation mechanism prevailed. However, when fog generation mechanisms became diverse, the tree-based models appeared to struggle with determining their decision boundaries. The overall performance of the classifiers in the study area fell short when compared to previous studies employing kindred machine learning models. This might be attributed to higher complexity in fog generation mechanisms and more severe data imbalance in our case study.
When using the downscaled reanalysis data, the five machine learning models perform worse than when they were trained with more direct observations. The potential errors in with reanalysis data could introduce significant uncertainty to the models’ decision boundaries, resulting in overestimation of fog frequency across the study area, necessitating additional postprocessing steps. While postprocessing measures may help reduce overestimation, caution is needed when interpreting the resultant frequency estimates, particularly in locations characterized by diverse fog generation mechanisms.
In summary, spatial upscaling of machine learning fog prediction models presents significant challenges due to the diversity of fog formation mechanisms, severe data imbalances, and potential errors in reanalysis data. Addressing these challenges may require innovative approaches beyond simple point-scale learning of foggy weather conditions in future studies.
Acknowledgments.
This study was supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2021-KA162665). We declare no conflict of interest.
Data availability statement.
The point-scale atmospheric feature datasets and visibility observations are publicly available at https://data.kma.go.kr/cmmn/main.do, and the ERA5-Land data are available at https://cds.climate.copernicus.eu. The GMTED2010 digitized elevation data were freely available at https://www.usgs.gov/coastal-changes-and-impacts/gmted2010. For training the decision-tree-based classifiers, we used the scikit-learn python package (https://scikit-learn.org/stable/index.html).
REFERENCES
Adhikari, B., and L. Wang, 2020: The potential contribution of soil moisture to fog formation in the Namib Desert. J. Hydrol., 591, 125326, https://doi.org/10.1016/j.jhydrol.2020.125326.
Allen, R. G., L. S. Pereira, D. Raes, and M. Smith, 1998: Crop evapotranspiration: Guidelines for computing crop water requirements. FAO Irrigation and Drainage Paper 56, 333 pp., http://www.climasouth.eu/sites/default/files/FAO%2056.pdf.
Bari, D., and T. Bergot, 2018: Influence of environmental conditions on forecasting of an advection-radiation fog: A case study from the Casablanca region, Morocco. Aerosol Air Qual. Res., 18, 62–78, https://doi.org/10.4209/aaqr.2016.11.0520.
Bari, D., and A. Ouagabi, 2020: Machine-learning regression applied to diagnose horizontal visibility from mesoscale NWP model forecasts. SN Appl. Sci., 2, 556, https://doi.org/10.1007/s42452-020-2327-x.
Bartoková, I., A. Bott, J. Bartok, and M. Gera, 2015: Fog prediction for road traffic safety in a coastal desert region: Improvement of nowcasting skills by the machine-learning approach. Bound.-Layer Meteor., 157, 501–516, https://doi.org/10.1007/s10546-015-0069-x.
Belorid, M., C. B. Lee, J.-C. Kim, and T.-H. Cheon, 2015: Distribution and long-term trends in various fog types over South Korea. Theor. Appl. Climatol., 122, 699–710, https://doi.org/10.1007/s00704-014-1321-x.
Bhardwaj, P., S. J. Ki, Y. H. Kim, J. H. Woo, C. K. Song, S. Y. Park, and C. H. Song, 2019: Recent changes of trans-boundary air pollution over the Yellow Sea: Implications for future air quality in South Korea. Environ. Pollut., 247, 401–409, https://doi.org/10.1016/j.envpol.2019.01.048.
Breiman, L., 2001: Random forests. Mach. Learn., 45, 5–32, https://doi.org/10.1023/A:1010933404324.
Breiman, L., 2017: Classification and Regression Trees. Routledge, 368 pp., https://doi.org/10.1201/9781315139470.
Castillo-Botón, C., D. Casillas-Pérez, C. Casanova-Mateo, S. Ghimire, E. Cerro-Prada, P. A. Gutierrez, R. C. Deo, and S. Salcedo-Sanz, 2022: Machine learning regression and classification methods for fog events prediction. Atmos. Res., 272, 106157, https://doi.org/10.1016/j.atmosres.2022.106157.
Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, 2002: SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res., 16, 321–357, https://doi.org/10.1613/jair.953.
Chen, T., and C. Guestrin, 2016: XGBoost: A scalable tree boosting system. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, Association for Computing Machinery, 785–794, https://www.kdd.org/kdd2016/subtopic/view/xgboost-a-scalable-tree-boosting-system.
Clements, C. B., C. D. Whiteman, and J. D. Horel, 2003: Cold-air-pool structure and evolution in a mountain basin: Peter Sinks, Utah. J. Appl. Meteor., 42, 752–768, https://doi.org/10.1175/1520-0450(2003)042<0752:CSAEIA>2.0.CO;2.
Cornejo-Bueno, L., C. Casanova-Mateo, J. Sanz-Justo, E. Cerro-Prada, and S. Salcedo-Sanz, 2017: Efficient prediction of low-visibility events at airports using machine-learning regression. Bound.-Layer Meteor., 165, 349–370, https://doi.org/10.1007/s10546-017-0276-8.
Daly, C., M. Halbleib, J. I. Smith, W. P. Gibson, M. K. Doggett, G. H. Taylor, J. Curtis, and P. P. Pasteris, 2008: Physiographically sensitive mapping of climatological temperature and precipitation across the conterminous United States. Int. J. Climatol., 28, 2031–2064, https://doi.org/10.1002/joc.1688.
Decesari, S., M. H. Sowlat, S. Hasheminassab, S. Sandrini, S. Gilardoni, M. C. Facchini, S. Fuzzi, and C. Sioutas, 2017: Enhanced toxicity of aerosol in fog conditions in the Po Valley, Italy. Atmos. Chem. Phys., 17, 7721–7731, https://doi.org/10.5194/acp-17-7721-2017.
Durán-Rosal, A. M., J. C. Fernández, C. Casanova-Mateo, J. Sanz-Justo, S. Salcedo-Sanz, and C. Hervás-Martínez, 2018: Efficient fog prediction with multi-objective evolutionary neural networks. Appl. Soft Comput., 70, 347–358, https://doi.org/10.1016/j.asoc.2018.05.035.
Fernández-González, S., P. Bolgiani, J. Fernández-Villares, P. González, A. García-Gil, J. C. Suárez, and A. Merino, 2019: Forecasting of poor visibility episodes in the vicinity of Tenerife Norte Airport. Atmos. Res., 223, 49–59, https://doi.org/10.1016/j.atmosres.2019.03.012.
Forthun, G. M., M. B. Johnson, W. G. Schmitz, J. Blume, and R. J. Caldwell, 2006: Trends in fog frequency and duration in the southeast United States. Phys. Geogr., 27, 206–222, https://doi.org/10.2747/0272-3646.27.3.206.
Freund, Y., and R. E. Schapire, 1997: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55, 119–139, https://doi.org/10.1006/jcss.1997.1504.
Gautam, R., and M. K. Singh, 2018: Urban heat island over Delhi punches holes in widespread fog in the Indo-Gangetic Plains. Geophys. Res. Lett., 45, 1114–1121, https://doi.org/10.1002/2017GL076794.
Ghiggi, G., V. Humphrey, S. I. Seneviratne, and L. Gudmundsson, 2019: GRUN: An observation-based global gridded runoff dataset from 1902 to 2014. Earth Syst. Sci. Data, 11, 1655–1674, https://doi.org/10.5194/essd-11-1655-2019.
Ghiggi, G., V. Humphrey, S. I. Seneviratne, and L. Gudmundsson, 2021: G-RUN ENSEMBLE: A multi-forcing observation-based global runoff reanalysis. Water Resour. Res., 57, e2020WR028787, https://doi.org/10.1029/2020WR028787.
Guijo-Rubio, D., P. A. Gutiérrez, C. Casanova-Mateo, J. Sanz-Justo, S. Salcedo-Sanz, and C. Hervás-Martínez, 2018: Prediction of low-visibility events due to fog using ordinal classification. Atmos. Res., 214, 64–73, https://doi.org/10.1016/j.atmosres.2018.07.017.
Gultepe, I., and Coauthors, 2007: Fog research: A review of past achievements and future perspectives. Pure Appl. Geophys., 164, 1121–1159, https://doi.org/10.1007/s00024-007-0211-x.
Gultepe, I., J. A. Milbrandt, and B. Zhou, 2017: Marine fog: A review on microphysics and visibility prediction. Marine Fog: Challenges and Advancements in Observations, Modeling, and Forecasting, D. Koračin and C. E. Dorman, Eds., Springer, 345–394.
Gultepe, I., and Coauthors, 2019: A review of high impact weather for aviation meteorology. Pure Appl. Geophys., 176, 1869–1921, https://doi.org/10.1007/s00024-019-02168-6.
Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 1999–2049, https://doi.org/10.1002/qj.3803.
Jung, M., and Coauthors, 2019: The FLUXCOM ensemble of global land-atmosphere energy fluxes. Sci. Data, 6, 74, https://doi.org/10.1038/s41597-019-0076-8.
Kamangir, H., W. Collins, P. Tissot, S. A. King, H. T. H. Dinh, N. Durham, and J. Rizzo, 2021: FogNet: A multiscale 3D CNN with double-branch dense block and attention mechanism for fog prediction. Mach. Learn. Appl., 5, 100038, https://doi.org/10.1016/j.mlwa.2021.100038.
Ke, G., Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, 2017: LightGBM: A highly efficient gradient boosting decision tree. NIPS’17: Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, Association for Computing Machinery, 3149–3157, https://dl.acm.org/doi/10.5555/3294996.3295074.
Köhler, C., and Coauthors, 2017: Critical weather situations for renewable energies—Part B: Low stratus risk for solar power. Renewable Energy, 101, 794–803, https://doi.org/10.1016/j.renene.2016.09.002.
Koračin, D., C. E. Dorman, J. M. Lewis, J. G. Hudson, E. M. Wilcox, and A. Torregrosa, 2014: Marine fog: A review. Atmos. Res., 143, 142–175, https://doi.org/10.1016/j.atmosres.2013.12.012.
Koziara, M. C., R. J. Renard, and W. J. Thompson, 1983: Estimating marine fog probability using a model output statistics scheme. Mon. Wea. Rev., 111, 2333–2340, https://doi.org/10.1175/1520-0493(1983)111<2333:EMFPUA>2.0.CO;2.
Kusch, E., and R. Davy, 2022: KrigR—A tool for downloading and statistically downscaling climate reanalysis data. Environ. Res. Lett., 17, 024005, https://doi.org/10.1088/1748-9326/ac48b3.
Lee, H.-K., and M.-S. Suh, 2019: Objective classification of fog type and analysis of fog characteristics using visibility meter and satellite observation data over South Korea. Atmosphere, 29, 639–658, https://doi.org/10.14191/Atmos.2019.29.5.639.
Lee, Y. H., J.-S. Lee, S. K. Park, D.-E. Chang, and H.-S. Lee, 2010: Temporal and spatial characteristics of fog occurrence over the Korean Peninsula. J. Geophys. Res., 115, D14117, https://doi.org/10.1029/2009JD012284.
Li, Y., F. Aemisegger, A. Riedl, N. Buchmann, and W. Eugster, 2021: The role of dew and radiation fog inputs in the local water cycling of a temperate grassland during dry spells in central Europe. Hydrol. Earth Syst. Sci., 25, 2617–2648, https://doi.org/10.5194/hess-25-2617-2021.
Lorenz, E. N., 1969: The predictability of a flow which possesses many scales of motion. Tellus, 21A, 289–307, https://doi.org/10.3402/tellusa.v21i3.10086.
Melhauser, C., and F. Zhang, 2012: Practical and intrinsic predictability of severe and convective weather at the mesoscales. J. Atmos. Sci., 69, 3350–3371, https://doi.org/10.1175/JAS-D-11-0315.1.
Miao, K.-C., T.-T. Han, Y.-Q. Yao, H. Lu, P. Chen, B. Wang, and J. Zhang, 2020: Application of LSTM for short term fog forecasting based on meteorological elements. Neurocomputing, 408, 285–291, https://doi.org/10.1016/j.neucom.2019.12.129.
Muñoz-Sabater, J., and Coauthors, 2021: ERA5-Land: A state-of-the-art global reanalysis dataset for land applications. Earth Syst. Sci. Data, 13, 4349–4383, https://doi.org/10.5194/essd-13-4349-2021.
Myers, D. E., 1982: Matrix formulation of co-kriging. J. Int. Assoc. Math. Geol., 14, 249–257, https://doi.org/10.1007/BF01032887.
Negishi, M., and H. Kusaka, 2022: Development of statistical and machine learning models to predict the occurrence of radiation fog in Japan. Meteor. Appl., 29, e2048, https://doi.org/10.1002/met.2048.
Oh, Y.-J., and M.-S. Suh, 2020: Development of quality control method for visibility data based on the characteristics of visibility data. Korean J. Remote Sens., 36, 707–723, https://doi.org/10.7780/kjrs.2020.36.5.1.5.
Pathak, J., A. Wikner, R. Fussell, S. Chandra, B. R. Hunt, M. Girvan, and E. Ott, 2018: Hybrid forecasting of chaotic processes: Using machine learning in conjunction with a knowledge-based model. Chaos, 28, 041101, https://doi.org/10.1063/1.5028373.
Prokhorenkova, L., G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, 2018: CatBoost: Unbiased boosting with categorical features. NIPS’18: Proc. 32nd Int. Conf. on Neural Information Processing Systems, Montréal, Canada, Association for Computing Machinery, 6637–6647, https://proceedings.neurips.cc/paper_files/paper/2018/file/14491b756b3a51daac41c24863285549-Paper.pdf.
Qiao, N., L. Zhang, C. Huang, W. Jiao, G. Maggs‐Kölling, E. Marais, and L. Wang, 2020: Satellite observed positive impacts of fog on vegetation. Geophy. Res. Lett., 47, e2020GL088428, https://doi.org/10.1029/2020GL088428.
Runyan, C., L. Wang, D. Lawrence, and P. D’Odorico, 2019: Ecohydrological controls on the deposition of non-rainfall water, N, and P to dryland ecosystems. Dryland Ecohydrology, P. D’Odorico, A. Porporato, and C. W. Runyan, Eds., Springer, 121–137.
Shin, J.-Y., K. R. Kim, J. Kim, and S. Kim, 2021: Long‐term trend and variability of surface humidity from 1973 to 2018 in South Korea. Int. J. Climatol., 41, 4215–4235, https://doi.org/10.1002/joc.7068.
Smith, D. K. E., I. A. Renfrew, S. R. Dorling, J. D. Price, and I. A. Boutle, 2021: Sub-km scale numerical weather prediction model simulations of radiation fog. Quart. J. Roy. Meteor. Soc., 147, 746–763, https://doi.org/10.1002/qj.3943.
Steeneveld, G.-J., and M. Bode, 2018: Unravelling the relative roles of physical processes in modelling the life cycle of a warm radiation fog. Quart. J. Roy. Meteor. Soc., 144, 1539–1554, https://doi.org/10.1002/qj.3300.
Suh, M.-S., S.-K. Hong, and J.-H. Kang, 2009: Characteristics of seasonal mean diurnal temperature range and their causes over South Korea. Atmosphere, 19, 155–168.
Tapiador, F. J., J.-L. Sanchez, and E. García-Ortega, 2019: Empirical values and assumptions in the microphysics of numerical models. Atmos. Res., 215, 214–238, https://doi.org/10.1016/j.atmosres.2018.09.010.
Tardif, R., and R. M. Rasmussen, 2007: Event-based climatology and typology of fog in the New York City region. J. Appl. Meteor. Climatol., 46, 1141–1168, https://doi.org/10.1175/JAM2516.1.
Taszarek, M., S. Kendzierski, and N. Pilguj, 2020: Hazardous weather affecting European airports: Climatological estimates of situations with limited visibility, thunderstorm, low-level wind shear and snowfall from ERA5. Wea. Climate Extremes, 28, 100243, https://doi.org/10.1016/j.wace.2020.100243.
Willmott, C. J., and K. Matsuura, 1995: Smart interpolation of annually averaged air temperature in the United States. J. Appl. Meteor., 34, 2577–2586, https://doi.org/10.1175/1520-0450(1995)034<2577:SIOAAA>2.0.CO;2.
Xu, T., and Coauthors, 2018: Evaluating different machine learning methods for upscaling evapotranspiration from flux towers to the regional scale. J. Geophys. Res. Atmos., 123, 8674–8690, https://doi.org/10.1029/2018JD028447.
Zeng, J., T. Matsunaga, Z.-H. Tan, N. Saigusa, T. Shirai, Y. Tang, S. Peng, and Y. Fukuda, 2020: Global terrestrial carbon fluxes of 1999–2019 estimated by upscaling eddy covariance data with a random forest. Sci. Data, 7, 313, https://doi.org/10.1038/s41597-020-00653-5.
Zhou, B., J. Du, I. Gultepe, and G. Dimego, 2012: Forecast of low visibility and fog from NCEP: Current status and efforts. Pure Appl. Geophys., 169, 895–909, https://doi.org/10.1007/s00024-011-0327-x.