1. Introduction
Radiation fog is a weather phenomenon whose visibility obstruction adversely affects business and private sectors (Fu et al. 2010; Bartok and Gera 2012). Although this is the most well-studied fog type, it is still not fully understood because of its complexity (Gultepe et al. 2007; Stolaki et al. 2015). Fog events are isolated and short events that are severely underrepresented in a long time series and whose variables (e.g., visibility, liquid water content) can vary with fog intensity (Bendix 2002; Kutty et al. 2019). The fog frequency can vary from season to season (Durán-Rosal et al. 2018). The complexity in the forecast of this rare phenomenon lies in the large number of highly dynamic variables that influence the formation, development, and dissipation of fog (Gultepe et al. 2007; Pérez-Díaz et al. 2017). Radiation fog occurs particularly frequently in spring, autumn, and winter nights when the dewpoint is reached through strong radiative cooling in connection with high pressure areas (Bendix 2002; Pérez-Díaz et al. 2017). Fog has formed if the visibility drops below 1 km (American Meteorological Society 2012; German Weather Service 2021). The intensity of the visibility reduction is controlled by the extent of the increase in droplet size and concentration (Thies et al. 2017; Pérez-Díaz et al. 2017). Additionally, fog has a strong influence on the energy balance (Price 2019). Physical and thermodynamical properties such as the strength of cooling or the intensity of mixing the air by turbulence can be crucial for the development of fog (Steeneveld and de Bode 2018; Price 2019). The chemical composition of aerosol particles involved in fog formation or katabatic flows due to orography are also important factors that influence the formation of radiation fog (Gultepe et al. 2007; Pérez-Díaz et al. 2017). Radiation fog usually dissipates due to the increase in air temperature after sunrise (Gultepe et al. 2007; Price 2019). At a certain point, the air is no longer saturated and can therefore absorb more water vapor, leading to the evaporation of fog droplets. However, a decrease in supersaturation can also be achieved by the entrainment of dry air or drizzle formation and deposition. Either way, if the air is undersaturated, evaporation begins, and the fog event ends. Thus, despite many years of research it is still difficult to provide an accurate forecast (Pérez-Díaz et al. 2017; Bergot and Lestringant 2019; Price 2019).
Since all the described fog influencing variables interact with each other, but do not react linearly (Jacobs et al. 2007), changes in one parameter lead to a chain of complex interactions on a vertically and temporally high-resolution level, making accurate radiation fog forecasts difficult. One approach is to use numerical weather prediction (NWP) models (Zhang et al. 2020; Smith et al. 2021). However, NWP models need to simulate the relevant atmospheric processes correctly in order to make a good forecast (Román-Cascón et al. 2019). Still, the depth of knowledge needed for an accurate forecast has not yet been achieved (Bergot et al. 2005; Stolaki et al. 2015; Steeneveld and de Bode 2018), and, furthermore, this requirement comes with high computational cost (Krasnopolsky and Fox-Rabinovitz 2006). The generation of a forecast by an NWP model takes up to several hours and thus only partially covers the period of nowcasting (Urbich et al. 2020). According to the definition of the German Weather Service nowcasting is a very short-term forecast period of up to 6 h. While NWP currently has reached its limits in fog forecasting (Steeneveld et al. 2015; Román-Cascón et al. 2019; Zhang et al. 2020), machine learning (ML) models could offer an alternative by covering the nowcasting period or supplement existing methods as part of an ensemble forecast. Because of the existing lack of knowledge about the interactions of the fog variables, an ML approach is chosen for this study that also covers the short-range forecasting gap of NWP models. ML models are able to address nonlinear interactions of variables automatically without complete physical knowledge of these specific intervariable relations. Furthermore, a trained ML model offers predictions at short computation times (Fabbian et al. 2007; Palvanov and Cho 2019). However, ML models require a very precise evaluation due to the nontransparent nature of model prediction. Otherwise, there could be a higher risk that the ML model adapts misguided priorities as forecasting strategy under the cover of a good forecast (Harris 2014; Lapuschkin et al. 2019). Especially the training can be error-prone, for example, if autocorrelated training data are used (Scher and Messori 2018) the subsequent evaluation of the scores can be misleading. In fog forecasting, autocorrelated variables are used for forecasting. If autocorrelation is present, this information can be used by the model across multiple consecutive points in time (Carvajal et al. 2018; Pan et al. 2019). That can lead to an information leakage between chronologically consecutive training, test, and validation datasets, resulting in datasets that are not independent. This is why the split of the data during training and the split for the validation after training can have an influence on the scores.
One commonly used method for evaluating ML models in radiation fog forecasting is to randomly draw data points or small sequences from the dataset (e.g., Zazzaro et al. 2015; Cornejo-Bueno et al. 2017; Ortega et al. 2019; Miao et al. 2020). This approach is prone to overestimating performance due to long-term autocorrelation (cross validation; see Pedregosa et al. 2011) found in meteorological data (Pérez et al. 2004). Since in strongly autocorrelated datasets subsequent data points are often similar, the randomly drawn subsets are not independent. Thus, the model performance can be overestimated. Therefore, a cross-validation (CV) method adapted to the requirements of time series should be chosen.
Furthermore, the sole evaluation of model performance, even if done with regard to autocorrelation, might not be sufficient to ensure a successful training, since it depends on the intrinsic model performance as well as on the overall complexity of the dataset used for training the model. Thus, a well-suited model may seem to perform poorly on a particularly complex dataset, whereas a poor model may achieve good forecasts on a less complex dataset. For this purpose, a standard for the dataset needs to be defined to measure the quality of the prediction (Nurmi 2003). A baseline model is a solution to define such a standard. In the context of ML, a baseline model is a trivial prediction strategy that the ML model of interest can also adapt (Vislocky and Fritsch 1997) when neglecting underlying dynamics of the dataset. This approach therefore establishes a threshold for fog forecasting on the used dataset that allows for a meaningful evaluation, unveils weaknesses, and makes studies more comparable. However, most fog forecasting studies lack such a comparison of a trained model with a baseline model (e.g., Colabone et al. 2015; Herman and Schumacher 2016; Dewi and Harsa 2020; Miao et al. 2020).
Furthermore, several studies disregard the chronology present in fog forecasting (e.g., Fabbian et al. 2007; Kneringer et al. 2019; Cornejo-Bueno et al. 2020). As in the application of fog forecasting all data points that come in are always more recent than all the data points that are used for training the model. Therefore, to simulate the performance of an operating fog forecasting model, the test data points would always have to be the most current data points. While this might be of minor influence it is still a point worth considering.
For a good fog forecast it is of particular importance to correctly forecast the formation and dissipation of fog, that is, the transitions from fog free to fog conditions and vice versa. Many publications concerned with fog forecasting use a pointwise classification-based method. This means that classes (e.g., fog/nonfog) will be forecast instead of continuous values for each time point. However, most publications concerned with pointwise classification-based fog forecast approaches are only evaluating their model performance based on the proportion of correctly classified data points (overall performance) (e.g., Dutta and Chaudhuri 2015; Bartoková et al. 2015; Guijo-Rubio et al. 2018; Cornejo-Bueno et al. 2020). But by evaluating the overall performance the ability of the model to forecast the transitions of fog, which usually constitute only a small proportion of data points is mostly neglected. Yet, if the proportion of transitions is small, the model can miss all transitions, that is, simulate persistence behavior, and still achieve a good forecast. This can lead to fog forecasting models that on one hand perform well in overall evaluation but fail to forecast transitions. Fog forecasting is very vulnerable to this problem because there are few transition points. In addition, commonly used meteorological scores (Roebber 2009; Lagerquist et al. 2017) such as F1, Heidke skill score, or the false alarm rate are based on a confusion matrix to evaluate model performance. A confusion matrix considers the false negative (FN), false positive (FP), true negative (TN), and true positive (TP) predicted points. In this case, negative represents the nonfog class and positive represents the fog class. Missing the fog formation means a false negative forecast. A missed fog dissipation corresponds to a false positive forecast. If the proportion of these missed transitions is small, the performance score will still be high. For this reason, common meteorological scores would not indicate the amount of missed fog formations or dissipations. In short, a model performing well at predicting the future state is not necessarily able to predict transitions. Therefore, the forecasting of fog formation and dissipation requires special attention.
The aim of this study is to demonstrate the previously explained weaknesses in pointwise classification-based fog forecasting on the example of the ML algorithm extreme gradient boosting (XGB). Based on the forecast of the XGB model, it will be demonstrated that an ML model can show high scores in short-term fog forecasting but still fail in operational application. For this purpose, an application-oriented training and evaluation framework called the expanding window approach (EWA) has been implemented. It considers the abovementioned aspects, namely, a model training that accounts for autocorrelation and an assessment that identifies weaknesses not yet comprehensively considered in both training and evaluation of pointwise fog forecasting models. The above issues are addressed as follows:
-
Two common CV methods will be contrasted with the EWA. The influence of different CV methods and the neglection of the temporal order of the time series will be investigated by comparing the CV method’s overall performance scores.
-
Different meteorological scores were compared for the overall performance to investigate the problem of confusion matrix-based scores if the model simulates persistence behavior.
-
The forecasting performance for the transitions is evaluated separately. It will be tested if, despite high overall scores, a fog forecasting model can still fail in forecasting fog formation and dissipation. This is examined using the XGB model trained with the EWA.
-
Three baseline models, namely, a persistence model (PM), a climatology model (CM), and a logistic regression (LR) model, were implemented. The LR model is used to put the XGB model performance in context. The PM and the CM are used to put the XGB model performance as well as the LR model performance in context.
The article is structured as follows. Section 2 gives an overview of the variables used for fog forecasting and the two predicted classes (fog and nonfog). Section 3 explains the different data split variants and their combination into two workflows as well as the scoring and baseline methods. Section 4 contains the results and their discussion. The conclusion follows in section 5.
2. Data
The variables used for this study (Table 1) were observed at the ground truth and profiling station located in Linden-Leihgestern, Germany (50°31′58.67″N; 8°41′3.85″E). They were selected because of their fog-influencing and fog-characterizing properties (Gultepe et al. 2007; Haeffelin et al. 2010; Cornejo-Bueno et al. 2017; Steeneveld and de Bode 2018; Price 2019). Linden-Leihgestern is characterized by one of the highest fog occurrences in Germany (Bendix 2002) with a predominance of continental radiation fog. Continuous and comprehensive measurement of different variables at this site provides an excellent basis for radiation fog prediction and its analysis (Maier et al. 2012, 2013; Egli et al. 2015).
Measured and calculated variables used for model training. The variable air pressure provided by HLNUG was measured at the same location as the LCRS variables. The LCRS variables were averaged from 1-min resolution to 15-min resolution. The HLNUG variable air pressure was interpolated from 30-min resolution to 15-min resolution. Here, t2m = temperature inversion measured at 2 m, t10m = temperature inversion measured at 10 m, t = present time point, and t−3h is the time point 3 h in the past from present time point t.
The data used for the study (Table 1) were collected between 2011 and 2019 by the Laboratory for Climatology and Remote Sensing (LCRS) and the Hessian Agency for Nature Conservation, Environment and Geology (HLNUG). For training and evaluation, the months September to April were chosen, because they represent the radiation fog season. Short data gaps within the time series of each variable were filled by a Gaussian-based interpolation with attention to the variables’ autocorrelation. If the gap of at least one variable was close to or larger than the autocorrelation period in Table 2, the patchy time period was completely discarded.
Autocorrelation of the variables used in this study for fog forecasting. The affected period describes for each variable respectively how many minutes a minimum of 80% of the information of one time point persists in the following time points. The last column displays the absolute number of affected consecutive data points for the dataset of 15-min resolution used in this study.
The fog and nonfog classifications are based on the 1-min visibility dataset. Resampling on higher resolution is intended to prevent short-term strong fluctuations in visibility from causing a 15-min data point to be classified as a nonfog point due to averaging, although the majority of points within this interval belong to the fog category. Data points with a visibility equal to or less than 1 km were classified as fog points, the rest as nonfog points. To account for visibility fluctuations that do not represent the absence of fog, every short-term fluctuation with a visibility greater than 1 km within a fog event was reclassified as fog point.
Then the variables were resampled from their original resolution (see Table 1) to 15-min intervals and scaled using scikit-learn’s RobustScaler. After preprocessing, a total of 90 138 data points of which 7.55% were classified as fog points were available, containing 303 complete fog events. The transitions amount to 2.5%, 4.36%, 5.87%, and 7.17% for prediction periods 60, 120, 180, and 240 min, respectively.
Since the fog class accounts for only 7.55% the dataset is heavily class imbalanced. If this imbalance between fog and nonfog points is ignored, it can lead to a training predominantly on the nonfog class. This is usually reflected in a poor prediction of the underrepresented class (Abd Elrahman and Abraham 2013). Nevertheless, we chose not to resample the dataset to preserve the fog climatology. The advantage by doing so is that the model can learn the dynamics and frequency of the respective class such as they occur in reality. However, to address the problem of neglection the false prediction of fog points was punished 2 times as hard as the prediction of nonfog points to further minimize the potential of overseeing the underrepresented fog class. Thus, the algorithm is forced to achieve a better prediction of the underrepresented fog class and can no longer disregard it. The penalty was implemented using the scale_pos_weight parameter from the XGBoost package and the class_weight parameter from sklearn.linear_model.LogisticRegression, respectively.
3. Methods
Several reasonable and independent data splits must be carried out into a training set for model fitting, a validation set for internal CV and a test set for model assessment, that is, external CV (Raykar and Saha 2015). Therefore, a nested CV was implemented. Initially, the data are split into k pairs of training and testing dataset pairs (external CV). Then, within the kth training fold, they are split into N training and validation pairs (internal CV). The interval CV is used for hyperparameter optimization (determines the best model averaged over the N pairs). After model optimization, the model is retrained on the kth training fold and evaluated on the kth testing dataset. The external CV results are averaged together to provide the final estimate of model performance.
In fog forecasting the variables are autocorrelated (see Table 2). A CV method that does not take autocorrelation into account may result in overfitting during model training, performance overestimation during model testing or even performance loss during operational mode. Furthermore, for the training and evaluation results to be interpreted as an operational model, the type of temporal configuration for training and testing should match operational mode.
To verify this statement, distinct split methods were implemented for nested CV, which consider the temporal order of the data points in different ways. The chronological data split and the blockwise data split have been implemented for external CV. StratifiedShuffleSplit, kFold and TimeSeriesSplit have been implement for internal CV. The split method TimeSeriesSplit is part of the EWA and integrated in workflow 1, which also includes the evaluation of the transitions. The second workflow combines StratifiedShuffleSplit and kFold CV. Details are given in the following subsections. For information about the hyperparameter set that was used, see the appendix.
a. Training and test data split
Two split variants have been implemented for the division into training and test dataset (Fig. 1) to investigate the influence of temporal order and autocorrelation on the evaluation scores. The first split variant is the chronological data split, which maintains the temporal order of the data points at all times by separating the data into sequential time blocks. As a result, the data points of the training block were always older than the data points of the test block in each of the five data splits. To achieve a variation of the training and test dataset and thus to test the reliability of the results, this split was performed 5 times, each time with shifted proportions. The proportion of training and test dataset shifts with each iteration step from 70%/30% to 90%/10% based on a total of 90 138 data points.
Overview over the applied training and evaluation variants V1 and V2 in workflow 1 (Fig. 2, below) and workflow 2 (Fig. 3, below). Each split is based on a dataset of 90 138 data points.
Citation: Artificial Intelligence for the Earth Systems 1, 2; 10.1175/AIES-D-21-0006.1
The second split into training and test data is the blockwise data split that splits the data into 10 equally sized blocks. Two of 10 blocks were the test blocks. The ratio of training to test dataset was thus always 80%/20%. The training and test data blocks were rotated with each iteration. Each of the 10 data point blocks was only used once for validation.
b. Internal CV split methods
1) StratifiedShuffleSplit
The hold-out method (Yadav and Shukla 2016) divides the dataset by randomly drawing individual points. By this method even the split of small datasets can be varied for training and evaluation. However, underlying temporal dynamics in the data are largely retained. In the case of fog data, this constitutes a weakness for the formation of a training and test dataset, since they often have a strong autocorrelation as can be seen in Table 2 for the variables used in this study. Consequently, drawing isolated points from a dataset does not generate an independent test dataset due to the high autocorrelation of two subsequent points in many variables. To calculate the scores of such an approach, the StratifiedShuffleSplit from the scikit-learn library was used. This method shuffles the data points and produces randomized training and test blocks in which the ratio between fog and nonfog classes remains the same (Fig. 1). The ratio between training and validation dataset is always 80/20.
2) kFold
Another common method is the kFold approach (Yadav and Shukla 2016). The dataset was divided into five equally sized blocks for training and evaluation (Fig. 1). The split into training and validation dataset was shifted in time to achieve a variation between the datasets. The chronology of the data points was kept. Therefore, the information leakage of the variables between the training and validation dataset should be reduced relative to the StratifiedShuffleSplit. However, when selecting kFold, it is noted (evaluating estimator performance; see Pedregosa et al. 2011) that this application should only be used with independent and uniformly distributed data; otherwise, there may be a problem with autocorrelation. Since this is not the case with a fog time series, it should not be the preferred method of choice; otherwise, the model training takes place on a poorer basis than might actually be available based on the data at hand.
If operational forecasting mode is to be simulated in a study, the validation and test datasets should also always temporally be after the training data. If the temporal order between these sets is altered, operational mode is only simulated if one proves that data points from the past behave like future data. Randomizing data points or rotating entire blocks can oversimplify the prediction problem, since the model no longer needs to predict future data, but rather test data points that are between training data points. The task then changes to a certain extent from extrapolation to interpolation and contradicts the task of the operational model. Neither kFold nor StratifiedShuffleSplit do take this property of time series data into account. Consequently, results from such a validation cannot necessarily be transferred one-to-one to an operational model.
3) Time series split
Neither the StratifiedShuffleSplit nor kFold CV meet the desired requirements of a model training that accounts for autocorrelation and keeps the temporal order of the data points unchanged. Therefore, to meet these requirements the time series cross-validator TimeSeriesSplit was implemented for internal CV.
The training dataset was divided into six chronological time blocks of equal size (Fig. 1) to form a training and validation set. In each training round the validation set was more recent than the training dataset and corresponds to 1/6 of the respective training dataset. With each training round the training dataset was increased by 1/6 while the validation set V was fixed in size but was moved forward in time in direct succession to the training set. TimeSeriesSplit thus maintains the chronological order of the data points at each iteration step and simulates a growing training dataset.
c. Workflow 1—Expanding window approach
To meet the requirements for an application-oriented training with minimal influence of autocorrelation, the chronological data split and the TimeSeriesSplit method were combined to the EWA (Fig. 2). The five training datasets generated from the chronological data split were used to train five separate XGB and five separate LR models for each forecasting period. The average of the forecasts from each of the five models results in the internal and external CV score, respectively.
Workflow 1: applied EWA for training and evaluation of the XGB fog forecasting model including the two baseline models (PM and CM) for the evaluation of the XGB model’s internal and external CV as well as the evaluation of the fog transitions.
Citation: Artificial Intelligence for the Earth Systems 1, 2; 10.1175/AIES-D-21-0006.1
During training, the TimeSeriesSplit method was applied for the internal CV to minimize the influence of autocorrelation not only during external CV but also during model training. After each model training, the external CV was performed using the relevant test dataset. This training and test procedure has been done for five iterations. During model training, internal CV scores are generated based on the validation dataset. During external CV, the external CV scores are generated based on the test dataset. These scores will be shown in the overall evaluation. A LR model was trained and evaluated in the same way to put the results of the XGB model in relation.
Using the trained XGB model, the forecast performance for fog formation and dissipation, that is, the transitions, is also evaluated based on the test dataset in a second evaluation step. The evaluation of the transitions will complement the results of the overall evaluation.
The EWA has several advantages. A growing dataset is simulated, as would occur in the application of an operational model, maintaining the temporal order and dynamics of the data points. The effect of autocorrelation is kept to a minimum between training, validation and test set since the block size is larger than the autocorrelation length of the variables. Thus, the independence of validation and test data is ensured at both levels of CV. The EWA also allows for applicability to many different types of ML models. In this way the strength of different ML models could be combined (Ahmed et al. 2010) to an ensemble forecast (Herman and Schumacher 2016). In addition, EWA offers a flexible, changing data split during XGB model training that can achieve a good balance for hyperparameter selection. The result of the model training is strongly dependent on the choice of split size (Larsen and Goutte 1999; Reitermanova 2010; Yadav and Shukla 2016). However, it is impossible to know in advance which data split is the optimal one as the training success depends on many factors (Gu et al. 2016). One focal parameter is the size of the training dataset in terms of the ability to generalize the model. Although, both models trained on small as well as models trained on large datasets can be prone to overfitting (Dos Santos et al. 2009; Jabbar and Khan 2015). Since the training window is growing and the best average hyperparameter set is chosen for the internal CV, considerable flexibility is maintained by the EWA as the data split varies at both levels.
d. Workflow 2—Commonly used split methods
To enable a comparison of the influence of the different split methods and data shuffling on the training and evaluation, a second workflow was implemented. The second workflow (Fig. 3) includes the chronological data split (CDS) and the blockwise data split (BDS). The training datasets from these two split methods were used to train five XGB models with kFold CV and five XGB models via StratifiedShuffleSplit CV for each prediction period. The average of the forecasts from each of the five models results in the internal and external CV score, respectively.
Workflow 2: applied training and evaluation variants (V1 and V2) of the XGB and LR fog forecasting models including PM for the evaluation of the XGB model’s internal and external CV.
Citation: Artificial Intelligence for the Earth Systems 1, 2; 10.1175/AIES-D-21-0006.1
The training run and the determination of the internal and external CV scores is identical to workflow 1.The difference between these two workflows is that the EWA (workflow 1) maintains the chronological order of the data points in both internal and external CV. In workflow 2 the CDS maintains the temporal order of the data points for external CV. However, for internal CV the temporal order of the data points is changed by the methods StratifiedShuffleSplit and kFold. The BDS changes the temporal order during external and internal CV.
To summarize, the CDS takes autocorrelation as well as the chronology into account during external CV but not during internal CV. The consideration of the autocorrelation and the chronology is the least with the training dataset created with the BDS. The influence of neglecting chronology and autocorrelation is therefore estimated to be highest for the training with the BDS and slightly lower for the training with the CDS from workflow 2. The influence of neglecting chronology and autocorrelation is estimated to be lowest for training with the EWA.
e. XGB model
The EWA shown above was applied to a deterministic XGB model to forecast the two states fog and nonfog based on the variables shown in Table 1. Tree-based models like XGB work on the basis of structures that link the given variables repeatedly over several trees in different configurations and thus perform the fog forecast. In this way, the important interactions that occur during radiation fog can be considered. XGB is a promising algorithm for forecasting (Chen and Guestrin 2016) that outperforms other tree-based models such as random forest and can in some cases even provide a similarly good prediction as neural networks (Sheridan et al. 2016; Kumari and Toshniwal 2021). For the training of the XGB model the Python package XGBoost was used.
f. LR model
The LR model is intended to provide a baseline for the prediction of the XGB model. The LR model is supposed to show that the explained weaknesses in an ML-based forecast are not related to the XGB algorithm but can be a general problem when using ML models. The LR model is used to perform a binary classification. For this purpose, the input values are mapped to a range between 0 and 1 according to their probabilities of belonging to one of the two classes. Values smaller than 0.5 are assigned to one class, and values larger than 0.5 are assigned to the other class. The function LogisticRegression from sklearn.linear_model has been used.
g. Baseline models
Two baseline models were implemented (Figs. 2 and 3) to identify weaknesses in both, the training and evaluation of the XGB model. The first baseline model is called PM and is used to evaluate the prediction of the overall fog forecast, meaning how many data points were correctly predicted. PM repeatedly predicts the input data point (see Fig. 4), that is, it pretends that the weather in the future will be the same as the current weather (Vislocky and Fritsch 1997). As long as no change in the fog state occurs, PM will always make a correct prediction. However, as soon as a transition from fog to nonfog, or vice versa, appears, PM will not be able to provide a correct prediction. PM represents a simple solution that the XGB model could learn if it neglects underlying dynamics of the data and only memorizes the data points.
Functionality of PM. The forecasting period in this example is 1 h, i.e., one step into the future. Prediction tpred of the PM is based on each data point t of the validation set during training or the test set during overall evaluation, respectively. The PM misses any transition from fog to nonfog and vice versa because it predicts the model input as output. Consequently, there is a mismatch between model input and model output at every transition point. Every other point is predicted correctly.
Citation: Artificial Intelligence for the Earth Systems 1, 2; 10.1175/AIES-D-21-0006.1
The F1 scores (see section 3h) of the PM are obtained from the validation dataset during internal CV and the test dataset during external CV (Figs. 2 and 3). As for the XGB scores, the mean value of the F1 score was calculated from the five external and five internal CVs, respectively, for each prediction period.
The internal CV score is necessary to assess whether the XGB model has learned more than memorizing data points. The score for external CV tells whether the XGB model can generalize on the test dataset. Therefore, the PM provides a threshold value that must be exceeded at least for the internal CV to tell whether the model training worked at all, and the learning effect was greater than simply passing on data points (“the weather stays the same”).
In addition to evaluating the overall XGB model performance, it is also important in fog forecasting to examine specifically the XGB forecast of fog formation and fog dissipation, that is, the transitions between fog and nonfog conditions. Since predicting these transitions is the most important requirement placed on the fog forecasting model this second evaluation step is necessary. It checks whether the time of fog formation and dissipation are predicted correctly and thus whether the model would really hold up as an operational forecast system.
Since the PM cannot forecast transitions per definition it does not provide a sensible baseline for the evaluation of the transitions. Therefore, another baseline model named CM was implemented. The CM forecasts the future state based only on the distribution of states present in the training dataset without considering any other information. In this study we used the averaged state distribution across the five training splits (Fig. 2) as the basis for the creation of the CM model. The fog state is predicted with a probability of 7% as this corresponds to its share of the total training dataset (Fig. 5). The same accounts for the nonfog state but with a forecasting probability of 93%. The CM only considers the state distribution present in the training dataset. For our data, this means that a real fog point has a 7% probability to be classified as a fog point and a 93% probability to be classified as a nonfog point. The same is true for a real nonfog point. This way the average confusion matrix can be directly calculated from the state distribution in the training dataset.
Derivation of CM scores to evaluate the XGB model. According to the proportions of fog and nonfog points in the training dataset, the probabilistic prediction for each class is determined.
Citation: Artificial Intelligence for the Earth Systems 1, 2; 10.1175/AIES-D-21-0006.1
Since the CM forecast only depends on the state distributions in the training dataset the average confusion matrix is the same for both transitions (Fig. 5). This means that for our data fog to nonfog transitions will be correctly forecast in 93% of the cases and in turn nonfog-to-fog transitions are correctly predicted 7% of the time. These values serve as a baseline for the performance of the XGB model with regard to the correct forecast of the transitions. In other words, if the XGB model outperforms the CM the XGB model offers a better forecast than always predicting the climatological event frequency.
h. Scoring
4. Results and discussion
a. Overall evaluation
When it comes to evaluating a model, it is necessary to check if the model can successfully generalize (internal CV ≈ external CV) and if the performance is at least on par with a baseline solution. Figure 6 shows the overall evaluation of the forecast of the XGB model and the LR model. There are four forecast periods (60, 120, 180 and 240 min) and three performance score groups displayed. The performance score groups results from the different split methods in workflow 1 (EWA) and workflow 2 (CDS and BDS).
Overall evaluation: (a) the internal (diamonds) and external (circles) CV F1 scores of the forecast of the XGB and LR models. The forecasting period ranges from 60 to 240 min. The scores are grouped by forecasting period and the expanding window approach EWA (workflow 1), the CDS with subsequent training using kFold and StratifiedShuffleSplit (workflow 2—variant 1) and the BDS with subsequent training using kFold and StratifiedShuffleSplit (workflow 2—variant 2). The black symbols represent the PM that serves as a baseline for the internal and external CV. (b) Internal CV and (c) external CV, showing the percent deviation of the variant’s evaluation scores from the EWA XGB evaluation scores in (a).
Citation: Artificial Intelligence for the Earth Systems 1, 2; 10.1175/AIES-D-21-0006.1
Overall, the performance decreases with the length of the prediction period with the BDS displaying the seemingly highest performance. The F1 score for the XGB model and the LR model drop from around 0.8 at 60-min forecast to around 0.45 at 240-min forecast. Looking at the internal CV scores (diamonds) of XGB and LR model the models are similar or above the respective baseline scores regardless of performance score evaluation group and prediction period. Furthermore, for the XGB model the external CV scores (circles) are similar to the internal CV scores (diamonds) and at least on par with the baseline, thus indicating the model’s ability to generalize. As for LR, it does fail to outperform the baseline for the EWA and CDS methods in most cases.
In comparing performance evaluation group EWA and CDS, it can be seen that the internal CV scores for CDS are higher when compared to EWA (e.g., up to an 18.41% difference; Fig. 6b). For both groups, the external CV has similar results for the different split methods regardless of prediction period. Since the external CV scores are more indicative of the model’s performance in an operational test case this shows that the EWA CV results offer a more realistic performance evaluation. The performance of CDS gets overestimated more by the StratfiedShuffledSplit than k-fold-based CV indicating an influence of the autocorrelation in present in the data. These discrepancies both between EWA XGB and CDS group and also within the CDS group grow larger as the length of the prediction period increases.
BDS produced the highest internal and external CV scores in comparison with EWA and CDS. BDS is up to 16.41% higher for internal CV (Fig. 6b) and 12.51% higher for external CV (Fig. 6c) relative to EWA XGB. However, it is important to note that the interpretation of this result is different. The external CV scores of EWA and CDS measure the performance of the model in a manner that resembles the actual operational case. The performance scores of BDS measure the performance on a test dataset that shares more similar internal statistics (fog frequency, duration, etc.) with the training dataset due to both being randomly drawn. The similar baseline performance between internal and external score of BDS is a further indication of this. However, looking at EWA and CDS and in particular at the baseline performance in these groups the assumption of new data having the same internal statistics as the training data might not necessarily hold true at least in the short term. Furthermore, the aforementioned difference between the autocorrelation caused overestimation of model performance by the StratifiedShuffledSplit is also present in BDS.
Considering the high F1 score of the PM especially in the very short-term forecast, the necessity of a baseline becomes clear. The internal and external CV of the PM show an F1 score of about 0.80 for the 1-h forecast. The high score of the PM that simply outputs the input data points leaves doubts about the functionality of the XGB model. Since the PM misses every transition (fog formation and dissipation) in the time series, it is a nonfunctional model when applied to fog forecasting. However, this fact is not apparent from just looking at the total score and makes a baseline necessary for interpretation in time point prediction. This inability to predict transitions could also apply to the XGB model. Therefore, if an evaluation score is not set in relation to the baseline of a model that implements a simple solution, no statement can be made about its forecast power.
Other studies (McCandless et al. 2016; Dietz et al. 2019; Kneringer et al. 2019) also show that persistence models can have extremely low error values in the short-term forecast range, especially in the 1–2 h range. Conversely, if no baseline is used, this means that a fog forecast model could erroneously simulate high forecast power despite the fact that it actually mimics persistence behavior. As shown in McCandless et al. (2016) a baseline model also reveals whether the dataset is large enough and the minimum variable set is met for the model to learn successfully. Since the error values of a persistence model can be particularly low in the short-term forecast, it is of great importance to benchmark against a baseline model for a thorough assessment in terms of overall model performance, dataset size, and variable selection. Furthermore, to properly evaluate the forecast power of the XGB model, it is interesting to see, how the model performs in predicting the transitions since the transitions are the main target of a good fog forecast.
b. Transitions
Figure 7 shows the proportion of the correctly forecast fog formation and dissipation data points of the XGB model in comparison with the CM and PM for the forecasting periods of 60, 120, 180, and 240 min during external CV. In case of fog formation, the CM stays constant at 7% whereas the forecast correctness of the XGB slightly increases with progressing prediction period. For the fog dissipation, the CM consistently predicts 93% of the data points correctly, whereas the XGB model shows a sharp increase in correctly forecast data points from 15% to 50% as the prediction period increases.
Validation for the transitions of the XGB, CM, and PM models.
Citation: Artificial Intelligence for the Earth Systems 1, 2; 10.1175/AIES-D-21-0006.1
Despite the high scores in the 1–2-h forecast of the overall model (Fig. 6), the XGB model as well as the PM failed at predicting transitions in Fig. 7. In both, the prediction of fog formation and dissipation, the XGB model is worse than guessing. Yet, this fact is not evident from looking at the overall score. This is because the transitions in fog prediction occur very rarely and are not a separate class during training nor evaluation. In general, the fog points are a scarce phenomenon in relation to the total dataset. However, the percentage of transitions is even much lower. In the entire dataset, 303 complete fog events occur. The proportion of transitions related to the total dataset is thus only 2.5% for the 60-min forecast period. The low proportion leads to the fact that a high overall score can be achieved in the evaluation, although the present model only copies points based on distinctive variables and neglects underlying dynamics.
When comparing the overall model scores (Fig. 6) and the scores of the transitions (Fig. 7), it also becomes clear that the baseline must be adapted to the respective requirement to represent a legitimate threshold value. The PM, which is a realistic baseline for evaluating the overall model, cannot predict transitions.
This shortcoming is not due to the F1 score itself. This score is very well suited for the task since both false alarms and oversights are penalized. Instead, the problem lies in the question that this score is supposed to answer. As can be seen in Fig. 8, every score based on overall true/false positive/negative values for example like Heidke skill score (Heidke 1926) or even the true skill score (Bloomfield et al. 2012; Miao et al. 2020) would suffer from this circumstance. All scores imply a good overall performance in the very short-term forecast although only persistence behavior is simulated.
Comparison of external CV scores of the prediction of the XGB for different meteorological scores.
Citation: Artificial Intelligence for the Earth Systems 1, 2; 10.1175/AIES-D-21-0006.1
The same hindrance that applies to the PM in terms of evaluating the transitions applies to the CM in terms of evaluating the overall forecast of the XGB model. Relative to the overall dataset, the fraction of fog points is small. Fog would occur with a probability of about 7%, depending on the composition of the training and test dataset. The CM and therefore the baseline would have a very low F1 score of 0.13. Consequently, this kind of baseline would not fit the answer to the question of how well the overall XGB model is working. However, looking at the transitions, the requirement to outperform guessing increases with this baseline. Thus, to really make a statement about how well the XGB model is working, a second baseline is necessary.
c. Differences in forecasting period performance of transitions
Another conspicuous feature about the transitions (Fig. 7) becomes apparent as the prediction period is extended. In the 60-min range, the prediction of the transitions is weak in both classes. This is an indication that the XGB model largely ignores transitions for short forecast periods. However, from the 120-min forecast on, both the fog formation and dissipation forecast show an improvement. In particular, the prediction of fog dissipation improves substantially the further in the future the time point to be predicted is located. This progression cannot be explained superficially by guessing, which could randomly hit more transitions due to an increasing decoupling of the fog state. On the one hand, such an increase of this strength is not to be found in the prediction of fog formation. It is very unlikely that only the dissipation phase benefits from such an influence. This argument is also invalidated by looking at the constant state in Fig. 9. The constant state represents all data points where no transition needs to be predicted within the forecast period, so, for example, in Fig. 4 from data point t2 to
Proportion of correctly predicted fog and nonfog points for which no change in form of fog formation or dissipation should be forecast (i.e., fog point should predict fog point and nonfog point should predict nonfog point for a correct prediction).
Citation: Artificial Intelligence for the Earth Systems 1, 2; 10.1175/AIES-D-21-0006.1
Another reason for the improvement of the performance for the prediction of the transitions is the increase in the number of transitions with extension of the prediction period. Figure 10 shows this in simplified form for a time series with 1-h resolution. When predicting 1 h into the future, there is one input data point that could correctly predict a transition. The data points before and after fall under the constant state category. For the 2-h forecast on the same dataset, there are now two input data points predicting a transition since not only the last fog data point predicts a transition, but also the second to last one. With each extension of the prediction period, the number of transitions to be predicted for both fog formation and fog dissipation increases, and so do the chances of correctly predicting this transition. On the other hand, the chances of achieving a correct prediction by simply assuming persistence decrease, which is reflected in a declining total performance over time (Fig. 6). However, this is a small influence when looking at the transitions. If this would be predominant, the fog formation would also increase much more with forecasting time.
Increase of transition points by extending the prediction period on the example of fog dissipation and a dataset time resolution of 1 h.
Citation: Artificial Intelligence for the Earth Systems 1, 2; 10.1175/AIES-D-21-0006.1
The question arises as to why the dissipation is consistently predicted better than the fog formation. This is most likely due to the lower fog frequency. Neither upsampling nor downsampling was performed for training, which would lead to either an increase in the number of fog points by duplication or a decrease of nonfog points by deletion, respectively. This preserves the natural statistics of the data. Nevertheless, in order to keep the model’s focus on fog events, incorrectly predicted fog data points were penalized much more heavily than nonfog data points during training. An advantage of this method is that the model can learn both the fog frequency by keeping the sampling as it is and still not neglect underlying dynamics because of a strict penalization. By preserving the natural statistic, this frequency distribution could now have a particularly strong influence on the prediction of fog dissipation. The average length of a fog event is neglected in this setting because the model has no information about such temporal relationships. Based solely on the fog frequency, a fog data point has only a small probability of remaining a fog data point again at the predicted time. Thus, the probability is much higher that the next time step will again be a nonfog data point since they are much more frequently represented in the dataset. The same applies in a similar way to the prediction of fog formation. There is a much stronger tendency for nonfog data points to remain nonfog data points, and thus fog formation is more likely to be missed. As compared with the large increase in the dissipation forecast and the possible reasons for this, the frequency distribution is most likely also the decisive reason for the small increase in the prediction quality of the fog formation. The higher transition probability from fog to nonfog as well as the lower transition probability from nonfog to fog was also evidenced by (Cornejo-Bueno et al. 2020). But even though the unchanged ratio of the two classes apparently complicates the prediction of the fog formation, it is a considerable gain for the correct prediction of the dissipation. Therefore, improving the prediction of fog formation and duration will be enhanced by supplementary information in a further study.
5. Conclusions
The aim of this study was to demonstrate weaknesses in pointwise classification-based fog forecasting on the example of the ML algorithm XGB. The results of this study are not limited to the XGB algorithm but are applicable in general since they are solely based on properties inherit to a fog forecasting dataset and its evaluation. It was shown that the evaluation results strongly depend on the split method. In particular, randomization (StratifiedShuffleSplit) leads to an overestimation of the model performance on independent data. Furthermore, the EWA (workflow 1) produced very similar results to kFold and StratifiedShuffleSplit (workflow 2—CDS) on the test data, although according to the internal CV the performance of the EWA trained model seemed worse. This can be seen as an advantage, since the difference between internal and external CV is particularly small for EWA. EWA does not overestimate the training performance. Furthermore, when comparing kFold and EWA, training and evaluation by EWA simulates reality more accurately, since the chronology of the data is predefined in operational use (test data, i.e., data to be predicted is more recent than training data). Since no performance loss was detected in our analysis, EWA can be recommended as a more application-oriented alternative to the more commonly used kFold.
For the external CV, however, the preference between EWA and kFold has to be decided on a case-to-case basis. EWA provides an evaluation of a realistic deployment situation but also could lead to either particularly hard or easy test dataset. This is particularly true for smaller datasets. The kFold-based split in test and training for external CV provides more homogeneity between training and test dataset (seen in baseline performance Fig. 4). However, a case can be made that this is not representative of an actual use case since training and test data does not adhere to the natural chronological order. Depending on the specific goal this might or might not be a concern. StratifiedShuffleSplit or similar approaches (e.g., leave-many-out) should not be employed since they can lead to an inflation of the model performance scores.
Most publications concerned with pointwise classification-based fog forecast approaches are only evaluating their model performance based on the proportion of correctly classified data points. While this is an accurate performance statistic for the way ML models where developed, the interpretation of those scores is not necessarily clear. It was shown why the evaluation of the model’s performance only based on the proportion of correctly classified data points (overall evaluation) is not sufficient. Such an evaluation can be misleading and has little significance for the functionality of the fog forecasting model. Therefore, on the one hand, we recommend especially for fog prediction to perform a further evaluation of the transitions as this predictive ability is the most important criterion for a good fog prediction model. Another possibility to our presented approach is to include the transitions directly into the model training as a third class. Either way, an application adapted double viewpoint assessment of the overall model and the transitions reveals potential model weaknesses and is thus an important step in improving fog forecasting. On the other hand, the need for baselines in each evaluation step was shown. Using a baseline, a threshold value is set against the model performance with which the prediction quality of the model can be measured as a function of the dataset. These baselines can target weaknesses of the model and make studies on different datasets comparable. ML model interpretability methods could also provide additional insight into how well a model is working and therefore help to improve weak points.
The capabilities of a model depend on its ability to generalize. This in turn is reliant on several factors, such as the task to be performed by the model, the given dataset, the optimal data split, and the hyperparameters (Larsen and Goutte 1999). The dataset used in this study can be seen as representative of fog datasets in general with regard to the described problems since the statistics for fog datasets are always similar—many nonfog points versus few fog points. Also, the classes in the dataset are very static, meaning fog points are followed by fog points and nonfog points are followed by nonfog points.
Considering the complexity of radiation fog, high expectations are placed on operational fog prediction models. The aforementioned points illustrate that the proposed training and evaluation scheme provides a good basis for fog prediction. In future studies we will focus on employing this scheme to improve pointwise fog forecasting for nowcasting.
Acknowledgments.
This study is funded by the German Research Foundation through the project “Forecasting Radiation Fog by Combining Station and Satellite Data Using Machine Learning” (FOG-ML) (TH1531/7-1 and BE 1780/54-1). We also thank the Hessian Agency for Nature Conservation, Environment and Geology for providing the air pressure data measured at the site in Linden, Germany.
Data availability statement.
The data access for the Laboratory for Climatology and Remote Sensing is restricted until the project “Forecasting Radiation Fog by Combining Station and Satellite Data Using Machine Learning” (FOG-ML) is completed. The data of the Hessian Agency for Nature and Conservation, Environment and Geology are publicly available (https://www.hlnug.de/messwerte/datenportal).
APPENDIX
Information about Python Library Versions and Parameter Settings
Table A1 presents an overview of the versions used, and Tables A2 and A3 present an overview of the parameters that were used.
Overview of the software and libraries used for model training and evaluation.
Chosen Hyperparametervalues of xgboost 60-, 120-, 180-, and 240-min forecast and GridSearch value set for model training.
Parameters of logistic regression; class_weight means a wrong forecast of the fog class is punished 2 times as hard as a wrong forecast of the nonfog class.
REFERENCES
Abd Elrahman, S. M., and A. Abraham, 2013: A review of class imbalance problem. J. Network Innovative Comput., 1, 332–340.
Ahmed, N. K., A. F. Atiya, N. E. Gayar, and H. El-Shishiny, 2010: An empirical comparison of machine learning models for time series forecasting. Econom. Rev., 29, 594–621, https://doi.org/10.1080/07474938.2010.481556.
American Meteorological Society, 2012: Fog. Glossary of Meteorology, https://glossary.ametsoc.org/wiki/Fog.
Bartok, J., and M. Gera, 2012: Data mining for fog prediction and low clouds detection. Comput. Inf., 31, 1441–1464.
Bartoková, I., A. Bott, J. Bartok, and M. Gera, 2015: Fog prediction for road traffic safety in a coastal desert region: Improvement of nowcasting skills by the machine-learning approach. Bound.-Layer Meteor., 157, 501–516, https://doi.org/10.1007/s10546-015-0069-x.
Bendix, J., 2002: A satellite-based climatology of fog and low-level stratus in Germany and adjacent areas. Atmos. Res., 64, 3–18, https://doi.org/10.1016/S0169-8095(02)00075-3.
Bergot, T., and R. Lestringant, 2019: On the predictability of radiation fog formation in a mesoscale model: A case study in heterogeneous terrain. Atmosphere, 10, 165, https://doi.org/10.3390/atmos10040165.
Bergot, T., D. Carrer, J. Noilhan, and P. Bougeault, 2005: Improved site-specific numerical prediction of fog and low clouds: A feasibility study. Wea. Forecasting, 20, 627–646, https://doi.org/10.1175/WAF873.1.
Bloomfield, D. S., P. A. Higgins, R. T. J. McAteer, and P. T. Gallagher, 2012: Toward reliable benchmarking of solar flare forecasting methods. Astrophys. J., 747, L41, https://doi.org/10.1088/2041-8205/747/2/L41.
Carvajal, T. M., K. M. Viacrusis, L. F. T. Hernandez, H. T. Ho, D. M. Amalin, and K. Watanabe, 2018: Machine learning methods reveal the temporal pattern of dengue incidence using meteorological factors in metropolitan manila, Philippines. BMC Infect. Dis., 18, 183–, https://doi.org/10.1186/s12879-018-3066-0.
Chen, T., and C. Guestrin, 2016: Xgboost: A scalable tree boosting system. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, Association for Computing Machinery, 785–794, https://doi.org/10.1145/2939672.2939785.
Colabone, R. O., A. L. Ferrari, F. A. S. Vecchia, and A. R. B. Tech, 2015: Application of artificial neural networks for fog forecast. J. Aerosp. Technol. Manage., 7, 240–246, https://doi.org/10.5028/jatm.v7i2.446.
Cornejo-Bueno, L., C. Casanova-Mateo, J. Sanz-Justo, E. Cerro-Prada, and S. Salcedo-Sanz, 2017: Efficient prediction of low-visibility events at airports using machine-learning regression. Bound.-Layer Meteor., 165, 349–370, https://doi.org/10.1007/s10546-017-0276-8.
Cornejo-Bueno, S., D. Casillas-Pérez, L. Cornejo-Bueno, M. I. Chidean, A. J. Caamaño, J. Sanz-Justo, C. Casanova-Mateo, and S. Salcedo-Sanz, 2020: Persistence analysis and prediction of low-visibility events at Valladolid Airport, Spain. Symmetry, 12, 1045, https://doi.org/10.3390/sym12061045.
Dewi, R., and H. Harsa, 2020: Fog prediction using artificial intelligence: A case study in Wamena Airport. J. Phys.: Conf. Ser., 1528, 012021, https://doi.org/10.1088/1742-6596/1528/1/012021.
Dietz, S. J., P. Kneringer, G. J. Mayr, and A. Zeileis, 2019: Low-visibility forecasts for different flight planning horizons using tree-based boosting models. Adv. Stat. Climatol. Meteor. Oceanogr., 5, 101–114, https://doi.org/10.5194/ascmo-5-101-2019.
Dos Santos, E. M., R. Sabourin, and P. Maupin, 2009: Overfitting cautious selection of classifier ensembles with genetic algorithms. Inf. Fusion, 10, 150–162, https://doi.org/10.1016/j.inffus.2008.11.003.
Durán-Rosal, A. M., J. C. Fernández, C. Casanova-Mateo, J. Sanz-Justo, S. Salcedo-Sanz, and C. Hervás-Martínez, 2018: Efficient fog prediction with multi-objective evolutionary neural networks. Appl. Soft Comput., 70, 347–358, https://doi.org/10.1016/j.asoc.2018.05.035.
Dutta, D., and S. Chaudhuri, 2015: Nowcasting visibility during wintertime fog over the airport of a metropolis of India: Decision tree algorithm and artificial neural network approach. Nat. Hazards, 75, 1349–1368, https://doi.org/10.1007/s11069-014-1388-9.
Egli, S., F. Maier, J. Bendix, and B. Thies, 2015: Vertical distribution of microphysical properties in radiation fogs—A case study. Atmos. Res., 151, 130–145, https://doi.org/10.1016/j.atmosres.2014.05.027.
Fabbian, D., R. De Dear, and S. Lellyett, 2007: Application of artificial neural network forecasts to predict fog at Canberra International Airport. Wea. Forecasting, 22, 372–381, https://doi.org/10.1175/WAF980.1.
Fu, G., P. Li, J. G. Crompton, J. Guo, S. Gao, and S. Zhang, 2010: An observational and modeling study of a sea fog event over the Yellow Sea on 1 August 2003. Meteor. Atmos. Phys., 107, 149–159, https://doi.org/10.1007/s00703-010-0073-0.
German Weather Service, 2021: Nebel (fog). DWD Weather and Climate Dictionary, https://www.dwd.de/DE/service/lexikon/Functions/glossar.html?lv2=101812&lv3=101882.
Górnicki, K., R. Winiczenko, A. Kaleta, and A. Choińska, 2017: Evaluation of models for the dew point temperature determination. Tech. Sci., 20, 241–257, https://doi.org/10.31648/ts.5425.
Gu, Y., B. K. Wylie, S. P. Boyte, J. Picotte, D. M. Howard, K. Smith, and K. J. Nelson, 2016: An optimal sample data usage strategy to minimize overfitting and underfitting effects in regression tree models based on remotely-sensed data. Remote Sens., 8, 943, https://doi.org/10.3390/rs8110943.
Guijo-Rubio, D., P. A. Gutiérrez, C. Casanova-Mateo, J. Sanz-Justo, S. Salcedo-Sanz, and C. Hervás-Martínez, 2018: Prediction of low-visibility events due to fog using ordinal classification. Atmos. Res., 214, 64–73, https://doi.org/10.1016/j.atmosres.2018.07.017.
Gultepe, I., and Coauthors, 2007: Fog research: A review of past achievements and future perspectives. Pure Appl. Geophys., 164, 1121–1159, https://doi.org/10.1007/s00024-007-0211-x.
Haeffelin, M., and Coauthors, 2010: ParisFog: Shedding new light on fog physical processes. Bull. Amer. Meteor. Soc., 91, 767–783, https://doi.org/10.1175/2009BAMS2671.1.
Harris, C., 2014: Potential pitfalls in exploration and production applications of machine learning. SPE Western North American and Rocky Mountain Joint Meeting, Denver, CO, Society of Petroleum Engineers, SPE-169523-MS, https://doi.org/10.2118/169523-MS.
Heidke, P., 1926: Berechnung des Erfolges und der Güte der Windstärkevorhersagen im Sturmwarnungsdienst. Geogr. Ann., 8, 301–349, https://doi.org/10.1080/20014422.1926.11881138.
Herman, G. R., and R. S. Schumacher, 2016: Using reforecasts to improve forecasting of fog and visibility for aviation. Wea. Forecasting, 31, 467–482, https://doi.org/10.1175/WAF-D-15-0108.1.
Jabbar, H., and R. Z. Khan, 2015: Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Computer Science, Communication and Instrumentation Devices, Research Publishing Services, 163–172.
Jacobs, W., V. Nietosvaara, A. Bott, J. Bendix, J. Cermak, M. Silas, and I. Gultepe, 2007: Short range forecasting methods of fog visibility and low clouds. Earth System Science and Environmental Management Final Rep. COST-722 Action, 61 pp.
Kneringer, P., S. J. Dietz, G. J. Mayr, and A. Zeileis, 2019: Probabilistic nowcasting of low-visibility procedure states at Vienna International Airport during cold season. Pure Appl. Geophys., 176, 2165–2177, https://doi.org/10.1007/s00024-018-1863-4.
Krasnopolsky, V. M., and M. S. Fox-Rabinovitz, 2006: Complex hybrid models combining deterministic and machine learning components for numerical climate modeling and weather prediction. Neural Networks, 19, 122–134, https://doi.org/10.1016/j.neunet.2006.01.002.
Kumari, P., and D. Toshniwal, 2021: Extreme gradient boosting and deep neural network based ensemble learning approach to forecast hourly solar irradiance. J. Cleaner Prod., 279, 123285, https://doi.org/10.1016/j.jclepro.2020.123285.
Kutty, S. G., G. Agnihotri, A. P. Dimri, and I. Gultepe, 2019: Fog occurrence and associated meteorological factors over Kempegowda International Airport, India. Pure Appl. Geophys., 176, 2179–2190, https://doi.org/10.1007/s00024-018-1882-1.
Lagerquist, R., A. McGovern, and T. Smith, 2017: Machine learning for real-time prediction of damaging straight-line convective wind. Wea. Forecasting, 32, 2175–2193, https://doi.org/10.1175/WAF-D-17-0038.1.
Lapuschkin, S., S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller, 2019: Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun., 10, 1096, https://doi.org/10.1038/s41467-019-08987-4.
Larsen, J., and C. Goutte, 1999: On optimal data split for generalization estimation and model selection. Neural Networks for Signal Processing IX: Proc. 1999 IEEE Signal Processing Society Workshop, Madison, WI, IEEE, 225–234, https://doi.org/10.1109/NNSP.1999.788141.
Maier, F., J. Bendix, and B. Thies, 2012: Simulating Z–LWC relations in natural fogs with radiative transfer calculations for future application to a cloud radar profiler. Pure Appl. Geophys., 169, 793–807, https://doi.org/10.1007/s00024-011-0332-0.
Maier, F., J. Bendix, and B. Thies, 2013: Development and application of a method for the objective differentiation of fog life cycle phases. Tellus, 65B, 19971, https://doi.org/10.3402/tellusb.v65i0.19971.
McCandless, T., G. S. Young, S. Haupt, and L. Hinkelman, 2016: Regime-dependent short-range solar irradiance forecasting. J. Appl. Meteor. Climatol., 55, 1599–1613, https://doi.org/10.1175/JAMC-D-15-0354.1.
Miao, K., T. Han, Y. Yao, H. Lu, P. Chen, B. Wang, and J. Zhang, 2020: Application of LSTM for short term fog forecasting based on meteorological elements. Neurocomputing, 408, 285–291, https://doi.org/10.1016/j.neucom.2019.12.129.
Nurmi, P., 2003: Recommendations on the verification of local weather forecasts. ECMWF Tech. Memo. 430, 19 pp., https://www.ecmwf.int/en/elibrary/11401-recommendations-verification-local-weather-forecasts
Ortega, L., L. D. Otero, and C. Otero, 2019: Application of machine learning algorithms for visibility classification. 2019 IEEE Int. Systems Conf. (SysCon), Orlando, FL, IEEE, 1–5, https://doi.org/10.1109/SYSCON.2019.8836910.
Palvanov, A., and Y. I. Cho, 2019: VisNet: Deep convolutional neural networks for forecasting atmospheric visibility. Sensors, 19, 1343, https://doi.org/10.3390/s19061343.
Pan, C., J. Tan, D. Feng, and Y. Li, 2019: Very short-term solar generation forecasting based on LSTM with temporal attention mechanism. 2019 IEEE Fifth Int. Conf. on Computer and Communications (ICCC), Chengdu, China, IEEE, 267–271, https://doi.org/10.1109/ICCC47050.2019.9064298.
Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 2825–2830.
Pérez, I. A., M. García, M. L. Sánchez, and B. de Torre, 2004: Autocorrelation analysis of meteorological data from a RASS sodar. J. Appl. Meteor., 43, 1213–1223, https://doi.org/10.1175/1520-0450(2004)043<1213:AAOMDF>2.0.CO;2.
Pérez-Díaz, J. L., and Coauthors, 2017: Fogs: Physical basis, characteristic properties, and impacts on the environment and human health. Water, 9, 807, https://doi.org/10.3390/w9100807.
Price, J. D., 2019: On the formation and development of radiation fog: An observational study. Bound.-Layer Meteor., 172, 167–197, https://doi.org/10.1007/s10546-019-00444-5.
Raykar, V. C., and A. Saha, 2015: Data split strategies for evolving predictive models. Machine Learning and Knowledge Discovery in Databases: ECML PKDD 2015, A. Appice et al., Eds., Lecture Notes in Computer Science, Vol. 9284, Springer, 3–19, https://doi.org/10.1007/978-3-319-23528-8_1.
Reitermanova, Z., 2010: Data splitting. Mathematics and Computer Sciences, Part I, WDS’10 Proceedings of Contributed Papers, J. Safrankova and J. Pavlu, Eds., Matfyzpress, 31–36.
Roebber, P. J., 2009: Visualizing multiple measures of forecast quality. Wea. Forecasting, 24, 601–608, https://doi.org/10.1175/2008WAF2222159.1.
Román-Cascón, C., C. Yagüe, G.-J. Steeneveld, G. Morales, J. A. Arrillaga, M. Sastre, and G. Maqueda, 2019: Radiation and cloud-base lowering fog events: Observational analysis and evaluation of WRF and HARMONIE. Atmos. Res., 229, 190–207, https://doi.org/10.1016/j.atmosres.2019.06.018.
Scher, S., and G. Messori, 2018: Predicting weather forecast uncertainty with machine learning. Quart. J. Roy. Meteor. Soc., 144, 2830–2841, https://doi.org/10.1002/qj.3410.
Sheridan, R. P., W. M. Wang, A. Liaw, J. Ma, and E. M. Gifford, 2016: Extreme gradient boosting as a method for quantitative structure–activity relationships. J. Chem. Inf. Model., 56, 2353–2360, https://doi.org/10.1021/acs.jcim.6b00591.
Smith, D. K. E., I. A. Renfrew, S. R. Dorling, J. D. Price, and I. A. Boutle, 2021: Sub-km scale numerical weather prediction model simulations of radiation fog. Quart. J. Roy. Meteor. Soc., 147, 746–763, https://doi.org/10.1002/qj.3943.
Steeneveld, G. J., and M. de Bode, 2018: Unravelling the relative roles of physical processes in modelling the life cycle of a warm radiation fog. Quart. J. Roy. Meteor. Soc., 144, 1539–1554, https://doi.org/10.1002/qj.3300.
Steeneveld, G. J., R. J. Ronda, and A. A. M. Holtslag, 2015: The challenge of forecasting the onset and development of radiation fog using mesoscale atmospheric models. Bound.-Layer Meteor., 154, 265–289, https://doi.org/10.1007/s10546-014-9973-8.
Stolaki, S., M. Haeffelin, C. Lac, J.-C. Dupont, T. Elias, and V. Masson, 2015: Influence of aerosols on the life cycle of a radiation fog event. A numerical and observational study. Atmos. Res., 151, 146–161, https://doi.org/10.1016/j.atmosres.2014.04.013.
Thies, B., S. Egli, and J. Bendix, 2017: The influence of drop size distributions on the relationship between liquid water content and radar reflectivity in radiation fogs. Atmosphere, 8, 142, https://doi.org/10.3390/atmos8080142.
Urbich, I., J. Bendix, and R. Müller, 2020: Development of a seamless forecast for solar radiation using ANAKLIM++. Remote Sens., 12, 3672, https://doi.org/10.3390/rs12213672.
Vislocky, R. L., and J. M. Fritsch, 1997: An automated, observations-based system for short-term prediction of ceiling and visibility. Wea. Forecasting, 12, 31–43, https://doi.org/10.1175/1520-0434(1997)012<0031:AAOBSF>2.0.CO;2.
Yadav, S., and S. Shukla, 2016: Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification. 2016 IEEE Sixth Int. Conf. on Advanced Computing (IACC), Bhimavaram, India, IEEE, 78–83, https://doi.org/10.1109/IACC.2016.25.
Zazzaro, G., G. Romano, P. Mercogliano, V. Rillo, and S. Kauczok, 2015: Short range fog forecasting by applying data mining techniques: Three different temporal resolution models for fog nowcasting on CDG airport. 2015 IEEE Metrology for Aerospace (MetroAeroSpace), Benevento, Italy, IEEE, 448–453, https://doi.org/10.1109/MetroAeroSpace.2015.7180699.
Zhang, F., B. Ren, C. Dou, and C. Wang, 2020: Numerical simulation of near-surface atmospheric conditions during a radiation fog over the complex terrain. IOP Conf. Ser.: Earth Environ. Sci., 555, 012093, https://doi.org/10.1088/1755-1315/555/1/012093.