Selection and Optimization of a Machine Learning Algorithm for Antarctic Blowing Snow Diagnosis Using MERRA-2 Reanalysis

Surendra Bhatta Climate and Radiation Laboratory, NASA Goddard Space Flight Center, Greenbelt, Maryland
Goddard Earth Sciences Technology and Research II, Morgan State University, Baltimore, Maryland

Search for other papers by Surendra Bhatta in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0002-3454-1062
and
Yuekui Yang Climate and Radiation Laboratory, NASA Goddard Space Flight Center, Greenbelt, Maryland

Search for other papers by Yuekui Yang in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Accurate identification of blowing snow (BLSN) is crucial for studying Antarctic surface mass balance (SMB). Previous research has demonstrated the potential of diagnosing BLSN properties using Modern-Era Retrospective Analysis for Research and Application, version 2 (MERRA-2), meteorological parameters. The goal of this study is to develop a machine learning model that is best suited for operational production of Antarctic BLSN occurrence diagnosis. To achieve this objective, the comprehensive framework begins by training with BLSN observations from CALIPSO. The optimal input features, consisting of 21 meteorological parameters from the MERRA-2 reanalysis, are then selected through a feature selection process. Second, a comparison of five machine learning algorithms reveals that the extreme gradient boosting (XGBoost) algorithm outperforms others, leading to its selection for model development. Third, extensive tests validate BLSN diagnoses, achieving best performance with individual monthly models. The results show that the monthly models, trained with XGBoost and 21 selected MERRA-2 parameters by aggregating data over a decade for each month spanning 2007–16, are well suited for long-term Antarctic BLSN diagnosis. This enables the generation of hourly, grid-based MERRA-2-BLSN data from 1980 to the present. The model’s robustness is demonstrated by training it on data excluding 2009 and then successfully diagnosing BLSN for 2009. Even for the challenging winter to summer transition month of October, the model achieved ∼92% accuracy and an F1 score of ∼73%. Results are even better in the winter months, from April to September. Moreover, the models’ explainability is showcased through the utilization of Shapley additive explanations (SHAP) values and local interpretable model-agnostic explanation (LIME) analysis.

For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Surendra Bhatta, surendra.bhatta@nasa.gov

Abstract

Accurate identification of blowing snow (BLSN) is crucial for studying Antarctic surface mass balance (SMB). Previous research has demonstrated the potential of diagnosing BLSN properties using Modern-Era Retrospective Analysis for Research and Application, version 2 (MERRA-2), meteorological parameters. The goal of this study is to develop a machine learning model that is best suited for operational production of Antarctic BLSN occurrence diagnosis. To achieve this objective, the comprehensive framework begins by training with BLSN observations from CALIPSO. The optimal input features, consisting of 21 meteorological parameters from the MERRA-2 reanalysis, are then selected through a feature selection process. Second, a comparison of five machine learning algorithms reveals that the extreme gradient boosting (XGBoost) algorithm outperforms others, leading to its selection for model development. Third, extensive tests validate BLSN diagnoses, achieving best performance with individual monthly models. The results show that the monthly models, trained with XGBoost and 21 selected MERRA-2 parameters by aggregating data over a decade for each month spanning 2007–16, are well suited for long-term Antarctic BLSN diagnosis. This enables the generation of hourly, grid-based MERRA-2-BLSN data from 1980 to the present. The model’s robustness is demonstrated by training it on data excluding 2009 and then successfully diagnosing BLSN for 2009. Even for the challenging winter to summer transition month of October, the model achieved ∼92% accuracy and an F1 score of ∼73%. Results are even better in the winter months, from April to September. Moreover, the models’ explainability is showcased through the utilization of Shapley additive explanations (SHAP) values and local interpretable model-agnostic explanation (LIME) analysis.

For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Surendra Bhatta, surendra.bhatta@nasa.gov

1. Introduction

Blowing snow is the lofting of snow particles from the surface by the wind (Mellor 1965). In populated areas, the occurrence of BLSN is one of the inclement weather conditions that impact people’s daily activities by significantly reducing visibility (Taylor 1998). In the polar regions, BLSN plays an important role in radiation budget and cryosphere surface mass balance (SMB) (Yang et al. 2014; Palm et al. 2011; Schmidt 1982). The Antarctic SMB is the net balance between the gain and loss of snow by various processes. It regulates and helps in estimating the sea level rise (WCRP Global Sea Level Budget Group 2018). The contribution of blowing snow stands as a significant component within the SMB, encompassing both sublimation and transport (e.g., Palm et al. 2017; Scarchilli et al. 2010; Pomeroy and Jones 1996). Therefore, an accurate assessment of blowing snow becomes crucial in precisely estimating its contribution to the overall SMB. However, there have not been many ground-based observations that could offer data on the frequency of snow blowing in Antarctica due to its extreme weather conditions. Some studies provide statistics for a year or occasionally several years, while others offer data for a few months. Mahesh et al.’s (2003) analysis of surface data collected at the South Pole over multiple years implied that the blowing snow occurred 40%–50% of the time during the winter. The analysis using the measurements from Halley Station during the 1991 Austral winter (75.6°S, 26.7°W) by Mann et al. (2000) showed that blowing snow occurred between 27% and 37% of the time (June, July, and August).

Studies have used wind speed thresholds for determining BLSN occurrence (Schmidt 1980). However, the article by Li and Pomeroy (1997) showed that a single standard on meteorological data cannot be established as a critical threshold to initiate BLSN. Wind driving force (strength of wind) and snow resistance (influenced by precipitation) on the surface are functions of topography, vegetation, and snow crystal form. The information on wind speed, air temperature, and snow age alone is not sufficient to make decisive thresholds on BLSN initiation (Li and Pomeroy 1997). So, a distinction of BLSN simply relying on the conventional thresholds on specific parameters is sometimes questionable; instead, BLSN is the outcome of multiple facets of atmospheric (meteorological) parameters.

To diagnose BLSN to a higher degree of accuracy using meteorological parameters, a more comprehensive algorithm is needed. Recently, Yang et al. (2023) showed that using a random forest (RF) machine learning (ML) model achieved promising results over the Antarctic continent. However, their study is essentially a proof of concept; extensive research is still needed in order to build a model that is suitable for operational use to produce long-term data. Our goal is to diagnose hourly BLSN properties over the Antarctic region using Modern-Era Retrospective Analysis for Research and Applications, version 2 (MERRA-2), reanalysis and build a long-term BLSN dataset. Toward that direction, this paper extends the work by Yang et al. (2023) in the following aspects. First, the current work develops a framework for the selection and optimization of the machine learning model which enables the idea proposed by Yang et al. (2023) to be fully operational. The procedure involves research not addressed by them, such as systematic comparison of models, recursive feature elimination with cross validation for input selection, multiyear data combination for better modeling, and model result explainability. Second, the method in Yang et al. (2023) only targets the land ice part of the Antarctic continent, while this paper extends that to including the sea ice region, which is important as demonstrated by previous work (e.g., Chung et al. 2011; Leonard and Maksym 2011; Huang and Jaeglé 2017).

Cloud–Aerosol Lidar and Infrared Pathfinder Satellite Observations (CALIPSO) and Ice, Cloud, and Land Elevation Satellite-2 (ICESat-2) (Palm et al. 2011, 2021) are the sources of BLSN datasets at the continental scale, and both employ spaceborne lidar-based methods. Temporally, these datasets only cover 2006 to present, CALIPSO with the decadal span, and ICESat-2 for few years. Spatially, they have large gaps since the ground tracks are limited to a single pixel in width (∼several hundred meters). This study fills the spatial gap by extending the coverage to the latitudinal span of 60°–90°S and delivering hourly data within the MERRA-2 grid. The model is intended to be applied to the entire MERRA-2 dataset to generate a blowing snow data record starting from 1980 to present. Once approved, the data will be publicly available through the data.nasa.gov. This BLSN dataset will serve as an important resource for the community in studying Antarctic SMB and climate.

The remainder of the paper will present the steps taken for building the final ML model for BLSN diagnosis with MERRA-2 data, including machine learning algorithm selection, input feature selection, feature importance analysis, hyperparameter tuning, training data range determination, model performance test, explainability, etc.

2. Training data, input features, and machine learning algorithm selection and description

a. Training data preparation and input feature selection

Previous studies have demonstrated that CALIPSO BLSN data are of high quality and can be used as the “truth” for machine learning training (Palm et al. 2018b; Yang et al. 2021, 2023). However, as a spaceborne lidar platform, CALIPSO observations are limited to a width of one pixel (∼333 m), and the satellite surface tracks only reach 82°S due to the orbit inclination. In addition, the CALIPSO mission ended on 1 August 2023. The good news is that another spaceborne lidar mission, ICESat-2, has been providing BLSN data since its launch in 2018 (Palm et al. 2021). In the future, we plan to incorporate ICESat-2 data for validation, comparison, and further improvements.

For this study, BLSN properties derived from CALIPSO observations are used as the truth for training. As discussed by Palm et al. (2011), the CALIPSO blowing snow detection algorithm utilizes the lidar attenuated backscatter profiles, a high-resolution digital elevation model (DEM), and 10-m wind speed from global numerical models. The algorithm searches backscatter signals to locate the ground and then works upward to identify if a BLSN layer exists using a threshold method. CALIPSO BLSN data with a resolution of 333 m are provided, highlighting the detection of blowing snow. The dataset capturing CALIPSO BLSN spans from 2007 to 2016 for all months.

Each CALIPSO pixel is classified as BLSN, non-BLSN, and unknown categories. For that purpose, CALIPSO level 2 cloud layer products, which provide cloud information at 1-km resolution, are also used (Winker et al. 2009). The pixel classification process involves synthesizing data from the CALIPSO lidar level 2 cloud layer product with a 1-km resolution and the 333-m resolution BLSN product. Specifically, if BLSN is detected, then the pixel is classified as BLSN; if surface is detected by the laser but no BLSN observed, then the pixel is classified as non-BLSN; other cases are classified as unknown. Unknown cases are those where surface signal is not detected.

The input features to the machine learning model are from the Global Modeling and Assimilation Office (GMAO) MERRA-2 reanalysis (Gelaro et al. 2017; Bosilovich et al. 2015). The study combines CALIPSO data with MERRA-2 variables to pinpoint the closest CALIPSO pixels occurring within a 30-min time difference. To align with the MERRA-2 grids, the provided CALIPSO merged pixels are regrouped into 50-km groups. The pixels within each group are combined to represent the group as a whole. If more than 50% of the regrouped pixels belong to the BLSN class, the group is classified as a BLSN. Conversely, if more than 80% of the group comprises “Clear” pixels, it is labeled as the non-BLSN class. Any remaining items that do not meet these criteria are categorized as “UNKNOWN” and are not included in the training dataset. These outcomes are then amalgamated to construct both the testing and training datasets. The analysis is centered on the combined data spanning Antarctica at latitudes south of 60°S. For a comprehensive representation of the entire process, refer to Fig. 1, which presents a detailed flowchart.

Fig. 1.
Fig. 1.

Flowchart illustrating the complete process, from training the model to data production, starting with the blending of CALIPSO BLSN and Cloud data with MERRA-2 data. The subsequent steps include feature and model selection, culminating in the deployment of the model.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0019.1

The process involves creating a comprehensive training dataset by blending two CALIPSO datasets (Cloud + BLSN) to generate the truth values. These truth values are then merged with the MERRA-2 data to form a complete training dataset.

At the outset, the dataset comprised a total of 28 potential MERRA-2 variables as listed in Table 1, encompassing a combination of calculated and direct model output parameters. They included surface pressure (PS), specific humidity (Qv2M) at 2 m, temperature (T2M) at 2 m, and eastward (U10M) and northward (V10M) wind at 10 m from an “Hourly, Instantaneous, Single-Level, Assimilation.” The corresponding variables at the lowest four levels (layer 71–layer 68) were also selected for every 3 h from the “assimilated Meteorological Fields” dataset. MERRA-2 uses a hybrid vertical coordinate system (eta levels), resulting in variable above-ground heights for model levels based on surface pressure. In Antarctica, the first model level is typically 45–60 m above ground, and the second level ranges from 120 to 170 m. The 3-hourly data are interpolated onto 1-h interval. In addition to these variables, we extracted geopotential height from the “Constant Model Parameters” and “Assimilation Surface Flux Diagnostics” datasets. Furthermore, snowfall data are extracted from the “Surface Flux Diagnostics” dataset, and this information is utilized to calculate the snow age (Sa). The temperature gradient (Tg) is computed by considering the temperature at 2 m and the lowest two height levels. These 28 variables collectively formed a comprehensive set that contribute to our analysis and modeling process.

Table 1.

Preliminary selection of potential MERRA-2 variables for appropriate feature selection in the model selection phase, aiming to combine variables that yield the best weighted (F1) score (71–68 are the model layer numbers).

Table 1.

In this study, recursive feature elimination with cross validation (RFECV) feature selection was employed to select the ideal combination of input features. This method helps identify the most appropriate combination of variables by iteratively removing the weakest feature (less significant variables) based on a cross-validation score (Bengfort and Bilbro 2019).

For analysis, October is a suitable month for model setup and training as it represents the seasonal crossover in Antarctica, providing a diverse distribution of data. Therefore, October 2010 is selected as the representative baseline framework month to follow the steps outlined in Fig. 1, which includes the feature and algorithm selection process. The number of features in the model along with their cross-validated test score and variability was determined using the RFECV visualizer, as depicted in Fig. 2. It displays the results obtained using extreme gradient boosting (XGBoost) as the representative model from the list of 28 variables presented in Table 1 using fivefold cross validation with the F1-weighted score.

Fig. 2.
Fig. 2.

The RFECV feature selection method identifies the number of features with the highest cross-validation score using a vertical line, while the shaded area represents the standard deviation. The number of features and the corresponding F1-weighted score are displayed for the representative month of October 2010, considering the list of 28 variables from Table 1.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0019.1

The feature selection process is crucial, and slight variations may occur when selecting features based on accuracy with different training data and algorithms. Our tests under various scenarios revealed a trade-off between different performance metrics and computation speeds. For this study, we followed the outcomes of RFECV from the XGBoost algorithm, which resulted in 21 selected variables for October 2010. It is important to highlight that the model’s performance as shown in Fig. 2 remains largely unaffected by the addition or removal of a few variables for this particular month. However, the chosen combination of variables demonstrates satisfactory performance metrics across various months. Consequently, we proceeded to evaluate different machine learning models using the 21 selected variables.

Figure 3 offers a detailed view of Z scores, quartiles, median, and interquartile range for 21 selected variables linked to BLSN and non-BLSN, excluding potential outliers.

Fig. 3.
Fig. 3.

A boxplot with whiskers displays the selected 21 input features of the October 2010 training data and their distribution with BLSN/non-BLSN with vertical axis standardized (Z score) values. The upper and lower boundaries of the box represent the higher quartiles (Q3) and lower quartiles (Q1), respectively. The center lines indicate the median, while the whisker caps extend to the maximum and minimum values for each variable, showing the variation in the data.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0019.1

These distributions visualize the distribution of BLSN and non-BLSN associations for each variable. The results are from October 2010 training data, separating BLSN and non-BLSN instances. Notably, lower surface pressure, specific humidity, and temperature often coincide with BLSN occurrences, while higher wind speed consistently associates with BLSN. However, drawing immediate conclusions from this analysis is cautioned, as the machine learning model provides a more comprehensive understanding of each variable’s role in identifying BLSN.

b. Machine learning algorithm selection

The MERRA-2–CALIPSO combined data are divided into training and validation sets using an 80%–20% split. Eighty percent of the data are used for training, while the remaining 20% are reserved for testing, with a probability threshold of 50%. To assess the effectiveness of ML models, several calculated metrics and statistics are employed on the test data. These metrics serve as indicators of the model’s performance and its ability to classify instances correctly. We assessed the confusion matrix derived from the model test. The matrix showcases the counts for true positive (TP) and true negative (TN), which signify accurate categorizations. Additionally, it provides the counts for false positive (FP) and false negative (FN), representing incorrect classifications (Chicco and Jurman 2020; Sokolova et al. 2006). By comprehending these components, the predictive capabilities of the model can be assessed and areas for improvement can be identified. These metrics include accuracy, precision, recall, F1 score, and the area under the curve (AUC)–receiver operating characteristic (ROC) curve, etc. The accuracy (Ac) metric is evaluated by the proportion of correctly classified instances to the total number of cases:
accuracy=TP+TNTP+TN+FP+FN.
In situations of class imbalance, accuracy can be misleading. Therefore, additional metrics like precision, recall, and the F1 score are crucial for accurate model performance evaluation. Precision (Pr) is the ratio of true-positive predictions to all positive predictions, measuring the model’s ability to correctly identify instances of a specific class:
precision=TPTP+FP.
Recall (Re), also known as sensitivity or true-positive rate, is the ratio of correctly predicted positive instances to all actual positive instances. It indicates the model’s effectiveness in capturing and identifying all instances of a specific class:
recall=true-positiverate(TPR)=TPTP+FN.
The F1 score is the harmonic mean of precision and recall. It provides a balanced measure that considers both precision and recall, giving equal importance to both metrics. The F1 score is particularly useful when there is an imbalance between the classes in the dataset:
F1score=2×precision×recallprecision+recall.
The false-positive rate (FPR) measures the proportion of incorrect positive predictions among all actual negative instances. It evaluates how often the model falsely predicts the positive class:
false-positiverate(FPR)=FPTN+FP.

The selection of the most suitable machine learning model was based on metrics relevant to the dataset. Several well-known classifiers, including Gaussian Naive Bayes, random forest, K-nearest neighbor, decision tree, and extreme gradient boosting, were evaluated using default hyperparameters from sklearn and the selected 21 variables as input features. Performance metrics assessed their effectiveness, and while RF showed the highest accuracy in Fig. 4a, factors like class imbalance were also considered. Precision, recall, and F1 score in Fig. 4a show that the XGBoost outperformed the others.

Fig. 4.
Fig. 4.

Classification ML model comparison for October 2010 using selected 21 variables: (a) accuracy, precision, recall, and F1 scores for decision tree, RF, Gaussian NB, KNN, and XGBoost. (b) AUC–ROC score curve for each of the corresponding models illustrating the most suitable model for the classification problem.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0019.1

We also employed key metrics, including ROC and AUC, to assess the performance and suitability of the classification model. The ROC curve, presented in Fig. 4b, demonstrated optimal performance by XGBoost, representing highest sensitivity and specificity. Also, the highest AUC value ∼0.97 indicates that XGBoost is an ideal classifier. The selection process, based on ROC–AUC values, and classification metrics identified XGBoost as the most promising candidate.

All results and performance of models are presented in the subsequent section 3 with the analysis of input and hyperparameter tuning effect.

c. XGBoost algorithm

XGBoost, introduced by Chen and Guestrin (2016), is a gradient-boosted decision tree algorithm that uses serialized decision trees to improve performance. It has shown continuous growth and progress, gaining attention as a scalable machine learning technique that outperforms existing classifiers. It uses tree boosting to prevent overfitting and employs a multiprocessing algorithm for efficient data processing. Boosting combines weak learners to create strong learners, reducing training errors and improving model performance. Unlike bagging, which builds independent models in parallel and combines them with equal weight, boosting builds models sequentially, focusing on misclassified instances from previous models, and is often more effective at reducing bias (Chen and Guestrin 2016).

Table 2 displays the selected features from the MERRA-2 dataset, labeled as f0, f1, f2, …, fN, along with instances denoted as Ai, Bi, Ci, …, Ei, where the maximum value of “i” is N, representing the total number of input features. The table illustrates the training process of the XGBoost algorithm using one feature for simplicity. In the case of a binary class, the initial diagnosis is set at 0.5 by default. This prediction represents the average of the target class distribution. The class 0 is “non-BLSN,” and class 1 is “BLSN.”

Table 2.

Sample of three MERRA-2 features with binary class to illustrate the XGBoost training process and residuals for each tree included.

Table 2.

After the initial diagnosis is set at 0.5 for each instance, intermediate residuals (R1, R2, R3, …, Rp) are generated in the training process of the XGBoost algorithm. For each model with a number denoted as “p,” the corresponding residuals (rpA, rpB, rpC, …, etc.) are calculated by measuring the error between the truth values and the predicted values for each instance from each model. These residuals help us understand the errors made by the models. For instance, in the context of the “U10M” feature with an assumed threshold of 7.0 m s−1, events “C” and “D” in Table 2 have residuals of 0.5, while the remaining events “A,” “B,” and “E” have residuals of −0.5. Based on the obtained Residual1, a decision tree is built with a threshold of 7.0 m s−1 for “U10M,” as shown in Fig. 5. This process continues for “V10M” and “T2M,” creating decision trees for each feature. To explain the training process of XGBoost algorithm, each tree considers the residuals from the previous trees, gradually improving the model’s predictions.

Fig. 5.
Fig. 5.

Tree demonstration of XGBoost classification starting from the base model and beginning with the thresholds to explain the training model.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0019.1

The loss function in the XGBoost algorithm quantifies the difference between predicted values and true labels. It helps the algorithm prioritize instances where the base model makes incorrect predictions. In classification tasks, the loss function is typically based on probabilities. Common loss functions like cross entropy and hinge loss are used to compare predicted probabilities with true labels, enabling the model to learn and enhance its predictions during training. The negative gradient of the loss function is computed using the residuals. The similarity score (SS) is then determined by combining the sum of squared residuals and the previous probabilities. Equation (6) represents the SS, where Residual1 is obtained from Table 2; P represents the previous probability (0.5 for each event); and λ is a regularization parameter that reduces the similarity score, aiding in tree pruning and mitigating overfitting of the model:
SS=(residuals)2[P(1P)]+λ.
The gain function value is calculated as shown in Eq. (7) and is obtained by subtracting the similarity score of the root node from the sum of similarity scores of the left and right child nodes (leaf):
gain=SSleaf1+SSleaf2SSroot.

The model compares similarity values and uses a threshold to make class predictions based on the highest “gain” value. It identifies errors and constructs multiple new models to improve the predictions. In the training process of the XGBoost algorithm, multiple models are generated with different gain values. The process continues until all values are correctly trained or until the maximum number of trees is reached. This iterative approach allows the algorithm to continually improve and refine its predictions, leading to better overall performance and accuracy (Chen and Guestrin 2016).

3. Optimization and evaluation of the model

a. ML performance and metrics evaluation

To further evaluate the classification metrics, XGBoost was tested over Antarctica in January, July, and October of 2009 and 2010 to cover summer, winter, and transitional seasons, encountering various conditions such as nights, daylight, and varying BLSN frequencies.

Hyperparameters are crucial for controlling the complexity and performance of the ML algorithms (Chen and Guestrin 2016). We fine-tuned the algorithm using a “Randomized Search” with 10-fold cross validation to achieve optimal hyperparameters. This approach efficiently explored various parameter combinations, allowing us to identify the configuration yielding the best performance. We further refined the XGBoost classification algorithm to enhance the stability and performance. We set the “n_thread = −1” value for parallel processing, enhancing computational efficiency. The “learning rate = 0.1” parameter was adjusted to control the step size for updating decision trees at each iteration, ensuring a balanced trade-off between accuracy and speed. We chose “n_estimators = 576” to determine the number of trees, crucial for complex problems and avoiding overfitting. Furthermore, the “max_depth = 7” controlled the depth, affecting model’s complexity and generalization. Additionally, “min_child_weight = 6” set the sample splitting threshold, capturing meaningful patterns in the data (XGBoost 2022).

The month-by-month model is trained using an 80% split of the data for training and 20% for testing, evaluated independently for each month of each year. Table 3 presents the evaluated metrics for the benchmark months of 2009 and 2010, including accuracy, precision, recall, F1 score, and FPR. We note that the dataset is significantly skewed toward non-BLSN instances, with BLSN occurrences varying from 1% to 34%. To better train the model, we upsampled the underrepresented class (BLSN) by randomly sampling with replacement from the current available samples. Upsampling increases the likelihood of true cases in the training process, enhancing the model’s robustness in capturing various patterns. We also conducted a sensitivity test by changing the threshold value for binary classification from the default of 0.5 to other values. The classification metrics consistently produced similar results.

Table 3.

ML Performance on 20% testing data from January, July, and October 2009–2010.

Table 3.

In January, the results demonstrate high accuracy scores of 0.994 and 0.995 for 2009 and 2010, respectively, with acceptable F1 scores of 0.679 and 0.733 and low FPR. Despite the presence of a significant number of non-BLSN instances in these months, there are limited training data available for BLSN (less than 1%) cases. As a result, a few false classifications of BLSN to non-BLSN and vice versa have a relatively minor impact on accuracy. However, these misclassifications do affect the precision and recall of BLSN cases, leading to lower F1 scores.

Conversely, for July (a winter month) with higher BLSN frequencies in the training data, there is a noticeably good accuracy (0.919 and 0.933 for 2009 and 2010, respectively). This is accompanied by a simultaneous increase in F1 scores over January (0.884 and 0.898 for 2009 and 2010, respectively) and a low FPR for BLSN classification, as presented in Table 3. October, being the transitional month from winter to summer, presents the greatest challenge and opportunity in developing and testing a model. Consequently, it serves as the ideal month for training and evaluating the model and shows good performance metrics for October 2009 and 2010. These numbers demonstrate the improvement in comparison to the metrics reported in Yang et al. (2023). They reported an F1 score of 0.772 for October 2010, as shown in their Table 2. In contrast, this model generates an improved F1 score of 0.821 for the same period. Their analysis is limited to land ice regions, while this study includes both land and sea ice. Therefore, the two studies are not entirely comparable. Despite very low BLSN counts in certain months of the training data as indicated in Table 3, the model metrics underscore the effectiveness of a well-trained model.

The classification results in Figs. 6a and 6b depict the BLSN frequency distribution with i) all (collocated with MERRA-2 and noncollocated) distribution frequency maps from CALIPSO and ii) the entire modeled BLSN frequencies (over land ice and sea ice only) for October during the years 2009 and 2010. The modeled classification frequency maps reveal several hotspots of BLSN frequencies that closely align with the CALIPSO (truth) over Antarctica. The ML model’s effectiveness in diagnosing BLSN is well demonstrated in Yang et al.’s (2023) work, particularly when it is compared with the collocated MODIS event, as depicted in Figs. 4a and 4b of Yang et al. (2023).

Fig. 6.
Fig. 6.

XGBoost monthly trained model predicted classification maps over Antarctica for October (a) 2009 and (b) 2010 with (i) CALIPSO truth classification frequency and (ii) the MERRA-2 predicted BLSN classification frequency normalized to 1, respectively.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0019.1

Applying the same algorithm, we diagnosed BLSN/non-BLSN occurrences for each October from 2007 to 2016. The modeled frequency maps showcased agreement with the truth values, effectively capturing most of the BLSN occurrences. These results align with the findings of Palm et al. (2018a) and Yang et al. (2023). The model exhibited consistently high accuracy across these October months, such as an F1 score of up to 86.4% for October 2009. This performance over multiple years validates its reliability in identifying BLSN occurrences in Antarctica. However, the CALIPSO BLSN determination relies only on signals that can penetrate all the way down; hence, there are no data when thick cloud or storms are present. In addition, CALIPSO tracks are limited, not covering the entirety Antarctica in each hour, unlike the MERRA grid, which consistently spans all over Antarctica for each hour. As a result, the frequency map in Figs. 6a and 6b (i) for CALIPSO truth values is underrepresented, potentially leading to the differences between the modeling results and the CALIPSO observations. Nonetheless, when only the collocated data are compared, the modeling results are very close to the observations. The models are applied to every MERRA-2 data point, regardless of the sky conditions. The assumption is that the meteorological conditions (wind, temperature, pressure, etc.) for BLSN occurrence do not change for clear or cloudy skies. But we note that there is a caveat: since the models are trained with clear-sky data points, this may result uncertainties when diagnosing BLSN under clouds.

Although demonstrating good performance on the 20% test data, its reliability decreases when applied to entirely new data. For example, using a model trained on October 2010 to forecast October 2009 leads to a decline in performance. The F1 score drops to 0.569, although it still maintains a decent accuracy score of 0.904 and less than 10% FPR. This pattern emerges when using a trained model from other years (e.g., 2011–12) to diagnose the same months in 2009 or the same months in different years. Table 4 displays some of these statistics for October 2009, where the F1 score poorly scores at 0.488 and 0.624. The combination of a good accuracy and F1 score is not consistently achieved for any month when using a model trained on different months or years. This illustrates the limitation of Yang et al.’s (2023) approach to generating long-term data. When using their model trained on single months (October 2010–12) to forecast outside those months (October 2009), the F1 score severely decreases to 0.48, 0.31, and 0.54, respectively.

Table 4.

ML classification for 2009 using models trained on data from 2010 to 2012 and a combined model from October 2010 to 2016 for 21 variables with XGBoost.

Table 4.

The possibility of not performing well for the outside month is having not enough diverse data to train the model. The 21 variables (e.g., pressure, wind components, temperature) exhibit notable distribution disparities in the same months across different years, making it difficult for the model to capture the pattern (Bhatta and Yang 2023). For instance, January data from 2009 to 2016 fluctuate, highlighting the dataset’s complexity and can impact the model’s average performance on external months. We note that uncertainties in the MERRA-2 fields and the CALIPSO retrievals can affect the results. Additionally, unbalanced training data classes can impact model performances. In some cases, the accuracy score may be high, but the F1 score could turn out to be low due to heavily biased classes even though we applied up sampling.

As mentioned above, the truth values were heavily skewed, with only about ∼12% of the samples labeled as positive (BLSN) in the training data. To address this class imbalance, we employed two main strategies. Initially, we augmented the minority class by enlarging the sample size of the positive (BLSN) class using upsampling. Second, we augmented the training data by combining October 2010 data with corresponding months’ data from other years. This approach led to improvements in the algorithm’s performance as we added the October data from subsequent years to the training set.

Table 4 shows that the model trained on cumulative October data from years 2010 to 2016 outperformed the individual month-trained models when tested in October 2009. The combined model achieved an accuracy score of ∼0.92 and an improved F1 score of ∼0.73, exceeding the ∼0.59 F1 score that would have been achieved by Yang et al. (2023) and being comparable to their accuracy. This highlights the superiority of the cumulative approach in achieving higher diagnostic accuracy and a more balanced F1 score compared to the individual month-based models. For the month of July, training the model on combined data from 2010 through 2016 yielded even better results. It achieved notable F1 scores of 0.80 and an accuracy of 0.86 for July 2009. These scores improved even further to 0.82 and 0.88, respectively, for 2010 using data from 2009 to 2011–16 (July).

Additional tests were conducted for July and October, using data from 2007 to 2016. This involved combining data from 9 years, with 2007 and 2009 reserved as unobserved test data for separate evaluations. The classification metrics were computed and tested on the same excluded data, further demonstrating the effectiveness of using multiple years of data to develop a robust model. Figures 7a–d present the learning curves that illustrate the model’s performance on the testing data from October and July of 2009 and 2007.

Fig. 7.
Fig. 7.

Training curve for XGBoost classifier on testing data illustrating the learning curves of an XGBoost classifier for benchmark months: (a) October 2009, (b) October 2007, (c) July 2009, and (d) July 2007. The learning curves depict the classifier’s performance metrics, including accuracy, precision, recall, and F1 score, along with the standard error for each metric. The learning process involves training the classifier incrementally from the year 2007 and testing its performance in 2007 and 2009 excluding the entire test data for each benchmark month training phase.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0019.1

The curve illustrates the diagnostic accuracy across various training data sizes, evaluating metrics such as accuracy, recall, precision, and F1 score. It highlights those peak values obtained when the model is trained with the maximum amount of training data. Specifically, Fig. 7a represents the model tested on October 2009 data while being trained on data spanning from October 2007 to 2016, excluding the data from 2009. This setup reveals that the best F1 and accuracy scores are achieved when the model is trained with the full accumulated dataset. Similar patterns are observed in other test and train month data, as shown in Fig. 7b for October 2007, Fig. 7c for July 2009, and Fig. 7d for July 2007. The consistent result across different months indicates the reliability and generalizability in the predictive models, reinforcing the effectiveness and robustness of the combined-months model. As the learning curves progress, and flatten, it indicates that the model’s performance has been optimized, producing similar metrics throughout the multiple tests. This flattening effect also suggests that the model has learned sufficiently from the available data, resulting in stable and consistent metrics with minimal additional gains from further data. However, balancing metrics in machine learning often means compromising among various metrics. Notably, in Figs. 7a–c, without combining data from multiple years, the recall would have dropped. Hence, accepting a slight dip in precision yields better enhancements in recall. An even more important factor remains in the number of training data, Table 3 demonstrates BLSN counts for January 2009 and 2010 at 177 and 120, respectively, barely reaching 1%. This insufficiency hampers effective model training. Therefore, including data spanning 10 years creates a more conducive environment for training during those specific months.

Stability is crucial for diagnosing BLSN in months lacking truth values, ensuring the model’s applicability beyond the training period. These findings, supported by extensive testing, led to the preparation of final models aimed at diagnosing BLSN occurrences from the 1980s to the present. Since actual CALIPSO BLSN values are available only from 2007 to 2016 across all 12 months, our training and testing were limited to this period. Thus, the final BLSN classification model consists 12 separate models, one for each month, trained on CALIPSO data from 2007 to 2016, yielding promising results. For instance, the decadal October trained model (80%–20% split) demonstrated impressive performance with an accuracy of 0.95, recall of 0.71, precision of 0.83, an F1 score of 0.77, and FPR of 0.066. Similarly, the July trained model across the same 10-yr period exhibited an accuracy of 0.91, recall of 0.86, precision of 0.86, an F1 score of 0.86, and FPR of 0.1.

The following sections are designed to investigate the ML feature contribution and model explainability.

b. ML feature importance and explanations

Feature importance ranking is a critically important tool that allows us to interpret the correlation between input and target variables. ML explanations are crucial for understanding model decision-making. Incorporating these aspects enhances transparency and trust in AI systems. Exploring feature importance, especially with XGBoost’s “plot importance” function, helps determine rankings based on criteria like “weight,” “gain,” or “cover.” However, rankings can vary based on the criterion chosen. A comprehensive approach, including Shapley additive explanations (SHAP) values, offers consistent interpretation of feature contributions (Lundberg and Lee 2017). Shapley (1953) introduced Shapley values in game theory to ensure a fair benefit allocation among players in cooperative games. SHAP values assess the impact of a feature by comparing predictions with and without it, providing a fair and consistent evaluation of all potential feature subsets. This method allows a thorough understanding of the effects of MERRA-2 variables on model training.

Let us consider a particular scenario when a specific data point from the MERRA-2 feature list is used to generate SHAP values. The first step is to establish a baseline value or reference point for each feature, typically the mean or median value across the entire dataset. This serves as the starting point for calculating SHAP values. Using a trained machine learning model, a diagnosis is made for the selected instance, producing either a probability or a class label to indicate whether the occurrence is related to BLSN or non-BLSN events. When determining SHAP values for a specific feature (such as U10M), it is important to consider how its value affects the prediction in relation to its baseline value. The SHAP value for each feature is calculated as the average of the differences between predictions made with and without “U10M” in all potential feature combinations. Based on the computed SHAP values, positive values indicate that a feature pushes the diagnosis in the direction of BLSN events, while negative SHAP values indicate a push in the opposite direction. When SHAP values are close to zero, the feature’s influence is neutral, or it has little to no impact on the prediction.

It is feasible to fully comprehend the beneficial and detrimental effects of MERRA-2 variables on the model’s training by using SHAP values. Mathematically, the representation of a diagnosis for an ith feature x is shown in Eq. (8):
ϕi(f,x)=SN{i}|S|!(|N||S|1)!|N|![f(xS{xi})f(xS)].
The ϕi(f, x) is the SHAP value for feature “i” in the context of model f and input x and [f(xS ∪ {xi}) − f(xS)] represents the marginal contributions or the difference in diagnosis when including ith feature compared to when the ith feature is excluded. The SN{i}|S|!(|N||S|1)!/|N|! helps to understand the impact of different groups of features while considering the total number of features available. The variable “N” represents the size of total feature set and “S” represents a specific combination of features for which the Shapley value is being calculated excluding the feature “i” from each subset where the weights are determined by the number of ways to form the coalition. This factor considers the different ways features can be combined and how big those combinations are (Lundberg and Lee 2017).
So, adopting the above equation in the context of MERRA-2 selected three features scenario can be expressed in terms of U10M as
ϕU10M(f,x)=S{V10M,T2M}|S|!(|3||S|1)!|3|![f(xS{U10M})f(xS)].
In Eq. (9), N = {“U10M,” “V10M,” “T2M”} is the set of all features. The |S| is the size of subset S. The xS represents the input where we consider only the features in S. Similarly, f(xS) is the model’s diagnosis when considering only the features in xS. The “f(xS ∪ {U10M})” is model’s diagnosis when considering both the features in xS and the “U10M” feature.

The equation calculates the difference in model predictions when including the “U10M” feature in addition to the features in subset S, compared to excluding it, weighted by the number of possible subsets containing “U10M.”

Using the concept of the above-explained SHAP values, we can measure the overall relevance of features, considering their rankings during model training. Figure 8a displays the SHAP-based rankings of input variables for the BLSN detection model. T2M, wind speed V10M, surface Pressure (PS), and wind speed U10 are the more dominant factors in determining BLSN/non-BLSN. In Fig. 8b, each row represents a class (BLSN or non-BLSN) and shows dots indicating the significance of features for that class. The color of each dot represents the features’ importance, while the x axis indicates its impact on the model’s prediction. Denser clusters of dots indicate more prevalent feature values. From the visualization, low values of the “T2M” feature positively contribute to the model’s prediction; on the other hand, high values have a negative impact. This reflects high BLSN frequency regions are in East Antarctica, where the temperature is low and wind speed is high.

Fig. 8.
Fig. 8.

The XGBoost classification ML model with adjusted hyperparameters of October: (a) SHAP value–based feature importance ranks in the descending order of their importance in contribution. (b) Corresponding SHAP value contribution of MERRA-2.

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0019.1

Local interpretable model-agnostic explanation (LIME) helps to understand how a machine learning model predicts outcomes based on input features (Ribeiro et al. 2016). MERRA-2 features like U10M, V10M, T2M, etc., are complex, making the model’s decision process appear like a black box. LIME provides local explanations by approximating the MERRA-2 model with a simpler explainer model, identifying key features influencing predictions for specific data points. Applying LIME to MERRA-2 features reveals how the model uses wind speed, temperature, and other data, aiding in interpreting and trusting the model’s decisions under certain conditions.

In our model, as post hoc analysis, the LIME explanation helps us understand which and how inputs influenced the decision process. The LIME plot for the provided scenario shows that Surface Pressure and Specific Humidity (2M) are two strong indicators of non-BLSN events, as shown in Fig. 9. On the other hand, BLSN events strongly correlate with other variables such as geopotential height, temperature, and wind speed, as indicated by the Lime plot, feature importance, and SHAP value plots in Figs. 8a and 8b. In summary, the LIME helps us understand how different variables contribute to the model’s prediction of BLSN and non-BLSN events, providing valuable insights into the factors that influence the model’s decisions for individual cases.

Fig. 9.
Fig. 9.

LIME plot: (a) the selected case values in default MERRA-2 units rounded to two decimal values [(e.g., surface pressure (hPa), temperature (K), wind speed (m s−1), and specific humidity (kg kg−1)] in the lower box and the probability of BLSN/non-BLSN in the upper box. (b) Variables, their thresholds, and the contribution ranks influencing the probability in (a).

Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0019.1

In Fig. 9a, the model’s probabilities for each hypothesis and their positive and negative contributions are illustrated. For the selected instance, the model correctly predicts BLSN with a probability of 56% (above the 50% threshold for classification), while non-BLSN is predicted with a probability of 44%. The colors orange and blue in Figs. 9a and 9b represent positive and negative contributions to the model’s determination, respectively. These analyses help explain each prediction and understand the variables influencing BLSN or non-BLSN events. LIME is a useful tool for evaluating each case and understanding the confidence level of predictions based on their probability fractions. By setting a threshold, we can express the model’s diagnosis quality and confidence. In this example, the 56% probability of BLSN, only slightly higher than non-BLSN (44%), suggests a relatively narrow margin in favor of BLSN. This additional feature can be added to the MERRA-2 BLSN data as a quality check flag to express the confidence level in determination.

4. Summary

To gain insights into Antarctica’s surface mass balance (SMB), it is crucial to precisely detect blowing snow (BLSN) occurrences. This work builds on Yang et al. (2023) and extends their findings. Through systematic comparison of models, recursive feature elimination with cross validation for input features selection, multiyear data combination for better modeling, and model results explainability, the current work fully operationalizes the idea proposed by Yang et al. (2023). In addition, this paper expands BLSN diagnosis to include both the sea ice and land ice regions, whereas Yang et al. (2023) only included land ice. The procedural steps for model selection and the resulting outcomes are systematically presented. These models take advantage of the meteorological parameter capabilities of the Modern-Era Retrospective analysis for Research and Applications, version 2 (MERRA-2). The key information from MERRA-2 as input and truth values from CALIPSO were integrated to train the model by creating a comprehensive framework.

The study utilized the recursive feature elimination with cross validation (RFECV) feature selection method to identify the most suitable set of input features from the MERRA-2 reanalysis. In the identification and diagnosis of BLSN occurrences in Antarctica, the combination of 21 crucial meteorological parameters, such as wind speed and direction, temperature, pressure, humidity, and snow age, had a substantial impact. Among five machine learning algorithms compared, extreme gradient boosting (XGBoost) demonstrated the best performance, leading to its selection for model construction.

The study conducted rigorous testing to ensure the accuracy of BLSN diagnosis results. The best outcomes were achieved by utilizing separate models for each month. These final monthly models, developed using the XGBoost algorithm and trained on CALIPSO observations from 2007 to 2016, can identify Antarctic BLSN occurrences with high accuracy and precision. The model’s robustness was thoroughly evaluated by excluding observations for each month of the year as unobserved testing while training on remaining data from 2007 to 2016. The models consistently demonstrated high BLSN diagnosis performance for each testing set of data. Moreover, by integrating multiple years of the same months, the model’s precision was further enhanced. Notably, during the season transition month of October 2009, the model achieved an accuracy above 90% and an F1 score above 70%. Additionally, the model’s performance was even higher during the winter month July. Thus, the adopted approach effectively addresses the challenges in BLSN diagnosis, yielding accurate and reliable results with an average accuracy of approximately 95% and an F1 score of around 82% for the combined 10-yr monthly model. The model developed here will be applied to the MERRA-2 dataset and generate hourly, grid-based MERRA-2-BLSN data from 1980 to the present.

This model and the obtained results have the potential for regional expansion to the Arctic region and other parts of the globe to enhance the BLSN diagnosis. Further testing and validation are required. Furthermore, the results facilitate diagnosis of BLSN heights and optical depth and contribute to a deeper understanding of the Antarctic surface mass balance.

Acknowledgments.

Funding support for this research is from NASA’s Modeling, Analysis, and Prediction (MAP) and CloudSat/CALIPSO Science programs, both managed by David Considine. We thank the editor and three anonymous reviewers for their valuable and constructive reviews.

Data availability statement.

All data used in this study are publicly accessible. The CALIPSO lidar level 2 cloud layer product, version 4.20, can be directly downloaded from the provided link (https://opendap.larc.nasa.gov/opendap/CALIPSO/LID_L2_01kmCLay-Standard-V4-20/contents.html). Likewise, the CALIPSO lidar level 2 Antarctic blowing snow product, version 1.00, is also available for direct download through the link (https://opendap.larc.nasa.gov/opendap/CALIPSO/LID_L2_BlowingSnow_Antarctica-Standard-V1-00/contents.html). The MERRA-2 data can be accessed online (https://disc.gsfc.nasa.gov/datasets?project=MERRA-2) and are managed by NASA’s Goddard Earth Sciences (GES) Data and Information Services Center (DISC). The model’s predicted sample data are available at https://github.com/surenbhatta2/Antarctica_MERRA2_BLSN_data. Future updates and additional data links will also be provided here. In the meantime, the authors are happy to provide additional data upon request.

REFERENCES

  • Bengfort, B., and R. Bilbro, 2019: Yellowbrick: Visualizing the Scikit-learn model selection process. J. Open Source Software, 4, 1075, https://doi.org/10.21105/joss.01075.

    • Search Google Scholar
    • Export Citation
  • Bhatta, S., and Y. Yang, 2023: Reconstructing PM2.5 data record for the Kathmandu valley using a machine learning model. Atmosphere, 14, 1073, https://doi.org/10.3390/atmos14071073.

    • Search Google Scholar
    • Export Citation
  • Bosilovich, M. G., R. Lucchesi, and M. Suarez, 2015: MERRA-2: File Specification. GMAO Office Note 9 (version 1.0), 73 pp., https://ntrs.nasa.gov/api/citations/20150019760/downloads/20150019760.pdf.

  • Chen, T., and C. Guestrin, 2016: XGBoost: A scalable tree boosting system. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, Association for Computing Machinery, 785–794, https://dl.acm.org/doi/10.1145/2939672.2939785.

  • Chicco, D., and G. Jurman, 2020: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21, 6, https://doi.org/10.1186/s12864-019-6413-7.

    • Search Google Scholar
    • Export Citation
  • Chung, Y.-C., S. Bélair, and J. Mailhot, 2011: Blowing snow on Arctic sea ice: Results from an improved sea ice–snow–blowing snow coupled system. J. Hydrometeor., 12, 678689, https://doi.org/10.1175/2011JHM1293.1.

    • Search Google Scholar
    • Export Citation
  • Gelaro, R., and Coauthors, 2017: The Modern-Era Retrospective Analysis for Research and Applications, version 2 (MERRA-2). J. Climate, 30, 54195454, https://doi.org/10.1175/JCLI-D-16-0758.1.

    • Search Google Scholar
    • Export Citation
  • Huang, J., and L. Jaeglé, 2017: Wintertime enhancements of sea salt aerosol in polar regions consistent with a sea ice source from blowing snow. Atmos. Chem. Phys., 17, 36993712, https://doi.org/10.5194/acp-17-3699-2017.

    • Search Google Scholar
    • Export Citation
  • Leonard, K. C., and T. Maksym, 2011: The importance of wind-blown snow redistribution to snow accumulation on Bellingshausen Sea ice. Ann. Glaciol., 52, 271278, https://doi.org/10.3189/172756411795931651.

    • Search Google Scholar
    • Export Citation
  • Li, L., and J. W. Pomeroy, 1997: Probability of occurrence of blowing snow. J. Geophys. Res., 102, 21 95521 964, https://doi.org/10.1029/97JD01522.

    • Search Google Scholar
    • Export Citation
  • Lundberg, S. M., and S.-I. Lee, 2017: A unified approach to interpreting model predictions. Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, Association for Computing Machinery, 4768–4777, https://dl.acm.org/doi/10.5555/3295222.3295230.

  • Mahesh, A., R. Eager, J. R. Campbell, and J. D. Spinhirne, 2003: Observations of blowing snow at the South Pole. J. Geophys. Res., 108, 4707, https://doi.org/10.1029/2002JD003327.

    • Search Google Scholar
    • Export Citation
  • Mann, G. W., P. S. Anderson, and S. D. Mobbs, 2000: Profile measurements of blowing snow at Halley, Antarctica. J. Geophys. Res., 105, 24 49124 508, https://doi.org/10.1029/2000JD900247.

    • Search Google Scholar
    • Export Citation
  • Mellor, M., 1965: Blowing snow. CRREL Monograph, Part III, Section A3c, U.S. Army Corps of Engineers, Cold Regions Research and Engineering Laboratory, 79 pp.

  • Palm, S. P., Y. Yang, J. D. Spinhirne, and A. Marshak, 2011: Satellite remote sensing of blowing snow properties over Antarctica. J. Geophys. Res., 116, D16123, https://doi.org/10.1029/2011JD015828.

    • Search Google Scholar
    • Export Citation
  • Palm, S. P., V. Kayetha, Y. Yang, and R. Pauly, 2017: Blowing snow sublimation and transport over Antarctica from 11 years of CALIPSO observations. Cryosphere, 11, 25552569, https://doi.org/10.5194/tc-11-2555-2017.

    • Search Google Scholar
    • Export Citation
  • Palm, S. P., V. Kayetha, and Y. Yang, 2018a: Toward a satellite-derived climatology of blowing snow over Antarctica. J. Geophys. Res. Atmos., 123, 10 30110 313, https://doi.org/10.1029/2018JD028632.

    • Search Google Scholar
    • Export Citation
  • Palm, S. P., Y. Yang, V. Kayetha, and J. P. Nicolas, 2018b: Insight into the thermodynamic structure of blowing-snow layers in Antarctica from dropsonde and CALIPSO measurements. J. Appl. Meteor. Climatol., 57, 27332748, https://doi.org/10.1175/JAMC-D-18-0082.1.

    • Search Google Scholar
    • Export Citation
  • Palm, S. P., Y. Yang, U. Herzfeld, D. Hancock, A. Hayes, P. Selmer, W. Hart, and D. Hlavka, 2021: ICESat-2 atmospheric channel description, data processing and first results. Earth Space Sci., 8, e2020EA001470, https://doi.org/10.1029/2020EA001470.

    • Search Google Scholar
    • Export Citation
  • Pomeroy, J. W., and H. G. Jones, 1996: Wind-blown snow: Sublimation, transport and changes to polar snow. Chemical Exchange between the Atmosphere and Polar Snow, E. W. Wolff and R. C. Bales, Eds., NATO ASI Series, Vol. 43, Springer, 453–489.

  • Ribeiro, M. T., S. Singh, and C. Guestrin, 2016: “Why Should I Trust You?”: Explaining the predictions of any classifier. KDD’16: Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, Association for Computing Machinery, 1135–1144, https://dl.acm.org/doi/10.1145/2939672.2939778.

  • Scarchilli, C., M. Frezzotti, P. Grigioni, L. De Silvestri, L. Agnoletto, and S. Dolci, 2010: Extraordinary blowing snow transport events in East Antarctica. Climate Dyn., 34, 11951206, https://doi.org/10.1007/s00382-009-0601-0.

    • Search Google Scholar
    • Export Citation
  • Schmidt, R. A., 1980: Threshold wind-speeds and elastic impact in snow transport. J. Glaciol., 26, 453467, https://doi.org/10.3189/S0022143000010972.

    • Search Google Scholar
    • Export Citation
  • Schmidt, R. A., 1982: Properties of blowing snow. Rev. Geophys., 20, 3944, https://doi.org/10.1029/RG020i001p00039.

  • Shapley, L. S., 1953: A value for n-person games. Contributions to the Theory of Games, Vol. II, H. W. Kuhn and K. W. Tucker, Eds., Princeton University Press, 307–317.

  • Sokolova, M., N. Japkowicz, and S. Szpakowicz, 2006: Beyond accuracy, F-Score and ROC: A family of discriminant measures for performance evaluation. AI 2006: Advances in Artificial Intelligence, A. Sattar and B. Kang, Eds., Lecture Notes in Computer Science, Vol. 4304, Springer, 1015–1021.

  • Taylor, P., 1998: The thermodynamic effects of sublimating, blowing snow in the atmospheric boundary layer. Bound.-Layer Meteor., 89, 251283, https://doi.org/10.1023/A:1001712111718.

    • Search Google Scholar
    • Export Citation
  • WCRP Global Sea Level Budget Group, 2018: Global sea-level budget 1993–present. Earth Syst. Sci. Data, 10, 15511590, https://doi.org/10.5194/essd-10-1551-2018.

    • Search Google Scholar
    • Export Citation
  • Winker, D. M., M. A. Vaughan, A. Omar, Y. Hu, K. A. Powell, Z. Liu, W. H. Hunt, and S. A. Young, 2009: Overview of the CALIPSO mission and CALIOP data processing algorithms. J. Atmos. Oceanic Technol., 26, 23102323, https://doi.org/10.1175/2009JTECHA1281.1.

    • Search Google Scholar
    • Export Citation
  • XGBoost, 2022: XGBoost Parameters — XGBoost 1.7.5 documentation. Accessed 29 April 2023, https://xgboost.readthedocs.io/en/stable/parameter.html.

  • Yang, Y., S. P. Palm, A. Marshak, D. L. Wu, H. Yu, and Q. Fu, 2014: First satellite-detected perturbations of outgoing longwave radiation associated with blowing snow events over Antarctica. Geophys. Res. Lett., 41, 730735, https://doi.org/10.1002/2013GL058932.

    • Search Google Scholar
    • Export Citation
  • Yang, Y., A. Anderson, D. Kiv, J. Germann, M. Fuchs, S. Palm, and T. Wang, 2021: Study of Antarctic blowing snow storms using MODIS and CALIOP observations with a machine learning model. Earth Space Sci., 8, e2020EA001310, https://doi.org/10.1029/2020EA001310.

    • Search Google Scholar
    • Export Citation
  • Yang, Y., D. Kiv, S. Bhatta, M. Ganeshan, X. Lu, and S. Palm, 2023: Diagnosis of Antarctic blowing snow properties using MERRA-2 reanalysis with a machine learning model. J. Appl. Meteor. Climatol., 62, 10551068, https://doi.org/10.1175/JAMC-D-23-0004.1.

    • Search Google Scholar
    • Export Citation
Save
  • Bengfort, B., and R. Bilbro, 2019: Yellowbrick: Visualizing the Scikit-learn model selection process. J. Open Source Software, 4, 1075, https://doi.org/10.21105/joss.01075.

    • Search Google Scholar
    • Export Citation
  • Bhatta, S., and Y. Yang, 2023: Reconstructing PM2.5 data record for the Kathmandu valley using a machine learning model. Atmosphere, 14, 1073, https://doi.org/10.3390/atmos14071073.

    • Search Google Scholar
    • Export Citation
  • Bosilovich, M. G., R. Lucchesi, and M. Suarez, 2015: MERRA-2: File Specification. GMAO Office Note 9 (version 1.0), 73 pp., https://ntrs.nasa.gov/api/citations/20150019760/downloads/20150019760.pdf.

  • Chen, T., and C. Guestrin, 2016: XGBoost: A scalable tree boosting system. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, Association for Computing Machinery, 785–794, https://dl.acm.org/doi/10.1145/2939672.2939785.

  • Chicco, D., and G. Jurman, 2020: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21, 6, https://doi.org/10.1186/s12864-019-6413-7.

    • Search Google Scholar
    • Export Citation
  • Chung, Y.-C., S. Bélair, and J. Mailhot, 2011: Blowing snow on Arctic sea ice: Results from an improved sea ice–snow–blowing snow coupled system. J. Hydrometeor., 12, 678689, https://doi.org/10.1175/2011JHM1293.1.

    • Search Google Scholar
    • Export Citation
  • Gelaro, R., and Coauthors, 2017: The Modern-Era Retrospective Analysis for Research and Applications, version 2 (MERRA-2). J. Climate, 30, 54195454, https://doi.org/10.1175/JCLI-D-16-0758.1.

    • Search Google Scholar
    • Export Citation
  • Huang, J., and L. Jaeglé, 2017: Wintertime enhancements of sea salt aerosol in polar regions consistent with a sea ice source from blowing snow. Atmos. Chem. Phys., 17, 36993712, https://doi.org/10.5194/acp-17-3699-2017.

    • Search Google Scholar
    • Export Citation
  • Leonard, K. C., and T. Maksym, 2011: The importance of wind-blown snow redistribution to snow accumulation on Bellingshausen Sea ice. Ann. Glaciol., 52, 271278, https://doi.org/10.3189/172756411795931651.

    • Search Google Scholar
    • Export Citation
  • Li, L., and J. W. Pomeroy, 1997: Probability of occurrence of blowing snow. J. Geophys. Res., 102, 21 95521 964, https://doi.org/10.1029/97JD01522.

    • Search Google Scholar
    • Export Citation
  • Lundberg, S. M., and S.-I. Lee, 2017: A unified approach to interpreting model predictions. Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, Association for Computing Machinery, 4768–4777, https://dl.acm.org/doi/10.5555/3295222.3295230.

  • Mahesh, A., R. Eager, J. R. Campbell, and J. D. Spinhirne, 2003: Observations of blowing snow at the South Pole. J. Geophys. Res., 108, 4707, https://doi.org/10.1029/2002JD003327.

    • Search Google Scholar
    • Export Citation
  • Mann, G. W., P. S. Anderson, and S. D. Mobbs, 2000: Profile measurements of blowing snow at Halley, Antarctica. J. Geophys. Res., 105, 24 49124 508, https://doi.org/10.1029/2000JD900247.

    • Search Google Scholar
    • Export Citation
  • Mellor, M., 1965: Blowing snow. CRREL Monograph, Part III, Section A3c, U.S. Army Corps of Engineers, Cold Regions Research and Engineering Laboratory, 79 pp.

  • Palm, S. P., Y. Yang, J. D. Spinhirne, and A. Marshak, 2011: Satellite remote sensing of blowing snow properties over Antarctica. J. Geophys. Res., 116, D16123, https://doi.org/10.1029/2011JD015828.

    • Search Google Scholar
    • Export Citation
  • Palm, S. P., V. Kayetha, Y. Yang, and R. Pauly, 2017: Blowing snow sublimation and transport over Antarctica from 11 years of CALIPSO observations. Cryosphere, 11, 25552569, https://doi.org/10.5194/tc-11-2555-2017.

    • Search Google Scholar
    • Export Citation
  • Palm, S. P., V. Kayetha, and Y. Yang, 2018a: Toward a satellite-derived climatology of blowing snow over Antarctica. J. Geophys. Res. Atmos., 123, 10 30110 313, https://doi.org/10.1029/2018JD028632.

    • Search Google Scholar
    • Export Citation
  • Palm, S. P., Y. Yang, V. Kayetha, and J. P. Nicolas, 2018b: Insight into the thermodynamic structure of blowing-snow layers in Antarctica from dropsonde and CALIPSO measurements. J. Appl. Meteor. Climatol., 57, 27332748, https://doi.org/10.1175/JAMC-D-18-0082.1.

    • Search Google Scholar
    • Export Citation
  • Palm, S. P., Y. Yang, U. Herzfeld, D. Hancock, A. Hayes, P. Selmer, W. Hart, and D. Hlavka, 2021: ICESat-2 atmospheric channel description, data processing and first results. Earth Space Sci., 8, e2020EA001470, https://doi.org/10.1029/2020EA001470.

    • Search Google Scholar
    • Export Citation
  • Pomeroy, J. W., and H. G. Jones, 1996: Wind-blown snow: Sublimation, transport and changes to polar snow. Chemical Exchange between the Atmosphere and Polar Snow, E. W. Wolff and R. C. Bales, Eds., NATO ASI Series, Vol. 43, Springer, 453–489.

  • Ribeiro, M. T., S. Singh, and C. Guestrin, 2016: “Why Should I Trust You?”: Explaining the predictions of any classifier. KDD’16: Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, Association for Computing Machinery, 1135–1144, https://dl.acm.org/doi/10.1145/2939672.2939778.

  • Scarchilli, C., M. Frezzotti, P. Grigioni, L. De Silvestri, L. Agnoletto, and S. Dolci, 2010: Extraordinary blowing snow transport events in East Antarctica. Climate Dyn., 34, 11951206, https://doi.org/10.1007/s00382-009-0601-0.

    • Search Google Scholar
    • Export Citation
  • Schmidt, R. A., 1980: Threshold wind-speeds and elastic impact in snow transport. J. Glaciol., 26, 453467, https://doi.org/10.3189/S0022143000010972.

    • Search Google Scholar
    • Export Citation
  • Schmidt, R. A., 1982: Properties of blowing snow. Rev. Geophys., 20, 3944, https://doi.org/10.1029/RG020i001p00039.

  • Shapley, L. S., 1953: A value for n-person games. Contributions to the Theory of Games, Vol. II, H. W. Kuhn and K. W. Tucker, Eds., Princeton University Press, 307–317.

  • Sokolova, M., N. Japkowicz, and S. Szpakowicz, 2006: Beyond accuracy, F-Score and ROC: A family of discriminant measures for performance evaluation. AI 2006: Advances in Artificial Intelligence, A. Sattar and B. Kang, Eds., Lecture Notes in Computer Science, Vol. 4304, Springer, 1015–1021.

  • Taylor, P., 1998: The thermodynamic effects of sublimating, blowing snow in the atmospheric boundary layer. Bound.-Layer Meteor., 89, 251283, https://doi.org/10.1023/A:1001712111718.

    • Search Google Scholar
    • Export Citation
  • WCRP Global Sea Level Budget Group, 2018: Global sea-level budget 1993–present. Earth Syst. Sci. Data, 10, 15511590, https://doi.org/10.5194/essd-10-1551-2018.

    • Search Google Scholar
    • Export Citation
  • Winker, D. M., M. A. Vaughan, A. Omar, Y. Hu, K. A. Powell, Z. Liu, W. H. Hunt, and S. A. Young, 2009: Overview of the CALIPSO mission and CALIOP data processing algorithms. J. Atmos. Oceanic Technol., 26, 23102323, https://doi.org/10.1175/2009JTECHA1281.1.

    • Search Google Scholar
    • Export Citation
  • XGBoost, 2022: XGBoost Parameters — XGBoost 1.7.5 documentation. Accessed 29 April 2023, https://xgboost.readthedocs.io/en/stable/parameter.html.

  • Yang, Y., S. P. Palm, A. Marshak, D. L. Wu, H. Yu, and Q. Fu, 2014: First satellite-detected perturbations of outgoing longwave radiation associated with blowing snow events over Antarctica. Geophys. Res. Lett., 41, 730735, https://doi.org/10.1002/2013GL058932.

    • Search Google Scholar
    • Export Citation
  • Yang, Y., A. Anderson, D. Kiv, J. Germann, M. Fuchs, S. Palm, and T. Wang, 2021: Study of Antarctic blowing snow storms using MODIS and CALIOP observations with a machine learning model. Earth Space Sci., 8, e2020EA001310, https://doi.org/10.1029/2020EA001310.

    • Search Google Scholar
    • Export Citation
  • Yang, Y., D. Kiv, S. Bhatta, M. Ganeshan, X. Lu, and S. Palm, 2023: Diagnosis of Antarctic blowing snow properties using MERRA-2 reanalysis with a machine learning model. J. Appl. Meteor. Climatol., 62, 10551068, https://doi.org/10.1175/JAMC-D-23-0004.1.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Flowchart illustrating the complete process, from training the model to data production, starting with the blending of CALIPSO BLSN and Cloud data with MERRA-2 data. The subsequent steps include feature and model selection, culminating in the deployment of the model.

  • Fig. 2.

    The RFECV feature selection method identifies the number of features with the highest cross-validation score using a vertical line, while the shaded area represents the standard deviation. The number of features and the corresponding F1-weighted score are displayed for the representative month of October 2010, considering the list of 28 variables from Table 1.

  • Fig. 3.

    A boxplot with whiskers displays the selected 21 input features of the October 2010 training data and their distribution with BLSN/non-BLSN with vertical axis standardized (Z score) values. The upper and lower boundaries of the box represent the higher quartiles (Q3) and lower quartiles (Q1), respectively. The center lines indicate the median, while the whisker caps extend to the maximum and minimum values for each variable, showing the variation in the data.

  • Fig. 4.

    Classification ML model comparison for October 2010 using selected 21 variables: (a) accuracy, precision, recall, and F1 scores for decision tree, RF, Gaussian NB, KNN, and XGBoost. (b) AUC–ROC score curve for each of the corresponding models illustrating the most suitable model for the classification problem.

  • Fig. 5.

    Tree demonstration of XGBoost classification starting from the base model and beginning with the thresholds to explain the training model.

  • Fig. 6.

    XGBoost monthly trained model predicted classification maps over Antarctica for October (a) 2009 and (b) 2010 with (i) CALIPSO truth classification frequency and (ii) the MERRA-2 predicted BLSN classification frequency normalized to 1, respectively.

  • Fig. 7.

    Training curve for XGBoost classifier on testing data illustrating the learning curves of an XGBoost classifier for benchmark months: (a) October 2009, (b) October 2007, (c) July 2009, and (d) July 2007. The learning curves depict the classifier’s performance metrics, including accuracy, precision, recall, and F1 score, along with the standard error for each metric. The learning process involves training the classifier incrementally from the year 2007 and testing its performance in 2007 and 2009 excluding the entire test data for each benchmark month training phase.

  • Fig. 8.

    The XGBoost classification ML model with adjusted hyperparameters of October: (a) SHAP value–based feature importance ranks in the descending order of their importance in contribution. (b) Corresponding SHAP value contribution of MERRA-2.

  • Fig. 9.

    LIME plot: (a) the selected case values in default MERRA-2 units rounded to two decimal values [(e.g., surface pressure (hPa), temperature (K), wind speed (m s−1), and specific humidity (kg kg−1)] in the lower box and the probability of BLSN/non-BLSN in the upper box. (b) Variables, their thresholds, and the contribution ranks influencing the probability in (a).

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 139 139 139
PDF Downloads 94 94 94