Browse
Abstract
Satellite low-Earth-orbiting (LEO) and geostationary (GEO) imager estimates of cloud-top pressure (CTP) have many applications in both operations and in studying long-term variations in cloud properties. Recently, machine learning (ML) approaches have shown improvement upon physically based algorithms. However, ML approaches, and especially neural networks, can suffer from a lack of interpretability, making it difficult to understand what information is most useful for accurate predictions of cloud properties. We trained several neural networks to estimate CTP from the infrared channels of the Visible Infrared Imaging Radiometer Suite (VIIRS) and the Advanced Baseline Imager (ABI). The main focus of this work is assessing the relative importance of each instrument’s infrared channels in neural networks trained to estimate CTP. We use several ML explainability methods to offer different perspectives on feature importance. These methods show many differences in the relative feature importance depending on the exact method used, but most agree on a few points. Overall, the 8.4- and 8.6-μm channels appear to be the most useful for CTP estimation on ABI and VIIRS, respectively, with other native infrared window channels and the 13.3-μm channel playing a moderate role. Furthermore, we find that the neural networks learn relationships that may account for properties of clouds such as opacity and cloud-top phase that otherwise complicate the estimation of CTP.
Significance Statement
Model interpretability is an important consideration for transitioning machine learning models to operations. This work applies several explainability methods in an attempt to understand what information is most important for estimating the pressure level at the top of a cloud from satellite imagers in a neural network model. We observe much disagreement between approaches, which motivates further work in this area but find agreement on the importance of channels in the infrared window region around 8.6 and 10–12 μm, informing future cloud property algorithm development. We also find some evidence suggesting that these neural networks are able to learn physically relevant variability in radiation measurements related to key cloud properties.
Abstract
Satellite low-Earth-orbiting (LEO) and geostationary (GEO) imager estimates of cloud-top pressure (CTP) have many applications in both operations and in studying long-term variations in cloud properties. Recently, machine learning (ML) approaches have shown improvement upon physically based algorithms. However, ML approaches, and especially neural networks, can suffer from a lack of interpretability, making it difficult to understand what information is most useful for accurate predictions of cloud properties. We trained several neural networks to estimate CTP from the infrared channels of the Visible Infrared Imaging Radiometer Suite (VIIRS) and the Advanced Baseline Imager (ABI). The main focus of this work is assessing the relative importance of each instrument’s infrared channels in neural networks trained to estimate CTP. We use several ML explainability methods to offer different perspectives on feature importance. These methods show many differences in the relative feature importance depending on the exact method used, but most agree on a few points. Overall, the 8.4- and 8.6-μm channels appear to be the most useful for CTP estimation on ABI and VIIRS, respectively, with other native infrared window channels and the 13.3-μm channel playing a moderate role. Furthermore, we find that the neural networks learn relationships that may account for properties of clouds such as opacity and cloud-top phase that otherwise complicate the estimation of CTP.
Significance Statement
Model interpretability is an important consideration for transitioning machine learning models to operations. This work applies several explainability methods in an attempt to understand what information is most important for estimating the pressure level at the top of a cloud from satellite imagers in a neural network model. We observe much disagreement between approaches, which motivates further work in this area but find agreement on the importance of channels in the infrared window region around 8.6 and 10–12 μm, informing future cloud property algorithm development. We also find some evidence suggesting that these neural networks are able to learn physically relevant variability in radiation measurements related to key cloud properties.
Abstract
A major challenge for food security worldwide is the large interannual variability of crop yield, and climate change is expected to further exacerbate this volatility. Accurate prediction of the crop response to climate variability and change is critical for short-term management and long-term planning in multiple sectors. In this study, using maize in the U.S. Corn Belt as an example, we train and validate multiple machine learning (ML) models predicting crop yield based on meteorological variables and soil properties using the leaving-one-year-out approach, and compare their performance with that of a widely used process-based crop model (PBM). Our proposed long short-term memory model with attention (LSTMatt) outperforms other ML models (including other variations of LSTM developed in this study) and explains 73% of the spatiotemporal variance of the observed maize yield, in contrast to 16% explained by the regionally calibrated PBM; the magnitude of yield prediction errors in LSTMatt is about one-third of that in the PBM. When applied to the extreme drought year 2012 that has no counterpart in the training data, the LSTMatt performance drops but still shows advantage over the PBM. Findings from this study suggest a great potential for out-of-sample application of the LSTMatt model to predict crop yield under a changing climate.
Significance Statement
Changing climate is expected to exacerbate extreme weather events, thus affecting global food security. Accurate estimation and prediction of crop productivity under extremes are crucial for long-term agricultural decision-making and climate adaptation planning. Here we seek to improve crop yield prediction from meteorological features and soil properties using machine learning approaches. Our long short-term memory (LSTM) model with attention and shortcut connection explains 73% of the spatiotemporal variance of the observed maize yield in the U.S. Corn Belt and outperforms a widely used process-based crop model even in an extreme drought year when meteorological conditions are significantly different from the training data. Our findings suggest great potential for out-of-sample application of the LSTM model to predict crop yield under a changing climate.
Abstract
A major challenge for food security worldwide is the large interannual variability of crop yield, and climate change is expected to further exacerbate this volatility. Accurate prediction of the crop response to climate variability and change is critical for short-term management and long-term planning in multiple sectors. In this study, using maize in the U.S. Corn Belt as an example, we train and validate multiple machine learning (ML) models predicting crop yield based on meteorological variables and soil properties using the leaving-one-year-out approach, and compare their performance with that of a widely used process-based crop model (PBM). Our proposed long short-term memory model with attention (LSTMatt) outperforms other ML models (including other variations of LSTM developed in this study) and explains 73% of the spatiotemporal variance of the observed maize yield, in contrast to 16% explained by the regionally calibrated PBM; the magnitude of yield prediction errors in LSTMatt is about one-third of that in the PBM. When applied to the extreme drought year 2012 that has no counterpart in the training data, the LSTMatt performance drops but still shows advantage over the PBM. Findings from this study suggest a great potential for out-of-sample application of the LSTMatt model to predict crop yield under a changing climate.
Significance Statement
Changing climate is expected to exacerbate extreme weather events, thus affecting global food security. Accurate estimation and prediction of crop productivity under extremes are crucial for long-term agricultural decision-making and climate adaptation planning. Here we seek to improve crop yield prediction from meteorological features and soil properties using machine learning approaches. Our long short-term memory (LSTM) model with attention and shortcut connection explains 73% of the spatiotemporal variance of the observed maize yield in the U.S. Corn Belt and outperforms a widely used process-based crop model even in an extreme drought year when meteorological conditions are significantly different from the training data. Our findings suggest great potential for out-of-sample application of the LSTM model to predict crop yield under a changing climate.
Abstract
Deep learning models are developed for high-resolution quantitative precipitation nowcasting (QPN) in Taiwan up to 3 h ahead. Many recent works aim to accurately predict relatively rare high-rainfall events with the help of deep learning. This rarity is often addressed by formulations that reweight the rare events. However, these formulations often carry a side effect of producing blurry rain-map nowcasts that overpredict in low-rainfall regions. Such nowcasts are visually less trustworthy and practically less useful for forecasters. We fix the trust issue by introducing a discriminator that encourages the model to generate realistic rain maps without sacrificing the predictive accuracy of rainfall extremes. Moreover, with consecutive attention across different hours, we extend the nowcasting time frame from typically 1 to 3 h to further address the needs for socioeconomic weather-dependent decision-making. By combining the discriminator and the attention techniques, the proposed model based on the convolutional recurrent neural network is trained with a dataset containing radar reflectivity and rain rates at a granularity of 10 min and predicts the hourly accumulated rainfall in the next three hours. Model performance is evaluated from both statistical and case-study perspectives. Statistical verification shows that the new model outperforms the current operational QPN techniques. Case studies further show that the model can capture the motion of rainbands in a frontal case and also provide an effective warning of urban-area torrential rainfall in an afternoon-thunderstorm case, implying that deep learning has great potential and is useful in 0–3-h nowcasting.
Abstract
Deep learning models are developed for high-resolution quantitative precipitation nowcasting (QPN) in Taiwan up to 3 h ahead. Many recent works aim to accurately predict relatively rare high-rainfall events with the help of deep learning. This rarity is often addressed by formulations that reweight the rare events. However, these formulations often carry a side effect of producing blurry rain-map nowcasts that overpredict in low-rainfall regions. Such nowcasts are visually less trustworthy and practically less useful for forecasters. We fix the trust issue by introducing a discriminator that encourages the model to generate realistic rain maps without sacrificing the predictive accuracy of rainfall extremes. Moreover, with consecutive attention across different hours, we extend the nowcasting time frame from typically 1 to 3 h to further address the needs for socioeconomic weather-dependent decision-making. By combining the discriminator and the attention techniques, the proposed model based on the convolutional recurrent neural network is trained with a dataset containing radar reflectivity and rain rates at a granularity of 10 min and predicts the hourly accumulated rainfall in the next three hours. Model performance is evaluated from both statistical and case-study perspectives. Statistical verification shows that the new model outperforms the current operational QPN techniques. Case studies further show that the model can capture the motion of rainbands in a frontal case and also provide an effective warning of urban-area torrential rainfall in an afternoon-thunderstorm case, implying that deep learning has great potential and is useful in 0–3-h nowcasting.
Abstract
The ability to find and recognize patterns in high-dimensional geophysical data is fundamental to climate science and critical for meaningful interpretation of weather and climate processes. Archetypal analysis (AA) is one technique that has recently gained traction in the geophysical science community for its ability to find patterns based on extreme conditions. While traditional empirical orthogonal function (EOF) analysis can reveal patterns based on data covariance, AA seeks patterns from the points located at the edges of the data distribution. The utility of any objective pattern method depends on the properties of the data to which it is applied and the choices made in implementing the method. Given the relative novelty of the application of AA in geophysics it is important to develop experience in applying the method. We provide an assessment of the method, implementation, sensitivity, and interpretation of AA with respect to geophysical data. As an example for demonstration, we apply AA to a 39-yr sea surface temperature (SST) reanalysis dataset. We show that the decisions made to implement AA can significantly affect the interpretation of results, but also, in the case of SST, that the analysis is exceptionally robust under both spatial and temporal coarse graining.
Significance Statement
Archetypal analysis (AA), when applied to geophysical fields, is a technique designed to find typical configurations or modes in underlying data. This technique is relatively new to the geophysical science community and has been shown to be beneficial to the interpretation of extreme modes of the climate system. The identification of extreme modes of variability and their expression in day-to-day weather or state of the climate at longer time scales may help in elucidating the interplay between major teleconnection drivers and their evolution in a changing climate. The purpose of this work is to bring together a comprehensive report of the AA methodology using an SST reanalysis for demonstration. It is shown that the AA results are significantly affected by each implementation decision, but also can be resilient to spatiotemporal averaging. Any application of AA should provide a clear documentation of the choices made in applying the method.
Abstract
The ability to find and recognize patterns in high-dimensional geophysical data is fundamental to climate science and critical for meaningful interpretation of weather and climate processes. Archetypal analysis (AA) is one technique that has recently gained traction in the geophysical science community for its ability to find patterns based on extreme conditions. While traditional empirical orthogonal function (EOF) analysis can reveal patterns based on data covariance, AA seeks patterns from the points located at the edges of the data distribution. The utility of any objective pattern method depends on the properties of the data to which it is applied and the choices made in implementing the method. Given the relative novelty of the application of AA in geophysics it is important to develop experience in applying the method. We provide an assessment of the method, implementation, sensitivity, and interpretation of AA with respect to geophysical data. As an example for demonstration, we apply AA to a 39-yr sea surface temperature (SST) reanalysis dataset. We show that the decisions made to implement AA can significantly affect the interpretation of results, but also, in the case of SST, that the analysis is exceptionally robust under both spatial and temporal coarse graining.
Significance Statement
Archetypal analysis (AA), when applied to geophysical fields, is a technique designed to find typical configurations or modes in underlying data. This technique is relatively new to the geophysical science community and has been shown to be beneficial to the interpretation of extreme modes of the climate system. The identification of extreme modes of variability and their expression in day-to-day weather or state of the climate at longer time scales may help in elucidating the interplay between major teleconnection drivers and their evolution in a changing climate. The purpose of this work is to bring together a comprehensive report of the AA methodology using an SST reanalysis for demonstration. It is shown that the AA results are significantly affected by each implementation decision, but also can be resilient to spatiotemporal averaging. Any application of AA should provide a clear documentation of the choices made in applying the method.
Abstract
Droplet-level interactions in clouds are often parameterized by a modified gamma fitted to a “global” droplet size distribution. Do “local” droplet size distributions of relevance to microphysical processes look like these average distributions? This paper describes an algorithm to search and classify characteristic size distributions within a cloud. The approach combines hypothesis testing, specifically, the Kolmogorov–Smirnov (KS) test, and a widely used class of machine learning algorithms for identifying clusters of samples with similar properties: density-based spatial clustering of applications with noise (DBSCAN) is used as the specific example for illustration. The two-sample KS test does not presume any specific distribution, is parameter free, and avoids biases from binning. Importantly, the number of clusters is not an input parameter of the DBSCAN-type algorithms but is independently determined in an unsupervised fashion. As implemented, it works on an abstract space from the KS test results, and hence spatial correlation is not required for a cluster. The method is explored using data obtained from the Holographic Detector for Clouds (HOLODEC) deployed during the Aerosol and Cloud Experiments in the Eastern North Atlantic (ACE-ENA) field campaign. The algorithm identifies evidence of the existence of clusters of nearly identical local size distributions. It is found that cloud segments have as few as one and as many as seven characteristic size distributions. To validate the algorithm’s robustness, it is tested on a synthetic dataset and successfully identifies the predefined distributions at plausible noise levels. The algorithm is general and is expected to be useful in other applications, such as remote sensing of cloud and rain properties.
Significance Statement
A typical cloud can have billions of drops spread over tens or hundreds of kilometers in space. Keeping track of the sizes, positions, and interactions of all of these droplets is impractical, and, as such, information about the relative abundance of large and small drops is typically quantified with a “size distribution.” Droplets in a cloud interact locally, however, so this work is motivated by the question of whether the cloud droplet size distribution is different in different parts of a cloud. A new method, based on hypothesis testing and machine learning, determines how many different size distributions are contained in a given cloud. This is important because the size distribution describes processes such as cloud droplet growth and light transmission through clouds.
Abstract
Droplet-level interactions in clouds are often parameterized by a modified gamma fitted to a “global” droplet size distribution. Do “local” droplet size distributions of relevance to microphysical processes look like these average distributions? This paper describes an algorithm to search and classify characteristic size distributions within a cloud. The approach combines hypothesis testing, specifically, the Kolmogorov–Smirnov (KS) test, and a widely used class of machine learning algorithms for identifying clusters of samples with similar properties: density-based spatial clustering of applications with noise (DBSCAN) is used as the specific example for illustration. The two-sample KS test does not presume any specific distribution, is parameter free, and avoids biases from binning. Importantly, the number of clusters is not an input parameter of the DBSCAN-type algorithms but is independently determined in an unsupervised fashion. As implemented, it works on an abstract space from the KS test results, and hence spatial correlation is not required for a cluster. The method is explored using data obtained from the Holographic Detector for Clouds (HOLODEC) deployed during the Aerosol and Cloud Experiments in the Eastern North Atlantic (ACE-ENA) field campaign. The algorithm identifies evidence of the existence of clusters of nearly identical local size distributions. It is found that cloud segments have as few as one and as many as seven characteristic size distributions. To validate the algorithm’s robustness, it is tested on a synthetic dataset and successfully identifies the predefined distributions at plausible noise levels. The algorithm is general and is expected to be useful in other applications, such as remote sensing of cloud and rain properties.
Significance Statement
A typical cloud can have billions of drops spread over tens or hundreds of kilometers in space. Keeping track of the sizes, positions, and interactions of all of these droplets is impractical, and, as such, information about the relative abundance of large and small drops is typically quantified with a “size distribution.” Droplets in a cloud interact locally, however, so this work is motivated by the question of whether the cloud droplet size distribution is different in different parts of a cloud. A new method, based on hypothesis testing and machine learning, determines how many different size distributions are contained in a given cloud. This is important because the size distribution describes processes such as cloud droplet growth and light transmission through clouds.
Abstract
Benchmark datasets and benchmark problems have been a key aspect for the success of modern machine learning applications in many scientific domains. Consequently, an active discussion about benchmarks for applications of machine learning has also started in the atmospheric sciences. Such benchmarks allow for the comparison of machine learning tools and approaches in a quantitative way and enable a separation of concerns for domain and machine learning scientists. However, a clear definition of benchmark datasets for weather and climate applications is missing with the result that many domain scientists are confused. In this paper, we equip the domain of atmospheric sciences with a recipe for how to build proper benchmark datasets, a (nonexclusive) list of domain-specific challenges for machine learning is presented, and it is elaborated where and what benchmark datasets will be needed to tackle these challenges. We hope that the creation of benchmark datasets will help the machine learning efforts in atmospheric sciences to be more coherent, and, at the same time, target the efforts of machine learning scientists and experts of high-performance computing to the most imminent challenges in atmospheric sciences. We focus on benchmarks for atmospheric sciences (weather, climate, and air-quality applications). However, many aspects of this paper will also hold for other aspects of the Earth system sciences or are at least transferable.
Significance Statement
Machine learning is the study of computer algorithms that learn automatically from data. Atmospheric sciences have started to explore sophisticated machine learning techniques and the community is making rapid progress on the uptake of new methods for a large number of application areas. This paper provides a clear definition of so-called benchmark datasets for weather and climate applications that help to share data and machine learning solutions between research groups to reduce time spent in data processing, to generate synergies between groups, and to make tool developments more targeted and comparable. Furthermore, a list of benchmark datasets that will be needed to tackle important challenges for the use of machine learning in atmospheric sciences is provided.
Abstract
Benchmark datasets and benchmark problems have been a key aspect for the success of modern machine learning applications in many scientific domains. Consequently, an active discussion about benchmarks for applications of machine learning has also started in the atmospheric sciences. Such benchmarks allow for the comparison of machine learning tools and approaches in a quantitative way and enable a separation of concerns for domain and machine learning scientists. However, a clear definition of benchmark datasets for weather and climate applications is missing with the result that many domain scientists are confused. In this paper, we equip the domain of atmospheric sciences with a recipe for how to build proper benchmark datasets, a (nonexclusive) list of domain-specific challenges for machine learning is presented, and it is elaborated where and what benchmark datasets will be needed to tackle these challenges. We hope that the creation of benchmark datasets will help the machine learning efforts in atmospheric sciences to be more coherent, and, at the same time, target the efforts of machine learning scientists and experts of high-performance computing to the most imminent challenges in atmospheric sciences. We focus on benchmarks for atmospheric sciences (weather, climate, and air-quality applications). However, many aspects of this paper will also hold for other aspects of the Earth system sciences or are at least transferable.
Significance Statement
Machine learning is the study of computer algorithms that learn automatically from data. Atmospheric sciences have started to explore sophisticated machine learning techniques and the community is making rapid progress on the uptake of new methods for a large number of application areas. This paper provides a clear definition of so-called benchmark datasets for weather and climate applications that help to share data and machine learning solutions between research groups to reduce time spent in data processing, to generate synergies between groups, and to make tool developments more targeted and comparable. Furthermore, a list of benchmark datasets that will be needed to tackle important challenges for the use of machine learning in atmospheric sciences is provided.
Abstract
Sea surface slope (SSS) responds to oceanic processes and other environmental parameters. This study aims to identify the parameters that influence SSS variability. We use SSS calculated from multiyear satellite altimeter observations and focus on small resolvable scales in the 30–100-km wavelength band. First, we revisit the correlation of mesoscale ocean variability with seafloor roughness as a function of depth, as proposed by Gille et al. Our results confirm that in shallow water there is statistically significant positive correlation between rough bathymetry and surface variability, whereas the opposite is true in the deep ocean. In the next step, we assemble 27 features as input variables to fit the SSS with a linear regression model and a boosted trees regression model, and then we make predictions. Model performance metrics for the linear regression model are R 2 = 0.381 and mean square error = 0.010 μrad2. For the boosted trees model, R 2 = 0.563 and mean square error = 0.007 μrad2. Using the hold-out data, we identify the most important influencing factors to be the distance to the nearest thermocline boundary, significant wave height, mean dynamic topography gradient, and M2 tidal speed. However, there are individual regions, that is, the Amazon outflow, that cannot be predicted by our model, suggesting that these regions are governed by processes that are not represented in our input features. The results highlight both the value of machine learning and its shortcomings in identifying mechanisms governing oceanic phenomena.
Abstract
Sea surface slope (SSS) responds to oceanic processes and other environmental parameters. This study aims to identify the parameters that influence SSS variability. We use SSS calculated from multiyear satellite altimeter observations and focus on small resolvable scales in the 30–100-km wavelength band. First, we revisit the correlation of mesoscale ocean variability with seafloor roughness as a function of depth, as proposed by Gille et al. Our results confirm that in shallow water there is statistically significant positive correlation between rough bathymetry and surface variability, whereas the opposite is true in the deep ocean. In the next step, we assemble 27 features as input variables to fit the SSS with a linear regression model and a boosted trees regression model, and then we make predictions. Model performance metrics for the linear regression model are R 2 = 0.381 and mean square error = 0.010 μrad2. For the boosted trees model, R 2 = 0.563 and mean square error = 0.007 μrad2. Using the hold-out data, we identify the most important influencing factors to be the distance to the nearest thermocline boundary, significant wave height, mean dynamic topography gradient, and M2 tidal speed. However, there are individual regions, that is, the Amazon outflow, that cannot be predicted by our model, suggesting that these regions are governed by processes that are not represented in our input features. The results highlight both the value of machine learning and its shortcomings in identifying mechanisms governing oceanic phenomena.
Abstract
We develop and demonstrate a new interpretable deep learning model specifically designed for image analysis in Earth system science applications. The neural network is designed to be inherently interpretable, rather than explained via post hoc methods. This is achieved by training the network to identify parts of training images that act as prototypes for correctly classifying unseen images. The new network architecture extends the interpretable prototype architecture of a previous study in computer science to incorporate absolute location. This is useful for Earth system science where images are typically the result of physics-based processes, and the information is often geolocated. Although the network is constrained to only learn via similarities to a small number of learned prototypes, it can be trained to exhibit only a minimal reduction in accuracy relative to noninterpretable architectures. We apply the new model to two Earth science use cases: a synthetic dataset that loosely represents atmospheric high and low pressure systems, and atmospheric reanalysis fields to identify the state of tropical convective activity associated with the Madden–Julian oscillation. In both cases, we demonstrate that considering absolute location greatly improves testing accuracies when compared with a location-agnostic method. Furthermore, the network architecture identifies specific historical dates that capture multivariate, prototypical behavior of tropical climate variability.
Significance Statement
Machine learning models are incredibly powerful predictors but are often opaque “black boxes.” The how-and-why the model makes its predictions is inscrutable—the model is not interpretable. We introduce a new machine learning model specifically designed for image analysis in Earth system science applications. The model is designed to be inherently interpretable and extends previous work in computer science to incorporate location information. This is important because images in Earth system science are typically the result of physics-based processes, and the information is often map based. We demonstrate its use for two Earth science use cases and show that the interpretable network exhibits only a small reduction in accuracy relative to black-box models.
Abstract
We develop and demonstrate a new interpretable deep learning model specifically designed for image analysis in Earth system science applications. The neural network is designed to be inherently interpretable, rather than explained via post hoc methods. This is achieved by training the network to identify parts of training images that act as prototypes for correctly classifying unseen images. The new network architecture extends the interpretable prototype architecture of a previous study in computer science to incorporate absolute location. This is useful for Earth system science where images are typically the result of physics-based processes, and the information is often geolocated. Although the network is constrained to only learn via similarities to a small number of learned prototypes, it can be trained to exhibit only a minimal reduction in accuracy relative to noninterpretable architectures. We apply the new model to two Earth science use cases: a synthetic dataset that loosely represents atmospheric high and low pressure systems, and atmospheric reanalysis fields to identify the state of tropical convective activity associated with the Madden–Julian oscillation. In both cases, we demonstrate that considering absolute location greatly improves testing accuracies when compared with a location-agnostic method. Furthermore, the network architecture identifies specific historical dates that capture multivariate, prototypical behavior of tropical climate variability.
Significance Statement
Machine learning models are incredibly powerful predictors but are often opaque “black boxes.” The how-and-why the model makes its predictions is inscrutable—the model is not interpretable. We introduce a new machine learning model specifically designed for image analysis in Earth system science applications. The model is designed to be inherently interpretable and extends previous work in computer science to incorporate location information. This is important because images in Earth system science are typically the result of physics-based processes, and the information is often map based. We demonstrate its use for two Earth science use cases and show that the interpretable network exhibits only a small reduction in accuracy relative to black-box models.
Abstract
Large inaccuracies still exist in accurately predicting fog formation, dissipation, and duration. To improve these deficiencies, machine learning (ML) algorithms are increasingly used in nowcasting in addition to numerical fog forecasts because of their computational speed and their ability to learn the nonlinear interactions between the variables. Although a powerful tool, ML models require precise training and thoroughly evaluation to prevent misinterpretation of the scores. In addition, a fog dataset’s temporal order and the autocorrelation of the variables must be considered. Therefore, classification-based ML related pitfalls in fog forecasting will be demonstrated in this study by using an XGBoost fog forecasting model. By also using two baseline models that simulate guessing and persistence behavior, we have established two independent evaluation thresholds allowing for a more assessable grading of the ML model’s performance. It will be shown that, despite high validation scores, the model could still fail in operational application. If persistence behavior is simulated, commonly used scores are insufficient to measure the performance. That will be demonstrated through a separate analysis of fog formation and dissipation, because these are crucial for a good fog forecast. We also show that commonly used blockwise and leave-many-out cross-validation methods might inflate the validation scores and are therefore less suitable than a temporally ordered expanding window split. The presented approach provides an evaluation score that closely mimics not only the performance on the training and test dataset but also the operational model’s fog forecasting abilities.
Significance Statement
This study points out current pitfalls in the training and evaluation of pointwise radiation fog forecasting with machine learning algorithms. The objective of this study is to raise awareness of 1) consideration of the time stability of variables (autocorrelation) during training and evaluation, 2) the necessity of evaluating the performance of a fog forecasting model in direct comparison with an independent performance threshold (baseline model) that evaluates whether the fog forecasting model is better than guessing, and 3) the fact that prediction of fog formation and dissipation must be evaluated separately because a model that misses all of these transitions can still achieve high performance in the commonly used overall evaluation.
Abstract
Large inaccuracies still exist in accurately predicting fog formation, dissipation, and duration. To improve these deficiencies, machine learning (ML) algorithms are increasingly used in nowcasting in addition to numerical fog forecasts because of their computational speed and their ability to learn the nonlinear interactions between the variables. Although a powerful tool, ML models require precise training and thoroughly evaluation to prevent misinterpretation of the scores. In addition, a fog dataset’s temporal order and the autocorrelation of the variables must be considered. Therefore, classification-based ML related pitfalls in fog forecasting will be demonstrated in this study by using an XGBoost fog forecasting model. By also using two baseline models that simulate guessing and persistence behavior, we have established two independent evaluation thresholds allowing for a more assessable grading of the ML model’s performance. It will be shown that, despite high validation scores, the model could still fail in operational application. If persistence behavior is simulated, commonly used scores are insufficient to measure the performance. That will be demonstrated through a separate analysis of fog formation and dissipation, because these are crucial for a good fog forecast. We also show that commonly used blockwise and leave-many-out cross-validation methods might inflate the validation scores and are therefore less suitable than a temporally ordered expanding window split. The presented approach provides an evaluation score that closely mimics not only the performance on the training and test dataset but also the operational model’s fog forecasting abilities.
Significance Statement
This study points out current pitfalls in the training and evaluation of pointwise radiation fog forecasting with machine learning algorithms. The objective of this study is to raise awareness of 1) consideration of the time stability of variables (autocorrelation) during training and evaluation, 2) the necessity of evaluating the performance of a fog forecasting model in direct comparison with an independent performance threshold (baseline model) that evaluates whether the fog forecasting model is better than guessing, and 3) the fact that prediction of fog formation and dissipation must be evaluated separately because a model that misses all of these transitions can still achieve high performance in the commonly used overall evaluation.
Abstract
Bow echoes (BEs) are bow-shaped lines of convective cells that are often associated with swaths of damaging straight-line winds and small tornadoes. This paper describes a convolutional neural network (CNN) able to detect BEs directly from French kilometer-scale model outputs in order to facilitate and accelerate the operational forecasting of BEs. The detections are only based on the maximum pseudoreflectivity field predictor (“pseudo” because it is expressed in mm h−1 and not in dBZ). A preprocessing of the training database is carried out in order to reduce imbalance issues between the two classes (inside or outside bow echoes). A CNN sensitivity analysis against a set of hyperparameters is done. The selected CNN configuration has a hit rate of 86% and a false alarm rate of 39%. The strengths and weaknesses of this CNN are then emphasized with an object-oriented evaluation. The BE largest pseudoreflectivities are correctly detected by the CNN, which tends to underestimate the size of BEs. Detected BE objects have wind gusts similar to the hand-labeled BE. Most of the time, false alarm objects and missed objects are rather small (e.g., <1500 km2). Based on a cooperation with forecasters, synthesis plots are proposed that summarize the BE detections in French kilometer-scale models. A subjective evaluation of the CNN performances is also reported. The overall positive feedback from forecasters is in good agreement with the object-oriented evaluation. Forecasters perceive these products as relevant and potentially useful to handle the large amount of available data from numerical weather prediction models.
Abstract
Bow echoes (BEs) are bow-shaped lines of convective cells that are often associated with swaths of damaging straight-line winds and small tornadoes. This paper describes a convolutional neural network (CNN) able to detect BEs directly from French kilometer-scale model outputs in order to facilitate and accelerate the operational forecasting of BEs. The detections are only based on the maximum pseudoreflectivity field predictor (“pseudo” because it is expressed in mm h−1 and not in dBZ). A preprocessing of the training database is carried out in order to reduce imbalance issues between the two classes (inside or outside bow echoes). A CNN sensitivity analysis against a set of hyperparameters is done. The selected CNN configuration has a hit rate of 86% and a false alarm rate of 39%. The strengths and weaknesses of this CNN are then emphasized with an object-oriented evaluation. The BE largest pseudoreflectivities are correctly detected by the CNN, which tends to underestimate the size of BEs. Detected BE objects have wind gusts similar to the hand-labeled BE. Most of the time, false alarm objects and missed objects are rather small (e.g., <1500 km2). Based on a cooperation with forecasters, synthesis plots are proposed that summarize the BE detections in French kilometer-scale models. A subjective evaluation of the CNN performances is also reported. The overall positive feedback from forecasters is in good agreement with the object-oriented evaluation. Forecasters perceive these products as relevant and potentially useful to handle the large amount of available data from numerical weather prediction models.