Assessing the Impact of Biased Target Variables on Machine Learning Models of Severe Hail

Montgomery L. Flora a Cooperative Institute for Severe and High-Impact Weather Research and Operations, University of Oklahoma, Norman, Oklahoma
b NOAA/OAR/National Severe Storms Laboratory, Norman, Oklahoma

Search for other papers by Montgomery L. Flora in
Current site
Google Scholar
PubMed
Close
,
Patrick Skinner a Cooperative Institute for Severe and High-Impact Weather Research and Operations, University of Oklahoma, Norman, Oklahoma
b NOAA/OAR/National Severe Storms Laboratory, Norman, Oklahoma
c School of Meteorology, University of Oklahoma, Norman, Oklahoma

Search for other papers by Patrick Skinner in
Current site
Google Scholar
PubMed
Close
,
Corey K. Potvin b NOAA/OAR/National Severe Storms Laboratory, Norman, Oklahoma
a Cooperative Institute for Severe and High-Impact Weather Research and Operations, University of Oklahoma, Norman, Oklahoma
c School of Meteorology, University of Oklahoma, Norman, Oklahoma

Search for other papers by Corey K. Potvin in
Current site
Google Scholar
PubMed
Close
,
Brian Matilla a Cooperative Institute for Severe and High-Impact Weather Research and Operations, University of Oklahoma, Norman, Oklahoma
b NOAA/OAR/National Severe Storms Laboratory, Norman, Oklahoma

Search for other papers by Brian Matilla in
Current site
Google Scholar
PubMed
Close
, and
Anthony Reinhart b NOAA/OAR/National Severe Storms Laboratory, Norman, Oklahoma

Search for other papers by Anthony Reinhart in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

This study examines the implications of using traditional local storm reports (LSRs) versus radar-derived Multi-Radar Multi-Sensor (MRMS) maximum estimated size of hail (MESH) as classification target variables for training and evaluating ML models to predict severe hail events. Using input data from the NSSL Warn-on-Forecast System (WoFS), we explore how the LSR and MESH severe hail climatologies compare in WoFS and the variation in model performance with the choices of target variable for training and testing. Regardless of the training target variable, all ML models performed better evaluated on MESH. The improved performance of the LSRs-trained model on MESH was attributed to MESH better capturing nighttime events, which reduced spurious false alarms compared to evaluating LSRs only. However, the best model for a given target variable was the one trained on that target variable. For example, when evaluating on LSRs, the LSR-trained model performed best. This has operational significance as MESH-trained models may underperform LSR-trained models if the target variable is LSRs. We attribute the better MESH scores to MESH being more spatially and temporally consistent with WoFS versus LSRs. Nevertheless, whether either approach better predicts severe hail occurrence is still to be determined. Lastly, combining MESH and LSRs did not significantly improve model performance, which may be attributed to the fact that both datasets have unique error sources that do not cancel out. Ultimately, the main goal of this study is to shed light on the broader implications of data choice in the training and verification of ML models.

© 2025 American Meteorological Society. This is an Author Accepted Manuscript distributed under the terms of the default AMS reuse license. For information regarding reuse and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Montgomery Flora, monte.flora@noaa.gov

Abstract

This study examines the implications of using traditional local storm reports (LSRs) versus radar-derived Multi-Radar Multi-Sensor (MRMS) maximum estimated size of hail (MESH) as classification target variables for training and evaluating ML models to predict severe hail events. Using input data from the NSSL Warn-on-Forecast System (WoFS), we explore how the LSR and MESH severe hail climatologies compare in WoFS and the variation in model performance with the choices of target variable for training and testing. Regardless of the training target variable, all ML models performed better evaluated on MESH. The improved performance of the LSRs-trained model on MESH was attributed to MESH better capturing nighttime events, which reduced spurious false alarms compared to evaluating LSRs only. However, the best model for a given target variable was the one trained on that target variable. For example, when evaluating on LSRs, the LSR-trained model performed best. This has operational significance as MESH-trained models may underperform LSR-trained models if the target variable is LSRs. We attribute the better MESH scores to MESH being more spatially and temporally consistent with WoFS versus LSRs. Nevertheless, whether either approach better predicts severe hail occurrence is still to be determined. Lastly, combining MESH and LSRs did not significantly improve model performance, which may be attributed to the fact that both datasets have unique error sources that do not cancel out. Ultimately, the main goal of this study is to shed light on the broader implications of data choice in the training and verification of ML models.

© 2025 American Meteorological Society. This is an Author Accepted Manuscript distributed under the terms of the default AMS reuse license. For information regarding reuse and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Montgomery Flora, monte.flora@noaa.gov
Save