Abstract
This study examines the implications of using traditional local storm reports (LSRs) versus radar-derived Multi-Radar Multi-Sensor (MRMS) maximum estimated size of hail (MESH) as classification target variables for training and evaluating ML models to predict severe hail events. Using input data from the NSSL Warn-on-Forecast System (WoFS), we explore how the LSR and MESH severe hail climatologies compare in WoFS and the variation in model performance with the choices of target variable for training and testing. Regardless of the training target variable, all ML models performed better evaluated on MESH. The improved performance of the LSRs-trained model on MESH was attributed to MESH better capturing nighttime events, which reduced spurious false alarms compared to evaluating LSRs only. However, the best model for a given target variable was the one trained on that target variable. For example, when evaluating on LSRs, the LSR-trained model performed best. This has operational significance as MESH-trained models may underperform LSR-trained models if the target variable is LSRs. We attribute the better MESH scores to MESH being more spatially and temporally consistent with WoFS versus LSRs. Nevertheless, whether either approach better predicts severe hail occurrence is still to be determined. Lastly, combining MESH and LSRs did not significantly improve model performance, which may be attributed to the fact that both datasets have unique error sources that do not cancel out. Ultimately, the main goal of this study is to shed light on the broader implications of data choice in the training and verification of ML models.
© 2025 American Meteorological Society. This is an Author Accepted Manuscript distributed under the terms of the default AMS reuse license. For information regarding reuse and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).