An Assessment of How Domain Experts Evaluate Machine Learning in Operational Meteorology

David R. Harrison Cooperative Institute for Severe and High-Impact Weather Research and Operations, The University of Oklahoma, Norman, Oklahoma
School of Meteorology, The University of Oklahoma, Norman, Oklahoma
NOAA/NWS/Storm Prediction Center, Norman, Oklahoma

Search for other papers by David R. Harrison in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0002-8796-8035
,
Amy McGovern School of Meteorology, The University of Oklahoma, Norman, Oklahoma
School of Computer Science, The University of Oklahoma, Norman, Oklahoma
NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography, The University of Oklahoma, Norman, Oklahoma

Search for other papers by Amy McGovern in
Current site
Google Scholar
PubMed
Close
,
Christopher D. Karstens NOAA/NWS/Storm Prediction Center, Norman, Oklahoma
School of Meteorology, The University of Oklahoma, Norman, Oklahoma

Search for other papers by Christopher D. Karstens in
Current site
Google Scholar
PubMed
Close
,
Ann Bostrom School of Public Policy and Governance, University of Washington, Seattle, Washington
NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography, The University of Oklahoma, Norman, Oklahoma

Search for other papers by Ann Bostrom in
Current site
Google Scholar
PubMed
Close
,
Julie L. Demuth NCAR/Mesoscale and Microscale Meteorology Laboratory, Boulder, Colorado
NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography, The University of Oklahoma, Norman, Oklahoma

Search for other papers by Julie L. Demuth in
Current site
Google Scholar
PubMed
Close
,
Israel L. Jirak NOAA/NWS/Storm Prediction Center, Norman, Oklahoma

Search for other papers by Israel L. Jirak in
Current site
Google Scholar
PubMed
Close
, and
Patrick T. Marsh NOAA/NWS/Storm Prediction Center, Norman, Oklahoma

Search for other papers by Patrick T. Marsh in
Current site
Google Scholar
PubMed
Close
Restricted access

Abstract

As an increasing number of machine learning (ML) products enter the research-to-operations (R2O) pipeline, researchers have anecdotally noted a perceived hesitancy by operational forecasters to adopt this relatively new technology. One explanation often cited in the literature is that this perceived hesitancy derives from the complex and opaque nature of ML methods. Because modern ML models are trained to solve tasks by optimizing a potentially complex combination of mathematical weights, thresholds, and nonlinear cost functions, it can be difficult to determine how these models reach a solution from their given input. However, it remains unclear to what degree a model’s transparency may influence a forecaster’s decision to use that model or if that impact differs between ML and more traditional (i.e., non-ML) methods. To address this question, a survey was offered to forecaster and researcher participants attending the 2021 NOAA Hazardous Weather Testbed (HWT) Spring Forecasting Experiment (SFE) with questions about how participants subjectively perceive and compare machine learning products to more traditionally derived products. Results from this study revealed few differences in how participants evaluated machine learning products compared to other types of guidance. However, comparing the responses between operational forecasters, researchers, and academics exposed notable differences in what factors the three groups considered to be most important for determining the operational success of a new forecast product. These results support the need for increased collaboration between the operational and research communities.

Significance Statement

Participants of the 2021 Hazardous Weather Testbed Spring Forecasting Experiment were surveyed to assess how machine learning products are perceived and evaluated in operational settings. The results revealed little difference in how machine learning products are evaluated compared to more traditional methods but emphasized the need for explainable product behavior and comprehensive end-user training.

© 2025 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: David R. Harrison, david.harrison@noaa.gov

Abstract

As an increasing number of machine learning (ML) products enter the research-to-operations (R2O) pipeline, researchers have anecdotally noted a perceived hesitancy by operational forecasters to adopt this relatively new technology. One explanation often cited in the literature is that this perceived hesitancy derives from the complex and opaque nature of ML methods. Because modern ML models are trained to solve tasks by optimizing a potentially complex combination of mathematical weights, thresholds, and nonlinear cost functions, it can be difficult to determine how these models reach a solution from their given input. However, it remains unclear to what degree a model’s transparency may influence a forecaster’s decision to use that model or if that impact differs between ML and more traditional (i.e., non-ML) methods. To address this question, a survey was offered to forecaster and researcher participants attending the 2021 NOAA Hazardous Weather Testbed (HWT) Spring Forecasting Experiment (SFE) with questions about how participants subjectively perceive and compare machine learning products to more traditionally derived products. Results from this study revealed few differences in how participants evaluated machine learning products compared to other types of guidance. However, comparing the responses between operational forecasters, researchers, and academics exposed notable differences in what factors the three groups considered to be most important for determining the operational success of a new forecast product. These results support the need for increased collaboration between the operational and research communities.

Significance Statement

Participants of the 2021 Hazardous Weather Testbed Spring Forecasting Experiment were surveyed to assess how machine learning products are perceived and evaluated in operational settings. The results revealed little difference in how machine learning products are evaluated compared to more traditional methods but emphasized the need for explainable product behavior and comprehensive end-user training.

© 2025 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: David R. Harrison, david.harrison@noaa.gov

Supplementary Materials

    • Supplemental Materials (ZIP 0.5753 MB)
Save
  • Barnes, E. A., R. J. Barnes, Z. K. Martin, and J. K. Rader, 2022: This looks like that there: Interpretable neural networks for image tasks when location matters. Artif. Intell. Earth Syst., 1, e220001, https://doi.org/10.1175/AIES-D-22-0001.1.

    • Search Google Scholar
    • Export Citation
  • Cains, M. G., and Coauthors, 2022: NWS forecasters’ perceptions and potential uses of trustworthy AI/ML for hazardous weather risks. 21st Conf. on Artificial Intelligence for Environmental Science, Houston, TX, Amer. Meteor. Soc., 1.3, https://ams.confex.com/ams/102ANNUAL/meetingapp.cgi/Paper/393121.

  • Cains, M. G., C. D. Wirz, J. L. Demuth, A. Bostrom, D. J. Gagne, A. McGovern, R. A. Sobash, and D. Madlambayan, 2024: Exploring NWS forecasters’ assessment of AI guidance trustworthiness. Wea. Forecasting, 39, 12191241, https://doi.org/10.1175/WAF-D-23-0180.1.

    • Search Google Scholar
    • Export Citation
  • Calhoun, K. M., K. L. Berry, D. M. Kingfield, T. Meyer, M. J. Krocak, T. M. Smith, G. Stumpf, and A. Gerard, 2021: The Experimental Warning Program of NOAA’s Hazardous Weather Testbed. Bull. Amer. Meteor. Soc., 102, E2229E2246, https://doi.org/10.1175/BAMS-D-21-0017.1.

    • Search Google Scholar
    • Export Citation
  • Chase, R. J., D. R. Harrison, A. Burke, G. M. Lackmann, and A. McGovern, 2022: A machine learning tutorial for operational meteorology. Part I: Traditional machine learning. Wea. Forecasting, 37, 15091529, https://doi.org/10.1175/WAF-D-22-0070.1.

    • Search Google Scholar
    • Export Citation
  • Chase, R. J., D. R. Harrison, G. M. Lackmann, and A. McGovern, 2023: A machine learning tutorial for operational meteorology. Part II: Neural networks and deep learning. Wea. Forecasting, 38, 12711293, https://doi.org/10.1175/WAF-D-22-0187.1.

    • Search Google Scholar
    • Export Citation
  • Cintineo, J. L., M. J. Pavolonis, J. M. Sieglaff, L. Cronce, and J. Brunner, 2020: NOAA ProbSevere v2.0—ProbHail, ProbWind, and ProbTor. Wea. Forecasting, 35, 15231543, https://doi.org/10.1175/WAF-D-19-0242.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., and Coauthors, 2021: A real-time, virtual spring forecasting experiment to advance severe weather prediction. Bull. Amer. Meteor. Soc., 102, E814E816, https://doi.org/10.1175/BAMS-D-20-0268.1.

    • Search Google Scholar
    • Export Citation
  • Clark, A. J., and Coauthors, 2022: The second real-time, virtual spring forecasting experiment to advance severe weather prediction. Bull. Amer. Meteor. Soc., 103, E1114E1116, https://doi.org/10.1175/BAMS-D-21-0239.1.

    • Search Google Scholar
    • Export Citation
  • Deal, S. V., and R. R. Hoffman, 2010: The Practitioner’s Cycles, Part 1: Actual world problems. IEEE Intell. Syst., 25, 49, https://doi.org/10.1109/MIS.2010.54.

    • Search Google Scholar
    • Export Citation
  • Demuth, J. L., and Coauthors, 2020: Recommendations for developing useful and usable convection-allowing model ensemble information for NWS forecasters. Wea. Forecasting, 35, 13811406, https://doi.org/10.1175/WAF-D-19-0108.1.

    • Search Google Scholar
    • Export Citation
  • Ding, L., 2018: Human knowledge in constructing AI systems—Neural logic networks approach towards an explainable AI. Procedia Comput. Sci., 126, 15611570, https://doi.org/10.1016/j.procs.2018.08.129.

    • Search Google Scholar
    • Export Citation
  • Doswell, C. A., 2004: Weather forecasting by humans—Heuristics and decision making. Wea. Forecasting, 19, 11151126, https://doi.org/10.1175/WAF-821.1.

    • Search Google Scholar
    • Export Citation
  • Doswell, C. A., L. R. Lemon, and R. A. Maddox, 1981: Forecaster training—A review and analysis. Bull. Amer. Meteor. Soc., 62, 983988, https://doi.org/10.1175/1520-0477-62.7.983.

    • Search Google Scholar
    • Export Citation
  • Gallo, B. T., and Coauthors, 2017: Breaking new ground in severe weather prediction: The 2015 NOAA/Hazardous Weather Testbed Spring Forecasting Experiment. Wea. Forecasting, 32, 15411568, https://doi.org/10.1175/WAF-D-16-0178.1.

    • Search Google Scholar
    • Export Citation
  • Harrison, D. R., 2018: Correcting, improving, and verifying automated guidance in a new warning paradigm. M.S. thesis, School of Meteorology, University of Oklahoma, 109 pp., https://shareok.org/server/api/core/bitstreams/1532895c-fb7b-4d4d-8786-aacdbf4c4a7c/content.

  • Harrison, D. R., 2022: Machine learning co-production in operational meteorology. Ph.D. dissertation, University of Oklahoma, 196 pp., https://shareok.org/items/c4f803e9-4917-4d56-866e-46200b301a10.

  • Harrison, D. R., M. S. Elliott, I. L. Jirak, and P. T. Marsh, 2022a: Utilizing the High-Resolution Ensemble Forecast system to produce calibrated probabilistic thunderstorm guidance. Wea. Forecasting, 37, 11031115, https://doi.org/10.1175/WAF-D-22-0001.1.

    • Search Google Scholar
    • Export Citation
  • Harrison, D. R., A. McGovern, C. Karstens, I. L. Jirak, and P. T. Marsh, 2022b: Evaluation of first-guess watch guidance in the 2022 HWT Spring Forecasting Experiment. 30th Conf. on Severe Local Storms, Santa Fe, NM, Amer. Meteor. Soc., 7.3A, https://ams.confex.com/ams/30SLS/meetingapp.cgi/Paper/407263.

  • Hill, A. J., R. S. Schumacher, and I. L. Jirak, 2023: A new paradigm for medium-range severe weather forecasts: Probabilistic random forest–based predictions. Wea. Forecasting, 38, 251272, https://doi.org/10.1175/WAF-D-22-0143.1.

    • Search Google Scholar
    • Export Citation
  • Hoffman, R. R., S. V. Deal, S. Potter, and E. M. Roth, 2010: The Practitioner’s Cycles, Part 2: Solving envisioned world problems. IEEE Intell. Syst., 25, 611, https://doi.org/10.1109/MIS.2010.89.

    • Search Google Scholar
    • Export Citation
  • Hoffman, R. R., M. Johnson, J. M. Bradshaw, and A. Underbrink, 2013: Trust in automation. IEEE Intell. Syst., 28, 8488, https://doi.org/10.1109/MIS.2013.24.

    • Search Google Scholar
    • Export Citation
  • Hoffman, R. R., D. S. LaDue, H. M. Mogil, P. J. Roebber, and J. G. Trafton, 2017: Minding the Weather: How Expert Forecasters Think. MIT Press, 488 pp.

  • Jebb, A. T., V. Ng, and L. Tay, 2021: A review of key Likert scale development advances: 1995–2019. Front. Psychol., 12, 637547, https://doi.org/10.3389/fpsyg.2021.637547.

    • Search Google Scholar
    • Export Citation
  • Jirak, I. L., C. J. Melick, and S. J. Weiss, 2014: Combining probabilistic ensemble information from the environment with simulated storm attributes to generate calibrated probabilities of severe weather hazards. 27th Conf. on Severe Local Storms, Madison, WI, Amer. Meteor. Soc., P2.5, https://ams.confex.com/ams/27SLS/webprogram/Paper254649.html.

  • Kain, J. S., P. R. Janish, S. J. Weiss, M. E. Baldwin, R. S. Schneider, and H. E. Brooks, 2003: Collaboration between forecasters and research scientists at the NSSL and SPC: The Spring Program. Bull. Amer. Meteor. Soc., 84, 17971806, https://doi.org/10.1175/BAMS-84-12-1797.

    • Search Google Scholar
    • Export Citation
  • Karstens, C. D., and Coauthors, 2018: Development of a human–machine mix for forecasting severe convective events. Wea. Forecasting, 33, 715737, https://doi.org/10.1175/WAF-D-17-0188.1.

    • Search Google Scholar
    • Export Citation
  • Loken, E. D., A. J. Clark, and C. D. Karstens, 2020: Generating probabilistic next-day severe weather forecasts from convection-allowing ensembles using random forests. Wea. Forecasting, 35, 16051631, https://doi.org/10.1175/WAF-D-19-0258.1.

    • Search Google Scholar
    • Export Citation
  • McGovern, A., K. L. Elmore, D. J. Gagne, S. E. Haupt, C. D. Karstens, R. Lagerquist, T. Smith, and J. K. Williams, 2017: Using artificial intelligence to improve real-time decision-making for high-impact weather. Bull. Amer. Meteor. Soc., 98, 20732090, https://doi.org/10.1175/BAMS-D-16-0123.1.

    • Search Google Scholar
    • Export Citation
  • McGovern, A., R. Lagerquist, D. J. Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the black box more transparent: Understanding the physical implications of machine learning. Bull. Amer. Meteor. Soc., 100, 21752199, https://doi.org/10.1175/BAMS-D-18-0195.1.

    • Search Google Scholar
    • Export Citation
  • McGovern, A., and Coauthors, 2022: NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES). Bull. Amer. Meteor. Soc., 103, E1658E1668, https://doi.org/10.1175/BAMS-D-21-0020.1.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281293, https://doi.org/10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • NOAA, 2020: NOAA artificial intelligence strategy. NOAA, 8 pp., https://sciencecouncil.noaa.gov/wp-content/uploads/2023/04/2020-AI-Strategy.pdf.

  • Novak, D. R., D. R. Bright, and M. J. Brennan, 2008: Operational forecaster uncertainty needs and future roles. Wea. Forecasting, 23, 10691084, https://doi.org/10.1175/2008WAF2222142.1.

    • Search Google Scholar
    • Export Citation
  • Potvin, C. K., and Coauthors, 2019: Systematic comparison of convection-allowing models during the 2017 NOAA HWT Spring Forecasting Experiment. Wea. Forecasting, 34, 13951416, https://doi.org/10.1175/WAF-D-19-0056.1.

    • Search Google Scholar
    • Export Citation
  • Rogers, E. M., 1995: Diffusion of Innovations. 4th ed. The Free Press, 518 pp.

  • Schumacher, R. S., A. J. Hill, M. Klein, J. A. Nelson, M. J. Erickson, S. M. Trojniak, and G. R. Herman, 2021: From random forests to flood forecasts: A research to operations success story. Bull. Amer. Meteor. Soc., 102, E1742E1755, https://doi.org/10.1175/BAMS-D-20-0186.1.

    • Search Google Scholar
    • Export Citation
  • Sejnowski, T. J., 2018: The Deep Learning Revolution. MIT Press, 352 pp.

  • Toms, B. A., E. A. Barnes, and I. Ebert-Uphoff, 2020: Physically interpretable neural networks for the geosciences: Applications to earth system variability. J. Adv. Model. Earth Syst., 12, e2019MS002002, https://doi.org/10.1029/2019MS002002.

    • Search Google Scholar
    • Export Citation
  • Waskom, M. L., 2021: seaborn: Statistical data visualization. J. Open Source Software, 6, 3021, https://doi.org/10.21105/joss.03021.

  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier, 676 pp.

  • Wirz, C. D., J. L. Demuth, M. G. Cains, M. White, J. Radford, and A. Bostrom, 2024: National Weather Service (NWS) forecasters’ perceptions of AI/ML and its use in operational forecasting. Bull. Amer. Meteor. Soc., 105, E2194E2215, https://doi.org/10.1175/BAMS-D-24-0044.1.

    • Search Google Scholar
    • Export Citation
All Time Past Year Past 30 Days
Abstract Views 185 185 14
Full Text Views 613 613 592
PDF Downloads 304 304 277