1. Introduction
Climate variability and weather phenomena that cause extremes can significantly stress natural and human resources and result in costly disasters that claim lives (Thornton et al. 2014; Strader and Ashley 2015; Hu et al. 2018; Seneviratne et al. 2021; Ebi et al. 2021). According to the American Meteorological Society (AMS), climate variability refers to “temporal variations of the atmosphere–ocean system around a mean state” (American Meteorological Society 2022), and thus typically comprises phenomena of subseasonal (minimum of 2 weeks) or longer time scales (Fig. 1), which can have natural (e.g., volcanic eruptions and orbital cycles that impact insolation) and/or anthropogenic sources [e.g., greenhouse gases (GHG) and sulfate/biomass burning aerosols; Ghil 2002; Salinger 2005]. Weather phenomena that can cause extremes include, for example, tropical cyclones (TCs), atmospheric rivers (ARs), and mesoscale convective systems (MCSs; Fig. 1). While the frequency, intensity, and spatial extent of weather features that cause extremes can also be influenced by multiscale modes of climate variability (Ropelewski and Halpert 1986; Molina et al. 2018; Kim et al. 2019; Lin et al. 2020), we purposefully delineate the two to aid in describing the challenges that remain in their understanding. For example, it is not clear whether climate variability is sufficiently captured in Earth system models (ESMs) because of limited observational records. Moreover, ESM grid spacing for climate time scales (i.e., future projections) is usually too coarse (∼100 km) to capture fine-scale properties of weather phenomena that lead to extremes (∼50 km or less; see, e.g., Li et al. 2013), especially considering that the effective resolution of numerical weather prediction (NWP) models and contemporary climate models is several times more coarse than the actual grid spacing due to the effects of diffusion (Skamarock 2004; Klaver et al. 2020). This paper is not about the stress on natural and human resources, but about the machine learning methods and tools to study those stresses.

Time scales of example weather and climate phenomena. Listed example phenomena include severe convective storms (SCSs), mesoscale convective systems (MCSs), atmospheric rivers (ARs), tropical cyclones (TCs), Madden Julian oscillation (MJO), El Niño-Southern Oscillation (ENSO), Pacific decadal variability (PDV), Atlantic multidecadal variability (AMV), Atlantic meridional overturning circulation (AMOC), and greenhouse gases (GHG). The figure is adapted from Meehl et al. (2021).
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0086.1

Time scales of example weather and climate phenomena. Listed example phenomena include severe convective storms (SCSs), mesoscale convective systems (MCSs), atmospheric rivers (ARs), tropical cyclones (TCs), Madden Julian oscillation (MJO), El Niño-Southern Oscillation (ENSO), Pacific decadal variability (PDV), Atlantic multidecadal variability (AMV), Atlantic meridional overturning circulation (AMOC), and greenhouse gases (GHG). The figure is adapted from Meehl et al. (2021).
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0086.1
Time scales of example weather and climate phenomena. Listed example phenomena include severe convective storms (SCSs), mesoscale convective systems (MCSs), atmospheric rivers (ARs), tropical cyclones (TCs), Madden Julian oscillation (MJO), El Niño-Southern Oscillation (ENSO), Pacific decadal variability (PDV), Atlantic multidecadal variability (AMV), Atlantic meridional overturning circulation (AMOC), and greenhouse gases (GHG). The figure is adapted from Meehl et al. (2021).
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0086.1
Across the broader scientific community, rapid advances in machine learning (ML; Haupt et al. 2022), and deep learning (DL; LeCun et al. 2015) in particular, have inspired researchers to consider how these tools might enable science advances that previously would have been unattainable. Since DL is a subset of ML, we use ML to refer to both herein for simplicity. For climate variability and weather, the appeal in ML stems in part from its ability to model complex nonlinear systems and in part from recent algorithmic and computational advances [e.g., graphics processing units (GPUs)] that have improved and accelerated ML model training. ML advances have enabled scientists and software engineers to probe difficult questions that span climate variability and weather, and have been a driving force behind recent national and international workshops (e.g., Chantry et al. 2021). In particular, the Earth and Environmental Systems Science Division and Advanced Scientific Computing Research of the U.S. Department of Energy Office of Science held a 6-week, large community workshop during the autumn of 2021 to foster research coordination across numerous disciplines and professional sectors (Hickmon et al. 2022). This 6-week workshop included an extended session on climate variability and weather phenomena that can cause extremes. One of the planned outcomes included a synthesis of recent progress and key challenges identified during the extended session, which motivated a more comprehensive review of research pertaining to ML applications for climate variability and weather phenomena, contained herein.
This paper targets two primary audiences: 1) domain scientists conducting research on climate variability and weather phenomena that can cause extremes (CV&E) who may have not yet incorporated ML into their research and 2) researchers with some ML experience seeking to broaden their knowledge of ML applications in the field of CV&E. The goals of this paper are to
-
help domain scientists understand the range of ML tools that already exist and have been applied to CV&E to thus avoid “reinventing the wheel”,
-
discuss challenges that exist in certain applications of ML (and why off-the-shelf ML tools may not always be appropriate for CV&E), and
-
outline future directions in ML and CV&E that we believe will have the most impact in advancing the field forward.
2. Overview of ML uses and challenges in CV&E
Here we discuss how ML has been recently used in CV&E and highlight some remaining challenges specific to ML for CV&E. ML applications in CV&E can fall within certain stages of the ESM workflow, which we delineate as initializing the ESM, running the ESM, and postprocessing ESM output (Fig. 2). However, recent studies focusing on extreme weather prediction using large high-resolution observations (on meter to kilometer scales) cannot be placed within stages of the ESM workflow. These openly available large observational datasets, such as from radar (e.g., Multi-Radar Multi-Sensor; Smith et al. 2016; Zhang et al. 2016) and satellite instruments (e.g., NOAA GOES Advanced Baseline Imager; Schmit et al. 2017), made it possible to advance extreme weather prediction using ML (e.g., Lagerquist et al. 2020; Lee et al. 2021). Other topics that have relevance to CV&E but are not discussed within this article, such as representation of subgrid processes (e.g., parameterizations), can be found in the associated workshop report (Hickmon et al. 2022).

Example machine learning applications across the typical Earth system modeling workflow include input data and ESM initialization (e.g., Sha et al. 2021; Buizza et al. 2022; Bird et al. 2023), running the ESM (e.g., Gettelman et al. 2021; Weyn et al. 2021; Schevenhoven and Carrassi 2022), and ESM output (e.g., Kim et al. 2021; Barnes et al. 2019; O’Brien et al. 2022; Prabhat et al. 2021).
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0086.1

Example machine learning applications across the typical Earth system modeling workflow include input data and ESM initialization (e.g., Sha et al. 2021; Buizza et al. 2022; Bird et al. 2023), running the ESM (e.g., Gettelman et al. 2021; Weyn et al. 2021; Schevenhoven and Carrassi 2022), and ESM output (e.g., Kim et al. 2021; Barnes et al. 2019; O’Brien et al. 2022; Prabhat et al. 2021).
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0086.1
Example machine learning applications across the typical Earth system modeling workflow include input data and ESM initialization (e.g., Sha et al. 2021; Buizza et al. 2022; Bird et al. 2023), running the ESM (e.g., Gettelman et al. 2021; Weyn et al. 2021; Schevenhoven and Carrassi 2022), and ESM output (e.g., Kim et al. 2021; Barnes et al. 2019; O’Brien et al. 2022; Prabhat et al. 2021).
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0086.1
a. Sources of predictability for modes of climate variability
There are several climate variability phenomena with strong global influence, such as El Niño–Southern Oscillation (ENSO; Rasmusson and Carpenter 1982; Trenberth 1997) and the Madden–Julian oscillation (MJO; Madden and Julian 1971, 1972), and prediction of these phenomena and associated teleconnections has seen substantial progress in recent years with ML (Table 1; e.g., Ham et al. 2019; Mayer and Barnes 2021; Martin et al. 2022; Gordon and Barnes 2022). For example, Ham et al. (2019) trained a convolutional neural network to predict ENSO at lead times beyond 1 yr with skill that exceeds state-of-the-art dynamical forecast models. Mayer and Barnes (2021) predicted the sign of geopotential height anomalies on subseasonal time scales (i.e., from 2 weeks to 2 months) using an artificial neural network (ANN). These studies also went beyond generating skillful predictions by using explainable artificial intelligence (XAI) methods, such as layerwise relevance propagation (Toms et al. 2020), to explore the sources of predictability that trained neural networks (NNs) identified as resulting in good forecast performance (McGovern et al. 2019). ML can also be used to assess the representation of modes of climate variability in large climate simulations (e.g., Maher et al. 2022) and to quantify their associated teleconnections in a causal framework (McGraw and Barnes 2018; Runge et al. 2019; Kretschmer et al. 2021). Causal methods include two overarching categories: causal discovery, in which models that illustrate cause-and-effect relationships are created (e.g., causal graphs), and causal inference, in which impacts to cause-and-effect relationships when a system is modified are explored (Yao et al. 2021; Nogueira et al. 2022).
Synthesis of recent ML-based advances for sources of predictability for modes of climate variability.


Despite progress, numerous challenges remain. For example, it remains unclear whether climate variability is sufficiently captured in ESMs due to artificial trends in observational data (e.g., improvements in observing networks over time), data gaps, and scale differences among observational products and ESMs. Uncertainty and consistency across XAI methods is a topic that remains underexplored, but highly relevant to knowledge discovery (Fayaz et al. 2022). For example, Mamalakis et al. (2022) illustrated that inconsistent results can be gleaned from different XAI methods, suggesting that a variant of layerwise relevance propagation and inputs multiplied by their respective gradients produced the correct XAI results for their study as compared with other methods. Thus, identifying potential failure modes of XAI, and uncertainty quantification pertaining to different types of XAI methods, are both crucial to establish confidence levels in XAI output and determine whether ML predictions are “right for the right reasons.” Collinearity among input variables also presents challenges for XAI attribution studies, limiting our ability to quantify variable importance or contributions to ML model predictions. This issue is particularly acute in the Earth sciences, where many physical variables are interrelated. For example, Kamangir et al. (2022) showed that their input variables, wind and temperature, were correlated, which could affect fidelity of their attribution results. Variants to the permutation feature importance test have been used to assess sensitivity to collinearity (e.g., multipass forward algorithm, single-pass backward algorithm, multipass backward algorithm), although they are computationally expensive (Lagerquist 2020). Additionally, while XAI can reveal correlations between input features and outputs, the statistics adage states: “correlation does not imply causation.” Currently, many statistical and explainable ML methods do not consider causality, and if used to infer predictability sources, may result in spurious or overconfident sources of predictability for modes of climate variability and teleconnectivity of natural modes of climate variability to regional climate or weather extremes. Thus, sources of predictability need to be better quantified using causal frameworks (e.g., Kretschmer et al. 2021). Human biases can also leak into the scientific process when using ML for knowledge discovery, which can lead to the discarding of new information that may be in conflict with past established work (Molnar et al. 2021; Molina et al. 2021; Kumar and Sharma 2022).
b. Feature detection
Over the last decade, there has been a considerable increase in research associated with the detection of mesoscale and synoptic-scale weather phenomena (i.e., features) that can cause extremes, such as tropical cyclones (e.g., Walsh et al. 2015), atmospheric rivers (e.g., O’Brien et al. 2022), and mesoscale convective systems (e.g., Feng et al. 2021). The spatial extent of the mesoscale is usually defined as a few kilometers to several hundred kilometers, in contrast to the spatial extent of the synoptic scale, which is usually defined as a few hundred to several thousand kilometers. The majority of feature detection research to date uses objective, heuristic (i.e., threshold based) detection algorithms; however, in the past several years, objective approaches using ML have appeared (see a summary in Table 2).
Synthesis of recent ML-based advances for feature detection in CV&E. In addition to NN, ANN, and CNN, which are defined in the main text, several other abbreviations are used for brevity: recurrent NN (RNN), residual NN (ResNet), generative adversarial network (GAN), long short-term memory network (LSTM), and U-Net (a type of CNN; Ronneberger et al. 2015).


Recently, for example, Liu et al. (2016) trained a convolutional neural network (CNN) to detect tropical cyclones, atmospheric rivers, and fronts in ESM output, using a combination of heuristic detection algorithm outputs and hand-labeled data for training. Lagerquist et al. (2019) likewise used a CNN to detect fronts in reanalysis data, using human-labeled fronts as the training dataset. Prabhat et al. (2021) built on the work of Liu et al. (2016) and trained a deep CNN to detect atmospheric rivers and tropical cyclones in a climate simulation using a large dataset of expert-labeled fields. Quinting and Grams (2022) trained a deep CNN to detect warm conveyor belts in NWP output, using an analysis of Lagrangian tracer advection as the training dataset, showing that the model works on a range of input datasets (Quinting et al. 2022). There have also been several attempts to use unsupervised approaches to detect weather phenomena. Radić et al. (2015) used self-organizing maps to characterize changes in different types of extreme atmospheric rivers in climate model projections. Rupe et al. (2019a,b) demonstrated that computational mechanics (specifically, ϵ machines) can objectively and automatically detect eddy-like features in fluid simulations. There have also been some efforts to combine heuristic detection approaches with optimization based on expert-labeled datasets: a form of statistical–heuristic machine learning that has been used for tropical cyclones (Zarzycki and Ullrich 2017) and for atmospheric rivers (O’Brien et al. 2020). Among the advances outlined in this paragraph, detection of localized extremes at the microscale (such as hail and tornadoes) that pose significant societal danger remains a gap in feature detection work at climate scales due to several reasons, such as (i) insufficient spatial resolution to explicitly capture these events (e.g., ERA5 is ∼32-km horizontal grid spacing; Hersbach et al. 2020), (ii) lack of labels globally (i.e., in situ observations to use as predictands), and (iii) limitations in label quality (e.g., hail reports are clustered on roads; Allen and Tippett 2015). However, severe weather hazard prediction and probabilistic guidance using ML has made strides, having been recently adopted by the National Weather Service and included in experimental NWP (e.g., the NOAA Warn-on-Forecast; Flora et al. 2021; Clark and Loken 2022). Extreme weather prediction is discussed further in section 2c.
ML approaches to detect or track weather phenomena have not yet been adopted widely, which may be partly due to research questions that still need to be addressed. Key questions are related to transferability across ESMs and the lack of interpretability of results, for example: will an ML model trained to detect features in one specific region perform well in another; will an ML model trained to detect features on the present climate perform well in a different (e.g., future) climate, and will an ML model trained to detect features on a specific dataset or ESM perform well on different ESMs? Uncertainty in detecting weather phenomena also stems in part from ambiguities in the definitions of these phenomena as provided by domain experts. For example, while much effort was placed in creating a formal definition for “atmospheric rivers” that the scientific community could agree upon (Ralph et al. 2018), ambiguities regarding how to delineate the MJO from atmospheric rivers (Toride and Hakim 2022) and how to define Antarctic atmospheric rivers (Pohl et al. 2021) persist. Uncertainty in the definition of weather phenomena can also arise during transitional stages (e.g., tropical cyclone to extratropical cyclone). With that said, by employing techniques from explainable ML (Mamalakis et al. 2022) and considering how ML methods identify features, these studies could yield insights that would allow us to better disambiguate these features. Additionally, lack of training and evaluation data challenges many areas of ML research in Earth system prediction. Some weather phenomena have extensive publicly available databases of expert labels (e.g., IBTrACs for tropical cyclones; Knapp et al. 2010) whereas other phenomena have no existing standardized datasets (e.g., global monsoon or Antarctic atmospheric rivers; Geen et al. 2020; Wille et al. 2021). This lack of labeled data is a significant barrier for training supervised algorithms and it inhibits rigorous evaluation of unsupervised methods. There is a major opportunity for the academic community to engage students systematically in the development of such databases. Initiatives such as ClimateNet (Prabhat et al. 2021), which led to the development of a web interface for labeling tropical cyclones and atmospheric rivers in climate snapshots, will be invaluable for enabling communities of climate experts to develop such datasets for other climate variability and weather phenomena.
c. Extreme weather and climate prediction and precursors
There has been recent success in the ML-based prediction of weather and climate extremes, along with the identification of associated precursors (Table 3; e.g., Herman and Schumacher 2018; Sønderby et al. 2020; Ravuri et al. 2021; Frame et al. 2022). At the weather time scale, for example, Gagne et al. (2019) used a convolutional neural network to predict hail with 1-h lead time from convection-permitting (3-km horizontal grid spacing) model output and then identified associated precursors using ML explainability tools. Permutation feature importance and saliency maps (McGovern et al. 2019) yielded insights into what the convolutional neural network learned for skillful hail prediction, including storm morphology (e.g., supercell) and environmental parameters that contribute to storm intensification and hail growth (Gagne et al. 2019). At short, medium, and subseasonal time scales, Lopez-Gomez et al. (2023) trained convolutional neural networks with custom loss functions to predict surface temperature and extreme heat waves with lead time of 1–28 days on a global coarse grid (about 200 km). On longer climate time scales (decades to centuries), Dagon et al. (2022) detected weather fronts (e.g., cold, warm, occluded, and stationary fronts) using a convolutional neural network and then quantified their association with extreme precipitation over North America along with projected changes in a future climate; thus, feature detection was used to extract precursors to extreme precipitation events. Skillful ML-based prediction of weather and climate has garnered broad interest from private sector industry as well, although not necessarily focused on extremes. Recent examples include Fourier forecasting neural network (named FourCastNet) for short to medium-range predictions (Pathak et al. 2022), an autoregressive model using graph neural networks (named GraphCast) with 10-day lead time (Lam et al. 2022), a 3D Earth Specific Transformer (3DEST) model (named Pangu-Weather) with 1-h to 1-week lead time (Bi et al. 2022), and ClimaX based on transformers with encoding and aggregation blocks for both weather and climate time scales (Nguyen et al. 2023).
Synthesis of recent ML-based advances for prediction and precursors in CV&E. Here, XGBoost is extreme gradient boosting.


There are challenges that remain in the prediction and precursor identification of extremes on weather and climate time scales. Extremes by definition are rare events and, therefore, separation of signal from noise in extremes statistics remains demanding due to limited observational records. The limited observational record also presents challenges to understanding covariability and compounding extremes (e.g., Mukherjee et al. 2020), in addition to long-term trends of extremes. The issue of limited samples of extremes can also manifest as class imbalance, where the minority class may represent the extreme phenomena and be undersampled as compared with the majority class(es). Several studies have overcome class sample size issues within extreme weather prediction or classification as follows: (i) undersampling of the majority class (e.g., Lagerquist et al. 2019), (ii) robust model and/or hyperparameter grid search to find ML architectures that can overcome class imbalance limitations (e.g., Mostajabi et al. 2019; Kamangir et al. 2022), (iii) use of custom class-weighted loss functions (e.g., Ebert-Uphoff et al. 2021), and (iv) others found that the respective ML task was performed skillfully despite severe class imbalance (e.g., Flora et al. 2021; Molina et al. 2021). Characterization of extreme events (e.g., extreme precipitation and floods) also need to be improved given that different predefined thresholds used to identify extremes can result in very different properties and disparate research results (Pendergrass 2018; Leung et al. 2022). Despite existing identified features, precursors to extreme events are still poorly understood and require further identification and characterization. For example, while ML-based detection of tropical cyclones has seen much progress (e.g., Prabhat et al. 2021), predicting which tropical disturbances (e.g., African easterly waves) will intensify into a tropical cyclone or not necessitates further exploration (e.g., Núñez Ocasio et al. 2021).
d. Observation–model integration
The use of ML for observation–model integration (e.g., ModEx approach1) is motivated by the need to improve the fidelity of ESMs using observations, perform reduced-order modeling, overcome data input/output (I/O) challenges with exascale, and improve assimilation of observations into modeling systems (e.g., Geer 2021; Schevenhoven and Carrassi 2022). Some progress has been recently made on these topics (see Table 4) and examples include the reconstruction of missing climate observations (Kadow et al. 2020), generation of a synthetic MJO-index time series from a past climate (Dasgupta et al. 2020), ESM emulation and parameter estimation (e.g., Lu et al. 2018; Dagon et al. 2020), and the blending of ML and data assimilation via a Bayesian perspective for specific tasks, such as parameterizations (e.g., Bocquet et al. 2020; Brajard et al. 2021; Cleary et al. 2021; Buizza et al. 2022).
Synthesis of recent ML-based advances for observation–model integration in CV&E.


Verification and validation of climate model simulations (i.e., a class of ESMs used to create projections of Earth’s future climate) using observations are challenging, particularly for longer time scales (e.g., decadal prediction; Meehl et al. 2021), future projections, and rare events. Challenges related to verification and validation of climate models are due in part to limited or lack of data, which is a barrier to subsequent development. Observations are often inhomogeneous across space and time, which presents a challenge when gap filling observations and models. ML could help determine where to deploy observations by evaluating how much uncertainty can be reduced with a new data point (Andersson et al. 2022) or projecting where more observations will be needed in a future climate (Bessenbacher et al. 2023). A risk of overparameterization when integrating observations into models also exists and thus there is a need for methods that indicate when observation and model agreement is suboptimal. Uncertainty quantification is at the intersection of observations and model simulations and is an important tool to assess confidence in predictive capability, but is lacking and a challenge to compute. Data are also limited over critical climate regions that contain potential tipping points, such as the Arctic (e.g., Bennett et al. 2022) and Antarctic (e.g., DuVivier et al. 2023). On weather time scales, data latency, which is the length of time between the measurement of an observation and its transmission to a forecasting center, can pose a limit to model skill; earlier incorporation of data as they are collected is needed (e.g., Casey and Cucurull 2022; Slivinski et al. 2022). Dynamic approaches to filtering can help address issues with temporal lags in the assimilation of data (Restrepo 2017; Foster and Restrepo 2022). Additionally, stronger communication and collaboration between observationalists, theoreticians, and process and climate modelers can help further advance our physical understanding, tool development, and fidelity of ESMs [as demonstrated by Legg et al. (2009)]. Moreover, a heavy emphasis has been placed on the development of data-driven ML methods, which may not obey underlying physical principles of Earth systems nor generalize well to unseen events (Reichstein et al. 2019; Kashinath et al. 2021); more development of physics-informed ML methods is needed in observation–model integration, but with consideration of their possible failure modes (e.g., Krishnapriyan et al. 2021; Wang et al. 2022).
e. Downscaling and bias correction
A spatial refinement and/or bias correction of climate models (i.e., a class of ESMs used to create projections of Earth’s future climate) is warranted due their coarse horizontal grid spacing (∼100 km) that contributes to inaccuracies in representing subgrid-scale Earth system component processes and interactions (e.g., land–atmosphere). Bias correction of ESMs has seen progress in recent years (see Table 5), with corrections learned from data applied after the ESM has been run (i.e., offline) or while the ESM is run (i.e., online). Applications include offline bias correction of weather observations for initial state quality control (e.g., Sha et al. 2021), offline bias correction of subseasonal climate oscillation predictions, such as the MJO (e.g., Kim et al. 2021), and online bias correction of global coarse-grid (i.e., not convection-permitting) weather and climate models (e.g., Watt-Meyer et al. 2021; Bretherton et al. 2022; Clark et al. 2022). There is also some success in the development of ML models to generate higher-resolution output of regional domains (e.g., Baño-Medina et al. 2020) and in the downscaling of climate model outputs (e.g., Vandal et al. 2017; Pan et al. 2021). However, most of these efforts are currently limited to snapshot modeling or downscaling (e.g., Sha et al. 2020a), with fewer studies focusing on downscaling in both space and time (e.g., Jiang et al. 2020; Serifi et al. 2021).
Synthesis of recent ML-based advances for downscaling and bias correction in CV&E.


ML models can break physical laws if constraints or equations are not incorporated during the training process (Beucler et al. 2020; Kashinath et al. 2021), where for example, a data-driven approach may not capture statistical tails in subgrid-scale physics, which could lead to inaccurate downscaling or bias correction, and numerical instabilities if subsequently run online in an ESM (e.g., Gettelman et al. 2021). Yuval et al. (2021) achieves numerical stability by using an artificial neural network to predict fluxes rather than tendencies, which serves as a physical constraint in the respective ML-based parameterization. Open-source microscale (1 km or less) benchmarking datasets at high temporal frequencies (e.g., 1 h or less) are also very limited or lacking (e.g., clouds and aerosols), which could be used for standardized evaluations, albeit with a large computational cost at the global scale. There are also challenges in the selection and availability of reasonable priors for uncertainty quantification from a Bayesian perspective, which is partly because priors can be highly subjective and partly because many computational scientists do not receive formal training in Bayesian methods as part of their standard curricula (Hoegh 2020). Future advancements in downscaling and bias correction using ML should consider both data driven and physics-based approaches (e.g., de Wolff et al. 2021), downscaling in both space and time, and blending traditional domain science methods (e.g., fluid mechanics) with ML to fully exploit the strengths of multiple data science approaches. Sufficiently long and high-resolution (e.g., convection-permitting) data streams would be vital to exploiting ML as a tool to emulate physical models representing weather and climate extremes.
3. Prospects for future contributions to CV&E with ML
Previously we detailed ML applications and challenges in CV&E (for a summary, see Fig. 3). Next, we outline research directions and outstanding problems that span across CV&E and were identified as priorities (also summarized in Table 6).

Synthesis of challenges in CV&E and/or possible needs that ML can address (the full text is given in section 2).
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0086.1

Synthesis of challenges in CV&E and/or possible needs that ML can address (the full text is given in section 2).
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0086.1
Synthesis of challenges in CV&E and/or possible needs that ML can address (the full text is given in section 2).
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0086.1
Summary of ML-based priorities for CV&E.


a. ML-detected features and model catalogs
Massive catalogs from feature trackers, such as the Atmospheric River Tracking Method Intercomparison Project (ARTMIP) catalog (Shields et al. 2018), already exist in consistent format and with uniform standards, and could be used as an example for cataloging existing ML-detected features for benchmarking or research (Dueben et al. 2022). Already-trained ML models for feature detection, such as ClimateNet (Prabhat et al. 2021), can also be cataloged, enabling the prioritization of other features with comparatively limited attention (e.g., monsoon depressions).
b. Robustness assessment
Further development and benchmarking of XAI methods (e.g., Mamalakis et al. 2022), assessment of the transferability of trained ML models from one dataset to another (e.g., Dagon et al. 2022), and whether knowledge extracted is consistent across models and data products, especially in a nonstationary system (e.g., new water and energy pathways due to a changing climate), were all identified as research priorities. Assessing and ensuring robustness of trained ML models to adversarial data is also a research priority due to security and societal implications.
c. Beyond detection of extremes
ML for feature detection has been successful (e.g., Prabhat et al. 2021) and research priorities should lie in gaining a better understanding of the characteristics of extremes (such as their inception, e.g., cyclogenesis) both within observations and ESMs. Distinguishing extremes from outliers that may be due to erroneous data also remains difficult and is of priority (e.g., anomaly detection; Sinha et al. 2020).
d. Uncertainty for teleconnection pathways
The teleconnections of ENSO and other climate modes have complex and varying pathways (e.g., Mehmood et al. 2022; Molina et al. 2022), and thus a clear understanding of these pathways is important, including identification of ESM-specific signals and those that apply in observations. ML could be used to quantify uncertainty (Haynes et al. 2023), such as with large ML-based ensembles (e.g., Weyn et al. 2021), ML model hierarchy, and models with uncertainty built in (e.g., Barnes and Barnes 2021a,b).
e. Leveraging existing ML for precursor discovery
Already existing and/or trained ML models, such as artificial neural networks and linear inverse models, present opportunities to discover new signals in the Earth system and should be further used with XAI to interrogate data (e.g., Shin et al. 2020; Toms et al. 2020). Causality should be further incorporated into ML workflows to help with the quantification of physical drivers and the assessment of robust sources of predictability (e.g., Kretschmer et al. 2021), while also considering their spatiotemporal nonstationarity.
f. Gap filling and synthetic data creation
ML also presents opportunities for filling observational gaps and/or combining observational datasets to leverage their complementary strengths (e.g., transfer learning). The use of ML for the creation of pseudo-observational data is another opportunity, which can include extending observational products into time periods when remote sensing and other observational technologies were unavailable (e.g., Dasgupta et al. 2020). ML can also be used to identify regions that need more observations (e.g., Bessenbacher et al. 2023), such as with techniques borrowed from adaptive sampling (Fer et al. 2018) and active learning (Ren et al. 2021).
g. Improving subgrid-scale representation and ESM development
Climate models cannot resolve or accurately represent processes associated with extremes given their relatively coarse grid spacing (∼100 km) and thus better understanding connections to extremes using, for example, ML-based downscaling, inverse modeling, or symbolic relationships, are additional research priorities. Incorporating ML during model development in order to diagnose errors, rather than applying it after the fact, was identified as a research priority, along with implementing ML to identify numerical model deficiencies (e.g., Bayesian calibration; Cleary et al. 2021).
h. Benchmarking metrics for CV&E
Metrics are scalar measures of skill that are used in model development to benchmark performance. Various existing metrics do not describe climate extremes well, such as generalized extreme value theory (e.g., Majumdar et al. 2020; Tabari 2021), partly due to heavily skewed distributions, and thus ML could be potentially used to expand distributions or create more useful metrics that can better capture process complexities and nonlinearities. Development of metrics that measure more than ML model performance in prediction tasks, such as scalability across hardware and software systems, in addition to trustworthiness (McGovern et al. 2022), were also identified as priorities.
i. Data compression tools and online learning techniques
Data compression tools present an opportunity to store large data, enabling subsequent training of data-hungry ML models (Pinard et al. 2020). ML methods could also help assess how much compression is possible with minimal data loss, and in the case of a reduced set of numerical model output variables, help with variable reconstruction. Online learning techniques, such as during a numerical model simulation (Partee et al. 2022), also present opportunities for observation–model integration and reducing the need for large model outputs.
4. Conclusions
The major motivation driving the use of ML for climate variability and weather extremes is related to the ability of ML to learn nonlinear, complex relationships in large datasets and across multiple variables, which has enabled improved predictive skill across time scales (e.g., Gagne et al. 2019; Ravuri et al. 2021) and process understanding (e.g., Toms et al. 2020; Kretschmer et al. 2021). Recent applications of ML for climate variability and extremes include uncovering sources of predictability for modes of climate variability (e.g., Toms et al. 2020; Mayer and Barnes 2021), feature detection (e.g., Prabhat et al. 2021; Dagon et al. 2022), identifying extreme weather precursors and improving their prediction (e.g., Gagne et al. 2019; Huang et al. 2021), observation–model integration (e.g., Dasgupta et al. 2020; Kadow et al. 2020), and downscaling and bias correction (e.g., Sha et al. 2020b; Kim et al. 2021). However, numerous algorithmic and statistical challenges must be addressed and overcome to robustly and reliably answer pressing questions with ML within these subject areas. Challenges include nonstationarity within the climate system (e.g., Beucler et al. 2020; Molina et al. 2021), uncertainty quantification associated with ML explainability (e.g., Abdar et al. 2021; Mamalakis et al. 2022), and limited sampling of extremes (e.g., Sillmann et al. 2017), among many others discussed throughout this article.
A long-term vision (e.g., during the next 10 years) for climate variability and weather extremes includes the further advancement and application of ML that is unaware of human-imposed labels (e.g., unsupervised learning; Celebi and Aydin 2016), which allows for the relaxation of a priori criteria and can potentially lead to the discovery of new modes of climate variability, climate signals, and sources of predictability. In a changing climate, extracting coherent structures from model output will require robust strategies that can separate long-term forcing signals from short-term variability. Efficient detection and attribution of the compound and cascading extremes, robust to a changing climate, and the creation/emulation of factual and counterfactual scenarios that incorporate uncertainty for climate change attribution (e.g., Easterling et al. 2016) will become increasingly important, particularly as applications extend into socioeconomic impacts (Perkins-Kirkpatrick et al. 2022). Overcoming observational limitations could be achieved using transfer operators with climate models to map a history of observations to predictions, potentially helping to generate ultrahigh-resolution simulations (e.g., 1 km or less) of extreme phenomena (tornadoes, lightning, etc.) at some future time. Critically, continuing to build trust between researchers, operational scientists, and the public is also of utmost importance, by increasing ML explainability and enabling robust identification of reasons for prediction failures (McGovern et al. 2022).
Beyond climate variability and weather extremes, collaboration and codevelopment by computer scientists, domain experts, and software engineers is important to make visionary changes and push scientific boundaries. Training and development of the future and the current workforce, particularly when considering the experienced workforce with an already clear understanding of Earth science data, and strong partnerships between laboratories and universities to allow for cross-pollination of ideas and training of students, were all seen as imperative. Focus on stakeholder and end-user engagement is also emphasized for ML and predictive analytics. High-risk/high-reward research was also identified as critical, with needed support by funding agencies and the broader scientific community to target creativity and risk-taking in research to create effective and transformative change.
Iterative model-experiment (ModEx) approach (https://ess.science.energy.gov/modex/).
Acknowledgments.
We acknowledge the following for their contributions as active session participants: Steven Klein (LLNL), Scott Collis (ANL), Alex Cannon (Environment and Climate Change Canada), Nicola Maher (CIRES/CU Boulder), Da Yang (UC Davis, LBNL), John Allen (Central Michigan University), Manos Anagnostou (University of Connecticut), Jim Ang (PNNL), Marian Anghel (LANL), Kenneth Ball (Geometric Data Analytics), Karthik Balaguru (PNNL), Antara Banerjee (CIRES, NOAA/CSL), Elizabeth Barnes (Colorado State University), Carolyn Begeman (LANL), Emily Bercos-Hickey (LBNL), Celine Bonfils (LLNL), Andrew Bradley (Sandia National Laboratories), Antonietta Capotondi (CIRES, NOAA/PSL), William Chapman (Scripps Institution of Oceanography), Nan Chen (University of Wisconsin-Madison), Paul J. Durack (LLNL), Zhe Feng (PNNL), Andrew Geiss (PNNL), Carlo Grazian (ANL), Gary Geernaert (DOE), Huanping Huang (LBNL), Whitney Huang (Clemson University), Brian Hunt (University of Maryland), Robert Jacob (ANL), Renu Joseph (DOE), Karthik Kashinath (Nvidia), Grace E Kim (Booz Allen Hamilton), Gu Lianhong (ORNL), Jian Lu (PNNL), Valerio Lucarini (University of Reading), Hsi-Yen Ma (LLNL), Kirsten Mayer (Colorado State University), Amy McGovern (University of Oklahoma), Gerald Meehl (NCAR), Balu Nadiga (LANL), Christina Patricola (Iowa State University), Steve Penny (CIRES, NOAA/PSL), Stephen Po-Chedley (LLNL), Philip Rasch (PNNL), Deeksha Rastogi (ORNL), Adam Rupe (LANL), Christine Shields (NCAR), Jeff Stehr (DOE), Aneesh Subramanian (CU Boulder), Bob Vallario (DOE), Charuleka Varadharajan (LBNL), Hailong Wang (PNNL), Jiali Wang (ANL), Michael Wehner (LBNL), Brian White (UNC Chapel Hill, UC Berkeley), and Mark Zelinka (LLNL). This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Biological and Environmental Research (BER), Regional and Global Model Analysis (RGMA) component of the Earth and Environmental System Modeling Program under Award DE-SC0022070 and National Science Foundation IA 1947282. The National Center for Atmospheric Research is sponsored by the National Science Foundation. A portion of this work was performed under the auspices of the U.S. DOE by Lawrence Livermore National Laboratory (LLNL) under Contract DE-AC52-07NA27344. Author Anderson was supported by LLNL Laboratory Directed Research and Development project 22-SI-008. This paper has been authored by UT-Battelle, LLC, under Contract DE-AC05-00OR22725 with the U.S. Department of Energy. The publisher, by accepting the article for publication, acknowledges that the U.S. government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of the paper, or allow others to do so, for U.S. government purposes. The DOE will provide public access to these results in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Data availability statement.
No datasets were generated or analyzed during the current study.
REFERENCES
Abdar, M., and Coauthors, 2021: A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion, 76, 243–297, https://doi.org/10.1016/j.inffus.2021.05.008.
Allen, J. T., and M. K. Tippett, 2015: The characteristics of United States hail reports: 1955–2014. Electron. J. Severe Storms Meteor., 10 (3), https://doi.org/10.55599/ejssm.v10i3.60.
American Meteorological Society, 2022: Climate variability. Glossary of Meteorology, https://glossary.ametsoc.org/wiki/Climate_variability.
Anderson, G. J., and D. D. Lucas, 2018: Machine learning predictions of a multiresolution climate model ensemble. Geophys. Res. Lett., 45, 4273–4280, https://doi.org/10.1029/2018GL077049.
Andersson, T. R., and Coauthors, 2022: Active learning with convolutional Gaussian neural processes for environmental sensor placement. arXiv, 2211.10381v1, https://doi.org/10.48550/arXiv.2211.10381.
Baño-Medina, J., R. Manzanas, and J. M. Gutierrez, 2020: Configuration and intercomparison of deep learning neural models for statistical downscaling. Geosci. Model Dev., 13, 2109–2124, https://doi.org/10.5194/gmd-13-2109-2020.
Barnes, E. A., and R. J. Barnes, 2021a: Controlled abstention neural networks for identifying skillful predictions for classification problems. J. Adv. Model. Earth Syst., 13, e2021MS002573, https://doi.org/10.1029/2021MS002573.
Barnes, E. A., and R. J. Barnes, 2021b: Controlled abstention neural networks for identifying skillful predictions for regression problems. J. Adv. Model. Earth Syst., 13, e2021MS002575, https://doi.org/10.1029/2021MS002575.
Barnes, E. A., J. W. Hurrell, I. Ebert-Uphoff, C. Anderson, and D. Anderson, 2019: Viewing forced climate patterns through an AI lens. Geophys. Res. Lett., 46, 13 389–13 398, https://doi.org/10.1029/2019GL084944.
Bennett, K. E., and Coauthors, 2022: Spatial patterns of snow distribution in the sub-Arctic. Cryosphere, 16, 3269–3293, https://doi.org/10.5194/tc-16-3269-2022.
Bessenbacher, V., L. Gudmundsson, and S. I. Seneviratne, 2023: Optimizing soil moisture station networks for future climates. Geophys. Res. Lett., 50, e2022GL101667, https://doi.org/10.1029/2022GL101667.
Beucler, T., M. Pritchard, P. Gentine, and S. Rasp, 2020: Towards physically-consistent, data-driven models of convection. 2020 IEEE Int. Geoscience and Remote Sensing Symp., Waikoloa, HI, IEEE, 3987–3990, https://doi.org/10.1109/IGARSS39084.2020.9324569.
Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2022: Pangu-weather: A 3D high-resolution model for fast and accurate global weather forecast. arXiv, 2211.02556v1, https://doi.org/10.48550/arXiv.2211.02556.
Biard, J. C., and K. E. Kunkel, 2019: Automated detection of weather fronts using a deep learning neural network. Adv. Stat. Climatol. Meteor. Oceanogr., 5, 147–160, https://doi.org/10.5194/ascmo-5-147-2019.
Bird, L., M. Walker, G. Bodeker, I. Campbell, G. Liu, S. J. Sam, J. Lewis, and S. Rosier, 2023: Deep learning for stochastic precipitation generation—Deep SPG v1.0. Geosci. Model Dev., 16, 3785–3808, https://doi.org/10.5194/gmd-16-3785-2023.
Bocquet, M., J. Brajard, A. Carrassi, and L. Bertino, 2020: Bayesian inference of chaotic dynamics by merging data assimilation, machine learning and expectation-maximization. arXiv, 2001.06270v2, https://doi.org/10.48550/arXiv.2001.06270.
Brajard, J., A. Carrassi, M. Bocquet, and L. Bertino, 2021: Combining data assimilation and machine learning to infer unresolved scale parametrization. Philos. Trans. Roy. Soc., A379, 20200086, https://doi.org/10.1098/rsta.2020.0086.
Bretherton, C. S., and Coauthors, 2022: Correcting coarse-grid weather and climate models by machine learning from global storm-resolving simulations. J. Adv. Model. Earth Syst., 14, e2021MS002794, https://doi.org/10.1029/2021MS002794.
Brodu, N., and J. P. Crutchfield, 2022: Discovering causal structure with reproducing-kernel Hilbert space ϵ-machines. Chaos, 32, 023103, https://doi.org/10.1063/5.0062829.
Buizza, C., and Coauthors, 2022: Data learning: Integrating data assimilation and machine learning. J. Comput. Sci., 58, 101525, https://doi.org/10.1016/j.jocs.2021.101525.
Casey, S. P. F., and L. Cucurull, 2022: The impact of data latency on operational global weather forecasting. Wea. Forecasting, 37, 1211–1220, https://doi.org/10.1175/WAF-D-21-0149.1.
Celebi, M. E., and K. Aydin, 2016: Unsupervised Learning Algorithms. Springer, 558 pp.
Chantry, M., H. Christensen, P. Dueben, and T. Palmer, 2021: Opportunities and challenges for machine learning in weather and climate modelling: Hard, medium and soft AI. Philos. Trans. Roy. Soc., A379, 20200083, https://doi.org/10.1098/rsta.2020.0083.
Chattopadhyay, A., E. Nabizadeh, and P. Hassanzadeh, 2020: Analog forecasting of extreme-causing weather patterns using deep learning. J. Adv. Model. Earth Syst., 12, e2019MS001958, https://doi.org/10.1029/2019MS001958.
Clark, A. J., and E. D. Loken, 2022: Machine learning–derived severe weather probabilities from a warn-on-forecast system. Wea. Forecasting, 37, 1721–1740, https://doi.org/10.1175/WAF-D-22-0056.1.
Clark, S. K., and Coauthors, 2022: Correcting a 200 km resolution climate model in multiple climates by machine learning from 25 km resolution simulations. J. Adv. Model. Earth Syst., 14, e2022MS003219, https://doi.org/10.1029/2022MS003219.
Cleary, E., A. Garbuno-Inigo, S. Lan, T. Schneider, and A. M. Stuart, 2021: Calibrate, emulate, sample. J. Comput. Phys., 424, 109716, https://doi.org/10.1016/j.jcp.2020.109716.