Distilling Machine Learning’s Added Value: Pareto Fronts in Atmospheric Applications

Tom Beucler Faculty of Geosciences and Environment, University of Lausanne, Lausanne, Switzerland
Expertise Center for Climate Extremes, University of Lausanne, Lausanne, Switzerland

Search for other papers by Tom Beucler in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0002-5731-1040
,
Arthur Grundner Deutsches Zentrum für Luft- und Raumfahrt, Institut für Physik der Atmosphäre, Oberpfaffenhofen, Germany

Search for other papers by Arthur Grundner in
Current site
Google Scholar
PubMed
Close
,
Sara Shamekh Courant Institute of Mathematical Sciences, New York University, New York, New York

Search for other papers by Sara Shamekh in
Current site
Google Scholar
PubMed
Close
,
Peter Ukkonen Department of Physics, University of Oxford, Oxford, United Kingdom

Search for other papers by Peter Ukkonen in
Current site
Google Scholar
PubMed
Close
,
Matthew Chantry European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Matthew Chantry in
Current site
Google Scholar
PubMed
Close
, and
Ryan Lagerquist Cooperative Institute for Research in the Atmosphere, Colorado State University, Fort Collins, Colorado
NOAA/Global Systems Laboratory, Boulder, Colorado

Search for other papers by Ryan Lagerquist in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

The added value of machine learning for weather and climate applications is measurable through performance metrics, but explaining it remains challenging, particularly for large deep learning models. Inspired by climate model hierarchies, we propose that a full hierarchy of Pareto-optimal models, defined within an appropriately determined error-complexity plane, can guide model development and help understand the models’ added value. We demonstrate the use of Pareto fronts in atmospheric physics through three sample applications, with hierarchies ranging from semiempirical models with minimal parameters (simplest) to deep learning algorithms (most complex). First, in cloud cover parameterization, we find that neural networks identify nonlinear relationships between cloud cover and its thermodynamic environment and assimilate previously neglected features such as vertical gradients in relative humidity that improve the representation of low cloud cover. This added value is condensed into a ten-parameter equation that rivals deep learning models. Second, we establish a machine learning model hierarchy for emulating shortwave radiative transfer, distilling the importance of bidirectional vertical connectivity for accurately representing absorption and scattering, especially for multiple cloud layers. Third, we emphasize the importance of convective organization information when modeling the relationship between tropical precipitation and its surrounding environment. We discuss the added value of temporal memory when high-resolution spatial information is unavailable, with implications for precipitation parameterization. Therefore, by comparing data-driven models directly with existing schemes using Pareto optimality, we promote process understanding by hierarchically unveiling system complexity, with the hope of improving the trustworthiness of machine learning models in atmospheric applications.

Significance Statement

As machine learning permeates the geosciences, it becomes urgent to identify improvements in process representation and prediction that challenge our understanding of the underlying system. We show that Pareto-optimal hierarchies transparently distill the added value of new algorithms in three atmospheric physics applications, providing a timely complement to post hoc explainable artificial intelligence tools.

© 2025 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Tom Beucler, tom.beucler@unil.ch

Abstract

The added value of machine learning for weather and climate applications is measurable through performance metrics, but explaining it remains challenging, particularly for large deep learning models. Inspired by climate model hierarchies, we propose that a full hierarchy of Pareto-optimal models, defined within an appropriately determined error-complexity plane, can guide model development and help understand the models’ added value. We demonstrate the use of Pareto fronts in atmospheric physics through three sample applications, with hierarchies ranging from semiempirical models with minimal parameters (simplest) to deep learning algorithms (most complex). First, in cloud cover parameterization, we find that neural networks identify nonlinear relationships between cloud cover and its thermodynamic environment and assimilate previously neglected features such as vertical gradients in relative humidity that improve the representation of low cloud cover. This added value is condensed into a ten-parameter equation that rivals deep learning models. Second, we establish a machine learning model hierarchy for emulating shortwave radiative transfer, distilling the importance of bidirectional vertical connectivity for accurately representing absorption and scattering, especially for multiple cloud layers. Third, we emphasize the importance of convective organization information when modeling the relationship between tropical precipitation and its surrounding environment. We discuss the added value of temporal memory when high-resolution spatial information is unavailable, with implications for precipitation parameterization. Therefore, by comparing data-driven models directly with existing schemes using Pareto optimality, we promote process understanding by hierarchically unveiling system complexity, with the hope of improving the trustworthiness of machine learning models in atmospheric applications.

Significance Statement

As machine learning permeates the geosciences, it becomes urgent to identify improvements in process representation and prediction that challenge our understanding of the underlying system. We show that Pareto-optimal hierarchies transparently distill the added value of new algorithms in three atmospheric physics applications, providing a timely complement to post hoc explainable artificial intelligence tools.

© 2025 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Tom Beucler, tom.beucler@unil.ch

1. Introduction

Will recent advancements in machine learning (ML) lead to enduring new knowledge in atmospheric science? While the added value of ML for weather and climate applications can be measured using performance metrics, it often remains challenging to understand. Taking advancements in data-driven, medium-range weather forecasting (Ben-Bouallegue et al. 2023) as an example, increasing reliance on complex architectures makes state-of-the-art models difficult to interpret.

Weyn et al. (2019, 2020) used convolutional neural networks (CNNs) with approximately 200k and 700k learned parameters to produce global forecasts that outperformed climatology, persistence, and a low-resolution numerical weather prediction (NWP) model for lead times smaller than 1 week. Following the success of early approaches, Rasp et al. (2020) developed a benchmark dataset for data-driven weather forecasting, which facilitated the objective assessment of rapid developments (e.g., Clare et al. 2021; Scher and Messori 2021). Since then, for data-driven, medium-range forecasting, Rasp and Thuerey (2021) trained a ≈ 6.3M-parameter deep residual CNN, Keisler (2022) trained a ≈ 6.7M-parameter graph neural network (GNN), and Pathak et al. (2022) trained a ≈ 7M-parameter emulator combining transformers with Fourier neural operators. Recently, deep learning models started rivaling state-of-the-art, high-resolution, deterministic NWP models: Lam et al. (2022) via combined GNNs totaling ≈37M parameters, Bi et al. (2022) via a ≈ 256M-parameter Earth-like transformer, and Lang et al. (2024) via a ≈ 256M-parameter graph and transformer model. It is hard to pinpoint what makes these models so successful, even with modern explainable artificial intelligence (XAI; Buhrmester et al. 2021; Das and Rad 2020) tools, given that XAI requires certain assumptions to be satisfied (Mamalakis et al. 2022, 2023) and involves choosing which samples to investigate, which is challenging for large models and datasets.

The growing complexity of data-driven models for weather applications shares similarities with the development of general circulation models (GCMs) that followed the first comprehensive assessment of global climate change due to carbon dioxide (Charney et al. 1979). Unlike data-driven weather prediction, where reducing forecast errors could warrant increased complexity, GCMs have traditionally been created to not simply project but also comprehend climate changes (Held 2005; Balaji et al. 2022). This implies that any additional complexity in Earth system model should be well justified, motivating climate model hierarchies that aim to connect our fundamental understanding with model prediction (e.g., Mansfield et al. 2023; Robertson and Ghil 2000; Bony et al. 2013; Jeevanjee et al. 2017; Maher et al. 2019; Balaji 2021).

Inspired by climate model hierarchies, we here show that modern optimization tools help systematically generate data-driven model hierarchies to model and understand climate processes for which we have reliable data. These hierarchies can 1) guide the development of data-driven models that optimally balance simplicity and accuracy and 2) unveil the role of each complexity unit, furthering process understanding by distilling the added value of ML for the atmospheric application of interest.

In this study, we showcase the advantages of considering a hierarchy of models with varying error and complexity, as opposed to focusing on a single fitted model (Fisher et al. 2019). After formulating Pareto-optimal model hierarchies (section 2) and categorizing the added value of ML into four categories (section 3), we apply our approach to three atmospheric processes relevant for weather and climate predictions (section 4) to distill the added value of recently developed deep learning frameworks before concluding (section 5).

2. Pareto-optimal model hierarchies

a. Pareto optimality

In multiobjective optimization, Pareto optimality represents a solution set that cannot be improved upon in one criterion without worsening another criterion (e.g., Censor 1977; Miettinen 1999). The first step is to define a set of n real-valued model evaluation metrics E={Ei}i=1n that we wish to minimize (e.g., error and complexity). We call a model Mopt Pareto optimal with respect to these metrics and with respect to a model family if there is no model in that family that strictly outperforms Mopt in one metric while maintaining at least the same performance in all other metrics (e.g., Lin et al. 2019). The Pareto front (PF) is the set of all Pareto-optimal models, which can be defined using logical statements:
PF={Mopt|Ms.t.{i i(M)i(Mopt)j j(M)<j(Mopt)}.

Intuitively, when we select a model from the Pareto front, any attempt to switch to a different model would mean sacrificing the quality of at least one evaluation metric. Conversely, a model that can be replaced without compromising any evaluation metrics is described as Pareto dominated. Henceforth, we derive Pareto fronts empirically from the available data and the subset of models M considered. These empirical Pareto fronts are denoted as PFE,M, acknowledging that the “true Pareto front” PFE is generally intractable. In practice, we often seek to balance evaluation metrics measuring error and complexity, which we discuss in the next subsections.

b. Error

We emphasize the importance of holistic evaluation, which employs several error metrics with different behaviors: traditional regression or classification metrics, distributional distances, spectral metrics, probabilistic scoring rules, reliability diagrams (Haynes et al. 2023), causal evaluation (Nowack et al. 2020), etc. To facilitate the use of Pareto-optimal hierarchies, we recommend prioritizing proper scores, whose expectation is optimal if and only if the model represents the true target distribution (Bröcker 2009). For simplicity’s sake, we will employ mean-square error (MSE) and its square root (RMSE) as our primary error metric for our study’s applications. We make this choice because MSE is a proper score for deterministic predictions that can be efficiently optimized, while recognizing MSE’s inherent limitations for nonnormally distributed targets.

c. Complexity

To our knowledge, there are no universally accepted metrics for quantifying the complexity of data-driven models within the geosciences. In statistical learning, various complexity metrics, such as Rademacher complexity (e.g., Bartlett et al. 2005), rely on dataset characteristics, whereas others, like the Vapnik–Chervonenkis (VC) dimension (Vapnik and Chervonenkis 2015), solely depend on algorithmic attributes. Here, we predominantly focus on two metrics that can be readily calculated from model attributes: the number of trainable parameters Nparam and the number of features Nfeatures. We choose these metrics due to their simplicity and versatility across the broad spectrum of models considered in our study (Balaji et al. 2017). As we will confirm empirically in section 4, Nparam and Nfeatures can be used as (very) approximate proxies for generalizability, as models with fewer trainable parameters and features with more stable distributions tend to result in superior generalizability. We defer the exploration of additional complexity metrics, such as the number of floating point operations (FLOPs) or multiply–accumulate operations (MACs), to future work.

Equipped with these definitions, we can now ask: Why, along the Pareto front in a well-defined error-complexity plane, does increasing complexity result in better performance? In the following section, we hence leverage Pareto optimality to distill machine learning’s added value.

3. The distillable value of machine learning

Amid rapid progress in optimization, machine learning architectures, and data handling, it can be challenging to distinguish long-lasting progress in modeling from small improvements in model error. We postulate that progress is more likely to be replicable if we can explain how added model complexity improves performance in simple terms. Based on this postulate, we propose simple definitions to categorize a model’s added value in the geosciences.

In geoscientific applications, we often work with features X that are functions of space, discretized in Nx spatial locations x, and of time, discretized in Nt time steps t. We consider a model M predicting a target vector Y from a multivariate spatiotemporal field Xx,t such that
Y=M[Xx,t].

We define a model M as having added value with respect to a set of evaluation metrics {Ei}i=1n and a baseline model M^ if the following holds: When applied to representative out-of-sample data, M is the Pareto optimal, while the baseline model M^ is not. Note that improving the Pareto front is necessary but not sufficient: A model predicting zero everywhere does not generally add value compared to other models solely because it is simpler.

Equation (2) suggests four degrees of freedom for a model’s added value: 1) improving the function M without changing the features Xx,t (functional representation), 2) improving the model through X (feature assimilation), 3) improving the model through x (spatial connectivity), and 4) improving the model through t (temporal connectivity). We depict these four nonmutually exclusive categories in Fig. 1 and rigorously define them below.

Fig. 1.
Fig. 1.

Exploring PFs (sets of Pareto-optimal models) within a complexity-error plane highlights ML’s added value. Crosses in step 1 denote existing models. Algorithms such as deep learning allow for the creation of efficient, low-error, albeit complex models (step 2). Knowledge distillation, through methods such as equation discovery, aims to explain error reduction, resulting in simpler, low-error models (step 3) and long-lasting scientific progress. For atmospheric applications, we propose four categories to classify this added value: functional representation, feature assimilation, spatial connectivity, and temporal connectivity.

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0078.1

a. Functional representation

A fundamental aspect a model can improve is the functional representation of the observed relationship between a set of baseline features X^fixed features used to benchmark performance—and the target Y. We deem M to improve the functional representation of a baseline M^ with respect to E and M if and only if
M^[X^]PF,M,M[X^]PF,M.

This improvement may stem from algorithms leading to better fits (e.g., gradient boosting instead of decision trees), improved optimization (e.g., the Adam optimizer and its variants instead of traditional stochastic gradient descent; Kingma and Ba 2014), improved parsimony (e.g., by decreasing the number of trainable parameters via hyperparameter tuning), or enforced constraints (e.g., positive concentrations and precipitation). Improvements in functional representation are readily visualized via partial dependence plots or their variants (marginal plots to avoid unlikely data instances, accumulated local effects to account for feature correlation, etc.; Molnar 2020). Note that Eq. (3) also captures improvements in probabilistic modeling by generalizing M to a stochastic mapping and including probabilistic scores (Gneiting and Raftery 2007; Haynes et al. 2023) in E. From an atmospheric science perspective, improving functional representation helps identify nonlinear regimes, model extremes, and describe how sensitive the prediction is to different features. However, it can be challenging to faithfully describe these sensitivities if key features are missing from the baseline set of features X^.

b. Feature assimilation

This motivates discussing the ability of a model to extract relevant information from a new feature set, which we refer to as feature assimilation and define as follows:
M^[X^]PF,M,M[X]PF,M,
where we emphasize the difference between the baseline X^ and new X sets of features. This improvement may stem from previously unconsidered features (e.g., variables whose quality has increased in recent datasets), the ability of models to handle features whose information could not be extracted in past attempts (e.g., the ability of deep learning to extract nonlinear relationships), or the discovery of a more compact feature set that facilitates learning or improves generalizability. The assimilation of new features may improve interpretability (e.g., if some features simplify the target’s functional representation or if there are less features to consider), physical consistency (e.g., if the new feature set improves generalizability across regimes or consistency with physical laws), and predictability. While the features’ spatial and temporal discretizations could technically be considered part of feature selection, we choose to discuss them separately in the following subsections, given the central role of space and time in geoscientific applications.

c. Spatial connectivity

To formalize a model M’s improved ability to leverage the features’ spatial information, we distinguish the spatial locations x^ used to discretize the baseline M^’s features from the spatial locations x used to discretize M’s features:
M^[X^x^,t^]PF,M,M[X^x,t^]PF,M.

This improvement may stem from the ability to

  1. 1) handle features at different spatial resolutions (e.g., via improved preprocessing or handling of data). In atmospheric science, this can help consider multiscale interaction and accommodate data from various Earth system models.
  2. 2) hierarchically process spatially adjacent data (e.g., via convolutional layers). In atmospheric science, this acknowledges the high correlation between spatial neighbors due to, e.g., small-scale mixing.
  3. 3) capture long-range spatial dependencies (e.g., via self-attention mechanisms or a graph structure), such as teleconnections in Earth system.

d. Temporal connectivity

To formalize a model M’s improved ability to leverage the features’ temporal information, we distinguish the time steps t^ used to discretize the baseline M^’s features from the time steps t used to discretize M’s features:
M^[X^x^,t^]PF,M,M[X^x^,t]PF,M.

This improvement may stem from the ability to

  1. 1) handle features at different temporal resolutions (e.g., via improved preprocessing or handling of data). This can help consider multiple time scales and accommodate data from various Earth system models.
  2. 2) process consecutive time steps (e.g., via recurrent layers). This acknowledges the high temporal autocorrelation of Earth system data, which is a property of the underlying dynamical system.
  3. 3) capture long-term temporal dependencies (e.g., via gating or self-attention mechanisms) and cyclic patterns (e.g., via data adjustments and temporal Fourier transforms) such as the diurnal and seasonal cycles.

4. Atmospheric physics application cases

This section demonstrates that Pareto optimality guides model development and improves process understanding through three realistic, atmospheric modeling case studies. Each case includes machine learning prototypes with demonstrated performance from previous studies, along with Pareto-optimal models newly trained for this study. The first case study emphasizes functional representation and feature assimilation, the second focuses on spatial connectivity, and the last compares spatial and temporal connectivity.

a. Cloud cover parameterization for climate modeling

1) Motivation

The incorrect representation of cloud processes in current Earth system models, with a grid spacing of approximately 50–100 km (Arias et al. 2021), significantly contributes to structural uncertainty in long-term climate projections (Bony et al. 2015; Sherwood et al. 2014). Cloud cover parameterization, which maps environmental conditions at the grid scale to the fraction of the grid cell occupied by clouds, directly affects radiative transfer and microphysical conversion rates, influencing the model’s energy balance and water species concentrations.

Although “storm-resolving” simulations with grid spacing below ≈5 km do not explicitly resolve clouds and their associated microphysical processes (Morrison et al. 2020), they significantly reduce uncertainty in the interaction between storms and planetary-scale dynamics by explicitly simulating deep convection (Stevens et al. 2020). However, their large computational cost prohibits their routine use for ensemble projections (Schneider et al. 2017). Machine learning can learn the storm-scale behavior of clouds from short, storm-resolving simulations, potentially improving coarser Earth system models through data-driven parameterizations (Gentine et al. 2021).

This case study aims to understand the improvement gained from the higher-fidelity representation of storms and clouds. As illustrated in Fig. 2, we demonstrate that this knowledge can be symbolically distilled into an analytic equation that rivals the performance of deep learning.

Fig. 2.
Fig. 2.

Pareto-optimal model hierarchies quantify the added value of ML for cloud cover parameterization. ML better captures the relationship between cloud cover and its thermodynamic environment and assimilates features like vertical humidity gradients. (left) We progressively improve traditional baselines via polynomial regression (red, orange, and yellow crosses), significantly decrease error using NNs (pink and purple crosses), and finally distill the added value of these NNs symbolically (green crosses). (right) Both the NN (orange line) and its distilled symbolic representation (green line) better represent the functional relationship between cloud cover and its environment, aligning more closely across temperatures with the reference storm-resolving simulation (blue dots) than the Sundqvist scheme (red line) used in the ICON Earth system model. “Cold” and “Hot” refer to the validation set’s first and last temperature octiles. Additionally, ML models assimilate multiple features absent in existing baselines, including vertical humidity gradients. The smaller discrepancy between the 5-feature scheme (“SFS5”) and the reference (“REF”), compared to the 4-feature scheme (“SFS4”), demonstrates the improved representation of the time-averaged low cloud cover in regions such as the southeast Pacific, thereby reducing biases in current cloud cover schemes that plague the global radiative budget.

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0078.1

2) Setup

We follow the setup described in Grundner et al. (2024), to which readers are referred for details. Fields are coarse grained from storm-resolving Icosahedral Nonhydrostatic (ICON) simulations (Giorgetta et al. 2018) conducted as part of the Dynamics of the Atmospheric General Circulation Modeled on Nonhydrostatic Domains (DYAMOND) intercomparison project (Stevens et al. 2019; Duras et al. 2021). The original simulations use a horizontal grid spacing of ≈2.5 km and 58 vertical layers below 21 km (the maximum altitude with clouds in the dataset). They are coarse grained to a typical climate model resolution of ≈80 km horizontally and 27 vertical layers, converting the binarized, high-resolution condensate field (1 if the cloud condensate mixing ratio exceeds 10−6 kg kg−1 and 0 otherwise) into a fractional area cloud cover C (unitless).

To prevent strong correlations between the training and validation sets, the union of the “DYAMOND Summer” (10 August–10 September 2016) and “DYAMOND Winter” (30 January–29 February 2020) datasets was partitioned into six consecutive temporal segments. The second segment (approximately 21 August–1 September 2016) and the fifth segment (approximately 9–19 February 2020) form the validation set. For all models, excluding traditional methods, the features are standardized to have a mean of zero and a standard deviation of one within the training set.

Once coarse grained and preprocessed, we aim to map the environmental conditions X on vertical levels indexed by the background terrain-following height grid z, to the cloud cover C on the same vertical levels. The variables X include the horizontal wind speed U (m s−1), specific humidity qv (kg kg−1), liquid water mixing ratio ql (kg kg−1), ice mixing ratio qi (kg kg−1), temperature T (K), pressure p (Pa), and relative humidity (RH). Except for the “nonlocal NN” in the next subsection, we simplify the mapping to a “vertically quasilocal” one where cloud cover at a given level depends only on the atmospheric variables X at the same level and their first- and second-order derivatives with respect to z. The term X also includes geometric height z (m) and surface variables: land fraction σf (%) and surface pressure ps (Pa). In summary, we approximate some mapping:
(X,dXdz,d2Xdz2,z,σf,ps)R3×7+3C[0,1],
using a hierarchy of machine learning models. In the following subsections, we will show that Pareto-optimal hierarchies not only facilitate data-driven model development, allowing for the comparison of simple baselines with neural networks, but also promote process understanding.

3) Model hierarchy

We start with the Sundqvist baseline (red star in Fig. 2; Sundqvist et al. 1989), the standard cloud cover parameterization in ICON. The Sundqvist parameterization implemented in ICON represents cloud cover as a monotonically increasing function of relative humidity, provided it exceeds a critical threshold RHcrit (Roeckner et al. 1996):
RHcrit=defRHtop+(RHsurfRHtop)exp[1(ps/p)n],
where our implementation includes four tunable parameters: {RHsurf, RHtop, RHsat, n}, with values listed in appendix A. When RH > RHcrit, cloud cover is given by the model:
MSundqvist:(p,ps,RH)CSundqvist,
whose output is
CSundqvist=def1min{RH,RHsat}RHsatRHcritRHsat.

To account for marine stratocumulus (low) clouds (Mauritsen et al. 2019), we use two different sets of four parameters for land and sea using a land fraction threshold of 0.5. The Sundqvist scheme is parsimonious with only eight trainable parameters, but it performs poorly against high-resolution data, with RMSE values as large as 25% despite having been returned to our training set.

Hypothesizing that the Sundqvist scheme’s large error is due to its lack of features, we test the effect of adding features one by one. For that purpose, we apply forward sequential feature selection, which greedily adds features using a cross-validated score (here MSE), to a standard multiple linear regression model that includes polynomial combinations of all available features, up to a maximum degree of 3.

We find that linearly including temperature as a feature alongside relative humidity is enough to outperform the Sundqvist baseline, providing our simplest example of feature assimilation:
MSundqvist[RH]PF,M,MLinear2F[RH,T]PF,M,
where MLinear2F is a two-feature linear regression whose output CLinear2F is
CLinear2Fα2F=1+RHRH2FTΔT2F,
with parameters α2F, RH2F, and ΔT2F listed in appendix A. This feature assimilation result corresponds to a well-known result: Accurate cloud cover parameterization requires temperature in addition to relative humidity.
Adding more features to linear (red crosses), quadratic (orange crosses), and cubic (yellow crosses) polynomials further reduces error. However, incorporating a simple physical constraint:
q+qi=0C=0,
i.e., that the absence of condensates implies no clouds, further improves the results. Note that this constraint can also be derived from data using a binary decision tree. Accounting for this constraint, polynomial models achieve MSEs less than half of the Sundqvist scheme (see Fig 2’s lower error, higher complexity polynomial models). Focusing on the Pareto-optimal, four-feature linear and quadratic model outputs:
CLinear4Fα1=1+RHRH1TΔT1+qiq1H1dRHdz,
Cquadraticα2=1+RHRH2,1TΔT2,1+qiq2,1+qq2,2qiqq2,32+H2(1+RHRH2,2TΔT2,2)dRHdz,
with parameters listed in appendix A. Analyzing this first Pareto front unveils the role of each added feature in better representing cloud cover. First, it highlights the role of cloud condensates, both liquid and ice, in accurately describing cloud cover. Second, among all possible vertical gradients, Pareto-optimal models select the negative gradient of relative humidity to depict the role of inversions in increasing cloud cover, which is particularly important for low clouds, including marine stratocumulus clouds. However, upon closer inspection, these contributions are not always physically consistent: While the four-feature linear model correctly captures the increase in cloud cover with ice, it incorrectly decreases cloud cover when there is humidity above. Conversely, the quadratic model accurately captures the inversion role when temperature and humidity are fixed to their training set mean but wrongly assumes that inversions’ effect on cloudiness depends on humidity and temperature, and the model’s decrease in cloudiness with increasing mixed-phase cloud condensates suggests that its qiql term is not an interpretable sensitivity to liquid and ice but rather a bias correction term relying on these variables.
Combining these insights with the Pareto optimality of the simplified “Wang” scheme (gray star in Fig. 2; Wang et al. 2023, based on the scheme of Xu and Randall 1996), which outputs
CWang=min{1,[1eαWang(q+qi)]×RHβWang},
with (αWang, βWang) in appendix A, suggests that the dependence of cloud cover on condensates is fundamentally nonlinear. This justifies the use of deep learning to quickly explore which features can be combined to improve the nonlinear, functional representation of cloud cover.

We start with the baseline neural network from Grundner et al. (2022): a three-layer-deep, 64-neuronwide multilayer perceptron with batch normalization after the second hidden layer. Hyperparameters were optimized using the Sherpa Python library (Hertel et al. 2020). This NN estimates the target cloud cover with high fidelity (RMSE = 8.5%), but at the cost of increased complexity, as it has a total of 9345 trainable parameters. Note that the models from Grundner et al. (2022) were not designed to minimize the number of trainable parameters, so we do not overly focus on this complexity metric. The RMSE can be further lowered to 6% with vertically nonlocal NNs, which map the entire atmospheric column of inputs to the entire column of outputs without inductive bias. However, this small error gain is deemed insufficient given the increased complexity cost.

Instead, we make the vertically quasilocal assumption [Eq. (7)] and deploy a hierarchy of Pareto-optimal NNs, with features selected sequentially using cross-validated MSE. The five most informative features for NNs are
RHqiqTdRHdz.
Unlike features selected by polynomial models, these can be nonlinearly combined to yield high-quality predictions, as shown in the right panels of Fig. 2. First, NNs improve the functional representation of cloud cover by accurately modeling the sharp increase in cloud cover above a temperature-dependent relative humidity threshold—a highly nonlinear, bivariate behavior that simple schemes struggle to capture. Formally,
MSundqvist[X]PF,M,MNN[X]PF,M,
where MNN is a Pareto-optimal NN at the bottom right of the (complexity, error) plane. Second, by incorporating vertical relative humidity gradients, NNs can capture stratocumulus decks off the coasts of regions like California, Peru/Chile, and Namibia/Angola. This improvement is especially visible when comparing the error map of an NN using the first four features from Eq. (17) to that of an NN additionally incorporating dRH/dz (Fig. 2, bottom-right panel).

While indicative of how accurately cloud cover can be parameterized, such improvements are often insufficient to be considered “discoveries” as they remain hard to explain, even with post hoc explanation tools (Figs. 8 and 9 of Grundner et al. 2022). Therefore, improvements in functional representation and feature assimilation need to be further distilled into sparse models that scientists can readily interpret.

4) Symbolic distillation and equation discovery

For this purpose, we use symbolic regression libraries, which optimize both the parameters and structure of an analytical model within a space of expressions. Symbolic regression yields expressions with transparent out-of-distribution behavior (asymptotics, periodicity, etc.) (La Cava et al. 2021), making them well suited for high-stakes societal applications (Rudin 2019) and the empirical distillation of natural laws (Schmidt and Lipson 2009). To avoid overly restricting the analytical form of the distilled equation, genetic programming is used. Genetic programming evolves a population of mathematical expressions using methods such as selection, crossover, and mutation to improve a fitness function (Koza 1994). Given that genetic programming scales poorly with the number of features (Petersen et al. 2019), our NN feature selection results are used to restrict our features to those listed in Eq. (17). Using NN results is appropriate since no assumption is made about the type of equation to be discovered.

The Gene-Pool Optimal Mixing Evolutionary Algorithm for Genetic Programming (GP-GOMEA) (light green crosses in Fig. 2; Virgolin et al. 2021) and Python Symbolic Regression (PySR) (dark green crosses in Fig. 2; Cranmer 2023) libraries were chosen for their ease of use and high relative performance compared to 12 other recent libraries (La Cava et al. 2021). They yielded over 500 closed-form equations for cloud cover, from which the nine most physically consistent and lowest-error fits were retained with their outputs always clipped to [0, 1] (see Grundner et al. 2024 for details). By physically interpreting the learned parameters, the output of the Pareto-optimal PySR model may be written as
CPySR=I1(RH,T)Humidity/temperature+I2(dRHdz)Inversion+I3(q,qi)Condensates,
where the first term I1 may be interpreted as a sparse, third-order Taylor expansion around the training-mean relative humidity (RH¯=0.60) and temperature (T¯=257K):
I1=C¯+(CRH)RH¯,T¯(RHRH¯)(CT)RH¯,T¯(TT¯)+12(2CRH2)RH¯,T¯(RHRH¯)2+12(CRHT2)RH¯,T¯(TT¯)2(RHRH¯).
Dominant Taylor series expansion terms are expected when discovering closed-form, subgridscale parameterizations (Jakhar et al. 2024), but as PySR is based on genetic programming, we find two more surprising terms:
I2=HPySR3[dRHdz+32(dRHdz)maxC](dRHdz)2,
where HPySR ≈ 585 m can be interpreted as a characteristic height for low-cloud humidity gradients, and (dRH/dz)maxC (≈−2 km−1) is the value of the relative humidity gradient that maximizes cloud cover at the inversion level. The last term is
I3=1ϵPySR×11+2ϵPySR(λq+λiqi),
which is a monotonically increasing function of the condensates’ concentrations, whose trainable parameters are provided in appendix A. Consistently, the Pareto-optimal GP-GOMEA equation has a term Cq that sharply increases as liquid or ice concentrations exceed 0:
CGOMEAαG=1+βGeRH/RHGHumidity+Cq(q,qi)Condensates,
Cq=γGln[ϵG+qiqG,i+δG(eq/qG,+1)]qqG,,
where the trainable parameters are listed in appendix A.

In addition to being easily transferable thanks to their low number of trainable parameters, the added value of these equations is transparent: The improved functional representation is explicit [see Eq. (20)], and the assimilation of new features is interpretable [see Eq. (21)]. Finally, scientific discovery may arise through the unexpected aspects of these equations that are robust across models, such as the difference between how cloud cover reacts to an increase in environmental liquid versus ice content. Indeed, at high resolution, cloud cover will become 1 as soon as condensates exceed a small threshold (here 10−6 kg kg−1) independently of the water’s phase. Then, how come cloud cover is more sensitive to ice than liquid in Eq. (22) (λi3.8λl) and Cq increases much faster with ice than liquid (for qi+ql1) in Eq. (24)?

These are in fact emerging properties of the subgrid distributions of liquid and ice (Grundner et al. 2024): As large values of cloud ice are rarely observed, larger spatial averages of cloud ice at coarse resolution mean that many more high-resolution pixels contain low values of cloud ice compared to the liquid case, resulting in higher cloudiness for a given spatially averaged condensate value. By assuming an exponential distribution for the subgrid liquid and ice content, we can even interpret λl and λi as the rate of the respective exponential distributions. This allows us to hypothesize that the nonlinear relationship between condensates and cloud cover is not scale invariant and requires separate treatments of liquid and ice, with implications for the interaction between microphysical processes and the radiative budget. While analyzing the feature importance of liquid and ice in neural networks could have suggested this difference, it would have been difficult to fully explain it and bridge spatial scales without distillation, confirming the importance of only treating deep learning as a first and not final step toward knowledge discovery.

b. Shortwave radiative transfer emulation to accelerate numerical weather prediction

1) Motivation

The energy transfer resulting from the interaction between electromagnetic radiation and the atmosphere, known as radiative transfer, is costly to simulate accurately. Line-by-line calculations of gaseous absorption at each electromagnetic wavelength (Clough et al. 1992) are too expensive for routine weather and climate models. Instead, models often use the correlated-k method (Mlawer et al. 1997), which groups absorption coefficients in a cumulative probability space to speed up radiative transfer calculations without significantly compromising accuracy. However, even the correlated-k method imposes a high computational burden (Veerman et al. 2021), forcing most simulations to reduce the temporal and spatial resolution of radiative transfer calculations, which can degrade prediction quality (Morcrette 2000; Hogan and Bozzo 2018).

This challenge has driven the development of ML emulators for radiative transfer in numerical weather prediction since the 1990s (Cheruy et al. 1996; Chevallier et al. 1998, 2000). ML architectures have become more sophisticated (e.g., Belochitski and Krasnopolsky 2021; Kim and Song 2022; Ukkonen 2022), but the primary goal remains to emulate the original radiation scheme as faithfully as possible. This allows the reduced inference cost of the ML model, once trained, to be leveraged for running the atmospheric model coupled with the emulator, enabling less expensive and more frequent radiative transfer calculations.

This case study examines how ML architectural designs impact the reliability of shortwave radiative transfer (covering solar radiation and wavelengths of 0.23–5.85 µm). As shown in Fig. 3, physics-informed architectures that closely mimic the vertical bidirectionality of radiative transfer are Pareto optimal, rivaling the performance of deep learning models with 10 times more trainable parameters.

Fig. 3.
Fig. 3.

Pareto-optimal model hierarchies guide the development of progressively tailored architectures for emulating shortwave radiative transfer. (a) Error vs complexity on a logarithmic scale for the simple clear-sky cases dominated by absorption; (b) error vs complexity for cases with multilayer cloud, including both liquid and ice, where multiple scattering complicates radiative transfer. Convolutional NNs (CNNs; red crosses) with small kernels, MLPs (orange crosses) that ignore the vertical dimension, and the simple linear baseline (light pink star) give credible results in the clear-sky case. However, they fail in the more complex case, which requires U-Net architectures (dark pink and purple crosses) to fully capture nonlocal radiative transfer. The vertical invariance of the two-stream radiative transfer equations suggests a bidirectional RNN (green star) architecture, which rivals the skill of U-Nets with a fraction of their trainable parameters.

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0078.1

2) Setup

We follow the setup described in Lagerquist et al. (2023) and emulate the full behavior (gas optics and radiative transfer solver) of the shortwave Rapid Radiative Transfer Model (Mlawer et al. 1997) in the context of weather predictions made by version 16 of the Global Forecast System with 0.25° horizontal spacing. In addition to the realistic geography setup of Lagerquist et al. (2021), the ML models are trained with global data on the 127-level native pressure–sigma grid (henceforth referred to as η) with synthetic information about aerosols, trace gases, and hydrometeors’ particle size distribution. We use data from most days between 1 September 2018 and 23 December 2020, with forecast lead times in (6, 12, 18, 24, 30, 36) hours. In each forecast, 4000 grid points are randomly sampled on the global grid. The training set comprises 873 086 samples from 237 days, the validation set comprises 479 806 samples from 126 days, and the test set comprises 472 456 samples from 120 days. Features X can broadly be separated into 22 vertical profiles tied to the grid Xη (listed in appendix B) and four scalars with one value per atmospheric column: solar zenith angle ζ (°), surface albedo αs, aerosol single-scattering albedo ω0, and aerosol asymmetry parameter g. For convenience, the four scalars are converted into profiles by repeating their values at all 127 levels. We target the shortwave heating rate’s vertical profile (dT/dt)SW (K day−1), thus approximating a high-dimensional mapping:
(Xη,ζ,αs,ω0,g,σe)R(22+4)×127(dTdt)SWR1×127,
via an ML model hierarchy described in the next section.

3) Model hierarchy

We deploy an ML model hierarchy using the same features X to isolate the added value of spatial connectivity along the vertical coordinate η. For all models, cropping and zero-padding layers set the top-of-atmosphere (TOA; ∼78 km above ground level) heating rate to zero, consistent with the RRTM being emulated. Readers interested in the exact ML model architecture are referred to Lagerquist et al. (2023).

We begin with a linear regression model (pink star), where the input layer is reshaped into a flattened vector and processed through linear layers without activation functions or pooling. This model lacks inherent spatial connectivity, treating all features (where one “feature” = one atmospheric variable at one vertical level) independently. Similarly, the multilayer perceptrons (MLPs; orange crosses) flatten inputs, ignoring vertical connectivity, but process them with varying degrees of nonlinearity, controlled by two hyperparameters: depth (number of dense layers) and width (number of neurons per layer). We conduct a grid search with depth varying from {1, 2, 3, 4, 5, 6} layers and width varying from {64, 128, 256, 512, 1024, 2048} neurons, resulting in 36 MLPs. The activation function for every hidden layer (i.e., every layer except the output) is the leaky rectified linear unit (ReLU; Maas et al. 2013) with a negative slope α = 0.2.

To incorporate vertical relationships between different levels, we train CNNs (red crosses) with 1D convolutions along the vertical axis. We conduct a grid search with depth varying from {1, 2, 3, 4, 5, 6} convolutional layers and number of channels in the first convolutional layer varying from {2, 4, 8, 16, 32, 64}. After the first layer, we double the number of channels in each successive convolutional layer, up to a maximum of 64. The CNN’s kernel size is restricted to three vertical levels, to enforce strong local connectivity.

The U-Net architecture (pink crosses; Ronneberger et al. 2015) builds on the CNN by incorporating skip connections that preserve high-resolution spatial information and an expansive path almost symmetric to the contracting path. Both of these model components improve the reconstruction of full-resolution data (here, a 127-level profile of heating rates). We conduct a grid search with depth (number of downsampling operations) varying from {3, 4, 5} and number of channels in the first convolutional layer varying from {2, 4, 8, 16, 32, 64}, using the same channel-doubling rule as for CNNs. Finally, the U-Net++ architecture (purple crosses; Zhou et al. 2020) enhances the U-Net’s skip pathways with nested dense convolutional blocks, potentially capturing more of the original spatial information and facilitating optimization. Our hyperparameters for U-Net++ are the same as for the U-Net.

4) Distilling radiative transfer’s bidirectionality

To better understand the added value of each architecture in representing shortwave absorption and scattering, we extract samples from our test set to form two distinct regimes. The simple, “clear-sky” regime (Fig. 3a) consists of profiles with no cloud (column-integrated liquid water path = column-integrated ice water path = 0 g m−2), an oblique sun angle (zenith angle > 60°), and little water vapor (column-maximum specific humidity < 2 g kg−1). This restricts shortwave radiative transfer to gaseous absorption of solar radiation throughout the atmospheric column, well described by a simple exponential attenuation model such as Beer’s law (Liou 2002). In contrast, the complex, “multicloud” regime (Fig. 3b) includes profiles with more than one cloud layer and at least one mixed-phase cloud layer. For this purpose, a “cloud layer” is defined as a set of consecutive vertical levels, such that every level has a total water content (liquid + ice) > 0 g m−3 and the layer has a height-integrated total water path (liquid + ice) ≥ 10 g m−2. A mixed-phase cloud layer meets the above criteria plus two additional ones: Both the height-integrated liquid water path and ice water path must be >0 g m−2. This regime is challenging to model as shortwave radiation is absorbed and scattered by both liquid and ice clouds, making the bidirectionality of radiative transfer essential to capture. Out of the 472 456 test set profiles, 14 226 are in the clear-sky regime and 13 263 are in the multicloud regime.

The first surprising result is the poor performance of CNNs: While Pareto-optimal for low complexity, with our simplest CNN having as few as 12 630 trainable parameters, none of the CNNs achieve an MSE below 0.95 K2 day−2, even in the clear-sky case. In contrast, MLPs systematically outperform our linear baseline model, showcasing the importance of nonlinearity. MLPs also outperform CNNs because they allow for nonlocal vertical connectivity. However, without information on which vertical levels are closest, MLPs connect every level together, resulting in high complexity: Increasing the number of trainable parameters by 100× does not even halve the MSE in the clear-sky case. This lack of inductive bias prevents generalization to more complex cases such as the multicloud regime, where the MSE climbs from ≈0.2–0.4 to ≈0.7–1.0 K2 day−2.

By ordering the features’ vertical levels, both the U-Net and U-Net++ achieve lower errors with fewer trainable parameters. Most Pareto-optimal models are in the U-Net++ family, suggesting that simple skip connections are insufficient; this highlights the nonlocality of shortwave radiative transfer, especially in the multicloud case. Fortunately, while shortwave radiation may instantaneously propagate information (e.g., the presence of mixed-phase clouds) throughout the entire atmospheric column, its transfer is approximately governed by bidirectional equations invariant in space. The two-stream equation holds exactly for the target Rapid Radiative Transfer Model data:
ddη(F+F)=R(F+,F,Xη),
where F+ and F are the upward and downward radiative fluxes (W m−2) and the function R governs how these fluxes change with the vertical coordinate η, allowing us to update fluxes using neighboring fluxes only. As noted in Ukkonen and Chantry (2024, manuscript submitted to J. Adv. Model. Earth Syst.), it is the vertical invariance of Eq. (26) that suggests we may treat radiative transfer with a network processing the vertical sequences of F+ and F in both directions with the same update rule at every level. Akin to standard atmospheric models, the target heating rate is then calculated by multiplying the vertical gradient of the net flux (downward minus upward) by the ratio of the gravity constant at Earth’s surface g to the specific heat capacity of dry air at constant pressure cp:
(dTdt)SW=gcpdηdpd(FF+)dη.

Following Ukkonen (2022), we divide the shortwave radiative fluxes (F+,F) by the incoming TOA downward flux FTOA, to implicitly inform the NN that the sun is the original energy source. At each vertical level η, we then target the resulting ratios F+/FTOA and F/FTOA with a bidirectional recurrent NN (RNN). The RNN has two 64-unit gated recurrent unit layers (one forward and one backward), which are concatenated before a dense layer for predicting the fluxes, resulting in only 54 786 trainable parameters. The RNN includes a multiplication layer with FTOA and a custom layer to directly calculate shortwave heating rates, maintaining the same optimization objective as the other models. We find that the resulting RNN (green star) is clearly Pareto optimal, competing with the U-Net++ and achieving MSE below 0.1 K2 day−2, even in the complex, multicloud case. Our results highlight the advantages of physically constraining ML solutions using robust knowledge of the underlying system, extending the aquaplanet findings of Bertoli et al. (2023, manuscript submitted to J. Adv. Model. Earth Syst.) to an Earth-like, operational setting, and paving the way toward fully hybrid physics–ML emulators of shortwave radiation (Schneiderman 2024).

c. Tropical precipitation and convective organization

1) Motivation

Accurately representing precipitation processes in tropical regions enhances global Earth system models and forecasting tools, particularly for water management and flood risk in a changing climate (Seneviratne et al. 2021). Due to computational limitations, achieving horizontal resolutions below 25–50 km is challenging, which hinders the representation of subgrid processes causing precipitation (Stevens et al. 2020). These processes include tropical convection and its complex organization patterns (Bao et al. 2024), typically modeled using semiempirical parameterizations that rely on a coarse representation of the thermodynamic environment and often exclude memory effects (Colin et al. 2019). These parameterizations generally approximate a mapping: Xt0Pt0, where at the time of interest t0, both the spatially resolved environmental predictors X and the precipitation are averaged over the coarse grid cell:
Xt0=1|Grid cell|Grid cellXt0,xdx,
where x represents a continuous, bidimensional horizontal coordinate. This spatial averaging removes convective organization information below the coarse grid’s spatial scale. Using storm-resolving simulations as a reference, Shamekh et al. (2023) showed this blur precipitation, adding irreducible uncertainty, especially for large values of precipitable water Q. Given that high-resolution information is typically inaccessible in a parameterization context, we ask in this case study: How much of this lost spatial granularity can we recover with temporal memory?

2) Setup

To address this question, we build upon the framework of Shamekh et al. (2023), coarse-graining global simulations with the System for Atmospheric Modeling (Khairoutdinov et al. 2022), conducted as part of the DYAMOND project, from their native, high-resolution horizontal grid (4.34-km spacing at the equator) to a low-resolution grid (138.9-km spacing at the equator) representative of a coarse Earth system model. Over the tropical ocean (20°S–20°N), we map six environmental variables X—precipitable water (mm), 2-m specific humidity (kg kg−1), 2-m temperature (K), surface sensible and latent heat fluxes (W m−2), and sea surface temperature (K)—to 15-min precipitation rates ⟨P⟩ (mm h−1). We use three setups at a given low-resolution location xLR including the high-resolution grid locations xHR:
Baseline:Xt0xLRR6Pt0xLRR+,
Spatial granularity:(Xt0xLRQt0,xHR)R6+32×32Pt0xLRR+,
Temporal memory:XtMemoryxLRR6×card(tMemory)Pt0xLRR+,
where Q′ is the precipitable water’s anomaly with respect to its gridcell mean ⟨Q⟩ and memory is accounted for through up to two time steps in the past, in which case tMemory = {t0, t0 − 15 min, t0 − 30 min}. Samples are extracted from 10 days of the simulation after spinup, with 6 days for training, 2 for validation, and 2 for testing. To focus on precipitation intensity rather than triggering, samples with precipitation values below 0.05 mm h−1 are discarded, resulting in a total of ≈108 samples.

3) Model hierarchy

As illustrated in Fig. 4, we design three categories of NNs to learn the three mappings given by Eqs. (29)(31). First, our baseline NNs (orange crosses) are simple MLPs with five layers of different widths, resulting in a number of trainable parameters ranging from ≈30 to 600 × 103 and never achieving MSEs below 0.4 mm2 h−2. The NNs using spatial granularity (purple crosses) additionally encode the high-resolution precipitable water anomaly field through an encoder–decoder structure with a small bottleneck representing the convective organization as a bidimensional latent variable, optionally regularized via data augmentation applied to Q′. MSEs are below 0.04 mm2 h−2 for the deepest encoder–decoder architectures, which yield high reconstruction quality through more than 1M trainable parameters. The NNs leveraging temporal memory instead of spatial resolution (green crosses) use the last or last two previous time steps to achieve competitive MSEs, all below 0.15 mm2 h−2. These “Memory-NNs” are built using two types of architectures: On the one hand, we flatten previous time steps and feed them to MLPs conditioned on nonzero current precipitation, which ignores temporal connectivity. On the other hand, we use CNN-based temporal models that preserve temporal ordering through 1D convolutional layers, achieving MSEs around 0.1 mm2 h−2 with ≈10 times fewer trainable parameters than the “High-Res Inputs-NNs” that use storm-scale information. Overall, our results showcase the added value of spatial and temporal connectivities, as MMLP[Xt0xLR]PFE,M, while
MCNN[XtMemoryxLR]PF,M,
MED[(Xt0xLRQt0,xHR)]PF,M.
Fig. 4.
Fig. 4.

Pareto-optimal model hierarchies underscore the importance of storm-resolving information in elucidating the relationship between precipitation and its surrounding environment, while also quantifying the recoverability of this information from the coarse environment’s time series. (left) NNs leveraging high-resolution spatial data (purple crosses) clearly outperform NNs that use only coarse inputs (orange crosses). However, this performance gap is largely mitigated when the coarse inputs’ past time steps are included (green crosses). (right) Processing the precipitable water field at a resolution of Δx ≈ 5 km yields coefficients of determination R2 ≈ 0.9, clearly surpassing the R2 ≈ 0.5 attained by our best NN using fields at the coarse Δx ≈ 102-km horizontal resolution. This performance gap is partially closed by incorporating two past time steps along with the current time step, resulting in R2 ≈ 0.7. This suggests a partial equivalence of the environment’s spatial and temporal connectivities in predicting precipitation.

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0078.1

4) Mitigating low spatial resolution with memory

While it may be unsurprising that leveraging spatiotemporal information decreases error, the competitive MSEs obtained with temporal memory but without high-resolution spatial information are promising for improving parameterizations. We ask: How can we explain that temporal memory helps recover a large portion of the spatial granularity information?

First, we can ask: Why would temporal memory of the coarse environment XtMemoryxLR be relevant on its own, given that at a given time t0, the precipitation Pt0 can be exactly diagnosed from the high-resolution environment Xt0,xHR? The informativeness of high spatial resolution is confirmed by the fact that our lowest-error model is MED. As written in Eq. (30), MED decomposes the high-resolution information into a low-resolution component and the inputs’ spatial anomaly (denoted by primes). Using this decomposition, we can write the following causal graph:
Xt0ΔtxLRXt0xLRXt0Δt,xHR,Xt0ΔtxLRXt0,xHRXt0Δt,xHR,Xt0,xHRPt0Xt0xLR,
where we posit that for a small enough time step Δt, the combination of the high-resolution anomaly and low-resolution information suffices to diagnose precipitation and step the entire system forward in time from t0 − Δt to t0. Formally, this is a Markovian assumption, which eliminates the need for temporal memory. Under this assumption, the precipitation Pt0 is independent from Xt0ΔtxLR and previous time steps when conditioned on the current environment (Xt0,xHR,Xt0xLR). Using the bottom two lines of Eq. (34), we see that suppressing access to high-resolution information activates the following causal path from Xt0ΔtxLR to Pt0 by introducing a dependency:
Xt0ΔtxLRXt0,xHRPt0.

This induces a conditional dependence of precipitation Pt0 on the past low-resolution environment Xt0ΔtxLR by removing d separation through the exclusion of current high-resolution anomalies Xt0,xHR from the conditioning set (Pearl et al. 2000). Therefore, temporal memory becomes necessary in the absence of high-resolution information, consistent with the general result of the Mori–Zwanzig decomposition for dynamical systems (Lucarini and Chekroun 2023). This explains the Pareto optimality of the “simple-memory NN” MCNN that considers low-resolution inputs and their past time steps.

Now, we can ask: How much information is lost in the process of replacing the high-resolution spatial anomaly with temporal memory of the low-resolution spatial means? The answer depends on the distance between low and high resolutions. There is no difference if these resolutions are equal, while we do not expect low-resolution memory to help at all if the low-resolution grid is coarse to the point that we cannot distinguish very different high-resolution situations. For the resolutions considered here, the encoder–decoder architecture (purple crosses) achieves high performance with a bottleneck whose dimension is only two. This suggests that the high-resolution anomaly field can be represented by a bidimensional (latent) variable, which we denote org, for the purpose of modeling precipitation. The term org encapsulates the aspects of convective organization that explain why precipitation can be different (stochastic) for the same low-resolution inputs (Moseley et al. 2016). This simplifies the question to how well we can represent org from the time series of low-resolution inputs. The encouraging results of our models leveraging temporal memory, along with the finding of Shamekh et al. (2023) that org is accurately predicted by an autoregressive model informed by the coarse-scale environment’s past time steps, suggest a mostly positive response. This reassuringly confirms that Earth system models with limited spatial resolutions can realistically represent coarse-scale precipitation as long as efforts to improve precipitation parameterization continue (Schneider et al. 2024). In hybrid modeling applications, where the NN itself computes precipitation at previous time steps, error propagation may reduce the informativeness of temporal memory—a limitation worth exploring in future work. More broadly, our results confirm that Pareto-optimal model hierarchies are useful in empirically establishing the partial equivalence between temporal memory and spatial granularity. This has practical applications for multiscale dynamical systems that are not self-similar, where ergodicity cannot be used to deterministically infer detailed spatial information from coarse time-series data.

5. Conclusions

In this study, we demonstrated that Pareto-optimal model hierarchies within a well-defined (complexity, error) plane not only guide the development of data-driven models—from simple equations to sophisticated neural networks—but also promote process understanding. While these hierarchies are derived empirically based on the models and data at hand, they serve as practical approximations for exploring trade-offs between error and complexity. To distill knowledge from these hierarchies, we propose a multistep approach: 1) use models along the Pareto front to hierarchically investigate the added value that leads to incremental error decreases from the simplest to the most complex Pareto-optimal model; 2) generate hypotheses and propose models tailored to the system of interest based on this added value; and 3) once the models are sparse enough to be interpretable, reveal the system’s underlying properties that previous theories or models may have overlooked. Beyond knowledge discovery, such hierarchies promote interpretability by explaining the added value of models that may initially seem overly complex and sustainability by optimizing models’ computational costs once their added value is justified.

We have chosen three weather and climate applications in realistic geography settings to showcase the potential of machine learning in bridging fundamental scientific discovery and increasing operational requirements. Each case demonstrates a different nature of discovery: data-driven equation discovery for cloud cover parameterization, physics-guided architectures, informed by the available Pareto-optimal solution, that reflect the bidirectionality and vertical invariance of shortwave radiative transfer, and spatial information hidden within time series of the coarse environment when diagnosing tropical precipitation. However, all three insights required retaining the full family of trained models and might have been overlooked had we focused on a single optimal model.

In all three cases, neural networks proved particularly advantageous as they can quickly explore large datasets and generate hypotheses about problems that are or appear high dimensional. We nonetheless emphasize that within this framework, deep learning is an integral part but not the ultimate goal of the knowledge generation process. Simpler models rivaling the accuracy of deep neural networks, initially intractable, may emerge once the necessary functional representations, features, and spatiotemporal connectivities are distilled. From this perspective, a rekindled interest in multiobjective optimization and hierarchical thinking would open the door to extracting new, human-interpretable scientific knowledge from ever-expanding geoscientific data, while paving the way for the adoption of machine learning in operational applications by fostering informed trust in their predictions.

Acknowledgments.

T. Beucler acknowledges partial funding from the Swiss State Secretariat for Education, Research and Innovation (SERI) for the Horizon Europe project AI4PEX (Grant agreement: 101137682). A. Grundner acknowledges funding by the European Research Council (ERC) Synergy Grant “Understanding and Modeling the Earth System with Machine Learning (USMILE)” under the Horizon 2020 research and innovation program (Grant agreement 855187). R. Lagerquist acknowledges the support from the NOAA Global Systems Laboratory, Cooperative Institute for Research in the Atmosphere, and NOAA Award NA19OAR4320073. S. Shamekh acknowledges the support provided by Schmidt Sciences, LLC, and NSF funding from the LEAP Science and Technology Center Grant (2019625-STC).

Data availability statement.

The reduced data and code necessary to reproduce this manuscript’s figures are available in the following GitHub repository: https://github.com/tbeucler/2024_Pareto_Distillation. The release for this manuscript is archived via Zenodo using the following DOI: https://zenodo.org/records/13217736. The cloud cover schemes and analysis code can be found at https://github.com/EyringMLClimateGroup/grundner23james_EquationDiscovery_CloudCover/tree/main and are preserved at https://doi.org/10.5281/zenodo.8220333. The coarse‐grained model output used to train and evaluate the data-driven models amounts to several TB and can be reconstructed with the scripts provided in the GitHub repository. This work used resources of the Deutsches Klimarechenzentrum (DKRZ) granted by its Scientific Steering Committee (WLA) under project ID bd1179. For radiative transfer, we use the ml4rt Python library (https://github.com/thunderhoser/ml4rt), with version 4.0.1 preserved at https://doi.org/10.5281/zenodo.13160776. The dataset for radiative transfer is stored at https://doi.org/10.5281/zenodo.13159877. The BRNN architecture used can be created with the file peter_brnn_architecture.py; all other architectures can be created with the scripts named pareto2024_make_*_templates.py. The precipitation example uses “Precip-org” (https://github.com/Sshamekh/Precip-org), a Python repository managed by Sara Shamekh. DYAMOND data management was provided by the DKRZ and supported through the projects ESiWACE and ESiWACE2. The full data are available on the DKRZ HPC system through the DYAMOND initiative (https://www.esiwace.eu/services/dyamond-initiative).

APPENDIX A

Calibrated Parameters for Cloud Cover Parameterization

In this appendix, we provide the values of the calibrated parameters in this manuscript’s equations.

For the Sundqvist scheme [Eq. (10)], RHsurf = 0.55, RHtop = 0.01, RHsat = 0.9/0.95 (over land/sea), and n = 2.12.

For the two-feature linear regression [Eq. (11)], α2F = 1.027, RH2F = 0.827, and ΔT2F = 198.6 K.

For the four-feature linear regression [Eq. (14)], α1 = 1.861, RH1 = 1.898, ΔT1 = 265.9 K, q1 = 0.3775 g kg−1, and H1 = 104.3 m.

For the quadratic model [Eq. (15)], α2 = 2.070, RH2,1 = 2.021, ΔT2,1 = 259.7 K, q2,1 = 0.4741 g kg−1, q2,2 = 3.161 g kg−1, q2,3 = 0.1034 g kg−1, H2 = 665.6 m, RH2,2 = 2.428, and ΔT2,2 = 206.0 K.

For the Wang scheme, αWang = 0.9105 and βWang = 914 × 103.

For the PySR equation, C¯=0.4435,
(CRH)RH¯,T¯=1.159, (CT)RH¯,T¯=0.0145K,
(2CRH2)RH¯,T¯=4.06, (CRHT2)RH¯,T¯=1.32.103K2,

HPySR = 585 m, (dRH/dz)maxC=2 Km1, ϵPySR = 1.06, λl =3.845.105, and λi = 1.448.106.

 For the GP-GOMEA equation, αG = 66, βG = 1.36.10−4, RHG = 11.5%, γG = 0.194, qG,i = 4.367 mg kg−1, δG = 0.774, qG,l+=88.05 mg kg1, qG,l=5.61 mg kg1, and ϵG → 0 (0 cloud cover assigned in the absence of condensates and model calibrated only with condensates present).

APPENDIX B

Features for Shortwave Radiative Transfer Emulation

The 22 features Xη with a vertical profile are temperature T (K), pressure p (Pa), specific humidity q (kg kg−1), RH (%), liquid water content (LWC) (kg m−3), ice water content (IWC) (kg m−3), downward and upward liquid water path LWP and LWP (kg m−2), downward and upward ice water path IWP and IWP (kg m−2), downward and upward water vapor path WVP and WVP (kg m−2), ozone (O3) mixing ratio (kg kg−1), height above ground level z (m), height layer thickness Δz (m), pressure layer thickness Δp (Pa), liquid effective radius rl (m), ice effective radius ri (m), nitrous oxide (N2O) concentration (ppmv), methane (CH4) concentration (ppmv), carbon dioxide (CO2) concentration (ppmv), and aerosol extinction (m−1).

REFERENCES

  • Arias, P., and Coauthors, 2021: Climate Change 2021: The Physical Science Basis. Cambridge University Press, 2391 pp.

  • Balaji, V., 2021: Climbing down Charney’s ladder: Machine learning and the post-Dennard era of computational climate science. Philos. Trans. Roy. Soc., A379, 20200085, https://doi.org/10.1098/rsta.2020.0085.

    • Search Google Scholar
    • Export Citation
  • Balaji, V., and Coauthors, 2017: CPMIP: Measurements of real computational performance of Earth system models in CMIP6. Geosci. Model Dev., 10, 1934, https://doi.org/10.5194/gmd-10-19-2017.

    • Search Google Scholar
    • Export Citation
  • Balaji, V., F. Couvreux, J. Deshayes, J. Gautrais, F. Hourdin, and C. Rio, 2022: Are general circulation models obsolete? Proc. Natl. Acad. Sci. USA, 119, e2202075119, https://doi.org/10.1073/pnas.2202075119.

    • Search Google Scholar
    • Export Citation
  • Bao, J., B. Stevens, L. Kluft, and C. Muller, 2024: Intensification of daily tropical precipitation extremes from more organized convection. Sci. Adv., 10, eadj6801, https://doi.org/10.1126/sciadv.adj6801.

    • Search Google Scholar
    • Export Citation
  • Bartlett, P. L., O. Bousquet, and S. Mendelson, 2005: Local rademacher complexities. arXiv, math/0508275v1, https://doi.org/10.48550/arXiv.math/0508275.

    • Search Google Scholar
    • Export Citation
  • Belochitski, A., and V. Krasnopolsky, 2021: Robustness of neural network emulations of radiative transfer parameterizations in a state-of-the-art general circulation model. Geosci. Model Dev., 14, 74257437, https://doi.org/10.5194/gmd-14-7425-2021.

    • Search Google Scholar
    • Export Citation
  • Ben-Bouallegue, Z., and Coauthors, 2023: The rise of data-driven weather forecasting. arXiv, 2307.10128v2, https://doi.org/10.48550/arXiv.2307.10128.

    • Search Google Scholar
    • Export Citation
  • Bertoli, G., F. Ozdemir, S. Schemm, and F. Perez-Cruz, 2023: Revisiting machine learning approaches for short- and longwave radiation inference in weather and climate models, part I: Offline performance. ESS Open Archive, https://doi.org/10.22541/essoar.169109567.78839949/v1

    • Search Google Scholar
    • Export Citation
  • Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2022: Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast. arXiv, 2211.02556v1, https://doi.org/10.48550/arXiv.2211.02556.

    • Search Google Scholar
    • Export Citation
  • Bony, S., and Coauthors, 2013: Carbon dioxide and climate: Perspectives on a scientific assessment. Climate Science for Serving Society: Research, Modeling and Prediction Priorities, G. Asrar and J. Hurrell, Eds., Springer, 391413.

    • Search Google Scholar
    • Export Citation
  • Bony, S., and Coauthors, 2015: Clouds, circulation and climate sensitivity. Nat. Geosci., 8, 261268, https://doi.org/10.1038/ngeo2398.

    • Search Google Scholar
    • Export Citation
  • Bröcker, J., 2009: Reliability, sufficiency, and the decomposition of proper scores. Quart. J. Roy. Meteor. Soc., 135, 15121519, https://doi.org/10.1002/qj.456.

    • Search Google Scholar
    • Export Citation
  • Buhrmester, V., D. Münch, and M. Arens, 2021: Analysis of explainers of black box deep neural networks for computer vision: A survey. Mach. Learn. Knowl. Extr., 3, 966989, https://doi.org/10.3390/make3040048.

    • Search Google Scholar
    • Export Citation
  • Censor, Y., 1977: Pareto optimality in multiobjective problems. Appl. Math. Optim., 4, 4159, https://doi.org/10.1007/BF01442131.

  • Charney, J. G., and Coauthors, 1979: Carbon dioxide and climate: A scientific assessment. National Academy of Sciences Rep., 18 pp., https://geosci.uchicago.edu/∼archer/warming_papers/charney.1979.report.pdf.

    • Search Google Scholar
    • Export Citation
  • Cheruy, F., F. Chevallier, J.-J. Morcrette, N. A. Scott, and A. Chédin, 1996: Une méthode utilisant les techniques neuronales pour le calcul rapide de la distribution verticale du bilan radiatif thermique terrestre. C. R. Acad. Sci. II, 322, 665672.

    • Search Google Scholar
    • Export Citation
  • Chevallier, F., F. Chéruy, N. A. Scott, and A. Chédin, 1998: A neural network approach for a fast and accurate computation of a longwave radiative budget. J. Appl. Meteor., 37, 13851397, https://doi.org/10.1175/1520-0450(1998)037<1385:ANNAFA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Chevallier, F., J.-J. Morcrette, F. Chéruy, and N. Scott, 2000: Use of a neural‐network‐based long‐wave radiative‐transfer scheme in the ECMWF atmospheric model. Quart. J. Roy. Meteor. Soc., 126, 761776, https://doi.org/10.1002/qj.49712656318.

    • Search Google Scholar
    • Export Citation
  • Clare, M. C., O. Jamil, and C. J. Morcrette, 2021: Combining distribution‐based neural networks to predict weather forecast probabilities. Quart. J. Roy. Meteor. Soc., 147, 43374357, https://doi.org/10.1002/qj.4180.

    • Search Google Scholar
    • Export Citation
  • Clough, S. A., M. J. Iacono, and J.-L. Moncet, 1992: Line‐by‐line calculations of atmospheric fluxes and cooling rates: Application to water vapor. J. Geophys. Res., 97, 15 76115 785, https://doi.org/10.1029/92JD01419.

    • Search Google Scholar
    • Export Citation
  • Colin, M., S. Sherwood, O. Geoffroy, S. Bony, and D. Fuchs, 2019: Identifying the sources of convective memory in cloud-resolving simulations. J. Atmos. Sci., 76, 947962, https://doi.org/10.1175/JAS-D-18-0036.1.

    • Search Google Scholar
    • Export Citation
  • Cranmer, M., 2023: Interpretable machine learning for science with PySR and symbolicregression.jl. arXiv, 2305.01582v3, https://doi.org/10.48550/arXiv.2305.01582.

    • Search Google Scholar
    • Export Citation
  • Das, A., and P. Rad, 2020: Opportunities and challenges in Explainable Artificial Intelligence (XAI): A survey. arXiv, 2006.11371v2, https://doi.org/10.48550/arXiv.2006.11371.

    • Search Google Scholar
    • Export Citation
  • Duras, J., F. Ziemen, and D. Klocke, 2021: The Dyamond winter data collection. EGU General Assembly 2021, Online, European Geosciences Union, Abstracts EGU21-4687.

    • Search Google Scholar
    • Export Citation
  • Fisher, A., C. Rudin, and F. Dominici, 2019: All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res., 20, 181.

    • Search Google Scholar
    • Export Citation
  • Gentine, P., V. Eyring, and T. Beucler, 2021: Deep learning for the parametrization of subgrid processes in climate models. Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences, G. Camps-Valls et al., Eds., Wiley, 307314.

    • Search Google Scholar
    • Export Citation
  • Giorgetta, M. A., and Coauthors, 2018: ICON‐A, the atmosphere component of the ICON Earth system model: I. Model description. J. Adv. Model. Earth Syst., 10, 16131637, https://doi.org/10.1029/2017MS001242.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359378, https://doi.org/10.1198/016214506000001437.

    • Search Google Scholar
    • Export Citation
  • Grundner, A., T. Beucler, P. Gentine, F. Iglesias-Suarez, M. A. Giorgetta, and V. Eyring, 2022: Deep learning based cloud cover parameterization for ICON. J. Adv. Model. Earth Syst., 14, e2021MS002959, https://doi.org/10.1029/2021MS002959.

    • Search Google Scholar
    • Export Citation
  • Grundner, A., T. Beucler, P. Gentine, and V. Eyring, 2024: Data‐driven equation discovery of a cloud cover parameterization. J. Adv. Model. Earth Syst., 16, e2023MS003763, https://doi.org/10.1029/2023MS003763.

    • Search Google Scholar
    • Export Citation
  • Haynes, K., R. Lagerquist, M. McGraw, K. Musgrave, and I. Ebert-Uphoff, 2023: Creating and evaluating uncertainty estimates with neural networks for environmental-science applications. Artif. Intell. Earth Syst., 2, 220061, https://doi.org/10.1175/AIES-D-22-0061.1.

    • Search Google Scholar
    • Export Citation
  • Held, I. M., 2005: The gap between simulation and understanding in climate modeling. Bull. Amer. Meteor. Soc., 86, 16091614, https://doi.org/10.1175/BAMS-86-11-1609.

    • Search Google Scholar
    • Export Citation
  • Hertel, L., J. Collado, P. Sadowski, J. Ott, and P. Baldi, 2020: Sherpa: Robust hyperparameter optimization for machine learning. SoftwareX, 12, 100591, https://doi.org/10.1016/j.softx.2020.100591.

    • Search Google Scholar
    • Export Citation
  • Hogan, R. J., and A. Bozzo, 2018: A flexible and efficient radiation scheme for the ECMWF model. J. Adv. Model. Earth Syst., 10, 19902008, https://doi.org/10.1029/2018MS001364.

    • Search Google Scholar
    • Export Citation
  • Jakhar, K., Y. Guan, R. Mojgani, A. Chattopadhyay, and P. Hassanzadeh, 2024: Learning closed-form equations for subgrid-scale closures from high-fidelity data: Promises and challenges. J. Adv. Model. Earth Syst., 16, e2023MS003874, https://doi.org/10.1029/2023MS003874.

    • Search Google Scholar
    • Export Citation
  • Jeevanjee, N., P. Hassanzadeh, S. Hill, and A. Sheshadri, 2017: A perspective on climate model hierarchies. J. Adv. Model. Earth Syst., 9, 17601771, https://doi.org/10.1002/2017MS001038.

    • Search Google Scholar
    • Export Citation
  • Keisler, R., 2022: Forecasting global weather with graph neural networks. arXiv, 2202.07575v1, https://doi.org/10.48550/arXiv.2202.07575.

    • Search Google Scholar
    • Export Citation
  • Khairoutdinov, M. F., P. N. Blossey, and C. S. Bretherton, 2022: Global system for atmospheric modeling: Model description and preliminary results. J. Adv. Model. Earth Syst., 14, e2021MS002968, https://doi.org/10.1029/2021MS002968.

    • Search Google Scholar
    • Export Citation
  • Kim, P. S., and H.-J. Song, 2022: Usefulness of automatic hyperparameter optimization in developing radiation emulator in a numerical weather prediction model. Atmosphere, 13, 721, https://doi.org/10.3390/atmos13050721.

    • Search Google Scholar
    • Export Citation
  • Kingma, D. P., and J. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.

    • Search Google Scholar
    • Export Citation
  • Koza, J. R., 1994: Genetic programming as a means for programming computers by natural selection. Stat. Comput., 4, 87112, https://doi.org/10.1007/BF00175355.

    • Search Google Scholar
    • Export Citation
  • La Cava, W., B. Burlacu, M. Virgolin, M. Kommenda, P. Orzechowski, F. O. de França, Y. Jin, and J. H. Moore, 2021: Contemporary symbolic regression methods and their relative performance. 35th Conf. on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, Online, Curran Associates Inc., 116, https://cavalab.org/assets/papers/La%20Cava%20et%20al.%20-%202021%20-%20Contemporary%20Symbolic%20Regression%20Methods%20and%20their.pdf.

    • Search Google Scholar
    • Export Citation
  • Lagerquist, R., D. Turner, I. Ebert-Uphoff, J. Stewart, and V. Hagerty, 2021: Using deep learning to emulate and accelerate a radiative transfer model. J. Atmos. Oceanic Technol., 38, 16731696, https://doi.org/10.1175/JTECH-D-21-0007.1.

    • Search Google Scholar
    • Export Citation
  • Lagerquist, R., D. D. Turner, I. Ebert-Uphoff, and J. Q. Stewart, 2023: Estimating full longwave and shortwave radiative transfer with neural networks of varying complexity. J. Atmos. Oceanic Technol., 40, 14071432, https://doi.org/10.1175/JTECH-D-23-0012.1.

    • Search Google Scholar
    • Export Citation
  • Lam, R., and Coauthors, 2022: GraphCast: Learning skillful medium-range global weather forecasting. arXiv, 2212.12794v2, https://doi.org/10.48550/arXiv.2212.12794.

    • Search Google Scholar
    • Export Citation
  • Lang, S., and Coauthors, 2024: AIFS—ECMWF’S data-driven forecasting system. arXiv, 2406.01465v2, https://doi.org/10.48550/arXiv.2406.01465.

    • Search Google Scholar
    • Export Citation
  • Lin, X., H.-L. Zhen, Z. Li, Q.-F. Zhang, and S. Kwong, 2019: Pareto multi-task learning. 33rd Conf. on Neural Information Processing Systems (NeurIPS 2019), Vancouver, British Columbia, Canada, Curran Associates Inc., 12060–12070, https://proceedings.neurips.cc/paper_files/paper/2019/file/685bfde03eb646c27ed565881917c71c-Paper.pdf.

    • Search Google Scholar
    • Export Citation
  • Liou, K.-N., 2002: An Introduction to Atmospheric Radiation. Vol. 84. Elsevier, 583 pp.

  • Lucarini, V., and M. D. Chekroun, 2023: Theoretical tools for understanding the climate crisis from Hasselmann’s programme and beyond. Nat. Rev. Phys., 5, 744765, https://doi.org/10.1038/s42254-023-00650-8.

    • Search Google Scholar
    • Export Citation
  • Maas, A. L., A. Y. Hannun, and A. Y. Ng, 2013: Rectifier nonlinearities improve neural network acoustic models. ICML 2013 Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, GA, International Machine Learning Society, https://robotics.stanford.edu/amaas/papers/relu hybrid icml2013final.pdf.

    • Search Google Scholar
    • Export Citation
  • Maher, P., and Coauthors, 2019: Model hierarchies for understanding atmospheric circulation. Rev. Geophys., 57, 250280, https://doi.org/10.1029/2018RG000607.

    • Search Google Scholar
    • Export Citation
  • Mamalakis, A., E. A. Barnes, and I. Ebert-Uphoff, 2022: Investigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscience. Artif. Intell. Earth Syst., 1, e220012, https://doi.org/10.1175/AIES-D-22-0012.1.

    • Search Google Scholar
    • Export Citation
  • Mamalakis, A., E. A. Barnes, and I. Ebert-Uphoff, 2023: Carefully choose the baseline: Lessons learned from applying XAI attribution methods for regression tasks in geoscience. Artif. Intell. Earth Syst., 2, e220058, https://doi.org/10.1175/AIES-D-22-0058.1.

    • Search Google Scholar
    • Export Citation
  • Mansfield, L. A., A. Gupta, A. C. Burnett, B. Green, C. Wilka, and A. Sheshadri, 2023: Updates on model hierarchies for understanding and simulating the climate system: A focus on data‐informed methods and climate change impacts. J. Adv. Model. Earth Syst., 15, e2023MS003715, https://doi.org/10.1029/2023MS003715.

    • Search Google Scholar
    • Export Citation
  • Mauritsen, T., and Coauthors, 2019: Developments in the MPI-M Earth System Model version 1.2 (MPI-ESM1.2) and its response to increasing CO2. J. Adv. Model. Earth Syst., 11, 9981038, https://doi.org/10.1029/2018MS001400.

    • Search Google Scholar
    • Export Citation
  • Miettinen, K., 1999: Nonlinear Multiobjective Optimization. Vol. 12. Springer, 298 pp.

  • Mlawer, E. J., S. J. Taubman, P. D. Brown, M. J. Iacono, and S. A. Clough, 1997: Radiative transfer for inhomogeneous atmospheres: RRTM, a validated correlated‐k model for the longwave. J. Geophys. Res., 102, 16 66316 682, https://doi.org/10.1029/97JD00237.

    • Search Google Scholar
    • Export Citation
  • Molnar, C., 2020: Interpretable Machine Learning. Lulu.com, 320 pp.

  • Morcrette, J.-J., 2000: On the effects of the temporal and spatial sampling of radiation fields on the ECMWF forecasts and analyses. Mon. Wea. Rev., 128, 876887, https://doi.org/10.1175/1520-0493(2000)128<0876:OTEOTT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Morrison, H., and Coauthors, 2020: Confronting the challenge of modeling cloud and precipitation microphysics. J. Adv. Model. Earth Syst., 12, e2019MS001689, https://doi.org/10.1029/2019MS001689.

    • Search Google Scholar
    • Export Citation
  • Moseley, C., C. Hohenegger, P. Berg, and J. O. Haerter, 2016: Intensification of convective extremes driven by cloud–cloud interaction. Nat. Geosci., 9, 748752, https://doi.org/10.1038/ngeo2789.

    • Search Google Scholar
    • Export Citation
  • Nowack, P., J. Runge, V. Eyring, and J. D. Haigh, 2020: Causal networks for climate model evaluation and constrained projections. Nat. Commun., 11, 1415, https://doi.org/10.1038/s41467-020-15195-y.

    • Search Google Scholar
    • Export Citation
  • Pathak, J., and Coauthors, 2022: Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/arXiv.2202.11214.

    • Search Google Scholar
    • Export Citation
  • Pearl, J., and Coauthors, 2000: Causality: Models, Reasoning and Inference. Vol. 19. Cambridge University Press, 384 pp.

  • Petersen, B. K., M. Landajuela, T. N. Mundhenk, C. P. Santiago, S. K. Kim, and J. T. Kim, 2019: Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. arXiv, 1912.04871v4, https://doi.org/10.48550/arXiv.1912.04871.

    • Search Google Scholar
    • Export Citation
  • Rasp, S., and N. Thuerey, 2021: Data‐driven medium‐range weather prediction with a Resnet pretrained on climate simulations: A new model for WeatherBench. J. Adv. Model. Earth Syst., 13, e2020MS002405, https://doi.org/10.1029/2020MS002405.

    • Search Google Scholar
    • Export Citation
  • Rasp, S., P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, 2020: WeatherBench: A benchmark data set for data‐driven weather forecasting. J. Adv. Model. Earth Syst., 12, e2020MS002203, https://doi.org/10.1029/2020MS002203.

    • Search Google Scholar
    • Export Citation
  • Robertson, A. W., and M. Ghil, 2000: Solving problems with GCMS: General circulation models and their role in the climate modeling hierarchy. International Geophysics, Vol. 70, Academic Press, 40 pp.

    • Search Google Scholar
    • Export Citation
  • Roeckner, E., and Coauthors, 1996: The atmospheric general circulation model ECHAM-4: Model description and simulation of present-day climate. Max-Planck-Institut für Meteorologie Rep. 218, 94 pp., https://pure.mpg.de/rest/items/item_1781494/component/file_1786328/content.

    • Search Google Scholar
    • Export Citation
  • Ronneberger, O., P. Fischer, and T. Brox, 2015: U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, N. Navab et al., Eds., Springer, 234241.

    • Search Google Scholar
    • Export Citation
  • Rudin, C., 2019: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell., 1, 206215, https://doi.org/10.1038/s42256-019-0048-x.

    • Search Google Scholar
    • Export Citation
  • Scher, S., and G. Messori, 2021: Ensemble methods for neural network‐based weather forecasts. J. Adv. Model. Earth Syst., 13, e2020MS002331, https://doi.org/10.1029/2020MS002331.

    • Search Google Scholar
    • Export Citation
  • Schmidt, M., and H. Lipson, 2009: Distilling free-form natural laws from experimental data. Science, 324, 8185, https://doi.org/10.1126/science.1165893.

    • Search Google Scholar
    • Export Citation
  • Schneider, T., J. Teixeira, C. S. Bretherton, F. Brient, K. G. Pressel, C. Schär, and A. P. Siebesma, 2017: Climate goals and computing the future of clouds. Nat. Climate Change, 7, 35, https://doi.org/10.1038/nclimate3190.

    • Search Google Scholar
    • Export Citation
  • Schneider, T., L. R. Leung, and R. C. J. Wills, 2024: Opinion: Optimizing climate models with process knowledge, resolution, and artificial intelligence. Atmos. Chem. Phys., 24, 70417062, https://doi.org/10.5194/acp-24-7041-2024.

    • Search Google Scholar
    • Export Citation
  • Schneiderman, H., 2024: Incorporation of physical equations into a neural network structure for predicting shortwave radiative heat transfer. 23rd Conf. on Artificial Intelligence for Environmental Science, Baltimore, MD, Amer. Meteor. Soc, J5A.5, https://ams.confex.com/ams/104ANNUAL/webprogram/Paper431981.html.

    • Search Google Scholar
    • Export Citation
  • Seneviratne, S. I., and Coauthors, 2021: Weather and climate extreme events in a changing climate. Climate Change 2021: The Physical Science Basis, V. Masson-Delmotte et al., Eds., Cambridge University Press, 15131766, https://doi.org/10.1017/9781009157896.013.

    • Search Google Scholar
    • Export Citation
  • Shamekh, S., K. D. Lamb, Y. Huang, and P. Gentine, 2023: Implicit learning of convective organization explains precipitation stochasticity. Proc. Natl. Acad. Sci. USA, 120, e2216158120, https://doi.org/10.1073/pnas.2216158120.

    • Search Google Scholar
    • Export Citation
  • Sherwood, S. C., S. Bony, and J.-L. Dufresne, 2014: Spread in model climate sensitivity traced to atmospheric convective mixing. Nature, 505, 3742, https://doi.org/10.1038/nature12829.

    • Search Google Scholar
    • Export Citation
  • Stevens, B., and Coauthors, 2019: DYAMOND: The DYnamics of the Atmospheric general circulation Modeled on Non-hydrostatic Domains. Prog. Earth Planet. Sci., 6, 61, https://doi.org/10.1186/s40645-019-0304-z.

    • Search Google Scholar
    • Export Citation
  • Stevens, B., and Coauthors, 2020: The added value of large-eddy and storm-resolving models for simulating clouds and precipitation. J. Meteor. Soc. Japan, 98, 395435, https://doi.org/10.2151/jmsj.2020-021.

    • Search Google Scholar
    • Export Citation
  • Sundqvist, H., E. Berge, and J. E. Kristjánsson, 1989: Condensation and cloud parameterization studies with a mesoscale numerical weather prediction model. Mon. Wea. Rev., 117, 16411657, https://doi.org/10.1175/1520-0493(1989)117%3C1641:CACPSW%3E2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Ukkonen, P., 2022: Exploring pathways to more accurate machine learning emulation of atmospheric radiative transfer. J. Adv. Model. Earth Syst., 14, e2021MS002875, https://doi.org/10.1029/2021MS002875.

    • Search Google Scholar
    • Export Citation
  • Ukkonen, P., and M. Chantry, 2024: Representing sub-grid processes in weather and climate models via sequence learning. ESS Open Archive, https://doi.org/10.22541/essoar.172098075.51621106/v1.

    • Search Google Scholar
    • Export Citation
  • Vapnik, V. N., and A. Y. Chervonenkis, 2015: On the uniform convergence of relative frequencies of events to their probabilities. Measures of Complexity: Festschrift for Alexey Chervonenkis, V. Vovk, H. Papadopoulos, and A. Gammerman, Eds., Springer, 1130.

    • Search Google Scholar
    • Export Citation
  • Veerman, M. A., R. Pincus, R. Stoffer, C. M. Van Leeuwen, D. Podareanu, and C. C. Van Heerwaarden, 2021: Predicting atmospheric optical properties for radiative transfer computations using neural networks. Philos. Trans. Roy. Soc., A379, 20200095, https://doi.org/10.1098/rsta.2020.0095.

    • Search Google Scholar
    • Export Citation
  • Virgolin, M., T. Alderliesten, C. Witteveen, and P. A. N. Bosman, 2021: Improving model-based genetic programming for symbolic regression of small expressions. Evol. Comput., 29, 211237, https://doi.org/10.1162/evco_a_00278.

    • Search Google Scholar
    • Export Citation
  • Wang, Y., S. Yang, G. Chen, Q. Bao, and J. Li, 2023: Evaluating two diagnostic schemes of cloud-fraction parameterization using the CloudSat data. Atmos. Res., 282, 106510, https://doi.org/10.1016/j.atmosres.2022.106510.

    • Search Google Scholar
    • Export Citation
  • Weyn, J. A., D. R. Durran, and R. Caruana, 2019: Can machines learn to predict weather? Using deep learning to predict gridded 500‐hPa geopotential height from historical weather data. J. Adv. Model. Earth Syst., 11, 26802693, https://doi.org/10.1029/2019MS001705.

    • Search Google Scholar
    • Export Citation
  • Weyn, J. A., D. R. Durran, and R. Caruana, 2020: Improving data‐driven global weather prediction using deep convolutional neural networks on a cubed sphere. J. Adv. Model. Earth Syst., 12, e2020MS002109, https://doi.org/10.1029/2020MS002109.

    • Search Google Scholar
    • Export Citation
  • Xu, K.-M., and D. A. Randall, 1996: A semiempirical cloudiness parameterization for use in climate models. J. Atmos. Sci., 53, 30843102, https://doi.org/10.1175/1520-0469(1996)053<3084:ASCPFU>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Zhou, Z., M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, 2020: UNet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging, 39, 18561867, https://doi.org/10.1109/TMI.2019.2959609.

    • Search Google Scholar
    • Export Citation
Save
  • Arias, P., and Coauthors, 2021: Climate Change 2021: The Physical Science Basis. Cambridge University Press, 2391 pp.

  • Balaji, V., 2021: Climbing down Charney’s ladder: Machine learning and the post-Dennard era of computational climate science. Philos. Trans. Roy. Soc., A379, 20200085, https://doi.org/10.1098/rsta.2020.0085.

    • Search Google Scholar
    • Export Citation
  • Balaji, V., and Coauthors, 2017: CPMIP: Measurements of real computational performance of Earth system models in CMIP6. Geosci. Model Dev., 10, 1934, https://doi.org/10.5194/gmd-10-19-2017.

    • Search Google Scholar
    • Export Citation
  • Balaji, V., F. Couvreux, J. Deshayes, J. Gautrais, F. Hourdin, and C. Rio, 2022: Are general circulation models obsolete? Proc. Natl. Acad. Sci. USA, 119, e2202075119, https://doi.org/10.1073/pnas.2202075119.

    • Search Google Scholar
    • Export Citation
  • Bao, J., B. Stevens, L. Kluft, and C. Muller, 2024: Intensification of daily tropical precipitation extremes from more organized convection. Sci. Adv., 10, eadj6801, https://doi.org/10.1126/sciadv.adj6801.

    • Search Google Scholar
    • Export Citation
  • Bartlett, P. L., O. Bousquet, and S. Mendelson, 2005: Local rademacher complexities. arXiv, math/0508275v1, https://doi.org/10.48550/arXiv.math/0508275.

    • Search Google Scholar
    • Export Citation
  • Belochitski, A., and V. Krasnopolsky, 2021: Robustness of neural network emulations of radiative transfer parameterizations in a state-of-the-art general circulation model. Geosci. Model Dev., 14, 74257437, https://doi.org/10.5194/gmd-14-7425-2021.

    • Search Google Scholar
    • Export Citation
  • Ben-Bouallegue, Z., and Coauthors, 2023: The rise of data-driven weather forecasting. arXiv, 2307.10128v2, https://doi.org/10.48550/arXiv.2307.10128.

    • Search Google Scholar
    • Export Citation
  • Bertoli, G., F. Ozdemir, S. Schemm, and F. Perez-Cruz, 2023: Revisiting machine learning approaches for short- and longwave radiation inference in weather and climate models, part I: Offline performance. ESS Open Archive, https://doi.org/10.22541/essoar.169109567.78839949/v1

    • Search Google Scholar
    • Export Citation
  • Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2022: Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast. arXiv, 2211.02556v1, https://doi.org/10.48550/arXiv.2211.02556.

    • Search Google Scholar
    • Export Citation
  • Bony, S., and Coauthors, 2013: Carbon dioxide and climate: Perspectives on a scientific assessment. Climate Science for Serving Society: Research, Modeling and Prediction Priorities, G. Asrar and J. Hurrell, Eds., Springer, 391413.

    • Search Google Scholar
    • Export Citation
  • Bony, S., and Coauthors, 2015: Clouds, circulation and climate sensitivity. Nat. Geosci., 8, 261268, https://doi.org/10.1038/ngeo2398.

    • Search Google Scholar
    • Export Citation
  • Bröcker, J., 2009: Reliability, sufficiency, and the decomposition of proper scores. Quart. J. Roy. Meteor. Soc., 135, 15121519, https://doi.org/10.1002/qj.456.

    • Search Google Scholar
    • Export Citation
  • Buhrmester, V., D. Münch, and M. Arens, 2021: Analysis of explainers of black box deep neural networks for computer vision: A survey. Mach. Learn. Knowl. Extr., 3, 966989, https://doi.org/10.3390/make3040048.

    • Search Google Scholar
    • Export Citation
  • Censor, Y., 1977: Pareto optimality in multiobjective problems. Appl. Math. Optim., 4, 4159, https://doi.org/10.1007/BF01442131.

  • Charney, J. G., and Coauthors, 1979: Carbon dioxide and climate: A scientific assessment. National Academy of Sciences Rep., 18 pp., https://geosci.uchicago.edu/∼archer/warming_papers/charney.1979.report.pdf.

    • Search Google Scholar
    • Export Citation
  • Cheruy, F., F. Chevallier, J.-J. Morcrette, N. A. Scott, and A. Chédin, 1996: Une méthode utilisant les techniques neuronales pour le calcul rapide de la distribution verticale du bilan radiatif thermique terrestre. C. R. Acad. Sci. II, 322, 665672.

    • Search Google Scholar
    • Export Citation
  • Chevallier, F., F. Chéruy, N. A. Scott, and A. Chédin, 1998: A neural network approach for a fast and accurate computation of a longwave radiative budget. J. Appl. Meteor., 37, 13851397, https://doi.org/10.1175/1520-0450(1998)037<1385:ANNAFA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Chevallier, F., J.-J. Morcrette, F. Chéruy, and N. Scott, 2000: Use of a neural‐network‐based long‐wave radiative‐transfer scheme in the ECMWF atmospheric model. Quart. J. Roy. Meteor. Soc., 126, 761776, https://doi.org/10.1002/qj.49712656318.

    • Search Google Scholar
    • Export Citation
  • Clare, M. C., O. Jamil, and C. J. Morcrette, 2021: Combining distribution‐based neural networks to predict weather forecast probabilities. Quart. J. Roy. Meteor. Soc., 147, 43374357, https://doi.org/10.1002/qj.4180.

    • Search Google Scholar
    • Export Citation
  • Clough, S. A., M. J. Iacono, and J.-L. Moncet, 1992: Line‐by‐line calculations of atmospheric fluxes and cooling rates: Application to water vapor. J. Geophys. Res., 97, 15 76115 785, https://doi.org/10.1029/92JD01419.

    • Search Google Scholar
    • Export Citation
  • Colin, M., S. Sherwood, O. Geoffroy, S. Bony, and D. Fuchs, 2019: Identifying the sources of convective memory in cloud-resolving simulations. J. Atmos. Sci., 76, 947962, https://doi.org/10.1175/JAS-D-18-0036.1.

    • Search Google Scholar
    • Export Citation
  • Cranmer, M., 2023: Interpretable machine learning for science with PySR and symbolicregression.jl. arXiv, 2305.01582v3, https://doi.org/10.48550/arXiv.2305.01582.

    • Search Google Scholar
    • Export Citation
  • Das, A., and P. Rad, 2020: Opportunities and challenges in Explainable Artificial Intelligence (XAI): A survey. arXiv, 2006.11371v2, https://doi.org/10.48550/arXiv.2006.11371.

    • Search Google Scholar
    • Export Citation
  • Duras, J., F. Ziemen, and D. Klocke, 2021: The Dyamond winter data collection. EGU General Assembly 2021, Online, European Geosciences Union, Abstracts EGU21-4687.

    • Search Google Scholar
    • Export Citation
  • Fisher, A., C. Rudin, and F. Dominici, 2019: All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res., 20, 181.

    • Search Google Scholar
    • Export Citation
  • Gentine, P., V. Eyring, and T. Beucler, 2021: Deep learning for the parametrization of subgrid processes in climate models. Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences, G. Camps-Valls et al., Eds., Wiley, 307314.

    • Search Google Scholar
    • Export Citation
  • Giorgetta, M. A., and Coauthors, 2018: ICON‐A, the atmosphere component of the ICON Earth system model: I. Model description. J. Adv. Model. Earth Syst., 10, 16131637, https://doi.org/10.1029/2017MS001242.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359378, https://doi.org/10.1198/016214506000001437.

    • Search Google Scholar
    • Export Citation
  • Grundner, A., T. Beucler, P. Gentine, F. Iglesias-Suarez, M. A. Giorgetta, and V. Eyring, 2022: Deep learning based cloud cover parameterization for ICON. J. Adv. Model. Earth Syst., 14, e2021MS002959, https://doi.org/10.1029/2021MS002959.

    • Search Google Scholar
    • Export Citation
  • Grundner, A., T. Beucler, P. Gentine, and V. Eyring, 2024: Data‐driven equation discovery of a cloud cover parameterization. J. Adv. Model. Earth Syst., 16, e2023MS003763, https://doi.org/10.1029/2023MS003763.

    • Search Google Scholar
    • Export Citation
  • Haynes, K., R. Lagerquist, M. McGraw, K. Musgrave, and I. Ebert-Uphoff, 2023: Creating and evaluating uncertainty estimates with neural networks for environmental-science applications. Artif. Intell. Earth Syst., 2, 220061, https://doi.org/10.1175/AIES-D-22-0061.1.

    • Search Google Scholar
    • Export Citation
  • Held, I. M., 2005: The gap between simulation and understanding in climate modeling. Bull. Amer. Meteor. Soc., 86, 16091614, https://doi.org/10.1175/BAMS-86-11-1609.

    • Search Google Scholar
    • Export Citation
  • Hertel, L., J. Collado, P. Sadowski, J. Ott, and P. Baldi, 2020: Sherpa: Robust hyperparameter optimization for machine learning. SoftwareX, 12, 100591, https://doi.org/10.1016/j.softx.2020.100591.

    • Search Google Scholar
    • Export Citation
  • Hogan, R. J., and A. Bozzo, 2018: A flexible and efficient radiation scheme for the ECMWF model. J. Adv. Model. Earth Syst., 10, 19902008, https://doi.org/10.1029/2018MS001364.

    • Search Google Scholar
    • Export Citation
  • Jakhar, K., Y. Guan, R. Mojgani, A. Chattopadhyay, and P. Hassanzadeh, 2024: Learning closed-form equations for subgrid-scale closures from high-fidelity data: Promises and challenges. J. Adv. Model. Earth Syst., 16, e2023MS003874, https://doi.org/10.1029/2023MS003874.

    • Search Google Scholar
    • Export Citation
  • Jeevanjee, N., P. Hassanzadeh, S. Hill, and A. Sheshadri, 2017: A perspective on climate model hierarchies. J. Adv. Model. Earth Syst., 9, 17601771, https://doi.org/10.1002/2017MS001038.

    • Search Google Scholar
    • Export Citation
  • Keisler, R., 2022: Forecasting global weather with graph neural networks. arXiv, 2202.07575v1, https://doi.org/10.48550/arXiv.2202.07575.

    • Search Google Scholar
    • Export Citation
  • Khairoutdinov, M. F., P. N. Blossey, and C. S. Bretherton, 2022: Global system for atmospheric modeling: Model description and preliminary results. J. Adv. Model. Earth Syst., 14, e2021MS002968, https://doi.org/10.1029/2021MS002968.

    • Search Google Scholar
    • Export Citation
  • Kim, P. S., and H.-J. Song, 2022: Usefulness of automatic hyperparameter optimization in developing radiation emulator in a numerical weather prediction model. Atmosphere, 13, 721, https://doi.org/10.3390/atmos13050721.

    • Search Google Scholar
    • Export Citation
  • Kingma, D. P., and J. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.

    • Search Google Scholar
    • Export Citation
  • Koza, J. R., 1994: Genetic programming as a means for programming computers by natural selection. Stat. Comput., 4, 87112, https://doi.org/10.1007/BF00175355.

    • Search Google Scholar
    • Export Citation
  • La Cava, W., B. Burlacu, M. Virgolin, M. Kommenda, P. Orzechowski, F. O. de França, Y. Jin, and J. H. Moore, 2021: Contemporary symbolic regression methods and their relative performance. 35th Conf. on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, Online, Curran Associates Inc., 116, https://cavalab.org/assets/papers/La%20Cava%20et%20al.%20-%202021%20-%20Contemporary%20Symbolic%20Regression%20Methods%20and%20their.pdf.

    • Search Google Scholar
    • Export Citation
  • Lagerquist, R., D. Turner, I. Ebert-Uphoff, J. Stewart, and V. Hagerty, 2021: Using deep learning to emulate and accelerate a radiative transfer model. J. Atmos. Oceanic Technol., 38, 16731696, https://doi.org/10.1175/JTECH-D-21-0007.1.

    • Search Google Scholar
    • Export Citation
  • Lagerquist, R., D. D. Turner, I. Ebert-Uphoff, and J. Q. Stewart, 2023: Estimating full longwave and shortwave radiative transfer with neural networks of varying complexity. J. Atmos. Oceanic Technol., 40, 14071432, https://doi.org/10.1175/JTECH-D-23-0012.1.

    • Search Google Scholar
    • Export Citation
  • Lam, R., and Coauthors, 2022: GraphCast: Learning skillful medium-range global weather forecasting. arXiv, 2212.12794v2, https://doi.org/10.48550/arXiv.2212.12794.

    • Search Google Scholar
    • Export Citation
  • Lang, S., and Coauthors, 2024: AIFS—ECMWF’S data-driven forecasting system. arXiv, 2406.01465v2, https://doi.org/10.48550/arXiv.2406.01465.

    • Search Google Scholar
    • Export Citation
  • Lin, X., H.-L. Zhen, Z. Li, Q.-F. Zhang, and S. Kwong, 2019: Pareto multi-task learning. 33rd Conf. on Neural Information Processing Systems (NeurIPS 2019), Vancouver, British Columbia, Canada, Curran Associates Inc., 12060–12070, https://proceedings.neurips.cc/paper_files/paper/2019/file/685bfde03eb646c27ed565881917c71c-Paper.pdf.

    • Search Google Scholar
    • Export Citation
  • Liou, K.-N., 2002: An Introduction to Atmospheric Radiation. Vol. 84. Elsevier, 583 pp.

  • Lucarini, V., and M. D. Chekroun, 2023: Theoretical tools for understanding the climate crisis from Hasselmann’s programme and beyond. Nat. Rev. Phys., 5, 744765, https://doi.org/10.1038/s42254-023-00650-8.

    • Search Google Scholar
    • Export Citation
  • Maas, A. L., A. Y. Hannun, and A. Y. Ng, 2013: Rectifier nonlinearities improve neural network acoustic models. ICML 2013 Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, GA, International Machine Learning Society, https://robotics.stanford.edu/amaas/papers/relu hybrid icml2013final.pdf.

    • Search Google Scholar
    • Export Citation
  • Maher, P., and Coauthors, 2019: Model hierarchies for understanding atmospheric circulation. Rev. Geophys., 57, 250280, https://doi.org/10.1029/2018RG000607.

    • Search Google Scholar
    • Export Citation
  • Mamalakis, A., E. A. Barnes, and I. Ebert-Uphoff, 2022: Investigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscience. Artif. Intell. Earth Syst., 1, e220012, https://doi.org/10.1175/AIES-D-22-0012.1.

    • Search Google Scholar
    • Export Citation
  • Mamalakis, A., E. A. Barnes, and I. Ebert-Uphoff, 2023: Carefully choose the baseline: Lessons learned from applying XAI attribution methods for regression tasks in geoscience. Artif. Intell. Earth Syst., 2, e220058, https://doi.org/10.1175/AIES-D-22-0058.1.

    • Search Google Scholar
    • Export Citation
  • Mansfield, L. A., A. Gupta, A. C. Burnett, B. Green, C. Wilka, and A. Sheshadri, 2023: Updates on model hierarchies for understanding and simulating the climate system: A focus on data‐informed methods and climate change impacts. J. Adv. Model. Earth Syst., 15, e2023MS003715, https://doi.org/10.1029/2023MS003715.

    • Search Google Scholar
    • Export Citation
  • Mauritsen, T., and Coauthors, 2019: Developments in the MPI-M Earth System Model version 1.2 (MPI-ESM1.2) and its response to increasing CO2. J. Adv. Model. Earth Syst., 11, 9981038, https://doi.org/10.1029/2018MS001400.

    • Search Google Scholar
    • Export Citation
  • Miettinen, K., 1999: Nonlinear Multiobjective Optimization. Vol. 12. Springer, 298 pp.

  • Mlawer, E. J., S. J. Taubman, P. D. Brown, M. J. Iacono, and S. A. Clough, 1997: Radiative transfer for inhomogeneous atmospheres: RRTM, a validated correlated‐k model for the longwave. J. Geophys. Res., 102, 16 66316 682, https://doi.org/10.1029/97JD00237.

    • Search Google Scholar
    • Export Citation
  • Molnar, C., 2020: Interpretable Machine Learning. Lulu.com, 320 pp.

  • Morcrette, J.-J., 2000: On the effects of the temporal and spatial sampling of radiation fields on the ECMWF forecasts and analyses. Mon. Wea. Rev., 128, 876887, https://doi.org/10.1175/1520-0493(2000)128<0876:OTEOTT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Morrison, H., and Coauthors, 2020: Confronting the challenge of modeling cloud and precipitation microphysics. J. Adv. Model. Earth Syst., 12, e2019MS001689, https://doi.org/10.1029/2019MS001689.

    • Search Google Scholar
    • Export Citation
  • Moseley, C., C. Hohenegger, P. Berg, and J. O. Haerter, 2016: Intensification of convective extremes driven by cloud–cloud interaction. Nat. Geosci., 9, 748752, https://doi.org/10.1038/ngeo2789.

    • Search Google Scholar
    • Export Citation
  • Nowack, P., J. Runge, V. Eyring, and J. D. Haigh, 2020: Causal networks for climate model evaluation and constrained projections. Nat. Commun., 11, 1415, https://doi.org/10.1038/s41467-020-15195-y.

    • Search Google Scholar
    • Export Citation
  • Pathak, J., and Coauthors, 2022: Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/arXiv.2202.11214.

    • Search Google Scholar
    • Export Citation
  • Pearl, J., and Coauthors, 2000: Causality: Models, Reasoning and Inference. Vol. 19. Cambridge University Press, 384 pp.

  • Petersen, B. K., M. Landajuela, T. N. Mundhenk, C. P. Santiago, S. K. Kim, and J. T. Kim, 2019: Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. arXiv, 1912.04871v4, https://doi.org/10.48550/arXiv.1912.04871.

    • Search Google Scholar
    • Export Citation
  • Rasp, S., and N. Thuerey, 2021: Data‐driven medium‐range weather prediction with a Resnet pretrained on climate simulations: A new model for WeatherBench. J. Adv. Model. Earth Syst., 13, e2020MS002405, https://doi.org/10.1029/2020MS002405.

    • Search Google Scholar
    • Export Citation
  • Rasp, S., P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, 2020: WeatherBench: A benchmark data set for data‐driven weather forecasting. J. Adv. Model. Earth Syst., 12, e2020MS002203, https://doi.org/10.1029/2020MS002203.

    • Search Google Scholar
    • Export Citation
  • Robertson, A. W., and M. Ghil, 2000: Solving problems with GCMS: General circulation models and their role in the climate modeling hierarchy. International Geophysics, Vol. 70, Academic Press, 40 pp.

    • Search Google Scholar
    • Export Citation
  • Roeckner, E., and Coauthors, 1996: The atmospheric general circulation model ECHAM-4: Model description and simulation of present-day climate. Max-Planck-Institut für Meteorologie Rep. 218, 94 pp., https://pure.mpg.de/rest/items/item_1781494/component/file_1786328/content.

    • Search Google Scholar
    • Export Citation
  • Ronneberger, O., P. Fischer, and T. Brox, 2015: U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, N. Navab et al., Eds., Springer, 234241.

    • Search Google Scholar
    • Export Citation
  • Rudin, C., 2019: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell., 1, 206215, https://doi.org/10.1038/s42256-019-0048-x.

    • Search Google Scholar
    • Export Citation
  • Scher, S., and G. Messori, 2021: Ensemble methods for neural network‐based weather forecasts. J. Adv. Model. Earth Syst., 13, e2020MS002331, https://doi.org/10.1029/2020MS002331.

    • Search Google Scholar
    • Export Citation
  • Schmidt, M., and H. Lipson, 2009: Distilling free-form natural laws from experimental data. Science, 324, 8185, https://doi.org/10.1126/science.1165893.

    • Search Google Scholar
    • Export Citation
  • Schneider, T., J. Teixeira, C. S. Bretherton, F. Brient, K. G. Pressel, C. Schär, and A. P. Siebesma, 2017: Climate goals and computing the future of clouds. Nat. Climate Change, 7, 35, https://doi.org/10.1038/nclimate3190.

    • Search Google Scholar
    • Export Citation
  • Schneider, T., L. R. Leung, and R. C. J. Wills, 2024: Opinion: Optimizing climate models with process knowledge, resolution, and artificial intelligence. Atmos. Chem. Phys., 24, 70417062, https://doi.org/10.5194/acp-24-7041-2024.

    • Search Google Scholar
    • Export Citation
  • Schneiderman, H., 2024: Incorporation of physical equations into a neural network structure for predicting shortwave radiative heat transfer. 23rd Conf. on Artificial Intelligence for Environmental Science, Baltimore, MD, Amer. Meteor. Soc, J5A.5, https://ams.confex.com/ams/104ANNUAL/webprogram/Paper431981.html.

    • Search Google Scholar
    • Export Citation
  • Seneviratne, S. I., and Coauthors, 2021: Weather and climate extreme events in a changing climate. Climate Change 2021: The Physical Science Basis, V. Masson-Delmotte et al., Eds., Cambridge University Press, 15131766, https://doi.org/10.1017/9781009157896.013.

    • Search Google Scholar
    • Export Citation
  • Shamekh, S., K. D. Lamb, Y. Huang, and P. Gentine, 2023: Implicit learning of convective organization explains precipitation stochasticity. Proc. Natl. Acad. Sci. USA, 120, e2216158120, https://doi.org/10.1073/pnas.2216158120.

    • Search Google Scholar
    • Export Citation
  • Sherwood, S. C., S. Bony, and J.-L. Dufresne, 2014: Spread in model climate sensitivity traced to atmospheric convective mixing. Nature, 505, 3742, https://doi.org/10.1038/nature12829.

    • Search Google Scholar
    • Export Citation
  • Stevens, B., and Coauthors, 2019: DYAMOND: The DYnamics of the Atmospheric general circulation Modeled on Non-hydrostatic Domains. Prog. Earth Planet. Sci., 6, 61, https://doi.org/10.1186/s40645-019-0304-z.

    • Search Google Scholar
    • Export Citation
  • Stevens, B., and Coauthors, 2020: The added value of large-eddy and storm-resolving models for simulating clouds and precipitation. J. Meteor. Soc. Japan, 98, 395435, https://doi.org/10.2151/jmsj.2020-021.

    • Search Google Scholar
    • Export Citation
  • Sundqvist, H., E. Berge, and J. E. Kristjánsson, 1989: Condensation and cloud parameterization studies with a mesoscale numerical weather prediction model. Mon. Wea. Rev., 117, 16411657, https://doi.org/10.1175/1520-0493(1989)117%3C1641:CACPSW%3E2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Ukkonen, P., 2022: Exploring pathways to more accurate machine learning emulation of atmospheric radiative transfer. J. Adv. Model. Earth Syst., 14, e2021MS002875, https://doi.org/10.1029/2021MS002875.

    • Search Google Scholar
    • Export Citation
  • Ukkonen, P., and M. Chantry, 2024: Representing sub-grid processes in weather and climate models via sequence learning. ESS Open Archive, https://doi.org/10.22541/essoar.172098075.51621106/v1.

    • Search Google Scholar
    • Export Citation
  • Vapnik, V. N., and A. Y. Chervonenkis, 2015: On the uniform convergence of relative frequencies of events to their probabilities. Measures of Complexity: Festschrift for Alexey Chervonenkis, V. Vovk, H. Papadopoulos, and A. Gammerman, Eds., Springer, 1130.

    • Search Google Scholar
    • Export Citation
  • Veerman, M. A., R. Pincus, R. Stoffer, C. M. Van Leeuwen, D. Podareanu, and C. C. Van Heerwaarden, 2021: Predicting atmospheric optical properties for radiative transfer computations using neural networks. Philos. Trans. Roy. Soc., A379, 20200095, https://doi.org/10.1098/rsta.2020.0095.

    • Search Google Scholar
    • Export Citation
  • Virgolin, M., T. Alderliesten, C. Witteveen, and P. A. N. Bosman, 2021: Improving model-based genetic programming for symbolic regression of small expressions. Evol. Comput., 29, 211237, https://doi.org/10.1162/evco_a_00278.

    • Search Google Scholar
    • Export Citation
  • Wang, Y., S. Yang, G. Chen, Q. Bao, and J. Li, 2023: Evaluating two diagnostic schemes of cloud-fraction parameterization using the CloudSat data. Atmos. Res., 282, 106510, https://doi.org/10.1016/j.atmosres.2022.106510.

    • Search Google Scholar
    • Export Citation
  • Weyn, J. A., D. R. Durran, and R. Caruana, 2019: Can machines learn to predict weather? Using deep learning to predict gridded 500‐hPa geopotential height from historical weather data. J. Adv. Model. Earth Syst., 11, 26802693, https://doi.org/10.1029/2019MS001705.

    • Search Google Scholar
    • Export Citation
  • Weyn, J. A., D. R. Durran, and R. Caruana, 2020: Improving data‐driven global weather prediction using deep convolutional neural networks on a cubed sphere. J. Adv. Model. Earth Syst., 12, e2020MS002109, https://doi.org/10.1029/2020MS002109.

    • Search Google Scholar
    • Export Citation
  • Xu, K.-M., and D. A. Randall, 1996: A semiempirical cloudiness parameterization for use in climate models. J. Atmos. Sci., 53, 30843102, https://doi.org/10.1175/1520-0469(1996)053<3084:ASCPFU>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Zhou, Z., M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, 2020: UNet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging, 39, 18561867, https://doi.org/10.1109/TMI.2019.2959609.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Exploring PFs (sets of Pareto-optimal models) within a complexity-error plane highlights ML’s added value. Crosses in step 1 denote existing models. Algorithms such as deep learning allow for the creation of efficient, low-error, albeit complex models (step 2). Knowledge distillation, through methods such as equation discovery, aims to explain error reduction, resulting in simpler, low-error models (step 3) and long-lasting scientific progress. For atmospheric applications, we propose four categories to classify this added value: functional representation, feature assimilation, spatial connectivity, and temporal connectivity.

  • Fig. 2.

    Pareto-optimal model hierarchies quantify the added value of ML for cloud cover parameterization. ML better captures the relationship between cloud cover and its thermodynamic environment and assimilates features like vertical humidity gradients. (left) We progressively improve traditional baselines via polynomial regression (red, orange, and yellow crosses), significantly decrease error using NNs (pink and purple crosses), and finally distill the added value of these NNs symbolically (green crosses). (right) Both the NN (orange line) and its distilled symbolic representation (green line) better represent the functional relationship between cloud cover and its environment, aligning more closely across temperatures with the reference storm-resolving simulation (blue dots) than the Sundqvist scheme (red line) used in the ICON Earth system model. “Cold” and “Hot” refer to the validation set’s first and last temperature octiles. Additionally, ML models assimilate multiple features absent in existing baselines, including vertical humidity gradients. The smaller discrepancy between the 5-feature scheme (“SFS5”) and the reference (“REF”), compared to the 4-feature scheme (“SFS4”), demonstrates the improved representation of the time-averaged low cloud cover in regions such as the southeast Pacific, thereby reducing biases in current cloud cover schemes that plague the global radiative budget.

  • Fig. 3.

    Pareto-optimal model hierarchies guide the development of progressively tailored architectures for emulating shortwave radiative transfer. (a) Error vs complexity on a logarithmic scale for the simple clear-sky cases dominated by absorption; (b) error vs complexity for cases with multilayer cloud, including both liquid and ice, where multiple scattering complicates radiative transfer. Convolutional NNs (CNNs; red crosses) with small kernels, MLPs (orange crosses) that ignore the vertical dimension, and the simple linear baseline (light pink star) give credible results in the clear-sky case. However, they fail in the more complex case, which requires U-Net architectures (dark pink and purple crosses) to fully capture nonlocal radiative transfer. The vertical invariance of the two-stream radiative transfer equations suggests a bidirectional RNN (green star) architecture, which rivals the skill of U-Nets with a fraction of their trainable parameters.

  • Fig. 4.

    Pareto-optimal model hierarchies underscore the importance of storm-resolving information in elucidating the relationship between precipitation and its surrounding environment, while also quantifying the recoverability of this information from the coarse environment’s time series. (left) NNs leveraging high-resolution spatial data (purple crosses) clearly outperform NNs that use only coarse inputs (orange crosses). However, this performance gap is largely mitigated when the coarse inputs’ past time steps are included (green crosses). (right) Processing the precipitable water field at a resolution of Δx ≈ 5 km yields coefficients of determination R2 ≈ 0.9, clearly surpassing the R2 ≈ 0.5 attained by our best NN using fields at the coarse Δx ≈ 102-km horizontal resolution. This performance gap is partially closed by incorporating two past time steps along with the current time step, resulting in R2 ≈ 0.7. This suggests a partial equivalence of the environment’s spatial and temporal connectivities in predicting precipitation.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 666 666 240
PDF Downloads 669 669 240