1. Introduction
Will recent advancements in machine learning (ML) lead to enduring new knowledge in atmospheric science? While the added value of ML for weather and climate applications can be measured using performance metrics, it often remains challenging to understand. Taking advancements in data-driven, medium-range weather forecasting (Ben-Bouallegue et al. 2023) as an example, increasing reliance on complex architectures makes state-of-the-art models difficult to interpret.
Weyn et al. (2019, 2020) used convolutional neural networks (CNNs) with approximately 200k and 700k learned parameters to produce global forecasts that outperformed climatology, persistence, and a low-resolution numerical weather prediction (NWP) model for lead times smaller than 1 week. Following the success of early approaches, Rasp et al. (2020) developed a benchmark dataset for data-driven weather forecasting, which facilitated the objective assessment of rapid developments (e.g., Clare et al. 2021; Scher and Messori 2021). Since then, for data-driven, medium-range forecasting, Rasp and Thuerey (2021) trained a ≈ 6.3M-parameter deep residual CNN, Keisler (2022) trained a ≈ 6.7M-parameter graph neural network (GNN), and Pathak et al. (2022) trained a ≈ 7M-parameter emulator combining transformers with Fourier neural operators. Recently, deep learning models started rivaling state-of-the-art, high-resolution, deterministic NWP models: Lam et al. (2022) via combined GNNs totaling ≈37M parameters, Bi et al. (2022) via a ≈ 256M-parameter Earth-like transformer, and Lang et al. (2024) via a ≈ 256M-parameter graph and transformer model. It is hard to pinpoint what makes these models so successful, even with modern explainable artificial intelligence (XAI; Buhrmester et al. 2021; Das and Rad 2020) tools, given that XAI requires certain assumptions to be satisfied (Mamalakis et al. 2022, 2023) and involves choosing which samples to investigate, which is challenging for large models and datasets.
The growing complexity of data-driven models for weather applications shares similarities with the development of general circulation models (GCMs) that followed the first comprehensive assessment of global climate change due to carbon dioxide (Charney et al. 1979). Unlike data-driven weather prediction, where reducing forecast errors could warrant increased complexity, GCMs have traditionally been created to not simply project but also comprehend climate changes (Held 2005; Balaji et al. 2022). This implies that any additional complexity in Earth system model should be well justified, motivating climate model hierarchies that aim to connect our fundamental understanding with model prediction (e.g., Mansfield et al. 2023; Robertson and Ghil 2000; Bony et al. 2013; Jeevanjee et al. 2017; Maher et al. 2019; Balaji 2021).
Inspired by climate model hierarchies, we here show that modern optimization tools help systematically generate data-driven model hierarchies to model and understand climate processes for which we have reliable data. These hierarchies can 1) guide the development of data-driven models that optimally balance simplicity and accuracy and 2) unveil the role of each complexity unit, furthering process understanding by distilling the added value of ML for the atmospheric application of interest.
In this study, we showcase the advantages of considering a hierarchy of models with varying error and complexity, as opposed to focusing on a single fitted model (Fisher et al. 2019). After formulating Pareto-optimal model hierarchies (section 2) and categorizing the added value of ML into four categories (section 3), we apply our approach to three atmospheric processes relevant for weather and climate predictions (section 4) to distill the added value of recently developed deep learning frameworks before concluding (section 5).
2. Pareto-optimal model hierarchies
a. Pareto optimality
Intuitively, when we select a model from the Pareto front, any attempt to switch to a different model would mean sacrificing the quality of at least one evaluation metric. Conversely, a model that can be replaced without compromising any evaluation metrics is described as Pareto dominated. Henceforth, we derive Pareto fronts empirically from the available data and the subset of models M considered. These empirical Pareto fronts are denoted as
b. Error
We emphasize the importance of holistic evaluation, which employs several error metrics with different behaviors: traditional regression or classification metrics, distributional distances, spectral metrics, probabilistic scoring rules, reliability diagrams (Haynes et al. 2023), causal evaluation (Nowack et al. 2020), etc. To facilitate the use of Pareto-optimal hierarchies, we recommend prioritizing proper scores, whose expectation is optimal if and only if the model represents the true target distribution (Bröcker 2009). For simplicity’s sake, we will employ mean-square error (MSE) and its square root (RMSE) as our primary error metric for our study’s applications. We make this choice because MSE is a proper score for deterministic predictions that can be efficiently optimized, while recognizing MSE’s inherent limitations for nonnormally distributed targets.
c. Complexity
To our knowledge, there are no universally accepted metrics for quantifying the complexity of data-driven models within the geosciences. In statistical learning, various complexity metrics, such as Rademacher complexity (e.g., Bartlett et al. 2005), rely on dataset characteristics, whereas others, like the Vapnik–Chervonenkis (VC) dimension (Vapnik and Chervonenkis 2015), solely depend on algorithmic attributes. Here, we predominantly focus on two metrics that can be readily calculated from model attributes: the number of trainable parameters Nparam and the number of features Nfeatures. We choose these metrics due to their simplicity and versatility across the broad spectrum of models considered in our study (Balaji et al. 2017). As we will confirm empirically in section 4, Nparam and Nfeatures can be used as (very) approximate proxies for generalizability, as models with fewer trainable parameters and features with more stable distributions tend to result in superior generalizability. We defer the exploration of additional complexity metrics, such as the number of floating point operations (FLOPs) or multiply–accumulate operations (MACs), to future work.
Equipped with these definitions, we can now ask: Why, along the Pareto front in a well-defined error-complexity plane, does increasing complexity result in better performance? In the following section, we hence leverage Pareto optimality to distill machine learning’s added value.
3. The distillable value of machine learning
Amid rapid progress in optimization, machine learning architectures, and data handling, it can be challenging to distinguish long-lasting progress in modeling from small improvements in model error. We postulate that progress is more likely to be replicable if we can explain how added model complexity improves performance in simple terms. Based on this postulate, we propose simple definitions to categorize a model’s added value in the geosciences.
We define a model M as having added value with respect to a set of evaluation metrics
Equation (2) suggests four degrees of freedom for a model’s added value: 1) improving the function M without changing the features Xx,t (functional representation), 2) improving the model through X (feature assimilation), 3) improving the model through x (spatial connectivity), and 4) improving the model through t (temporal connectivity). We depict these four nonmutually exclusive categories in Fig. 1 and rigorously define them below.
Exploring PFs (sets of Pareto-optimal models) within a complexity-error plane highlights ML’s added value. Crosses in step 1 denote existing models. Algorithms such as deep learning allow for the creation of efficient, low-error, albeit complex models (step 2). Knowledge distillation, through methods such as equation discovery, aims to explain error reduction, resulting in simpler, low-error models (step 3) and long-lasting scientific progress. For atmospheric applications, we propose four categories to classify this added value: functional representation, feature assimilation, spatial connectivity, and temporal connectivity.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0078.1
a. Functional representation
This improvement may stem from algorithms leading to better fits (e.g., gradient boosting instead of decision trees), improved optimization (e.g., the Adam optimizer and its variants instead of traditional stochastic gradient descent; Kingma and Ba 2014), improved parsimony (e.g., by decreasing the number of trainable parameters via hyperparameter tuning), or enforced constraints (e.g., positive concentrations and precipitation). Improvements in functional representation are readily visualized via partial dependence plots or their variants (marginal plots to avoid unlikely data instances, accumulated local effects to account for feature correlation, etc.; Molnar 2020). Note that Eq. (3) also captures improvements in probabilistic modeling by generalizing M to a stochastic mapping and including probabilistic scores (Gneiting and Raftery 2007; Haynes et al. 2023) in
b. Feature assimilation
c. Spatial connectivity
This improvement may stem from the ability to
- 1) handle features at different spatial resolutions (e.g., via improved preprocessing or handling of data). In atmospheric science, this can help consider multiscale interaction and accommodate data from various Earth system models.
- 2) hierarchically process spatially adjacent data (e.g., via convolutional layers). In atmospheric science, this acknowledges the high correlation between spatial neighbors due to, e.g., small-scale mixing.
- 3) capture long-range spatial dependencies (e.g., via self-attention mechanisms or a graph structure), such as teleconnections in Earth system.
d. Temporal connectivity
This improvement may stem from the ability to
- 1) handle features at different temporal resolutions (e.g., via improved preprocessing or handling of data). This can help consider multiple time scales and accommodate data from various Earth system models.
- 2) process consecutive time steps (e.g., via recurrent layers). This acknowledges the high temporal autocorrelation of Earth system data, which is a property of the underlying dynamical system.
- 3) capture long-term temporal dependencies (e.g., via gating or self-attention mechanisms) and cyclic patterns (e.g., via data adjustments and temporal Fourier transforms) such as the diurnal and seasonal cycles.
4. Atmospheric physics application cases
This section demonstrates that Pareto optimality guides model development and improves process understanding through three realistic, atmospheric modeling case studies. Each case includes machine learning prototypes with demonstrated performance from previous studies, along with Pareto-optimal models newly trained for this study. The first case study emphasizes functional representation and feature assimilation, the second focuses on spatial connectivity, and the last compares spatial and temporal connectivity.
a. Cloud cover parameterization for climate modeling
1) Motivation
The incorrect representation of cloud processes in current Earth system models, with a grid spacing of approximately 50–100 km (Arias et al. 2021), significantly contributes to structural uncertainty in long-term climate projections (Bony et al. 2015; Sherwood et al. 2014). Cloud cover parameterization, which maps environmental conditions at the grid scale to the fraction of the grid cell occupied by clouds, directly affects radiative transfer and microphysical conversion rates, influencing the model’s energy balance and water species concentrations.
Although “storm-resolving” simulations with grid spacing below ≈5 km do not explicitly resolve clouds and their associated microphysical processes (Morrison et al. 2020), they significantly reduce uncertainty in the interaction between storms and planetary-scale dynamics by explicitly simulating deep convection (Stevens et al. 2020). However, their large computational cost prohibits their routine use for ensemble projections (Schneider et al. 2017). Machine learning can learn the storm-scale behavior of clouds from short, storm-resolving simulations, potentially improving coarser Earth system models through data-driven parameterizations (Gentine et al. 2021).
This case study aims to understand the improvement gained from the higher-fidelity representation of storms and clouds. As illustrated in Fig. 2, we demonstrate that this knowledge can be symbolically distilled into an analytic equation that rivals the performance of deep learning.
Pareto-optimal model hierarchies quantify the added value of ML for cloud cover parameterization. ML better captures the relationship between cloud cover and its thermodynamic environment and assimilates features like vertical humidity gradients. (left) We progressively improve traditional baselines via polynomial regression (red, orange, and yellow crosses), significantly decrease error using NNs (pink and purple crosses), and finally distill the added value of these NNs symbolically (green crosses). (right) Both the NN (orange line) and its distilled symbolic representation (green line) better represent the functional relationship between cloud cover and its environment, aligning more closely across temperatures with the reference storm-resolving simulation (blue dots) than the Sundqvist scheme (red line) used in the ICON Earth system model. “Cold” and “Hot” refer to the validation set’s first and last temperature octiles. Additionally, ML models assimilate multiple features absent in existing baselines, including vertical humidity gradients. The smaller discrepancy between the 5-feature scheme (“SFS5”) and the reference (“REF”), compared to the 4-feature scheme (“SFS4”), demonstrates the improved representation of the time-averaged low cloud cover in regions such as the southeast Pacific, thereby reducing biases in current cloud cover schemes that plague the global radiative budget.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0078.1
2) Setup
We follow the setup described in Grundner et al. (2024), to which readers are referred for details. Fields are coarse grained from storm-resolving Icosahedral Nonhydrostatic (ICON) simulations (Giorgetta et al. 2018) conducted as part of the Dynamics of the Atmospheric General Circulation Modeled on Nonhydrostatic Domains (DYAMOND) intercomparison project (Stevens et al. 2019; Duras et al. 2021). The original simulations use a horizontal grid spacing of ≈2.5 km and 58 vertical layers below 21 km (the maximum altitude with clouds in the dataset). They are coarse grained to a typical climate model resolution of ≈80 km horizontally and 27 vertical layers, converting the binarized, high-resolution condensate field (1 if the cloud condensate mixing ratio exceeds 10−6 kg kg−1 and 0 otherwise) into a fractional area cloud cover
To prevent strong correlations between the training and validation sets, the union of the “DYAMOND Summer” (10 August–10 September 2016) and “DYAMOND Winter” (30 January–29 February 2020) datasets was partitioned into six consecutive temporal segments. The second segment (approximately 21 August–1 September 2016) and the fifth segment (approximately 9–19 February 2020) form the validation set. For all models, excluding traditional methods, the features are standardized to have a mean of zero and a standard deviation of one within the training set.
3) Model hierarchy
To account for marine stratocumulus (low) clouds (Mauritsen et al. 2019), we use two different sets of four parameters for land and sea using a land fraction threshold of 0.5. The Sundqvist scheme is parsimonious with only eight trainable parameters, but it performs poorly against high-resolution data, with RMSE values as large as 25% despite having been returned to our training set.
Hypothesizing that the Sundqvist scheme’s large error is due to its lack of features, we test the effect of adding features one by one. For that purpose, we apply forward sequential feature selection, which greedily adds features using a cross-validated score (here MSE), to a standard multiple linear regression model that includes polynomial combinations of all available features, up to a maximum degree of 3.
We start with the baseline neural network from Grundner et al. (2022): a three-layer-deep, 64-neuronwide multilayer perceptron with batch normalization after the second hidden layer. Hyperparameters were optimized using the Sherpa Python library (Hertel et al. 2020). This NN estimates the target cloud cover with high fidelity (RMSE = 8.5%), but at the cost of increased complexity, as it has a total of 9345 trainable parameters. Note that the models from Grundner et al. (2022) were not designed to minimize the number of trainable parameters, so we do not overly focus on this complexity metric. The RMSE can be further lowered to 6% with vertically nonlocal NNs, which map the entire atmospheric column of inputs to the entire column of outputs without inductive bias. However, this small error gain is deemed insufficient given the increased complexity cost.
While indicative of how accurately cloud cover can be parameterized, such improvements are often insufficient to be considered “discoveries” as they remain hard to explain, even with post hoc explanation tools (Figs. 8 and 9 of Grundner et al. 2022). Therefore, improvements in functional representation and feature assimilation need to be further distilled into sparse models that scientists can readily interpret.
4) Symbolic distillation and equation discovery
For this purpose, we use symbolic regression libraries, which optimize both the parameters and structure of an analytical model within a space of expressions. Symbolic regression yields expressions with transparent out-of-distribution behavior (asymptotics, periodicity, etc.) (La Cava et al. 2021), making them well suited for high-stakes societal applications (Rudin 2019) and the empirical distillation of natural laws (Schmidt and Lipson 2009). To avoid overly restricting the analytical form of the distilled equation, genetic programming is used. Genetic programming evolves a population of mathematical expressions using methods such as selection, crossover, and mutation to improve a fitness function (Koza 1994). Given that genetic programming scales poorly with the number of features (Petersen et al. 2019), our NN feature selection results are used to restrict our features to those listed in Eq. (17). Using NN results is appropriate since no assumption is made about the type of equation to be discovered.
In addition to being easily transferable thanks to their low number of trainable parameters, the added value of these equations is transparent: The improved functional representation is explicit [see Eq. (20)], and the assimilation of new features is interpretable [see Eq. (21)]. Finally, scientific discovery may arise through the unexpected aspects of these equations that are robust across models, such as the difference between how cloud cover reacts to an increase in environmental liquid versus ice content. Indeed, at high resolution, cloud cover will become 1 as soon as condensates exceed a small threshold (here 10−6 kg kg−1) independently of the water’s phase. Then, how come cloud cover is more sensitive to ice than liquid in Eq. (22) (
These are in fact emerging properties of the subgrid distributions of liquid and ice (Grundner et al. 2024): As large values of cloud ice are rarely observed, larger spatial averages of cloud ice at coarse resolution mean that many more high-resolution pixels contain low values of cloud ice compared to the liquid case, resulting in higher cloudiness for a given spatially averaged condensate value. By assuming an exponential distribution for the subgrid liquid and ice content, we can even interpret
b. Shortwave radiative transfer emulation to accelerate numerical weather prediction
1) Motivation
The energy transfer resulting from the interaction between electromagnetic radiation and the atmosphere, known as radiative transfer, is costly to simulate accurately. Line-by-line calculations of gaseous absorption at each electromagnetic wavelength (Clough et al. 1992) are too expensive for routine weather and climate models. Instead, models often use the correlated-k method (Mlawer et al. 1997), which groups absorption coefficients in a cumulative probability space to speed up radiative transfer calculations without significantly compromising accuracy. However, even the correlated-k method imposes a high computational burden (Veerman et al. 2021), forcing most simulations to reduce the temporal and spatial resolution of radiative transfer calculations, which can degrade prediction quality (Morcrette 2000; Hogan and Bozzo 2018).
This challenge has driven the development of ML emulators for radiative transfer in numerical weather prediction since the 1990s (Cheruy et al. 1996; Chevallier et al. 1998, 2000). ML architectures have become more sophisticated (e.g., Belochitski and Krasnopolsky 2021; Kim and Song 2022; Ukkonen 2022), but the primary goal remains to emulate the original radiation scheme as faithfully as possible. This allows the reduced inference cost of the ML model, once trained, to be leveraged for running the atmospheric model coupled with the emulator, enabling less expensive and more frequent radiative transfer calculations.
This case study examines how ML architectural designs impact the reliability of shortwave radiative transfer (covering solar radiation and wavelengths of 0.23–5.85 µm). As shown in Fig. 3, physics-informed architectures that closely mimic the vertical bidirectionality of radiative transfer are Pareto optimal, rivaling the performance of deep learning models with 10 times more trainable parameters.
Pareto-optimal model hierarchies guide the development of progressively tailored architectures for emulating shortwave radiative transfer. (a) Error vs complexity on a logarithmic scale for the simple clear-sky cases dominated by absorption; (b) error vs complexity for cases with multilayer cloud, including both liquid and ice, where multiple scattering complicates radiative transfer. Convolutional NNs (CNNs; red crosses) with small kernels, MLPs (orange crosses) that ignore the vertical dimension, and the simple linear baseline (light pink star) give credible results in the clear-sky case. However, they fail in the more complex case, which requires U-Net architectures (dark pink and purple crosses) to fully capture nonlocal radiative transfer. The vertical invariance of the two-stream radiative transfer equations suggests a bidirectional RNN (green star) architecture, which rivals the skill of U-Nets with a fraction of their trainable parameters.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0078.1
2) Setup
3) Model hierarchy
We deploy an ML model hierarchy using the same features X to isolate the added value of spatial connectivity along the vertical coordinate η. For all models, cropping and zero-padding layers set the top-of-atmosphere (TOA; ∼78 km above ground level) heating rate to zero, consistent with the RRTM being emulated. Readers interested in the exact ML model architecture are referred to Lagerquist et al. (2023).
We begin with a linear regression model (pink star), where the input layer is reshaped into a flattened vector and processed through linear layers without activation functions or pooling. This model lacks inherent spatial connectivity, treating all features (where one “feature” = one atmospheric variable at one vertical level) independently. Similarly, the multilayer perceptrons (MLPs; orange crosses) flatten inputs, ignoring vertical connectivity, but process them with varying degrees of nonlinearity, controlled by two hyperparameters: depth (number of dense layers) and width (number of neurons per layer). We conduct a grid search with depth varying from {1, 2, 3, 4, 5, 6} layers and width varying from {64, 128, 256, 512, 1024, 2048} neurons, resulting in 36 MLPs. The activation function for every hidden layer (i.e., every layer except the output) is the leaky rectified linear unit (ReLU; Maas et al. 2013) with a negative slope α = 0.2.
To incorporate vertical relationships between different levels, we train CNNs (red crosses) with 1D convolutions along the vertical axis. We conduct a grid search with depth varying from {1, 2, 3, 4, 5, 6} convolutional layers and number of channels in the first convolutional layer varying from {2, 4, 8, 16, 32, 64}. After the first layer, we double the number of channels in each successive convolutional layer, up to a maximum of 64. The CNN’s kernel size is restricted to three vertical levels, to enforce strong local connectivity.
The U-Net architecture (pink crosses; Ronneberger et al. 2015) builds on the CNN by incorporating skip connections that preserve high-resolution spatial information and an expansive path almost symmetric to the contracting path. Both of these model components improve the reconstruction of full-resolution data (here, a 127-level profile of heating rates). We conduct a grid search with depth (number of downsampling operations) varying from {3, 4, 5} and number of channels in the first convolutional layer varying from {2, 4, 8, 16, 32, 64}, using the same channel-doubling rule as for CNNs. Finally, the U-Net++ architecture (purple crosses; Zhou et al. 2020) enhances the U-Net’s skip pathways with nested dense convolutional blocks, potentially capturing more of the original spatial information and facilitating optimization. Our hyperparameters for U-Net++ are the same as for the U-Net.
4) Distilling radiative transfer’s bidirectionality
To better understand the added value of each architecture in representing shortwave absorption and scattering, we extract samples from our test set to form two distinct regimes. The simple, “clear-sky” regime (Fig. 3a) consists of profiles with no cloud (column-integrated liquid water path = column-integrated ice water path = 0 g m−2), an oblique sun angle (zenith angle > 60°), and little water vapor (column-maximum specific humidity < 2 g kg−1). This restricts shortwave radiative transfer to gaseous absorption of solar radiation throughout the atmospheric column, well described by a simple exponential attenuation model such as Beer’s law (Liou 2002). In contrast, the complex, “multicloud” regime (Fig. 3b) includes profiles with more than one cloud layer and at least one mixed-phase cloud layer. For this purpose, a “cloud layer” is defined as a set of consecutive vertical levels, such that every level has a total water content (liquid + ice) > 0 g m−3 and the layer has a height-integrated total water path (liquid + ice) ≥ 10 g m−2. A mixed-phase cloud layer meets the above criteria plus two additional ones: Both the height-integrated liquid water path and ice water path must be >0 g m−2. This regime is challenging to model as shortwave radiation is absorbed and scattered by both liquid and ice clouds, making the bidirectionality of radiative transfer essential to capture. Out of the 472 456 test set profiles, 14 226 are in the clear-sky regime and 13 263 are in the multicloud regime.
The first surprising result is the poor performance of CNNs: While Pareto-optimal for low complexity, with our simplest CNN having as few as 12 630 trainable parameters, none of the CNNs achieve an MSE below 0.95 K2 day−2, even in the clear-sky case. In contrast, MLPs systematically outperform our linear baseline model, showcasing the importance of nonlinearity. MLPs also outperform CNNs because they allow for nonlocal vertical connectivity. However, without information on which vertical levels are closest, MLPs connect every level together, resulting in high complexity: Increasing the number of trainable parameters by 100× does not even halve the MSE in the clear-sky case. This lack of inductive bias prevents generalization to more complex cases such as the multicloud regime, where the MSE climbs from ≈0.2–0.4 to ≈0.7–1.0 K2 day−2.
Following Ukkonen (2022), we divide the shortwave radiative fluxes
c. Tropical precipitation and convective organization
1) Motivation
2) Setup
3) Model hierarchy
Pareto-optimal model hierarchies underscore the importance of storm-resolving information in elucidating the relationship between precipitation and its surrounding environment, while also quantifying the recoverability of this information from the coarse environment’s time series. (left) NNs leveraging high-resolution spatial data (purple crosses) clearly outperform NNs that use only coarse inputs (orange crosses). However, this performance gap is largely mitigated when the coarse inputs’ past time steps are included (green crosses). (right) Processing the precipitable water field at a resolution of Δx ≈ 5 km yields coefficients of determination R2 ≈ 0.9, clearly surpassing the R2 ≈ 0.5 attained by our best NN using fields at the coarse Δx ≈ 102-km horizontal resolution. This performance gap is partially closed by incorporating two past time steps along with the current time step, resulting in R2 ≈ 0.7. This suggests a partial equivalence of the environment’s spatial and temporal connectivities in predicting precipitation.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0078.1
4) Mitigating low spatial resolution with memory
While it may be unsurprising that leveraging spatiotemporal information decreases error, the competitive MSEs obtained with temporal memory but without high-resolution spatial information are promising for improving parameterizations. We ask: How can we explain that temporal memory helps recover a large portion of the spatial granularity information?
This induces a conditional dependence of precipitation
Now, we can ask: How much information is lost in the process of replacing the high-resolution spatial anomaly with temporal memory of the low-resolution spatial means? The answer depends on the distance between low and high resolutions. There is no difference if these resolutions are equal, while we do not expect low-resolution memory to help at all if the low-resolution grid is coarse to the point that we cannot distinguish very different high-resolution situations. For the resolutions considered here, the encoder–decoder architecture (purple crosses) achieves high performance with a bottleneck whose dimension is only two. This suggests that the high-resolution anomaly field can be represented by a bidimensional (latent) variable, which we denote org, for the purpose of modeling precipitation. The term org encapsulates the aspects of convective organization that explain why precipitation can be different (stochastic) for the same low-resolution inputs (Moseley et al. 2016). This simplifies the question to how well we can represent org from the time series of low-resolution inputs. The encouraging results of our models leveraging temporal memory, along with the finding of Shamekh et al. (2023) that org is accurately predicted by an autoregressive model informed by the coarse-scale environment’s past time steps, suggest a mostly positive response. This reassuringly confirms that Earth system models with limited spatial resolutions can realistically represent coarse-scale precipitation as long as efforts to improve precipitation parameterization continue (Schneider et al. 2024). In hybrid modeling applications, where the NN itself computes precipitation at previous time steps, error propagation may reduce the informativeness of temporal memory—a limitation worth exploring in future work. More broadly, our results confirm that Pareto-optimal model hierarchies are useful in empirically establishing the partial equivalence between temporal memory and spatial granularity. This has practical applications for multiscale dynamical systems that are not self-similar, where ergodicity cannot be used to deterministically infer detailed spatial information from coarse time-series data.
5. Conclusions
In this study, we demonstrated that Pareto-optimal model hierarchies within a well-defined (complexity, error) plane not only guide the development of data-driven models—from simple equations to sophisticated neural networks—but also promote process understanding. While these hierarchies are derived empirically based on the models and data at hand, they serve as practical approximations for exploring trade-offs between error and complexity. To distill knowledge from these hierarchies, we propose a multistep approach: 1) use models along the Pareto front to hierarchically investigate the added value that leads to incremental error decreases from the simplest to the most complex Pareto-optimal model; 2) generate hypotheses and propose models tailored to the system of interest based on this added value; and 3) once the models are sparse enough to be interpretable, reveal the system’s underlying properties that previous theories or models may have overlooked. Beyond knowledge discovery, such hierarchies promote interpretability by explaining the added value of models that may initially seem overly complex and sustainability by optimizing models’ computational costs once their added value is justified.
We have chosen three weather and climate applications in realistic geography settings to showcase the potential of machine learning in bridging fundamental scientific discovery and increasing operational requirements. Each case demonstrates a different nature of discovery: data-driven equation discovery for cloud cover parameterization, physics-guided architectures, informed by the available Pareto-optimal solution, that reflect the bidirectionality and vertical invariance of shortwave radiative transfer, and spatial information hidden within time series of the coarse environment when diagnosing tropical precipitation. However, all three insights required retaining the full family of trained models and might have been overlooked had we focused on a single optimal model.
In all three cases, neural networks proved particularly advantageous as they can quickly explore large datasets and generate hypotheses about problems that are or appear high dimensional. We nonetheless emphasize that within this framework, deep learning is an integral part but not the ultimate goal of the knowledge generation process. Simpler models rivaling the accuracy of deep neural networks, initially intractable, may emerge once the necessary functional representations, features, and spatiotemporal connectivities are distilled. From this perspective, a rekindled interest in multiobjective optimization and hierarchical thinking would open the door to extracting new, human-interpretable scientific knowledge from ever-expanding geoscientific data, while paving the way for the adoption of machine learning in operational applications by fostering informed trust in their predictions.
Acknowledgments.
T. Beucler acknowledges partial funding from the Swiss State Secretariat for Education, Research and Innovation (SERI) for the Horizon Europe project AI4PEX (Grant agreement: 101137682). A. Grundner acknowledges funding by the European Research Council (ERC) Synergy Grant “Understanding and Modeling the Earth System with Machine Learning (USMILE)” under the Horizon 2020 research and innovation program (Grant agreement 855187). R. Lagerquist acknowledges the support from the NOAA Global Systems Laboratory, Cooperative Institute for Research in the Atmosphere, and NOAA Award NA19OAR4320073. S. Shamekh acknowledges the support provided by Schmidt Sciences, LLC, and NSF funding from the LEAP Science and Technology Center Grant (2019625-STC).
Data availability statement.
The reduced data and code necessary to reproduce this manuscript’s figures are available in the following GitHub repository: https://github.com/tbeucler/2024_Pareto_Distillation. The release for this manuscript is archived via Zenodo using the following DOI: https://zenodo.org/records/13217736. The cloud cover schemes and analysis code can be found at https://github.com/EyringMLClimateGroup/grundner23james_EquationDiscovery_CloudCover/tree/main and are preserved at https://doi.org/10.5281/zenodo.8220333. The coarse‐grained model output used to train and evaluate the data-driven models amounts to several TB and can be reconstructed with the scripts provided in the GitHub repository. This work used resources of the Deutsches Klimarechenzentrum (DKRZ) granted by its Scientific Steering Committee (WLA) under project ID bd1179. For radiative transfer, we use the ml4rt Python library (https://github.com/thunderhoser/ml4rt), with version 4.0.1 preserved at https://doi.org/10.5281/zenodo.13160776. The dataset for radiative transfer is stored at https://doi.org/10.5281/zenodo.13159877. The BRNN architecture used can be created with the file peter_brnn_architecture.py; all other architectures can be created with the scripts named pareto2024_make_*_templates.py. The precipitation example uses “Precip-org” (https://github.com/Sshamekh/Precip-org), a Python repository managed by Sara Shamekh. DYAMOND data management was provided by the DKRZ and supported through the projects ESiWACE and ESiWACE2. The full data are available on the DKRZ HPC system through the DYAMOND initiative (https://www.esiwace.eu/services/dyamond-initiative).
APPENDIX A
Calibrated Parameters for Cloud Cover Parameterization
In this appendix, we provide the values of the calibrated parameters in this manuscript’s equations.
For the Sundqvist scheme [Eq. (10)], RHsurf = 0.55, RHtop = 0.01, RHsat = 0.9/0.95 (over land/sea), and n = 2.12.
For the two-feature linear regression [Eq. (11)], α2F = 1.027, RH2F = 0.827, and ΔT2F = 198.6 K.
For the four-feature linear regression [Eq. (14)], α1 = 1.861, RH1 = 1.898, ΔT1 = 265.9 K, q1 = 0.3775 g kg−1, and H1 = 104.3 m.
For the quadratic model [Eq. (15)], α2 = 2.070, RH2,1 = 2.021, ΔT2,1 = 259.7 K, q2,1 = 0.4741 g kg−1, q2,2 = 3.161 g kg−1, q2,3 = 0.1034 g kg−1, H2 = 665.6 m, RH2,2 = 2.428, and ΔT2,2 = 206.0 K.
For the Wang scheme, αWang = 0.9105 and βWang = 914 × 103.
HPySR = 585 m,
For the GP-GOMEA equation, αG = 66, βG = 1.36.10−4, RHG = 11.5%, γG = 0.194, qG,i = 4.367 mg kg−1, δG = 0.774,
APPENDIX B
Features for Shortwave Radiative Transfer Emulation
The 22 features Xη with a vertical profile are temperature T (K), pressure p (Pa), specific humidity q (kg kg−1), RH (%), liquid water content (LWC) (kg m−3), ice water content (IWC) (kg m−3), downward and upward liquid water path LWP↓ and LWP↑ (kg m−2), downward and upward ice water path IWP↓ and IWP↑ (kg m−2), downward and upward water vapor path WVP↓ and WVP↑ (kg m−2), ozone (O3) mixing ratio (kg kg−1), height above ground level z (m), height layer thickness Δz (m), pressure layer thickness Δp (Pa), liquid effective radius
REFERENCES
Arias, P., and Coauthors, 2021: Climate Change 2021: The Physical Science Basis. Cambridge University Press, 2391 pp.
Balaji, V., 2021: Climbing down Charney’s ladder: Machine learning and the post-Dennard era of computational climate science. Philos. Trans. Roy. Soc., A379, 20200085, https://doi.org/10.1098/rsta.2020.0085.
Balaji, V., and Coauthors, 2017: CPMIP: Measurements of real computational performance of Earth system models in CMIP6. Geosci. Model Dev., 10, 19–34, https://doi.org/10.5194/gmd-10-19-2017.
Balaji, V., F. Couvreux, J. Deshayes, J. Gautrais, F. Hourdin, and C. Rio, 2022: Are general circulation models obsolete? Proc. Natl. Acad. Sci. USA, 119, e2202075119, https://doi.org/10.1073/pnas.2202075119.
Bao, J., B. Stevens, L. Kluft, and C. Muller, 2024: Intensification of daily tropical precipitation extremes from more organized convection. Sci. Adv., 10, eadj6801, https://doi.org/10.1126/sciadv.adj6801.
Bartlett, P. L., O. Bousquet, and S. Mendelson, 2005: Local rademacher complexities. arXiv, math/0508275v1, https://doi.org/10.48550/arXiv.math/0508275.
Belochitski, A., and V. Krasnopolsky, 2021: Robustness of neural network emulations of radiative transfer parameterizations in a state-of-the-art general circulation model. Geosci. Model Dev., 14, 7425–7437, https://doi.org/10.5194/gmd-14-7425-2021.
Ben-Bouallegue, Z., and Coauthors, 2023: The rise of data-driven weather forecasting. arXiv, 2307.10128v2, https://doi.org/10.48550/arXiv.2307.10128.
Bertoli, G., F. Ozdemir, S. Schemm, and F. Perez-Cruz, 2023: Revisiting machine learning approaches for short- and longwave radiation inference in weather and climate models, part I: Offline performance. ESS Open Archive, https://doi.org/10.22541/essoar.169109567.78839949/v1
Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2022: Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast. arXiv, 2211.02556v1, https://doi.org/10.48550/arXiv.2211.02556.
Bony, S., and Coauthors, 2013: Carbon dioxide and climate: Perspectives on a scientific assessment. Climate Science for Serving Society: Research, Modeling and Prediction Priorities, G. Asrar and J. Hurrell, Eds., Springer, 391–413.
Bony, S., and Coauthors, 2015: Clouds, circulation and climate sensitivity. Nat. Geosci., 8, 261–268, https://doi.org/10.1038/ngeo2398.
Bröcker, J., 2009: Reliability, sufficiency, and the decomposition of proper scores. Quart. J. Roy. Meteor. Soc., 135, 1512–1519, https://doi.org/10.1002/qj.456.
Buhrmester, V., D. Münch, and M. Arens, 2021: Analysis of explainers of black box deep neural networks for computer vision: A survey. Mach. Learn. Knowl. Extr., 3, 966–989, https://doi.org/10.3390/make3040048.
Censor, Y., 1977: Pareto optimality in multiobjective problems. Appl. Math. Optim., 4, 41–59, https://doi.org/10.1007/BF01442131.
Charney, J. G., and Coauthors, 1979: Carbon dioxide and climate: A scientific assessment. National Academy of Sciences Rep., 18 pp., https://geosci.uchicago.edu/∼archer/warming_papers/charney.1979.report.pdf.
Cheruy, F., F. Chevallier, J.-J. Morcrette, N. A. Scott, and A. Chédin, 1996: Une méthode utilisant les techniques neuronales pour le calcul rapide de la distribution verticale du bilan radiatif thermique terrestre. C. R. Acad. Sci. II, 322, 665–672.
Chevallier, F., F. Chéruy, N. A. Scott, and A. Chédin, 1998: A neural network approach for a fast and accurate computation of a longwave radiative budget. J. Appl. Meteor., 37, 1385–1397, https://doi.org/10.1175/1520-0450(1998)037<1385:ANNAFA>2.0.CO;2.
Chevallier, F., J.-J. Morcrette, F. Chéruy, and N. Scott, 2000: Use of a neural‐network‐based long‐wave radiative‐transfer scheme in the ECMWF atmospheric model. Quart. J. Roy. Meteor. Soc., 126, 761–776, https://doi.org/10.1002/qj.49712656318.
Clare, M. C., O. Jamil, and C. J. Morcrette, 2021: Combining distribution‐based neural networks to predict weather forecast probabilities. Quart. J. Roy. Meteor. Soc., 147, 4337–4357, https://doi.org/10.1002/qj.4180.
Clough, S. A., M. J. Iacono, and J.-L. Moncet, 1992: Line‐by‐line calculations of atmospheric fluxes and cooling rates: Application to water vapor. J. Geophys. Res., 97, 15 761–15 785, https://doi.org/10.1029/92JD01419.
Colin, M., S. Sherwood, O. Geoffroy, S. Bony, and D. Fuchs, 2019: Identifying the sources of convective memory in cloud-resolving simulations. J. Atmos. Sci., 76, 947–962, https://doi.org/10.1175/JAS-D-18-0036.1.
Cranmer, M., 2023: Interpretable machine learning for science with PySR and symbolicregression.jl. arXiv, 2305.01582v3, https://doi.org/10.48550/arXiv.2305.01582.
Das, A., and P. Rad, 2020: Opportunities and challenges in Explainable Artificial Intelligence (XAI): A survey. arXiv, 2006.11371v2, https://doi.org/10.48550/arXiv.2006.11371.
Duras, J., F. Ziemen, and D. Klocke, 2021: The Dyamond winter data collection. EGU General Assembly 2021, Online, European Geosciences Union, Abstracts EGU21-4687.
Fisher, A., C. Rudin, and F. Dominici, 2019: All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res., 20, 1–81.
Gentine, P., V. Eyring, and T. Beucler, 2021: Deep learning for the parametrization of subgrid processes in climate models. Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences, G. Camps-Valls et al., Eds., Wiley, 307–314.
Giorgetta, M. A., and Coauthors, 2018: ICON‐A, the atmosphere component of the ICON Earth system model: I. Model description. J. Adv. Model. Earth Syst., 10, 1613–1637, https://doi.org/10.1029/2017MS001242.
Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359–378, https://doi.org/10.1198/016214506000001437.
Grundner, A., T. Beucler, P. Gentine, F. Iglesias-Suarez, M. A. Giorgetta, and V. Eyring, 2022: Deep learning based cloud cover parameterization for ICON. J. Adv. Model. Earth Syst., 14, e2021MS002959, https://doi.org/10.1029/2021MS002959.
Grundner, A., T. Beucler, P. Gentine, and V. Eyring, 2024: Data‐driven equation discovery of a cloud cover parameterization. J. Adv. Model. Earth Syst., 16, e2023MS003763, https://doi.org/10.1029/2023MS003763.
Haynes, K., R. Lagerquist, M. McGraw, K. Musgrave, and I. Ebert-Uphoff, 2023: Creating and evaluating uncertainty estimates with neural networks for environmental-science applications. Artif. Intell. Earth Syst., 2, 220061, https://doi.org/10.1175/AIES-D-22-0061.1.
Held, I. M., 2005: The gap between simulation and understanding in climate modeling. Bull. Amer. Meteor. Soc., 86, 1609–1614, https://doi.org/10.1175/BAMS-86-11-1609.
Hertel, L., J. Collado, P. Sadowski, J. Ott, and P. Baldi, 2020: Sherpa: Robust hyperparameter optimization for machine learning. SoftwareX, 12, 100591, https://doi.org/10.1016/j.softx.2020.100591.
Hogan, R. J., and A. Bozzo, 2018: A flexible and efficient radiation scheme for the ECMWF model. J. Adv. Model. Earth Syst., 10, 1990–2008, https://doi.org/10.1029/2018MS001364.
Jakhar, K., Y. Guan, R. Mojgani, A. Chattopadhyay, and P. Hassanzadeh, 2024: Learning closed-form equations for subgrid-scale closures from high-fidelity data: Promises and challenges. J. Adv. Model. Earth Syst., 16, e2023MS003874, https://doi.org/10.1029/2023MS003874.
Jeevanjee, N., P. Hassanzadeh, S. Hill, and A. Sheshadri, 2017: A perspective on climate model hierarchies. J. Adv. Model. Earth Syst., 9, 1760–1771, https://doi.org/10.1002/2017MS001038.
Keisler, R., 2022: Forecasting global weather with graph neural networks. arXiv, 2202.07575v1, https://doi.org/10.48550/arXiv.2202.07575.
Khairoutdinov, M. F., P. N. Blossey, and C. S. Bretherton, 2022: Global system for atmospheric modeling: Model description and preliminary results. J. Adv. Model. Earth Syst., 14, e2021MS002968, https://doi.org/10.1029/2021MS002968.
Kim, P. S., and H.-J. Song, 2022: Usefulness of automatic hyperparameter optimization in developing radiation emulator in a numerical weather prediction model. Atmosphere, 13, 721, https://doi.org/10.3390/atmos13050721.
Kingma, D. P., and J. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.
Koza, J. R., 1994: Genetic programming as a means for programming computers by natural selection. Stat. Comput., 4, 87–112, https://doi.org/10.1007/BF00175355.
La Cava, W., B. Burlacu, M. Virgolin, M. Kommenda, P. Orzechowski, F. O. de França, Y. Jin, and J. H. Moore, 2021: Contemporary symbolic regression methods and their relative performance. 35th Conf. on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, Online, Curran Associates Inc., 1–16, https://cavalab.org/assets/papers/La%20Cava%20et%20al.%20-%202021%20-%20Contemporary%20Symbolic%20Regression%20Methods%20and%20their.pdf.
Lagerquist, R., D. Turner, I. Ebert-Uphoff, J. Stewart, and V. Hagerty, 2021: Using deep learning to emulate and accelerate a radiative transfer model. J. Atmos. Oceanic Technol., 38, 1673–1696, https://doi.org/10.1175/JTECH-D-21-0007.1.
Lagerquist, R., D. D. Turner, I. Ebert-Uphoff, and J. Q. Stewart, 2023: Estimating full longwave and shortwave radiative transfer with neural networks of varying complexity. J. Atmos. Oceanic Technol., 40, 1407–1432, https://doi.org/10.1175/JTECH-D-23-0012.1.
Lam, R., and Coauthors, 2022: GraphCast: Learning skillful medium-range global weather forecasting. arXiv, 2212.12794v2, https://doi.org/10.48550/arXiv.2212.12794.
Lang, S., and Coauthors, 2024: AIFS—ECMWF’S data-driven forecasting system. arXiv, 2406.01465v2, https://doi.org/10.48550/arXiv.2406.01465.
Lin, X., H.-L. Zhen, Z. Li, Q.-F. Zhang, and S. Kwong, 2019: Pareto multi-task learning. 33rd Conf. on Neural Information Processing Systems (NeurIPS 2019), Vancouver, British Columbia, Canada, Curran Associates Inc., 12060–12070, https://proceedings.neurips.cc/paper_files/paper/2019/file/685bfde03eb646c27ed565881917c71c-Paper.pdf.
Liou, K.-N., 2002: An Introduction to Atmospheric Radiation. Vol. 84. Elsevier, 583 pp.
Lucarini, V., and M. D. Chekroun, 2023: Theoretical tools for understanding the climate crisis from Hasselmann’s programme and beyond. Nat. Rev. Phys., 5, 744–765, https://doi.org/10.1038/s42254-023-00650-8.
Maas, A. L., A. Y. Hannun, and A. Y. Ng, 2013: Rectifier nonlinearities improve neural network acoustic models. ICML 2013 Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, GA, International Machine Learning Society, https://robotics.stanford.edu/amaas/papers/relu hybrid icml2013final.pdf.
Maher, P., and Coauthors, 2019: Model hierarchies for understanding atmospheric circulation. Rev. Geophys., 57, 250–280, https://doi.org/10.1029/2018RG000607.
Mamalakis, A., E. A. Barnes, and I. Ebert-Uphoff, 2022: Investigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscience. Artif. Intell. Earth Syst., 1, e220012, https://doi.org/10.1175/AIES-D-22-0012.1.
Mamalakis, A., E. A. Barnes, and I. Ebert-Uphoff, 2023: Carefully choose the baseline: Lessons learned from applying XAI attribution methods for regression tasks in geoscience. Artif. Intell. Earth Syst., 2, e220058, https://doi.org/10.1175/AIES-D-22-0058.1.
Mansfield, L. A., A. Gupta, A. C. Burnett, B. Green, C. Wilka, and A. Sheshadri, 2023: Updates on model hierarchies for understanding and simulating the climate system: A focus on data‐informed methods and climate change impacts. J. Adv. Model. Earth Syst., 15, e2023MS003715, https://doi.org/10.1029/2023MS003715.
Mauritsen, T., and Coauthors, 2019: Developments in the MPI-M Earth System Model version 1.2 (MPI-ESM1.2) and its response to increasing CO2. J. Adv. Model. Earth Syst., 11, 998–1038, https://doi.org/10.1029/2018MS001400.
Miettinen, K., 1999: Nonlinear Multiobjective Optimization. Vol. 12. Springer, 298 pp.
Mlawer, E. J., S. J. Taubman, P. D. Brown, M. J. Iacono, and S. A. Clough, 1997: Radiative transfer for inhomogeneous atmospheres: RRTM, a validated correlated‐k model for the longwave. J. Geophys. Res., 102, 16 663–16 682, https://doi.org/10.1029/97JD00237.
Molnar, C., 2020: Interpretable Machine Learning. Lulu.com, 320 pp.
Morcrette, J.-J., 2000: On the effects of the temporal and spatial sampling of radiation fields on the ECMWF forecasts and analyses. Mon. Wea. Rev., 128, 876–887, https://doi.org/10.1175/1520-0493(2000)128<0876:OTEOTT>2.0.CO;2.
Morrison, H., and Coauthors, 2020: Confronting the challenge of modeling cloud and precipitation microphysics. J. Adv. Model. Earth Syst., 12, e2019MS001689, https://doi.org/10.1029/2019MS001689.
Moseley, C., C. Hohenegger, P. Berg, and J. O. Haerter, 2016: Intensification of convective extremes driven by cloud–cloud interaction. Nat. Geosci., 9, 748–752, https://doi.org/10.1038/ngeo2789.
Nowack, P., J. Runge, V. Eyring, and J. D. Haigh, 2020: Causal networks for climate model evaluation and constrained projections. Nat. Commun., 11, 1415, https://doi.org/10.1038/s41467-020-15195-y.
Pathak, J., and Coauthors, 2022: Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/arXiv.2202.11214.
Pearl, J., and Coauthors, 2000: Causality: Models, Reasoning and Inference. Vol. 19. Cambridge University Press, 384 pp.
Petersen, B. K., M. Landajuela, T. N. Mundhenk, C. P. Santiago, S. K. Kim, and J. T. Kim, 2019: Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. arXiv, 1912.04871v4, https://doi.org/10.48550/arXiv.1912.04871.
Rasp, S., and N. Thuerey, 2021: Data‐driven medium‐range weather prediction with a Resnet pretrained on climate simulations: A new model for WeatherBench. J. Adv. Model. Earth Syst., 13, e2020MS002405, https://doi.org/10.1029/2020MS002405.
Rasp, S., P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, 2020: WeatherBench: A benchmark data set for data‐driven weather forecasting. J. Adv. Model. Earth Syst., 12, e2020MS002203, https://doi.org/10.1029/2020MS002203.
Robertson, A. W., and M. Ghil, 2000: Solving problems with GCMS: General circulation models and their role in the climate modeling hierarchy. International Geophysics, Vol. 70, Academic Press, 40 pp.
Roeckner, E., and Coauthors, 1996: The atmospheric general circulation model ECHAM-4: Model description and simulation of present-day climate. Max-Planck-Institut für Meteorologie Rep. 218, 94 pp., https://pure.mpg.de/rest/items/item_1781494/component/file_1786328/content.
Ronneberger, O., P. Fischer, and T. Brox, 2015: U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, N. Navab et al., Eds., Springer, 234–241.
Rudin, C., 2019: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell., 1, 206–215, https://doi.org/10.1038/s42256-019-0048-x.
Scher, S., and G. Messori, 2021: Ensemble methods for neural network‐based weather forecasts. J. Adv. Model. Earth Syst., 13, e2020MS002331, https://doi.org/10.1029/2020MS002331.
Schmidt, M., and H. Lipson, 2009: Distilling free-form natural laws from experimental data. Science, 324, 81–85, https://doi.org/10.1126/science.1165893.
Schneider, T., J. Teixeira, C. S. Bretherton, F. Brient, K. G. Pressel, C. Schär, and A. P. Siebesma, 2017: Climate goals and computing the future of clouds. Nat. Climate Change, 7, 3–5, https://doi.org/10.1038/nclimate3190.
Schneider, T., L. R. Leung, and R. C. J. Wills, 2024: Opinion: Optimizing climate models with process knowledge, resolution, and artificial intelligence. Atmos. Chem. Phys., 24, 7041–7062, https://doi.org/10.5194/acp-24-7041-2024.
Schneiderman, H., 2024: Incorporation of physical equations into a neural network structure for predicting shortwave radiative heat transfer. 23rd Conf. on Artificial Intelligence for Environmental Science, Baltimore, MD, Amer. Meteor. Soc, J5A.5, https://ams.confex.com/ams/104ANNUAL/webprogram/Paper431981.html.
Seneviratne, S. I., and Coauthors, 2021: Weather and climate extreme events in a changing climate. Climate Change 2021: The Physical Science Basis, V. Masson-Delmotte et al., Eds., Cambridge University Press, 1513–1766, https://doi.org/10.1017/9781009157896.013.
Shamekh, S., K. D. Lamb, Y. Huang, and P. Gentine, 2023: Implicit learning of convective organization explains precipitation stochasticity. Proc. Natl. Acad. Sci. USA, 120, e2216158120, https://doi.org/10.1073/pnas.2216158120.
Sherwood, S. C., S. Bony, and J.-L. Dufresne, 2014: Spread in model climate sensitivity traced to atmospheric convective mixing. Nature, 505, 37–42, https://doi.org/10.1038/nature12829.
Stevens, B., and Coauthors, 2019: DYAMOND: The DYnamics of the Atmospheric general circulation Modeled on Non-hydrostatic Domains. Prog. Earth Planet. Sci., 6, 61, https://doi.org/10.1186/s40645-019-0304-z.
Stevens, B., and Coauthors, 2020: The added value of large-eddy and storm-resolving models for simulating clouds and precipitation. J. Meteor. Soc. Japan, 98, 395–435, https://doi.org/10.2151/jmsj.2020-021.
Sundqvist, H., E. Berge, and J. E. Kristjánsson, 1989: Condensation and cloud parameterization studies with a mesoscale numerical weather prediction model. Mon. Wea. Rev., 117, 1641–1657, https://doi.org/10.1175/1520-0493(1989)117%3C1641:CACPSW%3E2.0.CO;2.
Ukkonen, P., 2022: Exploring pathways to more accurate machine learning emulation of atmospheric radiative transfer. J. Adv. Model. Earth Syst., 14, e2021MS002875, https://doi.org/10.1029/2021MS002875.
Ukkonen, P., and M. Chantry, 2024: Representing sub-grid processes in weather and climate models via sequence learning. ESS Open Archive, https://doi.org/10.22541/essoar.172098075.51621106/v1.
Vapnik, V. N., and A. Y. Chervonenkis, 2015: On the uniform convergence of relative frequencies of events to their probabilities. Measures of Complexity: Festschrift for Alexey Chervonenkis, V. Vovk, H. Papadopoulos, and A. Gammerman, Eds., Springer, 11–30.
Veerman, M. A., R. Pincus, R. Stoffer, C. M. Van Leeuwen, D. Podareanu, and C. C. Van Heerwaarden, 2021: Predicting atmospheric optical properties for radiative transfer computations using neural networks. Philos. Trans. Roy. Soc., A379, 20200095, https://doi.org/10.1098/rsta.2020.0095.
Virgolin, M., T. Alderliesten, C. Witteveen, and P. A. N. Bosman, 2021: Improving model-based genetic programming for symbolic regression of small expressions. Evol. Comput., 29, 211–237, https://doi.org/10.1162/evco_a_00278.
Wang, Y., S. Yang, G. Chen, Q. Bao, and J. Li, 2023: Evaluating two diagnostic schemes of cloud-fraction parameterization using the CloudSat data. Atmos. Res., 282, 106510, https://doi.org/10.1016/j.atmosres.2022.106510.
Weyn, J. A., D. R. Durran, and R. Caruana, 2019: Can machines learn to predict weather? Using deep learning to predict gridded 500‐hPa geopotential height from historical weather data. J. Adv. Model. Earth Syst., 11, 2680–2693, https://doi.org/10.1029/2019MS001705.
Weyn, J. A., D. R. Durran, and R. Caruana, 2020: Improving data‐driven global weather prediction using deep convolutional neural networks on a cubed sphere. J. Adv. Model. Earth Syst., 12, e2020MS002109, https://doi.org/10.1029/2020MS002109.
Xu, K.-M., and D. A. Randall, 1996: A semiempirical cloudiness parameterization for use in climate models. J. Atmos. Sci., 53, 3084–3102, https://doi.org/10.1175/1520-0469(1996)053<3084:ASCPFU>2.0.CO;2.
Zhou, Z., M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, 2020: UNet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging, 39, 1856–1867, https://doi.org/10.1109/TMI.2019.2959609.