1. Introduction
The U.S. Department of Energy (DOE) recently concluded a workshop on Artificial Intelligence for Earth System Predictability (AI4ESP; Hickmon et al. 2022). This workshop was hosted by the DOE’s Office of Science, Biological and Environmental Research (BER) and Advanced Scientific Computing Research (ASCR) Programs (all acronyms used in this paper are collected and defined in the appendix). A total of 17 sessions with researchers worldwide participated and discussed how artificial intelligence (AI) could enhance Earth system predictability across the field, laboratory, modeling, and analysis activities (Hoffman et al. 2017, Fig. 1.3). The primary focus of the discussion was on using AI for transforming BER’s model-experimentation (ModEx) integration (Chambers et al. 2012, p. 93).
Traditionally, the ModEx paradigm (Hoffman et al. 2017, section 1) integrates observations, experiments, and measurements performed in the field or laboratory with conceptual/process models in an iterative fashion. Recent advances in AI have shown promise to accelerate the traditional ModEx efficiency (Tsai et al. 2021; Cromwell et al. 2021; Mudunuru et al. 2022). Such an AI transformation in the ModEx loop is needed to efficiently and accurately integrate the DOE’s observational capabilities and platforms,1 process models and software infrastructure,2 and computational hardware.3 However, achieving this AI-enabled ModEx vision requires significant advancements in codesign and associated AI architectures (Germann et al. 2013; Zhang et al. 2019; Beckman et al. 2020; Descour et al. 2021; Bringmann et al. 2021). Codesign as described by Pao (2011), Barrett et al. (2013), and Germann (2021) refers to a computer system design process where scientific problem requirements influence architecture design, technology, and constraints inform the formulation and design of algorithms and software. Codesign will holistically weigh the fundamental trade-offs, such as 1) hardware and architecture, 2) software stacks, 3) numerical methods and algorithms, and 4) science applications. This paper provides perspectives on AI architectures and codesign approaches needed to develop AI-enabled ModEx for Earth system predictability. These perspectives include codesigning computational and storage infrastructure for automated ML feature engineering and model selection, integration of sensors, process models, and ML methods for efficient data assimilation. We also provide futuristic system ideas on codesigning frameworks and platforms to enable the BER community to accelerate the application of AI architectures in the ModEx life cycle.
The outline of our paper is as follows: Section 2 presents the state of the science on AI architectures and codesign that AI4ESP workshop participants discussed. Section 3 provides four different futuristic concepts, and section 4 discusses the grand challenges of developing such ideas. We also discuss near-, middle-, and long-term goals to overcome these grand challenges. Section 5 provides perspectives for potential research that will provide synergy with other AI4ESP workshop sessions. Conclusions are drawn in section 6.
2. State of the science
In this section, we describe the state of the science on AI architectures and codesign. The foci are the computing resources and DOE user facilities used in capturing and curating data, developing advanced AI/ML models, and inferences for quantifying and improving Earth system modeling and simulation predictability.
a. DOE’s high-performance computing user facilities
Over the past few decades, DOE has invested hundreds of millions of dollars in developing high-performance computing (HPC) user facilities (Stevens et al. 2020; Vetter et al. 2018; Heroux et al. 2020). DOE’s investments toward exascale computing include Leadership Computing Facilities (LCFs) at Argonne National Laboratory [Argonne Leadership Computing Facility (ALCF); e.g., Aurora], Oak Ridge National Laboratory [Oak Ridge Leadership Computing Facility (OLCF); e.g., Frontier], and National Energy Research Scientific Computing Center (NERSC; e.g., Perlmutter). The LCFs are leadership computing facilities for the computational science community. The LCFs provide researchers with a world-class computing capability for breakthrough science and engineering. Frontier is ranked the fastest supercomputing system on the November 2022 top 500 list (Prometeus 2022). The latest generations of DOE’s leadership-class computing facilities are based on integrating central processing unit (CPU) and graphics processing unit (GPU) processors into heterogeneous systems. Concurrently, DOE’s Biological and Environmental Research Program has invested substantial resources in state-of-the-art scientific models (U.S. DOE 2022; Lichtner et al. 2020; Coon et al. 2022) including the flagship Energy Exascale Earth System Model (E3SM; U.S. DOE 2022) that is specifically designed to target efficient utilization of the exascale supercomputers. These HPC resources have significantly improved model predictability in various areas including Earth system modeling and subsurface flow and transport models (e.g., E3SM, PFLOTRAN). As part of the DOE’s Exascale Computing Project (ECP), a selected subset of Earth science applications (Steefel 2022; Taylor 2022) firmly focused on model development for the exascale era. Furthermore, efforts like the E3SM multiscale modeling framework (E3SM-MMF) subproject (Taylor 2022) under ECP had targeted codesign activities, including strong engagements with vendors on early architecture evaluation and algorithm design. Experience from such efforts indicates the need for expansion to AI architecture codesign and increased coverage of Earth science applications. Such advancements allowed us to achieve energy-efficient performance on GPUs while leveraging the commercial drivers for GPU-based AI/ML performance.
With the slowing of Moore’s law (Eeckhout 2017; Theis and Wong 2017), the computing community recognized the increased need for architectural specialization. Hence, the next generation of HPC systems are likely to incorporate increased heterogeneity beyond the current hybrid CPU and GPU designs. The DOE’s efforts in AI for Science (Baker et al. 2019; Stevens et al. 2020) are exploring capabilities that provide a foundation for the integration of HPC applications (e.g., ALCF’s AI testbeds; Argonne Leadership Computing Facility 2022) with data science and AI/ML frameworks.
b. Cloud computing
Cloud providers4 have user-friendly tools to run AI/ML workloads. But there needs to be more compatibility among AI/ML tool capabilities and user interfaces among different providers that make it difficult to achieve interoperability in a federation of clouds (Chouhan et al. 2020; Rosa et al. 2021; Saxena et al. 2021). While specific Earth system model (ESM) data are presently stored on cloud storage systems (Xu et al. 2019), the data stores are associated with a patchwork of individual groups and projects, lacking a federated view. Cloud providers can presently accommodate petabytes to exabytes of data for data storage. The commercial cloud cost is based on accessing and computing or analyzing the data. It can become prohibitively expensive if data transmission in to/out of the cloud becomes frequent. Commercial AI/ML cloud infrastructure and services are predominately motivated by text and image data. Cloud providers have demonstrated AI at scale for these applications. For example, the most significant AI-based natural language processing (NLP) models (e.g., for sequence data analysis) approaching 1 trillion parameters have been demonstrated on Selene (Chen et al. 2019) (the ninth fastest supercomputing system on the November 2022 top-500 list; Prometeus 2022). Workflow services exist on the cloud for specific applications, including many AI/ML methods, and raw materials are available on cloud platforms to create more complex workflows. However, ESM workflows that combine external data sources or coordinate with HPC simulations efficiently and accurately currently do not exist. Computer science expertise is required to create such workflows in a form suitable for domain scientists (Chen et al. 2017; Bauer et al. 2021).
c. Edge computing
Recently, AI methods for classifying patterns, anomaly detection, unsupervised learning for data compression, inference at the edge, and continuous learning with streaming sensor data have gained considerable traction in the ESM community (Beckman et al. 2020; Talsma et al. 2022). This advancement was possible because of the rapid forward deployment of AI models on intelligent computing devices such as Raspberry Pi/Shake, Nvidia Jetson Nano, Google Coral Dev Board, and Intel Neural Compute Stick connected to sensors (Catlett et al. 2017, 2020; Mudunuru et al. 2021). The integration of edge computing with smart sensors (e.g., AI@SensorEdge) has many distinct deployment scenarios, including National Oceanic and Atmospheric Administration (NOAA) and National Aeronautics and Space Administration (NASA) Earth-observing satellite imagery with edge processing in space or at dedicated ground stations to control DOE’s Atmospheric Radiation Measurement (ARM) or Environmental Molecular Science Laboratory (EMSL) user facility instruments (Beckman et al. 2020). We can also integrate edge computing with the diverse collection of distributed sensors that collect observations and measurements for the DOE’s ARM user facility. Adaptive sensors with embedded hardware accelerators are now emerging (e.g., Waggle and PurpleAir) (Beckman et al. 2016; Stavroulas et al. 2020; Barkjohn et al. 2021). New concepts for distributed applications are also under development, such as geomorphic computing, where weather research and forecasting models are distributed, federated, and able to adapt dynamically to the environment (Daepp et al. 2022).
3. Future system concepts
In this section, we describe several plausible future systems concepts that participants in the breakout room focus groups discussed in the AI4ESP workshop. The focus was on the evolution of DOE’s Leadership Computing Facility systems for HPC and AI. These large-scale heterogeneous computing systems provide a foundation for advancing AI architectures and codesign using HPC. Moreover, these future concepts have the potential to provide a radically different approach to future Earth system modeling and AI-enabled ModEx.
a. Centralized large-scale HPC concept
The baseline system concept is the future evolution of large-scale HPC and cloud computing systems. This next step will extend postexascale architectures beyond the first generation of DOE’s heterogeneous systems integrating CPUs and GPUs. As the HPC and cloud computing communities increasingly rely on hardware specialization to improve performance, codesign approaches will support the development of accelerators (Lie 2021; Reuther et al. 2021; Cortés et al. 2021) for frequently used kernels in scientific modeling and AI/ML methods. New specialized accelerators may arise to support additional data science capabilities such as uncertainty quantification, streaming analytics, or graph analysis (Halappanavar et al. 2021; Acer et al. 2021). These future large-scale computing systems with extreme heterogeneity must be codesigned to support the increased computational and dataset sizes associated with Earth science predictability and scientific machine reasoning (Yang et al. 2016; Zhang et al. 2020; Yu et al. 2022).
b. Edge sensors with centralized HPC/cloud resources concept
In the second system concept, environmental data are recorded from a broad collection of point (Christensen and Blanco Chia 2017; Winter et al. 2021) and distributed sensors (e.g., fiber optics) (Lindsey et al. 2019) spread across the globe. These advanced sensors are designed to monitor specific items of interest (e.g., river flow, nutrients, temperature, chemical concentration, light) and to communicate these data back to a centralized location (Beckman et al. 2020). At this centralized facility, large HPC or cloud computing environments will process the incoming data streams for integration into online simulations of extreme weather events, climate, hydrology, and their impacts on Earth systems.
We could utilize AI/ML capabilities within this system concept at multiple points. First, the velocity of sensor data coming into the system will potentially overrun even the most significant data processing centers’ capabilities. Hence, such a volume of data is unlikely to be able to be stored in memory or even temporary storage resources (such as file systems or object stores). Advanced AI/ML models could be trained and tailored to summarize or select relevant features from the incoming data streams. Such an encoding or feature selection process will significantly reduce the amount of data that needs to be kept and integrated into ongoing simulations. Another potential is for AI/ML models to identify anomalies or precursors Yuan et al. (2019) from the incoming data streams that might suggest areas of interest for simulations to be focused on—for instance, the start of a hurricane or the high likelihood of significant rain-on-snow events or wildfires. Other examples include where to place a Geostationary Operational Environmental Satellites (GOES) floater and scan phased array radars for faster, more spatially focused sensing.
Because of the distributed nature and inhospitable environments (e.g., remote locations, extreme temperatures, or extreme pressures) where sensors may need to be placed or roam, it is unlikely that a reliable data stream will reach the centralized location for all possible inputs. One common use case is the intelligent city scenario to study urban science. Figure 1 is a notional depiction of various deployed sensors, computing, and data storage capabilities (Zhu et al. 2021). AI/ML models could be used in such an environment to fill measurement gaps and present a more consistent view of observational data to a future simulation run on a large-compute resource. Moreover, to understand and predict urban air mobility, a distributed sensor network (e.g., drone deliveries and air taxis) coupled with edge computing and AI is needed for block-level monitoring and forecasting for eddies.
A smart city scenario with a large number of sites for fixed sensor deployments that measure temperature, wind profile, CO2 concentration, precipitation, and so on, plus a variety of mobile devices that can also be used to augment the collection of measurement and observation data intermittently. An urban setting will support advanced wireless communications like 5G and eventually 6G to understand the interactions between cities and climate. This figure was developed by the Advanced Wireless Communications laboratory at Pacific Northwest National Laboratory.
Citation: Artificial Intelligence for the Earth Systems 3, 1; 10.1175/AIES-D-23-0029.1
c. Federated processing from the edge to the data center concept
The third potential system design extends the second concept by leveraging much more processing in or near the distributed sensor network. We can process the sensor data directly on the sensor itself or in a nearby edge server (e.g., fog computing) with processing elements that may stream a small collection of sensor data into it (Stevens et al. 2020, chapter 15). Local processing stations can then send their raw or locally processed data to a centralized HPC and/or cloud resource for inclusion in simulation models and centralized AI/ML models as in the first system concept.
The advantage of this approach is that data down-selection and feature extraction can be performed locally, significantly reducing the volume of data that must be transmitted to a centralized resource. Assuming that a sufficiently performant local network among sensors can be established, process model parameters and partial results, perhaps even AI/ML model updates, can be exchanged within a locale, allowing for a genuinely federated design aspect. Initially, this concept takes advantage of existing gateways and local area networks serving sensors in the field. Through codesign collaborations, it is possible to expand that service to include application/sensor-specific processing to filter, analyze, compress, encrypt, and unify multiple sensor streams transmitting measurements through the wireless network.
d. Dynamic and adaptive federated processing concept
The last system concept builds on the previous three by augmenting feedback and control paths within distributed networks of sensor-local resources (Di Lorenzo et al. 2021; Charles et al. 2021). Local control offers lower latency decision-making to dynamically control what information is observed, measured, recorded, and relayed by the sensor network (Morell and Alba 2022). Such a design has powerful implications—by dynamically controlling sensors online, simulations of the Earth’s weather and climate can essentially focus sensor inputs on specific quantities or geographic locations of interest. Examples might include where severe weather events are expected or whether climate scientists identify where specific information is needed to help improve the quality of their models. This concept expands to multiple HPC and/or cloud data centers for federated AI/ML modeling. AI/ML models can play a crucial part in this system by performing continuous, autonomous online inspection of evolving simulations or recorded data to identify areas of data insufficiency or statistical weakness. Furthermore, a dynamic and adaptive system may be able to carefully obtain and select data to improve the quality of its training, reducing the need for vast, potentially intractable datasets to be collected over long periods (Catlett et al. 2017).
4. Grand challenges
The system concepts that integrate federated processing are beyond the capabilities of affordable technologies today. It will require significant investment both in foundational technology systems and codesign programs. Such synergy between climate scientists, mathematicians, AI/ML experts, computer scientists, and hardware engineers is needed to balance the competing performance, energy, cost, and security challenges associated with AI-enabled ModEx. The following sections describe technical challenges that will arise in the areas: 1) programmability and usability, 2) data movement, 3) energy efficiency, and 4) privacy and security of data.
a. Programmability and usability
The current and near-term challenge is integrating scientific modeling and simulation applications with AI/ML methods. This drives the need to integrate Earth system HPC applications written in C/C++ and/or Fortran with AI/ML methods that use Python-based ML frameworks (Ott et al. 2020). Programming models are under development to support the convergence of applications and workflows onto heterogeneous computing systems. Many AI/ML architectures provide hardware support for reduced or mixed precision, and tools will be required to analyze which specific model components can use these capabilities. We must create protocols and tools for Earth System Predictability (ESP) data sharing and data federation on the cloud. The usability challenge is managing the complexity of mapping converged application workloads to future heterogeneous computing architectures that integrate specialized hardware accelerators with commodity CPU/GPU/TPU processors.
Domain scientists are interested in exploring the capabilities of new heterogeneous advanced architecture computing systems. Interfacing with sensors and AI analytics at the edge will allow domain scientists to extract actionable information needed for improved modeling of disturbances and extreme events. This type of codesign is needed for most ESP applications. For example, watershed science, hydrology, ecohydrology, climate variability and extremes, aerosols and clouds, and atmospheric modeling are cross-cutting themes where AI@SensorEdge has the highest impact. Codesign approaches that interface with distributed sensor networks will allow us to 1) collect reliable and relevant watershed data under disturbances, 2) monitor land–atmosphere–coastal interactions by embedding intelligence on the ARM instruments, 3) understand wildfire events and their impact on ecosystems in near–real time, and 4) assess critical infrastructure impacted by extreme events (e.g., see Hickmon et al. 2022, chapter 9). Popular codesign examples include sustainable urban systems Webb et al. (2018), sociotechnical systems corresponding to Earth-observation data (Barbier et al. 2024), and sensor placement (Huadong 2016).
Still, there are challenges in understanding how to map AI4ESP workflows to the diverse collection of computing system options. Understanding how AI/ML capabilities originally developed for generic commercial workloads may or may not be applicable for ESP hybrid modeling applications or observation and measurement capabilities is essential. From centralized large-scale modeling and training to edge computing inferencing and federated learning, new challenges arise for the composition and distribution of applications, algorithms, and methods. This is an important opportunity for the AI4ESP community to develop a new generation of proxy applications and benchmarks for modeling and observation capabilities. For example, AI-enabled codesign will enable us to emulate and deploy DOE codes such as PFLOTRAN, ATS, and E3SM at the sensor edge for empowering ARM instruments and EMSL user facilities. The focus should be facilitating communication and codesign collaborations with hardware designers, system software developers, algorithm developers, and domain scientists.
b. Data movement
The expected volume of data associated with a complete, coordinated Earth sensor capability will be unprecedented. Not only will such a network generate a previously unimaginable quantity and diversity of data, but the computing and network load for processing, transmitting, and subsequent storage of this volume will be orders of magnitude higher than any system available today. Data movement costs in terms of energy and latency motivate the interest in the federation and distribution of computing across the AI4ESP scientific ecosystem. AI/ML technologies could help reduce such volumes by identifying patterns and anomalies and summarizing subvolume. We will require significant investment in AI/ML approaches to ensure that the modeling capabilities will be compatible and efficient for the types of data being recorded, especially where this may deviate from commercial photo or video capabilities. Technologies that may assist in energy-efficient data transfers include investment in silicon photonic network capabilities, satellite-based communications, and wide-area 5G- or 6G-like communication networks that enable sensors to communicate over short/medium distances without needing physical wiring (Beckman et al. 2020). On the storage side, cloud technologies such as high-performance, large-volume data object stores could likely provide a capability to address increased sparse data storage volumes. However, this would pose a significant cost barrier using current commercial cloud pricing. We may also use AI/ML to enable innovative compression techniques on Earth system data to increase information density without increasing storage costs. Additionally, DOE HPC centers could incorporate concepts and methods from cloud storage systems into future parallel file and storage systems to slowly move toward such capability. These HPC centers allow data storage and connectivity with repositories such as the Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE; Agarwal et al. 2022; Velliquette et al. 2021). This HPC-to-ESS-DIVE connectivity allows to store (raw and curated) data for long periods of time. This data storage strategy benefits the DOE community when new sensor data are collected, curated, and interfaced with existing data repositories.
c. Energy efficiency
Large-scale networks with integrated sensors, federated processing, and wide-area communication networks to handle data transmissions will likely be very expensive in energy consumption. While this was a lower-priority focus for exascale computing, data processing and communication remain power expensive. Codesign has the potential to help improve this situation through the use of novel materials, devices, and processing techniques (e.g., neuromorphic-based accelerators to analyze images/video). However, significant investment will still be required in foundational technologies if large-scale, power-efficient sensing networks are to be realized. Codesign to balance performance and energy efficiency will also address how the modeling, machine learning, uncertainty quantification, and other streaming analytics capabilities are partitioned across the ESM scientific ecosystem. Such a codesign that integrates DOE’s heterogeneous HPC systems with cloud computing, edge servers, and sensors with IoT devices will transform the ModEx loop.
d. Privacy and security of data
As Earth systems modeling becomes increasingly integrated with a distributed network of observations and perhaps federated processing capabilities. The information’s quality, accuracy, and robustness through such a sensor network will become more critical. It must also be secured if the information generated from modeling and measurement capabilities is used to support high-consequence national or international scientific policy decisions. The implications of potential data tampering or nefarious modification are clear, as a national or international resource for accurate scientific prediction could be severely affected. Data privacy concerns are particularly valid in a data acquisition system where individual human subject images or videos may be captured, or their behavior discerned from the data. An example includes sensor capabilities that could identify patterns in human systems data (e.g., in citizen science or urban environments). Codesign has a potential role in this space—by including security experts in cyber-physical designs from the outset, secure data transmission and processing can be integrated as a first-level citizen rather than as a last, software-derived additional layer. In addition, data privacy may be afforded if local artifacts associated with specific individuals can be aggregated into a larger, federated model with individual patterns obfuscated or redacted into the complete model of the system.
5. Synergy with other AI4ESP workshop sessions
In this section, we provide visionary perspectives for future ideas and potential research in synergy with other workshop sessions. Table 1 summarizes this synergy with short-term (<5 yr), medium-term (5 yr), and long-term (10 yr) goals. The focus is on how AI architectures and codesign approaches are related to the integrative water cycle and associated water cycle extremes. The categories below came from the AI4ESP workshop themes:
-
Atmospheric modeling—There is a need for advancing the modeling of subgrid physics across scales and guiding or automating process model calibration. This includes 1) codesign approaches for parameterization and knowledge transfer across scales and 2) AI infrastructure for datasets, software, testing, validation, and training workflows for efficient model calibration.
-
Land modeling—AI architectures for efficient transfer of information between land and atmospheric models are needed. These include 1) subgrid parameterizations to capture the full complexity within a grid, 2) capturing heterogeneity utilizing LCFs, and 3) addressing observational gaps using advanced AI architectures (e.g., transformers).
-
Hydrology—Advanced AI architectures are needed for parameter estimation, down-scaling, and imputation to improve data products. Model-data codesign approaches are needed to identify how many and what types of observations are required to reach a desired process model performance without actual measurements being available. This includes 5G or other high-speed networking or software pipelines that can accelerate the transfer of information between field instrumentation and process models for near-real-time sampling decisions.
-
Watershed science—Codesign approaches are needed to understand better 1) the quality of collected data, 2) the predictability of a watershed’s response (e.g., the evolution of microbial activity) under disturbances and long-term perturbations using process-based models (e.g., PFLOTRAN), 3) when, how, and where to collect data (e.g., wildfires, flooding, drought events), and 4) how to deal with large data volumes.
-
Ecohydrology—Advanced AI architectures are needed for developing new data products and benchmark datasets across spatial scales from microbial and leaf scales to watershed and continental scales. Novel codesign approaches that build and collect labeled Earth science data needed for process models and open sourcing them to the BER community would facilitate rapid testing of existing AI/ML methods.
-
Aerosols and clouds—Codesign approaches that can extract valuable information or identify indicator patterns of forced changes and emergent properties of the actual and simulated climate system are essential. Future system concepts that can develop databases for indicator patterns (e.g., nucleation of ice or particles, snow formation) and emergent properties provide a path toward knowledge discovery and reveal missing mechanisms that must be incorporated in process models.
-
Coastal dynamics, oceans, and ice—Advanced AI architectures are needed that can improve 1) the standardization and merging of disparate datasets, 2) scale-awareness and dependency in process models (e.g., capturing coastal, ocean, and cryosphere processes across scales and from sparse datasets).
-
Climate variability and extremes—Codesign approaches for climate variability, signal identification, and sources of predictability are essential. These include AI architectures to detect signatures and features corresponding to tropical cyclones, fronts, atmospheric rivers, hailstones, tornadoes, and ice storms.
-
Human systems and dynamics—Codesign approaches are needed that can provide a better understanding of human and Earth systems. For example, advancements in AI architectures are needed to gain better insights into urban prediction and long-term urban policy due to extreme events.
Table 1.This table provides short-, medium-, and long-term goals needed to overcome the grand challenges discussed in section 4. Gradual progress on these specific goals will allow us to advance on the future system concepts needed for improving Earth system predictability.
6. Conclusions
In this perspective paper, we have described the need for codesign approaches for efficient and accurate integration of process models and observations for improved Earth system predictability. The current state of the science and existing HPC facilities provide a starting point to address the grand challenges of the model-experimentation loop. Future system concepts that connect the edge sensors to intelligent computing devices and, subsequently, the process models that reside in fog/cloud/exascale infrastructure are needed to transform the ModEx life cycle. Our near-term to long-term goals allow us to develop AI architectures and codesign approaches using future system concepts. Community integration and effort between domain and computational experts allow us to transform how we model the integrative and associated water cycle extremes.
Popular BER observational capabilities include ARM (ARM 2022) and EMSL (Pacific Northwest National Laboratory 2022).
State-of-the-art DOE-funded, open-source, and massively parallel multiphysics codes include PFLOTRAN (Lichtner et al. 2020), ATS (Coon et al. 2022), and E3SM (Taylor 2022).
ASCR-funded computational infrastructure and scientific user facilities include ALCF (Argonne National Laboratory 2022), NERSC (NERSC 2022), and OLCF (Oak Ridge National Laboratory 2022).
Popular providers include Amazon (2022), Google (2022), and Microsoft (2022).
Acknowledgments.
The authors acknowledge all of the efforts made as part of the Artificial Intelligence for Earth System Predictability (AI4ESP) workshop. Author Mudunuru acknowledges the support from the Environmental Molecular Sciences Laboratory, a U.S. Department of Energy (DOE) Office of Science User Facility sponsored by the Biological and Environmental Research program under Contract DE-AC05-76RL01830. Author Jones’s research was supported as part of the Energy Exascale Earth System Model (E3SM) project, funded by the DOE Office of Science, and Office of Biological and Environmental Research. Authors Sreepathi and Norman’s research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the DOE Office of Science and the National Nuclear Security Administration. Author Gokhale’s research was performed under the auspices of the DOE by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Authors Mudunuru, Ang, and Halappanavar acknowledge the contributions of Johnathan Cree and Elena Peterson at the Pacific Northwest National Laboratory’s (PNNL) Advanced Wireless Communications Laboratory, who developed the figure in this paper. This paper has been authored by the PNNL, operated by Battelle Memorial Institute for the DOE under Contract DE-AC05-76RL01830. This paper has also been authored by Oak Ridge National Laboratory, operated by UT-Battelle, LLC, under Contract DE-AC05-00OR22725 with the DOE. The U.S. government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this paper or allow others to do so, for U.S. government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. The authors thank the reviewers whose feedback helped to substantially improve the paper.
Data availability statement.
There were no data developed in this paper.
APPENDIX
Acronyms
AI4ESP |
The Artificial Intelligence for Earth System Predictability |
AI |
Artificial intelligence |
ALCF |
Argonne Leadership Computing Facility |
ARM |
Atmospheric Radiation Measurement Climate Research Facility |
ASCR |
Advanced Scientific Computing Research |
ATS |
Advanced Terrestrial Simulator |
AWS |
Amazon Web Services |
BER |
Biological and Environmental Research |
CPU |
Central processing unit |
DOE |
U.S. Department of Energy |
E3SM |
Energy Exascale Earth System Model |
ESM |
Earth system model |
ESP |
Earth system predictability |
EMSL |
Environmental Molecular Sciences Laboratory |
GCP |
Google Cloud Platform |
GOES |
Geostationary Operational Environmental Satellites |
GPU |
Graphics processing unit |
HPC |
High-performance computing |
IoT |
Internet of Things |
LCF |
Leadership Computing Facility |
ModEx |
Model experimentation |
ML |
Machine learning |
NASA |
National Aeronautics and Space Administration |
NERSC |
National Energy Research Scientific Computing Center |
NLP |
Natural language processing |
NOAA |
National Oceanic and Atmospheric Administration |
OLCF |
Oak Ridge Leadership Computing Facility |
TPU |
Tensor processing unit |
UQ |
Uncertainty quantification |
REFERENCES
Acer, S., and Coauthors, 2021: Exagraph: Graph and combinatorial methods for enabling exascale applications. Int. J. High Perform. Comput. Appl., 35, 553–571, https://doi.org/10.1177/10943420211029299.
Agarwal, D., S. Cholia, et al. V. C. et al.Hendrix ,R. et al.Crystal-Ornelas ,C. et al.Snavely ,J. et al.Damerow , andC. et al.Varadharajan, 2022: ESS-DIVE reporting format for dataset package metadata. ESS-DIVE, accessed 12 September 2023, https://doi.org/10.15485/1866026.
Amazon, 2022: Amazon Web Services. Amazon, accessed 10 November 2022, https://aws.amazon.com/about-aws/.
Argonne Leadership Computing Facility, 2022: ALCF AI testbed. ANL, accessed 10 November 2022, https://www.alcf.anl.gov/alcf-ai-testbed.
Argonne National Laboratory, 2022: Argonne Leadership Computing Facility. ANL, accessed 10 November 2022, https://www.alcf.anl.gov/.
ARM, 2022: Atmospheric Radiation Measurement Climate Research Facility. ARM, accessed 10 November 2022, https://www.arm.gov/.
Baker, N., and Coauthors, 2019: Workshop report on basic research needs for scientific machine learning: Core technologies for artificial intelligence. U.S. DOE Office of Science Tech. Rep., 109 pp., https://doi.org/10.2172/1478744.
Barbier, R., S. B. Yahia, P. Le Masson, and B. Weil, 2024: Co-design for novelty anchoring into multiple socio-technical systems in transitions: The case of earth observation data. IEEE Trans. Eng. Manage., https://doi.org/10.1109/TEM.2022.3184248, in press.
Barkjohn, K. K., B. Gantt, and A. L. Clements, 2021: Development and application of a United States-wide correction for PM2.5 data collected with the PurpleAir sensor. Atmos. Meas. Tech., 14, 4617–4637, https://doi.org/10.5194/amt-14-4617-2021.
Barrett, R. F., and Coauthors, 2013: On the role of co-design in high-performance computing. Transition of HPC towards Exascale Computing, E. H. D’Hollander et al., Eds., IOS Press, 141–155.
Bauer, P., P. D. Dueben, T. Hoefler, T. Quintino, T. C. Schulthess, and N. P. Wedi, 2021: The digital revolution of Earth-system science. Nat. Comput. Sci., 1, 104–113, https://doi.org/10.1038/s43588-021-00023-0.
Beckman, P., R. Sankaran, C. Catlett, N. Ferrier, R. Jacob, and M. Papka, 2016: Waggle: An open sensor platform for edge computing. 2016 IEEE SENSORS, Orlando, FL, IEEE, 1–3, https://doi.org/10.1109/ICSENS.2016.7808975.
Beckman, P., and Coauthors, 2020: 5G enabled energy innovation: Advanced wireless networks for science (workshop report). ANL Tech. Rep., 57 pp., https://doi.org/10.2172/1606538.
Bringmann, O., and Coauthors, 2021: Automated HW/SW co-design for edge AI: State, challenges and steps ahead: Special session paper. 2021 Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES + ISSS), Austin, TX, IEEE, 11–20, https://ieeexplore.ieee.org/document/9603364/references#references.
Catlett, C., P. Beckman, N. Ferrier, H. Nusbaum, M. E. Papka, M. G. Berman, and R. Sankaran, 2020: Measuring cities with software-defined sensors. J. Sci. Comput., 1, 14–27, https://doi.org/10.23919/JSC.2020.0003.
Catlett, C., P. Beckman, R. Sankaran, and K. K. Galvin, 2017: Array of things: A scientific research instrument in the public way: Platform design and early lessons learned. Proc. Second Int. Workshop on Science of Smart City Operations and Platforms Engineering, Pittsburgh, PA, Association for Computing Machinery, 26–33, https://doi.org/10.1145/3063386.3063771.
Chambers, J., R. Fisher, J. Hall, R. J. Norby, S. C. Wofsy, and D. Stover, 2012: Research priorities for tropical ecosystems under climate change workshop report. U.S. DOE Office of Science Tech. Rep., 136 pp., https://ess.science.energy.gov/wp-content/uploads/2020/12/NGEE-Tropics3webHR.pdf.
Charles, Z., Z. Garrett, Z. Huo, S. Shmulyian, and V. Smith, 2021: On large-cohort training for federated learning. Adv. Neural Inf. Process. Syst., 34, 20 461–20 475.
Chen, K. M., E. M. Cofer, J. Zhou, and O. G. Troyanskaya, 2019: Selene: A PyTorch-based deep learning library for sequence data. Nat. Methods, 16, 315–318, https://doi.org/10.1038/s41592-019-0360-8.
Chen, X., X. Huang, C. Jiao, M. G. Flanner, T. Raeker, and B. Palen, 2017: Running climate model on a commercial cloud computing environment: A case study using Community Earth System Model (CESM) on Amazon AWS. Comput. Geosci., 98, 21–25, https://doi.org/10.1016/j.cageo.2016.09.014.
Chouhan, L., P. Bansal, B. Lauhny, and Y. Chaudhary, 2020: A survey on cloud federation architecture and challenges. Social Networking and Computational Intelligence, Springer, 51–65.
Christensen, B. C., and J. F. Blanco Chia, 2017: Raspberry shake—A world-wide citizen seismograph network. 2017 Fall Meeting, San Francisco, CA, Amer. Geophys. Union, Abstract S11A-0560, https://ui.adsabs.harvard.edu/abs/2017AGUFM.S11A0560C/abstract.
Coon, E. T., and Coauthors, 2022: The Advanced Terrestrial Simulator. GitHub, accessed 10 November 2022, https://github.com/amanzi/ats.
Cortés, U., U. Moya, and M. Valero, 2021: When Sally met Harry or when AI met HPC. Supercomput. Front. Innovations, 8, 4–7, https://doi.org/10.14529/jsfi210101.
Cromwell, E., P. Shuai, P. Jiang, E. T. Coon, S. L. Painter, J. D. Moulton, Y. Lin, and X. Chen, 2021: Estimating watershed subsurface permeability from stream discharge data using deep neural networks. Front. Earth Sci., 9, 613011, https://doi.org/10.3389/feart.2021.613011.
Daepp, M. I. G., and Coauthors, 2022: Eclipse: An end-to-end platform for low-cost, hyperlocal environmental sensing in cities. 21st ACM/IEEE Int. Conf. on Information Proc. in Sensor Networks (IPSN), Milano, Italy, IEEE, 28–40, https://doi.org/10.1109/IPSN54338.2022.00010.
Descour, M., J. Tsao, D. Stracuzzi, A. Wakeland, D. Schultz, W. Smith, and J. Weeks, 2021: AI-enhanced co-design for next-generation microelectronics: Innovating innovation (workshop report). SNL Tech. Rep., 137 pp., https://doi.org/10.2172/1845383.
Di Lorenzo, P., C. Battiloro, M. Merluzzi, and S. Barbarossa, 2021: Dynamic resource optimization for adaptive federated learning at the wireless network edge. ICASSP 2021-2021 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, IEEE, 4910–4914, https://doi.org/10.1109/ICASSP39728.2021.9414832.
Eeckhout, L., 2017: Is Moore’s law slowing down? What’s next? IEEE Micro, 37, 4–5, https://doi.org/10.1109/MM.2017.3211123.
Germann, T. C., 2021: Co-design in the Exascale Computing Project. Int. J. High Perform. Comput. Appl., 35, 503–507, https://doi.org/10.1177/10943420211059380.
Germann, T. C., A. L. McPherson, J. F. Belak, and D. F. Richards, 2013: Exascale Co-design Center for Materials in Extreme Environments (ExMatEx) annual report – year 2. LLNL Tech. Rep., 18 pp., https://doi.org/10.2172/1116965.
Google, 2022: Google Cloud Platform. Accessed 10 November 2022, https://cloud.google.com/.
Halappanavar, M., M. Minutoli, and S. Ghosh, 2021: Graph analytics in the exascale era. Proc. 18th ACM Int. Conf. on Computing Frontiers, Online, Association for Computing Machinery, 209 pp., https://doi.org/10.1145/3457388.3459984.
Heroux, M. A., L. C. McInnes, R. Thakur, J. S. Vetter, X. S. Li, J. Aherns, T. Munson, and K. Mohror, 2020: ECP software technology capability assessment report. ORNL Tech. Rep., 216 pp., https://doi.org/10.2172/1760096.
Hickmon, N. L., C. Varadharajan, F. M. Hoffman, S. Collis, and H. M. Wainwright, 2022: Artificial Intelligence for Earth System Predictability (AI4ESP) workshop report. ANL Tech. Rep., 413 pp., https://doi.org/10.2172/1888810.
Hoffman, F. M., and Coauthors, 2017: 2016 International Land Model Benchmarking (ILAMB) workshop report. U.S. DOE Office of Science Tech. Rep., 172 pp., https://doi.org/10.2172/1330803.
Huadong, G., 2016: Digital earth and future earth. Int. J. Digital Earth, 9 (1), 1–2, https://doi.org/10.1080/17538947.2015.1135667.
Lichtner, P. C., and Coauthors, 2020: PFLOTRAN: A massively parallel reactive flow and transport model for describing subsurface processes. http://www.pflotran.org.
Lie, S., 2021: Multi-million core, multi-wafer AI cluster. 2021 IEEE Hot Chips 33 Symp. (HCS), Palo Alto, CA, IEEE, 1–41, https://doi.org/10.1109/HCS52781.2021.9567153.
Lindsey, N. J., T. C. Dawe, and J. B. Ajo-Franklin, 2019: Illuminating seafloor faults and ocean dynamics with dark fiber distributed acoustic sensing. Science, 366, 1103–1107, https://doi.org/10.1126/science.aay5881.
Microsoft, 2022: Microsoft Azure: Cloud computing services. Microsoft, accessed 10 November 2022, https://azure.microsoft.com/en-us/.
Morell, J. Á., and E. Alba, 2022: Dynamic and adaptive fault-tolerant asynchronous federated learning using volunteer edge devices. Future Gener. Comput. Syst., 133, 53–67, https://doi.org/10.1016/j.future.2022.02.024.
Mudunuru, M. K., and Coauthors, 2021: EdgeAI: How to use AI to collect reliable and relevant watershed data. AI4ESP Tech. Rep., 6 pp., https://doi.org/10.2172/1769700.
Mudunuru, M. K., K. Son, P. Jiang, G. Hammond, and X. Chen, 2022: Scalable deep learning for watershed model calibration. Front. Earth Sci., 10, 1026479, https://doi.org/10.3389/feart.2022.1026479.
NERSC, 2022: National Energy Research Scientific Computing Center. U.S. DOE, accessed 10 November 2022, https://www.nersc.gov/.
Oak Ridge National Laboratory, 2022: Oak Ridge Leadership Computing Facility. OLCF, accessed 10 November 2022, https://www.olcf.ornl.gov/.
Ott, J., M. Pritchard, N. Best, E. Linstead, M. Curcic, and P. Baldi, 2020: A Fortran-Keras deep learning bridge for scientific computing. Sci. Program., 2020, 8888811, https://doi.org/10.1155/2020/8888811.
Pacific Northwest National Laboratory, 2022: The Environmental Molecular Sciences Laboratory. U.S. DOE, accessed 10 November 2022, https://www.emsl.pnnl.gov/.
Pao, K., 2011: Co-design and you: Why should mathematicians care about exascale computing. U.S. Department of Energy, 21 pp., https://www.csm.ornl.gov/workshops/applmath11/documents/talks/Pao_CoDesign.pdf.pdf.
Prometeus, 2022: The 59th edition of the TOP500 revealed the Frontier system to be the first true exascale machine with an HPL score of 1.102 Exaflop per second. Top 500 list, accessed 10 November 2022, https://www.top500.org/lists/top500/2022/11/.
Reuther, A., P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, 2021: AI accelerator survey and trends. 2021 IEEE High Performance Extreme Computing Conf. (HPEC), Waltham, MA, IEEE, 1–9, https://doi.org/10.1109/HPEC49654.2021.9622867.
Rosa, M. J. F., C. G. Ralha, M. Holanda, and A. P. F. Araujo, 2021: Computational resource and cost prediction service for scientific workflows in federated clouds. Future Gener. Comput. Syst., 125, 844–858, https://doi.org/10.1016/j.future.2021.07.030.
Saxena, D., R. Gupta, and A. K. Singh, 2021: A survey and comparative study on multi-cloud architectures: Emerging issues and challenges for cloud federation. arXiv, 2108.12831v1, https://doi.org/10.48550/arXiv.2108.12831.
Stavroulas, I., and Coauthors, 2020: Field evaluation of low-cost PM sensors (Purple Air PA-II) under variable urban air quality conditions, in Greece. Atmosphere, 11, 926, https://doi.org/10.3390/atmos11090926.
Steefel, C., 2022: Subsurface: An Exascale subsurface simulator of coupled flow, transport, reactions, and mechanics. ECP, accessed 10 November 2022, https://www.exascaleproject.org/research-project/subsurface/.
Stevens, R., V. Taylor, J. Nichols, A. B. Maccabe, K. Yelick, and D. Brown, 2020: AI for science: Report on the Department of Energy (DOE) town halls on artificial intelligence (AI) for science. ANL Tech. Rep., 224 pp., https://doi.org/10.2172/1604756.
Talsma, C., K. C. Solander, M. K. Mudunuru, B. Crawford, and M. Powell, 2022: Frost prediction using machine learning and deep neural network models for use on IoT sensors. SSRN, 4032447, 33 pp., https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4032447.
Taylor, M., 2022: E3SM multiscale modeling framework. ECP, accessed 10 November 2022, https://www.exascaleproject.org/research-project/e3sm-mmf/.
Theis, T. N., and H.-S. P. Wong, 2017: The end of Moore’s Law: A new beginning for information technology. Comput. Sci. Eng., 19, 41–50, https://doi.org/10.1109/MCSE.2017.29.
Tsai, W.-P., D. Feng, M. Pan, H. Beck, K. Lawson, Y. Yang, J. Liu, and C. Shen, 2021: From calibration to parameter learning: Harnessing the scaling effects of big data in geoscientific modeling. Nat. Commun., 12, 5988, https://doi.org/10.1038/s41467-021-26107-z.
U.S. DOE, 2022: The Energy Exascale Earth System Model. U.S. DOE Office of Science, accessed 10 November 2022, https://e3sm.org/.
Velliquette, T., J. Welch, M. Crow, R. Devarakonda, S. Heinz, and R. Crystal-Ornelas, 2021: ESS-DIVE reporting format for file-level metadata. ESS-DIVE, accessed 12 September 2023, https://www.osti.gov/biblio/1734840.
Vetter, J. S., and Coauthors, 2018: Extreme heterogeneity 2018—Productive computational science in the era of extreme heterogeneity: Report for DOE ASCR workshop on extreme heterogeneity. U.S. DOE Office of Science Tech. Rep., 83 pp., https://doi.org/10.2172/1473756.
Webb, R., and Coauthors, 2018: Sustainable urban systems: Co-design and framing for transformation. Ambio, 47, 57–77, https://doi.org/10.1007/s13280-017-0934-6.
Winter, K., D. Lombardi, A. Diaz-Moreno, and R. Bainbridge, 2021: Monitoring icequakes in East Antarctica with the raspberry shake. Seismol. Res. Lett., 92, 2736–2747, https://doi.org/10.1785/0220200483.
Xu, H., W. Wei, J. Dennis, and K. Paul, 2019: Using cloud-friendly data format in earth system models. 2019 Fall Meeting, San Francisco, CA, Amer. Geophys. Union, Abstract IN13C-0728, https://ui.adsabs.harvard.edu/abs/2019AGUFMIN13C0728X/abstract.
Yang, C., and Coauthors, 2016: 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. SC’16: Proc. Int. Conf. for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, IEEE, 57–68, https://doi.org/10.1109/SC.2016.5.
Yu, Y., and Coauthors, 2022: Characterizing uncertainties of Earth system modeling with heterogeneous many-core architecture computing. Geosci. Model Dev., 15, 6695–6708, https://doi.org/10.5194/gmd-15-6695-2022.
Yuan, B., and Coauthors, 2019: Using machine learning to discern eruption in noisy environments: A case study using CO2-driven cold-water geyser in Chimayó, New Mexico. Seismol. Res. Lett., 90, 591–603, https://doi.org/10.1785/0220180306.
Zhang, S., and Coauthors, 2020: Optimizing high-resolution community earth system model on a heterogeneous many-core supercomputing platform. Geosci. Model Dev., 13, 4809–4829, https://doi.org/10.5194/gmd-13-4809-2020.
Zhang, X., W. Jiang, Y. Shi, and J. Hu, 2019: When neural architecture search meets hardware implementation: From hardware awareness to co-design. 2019 IEEE Computer Society Annual Symp. on VLSI (ISVLSI), Miami, FL, IEEE, 25–30, https://doi.org/10.1109/ISVLSI.2019.00014.
Zhu, T., J. Shen, and E. R. Martin, 2021: Sensing earth and environment dynamics by telecommunication fiber-optic sensors: An urban experiment in Pennsylvania, USA. Solid Earth, 12, 219–235, https://doi.org/10.5194/se-12-219-2021.