Cloud computing provides infrastructure, platforms, and software as a service to individuals, companies, and governments around the world with diverse offerings that reflect the diversity of the end users. Out of the five essential characteristics identified by Mell and Grance (2011) that differentiate cloud computing, the key for numerical weather prediction (NWP) is elasticity: the ability to provision resources when they are necessary and release them as soon as they are no longer required. Strict time limits, time-varying demand, and significant computational requirements make tropical cyclone (TC) ensembles an ideal test case for us to stretch the limits of cloud elasticity. In doing so, we demonstrated what we believe is the first real-time tropical cyclone ensemble forecast system that runs in the cloud.
Tropical cyclone prediction challenges
Horrific images of hurricane damage that dominate news media each year eclipse a record of triumphs and tradeoffs. The accuracy of storm track and storm intensity predictions derived from tropical cyclone numerical models continue to improve year over year (Cangialosi 2021) and represent a scientific progress success story. These improvements are the product of cutting-edge research transitioned from laboratories to operational centers, transforming scientific advances into better information for decision-makers and forecasters responsible for providing actionable guidance for the public.
While storm track forecasts made rapid progress in the first decade of the twenty-first century, intensity forecasts improved at a slower pace (DeMaria et al. 2014). Further improvements to intensity forecasts over the last decade accompanied significant advances in guidance from high-resolution regional models. One approach, used by the U.S. Navy’s COAMPS-TC (Doyle et al. 2014) and the U.S. National Oceanic and Atmospheric Administration’s (NOAA) HWRF (Tallapragada et al. 2014) and Hurricanes in a Multi-Scale Ocean-Coupled Non-Hydrostatic Model (HMON; Mehra et al. 2018) forecasting systems relies on regional models with moving nests centered on the storm to capture fine-scale processes near the storm and larger grid spacing elsewhere. This method can be challenging to schedule in operations: rather than a traditional fixed production schedule (like a global model feeding predefined regional models), tropical cyclone models represent a dynamic load with significant compute peaks and valleys. Storms form, evolve, and dissipate. Some days have several active storms, requiring considerable resources; some days have none.
Uncertainty quantification further exacerbates the dynamic loading on a production schedule. Despite myriad advances, numerical models cannot perfectly capture the initial state and time evolution of a chaotic atmosphere. Ensemble forecasting techniques have been used operationally for uncertainty quantification in global models since the early 1990s (Toth and Kalnay 1993)—but their use in tropical cyclone intensity forecasts has emerged only recently, with the first operational high-resolution1 TC ensemble system going live in 2020 (Komaromi et al. 2021). Understanding the potential variability in a forecast is a powerful tool with a hefty cost: the computational expense scales linearly with the number of ensemble members. This scaling constrains how many storm forecasts can run simultaneously. For a deterministic forecast, adding several new storms may not weigh down the schedule, but a 20-member ensemble means that adding two new storms requires 40 times the resources required for a single storm forecast.
Since its inception in the mid-twentieth century (Charney et al. 1950), the accuracy of numerical weather prediction has been constrained by finite computational resources, and TC modeling is no different: more storms to process simultaneously; more grid points for predicting rapid intensification; and more ensemble members to provide decision-makers with improved uncertainty estimates to direct public health and safety. Even as computing technology rapidly advances and enables new modeling capabilities, one thing remains the same: a forecast that arrives too late is useless. This timeliness requirement, particularly for real-time scenarios, drives compromises between advancing modeling technology against the computer it runs on. We set out to evaluate if cloud elasticity could reduce compromises between what is scientifically possible and what is technically feasible for operations and still get forecasts out on time.
NWP in the cloud
Given the decades-long tension between computing demand and computing supply in NWP, it is no surprise that the community has investigated methods to address this gap beyond “buy larger computers.” Initially, approaches focused on connecting multiple on-premise environments using web services to execute forecasts and process data for atmospheric analysis and prediction (Droegemeier et al. 2005). Now, those efforts have naturally evolved to leverage cloud computing for NWP applications including, but not limited to, opportunities for education, data dissemination, research collaboration, and public engagement (Vance et al. 2016).
More than 10 years have passed since one of the first studies with cloud computing and geophysical modeling was released after Evangelinos and Hill (2008) ran the MIT GCM on Amazon Web Services. Powers et al. (2021) provide a recent overview of the experiences running atmospheric models in the cloud. To place our TC ensemble demo work in context, we look at the body of literature as divided between cloud feasibility and cloud suitability.
Feasibility studies focus on the capabilities of cloud computing. Can a cloud platform effectively execute a model? A modeling workflow? How do the performance results from the cloud compare to on-premise systems? While atmospheric model components can be used as part of a larger benchmarking suite (He et al. 2010; Jackson et al. 2010), most previous efforts focus on a standalone model. The Weather Research and Forecasting (WRF) Model (Skamarock et al. 2008) has been popular for analysis in the cloud (Krishnan et al. 2014; Monteiro et al. 2014; Molthan et al. 2015; Siuta et al. 2016), with performance that varied relative to on-premise systems. Considering performance requires further scrutiny of what is good enough: Arevalo and Whitcomb (2019) showed that although the Navy Global Environmental Model (NAVGEM; Hogan et al. 2014) ran successfully in the Amazon cloud, the infrastructure could not meet operational timing requirements until the introduction of Amazon’s Elastic Fabric Adapter (EFA) for internode communication.
Suitability studies focus on cloud value propositions and how they apply to a specific application. Zhuang et al. (2019) leveraged public datasets and reproducible deployments to improve accessibility and remove roadblocks to collaborative research with a cloud-based end-to-end workflow for the GEOS-Chem model. Molthan et al. (2015) automated a WRF-based modeling system in the Amazon cloud for supporting NASA science as well as providing access to simulations for organizations in developing countries requiring regional weather forecasts. Siuta et al. (2016) found their cloud provider (Google Cloud Platform) cost-competitive with purchasing on-premise hardware to support real-time model runs. Powers et al. (2021) updated WRF to better support the cloud and demonstrated advantages of a curated cloud environment for the WRF community, including facilitating tutorials and classroom use. With the exception of Siuta et al. (2016), the focus for suitability studies was less on cost and more on the new opportunities allowed by a cloud computing model.
Suitability and feasibility questions come together when considering real-time applications. McKenna (2016) implemented an operational system for generating atmosphere, ocean, and wave forecasts in Dubai using the Amazon cloud. Molthan et al. (2015) were able to use their automated framework to execute a once-daily WRF domain over Central America and the Caribbean. Siuta et al. (2016) (and later, Chui et al. 2019) describe a real-time regional forecasting system at the University of British Columbia using WRF. Most recently, Etherton et al. (2020) implemented data pipelines and virtual computing to cold-start FV3GFS2 global forecasts.
Pushing the limits
With the literature from global and regional real-time forecasts and performance data suggesting that cloud platforms could run models fast enough for real time, we set out to further push the limits of cloud capability. Two primary limitations remained: problem size and dynamic demand.
The relatively small problem sizes in the literature are consistent with the dominance of WRF and regional modeling as the primary cloud use case. We found that the feasibility and suitability studies used varied core counts, but rarely more than several hundred cores. The outliers were Powers et al. (2021) who document WRF tests as high as 3,600 cores and Etherton et al. (2020) who used between 5,600 and 8,400 cores for their FV3GFS global forecasts.3 When provisioning virtual clusters of larger and larger size, more time is required for provisioning the virtual resources on demand. Our goal was to break a five-digit core count barrier if required to produce the ensemble forecast on time.
Dynamic demand was the greater unknown. Our forecast system cycled every 6 h and each cycle could have a different number of storms. Using an ensemble multiplied the compute impact of adding a new storm, which provided a dynamic load to test cloud elasticity within our sponsor-provided Microsoft Azure cloud environment.
Configuration
Our COAMPS-TC ensemble setup is very close to that of Komaromi et al. (2021) with minor infrastructure modifications for workflow orchestration. Each forecast uses four compute nodes (details for node configuration can be found in appendix A) for three computational nests with grid spacings of 36, 12, and 4 km. The outer 36-km nest is fixed and a function of the ocean basin (e.g., western North Pacific, eastern North Pacific, or Atlantic basin) and the initial storm position places the 12- and 4-km nests within the outer domain. The 12- and 4-km nests move with the storm throughout the forecast, allowing higher-resolution grid spacing where it is important to resolve the small-scale dynamical and physical processes required to forecast TC intensity. After the forecasts complete, the tracker (Marchok 2021) extracts storm position and intensity at each forecast lead time. The ensemble consists of 21 individual forecasts for each storm: an unperturbed control member and 20 members with perturbed initial and boundary conditions. Once all ensemble members have completed, visualization software uses the resulting tracker files to produce plots of predicted location and intensity.
The COAMPS-TC model requires global model data for two purposes: initial and boundary conditions for the outer nest, and vortex initialization for the inner moving nests. Komaromi et al. (2021) validated their results using GFS data for both purposes. Our previous experience has found that a hybrid approach using NAVGEM data for the outer nest initial and boundary conditions and GFS data for the vortex initialization provides competitive results (particularly for track forecasts). We adopted this hybrid approach for our cloud investigation for two reasons. First, executing the NAVGEM model in the cloud meant that we could limit data transfers into the cloud to operationally produced global analyses only (rather than the full forecast), increasing resilience to disruptions. Second, we could finish the NAVGEM forecast before the GFS forecast completed, giving us additional time to finish the COAMPS-TC forecasts before the 6-h deadline.
The NAVGEM global model used the current operational configuration: 19-km grid spacing and 60 vertical levels with a model top above 70 km. A postprocessing suite interpolated variables required to drive the TC model (e.g., sea level pressure, wind, temperature, and humidity) from the computational reduced-Gaussian grid and hybrid sigma–pressure levels to a regular half-degree latitude–longitude grid and constant pressure levels for ingest downstream. We executed the global model on 11 compute nodes.
Our use of global and regional models with a TC focus meant we dealt with a diverse upstream data dependence. The NAVGEM forecasts required initial conditions from Fleet Numerical Meteorology and Oceanography Center (FNMOC). The COAMPS-TC initialization required initial conditions from NCEP’s Global Forecasting System (GFS) as well as files from National Hurricane Center (NHC) and Joint Typhoon Warning Center (JTWC) with storm name, location, and intensity information. We were able to obtain files from NCEP, NHC, and JTWC via a data pull operation initiated within our cloud environment. NAVGEM initial conditions were not available on the public Internet and required a data push to cloud storage from Navy supercomputers.
Orchestration
The complexity of managing multiple upstream data dependencies, a global model, and a varying number of storms with 21 model runs each demanded an orchestration solution. We used the Cylc workflow manager (Oliver et al. 2018, 2019), executing on a virtual host in the Azure environment to manage the individual workflows that made up the forecast system (Fig. 1). Cylc’s ability to seamlessly transition from retrospective runs to real-time runs, handle task failures and retries, and single interface to view workflow statuses allowed us to build a data-driven system that could automatically handle errors and simplify monitoring.
Data acquisition consisted of two workflows: acquiring GFS initial conditions from NCEP, and acquiring TC information files from JTWC and NHC. The NAVGEM global model had its own workflow. Each storm used a standalone instance of a generalized ensemble TC model workflow. A management workflow analyzed the TC files from NHC and JTWC to compare the active storms with active workflows. If a new storm formed, the management workflow kicked off a new storm workflow. If a storm dissipated, the management workflow would stop that particular storm workflow.
To coordinate between multiple workflows, we focused on data-driven (rather than time-driven) orchestration. For example, initial conditions for NAVGEM used an external data push into Azure to provide NAVGEM analysis files: we used built-in Azure tools (Event Grid and Event Hubs) to connect a “restart file arrived” event to the Cylc workflow controlling the NAVGEM forecasts that would trigger the portion of the workflow requiring the analysis file. Only data acquisition tasks from external servers required an explicit schedule. All other tasks in the various workflows used similar Azure event techniques and Cylc dependency management. This data-driven approach allowed tasks to begin as soon as their dependencies were satisfied, increasing parallelism and scalability across distributed workflows. Defining workflows based on data and task dependencies also allowed multiple cycles (e.g., 0000 and 0600 UTC) to execute simultaneously if the real-time system fell behind.
We used Microsoft CycleCloud to schedule multinode jobs on a virtual cluster4 to execute the modeling workflows. The CycleCloud autoscaler provisioned compute nodes on demand: if the cluster was idle, and a job arrived requiring 11 nodes, an agent would spin up 11 nodes and the job would run. If no work was left in the job queue, the idle nodes would be deallocated and the cluster would shrink again (see Fig. 2). CycleCloud also managed a second cluster with high-performance storage virtual machines (VMs), providing a Lustre parallel filesystem to the compute nodes for improved parallel input/output (I/O) performance. For users, the CycleCloud clusters looked like on-premise systems, reducing demands on our team for specialized porting to the cloud environment.
Our dynamic cluster was how we stress-tested cloud elasticity. Without elasticity, Fig. 2 would show a solid rectangle of always-on compute nodes. Our requirement for high-performance computations meant we were using some of the most expensive instances the cloud vendor offered. Minimizing idle time was paramount in ensuring a more cost-effective solution.
Error handling
Parallel forecasts rely on multiple compute nodes, each with their own disks, network connections, operating systems, and software libraries. We had considered challenges in deploying models and orchestration, but had not given enough attention to error handling. Of all the project components, we spent the most unplanned time developing a robust error-handling system. Although it was frustrating at times, developing and automating error checks was a key enabler of ultimate success.
Over the several months we performed forecasts in the cloud, we encountered several common error scenarios and built automated error checking routines so the problems could resolve themselves. These checks became especially important for cycles in the middle of the night when nobody was watching. The modeling applications we used are robust thanks to years in operations, so our automated system focused on issues arising during node provisioning and application startup.
Guided by the principle of least privilege (Saltzer and Schroeder 1975), we separated error identification (which could run as an unprivileged user) from error resolution (which required an administrator-level privileged user) by using Azure Event Hubs and Cylc custom error messages. A test script ran at the beginning of each parallel job checking for the most common failures [message passing interface (MPI) problems, filesystem mounts, and interconnect failures in the underlying hardware]: any failures would abort the job with a custom error message showing the failure type and the failed node. The Cylc error handling mechanism packaged the failure message and sent it to an Event Hub, allowing a downstream process with elevated privileges to listen for and receive the message and mark the node offline in the job scheduler. Once offline, the node would be immediately terminated and a new VM could be provisioned in its place. While the action to address errors was the same in all cases, the custom error message provided visibility into the frequency of different error types and paved the way for expanded error handling capabilities in the future.
Once the initial software ports were complete and the full production pipeline was established, developing the automated error handling required the most time. The checks were straightforward to write, but depended on labor-intensive error identification and root cause analysis. Our efforts paid off with a self-healing system that kept node provisioning errors from stalling our cycling runs—even while we slept.
Dissemination
Our initial goal was to set up the data ingest feeds, modeling workflows, and product generation to evaluate if the cloud could handle the dynamic load and execute the forecast pipeline to have products ready by the deadline. Two weeks after we started production, with Hurricanes Laura and Marco making their way through the Gulf of Mexico, the project management team wanted to use the twin TC scenario as a chance to share our forecast demonstration with a wider audience. We needed a method to rapidly disseminate products, and turned to a publicly available platform that allowed us to post images via an application programming interface (API): Twitter.
Huang and Cervone (2016) discuss possibilities of social media for hazardous event notification. Our focus was on social media as an event-driven publish–subscribe dissemination method. Dissemination via Twitter allowed automated posts as products became available, with built-in abilities to share links and images for multiple storms. Although not as full-featured as a purpose-built dissemination portal, the effort to set up social media product distribution proved much simpler than full-fledged web portal development. Eventually, a persistent web presence would enable other notification technologies (such as RSS5) and allow us more control over results presentation.
Once the TC models and trackers completed, the resulting ensemble track and intensity data provided input to several product generation scripts. Products included maps showing predicted track and intensity for each ensemble member (Fig. 3), rapid intensification probabilities, probabilistic intensity forecast (including how many ensemble members dissipated a storm), and ensemble-mean track with ellipses showing uncertainty of position at each forecast time based on the ensemble members. Once the images arrived in long-term storage, our existing event-routing mechanisms triggered an Azure Logic App that would use image metadata we set (such as the storm name) to generate the text and image content of a tweet. This automation method had a secondary benefit: because images posted shortly after generation, the Twitter stream allowed us to see if a run was behind schedule.
Performance
Niño-Mora (2019) classifies NWP as a “firm real-time application” where forecasts that fail to complete by a deadline quickly lose all value. This is especially true for TC forecasts, which should be ready by 6 h after their initialization time to be incorporated into an operational forecaster’s work product. Given these constraints, forecast timeliness was our primary success metric. We computed the production delay as the difference between forecast initialization time and the wall-clock time when the track files arrived in their cloud storage account. For example, if track forecasts for all members initialized at 0000 UTC were available at 0830 UTC, the production delay would be 8.5 h. We computed delay on a storm-by-storm basis.
We produced forecasts every 6 h for cycles starting at 0000 UTC 1 August 2020 and concluding 18 September 2020 (after Hurricane Sally made landfall). We began cycling on 8 August 2020, starting a week behind and relying on cloud elasticity and Cylc’s ability to simultaneously execute multiple cycles of the modeling workflows where dependencies allowed to catch up to real time.
Delays for the entire processing period (Fig. 4) exceed 6 h for large swaths: out of 427 forecasts, 98 (∼23%) completed in under 6 h. Periods where we focused on production instead of troubleshooting demonstrated the capability to achieve on-time product delivery. There was no correlation between significant delays and the number of active storms, and we were able to execute four active 21-member TC ensembles simultaneously. If we consider only particular storms (Laura and Sally, Fig. 5), the on-time performance better reflects what is possible with a dynamic cloud-based approach. Laura included both catch-up and tuning time, finishing on time in 7 out of 32 forecasts (∼22%) and nearly on time (i.e., in under 6.5 h) in 16 out of 32 forecasts (50%). Laura also held the record for the fastest forecast, with tracks available 4 h 49 min after initialization time. Sally finished on time for 15 out of 16 forecasts (∼94%). On multiple occasions during our demonstration, our larger-than-operational ensemble was available before the deterministic forecasts from other prediction systems (e.g., NRL CTCX) were finished.
A postlandfall analysis of Hurricane Laura forecasts showed that we predicted the timing and location of landfall remarkably well, landing the storm on the Louisiana border 72 h ahead of time (Fig. 6). Our track also agrees with the well-verified Naval Research Laboratory CTCX system (Doyle et al. 2020), which was the best model for intensity in the Atlantic in 2019 and second-best model for 5-day track (Masters 2020). Although we lack the samples to draw broad conclusions, these comparisons demonstrated that the system was performing acceptably.6
What is ahead
Recognizing the computational challenge for tropical cyclone models is not new. Duran-Limon et al. (2016) discussed using clouds or grids to facilitate on-demand runs of WRF with hurricanes as a motivating example. Huang and Cervone (2016) predicted that cloud computing could provide the compute infrastructure for operational prediction during a crisis as well as a high-performance, flexible, and resilient infrastructure for hazardous event forecasting. We showed that our implementation of an event-driven cyberinfrastructure in Microsoft Azure is capable of meeting operational deadlines even in the face of dynamic demands.
Molthan et al. (2015) focused on the opportunities from cloud-based infrastructure-as-a-service (IaaS). IaaS focuses on provisioning virtual machines as the computational backbone of a cloud solution. For our demonstration, we made use of IaaS resources (the virtual machines required to execute the forecasts), but our demonstration work would not have been possible without platform-as-a-service (PaaS) products that provided vendor-managed capability without the need to deploy VMs and install software. PaaS was a force multiplier for us, allowing a small team without dedicated IT support to provision the resources we needed to plumb an end-to-end forecast pipeline in a few clicks rather than an involved installation and configuration process. When using Microsoft PaaS products, we deliberately limited ourselves to offerings that were available in higher-security clouds and that had on-premise software equivalents. Leveraging PaaS products let us experiment, iterate, and get to a minimum viable product faster.
We are interested in expanding this work further and hardening for potential operational transition. As we encountered issues (particularly in the CycleCloud autoscaler), we provided feedback to the vendor for future software upgrades and anticipate that future work should not encounter the same problems. We also look to address shortcomings that were cut due to time: in particular, automated deployment leveraging infrastructure-as-code7 and more rigorous cost modeling. We look to the progression of Chui et al. (2019) following Siuta et al. (2016): ensure a solution is technically capable of meeting requirements, then optimize for cost. We identified several approaches to save costs, and will need to test those cost controls and their impact on forecast timeliness.
Tropical cyclone ensembles are an ideal candidate for cloud computing where elastic resource provisioning handles the fluctuating nature of demand. We are interested in opportunities for on-demand modeling system pipelines to refine our deployment process and open avenues for increased reliability, larger ensembles, and wider dissemination from more resilient operational systems. Improved capability can lead to subsequently improved planning and broader awareness ultimately improving our ability to protect lives and property.
We use “high-resolution” here to mean convection-allowing grid spacing, typically ∼4 km or smaller following Skamarock (2004).
FV3GFS (Chen et al. 2019; Zhou et al. 2019) is NOAA’s current operational global forecast model
Both Etherton et al. (2020) and Powers et al. (2021) provide core counts for single jobs, with all cores working together. Ensembles like ours are embarrassingly parallel: although our total core counts are often higher, the individual forecast jobs are smaller.
Appendix A provides a more detailed description of the computational environment.
Real Simple Syndication (www.rssboard.org/rss-specification): a web content syndication format that allows a server to publish updates using a standard format.
Appendix B includes a more detailed analysis of available track data.
Infrastructure-as-code (IaC) defines a cloud environment (e.g., network configuration, virtual machines, or storage resources) as text data (often in a vendor-specific format like Amazon’s CloudFormation or Azure Resource Manager templates). Using IaC allows for repeatable and shareable cloud deployments.
Microsoft SMEs recommended this limitation to ensure acceptable Infiniband performance.
Here “intensity” is defined as the maximum sustained wind at 10-m elevation within a tropical cyclone, using an averaging time of 1 min. This definition is consistent with that used operationally by NHC and JTWC.
Acknowledgments.
We thank Fleet Numerical Meteorology and Oceanography Center for providing project scope, oversight, and funding support; Navy Commercial Cloud Services for cloud funding; Naval Facilities Information Technology Command for cloud brokerage services; and Microsoft Corporation (especially Steve Roach, Jerry Morey, Brent Siglar, and the behind-the-scenes experts they brought in to help us troubleshoot issues, Zach Yearsley, and Christopher De Felippo) for technical support. Bill Gourgey, John Michalakes, and three anonymous reviewers provided thoughtful comments that greatly improved the manuscript. The use or mention of specific commercial products within this study does not represent an official endorsement by the authors or their affiliated organizations. COAMPS-TC is a registered trademark of the Naval Research Laboratory.
Data availability statement.
Storm track, intensity, and latency data for the forecasts generated during this study are openly available at https://doi.org/10.5281/zenodo.4673553.
Appendix A: Computational configuration
We completed and deployed our demonstration in a commercial cloud environment, but we constrained our design choices based on several factors. First, security: we chose vendor-provided solutions that were available in higher-security environments. Next, portability: we chose vendor-provided solutions that had equivalents in other clouds and on-premise systems. For example, Azure Blob storage has conceptual equivalents in Amazon S3 or Ceph (Weil et al. 2006). The system presented here was implemented in Azure, but a similar system could be deployed to and tested in other cloud environments: Table A1 maps between specific product names and concepts.
Description of products used in the demonstration and a map between proprietary product names and the product functionality. This mapping can be used to support identification of similar products from other vendors and preserving our configuration if products are renamed or discontinued in the future.
We built a virtual compute cluster analogous to on-premise systems to limit project scope and reduce the time spent porting software and workflows. We accessed the cluster through a Windows virtual machine, which in turn was accessed via the Azure Bastion service in a browser rather than directly from the Internet (Fig. A1). This isolation previewed access controls that would be used in a more restrictive environment.
We used Microsoft CycleCloud for cluster provisioning with existing Microsoft-provided templates to construct and link together a Lustre storage cluster and a compute cluster using the open-source PBSPro scheduler. The CycleCloud server ran on a single Standard E4s v3 instance (see Table A2 for additional details on all instance types) running CentOS 7.8.2003 with access from the Bastion host. We used CycleCloud 7 with several updates during the demonstration (driven in part by our testing), finishing at 7.9.8-1407. For the Lustre scratch filesystem, one metadata server (MDS) and six object storage servers (OSS) nodes used Standard L16s_v2 virtual machines. The L16s series VMs provide NVMe storage, which was ideal for the burst I/O pattern for the numerical models and supported our time-to-solution constraints. However, stopping the storage VMs would result in data loss. For maintenance and our initial tests (where we would repeatedly be provisioning and tearing down the clusters), we relied on Lustre Hierarchical Storage Management (HSM) to archive the contents of the Lustre filesystem to Azure Blob Storage. The Lustre version used was 2.12. For static data storage that persisted across cluster restarts, we used the Azure Netapp Files (ANF) to store source code, static datasets, library files, and executables. The ANF was mounted to all nodes of the compute cluster.
Instance types used for the demonstration, showing the number of virtual cores (vCPUs), memory available, and how the instance type contributed to the demonstration.
The compute cluster was provisioned and managed through CycleCloud and consisted of a single Standard D12v2 VM login node, with compute nodes defined by two VM Scale Sets (VMSS). The first, an AMD-based Standard HB120rsv2 VMSS, was reserved for larger parallel jobs. The second VMSS, composed of Intel-based Standard HC44rs VMs handled computational jobs with lower resource requirements such as output generation after a successful forecast. This division ensured that VM autoscaling, determined by feedback from the PBS queue to the CycleCloud provisioning system, remained below the configured VM limit.8 Interconnect between HB120rsv2 nodes used 200 GB/s HDR Infiniband.
Appendix B: TC forecast performance
Our primary goal was to assess if a cloud computing platform could effectively support the dynamic computational resource requirements for TC ensemble forecasting. As a secondary goal, we wanted to ensure that the forecasts we were producing were skillful enough that we could be confident that we performed our cloud implementation correctly. We did not expect (and did not observe) sensitivity to running in the cloud versus an on-premise computer. The multiple data pathways and configurations required for moving from one computational environment to another present multiple opportunities for misconfiguration and incorrect results. The results presented in this appendix also allow us to analyze in more detail the hybrid initialization technique described in this paper (using the NAVGEM analysis/forecasts for outer nest initial/boundary conditions and the GFS analysis for vortex initialization).
First, we compare the deterministic forecast performance of the ensemble mean with respect to the unperturbed ensemble control member. Model predictions for a common set of cases are validated against the “best track” analyses of TC position and intensity produced by NHC and JTWC, using only cases in which the best track indicates the storm is a TC at the initial time and forecast valid time. The ensemble-mean forecast is defined only if at least 17 of the 21 ensemble members are present; the number of ensemble members is not fixed in lead time due to variable storm dissipation in the members. The validated sample consists of 298 forecasts, from 20 tropical cyclones that occurred in the Atlantic, eastern North Pacific, and western North Pacific basins during August and September of 2020.
Performance statistics for track accuracy (as measured by mean absolute error, MAE) and intensity9 accuracy (MAE) and bias (as measured by mean error, ME) are shown in Fig. B1. The track MAE for the ensemble mean is very close to that of the control member except at late lead times when the sample size is quite limited. For intensity, the accuracy of the ensemble-mean forecast is consistently superior to that of the control member, across all lead times. The ensemble-mean intensity MAE is about 10% lower than that of the control member between 6 and 72 h, with even larger improvements at later lead times. The mean error of the ensemble-mean intensity forecasts tends to be slightly lower than that of the control member (i.e., ensemble-mean intensity forecast is slightly weaker than that of the control member, on average). These results are all broadly consistent with those shown in Komaromi et al. (2021) for the operational COAMPS-TC ensemble configuration.
We also examined the track and intensity forecast performance statistics for the ensemble mean with respect to real-time deterministic CTCX (COAMPS-TC with GFS as the parent global model) run by NRL and operational deterministic COTC (COAMPS-TC with NAVGEM as the parent global model) run by FNMOC (Fig. B2). Given the limited scope of the ensemble forecast sample and configuration differences between versions of COAMPS-TC used by the ensemble and those run deterministically in real time in 2020, the purpose of this comparison is to evaluate if the ensemble mean is competitive in terms of accuracy with these rigorously tested deterministic COAMPS-TC configurations. The track forecast accuracy of the ensemble mean lies between that of CTCX (the best performer) and COTC (Fig. B2a). Track forecast performance in COAMPS-TC is highly sensitive to the global model utilized for initial condition information, so it is reasonable that the track accuracy for the ensemble mean, which uses a hybrid approach to initializing the model with GFS and NAVGEM employed in different parts of the domain, falls between that of CTCX and COTC. As for intensity forecast accuracy, the ensemble mean is superior to both CTCX and COTC for this sample, though our experience with the 2020 real-time CTCX configuration is such that we would expect it to be more competitive with the ensemble mean in a larger sample of cases. Nonetheless, the overall conclusion of this exercise is that the COAMPS-TC ensemble demonstration described in this paper produced ensemble-mean track and intensity forecasts that are competitive with or superior to those of deterministic COAMPS-TC, as run in real time in 2020.
References
Arevalo, D. , and T. R. Whitcomb , 2019: NAVGEM on the cloud: Computational evaluation of commercial cloud HPC with a global atmospheric model. 73rd HPC User Forum, Lemont, IL, Argonne National Laboratory, www.hpcuserforum.com/presentations/argonne2019/Arevalo_DeVine_Argonne.pdf.
Cangialosi, J. P. , 2021: 2020 hurricane season. National Hurricane Center Forecast Verification Rep., 77 pp., www.nhc.noaa.gov/verification/pdfs/Verification_2020.pdf.
Charney, J. G. , R. Fjörtoft , and J. V. Neumann , 1950: Numerical integration of the barotropic vorticity equation. Tellus, 2, 237–254, https://doi.org/10.3402/tellusa.v2i4.8607.
Chen, J. H. , and Coauthors, 2019: Advancements in hurricane prediction with NOAA’s next‐generation forecast system. Geophys. Res. Lett., 46, 4495–4501, https://doi.org/10.1029/2019GL082410.
Chui, T. C. Y. , D. Siuta , G. West , H. Modzelewski , R. Schigas , and R. Stull , 2019: On producing reliable and affordable numerical weather forecasts on public cloud-computing infrastructure. J. Atmos. Oceanic Technol., 36, 491–509, https://doi.org/10.1175/JTECH-D-18-0142.1.
DeMaria, M. , C. R. Sampson , J. A. Knaff , and K. D. Musgrave , 2014: Is tropical cyclone intensity guidance improving? Bull. Amer. Meteor. Soc., 95, 387–398, https://doi.org/10.1175/BAMS-D-12-00240.1.
Doyle, J. , and Coauthors, 2014: Tropical cyclone prediction using COAMPS-TC. Oceanography, 27, 104–115, https://doi.org/10.5670/oceanog.2014.72.
Doyle, J. , and Coauthors, 2020: Recent progress and challenges in tropical cyclone intensity prediction using COAMPS-TC. Tropical Meteorology and Tropical Cyclones Symp., Boston, MA, Amer. Meteor. Soc., 1.1., https://ams.confex.com/ams/2020Annual/meetingapp.cgi/Paper/363334.
Droegemeier, K. K. , and Coauthors, 2005: Service-oriented environments for dynamically interacting with mesoscale weather. Comput. Sci. Eng., 8, 12–29, https://doi.org/10.1109/MCSE.2005.124.
Duran-Limon, H. A. , J. Flores-Contreras , N. Parlavantzas , M. Zhao , and A. Meulenert-Peña , 2016: Efficient execution of the WRF model and other HPC applications in the cloud. Earth Sci. Inform., 9, 365–382, https://doi.org/10.1007/s12145-016-0253-7.
Etherton, B. , S. Cecelski , C. Cassidy , P. Sellers , R. Haas , and T. Hartman , 2020: NWP in the cloud – Successes and challenges. MultiCore 10, Boulder, CO, NCAR, https://drive.google.com/file/d/1QV164-PnF9UOHasrELRuAJZMNCK0SvjW/view.
Evangelinos, C. , and C. N. Hill , 2008: Cloud computing for parallel scientific HPC applications: Feasibility of running coupled atmosphere-ocean climate models on Amazon’s EC2. First Workshop on Cloud Computing and Its Applications (CCA-08), Chicago, IL, Argonne National Laboratory, 2–34.
He, Q. , S. Zhou , B. Kobler , D. Duffy , and T. McGlynn , 2010: Case study for running HPC applications in public clouds. Proc. 19th ACM Int. Symp. on High Performance Distributed Computing, Chicago, IL, Association for Computing Machinery, 395–401.
Hogan, T. , and Coauthors, 2014: The Navy Global Environmental Model. Oceanography, 27, 116–125, https://doi.org/10.5670/oceanog.2014.73.
Huang, Q. , and G. Cervone , 2016: Usage of social media and cloud computing during natural hazards. Cloud Computing in Ocean and Atmospheric Sciences, T. C. Vance et al., Eds., Academic Press, 297–324, https://doi.org/10.1016/B978-0-12-803192-6.00015-3.
Jackson, K. R. , L. Ramakrishnan , K. Muriki , S. Canon , S. Cholia , J. Shalf , H. J. Wasserman , and N. J. Wright , 2010: Performance analysis of high performance computing applications on the Amazon Web Services cloud. 2010 IEEE Second Int. Conf. on Cloud Computing Technology and Science, Indianapolis, IN, IEEE, 159–168, https://doi.org/10.1109/CloudCom.2010.69.
Komaromi, W. A. , P. A. Reinecke , J. D. Doyle , and J. R. Moskaitis , 2021: The Naval Research Laboratory’s Coupled Ocean–Atmosphere Mesoscale Prediction System-Tropical Cyclone ensemble (COAMPS-TC ensemble). Wea. Forecasting, 36, 499–517, https://doi.org/10.1175/WAF-D-20-0038.1.
Krishnan, S. P. T. , B. Veeravalli , V. H. Krishna , and W. C. Sheng , 2014: Performance characterisation and evaluation of WRF model on cloud and HPC architectures. 2014 IEEE Int. Conf. on High Performance Computing and Communications/6th Int. Symp. on Cyberspace Safety and Security/11th Int. Conf. on Embedded Software and Systems (HPCC, CSS, ICESS), Paris, France, IEEE, https://doi.org/10.1109/HPCC.2014.218.
Marchok, T. , 2021: Important factors in the tracking of tropical cyclones in operational models. J. Appl. Meteor. Climatol., 60, 1265–1284, https://doi.org/10.1175/JAMC-D-20-0175.1.
Masters, J. , 2020: The most reliable hurricane models, according to their 2019 performance. Yale Climate Connections, 20 August, accessed 6 April 2021, https://yaleclimateconnections.org/2020/08/the-most-reliable-hurricane-models-according-to-their-2019-performance/.
McKenna, B. , 2016: Dubai operational forecasting system in Amazon cloud. Cloud Computing in Ocean and Atmospheric Sciences, T. C. Vance et al., Eds., Academic Press, 325–345, https://doi.org/10.1016/B978-0-12-803192-6.00016-5.
Mehra, A. , V. Tallapragada , Z. Zhang , B. Liu , L. Zhu , W. Wang , and H.-S. Kim , 2018: Advancing the state of the art in operational tropical cyclone forecasting at NCEP. Trop. Cyclone Res. Rev., 7, 51–56, https://doi.org/10.6057/2018TCRR01.06.
Mell, P. , and T. Grance , 2011: The NIST definition of cloud computing. NIST Special Publ. 800-145, 3 pp., https://doi.org/10.6028/NIST.SP.800-145.
Molthan, A. L. , J. L. Case , J. Venner , R. Schroeder , M. R. Checchi , B. T. Zavodsky , A. Limaye , and R. G. O’Brien , 2015: Clouds in the cloud: Weather forecasts and applications within cloud computing environments. Bull. Amer. Meteor. Soc., 96, 1369–1379, https://doi.org/10.1175/BAMS-D-14-00013.1.
Monteiro, A. , C. Teixeira , and J. S. Pinto , 2014: Migrating HPC applications to the cloud. Eighth Iberian Grid Infrastructure Conference, Aveiro, Portugal.
Niño-Mora, J. , 2019: Resource allocation and routing in parallel multi-server queues with abandonments for cloud profit maximization. Comput. Oper. Res., 103, 221–236, https://doi.org/10.1016/j.cor.2018.11.012.
Oliver, H. , M. Shin , and O. Sanders , 2018: Cylc: A workflow engine for cycling systems. J. Open Source Software, 3, 737, https://doi.org/10.21105/joss.00737.
Oliver, H. , and Coauthors, 2019: Workflow automation for cycling systems. Comput. Sci. Eng., 21, 7–21, https://doi.org/10.1109/MCSE.2019.2906593.
Powers, J. G. , K. K. Werner , D. O. Gill , Y.-L. Lin , and R. S. Schumacher , 2021: Cloud computing efforts for the weather research and forecasting model. Bull. Amer. Meteor. Soc., 102, E1261–E1274, https://doi.org/10.1175/BAMS-D-20-0219.1.
Saltzer, J. H. , and M. D. Schroeder , 1975: The protection of information in computer systems. Proc. IEEE, 63, 1278–1308, https://doi.org/10.1109/PROC.1975.9939.
Siuta, D. , G. West , H. Modzelewski , R. Schigas , and R. Stull , 2016: Viability of cloud computing for real-time numerical weather prediction. Wea. Forecasting, 31, 1985–1996, https://doi.org/10.1175/WAF-D-16-0075.1.
Skamarock, W. C. , 2004: Evaluating mesoscale NWP models using kinetic energy spectra. Mon. Wea. Rev., 132, 3019–3031, https://doi.org/10.1175/MWR2830.1.
Skamarock, W. C. , and Coauthors, 2008: A description of the Advanced Research WRF version 3. NCAR Tech. Note NCAR/TN-475+STR, 113 pp., https://doi.org/10.5065/D68S4MVH.
Tallapragada, V. , C. Kieu , Y. Kwon , S. Trahan , Q. Liu , Z. Zhang , and I.-H. Kwon , 2014: Evaluation of storm structure from the operational HWRF during 2012 implementation. Mon. Wea. Rev., 142, 4308–4325, https://doi.org/10.1175/MWR-D-13-00010.1.
Toth, Z. , and E. Kalnay , 1993: Ensemble forecasting at NMC: The generation of perturbations. Bull. Amer. Meteor. Soc., 74, 2317–2330, https://doi.org/10.1175/1520-0477(1993)074<2317:EFANTG>2.0.CO;2.
Vance, T. C. , N. Merati , C. Yang , and M. Yuan , Eds., 2016: Cloud Computing in Ocean and Atmospheric Sciences. Academic Press, Inc., 415 pp., https://doi.org/10.1016/C2014-0-04015-4.
Weil, S. A. , S. A. Brandt , E. L. Miller , D. D. E. Long , and C. Maltzahn , 2006: Ceph: A scalable, high-performance distributed file system. Proc. Seventh Symp. on Operating Systems Design and Implementation, Seattle, WA, USENIX Association, 307–320.
Zhou, L. , S.-J. Lin , J.-H. Chen , L. M. Harris , X. Chen , and S. L. Rees , 2019: Toward convective-scale prediction within the next generation global prediction system. Bull. Amer. Meteor. Soc., 100, 1225–1243, https://doi.org/10.1175/BAMS-D-17-0246.1.
Zhuang, J. W. , D. J. Jacob , J. F. Gaya , R. M. Yantosca , E. W. Lundgren , M. P. Sulprizio , and S. D. Eastham , 2019: Enabling immediate access to Earth science models through cloud computing: Application to the GEOS-Chem model. Bull. Amer. Meteor. Soc., 100, 1943–1960, https://doi.org/10.1175/BAMS-D-18-0243.1.