Cloud computing offers new opportunities to the scientific community through cloud-deployed software, data-sharing and collaboration tools, and the use of cloud-based computing infrastructure to support data processing and model simulations. This article provides a review of cloud terminology of possible interest to the meteorological community, and focuses specifically on the use of infrastructure as a service (IaaS) concepts to provide a platform for regional numerical weather prediction. Special emphasis is given to developing countries that may have limited access to traditional supercomputing facilities. Amazon Elastic Compute Cloud (EC2) resources were used in an IaaS capacity to provide regional weather simulations with costs ranging from $40 to $75 per 48-h forecast, depending upon the configuration. Simulations provided a reasonable depiction of sensible weather elements and precipitation when compared against typical validation data available over Central America and the Caribbean.
Cloud computing activities are rapidly expanding within the public and private sectors and provide new capabilities for supporting numerical weather prediction and other applications.
Recently, cloud computing has assumed a greater role in the daily lives of many scientists and engineers, following alongside the growing capabilities of smartphones, other mobile devices, and workstations. Cloud computing has improved the efficiency of data storage, delivery, and dissemination across multiple platforms and applications, allowing easier collaboration and data sharing throughout the scientific community. In our personal lives, cloud computing supports the widespread availability and distribution of audio and video content, such as streaming music and movies available through a growing number of commercial providers. Cloud resources serve as the backbone to countless web pages and web-driven applications used by the general public, including data processing and distribution systems that disseminate key weather forecasting, severe weather warning, and climate information. This article reviews basic definitions of cloud computing terminology and describes cloud-based regional weather forecast modeling and spinoff applications to support capacity-building activities in developing countries.
Although the phrase “in the cloud” is common-place in advertising for new software applications, formal methods have been established for defining cloud resources and their applications. The National Institute of Standards and Technology (NIST; Mell and Grance 2011) defined cloud computing as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g. networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” The NIST identified the essential characteristics of cloud computing, which include capabilities that can be requested automatically through traditional platforms such as mobile phones or desktop workstations, the sharing of and access to pooled resources regardless of geographic proximity, rapid scalability of those resources to match end-user demand, and metered usage that ensures transparent cost accounting for resources used in a given application. Cloud environments are further subcategorized based upon their intended use and availability. Private clouds are those used and maintained by a single organization, whereas public clouds are available to the general public and maintained by another entity. Private and public clouds operate under three primary service models. Software as a service (SaaS) provides software capabilities, primarily through web-based interfaces, with management of necessary computing resources supported by the cloud provider. Often, SaaS applications are provided on a pay-per-use system, such as web-based teleconferencing (e.g., GoToMeeting) and database or customer management applications (e.g., Salesforce). Platform as a service (PaaS) provides a maintained software and networking platform where paying users can build and refine their applications (e.g., Google’s App Engine). Infrastructure as a service (IaaS), the focus of applications described in this article, provides basic computing hardware and storage, as well as an operating system, through a series of virtual machines (VMs) that the user can manipulate to execute their desired tasks.
The availability of IaaS within public and private clouds permits a wide variety of hardware configurations and storage options to support numerous applications. For example, the Amazon Elastic Compute Cloud (EC2) offers VMs with predetermined configurations of operating systems, and computing and storage resources. These resources can be allocated from physical hardware located in several domestic and international locations, ranging from the east (northern Virginia) or west (Oregon and Northern California) coasts of the United States to the European Union (Iceland), South America (São Paulo, Brazil), and Asia (Singapore; Tokyo, Japan; and Sydney, Australia). VMs on EC2 range from “micro” configurations comprising a small number of CPUs and storage options to “extra large” configurations with as many as 32 CPUs and terabytes of disk space. VM options are also targeted to specific applications by offering cluster compute, memory, or storage-optimized options, or the recent addition of VMs powered by graphical processing units. When created, a VM can be assigned one of several operating systems, and the resulting system is referred to as an instance. Costs for instances are billed at an hourly rate corresponding to the class of the underlying VM and generally scale with increases in computing resources. While up and running, VMs are charged an hourly rate; therefore, their timely spinup, use, and termination must be carefully managed to minimize cost. Termination of a VM leads to the loss of all data, unless results are first moved to a more permanent cloud storage service. Other options for pausing and resuming an instance are available, depending upon the instance configuration. Changes made to the EC2 instance and VM configuration can be preserved and cloned through use of a machine image. In the EC2 environment, Amazon Machine Images (AMIs) produce a snapshot of the VM software configuration and allow for later restoration by incorporating a saved image into a fresh VM. The same AMI can be launched onto multiple VMs to clone a given configuration, allowing for rapid scalability of an application.
To support data-intensive applications, the Amazon Simple Storage Service (S3) provides large-capacity object storage for datasets, as needed, subject to an additional, monthly billing rate. Storage space on S3 can be used to create data objects that are accessible by VMs through network transfers rather than mounted volume storage, allowing for the push or pull of data to associated VMs. Increased availability of cloud-based storage solutions such as S3, collocated processing of data by VMs, and sharing of computing configurations via AMIs may offer new opportunities for scientists to share access to and processing of large data volumes, reducing the need for the push–pull of content used in collaborative research. Whereas disks in the VM are lost upon termination, data pushed to S3 or configurations stored as an AMI allow for persistence. Since both the storage of AMIs and the usage of S3 are typically cheaper than uptime costs of a VM, their combined usage allows users to retain their progress without paying for the indefinite uptime of a VM.
The performance of the EC2 environment has been compared to more traditional high performance computing (HPC) systems by Jackson et al. (2010). Some limitations within the EC2 environment at that time included varying latency among compute instances of unknown proximity, unknown competition of multiple users operating VMs on the same hardware, variability in hardware assigned to each virtual machine, and occasional outages of various EC2 components. Recently, the EC2 environment has added cluster compute instances that address some of these concerns by improving network efficiency and hardware homogeneity for high performance computing applications.
Cloud-based support for IaaS naturally aligns with concepts of high performance computing that are used within the meteorological community and, specifically, numerical weather prediction (NWP). High performance computing systems supporting NWP are typically composed of a head node that manages the model source code, and execution of the code across a larger number of computing nodes. In traditional applications, these nodes are composed of physical hardware with a static configuration, but similar systems can be configured in a cloud environment. For example, VMs can be used to construct a head node and supporting compute nodes by managing their configurations through the development and maintenance of machine images. The entire system can be launched through a series of scripts that commission the appropriate VMs, establish necessary network configurations, and prepare the system for the execution of a given application. Herein, this study focuses on the application of VMs, AMIs, and cluster compute instances to provide a regional modeling framework in scenarios where access to larger HPC resources is limited.
IAAS IN SUPPORT OF CLOUD-BASED NUMERICAL WEATHER PREDICTION.
Researchers at the National Aeronautics and Space Administration’s (NASA’s) Marshall Space Flight Center and Ames Research Center (ARC) collaborated on the use of the Weather Research and Forecasting (WRF) Model within private and public cloud environments. The team obtained the then-latest version of the National Oceanic and Atmospheric Administration/National Weather Service (NOAA/NWS) Science and Training Resource Center (STRC) Environmental Modeling System (EMS; Rozumalski 2014), which comprises a set of data download and processing scripts alongside precompiled binaries of the WRF Model. The STRC-EMS also includes automatic subsetting of larger-domain NWP fields to reduce the amount of data required for download, and options to include unique, higher-resolution land surface and sea surface initial conditions provided by NASA’s Short-term Prediction Research and Transition (SPoRT) Center (Jedlovec 2013). The STRC-EMS was first used to explore and understand the feasibility of near-real-time numerical weather prediction within a NASA private cloud environment hosted at ARC. The use of a private cloud initially served as a software development “sandbox,” where the team experimented with varying configurations of VMs to understand model scalability and runtime efficiency, identify networking bottlenecks, and develop scripting capabilities to deploy a near-real-time modeling system. These exercises aligned with the internal goals of the ARC team by supporting a cloud computing initiative led by the NASA Office of the Chief Information Officer. Outcomes included benchmarks of performance in public and private cloud environments and a broader understanding of science user requirements for cloud access and utilization. As a result, the team developed a fully scripted capability that allowed for the launch of an STRC-EMS modeling system, including execution of a modeling domain, postprocessing of the resulting simulation, and dissemination of the data to a specific end user.
Once established, the scripted system was transitioned to the Amazon EC2 environment. Scripting included options to allow for the selection of the EC2 geographic region and the assignment of VM types for the file system and compute instances. Administration of the system is performed through a free “tiny” instance currently provided to all EC2 accounts. At a predetermined time each day, the administrative instance executes a script that constructs the modeling system by provisioning the necessary cluster compute VMs, applies the necessary AMIs, and establishes the required networking environment (Fig. 1). Once this is complete, specifics of the model domain and configuration are obtained from the Amazon S3, as the configuration for a given region is static and must be retained after computing VMs are terminated. Once the domain is in place, automated STRC-EMS scripts obtain the necessary initial and boundary conditions by subsetting them from the National Centers for Environmental Prediction (NCEP)/Environmental Modeling Center’s (EMC’s) Global Forecast System, along with higher-resolution sea surface temperatures provided by the NASA SPoRT Center. In total, this represents around 20 MB of data acquired from external Internet sources, but inbound data transfers were not charged to EC2 users at the time of this study. STRC-EMS scripts perform the necessary preprocessing, initialization of larger-scale simulation data, execution of the WRF Model, and postprocessing to predefined gridded binary (GRIB) output fields and desired vertical coordinates. Postprocessed GRIB outputs are pushed to the end user via FTP for further use and distribution by the recipient. Outbound data volumes are highly customizable within the WRF-EMS and user costs depend upon the total monthly volume of data transferred out of the EC2 environment. Data transfer speeds depend upon the network performance between EC2 and the end user. In environments where bandwidth is limited, forecast output and decision aids could be produced in a graphical format, such as hosting a website or web-mapping service using additional EC2-based resources. Once the data have arrived, the end user can incorporate the output within their decision support system or use the data to drive additional applications. When the postprocessing and data dissemination processes have completed, the script terminates the model simulation and decommissions all VMs composing the head and compute instances, ending VM charges against the EC2 user account. Then, with the administration node still intact, the system waits patiently until the next forecast period.
COST AND PERFORMANCE METRICS.
A fundamental difference between IaaS and the use of traditional (noncloud) resources is the cost model within the EC2 or other cloud environments. Local systems are purchased at a fixed, known cost, followed by costs for administrative support and utilities needed to provide power and cooling. This often occurs as a large upfront cost for a hardware purchase, followed by smaller, longer-term, and incremental costs for depreciation, maintenance, and upgrades. In the cloud environment, each component shown in Fig. 1 represents a metered resource, similar to a household utility. For example, VMs are charged at an hourly rate, and storage on S3 is charged at a rate based upon total monthly usage. Other charges are incurred for certain types of network support, storage of AMIs, and in- or outbound data transfers. VMs include additional cost options based upon usage that is on demand and charged immediately upon usage, or other options for larger, upfront purchases of compute time. Herein, discussions focus on costs associated with on-demand usage, but Amazon EC2 describes additional cost models and their perceived advantages on their pricing web page (Amazon Web Services 2014).
Since the costs of a cloud-deployed system depend heavily upon the time period when the system is active, optimizing the system requires trade-offs between acceptable data latency, data delivery, model resolution, complexity of simulated processes, and cost. Current pricing models charge a full hourly rate for any fractional hour of CPU time that is used; therefore, even small increases in performance efficiency that reduce fractional hour usage can produce significant cost savings. To understand system performance, short-term (6 h) model simulations were executed in the Amazon EC2 West (Oregon) region over a varying number of compute instances and during different times of day. Characteristics of the model simulation are listed in Table 1 with domain coverage shown in Fig. 2. This configuration represents a plausible, regional configuration supporting multiple nested domains with cloud-permitting resolutions that employ all scales of physical parameterizations (e.g., cumulus to single-moment bulk microphysics, land surface, boundary layer, and radiation) available within the STRC-EMS. A 6-h integration period was selected to include an appropriate model spinup time. In this interval, processing time and memory requirements for precipitation and other parameterized processes are fully engaged, resulting in a consistent number of real-time minutes that elapse between each hour of forecast output. For example, when performance stabilizes to produce an hour of forecast output every N minutes, the CPU time for a longer forecast can be reasonably estimated by multiplying the number of minutes per forecast hour by N minutes, converting to hours, and rounding any fractional hour to a full hour.
The demonstration case using a single-model configuration represents a deterministic approach, which may offer limited predictability for some synoptic or mesoscale events (e.g., severe weather, split flow aloft, cutoff lows, or blocking highs) versus multimodel ensembles that better characterize forecast uncertainty. If a regional simulation is not perceived as being valuable for an expected event, cloud-based simulations could be disabled to reduce user cost. Where ensemble approaches are desired, organizations could make use of multiple cloud-based forecast instances to generate an ensemble of simulations with varying initial and boundary conditions, physics configurations, or perturbations, and perform any necessary postprocessing after results are available. Each ensemble member would result in varied performance metrics due to differences in parameterization requirements and processing speeds, and additional costs incurred for any postprocessing performed with cloud resources. Since VM costs are based upon hourly rates, costs should scale with expected differences in runtime performance: schemes that require an additional 10% of runtime would add an additional 10% in cost, and so on. Cloud-based use of ensembles is plausible, but a thorough analysis of cost and performance among the full range of WRF physics packages was beyond the scope of this study.
Given the stable performance in a 6-h period and fixed hourly rates in the on-demand pricing model, VM costs were estimated by scaling runtime to a 48-h simulation. VM costs of a single simulation ranged from around $40 to $75 across the system configurations examined (Fig. 3). Billing information for testing and development indicated that VM usage composed more than 95% of simulation costs. To determine configuration efficiency, runtime performance was assessed by dividing the time required to perform the simulation by the hours simulated. Using this metric, efficiency ranged from 7% to 13% of real time (lower is better), which required 3–6.3 wall-clock hours to complete a 48-h simulation, depending upon VM usage.
The range in configuration and cost allows an end user to select an appropriate level of computational efficiency that meets data delivery and cost goals. However, it is important to note the diminishing marginal utility that occurred when simulations were executed with more than eight cluster compute instances (128 virtual CPUs). In other words, further increases in the number of compute instances (VMs) did not translate into increased performance and cost efficiency for the tests conducted on the Amazon EC2. Limits on internal network performance are believed responsible for this deficiency, consistent with scalability issues identified by Jackson et al. (2010). Whereas high performance computing systems operate with high-speed internal networks (e.g., Infiniband) and are fully optimized to support multiprocessing applications, the EC2 environment used in these experiments did not support higher-speed networking capabilities. Internal networking limitations appear to degrade model performance beyond an eight-node configuration and, to a lesser extent, input–output processes of writing model results to disk. However, an eight-node configuration provides the highest runtime efficiency available among the tested configurations, cutting simulation runtime by more than half compared to a two-node run, with a commensurate increase in cost. EC2 and other providers are constantly improving network access and improved performance is expected for future tests and applications.
IAAS IN SUPPORT OF CAPACITY BUILDING AND REGIONAL FORECASTING APPLICATIONS.
Cloud computing can offer additional resources to support regional weather forecast modeling, either to support on-demand capacity in response to specific events when supplemental data are necessary, or as daily simulations in regions not currently served by higher-resolution regional models. Daily, regional simulations at improved spatial resolutions can better depict and predict mesoscale features unique to an area, such as topographically induced wind flows, the diurnal cycle of precipitation, and greater spatial detail in temperature and moisture. Future growth of public IaaS environments will likely include Infiniband or other high-speed interconnect capabilities to improve efficiency, or other optimizations of simulation executions may improve performance. Access to cloud-generated, regional weather forecasts can support other simulation systems, such as prediction of air quality or dispersion models, weather forecast support during local disasters, or hydrological and empirical models used to predict landslide hazards in areas of complex topography (Kirschbaum et al. 2009).
The NASA SERVIR Project (NASA 2014), a joint effort between the NASA Applied Sciences Program and the U.S. Agency for International Development (USAID), is exploring many of these capabilities. SERVIR provides support for Earth observations and applications to improve environmental decision making within developing countries, focusing on activities relevant to regional hubs around the world. Capacity-building efforts within SERVIR focus on the transition of new capabilities to end users. In this mindset, a simulation system on EC2 can be transitioned to operations over Central America and the Caribbean or other regions of interest by migrating automation scripts and associated machine images to Amazon accounts managed by organizations interested in regional NWP. In this mode, they can establish their own modeling configurations without having to maintain the larger infrastructure and support staff that would be required for a full-fledged HPC facility. Cloud computing allows for continuity of operations during a disaster by leveraging compute resources that are geographically distant from the affected area. In turn, multiple EC2 host regions allow for options to continue model execution during periods when disasters may affect the operation of cloud resources in a given area. By operating in a cloud environment, modeling resources are far removed from the impacted area and can continue unaffected. Finally, as SERVIR interests expand, other EC2 regions can be incorporated to support local NWP efforts and spinoff applications. For example, EC2 regions in Europe and Asia can be leveraged to support partnering, regional hubs that are in closer proximity to cloud resources to reduce latency while maintaining forecasting capabilities. These hubs can continue to assume responsibility for operating simulations in a cloud environment, supporting their local decision making at costs that are reduced in comparison to establishing, maintaining, and staffing a high performance computing facility.
REGIONAL MODELING AND VERIFICATION OVER CENTRAL AMERICA AND THE CARIBBEAN.
SERVIR previously established a regional hub and partnerships within Central America. In support of SERVIR activities, the NASA Ames and Marshall team established a regional modeling application using IaaS concepts in private (NASA OpenStack) and public (Amazon EC2) cloud computing environments (Fig. 1). The system has been running in real time over Central America and the Caribbean since 2013 using the configuration and domain summarized in Table 1 and Fig. 2. The WRF Model is initialized once daily at 0600 UTC and run for 48 h. The selection of the nonsynoptic time of 0600 UTC was based upon expected model runtime and desired delivery of outputs to support decision-making activities, but can be changed easily by the user for other applications. This cloud-hosted simulation provides guidance to SERVIR end users in Central America and the Caribbean, and can be used for a variety of applications, such as assessing heavy rainfall and landslide threats, air quality modeling, and resolving strong mountain gap winds.
As an example of such an application, an NWP simulation of a mountain gap wind event is shown in Fig. 4. These strong gap winds, known regionally as tehuantepecers (e.g., Schultz et al. 1997), occur in the Chivela Pass and the Gulf of Tehuantepec when a cold front and attendant, broad region of high pressure move across the area (Fig. 4a). Tehuantepecers pose a significant threat to marine interests in the Gulf of Tehuantepec as shown by forecasts from the Tropical Analysis and Forecasting Branch (TAFB; Fig. 4b), but more readily available, coarser global models may underestimate their intensity. In this example, cloud-hosted, regional NWP simulations provided a reasonable depiction of the strongest gap winds (Fig. 4c) when compared to wind speeds estimated by the WindSat instrument aboard the Coriolis satellite (Fig. 4d). Accurate representation is key, as these gap wind events can reach hurricane-force strength in some extreme situations.
Regional model guidance is only valuable for local forecasting and spinoff applications if it consistently reproduces the observed weather. Quantitative model verification statistics were also provided to SERVIR so that decision makers can understand the quality of these WRF forecasts. Outside of the cloud environment, automated verification scripts are provided by the NASA SPoRT team (Zavodsky et al. 2014) for execution of the NCAR Model Evaluation Tools (MET; Brown et al. 2009). Model error statistics were generated over the WRF Model domains shown in Fig. 2 using point observations available in the NCEP Global Data Assimilation System. Model precipitation skill scores are produced using quantitative precipitation estimates from the NCEP Climate Prediction Center morphing technique product (CMORPH; Joyce et al. 2004). Figure 5 summarizes error statistics on WRF Model domain D01 during November 2013 calculated in the verification region shown in Fig. 2. Results demonstrate the skill that can be achieved by utilizing cloud computing resources in combination with other modeling and postprocessing tools. During this period, the WRF Model exhibited a slight cool, moist bias throughout the daily 48-h forecasts, particularly during the afternoon and early nighttime hours (forecast hours 12–24 and 36–48; Figs. 5a,b). A consistent high wind speed bias of about 1–2 m s−1 during November 2013 occurred in the WRF Model configuration (Fig. 5c) during most forecast hours. Forecasters can use this information of systematic bias to modify their expectations of the forecast temperature, dewpoint temperature, and winds to match the biases shown in the statistics. The 3-h accumulated precipitation validation, given by the Heidke skill score (HSS) in Fig. 5d, indicates that the skill at lower precipitation thresholds (5 and 10 mm) slowly declined as a function of forecast hour. For more intense precipitation rates (25 mm), the skill was considerably lower but did not change appreciably by forecast hour. These types of verification results can be used by decision makers to better understand the skill and limitations associated with NWP forecasts of precipitation events, particularly extreme events that are most likely to produce flooding and landslides.
SUMMARY AND FUTURE WORK.
The continued expansion of private and public cloud computing environments offers new opportunities within the scientific community by providing rapidly provisioned, on-demand computing capabilities to support a variety of data processing, simulation, and other end-user applications. This study demonstrated the application of IaaS concepts in support of regional weather forecasts comparable to those produced by many operational centers, but hosted within cloud environments, extending NWP capabilities to entities that may not have access to local HPC infrastructure. By enabling regional weather predictions on public IaaS systems, modeling capabilities developed by the meteorological community can be executed on geographically distant resources in a timely fashion and within a trade space that allows the end-user requirements of cost and computational efficiency.
Many new opportunities may be enabled through the clever use of cloud computing resources. For example, regional, higher-resolution weather predictions can be used to improve forecast accuracy for phenomena including convective storms and their precipitation, or terrain-dependent hazards such as those shown in the tehuanapecer wind example. The availability of higher-resolution, regional weather forecasts also permits other spinoffs that enable additional applications, such as using convection-permitting simulations to predict higher thunderstorm rain rates that may contribute to landslides within susceptible areas, or the use of simulation output to drive local dispersion models to address local issues related to agricultural burning, air quality, or hazardous emissions. Furthermore, the on-demand nature of cloud computing allows for simulations to be requested in response to significant events or local disasters if the end-user community cannot afford daily simulations. Cloud computing and IaaS-driven applications will provide new opportunities for establishing data processing and simulation capabilities within end-user communities that do not have immediate access to a fully established supercomputing facility.
The authors thank Karen Petraska of the NASA Office of the Chief Information Officer for her support in fostering the collaboration involving her office, NASA Ames, and NASA Marshall targeted at leveraging the Amazon EC2 environment as well as Tsengdar Lee, High End Computing Program manager at NASA Headquarters, for support to the SPoRT and SERVIR teams fostering experiments in operating NWP models in private and public cloud environments. The authors thank three anonymous reviewers for their comments and guidance that improved the final manuscript. The use or mention of a specific commercial product within this study does not represent an official endorsement by the authors or their affiliated organizations.