The National Oceanic and Atmospheric Administration’s (NOAA) Big Data Partnership (BDP) was established in April 2015 through cooperative research agreements between NOAA and selected commercial and academic partners. The BDP is investigating how the value inherent in NOAA’s data may be leveraged to broaden their utilization through modern cloud infrastructures and advanced “big data” techniques. NOAA’s Next Generation Weather Radar (NEXRAD) data were identified as an ideal candidate for such collaborative efforts. NEXRAD Level II data are valuable yet challenging to utilize in their entirety, and recent advances in weather radar science can be applied to both the archived and real-time data streams. NOAA’s National Centers for Environmental Information (NCEI) transferred the complete NEXRAD Level II historical archive, originating in 1991, through North Carolina State University’s Cooperative Institute for Climate and Satellites (CICS-NC) to interested BDP collaborators. Amazon Web Services (AWS) has received and made freely available the complete archived Level II data through its AWS platform. AWS then partnered with Unidata/University Corporation for Atmospheric Research (UCAR) to establish a real-time NEXRAD feed, thereby providing on-demand dissemination of both archived and current data seamlessly through the same access mechanism by October 2015. To organize, verify, and utilize the NEXRAD data on its platform, AWS further partnered with the Climate Corporation. This collective effort among federal government, private industry, and academia has already realized a number of new and novel applications that employ NOAA’s NEXRAD data, at no net cost to the U.S. taxpayer. The volume of accessed NEXRAD data, including this new AWS platform service, has increased by 130%, while the amount of data delivered by NOAA/NCEI has decreased by 50%.
NOAA’s Big Data Partnership has facilitated unprecedented access to NEXRAD real-time and archive data, enabling cloud computing that is accessible, efficient, and innovative.
The National Oceanic and Atmospheric Admin-istration’s (NOAA) Big Data Partnership (BDP) was announced in April 2015 as five identical but separate multiyear cooperative research and development agreements (CRADAs) between NOAA and Amazon Web Services (AWS), Google, IBM, Microsoft, and the Open Commons Consortium (OCC) (collectively referred to as the “collaborators”; Office of Public Affairs 2015). The BDP is essentially a 3–5-yr business experiment to determine to what extent the value inherent in NOAA’s weather, ocean, climate, fisheries, ecosystem, and other environmental data can underwrite the costs of commercial cloud storage and dissemination of those data. The research activity is also investigating whether the availability of NOAA’s data on modern cloud infrastructure may provide for wider or more complex applications of NOAA’s data and result in new business opportunities for U.S. industry.
Rather than simply pushing NOAA’s considerable environmental data holdings (tens of petabytes of data and thousands of datasets) to the BDP collaborators, NOAA and the collaborators have moved deliberatively on specific datasets of known value, whereby the collaborators engage NOAA to determine which datasets they wish to be transferred to their infrastructure. Discussions among NOAA subject matter experts, the collaborators, and interested third-party partners of the collaborators have taken place to identify the datasets of interest and their possible applications and business cases and to work out the technical aspects of the data and their delivery. If any special efforts, that is, not part of normal data services and operations, are expended by NOAA to make data available for a single collaborator’s interests, these data areas are also made available to all collaborators to maintain a fair and level “playing field,” without disclosing to others the possible applications thereof as discussed privately with NOAA.
As of summer 2016, many datasets have been discussed and investigated, but only the Weather Surveillance Radar-1988 Doppler (WSR-88D) [commonly called Next Generation Weather Radar (NEXRAD)] Level II data activity has reached a full or operational state at any of the collaborators’ sites and is the subject of this paper. Over 270 terabytes (TB) (180 million files) of the 1991–2016 historical archived data in the original binary volume scan format were transferred from NOAA to AWS, Google, Microsoft, and OCC (2015–16 only) during the fall of 2015. (IBM elected not to take part in this NEXRAD data activity.) The data from the tape-based archive systems at the National Centers for Environmental Information (NCEI) in Asheville, North Carolina, were sent to the cloud infrastructure at these collaborators through the Cooperative Institute for Climate and Satellites at North Carolina State University (CICS-NC). The use of CICS-NC infrastructure and personnel as a data distribution mechanism to the collaborators minimized the loads on NCEI operational systems and networks and thus limited any service impacts to NOAA’s customers of NCEI’s existing data services.
THE NEXRAD LEVEL II ARCHIVE.
NEXRAD is the third major iteration of civilian weather surveillance radars deployed in the United States. The WSR-88D follows the WSR-57 and WSR-74 and is a cooperative project with the Federal Aviation Administration (FAA) and Department of Defense (DOD). The deployments of WSR-88D radars began in 1991 and was completed in 1997, with 121 NOAA, 26 DOD, and 12 FAA operational (Crum and Alberty 1993; Crum et al. 1998). There are currently 160 WSR-88D operational weather radars in the network.
Out of 362 high-impact publically managed observation systems assessed in the 2014 National Plan for Civil Earth Observations [assembled by J. Holdren and the White House Office of Science and Technology Policy (Holdren 2014)], NEXRAD ranked second. For comparison, the global positioning system (GPS) ranked first, while NOAA’s Geostationary Operational Environmental Satellite (GOES) system ranked third. NEXRAD is instrumental in the analysis and early detection of tornadoes, rain, ice, snow, flash floods, hail, and destructive wind. The aforementioned National Plan for Civil Earth Observations describes a “national ground-based network of weather radars [that] supports weather forecasting and warning services. These systems detect precipitation and wind and contribute to severe-weather and flash-flood warnings, air-traffic safety, flow control for air traffic, resource protection at military bases, and management of water, agriculture, forests, and snow removal. The network impacts nine SBAs [Societal Benefit Areas], with ‘highest’ impact in the Transportation SBA and ‘very high’ impact in the Weather and Energy and Mineral Resources SBAs” (Holdren 2014, 30–31).
NEXRAD creates three processing levels of data. Level I are the raw signal data, extremely voluminous and not currently archived. Level II data are the result of the signal processing of Level I, resulting in the three meteorological base quantities of reflectivity, velocity, and spectrum width, as well as the three dual-polarization base quantities of differential reflectivity, correlation coefficient, and differential phase, recorded at each elevation angle in the volume scan (Fig. 1). The application of computer algorithms running at the radar sites derives products, called Level III, from the Level II data, including precipitation estimates, hail and tornado detection, and others, totaling over 75 products in all (NWS 1991).
NEXRAD Level II data have been archived at NCEI [formerly the National Climatic Data Center (NCDC)] since 1994 (Kelleher et al. 2007). The initial archive process in 1994 consisted of the National Weather Service (NWS) mailing 8-mm tapes to NCDC, which were manually copied and compressed to different tapes for long-term storage, then sent back to the NWS for reuse. In 2000, NCDC began use of a tape robotics system for the long-term storage of tapes. This allowed the automated retrieval of tapes and files, although an order for several days of data from several sites still took several days to weeks to complete. The Collaborative Radar Acquisition Field Test (CRAFT) project eliminated the manual data archiving process based on 8-mm tapes and transitioned to an all-electronic transmission of radar data from the NWS to NCDC. CRAFT became operational in 2004, improving the percentage of data archived from 56% to 98%, reducing the time to disseminate 20 GB of data from hundreds of hours to minutes and yielding a cost savings of $400,000 per year in eliminated maintenance and support for the 8-mm tape system (Kelleher et al. 2007). Further enhancements and optimizations to the tape robotics and access system reduced data order times, although data volumes continued to increase greatly due to “Super-Res” data resolution improvements in 2008 and the dual-polarization network upgrades in 2012 (Istok et al. 2009).
FROM YEARS TO DAYS: A DETAILED CASE STUDY OF PROCESSING THE ARCHIVE.
Because weather radar is the only operational tool that provides observations of severe weather producing thunderstorms on a fine-enough temporal and spatial resolution (with refreshes happening every 3–5 min and at spatial resolution of 0.25–1 km), weather radar is essential to enable reliable diagnosis and warning of severe weather (Joe et al. 2012; Crum and Alberty 1993). However, weather radar echoes are subject to significant quality-control issues, such as anomalous propagation, biological echoes, and ground clutter (Lakshmanan et al. 2007, 2014), in addition to geometry constraints such as beam blockage. It is essential that weather radar data from a single radar site are supplemented by other sources of information. Therefore, it is useful to carry out radar data processing outside of the radar collection system, where it is easier to combine multiple radar and multiple data sources, including satellite, model, and in situ (Smith et al. 2016). Ever since the first Internet dissemination of Level II (high resolution) radar data in real time (Droegemeier et al. 2002), there has been substantial research carried out on applications that employ these high-resolution data to provide guidance to severe weather forecasters.
The Warning Decision Support System–Integrated Information (WDSS-II; Lakshmanan et al. 2007) integrated several novel ideas on automated diagnosis, detection, and nowcasting of severe weather phenomena. These included machine-learning methods of identifying (Lakshmanan et al. 2009) and tracking storms (Lakshmanan and Smith 2010), depicting rotational shear (Miller et al. 2013; Lakshmanan et al. 2013), fusing multiple sensors (Lakshmanan and Humphrey 2014), data mining storm attributes (Lakshmanan and Smith 2009), disseminating to virtual globes (Smith and Lakshmanan 2009), polarimetric quality control (QC; Lakshmanan et al. 2014), and importance testing (Lakshmanan et al. 2015). One common aspect to all of these algorithms is that these methods were extremely data intensive in terms of training.
The Climate Corporation (“Climate”), a division of the Monsanto Company, aims to help all the world’s farmers sustainably increase their productivity through the use of digital tools. The integrated Climate FieldView digital agriculture platform provides farmers with a comprehensive, connected suite of digital tools. Seamless field data collection, advanced agronomic modeling, and local weather monitoring are brought together into simple mobile and web software solutions. Climate FieldView Prime, Climate FieldView Plus, and Climate FieldView Pro give farmers a deeper understanding of their fields so they can make more informed operating decisions to optimize yields, maximize efficiency, and reduce risk. For more information, please visit www.climate.com or follow the company on Twitter @climatecorp.
Climate has developed state-of-the-science, industry-leading agronomic advisory capabilities that are supported by a vast data science infrastructure. Accurate estimates of precipitation are a critical component to Climate’s agronomic advisors. As such, Climate partnered with Amazon, in collaboration with NOAA, to provide a use case for providing archive and real-time NEXRAD data on AWS as part of the BDP. Climate’s ongoing research efforts require access and processing of large amounts of NEXRAD radar data, which have historically been hard to obtain from NOAA (NWS 1991).
Climate performs most, if not all, research in the cloud using AWS (EC2 and S3). Having the NEXRAD archive as well as the real-time radar data stored in the same architecture has allowed Climate to streamline its research and development process. Prior to the BDP, Climate would have to request data and then wait for the robot to pull the tape and download and process the data. This data flow is prone to errors in the data downloaded and results in numerous starts and stops for our research pipeline (Fig. SB1); however, with the addition of NEXRAD data to AWS, the Climate research pipeline is more compact, projects are several months shorter, and evaluations can be done on larger datasets (Fig. SB2).
The data being available on S3 has benefits to AWS and NOAA:
AWS: Climate does not pay storage costs but does pay for compute cycles on EC2. As such, with more data available, larger instances and more compute cycles are needed.
NOAA: Climate was able to uncover a long-standing bug in NOAA’s archive of the NEXRAD data, improving the overall data quality, that without Climate/AWS would likely have continued to go unnoticed, and ultimately NOAA data are able to be more widely used.
The temporal and spatial resolution of weather radar is critical for precipitation measurements. A rain gauge network provides point measurements and cannot be scaled to large spatial areas because of logistical issues surrounding maintenance and measurement errors. Operational rain gauge networks do not have sufficient coverage and resolution to meet the needs of flash floods and mudslides that have small time and spatial scales while the WSR-88D network does (Zhang et al. 2012). However, because weather radar can only detect clouds aloft, it is necessary to fuse radar and gauge values to create realistic estimates of rainfall on the ground. Polarimetric weather radars (which the U.S. NEXRAD network was upgraded to in 2013) provide more accurate estimates of rainfall because of their ability to differentiate between different types of hydrometeors (Ryzhkov et al. 2005). In addition, radar geometry problems in areas of complex terrain necessitate the use of satellite information. Thus, radar-only methods of estimating rainfall such as Fulton et al. (1998) have been superseded by multisensor methods such as those of Zhang et al. (2011).
While data-driven methods are more accurate and reliable than the heuristic methods employed before in these domains, availability of data and the lack of elasticity in the computational power significantly slowed down both research and the operationalization of that research. For example, it took Cintineo et al. (2012) several months to carry out the necessary data processing to create a hail climatology. Expanding that work into a multiyear reanalysis by Ortega et al. (2015) took years to complete the computing while carefully managing data ingest and upload to the computational cluster. In addition, hardware and processes failed, and some of the computation would have to be repeated before the data could be transferred, and because this was a manual step, it was also prone to human error.
The public cloud resolves both the issue of data ingest/transfer and elasticity of compute nodes. Thus, when the Climate Corporation was building its new precipitation product (Lakshmanan 2016a), it was able to take advantage of collocated data and processing to fine-tune the complete precipitation pipeline to achieve errors that were significantly lower than that of NWS’s offerings. This was because it was possible to carry out parameter optimizations over very large datasets in a matter of days, not years, by simply having the data always available (on cloud storage) and spinning up a large number of nodes on the compute cluster (Lakshmanan 2016b).
The data management efforts outlined by Cintineo et al. (2012) and Ortega et al. (2015) included a large effort to QC the integrity of the data downloaded from NCEI, which is typical of any large-scale data reprocessing effort requiring a large copy of data from the NCEI archive. This data management QC includes checking for partially transferred files, comparing the lists of cached files to the authoritative copy at NCEI, and reordering and redownloading any corrupt or missing files. This QC is not necessary when using the cloud copy on AWS, or other BDP cloud providers, because NCEI and the collaborators have already completed the QC during the initial transfer. This is another clear benefit of using cloud technology for the BDP. The cloud storage eliminates complexities of massive data transfers and QC. The data are already organized and prepared for processing within the cloud platform.
PREVIOUS BARRIERS TO PROCESSING ENTIRE ARCHIVE.
The NOAA BDP discussed the possibility for transferring the NEXRAD Level II data archive with the collaborators starting in April 2015. Given the potential for new applications and an active user community, as discussed earlier, it had many desirable qualities for commercial and federal interests. While the NEXRAD data were publicly available from NOAA’s NCEI, they were relatively slow to order in bulk and difficult to use for large-scale analysis due to their unwieldy size, exceeding 270 TB as of October 2015. NCEI organizes the archive data storage on tapes to optimize preservation and access to isolated events, such as specific storms, hurricanes, and tornadoes. If a user desired a long time series or large spatial area, data from many tapes would need to be ordered, staged to NCEI web servers, and downloaded. Cost-free public web access to the NCEI tape archive is restricted to a maximum order size of 250 GB to accommodate limited bandwidth and web server saturation. Many previous attempts by users to download and reprocess large amounts of data for nationwide or climatological studies from NCEI had ended in frustration. NCEI offers an offline order fulfillment option, but this option is still limited to 0.5 TB per day, with a cost of $753 per TB required to pull and transfer data off the tapes. The NEXRAD Level II archive, requiring 270 TB in October 2015, would have cost $203,310 and taken 540 days to transfer to a single customer via an offline order. Because of these obstacles, the utilization of the entire NEXRAD archive as a whole was, to the authors’ knowledge, never before accomplished. A new scalable cloud-based approach would enable much faster direct access to the data across space and time.
COPYING THE ARCHIVE TO THE CLOUD.
NOAA’s NCEI faced several challenges as the opportunity to copy the large NOAA dataset of the historical NEXRAD Level II archive to the cloud emerged. Could the data transfer complete in a reasonable time? Would such a data transfer impact NCEI’s operations, infrastructure, and day-to-day data users? To address these challenges, several optimizations to the data transfer process, along with the collaboration with the CICS-NC, were required to achieve the data transfer within a reasonable time and to minimize impact on NCEI operational systems. CICS-NC was instrumental in staging and transferring the hundreds of terabytes of NEXRAD data from NOAA to the BDP cloud collaborators. In total, CICS-NC transferred over 1 petabyte of data in support of the BDP.
While the entire archive had never been extracted, a significant portion (2001–12) had been copied to NOAA’s National Severe Storms Laboratory (NSSL) and CICS-NC as part of the Multi-Year Reanalysis of Remotely Sensed Storms (MYRORSS) and and precipitation reanalysis of the National Mosaic and Multi-Sensor Quantitative Precipitation Estimation (QPE) (NMQ)/next generation QPE (Q2) project (Ortega et al. 2015, Howard et al. 2012). This created an opportunity to jump-start the data transfer from NOAA to the collaborators, as that data transfer took three years to manage, QC, and complete, and a significant portion (100 TB) of the dataset was readily accessible at CICS-NC when the BDP began. As NCEI extracted data from tape at a much slower rate of 1–4 TB per day, CICS-NC immediately transferred the existing data on CICS-NC’s storage systems to the BDP collaborators at rates exceeding 10 TB per day.
The BDP utilized the existing NCEI data access and queueing applications to manage the requests for the remaining Level II data. The historical NEXRAD data reside on a tape robotics archive at NCEI. New enhancements optimized the data requests to maximize parallel processing and grouped the tape requests to maximize efficiency. The data transfer from tape utilized an in-memory storage cache. A data transfer application routinely checked this cache for new files and transferred files to CICS-NC over multiple connections. Following the successful transfer for each file from the NCEI system, the application logged the transfer and the file removed from the cache.
The collaborators each use different technology to optimize access to their respective cloud environments. CICS-NC developed custom transfer processes for the AWS, Google, Microsoft, and OCC’s respective architectures. Each collaborator’s cloud had different storage architectures, transfer protocols, and error resolution. CICS-NC utilized a Python utility named “boto” to efficiently upload data to AWS, Google, and OCC (Boto 2016). Conversely, CICS-NC used the Microsoft Azure Python library for uploads to the Microsoft cloud. The use of multiple parallel connections yielded maximum transfer rates of 44 TB per day from CICS-NC to Google, along with a combined 22 TB per day simultaneously to Microsoft and AWS (Fig. 2).
In total, NOAA and CICS-NC transferred more than 270 TB and 180 million files to AWS, Microsoft, and Google, respectively, with the OCC receiving 80 TB of data from 2015 and 2016. Verifying the successful transfer of data and continued synchronization with the NCEI archive was critical to the success of the project. NCEI maintains an inventory of all files in the NCEI archive. This inventory includes checksum and file size information as well as a list of file names and sizes of the contents of tar files. The archived tar files bundle individual volume scan files together for efficient storage on the NCEI tape archive system. NCEI exports and publishes the inventory on the NCEI website daily, which CICS-NC and the collaborators use to verify the file transfers along with the existence and size of the individual volume scan files (NWS 1991).
The NEXRAD data on AWS were the first to be publicly accessible by the BDP collaborators, in October 2015. The service and the usage patterns of the NOAA NEXRAD data on AWS are described below.
AMAZON WEB SERVICES.
Sharing data on AWS.
Cloud technology enables scalable storage with instant access via simple application program interfaces (APIs). The partnership with commercial cloud providers, such as AWS, now enables instant access to NEXRAD data. Users can download the files to their own computers or the Amazon elastic computing environment. Utilizing the Amazon elastic computing environment presents a new option for a fully cloud-based data processing workflow, avoiding the traditionally lengthy ordering process, management of physical storage media, and high labor costs to facilitate sharing. The NEXRAD data stored and shared on AWS can be readily processed and analyzed with a variety of tools that are also available through the AWS cloud. Both experimentation and production use cases can work in parallel against the same copy of data.
Sharing data with amazon S3.
Amazon S3 is the object storage and publishing service provided by Amazon Web Services used for the NEXRAD data in the BDP. It is designed to be scalable and flexible, securely storing high data rates and large files while sharing high-demand data via http or https. This allows data curators to provide broad access to a variety of users around the world and enables sharing with organizations and consortiums that interact with the data curator and its peers, all from one centralized copy of the data.
NEXRAD on AWS makes both real-time and 25 years of archival NEXRAD data available on Amazon S3. Real-time data “chunks” and individual volume scan files are available in two separate buckets. For NEXRAD on AWS, a bucket represents a unit of storage, which has defined permissions and contains files as well as metadata describing the files. The real-time chunks are individually compressed sections of the greater volume scan, which flow into the real-time bucket with minimal latency from NEXRAD sites. The chunks are then assembled into volume scan files and added to the archive bucket within seconds or minutes of production. This creates a continuously updated, near-real-time archive of volume scan files.
The transfer of the NEXRAD data is powered by Unidata’s Local Data Manager (LDM) software (Davis and Rew 1994). The LDM software provides real-time, event-driven distribution of data and is the core software infrastructure behind Unidata’s Internet Data Distribution (IDD) of real-time weather data. The NEXRAD Level II data are fed into a single instance of a program, written in Python, that simultaneously feeds data to S3 for both real-time and archive access. This program makes extensive use of asynchronous computing capabilities available in recent Python versions, allowing the single program to handle the large bandwidth of data arriving in real time.
The IDD NEXRAD Level II feed contains the real-time chunks of radar data, containing up to 120 radials of data (and some optional metadata). Upon the receipt of one of these chunks of data by the LDM, it is fed into the Python program. This program uploads the chunk to the real-time S3 bucket for immediate access. This chunk is also queued, awaiting the receipt of all chunks for a given volume scan; upon arrival of a volume’s last chunk, all chunks are assembled (in proper order) to form a complete volume. The program is structured to handle any dropouts that occur in the transfer system; this includes missing chunks, as well as chunks arriving out of order. The program also ensures that data are archived, even if a volume’s final chunk does not arrive.
Accessing the archive and real-time data.
AWS publishes each real-time NEXRAD data packet and volume scan file as their own objects on Amazon S3, and they are available over http by anyone anywhere in the world. Standard file organization and naming structure create a representational state transfer (REST) “RESTful” interface to the data that allows users to retrieve data programmatically based on radar location and time of creation. Each volume scan file is its own object in Amazon S3. The data file naming conventions are described in the appendix.
INTEGRATIONS OF THE BDP IN EXISTING APPLICATIONS.
The existing users of both AWS infrastructure and NEXRAD data have a well-developed suite of tools used to manage, visualize, and analyze data. Taking steps to provide NEXRAD processing examples for existing AWS users, as well as efforts to integrate access to the AWS S3 bucket in existing NEXRAD applications, is instrumental in reducing the learning curve for users.
AWS data notifications.
AWS has configured both the real-time and archive NEXRAD on AWS buckets to publish notifications to an Amazon Simple Notification Service (SNS) topic when new data are available (Amazon 2016c). Researchers and developers can subscribe to the SNS topics with Amazon Simple Queue Service (SQS) or AWS Lambda. SQS will receive messages from the SNS topic and record them in a queue that can be acted upon using any number of virtual servers in the cloud. AWS Lambda allows users to run code without provisioning or managing servers (Amazon 2016b). Messages received by AWS Lambda can automatically analyze NEXRAD data and send results to another Amazon S3 bucket or to any other server for further analysis.
NEXRAD archive data access at NCEI.
Since 2000, NCEI has provided access to digital NEXRAD Level II data through public web applications. The web applications allow users to examine the inventory of available NEXRAD data and place orders, which retrieve the files from tape and place them into a web-accessible download location. In February 2016, NCEI made existing users aware of the new BDP and AWS capabilities by integrating the new direct download functionality into the existing web page workflows. NCEI added an additional web page that provides the option to users to order data from NCEI or download the data instantly from any BDP collaborator, such as AWS (Figs. 3, 4). The web page design facilitates the addition of other BDP collaborators in the future. From March through July 2016, 40% of users chose the direct data download option. It is likely that some of those users did not return to use the NCEI web pages after learning how to directly browse the Amazon S3 data bucket using stand-alone data transfer tools.
The NOAA Weather and Climate Toolkit.
The Weather and Climate Toolkit (WCT) is a software application, originally developed at NCEI in 2004, used for the simple visualization and data format conversion of meteorological data, including NEXRAD (Ansari et al. 2009). The WCT can access radar data from local disk drives in addition to remote locations accessible via http, https, or ftp protocols. The WCT has a dedicated data selection tab for completed radar orders from NCEI. NCEI created an additional dedicated data selection tab following the release of Level II NEXRAD data in the BDP (Fig. 5). This allows users to select their site and date and list the available files using the AWS API (Amazon 2016a). Users can directly download and visualize a selected file from AWS with a single click. The WCT supports animations and data format conversions. The interface design accommodates future datasets and BDP collaborators in the future.
THE ANALYSIS OF THE DATA ACCESS STATISTICS.
AWS publicly released access to the AWS Level II NEXRAD archive data on 27 October 2015 (Gold and Barr 2015). NCEI analyzed the user access statistics from NCEI and AWS between November 2015 and July 2016. The analysis recorded a significant total combined increase in outgoing NEXRAD Level II data from NCEI and AWS, totaling 445 TB (Fig. 6). For the single month of March 2016, users accessed a total of 94 TB from NCEI and AWS combined, more than doubling the previous monthly maximum from NCEI alone. The amount of outgoing NEXRAD Level II data has decreased from NCEI, overall by 50% and most notably for the federal government and military users, where NCEI recorded an 84% reduction in data access between January 2015 and July 2016 (Fig. 7). A comparison of the January to July time periods from 2013 through 2016 shows users accessed more than double (2.3 times) the amount of Level II NEXRAD data in 2016 as a direct result of the BDP and the release of the data by AWS (Fig. 8).
From November 2015 to July 2016, users accessed 445 TB of NEXRAD Level II data from NCEI and AWS combined. Of the total 445 TB, users accessed 78%, or 347 TB, from AWS (Fig. 9). From the AWS total of 347 TB, users downloaded 36%, or 126 TB, outside of the AWS environment. Users transferred the remaining 64%, or 221 TB, within the AWS cloud computing environment.
FUTURE POSSIBILITIES OF INNOVATION AND SCIENCE IN THE BDP.
Since almost all of NOAA’s data are open and available upon request, the Big Data Project is an opportunity to discover how federal data can be best utilized given modern technologies. As an oversimplification, the challenge can be thought of as “platform versus portal”—can NOAA’s data be utilized more, and/or more easily, by providing them integrated within a compute environment rather than simply providing access to those data? Is the manner and speed of service more significant for application development than simple access to the data?
The NEXRAD on AWS experience is showing that both NOAA data access and research may be facilitated, at reduced cost to the taxpayer, on the AWS platform through effective partnerships of government with industry. Seamless data access is now available to NOAA’s NEXRAD Level II holdings across time, and users can find and use both historical and real-time data in the same place, in the same way. During the writing of this paper, a second collaborator, the Open Commons Consortium, has also provided open, API-based access to the Level II NEXRAD dataset, further expanding the available cloud platforms for further analysis (OCC 2017).
New research opportunities are being created, including bird migration and insect studies (Amazon 2016d), that have been conducted on AWS using NEXRAD data. New catalogs have been developed, new data transformations have been performed, and applications have been developed on AWS. Does this also quicken the pace of application development and time to market as well? A faster pace of innovation and scientific discovery is now possible due to the scalability of the AWS platform—where NOAA’s NEXRAD recent reprocessing of 11 years of data took years of calendar time to accomplish because of system limitations, the same volume of processing today now can be accomplished on the AWS platform in days.
The role of NOAA as the single authoritative source for all metadata and data copied onto cloud platforms is pivotal to the continued success of the project. For NEXRAD data, NOAA/NCEI will continue to provide the inventory information required to ensure synchronization with the official NCEI archive and reproducible results across multiple cloud platforms. NCEI also maintains the collection metadata, in ISO 19115-2 format, in addition to the digital object identifier (DOI; NWS 1991).
The future of the BDP approach, beyond the initial CRADA term, is yet to be determined. The BDP is by definition a research project, and the results are informing data access and utilization strategies within NOAA and with its collaborators. Currently, the BDP offers an alternative platform for innovation and an augmentation of existing NOAA data services. Opportunities for future data services and collaborations are largely dependent upon the combined value realized during the project by NOAA, the BDP collaborators, and the users of NOAA’s data.
Challenges remain for the BDP and its participants. Development of new true “big data” applications (using many different but voluminous data types to discern new insights) within the BDP are now limited by the understanding of their business cases and the commitment of the necessary resources on both sides of the CRADA. And how can NOAA best steward the data held on AWS and others’ systems, extending the federal government’s role of ensuring the quality and authenticity of these data? The BDP experience with NEXRAD data and AWS are the first steps toward developing best practices for NOAA data access and stewardship in the cloud-based big data evolution.
This work was performed under a Cooperative Research and Development Agreement (CRADA) between NOAA and Amazon Web Services. However, the views expressed herein are not necessarily those of NOAA, the Department of Commerce, or the U.S. government. This project would not have been possible without the special assistance from Alan Steremberg (Presidential Innovation Fellow), Maia Hansen (Presidential Innovation Fellow), Doug Ross (NOAA/NCEI), John Kobar (NOAA/NCEI), Brian Nelson (CICS-NC), Scott Stevens (CICS-NC), Travis Smith (NOAA/NSSL), and Kiel Ortega (NOAA/NSSL).
File naming conventions.
The archive file naming convention is the following:
<Year> is the year the data were collected,
<Month> is the month of the year the data were collected,
<Day> is the day of the month the data were collected,
<NEXRAD Station> is the NEXRAD ground station (map of ground stations), and
<filename> is the name of the file containing the data. These are compressed files (compressed with gzip). The file name has more precise time stamp information.
All files in the archive use the same compressed format (.gz). The data file names are, for example, KAKQ20010101_080138.gz. The file naming convention is GGGGYYYYMMDD_TTTTTT, where
GGGG = ground station ID (map of ground stations),
YYYY = year,
MM = month,
DD = day, and
TTTTTT = time when data started to be collected (UTC).
Note that the 2015 files have an additional field on the file name. It adds “_V06” to the end of the file name. An example is KABX20150303_001050_V06.gz.
Each chunk of each volume scan file is its own object in Amazon S3. The file naming convention is the following:
<Site> is the NEXRAD ground station (map of ground stations),
<Volume_number> is the volume ID number (cycles from 0 to 999),
<YYYYMMDD> is the date of the volume scan,
<HHMMSS> is the time of the volume scan,
<CHUNKNUM> is the chunk number, and
<CHUNKTYPE> is the chunk type.
All files in the real-time feed use bzip2 compression.
A supplement to this article is available online (10.1175/BAMS-D-16-0021.2)