The First ISLSCP Field Experiment (FIFE) provided an opportunity to test the concept of data publication for long-term access to valuable scientific data. In analogy with the procedures used in research publication, the FIFE Information System and NASA’s Pilot Land Data System adapted the functions performed by authors, editors, and publishers to an information management environment. Procedures and standards were developed to organize, quality check, document, and review data and associated supporting information for publication on a series of five CD-ROM volumes. The CD-ROM series has been successfully published and is in widespread use in the scientific community. The preliminary indications are that this publication will pass the “20-year test” recommended by a committee of the National Research Council for preserving global change data. It is concluded that the data publication approach, using near-permanent distributable publication units like CD-ROMs, is an important addition to the tools for ensuring the survival of large scientific datasets over long periods.
The First International Satellite Land Surface Climatology Project Field Experiment (FIFE) occurred at a time when concepts and technology for handling field experiment data and information were rapidly changing. During the course of the experiment, it became possible to communicate routinely to an international group of investigators via e-mail, to use the Internet to remotely and interactively query a large online database and download query results, and to produce a permanent archival record of the experiment data on CD-ROM. The FIFE Information System (FIS), in cooperation with NASA’s Pilot Land Data System (PLDS), used these new capabilities to conduct an experiment in publishing the data.
FIFE (Sellers et al. 1992) was a local-scale climatology project set in the prairies of central Kansas during 1987–89. It was designed to study the flows of heat and moisture between the land surface and the atmosphere over a 15 km × 15 km region. To achieve this objective, aircraft and ground data collection activities were coordinated with satellite overpasses over the same area at the same time. Monitoring data were acquired continuously through the 3-yr study period while five intensive field campaigns were conducted (four in 1987 and one in 1989) to collect coordinated datasets encompassing relevant scientific questions in meteorology, biology, remote sensing, hydrology, and soil science.
The FIS (Strebel et al. 1990) was developed to 1) capture and preserve the data from FIFE; 2) facilitate cooperative exchange of data and information for experiment design, execution, and analysis; and 3) provide the foundation for a long-term archive of the data. The raw digital data totaled over 100 gigabytes, encompassed over 1000 distinct parameters and variables, and was submitted to the FIS by approximately 50 independent sources. The FIS staged and organized the raw data and distributed it to investigators during FIFE. After initial data processing and quality assurance, the data selected (by the investigators) for further use were placed in the PLDS (Meeson et al. 1993) for intermediate-term distribution to the land remote sensing community. As the use of the FIFE dataset collection expanded to this larger community, it became evident that a more permanent archive would eventually be required. The FIFE data publication experiment grew out of the attempt to address this need.
2. The origin of the experiment
Neither the FIS nor the PLDS was designed to be a long-term data archive. As with most current digital systems, they were based on media that were short-lived (e.g., <10-yr shelf life) and subject to failure. Providing long-term access would have required continual maintenance costs to overcome media deterioration and adapt to technological advance. Network access and routine operational costs were also likely to be subject to priority shifts and funding irregularities that would lead to data loss.
A better approach seemed to be a focused effort to create the digital equivalent of a book—a self-contained information resource with a long shelf life, low access costs, and high long-term survivability resulting from the production and distribution of multiple copies. Experiments by the Planetary Data System (PDS) (Arvidson and Dueck 1994; McMahon 1996) had shown that the CD-ROM medium had most of these characteristics and could be used for at least limited distribution of large volumes of raw scientific data. It was particularly attractive that CD-ROMs were physically stable and long-lived, they were accessible via medium-range personal computers, and they had a commercially supported international format standard (ISO 9600) that would be in use for some time. Essentially, the CD-ROM made it possible to avoid most of the costs and problems of central archiving and move the archive to the scientist’s desktop.
With this goal in mind, the FIS and the PLDS jointly proposed an information management experiment to extend the PDS approach by preparing an archive-ready version of the FIFE dataset collection on CD-ROM. The experiment hypothesis was that publishing and distributing the final, integrated data in this form would create a dispersed archive that would increase accessibility, use, and survival of the data. The objective was to meet the “20-year test” (Webster 1991); that is, the archived data should be usable after 20 years by a scientist unfamiliar with the data or their collection. To achieve this result, we found it useful to develop and apply a formal concept of data publication.
3. The data publication paradigm
The 20-yr test was specifically stated in the context of long-term research areas, such as global change studies, in which data and its documentation must be preserved for future investigators or unanticipated investigations. Such preservation and transmission of information between places, times, and cultures is routinely achieved by published text, which prompted us to examine the publication enterprise to develop an analogous publication model that could be applied to data. A detailed analysis of this analogy, including the need for data publication in science and the proposed mechanisms for implementing it, is presented in Meeson and Strebel (1998).
It should be noted that data publication has always been an implicit part of the scientific method. To support verification or modification of a scientific conclusion, it is a logical necessity that the data on which the conclusion is based be publically available for critical analysis, replication, and validation. Traditionally, reports on the results of original research also incorporated the measurements and observations on which the conclusions were based. This classical approach cannot accommodate the volume and nature of data, such as remotely sensed imagery, now routinely collected and analyzed in interdisciplinary scientific research. Extensive data publication in scientific papers is also discouraged by research journal editors concerned with soaring costs.
In response to this situation, data publication has begun to emerge as a distinct scientific activity that is the responsibility of scientific information systems. In concept, a system’s publication services should be analogous to those provided by a publisher of books or journals. This analogy identifies the organizational structures and functional activities that are required and leads to a data publication model that has the equivalents of editors, printers, and publishers to support the scientist-authors who collect or assemble the data.
Figure 1 illustrates the information flows between the four fundamental organizational entities (labeled author, publisher, printer, and distributor) of the model. The functions provided by each structural component are linked efficiently in the publication industry because there is an overall set of standards for the exchange of material and information. No such structures or standards were yet established for data publication at the time of the FIFE data publication experiment.
4. An experimental application to FIFE
Working collaboratively, the FIS and the PLDS were organized to fill the structural roles illustrated in Fig. 1. Specific functional assignments included standard data formatting and quality assurance (FIS), coordinating review of data and documentation by authors and outside reviewers (PLDS), staff review and testing of proof copies of each CD-ROM (FIS), making and overseeing printing arrangements (PLDS), and distribution to the scientific community (PLDS). It is important to note the contrast with other efforts (such as the PDS approach) in which one information management group provides all publishing functions.
In the first phase of the experiment, a prototype CD-ROM volume was produced that incorporated a wide variety of FIFE data types and reflected draft standards for documentation and organization (Strebel et al. 1991;Landis et al. 1992). Based on this prototype, the working relationships between FIS and PLDS were optimized for publishing the formal five-volume CD-ROM series. Formal distribution to the scientific community was assigned by NASA to the Goddard Distributed Active Archive Center (DAAC), and later transferred to the Oak Ridge National Laboratory (ORNL) DAAC.
In the prodution phase of the experiment, datasets were first prepared for publication individually and then combined into integrated groups. The overall process that was applied to each dataset, including incorporating the dataset into the FIS, running quality control checks, preparing documentation, conducting reviews, and publishing it through PLDS, was essentially the same as described in Table 5.2 of Strebel et al. (1994b). That process entailed two dozen steps that took, on average, a half of a person-year per dataset. Many of these steps were completed for immediate use of the data by the FIFE investigators and hence were not repeated for the publication experiment. The goal of creating a high-quality published product that would be of long-term use to the general scientific community, however, led to a particular emphasis on procedures for data selection and integration, quality assurance, documentation, and review. Each of these areas is described more fully below.
a. Selection and integration
The FIFE scientists served both as a selection committee and a quality assurance team for determining the subset of the FIFE data that would be preserved on the CD-ROM archive. The result of the initial scientific assessment of the data was that a reduced set of key data amounting to about 20% of the total volume collected (5–7 gigabytes) would produce most of the scientific results. Most of the auxiliary datasets collected by ground instruments, if they were of reasonable quality and adequately documented, were included to provide a relatively complete background record. For high-volume data, such as the aircraft imagery, it was more important to have a good-quality representative set of images collected under ideal or near-ideal conditions. In some cases, the FIS also created spatial or temporal subsets of the data that coincided with areas or events of special interest.
The scientific selection of a reduced volume of data was the first step in creating an archival publication that was more than a collection of the raw data. The FIFE CD-ROM archive is a new product reflecting much effort to integrate the datasets for use together. As part of the analysis phase of FIFE, the selected data were first worked by individual investigators, then examined by disciplinary groups meeting in focused workshops, and finally studied by interdisciplinary groups asking broad integrating questions. The FIS aided these processes by developing common spatial and temporal reference schemes and relating the data to them. Further, as the data were prepared for publication, they were regrouped to reflect discovered commonalities and ease of access for an uninitiated investigator, rather than propagating the specialized dataset groupings initially used to support the funded FIFE investigations.
b. Quality assurance
As part of the “authoring” functions of the FIS, a continual quality assurance effort was conducted by the staff who processed the data, the user support staff as they delivered the data, and the investigators as they worked with the data. Initial data submissions were considered preliminary and were revised as problems were identified. Quality codes and revision dates were associated with each data item as the online database was compiled, and they were carried into the CD-ROM representation of these data. To identify any statistical anomalies, an automated quality assurance screening check was performed as the datasets were extracted from the database and formatted for CD-ROM publication. During the prepublication review of the CD-ROMs, additional random spot checks were made to identify systematic errors or formatting problems.
The minimal documentation that often accompanies data exchanges between collaborators is not sufficient for archival publication. A large part of the FIFE CD-ROM publication effort was devoted to writing, reviewing, or editing the descriptive information for datasets to assist the FIFE scientists in providing enhanced documentation that could support innovative use of their data by uninitiated scientists. Preparing the documentation to the comprehensive outline developed for the CD-ROM publication also frequently revealed data gaps, quality problems, or links to other datasets, thus helping with overall quality assurance and data integration.
Both critical internal review and independent external review are fundamental to quality publication and to the scientific method. As there were no standard ways to review integrated scientific data and documentation for publication on CD-ROM, a series of review steps was defined. These steps included the following: 1) preliminary design review for the CD-ROM series, 2) quality assurance review of the data to be included, 3) documentation review for datasets, 4) peer review of prototype CD-ROMs, 5) peer review of dataset groups and their documentation, 6) community review of published CD-ROMs. Most of steps 1–3 were incorporated into the procedures for staff data processing or documentation preparation. The FIFE investigators themselves served as the review committee for the prototype CD-ROM volume as they used it in their analyses (step 4). As a way of implementing step 5, peer review workshops were held to conduct a final prepublication review of the dataset documentation and prepare summary documents for the dataset groups. Step 6 was not officially organized, but is occurring informally as use of the FIFE CD-ROM series expands.
5. CD-ROM organization and distribution
The experience gained from selecting, quality assuring, and documenting data for the prototype CD-ROM volume, and especially from the user review of the prototype (step 4 above) was important in determining the organization of the formal CD-ROM series. The prototype volume was specifically designed to test methods and theories of storing large diverse datasets in complex file and directory structures (Landis et al. 1992). Analysis of the performance of these structures, along with the feedback from the user review, led to a parallel hierarchical organization of datasets and documentation.
For the purposes of the FIFE CD-ROM series, a dataset was defined as a collection of observations that were acquired with similar methods for similar objectives, usually under the direction of the same person or group. Each of the datasets so defined was then placedby the FIS into one of 13 groups of associated datasets (see Table 1). A summary document for each of the groups was written in a peer review workshop to describe the datasets and how they are related, and to give a general assessment of the quality and known shortcomings of the datasets. For investigators unfamiliar with FIFE, these dataset groups provide the primary access routes (via the summary documents and a similiarly grouped directory hierarchy) to the data and documentation. Table 2 indicates the general contents of the final five-volume CD-ROM series (Strebel et al. 1992a,b,c, 1993, 1994a) by volume.
As each volume was completed, it was distributed by either PLDS (for the first volumes) or the Goddard DAAC (for the later volumes) to FIFE investigators, their scientific colleagues who learned of the CD-ROMs by word of mouth, and to others who indicated an interest after hearing presentations or lectures. When the series of CD-ROMs was completed, responsibility for access and distribution was transferred to the Earth Observing System (EOS) DAAC at the Oak Ridge National Laboratory. All distribution beyond the formal FIFE group was by request, in keeping with the publication approach that the consumer must take an action to acquire the publication. Any distribution and usage statistics are therefore free of the biases that might result from an “advertising” approach in which the CD-ROMs were distributed indiscriminantly using large mailing lists.
Mechanisms to measure the progress of the publishing experiment were also considered. Initial feedback would be provided by user comments and requests for the CD-ROM series, and long-term feedback would be indicated by citations. Formal acknowledgement of the CD-ROMs as published works was encouraged by material accompanying the CD-ROMs. Although unfamiliar in the scientific community, formal citation of the CD-ROMs as publications would be consistent with the international copyright convention and with proper scholarly treatment of use of material from other types of creative works that appear on nonprint media (e.g., literary anthologies or encyclopedias on CD-ROM, records, films, etc.).
The scientific community has frequently lost access to field experiment data supporting remote sensing studies when individual scientists responsible for the datahave changed institutions or research areas, archiving institutions have vanished or changed, or the costs of data (media) maintenance have not been funded in favor of new data collection. The point of the FIFE data publication experiment is to examine whether the inevitable decay of available scientific information due to these factors can be eliminated by archiving on distributable media like CD-ROMs. The body of knowledge captured from FIFE in this form amounts to over 100 datasets, thousands of compressed image products, and numerous scanned photographs with visual details of the experiment site in Kansas, along with extensive documentation of the project and data.
The publication of these data on CD-ROM is still an experiment in progress: so far it has been only established that it is possible to apply the publication model to scientific data. It is not yet possible to test the formal hypothesis that accessibility, use, and physical survival of the data will be increased, since the standard of proof is whether the 20-yr test is met. Since the FIFE CD-ROM series was just finished in 1994, it will be 2014 before the FIFE data publication experiment can be completed by a formal evaluation of the CD-ROM series by scientists who have not had previous exposure to FIFE. The authors intend to propose such an evaluation at that time.
In the meantime, there is only anecdotal and partial evidence to predict the likely outcome of the experiment. It is encouraging, for example, that the number of FIFE CD-ROMs in use has contined to grow. The initial distribution of the prototype volume numbered just over 200. As the publication of the five volumes of the official series progressed, the additional requests received pushed the distribution list to almost 300. The FIFE CD-ROM archive was formally transferred to the ORNL DAAC in July of 1994. In the 18-month period between that date and December of 1995, there were 125 FIFE-related requests to the User Services Office. More than 80% of these requests were orders for FIFE CD-ROMs (Olsen 1996).
While numbers indicate interest, they do not indicate usefulness or value. The booklike convenience of the CD-ROM medium, however, has allowed the data to be propagated throughout the world and used by scientists with only rudimentary facilities and support. The authors have received letters indicating that the CD-ROMs have been of use to investigators of many nationalities in a variety of circumstances. This suggests that the documentation and organization of the data are in fact sufficient to make the CD-ROMs a stand-alone archive.
Usability is also suggested by instances of research based on the FIFE data that has been completed and published by investigators not directly connected with FIFE. Among these instances are Ph.D. dissertations by students who had no direct FIFE experience and whose advisors were not FIFE investigators. This argues that in 20 years an equally uninitiated scientist will, in fact,be able to understand and use the data in new scientific investigations.
Actual quantitative measurements of use should be possible at that time, particularly if, as we expect, the scientific community begins to treat data publication more formally. Now that the CD-ROMs have become the primary source of the FIFE data, and research using the data is spreading to investigators not directly involved in FIFE, increasing numbers of formal literature citations are expected. In fact, citations to individual volumes of the series have appeared in Science Citation Index starting as early as 1992.
Overall, the current evidence is that the FIFE CD-ROM archive will exist as a published work for 20 years and be usable at that time for scientific research without additional information sources. Thus, the interim results predict that the 20-year test will be satisfactorily met in 2014. The authors fully anticipate that early in the next century a new crop of global climate researchers will again visit the FIFE site and that the value of that new research will be multiplied many times over by the existence and usability of this detailed and well-documented baseline dataset.
By applying the publication model to preserving the data from FIFE, an archival data publication likely to pass the 20-yr test has been created. Approximately 400 copies of this archive are now in the keeping of individual scientists at locations throughout the world. Many of these copies of the archive should survive well into the next century—accelerated stress tests have indicated that CD-ROMs may remain readable for over 100 years. This physical survivability of the media indicates that information loss would be primarily due to accidental or deliberate destruction. Assuming that there is a loss rate of 10% per year due to such factors, there should still be almost 50 recoverable copies of the archive in 2014 (the 20-yr mark) and at least 2 remaining at the 50-yr mark in 2044.
Comparing the above assessment of reliability to the alternative of online databases that are dependent on well-funded technology, reliable network links, and continual staff costs for maintenance and operation, CD-ROM publication appears well worth the up-front cost of the labor-intensive preparation. A mature, cross-checked, and well-documented data collection on CD-ROM is an optimal way to deliver large quantities of data directly to the user and to ensure the persistence of those data in the scientific community for long periods. Since multiple copies will survive, the disastrous consequences of single-point failure modes, such as unrecoverable hard disks, temperamental critical computers, or politically vulnerable institutional archives, are obviated in the preservation of scientific data.
Data publication on CD-ROM may, in fact, be the most cost-effective archiving solution available today.Once the CD-ROM master is created, the incremental cost of duplicating and distributing even thousands of copies is still far below the basic cost of such alternatives as online interactive databases, World Wide Web (Web) sites, or FTP sites. These alternatives, however, can also benefit from an application of the publication model. In general, for any data dissemination method, the more formal the approach to publication and the more attention paid to documentation issues, the more usable the data for a longer period of time.
For example, the same publication structures and concepts that were used in the FIFE CD-ROM data publication experiment can be invoked for placing data on the Web. The basic difference is that the Web is a transient medium more akin to a newspaper than a book—it is unlikely that anyone would maintain a Web site untouched for 20 years or more. In terms of scientific data publication, then, a Web site can serve as a short-term staging ground and testing area for assembling, quality checking, documenting, and reviewing new data collections. The entire Web site, if constructed properly, can then be written to a CD-ROM for final archiving, duplicating, and distribution.
The other major advantage of CD-ROMs is that on-line data access is always limited by network speed and traffic. It seems obvious that placing a common package of several hundred megabytes of data, well organized and constantly accessible, on every scientist’s desktop is superior to and more efficient than forcing each scientist to spend several hours (on a good network day) choosing, downloading, storing, and organizing that data from online sources. This fundamental advantage of a distributable publication unit, like a book or a CD-ROM, will be multiplied as digital video disk technology is implemented. This new generation of CD-like media offers about an order-of-magnitude increase in storage capacity, without compromising the use of existing CD-ROMs. Therefore, publication of large scientific datasets is likely to become even more viable and more frequently used as a method to establish a permanent record of scientific work.
This picture contrasts with current suggestions (Taubes 1996) that the future of science lies in rapid mass distribution of unreviewed publications and data products via electronic networks, with no need for print or other archival media. Such suggestions ignore the fundamental principle that the scientific method requires careful, quality-controlled, peer-reviewed efforts reflecting mature consideration. Scientific advancements are achieved by constantly testing new knowledge against existing data and accepted conclusions; hence, it will always be necessary to provide data and other peer-reviewed scientific information to scientists on permanent archival media, and to have reliable retrieval of original works over long time periods. In this context, the FIFE CD-ROM publication experiment shows that the data publication model not only works, but is a keyto adapting the scientific method to the emerging “information age.”
This experiment was initially conceived in a PLDS meeting with the help of Stephen Ungar, and was funded by NASA HQ Office of Mission to Planet Earth. For continuing support and encouragement, we thank Ghassem Asrar, Forrest Hall, and Piers Sellers. The success of the experiment was dependent on much dedicated work by the staffs of FIS and PLDS, and on the cooperation of the FIFE investigators, who gracefully tolerated our incessant nagging about documentation. We also thank Allison Brindley for her editorial assistance in creating a clear and readable final version of this paper.
Corresponding author address: Dr. Donald E. Strebel, Versar, Inc., 9200 Rumsey Rd., Columbia, MD 21045.