Global surface temperature changes are a fundamental expression of climate change. Recent, much-debated variations in the observed rate of surface temperature change have highlighted the importance of uncertainty in adjustments applied to sea surface temperature (SST) measurements. These adjustments are applied to compensate for systematic biases and changes in observing protocol. Better quantification of the adjustments and their uncertainties would increase confidence in estimated surface temperature change and provide higher-quality gridded SST fields for use in many applications.
Bias adjustments have been based on either physical models of the observing processes or the assumption of an unchanging relationship between SST and a reference dataset, such as night marine air temperature. These approaches produce similar estimates of SST bias on the largest space and time scales, but regional differences can exceed the estimated uncertainty. We describe challenges to improving our understanding of SST biases. Overcoming these will require clarification of past observational methods, improved modeling of biases associated with each observing method, and the development of statistical bias estimates that are less sensitive to the absence of metadata regarding the observing method.
New approaches are required that embed bias models, specific to each type of observation, within a robust statistical framework. Mobile platforms and rapid changes in observation type require biases to be assessed for individual historic and present-day platforms (i.e., ships or buoys) or groups of platforms. Lack of observational metadata and high-quality observations for validation and bias model development are likely to remain major challenges.
Bias estimation for sea surface temperature is discussed and recommendations for improving data, observational metadata, and uncertainty modeling are given.
The global surface temperature record is constructed by blending sea surface temperature (SST) with air temperature over land and ice (see also section S1 of the supplement , which is available online at http://dx.doi.org/10.1175/BAMS-D-15-00251.2). Both SST and land air temperature require adjustments to account for changes in, for example, depth or height of measurement, instrumentation, and siting. Improvement of estimated biases in historical measurements of SST will have a major effect on estimates of global surface temperature change and their uncertainty (Jones 2016).
The historical record of observations of the temperature of water at the “sea surface” is a disparate collection of measurements made using different methods from different measurement platforms. Most measurements come from platforms that move (mostly ships and drifting buoys) with relatively few providing time series at fixed locations (e.g., ocean weather ships, fixed platforms, coastal installations, or moored buoys). Adjustment of near-surface air temperatures over land, often called homogenization, relies on comparisons of a candidate station with nearby stations to identify and correct unphysical changes (Trewin 2010). The continually evolving, and largely mobile, marine observing system means that such approaches cannot be easily applied to marine observations.
Folland et al. (1984) applied first-order SST bias adjustments, adding a constant value of 0.3°C to observations made before 1942, based on the difference between global night marine air temperature (NMAT) and SST. By the time of the Intergovernmental Panel on Climate Change (IPCC) First Assessment Report (Houghton et al. 1990), more complex models of SST bias had been developed (Jones et al. 1986; Bottomley et al. 1990) and presently several different estimates of SST bias exist. Figure 1 shows global-mean SST anomalies for the current, commonly used, long-term gridded SST analyses: Hadley Centre SST dataset, version 3 (HadSST3; Kennedy et al. 2011a,b); Extended Reconstructed SST, version 4 (ERSSTv4; Huang et al. 2015); and Centennial Observation-Based Estimates of SST, version 2 (COBE-SST2; Hirahara et al. 2014), along with their bias estimates and uncertainties.
SST observations and gridded datasets underpin many thousands of published research papers every year, including their use as boundary conditions for atmospheric reanalysis, so the benefits of improved SST bias estimation are wide reaching. However, severe challenges arise because the observations we have are not from a dedicated climate observing system. Early observers were largely concerned with navigation and safety. Observations were collated to document climatology rather than climate change. Detailed information on the ships and the different methods of measurement, now known to be of immense value to assess changes, has been lost (see the sidebar for more information about lost datasets). Different measurement methods have different characteristic biases, and there are variations peculiar to individual platforms and installations. The characteristic biases also depend on environmental conditions, such as wind speed, solar radiation, and air–sea temperature contrasts, as does the real variability of ocean temperature, with further real variations due to the depth of measurement. Reconciling all of this to make consistent estimates of SST changes would be a challenge with good documentation. The patchy availability of observational and platform metadata, and sparse sampling in some regions and periods, makes it even harder.
Over the years there have been several studies comparing either SST measurements made by different methods or detailed wind tunnel– and ship-based assessments of temperature change from buckets. We have learned a lot from the papers and reports describing these experiments, but much more could be done if we were able to track down the original measurements. We have tried and failed, but we still hope they are out there and that someone knows where they are. And, of course, if you know the whereabouts of any similar measurements, we would be delighted to hear from you.
James and Fox (1972): Approximately 16,000 log entries, each containing at least two measurements of SST and ancillary data, and metadata collected under the auspices of the World Meteorological Organization (WMO) and analyzed at the U.S. Naval Oceanographic Office in Washington, D.C.
Roll (1951a,b): Wind-tunnel measurements of the temperature change of a German SST bucket made at the Meteorological Office for northwestern Germany, Central Office, Hamburg. Also pairs of SST measurements made on the Fishery Patrol Vessel Meerkatze during 1950.
Ashford (1948): Wind tunnel measurements of temperature change of a range of SST buckets carried out in the Instruments Branch of the Meteorological Office, Air Ministry, United Kingdom.
Brooks (1926, 1928): Paired measurements of SST made on the Royal Mail Ship Empress of Britain and other ships in the 1920s. Analysis was at Clark University, Worcester, Massachusetts, and at least a subset of the data was filed with the Library of the U.S. Weather Bureau in Washington, D.C.
We are also on the lookout for instructions given to observers, descriptions of how measurements were made, photographs, diagrams, and other metadata; so again, if you have anything that might be useful, please get in touch.
The first-order bias adjustments required to account for changes in methods of SST observation over the past 150+ years are known. We know that adjustments are required and the direction and approximate size of the change at very large scales. However, a comparison of the different approaches used to estimate SST bias adjustments shows that differences remain that are hard to fully explain. Unexplained differences occur at smaller scales and in periods where measurement methods change quickly. This shows the need to better understand the biases, to improve adjustment methods, and to refine the uncertainty estimates.
Our recommendations to improve the situation are in four areas. First is the enhancement of the source archive to provide more observations, to provide more complete metadata, and to improve quality. Second is a need to develop better models of SST bias and to maintain a range of SST products using different approaches to bias adjustment. Third is a need for accessible, high-quality, consistent validation datasets to be assembled from existing archives and for the availability of such data to be established as metrics for assessing the observing system. Finally, we would like to see more people working in this area and suggest how barriers to getting started might be reduced.
WHAT IS SST AND HOW IS IT MEASURED?
What is SST?
The temperature of the water near the sea surface varies on all space and time scales. The term SST has typically been used to describe the mean temperature of the upper few meters of the ocean. Historically measurements taken at depths from the surface and down to about 20 m have all been assumed representative of the SST. Under well-mixed conditions this is a good assumption. However, there are well-known variations of ocean temperature with depth, especially at low wind speeds and sunny conditions (Kawai and Wada 2007). Developers of long-term datasets have taken a pragmatic approach, assuming that either the measurements represent well-mixed conditions or the conditions were well sampled and therefore representative of the surface layer even if it was not well mixed. When considering biases, it is necessary to consider spatial differences in the depth dependence of temperature. Further discussion on the definition of SST and its uncertainty can be found in section S2 of the supplement.
How is SST measured?
SST has been measured in different ways over the past 200 years. The observations record real variations in temperature but also contain an imprint of how they were measured. Both the real variations and the biases are affected by the ambient environmental conditions, making them hard to disentangle.
The earliest observations were probably made by sampling seawater in a bucket. Maury (1858) recommended wooden buckets, which were likely used around this time. The type of bucket used evolved over time, with canvas buckets becoming predominant, later replaced by better-insulated rubber and plastic buckets. Figure 2 (left) summarizes the different factors that can cause bias in observations of SST made using buckets.
For measurement, the bucket is thrown into the water to collect a sample. The exact depth of sampling is unknown, but it is close to the surface, especially if the ship is moving fast. If the bucket is at a very different temperature from the water, or contained water from a past sample, then the time the bucket spends in the water to equilibrate is important. We do not know how much care the observers took in following instructions on sampling protocol in this regard, nor in others. Once a bucket leaves the sea, both the bucket and the water sample exchange heat with the atmosphere in a way that is dependent on their volume, thermal properties, and the environmental conditions. The temperature continues to change while the thermometer is read; the change is related to the length of time taken to get a stable reading and to whether the bucket is taken out of the wind and/or into the shade. The initial temperature and response time of the thermometer can also influence the reported temperature.
For ships with engines, the temperature of water pumped on board to cool the engines can be used as an estimate of SST (Fig. 2, right). Sampling is usually deep, as the inlet has to be below the surface whatever the loading of the ship. The ship may also mix the water, so the effective depth of sampling is ambiguous even if the inlet depth is known. Typically, most details of the installation are unknown, so it is hard to determine how an observation might be affected by heat exchange between the inlet and the point of measurement. Historically, there is evidence for inaccurate thermometers and poor installation (Kent and Taylor 2006). An extensive analysis of engine room intake (ERI) observations by James and Fox (1972) showed ERI SSTs, at that time, were particularly warm for large ships with thermometers more than 3 m inboard from the inlet. Technological developments have likely resulted in thermometers placed nearer to the hull (possible with remote-reading automatic sensors) and farther from the engine room. The type of ERI thermometer was also important with precision thermometers and thermistors showing smaller offsets relative to bucket measurements than mercury or other types of thermometers. There is some evidence that ERI biases have reduced over time (Kent and Kaplan 2006), which could be explained by better thermometers or improved siting. Determining a ship-by-ship estimate of mean ERI bias would represent a significant advance, perhaps permitting more subtle variations due to greater measurement depths or ship speed to be explored.
Hull-mounted sensors (also shown in Fig. 2, right) are dedicated SST sensors. Kent et al. (1993) showed, for a small subset of ships, that hull sensors were more accurate (smaller bias and noise) than ERI, but good insulation is required (Beggs et al. 2012). A wider analysis of hull sensor accuracy in the field is long overdue.
Surface drifting buoys (Fig. 2, bottom) measure at shallow depths, nominally 10–20 cm. Biases in drifter measurements might arise due to error in sensor calibration, temperature calibration “drift” while deployed, or biofouling on the sensor. Drifting buoys presently provide measurements of SST that are near-globally distributed and have better accuracy than from ships (Kennedy et al. 2012), since problems with early drifters were resolved (Bitterman and Hansen 1993). Careful quality control is still required to identify spurious spikes in reported position or SST measurements from when the buoy is out of the water (due to predeployment data transmission, beaching, or human interference) and instrument failure or other causes of erroneous data (Lumpkin et al. 2012; Atkinson et al. 2013). Observations made available in delayed mode [e.g., by Integrated Science Data Management (ISDM) or the Atlantic Oceanographic and Meteorological Laboratory] typically have quality control flags appended, but checks of the International Comprehensive Ocean–Atmosphere Data Set (ICOADS) have revealed additional problematic reports in both delayed mode (from ISDM) and real-time data (Atkinson et al. 2013).
Moored buoys produce continuous measurements at fixed locations at a depth of about 1 m or at several predetermined depths (Kennedy 2014), typically only near coasts or in tropical regions. The mechanisms causing their biases are similar to those for surface drifters, but it is often possible to recover instrumentation from moored buoys for recalibration, improving their overall accuracy.
Availability of observations and ancillary information.
SST observations were first made available in the nineteenth century as charts to aid navigation (Rennell 1832; Maury 1858). Much later, national compilations of marine observations were used to generate gridded analyses of SST for scientific applications (e.g., Bunker 1976; Bottomley et al. 1990). The U.S. national collection developed into a publicly available databank (Woodruff et al. 1987) that became ICOADS, currently release 3.0 (Freeman et al. 2017). ICOADS is the preferred source for constructing historical SST analyses, providing traceability of the data, simpler comparison among derived data products, and access to newly digitized observations (e.g., Allan et al. 2011) and observational metadata (Kent et al. 2007). Moreover, it enables a dialogue that can lead to improvements in ICOADS and in the many ICOADS-derived datasets (JCOMM 2015).
Quantifying SST bias ideally requires accurate location and time information, platform information, and complete information of methods, instruments, and protocols used, and of the ambient conditions (Fig. 2). ICOADS contains some of the information required (described in section S3 of the supplement), but its availability is patchy. We make recommendations that will enhance the amount of SST data and metadata available by digitization of data and metadata from ships logbooks (recommendation 1), by reprocessing of the existing ICOADS archive (recommendation 2), and by improved use of external sources of observational metadata (recommendation 3).
CURRENT APPROACHES TO SST BIAS ESTIMATION.
Physics-based bias models.
The factors affecting bucket SST measurements are well known (Fig. 2, top) and have been discussed since the time of Maury (1858). The heat exchange experienced by a water sample in a bucket can be estimated with a physical model (Folland and Parker 1995, hereafter FP95). The bucket is represented by a partly closed cylinder with appropriate thermal properties: uninsulated for canvas buckets, partly insulated for wooden buckets. More difficult is applying these models to historical measurements made using buckets of unknown dimensions and thermal properties in environmental conditions that are also not well known. The approach of FP95 to this problem, as used in HadSST3 and COBE-SST2, is summarized in section S4 of the supplement. Recommendation 4 addresses the need for simplified physical models of SST biases from buckets and better estimates of the thermodynamic forcing required.
Physical models for biases in ERI SSTs have not been developed, as the detailed information required on individual installations (Matthews and Matthews 2013) is almost always unavailable (Fig. 2, right). Similarly, the estimation of bias in hull sensors has not yet been tackled with physically based models.
Although drifter and moored buoy SSTs are usually considered to be bias free, adjustments for their differences relative to ship-derived SSTs are typically made (Kennedy et al. 2011b; Hirahara et al. 2014; Huang et al. 2015). This choice has been shown to have little effect on long-term trends (Kennedy et al. 2011b).
Physical models for the ocean cool-skin effect and for thermal stratification within the upper few meters of ocean (which can be significant during the daytime if mixing is small) are used to relate satellite SSTs to SST at the depths representative of buoys (Merchant et al. 2012). The models are driven by weather analysis fields, and have skill in reconciling satellite and subsurface measurements (Embury et al. 2012). Such models could be used to inform comparisons of in situ measurements made at different depths.
Application of physics-based models.
The two main barriers to the application of physical-correction models are uncertainty in the measurement method used and in the environmental conditions pertaining to individual observations. Section S3 of the supplement describes the information available in ICOADS to determine the type of platform and measurement method.
Kennedy et al. (2011b) brought together evidence from ICOADS, external sources of measurement metadata [such as that published by the WMO in Publication No. 47 (Publ. 47); Kent et al. 2007], and other documentary information, to estimate measurement methods and their uncertainties (Fig. 3). They weighted bias estimates for each method to produce estimated fields of the unbiased SST. Method weightings, and bias estimates, were varied within plausible ranges to produce an ensemble of SST fields spanning the likely uncertainty. In contrast, Hirahara et al. (2014) approached the problem by estimating the proportions of different methods from differences in the data. They assumed a bias model for each type (insulated bucket, uninsulated bucket, or engine intake) to adjust observations where the method was known. Proportions of observations with an unknown method were then assigned to the different methods such that global SST averages from observations with unknown methods agreed with SST averages from known methods when combined with the method-dependent bias models. These approaches show broad agreement in inferred measurement methods (Fig. 3b). Notable discrepancies include estimates of the rate of transition from uninsulated to insulated buckets (Kennedy 2014).
Once the measurement method has been assigned, the bias adjustment can be calculated using the appropriate bias model. This is presently done simply: bucket bias adjustments are applied using the fields calculated by FP95 weighted by the proportions of observations thought to be made using wooden, canvas, or rubber buckets (Kennedy et al. 2011b; Hirahara et al. 2014). The relative biases between ships and drifting buoys are fixed. Biases for ERI or hull sensors are fixed in the COBE-SST2 analysis and vary within an estimated range in the HadSST3 analysis.
Large-scale statistical adjustments using air temperature.
A statistical approach to bias adjustment of ship observations was developed by Smith and Reynolds (2002, hereafter SR02) based on large-scale differences between SST and NMAT measured from ships. The rationale is that biases in NMAT are more straightforward to adjust (Kent et al. 2013; section S1 of the supplement) and that the large-scale differences between SST and NMAT will not vary markedly over time (Huang et al. 2015). NMAT, rather than all-hours MAT, is used to avoid uncertainty due to daytime heating on ships. Details of the SR02 statistical bias model and its implementation by Huang et al. (2015) are described in section S6 of the supplement.
This method does not need the detailed information required by physical models, but there are still uncertainties. Any residual biases in adjusted NMAT will influence the SST bias estimates (Rayner et al. 2003; Kent et al. 2013), and uncertainty in NMAT will propagate through to the SST estimates. Although NMAT variations are representative of SST variations on the largest scales (Huang et al. 2015), the relationship is likely to be locally weaker. The computed spatial patterns of SST–NMAT differences are critical for the estimate, and assuming that the patterns are well known and invariant over time also introduces uncertainty. SR02 originally used the bias model only in the pre–World War II (WWII) period dominated by bucket measurements (Fig. 3). Huang et al. (2015) extended the method throughout the record and generated an ensemble to explore uncertainty (described in section S6 of the supplement).
Recommendation 5 calls for the extension of statistical-based modeling of SST biases beyond large-scale adjustments based on NMAT.
COMPARISON AND EVALUATION OF ESTIMATES OF SST BIAS.
Comparison of bias estimates.
The first test of the different bias adjustments is whether the estimates agree within their uncertainty ranges. Figure 4 compares the bias adjustments from HadSST3 and ERSSTv4. In these datasets the sensitivity of the bias estimates to assumptions and values chosen for internal parameters (parametric uncertainty; Kennedy 2014) has been quantified through making plausible perturbations to each of these choices to create an ensemble of bias estimates spanning the known uncertainty in the method (the calculation of the ensembles is described in sections S4 and S6 of the supplement). Figure 4 illustrates the differences between the bias adjustment in the context of the range of the uncertainty ensembles and shows that, by this measure, we do not yet fully understand the biases and their uncertainties at all times throughout the record. Maps showing the average spatial variation of the biases averaged over 1890–1919 (Figs. 4a,c) show differences that exceed the range of their combined uncertainty ensembles over large regions (Fig. 4e). Even in the more recent period of 1995–2004 (Figs. 4b,d), there are regions where the difference exceeds the ensemble range (Fig. 4f). Zonal-mean (Fig. 4g) and global-average differences (Fig. 4h) show that during these periods the large-scale biases are relatively well understood, albeit with compensating bias differences with latitude giving global-average agreement within uncertainty in the earlier period. Differences in the bias adjustments fall outside the ensemble range in two periods: at the start of the record (before about 1880) and around the 1980s. In the early period both SST and NMAT data are sparse, so it is not surprising that our understanding is limited. The later period is when the proportion of SST observations made by ERI is increasing (Fig. 3), and the buoy observing system for SST is not yet well established. Figure 4h suggests that the discrepancy is likely to arise from an underestimate in uncertainty during this period. However, improving our understanding of in situ SST bias during this period is necessary if the data are to be used with confidence to produce adjustments or validation for satellite-derived estimates of SST. The period around WWII is known to be problematic (e.g., Thompson et al. 2008), as making observations became dangerous, especially at night, when the use of lights could attract an attack. During WWII a greater proportion of observations are made during daylight hours, engine intake measurements were preferred to buckets, and buckets may have been carried inside: all tending to give a warm bias. The WWII period shows rapid variations in the difference between the bias estimates (Figs. 4g,h) but also a large ensemble range, so by this metric these differences are understood, albeit very uncertainly. Such comparisons can help to focus attention on periods and regions where differences are large (e.g., prior to about 1880 or in tropical and high-latitude regions prior to the mid-1990s), when uncertainties are large (e.g., during WWII), or where the uncertainty may be underestimated (e.g., during the 1980s).
The comparison shows we are yet to fully reconcile the biases in all types of SST observations throughout the historical record. It also shows that improvements in uncertainty estimation must go hand in hand with improvements in bias estimates. Nevertheless, uncertainties in the bias adjustments are not thought to be large enough to alter the conclusion that global SSTs have increased over the historical record (Hartmann et al. 2013). However, confidence in regional adjustments is lower than for the global mean, as the spatial patterns predicted by the different methods do not agree well (Figs. 4e–g; also Huang et al. 2015; section S7 of the supplement). Uncertainty due to undersampling can be large in some regions and periods (Kennedy 2014), particularly early in the record (Hirahara et al. 2014) and outside major shipping lanes prior to the extension of coverage provided by drifting buoys (Zhang et al. 2009).
Such comparisons of different estimates of the bias, or (less directly) datasets adjusted in different ways, are a good first step toward understanding uncertainty in bias adjustments. A range of different approaches to bias estimation should be maintained and compared (recommendation 6). However, more is learned by disagreement than by agreement, and in order to evaluate the estimated biases an independent reference is needed.
Evaluation by comparison with independent data.
Comparisons with validation data should cover a range of diagnostics, including mean bias and variance relative to validation data evaluated across a range of locations and throughout the annual and diurnal cycles. Attention should be paid to differences arising from the depths of the measurements.
In the modern period—since the mid-1990s—there are multiple sources of validation data for estimation of biases in SST observations from ships. Drifting and moored buoys take measurements of better accuracy and stability than is routinely obtained by shipboard measurements. Argo floats (Argo 2000) provide accurate data, but low sampling rates, and can be used for validation after about 2005. Some satellite datasets covering the 1990s to the present are of the desired accuracy and are largely independent of the in situ record (Merchant et al. 2012, 2014); therefore, they are suited to validation or independent assessment of SST bias adjustments applied to ship observations. Validating over longer time scales is more difficult. Drifting buoys can be used back to the early 1990s, before which there was no standardized design. Oceanographic measurements are available (Gouretski et al. 2012), but they are also affected by biases (Cheng et al. 2016) and seldom numerous. Ocean weather ships and underway observations from research vessels are potential sources of validation data. Although they may be affected by biases, there is a greater chance of obtaining a full set of high-quality marine meteorological variables and metadata. Work is ongoing to extend independent satellite SST records back to the early 1980s, but the achievable stability of observation is as yet unknown. Careful consideration must be given to the uncertainty inherent in all these data sources.
Extending validation to a wider range of comparison datasets would be valuable. Careful analysis is required if comparisons are made with different parameters (such as air temperature), with coastal observations (which might not be fully representative of open-ocean conditions), or with observations that may have their own biases. Records with consistent instrumentation over the several decades when the observing system was in flux could be valuable—perhaps records from harbor logs, lighthouses, or atolls should be considered. Land station air temperature data from other regions could also be used indirectly via experiments with climate models run with prescribed SST biases adjusted in different ways (e.g., Folland 2005). An overview of potential validation data is given in section S8 of the supplement. Recommendation 7 outlines the need for improved accessibility and management of existing potential sources of validation data. Recommendation 8 considers how the need for consistent and high-quality observations can be built into observing-system adequacy requirements.
Evaluation using measures of internal consistency.
The different types of bias can leave their own characteristic fingerprint on the SST record. For example, FP95 showed that there were signals in the data, related to the seasonal cycle, that could be explained by the characteristic biases in bucket measurements. In this case a measure of the effectiveness of the bucket bias adjustment would be the removal of spurious signals in the seasonal cycle of SST. Kennedy et al. (2011b) showed that adjustments applied to ERI and bucket measurements improved agreement between these two subsets of data from the 1950s onward.
Separating data into two datasets, one used for estimation and training and the other for validation, is a good general approach. This is widely used in assessing statistical techniques and might be applied to existing statistical methods of bias estimation (e.g., SR02). The method also can be applied more generally by setting aside a subset of data for validation, preferably a subset of known high quality that is not used in the estimation or correction of biases. Unfortunately, the data most suitable for validation also have great value for estimating biases. The price paid for having a dataset with credible, validated uncertainty estimates might be a slightly higher overall uncertainty; the alternative is a lower overall uncertainty that was impossible to assess fairly. Research vessel data and Argo data, which are not yet widely used in historical SST datasets, might be used to validate modern periods. Newly digitized data could be used for historical assessments. A degree of independence should also be maintained between the institutions producing bias adjustments and those performing validation. This could be achieved if validation were carried out by an organization independent of the dataset developers, or by using a standard set of widely agreed criteria and comparisons.
To date, the evaluation of bias adjustments using measures of internal consistency has been limited. The development of bias adjustment methods to be applied to individual observations or to data from individual ships would enable the extension of this type of evaluation to other metrics including perhaps a consistent representation of diurnal variations or a minimization of ship-to-ship differences.
PRIORITIES FOR THE FUTURE.
Improvements to data and metadata.
Fundamentally, there is scope for improvements to ICOADS. Although ICOADS is often thought of as “raw” data, it is derived from a larger, more heterogeneous underlying databank from diverse sources. Further reprocessing of the databank could help to better resolve duplicate observations, incomplete ship identifiers, scale conversions, missing metadata, and positional errors among other basic problems (recommendation 2). The recent addition (release 2.5.1 and later) of unique identification (UID) to each report in ICOADS is tremendously helpful. Tying quality control information and metadata studies back to ICOADS via the UID and sharing code and methods will improve traceability, promote collaboration, and help new researchers enter the field (recommendation 9).
Much is to be gained from improvements to metadata (recommendations 1–3). Ship tracking—the association of individual reports into coherent voyages (Carella et al. 2017)—will enable the better characterization of ship-by-ship biases and other errors. Bringing together known sources of metadata into a single repository would be a step toward a more holistic synthesis. A start has been made on inferring absent metadata (Kent et al. 2007, 2010; Kennedy et al. 2011b; Hirahara et al. 2014; Carella et al. 2017) and resolving conflicts that arise when different sources present inconsistent information, but more needs to be done.
A barrier to the use of recent marine data from ships is the decision by some countries to anonymize ship reports. The reasons often given are that the information has commercial value, or that there are concerns about security. Whatever the reason, it prevents the matching of ships to the relevant metadata in Publ. 47. We hope that a solution can be found to provide this information in a way consistent with the safety of the vessels, if not in real time, then after an appropriate delay.
There is also a need for existing sources of high-quality independent validation data to be collated. While such compilations exist for, for example, Argo and drifting buoy observations, complete authoritative archives of data and metadata do not exist for moored buoys, ocean weather ships, or research vessels. Land-based coastal observations are difficult to identify in global and regional archives, and multivariate records are often fragmented (Thorne et al. 2017). A consistent approach to the management of such high-quality observations, quality assured by experts in each data type, would be valuable for the validation of SST biases (recommendation 7). The need for such consistent observations, and their appropriate management should be recognized in climate observing-system requirements (recommendation 8).
Improvements to physically based models of SST bias.
Development of the physical models used to estimate bucket biases should continue. Models will be most valuable if independently tested in well-designed experiments under controlled laboratory conditions and at sea. Well-validated physical models will give improved estimates of the expected mean biases and their uncertainties, and allow for the possibility of estimating biases for each observation individually. Careful experimental design is needed before undertaking expensive and time-consuming measurements at sea. Simplified parameterizations of the bucket models are needed for application to a wider range of bucket designs, including modern insulated buckets (recommendation 4).
To drive physical models, we need to understand the inputs to those models and their uncertainties. Estimates of air temperature, humidity, cloud, and wind speed and direction are all needed and all are affected by biases comparable in magnitude to those affecting SST (Berry et al. 2004; Willett et al. 2008; Berry and Kent 2011; Eastman et al. 2011; Thomas et al. 2008). Reanalyses may prove a valuable tool for understanding the expected spatiotemporal variability of bucket-related SST biases and could reveal components of bias variability related to weather and longer-term effects (recommendation 4). It might be expected that as our understanding of these dependencies increases, the estimated random error of the measurements, which is partly an aggregation of many unresolved systematic processes, will decrease. Improved bias estimates will consequently need to go hand in hand with revisions to estimates of other components of the uncertainty.
Some other biases are not easily modeled. It may be impossible to derive meaningful physically based estimates of bias for an individual ERI installation (Fig. 2, right), so these ship-specific biases may need to be characterized statistically.
Improved statistical approaches.
SST biases are statistically and computationally challenging. There are several hundred million in situ observations in ICOADS. This amount of data is modest by modern standards, but complexity arises because the data are from diverse sources representing reports from perhaps hundreds of thousands of individual ships and buoys, some uniquely identified, some not. The data are of varied quality. Metadata are sometimes incomplete or conflicting. Reference observations are few and not always of unimpeachable quality. Improved statistical methods are required to advance and capitalize fully on the improvements in the basic data and modeling described above. Progress is likely to come from working more closely with statisticians, data scientists, and computational experts to develop state-of-the-art analysis systems. It may also be possible to adapt methods developed for the homogenization of land station data (Venema et al. 2012).
It is possible to write a system of equations encapsulating a full statistical description of the problem of estimating spatially complete unbiased fields, and their uncertainty, from sparse, noisy, and biased measurements of SST. In practice, however, the terms in these equations are subject to the same effects causing uncertainty in the current approaches. For example, the form of the method-dependent bias model must still be specified. Solving even a simplified version at coarse resolution is presently computationally challenging. The goal is to include all we know about SST biases into a holistic, statistically rigorous Bayesian analysis framework. The framework should embed method-dependent physically based bias models within a full description of the correlation structure of the variability of SSTs and their biases (recommendation 5).
Elements of such a holistic statistical approach are now being developed. The Met Office is developing methods to generate SST fields using estimates of the correlation structures of variability associated with both real changes in SST and biases. In this approach, individual ship biases and their uncertainties can be identified (Fig. 5). This relatively simple implementation, described in more detail in section S9 of the supplement, is able to identify biased measurements made by individual ships and could reduce the obvious SST artifacts related to “ship tracks” often present in SST analyses.
Everything we have learned from the existing approaches can feed into new statistical models. Every scrap of information about the structure of expected biases can be used to constrain and inform statistical analyses. Further constraints also could be applied, such as a large-scale consistency with NMAT. The development of improved statistical models should proceed in tandem with efforts to better characterize the observations and their biases.
Maintaining research effort and extending the community.
Huge progress has been made since the first estimates of SST bias were published in 1984. There are currently three families of SST datasets available that take different approaches to bias adjustment [HadSST/Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST), ERSST, and COBE]. However, all still use approaches that are essentially adaptations of methods originally developed decades ago. We now need to develop new approaches to bias adjustment that take advantage of recent advances in statistical methods and computing power (recommendation 5) while maintaining a diversity of different methods (recommendation 6). Diversity of methods helps quantify structural uncertainty: the spread between datasets arising from fundamental choices in analysis method and assumptions underlying them that are difficult and, in many cases, impossible, to capture by varying the parameters or modules within a single analysis system (Thorne et al. 2005).
Progress has been slower than we would like, as the number of researchers active in the area is small and fresh perspectives would be welcome. There are many barriers to new researchers entering this area; presenting the data and metadata in accessible ways and providing a range of different types of documentation are essential to engage a wider community in assessment and validation (recommendation 9).
Recommendation 1: Add more data and metadata to ICOADS.
Additional observations of SST and associated variables such as air temperature, humidity, wind, cloud, pressure, and weather information recovered from logbook digitization will help improve estimates of SST and SST bias. Every effort should be made to retain observational metadata and to keep multivariate observations together.
Recommendation 2: Reprocess existing ICOADS records.
Older ICOADS acquisitions are often lacking metadata and are compromised by legacy deficiencies in data management and storage formats. A full reprocessing of ICOADS legacy data, alongside improvements to data formats, would improve SST bias adjustment through improved ship tracking, recovery of information on platform identity, better identification of mispositioned and duplicate reports, better quality control, and recovery of additional data and metadata from the existing reports. A critical review of all input ICOADS data sources should be carried out to ensure that ICOADS contains the best available data, metadata, and quality information.
Recommendation 3: Improve information on observational methods.
A comprehensive review of documentary sources will better constrain the uncertainty in methods and protocols for historical observations. ICOADS call-sign recovery and reprocessing of WMO Publ. 47 metadata will help link observations to metadata from individual ships.
Recommendation 4: Improve physical models of SST bias.
Simplified and validated physically based models of SST bias are required along with better estimates of ambient conditions and understanding of how to use those estimates to drive the models.
Recommendation 5: Improve statistical models of SST bias.
More holistic and powerful statistical approaches to the problem of estimating SST biases and their uncertainties are needed, especially to study presently unknown causes for inhomogeneities.
Recommendation 6: Maintain and extend the range of different estimates of SST bias.
SST datasets and gridded analyses will continue to improve, but they will never become identical. A wider range of bias estimates taking different approaches to adjustment will enable improved understanding of structural uncertainty. Carefully designed comparisons, including all the developers of bias-adjusted SST analyses, will improve our understanding of biases and their uncertainties.
Recommendation 7: Expand data sources for validation and extend use of measures of internal consistency in validation.
Resources for validating SST bias adjustments include SST from satellites and ocean reanalyses, as well as observed air temperatures, albeit with their own uncertainties. Collating, assembling, and extending consistent datasets providing validation sources will enable more thorough validation of SST bias adjustments. Such sources include ocean weather ships, research vessels, moored buoys, land-based coastal stations, and independent satellite SST records. A more imaginative approach is required to make the best use of available validation data and to widen the use of measures of internal consistency in SST bias validation.
Recommendation 8: Ensure adequacy and continuity of the observing system.
It is important that the challenges we have encountered in understanding the historical SST record do not persist into the future. Requirements for consistency, metadata, subsets of high-quality validation data, and appropriate curation for climate applications should be integrated into the metrics for assessing observing-system adequacy and performance (e.g., GCOS 2010).
Recommendation 9: Improve openness and access to information.
Despite the complexity of the problem, SST bias adjustment has been tackled by only a small number of small groups producing SST products. Many aspects of the problem are potentially of much wider interest to physicists, metrologists, historians, computer scientists, and statisticians, among others. Providing modular software tools and improved access to data, metadata, and historical documentation will help to widen the range of approaches to the important, complex, and interesting problem of SST bias adjustment.
We thank the three reviewers for their help in improving this paper. Funding support was provided by the following organizations: Natural Environment Research Council (Grants NE/J020788/1, NE/I030127/1, and NE/J02306X/1); the Office of Naval Research (Grant N00014-12-1-0911); Deutsche Forschungsgemeinschaft (Grant DFG VE 366/8); Ministry of Environment, Japan (ERDTF 2-1506); and BEIS/Defra (Grant GA01101).
© 2017 American Meteorological Society.
Publisher’s Note: This article was revised on 16 October 2017 to include the Creative Commons license that was omitted when originally published.
A supplement to this article is available online (10.1175/BAMS-D-15-00251.2)