1. Introduction
The two most recent versions of the extended-reconstructed SST datasets, ERSST4 (Huang et al. 2015) and ERSST5 (Huang et al. 2017), both show anomalous warmth in global-mean SSTs that are, respectively, 0.30°C [95% confidence interval (c.i.): 0.17° to 0.41°C] and 0.29°C (0.23° to 0.37°C) during World War II (WW2) (Fig. 1a and Table 1). SST anomalies are calculated as the global, annual average between 1941 and 1945 relative to the average over 1936–40 and 1946–50, and are referred to as the World War II warm anomaly (WW2WA). All uncertainties are reported as 95% coverage intervals unless otherwise noted. Version 4 of the Hadley Center SST (HadSST4) shows a similar WW2WA of 0.19°C (Kennedy et al. 2019), although with a much large uncertainty estimate ranging from −0.09° to 0.45°C (Table 1).
WW2WA and SST variability. The WW2WA is the average over 1941–45 relative to average over 1936–40 and 1946–50. The standard deviation of the global-mean SST (second column) is computed from annual averages between 1936–50. The average regional standard deviation (third column) is the square root of the global-average of variance at 5° × 5° grids. ERSSTs have lower regional variability because their mapping technique truncates small-scale variability. The Cowtan SST is only available as global averages. CMIP5 models do not contain sampling and random observational errors and, therefore, show lower regional variability. Sampling and random errors cancel under averaging and become negligible at global and decadal scales (Kennedy et al. 2011a; Chan et al. 2019). The 95% confidence intervals are given in square brackets and are estimated using the following ensembles: 1000 random adjustment members for groupwise-adjusted SSTs (R4 and R5), ERSST4, and ERSST5; 100 members for HadSST3; 200 members for HadSST4; 94 simulation members for CMIP5 historical runs; 38 members for CMIP6 historical runs; 1662 15-yr segments for CMIP5 preindustrial control simulations; and 1020 segments for CMIP6 control simulations.
If the WW2WA reflects physical changes in climate, it would have important implications for understanding the magnitude of decadal climate variability (Hansen et al. 2010; Morice et al. 2012; Vose et al. 2012), constraining uncertain external forcing (Stevens 2015), and partitioning relative contributions of anthropogenic forcing and internal variability in driving historical climate change (Jones et al. 2013; Bindoff et al. 2013; Maher et al. 2014; Hegerl et al. 2018). For example, such an anomaly could indicate the ability of El Niño–Southern Oscillation (ENSO) to lead to larger and more persistent warming than is otherwise understood (Thompson et al. 2009).
A number of other data-based analyses and simulations suggest that the physicality of the WW2WA is questionable. The WW2WA is essentially absent in HadSST3 (Kennedy et al. 2011b). SSTs referenced to air temperatures from nearshore weather stations (Cowtan et al. 2018) and temperature proxies derived from isotopes in tropical coral reefs (Pfeiffer et al. 2017) also show a negligible WW2WA. Furthermore, the WW2WA found in ERSST and HadSST4 estimates greatly exceeds that reproduced by any of the CMIP5 (Taylor et al. 2012) or CMIP6 (Eyring et al. 2016) historical simulations available over this interval (gray curves in Fig. 1c). Neither can statistical models explain this warm anomaly using known climate forcing and internal variability (Folland et al. 2018).
A leading hypothesis for the WW2WA relates to switching from predominantly bucket measurements of SST before and immediately after WW2 to engine-room-intake (ERI) measurements during WW2 (Thompson et al. 2008). A typical measurement from a U.K. canvas bucket has been estimated to be, on average, 0.4°C cooler than actual SSTs because of latent cooling before measurement (Folland and Parker 1995). Conversely, although ERI measurements are typically sampled at 5–15 m below the surface and should be consequently cooler than SSTs, given that SSTs are typically defined as coming from depths between 20 and 30 cm (Kennedy et al. 2019), ERI SSTs have an average warm bias ranging from 0.1° to 0.5°C because of absorbing heat released from ship engines (Kennedy et al. 2011b, 2019). A 58% reduction in the number of SST measurements during the WW2 interval (Freeman et al. 2017) could also make errors in a subset of the data more likely to lead to seemingly global anomalies.
A second hypothesis involves changes in protocols for taking measurements at night. Nighttime marine air temperature readings are thought to have been taken inboard to avoid detection and, consequently, to be warmly biased by approximately 0.8°C (Folland et al. 1984). Bucket SSTs may have also been read inboard during WW2. We also note that the proportion of SST readings during the day, as opposed to the night, shifts from 55% in the surrounding 10 years of the war to 61% between 1941 and 1945 (Freeman et al. 2017), suggesting a preference for taking measurements during the daytime.
To further assess disagreement in existing estimates, we evaluate contributions to the WW2WA from specific groups of measurements and differences between day and nighttime measurements. In section 2, we investigate the evolution of bucket (R2) and ERI-only SSTs (R3) and show that, although SSTs from the two methods have systematic offsets, neither estimate has a strong WW2WA compared to when all data are used together (R1). In section 3, we re-examine the WW2WA after removing systematic offsets using an extended version of a linear-mixed-effect methodology (Chan and Huybers 2019) (R4). We also test the hypothesis of problematic nighttime bucket measurements using a daytime-only SST reconstruction (R5) in section 4. Finally, in section 5, we compare our results with estimates from previous studies and general circulation model simulations and discuss the implication of our updated estimate of the WW2WA.
2. R1–R3: Uncorrected reconstructions using all measurements or only buckets or ERIs
Six major SST estimates that cover the WW2 period (Fig. 1a) give distinct estimates for the WW2WA. All six estimates rely upon data coming from the International Comprehensive Ocean–Atmosphere Data Set (ICOADS; Woodruff et al. 2011; Freeman et al. 2017). Differences among estimates largely reflect differences in bias corrections, although use of different mapping procedures and inclusion criteria may also contribute. In one type of correction, bucket and ERI measurements are not distinguished, and global-mean SSTs are corrected to follow independent estimates of temperatures. Results depend on the choice of reference temperature. For example, ERSST5 (Huang et al. 2017) is referred to Hadley nighttime marine air temperatures (NMAT; Kent et al. 2013), from which the global average inherits a WW2WA of 0.22°C in NMAT estimates. Like SSTs, ship-based air temperatures are potentially subject to their own biases on account of changes in measurement protocols (Folland et al. 1984). Alternatively, referencing against air temperatures from coastal and island weather stations leads to removal of the WW2WA (Cowtan et al. 2018), an estimate that we refer to as Cowtan SST.
A second approach to correcting SST biases involves distinguishing between bucket and ERI measurements and attempts to account for their respective biases (Kennedy et al. 2011b; Hausfather et al. 2017; Kennedy et al. 2019), potentially giving a more detailed correction than available from a bulk correction of all SST data. A major impediment to such corrections, however, is that method information is poorly documented for most measurements during WW2, with only 6% of observations explicitly indicated as coming from buckets, 11% explicitly indicated as coming from ERIs, and 83% whose method requires some degree of inference (Freeman et al. 2017; Kent et al. 2017). Magnitudes of measurement biases are also uncertain and may have changed during WW2 (Folland et al. 1984; Kent et al. 2013). The lack of information regarding measurements has been addressed through plausible but potentially insufficient assumptions. In constructing HadSST3, for example, Kennedy et al. (2011b) assumed that U.S. and U.K. naval ships with missing method information take ERI measurements of SST that are, on average, warmly biased by 0.2°C. HadSST4 randomly designates measurements with missing method information during WW2 to be either bucket or ERI SSTs, with the portion of bucket measurements ranging from 0% to 25%. Wartime ERI measurements in HadSST4 are assumed to be biased warm, on average, by 0.25°C, whereas bucket SSTs are assumed to be biased cold, on average, by −0.2°C (Kennedy et al. 2019).
The implications of these corrections for the WW2WA are not obvious, and it is useful to make raw estimates—neither including corrections nor infilling regions that lack data—to better quantify the magnitude of the WW2WA in the underlying data. We, therefore, first reconstruct SST using all quality-controlled raw ship-based measurements (R1) available from ICOADS3.0 (Freeman et al. 2017). We also examine SSTs estimated using data thought to come only from buckets (R2) or ERIs (R3).
Quality control procedures for SST measurements are the same as those in Chan and Huybers (2019). We identify ship-based SSTs using the ICOADS platform metadata (PT from 0 to 5). Method information is identified from ICOADS SST measurement method (SI) metadata. If the SI metadata are not available, the measurement method is assigned to be unknown. Following Kennedy et al. (2011b), an exception is made for SST measurements from U.S. ships, which are assumed to be ERI measurements. This assumption is supported by the fact that U.S. measurements have a diurnal cycle that is smaller than that expected from bucket measurements (Carella et al. 2018). A small diurnal cycle is consistent with ERI measurements that are typically sampled at a depth of 5–15 m that is less affected by the diurnal cycle of insolation (Carella et al. 2018). We identify nations first using the ICOADS country code (C1). If C1 is not available, nations are inferred from ship call signs (Chan et al. 2019) or deck information (Kennedy et al. 2011b; Chan and Huybers 2019).
Global-average estimates of raw SSTs using only observations thought to come from buckets (R2) gives a WW2WA of 0.18°C, and a similar estimate for ERI-only SSTs (R3) gives a WW2WA of 0.08°C (Fig. 1b). Both estimates are far more stable than the 0.41°C WW2WA obtained if all available raw ICOADS SSTs are evaluated (R1; Fig. 1b). The fact that R3 is, on average, 0.52°C warmer than R2 highlights the potential for misidentification of measurements methods or insufficient corrections leading to biases remaining in existing SST estimates. Note that quantifying the WW2WA as the difference between the average over 1941–45 and the average over the 10 surrounding years neutralizes the effect of a constant SST bias and also accounts for the potential for an underlying linear trend between 1936 and 1950.
In addition to a reduced WW2WA in SST estimates stratified by instruments, R1 follows the ERI-only estimate (R3) more closely during the war and the bucket-only estimate (R2) before and after the war (Fig. 1b). The proportion of SSTs we identify to come from buckets decreases from 44% before and after the war to 6% during the war, whereas the proportion identified to come from ERIs, including both explicitly indicated and inferred U.S. measurements, increases from 25% to 50%. The remaining 44% of observations during WW2 have unassigned measurement types. This initial investigation of raw ICOADS indicates that the WW2WA mainly reflects instrumental changes at the start and the end of the war (Thompson et al. 2008).
3. R4: Accounting for groupwise offsets
To better account for limitations in previous corrections, we use a linear-mixed-effect (LME) intercomparison framework (Chan and Huybers 2019) to quantify systematic offsets associated with distinct groups of SSTs. We use our LME model to compare nearby measurements and, thereby, obtain estimates of SST offsets among different groups regardless of whether the method of measurement is known. Moreover, by diagnosing data offsets associated with individual nations, groups of SSTs, and available measurement types, the LME method allows for inference of more-detailed SST adjustments than previous estimates (e.g., Kennedy et al. 2019). Details include different magnitudes of offsets potentially contributed by different buckets or ERI designs, distinct protocols, or separate postprocessing effects, as well as the temporal and spatial patterns of observational biases associated with each group.
We applied a similar LME method to only bucket SST observations and showed that it accurately identifies offsets between nation and deck groups (Chan and Huybers 2019). The skill of the LME method is also supported by negative correlations between offsets and the amplitude of diurnal cycles in SSTs (Chan and Huybers 2020), identification of offsets later found to come from data truncation (Chan et al. 2019), and improved agreement between adjusted SSTs and air temperatures from nearby coastal weather stations (Chan et al. 2019). In this study, we assess all available ship-based SSTs that come from buckets, ERIs, or hull sensors, or with missing method information.
a. Linear-mixed-effect method
SST observations are grouped according to nation, deck, and method of measurement. Nation and method information is identified following the same approach as in the last section, but we no longer assume that U.S. SSTs with missing method information are ERI measurements, as done in obtaining R3. Rather, we define “missing method” as a category and allow the LME method to determine any required adjustment for these U.S. measurements. We include deck numbers in defining different groups because these indicate information regarding ICOADS data collectors and processors (Freeman et al. 2017), and processing has been found to be a potentially important source of bias (Chan et al. 2019).
To intercompare different groups of SSTs and estimate systematic offsets, we first pair SSTs if they come from distinct grouping according to nation, deck, and method and are within 300 km and 2 days of one another. We use each measurement at most once to prevent error covariance between pairs, with the pairing algorithm prioritizing SST pairs that are closest in space (Chan and Huybers 2019). Compared with Chan and Huybers (2019), who only intercompared SSTs thought to come from buckets, the inclusion of SSTs from ERIs, Hull sensors, or with missing method information increases the total number of pairs from 17.8 to 45.8 million throughout 1850–2014. These pairs come from 492 groups that each contribute at least 5000 pairs of SSTs. Our focus is on the years 1935–49 that contains a subset of 1.8 million SST differences from 66 groups (Fig. 2b and Table 2), but we analyze all 45.8 million SST pairs for purposes of more fully accounting for covariance across groupings. To account for physical separation between paired measurements, we first remove climatological differences expected from geographical and temporal displacement. The expected differences are estimated from NOAA optimal interpolated SSTs (Reynolds et al. 2007) and drifter observations in ICOADS3.0 (Chan and Huybers 2019).
Measurement groups containing SSTs during 1935–49. Among 66 groups that have groupwise offset estimates, 33 have valid estimates for the amplitude of diurnal cycles. Shown LME offsets are annual mean offsets averaged over 1935–49 for the analysis that uses both daytime and nighttime SSTs (“Mean offsets” column) and the analysis that only uses daytime SSTs (“Day offsets” column). Diurnal amplitudes are anomalies relative to collocated 1990–2014 climatology estimated from drifting buoys (“Excess DA” column) . One asterisk (*) indicates groupwise offsets differing from zero (P < 0.05) or diurnal amplitudes differing that of drifting buoys (P < 0.05). Two asterisks (**) indicate significance after Bonferroni corrections (P < 0.05/66 for groupwise offsets and P < 0.05/33 for diurnal amplitudes). Checkmarks highlight U.S. groups with unknown method information, which are assumed to contain ERI SSTs when computing ERI-only estimates (R3).
In practice, to reduce the computational cost, we aggregate data by averaging SST differences according to combinations of pairs of groups, regions, and years before estimating offsets. Uncertainties associated with aggregated SST differences are budgeted to account for observational error, physical SST variability, and heteroscedasticity associated with distinct group size, and are used to weigh aggregated pairs in the LME analysis. The error estimate resulting from the LME analysis is a multivariate Gaussian that accounts for covariance. We represent the uncertainties of groupwise adjustments using a 1000-member ensemble of random adjustments having groupwise offsets that are perturbed according to the estimated multivariate Gaussian. The ensemble captures error covariance among fixed and random effects as well as covariance introduced by changes in the spatial coverage of individual groups. Further details regarding the LME implementation are in appendix A.
One limitation of the LME method is that it informs regarding relative offsets and does not account for biases common to all groups. Common SST biases, however, may vary with time. For example, bucket biases toward being cold are generally thought to diminish with systematic changes from less-insulated canvas to more-insulated rubber buckets (Folland and Parker 1995; Kennedy et al. 2011b, 2019). Existing estimates represent this bucket change as occurring gradually from the 1930s to the 1970s (Kennedy et al. 2019) or as being confined to after the 1950s (Kennedy et al. 2011b). Systematic changes in biases for other ship-based measurements have also been identified for more recent years by comparison against marine profile measurements (e.g., Kennedy et al. 2019). We cannot rule out systematic changes between all types of measurements during WW2 but proceed with an examination of identifiable offsets.
b. Groupwise offsets and the diurnal cycle
We first apply the LME methodology to intercompare groups of SST measurements using all available SST data. Of the 66 groups present between 1935 and 1949, 29 have significant offsets (P < 0.05; Table 2). Significance is assessed relative to a null hypothesis of zero-mean offset relative to the average across all groups (Chan and Huybers 2019). In addition, 12 groups still show significant offsets after a Bonferroni correction. The Bonferroni correction compensates for the increased chance of false positives when conducting n tests by evaluating each at P < α/n. In our case, we have n equal to 66 groups between 1935 and 1949 and α equal to 0.05. There are five positively identified ERI groups that are each found to be warmer than the 24 bucket groups, on average, by 0.53°C (0.25° to 0.72°C; Table 2). Offsets of groups having missing method information range from −0.4° to 0.6°C. This range is similar to that spanned by the entire population of the bucket and ERI groups, suggesting that at least some of these groups are distinctly from bucket or ERI measurements.
Chan and Huybers (2020) demonstrated the utility of using the diurnal cycle in combination with offsets to infer the composition of measurements within a group. A negative correlation is generally found between diurnal amplitudes and offsets across groups that have variable compositions of ERI and bucket data because ERIs are generally warmer and have a smaller diurnal cycle than bucket measurements. To estimate diurnal cycles, we make use of tracked ships (Carella et al. 2017) and only evaluate ships making measurements at least four times per day using a least squares fit of a once-per-day sinusoid. Although the Nyquist cutoff is two measurements per day, in practice non-sinusoidal components of the diurnal cycle make using higher-resolution data useful. U.S. deck 195 presents a special case, however, because it contributes 47% of U.S. wartime measurements (Fig. 2a) but has a sampling frequency of only three times per day. To fit the amplitude of this deck, we estimate a linear combination of diurnal cycles from two basis functions based upon known bucket and ERI measurements (see appendix B). The best estimate indicates that U.S. deck 195 is consistent with being purely composed of ERI measurements.
Half of the 66 groups present between 1935 and 1949, along with U.S. deck 195, have tracked ships with sufficient resolution of the diurnal cycle (Table 2). To account for distinct spatial and seasonal coverage of individual groups, we report diurnal amplitudes as anomalies relative to a 1990–2014 climatology of diurnal amplitudes estimated from drifting buoys (Chan and Huybers 2019). As expected, diurnal amplitude is strongly anticorrelated with groupwise offsets (Fig. 4). The relationship between diurnal amplitudes and groupwise offsets is estimated using a York regression (York et al. 2004) and associated uncertainties are estimated by bootstrapping individual groups 10 000 times with replacement. The three known ERI groups are associated with relatively warm and small-amplitude diurnal cycles, whereas all known bucket groups are colder and have a higher diurnal amplitude. In general, bucket groups have higher intergroup variability in terms of both diurnal amplitudes and groupwise offsets, consistent with the fact that a variety of bucket designs and measurement protocols were used to collect SSTs (Folland and Parker 1995; Kent and Taylor 2006). The large spread across bucket groups may also involve misclassification of ERI SSTs as coming from buckets (Carella et al. 2018; Chan and Huybers 2020).
U.S. decks 110, 116, 195, 281, and 705 account for 88% of all U.S. measurements during 1935–49 and each is significantly warmer than the average across all groups and exhibits a diurnal amplitude that is significantly smaller than a climatology derived from drifting buoys (P < 0.05; Fig. 4, Table 2). The combination of warm offsets and small diurnal amplitudes supports the assumption made in HadSST3 (Kennedy et al. 2011b) and the findings of Carella et al. (2018) that U.S. measurements with missing method information during WW2 are ERI measurements. Confirmation of U.S. decks being composed of ERI measurements also supports the offset between R2 and R3 reflecting biases between ERI and bucket measurements.
c. Reduced WW2WA after removing groupwise offsets
The R4 reconstruction of historical SSTs during WW2 comes from combining all groups of SSTs after adjusting for groupwise offsets (Fig. 1c) and gives a WW2WA of 0.13°C (0.01° to 0.26°C). R4 can be contrasted with the nonadjusted R1 reconstructions having a WW2WA of 0.41°C (Fig. 1b). Groupwise adjusted SSTs also show a smaller WW2WA in bucket-only and ERI-only estimates of the WW2WA (Fig. 5a) that are, respectively, 0.15°C (0.06° to 0.24°C) and 0.07°C (−0.08° to 0.24°C). As expected, collocated ERI minus bucket difference decreases from an average of 0.48°C over 1936–50 in raw ICOADS to being centered on zero after groupwise adjustments (Fig. 5b).
The diminished WW2WA in adjusted SSTs largely reflects adjustments of U.S. deck 195 that features a warm offset of 0.43°C (0.17° to 0.68°C) and whose adjustment alone revises the WW2WA from 0.41°C in raw ICOADS to 0.22°C (Fig. 5c). Also of note is U.K. deck 245, which has offsets of −0.31°C (−0.53° to −0.09°C) before 1940 and 0.25°C (0.03° to 0.47°C) between 1941 and 1947 (Fig. 4). The adjustment of deck 245 further diminishes the WW2WA by 0.06°C (Fig. 5c) with local decreases of more than 0.4°C over the Indian Ocean and the Pacific warm pool.
4. R5: Reconstruction using daytime-only measurements
A second effect that we examine stems from recognition by Folland et al. (1984) that nighttime marine air temperatures (MAT) were likely measured inboard during WW2. Folland et al. (1984) state that “the reason is thought to be that it was forbidden, at least on UK ships, to shine a torch in an exposed place, so night MAT was observed well inboard, with consequential larger heating errors” (p. 672). Inboard measurements may have been operationally required to minimize light pollution and potential detection by enemy ships or submarines. For the same reason, water temperatures inside buckets were likely to have been read inboard, with warmer indoor air temperatures and lower wind speeds expected to lead to less sensible and evaporative cooling.
There are five additional lines of evidence that point to nighttime SSTs measured using buckets being anomalously warm and taken inboard during WW2. First, we examine nighttime and daytime-only SSTs coming from bucket and ERI measurements. Day and nighttime observations are identified using the ICOADS night–day flag (ND). Whereas daytime bucket SSTs show a WW2WA of 0.09°C, the nighttime estimate indicates the WW2WA being 0.32°C (Fig. 6a), indicating nighttime estimates as the source of a larger anomaly.
Second, nighttime bucket SSTs reverse from being colder than collocated daytime temperatures by −0.20°C during the five years before and after WW2, as expected regardless of bucket design (Chan and Huybers 2020), to being 0.02°C warmer during WW2. The inversion of the day–night difference in bucket SSTs during WW2 is mainly attributable to British Navy ships from deck 204 that contribute more than 75% of open-ocean bucket SSTs from 1942 to 1945. SST observations from buckets that are concentrated near shore, such as deck 720 (Deutscher Wetterdienst Marine Meteorological Archive), have little overall influence on global SST estimates after gridding. Accordingly, the warmest anomalies in WW2 nighttime bucket SSTs are found over the Indian Ocean and the extratropical Atlantic (Fig. 6c).
Third, and more specifically, the diurnal cycle of SSTs in deck 204 shifts from having peak temperatures at 1600 local time (LT) in the 5 years prior and after WW2 to 2000 LT during WW2, and the overall amplitude of the diurnal cycle decreases (Fig. 6d). Here, the diurnal cycle of deck 204 is estimated by directly binning all available SST anomalies by local hours. This approach allows for using all data but is more susceptible to noise contributions from changes in systematic offsets associated with individual ships relative to our typical approach of assessing the diurnal cycle (e.g., in Chan and Huybers 2020). If we compare the diurnal amplitude of deck 204 relative to collocated climatological amplitudes estimated from drifting buoys (Chan and Huybers 2019), the averaged anomalous amplitude of deck 204 decreases from being 0.03°C larger than drifters in the 5 years before and after WW2 to being 0.03°C smaller during WW2.
Fourth, it is possible to rule out other instrumental or physical causes for anomalously warm nighttime bucket SSTs. The smaller diurnal amplitude found in deck 204 is unlikely to be related to switching to ERI measurements because the average temperature of daytime measurements remains consistent with bucket measurements taken before and after WW2 and remains cooler than known ERI measurements (Fig. 6d). Furthermore, we are unaware of a physical mechanism that would cause nighttime SSTs to be routinely warmer, when averaged over a year, than daytime SSTs. A climatological cause of the WW2WA in nighttime bucket measurements is also contradicted by the lack of a warm anomaly in nighttime ERI measurements (Fig. 6b).
Finally, the inference that observation protocols were changed to avoid light pollution suggests that sailors would favor daytime over nighttime measurements. Indeed, the percentage of daytime bucket SSTs relative to all available bucket SST observations increases from 52% in the five years before and after WW2 to 62% during WW2. These additional lines of evidence provide a strong indication that nighttime bucket SST during WW2 were measured inboard. We also note that the shift from taking 55% of all available SST observations, including bucket, ERI, and unknown types, during the daytime in the five years before and after WW2 to 61% during WW2 makes only a minor contribution to the WW2WA. Sampling hourly-resolved climatological diurnal cycles from drifters indicates that such a shift contributes only 0.005°C to the observed warm anomaly.
Although the LME method could be further extended to temporally resolve anomalies in nighttime biases, building in such flexibility would force nighttime temperatures to be consistent with daytime temperatures. This approach would make nighttime temperatures effectively uninformative because daytime and nighttime measurements have approximately the same spatial and temporal coverage (Fig. S1 in the online supplemental material). Instead, we simply repeat our analysis excluding nighttime measurements. Specifically, we use 24.1 million pairings of daytime SST measurements between 1850 and 2014 with 1.0 million of these pairs available between 1935 and 1949. Whereas using both day and night measurements gives a 0.13°C (0.01 to 0.26°C) WW2WA (R4), the daytime-only analysis gives a WW2WA of 0.09°C (−0.01° to 0.18°C, R5; Table 1, Fig. 1c).
We compare the WW2WA in our various observational estimates against the variability found in CMIP5 models. To compare the WW2WA against simulated internal variability, we regrid a total of 25 236 years of simulated preindustrial SSTs from 42 available CMIP5 models (Taylor et al. 2012) to a common 5° resolution (see Table S1 for a list of models used). Simulations are then divided into 15-yr segments, with each segment masked using the 1936–50 least common coverage between HadSST4 and R5 on a month-by-month basis. The difference between the central five years and the surrounding 10 years is calculated for individual preindustrial segments to estimate the range of internal variability. The CMIP5 runs indicate a 95% range of internal variability being −0.10° to 0.10°C. A similar analysis based on 15 453 years of preindustrial runs from 27 available CMIP6 models gives a 95% range of internal variability of −0.11° to 0.10°C, where both the CMIP5 and CMIP6 results are consistent with R5 (Fig. 7b and Table 1). Also consistent with simulated internal variability are bucket-only estimates of daytime SST that have a WW2WA of 0.08°C (−0.02° to 0.17°C) and ERI-only estimates that have an anomaly of 0.03°C (−0.07° to 0.14°C). Time series of groupwise-adjusted daytime SSTs using only bucket or ERI measurements are shown in Fig. S2 in the supplemental material.
5. Further discussion and conclusions
Our groupwise intercomparison indicates that ERI and bucket groups have an average offset of 0.53°C (0.25° to 0.72°C) during 1935–49. Such a difference is nearly 0.1°C higher than the wartime difference that averages 0.45°C across ensemble members as implemented in HadSST4 (Kennedy et al. 2019). Furthermore, our analysis indicates that unknown U.S. and U.K. measurements from 1942 to1945, which account for 98% of unknown wartime measurements, are offset warm. In HadSST4, an average of 12.5% of SSTs with missing method information were assumed to come from buckets and adjusted positively. The smaller offset assumed between the bucket and ERI SSTs and a higher percentage of observations assumed to come from buckets explains the re-emergence of the WW2WA from HadSST3 to HadSST4. Our findings indicate that HadSST3 provides a more accurate assessment of the WW2WA. In addition, whereas HadSST4 specifies large uncertainties during WW2, our result reduces the standard error of WW2WA from 0.14°C in HadSST4 to 0.05°C in R5 on account of attributing more variance in the raw data to systematic offsets that can be corrected (Fig. 7b). Our LME results also indicate that R5 has slightly lower uncertainty than R4, even though using daytime-only measurements approximately halves the sample size. Sampling and random errors are negligible at global and decadal scales, and we infer that the greater uncertainty in R4 is driven by systematic errors associated with nighttime measurements, possibly associated with variable inboard offsets.
The SST adjustments obtained through our LME approach agree with an independent estimate arrived at using nearshore, land-station data (Cowtan et al. 2018; Fig. 7a). Whereas the approach of Cowtan et al. (2018) requires average SSTs to agree with land-station data, our analysis shows that WW2WA is an artifact arising from specific groups and features of SST measurements. Once these artifacts are accounted for, both nearshore land temperatures and SSTs are in good agreement. Our results thus help confirm the global-average results reported by Cowtan et al. (2018). On the other hand, compared with ERSST estimates that are referenced to MATs, our results cast doubt on the reliability of using nighttime or daytime MATs for adjusting SSTs during WW2. Another discrepancy is reported in a more modern context whereby the global mean of Hadley Centre nighttime marine air temperatures (HadNMAT2.0.1.0) appears to be significantly colder (P < 0.05) than SSTs by more than 0.08°C after the 1990s (Kennedy et al. 2019).
An important attribute of groupwise adjustments is the ability to resolve regional biases arising from spatially heterogeneous distributions of distinct groups (Fig. 8). Removing groupwise offsets leads to a greater decrease in the WW2WA over the Indian Ocean and Pacific warm pool and smaller decreases over the tropical eastern Pacific and the South Atlantic (Fig. 8b). The spatial correlation of the WW2WA between our adjustments and R5 is rs = 0.04, where the small correlation indicates that the magnitude of the pattern that we remove is appropriate because R5 is nearly free of the estimated pattern of bias. HadSST estimates partially account for biases associated with shifting instruments and have a similar pattern of correction, albeit one that is less complete such that rs = −0.15 (Fig. 8e) for HadSST3 and −0.18 for HadSST4 (Fig. 8f). In contrast, ERSST estimates use a fixed spatial pattern (Huang et al. 2015, 2017) that does not account for patterns associated with groupwise offsets, giving an rs = −0.28 for ERSST4 (Fig. 8d) and rs = −0.25 for ERSST5. The zonally symmetric corrections from Cowtan et al. (2018) also do not capture the patterns of WW2 offsets.
An implication of the removal of WW2WA in our analysis is a more stable and smoothly evolving SST estimate (Table 1). The 1936–50 standard deviation of global-average, annual SST anomalies decreases from 0.24°C in raw ICOADS (R1) to 0.07°C (0.06° to 0.11°C) in the adjusted daytime-only estimates (R5). Such subdecadal variability is consistent with estimates from HadSST3 and Cowtan SST and lies within the 95% confidence interval of CMIP5 and CMIP6 historical simulations that we analyze. In contrast, the HadSST4 median estimate has a standard deviation of 0.12°C that is larger than 93 out of 94 CMIP5 historical simulations and 37 out of 38 CMIP6 historical simulations, and ERSST median estimates have standard deviations of 0.16°–0.17°C that are higher than all CMIP5 and CMIP6 historical simulations (Table 1). On regional scales, the effect of groupwise adjustments are smaller than other sources of variability—including from physical changes, sampling uncertainty, and random measurement errors—such that the 1936–50 variance on 5° × 5° grids decreases, on average, by only 12%.
In sum, our results help confirm that the WW2WA in instrumental SST estimates is a data artifact that arises from instrumental changes (Thompson et al. 2008). We identify U.S. and U.K. ships as the primary origin of the WW2WA. Warm biases in WW2 nighttime bucket SSTs are also identified. Adjusting for these offsets removes the WW2WA and leads to a more homogeneous trend in SSTs. Our results highlight the importance of resolving systematic errors in SSTs and reconcile the largest existing discrepancy between historical surface temperatures and model estimates (Folland et al. 2018). The fact that our independently derived adjustments to the SST record leads to consistency with model simulations of SST variations during WW2 gives greater confidence in predictions based on such models.
Ongoing work to recover historic SST will make more wartime data available for U.S. deck 195, which currently only has measurements that were collected at 0800, 1200, and 2000 LT included in ICOADS. Metadata that allow for distinguishing between types of naval ships, such as destroyers and destroyer escorts, are also being recovered (Hawkins et al. 2020). Incorporation of these additional SST observations and metadata in future work should permit for a more accurate and detailed adjustments of systematic offsets and more accurate estimates of historical SST.
Acknowledgments
We thank three anonymous reviewers for their detailed and thoughtful feedback. Conversations with Carl Wunsch and Elizabeth Kent also improved the content of this manuscript. This study was supported by the Harvard Global Institute.
Data availability statement
All datasets used in this study are available as follows: ERSST4 and a 1000-member ensemble (https://psl.noaa.gov/data/gridded/data.noaa.ersst.v4.html; last access: 18 April 2020; doi:10.7289/V5KD1VVF), ERSST5 and a 1000-member ensemble (https://www.esrl.noaa.gov/psd/data/gridded/data.noaa.ersst.v5.html; last access: 5 April 2020; doi:10.7289/V5T72FNM), Cowtan SST (https://www-users.york.ac.uk/~kdc3/papers/evaluating2017/methods.html; last access: 7 April 2020), HadSST2 (https://www.metoffice.gov.uk/hadobs/hadsst2/data/download.html; last access, 18 April 2020), HadSST3.1.1.0 and a 100-member ensemble (https://www.metoffice.gov.uk/hadobs/hadsst3/data/download.html; last access: 13 February 2020), and HadSST4.0.0.0 and a 200-member ensemble (https://www.metoffice.gov.uk/hadobs/hadsst4/data/download.html; last access: 5 April 2020). HadSST.4.0.0.0 data are ©British Crown Copyright, Met Office 2021, provided under an Open Government License, http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/. Monthly CMIP5 SST (tos) outputs re-gridded to 2.5° resolution are available from the ETH repository (contact Jan.Sedlacek@env.ethz.ch or cmip5-archive@env.ethz.ch for access; last access: 21 February 2019). Monthly CMIP6 outputs are from the ESGF portal (https://esgf-node.llnl.gov/search/cmip6/; last access: 8 January 2021). Raw and groupwise adjusted SSTs (R1–5), as well as key results in this manuscript, are available from the Harvard Dataverse repository, https://doi.org/10.7910/DVN/RJLBOQ.
The full reference for ICOADS3.0 follows: Research Data Archive/Computational and Information Systems Laboratory/National Center for Atmospheric Research/University Corporation for Atmospheric Research, Physical Sciences Laboratory/Earth System Research Laboratory/OAR/NOAA/U.S. Department of Commerce, Cooperative Institute for Research in Environmental Sciences/University of Colorado, National Oceanography Centre/University of Southampton, Met Office/Ministry of Defence/United Kingdom, Deutscher Wetterdienst (German Meteorological Service)/Germany, Department of Atmospheric Science/University of Washington, Center for Ocean–Atmospheric Prediction Studies/Florida State University, and National Centers for Environmental Information/NESDIS/NOAA/U.S. Department of Commerce. 2016, updated monthly. International Comprehensive Ocean–Atmosphere Data Set (ICOADS) Release 3, Individual Observations. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory (https://doi.org/10.5065/D6ZS2TR3, accessed 3 October 2018).
Codes required to reproduce the full analysis and display items are available from a Github repository, https://github.com/duochanatharvard/World-War-II-Warm-Anomaly.
APPENDIX A
Setup and Implementation of the LME Methodology
Our setup of the LME method permits partly resolving variations in offsets due to geographically varying measurement environment (Fig. A1). Higher-order interactions that involve group, year, and region are not accounted for in this model to limit the number of free parameters. Note that the model does not explicitly resolve seasonal variations in offsets. To estimate seasonality, we fit the model on subsets of three consecutive months and combine 12 successive analyses to cover the full year. Southern Hemisphere SSTs are shifted by six months to account for different seasons between hemispheres (Chan and Huybers 2020).
After averaging, we use an iterative numerical algorithm (Harville 1977), as implemented in the Matlab function “fitlmematrix”, to estimate the LME model. In particular, we estimate three hyperparameters: the variance of all yearly effects,
The significance of groupwise offsets is estimated using a two-sided Z test from 1000 sets of groupwise randomized offset estimates. Offsets are realized according to the mean and uncertainties associated with fixed and random effects. A Bonferroni correction is also applied to account for the increased probability of incorrectly rejecting true null hypotheses in multiple hypothesis testing, which is carried out by lowering the threshold of the p value to be 0.05/66, where 66 is the number of groups from 1935 to 1949. We also use these 1000 sets of randomized offset estimates to generate a 1000-member ensemble of adjusted monthly gridded SSTs, which permits accounting for error covariance reflecting the spatial and temporal coverage of individual groups.
APPENDIX B
Using Diurnal Variations to Infer Measurement Type for Deck 195 Measurements
All SST measurements from U.S. deck 195 are sampled at 0800, 1200, and 2000 LT, whereas elsewhere we have required that each 6-hourly bin of a day has at least one measurement (Carella et al. 2018; Chan and Huybers 2020). Because deck 195 contains 24% of observations within the WW2 interval, however, we take an alternative approach that uses diurnal cycles from bucket and ERI measurements as basis functions (Fig. B1). Specifically, the diurnal cycle of deck 195 is represented as a linear combination of bucket and ERI cycles, with the mixture determined using least squares fitting. When using all available SST measurements, the best fit yields 100% ERI and 0% bucket (Fig. B1a), equivalent to a diurnal amplitude of 0.05°C that is 0.07°C smaller than that of drifting buoys. Our inference that deck 195 is consistent with purely ERI SSTs is robust to dividing the analysis to focus on individual regions and seasons, where in each case the observed diurnal variations are consistent with other ERI data and highly inconsistent with bucket observations (Figs. B1b–e).
REFERENCES
Bindoff, N. L., and Coauthors, 2013: Detection and attribution of climate change: From global to regional. Climate Change 2013: The Physical Science Basis, Cambridge University Press, 867–952.
Carella, G., E. C. Kent, and D. I. Berry, 2017: A probabilistic approach to ship voyage reconstruction in ICOADS. Int. J. Climatol., 37, 2233–2247, https://doi.org/10.1002/joc.4492.
Carella, G., J. Kennedy, D. Berry, S. Hirahara, C. J. Merchant, S. Morak-Bozzo, and E. Kent, 2018: Estimating sea surface temperature measurement methods using characteristic differences in the diurnal cycle. Geophys. Res. Lett., 45, 363–371, https://doi.org/10.1002/2017GL076475.
Chan, D., and P. Huybers, 2019: Systematic differences in bucket sea surface temperature measurements among nations identified using a linear-mixed-effect method. J. Climate, 32, 2569–2589, https://doi.org/10.1175/JCLI-D-18-0562.1.
Chan, D., and P. Huybers, 2020: Systematic differences in bucket sea surface temperatures caused by misclassification of engine room intake measurements. J. Climate, 33, 7735–7753, https://doi.org/10.1175/JCLI-D-19-0972.1.
Chan, D., E. C. Kent, D. I. Berry, and P. Huybers, 2019: Correcting datasets leads to more homogeneous early-twentieth-century sea surface warming. Nature, 571, 393–397, https://doi.org/10.1038/s41586-019-1349-2.
Cowtan, K., R. Rohde, and Z. Hausfather, 2018: Evaluating biases in sea surface temperature records using coastal weather stations. Quart. J. Roy. Meteor. Soc., 144, 670–681, https://doi.org/10.1002/qj.3235.
Eyring, V., S. Bony, G. A. Meehl, C. A. Senior, B. Stevens, R. J. Stouffer, and K. E. Taylor, 2016: Overview of the Coupled Model Intercomparison Project phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev., 9, 1937–1958, https://doi.org/10.5194/gmd-9-1937-2016.
Folland, C., and D. Parker, 1995: Correction of instrumental biases in historical sea surface temperature data. Quart. J. Roy. Meteor. Soc., 121, 319–367, https://doi.org/10.1002/qj.49712152206.
Folland, C., D. Parker, and F. Kates, 1984: Worldwide marine temperature fluctuations 1856–1981. Nature, 310, 670–673, https://doi.org/10.1038/310670a0.
Folland, C., O. Boucher, A. Colman, and D. E. Parker, 2018: Causes of irregularities in trends of global mean surface temperature since the late 19th century. Sci. Adv., 4, EAAO5297, https://doi.org/10.1126/SCIADV.AAO5297.
Freeman, E., and Coauthors, 2017: ICOADS release 3.0: A major update to the historical marine climate record. Int. J. Climatol., 37, 2211–2232, https://doi.org/10.1002/joc.4775.
Hansen, J., R. Ruedy, M. Sato, and K. Lo, 2010: Global surface temperature change. Rev. Geophys., 48, RG4004, https://doi.org/10.1029/2010RG000345.
Harville, D. A., 1977: Maximum likelihood approaches to variance component estimation and to related problems. J. Amer. Stat. Assoc., 72, 320–338, https://doi.org/10.1080/01621459.1977.10480998.
Hausfather, Z., K. Cowtan, D. C. Clarke, P. Jacobs, M. Richardson, and R. Rohde, 2017: Assessing recent warming using instrumentally homogeneous sea surface temperature records. Sci. Adv., 3, e1601207, https://doi.org/10.1126/SCIADV.1601207.
Hawkins, E., P. Brohan, G. Compo, K. Wood, and M. Mark, 2020: Old Weather—WW2. Accessed 22 November 2020, https://www.zooniverse.org/projects/krwood/old-weather-ww2/about/research.
Hegerl, G. C., S. Brönnimann, A. Schurer, and T. Cowan, 2018: The early 20th century warming: Anomalies, causes, and consequences. Wiley Interdiscip. Rev.: Climate Change, 9, e522, https://doi.org/10.1002/wcc.522.
Huang, B., and Coauthors, 2015: Extended Reconstructed Sea Surface Temperature version 4 (ERSST.v4). Part I: Upgrades and intercomparisons. J. Climate, 28, 911–930, https://doi.org/10.1175/JCLI-D-14-00006.1.
Huang, B., and Coauthors, 2017: Extended Reconstructed Sea Surface Temperature, version 5 (ERSSTv5): Upgrades, validations, and intercomparisons. J. Climate, 30, 8179–8205, https://doi.org/10.1175/JCLI-D-16-0836.1.
Jones, G. S., P. A. Stott, and N. Christidis, 2013: Attribution of observed historical near-surface temperature variations to anthropogenic and natural causes using CMIP5 simulations. J. Geophys. Res. Atmos., 118, 4001–4024, https://doi.org/10.1002/jgrd.50239.
Kennedy, J., N. Rayner, R. Smith, D. Parker, and M. Saunby, 2011a: Reassessing biases and other uncertainties in sea surface temperature observations measured in situ since 1850: 1. Measurement and sampling uncertainties. J. Geophys. Res., 116, D14103, https://doi.org/10.1029/2010JD015218.
Kennedy, J., N. Rayner, R. Smith, D. Parker, and M. Saunby, 2011b: Reassessing biases and other uncertainties in sea surface temperature observations measured in situ since 1850: 2. Biases and homogenization. J. Geophys. Res., 116, D14104, https://doi.org/10.1029/2010JD015220.
Kennedy, J., N. Rayner, C. Atkinson, and R. Killick, 2019: An ensemble data set of sea surface temperature change from 1850: The Met Office Hadley Centre HadSST. 4.0.0.0 data set. J. Geophys. Res. Atmos., 124, 7719–7763, https://doi.org/10.1029/2018JD029867.
Kent, E. C., and P. K. Taylor, 2006: Toward estimating climatic trends in SST. Part I: Methods of measurement. J. Atmos. Oceanic Technol., 23, 464–475, https://doi.org/10.1175/JTECH1843.1.
Kent, E. C., N. A. Rayner, D. I. Berry, M. Saunby, B. I. Moat, J. J. Kennedy, and D. E. Parker, 2013: Global analysis of night marine air temperature and its uncertainty since 1880: The HadNMAT2 data set. J. Geophys. Res. Atmos., 118, 1281–1298, https://doi.org/10.1002/jgrd.50152.
Kent, E. C., and Coauthors, 2017: A call for new approaches to quantifying biases in observations of sea surface temperature. Bull. Amer. Meteor. Soc., 98, 1601–1616, https://doi.org/10.1175/BAMS-D-15-00251.1.
Maher, N., A. Sen Gupta, and M. H. England, 2014: Drivers of decadal hiatus periods in the 20th and 21st centuries. Geophys. Res. Lett., 41, 5978–5986, https://doi.org/10.1002/2014GL060527.
Morice, C. P., J. J. Kennedy, N. A. Rayner, and P. D. Jones, 2012: Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 data set. J. Geophys. Res., 117, D08101, https://doi.org/10.1029/2011JD017187.
Pfeiffer, M., J. Zinke, W.-C. Dullo, D. Garbe-Schönberg, M. Latif, and M. Weber, 2017: Indian Ocean corals reveal crucial role of World War II bias for twentieth century warming estimates. Sci. Rep., 7, 14434, https://doi.org/10.1038/s41598-017-14352-6.
Reynolds, R. W., T. M. Smith, C. Liu, D. B. Chelton, K. S. Casey, and M. G. Schlax, 2007: Daily high-resolution-blended analyses for sea surface temperature. J. Climate, 20, 5473–5496, https://doi.org/10.1175/2007JCLI1824.1.
Stevens, B., 2015: Rethinking the lower bound on aerosol radiative forcing. J. Climate, 28, 4794–4819, https://doi.org/10.1175/JCLI-D-14-00656.1.
Taylor, K. E., R. J. Stouffer, and G. A. Meehl, 2012: An overview of CMIP5 and the experiment design. Bull. Amer. Meteor. Soc., 93, 485–498, https://doi.org/10.1175/BAMS-D-11-00094.1.
Thompson, D. W., J. J. Kennedy, J. M. Wallace, and P. D. Jones, 2008: A large discontinuity in the mid-twentieth century in observed global-mean surface temperature. Nature, 453, 646–649, https://doi.org/10.1038/nature06982.
Thompson, D. W., J. M. Wallace, P. D. Jones, and J. J. Kennedy, 2009: Identifying signatures of natural climate variability in time series of global-mean surface temperature: Methodology and insights. J. Climate, 22, 6120–6141, https://doi.org/10.1175/2009JCLI3089.1.
Vose, R. S., and Coauthors, 2012: NOAA’s merged land–ocean surface temperature analysis. Bull. Amer. Meteor. Soc., 93, 1677–1685, https://doi.org/10.1175/BAMS-D-11-00241.1.
Woodruff, S. D., and Coauthors, 2011: ICOADS release 2.5: Extensions and enhancements to the surface marine meteorological archive. Int. J. Climatol., 31, 951–967, https://doi.org/10.1002/joc.2103.
York, D., N. M. Evensen, M. L. Martinez, and J. De Basabe Delgado, 2004: Unified equations for the slope, intercept, and standard errors of the best straight line. Amer. J. Phys., 72, 367–375, https://doi.org/10.1119/1.1632486.