An evaluation of the performance of the twentieth century reanalysis version 3

: The performance of a new historical reanalysis, the NOAA–CIRES–DOE Twentieth Century Reanalysis version3(20CRv3),isevaluatedviacomparisonswithotherreanalysesandindependentobservations.Thisdatasetprovides global, 3-hourly estimates of the atmosphere from 1806 to 2015 by assimilating only surface pressure observations and prescribing sea surface temperature, sea ice concentration, and radiative forcings. Comparisons with independent observations, other reanalyses, and satellite products suggest that 20CRv3 can reliably produce atmospheric estimates on scales ranging from weather events to long-term climatic trends. Not only does 20CRv3 recreate a ‘‘best estimate’’ of the weather, including extreme events, it also provides an estimate of its conﬁdence through the use of an ensemble. Surface pressure statistics suggest that these conﬁdence estimates are reliable. Comparisons with independent upper-air observations in the


Introduction
A detailed understanding of past weather and climate, including variability and trends, is essential to better understand and predict ongoing changes in climate and weather statistics.Historical observational datasets intended to accomplish this are spatially and temporally incomplete, and often have inhomogeneity issues (Brönnimann et al. 2013;Cram et al. 2015;Jones et al. 1999;Parker et al. 1997;Rennie et al. 2014;Thorne et al. 2017;Noone et al. 2021).Reanalyses can provide complete and consistent atmospheric fields by objectively combining historical observations with modern numerical weather prediction model forecasts, while accounting for estimated errors in both (Kalnay et al. 1996).Most reanalyses, however, only go back to circa 1950 or 1979 to use the most comprehensive observing network while avoiding inconsistencies arising from major changes in it, such as the introduction of extensive upper-air observations or satellite data (Bengtsson et al. 2004;Bosilovich et al. 2011;Kinter et al. 2004;Kistler et al. 2001;Zhang et al. 2012).
By assimilating only long-term surface observations, historical reanalyses can avoid some of these inconsistencies and extend further back in time.In partnership with the international Atmospheric Circulation Reconstructions over the Earth initiative (ACRE; Allan et al. 2011), the University of Colorado Boulder's Cooperative Institute for Research in Environmental Sciences (CIRES) and the National Oceanic and Atmospheric Administration (NOAA) were the first to generate a dynamically consistent ''sparse input'' global atmospheric reanalysis, the Twentieth Century Reanalysis (20CR), based on only surface pressure observations.The preliminary first version spanned 1908-58 to demonstrate the feasibility of a surface-pressure-only reanalysis (Compo et al. 2006;Whitaker et al. 2004).The second version, 20CRv2, spanned from 1871 to the present and was kept up to date until 2012 (Compo et al. 2011).The follow-up, 20CRv2c, improved upon 20CRv2 and extended the reanalysis period to 1851-2014.The European Centre for Medium-Range Weather Forecasts (ECMWF) subsequently generated a sparse-input reanalysis, the ECMWF Twentieth Century Reanalysis (ERA-20C), assimilating both surface pressure and marine winds and extending back to 1900 (Poli et al. 2016).Their most recent historical reanalysis, the Coupled ECMWF Reanalysis of the Twentieth Century (CERA-20C), spans 1901-2010 and uses a coupled ocean-atmosphere forecast model to also assimilate subsurface ocean profile observations (Laloyaux et al. 2018).
The latest version of the Twentieth Century Reanalysis has been generated by NOAA, CIRES, and the U.S. Department of Energy (DOE).This NOAA-CIRES-DOE 20CR version 3 (20CRv3), uses a newer, higher-resolution model, assimilates a larger set of observations, and includes an improved data assimilation system relative to its predecessor 20CRv2c.The 20CRv3 system further extends the reanalysis period to 1836-2015, with an experimental extension spanning 1806-35.Slivinski et al. (2019a) provide an in-depth description of the system that generated the 20CRv3 reanalysis product, as well as preliminary evaluations of a subset of the 20CRv3 product (mainly via comparisons with 20CRv2c).Here, we provide a first evaluation of the entire 1806-2015 time span of the 20CRv3 dataset through comparisons with independent observations, full-and sparse-input reanalyses, and satellite products.The focus is on the synoptic and climatic behavior of a few key atmospheric variables from the surface to the upper atmosphere; further work will provide more exhaustive evaluations.
By going back to 1806, one can potentially study trends in the longest instrument-based reanalysis generated to date.However, one must first understand the accuracy and reliability of the dataset relative to other products and observations.To that end, this work provides an initial evaluation of the 20CRv3 dataset on weather and climate scales, but in-depth investigations of particular phenomena and variability are left for future research.This work is organized as follows: Section 2 reviews relevant details of 20CRv3 and introduces the other reanalyses and observational datasets used for comparison.Section 3 investigates an extreme weather event in the nineteenth century, illustrating that 20CRv3 has efficacy for evaluating the dynamics of past weather when data density is sufficient.Section 4 considers synoptic-scale performance metrics, including errors in surface pressure and upper-air geopotential heights.Section 5 focuses on climate time scales and discusses long-term structures of mass, circulation, and precipitation in 20CRv3.Section 6 concludes with a discussion of implications.

Data and methods
The 20CRv3 system consists of a numerical weather prediction model, an observational dataset, and an assimilation method.We review this system briefly; further details are given by Slivinski et al. (2019a).Using an 80-member ensemble Kalman filter, the 20CRv3 system assimilates only surface pressure observations from the open, unrestricted, and publicly available International Surface Pressure Databank (ISPD) version 4.7 (Compo et al. 2019;Cram et al. 2015), into the U.S. National Centers for Environmental Prediction (NCEP) Global Forecast System (GFS) model, version 14.0.1, with a spectral horizontal resolution of T254 (effectively 60 km at the equator) and a vertical atmospheric resolution of 64 levels up to about 0.3 hPa.Sea surface temperature (SST) fields are prescribed from two eightmember ensembles: version 3 of the Simple Ocean Data Assimilation with sparse input (SODAsi.3),reanalysis for 1836-1980 (Giese et al. 2016); and the Hadley Centre Sea Ice and Sea Surface Temperature dataset, version 2.2 (HadISST2.2),for 1981-2015.These SST fields, originally available as 5-day averages, were interpolated to daily resolution for 20CRv3.Sea ice concentration fields are prescribed from monthly HadISST2.3(Titchner and Rayner, 2014) interpolated to daily resolution.Solar radiation is determined from the Total Solar Irradiance (TSI) Reconstruction based on the Naval Research Laboratory TSI (NRLTSI2; Coddington et al. 2016), and time-varying atmospheric constituents of volcanic aerosols (Crowley and Unterman, 2013), stratospheric ozone (Cionni et al. 2011), and atmospheric carbon dioxide (CO 2 ) levels (Saha et al. 2010) are also specified.Output subsurface land fields (such as soil moisture), surface fields, and atmospheric fields are provided at 3-hourly resolution.
The experimental extension spanning 1806-35 uses an identical system, with SST fields for 1815-35 specified from the eight-member ensemble of SODAsi.3 and for 1806-14 specified as the 30-yr average climatological fields from HadISST2.1 for years 1861-90 (to be consistent with previous iterations of 20CR; see Giese et al. 2016).At the time of production, 1804 was the first year with at least one observation globally every six hours (mainly from stations in western Europe and occasionally North America).With a 16-month spinup period, 1806 is the first full year available in the extension.
To evaluate the performance of 20CRv3, its fields are compared with a variety of atmospheric reanalyses and observational datasets.Fujiwara et al. (2017) organize reanalyses into three general categories: full input, conventional input, and surface input.Full-input reanalyses assimilate all available surface and upper-air conventional observations as well as satellite observations as they become available in time.Conventionalinput reanalyses assimilate surface and upper-air conventional observations, but not satellite data; we do not consider any conventional-input reanalyses here.Surface-input reanalyses assimilate only surface conventional observations (such as surface pressure and marine winds.) In addition to reanalyses, 20CRv3 is also compared with independent upper-air, station-based, satellite-based, and satellite-station blended observational datasets of upper-air fields, precipitation, and lower-tropospheric temperature.A summary of all reanalyses and datasets used in this study is shown in Table 1.This list is far from exhaustive; see Fujiwara et al. (2017) for more details.

Case study: The Great Blizzard of 1888
A key feature of 20CRv3 is its 3-hourly resolution for the span of 210 years, allowing users to investigate weather extremes from across the nineteenth to twenty-first centuries in a consistent framework.Several studies have shown how well previous versions of 20CR were able to reconstruct individual weather events (e.g., Brönnimann et al. 2013;Gergis et al. 2020;Lorrey and Chappell 2016;Moore and Babij 2017;Stucki et al. 2015).Slivinski et al. (2019a) investigated how well the 20CRv3 system represents the 1915 Galveston Hurricane, and found that 20CRv3 has the strongest intensity (i.e., lowest central pressure) of the four historical reanalyses considered there: 20CRv2c, 20CRv3, ERA-20C, and CERA-20C.Here, we investigate an extratropical winter storm that impacted North America in the nineteenth century and has been the focus of several previous studies (Kocin 1983(Kocin , 1988;;Michaelis and Lackmann 2013), but 20CRv3 could be used to study a variety of weather extremes back to 1806 (e.g., the 1816 year without a summer; Brugnara et al. 2015;Skrynyk et al. 2021).
The Great Blizzard of 1888 was a historic snowstorm impacting the northeast United States between 11 and 14 March 1888, delivering up to 125 cm of snow across parts of New England along with strong winds and low temperatures (Kocin and Uccellini 2004;Kocin 1983Kocin , 1988)).For reference, in the original development of the Northeast Snowfall Impact Scale (NESIS), the Blizzard of 1888 received a score of 8.34, while the Superstorm of March 1993 received a value of 12.52, and the Blizzard of January 1996 received a score of 11.54.Michaelis and Lackmann (2013) downscaled 20CRv2 using the Weather Research and Forecasting (WRF) Model and were able to reconstruct offshore cyclogenesis and heavy snowfall, albeit with a significant position error.Here, we illustrate the ability of the 20CRv3 system to reconstruct this storm at its output resolution.Figure 1 shows synoptic maps over North America for 0000 UTC 13 March 1888.Figure 1a shows ensemble mean sea level pressure (SLP), the locations of observations assimilated within the previous 24 h, and confidence.Confidence of an atmospheric field is defined as in (Slivinski et al. 2019a) but for a single time t: where spread ens is the standard deviation of the ensemble of analyzed fields from 20CRv3 valid for the given date and time, and spread clim is the temporal standard deviation of the 20CRv3 ensemble mean field, over the given month, day, and hour, for 1981-2010.This metric is intended to provide information on the ensemble spread (not on mean biases), while normalizing out the effects of intrinsic variability.A value of zero in this metric corresponds to confidence equal to that of a climatological estimate, while negative (positive) values correspond to less (more) confidence than a climatological estimate.In particular, negative values can occur when the instantaneous ensemble spread is larger than the ensemble mean's temporal variability.This does not necessarily imply poor performance of the data assimilation system, unless longterm averages of confidence are regularly below zero (see section 3.1 of Slivinski et al. 2019a).Figure 1b shows ensemble mean 500-hPa geopotential height (Z500) and its confidence field.Figures 1c and 1d show ensemble mean 2-m air temperature and 6-hourly accumulated precipitation, respectively.While the confidence shading in Figs.1a and 1b demonstrates the spread of the ensemble, Fig. 2 provides further evidence of the ensemble variability by illustrating the SLP and precipitation fields from the first 20 ensemble members; this set is representative of the full 80-member ensemble.A storm is evident in all ensemble members shown here, but the variability within the ensemble suggests that there is more uncertainty in precipitation than in SLP.
Comparisons with preexisting reconstructions demonstrate that the overall structure and location of the storm in 20CRv3 are realistic (Fig. 3).The real-time U.S. Daily Weather Map, a reconstruction created by Kocin (1983), and the 20CRv3 fields valid at 1200 UTC 13 March 1888 compare well with each other.Note that this is 12 h later than the fields shown Figs. 1 and 2, after the storm began to move offshore, in order to compare with the U.S. Daily Weather Maps (which are only available once per day at 1200 UTC).Though the 20CRv3 ensemble mean is arguably an imperfect estimate of an extreme weather event since ensemble averaging tends to dampen gradients in positiondependent features, there is still good agreement across all three reconstructions, particularly in the location of the storm.
This example illustrates how well 20CRv3 can represent historic extreme weather events at the surface on subdaily scales.In the remainder of the paper, we investigate how well 20CRv3 performs on synoptic to climatic scales from the surface to high levels of the atmosphere.

Synoptic skill evaluation a. Surface pressure statistics
To investigate the overall performance, we initially consider background errors in surface pressure.The root-mean-squared errors (RMSEs) between independent surface pressure observations that have not yet been assimilated and the 20CRv3 ensemble mean background equivalents provide a simple metric for how well the system is performing.We refer to this as the ''actual'' error: where i indexes all N obs observations considered (here, in a given region for a single year), x o,i is the ith observation, and x b,i is the ensemble mean background field interpolated to the ith observation time and location.By comparing to expected errors, we can determine how well the uncertainty of the system is being estimated as well.Here, ''expected'' errors are calculated as the root mean of the sum of the observation error s 2 o,i and the estimated background error (variance of the background ensemble interpolated to the observation's location and time) s 2 b,i : As shown by Desroziers et al. (2005), if the observation and background errors are uncorrelated and unbiased, then RMSE actual should be equivalent to RMSE exp .
Figure 4 shows observation-forecast difference statistics for surface pressure and the associated expected RMSEs for three zonal regions and for the full experimental and production span of 20CRv3, 1806-2015.Here the Northern Hemisphere (NH) is defined as 208-908N, the tropics as 208S-208N, and the Southern Hemisphere (SH) as 908-208S.The RMSEs generally decrease in time and are consistent with the expected errors even as the available pressure observations vary over four orders of magnitude, demonstrating a high level of performance in this metric and an improvement over the performance of 20CRv2c due to the upgrades in forecast model and data assimilation algorithms (see Slivinski et al. 2019a, their Fig. 12 and related discussion).This also suggests that the prescribed observation errors are realistic, though the disagreement (particularly when the expected errors are larger than the actual errors) suggests that there is still room for improvement; this will be investigated in future tests.
There is a strong negative correlation between the RMSE and the log of the number of observations assimilated (Fig. 4), demonstrating that the errors generally decrease as the observation network density increases.This motivates global and regional data rescue efforts, particularly in regions that are currently data sparse (e.g., Allan et al. 2011;Brönnimann et al. 2019a;Williamson et al. 2018; https://climatehistory.com.au); it is estimated that there could be millions of pressure observations that exist but are currently unavailable for assimilation (i.e., they have not yet been scanned or digitized).If early paper record observations were digitized and made available for assimilation, Fig. 4 suggests that errors could be significantly decreased in the corresponding time periods.The continued decrease in actual error in the tropics and Southern Hemisphere also suggests that even now, the observation networks in these regions are not dense enough for the error to have saturated at its minimum level.

b. Modern upper-air comparisons
In Fig. 5 we evaluate the skill of Z500 estimates in 20CRv3, a large-scale variable that governs atmospheric circulation patterns (such as troughs and ridges), with respect to ERA5, the latest full-input reanalysis from ECMWF (Hersbach et al. 2020;Hersbach and Dee 2016).ERA5 assimilates nearly all available observations including extensive satellite-based, radiosonde, aircraft, and other conventional upper-air and surface observations.To compare with the available operational forecast errors, which are unavailable poleward of 808, here we define ''NH'' as 208-808N and ''SH'' as 808-208S.In both the NH and SH, 20CRv3 has smaller root-mean-squared differences (RMSDs) with respect to ERA5 than 20CRv2c, and similar to Fig. 4, RMSDs are more consistent with the ensemble spread in 20CRv3.This is in contrast to 20CRv2c, whose small ensemble spread relative to its RMSD suggests overconfidence, likely due to the imperfect covariance inflation algorithm used in the 20CRv2c system that was updated for 20CRv3 (Slivinski et al. 2019a).The seasonal variability in 20CRv3 differences relative to ERA5 has also been diminished from 20CRv2c in the NH.For reference, the 1981-2010 climatological variability (standard deviation over time) in the 20CRv3 ensemble mean field is 79.24 m averaged over the Northern Hemisphere and 91.09 m averaged over the Southern Hemisphere.
To put these numbers in context, we also compare with the 2019 annual average errors of operational forecasts of Z500 for 2-, 3-, and 4-day leads.The 20CRv3 analysis errors for 1979-2015 are comparable to modern 3-4-day operational forecast skill in the NH, and 4-day forecast skill in the SH, consistent with the expected skills predicted by Compo et al. (2006).The degraded performance in the SH for the period 1980-85 may be due to a lack of observations in this period (see the small drop in number of assimilated observations in Fig. 4c around 1980), causing the errors and the ensemble spread of 20CRv3 to increase.However, this may also affect the performance of the benchmark ERA5, as there are fewer satellite and conventional upper-air observations available in the Southern Hemisphere in this time period than in more recent periods.This reduction in skill is consistent with Fig. 1 of (Hersbach et al. 2020), which shows a slight decrease in the range of 365day mean anomaly correlations of Z500 forecasts for ERA5 in the Southern Hemisphere from about 1982-85 and considerable interannual variability in the quality of its Z500 fields before 1987.

c. Mid-twentieth-century surface and upper-air comparisons
Next, we compare surface and upper-air fields from 20CRv3 with the Japanese 55-year Reanalysis Project (JRA-55; Kobayashi et al. 2015).Figure 6 shows maps of local anomaly correlation between 20CRv3 and JRA-55 for SLP, Z500, and 300-hPa geopotential heights (Z300).Anomaly fields are computed for each reanalysis relative to its climatology over the stated time period and interpolated to a 18 3 18 grid; correlations in time are then computed for each grid point.Correlations are high in the Northern Hemisphere even before considerable satellite data are available for use in JRA-55 (1958-78;left column), though the extent of high correlation regions decreases higher in the atmosphere.While JRA-55 mainly assimilates conventional surface and upper-air data in this period, note that sparse satellite observations from the Vertical Temperature Profile Radiometer (VTPR) are included from 1973 to 1979 (see appendix A of Kobayashi et al. 2015).The correlation increases during the era of significant satellite observations (1979-2015; right column), particularly in the Southern Hemisphere, likely due to the lack of available observations in high southern latitudes prior to 1979 (Bromwich et al. 2007).
Stippled regions illustrate where 20CRv3 has low confidence.The high pattern correlation between the 20CRv3 confidence fields and the local anomaly correlation fields suggest that 20CRv3 uncertainty estimates are a good predictor of

d. Comparisons with independent upper-air observations
Upper-air skill prior to 1958 can be evaluated by comparisons with independent upper-air observations.We begin with observations at Lindenberg, Germany (52.228N, 14.128E), the station with the longest observational record available in the Integrated Global Radiosonde Archive, version 2 (IGRA2; Durre et al. 2016), from 1905 to the present.The observations are taken from a mix of kite, airplane, balloon, and radiosonde platforms (Adam and Dier 2005;Stickler et al. 2010).Even in the early twentieth century, the analyzed Z500 anomalies from both 20CRv3 and 20CRv2c correlate highly with observed anomalies (Fig. 7).Errors decrease in time for both systems, but RMSDs with 20CRv3 are consistently smaller than in 20CRv2c, and correlations with these observations are higher.
Figure 8 further demonstrates that 20CRv3 is able to predict its own skill by showing the RMSDs of analyzed Z500 anomalies with respect to the Lindenberg observations as a function of ensemble spread, for 20CRv3 and 20CRv2c.Only the latter two time periods considered in Fig. 7 are shown in Fig. 8, as it is unclear whether large errors in the early time period are due to instrumental errors in the observations or errors in the reanalysis, and the smaller number of observations does not allow for calculation of robust statistics.The gray diagonal line shows the expected RMSD for perfect observations, and the gray shaded area shows where the RMSD would be expected to fall for observations with error between 15 and 25 m (Wartenburger et al. 2013) assuming accurate ensemble spread.Thus, when the scatterplot lies above the gray swath, the reanalysis is overconfident (ensemble spread is too small); when the scatterplot lies below the gray swath, the reanalysis is underconfident (ensemble spread is too large).In the mid-to late twentieth century, 20CRv2c was nearly always overconfident in this field while 20CRv3 more accurately represents uncertainty  or is underconfident (particularly in the more recent period from 1981 to 2001).
To investigate whether these results hold in other locations and at other levels in the latter half of the twentieth century, geopotential heights from eleven other stations from IGRA2 are analyzed at 300, 500, and 850 hPa.Station names, locations, and maximum available time periods that include complete metadata are listed in Table 2.The RMSD and correlation with respect to 20CRv3 for each station at each level are also included, as well as the local variability (as measured by the standard deviation of all observed anomalies).The RMSDs are consistently smaller than the local variability, and the correlations are generally high, though in both measures, the performance of 20CRv3 worsens in the tropics and at higher levels, as also seen in (Compo et al. 2011).Figure 9 demonstrates that these errors are often well predicted by the 20CRv3 ensemble spread, even at 300 hPa.Similar to Fig. 8, Fig. 9 shows RMSDs of 20CRv3 analyzed Z300, Z500, and Z850 anomalies with respect to the labeled IGRA station observations as a function of 20CRv3 ensemble spread, for leveldependent observation errors defined in (Wartenburger et al. 2013).As in Fig. 8, there are several cases where 20CRv3 is over or underconfident.However, the 20CRv3 analyzed anomalies are interpolated to the instrument release time and location, and thus do not account for balloon drift, which is likely to influence the results (McGrath et al. 2006).

Climatic skill evaluation
The previous section illustrates the performance of 20CRv3 for synoptic variability.Here we evaluate the performance of long-term, climatic structures of mass, circulation, and precipitation.

a. Vertical structure of mass and circulation
To investigate possible systematic biases in 20CRv3 throughout the atmosphere, we compare its vertical profiles of long-term means of temperature and zonal wind with those from two full-input reanalyses, JRA-55 and ERA5, during the period 1979-2015 (Fig. 10).On average, 20CRv3 is warmer than both JRA-55 and ERA5 by about 18-1.58C from 600 to 300 hPa and below 900 hPa, pointing to a possible model bias in this version of the GFS (though this bias may be due to incorrectly tuned parameters in the model used at this resolution).Also note that midtroposphere (around 800-600 hPa) tropical biases, in both variables, are of opposite sign between the two full-input reanalyses, with magnitudes of about 0.58C and 0.5 m s 21 .In particular, 20CRv3 is cooler than JRA-55 by less than 0.58C between about 208S and 208N and from 800 to 600 hPa, but warmer than ERA5 in the same region, while ERA5 is cooler than JRA-55 by 0.58-1.08C(Fig. 10e); similar patterns hold in the same regions for zonal wind.In addition, note that the differences south of 608S and below 600 hPa are larger between ERA5 and JRA-55 than between 20CRv3 and ERA5.In both zonal wind and temperature, 20CRv3 has much larger differences in the upper atmosphere than in the middle to lower atmosphere, possibly demonstrating the limitations of a system that assimilates only surface observations.
Relative to similar comparisons (see Fig. A1 of Compo et al. 2011) between the earlier version 2 of 20CR, ERA-40 (Uppala et al. 2005), and NCEP-NCAR Reanalysis 1 (NNR1; Kalnay et al. 1996;Kistler et al. 2001), the 20CRv3 upperatmosphere differences in tropical and extratropical temperature appear to be larger, but the high-latitude differences in temperature are much smaller (Fig. 10).The structure of the zonal wind differences are different in Fig. 10 than shown by Compo et al. (2011), but the magnitudes of the differences are smaller between about 400 and 100 hPa.It is possible that the improved representation of the upper atmosphere in high latitudes will also correct systematic upper-level cold biases in vertical temperature stratification for the Arctic detected in previous versions of 20CR in the period 1934-40 (see Klaus et al. 2018, their Fig. 7).The possible causes of these differences between 20CRv3 and full-input reanalyses will be studied in future work.
To provide further context for the numbers in Fig. 10, Fig. 11 shows the global mean absolute biases (MAB) of long-term temperature and zonal wind between 20CRv3, JRA-55, ERA5, and the ensemble mean of a set of 50 nonassimilating atmospheric model simulations, denoted ''AMIP.''The latter were integrated using an older version of the GFS than 20CRv3 and are provided on a 18 3 18 grid with 64 vertical levels.They have prescribed SST and sea ice concentration fields (Hurrell et al. 2008), time-varying greenhouse gases based on observed data from 1979 to 2005 and RCP6 estimates thereafter (Meinshausen et al. 2011), and time-varying CMIP5 ozone (Cionni et al. 2011).Though the AMIP curves in Fig. 11 use a different version of the GFS and different SSTs than 20CRv3, they provide general magnitudes for biases one would expect from a model simulation relative to a full-input reanalysis.These curves demonstrate that the 20CRv3 biases relative to ERA5 and to JRA-55 are consistently smaller than the respective AMIP biases, with exceptions for surface temperature (where all the biases are between 18 and 1.58C), and above 150 hPa in both variables.In addition, the magnitudes of the 20CRv3 biases relative to ERA5 in temperature are similar to or smaller than the magnitudes of the absolute differences between JRA-55 and ERA5 below 700 hPa, and in zonal wind, are similar below 850 hPa.

b. Precipitation structure
Figures 12 and 13 illustrate the extent to which the precipitation fields in 20CRv3 could be used for long-term interannual to multidecadal variability and trend studies.Figure 12 shows the structure of the global precipitation field averaged for January 1979-2015 in 20CRv3 as well as differences between 20CRv3 and the satellite-station blend of the Global Precipitation Climatology Project version 2.3 (GPCP; Adler et al. 2003Adler et al. , 2018)), and the station blend of the Climatic Research Unit Time Series version 4.03 (CRU TS; Harris et al. 2020;Harris and Jones 2020), over land (Fig. 12d).For reference, the differences between ERA5 and GPCP (Fig. 12c) and CRU TS (Fig. 12e) are also shown.Over land, 20CRv3 differences are largest in highaltitude regions, precipitation in most of North America is overestimated, and precipitation in Australia is underestimated.Relative to GPCP, 20CRv3 also overestimates high-precipitation regions in the tropics and the North Atlantic, and the tropical convection zones differ (see the brown band across the tropical oceans in Fig. 12b).However, over Europe, the Sahel, and much of Russia, the precipitation fields in 20CRv3 lie within the range of magnitudes estimated via GPCP and CRU TS.Relative to CRU TS, 20CRv3 is wetter over all of Europe and Asia and drier over the Sahel; relative to GPCP, though, 20CRv3 is drier over Russia, wetter over the Sahel, and has patchy differences over Europe.
Though the results in Fig. 12 point to likely biases in the precipitation field of 20CRv3, the interannual variability is captured remarkably well in some regions.Figure 13 shows time series of monthly (January and July) precipitation averages over land for 20CRv3, 20CRv2c, GPCP, and CRU TS in two regions: the western United States (WUS; 308-508N and 1408-1008W) and Southern Australia (SAus; 508-268S and 1008-1608E), shown in black boxes in Fig. 12a.Additionally, two regional stationbased datasets are shown: PRISM for the western United States (PRISM Climate Group, Oregon State University, http:// prism.oregonstate.edu,created 11 May 2020); and the gridded dataset developed for the Australian Water Availability Project (AWAP; Jones et al. 2009; http://www.bom.gov.au/cgi-bin/climate/change/timeseries.cgi,accessed 11 May 2020).In January in both regions, the correlations are remarkably high.While the correlations are lower in July, especially over the western United States, the correlations between the observational products and 20CRv3 are higher than the respective correlations with 20CRv2c in all cases considered here.The lower correlations over the western United States in July and Southern Australia in January are consistent with (Compo et al. 2006), who predicted lower skill of the reanalysis in summer months.Additionally, Gehne et al. (2016) demonstrate that several global precipitation datasets (including reanalyses and satellite-based products) disagree in their estimates of precipitation over North America, despite the relatively dense network of precipitation observations in this region.In general, estimating large-scale convection in models and full-input reanalyses is difficult (Gehne et al. 2016;Stephens et al. 2010;Trenberth et al. 2003Trenberth et al. , 2011)), and gauge-based precipitation observations often suffer from systematic biases due to wind, evaporation, and snow undercatch, among others (Adam and Lettenmaier 2003;Peterson FIG. 5.The 500-hPa geopotential height RMSDs between ERA5 and 20CRv2c (thick blue) and 20CRv3 (thick red) ensemble mean fields for (a) the Northern Hemisphere (208-808N) and (b) the Southern Hemisphere (808-208S).Root-mean ensemble variance is shown in thin blue and red lines.The 2019 annual-average operational forecast errors for the given lead times are shown in gray swaths: the lower limit of each gray bar is the ECMWF forecast error, and the upper limit is the NCEP forecast error (https://www.emc.ncep.noaa.gov/gmb/STATS_vsdb/longterm/; accessed 29 Jan 2020).The 1981-2010 climatological variability in the 20CRv3 ensemble mean field is 79.24 m averaged over the Northern Hemisphere and 91.09 m averaged over the Southern Hemisphere.et al. 1998;Rasmussen et al. 2011;Sevruk et al. 2009).In particular, precipitation gauges tend to be in valleys, so CRU TS and PRISM may be underestimating in winter WUS due to snow undercatch.While this could explain the consistent difference between 20CRv3 and the observational products in January WUS, note that there is also a consistent difference in January SAus.Despite this, the agreement of interannual variability on a regional, monthly scale between 20CRv3 and three independent precipitation datasets is encouraging.

c. Long-term variability in mass
Figure 10 demonstrated the likely biases in air temperature; here we demonstrate how well 20CRv3 can represent multidecadal variability of a comparable mass variable, the mean temperature of the lower troposphere defined as the 500-1000-hPa atmospheric layer, relative to other full-and sparseinput reanalyses and to satellite measurement products.The mean layer temperature from each reanalysis is calculated using the hypsometric equation (Wallace and Hobbs 1977): where Z500 and Z1000 are the 500-and 1000-hPa geopotential heights, g is gravitational acceleration, and R d is the specific gas constant for dry air.We include two satellite products in this comparison: the Remote Sensing Systems (RSS) Temperature Lower Troposphere calculated via weighted measurements from Microwave Sounding Units (MSU) and Advance Microwave Sounding Units (AMSU) (hereafter denoted RSS; Mears and Wentz 2009); and version 6 of the University of Alabama Huntsville's Mean Layer Temperature record derived from  1) and averaged in time] for the given variable is less than 0.35 are stippled.The global average pattern correlation between the confidence field and the anomaly correlation field is given for each panel.
MSU and AMSU radiances (UAH; Christy et al. 2017).To calculate their temperature of the lower-troposphere (TLT) product, RSS calculates a weighted difference between singlechannel measurements from MSU2 and AMSU5, while UAH calculates TLT as a linear combination of their calculated temperatures in the midtroposphere, tropopause, and lower stratosphere (Christy et al. 2017).
To consistently compare RSS and UAH to each other as well as to the reanalyses, both RSS and UAH are regressed onto the ERA5 T anomalies for 1979-2018.While the most accurate method of comparing a satellite-based product to a reanalysis would be to use a weighted vertical profile of the reanalysis temperatures, the vertical structures of the weights used for RSS and UAH differ; it is therefore impossible to consistently compare all datasets simultaneously.For this reason, instead of projecting the reanalysis data onto the two different satellitebased observation spaces, we instead project the satellite data into the ''reanalysis space'' of T.
Figure 14 shows annual anomalies in T relative to 1981-2010 for the maximum available region over the globe, 708S-828N.By taking anomalies, the biases illustrated in Figs. 10 and 11 are effectively removed.Uncertainty swaths are included as twice the ensemble standard deviation for the ensemble products 20CRv2c, 20CRv3, CERA-20C, and RSS.Note that, for RSS, the central estimate is provided by version 4 of the data.The RSS uncertainty swath was calculated by applying the above regression to the anomaly of each member of the v3.3 ensemble, since there is currently no version 4 ensemble available.This ensemble is only available until 2015, so the RSS spread for 2016 onward is calculated as the average spread for 1998-2012.Finally, we find similar results using the levels 300-850 hPa (not shown), which are arguably closer to the ''lower troposphere'' as defined in the UAH and RSS weighting functions.
In the modern period, the decadal variability of atmospheric-layer temperature from 20CRv3 correlates well with that of other reanalyses and with satellite products.The FIG. 7. The 500-hPa geopotential height analyzed anomalies from 20CRv2c (blue) and 20CRv3 (red) vs observed anomalies from upper-air measurements at Lindenberg, Germany, for the (a) early twentieth century (1905-17; 1925-38), (b) mid-twentieth century , and (c) late twentieth century .Note that there are no available observations during the period 1918-24.Anomalies are calculated from 1973 to 2001 daily climatologies for the reanalyses and observations, respectively, with the annual cycle smoothed using a Fourier filter with three harmonics.RMSDs between observed and analyzed anomalies, Pearson correlations, and the number of observations for each panel are also given.FIG. 8. RMSDs between analyzed Z500 anomalies and observed Z500 anomalies at Lindenberg, Germany, from 20CRv2c (blue) and 20CRv3 (red) as a function of 20CR ensemble spread (where bin width is 1 m).The gray line shows expected RMSD for perfect observations; gray shading shows region of expected RMSD for observation errors between 15 and 25 m.Only points with 30 observations or more are plotted.high correlation is not solely a result of trends: detrended correlations (Fig. 15 and Table 3) between 20CRv3 and the other datasets remain higher than 0.8, with many higher than 0.9.The relatively low correlation between 20CRv3 and ERA5 over land suggests that ERA5 may not represent interannual fluctuations of this variable over land as well as 20CRv3, relative to the RSS satellite estimate.While intriguing, though, these low correlations are not statistically significantly different from the respective (higher) correlations over the oceans.Zonal averages show similar correlations, though correlations are generally lower in the Southern Hemisphere (Table 3).
Not only does 20CRv3 correlate highly with upper-air reanalyses and satellite data products in the mid-to late twentieth century and early twenty-first century, it also has reasonable correlations (above 0.8) with CERA-20C in the early twentieth century (Fig. 15), though CERA-20C has a stronger warming trend throughout the twentieth and early twenty-first centuries.The variability in 20CRv3 and in 20CRv2c in the last quarter of the nineteenth century behave similarly, though they differ significantly from 1851 to 1870.This difference in behavior will be explored further in future work, but the confidence intervals for 20CRv2c are less reliable than those for 20CRv3; note the artificial jumps in the 20CRv2c ensemble spread in 1871 and 1951 (Fig. 14) corresponding to suboptimal parameter changes (see Slivinski et al. 2019a).
The warming trend in 20CRv3 during 1835-65 may be an effect of the slow recovery of the Earth system from multiple volcanic eruptions between 1808 and 1835 (Brönnimann et al. 2019b), and the subsequent cooling from 1875 to 1905 is intriguing, though several merged SST/land temperature datasets commonly used to describe temperature variability exhibit a similar trend (Hegerl et al. 2018;Vose et al. 2012;Wen et al. 2011).These trends also exist in the 20CRv3 time series restricted both to land only and ocean only (not shown).Additional work will be needed to determine the sensitivity of these trends to the specified SODAsi.3SSTs, errors in the assimilated observations, and other factors.However, the overall high correlations between 20CRv3 and other products throughout the twentieth century suggest that the multidecadal variability seen in the full 210-yr span of 20CRv3 is representative of the true variability on these time scales.

Conclusions
The latest Twentieth Century Reanalysis version 3, has been evaluated on synoptic to climatic time scales via comparison with independent observations, satellite products, and full-and sparse-input reanalyses.Although the availability of comparable datasets decreases further back in time, the results shown here demonstrate that 20CRv3 can produce useful state estimates for its full time span and that its internal estimates of uncertainty are informative and reliable in the illustrated situations.In particular, 20CRv3 has smaller errors that are more consistent with its expected errors than 20CRv2c.
In the late twentieth and early twenty-first centuries, 20CRv3 represents several surface and upper-air fields and their variability well, relative to satellite products, reanalyses, and independent observations.Comparisons with a modern, full-input reanalysis suggest that the upper-air skill of 20CRv3 in the late twentieth and early twenty-first centuries is comparable to modern 3-4-day forecasts in both the Northern and Southern Hemispheres.
In the early to mid-twentieth century, 20CRv3 is evaluated via comparisons with independent surface and upper-air observations, and with the full-input reanalysis JRA-55 in the mid-twentieth century, which assimilates conventional surface, upper-air, and early satellite observations.The errors in 20CRv3 generally increase further back in time, but these errors are mainly predicted by the uncertainty in the 20CRv3 ensemble.
In the nineteenth century, 20CRv3 can reconstruct individual storms in the Northern Hemisphere and as in later time periods, internal background statistics demonstrate good agreement between the synoptic-scale errors in surface pressure and the prediction of those errors from the background ensemble spread.These results, along with the evaluations in the twentieth and early twenty-first centuries, suggest that 20CRv3 can provide reliable estimates of the illustrated atmospheric fields on synoptic scales, as well as their uncertainties, even in the nineteenth century.
Further, there is evidence that the interannual variability of mass and precipitation fields from 20CRv3 on climatic scales is also reliable.Comparisons with station-based precipitation datasets over regional and monthly scales indicate that 20CRv3 captures variability remarkably well for 1901-2015, while comparisons with a satellite-station blended product over 1979-2015 further support this assessment.Similarly, the largescale mass variable of 500-1000-hPa layer temperature shows high correlations globally in the modern time period with three full-input reanalyses and two satellite data products.It is interesting to note the large differences in this variable between 20CRv2c and 20CRv3 prior to about 1875 in light of the analyses above, though previous work suggests that the confidence estimates of 20CRv2c in this time period are less reliable than those of 20CRv3.These results suggest that the interannual variability of many fields from 20CRv3 and their uncertainty estimates on global and regional scales may be reliable throughout the twentieth and twenty-first centuries, even for difficult-to-estimate variables like precipitation, but caution is still needed to interpret results in the nineteenth century.Further comparisons with early observations are required to more carefully evaluate 20CRv3 in this time period, when there are as yet no other comparable instrument-based reanalyses.Additional evaluation of 20CRv3 in the nineteenth century would allow us to place modern trends of these variables in a consistent, long-term context.Future work, currently in progress, will allow further quantification of these results, particularly on smaller regional scales.
While the overall performance of this dataset has improved over previous versions of 20CR, there are several remaining issues for potential users of the new 20CRv3 to keep in mind.First, although variability of 20CRv3 fields appears to be well  2.
represented, there are still mean biases in temperature, wind, and precipitation, as well as in the location and orientation of the tropical convergence zones.Second, there are substantial biases in temperature and wind above 300 hPa, suggesting that a reanalysis that assimilates only surface pressure may not be adequate for upper-atmosphere studies.Finally, the Southern Hemisphere fields are generally less accurate than the Northern Hemisphere, and have less confidence.However, there is evidence that acquiring more observations, particularly by digitizing paper records in the eighteenth and nineteenth centuries, could have strong effects on the performance of future versions of 20CR.
In addition to further data rescue efforts, continued improvements to the 20CR system could provide much-needed information to support model development.For example, use of the 20CR system in sparsely populated regions, such as the Pacific Islands, could extend understanding of rainfall dynamics and past weather and climate patterns arising from convergence zones (Harvey et al. 2019;Lorrey et al. 2012).These regions are poorly represented in models but produce extreme conditions with strong impacts on society (Kruk et al. 2015;Lorrey and Fauchereau 2018) and would therefore benefit strongly from improved model representation.
Despite remaining issues, this evaluation of the first global, 3-hourly, instrument-based reanalysis spanning over 200 years has demonstrated that the 20CRv3 system can skillfully represent mass, circulation, and precipitation fields, on synoptic to climatic scales.Importantly, the ensemble-based approach also adds reliable estimates of uncertainty that can often predict this skill.Results demonstrate that 20CRv3 exhibits multidecadal variability in several fields, providing the opportunity to place recent trends in a longer historical context.The Justus Liebig University of Giessen, Germany is thanked for financial support to digitise, quality control, and analyse early instrumental meteorological data across the world.The following people are financially supported by the University

FIG. 4 .
FIG. 4. Actual (solid black) and expected (dashed) annual firstguess RMSE of surface pressure averaged for (a) the Northern Hemisphere (208-908N), (b) the tropics (208S-208N), and (c) the Southern Hemisphere (908-208S).The annual average number of observations assimilated within a 6-h window is shown in blue (right-hand axis).The Pearson correlation (r) between the actual error and log of the number of observations is also given.Note that the left-hand y axis differs in (c).

FIG. 6 .
FIG. 6. Maps of local anomaly correlation between JRA-55 and 20CRv3 for (top) SLP, (middle) Z500, and (bottom) Z300 over the years (left) 1958-78 and (right) 1979-2015.Heavy black contours represent correlation values of 0.975; note the nonlinear color scale.Regions where the 20CRv3 confidence field [as calculated by Eq. (1) and averaged in time] for the given variable is less than 0.35 are stippled.The global average pattern correlation between the confidence field and the anomaly correlation field is given for each panel.

FIG. 9 .
FIG. 9.As in Fig.8, but for (a) Z850, (b) Z500, and (c) Z300 anomalies from 20CRv3 and the labeled IGRA stations.Observation errors are assumed to be between 8 and 20 m for Z850, between 15 and 25 m for Z500, and between 20 and 50 m for Z300.Locations of the stations are shown in map inset.Exact latitude, longitude, and time period for each station plotted, as well as other statistics, are shown in Table2.
managed by Lawrence Berkeley National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract DE-AC02-05CH11231 and used resources of NOAA's Remotely Deployed High-Performance Computing Systems.Support for the Twentieth Century Reanalysis Project version 3 dataset is provided by the U.S. Department of Energy, Office of Science Biological and Environmental Research (BER), by the National Oceanic and Atmospheric Administration Climate Program Office, and by the NOAA Physical Sciences Laboratory.The efforts of the NERSC consultants are acknowledged.The technical support of the IT group of the NOAA Physical Sciences Laboratory is acknowledged.Don Hooper of NOAA PSL and CIRES is especially acknowledged for his extensive work to make 20CRv3 freely available from PSL. Comments from Steve Penny (CIRES and NOAA) on an earlier version improved this manuscript.John Christy (UAH) is thanked for access to the UAH TLT fields and further discussions about comparing reanalyses to the UAH data.Carl Mears (RSS) is thanked for access to the RSS TLT fields and discussions about their uncertainty.Collaborations with N. Rayner and H. Titchner of the Met Office in the development, production, and use of HadISST boundary conditions are gratefully acknowledged.Collaborations with B. Giese of Texas A&M University in the iterative devel-opment, production, and use of the SODAsi boundary conditions are gratefully acknowledged.The efforts of the National Center for Atmospheric Research (NCAR) Data Engineering and Curation section, especially C.-F. Shih are acknowledged.M. Benoy for the Citizen Science Unit of the Australian Meteorological Association, working with the Australian Bureau of Meteorology, is gratefully acknowledged for ongoing support.H. Mächel's (DWD) contribution of German climate observations is gratefully acknowledged.The authors would also like to thank the following individuals for their invaluable contributions of observations to the ISPD: and Past'' Contract CAOA2001.The research work of R. Przybylak and P. Wyszy nski was supported by the National Science Centre, Poland (Grants DEC-2012/07/B/ST10/04002 and 2015/19/B/ST10/02933).

FIG. 13 .
FIG. 13.Regionally averaged, land-only precipitation rate for (a) January in WUS (308-508N and 1408-1008W), (b) January in SAus (508-268S and 1008-1608E), (c) July in WUS, and (d) July in SAus.The 20CRv3 ensemble mean is shown in red, 20CRv2c ensemble mean in blue, CRU TS in black, GPCP in gold, and PRISM/AWAP in cyan (depending on region).The 2s confidence intervals (calculated from the ensemble standard deviation) are shaded for 20CRv2c and 20CRv3.Pearson correlations for comparable time periods included.

TABLE 1 .
Summary of reanalyses and observational datasets discussed in this study.

TABLE 2 .
Station details and statistics for IGRA stations plotted in Fig.9.RMSDs and correlations are calculated between the observed anomalies and the 20CRv3 anomalies.Local variability is calculated as the temporal standard deviation across all observations in the statistic.