## Abstract

This paper provides a multivariate regression method to estimate the sampling errors of the annual quasi-global (75°S–75°N) precipitation reconstructed by an empirical orthogonal function (EOF) expansion. The Global Precipitation Climatology Project (GPCP) precipitation data from 1979 to 2008 are used to calculate the EOFs. The Global Historical Climatology Network (GHCN) gridded data (1900–2011) are used to calculate the regression coefficients for reconstructions. The sampling errors of the reconstruction are analyzed in detail for different EOF modes. The reconstructed time series of the global-average annual precipitation shows a 0.024 mm day^{−1} (100 yr)^{−1} trend, which is very close to the trend derived from the mean of 25 models of phase 5 of the Coupled Model Intercomparison Project. Reconstruction examples of 1983 El Niño precipitation and 1917 La Niña precipitation demonstrate that the El Niño and La Niña precipitation patterns are well reflected in the first two EOFs. Although the validation in the GPCP period shows remarkable skill at predicting oceanic precipitation from land stations, the error pattern analysis through comparison between reconstruction and GHCN suggests the critical importance of improving oceanic measurement of precipitation.

## 1. Introduction

Finding reliable reconstructions for historical, observed global precipitation is of critical importance to both climate change assessment and climate model validation, in addition to numerous other applications. Activities involving various reconstructions of global precipitation for the twentieth century have intensified in the last few years (Smith et al. 2008b, 2012; Sapiano et al. 2008). In the past, relatively little effort was made to estimate reconstruction errors, but error estimation and uncertainty quantification have now gained wide attention from researchers. Smith et al. (2013) estimated the different components of errors: sampling error, bias error, and random error. This paper will focus on a detailed analysis of sampling errors.

Error estimates are important to climate change research in order to quantify uncertainties in the climate change assessment. Error estimates are not only a critical issue of data aggregation and statistics but also a central issue of climate modeling. A systematic discrepancy exists in the global-average annual-mean precipitation between reconstruction observations (about 2.66 mm day^{−1}) and models from phase 5 of the Coupled Model Intercomparison Project (CMIP5) (about 2.96 mm day^{−1}; Arkin et al. 2010; Ren et al. 2013; Bosilovich et al. 2008). Compared to the reconstructed observation (Fig. 1), the CMIP5 model mean systematically overestimates the global precipitation. Other model means include 2.85 mm day^{−1} from the 24 CMIP3 models, 3.36 mm day^{−1} from the Climate Variability and Predictability (CLIVAR) International Climate of the Twentieth Century Project (C20C), and 2.93 mm day^{−1} from the National Centers for Environmental Prediction’s (NCEP) Reanalysis-1, while the Global Precipitation Climatology Project (GPCP; Adler et al. 2003) and the Climate Prediction Center’s (CPC) Merged Analysis of Precipitation (CMAP; Xie and Arkin 1997) observations since 1979 agree with the reconstruction and are all about 2.65–2.67 mm day^{−1}. This systematic discrepancy involves a combination of uncertainties in the modeling of complex precipitation processes and the uncertainties in merging inhomogeneous and diverse measurements and estimates of precipitation into global precipitation datasets. Climate models have simulated changes in global-average annual-mean surface air temperature (SAT) for the past century (Solomon et al. 2007; Fyfe et al. 2010) and the increase of atmospheric water vapor associated with these SAT changes with reasonable success, the latter being confirmed in recent studies using modern satellite-based analyses (Santer et al. 2007). However, quantifying uncertainties of the observed global precipitation and improving its model simulation are still a grand scientific challenge. The current paper attempts to address some of the uncertainties in observed global precipitation datasets by focusing on the assessment of sampling errors in annual quasi-global (75°S–75°N) precipitation reconstructions. In particular, we will focus on the sampling errors in the reconstruction method of Smith et al. (1996, 2008b).

Our approach is to formulate the Smith method into a multivariate regression procedure so we can use the existing statistical inference results on errors such as confidence interval (CI), sum of squared errors (SSE), and root-mean-square errors (RMSE). This approach has other advantages too, such as allowing simplified comparisons against other regression models and checking the model’s fit for the given climate parameter. Our hope is that our simplified Smith reconstruction procedure will facilitate systematic studies of reconstructions and their error estimations for various climate parameters in addition to precipitation. The main objective of this paper is thus to develop the mathematical procedures for sampling error estimates. To facilitate that goal, we chose to focus on annual data, although the method can also be applied to monthly data. Our empirical orthogonal function (EOF) basis will be calculated from the 1979–2008 GPCP data with a 5° × 5° latitude–longitude resolution. Our historical data will be from the gridded Global Historical Climatology Network (GHCN) station-based precipitation anomaly dataset, also with a 5° × 5° latitude–longitude resolution.

Aside from our new reconstruction results, our uncertainty quantification work differs from the error estimation work of Smith et al. (2013) in four main aspects: (i) we build our theory based on the existing multivariate regression; (ii) we estimate the confidence interval of regression coefficients; (iii) we examine the regression error variance over the data domain; and (iv) we provide some expanded discussions of physical mechanisms, such as El Niño–Southern Oscillation (ENSO) patterns that contribute to the reconstruction skills.

## 2. Multivariate regression of the Smith reconstruction method

### a. Smith reconstruction, regression, and error bounds

Smith et al. (1996) introduced a reconstruction method for sea surface temperature (SST) using EOFs. They expanded the SST field into a finite series of EOFs and used the least squares method to determine the EOF coefficients or weights of the EOFs. Hence, it is a regression procedure. To determine which EOF modes should be included in the series expansion, Smith et al. (1998) introduced a screening procedure that determined whether a particular EOF was properly sampled by the observed data. This is basically a screening regression method. The Smith reconstruction method has been successfully used to reconstruct both SST (Smith and Reynolds 2005; Smith et al. 2008a) and precipitation (Smith et al. 2008b, 2012). Here, we recapitulate the Smith reconstruction theory and apply a standard regression analysis to describe the error bounds of the regression reconstruction.

The Smith reconstruction model is

where

is the reconstruction and

is the reconstruction error and is assumed to be normally distributed for each grid box *x*. Here, *x* denotes the location of the *N* grid boxes and hence *x* runs from 1 to *N*; is the relative-area factor and is equal to the cosine of the latitude of the centroid of the grid box *x*; and is the set of selected EOF modes and has an order *M*. The term denotes the *m*th EOF computed from the area-weighted covariance matrix of annual GPCP precipitation anomaly data and the Euclidean norm of is one because of the inclusion of the area factor, while is the physical EOF that is an eigenfunction of the integral operator from the kernel of precipitation covariance function without area factor; *t* denotes time; is the EOF weight or regression coefficient for the *m*th EOF and is to be estimated; is the unknown precipitation field to be reconstructed based on the estimation of ; the regression error variance,

is to be estimated; and the angle brackets denotes the operation of expected value.

The domain of the reconstruction field is denoted by , which is fixed. The observed monthly gridded GHCN anomaly data on 5° × 5° grid boxes are distributed across a subset of the domain. This subset is denoted by , which varies with time, since the sample stations and satellites change over time. The orders of , , and are denoted by

The observed data are denoted by

The sum of squared errors (SSE) over the spherical geometry is

The SSE formula can be rewritten into

Then, the standard least squares estimate formulas can be applied to the area-weighted data (Johnson and Wichern 2007; Wilks 2006). To use the standard multivariate regression framework, we introduce the following notations:

Thus, the vector form of the regression model [Eq. (1)] can be written as

The least squares linear regression formula that minimizes SSE [see chapter 7 of Johnson and Wichern (2007)] leads to an estimator for the regression coefficients,

where **b** is the unbiased estimator of and

is the link matrix of order that links the GHCN data and the EOF mode weights.

The expected value (i.e., the ensemble mean) of is ;

is the reconstructed precipitation vector and is an approximation to the area-weighted data over the entire domain . The rectangular matrix,

is called the “hat matrix” in statistics. The hat matrix is a link between the observed data over the data domain and the reconstruction over the entire domain .

The approximation error vector restricted to the data domain is denoted by

and is an vector (. According to the multivariate regression theory (Johnson and Wichern 2007), the regression error variance , after eliminating the area factor, can be estimated by SSE which is defined by Eq. (7):

where is the regular Euclidean norm by

Equation (21) implies that the error variance of regression is inversely proportional to the number of data boxes: More data imply less error variance, but more modes may mean larger error variance; that is, a larger *M* may imply a larger . This is because higher modes may include much noise. In practice, we usually choose *M* so that 90% of the variance is explained by the first *M* EOF modes. It seems difficult to find an optimal number of modes to be retained (Shen et al. 2001).

The CI at 95% confidence level for in Eq. (2), based on the *t* distribution of degrees of freedom, is

where

In the above, is the *t* value of a *t* distribution with degrees of freedom equal to and the tail probability equal to 0.95/2, and indicates the *m*th diagonal element of a square matrix (Johnson and Wichern 2007, p. 372).

The CI of regression coefficients implies that the reconstruction by Eq. (18) is the expected value of a family of reconstructions, determined by the CI of . Hence, a range of the reconstructed field can be determined and may be regarded as a spread of true precipitation fields.

### b. Sampling errors of the reconstruction

Similar to Eq. (22) of Shen et al. (2004), the area-weighted mean square error field of the reconstruction due to sampling gaps can be derived as

The first term on the right-hand side accounts for the truncated sampling error from sampling the first *M* EOF modes, while the second term accounts for the random data error in the reconstruction. According to Shen et al. (1994, 2004), the truncated error field is

For every selected mode in , the coefficient can be calculated by the following formula:

where is the Kroneker delta, which is equal to one if and equal to zero otherwise; is the variance of the observed error for data *D*(*x*, *t*); and is the *m*th column and the *x*th row of the matrix. For each given time *t*, this formula calculates the error map of the reconstruction.

If it is a perfect sample, then and . Further, . Thus, and , implying zero sampling error.

Equation (26) is a truncated sampling error variance combined with a random data error variance. The total error variance that reflects the overall uncertainty should consist of this truncated sampling error, random truncation error, and bias errors. Smith et al. (2013) discussed the structure of these different errors and their relative importance. The current paper focuses on the detailed mathematics of the truncated sampling errors.

## 3. Datasets and EOF basis

Two datasets are used here for reconstruction. One is the GPCP version 2.2 combined global precipitation anomalies dataset from 1979 to 2008 (Adler et al. 2003). The climatology is the 1979–2008 mean. GPCP version 2.2 is given at a 2.5° × 2.5° latitude–longitude grid resolution starting from (88.75°N, 1.25°E) and going first to the east and then to the south (GPCP data were downloaded from ftp://precip.gsfc.nasa.gov/pub/gpcp-v2.2/psg/).

Another dataset contains the gridded monthly GHCN version 2 data from station observations. The GHCN-Monthly (GHCN-M) precipitation dataset contains precipitation data from January 1900 to the present on 5° × 5° latitude–longitude grid boxes starting from (85°–90°N, 180°–175°W) and going first to the east and then to the south. The GHCN precipitation is in monthly total millimeters, which is converted into millimeters per day to be consistent with the GPCP precipitation units [GHCN data were downloaded from the U.S. National Climatic Data Center (NCDC): http://www.ncdc.noaa.gov/temp-and-precip/ghcn-gridded-products.php]. The history of the number of 5° × 5° grid boxes with data is shown in Fig. 2. The number of grid boxes with data started at around 380 in 1900, monotonically increased (except for a small dent around the First World War) to around 800 in the 1960s, maintained this level until the 1980s, and has finally decreased since the 1990s. The final decrease in the number of stations from the 1980s is due to a time lag in the global data reporting systems. Most of the GHCN data were compiled during the late 1980s and, in the case of the African datasets, in the early 1980s or late 1970s.

The GHCN anomalies are with respect to the 1961–90 climatology, as this period had the best station coverage. We recentered GHCN anomalies on the GPCP climatology period, 1979–2008, as in Smith et al. (2012). This was done for every GHCN grid box by calculating the temporal mean of the GHCN anomalies data from 1979 to 2008 and by subtracting the mean from the original GHCN anomalies.

For the regression theory of section 2, we use GPCP data to calculate the explainable variable matrix, also called the design matrix in statistics, and we used GHCN data and the design matrix to calculate the regression coefficients **b**. Finally, we used regression Eq. (18) to calculate the reconstruction.

Adler et al. (2012) show an increase in estimated bias error in the GPCP precipitation data at a higher latitude. We thus exclude the large-error polar areas of 75°–90°N and 75°–90°S and limit our reconstruction domain to 75°S–75°N, which already covers 97% of Earth’s total surface area. We call our reconstruction over 75°S–75°N a quasi-global reconstruction. We miss 3% of the polar areas and less than 3% of the truly global precipitation, because the polar precipitation is light compared with that of the tropical regions.

Our EOFs from the area-weighted covariance matrix are computed in the following way: The area-weighted GPCP data matrix from 1979 to 2008 on the 5° × 5° grid is expressed as

This matrix has *N* rows with grid box running from 1 to *N* [*N* = 2160 in the case of (75°S–75°N) and *N* = 2592 for (90°S–90°N)] and *Y* columns with time *t* running from 1 (corresponding to 1979) to *Y*, which is 30 (corresponding to 2008, in our case). The GPCP area-weighted precipitation covariance matrix is computed by the formula

This is an matrix and its diagonal decomposition. Here, [(mm day^{−1})^{2}] is the diagonal eigenvalue matrix and is the area-weighted orthonormal eigenvector matrix with . When is large, solving the eigenvalue problem for , one can encounter computational difficulties. Fortunately, these difficulties can be avoided by using a matrix transpose, which converts the spatial covariance matrix into a temporal covariance matrix . The basis data matrix is often a rectangular, tall matrix with *N* much larger than *Y*. Thus, has its rank less or equal to . In our particular case, and . We can use a relationship between the eigenvectors of and to reduce the eigenvalue problem of an matrix to a much smaller matrix. This approach has been used in the climate research community (e.g., Smith et al. 2012), but we recapitulate the mathematical derivation here to facilitate broad applications. In particular, this derivation will help those who use high-level program languages such as MatLab, R, or SAS to explain the meaning of their eigenvectors and eigenvalues. The eigenvalue problem of the temporal covariance matrix is

The eigenvalue problem of this matrix is

Multiplying both sides of this equation by yields

Therefore, one can solve the eigenvalue problem [Eq. (31)] of the smaller matrix . This matrix’s eigenvector being multiplied by the original basis data matrix is an eigenvector of the original and physically meaningful . The eigenvalues of and are the same. Finally, our desired normalized EOFs are thus

where is the Euclidean norm of vector . The higher-order eigenvalues for any . The corresponding higher-order eigenvectors are any *N*-dimensional vectors and usually have no physical meaning. In applications, one normally retains the first *M* eigenvalues up to the level of 90% or 95% of explained variance, which often corresponds to a cut-off order *M* of less than 20. In our particular calculation, we use the first 20 EOFs: that is, *M* = 20.

The spatial distribution of GPCP climatology displays where the heaviest and lightest precipitation occurs and has appeared in literature (e.g., http://www.ncl.ucar.edu/Applications/gpcp.shtml). EOFs are related not only to the climatology (i.e., the first statistical moment or mean), but more importantly to the variance and covariance (i.e., the second moment). Figure 3a shows spatial field of variance, of which the area-weighted spatial sum is the total variance of the quasi-global precipitation and is equal to the sum of all the eigenvalues [see Eqs. (38) and (39) of Shen et al. (1998)]. The figure shows that the predominant variances are over the tropical western and central Pacific. Figure 3b shows the eigenvalues and their corresponding total level of variances explained, relative to the above total variance. The figure shows a sharp decrease of the eigenvalues (i.e., the variances) from EOF1 (29% of the total variance) to EOF2 (10% of the total variance), and then to EOF3 (6% of the total variance). The first three eigenvalues already account for 45% of the total variance. The large percentage of explained variance by EOF1 (shown in Fig. 3c) implies its resemblance to the variance field shown in Fig. 3a. Our EOF1 (Fig. 3c) exhibits the canonical ENSO precipitation pattern over the tropical Pacific, as displayed in many existing research works (e.g., Xie and Arkin 1997; CPC 2013a).

The ENSO precipitation has equatorial extremes of opposite signs in the western Pacific–Maritime Continent (positive) and in the central and eastern Pacific (negative). El Niño events tend to have negative precipitation anomalies (i.e., drought) in the western tropical Pacific all the way into the Indian Ocean and positive anomalies over the central and eastern tropical Pacific and along the eastern coast of Central and South America. Figure 3c shows the EOF1 pattern, which resembles the precipitation pattern of a typical El Niño as shown in Fig. 1c of Dai and Wigley (2000). It also resembles the average spatial distribution of El Niño minus La Niña precipitation derived from GPCP data from 1979 to 2010 [see Fig. 2 of Curtis and Adler (2003)]. EOF1 also exhibits a dipole pattern of precipitation over the western and central tropical Pacific. EOF1 positives are over the western Pacific’s “warm pool” area, extending southeast along the eastern boundary of the Maritime Continent, reaching as far north as southern China and southeast to the eastern Australian ocean area all the way beyond the international date line. The EOF1 negatives are over the Niño-4 and Niño-3 regions (central and eastern tropical Pacific) and are strongest in the eastern part of the Niño-4 region, extending directly east all the way to the coast of Central America and also diverting in the southeastern direction, with weaker negative anomalies reaching Chile’s coastline. This Pacific zone of 40°S–20°N has the heaviest precipitation in the world during an El Niño. The EOF1’s west–east negatives and southeast positives form the two blades of a pair of scissors. The positive and negative precipitation anomalies along the two blades of the scissors alternate signs when the precipitation changes from El Niño to La Niña. The scissors open eastward; the two blades of the dry intertropical convergence zone (ITCZ) and South Pacific convergence zone (SPCZ) clip the wet and warm tongue of El Niño over the eastern tropical Pacific (CPC 2013a; Ropelewski and Halpert 1989, 1996) and break the SST tongue structure during transitions between El Niño and La Niña scenarios. This scissor pattern captures the difference between El Niño and La Niña annual anomalies and is dominated by the tropical Pacific region. The above suggests that the large amount of precipitation over the tropical Pacific must often exhibit alternative signs of anomalies in the east and west; the eastern tropical Pacific’s positive precipitation anomalies may imply the western tropical Pacific’s negative anomalies (i.e., El Niño) and vice versa for the La Niña precipitation. In summary, EOF1 corresponds to the ENSO precipitation pattern with El Niño or La Niña, depending on the EOF1’s negative or positive sign.

Figures 3d, 3f, and 3h show the first three principal components (PCs): PC1, PC2, and PC3. The large percentage (29%) of variance explained by EOF1 is reflected in PC1’s larger volatility, as compared to PC2 and PC3. Noticeably, PC1 has five positive extremes (1984, 1988, 1996, 1999, and 2008) and four negative extremes (1987, 1992, 1997, and 2002), PC2 has two large positive extremes in 1983 and 1998, and PC3 has only one extreme in 1997. This implies the great importance of EOF1 precipitation in the global precipitation field even in a non–El Niño or non–La Niña year.

Our EOF2 (Fig. 3e) over the Pacific also shows an ENSO-like pattern: the southeastern tropical Pacific’s positive anomalies and the northeastern tropical Pacific’s negative anomalies coexist with strong negative anomalies over the central and western tropical Pacific. Significant differences from EOF1 exist over the southern Pacific from 20° to 40°S. For example, EOF2 does not change signs from west to east, while EOF1 does. In addition, EOF2 changes signs from the western tropical Pacific to the Maritime Continent–Indonesia area.

Our EOF3 (Fig. 3g) has a more complex pattern, which may be associated with the multiyear modulation of different regional monsoon patterns, such as the Indian monsoon, western North Pacific monsoon, North American monsoon, and South American monsoon. However, from mode 3, the eigenvalue decreases slowly; hence, their neighborhood eigenvalues are close to each other. When two neighborhood eigenvalues are close to each other, their corresponding eigenfunctions can have large sampling errors, according to North’s rule of thumb (North et al. 1982; Shen et al. 2001). The higher-order EOFs thus may not have clear physical meaning, as a result of large uncertainty in the EOF patterns, caused by colinearity in the eigenspace, which is approximately multidimensional. Thus, a linear combination of several eigenvectors is still an eigenvector. This leads to nonuniqueness and large errors in the higher-order EOF patterns, which often do not have a physical correspondence or interpretation. Nonetheless, they are still useful in data dimension reduction and in data representation and reconstruction. In our case, we choose to use EOFs up to order 20: that is, *M* = 20.

## 4. Results

According to Eq. (18), the reconstructed precipitation

where is the *x*th component of the regressed result vector of Eq. (18). We will first show our annual precipitation reconstructions and their errors for two special years: Fig. 4 for 1983 (a strong El Niño in the 1982/83 boreal winter) and Fig. 5 for 1917 (a strong La Niña year).

It is not surprising that the 1983 reconstructed precipitation (Fig. 4a) has a very high level of resemblance to EOF2, particularly over the eastern tropical Pacific (see Fig. 3e), and some resemblance to EOF1, particularly over the South China Sea and the extended western tropical Pacific (Fig. 3c). This may be because the two modes are highly associated with El Niño precipitation patterns. With EOF1, the positive and negative phases are opposite from those of positive and negative anomalies over the tropical Pacific region. Over the tropical Atlantic, the same resemblance also exists, which may be due to the high tropical precipitation variance but not the El Niño variation itself. These high resemblances are quantitatively supported by the large regression coefficients for EOF1 and EOF2 in the reconstruction.

Table 1 shows the regression coefficients for the first eight EOF modes in the 1983 and 1917 reconstructions. The table also shows the spatial correlations between the reconstructed precipitation and each EOF. For 1983, the reconstruction skill comes almost exclusively from the first two EOFs. The EOF2 coefficient is 10.21, while that of EOF1 is −4.77. The absolute values of other EOF coefficients are less than 2. The spatial pattern correlation between the reconstructed precipitation and EOF2 is as high as 0.79, while the correlation for EOF1 is −0.36. Although the 1982/83 El Niño was very strong, the event faded away in the late summer of 1983 and switched to a La Niña phase, according to the extended reconstructed SST, version 3b (ERSST.v3b) SST anomalies in the Niño 3.4 region (5°N–5°S, 120°–170°W; Smith et al. 2008a; CPC 2013b). Thus, we may regard the 1983 precipitation as having mixed signals of El Niño and La Niña. One may avoid this signal-mixture problem by aggregating the annual data from July to June, which will contain the entire event of a typical El Niño from October to April in a single year. Further discussion on this will be given in section 5.

Figure 4b shows the spatial distribution of the truncated RMSE (mm day^{−1}) of the annual precipitation reconstruction for 1983. The expression is calculated according to Eqs. (26) and (27). The RMSE values are large over the western tropic Pacific and eastern Indian Ocean, with values reaching about 1.5 mm day^{−1}. The actual error should be larger than the sampling error shown in Fig. 4b, because random error, which may be largely contained in in Eq. (25), and bias error should be incorporated together to assess the total error (Smith et al. 2013; Shen et al. 2004). However, since the sampling gap is a major cause of the total error despite some large random errors in the historical observations, the total error should not be larger than twice the sampling error estimated here, as discussed in Smith et al. (2013) and Shen et al. (2004).

The eastward scissor-blade pattern of the RMSE seems to follow that of EOF1, enhanced by EOF2. The large values over the eastern Indian Ocean seem to follow the EOF3 pattern.

The secondary-level large values of RMSE are over the tropical Atlantic and some land regions, including eastern Brazil, Central America, and the northwestern coastal area of North America. These large errors might be due to the large precipitation anomalies of this particular year, with the following positive and negative anomalies: negative anomalies in the tropical Atlantic and positive anomalies in eastern Brazil, Central America, and the northwestern coastal area of North America. The large RMSE values over Madagascar are expected, since there is a large amount of precipitation over the region and the precipitation variance is large too, although Madagascar’s 1983 precipitation has a small anomaly.

The large RMSE over the tropical Pacific and Indian Oceans suggest the importance of rain gauge measurement over these regions by island stations (see Fig. 6 for the island GHCN data locations) or by remote sensing (Simpson et al. 1988).

Figure 4c shows the relative error of reconstruction: the ratio of RMSE to GPCP climatology. Because of the heavy precipitation in the western tropical Pacific (reaching 9 mm day^{−1}), the relative error in this region is actually fairly small. The largest relative errors are over a 15° zonal band of the tropical Pacific east of Tahiti all the way to the coast of Peru, where the normal precipitation is low (see GPCP climatology) while the RMSE is high (see Fig. 4b). Another area of high relative error is over northern Africa, the Sahara desert region, where the precipitation climatology also has very low value, under 0.5 mm day^{−1}.

For 1917, the first EOF mode plays a dominant role, with a spatial correlation of 0.90 and a regression coefficient of 12.98, 4 times larger than that of the next important mode: EOF2. Nonetheless, EOF2, EOF3, EOF4, EOF6, and EOF7 all make nonnegligible contributions to the reconstruction. These regression coefficients have absolute values greater than 1.77 and less than 3.2. The spatial correlation coefficients have absolute values between 0.12 and 0.22. During the period of 1900–2011, year 1917 had the strongest La Niña, which lasted for 21 months, from 1916 to 1918 (Giese and Ray 2011). This may imply that the 1917 precipitation had a purer ENSO signal than the 1983 precipitation and explains the dominance of EOF1 in the 1917 reconstruction. Comparing the 1917 reconstructed precipitation (Fig. 5a) and EOF1 (Fig. 3c), we can see the eastern tropical Pacific’s positive phase fork and the northeastern negative–southwestern positive dipole structure over the western Pacific’s warm pool region. The tropical Atlantic’s positive phase of EOF1 is also reflected in the reconstruction. The 1917 La Niña precipitation shows a better resemblance to EOF1 than does the 1983 El Niño precipitation. This, in turn, supports that EOF1 carries a strong ENSO signal.

The 1917 long-lasting and strong La Niña resulted in a huge precipitation surplus all the way from India to the eastern coast of Australia, along the northwest–southeast-oriented SPCZ (CPC 2013a; Ropelewski and Halpert 1989). The year also had a huge precipitation deficit in a large area immediately south of the east–west-oriented ITCZ and along the northwest–southeast-oriented zone almost parallel to the SPCZ. The SPCZ and the ITCZ form two blades of a pair of scissors. The EOF1’s scissor structure is thus largely the La Niña precipitation pattern. Despite the huge precipitation surplus over the SPCZ and the warm pool area, the La Niña global-average precipitation in general has a deficit because of the huge negative anomaly from the central tropical Pacific to the eastern tropical Pacific. Figure 1 shows a modest negative annual precipitation anomaly for 1917 (−0.0152 mm day^{−1}), although the largest negative anomaly was in 1918 (−0.0635 mm day^{−1}).

The spatial distribution of RMSE for the 1917 reconstruction is displayed in Fig. 5b. A striking feature is that the RMSE pattern for 1917 is highly similar to the one for 1983. However, they have large differences in the ocean area east of northern Australia, the SPCZ region, and the southeastern United States and its eastern oceanic region on the Atlantic. This high similarity between the two RMSE maps may suggest the following. First, the two years are El Niño and La Niña years, respectively, and hence are associated with the same heavily weighted EOF1 and EOF2 and thus have similar sampling-error patterns. Second, the large-error regions are the high-volatility areas of precipitation, and more gauges should be deployed to sample these key centers of action.

Figure 5c shows the relative error of reconstruction for 1917, which resembles that of the 1983 reconstruction. Differences do exist, such as the noticeable errors over the northern Indian Ocean and the Middle East, where modern observations in the GHCN dataset helped reduce the reconstruction errors.

The above two examples for 1983 and 1917 show reconstruction skills from ENSO modes. However, the global precipitation has other factors too, particularly in high-latitude areas, such as the heavy precipitation over the oceanic regions east of Japan and the Amazon region (Ropelewski and Halpert 1989, 1996). Higher-order EOFs play roles for the accurate reconstruction of precipitation over these regions while the ENSO modes may play minor roles or no role.

A regression coefficient divided by the corresponding correlation coefficient is approximated by the spatial standard deviation of the precipitation field through the following formula:

where is determined by Eq. (18). The spatial standard deviation where is the reconstructed field in Eq. (34). It is our intent to estimate the spatial standard deviation of this quantity, which is defined as the mean value, from the mean value theorem in calculus,

Table 1 and Eq. (35) imply 13.0 mm day^{−1} for 1983 for every EOF mode. For example, Similarly, Table 1 and Eq. (35) also imply 14.5 mm day^{−1} for 1917. The mean spatial standard deviation is thus around mm day^{−1}, which is about 13% of the global-average annual-mean precipitation (2.67 mm day^{−1}). This standard deviation seems to be a reasonable value for the globally reconstructed precipitation. It is remarkable that the spatial standard deviation—that is, the value—stays almost constant with respect to the EOF mode number.

Although EOF1 and EOF2 play very important roles in the 1983 and 1917 reconstructions, the contributions from higher EOF modes are different for the two years. For 1983, only EOF6 carries a noticeable signal and leads to a correlation of −0.14. For 1917, several higher EOF modes contribute to the reconstruction skill with correlations of 0.20 for EOF3, 0.14 for both EOF6 and EOF7, and −0.12 for EOF4. It is unclear whether these higher-order EOF contributions have as clear physical meanings as EOF1 and EOF2 do.

Although the reconstructed precipitation and their errors in Figs. 4 and 5 are for the entire latitude bend (75°S–75°N), the information is overwhelmed by the tropical signals, which are much larger there than in other parts of the world, particularly over land. To evaluate the uncertainty of the land precipitation reconstruction, we display the relative differences between the reconstructed precipitation and the GHCN observed data (Fig. 6). What is common about the two figures is that large errors exist over the tropics where the precipitation and its variances are great. However, there is a clear distinction between the two figures: each of which has many unique features. For 1983 (Fig. 6a), the reconstruction over most of the land areas has small departures from the GHCN data. It is not surprising that the differences of the reconstruction minus the GHCN are small over the United States and western Europe, where stations were dense and well distributed.

However, it is somehow unexpected that the differences are also small, within 0.5 mm day^{−1} (the ratio of RMSE to climatology is less than 25%, displayed in Fig. 5c), over most parts of China, Siberia, and Africa. Of course, the precipitation over these regions is also light, limited to less than 2 mm day^{−1}. Still, the consistent skill between the reconstruction and the GHCN data to measure precipitation demonstrated through these two figures is remarkable.

For 1917 (Fig. 6b), the differences are much greater over most land areas, as one would expect, as a result of less GHCN data coverage. Over the United States, the reconstruction in general has an overestimate, with an amount around 0.5–1.0 mm day^{−1} over the entire Great Plains area and East and West Coasts. Noticeably large differences existed over the southeastern United States and northeastern Mexico. Relatively speaking, the 1917 reconstruction has better fidelity over Europe and even in Africa. For Australia, the reconstruction has an overestimate in the north and east and an underestimate in the west and south, which may mean that the La Niña signal is well preserved in the reconstruction. South America has a similar situation: the reconstruction has an underestimate in parts of the area (e.g., Peru and the Amazon region) and overestimate in southeastern Brazil.

A few blue boxes of very large differences may be due to the small-scale microclimate that is highly inhomogeneous in each grid box. The inhomogeneous climate field cannot be resolved by our EOFs and can result in large sampling errors if there are not enough well-distributed rain gauges in a grid box. These anomalously large differences in 1983 (Fig. 6a) and 1917 are at different places.

The relative differences (see Figs. 6c,d) show large values in those coastal areas where normal precipitation is low and has large spatial gradients, such as California and Peru. Our current EOF reconstruction has the weakness of not being able to recover the variability of short spatial scales. Largely, the relative differences for 1983 are smaller than those of 1917 as a result of more observations.

Another feature of the two figures is the discontinuity in the differences field. In particular, some neighboring GHCN boxes over the tropics have different signs and large magnitudes. Although topographic differences and local microclimates can have significant influences on the inhomogeneity of the precipitation anomalies, the existence of large observational GHCN data errors, including instrument error, bias due to calibration issues, and sampling errors, cannot be ruled out. The variations that cannot be resolved by the limited number of EOFs and the nonstationary variations that fall out of the GPCP data period can also enhance the large-scale spatial gradient and discontinuity. Some of the strong discontinuities of the difference field, necessarily resulting from the discontinuity of the GHCN data, may be due to either the insufficient sampling of a highly inhomogeneous precipitation field or the errors of aggregating the station data into the GHCN gridbox data (Shen et al. 2007, 2012). Further investigation of the GHCN data discontinuity should be made in the data homogenization process.

Another validation is through the comparison between the reconstructed and GPCP annual precipitation anomalies. The differences of the reconstructed minus the GPCP anomalies for 1983 and 2009 are shown in Figs. 7a and 7b. Since GPCP data are a fusion product from both satellite remote sensing and land stations, they are, in general, spatially continuous, except for a few coastal grid boxes. Thus, these differences are much smoother than the differences shown in Fig. 6. Because of the El Niño dominance in the 1983 precipitation and the reconstruction’s excellent skill from El Niño, the 1983 reconstruction has smaller errors than the 2009 reconstruction. For 1983 (Fig. 7a), the reconstruction overestimates in the northern vicinity of the Niño-4 region, extended to 20°N, and underestimates in the regions of Niño-3.4 and Niño-3 and its eastern regions, extended to the Central American coast. This may be due to the reconstruction’s strong preference of the ENSO patterns in the first two EOFs. The underestimates in the Indian Ocean and the Indonesian islands are perhaps due to the negative phase of the monsoon system.

The year 2009 started with phasing down La Niña from January to March and entered into a weak El Niño phase from October on. The 2009 precipitation pattern was not pure signals of ENSO or monsoon but was associated with more complex climate dynamics. The difference pattern (Fig. 7b) is more complex than that of 1983 (Fig. 7a). The significant underestimate over the western Pacific warm pool area and the significant overestimate in its immediate western neighbor, the Indonesian islands, might be due to the inability of the reconstruction to resolve smaller spatial scales and larger spatial gradients.

We further look into the quasi-global average annual precipitation estimated by reconstruction and GPCP from 1979 to 2011 (Fig. 8). They show remarkable agreement, in terms of variance, extremes, and trends. Their correlation is 0.74.

Although this excellent agreement is expected, since the EOF patterns are computed from the GPCP data, the high-level agreement shown in Fig. 8 still implies the excellent skill of using the GHCN land precipitation data to predict the oceanic precipitation. The agreement also implies the high quality of both GPCP and GHCN data. This comparison result is hence a validation of our reconstruction model for the historical global precipitation and shows the appropriateness of our approach.

Finally, we look at the time series of the quasi-global-average annual precipitation from 1900 to 2011 in Fig. 1. The positive trend of the global precipitation is 0.024 mm day^{−1} (100 yr)^{−1}. This is comparable to the 1900–2008 observed data trend assessed by Ren et al. (2013), which is 0.04 mm day^{−1} (100 yr)^{−1}over the ocean and 0.03 mm day^{−1} (100 yr)^{−1} over the land, and our current estimate of a weaker trend of 0.024 (mm day^{−1})(100 yr)^{−1} in 1900–2011 might be due to the damping caused by the lack of ocean data and is consequently closer to the mean trend of 0.02 mm day^{−1} (100 yr)^{−1} in 1900–2008 from the 25 CMIP5 models (Whitaker and Hamill 2006; Ren et al. 2013). According to Ren et al. (2013), the trend varies along the latitude bands. The variation seems to suggest that the wet tropical areas will get wetter with a large positive trend, and the dry subtropical areas will get drier [see Fig. 3 of Ren et al. (2013)].

Another feature of Fig. 1 is the quasi-periodic variation over about 30 years, shown in the 10-yr moving-average curve. Whether this is related to the Pacific decadal oscillation (PDO) or climate shift (CS) is a question to explore.

A regression coefficient has uncertainties, related to degrees of freedom, data variance, and sampling errors. This uncertainty is quantified by a range of determined by Eq. (23), which corresponds to a range of reconstructions. Figure 1 is the quasi-global average reconstruction results from . Figure 9 shows the reconstruction results from

for the upper bound and from

for the lower bound, as determined by Eqs. (23) and (24). This figure’s shaded area is not the confidence interval of the reconstruction and hence is not symmetrical about the solid curve; rather, it shows the probabilistic spread of the reconstruction and may be regarded as a measure of the probability of extreme events. Therefore, this spread is much larger than the actual 95% confidence interval (not shown in this paper). The relatively small change of the spread throughout the entire reconstruction period is in agreement with the confidence intervals shown in Fig. 9 of Smith et al. (2013).

Then what are the spatial patterns of the upper-bound and lower-bound variability in the reconstructed precipitation? The case of 1983 is shown in Fig. 10. Compared to Fig. 4a, which shows the expected values of the reconstruction, Figs. 10a and 10b show more spatial variability and smaller spatial scales.

The upper bound (Fig. 10a) still shows traits of the El Niño precipitation pattern in the tropical Pacific, but with smaller scales. The strong eastern North Pacific pattern and eastern South Pacific patterns in the upper bound may be extreme climate possibilities. The lower bound (Fig. 10b) shows little of an El Niño precipitation pattern. Instead, it shows variations across a wide range of the Pacific, extending from 40°S to 40°N. The Atlantic variability reaches from 60°S to 60°N. The Indian Ocean also has a positive and negative pattern. These smaller-scale patterns might be due to the smaller weight of the lower-order EOFs and relatively larger weights of the smaller-scale, higher-order EOFs.

## 5. Conclusions and discussion

We have used a multivariate regression approach to estimate the sampling errors of annual quasi-global precipitation reconstructed by an empirical orthogonal function (EOF) expansion. The mathematical details of the error analysis approach are a distinct feature of this paper. The Global Precipitation Climatology Project (GPCP) precipitation data from 1979 to 2008 are used to calculate the EOFs. The Global Historical Climatology Network (GHCN) gridded data (1900–2011) are used to calculate regression coefficients for reconstructions. The sampling errors of the reconstructions are carefully analyzed using multivariate regression theory. Our reconstructed time series of quasi-global-average annual precipitation shows a 0.024 mm day^{−1} (100 yr)^{−1} trend, which is very close to the trend derived from the mean of 25 models of phase 5 of the Coupled Model Intercomparison Project. Our reconstruction examples of 1983 El Niño and 1917 La Niña precipitation patterns demonstrate that the El Niño and La Niña precipitation is well reflected in the first two EOFs. Although our validation in the GPCP period shows remarkable skill to predict oceanic precipitation from land stations, our error pattern analysis through comparison of the reconstruction and GHCN suggests the critical importance of improving tropical oceanic measurement of precipitation. This is because the tropical oceanic precipitation has both large mean and large variance.

The validation of our reconstruction results with GPCP makes it possible to use the reconstruction as the benchmark data for climate models. This will help the climate modeling community to improve model precipitation mechanisms and reduce the systematic difference between observed global precipitation, which hovers at around 2.7 mm day^{−1} for reconstructions and GPCP, and model precipitation, which has a range of 2.6–3.3 mm day^{−1} for CMIP5.

A noticeable feature of our reconstructed annual precipitation from 1900 to 2011 is that the maximum positive and negative anomalies of the reconstruction appear in 1998 (0.0528 mm day^{−1}) and 1918 (−0.0635 mm day^{−1}; see Figs. 1 and 8). The GPCP maximum annual precipitation anomaly also had its peak in 1998, which was 0.0394 mm day^{−1}. The GPCP maximum negative anomaly was −0.0373 mm day^{−1} in 1991. The year 1998 was a transition from a strong El Niño in the boreal winter of 1997/98 to a moderate La Niña in the later half of 1998, while the year 1918 was a transition from a super La Niña in 1917 to a moderate El Niño. The year 1991 was also a transition to the 1991/92 boreal winter strong El Niño. These transitions might be related to extremes in the annual precipitation anomalies during the reconstruction period of 1900–2011, yet it is unknown whether this is conclusive or which mechanisms might cause extreme global precipitation anomalies.

When using EOFs, there is always a question of how many EOF modes to retain. This is an issue of model fit (Shen et al. 2001). Akaike’s information criterion (AIC) might be useful in addressing this question, since the AIC method can help select an optimal number of empirical orthogonal functions (Michaelis 2013). This question will be deferred to future investigations.

Another question is whether to use a different starting month for aggregating the annual data. Our presented results are the annual mean of January–December in millimeters per day. We have also calculated the annual mean starting from any of the other 11 months. The July–June mean is particularly interesting, as it contains a typical ENSO event in a single year, rather than distributing it across two calendar years with weaker magnitude. The global precipitation time series is affected little with different aggregation methods. The EOF patterns and their explained variances are noticeably different. Figure 11 shows the first three EOFs and their corresponding temporal principal components. Comparison of Figs. 11 and 3 indicates that two spatial patterns of EOF1 are similar but their explained variances are very different: 38% for June–July EOF1 with a cleaner ENSO signal, compared to 29% for January–December EOF1. The two EOF2s are very similar in terms of both spatial pattern and the explained variance. The spatial patterns of the two EOF3s are very different, with June–July’s large variabilities occurring in the Indian, Atlantic, and eastern Pacific Oceans while January–December’s large variabilities are still over the Pacific and eastern Indian Oceans and are still a part of the ENSO variation. The principal components are very different for PC1, resemble each other for PC2, and are also different for PC3. The July–June PC1 of Fig. 11b is well synchronized with the oceanic Niño index (ONI) of the U.S. Climate Prediction Center (CPC 2013b): the low extremes of PC1 correspond to El Niño (warm) episodes (1982/83, 1991/92, 1997/98, 2002/03, 2004/05, and 2006/07), and the high extremes of PC1 correspond to La Niña (cold) episodes (1988/89, 1995/96, 1998/99–2000/01, and 2007/08). The two peaks of PC2 in both Figs. 3 and 11 correspond to the two strongest El Niño events (1982/83 and 1997/98) during the period of 1979–2008. The increasing trend of PC3 in Fig. 11 might be related to the Indian Ocean’s warming since the 1970s and the associated precipitation changes (Li et al. 2012). Our discussion on the issue of annual data aggregation is limited in this paper; however, further investigation would be worthwhile to identify the spatial and temporal patterns with limited spatial and temporal domains for exploring signals of extreme climate, such as the United States’ 1930s Dust Bowl.

In the reconstruction equations [Eqs. (1)–(3)], we assumed that our reconstruction error is normally distributed: that is, the differences between the true anomalies and the reconstructed are normally distributed. Most GHCN residual data can pass the q–q-plot test for this error normality assumption. However, this does not imply that the annual precipitation anomalies are necessarily Gaussian. Sardeshmukh et al. (2000) demonstrated the non-Gaussian monthly precipitation anomalies in January, February, and March using the Kolmogorov–Smirnov test. The distributions of the precipitation anomalies and their reconstruction errors are worth further studies in the future.

## Acknowledgments

This study was supported in part by the U.S. National Oceanographic and Atmospheric Administration (Award EL133E09SE4048), the U.S. National Science Foundation (Awards AGS-1015926 and AGS-1015957), the U.S. Department of Energy (Award DE-SC002763), and a contract from the NASA Jet Propulsion Laboratory. The revision suggestions from the anonymous referees have significantly helped improve the clarity of the paper.

## REFERENCES

*Applied Multivariate Statistical Analysis.*6th ed. Pearson Prentice Hall, 773 pp.

*DOF and Information Criteria Calculations for the United States Temperature and Precipitation Data*. M.S. thesis, Dept. of Mathematics and Statistics, San Diego State University, 64 pp.

*Climate Change 2007: The Physical Science Basis.*Cambridge University Press, 996 pp.

*Statistical Methods in the Atmospheric Sciences.*2nd ed. Elsevier, 627 pp.