## Abstract

This study formulates the design of optimal observing networks for past surface climate conditions as the solution to a data assimilation problem, given a realistic proxy system model (PSM), paleoclimate observational uncertainties, and a network of current and proposed observing sites. The method is illustrated with the design of optimal networks of coral *δ*^{18}O records, chosen among candidate sites, and used to jointly infer sea surface temperature (SST) and sea surface salinity (SSS) fields from the Community Climate System Model, version 4, last millennium simulation over the 1850–2005 period. It is shown that an existing paleo-observing network accounts for approximately 20% of the SST variance, and that adding 25 to 100 optimal pseudocoral sites would boost this fraction to 35%–52%. Characterizing the SST variance alone, or jointly with the SSS, leads to similar optimal networks, which justifies using coral *δ*^{18}O records for SST reconstructions. In contrast, the network design for reconstructing SSS alone is fundamentally different, emphasizing the hydroclimatic centers of action of El Niño–Southern Oscillation. In all cases, network design depends strongly on the amplitude of the observational error, so replicates may be more beneficial than the exploration of new sites; these replicates tend to be chosen where proxies are already informative of the large-scale climate field(s). Finally, extensions to other types of paleoclimatic observations are discussed, and a path to operationalization is outlined.

## 1. Introduction

A general goal of paleoclimatology is to use paleoclimate observations, which are sparser and noisier than direct observations, to characterize preinstrumental climate variability. Paleoclimate observations involve, for example, using *δ*^{18}O of coral aragonite to measure a combination of temperature during calcification and seawater *δ*^{18}O value, while direct observations would be made, for instance, by using a thermometer to measure bucket SST. Ideally, the design of sampling strategies includes both local and global considerations in some objectively determined combination. Local considerations include the presence and availability of instrumental records and material archives that may be sampled to produce paleoclimate observations, the sensitivity of the proxy system to climate conditions at the site, accessibility, safety, and costs. For scientific goals that include the description of nonlocal phenomena, additional considerations include the extent to which site observations reflect large-scale patterns, the variables in which those patterns are locally expressed, and the balance between expanding the observing network spatially and/or temporally and replicating existing data to reduce paleoclimate observational uncertainties. Although the former category of network design criteria is highly site specific, the latter is approachable by means of hypothetical studies. Given assumptions about local optimization criteria, how would one prioritize sampling expeditions given limited resources? Would it be more efficient to generate replicates of a proxy series at an existing site or go sample a new one? How much would each paleoclimate observation contribute to reducing uncertainty in our knowledge of past climates? Similar questions were explored by Bradley (1996), who considered an existing proxy network and examined how well it predicted global mean annual temperature series. Evans et al. (1998b) explored the more general problem of optimal proxy sampling for minimizing theoretical error in reconstructed fields, but for the more specific estimation of the SST field. To that end, the authors used a reduced-space optimal interpolation (OI)-based reconstruction framework (Gandin 1965; Kaplan et al. 1997) and assumed that hypothetical SST observations made with random error might be available at a predetermined, quasi-realistic, coral-atoll-like sampling network. For small paleoclimate observational errors, they found that relatively small networks were able to resolve the most important large-scale patterns of SST variance (e.g., Evans et al. 1998b, their Figs. 4–6) and reduce the analysis error by up to 30% regionally. In the presence of relatively large paleoclimate observational errors, resampling (replication) was often preferred over sampling additional locations. Their iterative, “next best” approach did not, however, guarantee that the resulting proxy network would be optimal; nor did they consider that the most common paleoclimatic observation made in coral aragonite is oxygen isotopic composition (*δ*^{18}O), which depends not only on temperature but also seawater *δ*^{18}O at time of calcification (Evans et al. 1998b).

Recently, Mauger et al. (2013) solved the related problem of designing optimal climatological networks for a specified field using a related data assimilation framework (Ancell and Hakim 2007; Hakim and Torn 2008). They applied their method to monitoring two separate quantities: U.S. Pacific Northwest regionally and annually averaged precipitation and temperature. By minimizing the error about the mean temperature field, they found that a relatively small number of stations (5–10) was sufficient to characterize the mean-field variance over the domain. This approach, generalized to vector optimization (N. Hryniw and G. J. Hakim 2015, unpublished manuscript) and applied to paleoclimate observations mapped from multivariate fields, provides a solid foundation for the present work.

It turns out that such problems fall under the general umbrella of optimal sensor placement (OSP), which is the subject of an extensive literature. Roughly speaking, OSP problems can be divided into two main categories:

monitoring for event detection, for instance in large-scale urban infrastructures (see Berry et al. 2005; Krause et al. 2008a; Hart and Murray 2010; Comboul and Ghanem 2013), and

using sparse sensor networks to estimate a physical process in space and time (see kriging and its generalization to Gaussian process regression; e.g., Zimmerman 2006; Krause et al. 2008b; Castello et al. 2010).

In this study we formalize the design of optimal strategies for climate proxy sampling by borrowing tools from this literature. We generalize previous approaches so the paleo observational network design is solely based on the choice of information utility measure and the ability of each proxy to capture environmental signals. Three ingredients make this approach feasible: 1) an efficient formulation of the sensitivity problem (Ancell and Hakim 2007; Hakim and Torn 2008; Mauger et al. 2013); 2) models relating climate fields to paleoclimate observations, that is, proxy system models (PSMs; Evans et al. 2013); and 3) an efficient search algorithm (e.g., greedy algorithm; Nemhauser et al. 1978; Krause et al. 2008b; Das and Kempe 2008, 2011) if a direct search is not feasible. Given a set of possible sampling locations dictated by a set of presumed site-local biological and geological considerations, we seek to characterize the variance of one or more climate fields with a minimum number of paleoclimate observations of a given quality. To illustrate this process, we focus on designing an optimal network of oxygen isotope observations in coral aragonite (*δ*^{18}O_{c}), among a list of possible observing sites, with the goal of learning about sea surface temperatures (SST) and/or sea surface salinity (SSS) from the Community Climate System Model, version 4 (CCSM4), last millennium simulation of Landrum et al. (2013). More precisely, given *p*_{0} locations where coral *δ*^{18}O could potentially be measured, where are the best *p* locations to capture as much variance in SST and SSS as possible? To do so, we leverage the simple bivariate model of Thompson et al. (2011) with a perfect model scenario aimed at establishing a proof of concept. However, the framework is extremely general, and applicable to any proxy system that is well understood enough to be numerically modeled via a PSM. A summary of the methodology is provided in Fig. 1, which illustrates how the different components of our approach operate together. The article is structured as follows: in section 2 we formulate the mathematical problem and propose a solution method. A pseudocoral example is introduced in section 3, which describes the setting and experimental design. Results are shown in section 4, followed by a discussion in section 5. We conclude with an assessment of challenges that remain before this strategy can be operationalized.

## 2. Methodology

Let *V* denote the set of *p*_{0} paleoclimate observation sites (i.e., |*V*| = *p*_{0} is the cardinal of *V*). To *V*, we associate a set of observable random variables (RV) **y**. In our case, **y** represents the proxy measurement anomalies such as coral *δ*^{18}O or strontium and/or calcium taken at specific sites. We denote by *F*, where , the set of grid points and **x** a climate field at those locations, observed indirectly via **y**. In the following, **x** will stand for anomalies of sea surface temperature and/or salinity and **J** for the field information we wish to infer from the proxies (e.g., the whole SST field, the Niño-3.4 index, or some other metric derived from SST and/or SSS), also called the forecast metric or response function of **x**. Next, we describe the Gaussian model representation for **x**, which is suitable for field anomaly values (centered around zero), and the Bayesian estimation of the system given paleoclimate observations **y**_{S} at some location *S*.

### a. Sensor placement as a filtering problem

Let us consider the filtering problem consisting of a Gaussian process **x**, which we cannot observe directly, and an observation process **y**_{S} correlated with **x** as follows:

where ** μ** and Σ are the prior mean and error covariance matrix of

**x**,

**y**

_{S}is the vector of

*p*observations at a subset of locations

*S*∈

*V*,

_{S}is the operator mapping the field variables to the measurements at

*S*. Since

**x**is typically not directly observable (i.e.,

**y**

_{S}≠

**x**

_{S}+

*ε*_{S}), the operator is derived from a forward model such as the one described in section 3b and is generally nonlinear. In addition,

*ε*_{S}~ (0,

_{S}) is the error vector, which includes measurement noise and the portion of the proxy archive that is not explained by the PSM, where is a

*p*

_{0}×

*p*

_{0}diagonal matrix (assuming the errors are independent) and its submatrix restricted to

*S*is denoted by

_{S}.

Then, we can estimate the posterior distribution of **x** given these observations **y**_{S}. It is also Gaussian with conditional mean and covariance derived via Bayes’s theorem (e.g., Wikle and Berliner 2007):

Here _{S} is the first-order linearization of the observation operator _{S}. Placing this problem in context, kriging (or equivalently Gaussian process regression) is the case where _{S} is just the identity matrix at sites *S* applied to **x** (in other words, _{S} has ones on the diagonal indices corresponding to sites *S* and zeros elsewhere).

Letting the innovation error covariance and the Kalman gain , Eq. (4) can be viewed as a classic “Kalman update” in data assimilation parlance:

Equation (6) shows that the state error covariance matrix is actually independent of the measurements at sites *S*. Only _{S}, Σ, and the mapping operator are necessary to compute . Note that _{S} is given by the quality of the observations and Σ is our initial guess on the uncertainty in **x**.

In this work, we are only interested in some features of **x** that we call **J**. Let **Δ σ**

_{J}(

*S*) denote the change in the variance of

**J**due to information at

*S*. Then, using the first-order Taylor expansion of

**J**, Ancell and Hakim (2007) showed that the assimilation of new observations

**y**

_{S}will change the covariance of

**J**according to

Note that this equation also applies to the generalized case where **J** is a vector, except that **Δ σ**

_{J}(

*S*) is a covariance matrix (N. Hryniw and G. J. Hakim 2015, unpublished manuscript). Since the posterior error covariance of

**J**also does not depend on the actual observations

**y**

_{S}, the trace of

**Δ**

*σ*_{J}(

*S*) provides a good measure of the performance of sensor network

*S*. Intuitively, when divided by , it represents the proportion of variance of the forecast metric

**J**explained by

*S*. Different sensor placement choices

*S*produce distinct values of Tr[

**Δ**

*σ*_{J}(

*S*)], where Tr[ ] denotes trace of a matrix, so one can formulate a

*p*-sensor placement problem (a “sparse approximation problem”) as finding the subset

*S*⊂

*V*(|

*S*| =

*p*) that maximizes Tr[

**Δ**

*σ*_{J}(

*S*)]. That is, one seeks the locations that maximize the knowledge (minimize uncertainty) about a climate field of interest.

### b. Optimization method: Greedy algorithm

The combinatorial nature of the *p*-sensor placement problem {} makes an exhaustive search impractical, such that heuristic methods must be employed, even at the cost of yielding suboptimal solutions. Most variance reduction functions [Eq. (7)] were shown in Das and Kempe (2008) to be submodular, which yields interesting theoretical properties that are very relevant in this context. Submodularity implies that for all subsets *A* ⊆ *B* ⊆ *V* and element *s* ∈ *V*, the variance reduction function satisfies the following properties:

**Δ***σ*_{J}(∅) = 0,0 ≤

**Δ***σ*_{J}(*A*) ≤**Δ***σ*_{J}(*B*), and**Δ***σ*_{J}(*A*∪ {*s*}) −**Δ***σ*_{J}(*A*) ≥**Δ***σ*_{J}(*B*∪ {*s*}) −**Δ***σ*_{J}(*B*).

The monotonic increase with the cardinal of *S* and submodularity of the reduction function justify the use of a greedy algorithm (Nemhauser et al. 1978) to find near-optimal solutions to our optimization problem. The idea behind the greedy algorithm is simple. It sequentially adds the best sensor location to an existing placement *S*, starting with the empty set. It has specifically been shown (Nemhauser et al. 1978) that greedy algorithms produce near-optimal solutions when the objective function is nondecreasing and submodular. Near-optimality, in this context, can further be quantified in the sense that the set *S*_{G} resulting from the greedy method is bounded from below by (1 − *e*^{−1})**Δ σ**

_{J}(

*S**), where

*S** is the optimal

*p*-sensor placement and

*e*is the exponential constant. The total computational cost is now reduced to

*O*(

*p*×

*p*

_{0}), where

*q*

^{3}is the cost of inverting matrix and

*p*

^{3}is the matrix multiplication cost. Factorial growth in computational cost has thus been replaced by linear growth with the size of the proxy network. With these greedy algorithms available, the exploration of stochastic variability can be carried out in an efficient and comprehensive fashion. A pseudocode of the optimization scheme is shown in algorithm 1.

Algorithm 1: Optimal sensor placement—greedy

Input: Σ, ,

_{S},*V*,*p*Output:

*S*⊂*V*such that |*S*| =*p**S*← ∅for

*k*from 1 to*p*doFind

*S*←*S*∪ {*s**}

### c. Ensemble Kalman filter

To estimate the covariance terms required to optimize Eq. (7), we choose an ensemble-based description of the stochastic variables (ensemble sensitivity analysis; Ancell and Hakim 2007; Hakim and Torn 2008) that solves the optimal proxy network problem via an ensemble Kalman filter (EnKF). The approach is similar to the optimal monitoring network design discussed in Mauger et al. (2013).

While the observed values play a crucial role in updating the state estimate in data assimilation, in our case, sensors have to be selected solely based on the sensor characteristics, locations, and their predicted effect on the forecast metric. Initially, *n* (independent) samples of **x** are used to represent the distribution of the random field, which are assembled as the columns of matrix of size *q* × *n*. The corresponding perturbation matrix *δ* is obtained by removing the ensemble mean from each row of . We then compute the response function **J** for every realization of **x** to get the ensemble forecast **J** of size *q*_{0} × *n* from which the ensemble mean is removed in *δ***J**. Then the background error covariance Σ of Eq. (6) can be estimated by the sample covariance:

The ensemble covariance estimate of the adjoint sensitivity of **J** with respect to changes in the initial state was shown in Ancell and Hakim (2007) to be

Replacing in Eq. (7) the covariance and adjoint sensitivity term with their Monte Carlo (MC) estimates, Eqs. (8) and (9) (i.e., distributions are approximated by ensembles of random samples), we can rewrite the change in the variance of **J** due to observations at sites *S* [Eq. (7)] as

where the innovation error covariance is estimated as , and is scalar for a single observation. Combining the greedy optimization approach, which sequentially finds the next best observation location, with the EnKF, which propagates the uncertainty reduction once a site is chosen via the Kalman covariance update, yields an efficient network design tool. Once a site *s* ∈ *V* is found after one iteration of algorithm 1, the posterior error covariance is computed with Eq. (6) so as to remove the variance explained by observations at *s* (Mauger et al. 2013) and is then used as a prior for the next iteration. This differs from the OI approach used in Evans et al. (1998b), where stationary covariance is assumed, and hence does not benefit from data-driven updates. In the present study, we found that updating the error covariance significantly affects the resulting error amplitudes (decreasing as more stations were chosen), but not the spatial patterns. This decrease in error covariance amplitude seems to generally lead to a reduced number of chosen replicates when compared to the stationary covariance case. That is, this method is able to extract more information from similarly noisy observations.

Noting that data assimilation is performed for a single site at a time (greedy approach), and to avoid explicit computations with large covariance matrices, we use an equivalent square root formulation (Mauger et al. 2013), which directly updates the perturbation matrix *δ* rather than its error covariance, as follows:

where *k* denotes the greedy iteration index, _{s} maps **x** to observations **y** at *s*, *r* is the paleoclimate observational noise variance at site *s* connected to the signal-to-noise ratio (SNR) by the relationship SNR = [*r*^{−1}var(_{s}*δ*)]^{1/2}, is the Kalman gain expressed as = *δ*(_{s}*δ*)^{T}^{−1}, and *β* = [1 + (*r*/)^{1/2}]^{−1} results from the square root description of the problem. It reduces the field update according to the proportion of variance at site *s* to measurement uncertainty.

Additionally, and to account for the inherent uncertainty that results from approximating the distribution of **x** by a climate dataset , we take a bootstrapping (Monte Carlo) approach, resampling from and performing the analysis over *m* subsamples _{i} of and corresponding response **J**_{i}, where *i* ∈ {1, 2, …, *m*} (typically *m* = 1000). Denoting by , the expected variance reduction for a sensor placement *S* over all realizations _{i} of , then

We may now formulate our *p*-sensor placement optimization problem as the set *S* ⊆ *V* that maximizes the expected variance reduction over all state matrix realizations _{i} weighted by their probability of occurrence (here, all equally probable, which assumes a uniform prior):

The OSP formulation described in this article thus differs from that in Mauger et al. (2013) because the optimization is performed over the stochastic component, namely the _{i}. Therefore a single sensor placement is obtained and is guaranteed to be near-optimal as explained in section 2b.

Algorithm 1 can thus be modified as follows:

Algorithm 2: Optimal sensor placement—greedy + Kalman filter

Input:

^{(0)}→**J**^{(0)}, ,*V*,*p*Output:

*S*⊂*V*such that |*S*| =*p**S*← ∅for

*k*from 1 to*p*doFind

*S*←*S*∪ {*s**}for

*i*from 1 to*m*doUpdate from Eq. (11) and derived

**J**^{(k)}

## 3. Application to optimal coral sampling

We now apply this methodology to the placement of coral paleo-observing sites. Using a perfect model approach (e.g., temperature and salinity output fields from GCM simulations), we can quantify the value of coral *δ*^{18}O record networks in characterizing the uncertainty in the climate fields with the sole knowledge of paleoclimate observational error. This is achieved by relying on a PSM, which we describe next.

### a. Question-based sampling

The existing network of subannually resolved corals (mostly in the Indo-Pacific region) documents large interannual and interdecadal variability in sea surface conditions (e.g., Evans et al. 2000; Gagan et al. 2000; Cobb et al. 2008; Lough 2010; Emile-Geay and Eshleman 2013). The following questions motivate the application of the mathematical framework described above:

Is the current coral network optimal?

Where should the next samples be gathered?

Would it be more advantageous to sample at new locations or replicate paleoclimate observations at existing sites?

How many paleoclimate observations are needed to yield useful information about a climate field?

Can SSS be constrained from corals?

When should longer records be chosen over shorter ones?

To address these questions, we employ a pseudoproxy framework (Smerdon 2012), generating the coral observations from a millennial simulation of a state-of-the-art climate model. These “pseudocoral” observations are generated via a forward model of coral aragonite oxygen isotope composition (Thompson et al. 2011), which we now describe.

### b. Coral sensor model

Thompson et al. (2011) proposed a process-based model of the oxygen isotopic composition anomalies of coral aragonite given sea surface temperature and salinity (SSS, taken as a proxy for the oxygen isotopic composition of seawater):

This is an anomaly model parameterized at interannual time scales; where the SST and SSS refer to annually averaged anomalies relative to the mean of SST and SSS, and *a*_{1} and *a*_{2} are coefficients determined from prior paleoclimate observations, manipulative experiments on inorganic aragonite precipitation, and global databases of surface salinity and seawater *δ*^{18}O observations (LeGrande and Schmidt 2006; Evans et al. 2000; Juillet-Leclerc and Schmidt 2001; Lough 2004). Note that *a*_{1} has a constant value of −0.21 and *a*_{2} varies by region: tropical Pacific (0.27), South Pacific (0.45), and Indian Ocean (0.16) (LeGrande and Schmidt 2006). The model explains large-scale ENSO-related variations in SST and SSS as sampled by a sparse coral *δ*^{18}O observing network, but underestimates twentieth-century trends in observed coral *δ*^{18}O. Suppose the measured oxygen isotopic composition of coral aragonite is written

where the subscripts *m* stand for measured *δ*^{18}O and *c* for coral *δ*^{18}O (the true composition). Also, *j* ∈ {1, 2, …, *p*_{0}} is a spatial location index, and ** ε** ~ (0, ) is a measure of nonclimatic influences, whose covariance matrix, of size

*p*

_{0}×

*p*

_{0}, is assumed diagonal (this assumption can be relaxed if a more detailed structure is suspected). Although the coral PSM was primarily validated for the Indo-Pacific region, we expand its use to a global set of potential sites for the reconstruction of tropical SST field anomaly. We apply the

*δ*

^{18}O –SSS relationship for the tropical Atlantic (0.15) and South Atlantic (0.51) regions (LeGrande and Schmidt 2006, their Table 1) to the global potential observing network studied by Evans et al. (1998b). Although the regression of

*δ*

^{18}O on SSS in the Atlantic is less skillful than in the Indo-Pacific region, we found our results were not sensitive within reported uncertainty in the regression coefficients (results not shown). For this study, the SNR value was set to 0.5 at all sites, which can be related to the paleoclimate data noise variance

*r*in Eq. (11). While this value is chosen arbitrarily (it probably is a function of, but not limited to, space, time, data type, target variable, and frequency of variation) and lies at the upper range of the spectrum, it is not unrealistic (Smerdon 2012; Wang et al. 2014).

### c. A virtual coral network

The SST and SSS fields were extracted from the CCSM4 last millennium simulation of Landrum et al. (2013). The CCSM4 model is a state-of-the art coupled general circulation model whose performance at simulating tropical Pacific climate is extensively described (Gent et al. 2011; Capotondi 2013). Briefly, the model successfully captures the observed annual-mean trade winds and precipitation, sea surface temperature, surface heat fluxes, surface currents, Equatorial Undercurrent, and subsurface thermal structure. CCSM4 displays a robust El Niño–Southern Oscillation (ENSO) with multidecadal fluctuations in amplitude, an irregular period between 2 and 7 years, and a distribution of SST anomalies that is skewed toward warm events as observed. The simulations cover the interval [850, 2005]. The fields were coarse-grained at 5° × 5° for computational efficiency.

The ensemble state matrix is constructed from the SST and SSS time series from CCSM4 simulations as follows:

where *q* is the number of field grid points and *n* is the number of time realizations. For a single new proxy measurement at location *s*, = _{s} is a 1 × 2*q* row vector, and in the specific case of *δ*^{18}O following Eq. (14), it may be written as

where the *a*_{1} coefficient is located at index *s* and *a*_{2} at index *q* + *s*.

To produce the results that follow, some design choices were required: we denote by _{i}, *i* ∈ {1, 2, …, *m*} for *m* = 1000, the subsamples from the last 155 years of the past 1000-yr CCSM4 simulation of SST and SSS fields (i.e., the calibration period is 1851–2005). Since we are mainly interested in reconstructing past SSTs, we choose **J** to be the SST field; therefore **J** ∈ ℝ^{q}. For vector form **J**, we choose to maximize the trace of **Δ σ**

_{J}(

*S*) in Eq. (13). The impact of the choice of

**J**(

**J**= {SSS} or

**J**= {SST, SSS}) on the OSP is investigated in section 4e. Covariance localization (Gaspari and Cohn 1999) is applied with a correlation length of 6000 km (e.g., covariance is null between points that are more than 12 000 km apart). The pseudoproxy experiments of Smerdon (2012) suggest that global patterns of covariance between paleodata and historical data used for calibrating the paleodata should be tested carefully. Hence, this value was determined, as done in Steiger et al. (2014), by finding the minimum in mean error variance that provides a smooth field. Also,

*V*is the coral network locations of Table 1, so |

*V*| =

*p*

_{0}= 53 sites. In one scenario, the optimization is run such that chosen stations are removed from the list of possible sites after each iteration. In other words, we remove {

*s**} from the set

*V*at line 3 of algorithm 2 and the sensor network size cannot exceed

*p*

_{0}. In another scenario,

*V*remains the same during the entire analysis, allowing for sites to be chosen multiple times. The latter scenario introduces possible replicates and we can ask whether it is preferable to sample new sites or multiple replicates at a single location.

### d. A two-tiered coral network

There are a few examples of observing networks for specialized phenomena. For the Prediction and Research Moored Array in the Tropical Atlantic (PIRATA) project (Bourlès et al. 2008), the Autonomous Temperature Line Acquisition System (ATLAS) network is localized in order to resolve the two main modes of tropical Atlantic climate variability. Similarly, the Tropical Ocean Global Atmosphere (TOGA) program (McPhaden et al. 2010), and its Tropical Atmosphere Ocean (TAO) network is a major component of the ENSO observing system and is located on a more or less regular array. In all cases, there are limits as to where the sensors can be placed. In our case, the pseudocoral locations were grouped into two tiers (see Table 1 and Fig. 2).

Tier 1 are locations that approximate the position of publicly available coral

*δ*^{18}O observations (http://www.ncdc.noaa.gov/paleo/corals.html) with a resolution better than 2 months. This is the network used by Emile-Geay and Eshleman (2013).Tier 2 are locations where coral samples might potentially be collected. We chose to work with the global potential observing network studied by Evans et al. (1998b), which includes Moorea and the Gambier Islands (French Polynesia), Suwarrow (Cook Islands), Hawaii (Big Island), Andaman Island (eastern India), Lakshwadweep (western India), Galapagos (Ecuador), Christmas Islands (Australia in the Indian Ocean), Easter Island (Polynesia in the southeastern Pacific Ocean), the Bahamas and Barbados (Caribbean Sea), Cape Verde Island, St. Helena and Sao Tome (Atlantic Ocean) and American Samoa (South Pacific Ocean) (Evans et al. 1998a).

This network is not exhaustive and while these locations contain corals, those are not necessarily conducive to multicentury-long coral-based reconstructions, but it suffices for a proof of concept. In test cases, the use of an updated database (e.g., ReefGIS; http://reefgis.reefbase.org) yielded similar results at the granularity imposed by realistic SST/SSS datasets.

### e. Network assessment

To evaluate the various sensor networks obtained in the following scenarios, we reconstruct the GCM-simulated SST or SSS fields over the 1700–1850 time period using the pseudoproxies described in section 3b at the candidate sites. Reconstructions are performed via assimilation of the optimal pseudocoral observations, as described in Steiger et al. (2014). At each location, we consider 100 noise realizations to validate our uncertainty quantification scheme (i.e., does the actual coverage rate match the nominal coverage rate? This may be seen as the gray band in Fig. 3a,b, bottom panels).

Next, we quantify reconstruction skill via the pointwise coefficient of efficiency (CE; Cook et al. 1994) for the 1700–1850 validation period. The CE measures the fraction of common variance between the actual and the reconstructed fields (at each grid point) over the validation period, using the validation period mean-field value as a benchmark. The CE skill score ranges from −∞ to 1, but our maps have cutoff values of −1 to 1, where 1 indicates that the reconstructed time series perfectly matches the true sequence, while a negative score denotes that the squared error between the two series exceeds the variance of the true sequence. In some cases, we also juxtapose the tropical average anomaly reconstructions with the true series and look at the ensemble coverage skill. The tropical average is taken over the 35°S–35°N latitude range. To convey the information gain/loss between two networks, we show differences between CE maps (*δ*CE), which range from −2 to 2.

## 4. Results on synthetic data

We now address the motivating questions posed in section 3a with this virtual network.

### a. Is the current coral network optimal?

In this experiment, we compare the performances of the pseudocoral *δ*^{18}O network at existing locales (33 unreplicated tier 1 locations listed in Table 1) with the optimal 33-sensor network. Again, the target is the 1700–1850 SST field. Figures 3a and 3b (top panels) illustrate how the CE skills of the two networks differ spatially. Looking at the areas of positive CE, the OSP is more efficient in the Indonesian and Philippine regions, the northern tropical and central Pacific, and the southern tropical Atlantic. This point is reinforced in the bottom panel of Fig. 3, which displays the pointwise difference between the CE map for the optimal 33 sensors (Fig3b, top) and the CE map for tier 1 (Fig. 3a, top). The OSP skill is seen to exceed the tier 1 skill in the regions cited above. It is, however, lower in the southwestern Pacific region and the southeastern Indian Ocean. [Looking at Fig. 6 (top, red dot and blue curve), the optimal 33 sensors explain nearly 10% more SST variance than tier 1.] This demonstrates the advantage of our design approach, which allows us to quantify the information added by a particular set of sensors. Although the field skill is quite different, the skill is indistinguishable for the reconstructed tropical means. Despite the warm bias (von Storch et al. 2004; Smerdon et al. 2011; Wang et al. 2014), which is due to the calibration period 1851–2005 being warmer than the reconstructed interval [1700, 1850], both networks perform well in reconstructing the tropical average SST and have uncertainty bands of similar width. The two networks differ spatially in that the OSP privileges the northern over the southern tropical Pacific sites, with the exception of the Gambier Islands (French Polynesia). Atlantic locations are also picked more often in the OSP case. Sites around Australia, however, are not being chosen, probably due to their localized spatial impact. This result, however, is model dependent since the covariance structure depends on the target SST and SSS fields (the CCSM4 “past1000” simulation). Different results may be obtained with different target fields.

### b. Where should the next samples be gathered?

Given the current network of 33 unreplicated sites (tier 1), where should the next 25 or 50 coral records be gathered? As in the previous experiment, we do not allow the sampling of replicates. Using an SNR of 0.5, Fig. 4 shows how the SST reconstructions improve with respect to the addition of *p* = 0, 25, and 50 records.

Figure 4 (top) shows the pointwise CE skill score for tier 1, while Fig. 4 (middle and bottom) illustrate the incremental increase of the positive CE area with the network size via *δ*CE, that is, the CE gain from adding 25 and 50 optimally placed sensors to tier 1 (Fig. 4, middle and bottom). Note that *δ*CE clearly demonstrates the diminishing return property, as the first additional 25 optimal sites have a greater impact on the positive CE area (Fig. 4, middle) than the subsequent 25 (Fig. 4, bottom). The next sites to sample for SST reconstruction are the Gambier Islands (French Polynesia), followed by the southern tropical Atlantic region, the equatorial Indian Ocean, and the northern tropical Pacific sites. However, the mean SST field variance reduction seems to stall as the network size increases (Fig. 6, bottom; blue curve). As we add 65 new sites to tier 1, the variance reduces by less than 20%; hence, we hypothesize that a better strategy is to allow replication to reduce the paleoclimate observational uncertainty.

### c. Are replicates more advantageous?

We repeat the previous analysis, this time allowing for a site to be chosen multiple times. Given the current pseudocoral network of 35 sites, which includes two replicates in Rarotonga (Cook Islands), where should we sample the next 50 or 100 corals, allowing for replication? We call this scenario our “control” run for future comparison. Results are shown in Fig. 5.

For up to 100 paleoclimate observations on top of the 35 existing sites, the addition of sensors greatly improves the SST field reconstruction. The combined positive *δ*CE area in Fig. 5 (middle and bottom) showing the information gain from adding the optimal 50 and 100 sensors to tier 1, covers nearly the entire Pacific, Indian, and Atlantic Oceans with the exception of the southern and northern limits of the map. We note a reduction of the mean SST field variance of 35% when adding 100 optimally placed sensors to tier 1 (Fig. 6, bottom; green curve). The resampling rate is highest in the central Pacific region (Palmyra and Kiritimati) as well as in the eastern Pacific Ocean (Galapagos and Clipperton Atoll), followed by the Seychelles in the Indian Ocean. Note also that replicates occur in places where there are few nearby alternatives.

### d. How many paleoclimate observations are needed to yield useful information about a climate field?

OSP results are highly sensitive to paleoclimate observational error, or equivalently, SNR [Eq. (11)]. We illustrate this by experiments which achieve a similar level of reconstruction CE (Fig. 7) while sampling plausible ranges of SNR and numbers of sites/replicates. Figure 7 shows that similar CE patterns and amplitudes can be achieved by 10 sites sampled with SNR = 1 (Fig. 7, top), 25 sites sampled with SNR = 0.5 (Fig. 7, middle), or 100 sites sampled with SNR = 0.25 (Fig. 7, bottom). Focusing on the Niño-3.4 region in Fig. 7, note how the number of replicates scales up as the SNR decreases—growing from 3 to 8 to 30 for SNR = 1, 0.5, and 0.25 respectively. It shows that a few chosen high-quality records can carry more information about the SST field as a whole, and hence represent a better investment of resources, than a much greater number of low-quality records.

### e. Are corals more informative of temperature or salinity?

So far we have focused on how much temperature information could be gleaned from a sparse network of coral observations. Could such a network also inform us about salinity (SSS)? If so, how would an optimal SSS-characterization coral network differ from an optimal SST-characterization coral network? Is it possible to jointly estimate SST and SSS with the same network?

To address these questions, we repeat the experiment carried out in section 4c with two alternative selections for **J**, namely **J** = {SSS} (Fig. 8) and **J** = {SST, SSS} (Fig. 9). The coral network that minimized the uncertainty in the SSS field (Fig. 8) seems much more concentrated around the western Pacific warm pool (Philippines and Papua New Guinea) and sparser in the central Pacific region (Fig. 8); this is ostensibly because salinity is only a dominant influence on coral *δ*^{18}O in the western Pacific warm pool region, where SST variations are weak compared to precipitation-induced SSS variations. Note also that the SSS field is more difficult to reconstruct accurately with tier 1 and tier 2 networks and requires many more sensors, perhaps not limited to coral records. There is, however, little structural difference between the OSP obtained with **J** = {SST, SSS} (Fig. 9) and that using **J** = {SST} only (Fig. 5). This indicates that, with this synthetic network PSM at least, coral *δ*^{18}O is so dominated by thermal signals that little additional information can be extracted about SSS.

### f. How important is the length of the records?

So far we have only considered an extremely idealized, complete paleoclimate observational coverage. In reality, most series are of different lengths, and we would like to quantify how this affects site selection. We assume a time span that is common to all the sites between 1931 and 2005, but a random start date (chosen from a uniform distribution over 1851–1930) is assigned to each location such that some samples go farther back in time than others. A single realization of start dates is used in this experiment. Therefore, some samples may provide at most twice as many annual measurements as others. In Eq. (11), the length of a particular time series specifies the ensemble size that is being updated, so as longer sequences are being assimilated, those affect the update of a larger portion of the ensemble, while shorter ones only impact a subsample thereof. Additionally, short sequences result in noisier paleoclimate observations due to the constant SNR. This setting would be representative of cases where the average length of samples is known in specific areas. The resulting OSP (Fig. 10, middle), when compared with the OSP obtained with fixed length sequences (Fig. 10, top) is different as it generally exhibits a lesser replication rate. Where the fixed lengths OSP picks 19 distinct sites, the variable lengths OSP choses 25 sites that seem to be more uniformly distributed within the climatically important regions. Yet, not all the chosen pseudocorals have long sequences (Fig. 10, bottom) suggesting that location might be just as important as length.

## 5. Discussion

The previous section laid down a proof of concept for how the data assimilation framework may be used to design optimal sampling strategies and answer broad scientific questions. Because it uses a linear bivariate forward model instead of assuming a univariate dependence on temperature with 1:1 scaling, this work brings the exercise of Evans et al. (1998b) one step closer to reality. It also approaches optimality of the total resultant network, rather than an iterative next best site to explore, and carries information gained sequentially as sites are being assimilated. Finally, its computational efficiency allows for the testing of multiple scenarios. However, results from this optimization problem are clearly sensitive to choices made to close the problem. In this section, we discuss some of our assumptions and how they may be relaxed to provide more realistic results in operational use. We also consider some possible extensions.

### a. Assumptions

#### 1) Realistic targets

The CCSM4 simulation that generated the target SST and SSS fields used here is a plausible representation of tropical ocean–atmosphere processes on time scales of interest and produced long (150 yr) calibration and validation periods, necessary to provide reliable estimates of the error covariances. However, its spatial covariance structure differs from the true SST and SSS (e.g., Deser et al. 2012). For a more realistic sampling design experiment, one might use historical observations to represent the true target fields as closely as possible. Those, by contrast, are spatially and temporally incomplete (therefore biased), noisy, and span a shorter time range, which may require some form of regularization (space reduction) to stabilize covariances at the expense of spatial detail (Kaplan et al. 1997). Additionally, the cost functions used in most of our experiments are for (co)variance of the SST and/or SSS fields. We used the entire SST or SSS fields as targets for generality, but the approach is applicable to any derived function of those fields. Thus, our algorithm would work just as well at a finer spatial resolution, if one chose to focus on a regional process or an area index (e.g., Niño-3.4 SST anomaly or SSS averaged over the Caribbean Sea).

#### 2) PSM dependency

The optimization results depend on the PSM chosen to model the paleoclimate observations and whether it precisely and accurately represents the response of the sensor to SST and SSS variations. Given the emerging state of the proxy system modeling field, our analysis only considers a single PSM, for which we fixed the parameters to the values proposed by Thompson et al. (2011). However, in reality one needs to account for structural uncertainties as well as parametric uncertainties. Parametric uncertainties could be tackled with sensitivity analyses using different values for the model parameters. Structural uncertainties could be identified with experiments comparing validated performance of a hierarchy of PSMs spanning a broad range of structural complexities.

#### 3) List of candidate sites

Because of the steep and anisotropic gradients in SST/SSS, results might also be sensitive to the exact location of paleoclimate observational sites in tier 1 and tier 2. For operational applications and depending on the granularity of interest, one might want to use an updated network of potential sites (e.g., ReefBase; http://reefgis.reefbase.org), in conjunction with an observational target with similarly enhanced spatial and/or temporal resolution.

#### 4) SNR estimation

In this paper, for simplicity, we used a uniform SNR value of 0.25, 0.5, or 1.0 for all potential sites, but this assumption can be relaxed and different SNR values may be specified at each location. If we were to perform an experiment with spatially varying SNR values, experience shows that, other things being equal, sites with high SNRs would be favored by our algorithm. In the real world, empirical SNR values are challenging to assign, especially given that paleoclimate observational uncertainties at different sites might include both random and systematic errors. In the case where in situ or local calibrations using contemporaneous direct and indirect observations have already been performed at the site (e.g., the existing tier 1 network), the SNR may be directly estimated (e.g., Evans et al. 2000). If that is not the case, then one must make an educated guess given SNR values observed across different samples, such as drawing SNR values from empirical SNR distributions (e.g., Wang et al. 2014).

### b. Extensions

#### 1) Choice of cost function

The choice of cost function is crucial to the sensor selection process. Because our goal is to predict the information utility of specific proxy measurements before obtaining the data, the cost function evaluates the compactness, spread, and variance of the estimated targets. In this study, we used the total variance reduction for **J**, but we could also use the entropy of **J** conditioned on observations at sites *S*. However, the entropy reduction, unlike the variance reduction, is neither submodular nor supermodular (Anstreicher et al. 2001) and prevents the use of a greedy algorithm. Evans et al. (1998b) also optimized based on resolution of area-weighted indices such as the Niño-3.4 index, the global temperature, and the warm pool average. Alternatively, Krause et al. (2008b) use a mutual information maximization criteria between sites *S* and the rest of the space *V*\*S* approximated as

which allows them to reduce the error about the estimates over the space not covered by the sensor locations. They were also able to show that the mutual information criteria were submodular, and that for sparse enough observation densities they were approximately nondecreasing, which permits the use of a greedy optimization approach. We leave the investigation of the OSP sensitivity to the choice of the cost function for future work.

#### 2) Ancillary optimization constraints

We have assumed that practical considerations (e.g., differences in site access, sample availability, systematic differences in longevity of coral colonies depending on the mean temperature and climatic variability, distribution of suitable species) are uniform across candidate sites. We could allow for more realism by considering a parameterized deployment cost for each potential observing site. Here we considered the potential coral sites from Evans et al. (1998b) according to a priori biological and geological considerations, but clearly those do not encompass all suitable locations. Conversely, not all of those sites may actually host coral species (e.g., *Porites*) and colonies suitable for climate reconstruction, for instance because of recent environmental deterioration, tropical cyclone activity, or simply because suitable materials are not safely or practically accessible. Specifically, corals at tier 2 locations such as Cape Verde, Easter Island, Rocas Reef were shown to be unfit for paleoclimate reconstruction (see Moses et al. 2006a; Mucciarone and Dunbar 2003; Ferreira et al. 2013, respectively). On the other hand, some locations are known to provide particularly relevant coral records for reconstructions (e.g., in Bermuda, Fiji, Puerto Rico, and La Reunion; see respectively Goodkin et al. 2008; Linsley et al. 2004; Winter et al. 2000; Pfeiffer et al. 2004). Therefore, as a step toward realism, expert judgement regarding coral quality at potential sites could easily be included in our analysis in the form of an optimization constraint. We might do so by assigning a weight *w*_{i} ∈ [0, 1], where *i* = 1, …, *p*, assessing conditions at each site. Sites at which corals previously produced valuable reconstructions could be given higher weights, while sites where corals were reported to be stressed or unfit would be downgraded. The algorithm would favor sites weighted more heavily, and this would have the added advantage of making site selection criteria explicit and transparent. Since results may be sensitive to weighting, efforts should be made to assess the validity of the weighting assignments before making inference on the results.

#### 3) Chronological uncertainties

This work has neglected age uncertainties, which are known to exist in corals and in all paleoclimatic observations, for which time of formation or imprinting is measured (Evans et al. 2013). The effect of chronological uncertainty on observing network definition may be assessed by incorporating ensemble estimation of plausible chronologies of the paleoclimatic data using probabilistic modeling (e.g., Comboul et al. 2014).

#### 4) Application to other observing networks

Although we focused our results on one proxy system (coral *δ*^{18}O), we note that this framework is generally applicable to any proxy system that is well understood enough to be explicitly modeled via a PSM. An open-source toolbox of such PSMs (Dee et al. 2015) will greatly expand the scope of such sensor placement studies. In our coral example, we used a linear relationship (*δ*) between the climate fields and the network of potential paleoclimate observations. However, one could use any nonlinear anomaly mapping (*δ*) = () − 〈()〉, where denotes a nonlinear operator and the angle brackets the ensemble mean, in lieu of the linear relation *δ*. Because the mapping is site specific, both the PSM parameters and structures may vary across a network, permitting optimizations for observing networks comprised of multiple classes of paleoclimatic sensors and observations (Emile-Geay et al. 2013a,b).

In practice, one could apply this methodology to the design of observing networks for multivariate, indirect responders to environmental conditions with observational error of any sort; thereby enabling applications in radioastronomy, fixed buoy networks, Lagrangian ARGO floats, or satellite remote sensing arrays.

## 6. Conclusions

We have described a general method to solve the optimal paleoclimatic sensor placement problem and illustrated its use for the design of coral *δ*^{18}O sampling networks using simulated climate fields. Although the subject had been investigated earlier by Evans et al. (1998b), here we presented a more generalized and realistic approach to its closure that was permitted by the integration of three novelties: 1) a data assimilation method, consisting of the square root formulation of the EnKF (Mauger et al. 2013), to skillfully measure the (possibly multidimensional) information gain from paleo-observing networks; 2) a regionally defined bivariate PSM relating SST and SSS anomaly fields to coral *δ*^{18}O (Thompson et al. 2011); and 3) an effective optimization scheme, which is a greedy algorithm (Nemhauser et al. 1978), enabled by exploiting the submodular and monotonic properties of the variance reduction criteria. While our optimization results are likely to depend on the GCM, PSM, target network, and assumptions chosen, we demonstrated the general merit of the OSP approach by quantifying the impact of additional paleoclimate observations on the reconstruction of simulated preinstrumental climate fields and by exploring several design strategies. In the idealized situation presented here (perfect climate model, perfect coral system model), we gave the following responses to our motivating questions (section 3a):

Is the current coral network optimal?

The existing network overly samples the south west Pacific region. The OSP approach targets more sparsely sampled regions, thereby achieving a more uniform coverage of the reconstruction domain.

Where should the next samples be gathered?

Given the existing coral

*δ*^{18}O network (i.e., tier 1), our list of potential sites (tier 1 + tier 2), and the goal to maximize knowledge of the tropical SST field at 5° × 5° resolutions, the next samples should be sought first in French Polynesia, then the southern tropical Atlantic, followed by the equatorial Indian Ocean and the northern tropical Pacific. By contrast, maximizing knowledge of the tropical SSS field yields heavily to new samples located in the southern tropical Pacific.

Are replicates more advantageous?

The choice of optimal sites depends sensitively on the initial assumption regarding the uncertainty in the paleoclimate observations (sensor precision; Evans et al. 1998b). While high-quality paleoclimate observations lead to mostly unreplicated OSPs (SNR ≥ 1), replicates prove advantageous when errors are larger, in which case they essentially improve the SNR via the central limit theorem (assuming the observational error is random and not systematic). The replication rate is highest in the central equatorial Pacific (Palmyra and Kiritimati) and the eastern Pacific (Galapagos and Clipperton Atoll) (Evans et al. 1998b) to infer both the SST field alone and the SST and SSS fields combined. Inferring the salinity field alone results in replicates that are more uniformly distributed within the western Pacific warm pool, besides a few places such as in the Cook Islands and the Seychelles.

How many paleoclimate observations are needed to yield useful information about a climate field?

This experiment and the previous one lead to a similar conclusion in that the quality of the paleoclimate observations determines the size of the paleo-observing network.

Are corals more informative of temperature or salinity?

We showed that to infer the SSS resulted in a sensor network that was structurally different from the OSP inferring the SST. However the OSP for the combined SST and SSS fields was very close to the one computed with the temperature field, because the contribution to

**J**of*δ*^{18}O variance by the scaled SST dynamic range (section 3b) is larger than that for the scaled SSS dynamic range. In combination for the tendency for SSS and SST to covary in the tropical climate, these results support the use of coral*δ*^{18}O records to reconstruct SST (see Tierney et al. 2015).

How important is record length?

Record lengths played a major role in the OSP design, with variable pseudocoral time series length encouraging more uniformly distributed networks and less replication.

The idealized nature of these experiments was necessary to establish a proof of concept, so these results are merely illustrative of the type of questions one may address via this framework; they should not be taken literally. Some of these results (items 3–6 above) are general enough that they should carry over to real world. However, the first two questions hinge crucially on the network of potential sites, resolution, and the spatial relations simulated by the GCM, which are known to be unrealistic in CCSM4 and other GCMs from phase 5 of CMIP (CMIP5) because of the so-called “cold tongue” bias (Li and Xie 2014), among other things [see Bellenger et al. (2014) for a recent review].

Operationalizing the OSP framework for paleoclimatic applications could be done relatively easily and accommodate a wide range of applications. It would require 1) the use of instrumental data as target climate fields; 2) validated PSM(s) (ideally, evaluating sensitivity to parametric and structural uncertainties); 3) an accurate catalog of potential sampling sites defined using comprehensive and updated paleoclimatic and ecological databases, on an appropriately resolved target field; and 4) estimation of SNR values using existing, validated paleoclimate record calibration studies. In future work, we may apply this framework to real-world optimal coral sampling, and extend it to other classes of paleoclimatic observations.

## Acknowledgments

We thank all reviewers and the Editor for constructive reviews that improved the paper. JEG and MC acknowledge funding from NOAA awards NA10OAR4310115 and NA14OAR4310175. GH acknowledges funding from NSF award AGS-1304263 and NOAA award NA14OAR4310176.

## REFERENCES

*Climatic Variations and Forcing Mechanisms of the Last 2000 Years*, P. D. Jones, R. S. Bradley, and J. Jouzel, Eds., NATO ASI Series, Vol. 41, Springer, 603–624.

*Proc. 42nd Southeastern Symp. on System Theory (SSST)*, Tyler, TX, IEEE, 275–279, doi:.

*Proc. CLIVAR-PAGES Workshop on Representing and Reducing Uncertainties in High-Resolution Climate Proxy Data*, Trieste, Italy, CLIVAR. [Available online at http://www.ncdc.noaa.gov/paleo/reports/trieste2008/corals.pdf.]

*Proc. 40th Annual ACM Symp. on Theory of Computing (STOC ’08*), Victoria, BC, Canada, ACM, 45–54, doi:.

*Proc. Int. Conf. on Machine Learning (ICML)*, Bellevue, WA, International Machine Learning Society, 1057–1064.

*Synoptic–Dynamic Meteorology and Weather Analysis and Forecasting: A Tribute to Fred Sanders*, L. F. Bosard and H. B. Bluestein, Eds., Amer. Meteor. Soc., 147–161.

*Easter Island: Scientific Exploration into the World’s Environmental Problems in Microcosm*, J. Loret and J. T. Tanacredi, Eds., Kluwer Academic/Plenum Publishers, 113–132.

^{18}O records of Okinawa corals

^{18}O

## Footnotes

*Publisher’s Note:* This article was revised on 5 October 2015 to include additional text in the Acknowledgments section.