## 1. Introduction

Key climate system parameters—in particular equilibrium climate sensitivity (*S*_{eq}), effective vertical deep-ocean diffusivity (*K*_{υ}), and total aerosol forcing (*F*_{aer})—are often estimated by studies (usually formulated in Bayesian terms) that compare simulations by adjustable parameter climate models with observations. Examples of such studies include Forest et al. (2000, 2001, 2002, 2006, 2008, hereafter F00, F01, F02, F06, and F08, respectively; collectively the Forest studies), Andronova and Schlesinger (2001), Frame et al. (2005), Hegerl et al. (2006), Knutti et al. (2002), Sansó et al. (2008, hereafter SFZ08), Sansó and Forest (2009, hereafter SF09), Aldrin et al. (2012), and Ring et al. (2012). These studies provided six of the eight probability density functions (PDFs) given in the Intergovernmental Panel on Climate Change (IPCC) Fourth Assessment Report (AR4) for equilibrium climate sensitivity inferred from observed changes in climate [Hegerl et al. (2007), see their appendix 9.B for an explanation of such studies].

The Forest studies are excellent examples since they used a wide spread of instrumental observations and avoided dependence on existing ill-constrained estimates of *K*_{υ} and *F*_{aer} by jointly estimating them with *S*_{eq}. Using for this reason F06 as a starting point, this paper derives and implements an improved, objective Bayesian, methodology, which provides better defined PDFs for *S*_{eq} in particular. Further, by taking advantage of the final 6 years of model simulation data, unused in F06, and revising the experimental design to improve diagnostic power, an updated closely constrained estimate for *S*_{eq} is obtained.

F06 and similar studies involve comparisons between observed temperatures at various spatiotemporal coordinates and climate model simulations. The models have adjustable calibrated parameters controlling key climate properties and are more suited than atmosphere–ocean general circulation models (AOGCMs) for exploring the entire parameter space and running multiple simulations at varying parameter settings.

F06 used the Massachusetts Institute of Technology (MIT) 2D climate model (2DCM) (Sokolov and Stone 1998; F06). Forcings from all greenhouse gases (specified explicitly), sulfate aerosols (calculated from emissions), stratospheric and tropospheric ozone, land use, solar irradiance, and volcanism (specified as stratospheric aerosol optical depth) (collectively referred to as GSOLSV) were included: see the F06 auxiliary material for details. To place F06 in context, it represented an update of F02 using more comprehensive forcings; F08 used the same methods and data as F06 except for its model simulations, which employed a later version of the MIT 2DCM, while Libardoni and Forest (2011) investigated sensitivities to the surface temperature dataset. Drignei et al. (2008) used a statistical model as a nonlinear regression surrogate to estimate the same parameters using F02's data and alternative correlation structures. SFZ08 and SF09 used F06 data but employed more complex hierarchical Bayesian methods, unlike the approaches used in F06 and this paper, which differ principally in the prior distribution used. The SFZ08 and SF09 posterior PDFs for *S*_{eq} and

F06 used three “diagnostics” (groups of variables whose observed values are compared to model simulations):

- Surface-air temperatures [surface (sfc)]: These are four equal-area latitude averages for each of the 5 decades comprising 1946–95, referenced to 1905–95 climatology (Jones et al. 1999).
- Deep-ocean temperatures [deep ocean (do)]: This is the trend in global mean 0–3-km-deep-layer pentadal averages ending in 1959–95 (Levitus et al. 2005).
- Upper-air temperatures [upper air (ua)]: These are the differences between 1986–95 and 1961–80 averages at eight standard pressure levels from 850 to 50 hPa on a 5° grid (Parker et al. 1997).

Simulations were run from 1860 to 2001 using 499 parameter combinations, with *S*_{eq} ranging from 0.5 to 15 K, *K*_{υ} from 0 to 64 cm^{2} s^{−1}, and *F*_{aer} from −1.5 to +0.5 W m^{−2}. (Units for these parameters are generally omitted from here on.) Cases with *S*_{eq} > 10 were discarded. For estimation, *K*_{υ} was parameterized as its square root (Sokolov et al. 2003), ocean heat uptake being proportional thereto. The term *F*_{aer} represents net forcing (direct and indirect) during the 1980s relative to pre-1860 levels and implicitly includes omitted forcings with patterns similar to those of sulfate aerosols. Means of four-member initial condition ensembles were used at each parameter combination to reduce the impact of internal variability. The diagnostics ended in 1995, matching F02 to enable the effects of including a more complete set of forcings to be illustrated, so the final six simulation years were not used. Reference should be made to F06 and F02 for a fuller description of the MIT2DCM and simulation runs, applied climate forcings, and methods used.

F06 uses an optimal fingerprint method involving comparing the modeled **T**_{m}(**θ**_{m}) and observed *S*_{eq} changes somewhat with time as additional feedbacks are activated. Simulation variability and model error are ignored.

AOGCM control run data provide an estimate of the natural variability (climate noise) covariance matrix **C**_{N} for each diagnostic. The estimated noise covariance matrix *κ* eigenfunctions (EOFs), or modes of variability, being retained in the estimate,

Measurement error for the surface and upper-air diagnostics is small compared to estimated internal climate variability and is therefore ignored. For the univariate deep-ocean diagnostic, neither observational measurement error nor control run variance dominates. The two variances are added to give *κ* being irrelevant here since

For each diagnostic, goodness-of-fit statistics *r*^{2} F06 compute a joint likelihood **T**_{m}(**θ**_{m}), representing the relative probability of that diagnostic's observations as a function of the candidate parameter value **θ**_{m}. The likelihood function is based on Δ*r*^{2}, the excess of *r*^{2} value, having a *m* being the number of parameters being estimated and *ν* the degrees of freedom (DF) available for estimating _{N} (see supplemental material for additional discussion). The minimum *r*^{2} is checked for consistency with the errors being generated by internal variability.

F06 uses a Bayesian paradigm, whereby probability distributions can be estimated for unknown parameters. Applying Bayes's theorem, F06 derives a joint posterior PDF for the parameter vector **θ** as the normalized product of the likelihood function from one diagnostic and a “prior” probability distribution consisting of the product of separate uniform priors for each parameter. Bayes's theorem is then applied twice more, each time multiplying the previous posterior PDF for **θ**, used as the prior, by the likelihood function from another diagnostic to obtain an updated posterior PDF. Marginal PDFs for individual climate system parameters are obtained by integrating out the other parameters from the final joint parameter posterior PDF. Although such PDFs may appear to provide precise probabilistic information, they are perhaps better viewed as indicating how likely it is that any chosen range brackets the parameter value.

Inferences as to climate system parameter values will be affected by the data used. The choice of data (and parameterizations) may be guided by the physics of the climate system but remains somewhat subjective. There is merit in seeking data that well constrain parameter values, but there may be a trade-off with data quality and coverage. Surface, upper-air, and/or deep-ocean temperature data spanning much of the instrumental period are typically used. Using other data types or periods may give rise to significantly different parameter estimates, as can detailed data processing choices. The selection of data types to compare to model simulations is a key decision; in this paper we work from the F06 data choices.

## 2. Objective Bayesian inference

In order for Bayesian inference to reflect, insofar as possible, only the data from which it is derived—as is appropriate when reporting objectively stand-alone scientific results—a noninformative prior must be used (Bernardo and Smith 1994; Kass and Wasserman 1996). If the prior is informative, and not overwhelmed by the data, a Bayesian posterior density is unlikely to approximate the density arising if, hypothetically, the experiment(s) concerned were to be repeated indefinitely, so valid frequentist confidence intervals cannot be estimated. Since the available data are insufficient to constrain climate parameters narrowly, an informative prior exerts strong influence.

Typically, scientific studies not expressed in Bayesian terms implicitly use a noninformative prior when considered from a Bayesian perspective. That occurs when observables are sampled according to their probability distributions. Gregory et al. (2002) estimated the change in global mean temperature between two periods and the corresponding change in forcing net of ocean heat uptake. They repeatedly sampled the probability distributions for those estimated changes, assuming independent normal error distributions, to derive a PDF for the ratio of the changes, which gives *S*_{eq}. The resulting PDF effectively embodies a noninformative prior. Similarly, Forster and Gregory (2006) diagnosed *S*_{eq} by ordinary least squares regression of changes in forcing net of radiative flux on changes in surface temperature, all errors being assumed Gaussian. They stated that this was equivalent to assuming uniform priors in the data (observables), which is noninformative given Gaussian errors.

Explicitly Bayesian climate sensitivity studies have commonly used uniform priors, or sometimes deliberatively informative “expert” priors, for the parameters being estimated. Frame et al. (2005) advocated sampling a flat prior distribution in *S*_{eq} if that is the target of the estimate and did not mention noninformative priors, but nevertheless derived PDFs for *S*_{eq} using various different sampling methods. Their uniform sampling-of-observables method, which they stated was appropriate for and gave an objective range for *S*_{eq} relevant to twenty-first-century warming forecasts, is very similar to the one we propose, although their implementation requires equal numbers of observables and parameters, and likelihood skewness may mean the prior involved was only approximately noninformative. Pueyo (2012) asserted that the problems of estimating *S*_{eq} and its reciprocal, the climate feedback parameter, were equivalent, and hence their priors should have the same form, implying a uniform-in-log(*S*_{eq}) prior. However, it is not clear that in practice the two problems have the same characteristics. In any event, Pueyo's arguments are not applicable where, as here, the prior is for jointly estimating *S*_{eq} and other parameters and is a function of all those parameters.

When data affected by random errors bear strongly nonlinear relationships to parameters upon which the (hypothetical) true data values depend, as with *S*_{eq} and *K*_{υ} in particular, a uniform prior is only noninformative if applied to the true *data*, the likelihood functions for which are of known form, centered on the observations (Box and Tiao 1973). Applied to the model *parameters*, as in F06, a uniform prior will be informative and may lead to substantially erroneous estimated parameter PDFs. Many parameterizations are possible. In this paper, a uniform prior is applied to the true data, a parameterization in which a uniform prior is noninformative.

Correct inference about parameters that have nonlinear relationships with data is impossible when, as in F06, only a sum of squared whitened differences (*r*^{2}) is computed. The sum of squares suffices to derive a probability density in data space but not to map that density into a density in parameter space. To estimate objectively a joint probability density for the parameters, the relationship between natural volume elements in data space and in parameter space must be computed. Determining a noninformative prior for the parameters is intimately bound up with that metric relationship (Kass 1989). Computing it requires information on how each whitened difference changes with model parameter values. Without such information, there is no way of correctly allocating probability mass between different locations in parameter space with identical sums of squared differences. For example, if the position of a point in 3D relative to an origin is measured with error in Cartesian coordinates

Furthermore, Bayesian updating cannot provide objective inference when the data used in deriving the posterior to be updated (forming the prior) and the data from which the updating likelihood function is derived have differing nonlinear relationships with the parameters, as the sets of diagnostic data in F06 do. In that case, the noninformative parameter priors required for objective Bayesian inference from the two datasets individually would differ. Using the appropriate individually noninformative prior, Bayesian updating would produce a different result according to the order in which Bayes's theorem was applied to the two datasets (see supplemental material for additional discussion). That noninformative priors and Bayesian updating conflict is a known problem (Kass and Wasserman 1996).

Noninformative priors vary with the experiment involved; they cannot be directly interpreted in probabilistic terms (Bernardo and Smith 1994). In order for the posterior to be dominated to the greatest possible extent by the data, however weak, and thereby for the prior to convey no particular knowledge as to the parameters, a noninformative prior must differ according to what nonlinear relationships the data has with the parameters.

## 3. Data

### a. Data sources

The raw surface and upper-air observational datasets employed in F06 were revised subsequent to use in F06 and are no longer online. Only partial AOGCM control run data still exist. We were initially unable to obtain original data for F06 and instead obtained archives of processed diagnostic data for two related studies: Curry et al. (2005, hereafter CSF05) and SFZ08. F06 stated that CSF05 used its data, while SFZ08 stated that it used F06 data. An archive of F06's computer code and partially processed annual and decadal data (GRL06_reproduce) was subsequently made available, with model simulation data verified against partially extant raw data. Analysis showed that the SFZ08 surface and upper-air diagnostic data were essentially identical to that generated by GRL06_reproduce and that the significantly different CSF05 data were misprocessed.

The GRL06_reproduce MIT model data ended in November 2001, with the last 6 years' data being discarded since the F06 diagnostics ended in 1995, but the surface observational data used ran to August 1996. Dr. Forest has confirmed (C. E. Forest 2012, personal communication) that there is thus a 9-month discrepancy between the F06/SFZ08 model simulation and observational surface diagnostic data and also that the forcings used were valid through to 2001. The timing mismatch has little impact on results.

We present results using the F06/SFZ08 5-decade to 1995/96 surface diagnostic data so as to provide accurate comparisons with the F06 results. We also present results using a revised extended surface diagnostic with correctly matched model and observational data for the 6 decades to 2001, using a 9-decade climatology to 1991 to compute temperature anomalies. Doing so substantially improves the constraint provided by the surface diagnostic. With a surface diagnostic comprising 5 decades, all included in the climatology, greater model simulation temperature increases at higher *S*_{eq} settings are more heavily diluted by deduction of increased model simulation climatological means. This effect may be illustrated by calculating, for both surface diagnostics, what proportion of the difference between the global mean model-simulated temperatures in the key final diagnostic decade at settings of *F*_{aer} range, the F06 surface diagnostic is unable to provide any significant constraint on *S*_{eq} when *S*_{eq}.

For surface observational data to 2001, to which the original Met Office Hadley Centre Climate Research Unit temperature (HadCRUT) dataset does not extend, we use the latest version, HadCRUT4 (Morice et al. 2012). Otherwise, we use GRL06_reproduce data (for the F06 surface and upper-air diagnostics, the essentially identical SFZ08 data), save for using third climate configuration of the Met Office Unified Model (HadCM3; Gordon et al. 2000) control run deep-ocean data.

When employing the surface diagnostic extending to 2001, we use the full Levitus et al. (2005) deep-ocean observational dataset, matching the model simulation data for 40 pentads ending 1998 rather than 37 ending in 1995. The periods covered by the various diagnostics did not match in F06 and need not do so, but it is preferable that they be broadly similar.

### b. Issues with the upper-air diagnostic

The Bayesian inference in both our and F06's method involves multiplying probability densities relating to the three diagnostics' whitened differences. Doing so will only be valid if errors in those differences are independent. If the errors are positively correlated, multiplying the diagnostic likelihood functions will overstate statistical significance. Covariance of the temperature change variables within the individual diagnostics is addressed by regularized whitening, resulting in a reduced number of variables, which should be independent provided AOGCM control run simulations represent natural variability sufficiently accurately. This issue is discussed in AT99, where a consistency criterion is proposed based (in the F06 case and our case) on *κ*, lying within the 5%–95% points of the

However, whitening does not address interdiagnostic correlations. Nonindependence of observational variability is not a particularly serious concern regarding the deep-ocean diagnostic, where natural variability is only moderately correlated with surface temperature variability and most variance comes from measurement/analysis error. However, it is a concern between the surface and upper-air diagnostics, where no dilution of correlations by added measurement/analysis error occurs. Because of linkage via the lapse rate, fluctuations in surface and tropospheric temperatures are likely to be highly correlated. In the tropics, the tropopause is generally above the 150-hPa level. Figure 1 shows that the decadal-scale correlation of natural variability of upper air with surface temperatures, as simulated by the second climate configuration of the Met Office Unified Model (HadCM2; Johns et al. 1997) control run, is close to one from 20°S–20°N except at 100- and 50-hPa pressure levels. Outside the tropics, these correlations remain generally high for the 850–300-hPa levels.

Even ignoring the correlation issue, inferences from the upper-air diagnostic are problematic. Parameter inference from the upper-air diagnostic varies greatly as *κ*_{ua} and/or the weighting of levels is changed and is also somewhat sensitive to the smoothness of interpolation. Mass weighting of upper-air data was employed in F06, treating each pressure level as extending halfway toward the next, and toward 1000 hPa (surface pressure) at the bottom and 30 hPa at the top. If alternatively the top (50 hPa) level is treated as extending halfway toward zero pressure—increasing its weight modestly, from 4.0% to 5.6%—parameter inference in the *S*_{eq}–*K*_{υ} plane from the fit between the model and observational upper-air data changes dramatically when

Figure 2a replicates the AT99 statistical consistency test on the sum of squares of whitened differences between best-fit modeled and observed upper-air diagnostic temperatures. At F06's selection of *r*^{2} with the top level weighted toward 0 hPa is 10.2, easily satisfying the AT99 consistency test; whereas with the F06 weightings it is 20.2, failing the stricter *r*^{2} values produced by the GRL06_reproduce code, which do not accurately reflect the F06 method and have a much lower and incorrect minimum of 11.4. On both weightings, at

Given high correlations and sensitivity to weightings, the validity of inference based on inclusion of information from the upper-air diagnostic is questionable. Fortunately, the upper-air diagnostic is considerably less informative than the surface diagnostic in constraining parameter values. The results using the extended surface diagnostic avoid the problematic upper-air diagnostic by employing only surface and deep-ocean diagnostics; adding the upper-air diagnostic, using the preferred

## 4. Method

### a. Introduction

We retain the F06 approach of whitening diagnostic variables, using the same regularized inverse climate noise covariance matrix estimates. As discussed in section 3, whitening does not remove interdiagnostic noise correlations, which appear substantial as regards the upper-air diagnostic but tolerably low between the surface and deep-ocean diagnostics. We remove dependence on parameter surface flatness by working with the full set of whitened variables *F* distribution we use relates to *r*^{2}, not *F* distribution for the geometry involved; the unadjusted density goes to zero at the best-fit point (see supplemental material for additional discussion). This step appears to have been omitted in F06: its effects increase with dimensionality and are modest when using the

Although working with full sets of whitened differences, rather than just their sum of squares, is much more computationally demanding than the F06 method, given the large number of parameter combinations involved, we thereby retain the information required to derive a PDF conversion factor [

### b. Derivation and interpretation of the PDF conversion factor

**θ**

_{t}. This will be a noninformative Jeffreys' prior, since the prior used when applying Bayes's theorem to the whitened temperatures was a noninformative Jeffreys' prior, and Jeffreys' priors are invariant under reparameterization.

The standard Jeffreys' noninformative prior for parameters related to normally distributed whitened variables, derived in Jewson et al. (2009), is indeed identical to

*r*

^{2})

^{κ/2−1}geometric volume adjustment (see supplemental material for additional discussion), to a

*r*

^{2}of the

*κ*whitened differences. To allow for variance uncertainty, we replace

*t*distribution for the deep-ocean diagnostic), as in the F06 method, giving the combined likelihood as

To simplify the mathematics, the derivation of the PDF conversion factor from whitened variables to parameter space—which allocates probability according to relative volumes in those spaces—ignores uncertainty in the variance of the whitened differences arising from **C**_{N} being estimated with only *ν* DF. Such variance uncertainty has little effect on the calculation involved, and the conversion factor has much less influence than the joint likelihood (where variance uncertainty is allowed for). Variance uncertainty is small for the surface diagnostic since *ν*_{sfc} is large. We take

## 5. Implementation

### a. Interpolation

F06 computed *r*^{2} for each of the modeled parameter combinations and interpolated it to a fine parameter grid spanning their range in two stages. Interpolation was first to a spacing of Δ*S*_{eq} = 0.1 K and *r*^{2}, and model simulation variability can be allowed for.

We follow Curry (2007) in using a thin plate spline (TPS) single-stage interpolation. For convenience we extrapolate *S*_{eq} into the range 0–0.5; as likelihood is extremely low there, this has negligible impact. The interpolation fits a TPS, separately for each variable, so as to best match actual values at all model run parameter combinations. The total squared misfits are minimized subject to a smoothness constraint, which by giving influence to all actual values in a neighborhood restricts the impact of distortions in individual values caused by model simulation variability. The F06 method is less able to reduce the effects of model simulation variability both because it uses two steps and because the relationship between the quantity being interpolated and model simulation variability is indirect.

Since we compute derivatives of the whitened interpolated variables with respect to the parameters—and from that (Jacobian) matrix the PDF conversion factor/noninformative prior—by differencing across adjacent fine grid cells, reasonably smooth interpolation is preferable. We achieve this by restricting the DF used in interpolation to below that resulting from the default smoothness constraint. With 256 DF for all diagnostics, the noninformative prior remains somewhat artifacted, particularly at model run locations. Using 128 DF instead for all diagnostics produced less artifaction, while still departing only modestly from model ensemble values.

However, the interpolation should depart from model ensemble values on account of model simulation variability (Curry 2007). Using 128 DF, interpolation error at model run locations averages 60%–100% of estimated model ensemble variability for the surface diagnostic, confirming that choice is reasonable. Similar comparisons suggest that 64 DF would be appropriate for the upper-air diagnostic interpolation and 256 DF for the deep-ocean diagnostic. These values are used for all results. The parameter marginal posterior PDFs using our method are very similar with 64, 128, or 256 DF interpolation.

When using the F06 method, parameter PDFs exhibit more sensitivity to the DF used in interpolation, principally for the upper-air diagnostic. TPS upper-air interpolation with unrestricted DF results in a flattening—starting at *S*_{eq} approaching 4—appearing in the *S*_{eq} PDF. Investigation reveals that the flattening is only noticeable when unrestricted DF interpolation is used for the 0°–5°S latitude band, where it appears that some data misprocessing may have occurred. The even more limited reduction of model simulation variability with the F06 interpolation method may account for such flattening being more pronounced in the published F06 *S*_{eq} PDF.

### b. Whitening the diagnostic variables

We follow F06's use of surface and upper-air data from the 1691-yr HadCM2 control run, masking for observational availability. F06 found that PDFs resulting from use of surface control run data from different AOGCMs did not differ qualitatively. Before generating the new 6-decade surface diagnostic, we verified that, using the F06 diagnostic periods, our code accurately generated the F06/SFZ08 surface diagnostic from model, observational, and control data extracted from the GRL06_reproduce archive.

Figure 2b replicates the AT99 consistency test on the sum of squares of whitened differences between best-fit modeled and observed surface diagnostic temperatures. Using the GRL06_reproduce HadCM2 control data, the 5- and 6-decade diagnostic data fail the AT99 consistency test beyond

In view of the significant difference in inference as to *S*_{eq} arising from use of the revised surface diagnostic and the danger that this might arise from a particular EOF correlating with some noise pattern in the observations, we have examined the effects of excluding one EOF at a time. Since the AT99 consistency test is only satisfied up to *r*^{2} values with the contribution thereto from each EOF in turn removed, and derived parameter PDFs for each case, using the same 40-yr revised deep-ocean diagnostic throughout. No major differences emerged, with the mode of all the PDFs for *S*_{eq} being within ±0.2 K of that at

As shown in Fig. 3, using the revised diagnostics, parameter PDFs become less well constrained at

We use the F06/SFZ08 HadCM2 upper-air control data matrix, containing 40 nonoverlapping samples, and take

Whitening the deep-ocean diagnostic differences involves no truncation but necessitates aggregating climate noise and estimated uncertainty in the observational temperature trend. Weighted linear least squares regression is used, as in F06, giving observational trend estimates for the 37 years to 1995 (40 years to 1998) of 0.70 (0.68) mK yr^{−1}, with unadjusted standard error (SE) of 0.08 (0.07) mK yr^{−1}. However, the regression residuals are highly autocorrelated, largely because of the 80% overlap of adjacent pentads. Adjusted, approximately twice as high, SEs estimated from regressions on nonoverlapping pentadal data were therefore used. This ignores the previously discussed gain in effective DF resulting from using overlapping samples. Most remaining autocorrelation is probably because of slow natural variability in ocean temperature, which is accounted for by adding climate noise rather than measurement error correlation. To test sensitivity to observational trend SEs, variants of the main results using 50% higher SEs were generated. These variants also serve to test the impact of climate noise correlation between the surface and deep-ocean diagnostics since the higher SEs reduce (to below 15%) the noise contribution to total deep-ocean whitening variance. In each case, the effective DF *t* distribution was adjusted to reflect the adjustments made to the SEs.

We obtained deep-ocean control data from the 6100-yr HadCM3 control run, whereas F06 used years 1–900 of the Geophysical Fluid Dynamics Laboratory (GFDL) R30 spectral resolution control run (Delworth et al. 2002). Trend variability in the first one-third of these control runs greatly exceeds that in the middle and final thirds, with the models appearing to be adjusting nonmonotonically toward dynamic equilibrium during the first one-third, perhaps because of dynamically inconsistent initial conditions. We therefore use the final two-thirds of the HadCM3 data; the corresponding segments of GFDL R30 data yield a similar estimate. Our deep-ocean climate noise estimate is accordingly only about half F06's. Conversely, our observational trend SE is about double F06's since F06 made no autocorrelation adjustment. The aggregate climate noise and observational estimate SE deep-ocean trend variance used, without a further 50% increase in SE, is close to that in F06.

### c. Off-grid probability mass

We adopt the F06 approach of assigning zero probability to off-grid regions and normalizing to unit total probability. The joint parameter posterior PDFs are low at all grid boundaries except (when using the objective Bayesian method with the original F06 diagnostics) to a modest extent at the *S*_{eq} values. If substantial parts of the likelihood lay in infeasible parameter regions, parameter inference would be problematic.

## 6. Results

The left-hand panels in Fig. 3 show marginal posterior PDFs for *S*_{eq}, *K*_{υ}, and *F*_{aer} obtained with the original diagnostics, at the preferred

Using the F06 method, PDFs for *S*_{eq} are much worse constrained than when using our new method, particularly at the deprecated *S*_{eq} PDF using our method has a shape approximating that expected theoretically (Roe and Baker 2007): when converted into a PDF for the climate feedback parameter, which is reciprocally related to *S*_{eq}, its distribution is close to normal. The *S*_{eq} PDFs using the F06 method, when so converted, are much less symmetric.

The PDFs for *F*_{aer} become more so.

The F06 main result PDFs based on uniform priors are shown for comparison. They differ slightly from the PDFs we compute using the F06 method and *r*^{2} values not according with its method, as already mentioned, the GRL_reproduce code shows that in computing likelihoods the *F* distribution's cumulative distribution function (CDF), was erroneously used, rather than its PDF, and the univariate deep-ocean diagnostic was treated as trivariate, its *r*^{2} value being wrongly divided by 3.

At *S*_{eq}, *F*_{aer} are respectively 2.0–3.6 K, 0.1–1.3 cm s^{−½}, and −0.6 to −0.15 W m^{−2}. Using the F06 method (in parentheses: as reported in F06) the corresponding ranges are 2.0–6.5 (2.1–8.9) K, 0.2–1.7 (0.2–2.0) cm s^{−½}, and −0.75 to −0.2 (−0.74 to −0.14) W m^{−2}. The modes using our and the F06 method are respectively 2.4 and 2.5 (2.9) K for *S*_{eq}, 0.6 and 0.7 (0.8) cm s^{−½} for ^{−2} for *F*_{aer}.

Imposing the assumption implicit in the F06 *S*_{eq} but only by 0.3 K even in the 50% higher than standard deep-ocean observational trend SE case. This confirms that assuming flatness is inappropriate, although the effects are modest here.

The right-hand panels in Fig. 3 show PDFs corresponding to those in the left-hand panels but using the revised diagnostics: longer 6-decade to 2001 surface diagnostic, 40-yr to 1998 deep-ocean diagnostic, and no upper-air diagnostic. The shapes of the *S*_{eq} are narrower and have lower modes, while those for *F*_{aer} are wider, particularly when using the F06 method. Varying *F*_{aer} has only a small effect on model-simulated surface temperatures and hence on the diagnostic fit when model climate sensitivity is low. Therefore, looser constraint on *F*_{aer} is a counterpart of *S*_{eq} being tightly constrained at lower levels.

As when using the F06 diagnostics, we emphasize results using *F*_{aer} PDF shifting slightly. However, PDFs using the F06 method become substantially less well constrained.

At *S*_{eq}, *F*_{aer} are respectively 1.2–2.2 K, 0.3–2.1 cm s^{−½}, and −0.55 to 0.0 W m^{−2}. Using the F06 method the corresponding ranges are 1.1–2.9 K, 0.5–4.4 cm s^{−½}, and −0.8 to −0.05 W m^{−2}. The modes (medians) using our and the F06 method are, respectively, 1.6 and 1.5 K (both 1.6 K) for *S*_{eq}, 0.9 and 1.0 (1.0 and 1.3) cm s^{−½} for ^{−2} for *F*_{aer}.

Imposing a 50% increase in the estimated deep-ocean observational trend SE (at *S*_{eq} rising to 2.3 K. Using the F06 method they increase somewhat more: most notably, the 95% bound for *S*_{eq} becomes 4.4 K.

Figure 4 shows the computed PDF conversion factor from whitened difference to parameter space (or noninformative joint parameter prior) used to generate the new method revised diagnostics results, *F*_{aer} values has been shown. When instead conditioned on *F*_{aer}, the prior retains its broad shape over the range where the *F*_{aer} likelihood is significant but scales up by a factor of several times as *F*_{aer} becomes less negative. The sharp decline in the prior with

Model-prediction variability remaining after interpolation accounts for departures from smoothness and monotonicity in the noninformative prior. The upturn in the low *S*_{eq}, high *S*_{eq}, where temperature changes are small, model variability results in some of the simulated surface diagnostic temperatures changing in the wrong direction from

Figure 5 shows marginal joint credible regions in *S*_{eq} levels, using the F06 diagnostics, is only reduced to low levels (producing better-constrained PDFs, particularly for *S*_{eq}) by the falling value of the noninformative prior. The other diagnostic likelihoods cannot by themselves sufficiently reduce the substantial surface diagnostic likelihood that exists at very high *S*_{eq}, even at fairly low *S*_{eq} PDF is reasonably constrained even when employing the F06 uniform priors method.

## 7. Discussion

The Forest papers develop a powerful means of estimating climate sensitivity jointly with uncertain ocean diffusivity and aerosol forcing. We develop a revised, objective Bayesian, statistical inference approach that improves their methods, principally by use of a noninformative prior but also by the avoidance of Bayesian updating (which is incompatible therewith) and of dependence on parameter surface flatness, and by incorporating a geometric volume adjustment.

Using our objective Bayesian method, the F06 approach of comparing observed with model-simulated spatiotemporal surface temperature patterns fairly well constrains parameter estimates using the same diagnostics as in F06, whereas *S*_{eq} is badly constrained using the F06 method and uniform priors. We find the upper-air diagnostic—which provides relatively weak inference—problematic, with natural variability in many of its variables likely highly correlated with that in surface diagnostic variables and sensitivity to the weightings and truncation used, with the weightings–truncation combination used in F06 seemingly producing unsatisfactory inference.

We resolve these issues by employing only surface and deep-ocean diagnostics, revising these to use longer diagnostic periods, taking advantage of previously unused post-1995 model simulation data and correctly matching model simulation and observational data periods (mismatched by 9 months in the F06 surface diagnostic). Using the revised diagnostics, estimates of *S*_{eq} are lower and more tightly constrained, with a 1.1–2.9-K range obtained using the F06 method and 1.2–2.2 K using the new method. Switching from the original HadCRUT observational surface temperature dataset to the updated HadCRUT4 may have contributed significantly to this reduction: Ring et al. (2012) reported that doing so caused a 0.5-K reduction in their *S*_{eq} estimate.

Comparison, at the best-fit parameter combinations, of the model-simulated and observed rises in global mean temperature between the first 20 and last 20 simulation years provides a key reality check for the validity of inference arising from the alternative diagnostics, testing all stages of the optimal fingerprint method. Taking the best-fit point at *S*_{eq} to the data used, its processing and analysis, indicates that results using the revised diagnostics should not be regarded as definitive.

Our method of converting directly from a PDF in whitened variable space to one in parameter space yields a conversion factor equating to a noninformative joint prior for the parameters. Its shape is far removed from the uniform priors mainly used in F06, substantially affecting results. Although, coincidentally, the central section of the shape of the F06 expert prior for *S*_{eq} is broadly similar to a cross section of the noninformative joint prior at fixed *K*_{υ} and *F*_{aer}, the expert prior declines much more rapidly at low and high *S*_{eq}. Moreover, the noninformative joint prior is far from uniform in *K*_{υ} and *F*_{aer} and is not a separable function of the three parameters. The shape of the noninformative prior reflects how informative the data are about the parameters as their values vary, rather than the preexisting information as to parameter values. Any such prior information is unlikely to be independent of information provided by the data, invalidating Bayesian inference. We recommend that a computed noninformative joint parameter prior, not separate uniform (or expert) priors, be used in future Bayesian climate parameter studies.

Using the same diagnostics and method as F06, we obtain tighter bounds (5%–95% points per marginal posterior PDFs) at *S*_{eq}, the central estimate of which is, moreover, reduced by 0.4 K. This principally reflects use of a preferable upper-air truncation parameter and a corrected implementation of the F06 method. Our 90% range of 2.0–3.6 K for *S*_{eq}, obtained using the new method with the F06 diagnostics, improves on the IPCC's 2–4.5-K “likely” range given in Hegerl et al. (2007). Our 90% range of 1.2–2.2 K for *S*_{eq}, obtained using the preferred revised diagnostics and the new method, appears low in relation to that range, partly because uncertainty in nonaerosol forcings and surface temperature measurements is ignored. Incorporation of such uncertainties is estimated to increase the *S*_{eq} range to 1.0–3.0 K, with the median unchanged (see supplemental material for derivation and additional discussion).

Our 1.6-K mode for *S*_{eq} obtained with the objective Bayesian method and the preferred revised diagnostics is identical to that from the main results in two recent studies providing observationally constrained estimates of *S*_{eq}: Aldrin et al. (2012) and (using the same HadCRUT4 dataset) Ring et al. (2012). Our 1.1–2.9-K 90% range for *S*_{eq} obtained using the F06 method (uniform priors) and revised diagnostics compares with the 1.2–3.5 K obtained using a uniform prior for *S*_{eq} in Aldrin et al. (2012).

We thank Charles Curry, Chris Forest, and Bruno Sansó for making available respectively CSF05, GRL06_reproduce, and SFZ08 archives, Tim Johns for providing HadCM2/3 AOGCM control run data, and Hu McCullough, Andrew Montford, Steven Mosher, and four reviewers for helpful comments.

# APPENDIX A

## The F06 Method and *mF*_{m,ν} Distribution

Each F06 goodness-of-fit statistic is based on a vector *p* differences between observations and model predictions. Through premultiplication by a “whitening” matrix **ũ** is transformed into a set of error variables (whitened differences) *N*(0, 1) distributions when the model parameter settings equaled the climate system parameters' hypothetical true values.

*κ*nonzero whitened differences. The whitened differences' sum of squares,

*r*

^{2}, is computed for each parameter combination setting. AT99 states, in the context of climate change detection and attribution, that where

*m*scaling factors for the ratios of observed to model-predicted pattern amplitudes are estimated before model minus observation differences are determined,

*κ*EOFs). When, as in F06, the aim is instead to estimate climate system parameters, the scaling factors are all set to unity rather than being estimated, as discussed in F01; so

*r*

^{2}computed. That sum represents the squared length of

*κ*-dimensional whitened difference space, with origin where all those differences are zero:It is argued in F01 [see Eq. (6) therein] thatwhere

*r*

^{2}is minimized), and

*m*now represents the number of unknown model parameters. Here

*r*

^{2}as model parameters are varied and

*r*

^{2}at an arbitrary location in parameter space over

*κ*-dimensional space, the location of that surface depending on the observational values. In the

*κ*-dimensional space,

*m*-dimensional tangent hyperplane coincident with the parameter surface where it meets

*mF*,

_{m}*distribution per (A4). However, the parameter surface is more likely to be convex, resulting in an*

_{ν}*mF*

_{m,}_{ν}distribution producing tighter bounds on the parameters than are justified (see supplemental material for additional discussion).

# APPENDIX B

## New Method

We start by defining whitened versions of the modeled, observed, and underlying surface diagnostic temperatures: **θ**_{t}.

*κ*vectors of whitened observed

_{c}*m*-dimensional hypersurface in

*g*(

**θ**) is a PDF on

**x**has a PDF on the hypersurface given bywhereis the

**D**), with the result that (B9) becomeswhereRearranging (B11),By the presumed model prediction accuracy and (B1)and noting that

**θ**we can therefore change (B12) toSubstituting now from Eq. (B6) in Eq. (B13), noting that conditionality on

**θ**

_{t}, giveswhere

## REFERENCES

Aldrin, M., , M. Holden, , P. Guttorp, , R. B. Skeie, , G. Myhre, , and T. K. Berntsen, 2012: Bayesian estimation of climate sensitivity based on a simple climate model fitted to observations of hemispheric temperatures and global ocean heat content.

,*Environmetrics***23**, 253–271.Allen, M. R., , and S. F. B. Tett, 1999: Checking internal consistency in optimal fingerprinting.

*Climate Dyn.,***15,**419–434.Andronova, N. G., , and M. E Schlesinger, 2001: Objective estimation of the probability density function for climate sensitivity.

*J. Geophys. Res.,***106**(D19), 22 605–22 611.Bernardo, J. M., , and A. F. M. Smith, 1994:

*Bayesian Theory.*Wiley, 608 pp.Box, G. E. P., , and G. C. Tiao, 1973:

*Bayesian Inference in Statistical Analysis.*Addison-Wesley, 588 pp.Curry, C. T., 2007: Inference for climate system properties. M.S. Science, Dept. of Applied Mathematics & Statistics, University of California, Santa Cruz, 40 pp. [Available online at http://www.webcitation.org/68VqaIbVK.]

Curry, C. T., , B. Sansó, , and C. E. Forest, 2005: Inference for climate system properties. Applied Mathematics and Statistics, University of California, Santa Cruz, Tech. Rep. Ams2005-13, 7 pp. [Available online at http://www.soe.ucsc.edu/research/technical-reports/ams2005-13.]

Datta, G. S., , and T. J. Sweeting, 2005: Probability matching priors.

*Handbook of Statistics,*Vol. 25, D. K. Dey and C. R. Rao, Eds., Elsevier, 91–114.Delworth, T. L., , R. L. Stouffer, , K. W. Dixon, , M. J. Spelman, , T. R. Knutson, , A. J. Broccoli, , P. J. Kusher, , and R. T. Wetherald, 2002: Review of simulations of climate variability and change with the GFDL R30 coupled climate model.

,*Climate Dyn.***19**, 555–574.Drignei, D., , C. Forest, , and D. Nychka, 2008: Parameter estimation for computationally intensive nonlinear regression with an application to climate modeling.

,*Ann. Appl. Stat.***2**, 1217–1230, doi:10.1214/08-AOAS210.Forest, C. E., , M. R. Allen, , P. H. Stone, , and A. P. Sokolov, 2000: Constraining uncertainties in climate models using climate change detection methods.

,*Geophys. Res. Lett.***27**, 569–572.Forest, C. E., , M. R. Allen, , A. P. Sokolov, , and P. H. Stone, 2001: Constraining climate model properties using optimal fingerprint detection methods.

,*Climate Dyn.***18**, 277–295.Forest, C. E., , P. H. Stone, , A. P. Sokolov, , M. R. Allen, , and M. D. Webster, 2002: Quantifying uncertainties in climate system properties with the use of recent climate observations.

,*Science***295**, 113–117.Forest, C. E., , P. H. Stone, , and A. P. Sokolov, 2006: Estimated PDFs of climate system properties including natural and anthropogenic forcings.

,*Geophys. Res. Lett.***33**, L01705, doi:10.1029/2005GL023977.Forest, C. E., , P. H. Stone, , and A. P. Sokolov, 2008: Constraining climate model parameters from observed 20th century changes.

,*Tellus***60A**, 911–920.Forster, P. M. de F., , and J. M. Gregory, 2006: The climate sensitivity and its components diagnosed from Earth radiation budget data.

,*J. Climate***19**, 39–52.Frame, D. J., , B. B. B. Booth, , J. A. Kettleborough, , D. A. Stainforth, , J. M. Gregory, , M. Collins, , and M. R. Allen, 2005: Constraining climate forecasts: The role of prior assumptions.

,*Geophys. Res. Lett.***32**, L09702, doi:10.1029/2004GL022241.Gordon, C., , C. Cooper, , C. A. Senior, , H. Banks, , J. M. Gregory, , T. C. Johns, , J. F. B. Mitchell, , and R. A. Wood, 2000: The simulation of SST, sea ice extents and ocean heat transports in a version of the Hadley Centre coupled model without flux adjustments.

,*Climate Dyn.***16**, 147–168.Gregory, J., , R. J. Stouffer, , S. C. B. Raper, , P. A. Stott, , and N. A. Rayner, 2002: An observationally based estimate of the climate sensitivity.

,*J. Climate***15**, 3117–3121.Hegerl, G. C., , T. C. Crowley, , W. T. Hyde, , and D. J. Frame, 2006: Climate sensitivity constrained by temperature reconstructions over the past seven centuries.

,*Nature***440**, 1029–1032, doi:10.1038/nature04679.Hegerl, G. C., and Coauthors, 2007: Understanding and attributing climate change.

*Climate Change 2007: The Physical Science Basis,*S. Solomon et al., Eds., Cambridge University Press, 663–745.Jeffreys, H., 1946: An invariant form for the prior probability in estimation problems.

,*Proc. Roy. Soc. London***186A**, 453–461.Jewson, S., , D. Rowlands, , and M. Allen, cited 2009: A new method for making objective probabilistic climate forecasts from numerical climate models based on Jeffreys' Prior. [Available online at http://arxiv.org/pdf/0908.4207.pdf.]

Johns, T. C., , R. E. Carnell, , J. F. Crossley, , and J. M. Gregory, 1997: The second Hadley Centre coupled ocean-atmosphere GCM: Model description, spinup and validation.

,*Climate Dyn.***13**, 103–134.Jones, P. D., , M. New, , D. E. Parker, , S. Martin, , and I. G. Rigor, 1999: Surface air temperature and its changes over the past 150 years.

,*Rev. Geophys.***37**, 173–199.Kass, R. E., 1989: The geometry of asymptotic inference.

,*Stat. Sci.***4**, 188–219.Kass, R. E., , and L. Wasserman, 1996: The selection of prior distributions by formal rules.

,*J. Amer. Stat. Assoc.***91**, 1343–1370.Knutti, R., , T. F. Stocker, , F. Joos, , and G.-K. Plattner, 2002: Constraints on radiative forcing and future climate change from observations and climate model ensembles.

,*Nature***416**, 719–723.Levitus, S., , J. Antonov, , and T. Boyer, 2005: Warming of the world ocean, 1955–2003.

,*Geophys. Res. Lett.***32**, L02604, doi:10.1029/2004GL021592.Libardoni, A. G., , and C. E. Forest, 2011: Sensitivity of distributions of climate system properties to the surface temperature dataset.

,*Geophys. Res. Lett.***38**, L22705, doi:10.1029/2011GL049431.Mardia, K. V., , J. T. Kent, , and J. M. Bibby, 1979:

*Multivariate Analysis.*Academic Press, 518 pp.Morice, C. P., , J. J. Kennedy, , N. A. Rayner, , and P. D. Jones, 2012: Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 dataset.

,*J. Geophys. Res.***117**, D08101, doi:10.1029/2011JD017187.Mosegaard, K., , and A. Tarantola, 2002: Probabilistic approach to inverse problems.

*International Handbook of Earthquake and Engineering Seismology,*W. H. K. Lee et al., Eds., International Geophysics Series, Vol. 81A, Elsevier, 237–265.Parker, D. E., , M. Gordon, , D. P. N. Cullum, , D. M. H. Sexton, , C. K. Folland, , and N. Rayner, 1997: A new global gridded radiosonde temperature data base and recent temperature trends.

,*Geophys. Res. Lett.***24**, 1499–1502.Pueyo, S., 2012: Solution to the paradox of climate sensitivity.

,*Climatic Change***113**, 163–179, doi:10.1007/s10584-011-0328-x.Ring, M. J., , D. Lindner, , E. F. Cross, , and M. E. Schlesinger, 2012: Causes of the global warming observed since the 19th century.

,*Atmos. Climate Sci.***2**, 401–415.Roe, G. H., , and M. B. Baker, 2007: Why is climate sensitivity so unpredictable?

,*Science***318**, 629–632.Sansó, B., , and C. Forest, 2009: Statistical calibration of climate system properties.

,*J. Roy. Stat. Soc.***58C**, 485–503.Sansó, B., , C. E. Forest, , and D. Zantedeschi, 2008: Inferring climate system properties using a computer model (with discussion).

,*Bayesian Anal.***3**, 1–62.Sokolov, A. P., , and P. H. Stone, 1998: A flexible climate model for use in integrated assessments.

,*Climate Dyn.***14**, 291–303.Sokolov, A. P., , C. E. Forest, , and P. H. Stone, 2003: Comparing oceanic heat uptake in AOGCM transient climate change experiments.

,*J. Climate***16**, 1573–1582.