## Abstract

Decadal climate predictions exhibit large biases, which are often subtracted and forgotten. However, understanding the causes of bias is essential to guide efforts to improve prediction systems, and may offer additional benefits. Here the origins of biases in decadal predictions are investigated, including whether analysis of these biases might provide useful information. The focus is especially on the lead-time-dependent bias tendency. A “toy” model of a prediction system is initially developed and used to show that there are several distinct contributions to bias tendency. Contributions from sampling of internal variability and a start-time-dependent forcing bias can be estimated and removed to obtain a much improved estimate of the true bias tendency, which can provide information about errors in the underlying model and/or errors in the specification of forcings. It is argued that the true bias tendency, not the total bias tendency, should be used to adjust decadal forecasts.

The methods developed are applied to decadal hindcasts of global mean temperature made using the Hadley Centre Coupled Model, version 3 (HadCM3), climate model, and it is found that this model exhibits a small positive bias tendency in the ensemble mean. When considering different model versions, it is shown that the true bias tendency is very highly correlated with both the transient climate response (TCR) and non–greenhouse gas forcing trends, and can therefore be used to obtain observationally constrained estimates of these relevant physical quantities.

## 1. Introduction

Until recently, projections of future climate have been generated by running climate models forced by estimates of future natural and anthropogenic (e.g., from greenhouse gases and aerosols) radiative forcing. The motivation for decadal climate predictions is to improve on these standard projections by using observations to initialize predictable modes of natural variability, and by correcting errors in a model’s response to past radiative forcings. Producing climate predictions that are initialized using observations of the current climate state is now a major field of scientific research (e.g., Smith et al. 2007, hereafter S07; Keenlyside et al. 2008; Pohlmann et al. 2009; Smith et al. 2013). For example, initialized decadal climate prediction experiments are a major component of phase 5 of the Coupled Model Intercomparison Project (CMIP5; Meehl et al. 2009; Taylor et al. 2012; Meehl et al. 2014). Decadal climate predictions could potentially be of great benefit to society, for example, helping to inform decisions on adaptation to a changing climate. However, there are many challenges in producing forecasts that are useful for adaptation decisions (e.g., Meehl et al. 2009; Oreskes et al. 2010).

One key challenge in producing robust predictions of future climate is to demonstrate an ability to make predictions in the past (“hindcasts”). Comparisons between hindcasts and past observations offer a wealth of information for assessing the strengths and weaknesses of a prediction system, including information that can guide work to improve the system. Such an approach has proved invaluable in weather forecasting (e.g., Ferranti and Viterbo 2006). Comparisons may focus on specific case studies (e.g., Robson et al. 2012; Yeager et al. 2012), particular regions (e.g., Toniazzo and Woolnough 2013) or on the average behavior of a system over a longer period (e.g., S07; Smith et al. 2010; van Oldenborgh et al. 2012). A particularly important issue for decadal climate predictions is the existence of large biases (i.e., systematic differences between hindcasts and observations). Biases may vary with the lead time of hindcasts and are often larger than the anomalies that the system is aiming to predict. In this situation the current standard approach (e.g., Goddard et al. 2013) is to subtract the mean bias from all hindcasts before assessing other aspects of the system performance (e.g., RMSE). Such an approach is pragmatic but assumes a linear additivity between bias and forced response and ignores many important issues, such as the following: Why is the bias present? Does it provide any useful information? Could it be reduced?

The aim of this paper is to investigate the first two of these questions in particular, initially in the context of an idealized “toy” model, and second using results from a real decadal prediction system. We focus especially on the growth of bias with lead time, which we demonstrate offers valuable information about a prediction system and the underlying climate model. We then show further that analysis of biases for different model versions can be used to obtain useful information about the real world, in particular new constraints on the transient climate response, which measures the transient sensitivity of the climate system to increases in greenhouse gases.

The structure of the paper is as follows. Section 2 discusses the design of decadal prediction experiments and clarifies terminology. Section 3 introduces our toy model of a decadal prediction system, explains how the bias can be decomposed into distinct contributions, and examines sampling issues. The methodology we develop is then applied to predictions of global mean surface air temperature from an operational decadal prediction system in sections 4 and 5. Conclusions and a discussion of implications are in section 6.

## 2. Experimental design and terminology

There are several types of decadal climate prediction experiment discussed in the literature. One important issue is the specification of external radiative forcings in the hindcasts. The two main choices are as follows:

“Projection” type, where anthropogenic forcings are assumed to be known, but “projected” natural forcings are used (e.g., see S07). In this case any volcanic aerosol present at the forecast start time is allowed to decay, but no “future” volcanic aerosol is used. In addition, the solar cycle is repeated from the previous cycle. This approach attempts to mimic the realistic situation in which there is little knowledge of future natural forcing.

“CMIP5” type, where all forcings are assumed to be known. This is the design adopted by the CMIP5 protocol (Taylor et al. 2012).

In addition, hindcasts may be initialized using observations at the forecast start time (“Assim”—because assimilation is used to generate the initial states), or be initialized directly from a model state without the use of observations (“NoAssim”).

The simplest case is arguably the “NoAssim CMIP5” type, corresponding to traditional so-called “transient” climate model simulations. However, the ensemble sizes for these simulations tend to be small (fewer than 5), which, as we will show, limits the robustness of the bias analysis. In this study we focus on the “NoAssim projection” type of hindcasts, as performed by the Met Office (see S07). The Met Office used this approach to produce a very large ensemble of hindcasts with different versions of the same GCM (Smith et al. 2010), which proves to be a very useful resource for our analysis. However, in examining these hindcasts we must take into account the difference between the natural forcings used to force the model and those that occurred in the real world.

The reason that we focus on NoAssim-type experiments is that understanding the biases in these experiments is a prerequisite for understanding the biases in Assim-type experiments. We demonstrate that the bias derived from NoAssim experiments provides useful information, and we will be investigating applications to Assim-type experiments in future work.

## 3. Estimating bias in a toy model of a decadal prediction system

We first build a toy model of a decadal prediction system to examine some of the issues involved with estimating the bias of a real prediction system.

### a. Bias of hindcasts

Pseudo-observations *O*(*t*) are generated by assuming an externally forced linear trend in time, with added red noise,

where *t* is time, is the “observed” climatology, *α* is the slope of the linear trend, and the red noise is denoted by *ϵ*(*t*).

We first assume that the ensemble mean of our pseudohindcasts (*N*) for the same quantity can be generally represented, for start time *T* and lead time *τ*, by

where is the model climatology and *γ* is the modeled linear response to the external forcing. If *α* ≠ *γ* then the climate model would produce a different trend from the observations and therefore be biased. This could either be because the model is in error or because there is an error in the specification of the forcing (see later). This equation for *N* assumes that we have an infinite ensemble of hindcasts, as there is no noise in the ensemble mean. This assumption will be relaxed later. Note that these pseudohindcasts are only attempting to predict the forced response and not the internal variability component.

The bias *B* of a prediction system is simply the mean error as a function of prediction lead time:

where *L* is the number of hindcast start dates and we assume that there is a decadal hindcast (*τ* = 1–10 yr) started every year between, and including, *T* = 1 and *T* = *L*. Note that in an operational system *N* and *O* would often represent anomalies from a particular reference period. However, our analysis focusses on “bias tendency” (defined below), which is independent of the choice of reference period.

### b. Correcting the bias for observed variability

The estimated bias defined in Eq. (3) has two contributing factors: the true bias (if *α* ≠ *γ* or ) and a bias from an insufficient sampling of the internal variability in the observations. Ideally, we would like to correct for this second variability contribution to obtain the true bias.

Following Robson (2010), in the case of an infinite ensemble in a stationary climate (*α* = *γ* = 0), the bias from Eq. (3) would be

where *t* represents time and *B*_{obsvar}(*τ*) is the mean of the observational anomalies used for validation for a particular lead time *τ*. An important point is that different observations are used for different lead times. Thus, *B*_{obsvar}(*τ*) is an estimate of the bias resulting from the insufficient sampling of the observed variability and will approach zero as *L* increases leaving the true bias, .

For the more realistic case when the climate is not stationary, and there is a trend in the observations (*α* ≠ 0) then we can estimate

and this is the definition we adopt. In the toy model examples shown here we use a linear detrending. When considering the real observations we performed sensitivity tests to explore linear and quadratic detrending and the results were very similar (not shown), so assume a linear detrending in all that follows.

A schematic demonstrating *B*_{obsvar} for different lead times is shown in Fig. 1 with pseudo-observations in black, which include a linear trend and red noise, and some predictions (for a noninfinite ensemble) shown in red in each panel. The gray regions indicate the area to be integrated to give the value of *B*_{obsvar}, which varies with the lead time chosen, and need not be zero, as shown in Fig. 1d.

### c. Bias tendency

In this analysis we generally consider the bias tendency *B*′ rather than the bias itself, that is, we use the bias relative to the bias for the mean of the first year:

This choice is made because we want to consider the growth of bias with lead time, which is natural for a prediction system. We do not use *τ* = 0 to avoid arbitrary assumptions about defining climatological periods. Hence, this bias tendency has the desirable property of being independent of the choice of climatology.

Similarly to the bias, the observed variability correction is also made into a tendency:

as shown in Fig. 1e, and an estimate of the underlying true bias tendency is then

The nature of the bias growth may give valuable information about the physical processes that cause prediction errors, potentially allowing particular parameterizations to be targeted for improvement.

### d. Estimating the bias tendency in the toy hindcasts

To test the bias tendency estimates described above, we first consider whether we can estimate the true bias tendency of the toy model using various numbers of hindcast start dates. Here, we generally assume that *α* = 0.016 K yr^{−1} and that the red noise *ϵ* in Eq. (1) has a first-order autoregressive (AR1) parameter *β* = 0.5 and total variance . These values are chosen to roughly simulate observed annual global mean surface air temperature (SAT) observations since 1850 (Brohan et al. 2006), although the conclusions are insensitive to the exact choices. We pick *γ* = 0.020 K yr^{−1} (i.e., the toy hindcasts are positively biased by 25%) and retain the infinite ensemble assumption for now.

An example of such a hindcast system is shown in Fig. 2a for decadal hindcasts started every year for *L* = 20 yr, where the black line represents the observations, the solid blue line is the true forced trend (*α*), the dashed blue line is a linear fit to the observations used in the estimation of , and the red lines represent the pseudohindcasts *N*, which are identical because of the infinite ensemble assumption.

In Fig. 2b, we show estimates of the bias tendency for the situation in Fig. 2a. The solid blue line uses the definition of uncorrected bias tendency [Eq. (8)], and the dashed blue line corrects for the observed variability using Eq. (10). Note that the dashed blue line does not match the true bias (gray shading) because the estimated trend from the observations is not correct (i.e., the estimate of is not exact). If the true forced trend is used in the estimation of then the true bias tendency is recovered (black line).

We next simulate 1000 realizations of the pseudo-observations and hindcast sets. Bias tendency estimates for 10 examples of these realizations are shown in Fig. 2c. With these 20 start dates there is a wide range of estimated bias tendencies. For different numbers of hindcast start dates *L*, Fig. 3 demonstrates that correcting the bias tendency using (dashed line) reduces the error in the estimates of bias tendency at a lead time of 10 yr compared to using the uncorrected bias tendency (solid line). Both estimators of the bias tendency are themselves unbiased (i.e., the mean over all realizations equals the true bias tendency; not shown). The spread in bias tendency estimates decreases with the number of start dates as more observations allow more accurate estimates. The observed variability correction also becomes smaller with more start dates. When analyzing the operational NoAssim hindcasts in section 4 we generally use 40 start dates, so the spread is around half as large as suggested in Fig. 2c.

For the particular set of toy model parameters chosen here, we see that the expected error in the bias tendency estimate becomes smaller than the bias itself (gray line in Fig. 3; i.e., the sign of the true bias tendency could be detected) for around *L* = 15–20 hindcast start dates. For fewer hindcasts, the uncertainty in the bias estimates does not allow a detection, with the implication for ensemble design that more start dates are required. If the bias is uncorrected then more start dates are required to detect the bias.

### e. Forcing bias and consistent verification times

So far we have assumed that the radiative forcing that is causing a warming or cooling trend has been correctly specified and so any bias tendency is attributable to errors in the model response to this forcing. However, there are two types of forcing bias that could make this assumption invalid: start-time-independent and start-time-dependent bias. The CMIP5 design discussed in section 2 results in start-time-independent forcing biases because all hindcasts see the same forcing at the same date. However, for the Projection design this is not the case: hindcasts started from different dates may see different forcings. For example, a hindcast started in 1989 would not include any volcanic aerosol from the Mount Pinatubo eruption in 1991, whereas a hindcast started in 1992 would. Thus, there is a start-time-dependent forcing bias. S07 noted that this type of forcing bias makes a significant contribution to the bias of a set of hindcasts. They attempted to remove it, somewhat arbitrarily, by excluding years just after volcanic eruptions from the estimation of the bias. Fortunately, a further correction is available to account for this start-time-dependent bias.

In deriving, *B* from Eq. (3) we chose to use all possible combinations of start dates and verification times. However, an alternative is to use a “consistent” set of verification times, which only includes years where all lead times *τ* can be *simultaneously* assessed (i.e., the *same* observation can be used to assess the bias at all lead times). In the schematic of Fig. 1 these times are shown by the range of the blue bars (i.e., years 11–21 in this example) as year 11 is the earliest time that a 10-yr lead-time forecast can be verified (along with forecasts for lead times of 1–9 yr), and year 21 is the last time that a 1-yr lead time can be verified (along with forecasts for lead times of 2–10 yr).

Using these consistent verification times, assuming there is no start-time-dependent forcing bias and an infinite ensemble, and generalizing from Eq. (3), the bias becomes

where *τ*_{max} is the largest lead time to be considered. Crucially, for this particular choice of verification times, all the terms on the right-hand side of Eq. (12) are *independent* of lead time, because *N*(*t*) is the same for all lead times and is zero for this choice of verification times (Fig. 1). In this instance, *B*_{consis}(*τ*) is a constant *A* with lead time, and therefore, the bias tendency using consistent verification times is

Hence, in the absence of a start-time-dependent forcing bias, is exactly zero (assuming an infinite ensemble).

To test the impact of a start-time-dependent forcing bias in our toy model, we generalize Eq. (1) by adding a volcanic eruption into the pseudo-observations, within the consistent validation time period, of the following form:

where *V* is the temperature response to a volcanic eruption, which reduces over time *ξ* (measured in years) with an exponential decay time scale of 1 yr, from a peak impact of 0.2 K. We also assume that the hindcasts also include this impact, but only after the eruption has occurred.

Repeating our toy hindcasts (Fig. 4), still assuming an infinite ensemble, demonstrates that the measured bias tendency (blue) is overestimated when compared to the true bias tendency (dark gray), because the bias tendency attributable to the volcanic eruption is nonzero (light gray). Here is shown by the red line in Fig. 4 (right panel), which matches the forcing bias tendency (light gray) as expected.

Note especially that to estimate from the data there is no need to assume any functional form for the forcing bias. Therefore, we can correct for the start-time-dependent forcing bias by estimating the bias tendency using all verification times, and subtracting off the bias tendency estimated using consistent verification times . Generalizing Eq. (10),

The green lines in Fig. 4 (right panel) are an example of such an estimate using the bias tendency corrected only by the consistent verification times (solid) and using Eq. (17) (dashed). Below we will demonstrate that it is necessary to remove the forcing bias in this way to obtain a robust estimate of the true bias tendency, which is the key quantity of interest.

We note here that there are still two contributions to the true bias tendency. The first is errors in the underlying climate model; for example, if the sensitivity of the model to greenhouse gas forcing is higher or lower than that of the real world, the hindcasts will warm too rapidly or too slowly, giving a positive or negative bias tendency. The second is (start-time independent) errors in the forcing applied to the model; for example, if the negative radiative forcing attributable to anthropogenic aerosols is lower or higher in the model than in the real world, this will also give a positive or negative bias tendency. Correcting the bias tendency using the period of consistent verification times does not deal with the issue of forcing errors that may occur outside of the period of consistent verification times, and this is discussed further when considering the real observations.

Finally, it should be noted that estimating the bias tendency using all verification times and subtracting off the bias tendency using consistent verification times is not the same as estimating the bias tendency using “nonconsistent” verification times (not shown).

### f. How many ensemble members are needed?

As discussed above, we have so far assumed that the toy hindcasts have infinite ensemble members. We now relax this assumption to understand how many ensemble members would be required to ensure a robust bias tendency estimate.

For a finite ensemble, our toy model for the predictions is generalized from Eq. (2) to

where *ζ* is red noise with the same AR1 parameter as the pseudo-observations (*β* = 0.5) and a noise component which depends on *M*, the number of ensemble members [i.e., ]. Note that this definition is equivalent to taking the mean of *M* different ensemble members, each with variance .

Figure 5 explores the spread in estimates of the true bias tendency using various values for *M*, making (or not) the different corrections discussed above. This spread is derived from 100 000 different realizations of the toy model. The colors represent using 20 start dates (gray) and 40 start dates (blue). First, the most reliable and accurate estimate of the true bias is when all the corrections described above are applied (Fig. 5a). For the other cases, the bias estimate itself becomes more biased, or more uncertain (Figs. 5b–d).

In addition, as the number of ensemble members is increased the uncertainty in the bias estimates initially decrease, but then stabilize. For *M* ≳ 8, the expected error in the bias remains roughly constant. This analysis suggests that as long as *M* ≳ 8, then the ensemble is effectively infinite for global mean temperature. In addition, to detect the sign of a true bias tendency *it is far better to increase the number of start years, than to increase the number of ensemble members.* This is also found to be the case when the variance of the noise is doubled to represent a regional mean, rather than a global mean (not shown).

We note that the mean of the toy model realizations in the fully corrected case does not quite match the expected value (black). This is probably as a result of an interaction between the *B*_{consis} and *B*_{obsvar} correction terms as *B*_{consis} will also have a variability component, but this estimate is still the least biased.

## 4. Estimating the true bias in an operational decadal prediction system

S07 describe the performance of a set of hindcasts made using the Hadley Centre Coupled Model, version 3 (HadCM3), global climate model (Gordon et al. 2000). Here we analyze a later set of ensembles, termed NoAssimPPE, which utilizes the same HadCM3 GCM, but with nine different “perturbed physics” versions (Smith et al. 2010). These different perturbed physics ensemble (PPE) versions were chosen to sample a wide range of climate sensitivities and ENSO amplitudes (e.g., Murphy et al. 2004; Smith et al. 2010; Collins et al. 2011).

The hindcasts were initialized from model states consistent with the applied radiative forcings using start dates once per year from 1961 to 2001, with one 10-yr prediction per model version. As in the original S07 hindcasts, the NoAssimPPE hindcasts used the projection approach to specifying external forcings (section 2).

### a. Start-time-dependent forcing bias

First, we demonstrate the presence of a start-time-dependent forcing bias in the NoAssimPPE hindcasts (41 start dates and 9 ensemble members, 1961–2001). Because the hindcasts use only information available at the start of the forecast, “future” volcanic eruptions were not considered. This produces hindcasts that are biased warm when compared to observations. Also, the previous solar cycle is repeated, which is another potential source of bias.

Figure 6 shows estimates of the natural forcings (volcanic and solar) used in the transient twentieth-century integrations (left panels) and in the prediction system (center panels). The estimates for the prediction systems assume an exponential decay rate of the volcanic aerosol present at the forecast start time of 1 yr and an 11-yr solar cycle length. The resulting forcing bias is shown in the right panels.

When integrated over all start dates an estimate of the start-time-dependent forcing bias is produced (Fig. 6, bottom right). The magnitude of the bias is dominated by the volcanic component and peaks at around 0.45 W m^{−2} at a lead time of 3 yr, subsequently dropping to around 0.30 W m^{−2} at a lead time of 10 yr.

### b. Bias tendency estimates in NoAssimPPE

We now explore the expected error in the bias estimates using the results from analysis of the toy model. Figure 7 shows the expected growth with lead time of the error in the estimated bias for NoAssimPPE (gray) where the solid (dashed) gray line indicates the expected error using 1 (9) ensemble members. The black line shows the corresponding error for the original NoAssim (S07) hindcasts (effectively 20 start dates and 16 ensemble members). The greater number of ensemble members in the original NoAssim results in a smaller expected error at short lead times (1–3 yr), compared with the single member PPE system. However, the larger number of start dates in NoAssimPPE suggests a far smaller error at long lead times (5–10 yr), even using a single ensemble member. The uncertainty estimates for 5-yr means (horizontal gray bars) are used below in section 5.

We next apply the bias estimate methodology developed using the toy model to annual means of global mean surface air temperature from the NoAssimPPE hindcasts (Fig. 8). We compare the hindcasts to four observational datasets [Hadley Centre/Climatic Research Unit temperature, version 4 (HadCRUT4; Morice et al. 2012), Goddard Institute for Space Studies Surface (GISS) Temperature Analysis (GISTEMP; Hansen et al. 2010), National Centers for Environmental Prediction (NCEP) reanalysis (Kalnay et al. 1996), and 40-yr European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis (ERA-40; Uppala et al. 2005)], but all give consistent results. Note that the observations used are for 1961–2010, except ERA-40, which uses 1961–2001. Unless otherwise stated we use HadCRUT4 in all that follows. For the NoAssimPPE system, the raw bias tendency estimate (Fig. 8a) suggests that HadCM3 has a warm bias, which is apparently a primary result of a start-time-dependent forcing bias (Fig. 8b) rather than an insufficient sampling of the observational variability (Fig. 8c). The best estimate for the true bias tendency (Fig. 8d) shows a very slight warm bias of around 0.04 K decade^{−1}, which is marginally statistically significant. The interpretation of this true bias tendency is discussed in section 5.

In addition, we note that the bias is positive over both land and sea (Figs. 8e,f). Both the spatial pattern and physical processes responsible for the bias growth will be explored in future work.

The global mean SAT bias tendency associated with the time-dependent forcing error makes the largest contribution to the SAT total bias tendency (Fig. 8). S07 also recognized the importance of accounting for the bias caused by volcanic eruptions. They estimated that the raw bias for NoAssim was around 0.14 K decade^{−1} (consistent with Fig. 8), but they removed the forcing bias by excluding some years following volcanic eruptions. We believe that our result is more robust as we are accounting for the forcing bias more explicitly and objectively.

The lead-time evolution of the ensemble mean global averaged shortwave radiation (SW) bias tendency over the ocean at the top of the atmosphere (TOA) (i.e., the forcing error) using the consistent verification times is illustrated in Fig. 9a, and shows a rapid increase in downward solar radiation in the first 3–4 yr to about 0.30–0.35 W m^{−2} and it maintains this magnitude afterward. This estimated forcing error and its lead time evolution are consistent with the implied surface heat flux bias tendency from vertically integrated ocean heat content (OHC) bias tendency (the implied flux bias tendency is not sensitive to the depth chosen for the integration since OHC bias tendency is mostly confined in the top 500 m) as shown in Fig. 9b and it is also consistent with the directly estimated forcing error associated with volcanic eruptions (Fig. 9c, smoothed from Fig. 6). A caveat with using the 1961–2001 start dates for validation is that the Agung volcano in 1963 is before the consistent verification times. We have performed a sensitivity test by excluding the hindcasts from 1961 through 1964, but this does not significantly affect the results.

The relative importance of each component of the bias is illustrated in Fig. 10, which confirms that the lead-time-dependent forcing bias dominates. For NoAssimPPE the sampling correction (orange) is very small for global mean temperature because the number of hindcast starts dates is large. Note, however, that this contribution is expected to be larger for other variables and smaller regions. These results illustrate clearly the importance of decomposing the bias into its different components before interpreting its meaning. Furthermore, if a bias correction were to be applied to a forecast (rather than a hindcast), we suggest it is the underlying true bias tendency that should be used, rather than the raw bias tendency derived from the hindcasts, in contrast to some current practices (e.g., Smith et al. 2013). We plan to explore the issues surrounding the application of bias corrections to forecasts in future work.

## 5. Interpretation of the true bias tendency

### a. Role of ocean heat uptake in bias tendency

The true bias tendency could arise either from start-time-independent errors in the forcings applied to the model (e.g., errors in the specification of anthropogenic aerosols) or from errors in the transient sensitivity of the model to such forcings (or both). Errors in the transient sensitivity could themselves arise from errors in either the representation of atmospheric or surface feedbacks and/or from errors in the representation of ocean heat uptake (e.g., Raper et al. 2002; Gregory and Forster 2008; Boé et al. 2009). This last factor can be examined by considering the bias tendency for global mean OHC (Fig. 11). As for surface air temperature the total bias is dominated by the start-time-dependent forcing bias. The true bias tendency for the surface or top 100 m is again positive, and is near zero below a few hundred meters. If insufficient ocean heat uptake were the cause of the warming bias at the surface we would expect to see a cooling bias subsurface. The fact that we do not see such a feature suggests that ocean heat uptake is not the reason for the warming bias in surface air temperature.

Further insights into the true bias tendency may be obtained by considering the biases associated with individual model versions (as distinct from the ensemble mean considered previously). Figure 12 shows that, within the PPE, there is a high positive correlation between the true bias tendency for OHC and that for SAT. This correlation again implies that variations in ocean heat uptake are not the primary cause of variations in SAT bias in NoAssimPPE.

### b. Relating climate sensitivity, forcing trends, and bias tendency

Next we consider the possible causes of the different true bias tendencies in the various PPE versions.

The first possible explanation is that the true bias tendency is directly related to the climate sensitivity of the model version (Fig. 13a). Values for the transient climate response (TCR) were obtained for each model version through separate specific experiments carried out at the Met Office. The HadCM3 NoAssimPPE model versions have a TCR range of 1.6–2.7 K with a mean of 2.1 K, which may be compared with the likely range of 1.0–2.5 K from the Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment Report (AR5; Stocker et al. 2013). Figure 13a shows a linear relationship between the true bias tendency for global mean SAT and TCR, in which the most sensitive models give the largest warming bias tendency, with a correlation coefficient of 0.89. This high correlation suggests that the true bias tendency may be providing very useful information about the sensitivity of the underlying model. The correlation between TCR and the uncorrected bias tendency is 0.75, so the corrections have also improved this relationship. In addition, since a perfect model should yield a true bias tendency of zero, we can use this relationship to estimate a likely range for TCR.

A Monte Carlo approach is used to fit regression lines to the data by perturbing the true bias tendency of each model version, taking into account the bias tendency uncertainty (0.016 K, calculated from the toy model). The distribution of the intercepts of these lines with the *y* = 0 line (corresponding to zero true bias tendency) then provides an observationally constrained range for TCR. We find that the 5%–95% range for TCR constrained in this way is 1.4–1.8 K with a median of 1.6 K using HadCRUT4 (Fig. 13c). This range is considerably narrower than the corresponding likely range from IPCC AR5 of 1.0–2.5 K, and observation-based ranges of 1.3–2.3 K (Gregory and Forster 2008) and 0.9–2.0 K (Otto et al. 2013). With doubled estimates for the uncertainty in the true bias tendency the range from this study becomes 0.9–1.9 K. The standard version of HadCM3 has a TCR of 2.0 K (Randall et al. 2007).

The constrained ranges of TCR for different observational datasets, are summarized in Table 1. Results indicate that the median and the ranges of the constrained TCR are only slightly sensitive to the data that are used to validate the hindcasts, with the other datasets producing values of TCR about 0.15 K higher. The reduced spread of TCR is a robust feature and so the underlying SAT true bias tendency from the decadal climate hindcasts could be used to constrain the model TCR, complementing other approaches proposed in the literature (e.g., Allen et al. 2000; Stott and Forest 2007; Gregory and Forster 2008; Knutti and Tomassini 2008; Murphy 2010; Tett et al. 2013). It is also interesting to note that having a range of models with widely different TCR has proved very useful in this analysis, especially to constrain the upper end of our TCR ranges.

However, there is another possible explanation for the true bias tendency differences. When considering the role of TCR we have assumed that the forcing trends in each PPE version are the same. However, Harris et al. (2013) recently demonstrated that the different PPE versions of HadCM3 have different non–greenhouse gas (GHG) forcing, likely attributable to the different interaction of aerosols with low clouds. The relationship is such that versions of HadCM3 with a low TCR, and negative bias tendency, also have a cooling trend from non-GHG forcing from 1961 to 2010, and this could potentially contribute to the relationship between TCR and true bias tendency.

Figure 13b relates the true bias tendency to the non-GHG forcing trends for the different PPE model versions. The forcing data are taken from Harris et al. (2013), and linear trends have been fitted from 1961 to 2010, excluding years with, and shortly after, volcanic eruptions. This provides an estimate of the non-GHG forcing trends and the observed relationship can be used to produce an improved constraint on the non-GHG forcing trend, which is found to be negative, unlike in the majority of the model versions.

Therefore, there are two possible causes for the relationship between perturbed parameter versions of HadCM3 and the true bias tendency: it is clear that the parameter perturbations affect both the TCR and the non-GHG forcing trends and that both factors influence the true bias tendency. Trying to separate the two effects is beyond the scope of this paper, but further work will use the spatial patterns, and other climate variables, to further understand the causes of the bias tendencies. However, we note that if both factors are playing a role then the constrained ranges for TCR and non-GHG forcing would broaden.

An additional related caveat is that if there is a *systematic* error (i.e., common to all model versions) in the trends in the radiative forcing applied to the model then this would also affect the true bias tendency. For example, if the forcing trends were systematically too large then the true bias tendency would also be too large, and vice versa. The result of any such bias would be to displace all the data in Figs. 13a,b vertically along the true bias tendency axis. Such a displacement would shift the constrained ranges but would not broaden the distributions. This caveat should be kept in mind when interpreting our results.

One possible approach to addressing these various caveats would be a multimodel study where the forcings are likely to be different for each model, and this is planned further work.

## 6. Conclusions and discussion

We have explored the estimation of bias in a toy model of a decadal prediction system, and applied the techniques developed to analyze the bias of operational predictions of global mean temperature. We have focused on hindcasts initialized from model states, rather than from observations, and examined the bias tendency in particular. The main findings can be summarized as follows:

The total bias tendency can be separated into several components: a contribution from sampling uncertainty attributable to internal variability, a start-time-dependent forcing bias tendency, and the true bias tendency.

We have shown how the contributions from sampling uncertainty and start-time-dependent forcing bias can be estimated, and removed, to give a better (lower variance and less biased) estimate of the true bias tendency. We argue that it is the true bias tendency, not the total bias tendency, which should be used to adjust decadal forecasts.

The true bias tendency is attributable to the following: 1) errors in the sensitivity of the underlying model to forcing and/or 2) start-time-independent errors in the specification of forcing (e.g., errors in the specification of anthropogenic aerosols).

To improve estimates of bias tendencies, more hindcast start dates are more beneficial than more ensemble members.

The Met Office NoAssimPPE prediction system exhibits, in the ensemble mean, a small positive true bias tendency in hindcasts of global mean surface air temperature, and this is marginally statistically significant. We have demonstrated that this bias is not attributable to insufficient ocean heat uptake.

The different true bias tendencies in global mean surface air temperature in the various PPE versions can be used to constrain relevant physical properties of the models, such as the TCR and non-GHG forcing trends.

There are a number of caveats to the findings above. In the toy model, we have assumed linear trends. However, we do not believe that this compromises the decomposition of the bias tendency into its different terms. Second, we assumed that the toy model has the same variability properties as the toy observations. This is unlikely to hold perfectly in an operational setting as there is a broad spread in simulated variability among different models (Hawkins and Sutton 2012) and even among the different PPE versions of HadCM3 (Ho et al. 2013), but this would only change the number of start dates and ensemble members required to reliably estimate the bias. Most importantly, we have assumed the radiative forcings imposed in the decadal hindcasts are correct, as discussed in section 5.

In the decadal hindcast experiments for CMIP5, the standard start dates are every 5 years (Meehl et al. 2009; Taylor et al. 2012). In this situation there is no way of estimating the consistent bias on annual time scales. Therefore, any lead-time-dependent errors in the forcing cannot be removed. However, in the “Tier 1” CMIP5 predictions, the complete volcanic and solar forcings are assumed known, so there should be little start-time-dependent forcing bias. In other suggested experiments this is not the case. We suggest that the design of future decadal prediction experiments should consider start dates every year to allow for any start-time-dependent forcing bias to be removed.

We believe that the analysis of bias tendencies has considerable potential to provide further insights into climate models and the real climate system. We note that Masson and Knutti (2013) suggest that perturbed-physics and multimodel ensembles can behave differently and show opposite emergent constraints so it would be valuable to repeat this analysis using a wider range of operational prediction systems.

Beyond the global means considered in this paper there is a great deal of information in the spatial patterns of bias growth for a range of variables, and we have begun work to analyze these patterns. Last, there is an obvious need to examine how the growth of biases in a system initialized from model states is related to the growth of biases in a system initialized from observational states. This work involves many challenges but is essential for the development of decadal predictions.

## Acknowledgments

We thank Glen Harris for providing the forcing data from his important study and for valuable discussions. We also thank two anonymous reviewers for their helpful comments that improved the manuscript. The research leading to this paper has received support from NCAS-Climate (EH, BD, and RS), from the European Community’s Seventh framework programme (FP7) under Grant GA212643 (THOR) (EH, DS) and from the UK NERC funded EQUIP (EH) and VALOR (JR) projects. DS was also supported by the joint DECC/Defra Met Office Hadley Centre Climate Programme (GA01101) and the EU FP7 COMBINE Project.

## REFERENCES

*Geophys. Res. Lett.,*

**36,**L22701,

*J. Geophys. Res.,*

**113,**D23105,

*Geophys. Res. Lett.,*

**39,**L01702,

**40,**5770–5775,

*Geophys. Res. Lett.,*

**35,**L09701,

*J. Geophy. Res.,*

**117,**D08101,

*Geophys. Res. Lett.,*

**37,**L09704,

*Climate Change 2007: The Physical Science Basis,*S. Solomon et al., Eds., Cambridge University Press, 589–662.

*Geophys. Res. Lett.,*

**39,**L19713,

**41,**2875–2888, doi:10.1007/s00382-012-1600-0.

*. Climate Change 2013: The Physical Science Basis,*Cambridge University Press, 3–28.

**26,**9367–9383, doi:10.1175/JCLI-D-12-00596.1.