## 1. Introduction

DelSole (2004a, hereafter Part I) discussed a framework for quantifying predictability based on information theory. This framework, which is reviewed in the following section, requires probability distributions that are not known and, in practice, are estimated from an imperfect forecast model. The purpose of this paper is to discuss an approach to accounting for imperfect forecasts within the above framework. The basic idea is to use not the forecast itself, but the conditional distribution of the state given the forecast. This idea was suggested by Schneider and Griffies (1999), although our interpretation appears to differ from theirs. The assumptions inherent in this approach are laid out in section 3 and the resulting predictability estimates are shown in section 4 to constitute a lower bound on the true predictability and potential predictability. The mutual information between verification and forecast is argued to be an attractive measure of forecast skill. Section 5 discusses predictable components of imperfect models and their potential significance in practical applications. The above concepts are illustrated in section 6 in the context of stationary, Gaussian, Markov systems. Numerical examples are presented in section 7. Finally, a summary of the results is given in section 8.

This paper considers the ideal case of large samples (large in a sense to be discussed in section 3). Strategies for dealing with small samples will be addressed in Part III.

## 2. Brief review of predictability

In this section we summarize the predictability framework proposed in Part I. Consider a dynamical system of dimension *K*. The state of the system at time *t* is specified by a *K*-dimensional vector **x*** _{t}*, which specifies a point with coordinates

**x**

*in the*

_{t}*K*-dimensional space. If the state is uncertain, it is appropriate to describe the state by the density of possible points in phase space. This density is essentially a probability distribution function and evolves in time in a manner described by Liouville’s equation for conservative systems. The distribution of

**x**

*changes discontinuously after the system is observed. Let the set of all observations up to time*

_{t}*t*be denoted by

**o**

*. Note that*

_{t}**x**

*and*

_{t}**o**

*often reside in different spaces. The distribution of the state after observations become available is the conditional distribution*

_{t}*p*(

**x**

*|*

_{t}**o**

*), whose mean is called the analysis. As is well known from state space estimation theory, the analysis distribution*

_{t}*p*(

**x**

*|*

_{t}**o**

*) depends on the forecast model and therefore is conditioned on the forecast model.*

_{t}It proves convenient to distinguish the state at two different times by different symbols. Thus, let the initial condition at time *t* be **i** = **x*** _{t}*, the verification at time

*t*+

*τ*be

**v**=

**x**

_{t+τ}, and the observations up to time

*t*be

**o**=

**o**

*. The parameter*

_{t}*τ*is called the lead-time. The probability distribution functions (pdf’s) of

**i**,

**v**,

**o**will be denoted by

*p*(

**i**),

*p*(

**v**),

*p*(

**o**), respectively, where the function

*p*(·) is understood to differ according to its argument. In this notation, the analysis distribution at time

*t*is

*p*(

**i**|

**o**). For stationary systems,

*p*(

**v**) =

*p*(

**i**).

**v**=

**x**

_{t+τ}, after observations become available, is denoted by p(

**v**|

**o**) and computed from the classical formula:where

*r*(

**v**|

**i**) is a

*transition probability*associated with a dynamical or stochastic model and the integral is a multiple integral. The distribution

*p*(

**v**|

**o**) will be called the

*perfect model forecast distribution*. This distribution describes our knowledge of the future state

**v**=

**x**

_{t+τ}after antecedent observations

**o**

*and a (perfect model) forecast based on those observations become available. We use the term forecast system to refer to the combined influence of the forecast model and the uncertainty in the initial condition. Note that a perfect model forecast distribution is not “perfectly predictable,” even for a deterministic model because even if*

_{t}*r*(

**v**|

**i**) is deterministic and hence a delta function,

*p*(

**v**|

**o**) from (1) is not a delta function, owing to uncertainty in the initial condition as described by

*p*(

**i**|

**o**).

**o**

*, the variable*

_{t}**v**=

**x**

_{t+τ}has a climatological distribution given by its marginal distributionIf the system is stationary or cyclostationary, then the climatological distribution is independent of time or periodic and can be estimated from historical records. The variable

**v**is said to be unpredictable if

*p*(

**v**|

**o**) =

*p*(

**v**), which is equivalent to the statement that

**v**is independent of the observations

**o**.

## 3. The accessible forecast distribution

*r*(

**v**|

**i**) for the climate system is not known. It follows then that the perfect model forecast distribution

*p*(

**v**|

**o**) cannot be computed from (1) and, hence, is unknown too. Moreover, the transition probability

*r*(

**v**|

**i**) cannot be estimated from data because nature provides only a single realization of

**x**

_{t+τ}for a given value of

**x**

*, and the atmosphere has no natural analogues in the sense discussed by Lorenz (1969). For these reasons, the transition probability must be estimated from a model. The details of the model are immaterial: for example, the model could be purely empirical or purely physical. What is important is that the model provides a transition probability, which in all realistic cases differs from that of the true system. Moreover, the state space of the forecast model usually differs from that of the true state space. Consequently, the “initial condition” appropriate for the accessible forecast, denoted*

_{t}**i**

*, differs from the initial condition appropriate for the perfect model forecast,*

_{f}**i**. Let the initial condition distribution appropriate for the accessible forecast be

*p*(

**i**

*|*

_{f}**o**), and let the forecast verifying at time

*t*+

*τ*be

**f**. The forecast distribution is then given bywhere

*r*′(

**f**|

**i**

*) is the transition probability for the model. The distribution*

_{f}*p*(

**f**|

**o**) will be called the accessible forecast distribution, to distinguish it from the perfect model forecast distribution

*p*(

**v**|

**o**), which is inaccessible in any realistic scenario. Samples drawn from

*p*(

**f**|

**o**) constitute the forecast ensemble.

Obviously, we would eliminate model errors if we could. Hence, we assume that model errors cannot be eliminated easily. In such situations, there appears to be no alternative other than to quantify predictability based on the past behavior of the model and observations, assuming that the past relation between model and observations will persist into the future. This assumption is reasonable for stationary systems, but is problematic for nonstationary systems, such as occur in climate change scenarios.

*p*(

**v**,

**o, f**). The distribution of the verification given knowledge of the forecast and observations is the conditional distribution

*p*(

**v**|

**o, f**). Since the true system evolves according to a set of laws that are (presumably) independent of any accessible forecast,

**f**and

**v**are conditionally independent, in the sense thatHence, if the joint distribution

*p*(

**v**,

**o, f**) were really known, then the accessible forecast would be irrelevant for the purposes of measuring predictability. The fact that knowledge of

*p*(

**v**,

**o, f**) is tantamount to knowledge of the perfect model distribution

*p*(

**v**|

**o**) raises the question as to the role of the forecast model. The answer lies in the fact that some distributions are more accessible than others. For instance, the distributions

*p*(

**v**,

**o**) and

*p*(

**f**,

**o**) are anticipated to be complicated functions owing to the nonlinear transition probabilities associated with dynamical systems. On the other hand, if the forecast model captures enough detail in the nonlinear processes, then it is hoped that the forecast

**f**will differ from

**v**in “simple” ways that are “easily” corrected. For instance, if the forecast merely is biased relative to

**v**, then the best prediction is the forecast distribution shifted by an amount that removes the bias. In this scenario,

*p*(

**v**,

**o**) and

*p*(

**f**,

**o**) would be impractical to estimate owing to their nonlinearity, but

*p*(

**v**,

**f**) would not because

**v**and

**f**are related by an additive constant. The approach pursued here assumes that

*p*(

**v**,

**f**) requires “much less” data for its estimation than

*p*(

**v**,

**o**), otherwise we would use

*p*(

**v**,

**o**) and dispense with the forecast altogether. Some insight into these assumptions is provided by the examples presented in section 7.

Since our focus is on predictability, we do not dwell on the question of how to utilize *p*(**v**, **f**) to characterize forecast errors; see Murphy (1993) and von Storch and Zwiers (1999) for discussion of this issue. Given the joint distribution *p*(**v**, **f**), the distribution of the verification given the accessible forecast is *p*(**v**|**f**). We call *p*(**v**|**f**) the regression forecast distribution for reasons that will become apparent. The regression forecast distribution *p*(**v**|**f**) has many desirable properties related to accuracy, reliability, and resolution, in the sense of Murphy (1993). Furthermore, if the accessible forecast is independent of the verification, then *p*(**v**|**f**) = *p*(**v**), the regression forecast distribution, reduces to the climatological distribution and the variable **v** is said to be unpredictable. In such cases, the accessible forecast distribution gives no information about the verification that is not already contained in the climatological distribution.

If an ensemble of forecasts are available, say **f**_{1}, **f**_{2}, . . . , **f*** _{M}*, then the desired regression forecast distribution is the conditional distribution given the forecast ensemble

*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*). Typically, forecast ensembles from the same model are constructed such that each member is equally likely. In such cases, the order of the ensembles is irrelevant and hence the regression forecast distribution*

_{M}*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*) can depend on the sample only through certain sufficient statistics. In section 6 we show that if the distribution is joint normal, then the ensemble mean is a sufficient statistic of the regression forecast distribution; that is,*

_{M}*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*) =*

_{M}*p*(

**v**|〈

**f**〉), where 〈

**f**〉 is the sample mean of the ensemble forecasts [also defined in (18)].

Note that the forecast distribution *p*(**f**|**o**) plays a relatively minor role in predictability, as compared to the regression forecast distribution *p*(**v**|**f**). This point deserves mention since numerous predictability studies focus almost exclusively on the forecast distribution *p*(**f**|**o**). This emphasis is appropriate if the accessible forecast is perfect. If the accessible forecast is not perfect however, structure in the forecast distribution is relevant only to the extent that it covaries with the event in question. One might suggest that the forecast distribution can be transformed into the relevant distribution for the event that we want to predict. Leaving aside the question of how to construct an appropriate transformation, there can be no more information in the derived forecast distribution than in the individual members of the forecast that were used to construct the distribution. Hence, it is sensible that the regression forecast distribution, which plays a central role in our predictability framework, depends on the actual realizations of the forecast rather than on some distribution derived from the realizations.

## 4. Predictability of regression forecasts

The previous section introduced the regression forecast distribution *p*(**v**|**f**), which can be interpreted as the distribution of the future state **v** given the accessible forecast **f** and the (past) joint behavior between these two variables. This section shows that the predictability of a regression forecast distribution constitutes a rigorous lower bound on the true predictability. Furthermore, the “potential predictability,” which is a measure of the difference between the accessible forecast distribution *p*(**f**|**o**) and its climatology *p*(**f**), provides an upper bound on the predictability of the regression forecast distribution. These bounds clarify the role of accessible forecasts in the estimation of predictability.

*p*(

**v**|

**o**) and climatological distribution

*p*(

**v**). Two measures of this difference are relative entropy

*R*and predictive information

_{υ,o}*P*, as discussed in Kleeman (2002) and Schneider and Griffies (1999). DelSole (2004b) showed that the average of either of these quantities, over all observations, yields a quantity called the mutual information

_{υ,o}*I*(

**V**;

**O**):Mutual information has the attractive property that it is invariant with respect to invertible, nonlinear transformations.

Unfortunately, the perfect model distribution *p*(**v**|**o**) is not accessible, as discussed in the previous section. Instead, we have access to the forecast *p*(**f**|**o**) and the regression forecast distribution *p*(**v**|**f**). Hence, two distinct predictability measures may be conceived. First, the accessible forecast *p*(**f**|**o**) may be compared to its climatology *p*(**f**). By analogy with (5), the associated average predictability is the mutual information between **F** and **O**, denoted *I*(**F**;** O**). Here *I*(**F**;** O**) will be called potential predictability since this term is used similarly in the literature to describe the predictability of a forecast system relative to its climatology, without reference to the true system. Second, the regression forecast distribution *p*(**v**|**f**) may be compared to the climatological distribution *p*(**v**). It can be shown that the average predictability of the regression forecast distribution is the mutual information between **F** and **V**, denoted *I*(**F**;** V**); *I*(**F**; **V**) will be called the predictability of the regression forecast distribution.

*I*(

**V**;

**O**),

*I*(

**F**;

**O**), and

*I*(

**F**;

**V**) satisfy certain fundamental inequalities. First, Eq. (4) implies that the variables

**v**,

**f**,

**o**form a Markov chain in the order

**f**⇒

**o**⇒

**v**. By the fundamental data processing theorem in information theory (Cover and Thomas 1991, chapter 2), the mutual information between the above variables satisfy the inequalityThis inequality states that the potential predictability of an accessible forecast system is greater than or equal to predictability of the regression forecast distribution. The above inequality clarifies that potential predictability does not constitute an upper bound on the true predictability, as is sometimes implied in the literature but, rather, it constitutes an upper bound on the average predictability of the

*regression forecast*.

**v**⇒

**o**⇒

**f**, from which it follows thatThis inequality states that no regression forecast can have greater predictability than that of the true system. Equivalently, the predictability of the regression forecast distribution constitutes a lower bound on the average predictability of the true system.

In contrast to most other proposed measures of predictability, mutual information does not require that the state space of the accessible forecast and the true system be the same. The desirability of this property can be appreciated from the fact that investigators generally are interested in whether some set of forecast variables can provide useful predictors of the verification, regardless of whether the variables are the same. Large mutual information indicates that some forecast variables are statistically dependent with the verification and hence can provide useful predictors of the verification.

A schematic that may facilitate the interpretation of the above quantities is shown in Fig. 1. In this abstraction, a probability distribution is characterized by a “point,” and the distance between two points indicates the difference between two distributions. Each line segment in the figure has been labeled to indicate its meaning. Angles have no meaning. The distance between the climatological distribution *p*(**v**) and perfect model distribution *p*(**v**|**o**) defines the predictability of the true system. However, we do not have access to the perfect model distribution *p*(**v**|**o**), rather, we have access to the accessible forecast distribution *p*(**f**|**o**). The “distance” between the perfect model distribution *p*(**v**|**o**) and accessible forecast *p*(**f**|**o**) is the most complete description of forecast error (although this distance is undefined if these distributions are represented in different state spaces). The distance between the accessible forecast *p*(**f**|**o**) and its climatology *p*(**f**) represents potential predictability. The regression forecast distribution *p*(**v**|**f**) provides a link between these two types of predictability measures. The distance between the regression forecast distribution *p*(**v**|**f**) and the climatological distribution *p*(**v**) is the best estimate of predictability based solely on the accessible forecasts. The figure has been constructed such that the predictability of the regression forecast distribution is smaller than either the predictability of the true system or the predictability of the accessible forecast system, as required by inequalities (6) and (7).

**f**we attempt to construct a new forecast, denoted

*L*(

**f**), with the goal of improving the skill over the original forecast

**f**. If the operation

*L*(·) is a function only of

**f**, then the distribution of

*L*(

**f**), conditional on

**f**and

**v**, must be independent of

**v**:It follows from this expression that the above variables form a Markov chain in the orderBy the fundamental data processing theorem in information theory (Cover and Thomas 1991, chapter 2), the mutual information between the variables satisfy the inequalityHence, no manipulation of the forecast can enhance

*I*(

**V**;

**F**). This property distinguishes

*I*(

**V**;

**F**) from other skill metrics, such as mean square error, which often can be improved by biasing the forecast toward climatology. Since mutual information is invariant with respect to invertible, nonlinear transformations, the above proof implies that noninvertible transformations can only reduce mutual information. The above inequality has an intuitive interpretation in communication theory: It states that, if a message is sent through a noisy channel and received at the other end as an output, no manipulation of the output can increase the information about the message contained in the output.

Mutual information between forecast and verification also can be interpreted as a measure of skill, as suggested briefly by Leung and North (1990). The skill of a forecast can be measured in at least two distinct ways: by the “closeness” between forecast and verification, such as as measured by mean square error, or by the “temporal similarity” between forecast and verification, as measured by the correlation coefficient. Mutual information can be interpreted as a generalization of “similarity” measures since it is based on the fundamental probabilistic definition of independence and, hence, does not make implicit assumptions regarding the form of the relation between two variables. Furthermore, mutual information is invariant with respect to nonlinear transformations of the data, is invariant with respect to the role of forecast and verification, cannot be improved by manipulating the forecast (provided the manipulation is independent of verification), and vanishes if and only if the forecast is statistically independent of the verification. Also, owing to (5), forecasts with larger mutual information provide more information about the verification, which is a sensible measure of skill. Finally, in the case of bivariate normal distributions, mutual information is monotonically related to the correlation between forecast and verification. Therefore, mutual information reduces to a common measure of skill in suitable circumstances.

The above discussion implies that a forecast can contain large systematic errors and yet still have significant mutual information. In communication theory, we would say that the forecast is subject to distortion. Such distortion can be corrected if the functional relation between the forecast and verification can be inverted. This correction is implicitly included in the regression forecast distribution *p*(**v**|**f**). Adopting mutual information as a measure of skill would implicitly include this correction and hence eliminate the temptation to statistically correct forecasts for the purposes of improving skill. These comments should not be construed as suggesting that skill metrics, such as mean square error, are not useful. We are simply clarifying the fact that mutual information measures a different kind of skill than mean square error.

For continuous distributions, mutual information has no maximum value. Joe (1989) showed that the transformation [1 − exp(−2*I*)]^{−1/2} produces a value in the interval [0, 1] and recovers the correlation, multiple correlation, and partial correlation in appropriate circumstances when the variables are normally distributed. Schneider and Griffies (1999) propose an analogous transformation for predictive information. For discrete distributions, mutual information is bounded above by the entropy of the verification *H*(**V**). Accordingly, in the discrete case, the ratio *I*(**V**; **F**)/*H*(**V**) is bounded between 0 and 1 and, hence, may provide an attractive skill score for discrete, probabilistic forecast verification. Joe discusses other normalizations of mutual information.

*H*(

**V**) is the entropy of the climatological distribution

*p*(

**v**) and

*H*(

**V**|

**F**) is the conditional entropy of

**V**given

**F**[Cover and Thomas (1991) chapter 2]. According to this identity, positive skill

*I*(

**V**;

**F**) implies

*H*(

**V**) >

*H*(

**V**|

**F**), implying that a skillful forecast reduces the average uncertainty of an event relative to the climatological distribution. This relation links the concepts of degree of dependence (skill) and reduction in uncertainty (predictability). Inequality (7) and identity (11) imply

*H*(

**V**|

**F**) ≥

*H*(

**V**|

**O**), which states that the forecast cannot reduce uncertainty more than observations and a perfect forecast model. The above results can be extended to show that conditioning never increases the average uncertainty. It follows that a forecast based on all available knowledge should have less uncertainty than a forecast based on partial knowledge.

## 5. Predictable components of an accessible forecast

**F**

*that satisfiesIf a forecast variable does not satisfy (12), then it is called a potential predictable component, denoted*

_{u}**F**

*. The word potential is used to indicate that these components are predictable in the accessible forecast but not necessarily in the true system, though this term often will be dropped in sequel because predictable components of other forecasts will not be considered. In this section, we show that, under certain plausible assumptions, potential predictable components, and these components alone, can be used as predictors of a regression forecast. This result has important implications if the potential predictable components span a dimension smaller than that of the full system.*

_{p}The proof given below holds even if the variables are not joint normally distributed. In practice, however, the normal assumption is needed to identify predictable components. For instance, if the variables are joint normally distributed, then canonical correlation analysis (CCA) can identify the unpredictable components (DelSole 2004b). This procedure is equivalent to predictable component analysis proposed by Schneider and Griffies (1999), provided the distributions are joint normal (DelSole and Chang 2003). Even if variables are not normally distributed, CCA still might be a useful method of finding potential predictable components because it identifies components with large correlation. Whether more general methods of finding predictable components are needed for realistic systems is a question that only experiment can settle.

**Z**

*= {*

_{u}**f**

^{(1)}

*,*

_{u}**f**

^{(2)}

*, . . .}, called potential unpredictable components, and everything else,*

_{u}**Z**

*= {*

_{p}**f**

^{(1)}

*,*

_{p}**f**

^{(2)}

*, . . .}, called potential predictable components. The unpredictable components are identified with weather noise. As such, it is plausible to assume that the unpredictable components are independent of observations jointly:We make the stronger, yet still plausible, assumption that weather noise in the accessible forecast is independent of observations, verification, and predictable components:This assumption holds automatically if weather noise is parameterized as independent, additive noise, as is usually the case in predictability studies. Note that the above assumption implies that*

_{p}**Z**

*is independent of any combination of*

_{u}**o**,

**v**,

**Z**

*.*

_{p}The importance of the above identities, (15) and (16), is that the dimension of **Z*** _{p}* may be much smaller than the dimension of full system, especially in the context of monthly or seasonal predictability. Furthermore, the predictable components of an accessible forecast model can be determined with more accuracy than the predictable components of the observed system, owing to the availability of multiple realizations of the accessible forecast. Finally, inequality (6) implies

*I*(

**Z**

*;*

_{p}**O**) ≥

*I*(

**V**;

**Z**

*): The predictability of the potential predictable components is never less than the predictability of the regression forecast distribution. For these reasons, predictable components may provide an attractive basis set for reducing the dimension of the predictability analysis.*

_{p}## 6. Regression forecasts for Gaussian, Markov systems

**v**and

**f**are joint normally distributed, then it is well known that the conditional distribution

*p*(

**v**|

**f**) iswhere

*and*

**μ**_{υ}**Σ**

*are the mean and covariance matrix of the marginal distribution p(*

_{υ}**v**),

*and*

**μ**_{f}**Σ**

*are the analogous quantities for p(*

^{τ}_{f}**f**), and

**Σ**

*=*

_{vf}**Σ**

*is the cross-covariance matrix between*

^{T}_{fv}**f**and

**v**[see Johnson and Wichern (1982), p. 170 for a derivation]. Readers familiar with statistical methods will recognize this result as equivalent to a least squares linear prediction of

**v,**given

**f**, for asymptotically large sample size. Although least squares estimation would be an obvious way of correcting a forecast, it arises here not because we are trying to minimize forecast error variance directly, but because, as is well known, it is equivalent to the conditional distribution

*p*(

**v**|

**f**) for Gaussian variables.

*p*(

**f**|

**o**). Let the forecast ensemble be

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*. The conditional distribution of the verification given the forecast ensemble is*

_{M}*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*). For Gaussian variables, the distribution*

_{M}*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*) is identical to*

_{M}*p*(

**v**|〈

**f**〉), where 〈

**f**〉 is the sample ensemble mean forecastThe equivalence between

*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*) and*

_{M}*p*(

**v**|〈

**f**〉) can be seen in several ways. Perhaps the simplest is to note that, owing to the Gaussian form, the distribution

*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*) can depend only linearly with respect to the forecasts*

_{M}**f**

_{1},

**f**

_{2}, . . . ,

**f**

*. Furthermore, since there is no basis for treating any one forecast from the same model differently from the others, the distribution*

_{M}*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*) must be invariant with respect to an interchange of any two forecasts. The properties of invariance and linearity imply that the distribution*

_{M}*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*) can depend only on the sum over*

_{M}*M*vectors

**f**

_{1}+

**f**

_{2}+ . . . +

**f**

*, which is proportional to the sample ensemble mean 〈*

_{M}**f**〉. From the joint normal distribution assumption, it follows that the conditional distribution

*p*(

**v**|〈

**f**〉) iswhich has the same form as (17), but with mean

*μ*_{〈f〉}and covariance matrices

**Σ**

_{υ〈f〉}and

**Σ**

_{〈f〉}pertaining to the sample ensemble mean forecast 〈

**f**〉.

*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*) depends only on the sample mean forecast 〈*

_{M}**f**〉, it might appear that the regression forecast distribution is independent of the forecast spread. This is not the case, as we will now demonstrate. The population mean forecast, often called the signal, is a random variable given byNote that E[

**f**|

**o**] is a random function since it depends on

**o**. It is routine to show thatwherein which

*E*[] with no conditioning represents the expectation over

*p*(

**v**,

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*,*

_{M}**o**). The term

**Σ**

*measures the variance of the “signal,” while the term*

_{s}**Σ**

*measures the variance of “forecast spread” or “noise.” Elementary sampling theory shows thatwhere*

_{n}**Σ**

_{〈f〉}is the covariance matrix of 〈

**f**〉, and

**Σ**

_{υ〈f〉}is the covariance matrix between

**v**and 〈

**f**〉. These expressions show that Σ

_{〈f〉}depends on the spread of the forecast

**Σ**

*. Since*

_{n}**Σ**

*and*

_{s}**Σ**

*are positive definite, increasing*

_{n}**Σ**

*increases the variance of 〈*

_{n}**f**〉 but does not alter the covariance

**Σ**

_{υ〈f〉}. Thus, appearances to the contrary, the distribution

*p*(

**v**|〈

**f**〉) depends on forecast spread in the following sense: given two forecasts with the same signal but different ensemble spreads, the forecast with larger spread gives rise to a regression forecast distribution with larger uncertainty. The variation of the regression forecast distribution depends on the sample only through the sample ensemble mean.

*p*(

**v**|

**o**) =

*p*(

**f**|

**o**), which implies thatThe last relation arises because the forecast and verification can be represented each as a sum of a common signal plus independent noise, and all cross-covariances involving the noise terms vanish. Substituting these expressions into (19) gives a regression forecast distribution

*p*(

**v**|〈

**f**〉) that is multivariate Gaussian with mean and covariance matrixwhere “PM” indicates perfect model. In the limit

*M*→ ∞, the regression forecast distribution approaches

*N*(

*E*[

**f**|

**o**],

**Σ**

*), which is the (correct) perfect model distribution*

_{n}*p*(

**v**|

**o**). For finite ensemble size

*M*, however, the conditional distribution

*p*(

**v**|〈

**f**〉) differs from the perfect model forecast distribution

*p*(

**v**|

**o**), even for a perfect model scenario, reflecting the fact that the forecast distribution has not been adequately sampled.

**x**

*of such a system can be interpreted as a solution of a linear stochastic model of the formwhere 𝗔 is a stable dynamical operator and*

_{t}**w**is a Gaussian white noise process with zero mean and covariance matrix 𝗤. The properties of this system have been discussed extensively in the literature (Gardiner 1990; DelSole 2004b, and references therein). The main facts of relevance in this paper are the following. Assuming the solution to (26) was begun in the infinite past, then

**x**

*is stationary. If the initial condition is drawn randomly from the stationary distribution*

_{t}*p*(

**x**

*), then the marginal distribution for the initial condition*

_{t}**i**=

**x**

*and verification*

_{t}**v**=

**x**

_{t+τ}are equal, independent of time, and normally distributed with zero mean and constant covariance matrix

**Σ**

*. Thus*

_{υ}**i**can be written as the sum of two terms,where 𝗣 = exp(𝗔

*τ*) is the propagator of the system and

**e**

*is Gaussian white noise with distributionThe random variables*

_{υ}**i**and

**e**

*are independent. It follows from the above two equations that the conditional distribution of*

_{υ}**v,**given

**i,**isThe predictability of this system, as measured by relative entropy, predictive information, and mutual information, has been discussed in Part I and need not be reproduced here.

**v**given

**o**. In this attempt, it would be unrealistic to assume that (26) is perfectly known. Thus, a forecast based on a stochastic model,is attempted, where 𝗔

*differs from 𝗔 in (26), and*

_{f}**w**

*is Gaussian white noise with statistics possibly different from those of*

_{f}**w**in (26). The propagator of the forecast system is 𝗣

*= exp(𝗔*

_{f}*), and the covariance matrix of the asymptotic forecast is*

_{f}τ**Σ**

^{∞}

*. To this model corresponds an analysis*

_{f}*p*(

**i**

*|*

_{f}**o**), which represents the distribution of the initial condition appropriate for the forecast model. Note that

*p*(

**i**

*|*

_{f}**o**) is not equal to

*p*(

**i**|

**o**), the analysis for the perfect model does not equal the analysis for the accessible forecast model. A forecast by the accessible forecast model starting from

**i**

*satisfies the equationwhere*

_{f}**e**

*is a Gaussian random process with distributionThe variables*

_{f}**i**and

**e**

*are independent. Physically, this independence follows from the fact that*

_{f}**i**is a realization from the true system (26) whereas

**e**

*represents the internal noise of the forecast. We could allow*

_{f}**e**

*to have nonzero mean, in which case the forecast*

_{f}**f**would be biased, but this situation represents only a trivial extension of the unbiased case. It thus follows from (32) and (33) that

**i**,

**i**

*, and*

_{f}**o**. Since

**i**and

**i**

*arise from an analysis procedure, they depend only on the antecedent observations and forecast models. In particular,*

_{f}**i**and

**i**

*are conditionally independent given the observations:We consider the case in which the initial condition errors are small. Thus, for simplicity, we assume*

_{f}**i**=

**i**

*, in which case the triplet (*

_{f}**e**

*,*

_{υ}**e**

*,*

_{f}**i**) forms a mutually independent set. Since all three variables are independent and normally distributed, any linear combination of the triple is joint normally distributed. In particular, the pair (

**e**

*− 𝗣 i) and (*

_{υ}**e**

*− 𝗣*

_{f}

_{f}**i**) are joint normally distributed, from which it follows that

**v**and

**f**are joint normally distributed. Thus, the regression forecast distribution can be written immediately as (17), where it remains to determine the covariance matrices (the means are assumed to vanish).

*p*(

**f**) is Gaussian with zero mean and covariancewhere we have used the fact that

**i**and

**e**

*are independent. In general,*

_{f}**Σ**

*can depend on lead time. This dependence arises whenever the initial condition for the forecast model is drawn from a distribution that differs from the marginal distribution of the forecast. If*

^{τ}_{f}**Σ**

^{∞}

*and*

_{f}**Σ**

*are equal, implying that the climatology of the forecast and true system coincide, then the covariance matrix*

_{υ}**Σ**

*is independent of*

^{τ}_{f}*τ*. The cross-covariance

**Σ**

*isAll quantities in (17) have now been specified. This completes the task of finding the regression forecast distribution*

_{υf}*p*(

**v**|

**f**) for a stationary, Gaussian, Markov process.

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*drawn randomly from the accessible forecast distribution*

_{M}*p*(

**f**|

**i**). As discussed in section 3, the appropriate regression forecast distribution is

*p*(

**v**|

**f**

_{1},

**f**

_{2}, . . . ,

**f**

*), which in the case of Gaussian distributions is identical to*

_{M}*p*(

**v**|〈

**f**〉), where 〈

**f**〉 is the ensemble mean forecastTo derive an expression for

*p*(

**v**|〈

**f**〉), we may follow precisely the same procedure used to derive

*p*(

**v**|

**f**), but with the new variablewith distributionThe resulting regression forecast distribution for the ensemble isStandard sampling theory givesWe recover the covariances for a single realization

**f**, (36) and (37), by substituting

*M*= 1 into the above expression. It can be verified that, in the limit of infinite ensemble size

*M*→ ∞, the regression forecast distribution (41) asymptotically approaches the perfect model distribution (30), provided

**Σ**

_{〈f〉}and

**Σ**

_{v〈f〉}remain nonsingular.

**v**|〈

**f**〉) are obtained by substituting (41) into the appropriate expressions in Part I. The results arewhereThese expressions are isomorphic to those obtained for the perfect model distributions and, hence, have properties similar to those associated with the true system (i.e., relative entropy depends on initial condition, all three quantities decay monotonically with lead time, etc.).

## 7. Numerical examples

*β*is a tunable parameter. This model can be solved analytically by methods described in Gardiner (1990) and DelSole (2004b). Since 𝗔 is upper triangular, its eigenvalues are −1/5 and −1, regardless of

*β*. The case

*β*= 0 corresponds to a normal dynamical operator in prewhitened coordinates, which constitutes a lower bound on the predictability of all stochastic systems with the same eigenvalues (Tippett and Chang 2003). Suppose that the “truth” is identified with the stochastic model with

*β*= 0, while the accessible forecast is identified with

*β*= 5; in both cases the covariance matrix is assumed to be 𝗤 = 𝗜. This experiment may be termed a perfect initial condition scenario since uncertainty arises from stochastic forcing within the model and not from the initial condition. The mutual information between verification and initial condition in the true system is given byand is shown as the dashed line in Fig. 2. The solid lines show, for different ensemble sizes, the mutual information between verification and accessible forecast, as evaluated from

**I**

_{〈}

_{f}_{〉}in (43). First, note that the predictability of the regression forecast distribution is always less than or equal to the predictability of the system. This reflects the inequality (7), which states that the predictability of a regression forecast distribution is a lower bound on the predictability of the true system. Second, note that the gain in predictability due to doubling the ensemble size is modest after one time unit. This is not surprising given that the regression forecast distribution of joint normal distributions depends on the sample only through the ensemble mean, so extra ensembles merely refine the sample mean. The predictability of the regression forecast distribution converges to the true predictability more rapidly as the

*β*in the forecast model approaches the true value of

*β*.

How well can mutual information be estimated from finite samples? To gain insight into this question, we numerically generated time series from the above stochastic models using a forward Euler stochastic scheme (Kloeden and Platen 1999, p. 305) with a time step of 0.01 time units. The verification and observation time series were constructed by first integrating the true stochastic model (i.e., *β* = 0) and then sampling this single realization 10 times every 16 time units. Since the slowest decaying eigenmode has an *e*-folding time of 5 time units, sampling every 16 time units ensures that each verification–initial condition pair is effectively independent of all other such pairs. Within each 16 time unit interval, the initial condition is identified with the starting point and the verification is the value of the time series *τ* time units later. Accessible forecasts were constructed by starting at each initial condition and integrating the stochastic model using *β* = 5, using random forcing that was independent of that used to generate the truth. Multiple ensemble members were generated by integrating from the same initial condition but with independent realizations of the random forcing. The result of this procedure is to produce 10 **v**–**i** pairs and 10 **v**–〈**f**〉 pairs. Estimates of *I*(**v**, **i**) and *I*(**v**, **f**), denoted *I _{e}*(

**v**,

**i**) and

*I*(

_{e}**v**,

**f**), were obtained by replacing population covariance matrices with sample covariance matrices in (43) and (46), respectively. This procedure was repeated 500 times to estimate the distribution of

*I*(

_{e}**v**,

**i**) and

*I*(

_{e}**v**,

**f**).

Figure 3 reproduces the exact values of *I*(**v**, **i**) and *I*(**v**, 〈**f**〉) for this model for 16 ensemble members (dashed and solid lines, with no filled circles, respectively). The figure also shows the mean and the 10th and 90th percentiles, as error bars, of the sample estimates *I _{e}*(

**v**,

**i**) and

*I*(

_{e}**v**, 〈

**f**〉) (dashed and solid lines, with error bars). First, we see that the sample estimates

*I*(

_{e}**v**,

**i**) and

*I*(

_{e}**v**, 〈

**f**〉) tend to be biased upward relative to their respective exact values

*I*(

**v**,

**i**) and

*I*(

**v**, 〈

**f**〉). The magnitude of this bias decreases as the sample size increases. Second, we see that

*I*(

_{e}**v**,

**i**) tends to be larger than

*I*(

_{e}**v**, 〈

**f**〉), even thought the latter is computed from 16 ensemble members. Quantitatively,

*I*(

_{e}**v**,

**i**) >

*I*(

_{e}**v**, 〈

**f**〉) in over 75% of the results for time lags less than 3 time units. We have verified that this result holds even if the accessible forecast model is perfect (i.e., if the accessible forecast model has

*β*= 0, but with random forcing that is independent of the truth), provided the number of forecast ensemble members is less than 10. These limited results suggest that, if the system is joint normally distributed, there is no compelling reason to utilize accessible forecasts, since higher (but equally biased) estimates of predictability can be obtained by estimating

*I*(

**v**,

**i**) directly from the observed time series. Conceivably, prior knowledge of the system could be incorporated into the estimation procedure to improve the predictability estimates, but this was not explored.

Now consider the nonlinear dynamical model of Lorenz (1963) with parameter values σ = 10, *ρ* = 8/3, and *β* = 28, for which the model is chaotic. This model is distinguished from the previous model in that it is nonlinear and non-Gaussian. This model was integrated with a fourth-order Runge–Kutta scheme starting from a random point near the origin to construct a single time series of length 2600 time units. After computing this time series, the initial 1000 time units were discarded to avoid spinup effects, and independent random numbers from a Gaussian distribution with zero mean and unit variance were added to the time series. The resulting time series then was sampled every 16 time units to construct 100 initial conditions. The accessible forecasts were constructed by integrating the Lorenz model at each of the 100 initial conditions, but with *β* = 20, all other parameters held the same (our major conclusions below do not appear to depend on the parameter being perturbed). Additional initial conditions for generating ensemble members were constructed by adding new, independent random numbers to the original solution to the Lorenz model. The resulting 100 **v**–**i** pairs and 100 **v**–〈**f**〉 pairs are insufficient to estimate distributions along the lines of Kleeman (2002). Nevertheless, the sample size is typical in climate research.

Given the small sample size, we evaluated mutual information by (incorrectly) assuming a Gaussian form for the distributions so that the Eqs. (43) and (46) can be used. Figure 4 shows *I _{e}*(

**v**,

**i**) (dashed) and

*I*(

_{e}**v**,

**f**) (solid) estimated from the time series for one ensemble member. We see that

*I*(

_{e}**v**,

**f**) exceeds

*I*(

_{e}**v**,

**i**), in contrast to Fig. 3. This result does not contradict inequality (7), which pertains to the exact probability distributions, because here a Gaussian form is imposed for the distributions. The essential reason for this result is that the accessible forecasts track the verification better than a linear prediction based on the initial condition, presumably because the accessible forecast to some extent captures important nonlinearity. Consequently, the covariance between

**v**and

**f**remains much higher as

*τ*increases than the covariance between

**v**and

**i**. Interestingly, adding additional ensemble members does not substantially improve estimates of mutual information. The reason for this is that each ensemble member remains relatively close to the other members over the time scales considered; that is, each

**f**is close to 〈

**f**〉. Thus, adding new ensemble members does not add much information about the prediction.

The potential predictable components were obtained by performing CCA between 100 **f**–**i** pairs. The mutual information between **v** and the first two predictable components **f*** _{p}*, as computed from

*I*in (43), is shown in Fig. 4 as the line with circles. Although the curve shows that the predictability based on two predictable components is comparable to the (Gaussian) mutual information between

_{〈f〉}**v**and

**i**, this appears to be a coincidence. The important result is that the predictability based on potential predictable components underestimates the predictability of the regression forecast distribution by a factor of 2. This result contradicts our hypothesis that only a small number of potential predictable components can capture the full predictability. The example given here, however, is more appropriately compared with short time weather forecasts, rather than with climate forecasts. We suspect that our hypothesis is valid for large dimensional climate systems on long time scales.

## 8. Summary

This paper proposed a predictability theory framework that accounts for imperfect forecast models. The critical quantity in this framework is neither the perfect model distribution, which is unknown anyway, nor the accessible forecast distribution, whose state space and variability may differ from the true system, but rather the conditional distribution of the state given all accessible forecasts. This idea also was proposed in Schneider and Griffies (1999), although our interpretation appears to differ from theirs. We have called this distribution the regression forecast distribution because, in the case of normal distributions, it is equivalent to a linear regression of the verification given the accessible forecast. Theoretically, the regression forecast distribution is not the best possible prediction. The best prediction is the conditional distribution given all forecasts and all antecedent observations. However, this latter distribution is independent of the accessible forecasts, reflecting the fact that an imperfect forecast is irrelevant if a perfect forecast model is available. The usefulness of imperfect forecasts appears to lie in the fact that they capture nonlinear or nonstationary behavior, which are difficult to capture in low-dimensional, statistical models estimated from short historical records.

This paper showed that the average predictability of the regression forecast distribution, denoted *I*(**V**; **F**), satisfies certain fundamental inequalities. First, *I*(**V**; **F**) provides a rigorous lower bound to the average predictability of the true system. This bound clarifies an important role of accessible forecasts in predictability studies. Second, *I*(**V**; **F**) is bounded above by the average potential predictability of the forecast model, defined as the average predictability of the accessible forecast distribution relative to its own climatology. The potential predictability of an accessible forecast system does not constitute an upper bound on the true predictability, as is sometimes asserted in the literature. In fact, the potential predictability and the true predictability need not have any relation to each other. Rather, the potential predictability constitutes an upper limit to the average predictability of the regression forecast distribution.

The absence of perfect models has lead some authors to suggest that all measures of predictability require some reference to accessible forecast models; that is, a true predictability does not exist. But defining predictability with respect to forecast models leads to multiple definitions of predictability, one for each model. Also, one should be careful not to equate existence with accessibility: just because we do not have access to something does not mean it does not exist. The framework proposed here presumes the existence of a true predictability, which is the distribution that a perfect model would produce given the initial condition distribution (which itself is constructed from a data assimilation procedure using the perfect model). The true predictability is an inaccessible property of the climate system and associated observations. Accessible forecasts provide lower bound estimates of true predictability; they are not needed to define predictability. The framework correctly implies that classical, deterministic models are perfectly predictable if both the initial condition and dynamical model are known perfectly, but not otherwise. The framework also accounts for the model dependence of uncertainty in the initial condition. True predictability can never be quantified definitively since at any given time only imperfect forecasts and finite observations exist, and there is no plausible way to eliminate the possibility that a better forecast model or better observations could lead to greater predictability.

This paper suggested that mutual information between accessible forecast and verification *I*(**V**; **F**) provides an attractive measure of forecast skill. This measure arises naturally in our framework as the average predictability of the regression forecast distribution. It also measures the degree of dependence between two sets of variables that is more fundamentally related to predictability than mean square error, which requires, for instance, that the forecast and verification be represented in the same state space. Furthermore, this measure is invariant with respect to nonlinear transformations of the data, is invariant with respect to the role of forecast and verification, cannot be improved by manipulating the forecast (provided the manipulation is independent of verification), and vanishes if and only if the forecast is statistically independent of the verification. In the case of bivariate normal distributions, mutual information reduces to familiar measures of skill based on the correlation between forecast and verification.

This paper showed that, under certain plausible assumptions, potential predictable components, and these components alone, completely describe the variability of regression forecasts. Potential predictable components therefore provide a basis for reducing the dimensionality of the predictability problem without loss of generality, provided they can be identified. If the forecast and observations are joint normally distributed, then potential predictable components can be obtained by canonical correlation analysis. In non-Gaussian cases, CCA may still provide a useful method of finding predictable components since it optimizes the correlation coefficient.

The predictability of regression forecast distributions for stationary, Gaussian, Markov systems was examined. The distribution of all relevant random variables was given explicitly. If ensemble forecasts are available, the regression forecast distribution varies only with the sample ensemble mean forecast, while the ensemble spread influences the predictability of the regression forecast distribution.

Simple numerical experiments were conducted to illustrate the above concepts and to gain insight into the usefulness of regression forecast distributions. In these experiments, the truth was identified with a single realization from a chosen model, while the forecast was generated by a model from the same class but with different parameter values. The exact predictability of a regression forecast distribution of a two-dimensional, Gaussian, Markov model was computed for various ensemble sizes. The results revealed that relatively small increases in predictability were to be gained with increasing ensemble size. This conclusion is not surprising since, for joint normal distributions, the regression forecast distribution depends on the forecast ensemble only through the ensemble mean, and hence the “extra” ensemble members merely “sharpen” the estimate of the ensemble mean. Sample estimates of the predictability of these stochastic systems, derived from numerical realizations, were biased upward relative to their true values. This bias, which is a manifestation of artificial skill that occurs in statistical prediction, represents a significant problem in the estimation of predictability from finite samples. In most cases, the true predictability estimated from finite realizations tended to be larger than the predictability of regression forecast distributions, even for large ensemble sizes. Further investigation of linear stochastic models in different parameter regimes suggest that, in the absence of prior information, imperfect forecast systems are not always useful in joint normally distributed systems since greater predictability often can be obtained directly from data. By contrast, the Lorenz (1963) model revealed opposite behavior: Gaussian approximated regression forecasts generally have more predictability than Gaussian approximated perfect model distributions. This difference was attributed to the nonlinear dynamics in the Lorenz model, which could be captured by an accessible forecast model from the same model class but with slightly incorrect parameter values, but not by a joint normal distribution, which essentially assumes a linear relation between verification and initial condition. Since the historical record is not adequate for developing anything more than simple linear regression laws, we concluded that the present usefulness of imperfect forecast models lies in the extent to which they capture relevant nonlinear dynamics and/or nonstationary behavior, or facilitate the identification of potential predictable components. These experiments on low-dimensional systems did not support the hypothesis that a truncated set of potential predictable components can capture most the predictability. Whether this finding holds in more realistic, large-dimensional climate models on long time scales can only be settled by experiment.

The problem of estimating the predictability of systems from finite samples will be addressed in more detail in Part III.

I am very much indebted to J. Shukla, Tapio Schneider, Michael Tippett, and Ben Kirtman for instructive discussions. Comments from Tapio Schneider, acting as reviewer, and two other reviewers also led to improvements in this paper. Discussions with Lenny Smith also led to helpful clarifications. This research was supported by the NSF (ATM9814295), NOAA (NA96-GP0056), and NASA (NAG5-8202).

## REFERENCES

Cover, T. M., , and J. A. Thomas, 1991:

*Elements of Information Theory*. Wiley, 576 pp.DelSole, T., 2004a: Predictability and information theory. Part I: Measure of predictability.

,*J. Atmos. Sci.***61****,**2425–2440.DelSole, T., 2004b: Stochastic models of quasigeostrophic turbulence.

,*Surv. Geophys***25****,**107–149.DelSole, T., , and P. Chang, 2003: Predictable component analysis, canonical correlation analysis, and autoregressive models.

,*J. Atmos. Sci.***60****,**409–416.Gardiner, C. W., 1990:

*Handbook of Stochastic Methods*. 2d ed. Springer-Verlag, 442 pp.Joe, H., 1989: Relative entropy measures of multivariate dependence.

,*J. Amer. Stat. Assoc.***84****,**157–164.Johnson, R. A., , and D. W. Wichern, 1982:

*Applied Multivariate Statistical Analysis*. Prentice-Hall, 594 pp.Kleeman, R., 2002: Measuring dynamical prediction utility using relative entropy.

,*J. Atmos. Sci.***59****,**2057–2072.Kloeden, P. E., , and E. Platen, 1999:

*Numerical Solution of Stochastic Differential Equations*. Springer, 636 pp.Leung, L-Y., , and G. R. North, 1990: Information theory and climate prediction.

,*J. Climate***3****,**5–14.Lorenz, E. N., 1963: Deterministic nonperiodic flow.

,*J. Atmos. Sci.***20****,**130–141.Lorenz, E. N., 1969: Atmospheric predictability as revealed by naturally occurring analogues.

,*J. Atmos. Sci.***26****,**636–646.Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting.

,*Wea. Forecasting***8****,**281–293.Schneider, T., , and S. M. Griffies, 1999: A conceptual framework for predictability studies.

,*J. Climate***12****,**3133–3155.Tippett, M. K., , and P. Chang, 2003: Some theoretical considerations on predictability of linear stochastic dynamics.

,*Tellus***55A****,**148–157.von Storch, H., , and F. Zwiers, 1999:

*Statistical Analysis in Climate Research*. Cambridge University Press, 484 pp.