Predictability and Information Theory. Part I: Measures of Predictability

Timothy DelSole George Mason University, Fairfax, Virginia, and Center for Ocean–Land–Atmosphere Studies, Calverton, Maryland

Search for other papers by Timothy DelSole in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

This paper gives an introduction to the connection between predictability and information theory, and derives new connections between these concepts. A system is said to be unpredictable if the forecast distribution, which gives the most complete description of the future state based on all available knowledge, is identical to the climatological distribution, which describes the state in the absence of time lag information. It follows that a necessary condition for predictability is for the forecast and climatological distributions to differ. Information theory provides a powerful framework for quantifying the difference between two distributions that agrees with intuition about predictability. Three information theoretic measures have been proposed in the literature: predictive information, relative entropy, and mutual information. These metrics are discussed with the aim of clarifying their similarities and differences. All three metrics have attractive properties for defining predictability, including the fact that they are invariant with respect to nonsingular linear transformations, decrease monotonically in stationary Markov systems in some sense, and are easily decomposed into components that optimize them (in certain cases). Relative entropy and predictive information have the same average value, which in turn equals the mutual information. Optimization of mutual information leads naturally to canonical correlation analysis, when the variables are joint normally distributed. Closed form expressions of these metrics for finite dimensional, stationary, Gaussian, Markov systems are derived. Relative entropy and predictive information differ most significantly in that the former depends on the “signal to noise ratio” of a single forecast distribution, whereas the latter does not. Part II of this paper discusses the extension of these concepts to imperfect forecast models.

Corresponding author address: Timothy DelSole, Center for Ocean–Land–Atmosphere Studies, 4041 Powder Mill Rd., Suite 302, Calverton, MD 20705-3106. Email: delsole@cola.iges.org

Abstract

This paper gives an introduction to the connection between predictability and information theory, and derives new connections between these concepts. A system is said to be unpredictable if the forecast distribution, which gives the most complete description of the future state based on all available knowledge, is identical to the climatological distribution, which describes the state in the absence of time lag information. It follows that a necessary condition for predictability is for the forecast and climatological distributions to differ. Information theory provides a powerful framework for quantifying the difference between two distributions that agrees with intuition about predictability. Three information theoretic measures have been proposed in the literature: predictive information, relative entropy, and mutual information. These metrics are discussed with the aim of clarifying their similarities and differences. All three metrics have attractive properties for defining predictability, including the fact that they are invariant with respect to nonsingular linear transformations, decrease monotonically in stationary Markov systems in some sense, and are easily decomposed into components that optimize them (in certain cases). Relative entropy and predictive information have the same average value, which in turn equals the mutual information. Optimization of mutual information leads naturally to canonical correlation analysis, when the variables are joint normally distributed. Closed form expressions of these metrics for finite dimensional, stationary, Gaussian, Markov systems are derived. Relative entropy and predictive information differ most significantly in that the former depends on the “signal to noise ratio” of a single forecast distribution, whereas the latter does not. Part II of this paper discusses the extension of these concepts to imperfect forecast models.

Corresponding author address: Timothy DelSole, Center for Ocean–Land–Atmosphere Studies, 4041 Powder Mill Rd., Suite 302, Calverton, MD 20705-3106. Email: delsole@cola.iges.org

1. Introduction

Recently, several papers have discussed the connection between information theory and predictability (Leung and North 1990; Schneider and Griffies 1999; Kleeman 2002). These papers present different definitions of predictability and present different metrics for measuring predictability. The reasons for choosing different metrics, and the connection between the metrics, are not always explained in these papers. In addition, the literature on this subject has been confined almost exclusively to perfect model scenarios. Finally, even for a perfect model scenario, the fundamental measure of predictability is not universally agreed upon, as evidenced by other metrics proposed in the literature: for example, mean square error in various norms (Lorenz 1965), signal to noise ratio (Chervin and Schneider 1976), analysis of variance (Shukla 1981), potential predictive utility measured by Kuiper's statistical test (Anderson and Stern 1996), and difference in distributions measured by a Kolmogorov–Smirnov measure (Sardeshmukh et al. 2000).

The goal of this paper is to clarify the fundamental definition of predictability and its connection to information theory, and to point out connections between metrics based on information theory that apparently are not well known. This paper will concentrate on theory; applications of the above metrics can be found in the above-cited papers. In Part II of this paper, (DelSole 2004, manuscript submitted to J. Atmos. Sci.), we discuss methods for dealing with imperfect forecasts, which were suggested by Schneider and Griffies (1999). Readers familiar with information theory and its connection to predictability may skip to sections 5 and 6, where the bulk of the new material is presented. A summary of this paper is given in the concluding section (section 7).

2. The definition of predictability

Predictability is the study of the extent to which events can be predicted. Perhaps the most utilized measure of predictability is the mean square error of a (perfect) forecast model. Typically, mean square error increases with lead time and asymptotically approaches a finite value, called the saturation value. The saturation value is comparable to the mean square difference between two randomly chosen fields from the system. Consequently, all predictability is said to be lost when the errors are comparable to the saturation value since the forecast offers no better a prediction than a randomly chosen field from the system. Despite its widespread use, mean square error turns out to be a limited measure of predictability. First, the most complete description of a forecast is its probability distribution. Mean square error is merely a moment of a joint distribution and hence conveys much less information than a distribution. Second, as Schneider and Griffies (1999) pointed out, mean square error depends on the basis set in which the data is represented. Unfortunately, no agreement exists on how to combine different variables to obtain a single measure of predictability—for example, how many degrees Celsius of temperature is equivalent to one meter per second of wind velocity. Third, the question of how to attribute predictability to specific structures in the initial condition has no natural answer in this framework. Finally, the question of how to quantify predictability uniquely with imperfect models is not addressed in this framework since different models produce different characteristic errors.

The fundamental foundation for predictability was proposed by Lorenz (1963, 1965, 1969) and Epstein (1969). In this framework, the forecast model is considered perfect; that is, the governing equations are known exactly and can be solved exactly. However, the initial state of a system is not known exactly and therefore is represented most appropriately by a probability distribution function (pdf), which can be interpreted as a density of possible states in phase space. The distribution evolves according to a continuity equation for probability, which is derived the same way as the mass continuity equation, under the principle that no ensemble members are created or destroyed (i.e., the total amount of probability always equals unity). This continuity equation reduces to Liouville's equation in the case of conservative dynamical systems. A generalized governing equation, called the Fokker–Planck equation, applies if the dynamical model is stochastic—that is, if the future state is not uniquely determined by the initial state, as would be the case if certain physical processes were represented by random processes.

Whether a state is predictable depends on two distributions. To clarify this point, consider the statement “event A is unpredictable.” This statement is not interpreted in the literal sense of saying that the distribution for A is completely unknown. For instance, in many cases a past record of observations is available from which a climatological distribution could be constructed and used to produce a forecast. Even though the climatological distribution constitutes a forecast, the event in question is considered to be unpredictable if the climatological distribution is the only basis for prediction. In order for an event to be predictable, a forecast should be “better” in some sense than a prediction based on the climatological distribution. Thus, the statement “an event is not predictable” is not an absolute statement; rather, it is a relative statement meaning that the forecast distribution does not differ significantly from the climatological distribution.

The above considerations imply that predictability is a property of the difference between two probability distributions. The two distributions in question are commonly referred to as the climatological and forecast distributions. In Bayesian terminology, the climatological distribution, which is not concerned with the chronological order of the states, would be called the prior distribution. The forecast distribution, which involves a lag between the times in which states occur and is available only after observations become available, would be called the posterior distribution. [Note that this terminology differs from that of Schneider and Griffies (1999), in which the prior and posterior distributions are distinguished according to whether a forecast is available.] Ironically, this terminology is exactly opposite from the standard terminology in data assimilation. Therefore, to avoid confusion, we avoid the use of “prior” and “posterior.”

To express the above quantities more precisely, let the state of the climate at time t be represented by an N-dimensional vector Xt. Capital letters will be used to denote a random variable and lower case letters to denote one of its realizations. The probability that Xt lies in the range xt and xt + dxt is denoted p(xt)dxt. We will call t the initial condition time, t + τ the verification time, and τ the lead time. To account for nonstationary signals, such as a seasonal or diurnal cycle, the climatological distribution is identified with the density at the verification time, p(xt+τ).

During times when the climate system is not observed, the distribution of states evolves in accordance with the continuity equation for probability. After the system is observed, the distribution changes discontinuously. For instance, a 20% chance of rain at time t discontinuously changes to 100% certainty when rain is observed (with certainty) at that time. Typical examples of observations include the state itself, boundary condition, and/or external forcing. Suppose the system is observed at times 1, 2, … , t with observed values β1, β2, … , βt. Let ot be the set of all observations up to time t: ot = (β1, β2, … , βt). Then the distribution of the state that embodies all the information contained in the observations ot is denoted p(xt|ot), which is the distribution of the climate state conditioned on ot. (We use the notation that the density function p(·) differs according to its argument.) Statistical procedures for constructing p(xt|ot) were developed by Wiener and Kalman and now have been generalized quite extensively (Jazwinski 1970). In meteorology and oceanography, the mean of p(xt|ot) is called the analysis, and the procedure by which it is obtained is typically associated with the Kalman filter. The analysis distribution at the future time t + τ, namely p(xt+τ|ot+τ), is called the verification distribution. We assume in this paper that p(xt|ot) and p(xt+τ|ot+τ) are known.

The climatological distribution p(xt+τ) is related to the analysis distribution by the chain rule,
i1520-0469-61-20-2425-e1
which is merely the marginal distribution for xt+τ obtained by “integrating out” all other variables. If the climate system is observed perfectly, then p(xt+τ|ot+τ) is a delta function and the above integral collapses to p(xt+τ) = p(ot+τ); that is, the climatological distribution equals the observation distribution if the observations cover all dimensions and are perfect. If the observation and climate systems are stationary, then the climatological distribution is independent of time, that is, p(xt+τ) = p(xt), and is simply the distribution of the variable over its past history, which can be estimated from observations. It should be noted that the integral in (1) is a multiple integral. If xt is discrete, then the integral is replaced by an appropriate sum. Here, continuous forms have been chosen as an arbitrary convention.

The question arises as to the appropriate time scale for defining the climatological distribution. For instance, the climatological distribution of the earth's climate would include ice ages, strictly speaking, but this distribution would be inappropriate for quantifying the predictability of tomorrow's weather. No unique climatological appears to exist—distributions over different time scales all constitute different aspects of the total problem. A climatological distribution that describes the past few decades would be appropriate for weather predictability, while a climatological that describes the past few million years would be appropriate for quantifying the predictability of long-term climate. In stationary Markov systems, the climatological is chosen by convention to be the unconditional distribution of the state, which in most practical cases is also the time asymptotic forecast distribution.

The most complete description of the (future) state Xt+τ based on all observations is denoted p(xt+τ|ot). This distribution defines the forecast distribution, and is related to the analysis distribution p(xt|ot) through the standard “chain rule,”
i1520-0469-61-20-2425-e2
where the identity p(xt+τ|xt, ot) = p(xt+τ|xt) has been invoked, owing to the assumption that each member of the ensemble evolves in accordance with the governing equations; this assumes that observations do not disturb the system. The distribution p(xt+τ|xt) is called a transition probability and is computed from a dynamical or stochastic model. If the climate system is deterministic, then p(xt+τ|xt) can be interpreted as a delta function centered about the deterministic solution with initial condition xt, and p(xt+τ|ot) describes the forecast ensemble generated from initial states randomly selected from p(xt|ot). For Markov systems, p(xt+τ|ot) satisfies a Fokker–Planck equation with initial condition p(xt|ot) (Gardiner 1990). If the system is stationary, then the climatological distribution p(xt+τ) is independent of time and (almost) every forecast asymptotically approaches the stationary distribution; that is, p(xt) = p(xt+τ) = lim p(xt+τ|ot) as τ → ∞.

To summarize, the basis of all predictability theory is a forecast procedure and a set of observations. Given all observations up to time t, ot, we construct the best description of the state at time t, called the analysis distribution and denoted by p(xt|ot). The most complete description of the state at time t + τ based on observations ot is the forecast distribution p(xt+τ|ot). To assess whether the system is predictable, we compare the forecast distribution p(xt+τ|ot) to the climatological distribution of the climate system, namely p(xt+τ). The degree of predictability is given by some measure of the difference between p(xt+τ|ot) and p(xt+τ).

The above framework allows different predictability scenarios to be described in a unified manner. For instance, a perfect model scenario identifies the transition probability p(xt+τ|xt) with a delta function and the analysis distribution p(xt|ot) with nonzero variance, in which case the forecast distribution p(xt+τ|ot) from (2) describes an ensemble of forecasts initialized at states randomly drawn from the analysis distribution. A perfect initial condition scenario identifies the transition probability p(xt+τ|xt) with nonzero variance and the analysis distribution p(xt|ot) with a delta function, in which case the forecast distribution from (2) describes the evolution of a stochastic model. If the observations are perfect and the model is deterministic, then the forecast distribution from (2) remains a delta function for all time whereas the climatological from (1) does not, indicating predictability for all time. This framework implies that predictability is a combined property of the dynamical system and the observations.

As will become evident, there is considerable disagreement as to what measure of the difference between climatological and forecast distributions should be used to quantify predictability. Murphy (1993) describes several distinct measures of goodness in forecasting. However, there is one point on which everyone agrees: an event is unpredictable if the forecast distribution is identical to the climatological distribution. According to this definition, all loss of predictability occurs when p(xt+τ|ot) = p(xt+τ), which is precisely equivalent to the statement that the future state Xt+τ is statistically independent of the climatological observations. This condition is intuitively consistent with the idea that a forecast is predictable only if it is statistically dependent on the initial condition on which the forecasts were based. Equivalently, we may say that a necessary condition for predictability is for the forecast and climatological to differ. The question of how to quantify predictability based on this difference is discussed in the next section.

The above framework does not exhaust all types of predictability. As Lorenz (1975) noted, there are at least two kinds of predictability. The first kind refers to the change in statistics over a fixed time span, as the beginning and end of the time space advance. The second kind refers to the change in statistics due to changes in climate forcing. (The reader should be warned that Lorenz's definitions often are misquoted in the literature.) If the state Xt is interpreted as a statistic over a fixed time span (whose end points could be infinitesimally close, in which case Xt is the instantaneous state), then the framework we have just discussed encompasses predictability of the first kind.

To clarify predictability of the second kind, consider a prediction for the change in climate statistics due to a doubling of atmospheric greenhouse gas concentration. If the original concentration is denoted θ0 and the increased level is θ1, then the climatology is identified with the distribution conditioned on the current concentration, p(xt+τ|θ0), while the forecast is the distribution conditioned on the higher concentration level p(xt+τ|θ1). Importantly, the two distributions p(xt+τ|θ0) and p(xt+τ|θ1) cannot be interpreted as conditional and marginal distributions as defined above because the forecast distribution does not satisfy the chain rule (2) and does not involve a chronological sequence of climate states. Rather, the forecast and climatological distribution are interpreted as subensembles of a larger climate population. Moreover, the climatological distribution defined in (1) involves a weighting by the probability of θ1, but this weighting is irrelevant to those who are concerned about the actual global warming. These considerations suggest that all types of predictability can be defined in terms of the difference between two distributions, but the class of distributions may differ according to the type of predictability.

It should be noted that the above framework assumes that the forecast distribution of the future state is known precisely, which of course is not a realistic situation. In light of this fact, numerous investigators have suggested that predictability should be defined by some measure of the predictability of a forecast model. The major problem with this definition is that it depends on the arbitrary forecast model and, hence, gives rise to multiple degrees of predictability for the same event—one for each distinct model. Furthermore, the predictability of an imperfect forecast model may bear no relation to the predictability of the climate system. For instance, some models may indicate infinite predictable times whereas others indicate short predictable times. The question of how to quantify predictability with imperfect models has received relatively little attention in the literature. This issue will be discussed in Part II of this paper.

3. Entropy, relative entropy, and mutual information

While we have defined predictability by the difference between the climatological and forecast distributions, how are we to quantify this difference? Before addressing this question, it should be recognized that any ranking of the difference between the climatological and forecast distributions will depend on geographical, economic, or societal issues that differ from user to user. The goal of this section is to show that defining predictability based on principles of information theory leads to an intuitive methodology.

Mean square error and the second moments of the forecast distribution are by far the most common measures of predictability. These metrics measure the dispersion, or “spread,” of the forecast distribution. Forecasts with larger dispersion characterize less predictability because they are associated with more uncertainty. This reasoning suggests that the degree of uncertainty provides a more fundamental basis for quantifying predictability.

To quantify uncertainty, consider a random event. Before the event occurs, we are uncertain as to the outcome of the event. The amount of uncertainty is related in some way to the probability distribution of the event. After the event has been observed, the uncertainty has been removed to an extent that depends on the analysis and observation errors. If these errors are sufficiently small, then we may say that we have received some information through observation. Thus, a decrease in uncertainty corresponds to an increase in information. A plausible requirement of any measure of information is that, if two events X and Y are independent, then the information h gained when we learn both X and Y should equal the sum of the information gained if X alone were learned, and the information gained if Y alone were learned; that is, h(x and y) = h(x) + h(y). It can be shown that the logarithm is the only differentiable function of probability density (to within a multiplicative factor) that has this additive property for independent events. Thus, a measure of the information content of the (discrete) event X = x with probability p(x) is
i1520-0469-61-20-2425-e3
Note how this measure conforms with our intuition about information: if p(x) = 99.99%, then knowledge that event X = x occurred gives very little information since it had a high probability of occurrence to begin with; if p(x) is very small, then knowledge that event X = x occurred conveys a lot of information since x is rare. The average information, weighted by the probability of each event, is called the entropy and is given by
i1520-0469-61-20-2425-e4
There exist many excellent reviews of entropy in the literature (Shannon 1948; Goldman 1953; Reza 1961; Cover and Thomas 1991). These reviews give numerous compelling arguments demonstrating that entropy arises as a natural and fundamental measure of uncertainty in communication theory, data compression, gambling, computational complexity, statistics, and statistical mechanics. We will not reproduce those arguments here, as we could hardly do them justice. Nevertheless, if we want a measure of uncertainty to have properties consistent with the meaning of the concept, then it is hard to escape the fact that we are compelled to use entropy as a measure of uncertainty.
There are several ways in which to measure the difference between the forecast and climatological distributions in the context of information theory. One approach is to measure predictability by the entropy difference between the climatological and forecast distributions:
i1520-0469-61-20-2425-e5
which we call predictive information. Schneider and Griffies (1999) introduce a related measure of predictive information, defined as the entropy of the climatological distribution minus that of the forecast error. As such, their measure of predictability is defined with respect to a forecast system. If the forecast system is perfect, by which we mean that the forecast error is distributed as p(xt+τ|ot), aside from a shift in mean, then Po defined above is formally equivalent to predictive information as defined by Schneider and Griffies. The Po defined above is intended to measure the fundamental predictability of the physical and observational system, not that of a forecast model. If the perfect model scenario is invoked, then no distinction exists between Po defined above and predictive information defined by Schneider and Griffies.

The first term on the right of (5) measures the uncertainty when no (recent) observation of the system is available, while the second term measures the uncertainty after the observation and associated forecast becomes available. The difference in uncertainties quantifies the amount of information about the event gained from observation and forecast. Schneider and Griffies (1999) review the properties of predictive information, to which we refer the reader for full details. Here we briefly state two properties. First, predictive information is invariant with respect to a nonsingular, linear coordinate transformation. It follows that predictive information is independent of the basis set used to represent the data. Hence, all climate variables, regardless of their units or natural variances, can be accounted for in a single measure of predictability. Second, for joint normally distributed variables predictive information can be decomposed into a set of uncorrelated components ordered by their entropy difference such that the leading component is the most predictable, the second is the most predictable uncorrelated with the first, and so on. This decomposition is explained in section 5.

Predictive information is not always positive—the entropy of a conditional distribution can exceed that of the unconditional distribution. Table 1 demonstrates this fact by example. This situation reflects the fact that, sometimes, a particular observation or forecast leads to “confusion.” However, the average predictive information is positive. To show this, we need to distinguish between the entropy of a conditional distribution for a given value of ot, and the average of this quantity. These quantities are as follows:
i1520-0469-61-20-2425-e6
The key distinction between these quantities is that the first depends on the particular realization ot, whereas the second, which averages over ot, does not. The latter quantity is called the conditional entropy and satisfies the inequality H(X) ≥ H(X|O) (Cover and Thomas 1991, chapter 2). This inequality implies that the average predictive information is non-negative—observations can only add information on average. Incidentally, for normal distributions, predictive information is positive regardless of the conditional argument (see appendix A).
An alternative measure of the difference between the forecast and climatological distributions is the difference in their information content:
i1520-0469-61-20-2425-e7
This difference depends on the realization xt+τ. A summary measure is provided by the average information difference, weighted by the probability of event xt+τ. Of the two possibilities for this weighting, the most useful turns out to be
i1520-0469-61-20-2425-e8
The above quantity, called relative entropy, was proposed by Kleeman (2002) as a measure of predictability. Relative entropy also is known as the Kullback–Leibler distance. Kullback (1959) gives compelling arguments that the relative entropy is a measure of the difficulty of discriminating between two distributions. It can be shown that relative entropy is positive, invariant to nonlinear invertible transformations of the random variables (Majda et al. 2002), and decomposable into components that optimize it (as discussed in section 5).
The above measures depend on the observation used for conditioning. We are often interested in the mean predictability as derived by averaging over all observations. It can be shown that both relative entropy and predictive information have the same such average. To see this explicitly, the average relative entropy can be computed as follows:
i1520-0469-61-20-2425-e9
where the subscripts indicating time have been suppressed for ease of interpretation, p(x, o) is the joint distribution between x and o, and the classical relation p(x, o) = p(x|o)p(o) has been used. Similarly, the average predictive information is
i1520-0469-61-20-2425-e10
where we have used the classical relation
i1520-0469-61-20-2425-e11
Comparison between (9) and (10) shows that the quantities have the same average. Interestingly, not only do relative entropy and predictive information have the same average, but this average is precisely the measure of predictability introduced by Leung and North (1990), called transinformation. Cover and Thomas (1991) call this quantity mutual information, a terminology that we adopt in this paper. Mutual information measures the dependence between two variables and is defined as
i1520-0469-61-20-2425-e12
which can be seen to equal (9) and (10). If the observations are perfect, that is, ot = xt, then I(xt+τ, ot) = I(xt+τ, xt). The significance of this identity is that Cover and Thomas (1991, chapter 2) show that I(xt+τ, xt) decreases monotonically for a stationary, Markov process. It follows from this that, if the observations are perfect, the average predictability always decays with time (starting from the time of the last observation).

4. Discussion

The question arises as to which of the above metrics should be used to measure predictability. As noted earlier, no universal principle exists to settle this question. The following comments may help to clarify this issue. First, predictive information and relative information measure the predictability of a single forecast distribution, whereas mutual information measures the mean predictability averaged over all initial conditions. Second, the average of the first two quantities equals the third. Thus, if we are interested only in the mean predictability, all three metrics are equivalent and no distinction exists.

Any measure of predictability should vanish when the forecast equals the climatological. Relative entropy and predictive information, which are the relevant metrics for single forecasts, satisfy this condition. However, a key difference between predictive information and relative entropy is that relative entropy (8) vanishes if and only if the two distributions are equal, whereas this is not the case for predictive information (5). Theoretically, predictive information can vanish even when the climatological differs from the forecast. Whether this represents a drawback with predictive information depends on a more detailed definition of predictability, as discussed below.

An interesting question is whether a forecast with negative predictive information should be considered predictable. Recall that negative predictive information implies that the forecast has more entropy, and hence more uncertainty, than the climatology. It is not absurd to characterize an event as unpredictable if the forecast has more uncertainty than the climatological distribution. This reasoning suggests that a variable should be defined as unpredictable if the predictive information vanishes or is negative. On the other hand, relative entropy is positive when predictive information is negative. In fact, measuring predictability by relative entropy would imply that any difference between the climatological and forecast distribution should be characterized as predictable. Whether this represents a drawback for relative entropy depends on a more refined definition of predictability.

Kleeman (2002) states that the relative entropy of a stationary, Markov process decays monotonically, whereas absolute entropy does not, thereby implying that entropy does not conform to our intuition that predictability in such systems monotonically degrades with lead time. This point deserves comment. First, Cover and Thomas (1991, chapter 2) show that the conditional entropy H(Xt+τ|Xt) of a stationary Markov process increases monotonically with τ. Recall that for perfect observations H(Xt+τ|Ot) = H(Xt+τ|Xt), and that the average predictive information is H(Xt+τ) − H(Xt+τ|Ot). It follows from this and the above theorem that the predictive information of a stationary, Markov process decreases monotonically on average. Second, if the distributions are normal, then the entropy of a stationary, Markov process can be shown to always increase monotonically with lead time (again, under perfectly known initial conditions). This fact is proven in appendix B. Third, Cover and Thomas (1991, chapter 2) prove that relative entropy between a stationary, Markov process and any stationary distribution decreases monotonically with time. Thus, relative entropy decreases no matter how the climatological distribution is chosen. This curious behavior is related to the fact that relative entropy does not satisfy the triangle inequality, so a decrease in relative entropy does not necessarily imply that the forecast and climatological distributions are getting “closer.”

T. Schneider (2003, personal communication) has suggested that interpreting the above metrics in terms of coding theory may help to clarify the different metrics. For discrete distributions, entropy can be interpreted as the average number of bits (i.e., the average number of “yes–no” questions) needed to describe a variable from a distribution (Cover and Thomas, 1991, p. 6). Hence the predictive information is the average reduction (for positive predictive information) in number of bits between a code optimized for the climatological, versus a code optimized for the forecast. Cover and Thomas (1991, p. 18) effectively state the following: to describe a realization from the forecast distribution, a code optimized for the forecast distribution would have average length H(Xt+τ|Ot = ot), whereas a code optimized for the climatological distribution would have (longer) average length H(Xt+τ|Ot = ot) + Ro. This implies that relative entropy could be interpreted as a measure of the inefficiency of using the climatological distribution to describe the forecast distribution. This is not a quantity one would naturally choose to measure predictability. In contrast, predictive information could be interpreted as the decrease in complexity in the forecast distribution relative to the climatological. This may have some intuitive appeal as a measure of predictability.

Other equally valid interpretations can be given that do not always favor the same metric. For instance, one approach to measuring the difference between the climatological and forecast distributions is to ask whether we could determine from which of the two distributions a given sample was drawn. If we cannot, then perhaps the difference in distributions is not “important.” It is well known that the optimum test, in the sense of having desirable properties with respect to type I and type II errors, is given by the Neyman–Pearson likelihood ratio test. Kullback (1959) shows that this test is formally equivalent to comparing the relative entropy between the empirical distribution and the possible distributions. Thus, relative entropy can be interpreted as a measure of the difficulty of discriminating between forecast and climatological distributions, a natural prerequisite for measuring predictability. Alternatively, recall that Pearson's chi-square test is a classical method of deciding whether two distributions differ. It can be shown that the chi-square statistic is formally equivalent to relative entropy in the limit of small differences between the two distributions (Kullback 1959, chapter 5). Another basis for quantifying the difference between two distributions is by the amount of money a gambler would make given a forecast, provided a “bookie” assigns fair odds based on the climatology. In effect, Cover and Thomas (1991, section 6.1) show that the average doubling rate of the gambler's winnings equals the relative entropy between the forecast and climatological distributions. This situation is not far removed from those associated with weather derivatives in real financial markets.

Bernardo and Smith (2000) give an elegant summary of Bayesian theory in which relative entropy and mutual information arise naturally in the problem of choosing actions in situations with uncertainty. This framework requires 1) a set of axioms for making decisions and assigning degrees of belief based on data and 2) the concept of utility. The logarithmic utility function turns out to be the only utility function that is smooth, proper, and local. For such utility functions, the expected utility of data is precisely the relative entropy between the forecast and climatological distributions, which are distinguished according to the availability of data. In the context of predictability, relative entropy (8) can be interpreted as the expected utility of the observed data. Mutual information (12), which is the average relative entropy over the distribution of possible data outcomes, can be interpreted as the expected information from an experiment. Thus, relative entropy and mutual information are natural measures of the utility provided by an experiment when an individual's preferences are described by a smooth, proper, local utility function. Whether the aforementioned properties of the utility function are fundamental remains unclear, as the many references to alternative measures in Bernardo and Smith illustrate.

Relative entropy and mutual information are invariant with respect to nonlinear, invertible transformations (Bernardo and Smith 2000, p. 158; Goldman, p. 153), whereas predictive information is invariant only to linear transformations. Bernardo and Smith (2000) suggest that this invariance is essential for any measure of the utility increase from observed data. For instance, in the context of predictability, one could argue that the degree of predictability should not depend on whether the data is represented in pressure or sigma coordinates, which are related by a nonlinear transformation. Since the average predictive information equals mutual information, it follows that while predictive information is not invariant to nonlinear transformations, its average is. Whether the invariance property in its full nonlinear sense can ever be taken advantage of in realistic situations with finite sample sizes remains to be seen.

An issue that may raise concern is the fact that all three metrics depend on the log of the determinant of a covariance matrix, for normal distributions. Consequently, small sampling errors in the smallest eigenvalues of the covariance matrices produce large errors in the metrics. This problem might discourage the use of such measures in favor of other measures, perhaps those based on the trace of the error covariance matrix, which are not as sensitive to sampling errors. While such sensitivity affects the statistical estimation procedure, it need not be interpreted as a problem in the fundamental concepts. Indeed, the advantage of measures based on determinants can be seen in the following scenario. Consider a system that is unpredictable everywhere except at a single location. A predictability measure based on the mean square forecast error would be dominated by the environmental noise, in which case the high degree of predictability at the particular location might go undiscovered. In contrast, a predictability measure based on determinants would reveal a great deal of predictability, even though the highly predictable component is due to a single location. Moreover, as discussed in section 6, the above predictability measures can be decomposed into a linear combination of uncorrelated components. This decomposition allows an investigator to isolate those components that contribute significantly to the total predictability.

In the case of predictability of the second kind, we are generally interested in any difference between two distributions, even if there is no change in uncertainty. Relative entropy appears to provide an attractive measure for predictability of the second kind.

5. Predictability of linear stochastic models

Linear stochastic models provide an opportunity to understand the predictability of a system comprehensively. Moreover, Wang and Uhlenbeck (1945) show that any stationary, Gaussian, Markov process can be represented by a suitable linear stochastic model, and vice versa. Thus, a general conclusion in one representation implies an analogous conclusion in the other. A linear stochastic model is of the form
xw
where x is a state vector, the dot indicates a time derivative, 𝗔 is a dynamical operator, and w is a Gaussian white noise process with zero mean and covariance matrix 𝗤. If the dynamical operator is independent of time, then the system is autonomous and the solution can be calculated analytically. Such models are well understood and have been reviewed in numerous publications (Gardiner 1990; DelSole 2004). Here we simply state relevant results. The initial condition xt and the verification xt+τ represent solutions of (13) for a single realization of the noise. If the stochastic process was begun in the infinite past, then xt and xt+τ are stationary processes with distribution
i1520-0469-61-20-2425-e14
where N(μ, Σ) denotes a normal distribution with mean μ and covariance matrix Σ, and the “climatological” covariance matrix is given by
i1520-0469-61-20-2425-e15
The distribution p(xt+τ) can be interpreted as the climatological distribution when no (recent) observation is available. The initial condition xt and verification xt+τ are related by
xt+τe𝗔τxtηυtτ
where ηυ is distributed as
i1520-0469-61-20-2425-e17
The first term on the right of (16) constitutes an unbiased estimate of xt+τ and is not a random variable when xt is known. Schneider and Griffies (1999) write a “constitutive equation” of the same form as (16) (the first displayed equation in their paper), but their equation differs from (16) in a subtle way, which need not be discussed at this point. Many predictability studies call the first term the signal and the second term the noise.
It follows from (16) and (17) that the conditional distribution at time t + τ, given knowledge of the state at time t, is
i1520-0469-61-20-2425-e18
This distribution can be interpreted as the forecast distribution when the initial condition becomes known. Importantly, the initial condition xt affects only the mean of the forecast. In the limit of large lead time, the forecast (18) approaches the climatological (14).

To evaluate the climatological and forecast distributions (1) and (2), the distribution of the observations p(ot) is needed. While all distributions can be evaluated analytically if the observation distribution p(ot) is Gaussian, this complete solution tends to obscure the basic conclusions related to predictability. Thus, for simplicity, we assume that the system is perfectly observed and, hence, the forecast distribution is simply p(xt+τ|xt) given in (18).

It is shown in appendix A that the relative entropy for this system is
i1520-0469-61-20-2425-e19
where
i1520-0469-61-20-2425-e20
The matrix 𝗪 is called the prewhitened dynamical operator. Kleeman (2002) calls the last term in (19) the “signal” and the remaining terms the “dispersion.” The initial condition xt affects only the signal. As lead time τ increases, 𝗪 tends to vanish, and hence relative entropy tends to vanish.
It is shown in appendix A that the predictive information for this system is
i1520-0469-61-20-2425-e21
Note that the above expression is independent of the initial condition, in contrast to the relative entropy (19). Finally, the mutual information between the verification and initial condition can be obtained by averaging the predictive information (21) over all initial conditions, which is a trivial exercise with the result
i1520-0469-61-20-2425-e22

Tippett and Chang (2003) show that, for all of the above measures, the least predictable system, out of all systems with the same eigenvalues and vanishing initial error, is that in which the prewhitened dynamical operator 𝗪 is normal. Since the prewhitened operator is unique up to an orthogonal transformation, the normality of 𝗪 is not altered by coordinate changes that preserve the prewhitening characteristic.

To connect with classical theory, we consider a measure of predictability based on the mean square difference between two randomly chosen members of the forecast distribution. If ϵ is half the difference between two randomly chosen states, and 𝗠 is a positive definite matrix defining the norm, then appendix B shows that
i1520-0469-61-20-2425-e23
where the angle brackets denote an average over the forecast distribution. Note that this measure is independent of the initial condition.

Interestingly, the lead-time dependence of all the above measures are controlled by the operator 𝗪. Appendix B shows that all of the above measures are monotonic functions of lead time. [Note that the proof in appendix B assumes that the initial state is known with certainty and, hence, has zero entropy. Without this assumption, predictability could increase with time, as pointed out by Cover and Thomas (1991, p. 35). For instance, the arbitrary initial condition could have larger entropy than the asymptotic stationary distribution.] Hence all of the above metrics conform to intuition that predictability degrades with lead time in Gaussian, stationary, Markov systems.

If errors are measured in the prewhitening space, that is, 𝗠 = Σ−1υ, which is similar to measuring errors by a Mahalanobis distance (Johnson and Wichern 1982), then, remarkably, the expressions (19), (21), and (23) can be combined into the single equation
i1520-0469-61-20-2425-e24
The last three terms have zero mean. Thus, relative entropy can be larger or smaller than predictive information depending on the initial condition. It follows that rankings of predictability according to relative entropy will differ from rankings according to predictive information.

By far the most significant difference among the above measures in the context of Markov models is that predictive information, mutual information, and mean square error are independent of the initial condition, whereas the relative entropy is not. A revealing thought experiment is to consider the case in which the initial condition projects on the leading singular vector of a suitable propagator. In this case, the ensemble mean grows to a large value, whereas the forecast spread grows at a rate independent of the initial condition. Penland and Sardeshmukh (1995) and Moore and Kleeman (1999) have suggested that ENSO predictability can be understood in these terms. However, singular vectors are relevant to this type of predictability only if one adopts a measure of predictability that depends on the signal to noise ratio of a single forecast. Of the metrics discussed in this paper, only relative entropy satisfies this condition. In contrast, predictive information and mean square error are independent of the initial condition because they measure uncertainty, which is independent of the initial condition in Gaussian systems.

6. Predictable components

It is often insightful to decompose predictability into components that maximize it. This approach is analogous to the use of principal components to understand how different spatial structures contribute to total variance. This decomposition is especially useful if predictability is dominated by a few coherent patterns. In such cases, the decomposition allows an investigator to focus only on the few structures that are predictable and to ignore the vast majority of structures that are unpredictable. This decomposition also provides the basis for attributing predictability to specific structures in the initial condition, boundary condition, and/or external forcing.

In this section, we show (apparently for the first time) that optimization of predictive information is identical to optimization of relative entropy when the distributions are Gaussian and the means of the climatological and forecast are identical. The resulting structures will be called predictable components. Furthermore, we show that optimization of mutual information is equivalent to canonical correlation analysis. Finally, we discuss a surprising connection between these optimization procedures.

It should be recognized that defining predictable components according to error variance is problematic. First, absolute error variance is not a measure of predictability. Rather, error variance relative to its saturation value is the appropriate measure of predictability. Second, total error variance can be dominated by structures with large natural variance and can have little to do with predictability. Third, a background random noise field can dominate error variance and thereby hide highly predictable components.

Schneider and Griffies (1999) proposed defining predictable components based on optimizing predictive information. The optimization procedure, called predictable component analysis, is discussed in Schneider and Griffies (1999), to which the reader is referred for details (see also DelSole and Chang 2003). Here we give only an outline of the method. It proves convenient to distinguish different random variables by different symbols. Thus, the variable having the climatological distribution, called the verification, will be denoted by v, while the variable with the forecast distribution will be denoted by f. Consider a projection vector qk; the subscript k is an index anticipating the fact that multiple vectors will be obtained. The projected verification and forecast are υk(t) = qTkv and fk(τ, t) = qTkf, both of which are scalar time series. If the variables are joint normally distributed, then the projected quantities also are Gaussian with scalar means and variances:
i1520-0469-61-20-2425-e25
where the fact that each of these parameters depends on k is understood. The predictive information associated with projection vector qk can be inferred from (A3) to be
i1520-0469-61-20-2425-e26
where rk is the ratio of variances associated with the kth projection vector
i1520-0469-61-20-2425-e27
This ratio can be interpreted as the ratio of the forecast spread to the climatological spread. Since the log-function is monotonic, optimization of predictive information Pk is equivalent to optimization of the variance ratio rk. The variance ratio (27) is a Rayleigh quotient. It is a standard procedure (Noble and Daniel 1988) to show that optimization of the quotient rk with respect to qk leads to the generalized eigenvalue problem
i1520-0469-61-20-2425-e28
The eigenvectors yield statistically uncorrelated time series; that is, 〈fkfm〉 = 0 and 〈υkυm〉 = 0 for km. The corresponding eigenvalues give the variance ratio rk associated with each vector. Ordered by their eigenvalues from largest to smallest, the leading eigenvector gives the projection vector that maximizes predictive information, the second maximizes predictive information over all vectors uncorrelated with the first, and so on.
Now consider components that optimize relative entropy. The relative entropy Rk associated with projection vector qk can be inferred from (A4) to be
i1520-0469-61-20-2425-e29
where the terms appearing on the right are defined in (25). All terms appearing in (29) are ratios of quadratic forms. It follows that Rk is invariant with respect to multiplicative factors applied to the projection vector qk. Without loss of generality, the projection vectors may be normalized such that σ2υ = 1. Optimization of Rk with respect to qk, subject to the constraint σ2υ = 1, leads to the equation
i1520-0469-61-20-2425-e30
where λ is a Lagrange multiplier. Since σ2f depends on qk [see (25)], the above equation is nonlinear and solvable only by iterative procedures. Nevertheless, it still holds that if qk and qj are two solutions of (30) with different values of λ, then qTkΣυqj = δkj; that is, the time series associated with qk and qj are uncorrelated with respect to the verification ensemble. This may be shown in essentially the same way that one proves that eigenvectors of symmetric matrices with distinct eigenvalues are orthogonal.
We now come to an interesting fact: if V and F have the same means, then (30) reduces to the eigenvalue problem (28), from which it follows that relative entropy and predictive information share the same predictable components. This correspondence is not surprising for two normal distributions with identical means, since such distributions differ only in their variances. If λi is the ith eigenvalue of (28), then the total relative entropy, when the difference in means vanishes, is
i1520-0469-61-20-2425-e31
In the opposite extreme, in which covariances are nearly the same, the matrix on the left of (30) is dominated by the terms involving the means, in which case the projection vector can be solved analytically as
i1520-0469-61-20-2425-e32
Readers familiar with “fingerprint” methods (Hasselmann 1993) and Fisher's linear discriminant function will recognize this projection vector as the one that optimizes the signal to noise ratio, where signal is defined as the difference in means and noise is the climatological variance.
Now consider the determination of components that optimize mutual information. The mutual information for multivariate normal distributions is shown in appendix A to be
i1520-0469-61-20-2425-e33
where Σfo = ΣTof is the cross-covariance matrix between the observation and forecast, o and f. We follow the same procedure as above except that we introduce a pair of projection vectors, say qk and sk. Let the projected observation be ok(t) = sTko, and the projected forecast be fk(t) = qTkf, both of which are scalar time series. The projection renders all covariance matrices in (33) scalar variables, which by visual inspection can be seen to yield the mutual information between ok and fk in the form
i1520-0469-61-20-2425-e34
where ρk is the cross correlation between ok and fk. Thus, for Gaussian variables, the projection vector pair that maximizes Ik also maximizes the squared correlation between the associated time series. This relation is reasonable in view of the fact that mutual information measures the degree of dependence between two variables, which in the case of Gaussian variables is quantified by ρ. However, optimization of correlation is solved by a well-known procedure called canonical correlation analysis (CCA). It follows that, in the case of Gaussian variables, CCA applied to the initial and verification fields determines the components that maximize mutual information. See Barnett and Preisendorfer (1987) and DelSole and Chang (2003) for a review of the CCA procedure and the method for constructing predictions based on canonical patterns.
As is well known in analysis of variance (ANOVA) techniques, the forecast covariance of a perfect model can be written as Σf = Συ − 〈μfμTf〉, provided that the variables are defined such that μυ = 0. If we identify μf as the “signal” and Σf as the “spread,” then this last expression shows that signal and spread are inversely related—the larger the signal, the smaller the spread. Substituting this expression into the appropriate equations above shows that the predictability metrics can be written as
i1520-0469-61-20-2425-e35
Note that all three quantities are monotonic functions of the parameter 〈μ2f〉/σ2υ, which can be interpreted as the average signal to noise ratio. Relative entropy Rk is distinguished from the other two metrics by the fact that it also involves the signal to noise ratio for a single forecast, given by the last term μ2f/σ2υ. If we ignore the last term in Rk, then (35) shows that, for Gaussian variables, optimization of predictability, as measured by relative entropy, predictive information, or mutual information, is equivalent to optimization of the average signal to noise ratio 〈μ2f〉/σ2.

7. Summary and discussion

This paper reviewed a framework for understanding predictability based on information theory, assuming that the governing equations of the system were known exactly. In this framework, predictability depends on three quantities: 1) a set of observations of the system, 2) a climatological distribution of the system that is known in the absence of recent observations, and 3) a forecast distribution that is known after all observations becomes available and a forecast based on these observations is performed. An event is said to be unpredictable if the climatological and forecast distributions are identical in every way. Hence, the degree of predictability depends on the degree to which the prior and posterior distributions differ.

Any ranking of predictability by some measure of the difference between climatological and forecast distributions will depend on geographical, economic, and/or societal issues that differ from user to user. Information theory provides an intuitive and rigorous framework for quantifying predictability because it measures the degree of uncertainty in a random variable. However, there is no unique measure within this framework. At least three measures of predictability based on information theory have been proposed: mutual information between the forecast and initial condition (Leung and North 1990), predictive information between the forecast and climatological distributions (Schneider and Griffies 1999), and relative entropy between the forecast and climatological distributions (Kleeman 2002). Significant properties of these metrics are as follows:

  1. All three metrics are invariant with respect to nonsingular linear transformations. Hence, the metrics do not depend on the basis in which the data is represented, and variables with different units and natural variances can be analyzed as a single state variable without introducing an arbitrary normalization since such normalization cannot affect the final results. This property is not shared by such metrics as mean square error. Relative entropy and mutual information are also invariant with respect to nonlinear, invertible transformations.

  2. Relative entropy and predictive information have the same average over all initial conditions, and this average is precisely equal to mutual information. Thus, relative entropy and predictive information measure the predictability of a single forecast distribution, while mutual information measures the average of these two quantities. This fact, which is not stated in the above-cited papers on predictability, clearly reveals a similarity between the measures, which may not be apparent from their formal definitions.

  3. Predictable component analysis, which identifies highly predictable components in a Gaussian system, not only optimizes predictive information, but also optimizes relative entropy when the means of the climatological and forecast distribution are identical.

  4. Optimization of the average predictability in Gaussian, Markov systems is equivalent to canonical correlation analysis of the forecast and initial condition. Thus, CCA, which is an ad hoc method in many predictability studies, arises naturally in information theory.

  5. All three metrics, as well as mean square error, are monotonic functions of lead time for normally distributed, stationary, Markov systems, consistent with the classical notion (Lorenz 1969) that predictability degrades with time. Predictive information is a monotonic function of lead time only in an average sense in non-Gaussian systems.

  6. All three metrics have closed-form expressions in finite dimensional, stationary, Gaussian, Markov systems. These expressions completely describe the predictability of all linear stochastic models driven by Gaussian noise.

An unresolved issue in the above framework is whether predictability should be measured by a “distance” metric, like relative entropy, or by an entropy difference, like predictive information. For normal distributions, the key distinction between these metrics is that one set depends on the signal to noise ratio of a single forecast distribution, whereas the other does not. By signal to noise ratio, we mean that the signal is identified with the mean of the forecast distribution and the noise is identified with the variance of the climatological distribution (though similar conclusions hold if we identify the noise with the forecast variance). A contrived situation that highlights the differences among the metrics is illustrated in Fig. 1. In this figure, two forecast distributions are shown, labeled F1 and F2, that have different means but exactly the same entropy. The climatological distribution, shown as the dashed curve, has more entropy than either forecast, but has the same mean as F1. Since the forecasts F1 and F2 have exactly the same entropy, they have exactly the same predictive information. In contrast, the relative entropy is greater for F2 than that for F1 owing to a larger signal to noise ratio. The question arises as to which of these metrics conforms with our intuition about predictability. We can immediately reject the idea that signal to noise ratio alone is sufficient to measure predictability: in the limit of a delta function, F1 would have zero signal to noise ratio but complete certainty; it would be absurd to call a variable with this distribution unpredictable. The seeming importance of signal to noise ratio may arise from the fact that it often appears in context-dependent utility functions, which quantify the cost of a random event with respect to an individual. But the mere fact that signal to noise ratio plays an important role in utility functions does not necessarily imply that it is fundamental to predictability, just as the meaning of a message is irrelevant to the engineering problem of communication. Also, in a perfect model scenario, the signal is perfectly known (with an infinite ensemble) and does not constitute a source of uncertainty. There seems to be no inconsistency in restricting predictability theory to metrics with the signal to noise ratio, or to metrics without the signal to noise ratio, especially since both can have the same average.

In principle, information theory provides an elegant and powerful framework for quantifying predictability and addressing a host of questions that otherwise would remain unanswered. In practice, however, there are significant challenges with actually applying it. First, the perfect model is not available. In practice, this problem is dealt with by using forecast models to simulate the forecast distribution. Unfortunately, errors in all forecast models give rise to differences between the forecast and climatological distributions that, if not accounted for, would lead to false indications of predictability. The question of how to extend the above framework to imperfect forecasts is discussed in Part II. The second problem is that the required probability distributions are not known and must be estimated from data. Strategies for estimating the necessary distributions from finite sample sizes will be discussed in future work.

Acknowledgments

I am very much indebted to J. Shukla, who provided essential criticism and encouragement during the course of this work, and to Tapio Schneider, with whom I have had many stimulating correspondences and discussions about this work. I also thank Michael Tippett, David Nolan, and Ben Kirtman for discussions on this topic, and William Merryfield and the reviewers for many helpful comments. This research was supported by the NSF (ATM9814295), NOAA (NA96-GP0056), and NASA (NAG5-8202).

REFERENCES

  • Anderson, J. L., and W. F. Stern, 1996: Evaluating the potential predictive utility of ensemble forecasts. J. Climate, 9 , 260269.

  • Barnett, T. P., and R. Preisendorfer, 1987: Origins and levels of monthly and seasonal forecast skill for United States surface air temperatures determined by canonical correlation analysis. Mon. Wea. Rev, 115 , 18251850.

    • Search Google Scholar
    • Export Citation
  • Bernardo, J. M., and A. F. M. Smith, 2000: Bayesian Theory. Wiley, 586 pp.

  • Chervin, R. M., and S. H. Schneider, 1976: On determining the statistical significance of climate experiments with general circulation models. J. Atmos. Sci, 33 , 405412.

    • Search Google Scholar
    • Export Citation
  • Cover, T. M., and J. A. Thomas, 1991: Elements of Information Theory. Wiley, 576 pp.

  • DelSole, T., 2004: Stochastic models of quasigeostrophic turbulence. Surv. Geophys, 25 , 107149.

  • DelSole, T., and P. Chang, 2003: Predictable component analysis, canonical correlation analysis, and autoregressive models. J. Atmos. Sci, 60 , 409416.

    • Search Google Scholar
    • Export Citation
  • Epstein, E. S., 1969: Stochastic dynamic predictions. Tellus, 21 , 739759.

  • Gardiner, C. W., 1990: Handbook of Stochastic Methods. 2d ed. Springer-Verlag, 442 pp.

  • Goldman, S., 1953: Information Theory. Prentice Hall, 385 pp.

  • Hasselmann, K., 1993: Optimal fingerprints for the detection of time-dependent climate change. J. Climate, 6 , 19571971.

  • Horn, R. A., and C. R. Johnson, 1985: Matrix Analysis. Cambridge University Press, 561 pp.

  • Jazwinski, A. H., 1970: Stochastic Processes and Filtering Theory. Academic Press, 376 pp.

  • Johnson, R. A., and D. W. Wichern, 1982: Applied Multivariate Statistical Analysis. Prentice Hall, 594 pp.

  • Kleeman, R., 2002: Measuring dynamical prediction utility using relative entropy. J. Atmos. Sci, 59 , 20572072.

  • Kullback, S., 1959: Information Theory and Statistics. Wiley and Sons, 399 pp. [Republished by Dover, 1968.].

  • Leung, L-Y., and G. R. North, 1990: Information theory and climate prediction. J. Climate, 3 , 514.

  • Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci, 20 , 130141.

  • Lorenz, E. N., 1965: A study of the predictability of a 28-variable atmospheric model. Tellus, 17 , 321333.

  • Lorenz, E. N., 1969: The predictability of a flow which possesses many scales of motion. Tellus, 21 , 289307.

  • Lorenz, E. N., 1975: Climatic predictability. The Physical Basis of Climate and Climate Modelling, B. Bolin et al., Eds., GARP Publication Series, Vol. 16, World Meteorological Organization, 132–136.

    • Search Google Scholar
    • Export Citation
  • Magnus, J. R., and H. Neudecker, 2001: Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley and Sons, 395 pp.

    • Search Google Scholar
    • Export Citation
  • Majda, A., R. Kleeman, and D. Cai, 2002: A framework for predictability through relative entropy. Methods Appl. Anal, 9 , 425444.

  • Moore, A. M., and R. Kleeman, 1999: Stochastic forcing of ENSO by the intraseasonal oscillation. J. Climate, 12 , 11991220.

  • Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281293.

    • Search Google Scholar
    • Export Citation
  • Noble, B., and J. W. Daniel, 1988: Applied Linear Algebra. 3d ed. Prentice Hall, 521 pp.

  • Penland, C., and P. D. Sardeshmukh, 1995: The optimal growth of tropical sea surface temperature anomalies. J. Climate, 8 , 19992024.

  • Reza, F. M., 1961: An Introduction to Information Theory. McGraw-Hill, 496 pp.

  • Sardeshmukh, P. D., G. P. Compo, and C. Penland, 2000: Changes of probability associated with El Niño. J. Climate, 13 , 42684286.

  • Schneider, T., and S. M. Griffies, 1999: A conceptual framework for predictability studies. J. Climate, 12 , 31333155.

  • Shannon, C. E., 1948: A mathematical theory of communication. Bell Syst. Tech. J, 27 , 370423. 623–656.

  • Shukla, J., 1981: Dynamical predictability of monthly means. J. Atmos. Sci, 38 , 25472572.

  • Tippett, M. K., and P. Chang, 2003: Some theoretical considerations on predictability of linear stochastic dynamics. Tellus, 55A , 148157.

    • Search Google Scholar
    • Export Citation
  • Wang, M. C., and G. E. Uhlenbeck, 1945: On the theory of the Brownian motion II. Rev. Mod. Phys, 17 , 323342.

APPENDIX A

Expressions for Gaussian Variables

In this appendix we derive explicit expressions for predictive information, relative entropy, and mutual information for joint normally distributed variables. It proves necessary to distinguish different distributions by different subscripts. Thus, the climatological distribution will be denoted by pυ(x) (υ for verification), while the forecast distribution will be denoted by pf(x). Let the mean and covariance matrix for the climatological distribution be μυ and Συ, and those for the forecast distribution be μf and Σf. We use the notation pυ(x) = N(μυ, Συ) to denote the M-dimensional normal distribution with mean μυ and covariance matrix Συ:
i1520-0469-61-20-2425-ea1
The entropy of a normal distribution is given in Cover and Thomas (1991, chapter 9) and Schneider and Griffies (1999). The explicit expression can be derived as follows:
i1520-0469-61-20-2425-ea2
The predictive information therefore is
i1520-0469-61-20-2425-ea3
The relative entropy for normal distributions is given in Kullback (1959, p. 189) and Kleeman (2002). The expression can be derived as follows:
i1520-0469-61-20-2425-ea4
As discussed in section 5, an appropriate forecast distribution for a stochastic process is (18). Substituting the appropriate covariance matrix from (18) into (A3) gives the predictive information:
i1520-0469-61-20-2425-ea5
Making use of the identity Συ = Σ1/2υΣ1/2υ and standard properties of determinants, we can express the above equation for predictive information as
i1520-0469-61-20-2425-ea6
which is given in (21) in more concise notation. Similarly, substituting the forecast distribution for a stochastic process (18) into (A4) gives the relative entropy:
i1520-0469-61-20-2425-ea7
Using similar “tricks” as above allows the relative entropy to be written as (19).
In a perfect model scenario, the forecast distribution is the conditional distribution of the verification given all available observations. If the verification V and observations O are joint normally distributed, then the conditional distribution of V, given that the random variable O equals o, is the normal distribution (Johnson and Wichern 1982):
i1520-0469-61-20-2425-ea8
where Συo = ΣT is the cross-covariance matrix between V and O, μo and Σo are the mean and covariance matrix of O, and N(μ, Σ) denotes a normal distribution with mean and covariance μ and Σ. From (A8) and (A2), the entropy of the conditional distribution is
i1520-0469-61-20-2425-ea9
It follows that the predictive information given the event O = o is
i1520-0469-61-20-2425-ea10
where standard properties of the determinant have been used to derive the last expression.

The mutual information for normal distributions is given in Reza (1961, p. 296). It can be derived from the identity I(V, O) = H(V) − H(V|O). However, if the distribution is normal, then H(V|O = o) is independent of o [see (A9)], in which case H(V|O) = H(V|O = o). It follows that the mutual information I(V, O) is identically equal to predictive information Po [Eq. (A10)] when the variables are Gaussian.

The matrix inside the determinant in (A10) is familiar from canonical correlation analysis (CCA). In particular, the eigenvectors of Σ−1υΣυoΣ−1oΣ are associated with the canonical patterns between the V and O, and the eigenvalues are the squared canonical correlations, which lie between 0 and 1. Since the determinant of a matrix equals the product of eigenvalues, it follows that the predictive information Po is
i1520-0469-61-20-2425-ea11
where ρ1, ρ2, … , ρM are the canonical correlations. Hence, predictive information is non-negative. Thus, we have shown that the predictive information of a normally distributed set of variables, given an observation that reveals the exact value of a set of variables, is non-negative. Since no constraints beyond the joint normal distribution were assumed for V and O, the above result indicates that any conditioning of a normal distribution results in a decrease in entropy. Loosely speaking, knowledge decreases the average uncertainty.
The relative entropy between pυ(x) and pυ|o(x) can be obtained by substituting the associated population parameters from (A1) and (A8) into (A4), which gives
i1520-0469-61-20-2425-ea12

APPENDIX B

The Variation of Predictability

Cover and Thomas (1991, chapter 2) show that mutual information and relative entropy decay monotonically for discrete, stationary, Markov chains (provided the initial state is known with certainty). This result is extended here to show that, for continuous normal distributions of stationary, Markov processes, relative entropy, predictive information, mutual information, and mean-square error are monotonic functions of lead time.

A basic result that will be needed is the derivative of the log of a determinant. Magnus and Neudecker (2001, chapter 8) show that, for any nonsingular matrix 𝗫,
i1520-0469-61-20-2425-eb1
It also will prove convenient to define the following quantities:
i1520-0469-61-20-2425-eb2
It can be shown that both matrices Φ and Γ are positive definite for finite τ. Also, the eigenvalues of Φ lie between 0 and 1, which follows from standard stochastic calculus that Συe𝗔τΣυe𝗔Tτ is a positive definite covariance matrix for τ > 0.
The properties of normally distributed, stationary Markov processes were reviewed in section 4, within the equivalent context of linear stochastic models. The derivative of mutual information (22) with respect to lead time τ is given by
i1520-0469-61-20-2425-eb3
where we have used (B1), (B2), and the Lyapunov equation, 𝗔Συ + Συ𝗔T + 𝗤 = 𝗢, associated with the stochastic model (13). By the ordering theorem (see next paragraph) and the fact that Φ is positive definite with eigenvalues less than unity, Φ−1Γ is positive definite. This proves that the mutual information (22) and predictive information (21) for a normally distributed, stationary Markov process decay monotonically with lead time.
Now consider the relative entropy given in (19). The above result proves that the second term in (19) decays monotonically. The derivative of the first term in (19) is
i1520-0469-61-20-2425-eb4
Since Γ is positive definite, Tr(𝗪𝗪T) increases with lead time. Despite this increase, the sum of the first two terms in (19) decrease with lead time. This follows from the ordering theorem (Horn and Johnson 1985, chapter 7), which can be used to show that ΓΓΦ is positive definite because Φ and Γ are positive definite and Φ has eigenvalues less than unity. Similarly, the last term in (19) can be shown to decrease monotonically with lead time. Thus, the sum of all three terms in (19), and hence the relative entropy, decrease monotonically with lead time.
Finally, consider the mean square difference between two randomly chosen members of the forecast distribution. If ϵ is half the difference between two randomly chosen members of the forecast distribution (18), and 𝗠 is a positive definite matrix defining the norm, then by (23)
i1520-0469-61-20-2425-eb5
where the braces denote an average over the forecast distribution. The last line in (B5) is the trace of the product of two positive definite matrices and, hence, is positive. This demonstrates that the mean square difference between two randomly chosen members of the forecast distribution increases monotonically with lead time, for any norm.

Fig. 1.
Fig. 1.

Illustration of two forecast distributions, labeled F1 and F2, and a climatological distribution, in which relative entropy and predictive information give different measures. All distributions are Gaussian with the following mean (μ) and variance (σ2): climatology has parameters μ = 0, σ2 = 5; F1 has μ = 0, σ2 = 1.5; and F2 has μ = 5.4, σ2 = 1.5. Thus, F1 and F2 have exactly the same variance, and hence the same entropy, but differ in their means. Predictive information for F1 equals that of F2, whereas the relative entropy for F2 is greater than that of F1. Which is the more appropriate measure of predictability?

Citation: Journal of the Atmospheric Sciences 61, 20; 10.1175/1520-0469(2004)061<2425:PAITPI>2.0.CO;2

Table 1.

The probabilities of a discrete distribution p (x, y) with negative predictive information. The indices x and y have the values of 1 and 2 only. The top table gives the probabilities of the joint distribution p (x, y) for all possible values of x and y, and those for the marginal distributions p (x) and p ( y). The bottom table gives the conditional distribution computed from the definition p(x|y) = p(x, y)/p( y). The entropy of p(x) and p(x|2) is given in the box below the tables and shows that the entropy of the conditional distribution exceeds that of the unconditional distribution

Table 1.
Save
  • Anderson, J. L., and W. F. Stern, 1996: Evaluating the potential predictive utility of ensemble forecasts. J. Climate, 9 , 260269.

  • Barnett, T. P., and R. Preisendorfer, 1987: Origins and levels of monthly and seasonal forecast skill for United States surface air temperatures determined by canonical correlation analysis. Mon. Wea. Rev, 115 , 18251850.

    • Search Google Scholar
    • Export Citation
  • Bernardo, J. M., and A. F. M. Smith, 2000: Bayesian Theory. Wiley, 586 pp.

  • Chervin, R. M., and S. H. Schneider, 1976: On determining the statistical significance of climate experiments with general circulation models. J. Atmos. Sci, 33 , 405412.

    • Search Google Scholar
    • Export Citation
  • Cover, T. M., and J. A. Thomas, 1991: Elements of Information Theory. Wiley, 576 pp.

  • DelSole, T., 2004: Stochastic models of quasigeostrophic turbulence. Surv. Geophys, 25 , 107149.

  • DelSole, T., and P. Chang, 2003: Predictable component analysis, canonical correlation analysis, and autoregressive models. J. Atmos. Sci, 60 , 409416.

    • Search Google Scholar
    • Export Citation
  • Epstein, E. S., 1969: Stochastic dynamic predictions. Tellus, 21 , 739759.

  • Gardiner, C. W., 1990: Handbook of Stochastic Methods. 2d ed. Springer-Verlag, 442 pp.

  • Goldman, S., 1953: Information Theory. Prentice Hall, 385 pp.

  • Hasselmann, K., 1993: Optimal fingerprints for the detection of time-dependent climate change. J. Climate, 6 , 19571971.

  • Horn, R. A., and C. R. Johnson, 1985: Matrix Analysis. Cambridge University Press, 561 pp.

  • Jazwinski, A. H., 1970: Stochastic Processes and Filtering Theory. Academic Press, 376 pp.

  • Johnson, R. A., and D. W. Wichern, 1982: Applied Multivariate Statistical Analysis. Prentice Hall, 594 pp.

  • Kleeman, R., 2002: Measuring dynamical prediction utility using relative entropy. J. Atmos. Sci, 59 , 20572072.

  • Kullback, S., 1959: Information Theory and Statistics. Wiley and Sons, 399 pp. [Republished by Dover, 1968.].

  • Leung, L-Y., and G. R. North, 1990: Information theory and climate prediction. J. Climate, 3 , 514.

  • Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci, 20 , 130141.

  • Lorenz, E. N., 1965: A study of the predictability of a 28-variable atmospheric model. Tellus, 17 , 321333.

  • Lorenz, E. N., 1969: The predictability of a flow which possesses many scales of motion. Tellus, 21 , 289307.

  • Lorenz, E. N., 1975: Climatic predictability. The Physical Basis of Climate and Climate Modelling, B. Bolin et al., Eds., GARP Publication Series, Vol. 16, World Meteorological Organization, 132–136.

    • Search Google Scholar
    • Export Citation
  • Magnus, J. R., and H. Neudecker, 2001: Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley and Sons, 395 pp.

    • Search Google Scholar
    • Export Citation
  • Majda, A., R. Kleeman, and D. Cai, 2002: A framework for predictability through relative entropy. Methods Appl. Anal, 9 , 425444.

  • Moore, A. M., and R. Kleeman, 1999: Stochastic forcing of ENSO by the intraseasonal oscillation. J. Climate, 12 , 11991220.

  • Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281293.

    • Search Google Scholar
    • Export Citation
  • Noble, B., and J. W. Daniel, 1988: Applied Linear Algebra. 3d ed. Prentice Hall, 521 pp.

  • Penland, C., and P. D. Sardeshmukh, 1995: The optimal growth of tropical sea surface temperature anomalies. J. Climate, 8 , 19992024.

  • Reza, F. M., 1961: An Introduction to Information Theory. McGraw-Hill, 496 pp.

  • Sardeshmukh, P. D., G. P. Compo, and C. Penland, 2000: Changes of probability associated with El Niño. J. Climate, 13 , 42684286.

  • Schneider, T., and S. M. Griffies, 1999: A conceptual framework for predictability studies. J. Climate, 12 , 31333155.

  • Shannon, C. E., 1948: A mathematical theory of communication. Bell Syst. Tech. J, 27 , 370423. 623–656.

  • Shukla, J., 1981: Dynamical predictability of monthly means. J. Atmos. Sci, 38 , 25472572.

  • Tippett, M. K., and P. Chang, 2003: Some theoretical considerations on predictability of linear stochastic dynamics. Tellus, 55A , 148157.

    • Search Google Scholar
    • Export Citation
  • Wang, M. C., and G. E. Uhlenbeck, 1945: On the theory of the Brownian motion II. Rev. Mod. Phys, 17 , 323342.

  • Fig. 1.

    Illustration of two forecast distributions, labeled F1 and F2, and a climatological distribution, in which relative entropy and predictive information give different measures. All distributions are Gaussian with the following mean (μ) and variance (σ2): climatology has parameters μ = 0, σ2 = 5; F1 has μ = 0, σ2 = 1.5; and F2 has μ = 5.4, σ2 = 1.5. Thus, F1 and F2 have exactly the same variance, and hence the same entropy, but differ in their means. Predictive information for F1 equals that of F2, whereas the relative entropy for F2 is greater than that of F1. Which is the more appropriate measure of predictability?

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 4723 1242 78
PDF Downloads 3160 483 30