1. Introduction
Recently, several papers have discussed the connection between information theory and predictability (Leung and North 1990; Schneider and Griffies 1999; Kleeman 2002). These papers present different definitions of predictability and present different metrics for measuring predictability. The reasons for choosing different metrics, and the connection between the metrics, are not always explained in these papers. In addition, the literature on this subject has been confined almost exclusively to perfect model scenarios. Finally, even for a perfect model scenario, the fundamental measure of predictability is not universally agreed upon, as evidenced by other metrics proposed in the literature: for example, mean square error in various norms (Lorenz 1965), signal to noise ratio (Chervin and Schneider 1976), analysis of variance (Shukla 1981), potential predictive utility measured by Kuiper's statistical test (Anderson and Stern 1996), and difference in distributions measured by a Kolmogorov–Smirnov measure (Sardeshmukh et al. 2000).
The goal of this paper is to clarify the fundamental definition of predictability and its connection to information theory, and to point out connections between metrics based on information theory that apparently are not well known. This paper will concentrate on theory; applications of the above metrics can be found in the above-cited papers. In Part II of this paper, (DelSole 2004, manuscript submitted to J. Atmos. Sci.), we discuss methods for dealing with imperfect forecasts, which were suggested by Schneider and Griffies (1999). Readers familiar with information theory and its connection to predictability may skip to sections 5 and 6, where the bulk of the new material is presented. A summary of this paper is given in the concluding section (section 7).
2. The definition of predictability
Predictability is the study of the extent to which events can be predicted. Perhaps the most utilized measure of predictability is the mean square error of a (perfect) forecast model. Typically, mean square error increases with lead time and asymptotically approaches a finite value, called the saturation value. The saturation value is comparable to the mean square difference between two randomly chosen fields from the system. Consequently, all predictability is said to be lost when the errors are comparable to the saturation value since the forecast offers no better a prediction than a randomly chosen field from the system. Despite its widespread use, mean square error turns out to be a limited measure of predictability. First, the most complete description of a forecast is its probability distribution. Mean square error is merely a moment of a joint distribution and hence conveys much less information than a distribution. Second, as Schneider and Griffies (1999) pointed out, mean square error depends on the basis set in which the data is represented. Unfortunately, no agreement exists on how to combine different variables to obtain a single measure of predictability—for example, how many degrees Celsius of temperature is equivalent to one meter per second of wind velocity. Third, the question of how to attribute predictability to specific structures in the initial condition has no natural answer in this framework. Finally, the question of how to quantify predictability uniquely with imperfect models is not addressed in this framework since different models produce different characteristic errors.
The fundamental foundation for predictability was proposed by Lorenz (1963, 1965, 1969) and Epstein (1969). In this framework, the forecast model is considered perfect; that is, the governing equations are known exactly and can be solved exactly. However, the initial state of a system is not known exactly and therefore is represented most appropriately by a probability distribution function (pdf), which can be interpreted as a density of possible states in phase space. The distribution evolves according to a continuity equation for probability, which is derived the same way as the mass continuity equation, under the principle that no ensemble members are created or destroyed (i.e., the total amount of probability always equals unity). This continuity equation reduces to Liouville's equation in the case of conservative dynamical systems. A generalized governing equation, called the Fokker–Planck equation, applies if the dynamical model is stochastic—that is, if the future state is not uniquely determined by the initial state, as would be the case if certain physical processes were represented by random processes.
Whether a state is predictable depends on two distributions. To clarify this point, consider the statement “event A is unpredictable.” This statement is not interpreted in the literal sense of saying that the distribution for A is completely unknown. For instance, in many cases a past record of observations is available from which a climatological distribution could be constructed and used to produce a forecast. Even though the climatological distribution constitutes a forecast, the event in question is considered to be unpredictable if the climatological distribution is the only basis for prediction. In order for an event to be predictable, a forecast should be “better” in some sense than a prediction based on the climatological distribution. Thus, the statement “an event is not predictable” is not an absolute statement; rather, it is a relative statement meaning that the forecast distribution does not differ significantly from the climatological distribution.
The above considerations imply that predictability is a property of the difference between two probability distributions. The two distributions in question are commonly referred to as the climatological and forecast distributions. In Bayesian terminology, the climatological distribution, which is not concerned with the chronological order of the states, would be called the prior distribution. The forecast distribution, which involves a lag between the times in which states occur and is available only after observations become available, would be called the posterior distribution. [Note that this terminology differs from that of Schneider and Griffies (1999), in which the prior and posterior distributions are distinguished according to whether a forecast is available.] Ironically, this terminology is exactly opposite from the standard terminology in data assimilation. Therefore, to avoid confusion, we avoid the use of “prior” and “posterior.”
To express the above quantities more precisely, let the state of the climate at time t be represented by an N-dimensional vector Xt. Capital letters will be used to denote a random variable and lower case letters to denote one of its realizations. The probability that Xt lies in the range xt and xt + dxt is denoted p(xt)dxt. We will call t the initial condition time, t + τ the verification time, and τ the lead time. To account for nonstationary signals, such as a seasonal or diurnal cycle, the climatological distribution is identified with the density at the verification time, p(xt+τ).
During times when the climate system is not observed, the distribution of states evolves in accordance with the continuity equation for probability. After the system is observed, the distribution changes discontinuously. For instance, a 20% chance of rain at time t discontinuously changes to 100% certainty when rain is observed (with certainty) at that time. Typical examples of observations include the state itself, boundary condition, and/or external forcing. Suppose the system is observed at times 1, 2, … , t with observed values β1, β2, … , βt. Let ot be the set of all observations up to time t: ot = (β1, β2, … , βt). Then the distribution of the state that embodies all the information contained in the observations ot is denoted p(xt|ot), which is the distribution of the climate state conditioned on ot. (We use the notation that the density function p(·) differs according to its argument.) Statistical procedures for constructing p(xt|ot) were developed by Wiener and Kalman and now have been generalized quite extensively (Jazwinski 1970). In meteorology and oceanography, the mean of p(xt|ot) is called the analysis, and the procedure by which it is obtained is typically associated with the Kalman filter. The analysis distribution at the future time t + τ, namely p(xt+τ|ot+τ), is called the verification distribution. We assume in this paper that p(xt|ot) and p(xt+τ|ot+τ) are known.
The question arises as to the appropriate time scale for defining the climatological distribution. For instance, the climatological distribution of the earth's climate would include ice ages, strictly speaking, but this distribution would be inappropriate for quantifying the predictability of tomorrow's weather. No unique climatological appears to exist—distributions over different time scales all constitute different aspects of the total problem. A climatological distribution that describes the past few decades would be appropriate for weather predictability, while a climatological that describes the past few million years would be appropriate for quantifying the predictability of long-term climate. In stationary Markov systems, the climatological is chosen by convention to be the unconditional distribution of the state, which in most practical cases is also the time asymptotic forecast distribution.
To summarize, the basis of all predictability theory is a forecast procedure and a set of observations. Given all observations up to time t, ot, we construct the best description of the state at time t, called the analysis distribution and denoted by p(xt|ot). The most complete description of the state at time t + τ based on observations ot is the forecast distribution p(xt+τ|ot). To assess whether the system is predictable, we compare the forecast distribution p(xt+τ|ot) to the climatological distribution of the climate system, namely p(xt+τ). The degree of predictability is given by some measure of the difference between p(xt+τ|ot) and p(xt+τ).
The above framework allows different predictability scenarios to be described in a unified manner. For instance, a perfect model scenario identifies the transition probability p(xt+τ|xt) with a delta function and the analysis distribution p(xt|ot) with nonzero variance, in which case the forecast distribution p(xt+τ|ot) from (2) describes an ensemble of forecasts initialized at states randomly drawn from the analysis distribution. A perfect initial condition scenario identifies the transition probability p(xt+τ|xt) with nonzero variance and the analysis distribution p(xt|ot) with a delta function, in which case the forecast distribution from (2) describes the evolution of a stochastic model. If the observations are perfect and the model is deterministic, then the forecast distribution from (2) remains a delta function for all time whereas the climatological from (1) does not, indicating predictability for all time. This framework implies that predictability is a combined property of the dynamical system and the observations.
As will become evident, there is considerable disagreement as to what measure of the difference between climatological and forecast distributions should be used to quantify predictability. Murphy (1993) describes several distinct measures of goodness in forecasting. However, there is one point on which everyone agrees: an event is unpredictable if the forecast distribution is identical to the climatological distribution. According to this definition, all loss of predictability occurs when p(xt+τ|ot) = p(xt+τ), which is precisely equivalent to the statement that the future state Xt+τ is statistically independent of the climatological observations. This condition is intuitively consistent with the idea that a forecast is predictable only if it is statistically dependent on the initial condition on which the forecasts were based. Equivalently, we may say that a necessary condition for predictability is for the forecast and climatological to differ. The question of how to quantify predictability based on this difference is discussed in the next section.
The above framework does not exhaust all types of predictability. As Lorenz (1975) noted, there are at least two kinds of predictability. The first kind refers to the change in statistics over a fixed time span, as the beginning and end of the time space advance. The second kind refers to the change in statistics due to changes in climate forcing. (The reader should be warned that Lorenz's definitions often are misquoted in the literature.) If the state Xt is interpreted as a statistic over a fixed time span (whose end points could be infinitesimally close, in which case Xt is the instantaneous state), then the framework we have just discussed encompasses predictability of the first kind.
To clarify predictability of the second kind, consider a prediction for the change in climate statistics due to a doubling of atmospheric greenhouse gas concentration. If the original concentration is denoted θ0 and the increased level is θ1, then the climatology is identified with the distribution conditioned on the current concentration, p(xt+τ|θ0), while the forecast is the distribution conditioned on the higher concentration level p(xt+τ|θ1). Importantly, the two distributions p(xt+τ|θ0) and p(xt+τ|θ1) cannot be interpreted as conditional and marginal distributions as defined above because the forecast distribution does not satisfy the chain rule (2) and does not involve a chronological sequence of climate states. Rather, the forecast and climatological distribution are interpreted as subensembles of a larger climate population. Moreover, the climatological distribution defined in (1) involves a weighting by the probability of θ1, but this weighting is irrelevant to those who are concerned about the actual global warming. These considerations suggest that all types of predictability can be defined in terms of the difference between two distributions, but the class of distributions may differ according to the type of predictability.
It should be noted that the above framework assumes that the forecast distribution of the future state is known precisely, which of course is not a realistic situation. In light of this fact, numerous investigators have suggested that predictability should be defined by some measure of the predictability of a forecast model. The major problem with this definition is that it depends on the arbitrary forecast model and, hence, gives rise to multiple degrees of predictability for the same event—one for each distinct model. Furthermore, the predictability of an imperfect forecast model may bear no relation to the predictability of the climate system. For instance, some models may indicate infinite predictable times whereas others indicate short predictable times. The question of how to quantify predictability with imperfect models has received relatively little attention in the literature. This issue will be discussed in Part II of this paper.
3. Entropy, relative entropy, and mutual information
While we have defined predictability by the difference between the climatological and forecast distributions, how are we to quantify this difference? Before addressing this question, it should be recognized that any ranking of the difference between the climatological and forecast distributions will depend on geographical, economic, or societal issues that differ from user to user. The goal of this section is to show that defining predictability based on principles of information theory leads to an intuitive methodology.
Mean square error and the second moments of the forecast distribution are by far the most common measures of predictability. These metrics measure the dispersion, or “spread,” of the forecast distribution. Forecasts with larger dispersion characterize less predictability because they are associated with more uncertainty. This reasoning suggests that the degree of uncertainty provides a more fundamental basis for quantifying predictability.
The first term on the right of (5) measures the uncertainty when no (recent) observation of the system is available, while the second term measures the uncertainty after the observation and associated forecast becomes available. The difference in uncertainties quantifies the amount of information about the event gained from observation and forecast. Schneider and Griffies (1999) review the properties of predictive information, to which we refer the reader for full details. Here we briefly state two properties. First, predictive information is invariant with respect to a nonsingular, linear coordinate transformation. It follows that predictive information is independent of the basis set used to represent the data. Hence, all climate variables, regardless of their units or natural variances, can be accounted for in a single measure of predictability. Second, for joint normally distributed variables predictive information can be decomposed into a set of uncorrelated components ordered by their entropy difference such that the leading component is the most predictable, the second is the most predictable uncorrelated with the first, and so on. This decomposition is explained in section 5.
4. Discussion
The question arises as to which of the above metrics should be used to measure predictability. As noted earlier, no universal principle exists to settle this question. The following comments may help to clarify this issue. First, predictive information and relative information measure the predictability of a single forecast distribution, whereas mutual information measures the mean predictability averaged over all initial conditions. Second, the average of the first two quantities equals the third. Thus, if we are interested only in the mean predictability, all three metrics are equivalent and no distinction exists.
Any measure of predictability should vanish when the forecast equals the climatological. Relative entropy and predictive information, which are the relevant metrics for single forecasts, satisfy this condition. However, a key difference between predictive information and relative entropy is that relative entropy (8) vanishes if and only if the two distributions are equal, whereas this is not the case for predictive information (5). Theoretically, predictive information can vanish even when the climatological differs from the forecast. Whether this represents a drawback with predictive information depends on a more detailed definition of predictability, as discussed below.
An interesting question is whether a forecast with negative predictive information should be considered predictable. Recall that negative predictive information implies that the forecast has more entropy, and hence more uncertainty, than the climatology. It is not absurd to characterize an event as unpredictable if the forecast has more uncertainty than the climatological distribution. This reasoning suggests that a variable should be defined as unpredictable if the predictive information vanishes or is negative. On the other hand, relative entropy is positive when predictive information is negative. In fact, measuring predictability by relative entropy would imply that any difference between the climatological and forecast distribution should be characterized as predictable. Whether this represents a drawback for relative entropy depends on a more refined definition of predictability.
Kleeman (2002) states that the relative entropy of a stationary, Markov process decays monotonically, whereas absolute entropy does not, thereby implying that entropy does not conform to our intuition that predictability in such systems monotonically degrades with lead time. This point deserves comment. First, Cover and Thomas (1991, chapter 2) show that the conditional entropy H(Xt+τ|Xt) of a stationary Markov process increases monotonically with τ. Recall that for perfect observations H(Xt+τ|Ot) = H(Xt+τ|Xt), and that the average predictive information is H(Xt+τ) − H(Xt+τ|Ot). It follows from this and the above theorem that the predictive information of a stationary, Markov process decreases monotonically on average. Second, if the distributions are normal, then the entropy of a stationary, Markov process can be shown to always increase monotonically with lead time (again, under perfectly known initial conditions). This fact is proven in appendix B. Third, Cover and Thomas (1991, chapter 2) prove that relative entropy between a stationary, Markov process and any stationary distribution decreases monotonically with time. Thus, relative entropy decreases no matter how the climatological distribution is chosen. This curious behavior is related to the fact that relative entropy does not satisfy the triangle inequality, so a decrease in relative entropy does not necessarily imply that the forecast and climatological distributions are getting “closer.”
T. Schneider (2003, personal communication) has suggested that interpreting the above metrics in terms of coding theory may help to clarify the different metrics. For discrete distributions, entropy can be interpreted as the average number of bits (i.e., the average number of “yes–no” questions) needed to describe a variable from a distribution (Cover and Thomas, 1991, p. 6). Hence the predictive information is the average reduction (for positive predictive information) in number of bits between a code optimized for the climatological, versus a code optimized for the forecast. Cover and Thomas (1991, p. 18) effectively state the following: to describe a realization from the forecast distribution, a code optimized for the forecast distribution would have average length H(Xt+τ|Ot = ot), whereas a code optimized for the climatological distribution would have (longer) average length H(Xt+τ|Ot = ot) + Ro. This implies that relative entropy could be interpreted as a measure of the inefficiency of using the climatological distribution to describe the forecast distribution. This is not a quantity one would naturally choose to measure predictability. In contrast, predictive information could be interpreted as the decrease in complexity in the forecast distribution relative to the climatological. This may have some intuitive appeal as a measure of predictability.
Other equally valid interpretations can be given that do not always favor the same metric. For instance, one approach to measuring the difference between the climatological and forecast distributions is to ask whether we could determine from which of the two distributions a given sample was drawn. If we cannot, then perhaps the difference in distributions is not “important.” It is well known that the optimum test, in the sense of having desirable properties with respect to type I and type II errors, is given by the Neyman–Pearson likelihood ratio test. Kullback (1959) shows that this test is formally equivalent to comparing the relative entropy between the empirical distribution and the possible distributions. Thus, relative entropy can be interpreted as a measure of the difficulty of discriminating between forecast and climatological distributions, a natural prerequisite for measuring predictability. Alternatively, recall that Pearson's chi-square test is a classical method of deciding whether two distributions differ. It can be shown that the chi-square statistic is formally equivalent to relative entropy in the limit of small differences between the two distributions (Kullback 1959, chapter 5). Another basis for quantifying the difference between two distributions is by the amount of money a gambler would make given a forecast, provided a “bookie” assigns fair odds based on the climatology. In effect, Cover and Thomas (1991, section 6.1) show that the average doubling rate of the gambler's winnings equals the relative entropy between the forecast and climatological distributions. This situation is not far removed from those associated with weather derivatives in real financial markets.
Bernardo and Smith (2000) give an elegant summary of Bayesian theory in which relative entropy and mutual information arise naturally in the problem of choosing actions in situations with uncertainty. This framework requires 1) a set of axioms for making decisions and assigning degrees of belief based on data and 2) the concept of utility. The logarithmic utility function turns out to be the only utility function that is smooth, proper, and local. For such utility functions, the expected utility of data is precisely the relative entropy between the forecast and climatological distributions, which are distinguished according to the availability of data. In the context of predictability, relative entropy (8) can be interpreted as the expected utility of the observed data. Mutual information (12), which is the average relative entropy over the distribution of possible data outcomes, can be interpreted as the expected information from an experiment. Thus, relative entropy and mutual information are natural measures of the utility provided by an experiment when an individual's preferences are described by a smooth, proper, local utility function. Whether the aforementioned properties of the utility function are fundamental remains unclear, as the many references to alternative measures in Bernardo and Smith illustrate.
Relative entropy and mutual information are invariant with respect to nonlinear, invertible transformations (Bernardo and Smith 2000, p. 158; Goldman, p. 153), whereas predictive information is invariant only to linear transformations. Bernardo and Smith (2000) suggest that this invariance is essential for any measure of the utility increase from observed data. For instance, in the context of predictability, one could argue that the degree of predictability should not depend on whether the data is represented in pressure or sigma coordinates, which are related by a nonlinear transformation. Since the average predictive information equals mutual information, it follows that while predictive information is not invariant to nonlinear transformations, its average is. Whether the invariance property in its full nonlinear sense can ever be taken advantage of in realistic situations with finite sample sizes remains to be seen.
An issue that may raise concern is the fact that all three metrics depend on the log of the determinant of a covariance matrix, for normal distributions. Consequently, small sampling errors in the smallest eigenvalues of the covariance matrices produce large errors in the metrics. This problem might discourage the use of such measures in favor of other measures, perhaps those based on the trace of the error covariance matrix, which are not as sensitive to sampling errors. While such sensitivity affects the statistical estimation procedure, it need not be interpreted as a problem in the fundamental concepts. Indeed, the advantage of measures based on determinants can be seen in the following scenario. Consider a system that is unpredictable everywhere except at a single location. A predictability measure based on the mean square forecast error would be dominated by the environmental noise, in which case the high degree of predictability at the particular location might go undiscovered. In contrast, a predictability measure based on determinants would reveal a great deal of predictability, even though the highly predictable component is due to a single location. Moreover, as discussed in section 6, the above predictability measures can be decomposed into a linear combination of uncorrelated components. This decomposition allows an investigator to isolate those components that contribute significantly to the total predictability.
In the case of predictability of the second kind, we are generally interested in any difference between two distributions, even if there is no change in uncertainty. Relative entropy appears to provide an attractive measure for predictability of the second kind.
5. Predictability of linear stochastic models
To evaluate the climatological and forecast distributions (1) and (2), the distribution of the observations p(ot) is needed. While all distributions can be evaluated analytically if the observation distribution p(ot) is Gaussian, this complete solution tends to obscure the basic conclusions related to predictability. Thus, for simplicity, we assume that the system is perfectly observed and, hence, the forecast distribution is simply p(xt+τ|xt) given in (18).
Tippett and Chang (2003) show that, for all of the above measures, the least predictable system, out of all systems with the same eigenvalues and vanishing initial error, is that in which the prewhitened dynamical operator 𝗪 is normal. Since the prewhitened operator is unique up to an orthogonal transformation, the normality of 𝗪 is not altered by coordinate changes that preserve the prewhitening characteristic.
Interestingly, the lead-time dependence of all the above measures are controlled by the operator 𝗪. Appendix B shows that all of the above measures are monotonic functions of lead time. [Note that the proof in appendix B assumes that the initial state is known with certainty and, hence, has zero entropy. Without this assumption, predictability could increase with time, as pointed out by Cover and Thomas (1991, p. 35). For instance, the arbitrary initial condition could have larger entropy than the asymptotic stationary distribution.] Hence all of the above metrics conform to intuition that predictability degrades with lead time in Gaussian, stationary, Markov systems.
By far the most significant difference among the above measures in the context of Markov models is that predictive information, mutual information, and mean square error are independent of the initial condition, whereas the relative entropy is not. A revealing thought experiment is to consider the case in which the initial condition projects on the leading singular vector of a suitable propagator. In this case, the ensemble mean grows to a large value, whereas the forecast spread grows at a rate independent of the initial condition. Penland and Sardeshmukh (1995) and Moore and Kleeman (1999) have suggested that ENSO predictability can be understood in these terms. However, singular vectors are relevant to this type of predictability only if one adopts a measure of predictability that depends on the signal to noise ratio of a single forecast. Of the metrics discussed in this paper, only relative entropy satisfies this condition. In contrast, predictive information and mean square error are independent of the initial condition because they measure uncertainty, which is independent of the initial condition in Gaussian systems.
6. Predictable components
It is often insightful to decompose predictability into components that maximize it. This approach is analogous to the use of principal components to understand how different spatial structures contribute to total variance. This decomposition is especially useful if predictability is dominated by a few coherent patterns. In such cases, the decomposition allows an investigator to focus only on the few structures that are predictable and to ignore the vast majority of structures that are unpredictable. This decomposition also provides the basis for attributing predictability to specific structures in the initial condition, boundary condition, and/or external forcing.
In this section, we show (apparently for the first time) that optimization of predictive information is identical to optimization of relative entropy when the distributions are Gaussian and the means of the climatological and forecast are identical. The resulting structures will be called predictable components. Furthermore, we show that optimization of mutual information is equivalent to canonical correlation analysis. Finally, we discuss a surprising connection between these optimization procedures.
It should be recognized that defining predictable components according to error variance is problematic. First, absolute error variance is not a measure of predictability. Rather, error variance relative to its saturation value is the appropriate measure of predictability. Second, total error variance can be dominated by structures with large natural variance and can have little to do with predictability. Third, a background random noise field can dominate error variance and thereby hide highly predictable components.
7. Summary and discussion
This paper reviewed a framework for understanding predictability based on information theory, assuming that the governing equations of the system were known exactly. In this framework, predictability depends on three quantities: 1) a set of observations of the system, 2) a climatological distribution of the system that is known in the absence of recent observations, and 3) a forecast distribution that is known after all observations becomes available and a forecast based on these observations is performed. An event is said to be unpredictable if the climatological and forecast distributions are identical in every way. Hence, the degree of predictability depends on the degree to which the prior and posterior distributions differ.
Any ranking of predictability by some measure of the difference between climatological and forecast distributions will depend on geographical, economic, and/or societal issues that differ from user to user. Information theory provides an intuitive and rigorous framework for quantifying predictability because it measures the degree of uncertainty in a random variable. However, there is no unique measure within this framework. At least three measures of predictability based on information theory have been proposed: mutual information between the forecast and initial condition (Leung and North 1990), predictive information between the forecast and climatological distributions (Schneider and Griffies 1999), and relative entropy between the forecast and climatological distributions (Kleeman 2002). Significant properties of these metrics are as follows:
All three metrics are invariant with respect to nonsingular linear transformations. Hence, the metrics do not depend on the basis in which the data is represented, and variables with different units and natural variances can be analyzed as a single state variable without introducing an arbitrary normalization since such normalization cannot affect the final results. This property is not shared by such metrics as mean square error. Relative entropy and mutual information are also invariant with respect to nonlinear, invertible transformations.
Relative entropy and predictive information have the same average over all initial conditions, and this average is precisely equal to mutual information. Thus, relative entropy and predictive information measure the predictability of a single forecast distribution, while mutual information measures the average of these two quantities. This fact, which is not stated in the above-cited papers on predictability, clearly reveals a similarity between the measures, which may not be apparent from their formal definitions.
Predictable component analysis, which identifies highly predictable components in a Gaussian system, not only optimizes predictive information, but also optimizes relative entropy when the means of the climatological and forecast distribution are identical.
Optimization of the average predictability in Gaussian, Markov systems is equivalent to canonical correlation analysis of the forecast and initial condition. Thus, CCA, which is an ad hoc method in many predictability studies, arises naturally in information theory.
All three metrics, as well as mean square error, are monotonic functions of lead time for normally distributed, stationary, Markov systems, consistent with the classical notion (Lorenz 1969) that predictability degrades with time. Predictive information is a monotonic function of lead time only in an average sense in non-Gaussian systems.
All three metrics have closed-form expressions in finite dimensional, stationary, Gaussian, Markov systems. These expressions completely describe the predictability of all linear stochastic models driven by Gaussian noise.
An unresolved issue in the above framework is whether predictability should be measured by a “distance” metric, like relative entropy, or by an entropy difference, like predictive information. For normal distributions, the key distinction between these metrics is that one set depends on the signal to noise ratio of a single forecast distribution, whereas the other does not. By signal to noise ratio, we mean that the signal is identified with the mean of the forecast distribution and the noise is identified with the variance of the climatological distribution (though similar conclusions hold if we identify the noise with the forecast variance). A contrived situation that highlights the differences among the metrics is illustrated in Fig. 1. In this figure, two forecast distributions are shown, labeled F1 and F2, that have different means but exactly the same entropy. The climatological distribution, shown as the dashed curve, has more entropy than either forecast, but has the same mean as F1. Since the forecasts F1 and F2 have exactly the same entropy, they have exactly the same predictive information. In contrast, the relative entropy is greater for F2 than that for F1 owing to a larger signal to noise ratio. The question arises as to which of these metrics conforms with our intuition about predictability. We can immediately reject the idea that signal to noise ratio alone is sufficient to measure predictability: in the limit of a delta function, F1 would have zero signal to noise ratio but complete certainty; it would be absurd to call a variable with this distribution unpredictable. The seeming importance of signal to noise ratio may arise from the fact that it often appears in context-dependent utility functions, which quantify the cost of a random event with respect to an individual. But the mere fact that signal to noise ratio plays an important role in utility functions does not necessarily imply that it is fundamental to predictability, just as the meaning of a message is irrelevant to the engineering problem of communication. Also, in a perfect model scenario, the signal is perfectly known (with an infinite ensemble) and does not constitute a source of uncertainty. There seems to be no inconsistency in restricting predictability theory to metrics with the signal to noise ratio, or to metrics without the signal to noise ratio, especially since both can have the same average.
In principle, information theory provides an elegant and powerful framework for quantifying predictability and addressing a host of questions that otherwise would remain unanswered. In practice, however, there are significant challenges with actually applying it. First, the perfect model is not available. In practice, this problem is dealt with by using forecast models to simulate the forecast distribution. Unfortunately, errors in all forecast models give rise to differences between the forecast and climatological distributions that, if not accounted for, would lead to false indications of predictability. The question of how to extend the above framework to imperfect forecasts is discussed in Part II. The second problem is that the required probability distributions are not known and must be estimated from data. Strategies for estimating the necessary distributions from finite sample sizes will be discussed in future work.
Acknowledgments
I am very much indebted to J. Shukla, who provided essential criticism and encouragement during the course of this work, and to Tapio Schneider, with whom I have had many stimulating correspondences and discussions about this work. I also thank Michael Tippett, David Nolan, and Ben Kirtman for discussions on this topic, and William Merryfield and the reviewers for many helpful comments. This research was supported by the NSF (ATM9814295), NOAA (NA96-GP0056), and NASA (NAG5-8202).
REFERENCES
Anderson, J. L., and W. F. Stern, 1996: Evaluating the potential predictive utility of ensemble forecasts. J. Climate, 9 , 260–269.
Barnett, T. P., and R. Preisendorfer, 1987: Origins and levels of monthly and seasonal forecast skill for United States surface air temperatures determined by canonical correlation analysis. Mon. Wea. Rev, 115 , 1825–1850.
Bernardo, J. M., and A. F. M. Smith, 2000: Bayesian Theory. Wiley, 586 pp.
Chervin, R. M., and S. H. Schneider, 1976: On determining the statistical significance of climate experiments with general circulation models. J. Atmos. Sci, 33 , 405–412.
Cover, T. M., and J. A. Thomas, 1991: Elements of Information Theory. Wiley, 576 pp.
DelSole, T., 2004: Stochastic models of quasigeostrophic turbulence. Surv. Geophys, 25 , 107–149.
DelSole, T., and P. Chang, 2003: Predictable component analysis, canonical correlation analysis, and autoregressive models. J. Atmos. Sci, 60 , 409–416.
Epstein, E. S., 1969: Stochastic dynamic predictions. Tellus, 21 , 739–759.
Gardiner, C. W., 1990: Handbook of Stochastic Methods. 2d ed. Springer-Verlag, 442 pp.
Goldman, S., 1953: Information Theory. Prentice Hall, 385 pp.
Hasselmann, K., 1993: Optimal fingerprints for the detection of time-dependent climate change. J. Climate, 6 , 1957–1971.
Horn, R. A., and C. R. Johnson, 1985: Matrix Analysis. Cambridge University Press, 561 pp.
Jazwinski, A. H., 1970: Stochastic Processes and Filtering Theory. Academic Press, 376 pp.
Johnson, R. A., and D. W. Wichern, 1982: Applied Multivariate Statistical Analysis. Prentice Hall, 594 pp.
Kleeman, R., 2002: Measuring dynamical prediction utility using relative entropy. J. Atmos. Sci, 59 , 2057–2072.
Kullback, S., 1959: Information Theory and Statistics. Wiley and Sons, 399 pp. [Republished by Dover, 1968.].
Leung, L-Y., and G. R. North, 1990: Information theory and climate prediction. J. Climate, 3 , 5–14.
Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci, 20 , 130–141.
Lorenz, E. N., 1965: A study of the predictability of a 28-variable atmospheric model. Tellus, 17 , 321–333.
Lorenz, E. N., 1969: The predictability of a flow which possesses many scales of motion. Tellus, 21 , 289–307.
Lorenz, E. N., 1975: Climatic predictability. The Physical Basis of Climate and Climate Modelling, B. Bolin et al., Eds., GARP Publication Series, Vol. 16, World Meteorological Organization, 132–136.
Magnus, J. R., and H. Neudecker, 2001: Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley and Sons, 395 pp.
Majda, A., R. Kleeman, and D. Cai, 2002: A framework for predictability through relative entropy. Methods Appl. Anal, 9 , 425–444.
Moore, A. M., and R. Kleeman, 1999: Stochastic forcing of ENSO by the intraseasonal oscillation. J. Climate, 12 , 1199–1220.
Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281–293.
Noble, B., and J. W. Daniel, 1988: Applied Linear Algebra. 3d ed. Prentice Hall, 521 pp.
Penland, C., and P. D. Sardeshmukh, 1995: The optimal growth of tropical sea surface temperature anomalies. J. Climate, 8 , 1999–2024.
Reza, F. M., 1961: An Introduction to Information Theory. McGraw-Hill, 496 pp.
Sardeshmukh, P. D., G. P. Compo, and C. Penland, 2000: Changes of probability associated with El Niño. J. Climate, 13 , 4268–4286.
Schneider, T., and S. M. Griffies, 1999: A conceptual framework for predictability studies. J. Climate, 12 , 3133–3155.
Shannon, C. E., 1948: A mathematical theory of communication. Bell Syst. Tech. J, 27 , 370–423. 623–656.
Shukla, J., 1981: Dynamical predictability of monthly means. J. Atmos. Sci, 38 , 2547–2572.
Tippett, M. K., and P. Chang, 2003: Some theoretical considerations on predictability of linear stochastic dynamics. Tellus, 55A , 148–157.
Wang, M. C., and G. E. Uhlenbeck, 1945: On the theory of the Brownian motion II. Rev. Mod. Phys, 17 , 323–342.
APPENDIX A
Expressions for Gaussian Variables
The mutual information for normal distributions is given in Reza (1961, p. 296). It can be derived from the identity I(V, O) = H(V) − H(V|O). However, if the distribution is normal, then H(V|O = o) is independent of o [see (A9)], in which case H(V|O) = H(V|O = o). It follows that the mutual information I(V, O) is identically equal to predictive information Po [Eq. (A10)] when the variables are Gaussian.
APPENDIX B
The Variation of Predictability
Cover and Thomas (1991, chapter 2) show that mutual information and relative entropy decay monotonically for discrete, stationary, Markov chains (provided the initial state is known with certainty). This result is extended here to show that, for continuous normal distributions of stationary, Markov processes, relative entropy, predictive information, mutual information, and mean-square error are monotonic functions of lead time.
Illustration of two forecast distributions, labeled F1 and F2, and a climatological distribution, in which relative entropy and predictive information give different measures. All distributions are Gaussian with the following mean (μ) and variance (σ2): climatology has parameters μ = 0, σ2 = 5; F1 has μ = 0, σ2 = 1.5; and F2 has μ = 5.4, σ2 = 1.5. Thus, F1 and F2 have exactly the same variance, and hence the same entropy, but differ in their means. Predictive information for F1 equals that of F2, whereas the relative entropy for F2 is greater than that of F1. Which is the more appropriate measure of predictability?
Citation: Journal of the Atmospheric Sciences 61, 20; 10.1175/1520-0469(2004)061<2425:PAITPI>2.0.CO;2
The probabilities of a discrete distribution p (x, y) with negative predictive information. The indices x and y have the values of 1 and 2 only. The top table gives the probabilities of the joint distribution p (x, y) for all possible values of x and y, and those for the marginal distributions p (x) and p ( y). The bottom table gives the conditional distribution computed from the definition p(x|y) = p(x, y)/p( y). The entropy of p(x) and p(x|2) is given in the box below the tables and shows that the entropy of the conditional distribution exceeds that of the unconditional distribution