## Abstract

In this paper, possible connections between actual and potential skill are discussed. Actual skill refers to when the prediction time series is validated against the observations as the verification while perfect skill refers to when the observed verification time series is replaced by one of the members from the ensemble of predictions. It is argued that (i) there need not be a relationship between potential and actual skill; (ii) potential skill is not constrained to be always greater than actual skill, and examples to the contrary can be found; and (iii) there are methods whereby statistical characteristics of predicted anomalies can be compared with the corresponding in the observations, and inferences about the validity of the (positive) gap between potential and actual skill as “room for improvement” can be better substantiated.

## 1. Introduction

For weather and climate predictions based on the ensemble approach, two measures of skill have often been reported side by side (Waliser et al. 2003; Wang et al. 2013; Becker et al. 2013; Holland et al. 2013; Younas and Tang 2013; Boer et al. 2013). The first is referred to as the prediction (or the actual) skill and is evaluated by validating predictions against corresponding observations. The other, commonly referred to as the potential (or the perfect) skill, is based on treating one of the ensemble members as proxy for the observations, while remaining ensemble members are used for predictions.

A feature of the assessment of *potential* skill is that because the statistical characteristics, for example, mean and spread of the “observations” (which is one of the member from the ensemble of predictions) and the prediction are the same (and hence use of the word “perfect”), the estimate of potential skill is not affected by model biases. The estimate of the potential skill is also the upper bound of realizable prediction skill in a *hypothetical world* where the statistical characteristics of observed time series are identical to the statistical characteristics of the model predicted time series.

It is common to see analysis where both actual and potential measures of skill are reported (Waliser et al. 2003; Wang et al. 2013; Becker et al. 2013; Holland et al. 2013; Younas and Tang 2013; Boer et al. 2013). In such comparisons, it is generally the case that the potential skill is found to be larger than the actual skill. An interpretation is then made that the gap between the potential and the actual skill represents room for improvement. It is further hypothesized that as model biases get smaller, the actual skill will converge to its perfect skill counterpart that was estimated based on ensemble of predictions. For example, statements along the lines “…potential skill is reasonably high which suggests that forecast improvement might be possible…” (Boer et al. 2013); “…that the potential predictability is higher than actual skill suggests that there may be a lot of room to improve present prediction skill of PNA…” (Younas and Tang 2013); “…initial-value predictability in Antarctic sea ice has been assessed in a perfect model framework. This provides an upper limit on the predictability that could be realized…” (Holland et al. 2013), create an impression that estimates of potential skill based on the perfect model assumption are also the limits of realizable skill that could be achieved with continued model improvement efforts, or alternatively, the current level of actual skill are unduly low because of model biases.

In assuming the validity of the “closing of the prediction skill gap” conjecture, an often ignored fact is that the estimate of perfect skill can be an erroneous estimate of the true upper limit of prediction skill in observations. In positing that the gap between the perfect and actual skill is the room for improvement, one essentially makes an assumption that when the statistical characteristics of the observed time series becomes similar to the model simulated time series, the estimate of perfect skill will become the realizable skill. What is disregarded is that the converse should happen; model improvements should lead to the convergence of statistical characteristics of the model forecasts to that for the observations (and not vice versa). Further, because of the preponderance of instances where the estimate of perfect skill is found to be higher than the actual skill, it is often not recognized that there is no a priori constraint that perfect skill has to be better than the actual skill. Indeed, there are instances when actual skills higher than the perfect skill have been reported (Mehta et al. 2000).

In this paper, based on analysis of ensemble seasonal prediction systems, we focus on clarifying two important aspects of perfect and actual skill. We illustrate that (i) potential skill need not always be higher than the actual skill, and (ii) although desirable, there need not be an a priori relationship between the two. We also provide physical interpretations based on simple arguments whereby model-predicted and observed variability when decomposed into the predictable (ensemble mean) and unpredictable (departure from ensemble mean) components, governs the magnitude of perfect and actual skill. We also provide recommendations whereby an assessment of differences between perfect and actual skill, and its interpretation as “room for improvements,” can be tested by some basic comparisons of statistical properties of predicted and observed time series.

## 2. Data and analysis procedure

The model forecast data used in this analysis are from the North American Multimodel Ensemble (NMME) archives (Kirtman et al. 2014). For the sake of brevity, out of several models we select seasonal forecasts from two prediction systems: the Climate Forecast System, version 2 (CFSv2), and the Community Earth System Model, version 3 (CESMv3). The analysis is for the prediction of sea surface temperatures (SSTs) for December–January–February (DJF) seasonal means for forecasts initialized from October initial conditions, and is during the period 1982–2009. Further details about respective forecast systems can be found in Kirtman et al. (2014).

The analysis is for skill measures for deterministic forecasts–anomaly correlation and root-mean-square error. The anomaly correlation between ensemble mean as the forecast and the corresponding verifying observations is defined as

where is the ensemble mean forecast anomaly for the year *i* and is the corresponding verifying observed anomaly; are the standard deviations of the forecast and observed time series; and 〈 〉 is the average over all the years in the verification time series. Under the perfect model assumption, and to estimate potential skill, the verifying observed time series is replaced by the one of the members from the ensemble of forecasts. The potential skill can be computed as many times as there are ensemble members, and averaging over all such estimates provides a more robust assessment of perfect skill.

The root-mean-square error between the ensemble mean forecast and observations is defined as

For the perfect model assumption, is once again replaced by the one of the members from the ensemble of forecasts.

We also analyze temporal autocorrelations for forecast DJF SST anomalies against the forecast November monthly mean SST. A larger (smaller) autocorrelation implies larger (smaller) persistence of initial anomalies, and therefore, this analysis is an implicit measure of the dispersion characteristics of the CFSv2 and CESMv3 forecast systems. The verification dataset for SSTs is from the extended reconstructed SST, version 2 (ERSSTv2; Smith et al. 2008). All anomalies for respective model and observed SSTs are computed relative to their own climatologies.

## 3. Results

Figure 1 compares the actual skill (left panels) versus the perfect skill (right panels) anomaly correlation for predicted SSTs from the two dynamical seasonal forecast systems. Actual skill for both forecast systems is largest in the central to eastern equatorial Pacific and is associated with a skillful prediction of interannual variability associated with El Niño–Southern Oscillation (ENSO). Collocated with high actual skill in the equatorial Pacific is the area where perfect skill is also high. Beyond the equatorial tropical Pacific, however, the relationship between actual and perfect skill seems to break down. For example, while for CESMv3 over a wide geographical area perfect skill is >0.7 (Fig. 1, bottom-right panel), the actual skill does not exceed 0.3 (Fig. 1, bottom-left panel). Even more disconcerting is the recognition that in regions away from equatorial Pacific, even though perfect skill for CESMv3 is generally higher than CFSv2, its actual skill in fact is lower. A similar conclusion in the context of comparison of skill measures for the two versions of the CFS was reported by Becker et al. (2013). On a stand-alone basis if results from a single model were shown, a traditional argument could be that the difference between actual and perfect skill is where there is room for improvement. With results from two models shown side by side, however, the validity of this inference can be easily questioned, for one would not know which perfect skill is the right target where the actual skill should converge.

Another feature a comparison of perfect and actual anomaly correlation reveals is that former is not constrained to be always larger than the latter (i.e., at some places, perfect model skill can be lower than the actual skill). This is better illustrated in the scatterplot between actual and perfect skill (Fig. 2). For all the points below the diagonal, actual anomaly correlation (AC) happens to be larger than perfect AC. Similar cases have also been reported earlier (Mehta et al. 2000). This further leads to an interpretative dilemma in these instances, with model improvements there is “room for reduction in skill.”

Yet another point to note from the scatterplot is a lack of relationship between perfect and actual AC. Ideally the points in the scatterplot should be along the diagonal, but that is not the case. For a constant value of perfect AC (corresponding to any horizontal line in the scatterplot), there is a wide range of outcomes for actual AC.

How should these results be interpreted? The answer lies in the signal-to-noise (SN) ratio and the corresponding expected value of prediction skill measured in terms of anomaly correlation (Kumar and Hoerling 2000). A more detailed argument is presented in the appendix, but a simple explanation is the following: for ensemble based deterministic prediction, ensemble mean is considered as the predictive signal, while the departure from the ensemble mean for individual forecasts is the noise. The difference between actual and perfect skill is that while the same ensemble mean prediction is used in both cases, it is correlated either with the observed time series, or with the model simulated time series taken as a proxy of the observations. With the same ensemble mean as the prediction in the computation of both actual and perfect skill, the respective anomaly correlation depends on the relative magnitude of predictive signal and the noise in the corresponding verification time series (i.e., observed time series in the case of actual skill and model forecast time series in the case of perfect skill).

If the model is biased then there are two possibilities for the relative split of total variability between the predictive signal and noise: observations can either have a larger signal (smaller noise) or a smaller signal (larger noise) than for the model. For the case when observed time series has a larger (smaller) signal than the noise compared to model-predicted time series, actual AC will be larger (smaller) than the perfect AC. We should point out that since the observations are a single realization, an explicit estimate of signal and the noise based on the observed time series, as is the case for the ensemble of forecasts, cannot be made. This limitation, in fact is at the heart of developing various approaches to estimate limits of prediction skill (or predictability) in observations (DelSole et al. 2013).

Is there a framework to provide a sanity check to assess whether the estimate of perfect skill is a realistic estimate or not, and whether the difference between perfect and actual AC should be inferred with some caution? We propose that for initialized predictions, there are metrics based on which a meaningful comparison between observed and predicted variability can be made, and the realism of perfect skill as an estimate of realizable skill can be put to the test. One such measure is temporal autocorrelation.

For an ensemble of initialized predictions of SST starting from different initial conditions, the spread between different ensemble realizations is a measure of noise while the ensemble mean is the predictive signal (Peng et al. 2011). The temporal autocorrelation is related to the persistence of anomalies, with high (low) autocorrelation indicating higher (smaller) persistence. An assessment of temporal autocorrelation is also an assessment of how the ensemble mean (the predictive signal) and the departure from the ensemble mean for individual forecasts (the noise) evolve with lead time. Further, a larger (smaller) autocorrelation also indicates less (more) dispersion among forecasts.

Following this argument, regions where autocorrelation for predicted anomalies is higher (lower) will have a higher (lower) SN ratio. The relationship between SN ratio and the expected value for AC (Kumar and Hoerling 2000) will then argue that regions with higher (smaller) autocorrelation for predictions will also have higher (lower) AC. We illustrate this by comparing the temporal autocorrelation and perfect skill between CFSv2 and CESMv3. For the forecasts starting from the October initial conditions, the autocorrelation for the evolution of forecast anomalies is estimated as the correlational between the forecast for November monthly mean with the forecast for DJF seasonal mean. This process is repeated for individual forecasts in the ensemble, and similar to the perfect skill, autocorrelations are averaged over all such estimates.

The spatial maps of difference in perfect AC and corresponding autocorrelations between CFSv2 and CESMv3 are shown in Fig. 3. In general, the autocorrelation is smaller for the CFSv2 compared to CESMv3 (Fig. 3, bottom panel) indicating larger persistence time scales for anomalies for the CESMv3. Consistent with the CESMv3 having larger persistence (and consequently, less dispersion among forecasts), the perfect AC for CESMv3 is also larger (Fig. 3, top panel). The scatterplot of difference in autocorrelation and perfect AC shown in Fig. 4 confirms this relationship. A similar analysis of discrepancy between perfect and actual AC can also be made. If perfect AC is much larger than the actual AC, the likely cause is a larger persistence (or underdispersion) of the forecast anomalies, thereby providing a check on the realism for the estimate of perfect AC. A comparison of temporal autocorrelation between two forecast systems and the observations indeed shows that model forecasts, in general, have higher persistence (i.e., are underdispersive) (not shown).

Higher autocorrelation (or higher persistence or a smaller spread among different forecast members) also influences the root-mean-square error (RMSE) skill measure in a very distinct manner. If the statistical characteristics of model predictions are unbiased then the perfect and the actual RMSE should be similar since replacing observations as verification by a member from the ensemble prediction will not matter. A higher autocorrelation in predictions, however, should result in perfect RMSE being less than actual RMSE, and is an indication of a forecast system that is underdispersive.

To demonstrate that both prediction systems are generally underdispersive, the perfect and the actual RMSE are shown in Fig. 5. In general, almost everywhere, perfect RMSE is smaller than the actual RMSE, and consistent with our reasoning that an underdispersive forecast system will have a smaller perfect RMSE compared to the actual RMSE. The perfect RMSE for the model CESMv3 is even smaller than for the model CFSv2, and is consistent with (i) higher autocorrelation (Fig. 4, bottom panel) and (ii) higher perfect AC (Fig. 1, right panels). This relationship among temporal autocorrelation, RMSE, and AC also parallels the concept of statistical consistency or the reliability of forecast system (Eckel and Mass 2005; Peng et al. 2012).

## 4. Summary

In this paper, based on analysis of perfect and actual skill for anomaly correlation and root-mean-square error, the following aspects of the possible relationship between perfect and actual skill were highlighted:

Because of model biases, there does not have to be a relationship between perfect and actual skill. For an unbiased prediction system, while perfect skill should be the same as the actual skill, but beyond that, for prediction systems with various biases, estimate of perfect skill are just an estimate of predictability in a world where observed variability is assumed to mimic the “erroneous” model-predicated variability. The inference that the gap between perfect and actual skill is “room for improvement” in prediction skill is a flawed interpretation, and assumes a

*hypothetical world*where the statistical characteristics of observed time series are identical to the statistical characteristics of the model-predicted time series. However, no definitive statements based on comparison of perfect and actual skill, about the realizable skill corresponding to observed variability can be made.There is no a priori reason that the perfect skill has to be larger than the actual skill (for further discussion, see the appendix). Example of such cases can be easily found, and an explanation based on decomposition of variability into predictive signal and noise can be formulated.

To the extent possible, inferences about the estimates of perfect skill need to be confronted with the comparison of statistical characteristics of predictions with corresponding observations. For example, for initialized predictions, one could easily check the autocorrelation between predicted and observed quantities. Large difference in autocorrelation will also reflect the discrepancy between perfect and actual skill measures. Some other basic comparisons include a comparison of total variance; a comparison of linear relationships between the causes of predictability and the atmospheric variability, for example, linear regression between SST indices related to El Niño–Southern Oscillation (ENSO) and interannual atmospheric variability.

The utility of perfect skill estimates, and a large discrepancy from the actual skill, may in fact be in pointing to serious issues with model biases, and therefore, the right context for the use of perfect skill may be as a model diagnostics tool.

## Acknowledgments

We thank two anonymous reviewers for their comments that led to improvements in the final version of the manuscript.

### APPENDIX

#### Derivation of Perfect and Actual Skill

In this section, using an idealized scenario, we demonstrate how the decomposition of total variability into signal and noise for predictions and observations affects the magnitude of the perfect and actual anomaly correlation. Let observations and predictions be given by

where *O* and *F* are the observed and predicted seasonal means for the year *i*. As the model prediction is based on ensembles, the index *j* is for the member within the ensemble. On the right hand side of (A1) and (A2) *μ* and are the predictive signal and the unpredictive noise, respectively; and subscripts *o* and *m* refer to observed and predicted quantities, respectively. For the model, predictive signal *μ* is the ensemble mean, while the noise is the difference between the ensemble mean and individual predictions. Because observations are a single realization, such an explicit decomposition of the observed time series into the signal and noise component cannot be made. A deterministic seasonal prediction based on ensemble mean signal is given by

The total interannual variability of seasonal means *σ* can be decomposed into variability associated with the predictive signal and noise (Kumar and Hoerling 1995):

where subscripts *e* and *n* refer to predictive and the noise components of the variability, respectively. The first simplifying assumption we made is that the total observed and predicted variability are the same:

The temporal anomaly correlation between observations and predictions (i.e., the actual skill) is given by

where 〈〉 represent average over all the years *i*. Recognizing the fact that and , (A7) reduces to

The computation of the perfect skill, which is equivalent to replacing the observed verification time series in (A7) with the time series of one realization of prediction *F*_{ij}, is given by

Out of infinite possibilities as to what may be the relationship between the model-predicted and observed signal and noise, and which depend on the characteristics of model biases, our next simplifying assumption is that the predicted and the observed signal has a linear relationship:

With this assumption the relationship between perfect and actual skill reduces to

From (A11) is *α* > 1 then AC_{p} > AC_{a} and if *α* < 1 then AC_{p} < AC_{a} [i.e., if the predicted signal is larger (smaller) than the observed signal, and consequently the predicted noise is smaller (larger) than the observed noise, than the perfect AC is larger (smaller) than the actual AC]. There is no constraint that the perfect skill has to be larger than the actual skill.

In our idealized system where total variability for predicted and observed anomalies is the same, the relationship between the perfect and actual skill, will depend on how the total variability is partitioned between the predictive signal and unpredictive noise. In reality (i) the total variability between predicted and observed anomalies, because of model biases, need not be the same; and (ii) the dependence between the predicted and observed signal can be more complex than assumed in (A10), and further, can vary from year-to-year, which could lead to a complex (and an unknown) relationship between perfect and actual AC.