1. Introduction
In the research fields of weather forecasting and climate prediction, the benefits of a given physical parameterization or a given data assimilation technique on the forecast quality are assessed by comparing the prediction scores of two retrospective sets of forecasts with and without this new feature. Those prediction scores can be, for example, the anomaly correlation coefficient or the root-mean-square error, which provide information on the accuracy of an ensemble-mean prediction, that is, how far from the observations the ensemble-mean prediction is on average. Those prediction scores can also be, for example, a measure of the spread of the members around the ensemble mean of a deterministic forecast that provides information on the reliability of the probabilistic forecast that can be derived from this deterministic forecast (Palmer et al. 2005). Any robust comparison of two scores requires their confidence intervals. The wide variety of scores used to assess the forecast quality leads currently to the use of a wide variety of parametric tests relying on the χ2, Fisher, or Student distributions, which require an estimated Neq, and the use of nonparametric block bootstrap tests, which require an estimated Tp for the serial dependence of the data to be accounted for. An alternative approach commonly used in weather forecasting is to consider the data to be independent from one day to the next, although the same meteorological conditions often persist for a few days (Wilks and Wilby 1999).
The objectives of this paper are the following:
We will illustrate the substantial uncertainty on the Neq provided by the standard Eq. (2) from von Storch and Zwiers (2001), even when the hypotheses for its use are valid, which is an issue mentioned earlier by Thiébaux and Zwiers (1984), although this formula remains widely used currently, in particular when estimating confidence intervals for skill scores in weather and climate forecasting. Section 2 focuses on this objective.
We will suggest a preliminary treatment of the time series before applying such a formula that allows for more robust estimates of the effective sample size. Some tools are made available in the supplementary materials to apply this alternative approach, which aims at increasing the robustness of the Neq estimation. Such a method is extensively described in section 3.
We will highlight a second issue never before reported in the literature that is raised by the application of such a formula to problems of climate and weather prediction because of the frequent violation of one of its hypotheses—namely, the hypothesis of “identically distributed data.” We explain this issue in section 4.
We will suggest additional preliminary treatments that could prevent the consequences of violating such a hypothesis. Those preliminary treatments are briefly described in section 4 and are validated in the supplementary material, and they are available as options to the tools that are contained in the supplemental material.
Sections 5 and 6 respectively provide a discussion and the conclusions. An appendix describes succinctly the tools provided in the supplementary material to apply the methods suggested in this article.
2. A large uncertainty on the estimated effective number of independent data with the standard approach
The large spread of the Neq provided by Eq. (4) and the unrealistic results in a number of cases originate from the large uncertainty in the estimation of the autocorrelation function from the short time series that are usually available. This uncertainty is illustrated with Fig. 2, which shows the true autocorrelation function of an AR1 with α1 = 0.3 or α1 = 0.8 as a function of the lag as black lines together with the interval between various quantiles of the 1000 estimated autocorrelation functions of AR1 drawn randomly, as explained in the previous paragraph, as colored lines. The range of estimated autocorrelations increases quickly with the lag. For lags larger than τ = 5, the estimated autocorrelation ranges between approximately −0.5 and 0.5. This uncertainty strongly affects the estimated Neq shown in Fig. 1.
3. How can one reduce the uncertainty on the estimated effective number of independent data?
The histograms of Neq obtained by applying solutions 1 and 2 to 1000 AR1 are respectively shown in Figs. 3 and 4. Both solutions provide a distribution that peaks around the true Neq, indicated by a vertical brown line, whereas the original formulation provided a distribution that peaked at Neq = 50. The distance between the mean of the distribution, indicated by a vertical blue line, and the true Neq ranges between 7.74 and 14.33 depending on α1 for solution 1 and between 0.45 and 3.66 only for solution 2. The peak in distribution is closer to the true Neq with solution 2 (Fig. 4) than with solution 1 (Fig. 3), and the spurious secondary peak at Neq = 50 disappears much more quickly when α1 increases with solution 2. We thus conclude that solution 2 reduces to a larger extent, relative to solution 1, the uncertainty on the estimated number of independent data.
Figure 5 further compares the performance of the original formulation (open circles), the solution 1 (filled circles), and the solution 2 (triangles) in terms of bias (left panel), standard deviation (center panel), and root-mean-square error (RMSE; right panel) of the estimated Neq, expressed in percentage of the sample size N, for samples of length N equal to 20 (black lines), 50 (red lines), and 100 (green lines) and with α1 between 0.1 and 0.8 as given by the x axis. The bias is systematically reduced when using solution 1 over the original formulation and when using solution 2 over solution 1. The variance is increased for low α1 but decreased for high α1 when using solution 2 over the original formulation. Overall, the root-mean-square error, which integrates the bias and standard deviation information, is systematically reduced when using solution 1 over the original formulation and when using solution 2 over solution 1. Whereas the root-mean-square error increases with α1—that is, with a decrease in the true Neq—when using the original formulation, it decreases with α1 when using the solution 2.
We came to the conclusion, in section 2, that the large uncertainty in the estimation of the effective number of independent data when using Eq. (3) originates from the large uncertainty in the estimation of the autocorrelation function. We compare in Fig. 6 the performance of our solution 2 (triangles) with the first guess of the estimated autocorrelation function (ACF; open circles) and with the maximum likelihood estimator (MLE; closed circles) in terms of bias (left panel), standard deviation (center panel), and root-mean-square error (right panel) of the estimates of the autocorrelation at lag 1. These performances have been computed on 2000 AR1 processes drawn with α1 in the x axis and with length N equal to 20 (black lines), 50 (red lines), and 100 (green lines). Our estimator following solution 2 has a lower bias than the other two estimators for low α1 but a larger bias for large α1 and it has a lower variance for any α1 for the sample sizes considered in this study. Overall, its root-mean-square error is lower than for the other two estimators for α1 < 0.6 for N = 20, for α1 < 0.5 for N = 50, and α1 < 0.4 for N = 100. Since the most common case is the small sample size and small α1 case in climate and weather prediction, we recommend the use of our solution 2 over the maximum likelihood estimator or over the first guess of the autocorrelation function.
Thiébaux and Zwiers (1984) had suggested seven different estimators for the effective number of independent data. We have reproduced parts of their Tables 3, 4, and 5, which compare the performance of their estimators. We have only considered the results for a sample size below 60 since one of the issues also faced in climate or weather prediction is the small sample size. Furthermore, we have not considered their example of a second-order autoregressive process that is affected by a large peak in its power spectrum and therefore does not seem representative of our typical observed data. The methods considered by Thiébaux and Zwiers (1984) consist of the following:
“DIR”: applying the original Eq. (3),
“DIR2”: truncating the original Eq. (3) at lag 10,
“ARMA”: fitting an autoregressive moving-average process after determining its order by Alaike’s information criterion and then using its theoretical autocorrelation function,
“SPEC1”: estimating the spectrum at the origin by using the first ordinate of the smoothed periodogram (or Daniel estimator),
“SPEC5”: estimating the spectrum at the origin by using the first five ordinates of the smoothed periodogram (or Daniel estimator),
“SPECT5”: estimating the spectrum at the origin by using the Daniel estimator with 20% of the data tapered with a cosine taper, and
“BART”: using a weighted covariance estimator with a Bartlett lag window.
Partial reproduction of Table 3 from Thiébaux and Zwiers (1984) with an additional column, in boldface roman font, that provides the performance of our estimator. Shown are the true median and medians of 1000 estimates of the effective sample size made with our estimator and the seven estimators from Thiébaux and Zwiers (1984). The first two columns give the properties of the samples on which we estimate the effective sample size. Boldface italics indicate that our method is outperformed.
Table 2. As in Table 1, but for Table 5 of Thiébaux and Zwiers (1984) and for the standard deviation.
Last, we would like to warn the reader about a drawback of this method. Whereas the original formulation tends to provide Neq that is larger than N when applied to random samples of independent data, solution 2 tends to provide Neq that is slightly lower than N when applied to such samples. Using this solution might thus make the statistical tests of hypothesis too conservative.
4. Violation of the “identically distributed” hypothesis
The formula developed by Anderson (1971) and revisited in von Storch and Zwiers (2001) is demonstrable under a range of hypotheses (see the supplementary material). Each datum of the time series is considered as a particular drawing of an X random variable. The process of drawing a time series is then modeled as a list of X1, X2, X3, X4, … , XN identically distributed random variables. This mathematical framework to model the sample is a classical one that has been used to develop many commonly used parametric statistical tests, except that the independence of the data is required in addition in those parametric tests (e.g., Student’s test, or the Fisher test). The identically distributed hypothesis implies that the data in the sample are equivalent; that is, they are not influenced by any physical underlying process that would make the distribution of two Xi different.
The identically distributed hypothesis is a strong hypothesis that in many cases might not be verified. The annual cycle, driven by the insolation, makes an annual oscillation in the climate variable. The fact that summer and winter processes are not identically distributed is clear in the scientific community, and summer and winter values are never compared. Because of the El Niño–Southern Oscillation, the climate variable distribution might be different during warm and cold phases. In more general terms, in a time series that is affected by any climate oscillation, a datum in one phase cannot be represented by the same X as one in the other phase. Also, the role of climate change induces a slow change in the distribution of all of the climate variables. In any time series that is affected by a trend such as the one of climate change, we should consider that the mean of the random variable X is changing with time. Otherwise, a spurious correlation between the random variables Xi is introduced, and it affects the estimated equivalent number of observations.
To illustrate the impact of neglecting this identically distributed hypothesis, we consider the example of the global annual mean sea surface temperature (SST) over the 1960–2010 period (Fig. 7a), which is affected by a strong trend. The number of independent data provided by solution 2 applied to this example containing N = 51 actual data is Neq = 4.8 because the trend stands as a major part of the signal; indeed, if we consider a sample drawn for a simple line with positive trend and containing N = 51 actual data, we obtain Neq = 2.3, which is only slightly lower than the result obtained for the global mean SST. The autocorrelation function estimated from the global mean SST time series is only slightly lower than 1.0 because of the weak departures from the trend, thus resulting in a slightly larger Neq than 2.3. In the mathematical framework under which the formula was developed, a sample drawn for a simple line corresponds to two phases of a long oscillation: a long phase of below-average anomalies followed by a long phase of above-average anomalies, hence leading to a number of independent data that is close to 2.
When verifying against observational datasets, the decadal temperature predictions produced within the framework of phase 5 of the Coupled Model Intercomparison Project (CMIP5; Taylor et al. 2012), which are initialized every 5 years from 1960 to 2005, 9–10 observational data are available depending on the forecast time considered. When applying solution 2 to the near-surface temperature averaged over forecast years 2–5 from this climate-prediction case, the number of independent data for verification is about 2 in the Indian and North Atlantic Oceans (Fig. 7b). If the physical mechanisms targeted by the analyses are contained in the departures from the long-term trend and if the trend has large amplitude as compared with the departures from the trend, the formula we discuss here should rather be applied to a detrended time series. The number of independent data provided by solution 2 applied to the linearly detrended global annual mean sea surface temperature shown in Fig. 7a is Neq = 31.6. A linear trend stands as a crude estimation of the climatic change response. However, the Neq obtained with such a simple method seems already much closer to the true Neq if one is interested in the climate signal contained in the departures from the trend. The added value of a linear detrending is illustrated in Fig. S2 of the supplemental material for various cases of α1 and slope of trends.
5. Discussion
The use of Eq. (2) combined with classical parametric or nonparametric inferential tests suitable for each prediction skill score is standard practice in the research fields of climate and weather prediction. The mathematical theory that led to such a formula, however, initially considered only tests of the means. Its extension to other statistical inferential tests seems not to be as accurate as its original usage. On the one hand, from a physical point of view, the independence between two data depends solely on the underlying physical processes that induce (or not) a temporal coherence between two separate dates. On the other hand, from a statistical point of view, the concept of equivalent sample size depends on the statistical context—that is, the statistical parameter of interest, the particular inferential tests, or the distribution of the random variables that model the sample drawing process (see von Storch and Zwiers 2001), and the equivalent sample size is equal to the number of independent data that would provide the same amount of information about a given statistical parameter as the available sample of dependent data, that is, the same standard deviation of the estimator. The unification of both points of view is not straightforward, and the choice of which statistical model is the most suitable to represent a given physical process remains the responsibility of the physicist. As long as well-tested methods for the case of serial dependence of the data are not available for each one of the statistical inference tests, as is the case for tests of the means (Zwiers and von Storch 1995), an effective number of independent data has to be estimated using one of the currently available options. Because of the large uncertainties in the estimate obtained with the original formula following Eq. (2) and the better performance of our approach than any other one available to our knowledge, we recommend the use of our solution 2.
To improve the robustness of the equivalent sample size estimation, we have relied on the statistical model of first-order autoregressive process. More general statistical models such as higher-order autoregressive processes and autoregressive moving average processes (ARMA) could have been used. Those models might represent in a slightly more accurate way the physical processes of interest. However, fitting such models would require determining additional sets of parameters. When dealing with applicative purposes, one has to balance the benefits of using a more complex model with the uncertainties brought by the determination of additional parameters. As we have shown in Fig. 2, the information from the first guess of autocorrelation function is rapidly lost as the lag increases. A strength of our solution 2 lies in the weights we give to the different lags from the first guess of autocorrelation function to estimate a single parameter α1, as explained in section 3. This would not be possible with statistical models of higher orders. Indeed, the solution proposed by Thiébaux and Zwiers (1984), which relies on fitting an ARMA statistical model to the sample to estimate its number of effective independent data, is outperformed by our solution 2 as mentioned in section 3.
6. Conclusions
Most climate and weather studies make use of statistical tests of hypotheses and confidence interval estimation to assess the robustness of their conclusions. The statistical methods involved in those assessments require independent data in the time series to which they are applied. Those time series are unfortunately usually affected by the strong persistence of meteorological and climatic phenomena. A common approach to cope with this obstacle is to replace in those statistical methods the actual number of data by an estimated effective number of independent data that is computed from the well-known formula of Anderson (1971) and von Storch and Zwiers (2001). Even though this formula is demonstrable under some hypotheses, it provides unreliable results on practical examples because an estimated autocorrelation function is used to apply this formula and this latter is affected by a large uncertainty due to the usual shortness of the available time series. Our recommendation is to fit the autocorrelation function of a first-order autoregressive process to the estimated one prior to the application of Anderson (1971) formula to reduce this uncertainty. An R software function called CFU_eno is made available to the reader to implement this method. In addition to not respecting the hypothesis of independence of the data, many meteorological and climatic time series also do not respect the hypothesis of stationarity of the time series—for example, when affected by climate change. The estimated number of independent data is heavily affected by the existence of a trend. The CFU_eno function we make available in the supplementary material contains an option to linearly detrend the time series prior to the computation of the number of independent data for cases in which the climate signal that is targeted by the analyses is contained in the departures from the trend. More details are provided in the appendix about the various R functions made available to the reader to apply the solution proposed in this article.
Acknowledgments
John Maxwell Halley is greatly acknowledged for the fruitful discussion about this article. This work was supported by the EU-funded QWeCI (FP7-ENV-2009-1-243964), CLIM-RUN projects (FP7-ENV-2010-265192), and SPECS (FP7-ENV-2012-308378) projects; the MINECO-funded RUCSS (CGL2010-20657) and PICA-ICE (CGL2012-31987) projects; and the Catalan government. The authors thank the two anonymous reviewers for their fruitful suggestions and their substantial contribution to the improvement of our article.
APPENDIX
Tools Provided as Supplementary Materials
A set of seven functions written in the R programming language is provided to the reader as part of the supplementary material. Those R functions are distributed to
enable the reader to apply the methods we suggest in our article on his or her own dataset by using the CFU_eno function described below, which is dependent on CFU_alpha and fitacfcoef also described below (and potentially dependent on CFU_spectrum and/or CFU_filter depending on the input arguments to CFU_eno);
reproduce our tests and results by using CFU_gen_series to generate autoregressive processes and by applying CFU_eno with different input arguments; and
have access to the exact implementation details of our method by reading our code.
A brief description of each function is provided below:
CFU_eno: This function estimates the effective number of independent data of the input array xdata. It has one compulsory argument (the xdata array of which the effective number of independent data has to be estimated) and two optional flag arguments [detrend (default: “FALSE”) to apply a linear detrending prior to the estimation of the effective sample size and filter (default: “FALSE”) to apply a filtering of any periodic signal prior to the estimation of the effective sample size]. This function calls CFU_alpha to obtain a refined estimate of the autocorrelation function at lag 1 following the solution 2 presented in our article, and then it applies Eq. (10) to obtain the effective number of independent data. The method used to apply the linear detrending and to filter any periodic signal is described below in the comments on the CFU_alpha function.
CFU_alpha: This function estimates the autocorrelation at lag 1 of the input array xdata. It has one compulsory argument (the xdata array of which the autocorrelation at lag 1 has to be estimated) and two optional flag arguments [detrend (default: “FALSE”) to apply a linear detrending prior to the estimation of the autocorrelation at lag 1 and filter (default: FALSE) to apply a filtering of any periodic signal prior to the estimation of the autocorrelation at lag 1] This function, after a potential linear detrending and periodic signal filtering on xdata, estimates its first guess of autocorrelation function and then calls fit_acfcoef to refine the estimate of the autocorrelation function at lag 1 following our solution 2. If detrend = “TRUE,” the linear trend of the xdata array is estimated together with its 95% confidence interval. If this confidence interval does not encompass 0, the minimum absolute value of the confidence interval limits is used as the slope of the linear trend to be removed. Indeed, a nonnull autocorrelation at lag 1 induces a trend in the xdata array so that the benefits of subtracting the linear trend for large slopes have to be balanced with the removal of part of the autocorrelation signal at lag 1 for weak slopes. The results of our tests regarding the impact of a linear detrending are illustrated in Fig. S2 of the supplemental material. If filter = “TRUE,” the frequency spectrum of xdata is estimated by calling CFU_spectrum. If any peak in the obtained frequency spectrum is significant at the 99% level, such peak is filtered out by calling CFU_filter.
CFU_gen_series: This function generates first-order autoregressive processes containing n data, with alpha as autocorrelation at lag 1 and mean and standard deviation provided by the mean and std arguments. It has four compulsory arguments: the n, alpha, mean, and std values. This function can be used by the reader to reproduce the various tests, the results of which are presented in this article.
fit_acfcoef: This function finds the minimum point of the fourth-order polynomial (a − x)2 + 0.25(b − x2)2 written to fit the two autoregression coefficients a and b with the Cardan formula. It has two compulsory arguments: the a and b values. Provided that a and b are in the [0, 1] interval, Δ > 0 and there is only one solution to the minimum. This function can be used to minimize the mean-square differences between the true autocorrelation function of an AR1 and the first guess of estimated autocorrelation function using only the first two lags.
CFU_spectrum: This function estimates the frequency spectrum of the xdata array together with its 95% and 99% significance levels. It has one compulsory argument: the xdata array. Its output is provided as a matrix with dimensions (number of frequencies, 4). The second dimension contains the frequency values, the power, the 95% significance level, and the 99% one. The spectrum estimation relies on an R built-in function and the significance levels are estimated by a Monte Carlo method. This function can be used to detect the potential periodic signals in the xdata array that might need to be filtered out prior to the computation of the effective number of independent data to avoid a violation of the “identically distributed” hypothesis.
CFU_filter: This function filters from the xdata array, the signal of frequency freq. The filtering is performed by dichotomal seek for the frequency around freq and the phase that maximizes the signal to subtract from xdata. It has two compulsory arguments: an xdata array and the freq value. The maximization of the signal to subtract relies on a minimization of the mean-square differences between xdata and a cosine of given frequency and phase. As highlighted in section 4, the presence of a periodic signal induces a violation of the “identically distributed” hypothesis. Such a periodic signal can be filtered out prior to the computation of the effective number of independent data using this function.
CFU_fitautocor: This function can be used to minimize the mean-square differences between the true autocorrelation function of an AR1 and the first guess of estimated autocorrelation function using any range of lags, but it is less computationally efficient than fit_acfcoef for the particular case of two lags as advised in our article. This function has one compulsory argument (the first guess of estimated autocorrelation function, which can contain any range of lags) and two optional arguments [the window in which the output autocorrelation at lag 1 should lie (default: [−1; 1]) and the precision to which the output autocorrelation at lag 1 should be determined (default: 0.01)]. The estimation of the output autocorrelation at lag 1 relies on a dichotomial minimization of the mean-square differences between the true autocorrelation function of an AR1 and the first guess of the autocorrelation function.
REFERENCES
Anderson, T. W., 1971: The Statistical Analysis of Time Series. John Wiley and Sons, 704 pp.
Bayley, G. V., and J. M. Hammersley, 1946: The “effective” number of independent observations in autocorrelated time series. J. Roy. Stat. Soc. Suppl., 8, 184–197.
Fan, Y., and H. van den Dool, 2008: A global monthly land surface air temperature analysis for 1948–present. J. Geophys. Res., 113, D01103, doi:10.1029/2007JD008470.
Hansen, J., R. Ruedy, M. Sato, and K. Lo, 2010: Global surface temperature change. Rev. Geophys., 48, RG4004, doi:10.1029/2010RG000345.
Jones, R. H., 1975: Estimating the variance of time averages. J. Appl. Meteor., 14, 159–163, doi:10.1175/1520-0450(1975)014<0159:ETVOTA>2.0.CO;2.
Knight, J. R., R. J. Allan, C. K. Folland, M. Vellinga, and M. E. Mann, 2005: A signature of persistence natural thermohaline circulation cycles in observed climate. Geophys. Res. Lett., 32, L20708, doi:10.1029/2005GL024233.
Leith, C. E., 1973: The standard error of time-average estimates of climatic means. J. Appl. Meteor., 12, 1066–1069, doi:10.1175/1520-0450(1973)012<1066:TSEOTA>2.0.CO;2.
Palmer, T., R. Buizza, R. Hagedorn, A. Lawrence, M. Leutbecher, and L. Smith, 2005: Ensemble prediction: A pedagogical perspective. ECMWF Newsletter, No. 106, ECMWF, Reading, United Kingdom, 10–17. [Available online at http://www.ecmwf.int/publications/newsletters/pdf/106.pdf.]
Smith, T. M., R. W. Reynolds, and T. C. P. H. Lawrimore, 2008: Improvements to NOAA’s historical merged land–ocean surface temperature analysis (1880–2006). J. Climate, 21, 2283–2296, doi:10.1175/2007JCLI2100.1.
Taylor, K. E., R. J. Stouffer, and G. A. Meehl, 2012: An overview of CMIP5 and the experiment design. Bull. Amer. Meteor. Soc., 93, 485–498, doi:10.1175/BAMS-D-11-00094.1.
Thiébaux, H. J., and F. W. Zwiers, 1984: The interpretation and estimation of effective sample size. J. Climate Appl. Meteor., 23, 800–811, doi:10.1175/1520-0450(1984)023<0800:TIAEOE>2.0.CO;2.
Trenberth, K. E., 1984: Some effects of finite sample size and persistence on meteorological statistics. Part I: Autocorrelations. Mon. Wea. Rev., 112, 2359–2368, doi:10.1175/1520-0493(1984)112<2359:SEOFSS>2.0.CO;2.
von Storch, H., and F. W. Zwiers, 2001: Statistical Analysis in Climate Research. Cambridge University Press, 484 pp.
Wilks, D. S., and R. L. Wilby, 1999: The weather generation game: A review of stochastic weather models. Prog. Phys. Geogr., 23, 329–357, doi:10.1177/030913339902300302.
Zwiers, F. W., and H. von Storch, 1995: Taking serial correlation into account in tests of the mean. J. Climate, 8, 336–351, doi:10.1175/1520-0442(1995)008<0336:TSCIAI>2.0.CO;2.