1. Introduction
Linear regression has long played an important role in weather and climate forecasting, both in empirical prediction models and statistical postprocessing of physics-based prediction model output (e.g., Glahn and Lowry 1972; Penland and Magorian 1993). Here we focus on the use of regression to correct and calibrate climate forecasts, although our findings are generally applicable to all regression-based forecasts. A regression between past model output and observations provides an effective means of combining available forecast model guidance and past model behavior to predict a future observation (e.g., Landman and Goddard 2002; Tippett et al. 2005; Vecchi et al. 2011). Describing a model forecast f and its corresponding observation o as random variables, the regression forecast distribution is the conditional distribution p(o | f), that is, the probability distribution of the observation o conditional on the forecast f (DelSole 2005). The regression forecast distribution, by definition, contains all the information about the observation that can be extracted from the model forecast. When the model forecast provides no useful information (i.e., the model forecast is independent of the observation), the regression forecast distribution p(o | f) is the same as the unconditional or climatological distribution p(o). The mean of the regression forecast distribution is the best estimate of the observation in the sense that it minimizes the expected squared forecast error, irrespective of the forecast and observation distributions. Additional statistics of the regression forecast distribution (e.g., variance) can be used to produce probabilistic forecasts (Kharin and Zwiers 2003).
The regression forecast distribution generally must be determined empirically from data. When forecasts and observations have a joint-Gaussian distribution, the regression forecast distribution p(o | f) is itself Gaussian and is specified completely by its mean and variance. Moreover, in this case, the regression forecast mean is given by linear regression, and the regression forecast variance is the error variance of the linear regression. Such Gaussian regression forecasts, where the mean and variance are known exactly, are reliable, essentially by construction as we show below (see also Johnson and Bowler 2009).
Reliability is an aspect of forecast quality that measures the extent to which probabilistic forecasts capture the actual likelihood of the event being predicted. A fundamental challenge in assessing forecast reliability is that nature provides only a single outcome for a particular forecast. Consequently, a reliability assessment necessarily involves some pooling of forecasts. There are several ways of quantifying reliability (Gneiting et al. 2007). Probabilistic calibration is assessed via the probability integral transform or rank histogram (Hamill 2001). Marginal calibration refers to the equality of the forecast and observed climate. Exceedance calibration is defined in terms of thresholds; there is no empirical method in the literature for assessing exceedance calibration (Machete 2013).
In practice, the regression forecast distribution must be estimated from data; in the case of seasonal climate forecasts, those records are fairly short. Therefore, the impact of sampling error on the regression parameter estimates and in turn on forecast quality is an important consideration. In this paper we analyze the impact of sampling error on the reliability of Gaussian regression forecast distributions. In section 2 we present the Gaussian (linear) regression forecast, demonstrate its reliability with population parameters, and use those parameters to characterize the reliability of the underlying uncorrected forecast ensemble in a procedure that parallels detection and attribution analysis (Allen and Tett 1999). In section 3 we demonstrate that applying regression to seasonal precipitation forecasts improves reliability on dependent data but that sampling errors in the regression parameter estimates lead to unreliable overconfident forecasts on independent data. We use synthetic data and theoretical arguments in section 4 to show that estimated regression forecast distributions are unreliable because of a positive bias in the forecast signal variance and characterize that unreliability as function of sample size and skill level. A summary of the results is given in section 5.
2. Univariate Gaussian forecasts
In this paper we consider the case in which an ensemble of forecasts is available, the forecast ensemble mean and corresponding observation are scalars, and their joint distribution is Gaussian. We show that the resulting regression forecast distribution is reliable when the population regression parameters are known and that the regression parameters characterize the reliability of the underlying uncorrected forecast ensemble. In particular, the calibration function for the forecast ensemble is an explicit function of the regression parameters, and the requirements for the reliability of ensemble-based probabilities can be stated in terms of the regression parameters. We also show that testing the consistency of the sample regression parameters with the reliability requirements parallels the procedure used in detection and attribution analysis.
a. Reliability of Gaussian regression forecasts










b. Regression parameters and ensemble reliability







Only sample estimates of the regression parameters are available in practice. The consistency of the sample values with the hypothesis of a skillful and reliable forecast ensemble can be checked using linear hypothesis testing in a procedure that parallels that employed in detection and attribution analysis (Allen and Tett 1999). First, the forecast ensemble mean is regressed with the observations and a confidence interval is obtained for the estimated regression coefficient
c. Regression parameters and the calibration function

Calibration functions (solid), linear interpolation approximation (dashed), and ideal (dotted) for (a) b = 0.5, (b) b = 2, (c) σ/σe = 2, and (d) σ/σe = 0.5.
Citation: Journal of Climate 27, 9; 10.1175/JCLI-D-13-00565.1
3. Example: Seasonal precipitation forecasts
We now examine the reliability of seasonal precipitation forecasts made by a current coupled general circulation model. The distribution of seasonal precipitation, suitably transformed, has been shown to permit a Gaussian description (Tippett et al. 2007). Figure 2a shows the ranked probability skill score (RPSS) of the National Oceanic and Atmospheric Administration (NOAA) Climate Forecast System version 2 (CFSv2) forecasts of December–February (DJF) precipitation over North and South America made the previous 1 November during the period 1982–2010 (Saha et al. 2014). The forecast data are those provided to the National Multi-Model Ensemble project (Kirtman et al. 2014). Precipitation observations were taken from the Climate Prediction Center (CPC) Unified Gauge-Based Analysis of Global Daily Precipitation averaged onto a 1° × 1° grid (Xie et al. 2007; Chen et al. 2008). Forecast probabilities of the below-normal, normal, and above-normal categories are computed from ensemble frequencies of membership in the model-defined tercile-based categories; the ensemble size is 24. Although positive RPSS values are seen in ENSO teleconnection regions, there are substantial areas with negative RPSS values. The corresponding reliability diagrams for the above-normal and below-normal categories shown in Fig. 2c are computed by pooling all forecasts and locations and indicates that, overall, forecasts are overconfident in the sense that the calibration function has slope less than one. The pooling of forecasts in the reliability diagram means that the observed overconfidence cannot be ascribed to particular locations. We note that the calibration function intersects the ideal line near the climatological probability of ⅓, consistent with signal miscalibration.
RPSS of CFSv2 forecasts of DJF precipitation made 1 November with probabilities computed using (a) ensemble frequencies and (b) linear regression. Map averages are shown in title. Reliability of probabilities computed using (c) ensemble frequencies and (d) linear regression. Inset histograms show the frequency of forecast probability issuance.
Citation: Journal of Climate 27, 9; 10.1175/JCLI-D-13-00565.1
To estimate the regression forecast distribution, we first apply the square root transformation to both the observed and forecast precipitation values to reduce the positive skewness and then compute a regression between the transformed ensemble mean forecasts and observations at each grid point. Least squares estimates of the regression parameters are made using all 29 years of data. A map of RPSS values for the linear regression forecast distribution probabilities is shown in Fig. 2b, and values are positive in essentially the same locations as the RPSS of the ensemble frequency-based forecasts. However, where the ensemble frequency-based forecasts have negative RPSS values, the regression forecast distribution forecasts have near-zero RPSS. The corresponding reliability diagram in Fig. 2d indicates that the regression distribution forecasts are well calibrated with the occurrence frequencies being virtually equal to the forecast probabilities. The histograms of forecast probability issuance indicate that the regression distribution forecast probabilities are more tightly clustered around the climatological frequency than those of the ensemble frequency-based forecasts. Linear regression reduces forecast probability excursions from climatology and improves reliability.
We can examine the calibration of the forecast ensemble on a spatial basis using the linear reliability hypothesis tests introduced in the previous section. Where the intercept interval includes zero, the calibrated mean hypothesis cannot be rejected. Figure 3a shows that the intercept calibration hypothesis is rejected at the 95% level at relatively few points, mostly over South America. The no-skill hypothesis cannot be rejected where the 95% confidence interval for
Colors indicate regions where the CFSv2 forecasts pass the tests for (a) mean calibration M and (b) skill s, skill and signal calibration s/S and skill, signal, and variance calibration s/S/υ. Land points with no color in (b) indicate locations where the no-skill hypothesis cannot be rejected.
Citation: Journal of Climate 27, 9; 10.1175/JCLI-D-13-00565.1
Unfortunately, the quality of these regression forecasts is overestimated because the regression parameters were estimated using the same data that are used to evaluate skill and reliability. Figures 4a and 4c show the RPSS and reliability, respectively, of regression forecasts developed in a leave-one-out cross-validated fashion. There is a modest decrease in RPSS, but a substantial reduction in reliability with a reappearance of overconfidence. The forecast issuance histograms show no marked changes. Some aspect of the cross-validation procedure results in loss of reliability. Figures 4b and 4d show the RPSS and reliability, respectively, of regression forecasts where only the regression coefficient
As in Fig. 2, but for probabilities computed using (a) cross-validated linear regression and (b) cross-validated estimates of only the regression coefficient. Calibration functions for probabilities computed using (c) cross-validated linear regression and (d) cross-validated estimates of only the regression coefficient, respectively. Also shown in (c) and (d) are analytical approximations described in the text.
Citation: Journal of Climate 27, 9; 10.1175/JCLI-D-13-00565.1
To avoid sources of signal miscalibration other than sampling error, such as the observations or forecasts not satisfying the assumptions of linear regression or some unknown degeneracy associated with the cross-validation procedure (Barnston and Van den Dool 1993), we use synthetic data with specified distributions and theoretical estimates in the following section to analyze the reliability of estimated regression forecast distributions.
4. Reliability of estimated regression forecast distributions
a. Synthetic data
We design a synthetic data experiment that mimics the seasonal climate regression forecasts of the previous section. The use of synthetic data permits the population distributions to be specified and cross-validation to be avoided. We take the number of “grid points” to be 5000 and the training period to consist of 29 samples. Additional experiments (not shown) indicate that the number of grid points in the synthetic data affects only the calibration function error bars, with more grid points leading to smaller error bars. The population correlation of the forecasts with observations is taken to be uniformly distributed between 0 and a specified value rmax. Linear regression parameters are estimated from the 29 training samples using least squares. The resulting regression is used to make probability forecasts of the observations occurring in the lower tercile category for an independent verification dataset with 100 samples. Figure 5 shows the resulting reliability diagrams for rmax = 0.3 and 0.5, respectively. While the probabilities are reliable during the training period (blue lines), they are not reliable during the verification period (red lines) and display overconfidence as did the cross-validated CFSv2 regression forecasts. The overconfidence is relatively greater for rmax = 0.3. Applying cross-validation to one regression parameter at a time reveals that the overconfidence is almost entirely due to sampling error in the estimate of the regression coefficient
Calibration functions for estimated regression distribution probability forecasts in the training data and in independent data using least squares (LS) and using ridge regression (RR). The maximum correlations between observations and forecasts are (a) 0.3 and (b) 0.5. An analytical approximation is also shown.
Citation: Journal of Climate 27, 9; 10.1175/JCLI-D-13-00565.1
b. Theoretical estimates







5. Summary
For a given model forecast and verifying observations, the regression forecast distribution is the probability distribution of the observations conditional on the model forecast (DelSole 2005). The regression forecast distribution contains all the information about the observation that can be extracted from the model forecast and is the optimal correction of the model forecast. The regression forecast distribution produces reliable calibrated probability forecasts in the sense that the expected occurrence of an event is equal to its forecast probability. The relation between expected occurrence and forecast probability is summarized by the calibration function that appears in the reliability diagram (Murphy 1973). Here we considered joint-Gaussian distributed forecasts and observations. In this case, the regression forecast distribution is also Gaussian. Additionally, we suppose that an ensemble of interchangeable model forecasts is available; by interchangeable forecasts we mean forecasts from a single model rather than from multiple models. In this case, we demonstrated that the parameters of the regression forecast characterize the skill and reliability of the underlying uncorrected forecast ensemble. In fact, the calibration function for the underlying forecast ensemble is a function of the regression parameters. A consequence of this functional dependence is that examination of the calibration function can distinguish between overconfidence due to a too strong signal and overconfidence due to a too small variance. The consistency of estimated regression parameters with the hypothesis of ensemble skill and reliability can be assessed using linear hypothesis testing in a procedure that parallels detection and attribution analysis (Allen and Tett 1999).
Linear reliability hypothesis testing offers some advantages over the reliability diagram approach for Gaussian distributions. Calibration functions are generally computed nonparametrically and often require pooling forecasts from multiple locations and lead times. On the other hand, linear reliability hypothesis testing is parametric, requires less data, and thus permits finer assessments of reliability than do reliability diagrams. For instance, linear reliability hypothesis testing allowed us to characterize, on a spatially varying basis, the reliability of a 29-yr set of CFSv2 ensemble seasonal precipitation forecasts for North and South America.
The regression forecast distribution generally must be estimated from data. Least squares estimates of the regression forecast mean and variance are unbiased. However, sampling error in the estimate of the mean causes a positive bias in the regression forecast signal variance that manifests itself in overconfident forecast probabilities. This result is consistent with that of Richardson (2001), who noted that sampling error due to finite ensemble size introduced unreliability into probability forecasts. No comparable bias is seen in the noise variance. The level of sampling error and the corresponding signal bias depend on the sample size and the underlying level of skill. We analytically estimated the signal bias, as well as its impact on the calibration function, as a function of the sample size and population correlation skill. For the low-skill, small-sample situations typical in climate forecasting, there is substantial sampling error, leading to a substantial positive bias in signal variance and overconfident probability forecasts. With increasingly larger forecast sample sizes and/or higher average correlation skills, the positive bias of the signal variance lessens until it becomes inconsequential. The reality of the signal bias and the accuracy of the calibration function estimate were demonstrated using idealized numerical simulations with synthetic data and CFSv2 ensemble seasonal precipitation forecasts. The impact of this bias on reliability has not been previously recognized, although its impact on (out of sample) prediction error is known and provides the motivation for shrinkage methods such as ridge regression that reduce prediction error and improve reliability by decreasing the forecast signal variance.
Acknowledgments
The authors thank two anonymous reviewers for their insightful and useful comments and corrections. MKT and AGB are supported by grants from the National Oceanic and Atmospheric Administration (NA05OAR4311004 and NA08OAR4320912) and the Office of Naval Research (N00014-12-1-0911). TD gratefully acknowledges support from grants from the NSF (0830068), the National Oceanic and Atmospheric Administration (NA09OAR4310058), and the National Aeronautics and Space Administration (NNX09AN50G). The views expressed herein are those of the authors and do not necessarily reflect the views of NOAA or any of its sub-agencies.
REFERENCES
Allen, M. R., and S. F. B. Tett, 1999: Checking for model consistency in optimal fingerprinting. Climate Dyn., 15, 419–434, doi:10.1007/s003820050291.
Allen, M. R., and P. A. Stott, 2003: Estimating signal amplitudes in optimal fingerprinting, Part I: Theory. Climate Dyn., 21, 477–491, doi:10.1007/s00382-003-0313-9.
Barnston, A. G., and H. M. van den Dool, 1993: A degeneracy in cross-validated skill in regression-based forecasts. J. Climate, 6, 963–977, doi:10.1175/1520-0442(1993)006<0963:ADICVS>2.0.CO;2.
Chen, M., W. Shi, P. Xie, V. B. S. Silva, V. E. Kousky, R. W. Higgins, and J. E. Janowiak, 2008: Assessing objective techniques for gauge-based analyses of global daily precipitation. J. Geophys. Res.,113, D04110, doi:10.1029/2007JD009132.
Copas, J. B., 1983: Regression, prediction and shrinkage. J. Roy. Stat. Soc., 45B, 311–354. [Available online at http://www.jstor.org/stable/2345402.]
DelSole, T., 2005: Predictability and information theory. Part II: Imperfect forecasts. J. Atmos. Sci., 62, 3368–3381, doi:10.1175/JAS3522.1.
DelSole, T., 2007: A Bayesian framework for multimodel regression. J. Climate, 20, 2810–2826, doi:10.1175/JCLI4179.1.
DelSole, T., and X. Yang, 2011: Field significance of regression patterns. J. Climate, 24, 5094–5107, doi:10.1175/2011JCLI4105.1.
DelSole, T., M. K. Tippett, and L. Jia, 2013: Scale-selective ridge regression for multimodel forecasting. J. Climate, 26, 7957–7965, doi:10.1175/JCLI-D-13-00030.1.
Glahn, H. R., and D. A. Lowry, 1972: The use of model output statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11, 1203–1211, doi:10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.
Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc., 69B, 243–268, doi:10.1111/j.1467-9868.2007.00587.x.
Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550–560, doi:10.1175/1520-0493(2001)129<0550:IORHFV>2.0.CO;2.
Hastie, T., R. Tibshirani, and J. Friedman, 2009: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 767 pp.
Hoerl, A., and R. Kennard, 1988: Ridge regression. Encyclopedia of Statistical Sciences, Vol. 8, Wiley, 129–136.
Johnson, C., and N. Bowler, 2009: On the reliability and calibration of ensemble forecasts. Mon. Wea. Rev., 137, 1717–1720, doi:10.1175/2009MWR2715.1.
Kharin, V. V., and F. W. Zwiers, 2003: Improved seasonal probability forecasts. J. Climate, 16, 1684–1701, doi:10.1175/1520-0442(2003)016<1684:ISPF>2.0.CO;2.
Kirtman, B., and Coauthors, 2014: The North American Multimodel Ensemble: Phase-1 seasonal-to-interannual prediction; phase-2 toward developing intraseasonal prediction. Bull. Amer. Meteor. Soc., doi:10.1175/BAMS-D-12-00050.1, in press.
Landman, W. A., and L. Goddard, 2002: Statistical recalibration of GCM forecasts over southern Africa using model output statistics. J. Climate, 15, 2038–2055, doi:10.1175/1520-0442(2002)015<2038:SROGFO>2.0.CO;2.
Livezey, R. E., and W. Chen, 1983: Statistical field significance and its determination by Monte Carlo techniques. Mon. Wea. Rev., 111, 46–59, doi:10.1175/1520-0493(1983)111<0046:SFSAID>2.0.CO;2.
Machete, R. L., 2013: Early warning with calibrated and sharper probabilistic forecasts. J. Forecast., 32, 452–468, doi:10.1002/for.2242.
Mason, S. J., J. S. Galpin, L. Goddard, N. E. Graham, and B. Rajartnam, 2007: Conditional exceedance probabilities. Mon. Wea. Rev., 135, 363–372, doi:10.1175/MWR3284.1.
Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595–600, doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
Murphy, A. H., and R. L. Winkler, 1977: Reliability of subjective probability forecasts of precipitation and temperature. J. Roy. Stat. Soc., 26C, 41–47. [Available online at http://www.jstor.org/stable/2346866.]
Penland, C., and T. Magorian, 1993: Prediction of Niño-3 sea surface temperatures using linear inverse modeling. J. Climate, 6, 1067–1076, doi:10.1175/1520-0442(1993)006<1067:PONSST>2.0.CO;2.
Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Quart. J. Roy. Meteor. Soc., 127, 2473–2489, doi:10.1002/qj.49712757715.
Saha, S., and Coauthors, 2014: The NCEP Climate Forecast System version 2. J. Climate, 27, 2185–2208, doi:10.1175/JCLI-D-12-00823.1.
Tippett, M. K., L. Goddard, and A. G. Barnston, 2005: Statistical–dynamical seasonal forecasts of central southwest Asia winter precipitation. J. Climate, 18, 1831–1843, doi:10.1175/JCLI3371.1.
Tippett, M. K., A. G. Barnston, and A. W. Robertson, 2007: Estimation of seasonal precipitation tercile-based categorical probabilities from ensembles. J. Climate, 20, 2210–2228, doi:10.1175/JCLI4108.1.
Unger, D. A., H. van den Dool, E. O’Lenic, and D. Collins, 2009: Ensemble regression. Mon. Wea. Rev., 137, 2365–2379, doi:10.1175/2008MWR2605.1.
Vecchi, G. A., M. Zhao, H. Wang, G. Villarini, A. Rosati, A. Kumar, I. M. Held, and R. Gudgel, 2011: Statistical–dynamical predictions of seasonal North Atlantic hurricane activity. Mon. Wea. Rev., 139, 1070–1082, doi:10.1175/2010MWR3499.1.
Weigel, A. P., M. A. Liniger, and C. Appenzeller, 2007: The discrete Brier and ranked probability skill scores. Mon. Wea. Rev., 135, 118–124, doi:10.1175/MWR3280.1.
Xie, P., A. Yatagai, M. Chen, Y. F. T. Hayasaka, Y. Fukushima, C. Liu, and S. Yang, 2007: A gauge-based analysis of daily precipitation over East Asia. J. Hydrometeor., 8, 607–626, doi:10.1175/JHM583.1.