## Abstract

Forecasters often develop rules of thumb for adjusting model guidance. Ideally, before use, these rules of thumb should be validated through a careful comparison of model forecasts and observations over a large sample. Practically, such evaluation studies are difficult to perform because forecast models are continually being changed, and a hypothesized rule of thumb may only be applicable to a particular forecast model configuration.

A particular rule of thumb was examined here: *d*prog/*dt.* Given a set of lagged forecasts from the same model all verifying at the same time, this rule of thumb suggests that if the forecasts show a trend, this trend is more likely than not to continue and thus provide useful information for correcting the most recent forecast. Forecasters may also note the amount of continuity of forecasts to estimate the magnitude of the error in the most recent forecast.

Statistical evaluation of this rule of thumb was made possible here using a dataset of forecasts from a “frozen” model. A 23-yr record of forecasts was generated from a T62 version of the medium-range forecast model used at the National Centers for Environmental Prediction. Forecasts were initialized from reanalysis data, and January–March forecasts were examined for selected locations. The rule *d*prog/*dt* was evaluated with 850-hPa temperature forecasts. A total of 2070 sample days were used in the evaluation.

Extrapolation of forecast trends was shown to have little forecast value. Also, there was only a small amount of information on forecast accuracy from the amount of discrepancy between short-term lagged forecasts. The lack of validity of this rule of thumb suggest that others should also be carefully scrutinized before use.

## 1. Introduction

Numerical weather prediction (NWP) models grow increasingly sophisticated with each passing year. Unfortunately, the quest for an NWP model free of systematic error remains elusive. Weather forecasters often develop rules of thumb to adjust the guidance produced by NWP models. Sometimes a rule of thumb may become obvious from a small sample. If, say, Eta Model (Black 1994; Rogers et al. 1995, 1996; Mesinger 1996) forecasts are consistently too cold over snow every day for a month, a forecaster would certainly be wise to compensate for this bias until the model is improved. Nonetheless, human forecasters are fallible; their rules may appear to be appropriate from a relatively small sample of recent forecasts, but human judgment can often be a poor arbiter of statistical significance [see Gilovich (1993) for some interesting examples]. Ideally, a forecaster should validate statistically their rules of thumb with a longer time series of forecasts. Practically, this takes time and effort, and a statistically robust sample may not be available, since operational weather prediction centers frequently update their weather forecast models.

Given these model changes, rules of thumb for adjusting model forecasts that can be applied regardless of the specific forecast model would be especially valuable. One potentially fruitful avenue for improving upon the latest numerical guidance is to consider multiple forecasts from the same model valid at the same time. Such *lagged-average forecasts* (LAFs; Hoffman and Kalnay 1983; Dalcher et al. 1988) have previously been shown to be useful for improving the skill of medium-range forecasts. For shorter-range forecasts, an evaluation of trends in lagged forecasts is often referred to informally as *d*prog/*dt.* Thus, one may see in forecast discussions that “temperature *d*prog/*dt* is negative,” meaning that more recent numerical forecasts are colder than older ones. Some forecasters may also view *d*prog/*dt* as a handy, model-independent rule of thumb: if forecasts are trending colder, does that not suggest that the most likely actual state is yet somewhat colder than the most recent forecast? Forecasters may also note the amount of continuity of these lagged forecasts as a judge of the likely error in the most recent forecast. Lagged forecasts that have been consistent are judged to be more accurate than ones that substantially differ from each other.

Because *d*prog/*dt* is often used as a rule of thumb regardless of the forecast model, it should be generally valid and testable with almost any model that can be run long enough to generate a statistically significant sample. Ideally, the most appropriate data to test would be the ones forecasters are now using. Hence, if forecasters are applying *d*prog/*dt* to 12-, 24-, and 36-h forecasts from the Eta Model, this model should be tested. However, the Eta Model is frequently modified at the National Centers for Environmental Prediction (NCEP), so a long history of forecasts from the current version of this model is not available. Consequently, we will test the validity with a forecast model where we do have a long time series of forecasts from the same model, a reduced-resolution version of NCEP's Medium-Range Forecast (MRF) model. If *d*prog/*dt* cannot be validated here, its applicability to more complex models should be considered suspect until demonstrated statistically.

The data to test *d*prog/*dt* were generated at the National Oceanic and Atmospheric Administration–Cooperative Institute for Research in Environmental Sciences (NOAA–CIRES) Climate Diagnostics Center (CDC) in our “reforecasting” project (information available online at http://www.cdc.noaa.gov/~jsw/refcst). This project was undertaken in part to study whether significant improvements to forecast skill are possible if a very long time series of forecasts is available from a frozen model. Using this large training dataset, systematic model errors can be detected, and current forecasts using the same frozen model can be adjusted for these errors. We have thus far generated 23 yr of medium-range weather forecasts from a T62 resolution version of NCEP's MRF model (Kanamitsu 1989; Kanamitsu et al. 1991; Caplan et al. 1997; Wu et al. 1997). A single control forecast has been run forward for 2 weeks once every day from 0000 UTC initial conditions using the NCEP–National Center for Atmospheric Research (NCAR) reanalyses (Kalnay et al. 1996) from 1979 to 2001. Recently, we have also completed a 15-member ensemble of forecasts over the 23 yr. The reduced, T62 resolution was chosen so that the experiments could be conducted on the limited computer resources available at CDC.

The rest of this note consists of a brief examination of the skill of this forecast dataset, an examination of how much improvement can be obtained through lagged regression approaches, and an examination of the validity of the *d*prog/*dt* rules of thumb. We hope the reader will see beyond the specifics of testing *d*prog/*dt*; the more important point is the importance of careful statistical evaluation of hypothesized rules of thumb.

## 2. Results

Our dataset will consist of 1-, 2-, and 3-day control forecasts of 850-hPa temperature from January to March 1979 to 2001. Sea level pressure forecasts were also examined but will not be shown here; the results were both qualitatively and qualitatively similar. NCEP–NCAR reanalyses were used as verification data. For simplicity, regression corrections and the usefulness of *d*prog/*dt* was evaluated at a limited set of locations in the United States. These locations were the grid points nearest to Seattle, Washington, Los Angeles, California, Denver, Colorado; Minneapolis, Minnesota; San Antonio, Texas; Columbus, Ohio; Tampa, Florida; Cape Hatteras, North Carolina; and Portland, Maine. To minimize the direct effect of forecast bias and the annual cycle upon the analysis, a 31-day running mean climatology of the analysis state and the mean forecast state was computed for each of these locations using the full 23-yr dataset. These running means were subtracted from the analyses and forecasts prior to the subsequent examination.

### a. Validity of extrapolating forecast trends

First consider the overall error statistics of these forecasts. Table 1 provides the root-mean-square (rms) error characteristics of the forecasts at the nine locations as a function of lead time.

As a baseline for evaluating the value of forecast trends, a simple univariate regression was performed to predict the 850-hPa temperature provided from just the 24-h forecast temperature. For this regression, a cross-validation approach was used (Wilks 1995). The regression constants are separately calculated for each of the 23 years, using the remaining 22 years as training data. Denote *T*_{pred} as the predicted 850-hPa temperature (deviation from observed climatology) and *T*_{24} the 24-h forecast (deviation from forecast climatology). The regression equation was of the form

The rms error of this univariate regression is also displayed in column 5 of Table 1. The errors are consistently slightly lower than those from the 24-h forecast itself. On average, there was an ∼0.07 K reduction in rms error.

If there is value in the trend in lagged forecasts, inclusion of these trends ought to significantly improve the accuracy of these forecasts. Accordingly, denote (*T*_{48} − *T*_{24}) the trend between 48- and 24-h lagged forecasts valid at the same time, and similarly for (*T*_{72} − *T*_{48}). A cross-validated multivariate linear regression was performed of the form

The rms errors of this multivariate regression are also displayed in the last column of Table 1. The inclusion of additional information on forecast trends made only a very small improvement to the skill of the forecasts; on average, only ∼0.02 K less than the errors from the univariate regression. If one examines the distribution of regression coefficients produced via the cross-validation (not shown), the distribution of *β*_{2} and *β*_{3} typically overlapped zero, indicating little confidence that the optimal values for these coefficients were significantly different from zero.

Examining a scatterplot of 48–24-h forecast trends and their relationship to the difference between the 24-h forecast and the analyzed state, the reason for the limited value of extrapolating trends is more apparent. Figure 1 provides this scatterplot; the difference in temperatures between 48- and 24-h forecasts valid at the same time is plotted along the *x* axis, the difference between 24-h forecasts and the verification along the *y* axis. There was little relationship between the forecast trend and the 24-h forecast error, as noted by the correlation coefficients near zero (plotted in the upper-left corner of each panel). The correlations were generally smaller yet if the trend was evaluated between 72 and 24 h, and the correlations were no larger if one examined the subset of cases where there was a consistent trend in the 72–48- and 48–24-h forecast tendencies.

### b. Estimating forecast skill from consistency

Is the consistency of forecasts useful for determining the accuracy of the most recent forecast? Figure 2 provides a scatterplot of the absolute difference between 48- and 24-h forecasts (*x* axis) and the mean absolute error (MAE) of the 24-h forecasts (*y* axis). Ideally, the larger the discrepancy between the 48- and 24-h forecasts, the larger the typical MAE should be. With *F* denoting the absolute difference between the 48- and 24-h forecasts, MAE(*F* ≤ 1), MAE(1 < *F* ≤ 3), and MAE(3 < *F*) are also plotted in Fig. 2, the overbar denotes the average over all forecasts. Note that there was only a small difference between the average MAEs of forecasts with large discrepancies and small discrepancies; the discrepancy in short-term lagged forecasts was only a slightly useful predictor of forecast skill.

## 3. Conclusions

Weather forecasters develop rules of thumb to aid themselves in improving upon the numerical forecast guidance. Unfortunately, the human brain is often deceived into seeing patterns where there may be none (Gilovich 1993). Hence, a rule of thumb ought to be statistically validated before use, if this is possible. As an example of the potential problems with rules of thumb, we examined the usefulness of short-term lagged forecasts, that is, *d*prog/*dt.* Using data from a reduced-resolution version of NCEP's MRF model and NCEP–NCAR reanalysis initial conditions, *d*prog/*dt* was shown to have little validity as a forecast rule of thumb. Short-term temperature trends with this model should not be extrapolated, and there is only a slight value in the amount of discrepancy in lagged forecasts for predicting the magnitude of forecast error.

Is this apparent lack of improvement a consequence of using this particular model, or the better NCEP–NCAR initial conditions? While rules of thumb are often model dependent, this particular rule seems to be applied across a variety of models and analysis systems. Following this same reasoning, *d*prog/*dt* should be carefully validated in other models rather than being used unquestioningly.

If not *d*prog/*dt,* then what? There are demonstrably valuable techniques for estimating forecast uncertainty and improving the skill from a single deterministic forecast. One such technique is commonly referred to as ensemble forecasting (Toth and Kalnay 1993, 1997; Molteni et al. 1996; Houtekamer et al. 1996). There is a smaller body of literature on the usefulness of ensembles for shorter-range forecasts. See Brooks et al. (1992) for a motivation for short-range ensemble forecasting and Hamill et al. (2000) for a literature review. Other recent synoptic evaluations of ensembles include Mullen and Buizza (2001, 2002), Wandishin et al. (2001), and Grimit and Mass (2002). Though there are many challenging problems that need to be addressed to improve these forecasts, such datasets should be more useful for evaluating the uncertainty of shorter-range forecasts. Readers who may have used *d*prog/*dt* but are looking for a more theoretically justifiable alternative are encouraged to consider the information from these ensemble studies and to examine the new short-range ensemble forecast guidance now being generated at NCEP.

## Acknowledgments

Matt Briggs (Cornell University) and Richard Grumm (National Weather Service, State College, Pennsylvania) are gratefully acknowledged for their consultation during the drafting of this manuscript. The reviews of Joseph Schaefer and two other anonymous reviewers improved the quality of the final manuscript.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

## Footnotes

*Corresponding author address:* Dr. Thomas M. Hamill, NOAA–CIRES CDC, R/CDC 1, 325 Broadway, Boulder, CO 80305-3328. Email: tom.hamill@noaa.gov