Statistical postprocessing methods have been successful in correcting many defects inherent in numerical weather prediction model forecasts. Among them, model output statistics (MOS) and perfect prog have been most common, each with its own strengths and weaknesses. Here, an alternative method (called RAN) is examined that combines the two, while at the same time utilizes the information in reanalysis data. The three methods are examined from a purely formal/mathematical point of view. The results suggest that whereas MOS is expected to outperform perfect prog and RAN in terms of mean squared error, bias, and error variance, the RAN approach is expected to yield more certain and bias-free forecasts. It is suggested therefore that a real-time RAN-based postprocessor be developed for further testing.
It is well known that forecasts of numerical weather prediction (NWP) models have certain defects that can be removed by statistically postprocessing their output (Wilks 1995). Two of the more popular postprocessing approaches are model output statistics (MOS) and perfect prog (Glahn and Lowry 1972; Klein et al. 1959), both of which are based on the idea of relating model forecasts to observations through linear regression. A great many variations on the original ideas have been proposed: Vislocky and Fritsch (1997), for example, include observations as both predictor and predictand, and Marzban (2003) additionally allows for nonlinear relationships among the various variables. In response to the necessity of deriving new regression equations every time an NWP model changes, Wilson and Vallée (2002) have developed an updateable version of MOS whose training set is a blend of “old” and “new” data. Vislocky and Young (1989) even consider perfect prog as a further postprocessor of MOS.
In spite of the development of increasingly more sophisticated versions of MOS and perfect prog, they have certain characteristics that have remained distinct, resulting in certain advantages or disadvantages of the respective approaches. For example, although MOS is known to remove the bias from NWP forecasts, its development generally requires not-readily-available large datasets involving both model variables and observations. The development of perfect prog, on the other hand, is simpler because it requires only observations as both predictor and predictand. However, perfect prog forecasts are not bias free. Meanwhile, Brunet et al. (1988) have shown that perfect prog outperforms MOS in short-term forecasts. Wilson and Vallée (2003) found that their updateable MOS outperforms both perfect prog and MOS, at least in forecasting 2-m temperature, 10-m wind direction and speed, and probability of precipitation.
MOS is known to maintain reliability but loses sharpness and converges to climatology for longer forecast projections. In ensemble systems, this is disadvantageous because an MOS-based ensemble forecast system fails to adequately forecast rare events. For this reason, some operational centers now rely on perfect prog as their method of choice in postprocessing ensemble outputs (Denis and Verret 2004). On the other hand, an MOS-type approach to postprocessing of ensemble forecasts has been developed (Stephenson et al. 2004), which shows improved forecasts in comparison with individual model forecasts as well as the ensemble mean.
In the development of all the variations on MOS and perfect prog, a limitation is the amount of data available for developing (or training) the regression model. A possible solution to this data shortage problem was proposed by Kalnay (2003). There, in appendix C.2, it is suggested to utilize reanalysis data (Kalnay et al. 1996; Kistler et al. 2001) to develop a postprocessor with the advantages of both MOS and perfect prog, but without the weaknesses due to limited training data. In this article, this method is referred to as RAN (for reanalysis). As described in the next section, RAN also has the added virtue of separating the loss of information between predictor and predictand into its components—one due to the inadequacies of the NWP model, and the other due to chaos in the atmosphere itself.
The object of this work is to compare MOS, perfect prog, and RAN within the confines of a simple statistical model that allows for exact/analytic conclusions. A databased comparison will be performed elsewhere. The approach assumes a single underlying numerical model, and not an ensemble of models. Finally, the postprocessors are for processing model data at a given station, and not for all points on a grid. The next section addresses the three approaches in more detail, and the mathematical framework is presented in section 3. Two subsections compare the three postprocessors in terms of bias, error variance, and uncertainty of the forecasts. The paper ends with conclusions and discussions in section 4. This last section also discusses questions regarding the effect (on the three postprocessors) of changes in the NWP models, as well as the consequences of the low resolution of the reanalysis data.
2. MOS, perfect prog, and the reanalysis method
First, it is worthwhile to examine MOS more closely. Figure 1 displays two schematic versions of MOS: one where the predictor and predictand are contemporaneous (MOS1), and one that allows for different times for the two variables (MOS2)—t for the forecast time, and T for the valid time. The left portion of the figures refers to the development stage of the equations, where linear regression equations are developed to relate the relevant quantities. The right portion displays the manner in which the developed regression equations are utilized for making forecasts. The vertical lines mark the initialization/analysis time. The variable x denotes a predictor, and y represents the predictand. The subscript “o” indicates observation, “m” stands for model, and “f” represents forecast. These definitions are tabulated in Table 1. As an example, one may think of x as the 700-mb height, and y as the surface temperature.
Which MOS in Fig. 1 is better? To answer that question, it is reasonable to assume that the best predictor of yo is xo, and not xm. This is so because xm can, at best, be equal to xo. In other words, if the NWP model is perfect, then xm = xo. Therefore, if the NWP model is perfect, then MOS1 will outperform MOS2, because at every point in time, the best predictor of y is the xo at that same time.
However, no NWP model is perfect. The value of xm agrees with xo at the initialization/analysis time, but thereafter diverges from xo.1 This loss of information/skill is only partially due to the imperfect nature of the model. There is another loss due to the chaotic nature of the atmosphere itself. That loss reflects itself in the divergence between xo and y as the time difference between them increases. In other words, quite independently of the model, the skill in forecasting y from xo decreases as a function of the time between them. If the loss of information due to the imperfect nature of the model is larger than that due to chaotic growth, then MOS2 will outperform MOS1.
In short, whether MOS1 is better than MOS2 depends on the rate at which the mutual information between x and y is lost in the atmosphere relative to the rate at which the model loses information as it propagates a state into the future. At one extreme, if the model is perfect, then MOS1 is better, because one would run the model forward all the way to the valid time, confident that xm at that time has not diverged from xo. At the other extreme, if the model is completely flawed, then y is better predicted from xm if the model is not run at all, for running the model forward for any duration would only worsen xm. Given that most models reside somewhere in the range between perfect and completely flawed, it follows that there exists an “optimal” time lag between the predictor and predictand.
The same argument applies to perfect prog (see Fig. 2). The assumption of perfect prog is that the NWP model is perfect in the sense that xm = xo for all times. Then one can run the model forward to the valid time (T), and produce a regression forecast (yf) based on the value of xm at that time. As such, it would be sufficient to produce the regression equations from contemporaneous values of xo and yo at the valid time. But if the model is completely flawed, then the model should not be run forward at all. It follows, again, that a generalization of perfect prog where the predictor and predictand are measured at different times may outperform the conventional perfect prog. Hereafter, “perfect prog” refers to this more general definition of the model.
The arguments presented above for allowing a time lag between predictor and predictand is predicated on the acknowledgment that information is lost in two ways—one due to model deficiency and another due to the atmosphere itself. This raises the possibility of modeling the two components, separately. In other words, as a first step, one may develop a regression equation that maps xm to xo. This regression model would capture only model deficiencies, since it maps the model value of a quantity to the observed value of that same quantity.2 The second step would then involve mapping xo to yo. Since this regression does not involve the model at all, it captures loss of information due to chaos.3 In practice, one may employ this two-step approach to produce a forecast for y. In a way, this method can be considered as a hybrid of MOS and perfect prog, since both xo and xm are employed in the forecasting y.
One drawback of this two-step approach is that it does not allow for x and y to be the same physical quantities, because the second step of the model would then involve mapping a variable to itself. This is a drawback because one would expect that the model forecast for, say, temperature is a better predictor of surface temperature than any other single predictor. To overcome this problem, one may replace xo with a “reanalysis” value of x. This will allow same-variable regression models, for at no point in the development stage will a variable be regressed onto itself. Meanwhile, the reanalysis value of x (xr) encapsulates knowledge of the observed value, and so the information in the observations is not lost. In this article, this reanalysis-based approach is denoted RAN. Note that during the training phase, the NWP model is utilized to provide the best estimate of the reanalysis, followed by a regression model to provide the best estimate of yo (Kalnay 2003).
Moreover, note that the optimal time lag for the three postprocessors (MOS, perfect prog, and RAN) may be mutually different. This is indicated in Figs. 1 and 2 by marking the time at which the forecast is made (t) at different points along the time line. As such, in comparing the three approaches, one must first derive the optimal value of t for each approach, and then compare the models at their respective optimal values of t. This is an important point and will be further addressed in the next section.
In summary, better postprocessors may be obtained if 1) the predictor is measured at some “optimal” time preceding the time at which the predictand is measured, and 2) the two-step approach involving the reanalysis field is employed. Figure 2 illustrates the general situation underlying the three statistical forecasting approaches. The subscript “r” represents the reanalysis value of the predictor (see Table 1). Note that the above arguments for allowing a time lag between predictor and predictands applies to each of the two steps in RAN, introducing an additional time parameter (τ). The aim of this article is to assess the relative expected performance of the three postprocessors displayed in Fig. 2.
The statistical model is developed over some historical data on predictors at time t, and predictands at time T. The perfect prog and MOS regression equations, and forecasts (yf) can be written as (see Fig. 2)
where the function f is usually a linear equation whose parameters are estimated from data. The same function with the estimated parameters is then utilized for making the forecasts yf for time T. It is assumed that there exists only a single predictor; the analysis can be generalized to the multiple predictor case in a straightforward manner.
The RAN method involves two regression equations, and one forecast equation:
where the functions f and g are linear functions whose parameters are again estimated from data. Note that for RAN, yf depends also on the “reanalysis time,” τ. In the following, this τ dependence will not be expressed unless necessary.
Although the choice of the performance measure is arbitrary, here only the mean-squared error, as well as its decomposition into bias and error variance, is considered. Therefore, the model parameters are estimated by minimizing the mean of the squared errors in Eqs. (1)–(3), and the verification measure is computed as
where the overbar denotes average. A useful decomposition of E is E = V + B2, where V and B are called the error variance and bias, respectively, that is,
where σ2[x(t)] denotes the variance of x(t). Good forecasts yield small values of V and B.
The minimization of E in (4) yields the well-known ordinary least squares (OLS) estimates:
where σ[x(t), y(T)] denotes the covariance between x at time t and y at time T. Note that these OLS estimates are denoted by the same symbol as the arbitrary parameters appearing in the regression Eqs. (6). Henceforth, any reference to the α, β parameters refers to their respective OLS values given in Eq. (7).
To bring transparency to the equations, it is convenient to introduce the quantities
where r[x, y] is the linear correlation coefficient between x and y, defined as
Given that r(x, y) assesses the linear correlation between a pair of variables (usually a predictor and a predictand), the quantities ρP, ρM, and ρR also assess the “correlation,” but in a generalized sense; if the variables are scaled to have unit variance, then the ρ's are determined entirely from the r's.
Then V and B in Eq. (5) for the three methods become
where the various times (t, T, τ) have been supplemented with an index—P, M, R—denoting the respective approaches (perfect prog, MOS, and RAN). These time parameters represent the respective optimal times for each method, as discussed in section 2.
It is worth mentioning that BR(τR, tR, T) = 0 only if one assumes that the average of xm over the training data is equal to that over the data of the forecasting phase. In other words, it is assumed that for the left portion of Fig. 2 is the same as that of the right portion. The same is assumed for . This is a reasonable assumption if the sample employed for training is representative of the population.
a. Comparison of bias and variance
With a bit of algebra, one can show
where, for brevity, the dependence of the ρ quantities on the valid time T has been suppressed. This form of the equations is particularly useful, because the last term on the right-hand side of all three equations involves only ρM evaluated at different times. The remaining terms on the right-hand side are all perfect squares, and hence nonnegative.
For example, consider the first equation in (14). The first term on the right-hand side is nonnegative. The second term is the difference between ρM at time tM and time tP. However, by the definition of tM, it is the time at which the mean squared error for MOS is a minimum. But since the bias is zero at that time [Eq. (13)], then VM, too, is minimized at time tM. Finally, based on Eq. (12), ρM is maximized at time tM. In short, ρM(tM) ≥ ρM(tP), which implies that the right-hand side of the first equation in Eq. (14) is nonnegative. It follows that VP(tP) ≥ VM(tM), that is, MOS outperforms perfect prog in terms of the variance of the forecast errors.
The same argument applies to the second equation in (12), which in turn, implies that VR(τR, tR) ≥ VM(tM), that is, MOS outperforms RAN in terms of the variance of the forecast errors.
Given that the bias of MOS forecasts is zero, it then follows that MOS outperforms both perfect prog and RAN in terms of bias and error variance (and mean squared error).4
b. Comparison of uncertainty
It is now possible to compute the uncertainty associated with the forecasts, namely, the variance of the forecasts.5 For brevity, the time dependence of the various quantities is suppressed in this section, because the central quantity in the analysis of these uncertainties is the sample size of the data. Note that the three approaches may be based on samples of different size. Let the sample size of the data employed in MOS be NM, and that in perfect prog NP. As for RAN, the samples in the two steps of that approach may again have different sizes. In practice, the first step of the approach involves model predictor, xm, and as such will have a sample size of NM. The second step of RAN involves reanalysis data, and for that reason, its sample size is denoted NR.
The uncertainties can be computed as
A comparison of MOS and perfect prog uncertainties in Eq. (15) suggests that the latter may in fact have smaller uncertainties when NP > NM, at least for some values of xm. However, this slight advantage is offset by the fact that perfect prog forecasts are not bias free [Eq. (13)].
As for MOS and RAN, it follows from (15) that, if NR > NM, then the uncertainty in RAN forecasts is smaller than that of MOS forecasts, for all values of xm. Therefore, if the second step of RAN (i.e., the step with reanalysis predictors) is based on a larger dataset than the first step (i.e., the step based on model predictors), then RAN forecasts have lower uncertainty than MOS forecasts. In short, if the reanalysis data are larger than NWP model data, then RAN forecasts have lower uncertainty than MOS and perfect prog forecasts. Therefore, as far as uncertainty is concerned, RAN outperforms the alternatives.
4. Conclusions and discussion
MOS is expected to outperform perfect prog and RAN in terms of bias, error variance, and mean squared error. However, the uncertainty of MOS forecasts may be hindered by the limited size of available model data. Perfect prog is less restrictive because its development is not limited by the availability of model data; however, its forecasts are biased and have higher error variance than MOS forecasts. On the other hand, RAN forecasts are bias free and have lower uncertainty than MOS and perfect prog. In short, even though MOS has higher performance (in terms of bias and error variance) than perfect prog and RAN, the latter has lower uncertainty associated with its forecasts if its sample size is larger than MOS's sample size.
These findings are based on several assumptions, including the following: First, the underlying relations are assumed to be linear. It is this linearity that allows for analytic conclusions like those presented here. Nonlinear models do not lend themselves to such analysis, and are better assessed with real data. That task will be addressed in the future. Second, ordinarily, in regression modeling, the predictors are assumed to be error free, and only the target is assumed to have errors. In RAN, however, there are two regression models—one from a model variable to a reanalysis variable, and a second from the reanalysis variable to an observation. Here, the reanalysis variable is assumed to be error free during the first step, but with error during the second step. The more realistic analysis wherein both the predictors and predictands are assumed to have error is more complicated [see section 3.4 of Draper and Smith (1998)] and will be considered elsewhere.
Given all of these assumptions, it is important to emphasize that the results reported here regarding the relative performance of the three methods are only “expected” and not guaranteed. However, as shown here, there exists sufficient theoretical justification to develop and test a real-time, RAN-based postprocessor.
One question is whether changes in the numerical model call for retraining RAN. In other words, in an operational setting, is one expected to retrain RAN each time a numerical model undergoes some revision? The important point to note is that the difficulty in training MOS is in the requirement for large datasets involving observations. In RAN, however, observations play a role only in the second step, where they are mapped to the reanalysis data, not model data. When a new version of a model is released, only the first step of RAN must be retrained, but that is a relatively simple task (compared to MOS), given that it involves mapping the “new” model data only to the reanalysis data, that is, the reanalysis based on the “old” model compared to the “new” model, where a large amount of data exists on identical grids. Consequently, only a relatively short model rerun is required to retrain. This is especially important when dealing with rare phenomena, where MOS would require exceedingly large training sets mapping the “new” model to the observations. Moreover, it is possible to implement an updateable approach (Wilson and Vallée 2002) for RAN. As such, model changes are expected to have minimal effect on RAN in an operational setting.
Another question is whether or not there is an advantage in using the NWP model used for the reanalysis also for producing the forecasts. One might expect that the correlation between the reanalysis (i.e., modeled “truth”) and an NWP prediction based on the same underlying dynamics, physics, and numerical approximations should be higher than the correlation based on an entirely different NWP system. If true, this would constitute a severe restriction to RAN. However, it is also possible that the benefits of a sufficiently long training period (as is possible in RAN) may outweigh any drawbacks of using different models for reanalysis and forecasts. This is especially true if differences between models are “systematic.” This question will be further addressed in the future.
Finally, how does the relatively low resolution of the reanalysis data affect the practical utility of RAN? A RAN postprocessor will reproduce a response to larger-scale features but not very high resolution features (small eddies, thunderstorms, etc.). But that is precisely what one would desire in a postprocessor—the ability to calibrate to large-scale features that are highly predictable, rather than to “contaminate” the calibration with less predictable, small-scale features. (It is useful to note here that the NWS operational MOS was developed using a very low resolution version NWP model, following this reasoning.) The first step of RAN calibrates the model to the reanalysis, correcting the model representation of the large scale. The second step of RAN calibrates the reanalysis to the observations, ensuring that site-specific properties of the observations are appropriately represented.
David Stephenson is acknowledged for pointing out to us the connection of this work to the Gauss–Markov theorem. We also thank Frederic Atger, and Bertrand Denis for commenting on the role of MOS and perfect prog in ensemble-based forecast systems. Partial funding was provided under a cooperative agreement between the National Oceanic and Atmospheric Administration (NOAA) and the University Corporation for Atmospheric Research (UCAR). The views expressed herein are those of the authors and do not reflect the views of NOAA, its subagencies, or UCAR.
Corresponding author address: Caren Marzban, Center for Analysis and Prediction of Storms, University of Oklahoma, Norman, OK 73019. Email: firstname.lastname@example.org
To be exact, the value of xm never agrees with xo, but their difference is minimum at the initialization time.
This step is affected by more than model deficiencies; for example, observational and assimilation errors still come into play. However, for the present analysis, only two sources of “error” are considered—one due to an imperfect model, and another due to chaos in the atmosphere.
Technically, this regression models the dynamic relation between xo and yo, and so it captures all contributions unrelated to the NWP model. For example, even instrument error may be captured.
That MOS is better than the alternatives is related to the Gauss–Markov theorem, which states that the best linear unbiased estimator is the OLS estimator. David Stephenson is acknowledged for pointing out this connection.
The variance of the forecasts is to be contrasted with the error variance computed previously. The former is the variance of the forecasts, while the latter is the variance of the forecast errors (technically, residuals).