1. Introduction
In a recent paper (Applequist et al. 2002, hereafter referred to as AGPN), the authors examined the relative skills of various statistical methodologies for making probabilistic quantitative precipitation forecasts (PQPFs) at each of 154 stations over the central and eastern regions of the United States (Fig. 1). This was done for 24-h precipitation accumulations (0–24 h initialized at 1200 UTC) exceeding thresholds of 0.01, 0.05, and 0.10 in. during the cool season (December–March). The predictors used were Nested Grid Model (NGM) gridded analyses and predictions of various meteorological quantities for the period December 1992–March 1996 obtained from the National Center for Atmospheric Research (NCAR) archive (available online at http://dss.ucar.edu/datasets/ds069.5/). This method is known as model output statistics (MOS; Glahn and Lowry 1972). The statistical methodologies included linear regression, discriminant analysis, logistic regression, neural networks, and a classifier system with a genetic algorithm. For each methodology, the coefficients that relate the predictand to the predictors were determined from a training dataset consisting of NGM model output and observed precipitation accumulations for different combinations of three cool seasons. The skill of each methodology was tested on an independent dataset consisting of similar data for the other cool season. The primary finding was that, at all three thresholds, logistic regression had a significantly higher Brier skill score (Brier 1950) than linear regression (the benchmark methodology used by the National Weather Service) at the 99% confidence limits. The other methods, while generally better than linear regression, showed mixed results.
For each methodology, the coefficients assigned to the different predictors in our previous work were determined from the training dataset by maximizing the Brier skill score (BSS) and using a stopping rule to determine the model order at which the BSS failed to increase by a specified percentage. The present paper contains improved scores for linear regression and logistic regression obtained by choosing the model order using the generalized information criterion (GIC). In this work, the choice of coefficients for logistic regression is based on the use of the maximum likelihood method with a Fisher scoring algorithm. In addition, a third statistical methodology is examined in which the training data are used to pair the numerical model predictions of precipitation accumulation with observed accumulations to determine the probability of precipitation exceeding a given threshold for a given model-predicted accumulation. Finally, in response to questions raised about the independence of the scores at different stations used to determine the significance of our results, we present here an alternative significance test, the results of which support our earlier conclusion concerning the greater forecast skill of logistic regression over linear regression.
2. Statistical methods


a. Screening of predictors
In AGPN the programs for training and application of linear regression and logistic regression were written by the first author of that paper in such a way that they could handle large numbers of potential predictors. Rather than rewrite these programs to incorporate the generalized information criterion for determining model order, and the maximum likelihood approach for choosing the coefficients, we purchased the S-Plus statistical software package (MathSoft Inc. 1999), which already incorporates these methodologies. This package is, however, more limited in the number of potential predictors it can handle. Accordingly, we reduced the set of 234 potential predictors (see Table 1) to a pool of 20 predictors by keeping only those that correlated best with the actual rainfall amount in the training dataset for each of the 154 stations, 4 yr, and three thresholds (a total of 1848 sets with 20 predictors in each set). This drastic reduction in the initial pool of predictors is judged to be inconsequential to the final results: first because many of the predictors are highly dependent on one another, and second because, in our previous results, we found that our statistical models rarely required more than 10 predictors and that the same leading predictors, namely, model-predicted precipitation and the vertical average of relative humidity, were selected in almost all models at almost all stations.
b. Choice of coefficients for logistic regression






c. Determination of model order


d. The binning method
In this methodology, the model-predicted precipitation interpolated from the model grid to each station is the only predictor, R, was divided into six bins as follows: R = 0.0 in., 0.0 < R < 0.01 in., 0.01 ≤ R < 0.05 in., 0.05 ≤ R < 0.10 in., 0.10 ≤ R < 0.25 in., and 0.25 in. ≤ R. Other precipitation ranges and numbers of bins were tried and it was found that the results were not sensitive to the details. If, in the training dataset, any bin had less than five forecasts of precipitation falling within its range, we combined this bin with the next lower one. This situation occurred at some of the stations with drier climatologies.


3. Results
In order to make maximum use of our dataset, we employed cross validation (Elsner and Schmertmann 1994), where the coefficients corresponding to each model are determined using three years of training data and verified on the fourth. By choosing different combinations of three years among the four for training, verification forecasts were produced for four years of independent data. Table 2 compares the Brier skill scores for the methods used by AGPN [viz., linear and logistic regression using a stopping rule based on the Brier skill score to limit the number of predictors, and logistic regression using predictors chosen by principal component (PC) analysis] with the new methods (viz., linear regression and logistic regression using S-Plus software, and binning). Each of these scores is the mean of the individual scores for the 154 stations over the four independent cool seasons. Every method is seen to do better than climatology, with scores ranging from 0.378 to 0.510 and (with the exception of logistic regression using PC analysis) increasing as the threshold is increased from 0.01 to 0.10 in. However, it should be noted that Brier skill scores in other studies have been found to decrease at thresholds above 0.10 in. (Ebert 2001). At the lowest threshold, logistic regression using predictors determined from principal component analysis gives the highest BSS, while, at the higher thresholds, the binning method, which removes the bias from the numerical predictions of precipitation accumulation, scores best. The use of the S-Plus software, which incorporates the generalized information criterion for determining the model order, as well as the maximum likelihood approach for choosing the coefficients, improves the scores at all thresholds for both linear regression and logistic regression over those obtained by the use of a stopping rule. At the 0.01-in. threshold, however, logistic regression using predictors chosen by PC analysis still exhibits the greatest skill.
Following the approach in AGPN, we made a paired t test (Weiss and Hassett 1991) to determine, for each threshold, the extent to which the differences among the Brier skill scores corresponding to the different methodologies are significantly different from one another. The results of this test are shown in Tables 3–5. The numbers in the body of these tables are the t values, positive indicating that the method in the left-hand column has a higher mean BSS than the one listed along the top row, and negative indicating the reverse. A magnitude greater than 2.576 in the table indicates with at least 99% confidence that the method with the higher mean BSS is significantly more skilled than the one with the lower mean score. For the 95% and 90% confidence limits the corresponding criteria are 1.960 and 1.645, respectively. In each table, the numbers to the lower left of the main diagonal are mirror images of those to the upper right, with the signs reversed. All the numbers are retained here in order to facilitate comparisons between a single method and all the rest.
Table 3 gives the t values for the 0.01-in. threshold. We compare first the results for linear and logistic regression based on the use of the S-Plus software with those for the corresponding methodologies using a stopping rule based on the BSS reported in AGPN. The boldfaced number in the second row indicates that the BSS of 0.389 for linear regression using S+ (Table 2) is significantly higher at the 99% confidence level than the score of 0.378 obtained using our earlier method. The boldfaced numbers in the fourth row of Table 3 indicate that the BSS of 0.407 in Table 2 for logistic regression using S-Plus is significantly higher at the 99% level than is the score of 0.384 for the previous method (which uses a stopping rule based on changes in the BSS as predictors are added), but it is significantly lower at the 95% level than the score of 0.413 for logistic regression using the leading principal components as predictors.
The last row of Table 3 reveals that the score of 0.400 in Table 2 for the 0.01-in. threshold using the binning method is significantly higher at the 99% level than both scores for linear regression and the score for logistic regression using a stopping rule to limit the number of predictors, but significantly lower than the scores for logistic regression using the top 10 predictors from a PC analysis or using S-Plus. As indicated by the comparisons in the fifth row, logistic regression using PC analysis is clearly the method with the greatest overall skill at the 0.01-in. threshold, while the comparisons in the first two rows indicate that linear regression is the method with the lowest overall skill.
The t values for the 0.05- and the 0.10-in. thresholds are given in Tables 4 and 5, respectively. The boldfaced numbers in the second and fourth rows of each of these tables reveal that the Brier skill scores for both linear and logistic regression at this threshold corresponding to the use of the S-Plus software are significantly higher at the 99% level than their counterparts using our previous methodologies. Moreover, the comparisons in the last row of each table reveal that the Brier skill scores of 0.492 for 0.05 in. and 0.510 for 0.10 in. (Table 2) corresponding to the use of binning are significantly higher at the 99% level than those for all methodologies other than logistic regression using S-Plus software. Since the skill scores corresponding to binning and logistic regression using S-Plus software are not significantly different from each other at either the 99% or 95% level binning must be regarded as the method of choice for practical forecasting because it is computationally much more efficient than logistic regression using S-Plus or any other software. It should be emphasized that these conclusions are based on the use of the NGM model over the eastern half of the United States during the winter season. For other regions, other seasons, or the use of other numerical models, studies such as this one and the one by AGPN should be repeated.
Attributes diagrams (Murphy 1973; Wilks 1995), such as those shown in Figs. 2–4, provide insight into the relative performances of different methodologies. In particular, they display graphically the reliability of forecasts of different probabilities and the frequency with which they are made by each methodology.
A forecast methodology is considered perfectly reliable if, corresponding to each range of predicted probabilities, the observed frequencies of precipitation equaling or exceeding the threshold are equal to the predicted probabilities. The closer the points on a plot of observed frequency versus forecast probability are to the diagonal from (0, 0) to (1, 1), the more reliable are the forecasts. The upper graphs in Figs. 2–4 reveal that logistic regression forecasts on an independent dataset using S-Plus software are systematically more reliable at all three thresholds than the linear regression forecasts on the same dataset using the same software. While not shown here, the curves for binning are almost identical to those for logistic regression.
Good or perfect reliability is not, however, a sufficient condition for a forecast methodology to achieve a high BSS. It is necessary that the methodology also make a significant number of forecasts of probabilities that are not close to, or equal to, the climatological probability for each location. This is because the BSS is zero (i.e., no skill) when the forecast probability is equal to the climatological frequency of the event. A methodology with good reliability that gives probability forecasts far removed from the climatological frequency will have a significantly higher BSS than one with equally good reliability that gives probability forecasts close to the climatological frequency. The histograms at the bottom of Figs. 2–4 reveal that logistic regression gives many more forecasts far removed from the climatological frequency and that the linear regression forecast probabilities tend to be much closer to the climatological frequencies at all three thresholds.
The conclusions reached here about the greater reliability and greater frequency of forecasts far removed from climatology for logistic regression are similar to those reached in AGPN. The Brier skill scores are, however, higher, and the frequencies of logistic regression forecasts of 98% chance of precipitation equaling or exceeding the threshold are significantly greater with the use of the S-Plus software.
4. Sensitivity to the choice of the penalty weight A
In the statistics literature, two choices are recommended for the value of the penalty function, A, in (5). One is based on the Akaike information criterion (A = 2) and the other on the Bayesian information criterion [A = ln(n)]. As noted earlier, the former yields a higher-order model with more predictors than the latter. For a given application, the value that gives the best results depends on the nature of the problem and the structure of the data and must be determined empirically. In the present study, with n = 360, we compared both of these choices with a third choice (viz., A =
5. Significance test revisited
Since the completion of AGPN, some questions have arisen concerning our assumption that the differences between the Brier skill scores for linear regression and logistic regression at each of the 154 meteorological stations are independent of those at all the other stations. This is because of the expected spatial dependence of the predictors and the predictand associated with organized atmospheric disturbances. The assumption of independence led to the use of m = 154 in the paired t test for significance in that paper and in the earlier sections of the present paper. In order to address these concerns, we revisit the subject here using, for each methodology, only one score for each of the seven regions defined in AGPN. It will be recalled that the stations within each region were determined, with the use of factor analysis, as having meteorological characteristics that were similar to one another and different from those in the other regions. There can be no question, therefore, that the differences between the Brier skill scores for linear regression and logistic regression for each of the seven regions are independent of all the others. Moreover, to eliminate any concerns about the possible lack of independence of two nearby stations in adjacent regions, we eliminated all stations that are closer than 500 km to a station in an adjacent region. This reduced the total number of stations used for the new significance test to 88.
A Brier skill score was then computed for each of the seven regions using all of the remaining stations in the region, and a t test was performed on the paired differences of the scores for linear regression and logistic regression using the S-Plus software with m = 7. We regard the use of m = 7 as the most stringent possible significance test, since, in reality, there must be a larger number of stations over the eastern half of the United States that can be considered independent of one another. If significance at a sufficiently high level can be found using m = 7, we would feel even more confident in the conclusions.
The averages of the Brier skill scores for linear regression and logistic regression over the seven regions are given in Table 10. Since 88 stations were used to obtain seven regional scores, which were then averaged to get the values in Table 10, whereas 154 individual scores were averaged to get the scores in Table 2, the corresponding scores in the two tables are not identical. Nevertheless, as in Table 2, the results in Table 10 reveal higher skill scores for logistic regression at all three thresholds. For m = 7, the 90%, 95%, and 99% confidence levels are now 1.943, 2.447, and 3.707. The t values for the three thresholds and the corresponding confidence levels are given in the last two rows of Table 10. While the confidence levels are below 99%, they are sufficiently high in this most stringent test to support the conclusion that logistic regression is significantly more skillful in PQPF using predictors derived from the NGM over the eastern half of the United States at all three thresholds.
6. Conclusions
The results of this follow-up study suggest the following for probability forecasting of cool season precipitation accumulations at thresholds of 0.01, 0.05, and 0.10 in. using NGM model analyses and forecasts.
The BSS for linear regression forecasts can be improved with the use of the Akaike information criterion to determine the model order.
The BSS for logistic regression can also be improved by use of the Bayesian information criterion to determine the model order and the maximum likelihood method with a Fisher scoring algorithm to determine the coefficients. At the lowest threshold, however, the use of the leading principal components determined from the original predictor set of meteorological variables still gives the highest BSS when applied with logistic regression.
For PQPF, logistic regression yields significantly higher Brier skill scores than does linear regression. This is not surprising, since standard linear regression is based on the assumption that the predictand is normally distributed, whereas the predictand in this case (precipitation exceeding or not exceeding a prescribed threshold) has a binomial distribution (0 or 1).
At the 0.01-in. threshold, logistic regression with predictors chosen from among the leading PCs as predictors gives significantly higher Brier skill scores than does binning. At the two higher thresholds, the differences between the Brier skill scores for binning and for logistic regression using the Bayesian information criterion and the maximum likelihood function with a Fisher scoring algorithm (which yields the highest skill score for logistic regression) are not significantly different from each other. Since binning is much less computationally intensive than logistic regression, it is suggested that this is the method of choice at these thresholds.
Further significance testing supports the notion that logistic regression is significantly more skillful than linear regression at very high confidence levels.
Acknowledgments
This research was supported by NOAA under CSTAR Grant NA17WA1010, and in part by NSF and NOAA under USWRP Grant ATM 9714414, and by AFOSR AASERT Grant F49620-93-1-0531. The authors would like to thank the Geophysical Fluid Dynamics Institute at The Florida State University (FSU) for supplying the computer resources necessary for implementing and running the statistical models, and Dr. Jon Ahlquist of the FSU Meteorology Department for some very insightful discussions. Additionally, the data were provided by the Data Support Section, Scientific Computing Division, at the National Center for Atmospheric Research, which is supported by grants from the National Science Foundation.
REFERENCES
Agresti, A., 1990: Categorical Data Analysis. John Wiley and Sons, 558 pp.
Akaike, H., 1973: Information theory and an extension of the maximum likelihood principle. Proc. Second Int. Symp. on Information Theory, Budapest, Hungary, Akademiai Kaido, 267–281.
Applequist, S., Gahrs G. E. , Pfeffer R. L. , and Niu X-F. , 2002: Comparison of methodologies for probabilistic quantitative precipitation forecasting. Wea. Forecasting, 17 , 783–799.
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 1–3.
Buizza, R., Hollingsworth A. , Lalaurette F. , and Ghelli A. , 1999: Probabilistic predictions of precipitation using the ECMWF Ensemble Prediction System. Wea. Forecasting, 14 , 168–189.
Du, J., Mullen S. L. , and Sanders F. , 1997: Short-range ensemble forecasting of quantitative precipitation. Mon. Wea. Rev., 125 , 2427–2459.
Ebert, E. E., 2001: Ability of a poor man's ensemble to predict the probability and distribution of precipitation. Mon. Wea. Rev., 129 , 2461–2480.
Elsner, J. B., and Schmertmann C. P. , 1994: Assessing forecast skill through cross validation. Wea. Forecasting, 9 , 619–624.
Glahn, H. R., and Lowry D. A. , 1972: Use of model output statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11 , 1203–1211.
Hamill, T. M., and Colucci S. J. , 1998: Evaluation of the Eta/RSM ensemble probabilistic precipitation forecasts. Mon. Wea. Rev., 126 , 711–724.
MathSoft Inc., 1999: S-Plus 2000 Guide to Statistics. Vol. 2. Data Analysis Products Division, MathSoft, 582 pp.
Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12 , 595–600.
Schwarz, G., 1978: Estimating the dimension of a model. Ann. Stat., 6 , 461–464.
Wedderburn, R. W. M., 1976: On the existence and uniqueness of the maximum likelihood estimates for certain generalized linear models. Biometrika, 63 , 27–32.
Weisberg, S., 1985: Applied Linear Regression. John Wiley and Sons, 324 pp.
Weiss, N. A., and Hassett M. J. , 1991: Introductory Statistics. Addison-Wesley, 834 pp.
Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.

The 154 stations used for this study
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2

The 154 stations used for this study
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2
The 154 stations used for this study
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2

Attributes diagram for all 154 stations for the 0.01-in. threshold for linear regression and logistic regression using the S-Plus software. The heavy solid line represents perfect reliability. The dotted–dashed line represents “no resolution,” and the heavy dashed line represents “no skill.” The histogram below shows the frequency of the different probability forecasts made by linear and logistic regression for all 154 stations at the 0.01-in. threshold
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2

Attributes diagram for all 154 stations for the 0.01-in. threshold for linear regression and logistic regression using the S-Plus software. The heavy solid line represents perfect reliability. The dotted–dashed line represents “no resolution,” and the heavy dashed line represents “no skill.” The histogram below shows the frequency of the different probability forecasts made by linear and logistic regression for all 154 stations at the 0.01-in. threshold
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2
Attributes diagram for all 154 stations for the 0.01-in. threshold for linear regression and logistic regression using the S-Plus software. The heavy solid line represents perfect reliability. The dotted–dashed line represents “no resolution,” and the heavy dashed line represents “no skill.” The histogram below shows the frequency of the different probability forecasts made by linear and logistic regression for all 154 stations at the 0.01-in. threshold
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2

Same as in Fig. 2 except for at the 0.05-in. threshold
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2

Same as in Fig. 2 except for at the 0.05-in. threshold
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2
Same as in Fig. 2 except for at the 0.05-in. threshold
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2

Same as in Fig. 2 except for at the 0.10-in. threshold
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2

Same as in Fig. 2 except for at the 0.10-in. threshold
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2
Same as in Fig. 2 except for at the 0.10-in. threshold
Citation: Weather and Forecasting 18, 5; 10.1175/1520-0434(2003)018<0879:IRFPQP>2.0.CO;2
List of the variables in the predictor pool with the pressure level (in hPa) at which they were taken. Bracketed terms on the right-hand side indicate vertical averages, and the last five entries represent gridded binaries with the threshold value of each one denoted in brackets on the left-hand side


Brier skill scores for the six models (two linear regression models, three logistic regression models, and binning) at the three precipitation thresholds. The symbols BSS and S+ within the parentheses indicate the method of choosing the model order (Brier skill score or generalized information criterion, respectively). Here, PC represents the use of principal components as predictors


Paired difference t scores for the six models at the threshold of 0.01 in. Table is read as model 1 vs model 2 with model 1 on the left-hand side and model 2 across the top. Negative scores represent a better performance by model 2, and positive scores indicate a better performance by model 1. The threshold values for the 90%, 95%, and 99% confidence levels are 1.645, 1.960, and 2.576, respectively


Same as in Table 3 except for precipitation threshold of 0.05 in


Same as in Table 3 except for precipitation threshold of 0.10 in


BSSs for the three different thresholds for the linear regression and logistic regression models using three different values for the generalized information criterion [A = 2, A = ln (n), A = √n ]


Paired difference t scores for the 0.01-in. threshold for logistic regression and linear regression using the generalized information criterion [A = 2, A = n (n), A = √n ]


Same as in Table 7 except for the 0.05-in. threshold


Same as in Table 7 except for the 0.10-in. threshold


Averages of the seven regional BSSs, the corresponding paired difference t scores, and the confidence levels based on the limited sample of 88 stations


Geophysical Fluid Dynamics Institute Contribution Number 433.