1. Introduction
As has been noted in previous works (e.g., Czys et al. 1996; Bourgouin 2000) the fundamental physical processes that determine the hydrometeor phase, and thus the precipitation type, are well known. Nevertheless, the prediction of precipitation type still remains a substantial forecast challenge, in large part because the relevant atmospheric variables, such as the vertical distribution of temperature and moisture, vertical motion, and ice nuclei are not well known. Detailed temperature and moisture profiles are available only from a sparse rawinsonde network, while vertical motion and ice nuclei distributions are not observed at all (aside from during special field programs). Thus, we do not even know precisely the extent to which present numerical weather prediction models represent these properties accurately. Moreover, the resolution of current operational models (Δx ≈10 km, Δz ≈ 100–500 m) is insufficient for detailed microphysical calculations, so postprocessing algorithms such as model output statistics (MOS) are frequently employed in forecasting precipitation type (e.g., Allen and Erickson 2001; Allen 2001). Hence, there are two large components of uncertainty in forecasts of precipitation type: the model-generated forecast sounding and the algorithm summarizing the microphysical processes (i.e., melting and freezing).
Medium-range ensemble forecasting has been operational for over a decade (Tracton and Kalnay 1993; Molteni et al. 1996), and, as computational power increases, ensembles are increasingly being used for mesoscale and smaller-scale forecasts. Previous work (Brooks et al. 1995; Hamill and Colucci 1997; Wandishin et al. 2001) called for these short-range ensembles to focus on short-range phenomena (e.g., precipitation, severe weather, etc.). Recent studies have taken up this call, applying the ensemble approach to forecasts of surface temperature and dewpoint (Stensrud and Yussouf 2003), surface winds in difficult terrain of the Pacific Northwest (Grimit and Mass 2002), and severe weather parameters (Stensrud and Weiss 2002). The uncertainties inherent in precipitation-type forecasts would seem to lend themselves well to the ensemble framework.
Cortinas et al. (2002) explored the accuracy of both individual algorithms and the most probable type and probabilistic forecasts derived from the combination of the algorithms. Their study was done in a single model context and found that no single algorithm was most accurate for all times and all precipitation types. Similarly, the most probable type forecast was not dominant, but it did consistently rank in the upper half when compared to the individual algorithms, which suggests that there is some benefit to be gained through the use of multiple algorithms.
The present study is an extension of Cortinas et al. (2002) to a full ensemble framework, that is, multiple algorithms combined with a short-range ensemble system. A description of the ensemble and algorithms is given in the following section. The experimental design is described in section 3, followed by the results in section 4. Sections 5 and 6 present some further discussion and conclusions, respectively.
2. Ensemble description
The ensemble used in this study consists of two components: a 10-member mixed-model, mixed-initial-condition ensemble and five separate precipitation-type algorithms applied in a postprocessing mode. The components are described briefly below.
a. Models
The operational short-range ensemble from the National Centers for Environmental Prediction (NCEP) serves as the model base for our combined ensemble (Du et al. 2004). The NCEP ensemble consists of 10 members using both the Eta Model (Black 1994) and the Regional Spectral Model (RSM; Juang 2000). Each model is run with a set of five initial conditions (ICs): one control and two pairs of regional bred modes (Toth and Kalnay 1997). The models use a horizontal grid spacing of 48 km. [As of summer 2004, the NCEP Short-Range Ensemble Forecast (SREF) system is run at 32 km.] The ensemble is run twice per day starting at 0900 and 2100 UTC out to 51 h, with output every 3 h. (During the experiment, NCEP extended the ensemble forecast lengths to 63 h but, for consistency, only the first 51 h were used in this study.)
b. Algorithms
This work is an extension of Cortinas et al. (2002) and so the same five precipitation-type algorithms are used here (excluding the simple thickness algorithm that is often unable to produce a forecast because of the lack of a 1000-mb level above the surface). Only brief descriptions of the algorithms will be provided here; the reader is referred to the individual references or to Cortinas et al. (2002) for more complete descriptions.
The Ramer (1993; hereafter Rm) uses pressure p, temperature T, relative humidity RH, and wet-bulb temperature Tw to determine the ice fraction, I (0 ≤ I ≤ 1), of the precipitation at the surface. If Tw > −6.6°C in the precipitation generation layer, then supercooled droplets are assumed and rain is forecast; otherwise, hydrometeors are taken to be entirely ice. If I = 1 at the surface, that is, no melting has occurred, then snow is diagnosed. For partial melting (I > 0.85), the diagnosis is ice pellets. If I < 0.04, then either rain (R) or freezing rain (ZR) is diagnosed, depending on whether the lowest-level Tw is above or below freezing, respectively. A mix of types is diagnosed for 0.04 ≤ I ≤ 0.85. Additionally, if Tw > 2°C at the lowest level, then rain hydrometeors are automatically assumed to be completely liquid.
The Baldwin et al. (1994; hereafter Bd) algorithm is a version of the algorithm currently used by NCEP. First, the initial hydrometeor phase is determined based on the Tw of the precipitation generation layer (i.e., the highest saturated layer). The algorithm then determines whether melting or freezing occurs based on the area (i.e., the depth and strength) of subsequent (looking downward) warm or cold layers, respectively. A phase change is diagnosed if the area of the warm or cold layer is of (an empirically derived) sufficient magnitude. Finally, the surface temperature is used to distinguish between rain and freezing rain.
Conceptually similar to the Bd algorithm, the algorithm developed by Bourgouin (2000; hereafter Bg) is currently used by the Canadian Meteorological Center (CMC). Whereas the Bd algorithm computes the area of warm and cold layers, the Bg algorithm computes the melting and freezing energies of these layers from a standard tephigram. The final hydrometeor phase is determined by comparing the melting and freezing energies. The original Bg algorithm presents circumstances in which either freezing rain and ice pellets (IP) or rain and ice pellets are equally likely. Following Cortinas et al. (2002), one of the two types is chosen at random.
The Czys et al. (1996; hereafter Cz) algorithm was developed to distinguish between freezing rain and ice pellets environments; minor modifications were added by Cortinas et al. (2002) to diagnose rain and snow (S) as well. Precipitation type is determined by computing the ratio of the time an ice sphere remains in a warm layer to the time required to completely melt the sphere. The initial size of the ice sphere is fixed and was empirically determined by Czys et al. (1996) based on radar reflectivity data. The melting time is determined from bulk properties of the warm layer. If no warm layer exists (layer average Tw > 0°C), snow is diagnosed, and rain is predicted if complete melting occurs and the surface temperature is above freezing.
The Cortinas (et al. 2002; hereafter Ct) algorithm is nearly identical to the Cz algorithm except that instead of computing melting from the bulk properties of the warm layer it interpolates the complete thermodynamic profile below 500 mb into 25-mb layers.
3. Experimental design
The above algorithms were applied to the ensemble forecast soundings from 93 runs (41 runs starting at 0900 UTC and 52 runs starting at 2100 UTC) between 1 January and 31 March 2002 and compared to surface observations. Following Cortinas et al. (2002), consideration was given only to sites at which precipitation was both observed and forecast by at least one ensemble member. (The algorithms assume that precipitation will occur and are designed to produce a forecast of the phase of that precipitation. Therefore, it is desirable to separate the problem of precipitation type from the problem of occurrence–nonoccurrence.)
One unfortunate result of this approach is that not all 100% probabilities are equal. Specifically, assuming the use of a single algorithm only, if 4 of the 10 members do not predict any precipitation, then should unanimity among the other six forecasts be considered a 100% probability or 60%? We have chosen to disregard completely nonprecipitating forecasts from the computation of the forecast probability and thus consider the above scenario as a 100% forecast. However, the argument could be made, plausibly, that a 100% probability from 6 members does not inspire the same confidence as if from 10 members. Verification scores were recomputed, with the required number of precipitating members increased from 1 to all 10 (not shown). Skill scores generally increased by 5%–10%, but there was no clear or consistent break in the curves on which one could base a recommendation (e.g., five precipitating members are required). Also, as the requirement is made more stringent, more and more cases get thrown out as not possessing enough members to make a forecast. Therefore, we have chosen the most lenient requirement, only a single precipitating member need be present, in order to increase sample size. In an operational setting, the lack of precipitating members (i.e., the probability of precipitation occurrence) can be handled separately or the precipitation-type probability could be multiplied by the probability of precipitation to yield an unconditional probability.
Following Cortinas et al., an attempt to eliminate “easy” rain forecasts is made by neglecting situations in which the observed surface temperature is greater than 5°C. However, unlike Cortinas et al. (2002), who compared the forecasts to observations within a 3-h window, only observations matching the forecast valid time are used in this study. Cortinas et al. tested an unequal weighting of the algorithm output in constructing forecast probabilities, using preliminary results from the application of the algorithms to 25 yr of observed soundings, but found little difference between forecasts with equal or unequal weights. Since our study is the first to investigate extensively the potential use of ensembles for forecasting precipitation types, we employ the simple approach of equal weighting in this study.
Some algorithms can produce forecasts of mixed or unknown type. However, not all of the algorithms produce such forecasts, so a full ensemble cannot be constructed and no comparison between the algorithms is possible. Therefore, these forecasts (∼3% of total forecasts) are ignored. An observation of mixed type (e.g., S–IP) counts as a hit for forecasts of any of the observed types, but mixed-type observations compose a small percentage (1%–2%) of the total set of observations.
Finally, in an effort to boost the sample size of the freezing rain and ice pellet categories, all forecast lead times are considered together when calculating performance measures. This choice does lead to some degradation in reported performance through the inclusion of inferior longer lead time forecasts, but this should not affect the conclusions qualitatively. The difference in performance between 3- and 48-h forecasts is typically less than 10% for the rain and snow forecasts. Generally, the differences for the freezing rain and ice pellet forecasts are about twice that amount, with some measures seeing considerable deterioration, but stable statistics could not be obtained for these events otherwise. The forecast degradation is qualitatively similar to that found by Cortinas et al. (2002, their Fig. 1).
4. Results
The challenge that forecasts of precipitation type present is illustrated by Fig. 1, which shows a 15-h forecast by the ensemble for Topeka, Kansas, valid at 1200 UTC on 25 March 2002. Also shown is a meteogram for the seven surface reports centered at 1200 UTC that gives the hourly observations of temperature, dewpoint, and precipitation type. The forecast type from each of the five algorithms is given next to the ensemble member identifier. The entire column is subfreezing, but the surface and a warm nose around 800 mb are just barely so. Between them is a wedge of cold air with the temperature at 850 mb about −9°C. The cloud top lies above the warm nose around 750 mb at roughly −5°C. Thus for much of the cloud depth the temperature is in an ambiguous region concerning the formation of supercooled water droplets or ice. This is reflected in the meteogram, which shows that light snow was observed at the valid time, but light freezing rain occurred 2 h before and 1 h after the snow. This example also illustrates how phase errors of just an hour can produce false alarms in our study.
The ensemble forecasts reflect the ambiguous nature of the event with 26 of the 50 forecasts being for snow, 12 for ice pellets, and 12 for freezing rain. All of the models reproduce the basic structure of the sounding, with a surface cold layer topped by a warm layer, but none of them capture the details with much fidelity. Most of the models do reasonably well with the surface temperature but all of them place the inversion height too low and four of the models (etap2, etan1, rsmn2, and rsmn1) are too warm above the inversion to the extent that the nose of this warm layer is erroneously above freezing. Of these four, only some of the algorithms for the rsmn2 member are able to predict correctly the snow as a result of the low cloud top predicted by that model, keeping the entire cloud below the warm layer. Thus the forecasts of snow from this model could be said to be correct, but for the wrong reason. Given a reasonably accurate representation of the column, all of the algorithms are able to correctly predict snow except for the Ramer algorithm. The Ramer algorithm has the most restrictive condition for initiating precipitation as ice, requiring Tw < −6.6°C, a condition not met in these forecast soundings.
Thus, it is seen that the precipitation type forecasts depend greatly on the performance of both the algorithm and the underlying model. Disentanglement of the forecast error sources is far from trivial, especially in view of the small sample size and, so, will not be attempted in this paper. However, some understanding of the performance of the individual algorithms can be gleaned by examining the relevant ensemble subsets. Following Murphy (1993) and Murphy and Winkler (1987), performance will be based on two measures of goodness (a likelihood base rate factorization and a calibration refinement factorization) and a measure of value. Each of these will be explained briefly as they are introduced.
a. Attributes diagrams
Attributes diagrams (Hsu and Murphy 1986; Wilks 1995) plot the forecast probabilities against the frequency of event occurrence for the times when that forecast probability was issued, and thus they measure the reliability of the forecasts (e.g., the event should occur 30% of the time a forecast of 30% is issued). By including the climatological event frequency and the frequency at which each forecast probability was issued, attributes diagrams also convey information on the resolution (i.e., how different are the forecasts from climatology) and skill.
Attributes diagrams for the full ensemble are shown in Fig. 2 for each of the four precipitation types. The first impression is that rainfall forecasts are extremely good, snow forecasts are reasonably good, and forecasts of freezing rain and ice pellets are awful. The reliability and skill of the snow forecasts are actually much closer to that for the rain forecasts than appears at first glance; in fact, the snow forecasts even show slightly higher skill. For these scores, the distance from the 45° diagonal is weighted by the relative frequency of use of each forecast probability, so the underforecasting of snowfall for moderate probabilities carries little penalty since these probabilities are rarely issued. In fact, only the extreme probabilities (e.g., 0%, 10%, 90%, and 100%) are consistently given for any of the precipitation types, so the forecasts exhibit the desired attribute of sharpness (Murphy 1993). Snow and rain forecasts both have skewed U-shaped distributions with the high end closer to climatology. Freezing rain and ice pellets are both extremely rare in this sample (about 3% and 1%, respectively) and accordingly forecasts for those events have an L-shaped distribution with the majority of forecasts being 0% and no forecasts being issued with probabilities above 50% for freezing rain and 30% for ice pellets. Still, the reliability curves show that these events are badly overforecast, particularly ice pellets.
Figure 3 gives attributes diagrams partitioned by model: the five Eta models (black) and the five RSM models (gray). The two subgroups perform very similarly for rain and snow forecasts. The slight reductions in rain forecast skill compared to the full ensemble (Fig. 2) result from an increase in the overforecast of higher probabilities and the underforecast of lower probabilities—the ensemble subgroups are overconfident. For snow forecasts, the drop in skill is primarily in the underforecast of the 0%–20% probabilities; that is, snow occurs too often when these low forecast probabilities are issued (e.g., event frequency ≈ 8% for 0% forecasts). Some separation in the performance between the two ensemble subgroups is seen for the freezing rain and ice pellet forecasts, with the Eta-based ensemble showing a greater decline in skill. The greater tendency of the Eta-based ensemble to overforecast freezing rain occurs for most forecast probabilities. The Eta-based ensemble actually has greater skill than the RSM-based subgroup for higher probabilities of ice pellets, but those forecasts occur so infrequently as to be negligible to the overall skill.
The ensemble size can be further restricted by applying the five algorithms to a single model run, in this case the control runs of the Eta and RSM (Fig. 4). Thus, the probabilities are generated solely by the different algorithms, each using the same model soundings. The resolution of the probability thresholds is therefore limited to every 20%. These curves extend the trend seen from the full ensemble to the five-member subgroups in both reliability and skill. Once again, the decline in skill for rain and snow forecasts is surprising small, while the skill of the freezing rain and ice pellet forecasts experiences a larger drop. As with the five-member subgroups, the RSM–control ensemble outperforms the Eta on freezing rain and ice pellet forecasts. Note that as the ensemble size is reduced from the full ensemble to the five-member subgroups to the single model subgroups, the distribution of the forecast probabilities becomes sharper, with the 0% forecasts and, for rain and snow forecasts, 100% forecasts being populated more often by the subgroups. This reflects the lack of dispersion attainable by the ensemble subgroups.
Alternate ensemble subgroups can be achieved by forming separate 10-member ensembles for each of the five algorithms. Figure 5 shows the attributes diagrams for these ensemble subgroups. The range of skillfulness varies considerably with precipitation type. For snow forecasts, the algorithms each display the underprediction problem seen for the previous ensemble groups, with the Ct and Cz algorithms consistently worse through all probabilities. Similarly, each of the algorithms replicates the poor skill and overprediction of freezing rain and ice pellets. However, the ice pellet forecasts possess by far the widest distribution in skill, ranging from the only marginally unskillful Rm and Bg algorithms, which outperform the full ensemble, to the woeful Ct forecasts. In contrast, the reliability of the rain forecasts varies considerably from the underprediction of the Ct algorithm to the overprediction of the Cz algorithm, but since the curves straddle the perfect skill line, the range in skill is not as great as the visual spread of the curves. Note that the skill of these ensemble subgroups is comparable to the mixed-algorithm ensemble subgroups.
One note of particular interest is the relative behavior of the Ct and Cz algorithms. As noted, the Ct algorithm is based on Cz with the primary difference between them being Cz uses bulk characteristics of the warm layer to determine whether a falling ice particle has partially or fully melted whereas Ct computes the melting rate as the particle falls through each level of the sounding. The interdependence of the four attributes diagrams—the combined bias of the four events must equal unity if only the conditional forecasts are considered—can be helpful in discerning the behavior of the algorithms.
The two algorithms produce nearly identical reliability curves for snow forecasts and perform similarly for freezing rain, although Cz predicts fewer freezing rain events than does Ct (or any other algorithm). The overprediction of ice pellets is much greater for Ct, however, and as noted above, Ct underpredicts rain while Cz strongly overpredicts rain. The identical snow forecasts are a product of the fact that both algorithms first check for the existence of a warm layer (defined as any data point within the column with a Tw > 0°C). The absence of such a layer leads to a forecast of snow, but the presence of such a layer, however small, leads to a forecast of one of the other three types. This likely accounts in good part for the large underprediction of snow by these two algorithms. To discern the differences associated with the other types, note that the Cz algorithm includes a hard trigger that converts ice pellets to rain if the surface wet-bulb temperature is above freezing. This explains why Ct predicts many more ice pellets than Cz and why Cz has a high bias on rain forecasts. Both algorithms bring too many ice pellets to the surface (suggesting that either melting is too slow, the initial particle size is too large, or only minimal melting has occurred and wet snow is more likely than ice pellets), which Cz then converts to rain if the surface is warm (T > 0°C). Finally, the discrepancy in freezing rain forecasts (in which Cortinas predicts freezing rain while Czys predicts ice pellets) suggests that either melting occurs faster in the Cortinas algorithm or that refreezing is more likely with Czys, with the former being more likely than the latter.
Bootstrap resampling (Wilks 1995) was performed to estimate confidence intervals for the Brier skill score (BSS) of each of the ensemble groups (Fig. 6, first column). A significant separation between two distributions requires only that the mean of one distribution not lie within the pluses (+) of the other distribution. Note that the benefit of using the full ensemble (F) is greater for the rain and snow forecasts than the freezing rain and ice pellet forecasts. The optimistic view of this is that the full ensemble shows the most improvement for the more commonly issued forecasts; the pessimistic view is that the full ensemble does not provide much benefit where it is most needed, forecasting rare events such as freezing rain (i.e., our models are fundamentally incapable of forecasting these events and so adding more members will not help). In fact, two of the algorithms (Rm and Bg), though still not skillful for ice pellet forecasts, are significantly better than the full ensemble. The increase in skill when all five members (E, R) are used versus when only the control run (Ec, Rc) of the single model is consistent for all precipitation types, but the statistical significance is greater for rain and freezing rain than for ice pellets and snow. Except for freezing rain forecasts, there are no significant differences between the Eta- and RSM-based ensemble subgroups. Finally, there is no consistent signal as to whether one can achieve better skill by using a single model run and several algorithms or several model runs and a single algorithm. Rather, it depends on the single algorithm chosen. The Ramer (Rm) and Bourgouin (Bg) algorithm subgroups are generally better than the single model run subgroups (E, R, Ec, Rc) with a reasonably high level of confidence, while the other three algorithms (Bd, Cz, and Ct) are generally equal to or worse than the single model run subgroups. The noticeable exception is once again the freezing rain forecasts.
b. ROC curves
Whereas the attributes diagram is based on the forecasts (i.e., given a particular forecast, what occurred?), the relative operating characteristic (ROC; Swets 1973; Mason 1982) curve is based on the observations (i.e., given that event A occurred, what was forecast?). In this way, ROC analysis measures the ability of a forecast system to discriminate between two event classes (e.g., rain–no rain). A ROC curve is obtained by plotting the probability of detection (POD, also referred to as hit rate or true positive rate) against the probability of false detection (POFD, also referred to as false alarm rate or false positive rate) for different forecast thresholds. That is, separate 2 × 2 contingency tables can be constructed for each of several forecast probabilities from which can be constructed (POFD, POD) pairs. The line from the upper-right corner of the diagram (1, 1) through each of the (POFD, POD) pairs to the lower-left corner (0, 0) constitutes the ROC curve. The area under this curve (AUC) is equivalent to the probability that a randomly selected “no” event will have a lower forecast probability than a randomly selected “yes” event (Hand and Till 2001). Alternatively, the AUC can be viewed as a measure of the separation between the distributions of the two event classes. If the distributions are assumed to have equal variance, then an AUC = 0.76 represents a one standard deviation separation of the means of the two distributions.
Different strategies for calculating the AUC have been proposed, with the two most common being the trapezoidal area under the discrete ROC curve (Mason 1982) and the integrated area beneath an empirically fitted continuous curve (Mason 1982; Harvey et al. 1992). The curve fitting is accomplished by transforming the points of the ROC curve to normal-deviate space, in which the ROC curve becomes a straight line. This continuous, fitted curve gives an estimate of the performance of an infinite-member ensemble based on the same underlying model as the actual ensemble (Richardson 2000). The advantage of using continuous ROC curves is that they are not sensitive to the number of ensemble members and thus remove that ambiguity when comparing different ensembles. The advantage of the trapezoidal method is that it gives information on the performance of the ensemble as it is configured rather than of an idealized infinite-member ensemble. For that reason, and because all ROC curves in this study were calculated after binning the forecast probabilities into an equal number of bins, thus alleviating the sensitivity to the number of members, the trapezoidal method is used in all AUC calculations.
Figure 7 shows the ROC curves for the full ensemble as well as the Eta- and RSM-based five-member subgroups for each of the precipitation types. Climatological forecasts would lie along the (no skill) diagonal. Once again, the ensemble and its subgroups do much better forecasting rain and snow than freezing rain or ice pellets with little difference between the full ensemble and the five-member subgroups. Despite the strong similarity in the ROC curves (especially for rain and snow) between these three groups, the differences between the corresponding AUCs are statistically significant except for ice pellets (Fig. 6), highlighting the fact that statistically significant differences are not necessarily of practical importance. A true assessment of whether a statistical improvement is of practical importance would require a detailed examination of each particular forecast user. One advantage of the full ensemble is that it consistently produces a better spread of its (POFD, POD) pairs. Specifically, the full ensemble provides information for a wider range of POFD scores and thus may be useful to a larger set of users with different levels of false alarm tolerance.
ROC curves for the five algorithm-based ensembles are shown in Fig. 8. In general, the ROC points from the different forecast systems fall along the same curve with the main difference in the areas resulting primarily from how far along the curve (in a clockwise sense) the points lie. For example, the largest AUC for ice pellets forecasts is for the Cz-based ensemble because its POD is near 0.6 for 10% forecasts; the Rm-based ensemble achieves a POD of only around 0.1. Thus, users could capture many more events by using the Cz-based ensemble rather than the Rm-based ensemble for the cost of a modest increase in the number of false detections. The exceptions to this general pattern are the Bd algorithm for snow forecasts that is slightly, though significantly (see Fig. 6), inferior to the other ensemble groups and the strikingly different Ct curve for ice pellets. Recall that the Ct algorithm grossly overforecasts ice pellets. The ROC curves for ice pellets suggest that relatively more of the Ct forecasts fail to materialize resulting in a higher POFD value for the same POD. This is particularly noticeable for the higher-probability forecasts, which lie along the diagonal in the lower-left corner. Nonetheless, Ct ice pellet forecasts could still be useful for some users; note that the 10% forecast point (the point farthest to the right along the Ct curve) has a higher POD value than any point on the other curves, so a user with a higher false alarm tolerance would prefer the Ct ensemble.
The bootstrapped AUC distributions for the different ensemble groups are markedly different than for the BSS, particularly among the algorithm-based subgroups. For example, the Rm and Bg algorithms are consistently as good as or better than the others as measured by the BSS, but they are not superior as measured by the AUC. This is particularly evident for the ice pellet forecasts: Rm and Bg are clearly more skillful than the other algorithms, yet they (especially Rm) possess significantly less ability to discriminate between ice pellet and non-ice-pellet events. Despite the lack of Brier skill for ice pellet forecasts, all of the ensemble groups possess greater discriminating ability than climatology (AUC > 0.5), and some of the groups are able to separate the means of the event and nonevent distributions by greater than one standard deviation (AUC > 0.76), the standard for “reasonable” discrimination.
ROC curves and attributes diagrams assess different aspects of forecast quality, so it should not be assumed that they would suggest identical conclusions on the relative performance between different forecast systems. Still, the discrepancy in the assessment of the ice pellet forecasts is striking. One possible explanation lies with the fact that all of the ensemble groups possessed a strong overforecast bias. Overforecasting rotates the reliability curves below the no-skill line (e.g., Figs. 2 –5) resulting in a dismal BSS (Fig. 6). Also, ice pellets occur rarely in this sample, which makes climatology—the standard against which the ensemble forecasts are judged for the BSS—a very difficult forecast to beat. For an extremely rare forecast one can be highly accurate simply by forecasting “no” every time. In this situation, a forecast system is severely punished by false alarms. ROC curves, on the other hand, depend solely on the ability of the forecast system to order the cases (events should receive higher probabilities than nonevents), making them insensitive to both the forecast bias and event frequency, so the ensemble can possess strong discrimination ability despite these difficulties.
c. Relative economic value
The previous section contained several references to the usefulness of various forecasts, thereby making implicit reference to the value of the forecasts for users. The concept of forecast value can be made more explicit through use of a simple cost–loss model (Richardson 2000). Consider the simple decision problem in which the forecast information is used to decide whether or not to take preventative action (e.g., whether or not to ready the road crews to deal with inclement weather). A rational user will act in a manner so as to minimize losses (or, alternatively, to maximize benefits or utility) over an extended period of time. Just as a 2 × 2 contingency table can be constructed to describe a set of forecasts, so, too, one can be constructed describing the costs incurred by taking protective action, the losses suffered by not taking action when an event occurs, and the benefit gained by taking action when an event does occur. The expected utility of a forecast system is a linear combination of the forecast system performance and the user’s utility matrix. Relative value measures the expected utility of a given forecast system against the expected utility of a forecast based on climatology. [For a detailed description of the cost–loss value problem, see Richardson (2000).] Thus, a forecast system that provides no utility over climatology will have zero relative value and a perfect forecast system will have 100% relative value. This relative value is typically plotted as a function of the cost–loss ratio, which is used as a proxy for the different forecast users, that is, each user will typically have a unique cost–loss ratio. Therefore, the relative value curve is a general measure of value covering all possible forecast users.
The relative value curves for the full ensemble for each of the precipitation types are presented in Fig. 9. Separate value curves can be drawn for each (POFD, POD) pair from the ROC curve. Note that the relative value is always maximized for the cost–loss ratio equal to the climatological value. A user with only climatological information will maximize his benefit (minimize losses) by always protecting if the climatological event frequency is greater than his cost–loss ratio and never protecting if the event occurs less frequently than his cost–loss ratio. When the cost–loss ratio equals climatology, the two strategies provide equal benefit; therefore, climatology provides no utility to such a user and an alternate forecast system will maximize the value it provides relative to climatology (Richardson 2000).
To simplify comparisons between the different ensemble groups, the information in the relative value curves will be summarized by two parameters: the maximum relative value (Vmax) and the width of the value curve (Vwid), that is the range of users for whom the forecast system provides positive value. Aside from capturing the basic features of the value curves, an added benefit to these parameters is that they are directly tied to Pierce’s skill score and Clayton’s skill score, standard skill measures derived from the 2 × 2 contingency table (see Richardson 2000; Wandishin and Brooks 2002).
Bootstrap distributions of Vmax and Vwid are presented in Fig. 6. Immediately apparent is the similarity between the relationship of the AUCs among the ensemble groups and the relationship of the maximum value among the same groups. In other words, the relative rank of the scores among the ensemble groups is very similar for the two measures and, indeed, this is not surprising given that the points of the ROC curve are used to generate the value curves. Furthermore, Vmax is directly related to Pierce’s skill score, which in turn is equal to the difference between the POD and POFD. A good AUC is achieved by moving the curve toward the upper-left corner of the ROC graph, meaning a large POD and small POFD and thus a large difference between them (i.e., a large Pierce skill score). So a forecast system with a large AUC is expected to provide a large maximum relative value.
This same parallel does not hold for the width of the value curve, however. In fact, some comparisons are reversed between Vmax and Vwid. For example, the Ct algorithm gives the lowest maximum value for rain forecasts but the range of users for whom the Ct algorithm provides some value is as large as or larger than all of the other ensemble subgroups. Similarly, the Eta and Eta–control ensembles have significantly larger maximum value than the respective RSM-based subgroups, but provide value to significantly fewer users.
Still, some general conclusions can be drawn. The full ensemble clearly provides significantly more value than the ensemble subgroups for all precipitation types except for ice pellets, and it is at least as good for the latter. The five-member groups are as good as or better than the Eta and RSM control member based groups. There is no clear winner among the single-algorithm ensembles. If a user were forced to choose a single algorithm, that choice would depend on which precipitation type the user was most concerned with and the user’s cost–loss ratio for that type. Finally, as with the BSS, there is no overall advantage between using a fixed-model multiple-algorithm approach or a fixed-algorithm multiple model approach, whether the fixed-model ensemble uses just a single member or five members.
Figure 9 also highlights the added value of ensemble forecasts in that several of the individual ROC point-based curves compose the maximum value curve. Recall that each point on the ROC curve (and thus each thin line in the value plots) comes from a specific forecast probability, and thus an ensemble, by providing a greater choice of probability thresholds, is potentially able to deliver more value for a wider range of users. Richardson (2000) has demonstrated that the added value of large ensembles is particularly large for users with low cost–lost ratios and for rarer events (e.g., ice pellets) through the ensemble’s ability to sample the tails of probability space.
d. Evaluation of false alarm forecasts
It is of interest to examine not only whether a set of forecasts were correct but also, when wrong, what in fact did occur. Figure 10 shows, for each algorithm (rows), the frequency of occurrence of the four precipitation types conditioned on the forecast type (columns) and probability (abscissa). The observed type matching the forecast is highlighted by the thick curve in each plot. When the thick curve lies along the thin dotted diagonal line, the forecasts are unbiased (i.e., perfectly reliable). The numbers next to the column headings indicate the sample climatological event frequency for each precipitation type. This is important as it affects what one would expect to see in the plots. For example, rain and snow forecasts (first and last columns) basically reduce to a decision between whether it will rain or snow. The freezing rain and ice pellet frequencies are small and fairly consistent. Note that the crossing point at which the predicted type becomes the most likely observed type seems to be related greatly to the event frequency. The crossing point for snow forecasts is between 20% and 30%, while the crossing point for rain forecasts is between 30% and 70%, and is 60% for freezing rain forecasts. Note that this crossing point is where the predicted type becomes more likely than any other single type not more likely than all the other types, which for a perfectly reliable forecast would occur for forecast probabilities above 50%.
The difficulty in distinguishing between rain and freezing rain is evident from the plots in the second column. It is not until the highest forecast probabilities that the freezing rain curve separates itself from the rain curve, and the reduction in rain frequency as a function of forecast probability is slight. For the Rm algorithm, the drop in rain frequency is negligible. The algorithms also display varying capacities to rule out snow in the freezing rain forecasts. Both Rm and Bd start with snow frequencies over 50% (i.e., snow occurs more than half the time a 10% freezing rain forecast is issued) but the Rm algorithm is better at reducing the snow likelihood as the freezing rain forecast probability increases, while Bd keeps snow at least as likely as rain (roughly 30%). The Cz algorithm is the best at avoiding incorrectly labeling snow events as freezing rain. Finally, note that while ice pellet frequencies are still low and fairly steady (∼10%), ice pellets are more likely to occur when freezing rain is forecast than when either rain or snow are forecast, with the frequency being 10 times its climatological value. So, whereas one still would not expect to see ice pellets when freezing rain is forecast, it would be wise to recognize that such a forecast does indicate an elevated risk of an ice pellet event.
Event frequencies given an ice pellet forecast are shown in the third column. As was noted earlier, all algorithms overforecast ice pellets, and they do this despite the fact that few high probability ice pellet forecasts are made (the noisy curves are a result of the small sample sizes) by any algorithm except for Ct. Once again, snow is the most likely precipitation type for Rm, Bd, and Ct, but the situation is much more complicated for the Bg and Cz algorithms, with no precipitation type dominating. The relatively high frequency of freezing rain events suggests that many of these ice pellet forecasts occur when a surface freezing layer is forecast with the algorithms incorrectly converting the supercooled raindrops into ice pellets.
Some light can be shed on the performance of the individual algorithms by focusing on the rows. For example, the high snow frequencies associated with freezing rain and ice pellet forecasts suggest that the large underforecasting of snow by the Bd algorithm results from a too lenient criteria for conversion from snow to another type. To produce a snow forecast, the Bd algorithms must find a layer below 500 mb with T ≤ −4°C and the area of the sounding between −4°C and Tw must be < 3000° m. The fact that Bd is highly reliable for rain forecasts further suggests that the conversion from snow occurs too easily when a surface cold layer is present. The Cz algorithm also substantially underforecasts snow, but it does not produce the same high snow frequencies for its freezing rain and ice pellet forecasts. Rather, a higher percentage of Cz rain forecasts are followed by snow events than for any other algorithm, while at the same time Cz exhibits a strong overforecast of rain. This suggests that Cz is too quick melting snow and converting it completely over to rain, as opposed to the Ct algorithm, which leaves many hydrometeors only partially melted resulting in a strong overforecast of ice pellets.
5. Summary
In this paper we extended the short-range ensemble forecasting approach to a critical winter weather problem, specifically, 0–48-h predictions of precipitation type. The ensemble system uses forecast soundings produced by a 10-member mixed-model, mixed-initial-condition ensemble as input for five different postprocessing algorithms to determine precipitation-type forecast probabilities. Along with confirming that snow and rain are much easier to predict than freezing rain or ice pellets, the results show that despite the low (or even negative) skill associated with forecasts of the latter two types, ensemble forecasts still can provide substantial value to potential users.
There is a drop in skill as the number of members to which the algorithms are applied is decreased from 10 (full) to 5 (Eta, RSM) to 1 (Ec, Rc), with a larger drop for freezing rain and ice pellets than for rain and snow. The drop in skill, in general, seems qualitatively consistent with that expected from reducing ensemble size, as modeled by Richardson (2001). Unfortunately, quantitative application of Richardson’s theory, which is based on idealized, perfect ensembles, is difficult because an imperfect, mixed-model framework is used here.
The distribution of forecast probabilities becomes sharper (less ensemble spread) as the number of members is reduced, but there is a limit to the amount of ensemble spread the algorithms can contribute per se. While there is more variability among the verification scores for the different algorithms, on average, the single-algorithm ensembles perform similarly to the single-model (Ec, Rc) or even single-base-model (Eta, RSM) ensembles.
No algorithm is found to be superior to all others under all circumstances. For example, the Baldwin algorithm is among the best for freezing rain and ice pellets but among the worst for rain and snow. Similarly, none of the five algorithms tested can be discarded as universally inferior. Thus, if only a subset of the algorithms is desired, algorithm selection must be based on the user’s specific concerns (e.g., low false alarm tolerance or high sensitivity to freezing rain forecast).
The RSM produces slightly more spread than the Eta (not shown), but including model diversity by taking members from both models increases spread by 15%–30%. Using the full ensemble (i.e., doubling the number of members) increases spread by another 15%–30%. The diversity of solutions provided by a mixed-algorithm ensemble is on a par with a mixed-IC ensemble without model diversity. It is important to emphasize that, by its very nature (i.e., postprocessing), the mixed-algorithm approach cannot produce the expected increase in spread with time. The spread of a mixed-algorithm ensemble will fluctuate in time as the atmosphere moves in and out of states that exploit the different approaches of the algorithms, but no error growth can occur.
Finally, given that ice pellets compose such a small fraction of the total winter precipitation events (∼1%) and that the algorithms and models all do such a poor job of forecasting these rare events, we question whether ice pellets should remain a separate forecast category. The simplest alternative is to use “liquid,” “freezing,” and “frozen” categories. We recommend that ice pellets be merged with snow into the frozen category instead of the freezing category (e.g., Allen and Erickson 2001) because of freezing rain’s severe impact on transportation, public utilities, forests, etc. (e.g., Cortinas et al. 2004 and references therein). As it is, snow is at least as likely, if not more so (depending on the algorithm), than any other precipitation type following a forecast for ice pellets.
Acknowledgments
The lead author (MSW) is primarily supported by NSF Grant ATM-9908968, while the third author (SLM) is supported by NSF Grant ATM-9714397 and ONR Grant N00014-99-1-0181. This paper is also funded in part under cooperative agreements between the National Oceanic and Atmospheric Administration (NOAA) and the University Corporation for Atmospheric Research (UCAR), and NOAA and the University of Oklahoma (Grant NA17RJ1227). The views expressed herein are those of the authors and do not necessarily reflect the views of NOAA, its subagencies, or UCAR. The authors thank Harold Brooks for many discussions on forecast evaluation and the anonymous reviewers for their helpful comments.
REFERENCES
Allen, R L., 2001: MRF-based MOS precipitation type guidance for the United States. NWS Tech. Procedures Bull. 485, 12 pp.
Allen, R L., and Erickson M C. , 2001: AVN-based MOS precipitation type guidance for the United States. NWS Tech. Procedures Bull. 476, 10 pp.
Baldwin, M., Treadon R. , and Contorno S. , 1994: Precipitation type prediction using a decision tree approach with NMCs mesoscale eta model. Preprints, 10th Conf. on Numerical Weather Prediction, Portland, OR, Amer. Meteor. Soc., 30–31.
Black, T., 1994: The new NMC mesoscale Eta model: Description and forecast examples. Wea. Forecasting, 9 , 265–278.
Bourgouin, P., 2000: A method to determine precipitation type. Wea. Forecasting, 15 , 583–592.
Brooks, H E., Tracton M S. , Stensrud D J. , DiMego G. , and Toth Z. , 1995: Short-range ensemble forecasting: Report from a workshop. Bull. Amer. Meteor. Soc., 76 , 1617–1624.
Cortinas, J V., Brill K F. , and Baldwin M E. , 2002: Probabilistic forecasts of precipitation type. Preprints, 16th Conf. on Probability and Statistics in the Atmospheric Sciences, Orlando, FL, Amer. Meteor. Soc., 140–145.
Cortinas, J V., Bernstein B C. , Robbins C C. , and Strapp J W. , 2004: An analysis of freezing rain, freezing drizzle, and ice pellets across the United States and Canada: 1976–90. Wea. Forecasting, 19 , 377–390.
Czys, R., Scott R. , Tang K C. , Przybylinski R W. , and Sabones M E. , 1996: A physically based, nondimensional parameter for discriminating between locations for freezing rain and ice pellets. Wea. Forecasting, 11 , 591–598.
Du, J., and Coauthors, 2004: The NOAA/NWS/NCEP Short Range Ensemble Forecast (SREF) system: Evaluation of an initial condition vs multiple model physics ensemble approach. Preprints, 16th Conf. on Numerical Weather Prediction, Seattle, WA, Amer. Meteor. Soc., CD-ROM, 21.3.
Grimit, E P., and Mass C F. , 2002: Initial results of a mesoscale short-range ensemble forecasting system over the Pacific Northwest. Wea. Forecasting, 17 , 192–205.
Hamill, T M., and Colucci S J. , 1997: Verification of Eta–RSM ensemble probabilistic precipitation forecasts. Mon. Wea. Rev., 125 , 1312–1327.
Hand, D J., and Till R J. , 2001: A simple generalization of the area under the ROC curve for multiple class classification problems. Mach. Learning, 45 , 171–186.
Harvey, L O., Hammond K R. , Lusk C M. , and Mross E F. , 1992: The application of signal detection theory to weather forecasting behavior. Mon. Wea. Rev., 120 , 863–883.
Hsu, W-R., and Murphy A H. , 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts. Int. J. Forecasting, 2 , 285–293.
Juang, H-M., 2000: The NCEP mesoscale spectral model: A revised version of the nonhydrostatic Regional Spectral Model. Mon. Wea. Rev., 128 , 2329–2362.
Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30 , 291–303.
Molteni, F., Buizza R. , Palmer T N. , and Petroliagis T. , 1996: The ECMWF ensemble prediction system: Methodology and validation. Quart. J. Roy. Meteor. Soc., 122 , 73–119.
Murphy, A J., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281–293.
Murphy, A J., and Winkler R L. , 1987: A general framework for forecast verification. Mon. Wea. Rev., 115 , 1330–1338.
Ramer, J., 1993: An empirical technique for diagnosing type from model output. Preprints, Fifth Int. Conf. on Aviation Weather Systems, Vienna, VA, Amer. Meteor. Soc., 227–230.
Richardson, D., 2000: Skill and economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126 , 649–667.
Richardson, D., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Quart. J. Roy. Meteor. Soc., 127 , 2473–2489.
Stensrud, D J., and Weiss S J. , 2002: Mesoscale model ensemble forecasts of the 3 May 1999 tornado outbreak. Wea. Forecasting, 17 , 526–543.
Stensrud, D J., and Yussouf N. , 2003: Short-range ensemble predictions of 2-m temperature and dewpoint temperature over New England. Mon. Wea. Rev., 131 , 2510–2524.
Swets, J A., 1973: The relative operating characteristics in psychology. Science, 182 , 990–1000.
Toth, Z., and Kalnay E. , 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev., 125 , 3297–3319.
Tracton, M S., and Kalnay E. , 1993: Operational ensemble prediction at the National Meteorological Center: Practical aspects. Wea. Forecasting, 8 , 378–398.
Wandishin, M S., and Brooks H E. , 2002: On the relationship between Clayton’s skill score and expected value for forecasts of binary events. Meteor. Appl., 9 , 455–459.
Wandishin, M S., Mullen ,S. L. , Stensrud D J. , and Brooks H E. , 2001: Evaluation of a short-range multimodel ensemble system. Mon. Wea. Rev., 129 , 729–747.
Wilks, D S., 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. Academic Press, 467 pp.