Calls for moving from a deterministic to a probabilistic view of weather forecasting have become increasingly urgent over recent decades, yet the primary national forecasting competition and many in-class forecasting games are wholly deterministic in nature. To counter these conflicting trends, a long-running forecasting game at Rutgers University has recently been modified to become probabilistic in nature. Students forecast high- and low-temperature intervals and probabilities of precipitation for two locations: one fixed at the Rutgers cooperative observing station, the other chosen for each forecast window to maximize difficulty. Precipitation errors are tabulated with a Brier score, while temperature errors contain a sharpness component dependent on the width of the forecast interval and an interval miss component dependent on the degree to which the verification falls within the interval.

The inclusion of a probabilistic forecasting game allows for the creation of a substantial database of forecasts that can be analyzed using standard probabilistic approaches, such as reliability diagrams, relative operating characteristic curves, and histograms. Discussions of probabilistic forecast quality can be quite abstract for undergraduate students, but the use of a forecast database that students themselves help construct motivates these discussions and helps students make connections between their forecast process, their standing in class rankings, and the verification diagrams they use. Student feedback on the probabilistic game is also discussed.

Rutgers University students compete at making probabilistic forecasts for both temperature and precipitation and, in the process, accumulate a large database with which to practice analyzing forecast quality.

It has become increasingly recognized that the addition of uncertainty information to weather forecasts will improve the decisions made based on such forecasts, but this requires education of both users and providers. The National Research Council has made this need explicit, recommending that “All sectors and professional organizations of the [weather and climate] Enterprise should cooperate in educational initiatives that will improve communication and use of uncertainty information. In particular, (1) hydrometeorological curricula should include understanding and communication of risk and uncertainty. . . .” (National Research Council 2006, p. 99). Communication of weather forecast uncertainty is also a priority that has emerged from The Observing System Research and Predictability Experiment (THORPEX) program (Morss et al. 2008b).

Despite these increasingly frequent calls for communication of forecast uncertainty (e.g., probabilistic forecasts), exposure of students to probabilistic forecasting is still the exception rather than the rule. Aside from a few probabilistic contests described later, many students engage in contests, in class or otherwise, that are deterministic (C. Mass 2011, personal communication). For example, the Weather Challenge (WxChallenge), an intercollegiate weather forecasting contest with over 2,000 student participants, is wholly deterministic in format (Illston et al. 2009). Additionally, students examining weather forecasts from the National Weather Service (NWS) obtain primarily deterministic information courtesy of the National Digital Forecast Database (Glahn and Ruth 2003). Even the most common probabilistic forecast, the probability of precipitation (POP), is often interpreted incorrectly (Morss et al. 2008a; Joslyn et al. 2009).

In an effort to introduce students to issues related to the communication of weather uncertainty and to improve upon poor reviews of a previous contest, the New Brunswick Forecasting Game (NBFG) at Rutgers University has been revised to become fully probabilistic. In addition to providing students with practice making probabilistic forecasts, the scoring system itself introduces students to concepts of forecast quality such as reliability and sharpness. The purpose of this paper is to raise awareness of probabilistic forecast contests within the community by describing the mechanics of the NBFG and students' reactions to it, to show how the forecast database can be mined to introduce concepts such as the receiver operating characteristic (ROC) curve, and to draw comparisons with other contests.

## THE NEW BRUNSWICK FORECASTING GAME.

### Overview.

The current NBFG is designed as follows. During the fall semester, students in synoptic meteorology are required to make forecasts once a week (i.e., one “forecast window”) for four periods for two locations. The four periods run from either 0000–1200 or 1200–0000 UTC and can be thought of as tonight, tomorrow, tomorrow night, and the day after tomorrow. In the spring, students in mesoscale meteorology are required to make forecasts twice a week, but for only two periods: tonight and tomorrow. These differences between semesters are a consequence of varying computer lab schedules in the two classes and are consistent with distinguishing between synoptic and mesoscale time scales. One of the two locations is always the Rutgers Gardens weather station located in New Brunswick (NB), New Jersey, roughly 1.5 km east of the meteorology classrooms (the weather station may be viewed at http://meteorology.rutgers.edu/weatherstationNew.jpg). This location is an NWS cooperative station (ID 286055) and a member of the U.S. Historical Climatology Network. The instructor chooses the other location, which varies with each forecast window but not with each forecast period, to maximize forecast difficulty. This is referred to as the alternate location (AL).

Assessment of forecast difficulty is inherently subjective, but some factors that frequently are used to determine the alternate location include large spreads in model output statistics (MOS) temperature forecasts, MOS or NWS POPs near 50%, and/or precipitation expected to begin or end near the transition from one period to the next. Occasionally, the alternate location is chosen because it may be of particular interest to students (e.g., a spring break destination in March or a World Series site in October).

Students are charged with predicting temperature intervals and the POP for each period and location. For night periods, the low temperature is predicted, whereas for day periods, the high temperature is predicted. Note that if the low temperature for the day occurs at, say, 1300 UTC, it would not correspond to the verification. Only the lowest temperature occurring between 0000 and 1200 UTC is counted. At Rutgers Gardens, precipitation must accumulate to at least 0.25 mm (0.01 in.) before precipitation is declared to have occurred (i.e., the bucket must tip). However, the alternate location is verified via aviation routine weather report (METAR) observations that include the ability to record a trace of precipitation. Thus, for the alternate location, only a trace of precipitation is necessary for precipitation to be declared to have occurred. These competing definitions of “precipitation” allow students to experience the abstract concept of reference class (de Elía and Laprise 2005) concretely in a classroom setting. Participants must submit forecasts via a password-protected website by 0000 UTC.

All NBFG forecasts made during the four semesters between the spring of 2010 and fall of 2011 inclusive and their corresponding verifications provided 6,688 forecasts of both precipitation and temperature for the analysis. There are 3,344 forecasts for New Brunswick temperature and likewise for New Brunswick precipitation, alternate location temperature, and alternate location precipitation. Because forecasts beyond the second period are only made in the fall, 4,660 of the 6,688 (70%) temperature forecasts are for tonight or tomorrow (day 1), with the remaining 30% valid for subsequent times (day 2), and likewise for precipitation.

### Precipitation score.

A previous incarnation of the NBFG (Croft and Milutinovic 1991) also employed probabilistic precipitation forecasts. In fact, as is the case with some other contests (see Table 1), POPs were provided for a variety of accumulated precipitation ranges. Unlike other contests that use the ranked probability score (RPS) (Wilks 1995, 269–272) to measure forecast performance, this previous NBFG used Brier scores (BS) local to each bin that were weighted based on the distance between each forecast bin and the verifying precipitation bin, something akin to the location measure used in object-based verification techniques such as Wernli et al. (2008). For example, suppose there were three precipitation categories (0, 1–5, and 6+ mm); forecast A was (0%, 90%, 10%) and forecast B was (0%, 10%, 90%). If no precipitation occurred, then the ranked probability scores for these forecasts would be 1.01 (1^{2} + 0.1^{2} + 0^{2}) and 1.81 (1^{2} + 0.9^{2} + 0^{2}), respectively, but the distance weighting used by Croft and Milutinovic (1991) results in A scoring 1.415 [1^{2} + (0.9)^{2}/2 + (0.1)^{2}/1] and B scoring 1.815 [1^{2} + (0.1)^{2}/2 + (0.9)^{2}/1].

To limit the forecasting burden on the students (they also participate in the WxChallenge, and the temperature portion of the NBFG was changed to be more complicated as shown below), the precipitation portion of the current NBFG was simplified to one POP for each location and forecast period. In addition to being simple, this approach allows for the straightforward construction of attributes and reliability diagrams and ROC curves.

POPs are provided to the nearest 10% and assessed using the following half-Brier score:

where *p* is the POP expressed as a number between 0 and 1, *o* is the observed probability of precipitation (either 0 or 1), and *E _{P}* represents the error points attributed to precipitation incurred by the forecaster. Appropriate

*E*values are then summed for each forecast–observation pair contributed by that forecaster to arrive at the forecaster's overall precipitation error score. The immediate scaling by 10 lets error points be expressed with only one digit past the decimal point. With this score, a forecast POP of 50% generates 2.5 error points regardless of the verification. A perfect deterministic forecast of 0% or 100% generates no error points, but an imperfect deterministic forecast (i.e., forecasting 0% when it does precipitate) generates 10 error points. Despite the instructor's admonitions not to be overconfident, students do construct 10-errorpoint forecasts on a regular basis, as reliability diagrams indicate in the “Precipitation forecast performance” section.

_{P}### Temperature score.

Forecasters make temperature forecasts that, based on the scoring rule described below, may be interpreted as 50% central credible interval forecasts (Murphy and Winkler 1974; Winkler and Murphy 1979). Let *l* be the forecast lower bound for temperature, *u* be the forecast upper bound for temperature, and *T* be the observed temperature. Then Gneiting and Raftery (2007) show that the following interval scoring rule is proper for a (1 – *α*) × 100% central credible interval of a continuous predictand regardless of the forecast distribution:

where *s* is any nondecreasing function and *h* is arbitrary. Dunsmore (1968) and Winkler (1972) first introduced scoring rules of this nature. Letting *s*(*T*) = *h*(*T*) = 2*T*/*α* and allowing for equality because temperatures are reported to the nearest whole degree, the interval score becomes

The NBFG uses an interval score in the form of (3) to evaluate temperature forecasts, but it is presented in the following way. Temperature error points are given by *E _{T}* =

*E*+

_{S}*E*, where

_{R}*E*is given by

_{S}*u*–

*l*, which represents the

*sharpness*of the temperature forecast, and

*E*is given by

_{M}which measures the degree to which the observed temperature falls outside the forecast interval. The *E _{M}* score can be thought of as an

*interval miss*score. The NBFG scoring rule for temperature forecasts is equivalent to (3) with

*α*set to 50%. Hamill and Wilks (1995) use a similar scoring rule.

Just as for precipitation, a perfect score of zero is achievable with a deterministic temperature forecast (*l* = *u*) that verifies. A wider interval increases the sharpness score, but also increases the chance that the verifying temperature will be within the interval, thus lowering the expected interval miss score. Based on the above, it would seem that the best forecast strategy is to choose an interval such that the temperature occurs within it 50% of the time and above or below it 25% of the time each. However, because of rounding, temperatures that are just outside the interval will not be penalized by the interval miss score. For example, suppose the true high temperature was 72.4°F and the forecast upper bound was 72°F. Then the interval miss score is still zero because the high temperature will be reported as 72°F, whereas the true interval miss score should be 1.6 in this case. Because of these considerations, the best results will be achieved when the verifying temperature falls within the interval at a rate that is somewhat larger than 50%.

## PRECIPITATION FORECAST PERFORMANCE.

The attributes diagram (Wilks 1995, 263–265) is a standard way to assess the quality of probabilistic forecasts for dichotomous events. Figure 1 presents the attributes diagram for all of the precipitation forecasts in the NBFG database. The attributes diagram presents information regarding the reliability and skill of the set of forecasts as a whole. The notion of forecast reliability is that, if something is forecast *x*% of the time, then it should actually happen *x*% of the time. Thus, in Fig. 1, the diagonal dashed line represents perfect reliability; the *x* axis is the forecast, and the *y* axis is the frequency with which the event occurs given a forecast of *x*. From the diagram, we can see that POPs of 70%–90% are quite reliable in this dataset. The distribution of forecast POPs shows that the forecasts are relatively sharp (many 0% and 100% forecasts), but some of this sharpness comes at the expense of reliability. It actually precipitates 5% of the time when the forecast is 0%, and it does not precipitate 6% of the time when the forecast is 100%.^{1} Except for the highest POPs, NBFG forecasters have a systemic underforecasting bias, as the observed relative frequencies tend to be above the perfect reliability line. This is most pronounced with the 30% POP; when that forecast is made, it precipitates 47% of the time.

NBFG forecasters can use this information to calibrate their forecasts. Given the data shown in Fig. 1, a first step toward that process would be to have all forecasters increase their forecasts by 10% when they are thinking of a POP between 20% and 60%. An unanswered question is whether the availability of the attributes diagram in real time for a particular semester results in increased forecast reliability.

Figure 1 also indicates the degree of resolution (the ability to discriminate between events) and skill relative to climatology exhibited by NBFG forecasters. For the most part, forecasts display sizeable resolution, although the nearly horizontal lines between POPs of 40% and 50% and between 60% and 70% indicate somewhat less ability to resolve those forecast scenarios into separate events. This may be a reflection of the fact that 40% and 60% are less popular forecasts than their neighbors are. All forecasts except those in the 20%–40% range contribute to positive skill with respect to the NBFG sample climatology. Positive skill is provided when the reliability curve is below (above) the “no skill” line when the no-skill line is below (above) the “no resolution” line (Wilks 1995).

Examining subsets of the NBFG database reveals interesting patterns in forecast reliability (Fig. 2). For example, when only forecasts for New Brunswick are considered, there is an extremely large overforecasting bias. A 50% POP leads to precipitation only 12% of the time! In part, this is likely to be an artifact of the fact that traces of precipitation at New Brunswick are not considered precipitation, and students are not able to calibrate their forecasts to this definition. Evidence for this possibility comes from the fact that at nearby Trenton, climate records indicate that traces occur about 6% of the time. Given that 50% POPs are issued for New Brunswick only 2% of the time, traces could conceivably account for all of this unreliability. However, it is likely that other POPs are issued when traces occur as well. Other factors that are leading to this overforecast bias are unclear. Perhaps Schwartz's admonition (Bosart 1983), which states that the observed precipitation is some fraction of the forecast, where the fraction is inversely proportional to a “map room excitement” factor, is playing a role. Regardless, the fact that this bias can be pointed out to students is important in and of itself.

In contrast, Fig. 2 shows that forecasts for the intentionally difficult alternate location are pervasively underforecast. Here, students may be tripped up in part by not recognizing the ease with which a trace of precipitation may occur under certain synoptic regimes. Forecasts for the alternate location are considerably less sharp than those for New Brunswick are (compare the forecast distributions in the bottom right of Fig. 2), as students are less knowledgeable of the peculiarities of the alternate location, and it is typically less predictable than New Brunswick by design. Even with the reduced sharpness, however, very noticeable overconfidence in predicting a 0% POP for the alternate location is evident.

Forecasts for day 1 (tonight and tomorrow) mirror the overall reliability, which is not surprising given 70% of the dataset consists of day 1 forecasts. Two interesting differences emerge when comparing day 1 forecasts to day 2 forecasts, however. First, there is a clear reduction in the sharpness of the day 2 forecasts, as evidenced particularly by the greatly reduced tendency to forecast a POP of 100%. The 0% POP is less affected, perhaps because the overall climatological frequency of precipitation is less than 50% (recall Fig. 1) but also because the author may be biased in choosing alternate locations where precipitation is particularly tricky during day 1. The second difference is seen in the greater amount of underforecasting that occurs for midrange (especially 30%–60%) POPs in the day 2 forecasts.

Although the ability of the POP forecasts to discriminate between precipitation and no precipitation is reflected in the strong verticality to the reliability curves just discussed, an alternate way of depicting resolution is to condition the forecasts based on what happened. By applying thresholds to a set of POP forecasts, a series of deterministic yes/no forecasts can be derived to construct the ROC curve (Mason and Graham 1999). Given that precipitation did not occur, how often was the forecast wrong? This is the probability of false detection (POFD), or false-alarm rate.^{2} Given that precipitation did occur, how often was the forecast right? This is the probability of detection (POD), or hit rate. Because the NBFG POPs are multiples of 10%, it is natural to use thresholds of 5%, 15%, . . ., 95% to construct the ROC curve, and this is what is done to create Fig. 3. When thresholds are very high, no forecasts for precipitation are issued, so both the POD and POFD are zero. When thresholds are very low, each forecast is treated as if precipitation will occur, which necessitates that the POD and POFD be one. A skilled forecaster will be able to detect the phenomenon (in this case, precipitation) without issuing many false alarms. This is represented by points in the top-left half of the ROC diagram. Hence, the area under a ROC curve (AUC) is a common forecast quality metric, with 1 being a perfect score, and 0.5 representing no skill.

Reassuringly for NBFG forecasters, Fig. 3 shows that the ROC curves for various subsets of the NBFG database all indicate significant skill. It is not surprising that New Brunswick forecasts are of much higher quality than their alternate location counterparts are, as this is by design. New Brunswick may offer tranquil weather throughout a given forecast window, whereas the alternate location is chosen to be difficult. An additional factor explaining this difference may be that students are more familiar with the weather in New Brunswick than they are at the alternate location. Although Roebber et al. (1996) suggested that distance between station and forecaster had no effect on forecast skill, especially for novices forecasting precipitation, their study did not examine stations within 100 km of the forecasters. Hence, the possibility that distance does matter if it is less than 100 km is not excluded. The reduction in skill between day 1 and day 2 forecasts reflects the increased uncertainty associated with a longer-range forecast. However, the difficulty of day 2 relative to day 1 is evidently less than the difficulty of the alternate location relative to New Brunswick.

## TEMPERATURE FORECAST PERFORMANCE.

Turning our attention to the temperature interval forecasts, Fig. 4 shows that the inverse relationship between the sharpness and interval miss scores posited earlier exists within the NBFG. Note that, while Fig. 4 shows the results of the fall 2011 semester alone for clarity, other semesters exhibit similar patterns (not shown). As forecast sharpness increases (i.e., the sharpness score decreases), the interval miss score increases as well. This tendency is revealed most easily by comparing forecaster 1 (the best temperature forecaster) to forecaster N (the worst). Forecaster 1 and other high-ranking forecasters had a roughly even mix of sharpness and interval miss error points, with a small shift toward more sharpness points. Forecaster N and other poor performers had narrow intervals, leading to good sharpness scores but poor interval miss scores. A few forecasters ran counter to the overall trend, particularly forecasters C and H. Those forecasters had few interval miss points, but at the cost of wide temperature intervals and hence large sharpness scores.

The reliability of NBFG temperature forecasts can be assessed based on the frequency with which the observed temperature falls below, within, or above the forecast interval. Figure 5 displays such an analysis and, once again, only results from the fall 2011 semester are shown for clarity. Other semesters are similar. As was suggested by Fig. 4, there is a tendency for forecasters at the bottom of the standings to perform poorly because of overconfidence. Temperatures rarely fall within their intervals, so their bars in Fig. 5 are relatively narrow. Only forecaster H (and to some extent forecaster C) bucks this trend. Combining these results with our discussion of precipitation errors allows us to make the general statement that, as a whole, NBFG forecasters are an overconfident lot apt to be too sharp at the expense of reliability. Perhaps this overconfidence is a side effect of the risk taking (i.e., one-upmanship) associated with students jovially competing in the NBFG.

Figure 5 also suggests that rounding does increase the credible forecast interval width beyond 50%. All of the top 10 temperature forecasters had observed temperatures fall within their intervals more than 50% of the time. The best forecasters saw temperatures within their intervals 60%–65% of the time. The degree to which this extra width can be explained by rounding was checked by simulating thousands of forecasts designed to be reliable in the long run. Using temperatures and standard deviations from the 1981–2010 climate normals (Arguez et al. 2012) for New Brunswick for a variety of months, reliable forecast intervals were constructed by rounding the 25th and 75th percentiles to the nearest integer, with pseudo-observations drawn from the same distribution. Assuming the alternate location has similar climatological characteristics to New Brunswick, the results (not shown) indicate that about one-quarter of the extra credible forecast interval width is due to rounding. In other words, a perfectly reliable forecaster would hit the interval 53% of the time. Hence, most NBFG participants have some bias toward wide intervals, even after accounting for rounding.

Finally, Fig. 5 shows that most NBFG forecasters exhibited a cold bias, as it is more likely for temperatures to verify above the forecast interval than below it. Further analysis indicates that this cold bias is present in all semesters for both New Brunswick and the alternate location (not shown). Perhaps this bias is a result of participants relying on NWP models that have their own biases. In addition, the nearest MOS location to New Brunswick [Somerville, New Jersey (SMQ)] is in an exurban valley that is known to be a relatively cold location in New Jersey, so the cold bias for New Brunswick may be explained by participants not realizing that local conditions differ between SMQ and New Brunswick.

## DISCUSSION.

In any forecast contest, it becomes necessary to determine how the various elements of the forecast combine into an overall score. In the case of the NBFG, Fig. 6 shows that simply adding the half- Brier scores for precipitation to the interval scores for temperature would vastly underweight the precipitation scores. Many contests deal with this issue using skill scores (e.g., Bosart 1983; Sanders 1986; Gyakum 1986; Hamill and Wilks 1995; Newman 2003), but skill scores are not proper in general and hence may be gamed (Gneiting and Raftery 2007). Therefore, the initial revision to the NBFG scaled the precipitation score such that the median temperature and precipitation scores were identical (precip med). However, this approach tended to overweight precipitation scores. It also exhibits the disadvantage that a forecaster could finish in first place for each forecast window at the time each window is originally scored yet not finish first overall because of changes in the scale factor over time. A simpler approach is to multiply the precipitation score by a reasonable constant. In the fall of 2011, this constant was set to six, which seemed to produce results that are more balanced (precip 6). Weighting precipitation errors such that the interquartile range is the same for both precipitation and temperature is another approach under consideration for future semesters (precip IQR).

The same students that participate in the NBFG are also required to participate in the WxChallenge. This makes it possible to see whether probabilistic forecast skill carries over into deterministic forecast skill and vice versa. Indeed, there is a relationship between the two. For the 22 participants during the fall 2011 semester, the Spearman rank correlation was 0.57, indicating that forecasters ranked highly in the NBFG were also likely to rank highly in the WxChallenge. Outliers exist (not shown), especially when raw scores are compared, which reduces the Pearson correlation to 0.39.

Every course at Rutgers is subjected to boilerplate end-of-semester evaluations that include Likert items such as “I learned a great deal in this course” (1 = strongly disagree; 5 = strongly agree). However, instructors are allowed to add custom questions to the evaluation (changes in evaluation procedures between the Fall 2009 and Fall 2010 semesters precluded this, hence the gap in Table 2). For synoptic and mesoscale meteorology courses, the additional items “I enjoyed the NBFG” and “I enjoyed the WxChallenge” were added. Before the NBFG was revised over the summer of 2009, students scored the old NBFG poorly (Table 2). In fact, 49% of the 72 students surveyed preferred the WxChallenge to the NBFG, 43% expressed no preference, and only 8% (six students) preferred the NBFG. These poor results in addition to reports such as Morss et al. (2008b) were what led the instructor to try a probabilistic approach to the NBFG.

Unfortunately, only two semesters have been able to be surveyed since the probabilistic revision of the NBFG. Despite this, Table 2 shows encouraging results. Now, 58% of the 31 students surveyed show no preference between the WxChallenge and NBFG, the proportion of those favoring the WxChallenge has fallen to 26%, and the proportion of those favoring the NBFG has doubled to 16%. Note that general WxChallenge enjoyment increased as well (mean 4.48 versus 4.06 previously despite no change in that contest); perhaps recent students have simply liked to forecast. However, against the null hypothesis that the new NBFG is no better than the old NBFG, the increase of nearly a point in students' enjoyment of the NBFG is fairly significant even with the correlation between NBFG and WxChallenge ratings taken into account (*p* value 0.052 for a one-sided *t* test). While the distribution of student responses is far from Gaussian, such a *t* test should still give reasonable results (Norman 2010). Hence, there is at least a suggestion that modifying the NBFG has increased student enjoyment of the contest, but because of the small sample size available since the modifications, it is prudent to consider these results encouraging rather than definitive.

While quantitative results shed some light on students' perceptions of these differing forecast contests, the “other comments” section of the course evaluations can provide windows into an individual student's experiences. In recent semesters (after revising the NBFG), students have been invited to comment specifically on the NBFG and WxChallenge in that section. Many students left that section blank, but among those who did not, the following comments could be found:

“I really enjoyed doing [the NBFG]. It was time consuming sometimes, but I feel that it helped me apply what I learned throughout our meteorology courses.”

“I gained . . . an appreciation for the different components that go into creating a ‘good’ forecast. . . . I have read papers on probabilistic versus deterministic forecasts in the past but it was through the NBFG that I gained a broader understanding of why some argue one type over the other.”

“I feel like the NBFG is more geared towards the type of forecasting that is expected of operational meteorologists today.”

“Make NBFG be only P001 or above and do not count P000.” (This student wants traces of precipitation at the alternate location to not count as precipitation.)

Here we see that students had both positive and negative perceptions of the probabilistic NBFG. On the one hand, it seems more relevant to real-world forecasting and allows students to understand what probabilistic forecasting is more fully. On the other hand, it is more time consuming (24 numbers are required each week versus 16 for the WxChallenge), and students tend not to like having to consider more than one reference class for precipitation events.

## CONCLUSIONS.

After a previous deterministic New Brunswick Forecasting Game at Rutgers University received poor reviews, the NBFG was transformed into a probabilistic contest. Participants forecast high- and low-temperature intervals and POPs for New Brunswick and a location that varies each time a set of forecasts is made. Course evaluations suggest that students find these changes have made the NBFG more enjoyable, but that level of enjoyment is no higher than that found for the deterministic WxChallenge according to the available evidence.

Overall, NBFG forecasts are skillful and reasonably reliable when considered together. However, notable biases are found within particular forecast subsets. A pervasive overforecasting bias exists when precipitation forecasts for New Brunswick are considered alone, so that a 50% POP corresponds to precipitation occurring less than half as frequently. In contrast, underforecasting of precipitation occurs for the alternate location, and these differences are statistically significant. Temperature forecasts have a cold bias regardless of location, and even the best students tend to choose temperature intervals a bit too wide. Thus, there is room for students to improve their performance on both aspects of the NBFG.

As a step toward improving students' probabilistic forecasting ability, the results of this study will be shared with them in future semesters. Not only will students be able to use knowledge of the biases found by this study to improve their forecasts, but these data also motivate understanding of the abstract concepts behind ROC curves and attributes diagrams. Additionally, it is planned to provide figures like those shown in this study to students in real time as forecast windows are scored. The combination of attributes diagrams for the class as a whole with attributes diagrams that apply to each student individually may allow for students to reduce their forecast biases more completely. The performance of model guidance will be tracked in a similar fashion, so that students may see whether their biases derive from the models or from other means. Finally, it may be of interest to probe more deeply into students' perceptions of the NBFG and WxChallenge by asking them to rate the degree to which they felt each contest helped them learn more about forecasting in addition to the more general queries regarding their enjoyment of each contest.

*ACKNOWLEDGMENTS*

The author extends special thanks to all of the students whose countless hours of forecasting produced the dataset analyzed in this work: W. C. Alston, S. Ames, B. Bachman, M. Baker, D. Baur, J. Berenguer, J. Carlin, A. Caroprese, K. A. Cicalese, N. Cicero, A. Coletti, K. Cronin, R. D'Arienzo, P. Degliomini, J. Deppa, M. Devereaux, C. Devito, A. Donovan, M. Drews, N. Gibbs, G. Grek, A. Harrison, G. Heidelberger, J. Kafka, K. Kelsey, P. Loikith, D. Manzo, C. Marciano, A. McCullough, N. Mentel, S. Nicholls, E. Nielsen, G. Orphanides, H. Patel, N. Peterson, G. Pietrafesa, J. Ramey, R. Reale, A. Riggi, S. Sabo, C. Sheridan, D. Simonian, C. Speciale, C. Sutherland, C. Tait, C. Thompson, J. Webster, C. Weibel, R. Winnan, and S. Winter. Michael Ferner coded the forecast submission website. Three anonymous reviewers helped to improve this work substantially. This research was funded in part by the New Jersey Agricultural Experiment Station.

## REFERENCES

## Footnotes

^{1}

Because of rounding, a 0% (100%) POP corresponds to an interval of 0%–5% (95%–100%). Hence, these results are not as unreliable as they seem. Even so, the observed frequencies when compared to the interval midpoints indicate some overconfidence.

^{2}

As Mason and Graham (1999) note, the false-alarm *rate* is often confused with the false-alarm *ratio* (FAR). In a contingency table where A is the number of hits, B is the number of misses, C is the number of false alarms, and D is the number of correct negatives, the false-alarm rate is given by POFD = C / (C + D), whereas FAR = C / (A + C).