## 1. Introduction

Although a large body of work has examined the skill of weather forecasts, less effort has been devoted to examining the development of forecast skill, with the notable exceptions of Gedzelman (1978) and Roebber and Bosart (1996). This paper extends their work, using the results from 10 yr of forecasts made by students enrolled in a senior-level weather analysis and forecasting class (ATMS 452) taught at the University of Washington.

The primary objective of ATMS 452 is to demonstrate how the relatively abstract information presented in classes on the dynamics and thermodynamics of the atmosphere can be applied to forecasting tangible weather. It is designed for seniors in the last term of their undergraduate studies but has included students at different stages of their education (e.g., juniors, visiting students, and a few graduate students). The class is taught every spring, with a 10-week period of instruction from the end of March to early June. ATMS 452 is composed of a 1-h lecture 3–4 days a week with an emphasis on practical topics, a 1-h laboratory session 4 days a week that includes real-time surface map analysis and case studies illustrating specific phenomena such as mesoscale structures induced by orography, and, most germane to this paper, a 1-h forecasting exercise 5 days a week. The lecture portion is taught by the second author (Mass); the laboratory and forecast components are taught by the first author (Bond). The focus of ATMS 452 is on understanding and forecasting day-to-day weather using the concepts and tools of modern meteorology.

The daily forecast scores for students in ATMS 452 provide a unique opportunity to analyze the development of proficiency at weather forecasting in a controlled environment. This environment resembles more of an operational setting than the forecast “game” used by Roebber and Bosart (1996), in that the forecasts are not just mandatory and graded, but are carried out under a tight time constraint. Moreover, the consistency with which the forecast component of ATMS 452 has been conducted over many years provides a substantial amount of comparable data. Our analysis of this data here has had the following objectives: 1) to document the rates at which students develop skill at short-term forecasting, 2) to determine the differences in the rates at which proficiency is gained for different kinds of forecasts, and 3) to compare forecasting skill with performance on conventional written tests.

The organization of the paper is as follows. The forecasts, scoring, and methods of analysis are described in the next section. Section 3 represents the results of our analysis of the forecast scores. A summary and concluding remarks are provided in the final section.

## 2. Description of forecasts and methods

The students are subjected to an intensive forecasting experience in ATMS 452 for three reasons: 1) to provide the repetition important in gaining prowess at forecasting (e.g., Roebber and Bosart 1996); 2) to yield a large number of realizations so that the students’ scores, which compose part of their grades in ATMS 452, are statistically robust; and 3) to give the students a chance to find out how well they cope under the time pressures that they would face in an operational forecast setting.

### a. Weather parameters and scoring

The scores analyzed here are based on next-day forecasts made by the students and Bond on Monday–Thursday afternoons. Three basic types of the next-day forecasts are the subjects of this study. The type 1 category is for parameters verifying at 1200 UTC the next morning. These parameters, itemized in Table 1, consist of categorical forecasts of ceiling, visibility, and sky cover, and forecasts of wind speed and direction, temperature, and dewpoint. The scoring method for each parameter is also indicated in Table 1. In practice, while it is difficult for all forecasters to predict cloud cover with consistent success, the most error points are generally associated with forecasts of dewpoint, followed by temperature. The type 2 category relates to precipitation over the period of 0600–1800 UTC the next day. There are three separate elements: a precipitation probability, a thunderstorm probability, and a categorical quantitative precipitation forecast, summarized, along with their scoring, in Table 2. Note that the probability forecasts are scored in a “proper” manner (e.g., Murphy and Epstein 1967) such that in the long run scores are optimized when forecast probabilities match the true probabilities. After a short introductory period, the students make these two types of forecasts for stations in four cities in different regions of the United States: Will Rogers World Airport (OKC), in Oklahoma City, Oklahoma; Pittsburgh International Airport (PIT), in Pittsburgh, Pennsylvania; Hector International Airport (FAR), in Fargo, North Dakota; and Seattle–Tacoma International Airport (SEA) in SeaTac, Washington. Finally, a third category of forecasts for approximately the last month of class consists of a detailed, local forecast for SEA (Table 3). This is also a next-day forecast and consists of projections for wind and sky conditions at 6-h intervals (0600 UTC, 1200 UTC, etc.), minimum and maximum temperatures, as well as precipitation probabilities and quantitative precipitation forecasts (QPFs) for 0600–1800 UTC and 1800–0600 UTC. The type 3 forecast supplants the type 1 and type 2 forecasts for SEA, while the forecasts for the other three cities are unchanged.

The resources available to the students for making these forecasts include all available observations (station, satellite, radar, etc.) as well as analyses and numerical weather prediction (NWP) model data from the National Centers for Environmental Prediction’s Global Forecast System (GFS) and North American Mesoscale (NAM) models. For the type 3 forecasts for SEA, they are allowed to consider the output from the high-resolution Weather Research and Forecasting Model (WRF) system that is run at the University of Washington. The Department of Atmospheric Sciences at the University of Washington has developed a host of tools for displaying these observational and model data as well as other resources on the Internet.

It is important to distinguish the forecast component of ATMS 452 from other forecast competitions that have been the subject of previous studies (e.g., Bosart 1983; Sanders 1986). First, the students’ forecasts are individual efforts. While student conversations about the weather situation in general are encouraged, comparisons of their specific forecasts during the preparation stage are discouraged. Second, the use of model output statistics (MOS) guidance is prohibited.^{1} This philosophy is based on the idea that inexperienced forecasters will use MOS as a crutch, and probably should do so if forecast score is the top priority (Baars and Mass 2006). Using MOS likely delays the development of understanding how various elements of the weather relate to the larger-scale aspects of the atmosphere. With a longer period of instruction, it would probably be effective to let the students use MOS, and see for themselves how its utility varies with the weather regime (Roebber 1998).

### b. Analysis procedure

The daily scores for each student, Bond, and persistence are analyzed here. A persistence forecast, that is, what happened today will happen tomorrow, is an appropriate standard or control forecast by which a next-day forecast is evaluated.^{2} For the purposes of illustration, time series of daily scores during 2007 for three students chosen at random, Bond, and persistence for type 1 and type 2 forecasts are shown in Figs. 1a and 1b, respectively. The key trends apparent in these time series occur essentially each year. For example, the raw daily type 1 scores for virtually all students tend to decline over the quarter, as do the scores for Bond and persistence (Fig. 1a). Since the class is always held in the spring, when synoptic-scale disturbances become progressively weaker and day-to-day changes in temperatures and winds decrease, error scores tend to decline with time. On the other hand, less systematic temporal change is typical for type 2 (precipitation) scores, as shown in Fig. 1b. There is, of course, a major change in the origin of the precipitation over the quarter, with deep convection tending to play an increasingly important role in producing precipitation variability.

_{per}is the point total of the persistence forecast and FP

_{ind}is the point total of the human forecaster. In this formulation, “1” represents perfect skill, “0” represents no skill, and it is possible to have negative skill, relative to the control persistence forecast. This measure of forecast skill follows the form used by Sanders (1979) and others. Sequences of daily average SS values over many classes are computed by averaging the scores with respect to the number of days into the quarter. For example, averages are computed of the scores (FPs) for the first day of forecasts over all the years, and SS values for that forecast day are then produced by using these averages in (1). This procedure is repeated for the second day scores, and so forth, through the 33 days of forecasting each quarter. The alternative, computing SS values for each individual forecast and then averaging, yields results dominated by the instances for which the persistence score is small (and is undefined when persistence is a perfect forecast). We present results in the following section on the sequences of the average SSs for the median student, and for the two students that are most and least proficient each year. The rankings for the latter two groups are based on the individual students’ total forecast scores over the quarter for the category under consideration. For reference purposes, SS values are also shown for Bond. The Bond skill scores indicate the skill of an experienced forecaster and are by no means the upper limit of potential skill. A reviewer suggested that the Bond scores be used to normalize the student scores, because this would smooth the time series for the three student categories. We have chosen to display the results without this scaling, so as to better reveal the variability in forecast skill over the course of the quarter. The SS sequences for the type 1, 2, and 3 forecasts are considered separately.

## 3. Forecast skill sequences

Time series of forecast skill in the type 1 category are shown in Fig. 2. A marked increase in median student skill occurs over the course of the first half of the quarter for the type 1 forecasts, with a flattening in this trend over the second half of the term. Since the forecast skill of Bond does not vary systematically during the term, the upward trends in student skill show that proficiency is being gained by the students. Note also the substantial day-to-day variations in skill, for all groups of students and Bond, even though these traces reflect averages over 10 yr. The standard errors in the daily means of these skill scores average about 0.1 for each type of forecaster, which is comparable to the day-to-day variations in average skill. The consistency in the day-to-day variations in the time series for the different categories of forecasters illustrates that the relative degree of difficulty in forecasting on particular groups of days is independent of skill and experience. The top student forecasters have an immediate edge over the other students, continue to improve over the first two-thirds of the term, and by the last third of the term have skill approaching that of Bond. The skill of the least successful student forecasters increases steadily over the course of the quarter. The difference in skill between the higher and lower student groups narrows from about 0.25 to 0.15 over the duration of the quarter.

The time series for the type 2 forecasts (Fig. 3) has some interesting differences from those of the type 1 forecasts. First, a typical day’s median student forecast skill is actually higher for the type 2 predictions related to precipitation than for the type 1 forecasts related to clouds, winds, and temperatures. Second, while the median student skill for the type 2 forecasts declines with time during the course of instruction, Bond’s skill actually exhibits a greater rate of decline, to the point where the students’ skill closely approaches Bond’s. As mentioned above, the precipitation tends to be increasingly convective in nature during the period considered, posing an increasing challenge for all forecasters. This is also reflected in the standard errors of the daily means of these skill scores, which average about 0.2, and tend to be higher later in the quarter (not shown) when there is greater year-to-year variability in forecast errors. It is interesting that the top students maintain their skill late into the spring as precipitation becomes less predictable, and their mean skill near the end of the term actually slightly surpasses that of Bond’s. Unlike type 1 category forecasts, the gap in skill between the poorer forecasters and the other groups is relatively constant during the quarter.

The final type of forecast considered is the intensive forecast for SEA (type 3). This forecast type is for clouds, winds, temperatures, and precipitation probability and amount, and hence includes the elements in both the type 1 and type 2 forecast results shown above. This forecast is made only during the latter portion of the class, after the students have had 5–6 weeks of forecast practice. Time series of skill as a function of forecast number are shown in Fig. 4. These time series are rather short, but do not suggest obvious trends in the skills for any of the groups. The standard errors in the daily means of these skill scores are about 0.1 and, as for the other types of forecasts, are similar in magnitude for all four forecaster categories. For the most part, the differences in skill between Bond and the students in the three groups are similar to those differences near the end of the term for the type 1 and type 2 forecasts. This result is not surprising. The students have had forecast experience by the onset of these forecasts, and also have the benefit of familiarity with Seattle weather, based on both knowledge gained before taking ATMS 452 and the early activities of the class.

While our focus is on the rate of development of student forecast skill over the period of instruction, our dataset also allows for an assessment of how this skill has evolved between 1997 and 2007. Toward that end, the average skill over each quarter as a whole has been calculated for each type of forecast and all categories of forecaster. The linear trends in the quarter-average skill scores are itemized in Table 4. The trends in forecast skill are generally small relative to their mean values, and are of both positive and negative signs. They explain very little, typically <3%, of the interannual variance in forecast skill. The two exceptions are the type 3 forecasts in the median and highest student categories, but this was due to very low skill values in the first year of 1997. The trends in these skill scores were negligible when this year was excluded. It is interesting that a lack of systematic growth in forecast skill was found for a period during which there has been an undeniable improvement in the quality of the NWP models available. We speculate that this may be at least partly explained by our consideration of only next-day forecasts and that the advances in NWP guidance would be more obvious for forecasts with longer time horizons.

The consistency that has been maintained in the lecture portion of ATMS 452 over the years, in particular the nature of its examinations, allows assessment of the relationship between the students’ forecast scores and test scores. A scatterplot of the students’ average test scores, with the midterm and final exams weighted equally, against their overall forecast scores, combining all four types of forecasts, is shown in Fig. 5. There is a positive relationship between test and forecast scores, as might be expected, but the linear correlation coefficient is rather modest (*r* ∼ 0.4). The distribution is asymmetric. A large proportion of the best forecasters also did well on tests, but for forecast scores below about 80, there is little relationship between forecast and test scores.

## 4. Final remarks

The forecasts of students enrolled in the senior weather analysis and forecasting class (ATMS 452) at the University of Washington for the years of 1997–2007 have been analyzed. The primary objective of this analysis was to determine the rate at which the students, primarily seniors in the last quarter of their undergraduate studies, develop skill at short-term weather forecasting. The opportunity for such an analysis is afforded by the controlled setting of ATMS 452 and the consistency over the years in its forecast segment.

The typical student requires about 6 weeks or about 25 forecasts (each involving multiple parameters at up to four different locations) to gain basic proficiency in next-day forecasts of clouds, winds, and temperatures (the so-called type 1 forecasts). Beyond this point, improvements in skill are minimal on average. Based on the students’ success at forecasting for unfamiliar as well as familiar locations, it appears that this proficiency arises both from practice in the drill of forecasting and from the development of local knowledge, that is, of the nature of the weather in particular locations. While the best student forecasters have comparable skill to the instructor (Bond) during the latter portion of the class, his prior experience gives him a sizable advantage early in the class. Additional support for the importance of experience is provided by the results based on the intensive forecasts for SEA: the flat learning curves for the students in this category reflect presumably their preexisting knowledge of Seattle weather and their prior forecast practice during the earlier portion of the class.

The sequences of daily forecast scores for the forecasts involving precipitation (type 2) reveal that typical students have almost immediate skill at a level not much lower than Bond. This result may be attributable to all forecasters relying on basically the same suite of NWP model precipitation forecasts and that humans find it difficult to enhance such numerical predictions. Our results suggest that the top student forecasters are better at maintaining high skill levels through the latter portion of the term, when convective precipitation is more prevalent, compared to the slightly greater declines in skill for the other groups of students and Bond.

Student forecast and test scores for ATMS 452 are only moderately correlated (*r* ∼ 0.4). This result probably relates to the spectrum of students in the undergraduate program in the Department of Atmospheric Sciences at the University of Washington, which ranges from individuals with more of a theoretical or academic bent to those with more interest in the day-to-day weather. Presumably, the former tend to do better on examinations and the latter tend to do better at forecasting. They may also employ different styles of forecasting, but this topic is outside the scope of this paper. Readers interested in pursuing this topic should consult Roebber et al. (2004), which includes a variety of relevant references. The finding that some highly capable forecasters are not necessarily high academic achievers should be considered by employers seeking to fill forecasting positions.

We conclude with some musings on what it takes to become a capable forecaster based not just on the results presented above, but also our own interactions with the students. First and foremost is the role of experience. As discussed by Roebber (2005), many and perhaps most individuals learn most effectively through the use of concrete examples, and there is nothing like a busted forecast to bring home a point. A large number of realizations are required to understand the many ways a forecast can fail. In particular, novice forecasters tend to be overconfident with probabilistic forecasts (e.g., Doswell 2004). Indeed, for the type 2 forecasts in ATMS 452, our students frequently forecast precipitation probabilities of either 0% or 100% early in the quarter, but quickly gain appreciation for the inherent uncertainties in forecasting precipitation. Based on our conversations with the students, particularly during group discussions held right after forecast preparation, it appears that many students’ methods evolve over the course of the quarter. In particular, early in the quarter many students focus on NWP model output and give short shrift to consideration of the present state. They seem to bring these two elements of making a short-term forecast into a better balance with practice. It also seems to require at least a few weeks for most students to develop a consistent, complete, and efficient procedure for examining a large amount of observational and NWP model data, and then effectively distilling this information into a reasonable prediction. The drill of daily forecasting is presumed to also be important in learning how to best take advantage of NWP guidance, that is, recognizing its capabilities and limitations in making specific weather forecasts. Finally, repetition also helps students form conceptual models of the weather tailored to specific locations. As mentioned above, for the typical student about 6 weeks of intensive forecasting is required to develop competency.

Further gains in proficiency in forecasting are probably more subtle and difficult to measure. The very best forecasters are distinguished by their ability to handle unusual situations. Stuart et al. (2007) suggest that this success arises through the ability to mix analytic approaches with more heuristic, intuitive methods. We expect that this necessitates the deep knowledge base that is gained with years of scrutiny of a region’s weather.

## Acknowledgments

This publication is (partially) funded by the Joint Institute for the Study of the Atmosphere and Ocean (JISAO) under NOAA Cooperative Agreement NA17RJ1232. Support was also provided by a grant from the National Science Foundation (ATM-0504028). We appreciate the constructive comments we received from Prof. Paul Roebber and two anonymous reviewers.

## REFERENCES

Baars, J. A., and Mass C. F. , 2005: Performance of National Weather Service forecasts compared to operational, consensus, and weighted model output statistics.

,*Wea. Forecasting***20****,**1034–1047.Bosart, L. F., 1983: An update on trends in skill of daily forecasts of temperature and precipitation at the State University of New York at Albany.

,*Bull. Amer. Meteor. Soc.***64****,**346–354.Doswell C. A. III, , 2004: Weather forecasting by humans: Heuristics and decision making.

,*Wea. Forecasting***19****,**1115–1126.Gedzelman, S. D., 1978: Forecasting skill of beginners.

,*Bull. Amer. Meteor. Soc.***59****,**1305–1309.Murphy, A. H., and Epstein E. S. , 1967: A note on probability forecasts and “hedging”.

,*J. Appl. Meteor.***6****,**1002–1004.Roebber, P. J., 1998: The regime dependence of degree day forecast technique, skill, and value.

,*Wea. Forecasting***13****,**783–794.Roebber, P. J., 2005: Bridging the gap between theory and applications: An inquiry into atmospheric science teaching.

,*Bull. Amer. Meteor. Soc.***86****,**507–517.Roebber, P. J., and Bosart L. F. , 1996: The contributions of education and experience to forecast skill.

,*Wea. Forecasting***11****,**21–40.Roebber, P. J., Schultz D. M. , Colle B. A. , and Stensrud D. J. , 2004: Toward improved prediction: High-resolution and ensemble modeling systems in operations.

,*Wea. Forecasting***19****,**936–949.Sanders, F., 1979: Trends in skill of daily forecasts of temperature and precipitation.

,*Bull. Amer. Meteor. Soc.***60****,**763–769.Sanders, F., 1986: Trends in skill of Boston forecasts made at MIT, 1966–1984.

,*Bull. Amer. Meteor. Soc.***67****,**170–176.Stuart, N. A., Schultz D. M. , and Klein G. , 2007: Maintaining the role of humans in the forecast process: Analyzing the psyche of expert forecasters.

,*Bull. Amer. Meteor. Soc.***88****,**1893–1898.

Time series of type 1 (cloud, wind, and temperatures) forecast skill for the median student (green), highest two students (yellow), lowest two students (blue), and Bond (red) with respect to forecast day, averaged over 1997–2007.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

Time series of type 1 (cloud, wind, and temperatures) forecast skill for the median student (green), highest two students (yellow), lowest two students (blue), and Bond (red) with respect to forecast day, averaged over 1997–2007.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

Time series of type 1 (cloud, wind, and temperatures) forecast skill for the median student (green), highest two students (yellow), lowest two students (blue), and Bond (red) with respect to forecast day, averaged over 1997–2007.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

As in Fig. 2 but for type 2 (precipitation) forecast skill.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

As in Fig. 2 but for type 2 (precipitation) forecast skill.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

As in Fig. 2 but for type 2 (precipitation) forecast skill.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

As in Fig. 2 but for the type 3 (intensive SEA) forecasts.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

As in Fig. 2 but for the type 3 (intensive SEA) forecasts.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

As in Fig. 2 but for the type 3 (intensive SEA) forecasts.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

Overall forecast grade (ordinate) vs average (midterm and final) test grade (abscissa) in ATMS 452. The linear trend of the forecast grades on the test grades is also shown; the correlation coefficient between these two parameters is ∼0.4.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

Overall forecast grade (ordinate) vs average (midterm and final) test grade (abscissa) in ATMS 452. The linear trend of the forecast grades on the test grades is also shown; the correlation coefficient between these two parameters is ∼0.4.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

Overall forecast grade (ordinate) vs average (midterm and final) test grade (abscissa) in ATMS 452. The linear trend of the forecast grades on the test grades is also shown; the correlation coefficient between these two parameters is ∼0.4.

Citation: Weather and Forecasting 24, 4; 10.1175/2009WAF2222214.1

Type 1 forecast parameters and scoring.

Type 2 forecast parameters and scoring.

Type 3 forecast parameters and scoring.

Linear trends in quarter-average forecast skill.

^{1}

We recognize that it is impossible to strictly enforce the ban on the use of MOS. Nevertheless, students have not reported their classmates using MOS, and no cheating has been observed during the hour of forecasting.

^{2}

The climatological mean for the day could be an alternative control forecast. The forecast scores for climatology have not been evaluated for ATMS 452 since climatologies for cloud cover and wind speed for individual days and hours of the day are not readily available.

^{}

* Pacific Marine Environmental Laboratory Contribution Number 3074 and Joint Institute for the Study of the Atmosphere and Ocean Contribution Number 1738.