Experimental gridded forecasts of surface temperature issued by National Weather Service offices in the western United States during the 2003/04 winter season (18 November 2003–29 February 2004) are evaluated relative to surface observations and gridded analyses. The 5-km horizontal resolution gridded forecasts issued at 0000 UTC for forecast lead times at 12-h intervals from 12 to 168 h were obtained from the National Digital Forecast Database (NDFD). Forecast accuracy and skill are determined relative to observations at over 3000 locations archived by MesoWest. Forecast quality is also determined relative to Rapid Update Cycle (RUC) analyses at 20-km resolution that are interpolated to the 5-km NDFD grid as well as objective analyses obtained from the Advanced Regional Prediction System Data Assimilation System that rely upon the MesoWest observations and RUC analyses. For the West as a whole, the experimental temperature forecasts issued at 0000 UTC during the 2003/04 winter season exhibit skill at lead times of 12, 24, 36, and 48 h on the basis of several verification approaches. Subgrid-scale temperature variations and observational and analysis errors undoubtedly contribute some uncertainty regarding these results. Even though the “true” values appropriate to evaluate the forecast values on the NDFD grid are unknown, it is estimated that the root-mean-square errors of the NDFD temperature forecasts are on the order of 3°C at lead times shorter than 48 h and greater than 4°C at lead times longer than 120 h. However, such estimates are derived from only a small fraction of the NDFD grid boxes. Incremental improvements in forecast accuracy as a result of forecaster adjustments to the 0000 UTC temperature grids from 144- to 24-h lead times are estimated to be on the order of 13%.
The National Weather Service (NWS) has extensively revised the procedures used by forecasters to create and distribute forecasts of sensible weather elements. Instead of manually typing text forecast products, forecasters now use graphical editors included in the Interactive Forecast Preparation System (IFPS) to create high-resolution gridded forecasts of weather elements that can be viewed graphically by customers as well as downloaded for specific user applications (Ruth 2002; Glahn and Ruth 2003; Glahn 2003; Spayd et al. 2005). Forecast grids at resolutions of 1.25, 2.5, or 5 km are produced at each NWS Weather Forecast Office (WFO) for their respective County Warning Area (CWA). These local grids are then combined to form national grids at 5-km resolution that are stored and disseminated as part of the National Digital Forecast Database (NDFD; Glahn and Ruth 2003).
The NWS currently provides access to NDFD forecast grids for 13 operational and experimental forecast elements over the 48 contiguous states (information available online at http://www.nws.noaa.gov/ndfd). Operational elements since December 2004 include maximum temperature, minimum temperature, and probability of precipitation. Temperature, dewpoint temperature, and weather elements have been operational since March 2004. Experimental elements include sky cover, wind direction and speed, quantitative precipitation forecast, snow amount, significant wave height, relative humidity, and apparent temperature. The forecasts are available with lead times up to 7 days.
Concerns have been raised within the NWS (Western Region Science and Operations Officers 2003) and external to that agency regarding the IFPS (Mass 2003a, b; see also Glahn 2003). In response to those concerns, the NWS Office of Science and Technology formed the IFPS Science Steering Team to suggest improvements to the IFPS (see http://www.nws.noaa.gov/ost/ifps_sst). Of particular interest to this study is the first recommendation of the Western Region Science and Operation Officers: “Develop a national real-time, gridded verification system of surface-based parameters to track the accuracy of both the numerical model guidance and the official, forecaster-edited grids.” The NWS Meteorological Development Laboratory (MDL) is verifying NDFD gridded forecasts nationally (Dagostaro et al. 2004). In addition, the NWS Office of Science and Technology sponsored a meeting during June 2004 in part to address the needs for gridded verification and subsequently formed the Mesoscale Analysis Committee to foster the development of a national mesoscale “analysis of record” that would serve as the foundation for future NDFD forecast verification (Horel and Colman 2005).
As discussed by Mass (2003a), forecasters using the IFPS often spend considerable effort adjusting model guidance and prior forecasts to develop the current set of forecast grids. Forecasters are free to choose the methods by which they create and update those grids for their CWA (L. Dunn 2005, personal communication). Forecasters may decide to initialize their entire set of forecast grids from numerical model guidance. A common approach used by forecasters in the NWS Western Region is to incrementally adjust a prior set of grids valid for a particular forecast time taking into account new information available from more recent numerical model guidance (D. Edman 2005, personal communication). In this fashion, mesoscale variations within the forecast grid that are introduced by previous forecasters are more likely to be retained.
To introduce some of the complexities faced by the forecasters who generate surface fields in the western United States, Fig. 1 shows a conservative estimate of the subgrid-scale variability in terrain height as a function of location. This estimate of the roughness of the underlying surface within NDFD grid boxes is derived from the standard deviation of the terrain height at a resolution of 0.0083° in latitude and longitude (roughly 0.8-km resolution) within each 0.0498° latitude–longitude grid box (slightly larger than a NDFD grid box). If a finer-resolution terrain dataset were used, then the subgrid-scale variability estimated in this fashion would be higher in most hilly locations. The standard deviation of terrain height is greater than 100 m in over 20% of the continental grid boxes. Distinct microclimates and responses to specific weather situations as a result of such subgrid-scale terrain variability often lead to large temperature variations within distances of 5 km in the West. Myrick et al. (2005) examined a case where 15-min averages of temperature in a terrain gap were 10°C lower than those observed 4 km away along an adjacent ridge as a result of the development of a nocturnal cold pool in northern Utah.
As reviewed by Kalnay (2003), the goal of objective analysis is to minimize the difference between the analyses and “unknown” truth over a large sample of analyses, given the errors inherent in both the observations and background fields from which the analyses are derived. Similarly, according to the NWS Digital Data Products/Services Specification (NWS Directive 10-506; see http://www.nws.noaa.gov/directives/), the NDFD temperature forecast is defined as the expected value for the indicated hour and is representative of the conditions expected across the grid box. As discussed by Kalnay (2003), the expected value in the context of objective analysis is the average obtained by making many similar measurements and would equal the average of all the true values in a sample of unbiased observations.
Even though the expected or true values appropriate for verifying forecasts are unknown, forecasters face the dilemma of specifying a value for each grid box in their CWA. It is often assumed by forecasters and those who evaluate human forecasts that observational errors are small such that forecast accuracy can be assessed by comparing forecasts directly with observations. For example, a common measure of forecast accuracy is the mean squared error (MSE),
where f (o) denotes the forecast (observation). If it is assumed that the forecast and observational departures from truth are uncorrelated, then the MSE is equal to
where σ2f (σ2o) is the forecast (observation) error variance. The sensitivity of verification results to observational uncertainty σ2o often receives limited attention. The text by Jolliffe and Stephenson (2003) has only two pages devoted to data quality and verification data biases. Observations can develop systematic biases under certain synoptic situations due to siting, exposure, and instrument characteristics; for example, temperature obtained from nonaspirated thermistors with radiation shields can be in error by as much as 0.5°–2.0°C (Hubbard and Lin 2002). Further, assume that the “true” grid-box temperature for the cold pool example of Myrick et al. (2005) is equal to the average of the observations on the ridge (warm) and in the gap (cold) and that the forecaster successfully predicts this value. If only one observation from the ridge or gap is available, most objective verification schemes based solely on observations will penalize this correct forecast.
Brier and Allen (1951) classified the goals of forecast verification into three categories: administrative (assess overall forecast performance for the purposes of strategic planning), scientific (improve understanding of the nature and causes of forecast errors in order to improve future forecasts), and economic (assess the value of the forecasts to the end users). Many tools have been developed over the years to evaluate numerical forecast guidance that are equally applicable to the NDFD gridded forecasts (Jolliffe and Stephenson 2003). The goal of this study is to assess selected estimates of forecast quality (Murphy 1993) in order to identify procedures that are appropriate for administrative-oriented verification at the national level as well as scientific-oriented verification that may provide feedback directly to the forecasters. Our study will attempt to answer the following questions related to the verification of NDFD gridded forecasts:
What are appropriate verification metrics that can be used to verify NDFD forecast grids?
Do estimates of forecast accuracy and skill differ if the grids are verified directly against observations at specific locations rather than verified against an analysis at the same resolution as the forecast grid?
Is it possible to estimate the incremental improvements in forecast accuracy as a function of lead time? In other words, what measures are appropriate to quantify the impact of the forecasters’ adjustments to the prior set of grids?
A sample of the experimental NDFD temperature forecasts from the 2003/04 winter season will be used to explore these questions. This study does not address forecast value (Brier and Allen 1951; Murphy 1993); for example, it will not be possible to assess from our results whether the human-edited gridded forecasts provide added skill relative to the original numerical guidance for an application of interest to a specific end user. Specific details of the observations and objective analyses follow in section 2. In section 3, NDFD forecasts during two distinctly different synoptic events are evaluated. In addition, verification statistics for NDFD forecasts during the 2003/04 winter season are presented and the sensitivity of those measures of forecast quality to selected aspects of the methodology is examined. A discussion of the results and conclusions follow in section 4.
2. Verification datasets
Surface data from weather observing stations across the United States have been linked together into a common database as part of MesoWest (Horel et al. 2002). Through MesoWest, the Automated Surface Observing System (ASOS) network maintained by the NWS, Federal Aviation Administration, and Department of Defense is supplemented by networks supported by over 150 government agencies and commercial firms. The characteristics of the observations vary considerably from one reporting network to another as a result of the needs of the agency or firm that installed the equipment. MesoWest observations of surface temperature during the period from 18 November 2003 to 7 March 2004 are used in this study. Data entered into the MesoWest database are quality controlled in real time. For this study, considerable effort was spent objectively and subjectively reevaluating many of the observations in the database to remove obvious biases.
Approximately 2500 temperature observations were available on average for each verification time across the western United States from MesoWest during the 2003/04 winter to supplement the roughly 300 ASOS reports. However, only 2% of the total number of 5 km × 5 km NDFD grid boxes contain an observation from MesoWest. For example, of the 2691 observations available at 0000 UTC 5 January 2004, 2484 (98) out of 129 763 grid boxes over the western United States contained one (more than one) observation from MesoWest.
As a crude measure of the areas where there are sufficient observations to verify the NDFD forecast products in the West, Fig. 2 shows estimates of the spatial coverage provided by only ASOS as well as that available using all MesoWest observations (including ASOS observations). These estimates are derived from the total area (km2) of each NWS forecast zone multiplied by the standard deviation (km) of the terrain height within each zone divided by the average number of observations (obs) available. The NWS forecast zones are used because they reflect common geographic and climate regions over which watches and warnings are issued. The area is multiplied by the standard deviation of the terrain height because observations in flat regions are more likely to be representative of larger areas than observations in hilly ones. For a moderately hilly zone (terrain height standard deviation equal to 0.2 km), no shading in Fig. 2 reflects less than one station within a 50 km × 50 km area (i.e., one station within 100 NDFD grid boxes) while the darkest shading denotes more than four stations within the same area. Obviously, the ASOS observations by themselves do not provide sufficient coverage to verify NDFD forecasts throughout the West. Even with the MesoWest observations, there are still many areas of the West where there are few observations available nearby to use for verification.
The operational Rapid Update Cycle (RUC) analyses at 20-km horizontal resolution provided by the National Centers for Environmental Prediction (Benjamin et al. 2004) were used to verify the NDFD forecasts. For direct comparisons with the 5-km NDFD forecasts, the 20-km RUC surface potential temperature field was bilinearly interpolated horizontally to the 5-km NDFD grid and then converted to surface temperature. These 5-km RUC grids were also used as the background fields for objective analyses that rely upon the MesoWest observations. The verifying analyses are created at the University of Utah using the Advanced Regional Prediction System (Xue et al. 2000, 2001, 2003) Data Assimilation System (ADAS). For the objective analysis, ADAS employs the Bratseth method of successive corrections, which is an inexpensive analysis procedure that converges to the same solution as optimal interpolation (Bratseth 1986; Kalnay 2003). The University of Utah version of ADAS employed in this study is described by Lazarus et al. (2002) and Myrick et al. (2005).
The assumptions made regarding the magnitude of the observation and background errors (as well as the rate at which these errors become uncorrelated with distance) are fundamental aspects of objective analysis. The observation σ2o and background σ2b error variances can be estimated following an approach used by previous investigators (e.g., Lönnberg and Hollingsworth 1986; Xu et al. 2001). The covariance between observational innovations (difference between observation and background values) at two points separated by distance r is calculated from all pairs of points for all background fields during the entire winter season,
where o and b, in our case, are MesoWest observations and RUC analysis values, respectively, at the nearest grid points to the observation locations (as mentioned above, the covariance is calculated from roughly 2% of the total grid points). Subtracting the unknown “truth” from each observed and background value in Eq. (3) and expanding the right-hand side yields
where departures from truth are denoted by primes and ( ) is an estimate of the observational (background) error variance. It is usually assumed 1) that the observational errors are uncorrelated with one another such that = σ2o and 2) that the background and observational errors are uncorrelated such that = 0 and = 0.1 Thus,
where ρij is the background error correlation, which is often assumed to diminish isotropically with increasing distance. [Myrick et al. (2005) discuss the limitations of this isotropic assumption.]
The covariance of observational innovations as a function of distance r during the 2003/04 winter season is shown in Fig. 3 for the 0000 and 1200 UTC RUC background fields. The covariance drops sharply as a function of horizontal distance but does not asymptote to 0, which suggests that the RUC background fields exhibit errors that remain correlated over distances of hundreds of kilometers. Fitting a least squares curve to the innovation covariance values for horizontal distances greater than 5 km and extrapolating the curve back to r = 0 makes it possible to estimate the σ2b. Based on sensitivity tests with 4th- through 12th-order polynomials (not shown), a 7th-order polynomial was chosen to fit the covariance values because the 95% confidence limits for the extrapolated values at r = 0 were the smallest (e.g., 0.25°C at 0000 UTC and 0.26°C at 1200 UTC). Thus, σb at 0000 (1200) UTC is estimated to be equal to 2.5°C (2.7°C). Using Eq. (5), the observation error variance can then be estimated from the difference between the innovation covariance value at distance zero and the estimate of σ2b. Thus, σo at 0000 (1200) UTC is estimated to be equal to 2.3°C (2.7°C). The causes for the higher observational error at 1200 UTC compared with 0000 UTC are not entirely clear, but may reflect that morning temperature during winter is more difficult to observe, because of the often complex early morning boundary layer in many locations compared with the generally well-mixed boundary layer in the afternoon. As will be discussed in section 4, future work will examine in greater detail the causes for these large errors found in the MesoWest observations.
For this study, the magnitudes of the background and observation error variances for the ADAS analyses were set to be equal (e.g., shown in Fig. 3 to be approximately 2.6°C). Previous work by Myrick et al. (2005) explored the rate at which the ADAS analyses converged and found that the best results were obtained when the ratio of the error variances was set to 1. In addition, the rate at which the background errors decorrelate with distance in the ADAS analyses are specified a priori beginning with a length scale of 75 km and decreasing that scale to 55 km in later iterations as shown schematically by the dotted lines in Fig. 3. Note that our ADAS analyses assume that the background errors remain more (less) correlated at distances less than 50 km (greater than 100 km) than that determined empirically. Vertical and terrain-blocking constraints also influence the background error correlation in ADAS (Myrick et al. 2005).
Benjamin et al. (2004) found root-mean-square errors (rmse) between the RUC surface analyses and the hourly ASOS observations around the nation to be on the order of 1.5°C for temperature. Table 1 provides similar information limited to the western United States by comparing the nearest gridded analysis values with all of the MesoWest observations (including the ASOS stations that were used in the operational RUC analyses) during the 2003/04 winter season. As illustrated in Table 1, larger mean absolute errors and rmse are evident between the RUC analyses and MesoWest observations over the mountainous terrain of the West than those found by Benjamin et al. (2004). Note that calculating the rmse is essentially the same as computing the covariance of the observation innovations at r = 0 (see Fig. 3). In other words, the MSE is not the error, σ2b, of the RUC analyses alone, but the combined error of the RUC analyses and observations (σ2b + σ2o). The ADAS analyses have minimal bias when compared with the MesoWest observations and the rmse of the ADAS analyses are smaller than those of the background field since the “optimal” analysis errors should be less than the observational and background errors (Kalnay 2003).
a. NDFD forecast examples
Two contrasting weather situations are used to illustrate NDFD temperature forecast accuracy. The first (22–24 November 2003) focuses on a rapidly moving cold front to the east of the Rockies and the typical decrease of temperature with elevation throughout much of the West while the second (14–16 January 2004) emphasizes the persistence of cold pools in many basins. As discussed in the introduction, the latter situation is often difficult to analyze objectively. For example, the rmse between the RUC (ADAS) analysis and all observations in the West is 3.2°C (1.8°C) for 0000 UTC 22 November 2003, which is slightly better than the differences between the analysis grids and observations for the season as a whole (see Table 1). In contrast, the rmse between the RUC (ADAS) analysis and all MesoWest observations in the West is 4.7°C (2.5°C) for 0000 UTC 14 January 2004.
The 48-h NDFD temperature forecast valid at 0000 UTC 22 November 2003 (Fig. 4a) in part reflects the climatological tendency for temperature to decrease from south to north and with increasing elevation across the West. In addition, a strong cold front is forecast to push southward from Canada into eastern Montana. The 48-h NDFD temperature forecast (Fig. 4a) represents a blend of individual forecast grids provided by forecasters at 40 NWS offices. For the most part, the 48-h temperature forecast appears seamless and consistent from one CWA to another. Noticeable exceptions in this instance include the areas of northeastern Wyoming and the state border between Arizona and New Mexico.
Comparing the 48-h temperature forecast (Fig. 4a) to the 20-km RUC analysis of temperature interpolated to the 5-km NDFD grid (Fig. 4b) at 0000 UTC 22 November 2003, it is evident that the intensity of the arctic outbreak in Montana was underestimated and the onset of the arctic push into Wyoming was not captured. For example at Sheridan, Wyoming, near the Montana–Wyoming border, the temperature dropped from −4°C at 2000 UTC to −13°C at 0000 UTC; the NDFD forecast valid at 0000 UTC in the vicinity of Sheridan was −4°C. The ADAS temperature analysis (Fig. 4c) constrains the RUC analysis (Fig. 4b) by the local observations and terrain features such that the arctic outbreak and decreasing temperature with elevation in most mountain ranges is delineated in greater detail than is evident in the RUC analysis.
The southward progression of the arctic air mass to the east of the Rockies is evident in the forecast and analysis grids 48 h later (Fig. 5). The forecasters underestimated the progression of the front through southeastern Colorado at this time, as is evident in the RUC and ADAS analysis grids. In addition, the onset of downslope winds and rapid warming to the east of the Montana Rockies was not forecast. For example, a temperature of −1°C was observed at Cut Bank, Montana, at 0000 UTC 24 November 2003; the 48-h NDFD forecast valid at this time near Cut Bank was −11°C.
To examine how this rapidly evolving event compares with the conditions throughout the winter season, departures of the 48-h forecasts and analyses from their respective seasonal means for the period 18 November 2003–29 February 2004 are shown in Figs. 4d–f and 5d–f. Hence, the forecast and analysis biases for the season as a whole are removed in order to focus upon the synoptic-scale and mesoscale spatial distributions of the forecast temperature compared with those analyzed by the RUC and ADAS. Over the West as a whole, the temperature departures of these forecasts from the seasonal average agree subjectively quite well with the corresponding analyses. The aforementioned errors in the intensity and timing of the arctic push are evident, however.
In contrast to the previous case where the forecast problem of the day was a fast-moving arctic cold front to the east of the Rockies, we now examine forecasts during a prolonged period of upper-level ridging over the western United States in mid-January 2004 that led to the maintenance of extensive snow cover in mountain valleys and the development of persistent cold pools in many basins. During this period, forecasters at WFOs in the Great Basin predicted the valley cold pools to persist; however, they underestimated the intensity of the cold pools. For example, at 0000 UTC 14 January 2004, the temperature was at or below −6°C in the Salt Lake Valley, Utah, while the coldest temperature forecast with a 48-h lead time was greater than −4°C (Fig. 6a). Similarly, temperature tended to be below −6°C at Pocatello, Idaho, and nearby regions of the upper Snake River Valley while the coldest temperature forecast was greater than −4°C.
The cold pools in Utah, Idaho, and Wyoming at 0000 UTC 14 January 2004 are particularly evident in the RUC and ADAS analyses (Figs. 6b and 6c) and more comparable to the temperatures observed in those valley locations than those forecast 48 h earlier (Fig. 6a). The ADAS temperature analysis (Fig. 6c) constrains the RUC analysis by local terrain features such that the cold pools are confined to the valleys. The ADAS analysis also more clearly captures the increase of temperature with elevation along the slopes and sidewalls of the valleys in northern Utah, southwestern Wyoming, northeastern Nevada, and southern Idaho. Consistent with the ADAS analysis, observed temperatures at high elevations (e.g., Wasatch and Uinta Mountains in northern Utah) are as much as 10°C higher than in nearby valleys. The persistence of this cold pool event is particularly striking and can be seen by comparing the forecasts and verifying analyses on 14 January (Fig. 6) with those two days later on 16 January (Fig. 7). The 48-h forecast continues to underestimate the intensity of the valley cold pools.
To examine how this persistent ridging event over the West compares with the conditions throughout the winter season, departures of the forecasts and analyses from their respective seasonal means are shown in Figs. 6d–f and 7d–f. Hence, relatively cold conditions compared with the seasonal mean temperature were forecast on both days in central Washington, the Snake River Valley in Idaho, and portions of western and eastern Utah, which generally agrees with the conditions analyzed by the RUC and ADAS. As was evident before, the RUC analyses tend to extend the cold pools beyond the confines of the valleys into surrounding higher elevations compared with the ADAS analyses.
Table 2 summarizes several basic measures of forecast accuracy applied to these two events when calculated only at the small number of grid points adjacent to the MesoWest observations. The rmse for the four forecasts relative to the observations is roughly 4°C. The bias and rmse of the forecasts computed relative to the ADAS analyses are similar to those determined relative to the observations while larger rmse (and a warm bias of 2.4° and 1.6°C during the cold pool cases) are evident when the forecasts are compared with the RUC analyses.
The statistics in Table 3 can be compared with those in Table 2 to assess the sensitivity of the bias and rmse to whether they are computed near observation locations only (roughly 2% of the grid) or over the entire grid. Accuracy measures of the forecasts relative to ADAS are fairly similar for the two approaches. However, the bias and rmse of the forecasts computed over the entire grid relative to the RUC analyses are as much as 0.5°C lower than when computed for the small fraction of the grid where observations are nearby.
The anomaly correlation coefficient (ACC) relates the spatial patterns of the departures from their respective seasonal means of the 48-h forecasts and verifying analyses (right-hand panels in Figs. 4 –7). The removal of the seasonal means in this instance draws attention to how well the forecasts capture the day-to-day variations in synoptic and mesoscale features. As discussed by Murphy and Epstein (1989) and Wilks (2006), the threshold for “skillful” forecasts is an ACC of 0.5; however, common practice for relatively smooth fields (such as midtropospheric geopotential height) is to use a threshold of 0.6 to define “useful” forecasts. As shown in Table 3, all four of the 48-h forecasts in this instance would be judged to be skillful and the November 2003 cases would be judged to be useful. However, the skill of a persistence forecast as measured by the ACC is high during the cold pool event.
b. Measures-oriented cumulative statistics
As outlined in section 3a, examining forecast error as a function of synoptic situation is clearly necessary in order to provide feedback to forecasters and end users. However, cumulative statistics are also necessary to elucidate common forecast tendencies and reduce the dimensionality of the verification problem (Murphy 1991). Thus, we have also examined cumulative bias and rmse metrics over the entire winter season (18 November 2003–29 February 2004) as a function of location and forecast lead time. The bias of the 48-h temperature forecasts relative to the analysis grids (Figs. 8a and 8b) highlights regions where forecasters tend to have a warm or cold bias during this winter season. For example, forecasters at the Medford, Oregon, and Elko, Nevada, WFOs had a warm bias forecasting 48-h temperature when compared with the verifying RUC and ADAS analyses. The biases between the NDFD 48-h temperature forecasts and the verifying ADAS analyses tend to be smaller compared with those between the NDFD forecasts and the RUC analyses (cf. Figs. 8a and 8b). The smallest 48-h rmse (<2°C) between the forecast and analysis grids (Figs. 8c and 8d) are located over the desert Southwest near the California–Arizona border and offshore while the largest rmse (4°–7°C) tend to be located over the higher terrain of the West.
To assess the sensitivity of basic accuracy measures such as bias and rmse to different verification methodologies, statistics are presented for verification of the forecast values over the small fraction of the grid adjacent to observation locations (Fig. 9) and for verification of the forecast values over the entire grid (Fig. 10). Generally, the NDFD forecasts have a cold bias in the morning and a smaller warm bias in the afternoon while rmse increases from 3.5° to 4°C during the first day to 5°–5.5°C after several days. A possible explanation for the morning and afternoon temperature biases will be discussed in section 4. NDFD temperature rmse are largest for forecasts valid at 1200 UTC (Figs. 9b and 10b). The temperature biases of the NDFD forecasts relative to the ADAS analyses are quite similar to those exhibited relative to the observations (Fig. 9a).
Because we have estimated σ2o and σ2b at the observation locations (Fig. 3), it is possible to calculate a gross estimate of the analysis error variance, denoted by σ2a (see Kalnay 2003, section 5.3.17, p. 146). Estimates of the error variance of the NDFD forecasts (σ2f ) near the observation locations can then be determined from the rmse (Fig. 9b) assuming that the errors of the observations, RUC analyses, and ADAS analyses are uncorrelated with the forecast errors. These estimates of σf are shown in Fig. 9c for the limited sample of grid points near observation locations (it is not possible to estimate σf for the entire grid because σ2o, σ2a, and σ2b are not known at all of the grid points). Although there are larger discrepancies between the three estimates of σf at short lead times, these results suggest that NDFD forecast error increases from roughly 3°C at lead times less than 48 h to more than 4°C after 120 h. Hence, even though the true values are unknown, we can crudely estimate the accuracy of the NDFD forecasts for the entire season for the small fraction of the grid points near observations.
As discussed in the section 3a, the ACCs computed between forecast and analysis fields with their respective seasonal means removed focus attention upon the synoptic and mesoscale features forecast on particular days. To assess the sensitivity of the ACC during this winter season, ACCs between 48-h NDFD temperature forecasts and RUC and ADAS analyses are shown in Fig. 11. The ACC between pairs of ADAS analyses separated by 48 h is also shown to provide a reference persistence (PERS) forecast. ACCs greater than 0.5 occurred 69% of the time for 48-h NDFD temperature forecasts when verified against ADAS during the 2003/04 winter season. The day-to-day differences between the ACCs are large compared with the differences arising from the various analysis approaches (i.e., cf. the ADAS and RUC values for any particular forecast). Hence, once the seasonal biases are removed (Figs. 8a,b), the day-to-day variations in the spatial patterns of the ADAS and RUC are very similar to one another. During several active weather periods (e.g., cold front sweeping across the West from 21 to 23 November 2003 and an arctic outbreak to the east of the Rockies after 1 January 2004), the NDFD temperature forecasts at 48 h exhibited considerable skill, especially relative to the 48-h persistence forecasts. During the persistent upper-level ridging episode from 10 to 15 January 2004, the NDFD forecasts exhibited skill but persistence forecasts had equal skill, and occasionally even greater skill. The particularly low NDFD skill on 25 November 2003 following immediately after a period of high skill resulted from the failure to capture several different synoptic and mesoscale details around the West, including rapid warming to the east of the Rockies.
Averaged over the entire 2003/04 winter season as a function of forecast lead time, ACCs indicate that NDFD temperature forecasts exhibit skill at 12-, 24-, 36-, 48-, and 72-h lead times when verified against ADAS assuming a skill threshold of 0.5 (Fig. 12). In addition, the NDFD 72-h temperature forecasts exhibit skill comparable to 24-h persistence forecasts. The skill of the NDFD temperature forecasts would be assessed to be lower if the RUC is used for the verification.
c. Examples of distributions-oriented statistics
Murphy and Winkler (1987) and Brooks and Doswell (1996) summarize many of the limitations of the traditional accuracy measures used in section 3b. Alternatively, a distributions-oriented approach can be used that focuses upon the joint distribution of the forecasts and observations or the conditional distributions predicated upon the values of the forecasts or observations (Potts 2003). However, as noted by Murphy (1991), the large dimensionality of the joint distributions of NDFD forecasts is a significant drawback as a result of the large number of possible combinations of forecasts and observations as a function of geographic location, lead time, parameter, forecast issuance time, season, and verification datasets, among other factors. Reduction in the dimensionality of the joint distributions requires defining specific verification goals, for example, assessing the skill of the criteria used for watches or warnings, evaluating the skill for significant meteorological events such as large temperature falls or rises from one day to the next, or focusing on the needs of specific user communities.
The starting point in any distributions-oriented verification is to examine the joint distribution of forecasts and observations as a means to assess general forecast tendencies. A subset of the joint distribution of 0000 UTC 48-h NDFD forecasts and ADAS analyses is shown in Table 4 within the range ±10°C for all locations in the West. A bin size of 5°C was selected to reflect our estimate of the observational error (∼2.5°C). Nearly 15 million pairs of analysis and forecast values are summarized in Table 4. For 48-h NDFD forecasts between −5° and 0°C issued at 0000 UTC, the percentage of accurate forecasts (49.5%) is calculated by dividing the percentage of forecasts and analyses between −5° and 0°C (9.4%) by the percentage of ADAS analysis values between −5° and 0°C from the marginal distribution (19.0%). The percentage of accurate forecasts for forecasts in the range 0°–5°C is slightly higher, 53%. Relative to the ADAS analyses, forecasters tended to overpredict the temperature between −5° and 0°C; that is, 6.1% (2.5%) of the time the predicted temperature was one bin higher (lower).
A salient example from the large number of joint probability distributions derived in this study is shown in Fig. 13a. The accuracy of NDFD forecasts in the West between −5° and 0°C at lead times from 12 to 168 h is determined for all grid points relative to the RUC and ADAS analyses as well as for the much smaller sample of grid locations adjacent to observation sites. The forecast accuracy improves from 40% at 168-h lead time to over 50% at 12 h. Forecast accuracy in the afternoon is estimated to be lower when comparing the forecasts with the RUC analyses rather than the ADAS analyses.
Because the binning procedure used above to define the joint probability distributions crudely takes into consideration estimates of observational or analysis uncertainty, alternative approaches may be more appropriate for verifying the NDFD forecasts. For example, a forecast of −4.5°C is considered accurate even if the observed or analyzed temperature is 4°C warmer (−0.5°C) while it is counted as inaccurate if the observed or analyzed temperature is only 1°C colder (−5.5°C). Conditional probability distributions that rely on binning the differences between the forecast and analysis values help to mitigate this limitation. Depending on the application, we may choose to assess the accuracy of forecasts given a specific range of forecast values or address the accuracy of forecasts given a specific range of analysis values. We will show examples of both.
Table 5 shows a subset of the conditional probability distribution of such differences constrained by the NDFD forecast value. The rows delineate the forecast values while the columns indicate the degree of departure of the forecasts from the verifying analyses. For example, the NDFD forecast value was within ±2.5°C (2.5°–7.5°C too high) of the ADAS analysis value for 10% (4.3%) of the pairs when the forecast fell within the range from −5° to 0°C. Hence, the percentage of accurate forecasts between −5° and 0°C (53.2%) is calculated by dividing the percentage of forecasts between −5° and 0°C that verify within ±2.5°C of the ADAS analyses (10.0%) by the total percentage of NDFD forecasts between −5° and 0°C (18.8%). Similarly, the percentage of accurate forecasts between 0° and 5°C computed in this fashion is 55%.
The percentage of accurate forecasts determined from the number of forecast–verification values that fall within ±2.5°C of each other, given that the forecast value was within the range between −5° and 0°C, is shown in Fig. 13b as a function of lead time and verifying dataset. The application of a more appropriate treatment of analysis and observational uncertainty has contributed to an overall increase in the estimate of forecast accuracy on the order of a few percent (cf. Figs. 13a and 13b). As before, verification of the NDFD forecasts relative to the RUC analyses leads to a lower estimate of accuracy than that evident relative to the ADAS analyses or at the small sample of grid points adjacent to observations. As already noted in the RMS statistics (Fig. 9b), lower forecast accuracy is evident during the morning compared with the afternoon.
The conditional probability of the differences between the NDFD forecasts and ADAS analysis values, constrained by the analysis values, is shown in Table 6. The rows in Table 6 delineate the degree of departure of the forecasts from the verifying analyses while the columns denote the analysis values. When the analyzed values are between −5° and 0°C, the percentage of accurate forecasts (53.2%) is calculated by dividing the percentage of ADAS analysis values between −5° and 0°C that are forecast within ±2.5°C (10.1%) by the marginal percentage for that range (19.0%). A slight tendency for colder temperatures to be overforecast and warmer temperatures to be underforecast is evident in Table 6. The percentage of accurate forecasts determined from the number of forecast–verification values that fall within ±2.5°C of each other, given that the verification value was within the range −5° and 0°C, is shown in Fig. 13c as a function of lead time and verifying dataset. The tendency for lower forecast accuracy in the morning relative to the afternoon is accentuated using this measure (cf. Figs. 13a–c).
Tables 4 –6 and Fig. 13 estimate forecast accuracy for the West as a whole. To assess forecast accuracy as a function of location, probability and conditional probability distributions can be computed separately for each grid point or observation location or aggregated into regions such as forecast zones or CWAs. Following the approach used for Table 5 and Fig. 13b, the accuracy of 48-h temperature forecasts for each forecast zone is determined for those occasions when the 48-h temperature forecast falls in the range from −5° to 0°C (Fig. 14). Zones are shaded in black where winter temperature in this range is uncommon. As was already seen in Fig. 13, the accuracy of the NDFD forecasts in the range from −5° to 0°C estimated relative to the RUC (Fig. 14a) is generally lower than that estimated from ADAS (Fig. 14b). For locations where temperatures between −5° and 0°C are common, the most accurate forecasts are evident in portions of Idaho, Utah, southern Oregon, and eastern Colorado, while the least accurate forecasts are evident in Nevada and Wyoming.
d. Estimated impact of forecaster adjustments
While the joint and conditional probability distributions shown in section 3c are useful for assessing the overall accuracy of the forecasts and the general improvement in accuracy from long to short forecast lead times, we also wish to examine ways of assessing the incremental improvement in the accuracy of the forecasts that may arise from the forecasters’ adjustments to prior grids. Sequences of forecast temperature values from 168- to 24-h lead times were examined at many observation locations and valid times. It was immediately apparent that forecasters chose to leave temperature forecasts unmodified on many occasions as the shorter lead time forecast values were exactly the same as the longer lead time values (or the changes were so small that they likely were caused by rounding or interpolation after the forecasts were created). Hence, it is possible to distinguish between those occasions when the forecast value is unchanged from an earlier forecast from those when the value is updated by the forecaster.
We wish to illustrate one way to assess how often forecasters 1) left an earlier accurate forecast unchanged, 2) left an earlier inaccurate forecast unchanged, 3) improved forecast accuracy by adjusting an earlier forecast value that was inaccurate, 4) made adjustments to an earlier accurate forecast and it remained accurate, 5) reduced forecast accuracy by adjusting an earlier accurate forecast, and 6) adjusted an earlier inaccurate forecast but the final forecast remained inaccurate. While each of these quantities are of interest, the difference between the third and fifth categories indicates the incremental improvement in forecast accuracy as a result of forecasters’ adjustments to the grids. Our example is based on the aggregate of all (roughly 250 000) pairs of observations and adjacent NDFD forecast values over the West during the 2003/04 winter season valid at 0000 UTC. Accurate forecasts are defined to be forecasts that are within ±2.5°C of the verifying observations, which takes into account our estimate of observational uncertainty. Forecasts are considered unchanged if the difference between the forecasts at the two lead times is less than 0.25°C to allow for rounding and interpolation during postprocessing of the forecasters’ grids.
Table 7 relates the joint distribution of forecast changes from 48 to 24 h to the accuracy of the 24-h forecasts. The marginal distribution in the right column indicates that 58.4% of all of the 24-h forecasts are judged to be accurate. The forecasts are both accurate as well as unchanged from 24 h earlier for 15% of the total forecast–observation pairs. Note that the majority of the forecaster adjustments to the 48-h forecast values are less than our estimate of the observational uncertainty. Similarly, Table 8 relates the joint distribution of the forecast changes from 144 to 24 h to the accuracy of the 24-h forecasts. There are fewer forecast values that remain unchanged from 144 to 24 h (7.6%).2 As might have been expected, the accurate 24-h forecasts result from larger forecaster adjustments over the intervening period (40.4% of the accurate 24-h forecasts result from changes greater than our estimate of the observational uncertainty).
Figure 15 summarizes information obtained from many joint distributions in terms of the percentage of all of the forecast and observation pairs that fall in the six categories listed above. Consider first Fig. 15a where the original (final) forecasts are issued 48 h (24 h) prior to the valid time: 1) 15% of the 48-h forecasts were correctly left unchanged (see also Table 7), 2) 10.3% of the 48-h temperature forecasts were incorrectly left unchanged, 3) 11.6% of the 48-h temperature forecasts were adjusted so that the 24-h forecast became accurate, 4) 31.8% of the 48-h forecast values were adjusted and remained accurate even though the original forecast was already assessed to be accurate according to our criterion, 5) 8.7% of the 48-h forecasts were adjusted and were no longer accurate at 24 h, and 6) 22.6% of the forecasts were originally inaccurate and remained so. The difference (2.9%) between the third and fifth categories indicates the modest incremental improvement in accuracy resulting from adjusting the 48-h forecasts by incorporating information available to the forecasters during the next day. Similar modest differences of 2%–3% were found between nearly all pairs of lead times separated by a day; for example, the difference arising from forecaster adjustments was 2.3% between 144- and 120-h forecasts (Fig. 15b). We found that 65%–75% of the temperature forecast grids were adjusted from one day to the next (e.g., cf. the percentage sums in the left and right halves of Figs. 15a and 15b). Generally, the accuracy is improved due to leaving some of the prior forecast values unchanged from one day to the next; that is, the differences between the first and second categories are positive (4.7% in Fig. 15a). However, a reduction in accuracy sometimes occurs by leaving a fraction of the grid unchanged as shown in Fig. 15b.
To assess to what extent the modest incremental improvements in forecast accuracy from one day to the next contribute to an increase in forecast accuracy over longer lead times, consider Figs. 15b–d. As the separation between forecast lead times increases, the fractions of the grids that are adjusted by the forecasters increased (cf. the total percentages in the left and right halves of each figure) and the percentage of forecasts that were 1) correctly left unchanged decreased, 2) incorrectly left unchanged also decreased, 3) adjusted and thereby became accurate increased substantively, 4) adjusted a small amount and remained accurate stayed roughly the same, 5) adjusted and became inaccurate increased a small amount, and 6) adjusted and remained inaccurate stayed roughly the same. Overall, the incremental adjustments of the forecast grids from 144 to 24 h increased the accuracy of the forecasts by 13.4% (Fig. 15d).
4. Summary and discussion
Experimental NDFD gridded forecasts of surface temperature from the 2003/04 winter season have been examined subjectively and objectively in terms of a variety of estimates of forecast quality. The subgrid-scale variability in weather and inadequate ground truth in many areas, especially in regions of complex terrain, are significant complicating factors for the generation of the NDFD forecasts as well as their verification. Nonetheless, NWS forecasters are expected to issue a deterministic forecast for each grid box out to 7 days and estimates of forecast skill are needed in order to help them diagnose and mitigate systematic errors.
We contrasted the NDFD forecast accuracy during an active weather period to that during a persistent one. Even though cold pools in basins are more persistent, they are more difficult to analyze given present analysis capabilities. Removal of the seasonal means helps to focus attention on the accuracy at forecasting synoptic and mesoscale weather events.
For administrative-oriented verification where the goal is to assess overall forecast performance for the West or the nation as a whole, ACCs calculated in this study suggest that the experimental NDFD temperature forecasts issued at 0000 UTC for the 2003/04 winter season on average exhibited skill at 12-, 24-, 36-, and 48-h lead times over the West, assuming a skill threshold of 0.5 (Fig. 12). The ACC is a good measure of forecast quality if the goal is to focus on the overall ability of forecasters to forecast large-scale weather features from day to day or over a season. Limiting the evaluation to the small fraction of grid points adjacent to observation locations, it is possible to estimate roughly the NDFD forecast error relative to the unknown expected values for those grid points (order 3°C at lead times less than 48 h to over 4°C after 120 h). Similar estimates computed from model guidance would help to assess the value added by forecasters as a result of the adjustments to the model grids.
For scientific-oriented verification where the goal is to improve future forecasts by understanding past forecast errors, we have focused on assessing the uncertainty arising from the choice of verifying datasets: MesoWest observations, RUC analyses, and ADAS analyses that adjust the RUC analyses by using the MesoWest observations. ASOS observations clearly provide insufficient coverage in the West in assessing NDFD forecast quality and only a fraction of the NDFD forecast grid can be verified directly to MesoWest observations. Because forecasters are required to produce forecast values at every grid point, we recommend applying verification tools that evaluate the entire forecast grid. As a general rule, the NDFD forecasts exhibit the greatest accuracy when evaluated using observations or ADAS analyses rather than RUC analyses interpolated to the 5-km grid.
As evident in Figs. 9, 10 and 13, the NDFD temperature forecasts were found to have larger biases and rmse and lower accuracy when verified at 1200 UTC compared with 0000 UTC. We hypothesize that this difference is partly caused by the methodology used to derive the hourly values for the forecasts and observations. The IFPS procedure for forecasting surface temperature used by most forecasters is to predict the minimum and maximum temperature for the day and then fit a diurnal curve to those extremes to obtain the hourly temperature grids. This interpolation step from the daily temperature extremes may lead to an underprediction (morning) or overprediction (afternoon) of temperature since the temperature extremes often occur during only a short period (see Monti et al. 2002; Clements et al. 2003).
The dilemma faced in any verification study of the NDFD forecasts is how to reduce the dimensionality given the vast number of forecasts being issued. For brevity, we showed results for temperature only, even though we have computed similar statistics for wind speed and dewpoint temperature. We limited our study to the 2003/04 winter season and evaluated forecasts at 12-h intervals issued at 0000 UTC only. To assess the quality and value of the NDFD forecasts, clear verification goals must be defined. For example, quantifying forecast skill in terms of the lead time provided for thresholds associated with watch and warning criteria may be useful. Our simple examples in section 3c related to the probability distributions near 0°C could be loosely tied to the needs of the winter road maintenance community. Of greater interest to that community might be: what is the forecast skill for those grid boxes containing highways with road surface temperatures below freezing when precipitation is likely and the temperature is forecast to be above freezing and then forecast to drop below freezing? Even more specifically, what is the skill for such occasions during the first snowstorm of the season when accidents are more likely? Such verification will obviously require sophisticated data-mining tools to extract that information from the NDFD database and the observational assets to be able to assess those conditions.
Joint distributions and summary statistics of the type illustrated in section 3d provide a means to assess the incremental improvements in accuracy as a function of lead time. A more thorough examination of this issue remains to be completed and requires the comparison of such measures to the incremental improvements in the numerical model guidance over the same span of lead times. We found that 65%–75% of the forecast temperature values at the observation locations change from day to day and we expect that this fraction is likely higher for other fields, such as probability of precipitation. The adjustments from one day to the next tend to be small (Table 7). Cumulatively those changes lead to measurable increases in forecast accuracy of the order of 13% from 144 to 24 h. A substantial fraction of the small adjustments from one day to the next are made when the longer-range guidance would later be found to be accurate (Fig. 15). Using our criterion for forecast accuracy based on our estimate of observational uncertainty, the forecaster decision to leave portions of the grids unchanged is a reasonable one that during the 2003/04 winter season led to a small improvement in forecast accuracy. Our example suggests that forecasters may wish to focus greater attention on those portions of the grids that require substantive adjustment and leave the original forecasts unchanged over larger portions of the grid.
Revealing and resolving deficiencies through verification is a critical step in improving the IFPS system and NDFD products. Forecasters respond rapidly to the results of verification studies. For example, discontinuities in the temperature forecasts across CWA boundaries evident in many locations during the 2003/04 winter season in our results were also apparent in the statistics generated by MDL that were disseminated to the forecasters during the past year. Efforts were made to reduce CWA border discontinuities. Hence, such coordination issues are largely absent in the NDFD forecast verification statistics for the 2004/05 winter available from MDL. However, it is unclear at this point whether improving forecast consistency across CWAs leads to improved accuracy. A number of successful pilot verification efforts are under way at local NWS offices in the West that are beginning to focus on the errors as a function of elevation, weather regime, time of day, change from day to day, etc. Verification needs to proceed within each CWA as well as on a national level. Local verification provides feedback to the forecasters that will result in improved forecasts; national verification will provide feedback to the user community on the accuracy of the NDFD products.
For the development of the IFPS forecasts as well as for verification of those forecasts, considerable research and development are required to improve the observational database, the estimates of the errors in those observations, and analysis techniques in mountainous regions. For example, the reasons for the relatively large sampling errors in the MesoWest database estimated from Fig. 3 need to be investigated (instrument errors are generally expected to be less than 1°C). It is likely that the integration of observations from different networks with different protocols and maintenance standards may contribute to these errors of representativeness. However, we suspect that our relatively large estimate of observational error may be an upper limit arising in part from assuming that the background error covariance is a function of horizontal distance only. For example, imagine a pair of stations, one located in a basin and the other a few kilometers away on a nearby slope or mountain peak. The errors between these two locations are likely to be unrelated to one another as a result of the differences in elevation or presence of intervening terrain, which would tend to lower the error covariance used to calculate Fig. 3. To examine the error characteristics of MesoWest observations in greater detail, we intend in the future to compute the background error covariance separately for each of the major network types [e.g., ASOS, Bureau of Land Management/U.S. Department of Agriculture Forest Service Remote Automated Weather Stations, Natural Resources Conservation Service Snopack Telemetry] as well as for stations located in similar topographic and microclimate regimes. This information will be of benefit not only for the verification of NDFD forecasts but also may lead to improvements in quality control procedures and application of the observations in objective analyses.
This work was supported by National Weather Service Grant 468004 to the NOAA Cooperative Institute for Regional Prediction at the University of Utah and NOAA Office of Global Programs Grant GC04-281. This study was made possible in part due to the data made available by the governmental agencies, commercial firms, and educational institutions participating in MesoWest. Special thanks to three anonymous reviewers whose comments greatly enhanced this manuscript. Conversations with B. Colman, L. Dunn, and D. Edman of the National Weather Service contributed to an improved understanding of the methodologies used by the forecasters to create and update the NDFD grids.
Corresponding author address: David T. Myrick, Dept. of Meteorology, University of Utah, 135 South 1460 East, Rm. 819, Salt Lake City, UT 84112-0110. Email: firstname.lastname@example.org
Although a nonzero correlation between the background and ASOS observational errors should be expected because the RUC assimilates the ASOS observations, the MesoWest observations represent independent information not included in the RUC analyses.
Spot checks of sequences of forecast values from 144 to 24 h revealed no cases where the forecast values at 144 and 24 h were identical as a result of compensating adjustments during the intervening 120-h period.