The Weather Channel (TWC) is a leading provider of weather information to the general public. In this paper the reliability of their probability of precipitation (PoP) forecasts over a 14-month period at 42 locations across the United States is verified. It is found that PoPs between 0.4 and 0.9 are well calibrated for near-term forecasts. However, overall TWC PoPs are biased toward precipitation, significantly so during the warm season (April–September). PoPs lower than 0.3 and above 0.9 are not well calibrated, a fact that can be explained by TWC’s forecasting procedure. In addition, PoPs beyond a 6-day lead time are miscalibrated and artificially avoid 0.5. These findings should help the general public to better understand TWC’s PoP forecasts and provide important feedback to the TWC so that they may improve future performance.
The Weather Channel (TWC) is a leading provider of weather information to the general public via its cable television network and interactive Web site (see http://www.weather.com/). TWC’s cable network is available in 95% of cable TV homes in the United States and reaches more than 87 million households. Their Internet site, providing weather forecasts for 98 000 locations worldwide, averages over 20 million unique users per month and is among the top 15 news and information Web sites, according to Nielsen/NetRatings (more information is available online at http://press.weather.com/company.asp).
The public uses TWC’s forecasts to make decisions as mundane as whether to carry an umbrella or as significant as whether to seek shelter from an approaching storm. How accurate are these forecasts? Are they free from bias? Should the public accept TWC forecasts at face value or do they need to be adjusted to arrive at a better forecast?
In this paper, we analyze the reliability of probability of precipitation (PoP) forecasts provided by TWC (via weather.com) over a 14-month period (2 November 2004–16 January 2006), at 42 locations across the United States. Specifically we compare n-day-ahead PoP forecasts, where n ranges from 0 (same day) to 9, with actual precipitation observations.
This paper is organized as follows. In the next section, we describe our verification approach and review the associated literature. In section 3 we summarize our data collection procedure. In section 4 we present the reliability results and discuss the implications. In section 5 we present our conclusions.
2. Verification of probability forecasts
a. Distributional measures
Let F be the finite set of possible PoP forecasts fi ∈ [0, 1], i = 1 to m. Here X is the set of precipitation observations, which we assume may obtain only the value x = 1 in the event of precipitation and x = 0 otherwise. The empirical relative frequency distribution of forecasts and observations given a particular lead time l is p( f, x|l) and completely describes the performance of the forecasting system. A perfect forecasting system would ensure that p( f, x|l) = 0 when f ≠ x. In the case of TWC, l may obtain integer values ranging from 0 (same day) to 9 (the last day in a 10-day forecast).
two different factorizations of p( f, x|l) are possible and each facilitates the analysis of forecasting performance.
The first factorization, p( f, x|l) = p( f | l)p(x| f, l) is known as the calibration-refinement (CR) factorization. Its first term, p( f | l), is the marginal or predictive distribution of forecasts and its second term, p(x| f, l), is the conditional distribution of the observation given the forecast. For example, p(1| f, l) is the relative frequency of precipitation when the forecast was f and the lead time was l. The forecasts and observations are independent if and only if p(x| f, l) = p(x|l). A set of forecasts is well calibrated (or reliable) if p(1| f, l) = f for all f. A set of forecasts is perfectly refined (or sharp) if p( f ) = 0 when f is not equal to 0 or 1 (i.e., the forecasts are categorical). Forecasting the climatological average or base rate will be well calibrated, but not sharp. Likewise, perfectly sharp forecasts generally will not be well calibrated.
The second factorization, p( f, x|l) = p(x|l)p( f | x, l), is the likelihood-base rate (LBR) factorization. Its first term, p(x|l), is the climatological precipitation frequency. Its second term, p( f | x, l), is the likelihood function (also referred to as discrimination). For example, p( f | 1, l) is the relative frequency of forecasts when precipitation occurred, and p( f | 0, l) is the forecast frequency when precipitation did not occur. The likelihood functions should be quite different in a good forecasting system. If the forecasts and observations are independent, then p( f | x, l) = p( f | l).
b. Summary measures
In addition to the distributional comparison discussed above, we will use several summary measures of forecast performance. The mean forecast given a particular lead time is
where E is the expectation operator. Likewise, the climatological frequency of precipitation, indexed by lead time, is
The mean error (ME) is
and is a measure of unconditional forecast bias. The mean-square error (MSE) or the Brier score (Brier 1950) is
The climatological skill score (SS) is
where σ2x is the variance of the observations. Therefore,
and we see that SS measures the proportional amount by which the forecast reduces our uncertainty regarding precipitation, as measured by variance.
In addition to these scoring measures, we will also investigate the correlation between the forecasts and the observations, which is given by
where cov is the covariance and σ2f is the variance of the forecasts.
3. Data gathering procedure
a. PoP forecasts
We collected TWC forecasts from 2 November 2004 to 16 January 2006. [These data were collected from http://www.weather.com/, which provides a 10-day forecast that includes forecasts from the same day (0-day forecast) to 9 days ahead.] Figure 1 displays a representative 10-day forecast from 2007. These forecasts are available for any zip code or city and include probability of precipitation, high/low temperature, and verbal descriptions or weather outcomes such as “partly cloudy.” The forecasts are updated on a regular basis and are freely available to the public.
TWC’s PoP forecasts cover a 12-h window during the daytime (0700–1900 local time), rather than a complete 24-h day. The 12-h PoP is the maximum hourly PoP estimated by TWC during the forecast window. PoPs are rounded and must adhere to local rules relating PoPs to weather outcomes (B. Rose 2007, personal communication).1
We selected 50 locations in the United States, one in each state. Within each state we selected a major city. Within each city we selected the lowest zip code, excluding P.O. boxes. See Table 1 for a list of the cities and zip codes included in this study.
Since TWC’s forecasts are not archived, we recorded the forecasts daily. We automated this collection using a Web query and a macro in Microsoft Excel. The macro gathered forecast data directly from Web pages, such as that shown in Fig. 1. This process worked well, but was not completely automatic. In some cases, we experienced temporary problems with certain zip codes (e.g., http://www.weather.com/ data being unavailable) or faced Internet outages. These errors were generally discovered at a point at which forecast data could still be acquired. However, on some days (though fewer than 5%), we were unable to retrieve the PoP forecasts, and these data have been excluded from the analysis. While we did record high and low temperature in addition to PoP, we do not analyze temperature forecasts in this paper.
Because the archival process required intervention, we decided upon a single collection time. This timing is important because the forecasts are updated frequently and not archived. To ensure that we did not collect same-day forecasts before they were posted in our westernmost zip codes (Hawaii and Alaska) we established a collection time of 1130 central standard time (CST), which corresponds to 0730 Hawaii–Aleutian standard time, 0830 Alaska standard time, 0930 Pacific standard time, 1030 mountain standard time, and 1230 eastern standard time (EST). During daylight saving time (DST), we archived forecasts at 1130 central daylight time (CDT; 1030 CST). TWC builds their forecasts at 0100, 0300, 0900, 1100, 1810, and 2300 EST [or eastern daylight time (EDT); B. Rose 2007, personal communication]. These forecasts reach TWC’s Web site approximately 15 min later. Therefore, our forecasts represent TWC’s view at 1000 CST (or CDT). On rare occasions, TWC amends forecasts during the day, but we do not try to account for this.
b. Precipitation observations
The observed records of daily precipitation and high/low temperature of the current and previous month are available online at the TWC Web site. However, the Web site only archives daily precipitation observations, whereas we require hourly observations because the PoP forecast is for the 12-h window during the daytime. Therefore, we obtained hourly precipitation observation data from the National Climatic Data Center (NCDC; available online at www.ncdc.noaa.gov). Using NCDC’s database, we selected the observation station that was closest to our forecast zip code.2 Table 1 lists the observation stations used in this study and both the distance and elevation difference between the forecast zip code and the observation station. Most stations were within 20 km of the forecast zip code. However, eight stations were more than 20 km from the forecast area (i.e., Alaska, California, Colorado, Idaho, New Mexico, Oklahoma, Pennsylvania, and Vermont). In addition, one forecast–observation pair was separated by more than 500 m in elevation (i.e., Alaska). We have therefore removed these eight locations from our analysis, leaving 42 locations.3 The average distance and elevation between observation stations and our zip codes for these 42 locations are approximately 7 km and 18 m, respectively. The maximum distance and elevation difference between forecast–observation pairs are 16 km and 181 m, respectively. We also verified that the surface conditions between the observation–forecast pairs for the 42 remaining stations are similar.
The hourly data for each observation station is archived according to local standard time (LST). We used a 12-h observation window from 0700 to 1900 LST for each location to calibrate the PoP forecast data, which corresponds to the TWC’s PoP definition. Because the observations are always archived according to LST, during DST we slide our observation window up 1 h (0600–1800 LST) except in Arizona and Hawaii.
The verification of the same day PoP forecasts is more complicated than other PoP forecasts because the timing of the forecast collection determines which hours of observation data should be included. For example, in the eastern time zone, we only want to include precipitation observations between 1100 and 1900 EST (or between 1000 and 1800 EST during DST). Therefore, we removed hourly precipitation observations that occurred before the forecast time for the same-day forecasts at each location.
c. Data summary
Before beginning our analysis, we summarize our forecast and observation data in Table 2. We collected between 15 742 and 17 338 PoP forecasts, depending on the lead time (169 163 PoPs in total). Precipitation was observed approximately 21% of the time. The frequency of precipitation for same-day forecasts is lower (18%) because these forecasts span less than a 12-h window for some time zones. TWC’s average PoP forecast varied over the lead times, ranging from a low of 0.198 (7 day) to a high of 0.265 (8 day). All but one lead time exhibits a positive mean error between the forecast and the observation, suggesting some degree of positive bias in TWC’s PoP forecasts. The same-day bias is 0.052.
Table 3 details the number of forecasts by PoP and lead time. TWC forecast a 0.2 PoP 4930 times for their same-day forecast. Overall, a 0.0 PoP was forecast 24 382 times, while a PoP of 1.0 was forecast 410 times. The italic values identify forecasts that were made fewer than 40 times, which we exclude from further analysis.4
4. Forecast verification
a. Calibration-refinement factorization
Figure 2 displays a calibration or attributes diagram (Hsu and Murphy 1986) for TWC’s 0-day PoP forecasts. The line at 45°, labeled “perfect,” identifies PoPs that are perfectly calibrated [i.e., p(1| f, l) = f ] . The horizontal line labeled “no resolution” identifies the case where the frequency of precipitation is independent of the forecast. The line halfway between no resolution and perfect is labeled “no skill.” Along this line the skill score is equal to zero and according to Eq. (7), the forecast does not reduce uncertainty in the observation. Points above (below) this line exhibited positive (negative) skill.
The gray area in Fig. 2 presents the frequency with which different PoPs are forecast [i.e., p( f )]. We notice peaks at PoPs of 0.0 and 0.2, each being forecast more than 20% of the time.
We identified a probability interval around the line of perfect calibration, based on the number of forecasts, which determines whether we identify a PoP as being not well calibrated. Based on the normal approximation to the binomial distribution, we establish a 99% credible interval, in which case there is a 1% chance a forecast–observation pair would be outside this interval (0.5% chance of being above and 0.5% chance of being below). For example, if the PoP was truly f, then there is a 99% chance that the actual relative frequency of precipitation would be within
where Φ−1 is the inverse of the standard normal cumulative [Φ−1(0.995) = 2.576] and N is the number of forecasts. This range forms an envelope around the line of perfect calibration, the width of which is determined by Eq. (9). If a forecast–observation pair lies outside this range, then the forecast is not well calibrated.5 PoPs of 0.0, 0.1, 0.2, 0.3, and 1.0 are not well calibrated. PoPs of 0.0 and 1.0 will not be well calibrated if even a single contrary event occurs, which is a good reason to restrict PoP forecasts to the open interval (0, 1).
The 0.3 PoP is not well calibrated and exhibits no skill. PoPs below 0.3 are quite poor: they are miscalibrated, exhibit negative skill, and are biased. For example, when TWC forecast a 0.2 chance of precipitation for the same day, precipitation occurred only 5.5% of the time.
PoPs of 0.4 and above, excluding 1.0, can be taken at face value and used directly in decision making. However, PoPs of 0.3 and below or 1.0 require adjustment—sometimes significant.
Figure 3 presents the calibration diagrams for lead times of 1–9 days. The 1-day forecasts exhibit the same behavior as the 0-day forecasts: PoPs from 0.0 to 0.2 and 1.0 are miscalibrated. The calibration of midrange PoPs begins to degrade with lead time. Performance decreases markedly beginning with the 7-day forecasts. For example, most of the PoP forecasts lay along the no skill line for lead times of 7 days or longer. While predictability does decrease with lead time, calibration performance should not; a forecast of f should occur f × 100% of the time whether it was a forecast for the next hour or the next year.
These phenomena can be explained in part by TWC’s forecasting procedure (B. Rose 2007, personal communication). The meteorologists at TWC receive guidance from a mixture of numerical, statistical, and climatological inputs provided by computer systems. The human forecasters rarely intervene in forecasts beyond 6 days. Thus, the verification results of the 7–9-day forecasts represent the “objective” machine guidance being provided to TWC’s human forecasters. In this respect, the human forecasters appear to add considerable skill, since the 0–6-day calibration performance is so much better.
However, when humans do intervene, they introduce considerable bias into the low-end PoP forecasts. This bias could be a by-product of the intervention tools used by the human forecasters. The forecasters do not directly adjust the PoPs, but instead change what is known as the sensible weather forecast. For example, they might change partly cloudy to “isolated thunder.” When this change is made, a computer algorithm determines the “smallest” change that must be made in a vector of weather parameters to make them consistent with the sensible weather forecast. A PoP of 29% is the cutoff for a dry forecast and therefore, it appears as though this intervention tool treats all “dry” PoPs as being nearly equivalent. This also might explain the curious dip in forecast frequency at 0.1 in both the 0- and 1-day forecasts.
The frequency of forecasts highlights additional challenges with the machine guidance. The most likely 8- and 9-day forecasts are 0.0 and 0.6, with a forecast of 0.5 being very unlikely. TWC appears to avoid forecasts of 0.5. We can even see the “ghost” of the 0.6 peak in the shorter-term human-adjusted forecasts. Forecasts as extreme as 0.0 or 0.6 are difficult to justify far into the future. For example, the frequency of precipitation conditional on the forecast ranges from 0.12 to 0.32 for the 9-day forecast. It appears that TWC’s forecasts would need to be constrained to this range if they were intended to be well calibrated.
Table 4 presents several summary measures of forecasting performance. The mean-square error [Eq. (5)] ranges from 0.095 to 0.188. The variance of the forecasts is less than the variance of the observations, but much less stable. The correlation between the forecasts and the observations begins at 0.615 and declines quickly with lead time. The same-day skill score is approximately 36% and declines with lead time. The 8- and 9-day computer forecasts exhibit negative skill—using the computer forecasts directly induces more error than using climatology. For comparison, Murphy and Winkler (1977) found an overall SS for a sample of National Weather Service forecasts, averaged over all lead times, of approximately 31%.
b. Likelihood-base-rate factorization
Figure 4 displays the likelihood functions (or discrimination plots), p( f | 1, l) and p( f | 0, l) for TWC’s 0-day PoP forecasts. Given that precipitation did not occur, it is likely TWC forecast a PoP of either 0.0 or 0.2. Likewise, it is unlikely that PoPs greater than 0.6 were forecast in this situation. However, if precipitation did occur, a range of PoPs from 0.3 to 0.8 were almost equally likely to have been forecast. Ideally, one would hope to see p( f | 1, l) peak at high PoPs and decline to the left.
Figure 5 displays likelihoods for the remainder of lead times. The degree of overlap between the likelihood functions increases rapidly with lead time, as the forecasts lose their ability to discriminate and skill scores fall. The peaks at a PoP of 0.6 are even more pronounced in the likelihood graphs.
c. Warm and cool seasons
Following Murphy and Winkler (1992), we gain additional insight into TWC’s forecasts by analyzing their performance during warm (April–September) and cool (October–March) months. Table 5 summarizes the forecast and observation data by season. Approximately 60% of our dataset covers the cool season because we gathered data from 2 November 2004 to 16 January 2006. The sum of the number of forecasts for the cool and warm seasons is lower than the totals presented in Table 2 because we have excluded PoPs that were forecast fewer than 40 times. For example, a same-day PoP of 0.9 was forecast only 26 times during the warm-season and has therefore been excluded from the warm-season analysis (17 388 − 10 374 − 6938 = 26).
The frequency of precipitation was lower during the warm season than during the cool season. Yet, TWC forecast higher PoPs during the warm season, resulting in a larger mean error. For example, the 0-day warm season PoP was 0.086 too high on average.
Figure 6 compares the 0-day PoP calibration in the cool and warm seasons. The most likely forecast in the cool season was 0.0, even though precipitation occurred more frequently than during the warm season. The cool season is not well calibrated for low (0.0–0.2) or high (0.8–1.0) PoPs, whereas the lower half of the PoP range performs poorly during the warm season—TWC overforecasts PoPs below 0.5 during the warm season. Overall, the warm season is not as well calibrated as the cool.
Figure 7 contrasts the cool and warm calibration for 1–9-day forecasts. The calibration performance between the two seasons is similar. However, the cool-season PoPs tend to be sharper because they forecast 0.0 more frequently. One noticeable difference in forecast behavior is the increased frequency of 0.3 PoPs during the warm season.
Table 6 compares the skill scores and correlations between the two seasons. Warm-season forecasts are about half as skillful as the cool season. Cool-season skill scores begin at about 44% and decline to 0% by day 7. Warm-season skill scores are about 50% lower. For comparison, Murphy and Winkler (1992) found skill scores of 57%, 38%, and 30% for the 0-, 1-, and 2-day forecasts during the cool season and 37%, 24%, and 21% during the warm season, respectively. TWC’s performance is on par with these earlier studies in the cool season, if somewhat worse for same-day forecasts. Warm-season performance appears to lag previous studies.
The second term on the rhs of (10) is a measure of calibration or refinement. The last term is the resolution (Murphy and Daan 1985). Figure 8 plots the MSE for the cool and warm seasons according to this factorization. Note that we have displayed the negative of the resolution (the lowest area) so that higher resolution lowers the MSE, as in Eq. (10). We see that cool-season forecasts have better resolution (more negative) than the warm season. In addition the cool season exhibits better calibration for near-term (2 days or less) and long-term (7 days or more) PoP forecasts. The variance of the observations is slightly lower in the warm season.
The best measure of a probability distribution’s sharpness is its entropy H (Cover and Thomas 1991), which is given by
The logarithm can be to any base, but we will use base 2. Entropy is at a minimum in the case of certainty and at a maximum when the probabilities are uniform. In the case of binary forecasts, the maximum entropy is log2(2) = 1. Entropy can also be thought of as a measure of the amount of information contained in a probability assessment, with lower entropies conveying greater information content.
Suppose a forecaster provides a PoP of f. The entropy of this forecast is −[ f log2( f ) + (1 − f ) log2(1 − f )]. We can therefore associate an entropy to each of TWC’s forecasts. Figure 9 plots the average entropy of TWC forecasts for the cool and warm seasons as a function of lead time. In addition, the entropy of a climatological forecast, based on Table 5, is also displayed. In the case of the cool season, we see that TWC forecasts have less entropy (more information) than climatology. The 0- and 1-day forecasts are much narrower than forecasts based solely on climatology because a PoP of 0.0 is forecast often. Entropy increases with lead time as one would expect, but suddenly drops for lead times of 7–9 days. Because these forecasts are not calibrated, we see this drop in entropy as not a result of superior information. Rather, the long-term forecasts are too sharp. The warm season entropies are closer to climatology, but also drop significantly after 6 days.
The 0-day likelihood functions for the cool and warm seasons are compared in Fig. 10. Given that precipitation was not observed, the most likely forecast during the cool season was 0.0, whereas it was 0.2 during the warm season. If precipitation was observed, it was much more likely that a lower PoP was forecast during the warm season than during the cool season. We also notice peaks at 0.8 in the event of precipitation. Figure 11 compares the likelihoods for the remaining lead times. The overlap between the likelihood functions is greater during the warm season. We also observe peaks at particular probabilities. For example, if precipitation occurred during the warm season, it is almost certain that TWC did not forecast a PoP of 0.7 1-day ahead. Likewise, the 0.6 peaks are prominent in both seasons. Again, one would hope to see the likelihood function given precipitation peak at high PoPs and monotonically decline to the left. TWC’s forecasts are good at identifying a lack of precipitation, but are not particularly strong at identifying precipitation—especially during the warm season.
TWC’s forecasts exhibit positive skill for lead times less than 7 days. Midrange PoPs tend to be well calibrated, but performance decreases with lead time and worsens during the warm season. PoPs below 0.3 and above 0.9 are miscalibrated and biased. Overall, almost all lead times exhibit positive bias and the same-day bias is significant, especially during the warm season.
As discussed previously, there is no reason, per se, that calibration performance should decrease with lead time. Rather, the difficulty of the forecasting task should be reflected in the sharpness of the forecasts. TWC’s long-term forecasts are too sharp. Apparently, one cannot reasonably forecast a 0% or 60% chance of precipitation 8 or 9 days from now, much less provide these forecasts nearly 40% of the time.
There seem to be two primary areas in which TWC could improve its forecasts: the machine guidance provided to human forecasters and the intervention tool used by these forecasters to arrive at sensible forecasts. The long-term forecasts, which are unedited by humans, exhibit a tendency to provide extreme forecasts and to artificially avoid 0.5. Perhaps revisions/additions to these models could improve performance. If not, TWC might want to consider intervening in these forecasts as well. The intervention of human forecasters increases skill, but also introduces bias. The intervention tool uses a least squares procedure to adjust underlying weather variables. Perhaps other approaches, such as the application of maximum entropy techniques (Jaynes 1957), would improve performance. Maximum entropy techniques would avoid producing narrow and biased forecasts.
Performance during the warm season is noticeably worse; even though the variance of the observations is lower (see Table 6). This suggests that TWC should concentrate its attention on improving PoPs during this time.
In addition, providing PoPs at 0.05 intervals (i.e., 0.05, 0.10, . . . , 0.95) might be helpful and enable TWC to avoid forecasts of 0.0 and 1.0, which will not be well calibrated.
The authors gratefully acknowledge the assistance and guidance of Dr. Bruce Rose at The Weather Channel and the suggestions of two anonymous reviewers.
* Current affiliation: Operations Research/Industrial Engineering, The University of Texas at Austin, Austin, Texas.
Corresponding author address: J. Eric Bickel, Operations Research/Industrial Engineering Program, The University of Texas at Austin, Austin, TX 78712-0292. Email: email@example.com
Bruce Rose is a meteorologist and software designer for TWC based in Atlanta, Georgia. The authors worked closely with Dr. Rose to understand TWC’s forecasting process.
We considered an NCDC observation of less than 0.01 in. of precipitation as an observation of no precipitation.
In hindsight, we should have selected forecasts that correspond to observation stations. However, we initially thought we would be able to use TWC’s observation data, only later realizing that these observations do not cover the same length of time as the forecasts.
A cutoff of 40 is common in hypothesis testing. The variance of a binomial distribution is Np(1 − p). The normal approximation to the binomial is very good when this variance is greater than 10. Thus, if p = ½ then N should be greater than 40.
This is identical to a two-tailed t test with a 1% level of significance.