• Bélair, S., Méthot A. , Mailhot J. , Bilodeau B. , Patoine A. , Pellerin G. , and Coté J. , 2000: Operational implementation of the Fritsch–Chappell convective scheme in the 24-km Canadian regional model. Wea. Forecasting, 15 , 257274.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brunet, N., 1987: Development of a perfect prog system for spot time temperature forecasts. CMC Tech. Doc. 30, 55 pp. [Available from Environment Canada, 2121 Route Transcanadienne, Dorval, QC H9P 1J3, Canada.].

    • Search Google Scholar
    • Export Citation
  • Brunet, N., Verret R. , and Yacowar N. , 1988: An objective comparison of model output statistics and perfect prog systems in producing numerical weather element forecasts. Wea. Forecasting, 3 , 273283.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Dallavalle, J. P., 1988: An evaluation of techniques used by the National Weather Service to produce objective maximum/minimum temperature forecasts. Preprints, Eighth Conf. on Numerical Weather Prediction, Baltimore, MD, Amer. Meteor. Soc., 572–579.

    • Search Google Scholar
    • Export Citation
  • Glahn, H. R., 1970: A method for predicting surface winds. ESSA Tech. Memo. WBTM TDL 29, U.S. Department of Commerce, Washington, DC, 18 pp.

    • Search Google Scholar
    • Export Citation
  • Glahn, H. R., and Lowry D. A. , 1972: The use of model output statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11 , 12031211.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Glahn, H. R., and Dallavalle J. P. , 2000: MOS-2000. TDL Office Note 00-1, NOAA/Techniques Development Laboratory, 169 pp.

  • Klein, W. H., and Lewis F. , 1970: Computer forecasts of maximum and minimum temperatures. J. Appl. Meteor., 9 , 350359.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12 , 595600.

  • Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281293.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Sarrazin, R., 1989: Development of a tuned perfect prog surface wind forecasting system. ARD Research Rep. MSRB 89-6, 56 pp. [Available from Environment Canada, 2121 Route Transcanadienne, Dorval, QC H9P 1J3, Canada.].

    • Search Google Scholar
    • Export Citation
  • Sarrazin, R., and Wilson L. J. , 1989: A tuned perfect prog wind forecast system for Canada. Preprints, 11th Conf. on Probability and Statistics in Atmospheric Sciences, Monterey, CA, Amer. Meteor. Soc., 50–55.

    • Search Google Scholar
    • Export Citation
  • Stanski, H. R., Wilson L. J. , and Burrows W. R. , 1990: Survey of common verification methods in meteorology. WMO World Weather Watch Tech. Rep. 8, 115 pp.

    • Search Google Scholar
    • Export Citation
  • Verret, R., 1987: Development of a perfect prog system to forecast probability of precipitation and sky cover. CMC Tech. Doc. 29. [Available from Environment Canada, 2121 Route Transcanadienne, Dorval, QC H9P 1J3, Canada.].

    • Search Google Scholar
    • Export Citation
  • Verret, R., 1988: Post-processing of statistical weather element forecasts. CMC Monthly Review, Vol. 7, No. 5, 2–16. [Available from Environment Canada, 2121 Route Transcanadienne, Dorval, QC H9P 1J3, Canada.].

    • Search Google Scholar
    • Export Citation
  • Wilson, L. J., and Vallée M. , 2002: The Canadian Updateable Model Output Statistics (UMOS) system: Design and development tests. Wea. Forecasting, 17 , 206222.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • View in gallery

    Scatterplot of (top) 6-h and (bottom) 30-h PPM temperature forecasts vs observations for Jan–Mar 2000, Toronto Pearson International Airport. Observation data are from METARs

  • View in gallery

    Schematic diagram showing the dependent and independent datasets used in this evaluation. Dark portions of shaded bars indicate summer-season data and lighter portions indicate winter-season data. The open bars refer to the periods for which old and new model data were available. Tick marks on the axis represent the beginning of each year. See text for further details

  • View in gallery

    Bias of winter-season temperature forecasts issued from the 1200 UTC run, as a function of projection time, for the five sets of forecasts. Sample size is about 17 000 events for each, from 205 Canadian stations

  • View in gallery

    Percent reduction of variance (RV) with respect to the independent sample mean for the same dataset as in Fig. 3

  • View in gallery

    Summary of reduction of independent sample variance (RV) over all four datasets for (a) 6- and (b) 48-h temperature forecasts. Summer sample of about 22 400 cases over 210 stations and winter sample of about 17 000 cases over 205 stations, respectively

  • View in gallery

    Brier score as a function of projection time for the five sets of POP forecasts issued from the 1200 UTC run, summer season. Sample size is about 18 000 cases over 170 stations

  • View in gallery

    Reliability table for 0–6-h PPM, UMOS, and DMO POP forecasts from the 1200 UTC run, summer season. Frequency of use of the 10 forecast probability deciles is shown in the inset (upper left) and by means of the figures above the plotted points, with UMOS frequency above and PPM frequency below. The categorical DMO forecasts are shown by the plotted points in the two extreme deciles. The independent sample climatological frequency is indicated by the horizontal dashed line

  • View in gallery

    Same as in Fig. 7 but for the 42–48-h POP forecasts

  • View in gallery

    Same as in Fig. 8 but for the 42–48-h UMO, PPM, and DMO POP forecasts. Forecast probability frequencies are shown with UMO above and PPM below

  • View in gallery

    Reliability component of Brier score as a function of projection time for the five sets of POP forecasts issued from the 1200 UTC run, summer season. Sample size is about 18 000 cases over 170 stations

  • View in gallery

    Brier score as a function of projection time for the five sets of POP forecasts issued from the 0000 UTC run, winter season. Sample size is about 13 000 cases over 168 Canadian stations

  • View in gallery

    Same as in Fig. 11 but for reliability component of Brier score

  • View in gallery

    Reliability table for 0–6-h PPM, UMOS, and DMO POP forecasts from the 0000 UTC run, winter season. Format as in Fig. 7.

  • View in gallery

    Same as in Fig. 13 but for 42–48-h POP forecasts

  • View in gallery

    Bias of summer-season wind speed forecasts issued from 0000 UTC, as a function of projection time, for the five sets of forecasts. Sample size is about 18 000 cases over 155 stations

  • View in gallery

    Frequency bias of 18-h UMOS and PPM wind speed forecasts, from the data in Tables 2 and 3

  • View in gallery

    Percent correct for (left) 6- and (right) 18-h wind speed forecasts, based on six-category forecast–observed contingency tables, for summer-season forecasts issued from 0000 UTC

  • View in gallery

    Heidke skill score (relative to chance) for summer-season wind speed forecasts issued from 0000 UTC, as a function of projection time, based on six-category contingency tables

  • View in gallery

    Reduction of independent sample variance (RV) as a function of projection time for wind direction forecasts, winter-season run from 1200 UTC. Sample size is about 8500 cases for which wind speed was forecast and observed over 8.96 km h−1, for 145 stations

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 204 173 4
PDF Downloads 77 55 1

The Canadian Updateable Model Output Statistics (UMOS) System: Validation against Perfect Prog

View More View Less
  • 1 Meteorological Service of Canada, Dorval, Quebec, Canada
© Get Permissions
Full access

Abstract

This paper describes validation tests of the Canadian Updateable Model Output Statistics (UMOS) system against the perfect prognosis forecast system and forecasts of weather elements from the operational numerical weather prediction model. Several update experiments were performed using 2-m temperature, 10-m wind direction and speed, and probability of precipitation as predictands. These experiments were designed to evaluate the ability of the UMOS system to provide improved forecasts during the period following a model change when the development samples contain data from two or more different model versions. Tests were run for about 200 Canadian stations for both summer and winter periods. Independent summer and winter samples were used in the evaluation, to compare UMOS forecast accuracy with the direct model output forecasts, the perfect prog forecasts, and MOS forecasts based only on data from the earlier model version. The authors were also able to compare the evaluation results of forecasts generated using the data from a 4-month summer “parallel run” period for which two versions of the model were run concurrently. Results show that the UMOS forecasts are generally superior to both perfect prog and direct model output forecasts for all three weather elements. The UMOS forecasts are particularly responsive to bias changes; most forecast biases could be corrected with relatively little data from the newer model version. Although some of the improvement over perfect prog forecasts is apparently due solely to the use of MOS, the updating brings additional improvements even during the data blending period. The results also suggest that the higher-resolution predictions from the model bring advantages only for the first day of the forecast period. For the day-2 forecasts, the improvement over the much smoother perfect prog forecasts was smaller, especially for probability of precipitation.

Corresponding author address: Laurence J. Wilson, Recherche en Prévision Numérique, Meteorological Service of Canada, 2121 Route Transcanadienne, Suite 500, Dorval, QC H9P 1J3, Canada. Email: lawrence.wilson@ec.gc.ca

Abstract

This paper describes validation tests of the Canadian Updateable Model Output Statistics (UMOS) system against the perfect prognosis forecast system and forecasts of weather elements from the operational numerical weather prediction model. Several update experiments were performed using 2-m temperature, 10-m wind direction and speed, and probability of precipitation as predictands. These experiments were designed to evaluate the ability of the UMOS system to provide improved forecasts during the period following a model change when the development samples contain data from two or more different model versions. Tests were run for about 200 Canadian stations for both summer and winter periods. Independent summer and winter samples were used in the evaluation, to compare UMOS forecast accuracy with the direct model output forecasts, the perfect prog forecasts, and MOS forecasts based only on data from the earlier model version. The authors were also able to compare the evaluation results of forecasts generated using the data from a 4-month summer “parallel run” period for which two versions of the model were run concurrently. Results show that the UMOS forecasts are generally superior to both perfect prog and direct model output forecasts for all three weather elements. The UMOS forecasts are particularly responsive to bias changes; most forecast biases could be corrected with relatively little data from the newer model version. Although some of the improvement over perfect prog forecasts is apparently due solely to the use of MOS, the updating brings additional improvements even during the data blending period. The results also suggest that the higher-resolution predictions from the model bring advantages only for the first day of the forecast period. For the day-2 forecasts, the improvement over the much smoother perfect prog forecasts was smaller, especially for probability of precipitation.

Corresponding author address: Laurence J. Wilson, Recherche en Prévision Numérique, Meteorological Service of Canada, 2121 Route Transcanadienne, Suite 500, Dorval, QC H9P 1J3, Canada. Email: lawrence.wilson@ec.gc.ca

1. Introduction

In Canada, we have recently implemented a new system for the interpretation of numerical weather prediction model output into local weather element forecasts. Called Updateable Model Output Statistics (UMOS), the system replaced an older weather element forecast system based on the perfect prognosis method (PPM; Klein and Lewis 1970). The PPM forecasts had been used operationally for about 10 years because frequent changes in the driving model prevented the accumulation of a sufficiently large and stable dataset for redevelopment of model output statistics (MOS) equations following each significant change of numerical weather prediction (NWP) model. As the spatial resolution of our operational model increased, a return to the MOS formulation became more attractive. The UMOS system was developed as a way of ensuring not only that statistically stable MOS equations would always be available for use in operations, but also that the predictive expertise from new models would be brought into the weather element guidance forecasts as quickly as possible after the model change.

UMOS differs from a standard MOS system (Glahn and Lowry 1972; Glahn and Dallavalle 2000), in two important respects. First, UMOS is designed to allow frequent and automatic updating of the statistical forecast equations, so that the output from recent model runs is incorporated into the statistical forecasts as early as possible. Second, UMOS includes a user-controllable weighting scheme. This means that, following a change of the numerical weather prediction model, output from the new model can be given higher weights for equation development relative to output from the older model, so that the statistical equations respond as quickly as possible to the characteristics of the new model. Both of these design features are intended to make the MOS forecasts as responsive as possible to new model characteristics following model changes. Data from the old model version are retained and used in development to keep the equations statistically stable until the sample from the new model attains sufficient size to produce reliable forecasts. In the case of the Canadian UMOS system, blending of new and old model samples begins when 30 cases from the new model have been accumulated. As the sample size of new model cases increases, the weight applied to the old model data decreases. After a sufficiently large sample from the new model becomes available (300–350 cases, depending on the predictand), the old model data are no longer used in development. Since separate equations are developed for summer and winter seasons, this means that the data blending period following a model change lasts approximately 2 yr.

The Canadian UMOS system is described in detail in a companion paper (Wilson and Vallée 2002, hereafter WV02). That paper discusses the design of the system, including the weighting scheme and the data management aspects. The results presented in WV02 illustrate the impact of the weighted blending of new and old model output on individual regression equations for selected stations, and summarized over a set of 200 stations. It was found that the predictors adjusted quickly to the addition of data from a new model version, and that the selected predictors changed only slightly through the blending period. Though the blended samples inevitably contain variance that is due solely to the change in the statistical characteristics of the model variables, this did not seem to have a significant detrimental effect on the fit of these samples to observations. We found little evidence of increased difficulty in fitting the blended model output to observations for the three predictands, temperature, wind, and probability of precipitation (POP).

All the results in WV02 are based on the samples used to develop the regression equations. In order to demonstrate that the UMOS forecasts would be useful in operations, it is necessary to test the equations on independent data. In fact, before the equations could be accepted for implementation at the Canadian Meteorological Center (CMC), it was necessary to show that the forecasts from the UMOS system were superior to those from the existing operational PPM system. It is also preferable that the UMOS forecasts retain their quality with respect to the PPM forecasts even during the period following a model change, when the equations are based on blended samples. That is one of the goals of the present work: To determine the quality of the UMOS forecasts compared to that of the operational PPM, as well as that of direct model output (DMO) forecasts of temperature, wind, and POP.

Comparisons between PPM and MOS forecasts have been done before, for example, Brunet et al. (1988) and Dallavalle (1988). Their studies were motivated by a desire to consider reverting to PPM forecasts when frequent model changes precluded the accumulation of statistically stationary samples of model output for MOS development. These tests generally showed that PPM methods were competitive with MOS based on the models of the time, but more importantly, they also illustrated noticeable differences in the characteristics of forecasts from the two methods: PPM forecasts tended to retain their sharpness over the entire range of forecast projections, while MOS forecasts would become more conservative with increasing projection (Brunet et al. 1988). On the other hand, MOS forecasts would retain their reliability while PPM forecasts would become less reliable with increasing projection.

These differences arise because of the differences in development methods of PPM and MOS. The “goodness of fit” of statistical equations is sensitive to the level of error variance (noise) in the predictors: the greater the noise level in the predictors, the poorer the fit to the observations. Equations that explain a relatively small portion of the predictand variance will lead to predictions that are conservative, clustering toward the dependent sample mean. Thus, as model error variance increases with increasing forecast projection, the model output predictors used in MOS development contain more noise, with the result that MOS equations become more conservative, and increasingly forecast toward climatology. PPM equations are based on analyses or observations and, therefore, are not sensitive to model error variance. The statistical relationships may be quite strong, and are independent of model projection. PPM forecasts therefore do not tend toward climatology as model errors increase; they retain their sharpness and will generally forecast extreme values more often than MOS forecasts.

Modern operational models are capable of simulating finer horizontal scales than can be resolved by regular surface and upper-air observation networks, especially in Canada. In statistical terms, this means that forecast variables from higher-resolution models contain greater spatial and temporal variance. In this study, the UMOS equations have access to all model variables at their full resolution, and the forecasts are compared with PPM forecasts that were developed on rather low resolution analyses of standard variables, but are run using the higher-resolution model estimates of these variables. If there is useful predictive information in the smaller scales simulated by the model, then the UMOS equations should outperform the PPM equations, even if the PPM equations are driven by the higher-resolution model output. A reevaluation of MOS versus PPM forecasts is therefore essentially a way of evaluating the relative quality of high-resolution variables in the model.

In summary, the comparisons described in this paper have three related purposes: 1) to evaluate the UMOS forecasts on independent data, with respect to the existing operational PPM forecasts; 2) to evaluate the utility of high-resolution model variables as used in MOS, compared to lower-resolution PPM forecasts; and 3) to verify the performance of UMOS equations based on blended samples. Section 2 describes the forecasts and data used in the study, section 3 presents the results, and the results are discussed in section 4.

2. Description of the forecasts and data used in the comparison

To address the three goals of the study, five different sets of forecasts were generated for each of three predictand elements, temperature, POP, and 10-m wind. The first two goals were addressed by comparing the DMO, PPM, and UMOS forecasts. To meet the third goal, we developed three different versions of UMOS forecasts that are designated UMOS, UMB, and UMO. Thus the full study consists of the five-way intercomparison of DMO, PPM, UMOS, UMB, and UMO on the test datasets. This section describes the source and characteristics of the forecasts and data used in the study.

a. PPM forecasts

Table 1 summarizes the characteristics of the PPM equations, compared to the UMOS equations, for all three predictands. We used the operational PPM forecasts that were implemented in March 1991 after frequent model changes rendered older MOS forecasts obsolete. Equations for 3-h spot temperature and 6-h POP were developed using a 22-yr (1963–84) dataset of analyzed upper-air data with a grid resolution of 381 km. Temperature equations were developed for 2-month seasons and POP equations for 3-month seasons. Predictors for the temperature forecasts consisted of geopotential heights at standard levels, various thicknesses derived from these heights, thermal wind, horizontal gradients of thickness, relative humidity, vorticity advection, geostrophic vorticity, and vertical wind shears. Surface observations were also added as predictors. All the derived predictors were obtained by differencing the basic height, temperature, and wind fields from the original 381-km resolution analyses. For POP, the predictors included heights, temperatures, dewpoint depressions, relative humidity, thicknesses, vorticity, winds, and advection, all for standard upper-air levels. To these were added some derived predictors including the George K stability index, thickness gradients, and the Laplacian of height changes. Upstream surface predictors and persistence predictors were also included in the PPM POP system. Different equations are used for different upstream displacements. Further details of the PPM forecasts are given in Brunet (1987) for the 3-h spot temperatures and Verret (1987) for the 6-h POP forecasts.

It is important to note that the temperature and POP PPM forecasts are postprocessed in ways that affect the predicted distribution of these variables (Verret 1988). Postprocessing of the POP forecasts was designed to increase the sharpness of the forecasts, that is, to increase the frequency with which the extreme probabilities are forecast. Although PPM forecasts tend to be sharper than MOS forecasts, as discussed above, the postprocessing of the PPM forecasts is intended to counteract the smoothing effects of the low-resolution analysis data used in equation development. Temperature forecasts are subjected to an anomaly reduction routine, which effectively reduces the tendency toward prediction of extreme temperatures for longer projection times. This, in effect, reduces the sharpness of the temperature forecasts as a function of projection, and would tend to make the forecast distribution for longer projections appear more like MOS, with a reduction in the frequency of forecasts of large anomalies.

The anomaly reduction scheme is described in appendix A. An example of its impact is shown in Fig. 1, for a specific station (Toronto Pearson International Airport). The verification data are the same for both forecast projections, with a temperature range of about −18° to +13°C. The forecast range is −18° to +15°C for the 6-h forecasts and only −17° to +10°C for the 30-h forecasts.

The PPM wind forecasts were developed a little differently. They were based on only 8 yr of analyses, also with a resolution of 381 km. The predictors included heights, temperatures, relative humidities at standard levels, along with numerous derived predictors such as vertical and horizontal shears, geostrophic wind components, thicknesses, vorticities, sine and cosine of wind direction, and advection of thickness, temperature, and vorticity. To this were added a few special predictors to assist with modeling the diurnal variation in wind speed. Separate equations were developed for the west and south wind components and the scalar speed. The components were used to compute wind direction. Since diurnal predictors were included, only one equation per day was developed. The PPM equations were run on 3 yr of data from the model that was operational between 1985 and 1987, and the resulting forecasts were tuned to each station and each forecast projection. The tuning had the effect of removing the bias on the 3-yr test sample and, also, resulted in changes to the forecast variance in wind speed. Given that the corrections were based on forecasts that used model-derived estimates of the predictor variables, the tuned equations would have some of the characteristics of MOS equations. The performance of these equations could therefore be expected to be degraded after model changes. Despite this concern, the forecasts are still considered as useful guidance by forecasters, and have been run and verified without further modification through the 1990s. Further details of the PPM wind forecast technique are given in Sarrazin (1989) and Sarrazin and Wilson (1989).

All the PPM forecasts are run using model data extracted on the same 381-km grid that was used in development. Predictor values are interpolated linearly to stations, and differenced predictors are computed in the same way as in the development sample. This is necessary to ensure that the model estimates of the predictor values are statistically as close as possible to the analyzed predictors that were used in development. It should be noted that the model predictors are sampled on the lower-resolution grid; there is no systematic spatial smoothing or filtering applied. Therefore, they are more likely to contain smaller-scale variations than they would if analyzed onto the lower-resolution grid before being used to generate predictor values. Differenced predictors, however, are expressed over distances corresponding to the grid resolution.

b. DMO forecasts

Estimates of the 2-m temperature, the 10-m wind speed and direction, and precipitation are available from the regional version of the CMC operational Global Environmental Multiscale (GEM) model at each forecast time. The data used in the comparison experiments come from two different versions of the model. The “old” model has a horizontal resolution of 0.33° latitude, uses the Kuo convective precipitation scheme and carries out the radiation balance calculations every 135 min. The “new” model has a resolution of 0.24° latitude, uses the Fritsch–Chappell convection scheme, and carries out the radiation balance calculations every 60 min. In addition, the new model uses a higher-resolution topography database to define the topography and roughness fields than was used in the old model. The model was changed on 14 September 1998 after a 4-month period when both new and old models were run each day.

The changes to the model and an assessment of their impact on precipitation forecasts are described in detail in Bélair et al. (2000). Their results showed that the change to the convection scheme produced a positive impact on summer precipitation prediction, but a neutral impact on winter precipitation. The impact of the other changes on precipitation forecasts was found to be small. One might expect the resolution change to have an impact on model variables that relate to gradients, such as surface and upper-air wind forecasts, as well as adding variance to the predictor fields that are expressed at the full model resolution. The topography resolution increase might also have a systematic impact on the DMO surface predictors evaluated in these experiments, as well as on the UMOS forecasts that use them.

c. UMOS forecasts

The UMOS system is described in general terms in the introduction and in detail in WV02. The predictands are similar to the PPM predictands: 3-h spot temperature (17 projections from 0 to 48 h), 6-h POP (8 projections from 0 to 48 h), and wind direction and speed every 3 h (17 projections, 0 to 48 h). The wind predictand in UMOS is defined slightly differently: we use the maximum hourly reported wind over a 3-h period centered on the projection time, while the PPM forecasts are spot 3-h values. For comparison, each method was verified as defined. The predictors used in the UMOS equations for temperature, POP, and wind are listed in detail in WV02, and will not be repeated here. Aside from the model variables, persistence predictors (surface observations) are used for all projections up to 24 h (12 h for POP), and the model's estimate of the predictand (DMO) is included as a predictor. No smoothing or filtering is applied to the model predictors; they are interpolated to station locations using spline interpolation methods, which are capable of inferring local maxima and minima between grid points. This means that, following a model resolution change, the resolution of the predictors will also change, and the blended new and old model development samples will contain predictors that express different model resolutions. They only exception to this is that the finite-differencing distance is fixed at 50 km for computation of horizontal differenced predictors, regardless of the model resolution.

As noted above, one of the purposes of this study is to evaluate the UMOS forecasts during a period when the equations are based on blended development samples from two or more model versions. This evaluation was necessary to determine whether the quality of the forecasts can be maintained during the model transition period. Table 1 lists the overall characteristics of the UMOS equations while Fig. 2 summarizes the origin of the three versions of UMOS forecasts: UMOS, UMB (UMOS both), and UMO (UMOS old). The winter (gray) and summer (black) development sample periods for each version are indicated by the three bars in the upper part of the figure. The overlap in the bars in Fig. 2, between 6 May and 14 September 1998, indicates the model parallel-run period, when both the new model and the old were run twice each day. Output was saved from both model versions, and was used in the UMOS development.

First, the UMO equations are based on old model data only; the equation updating was stopped at the time of the model change, 14 September 1998. Next, the summer season UMB equations were generated using the parallel-run data. That is, the update cycle used old model summer season data until 6 May 1998; then the equations were updated with the new model data between 6 May and 14 September 1998. Development of the UMB summer equations then was stopped at the time of the model change. Thus, the only difference between the UMO and UMB summer equations is that the latter use the 4 months of new model data (about 120 cases), blended with the old model data, while the former use only old model data during this period. The development period is the same for both. For the winter UMB equations, we did not have access to parallel-run data, and so these were generated using one season of new model cases from the 1998/99 winter season, blended with old model data from previous winters.

Finally, the UMOS forecasts were generated by running the update cycle through the entire period for which data were available, September 1995 until the end of March 2000. Since this period contains the test sample period, it was necessary to ensure independence of the test results by stopping the update each week and running the test on the following week during the test periods. This means the UMOS equations were changed (updated) during the verification periods, while the UMO and UMB equations were fixed throughout the verification periods.

The UMO, UMB, and UMOS equations form a kind of hierarchy for the UMOS evaluation. The UMO equations contain no new model data so that when they are run on an independent sample from the new model, they should indicate the effect of ignoring the model change completely. The UMB equations contain some new model data, and represent an intermediate stage in the transition from old to new model dependence. They do not contain any data that are close in time to the test periods. Finally, the UMOS equations, developed as they would be in operations, contain more new model data (usually more than one season's worth), and benefit from updating with recent model output.

d. Independent datasets and test methods

Four separate independent samples were used in the evaluation, one for each season and one for each of the two daily model runs, 0000 and 1200 UTC, all from the new model. These are indicated in Fig. 2 by the black (summer) and gray (winter) bars contained within the larger open “new” model bar. To minimize the effects of differences in the sampling climatology on the results, we chose the same 4-month period as was used in development for the summer equations, 13 May–13 September, but the test period is one year later, 1999. This is not quite true for the winter equations: The new model development period (UMB equations) is approximately one full winter season while the independent test period is from 3 January to 3 April 2000. Except for a small percentage of missing forecasts, the four independent samples are the same for all five sets of forecasts that are compared in the results that follow.

The DMO and PPM forecasts for each of the four independent test periods were used as standards of comparison. The DMO forecasts are simply the model's estimates of the three predictands, interpolated to the stations, while the PPM forecasts were prepared using new model data and postprocessed exactly as they are in operations.

The verification data were obtained from the standard combined METAR (aviation routine weather report) and synoptic observation data files available operationally at CMC. Temperature observations from 210 (205) primary-manned Canadian stations were available for each of the summer (winter) independent samples, 155 (145) stations for wind and 170 (168) for precipitation. The temperature, wind, and POP independent sample sizes are approximately 22 400, 18 000, and 18 000, respectively, for each of the two summer datasets, and 17 000, 13 500, and 13 000 for the two winter verification datasets. For the wind direction verification only, forecast and/or observed light wind cases were eliminated from the sample, reducing its size to approximately 10 000 cases in summer and 8500 cases in winter. The observations were subjected to a rudimentary quality control before being inserted into the database. Observations and forecasts for all three predictands were matched for verification according to the forecast valid time, and according to the predictand definition for each element. For POP, the observation is binary. A value of 1 is inserted into the database if measurable precipitation (>0.2 mm) occurred during the 6-h valid period and a 0 is inserted if it did not.

In order to summarize the performance of the different sets of temperature and wind forecasts, the bias, mean absolute error (MAE), root-mean-square error (rmse), and reduction of variance (RV) were used. In addition, contingency tables were computed for categories of wind speed to aid in the evaluation of performance for high wind speed cases. Percent correct, frequency bias, and the Heidke skill score relative to chance were computed from the contingency table data. For POP, we used verification measures appropriate to probability forecasts, namely the Brier score, the reliability component of the Brier score, and reliability tables. For definitions of the evaluation measures shown in the paper, the reader is referred to appendix B.

3. Results

In the results that follow, a five-way comparison is shown among the UMOS forecasts, the PPM forecasts, and the DMO. We found a good deal of consistency in the results over the four independent datasets. The results shown below were chosen to be representative of the characteristics of the performance over all the independent data.

a. Temperature

Figure 3 shows the bias (mean error) for the independent sample winter temperature forecasts issued from the 1200 UTC model run. Several characteristics of the performance are indicated in this figure. First, the DMO forecasts are negatively biased (cold bias), and therefore the PPM forecasts also show a cold bias, generally of about the same magnitude. For projections up to 12 h, the PPM forecasts seem to benefit from the use of persistence and upstream predictors, which have the effect of reducing the cold bias with respect to that of the DMO. Second, all three of the UMOS runs have reduced the cold bias, and the full UMOS run essentially eliminates the bias. This is consistent with WV02, where we found that UMOS responds quickly to new model data to eliminate biases. The cold bias is reduced even for the UMO run, which suggests that there is benefit from using MOS equations based on an old model. The two runs using development data from the new model further improve the bias characteristics, as expected. Forecasts from all three of the MOS runs are essentially unbiased to 9 h, which reflects the effect of the persistence predictors in the equations.

Figure 4 shows the RV (see appendix B) for the five forecast sets, for the same sample as Fig. 3. The reference climatology in this case is the independent sample climatology, which is the same for all five forecasts. The MOS forecasts improve substantially upon both the model and PPM forecasts, gaining about 5% over the DMO forecasts. This translates to a 9–12-h improvement for the first 18 h and one of 18 h or so for the longer projections. Once again more than half the improvement is achieved by using MOS rather than PPM, but additional gains are realized by using the full update cycle. The blending of development samples does not seem to have hindered the performance of the MOS equations, in terms of either bias or RV characteristics. The RV gains of 1% or so when the full update cycle is used remain more or less constant over all projections.

Figure 5 demonstrates variations in performance of the temperature forecasts over the four independent samples in terms of RV. The RV is higher in all cases for the winter samples than for the summer samples. At 6 h, the relative performance across the five sets of forecasts is much the same over all the independent samples, with the MOS forecasts showing higher RV than the DMO and PPM forecasts. The use of recent data in UMOS seems to have benefited the performance more in the winter than in summer. At 48 h, the relative improvement of MOS over PPM is slightly smaller, and all methods show greater RV in winter than in summer. Since the RV is a quadratic scoring rule, the postprocessing of the PPM forecasts may have helped their performance by reducing the incidence of heavily weighted large errors.

b. POP

Precipitation is known to be a more “difficult” predictand because of its highly variable nature and non-normal sampling distribution. Thus we expected that it would be a greater challenge to show skill with POP, based on small and/or blended samples of model variables. Two of the four changes to the model would be expected to have an impact on POP forecasts: the change of convection scheme and the change of resolution. Results for POP forecasts are shown for two of the four independent datasets in this section, one for each season.

Figure 6 shows Brier scores for the five sets of forecasts for the summer season, 1200 UTC forecast run. In terms of the Brier score, all the statistical forecasts have improved considerably over those of the DMO at all projections. Differences in Brier scores among the various statistical forecasts are much smaller by comparison. The full UMOS run has improved over the PPM forecasts for the first 18 h of the forecast period only. After 18 h, there is no significant difference between the UMOS and PPM Brier scores. As in the temperature results, some of the improvement is attributable to the use of MOS, since the scores for the UMO forecasts generally lie between the PPM and UMOS scores. Brier scores for all techniques are lower (better) during the nighttime period, when convective activity is lower. The poorer daytime performance may be due to the serious spatial undersampling of the discrete observation dataset, rather than to any real difference in forecast skill. Figure 6 also indicates no apparent difference between UMO and UMB forecasts. There is thus no evidence here to indicate a positive or negative impact due to the change of convective scheme in the model. Again, any real impact might have been obscured by the unrepresentativeness of the spatially discrete station observations.

To interpret the UMOS–PPM comparison, one must bear in mind the differences between the two equation sets. First, the PPM forecasts are postprocessed to increase their sharpness; that is, the frequency with which the extreme probabilities are forecast has been increased. Under some conditions, this can improve the Brier score. Second, the PPM forecasts use upper-air variables that are sampled on the same low-resolution grid as in development, then interpolated to stations. Although differenced variables are computed at low resolution, the PPM forecasts have some access to higher resolution of the model forecasts via these sampled predictors. The UMOS forecasts are not postprocessed, and rely on the sharpness of the model predictors to generate sharp forecasts. The UMOS–PPM comparison result is encouraging in the sense that the blended samples do not degrade the forecasts compared to those of the PPM. However, it also suggests that the additional resolution in the model predictors is of little use after day 1 since the smoother PPM equations are competitive in terms of the Brier score.

The characteristics of the POP forecasts may be further explored by means of reliability tables. Reliability tables can be used to evaluate the attributes “reliability,” “resolution,” and “sharpness” (Murphy 1993). (See appendix B for descriptions of these attributes and their relationship to the Brier score.) Figure 7 is a reliability table for the UMOS–PPM comparison, for the shortest-range forecasts, again for summer POP forecasts from the 1200 UTC run. It can be seen that both the UMOS and PPM forecasts are quite reliable overall (close to the 45° line), but that UMOS is slightly more reliable. The PPM forecasts show an underforecasting bias. In addition to being slightly more reliable overall, the UMOS forecasts are sharper, as indicated by the frequencies of use of the probability categories and the frequency histogram inset. For example, UMOS predicts the 90%–100% range 434 times compared to 123 times for PPM. (It should be noted that the underforecasting bias of the PPM forecasts gives the appearance that they are sharper at the low end of the probability range. We checked this by using the variance of the forecast probabilities, which confirmed the greater sharpness of the MOS forecasts even in the 0%–10% range.) The model categorical forecasts are shown in the figure as well. While these are not reliable, the DMO shows some accuracy since the frequency of precipitation occurrence is much lower when the model does not forecast it than when it does.

Figure 8 shows the corresponding reliability table for the 42–48-h forecasts. Here, the PPM overforecasts the probabilities in all ranges, while UMOS remains nearly perfectly reliable for POP less than 70%. The apparent loss of reliability for both methods above 70% may be due to a lack of detection of convective events in the observation dataset rather than to forecast errors. Figure 8 exhibits the known characteristics of MOS forecasts to maintain reliability as model errors increase, but also to decrease in sharpness, as discussed above. For example, UMOS attempts the highest probability range only 135 times at 48 h, compared to 194 times for PPM. (Once again, the overforecasting bias of PPM obscures the sharpness characteristics in the lowest forecast range; the PPM forecasts are indeed sharper in the 0%–10% range.) The DMO accuracy has also decreased: Of the 5823 occasions when it predicted precipitation, only 36% of those cases were associated with observations of precipitation.

It is interesting to note that the reliability curve for the UMO forecasts resembles more closely the PPM forecasts (Fig. 9) than it does the UMOS forecasts (Fig. 8). It is as if the old model equations lose their “precipitation signature” when applied to new model data, and behave more like a PPM system, which does not use model output in development.

Figure 10 quantifies the reliability for the five sets of forecasts for the summer 1200 UTC dataset. The MOS forecasts that use new model data are most reliable (lowest values) while the PPM and UMO forecasts are less reliable, especially for afternoon and evening valid times, when convective activity is normally at a daily maximum. The relative unreliability of the UMO forecasts is possibly an effect of the change in the GEM model convective scheme.

For the winter season, the full UMOS POP forecasts are superior to the PPM forecasts in terms of the Brier score for the first 24 h and only slightly better after 24 h (Fig. 11). The improvement over the UMO forecasts is smaller, especially for day 1, which supports the finding of Bélair et al. (2000) that the effect of the change of convection scheme was neutral in winter. Perhaps surprisingly, the full UMOS forecasts are less reliable at some of the forecast projections than the PPM forecasts (Fig. 12), especially 6 and 30 h. Since reliability is a component of the Brier score, this is an apparent contradiction, which can be resolved by looking at the reliability table (Fig. 13) for the 6-h forecasts. The figure indicates that the UMOS forecasts are overresolved. That is, probabilities above the climatological frequency are underforecast and probabilities below the climatological frequency are overforecast. Coupled with the higher weights assigned to the extreme bins of the probability distribution because of the greater sharpness of the UMOS forecasts, the benefits of the extra resolution outweigh the slight penalty in reliability, and this leads to a lower Brier score. At 48 h, both the overresolution and the sharpness are reduced for UMOS (Fig. 14); both UMOS and PPM are quite reliable overall and so, have similar Brier scores. Compared to the corresponding summertime result, the reliability of both is much greater at higher probability levels, probably because the greater dominance of larger-scale precipitation systems in winter reduces the representativeness problems of the observation data.

c. Wind

Figure 15 shows the independent sample bias for wind speed for forecasts from the summer 0000 UTC run. Both DMO and PPM forecasts show an underforecasting bias that is strongest near the maximum temperature time, when surface winds are more strongly coupled to free atmosphere winds. Overnight, the underforecasting bias is probably partially offset by the model's tendency to underestimate the strength of the nocturnal inversion and, therefore, to overestimate surface winds. All of the MOS forecasts have essentially eliminated this bias. In this case, the full UMOS run does not further improve the bias characteristics over those of the UMO forecasts. This is consistent with the fact that wind prediction relies heavily on gradients, which are computed over a constant 50-km grid distance regardless of model resolution. Thus, the model resolution change may have had less of an impact on wind speed forecasts than on other variables. This result may also suggest that the statistical characteristics of the DMO wind speed did not change significantly from the old model to the new model.

The wind speed forecasts were also evaluated in terms of contingency tables. These made possible a convenient comparison of the forecast and observed distributions. Six-category tables were used, with thresholds at 8.96, 15, 25, 40, and 60 km h−1. The first threshold defines “light winds,” which are not mentioned in a forecast, and the upper two thresholds are for wind advisories. Complete tables are shown only for the 18-h PPM and UMOS forecasts in Tables 2 and 3; the DMO table is similar to the PPM one, and all of the three UMOS versions are similar to each other. The tables show the tendency of the PPM forecasts to underestimate the wind speed, although there is an attempt to forecast the highest category on three occasions. The MOS forecasts exhibit the characteristic “central tendency” of MOS, overpredicting the central category and underpredicting both extreme categories. These differences in the predicted and observed frequency distributions are more clearly depicted by the frequency bias (Fig. 16). Postprocessing methods can be used to remedy both types of errors, for example, some sort of bias correction in the case of the PPM forecasts, and inflation (Glahn 1970) for the MOS forecasts. Both of these methods would tend to lower the percentage correct.

Percentage correct for the 6- and 18-h forecasts is shown in Fig. 17. The three sets of MOS forecasts improve considerably over both the PPM and the DMO at both forecast projections shown here, and at other projections as well (not shown).

The Heidke skill scores relative to chance (Fig. 18) are generally consistent with the above, and depict the variations in skill over the 48-h forecast period for all five forecast sets. Consistent with the temperature results, the three MOS forecasts outperform the DMO and PPM forecasts, and the full UMOS improves slightly over the other MOS forecasts. It is worth noting that the extra skill of UMOS is negligible compared to UMB after day 1, that the improvement over PPM and the DMO decreases after day 1, and that the improvement of UMB and UMOS over UMO decreases almost to 0 by 48 h. These three results together suggest that the additional resolution of the new model predictors loses its value rather quickly, as was noted for the POP forecasts. Similar results were obtained for the other three independent datasets (not shown). In fact, for the winter 1200 UTC run, the DMO showed slightly better skill than all of the MOS forecasts at 48 h.

Results for wind direction were more mixed. In the summer, there was little difference in the RV among the three MOS forecasts, and the DMO was just as skillful. The PPM forecasts were not at all competitive with the others, in either season. This is not surprising, because the PPM wind forecasts were tuned to a much lower resolution model that was operational in the late 1980s, and have not been retuned since then. As shown in Fig. 19, the winter MOS forecasts achieved higher RV than the DMO and PPM forecasts. However, the UMO forecasts were slightly superior to the UMB and UMOS forecasts in terms of RV. This is the only instance where the forecasts based on a MOS sample from a single model outperformed the forecasts based on blended samples. This is consistent with WV02 where it was noted that the only evidence of some difficulty with the fit of the regression equations occurred with winds on blended samples. This was attributed to the change in model resolution.

4. Summary and discussion

We have carried out a comprehensive evaluation of forecasts of surface temperature, POP, and wind from the new Canadian UMOS system, comparing their performance with older perfect prog forecasts and the model's estimates of the three predictands. The evaluation had three overlapping purposes: 1) The evaluation of the new UMOS forecast system against the operational PPM system using independent datasets; 2) the comparison of the PPM forecasts and the UMOS forecasts in the context of high-resolution GEM model–derived predictors; and 3) exploration of the performance characteristics of UMOS using developmental samples blended from two versions of the GEM model. The evaluation used four different independent samples of predictors from the Canadian operational GEM model, along with station observations from about 200 Canadian stations. The evaluation was carried out using a variety of summary measures and diagnostic performance measures appropriate to the form of the predictands.

The results indicate that the UMOS forecasts are clearly superior to the DMO forecasts and generally better than the PPM forecasts. For temperature, the improvement of UMOS over PPM is significant at all projections, but for precipitation and wind, the improvement is noticeable only for the day 1 forecasts. For temperature also, at least half of the improvement seems to come from the use of MOS. Even MOS forecasts from equations developed on the old version of the model were superior to the PPM forecasts and DMO. This result might have been anticipated, given that the PPM equations are in fact quite old, and based on relatively low-resolution analysis data. For all three predictands, there is evidence that the use of recent data to update the MOS equations is of benefit. There is also little evidence that the blending of developmental samples has degraded the forecast accuracy during the model transition period after a model change. MOS equations based on a single model version (UMO) performed better than the equations developed from blended model data only for wind direction, but even in that case, the UMOS forecasts outperformed the PPM forecasts.

For temperature, the improvements over PPM and DMO remain relatively constant over all projections. However, for wind and precipitation, both of which contain more small-scale and local variability, the improvement over the PPM forecasts decreases with increasing projection. Since an important advantage of the MOS equations over the simpler PPM equations is the access to high-resolution model variables, this is a somewhat worrisome result because it suggests that these higher-resolution variables lose their usefulness as predictors by day 2. The UMOS system does no preprocessing of predictors; all model variables are entered into the equations at full resolution. Further, differenced predictors are also computed at relatively high resolution. If the higher-resolution components in the model predictors lose their accuracy quickly, then the MOS development responds either by shunning those predictors or entering them with lower weight, due to the increasing error variance. If the error variance levels in all the MOS predictors rise quickly with increasing projection because of small-scale noise, then the MOS equations will become conservative more quickly. This raises the question of whether the performance of the UMOS equations could be improved by preprocessing the model variables to filter out the noisy smallest scales. Such preprocessing would allow a better fit to predictors that contain only those scales that can be predicted reasonably accurately by the model. This is a question for another study, but it is an important one if MOS techniques are to be applied to mesoscale or high-resolution regional models.

Finally, the verification results for precipitation and, to a lesser extent, for wind are limited by the sampling limitations of point observations. There is evidence of these effects in the summer season, when small-scale convective processes dominate. As models become capable of simulating spatial variations on scales too small to be depicted by station networks, it is imperative that data from other sources such as radar, satellite, and lightning detection networks be used in verification of high-resolution model output and statistical guidance products derived from that output.

Acknowledgments

The authors thank Mr. Franco Petrucci of CMC for his help with the preparation of improved predictor data for the POP forecasts. We also thank two anonymous reviewers for their helpful comments on an earlier version of the paper.

REFERENCES

  • Bélair, S., Méthot A. , Mailhot J. , Bilodeau B. , Patoine A. , Pellerin G. , and Coté J. , 2000: Operational implementation of the Fritsch–Chappell convective scheme in the 24-km Canadian regional model. Wea. Forecasting, 15 , 257274.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brunet, N., 1987: Development of a perfect prog system for spot time temperature forecasts. CMC Tech. Doc. 30, 55 pp. [Available from Environment Canada, 2121 Route Transcanadienne, Dorval, QC H9P 1J3, Canada.].

    • Search Google Scholar
    • Export Citation
  • Brunet, N., Verret R. , and Yacowar N. , 1988: An objective comparison of model output statistics and perfect prog systems in producing numerical weather element forecasts. Wea. Forecasting, 3 , 273283.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Dallavalle, J. P., 1988: An evaluation of techniques used by the National Weather Service to produce objective maximum/minimum temperature forecasts. Preprints, Eighth Conf. on Numerical Weather Prediction, Baltimore, MD, Amer. Meteor. Soc., 572–579.

    • Search Google Scholar
    • Export Citation
  • Glahn, H. R., 1970: A method for predicting surface winds. ESSA Tech. Memo. WBTM TDL 29, U.S. Department of Commerce, Washington, DC, 18 pp.

    • Search Google Scholar
    • Export Citation
  • Glahn, H. R., and Lowry D. A. , 1972: The use of model output statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11 , 12031211.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Glahn, H. R., and Dallavalle J. P. , 2000: MOS-2000. TDL Office Note 00-1, NOAA/Techniques Development Laboratory, 169 pp.

  • Klein, W. H., and Lewis F. , 1970: Computer forecasts of maximum and minimum temperatures. J. Appl. Meteor., 9 , 350359.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12 , 595600.

  • Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8 , 281293.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Sarrazin, R., 1989: Development of a tuned perfect prog surface wind forecasting system. ARD Research Rep. MSRB 89-6, 56 pp. [Available from Environment Canada, 2121 Route Transcanadienne, Dorval, QC H9P 1J3, Canada.].

    • Search Google Scholar
    • Export Citation
  • Sarrazin, R., and Wilson L. J. , 1989: A tuned perfect prog wind forecast system for Canada. Preprints, 11th Conf. on Probability and Statistics in Atmospheric Sciences, Monterey, CA, Amer. Meteor. Soc., 50–55.

    • Search Google Scholar
    • Export Citation
  • Stanski, H. R., Wilson L. J. , and Burrows W. R. , 1990: Survey of common verification methods in meteorology. WMO World Weather Watch Tech. Rep. 8, 115 pp.

    • Search Google Scholar
    • Export Citation
  • Verret, R., 1987: Development of a perfect prog system to forecast probability of precipitation and sky cover. CMC Tech. Doc. 29. [Available from Environment Canada, 2121 Route Transcanadienne, Dorval, QC H9P 1J3, Canada.].

    • Search Google Scholar
    • Export Citation
  • Verret, R., 1988: Post-processing of statistical weather element forecasts. CMC Monthly Review, Vol. 7, No. 5, 2–16. [Available from Environment Canada, 2121 Route Transcanadienne, Dorval, QC H9P 1J3, Canada.].

    • Search Google Scholar
    • Export Citation
  • Wilson, L. J., and Vallée M. , 2002: The Canadian Updateable Model Output Statistics (UMOS) system: Design and development tests. Wea. Forecasting, 17 , 206222.

    • Crossref
    • Search Google Scholar
    • Export Citation

APPENDIX A

Description of the Temperature Anomaly Reduction Used in the PPM 3-h Spot Temperature Forecasts

The anomaly reduction for temperature forecasts applies after day 1 and is given by [reduction in percent = (forecast projection in h − 24)(absolute value of the anomaly)]. This amounts to a simple linear reduction of the forecast anomaly as a function of projection time, and is intended to compensate for the characteristic of perfect prog forecasts to retain their sharpness and lose reliability despite decreases in model skill. Such anomaly reduction schemes would be expected to lead to better scores, especially those obtained from quadratic scoring rules such as the rmse, and would cause the forecasts to mimic some of the characteristics of MOS. In contrast, the anomaly reduction effect of MOS arises naturally as a consequence of the decreasing confidence in the model predictors, as expressed by the variance explained in the MOS equations.

APPENDIX B

Verification Measures Used in This Study

For the two continuous predictands, temperature and wind, we used the bias or mean error and the reduction of variance to assess the forecasts. The bias is given by
i1520-0434-18-2-288-eqb1
for a sample of size N, where fi is the forecast value and xi the observed value. Negative values indicate underforecasting and positive values overforecasting. The reduction of variance (RV) is a skill score, where the standard is the climatology of the independent sample. It is given by
i1520-0434-18-2-288-eqb2
where x is the sample mean observation.
For wind speed, evaluation was also carried out in terms of contingency tables and their associated scores. For a K-category contingency table where the columns represent forecast category and the rows the observed category, the (i, j)th cell of the table contains the total number of times the ith category was observed when the jth category was forecast, nij. The sums of the rows and columns give the total observations and forecasts of each category, respectively:
i1520-0434-18-2-288-eqb3
Then, the percentage correct (PC) is given by
i1520-0434-18-2-288-eqb4
where N = ΣKk=1 Nxk = ΣKk=1 Nfk is the sample size. The frequency bias is given by Nfk/Nxk, k = 1, … , K. A perfect bias in this context is a value of 1, which indicates that the forecast distribution matches the observation distribution. The Heidke skill score (HSS) expresses the percentage improvement over a standard forecast, and is in the usual format for skill scores. We use chance as the standard forecast, but the skill score can be computed against other standards, if additional information is obtained. The score is in the form
i1520-0434-18-2-288-eqb5
For POP verification, we have used equivalent measures, but in the form appropriate for probability forecasts of categorical events. Precipitation occurrence is a dichotomous event; thus, the observation value is either 1 or 0 according to whether precipitation occurred or not. The Brier score (PS) measures the mean squared error of the probability forecasts:
i1520-0434-18-2-288-eqb6
In this form, the score is negatively oriented (lower is better) and the range is 0 to 1. Murphy (1973) proposed a partitioning of the Brier score into three components,
i1520-0434-18-2-288-eqb7
Here, the N events of the sample have been partitioned (stratified) into T subsamples according to the forecast probability. For the analysis in this paper, T = 10. The overbars indicate averages, over the kth subsample if there is a subscript and over the whole sample if not. The three terms on the right-hand side represent the reliability, resolution, and uncertainty components of the score, respectively. The graphs of reliability shown in this paper are computed using the first term of this equation. Graphically, this represents the mean squared vertical distance from the plotted points to the 45° line on the reliability graphs shown. The resolution component, though not shown quantitatively in this paper, can be visualized from the reliability graph as the mean squared vertical distance from the plotted points to the horizontal line representing the sample average frequency of precipitation (x). The uncertainty term is the variance of the observations in the sample and has nothing to do with the forecast. This term will vary over different verification samples and thus makes it difficult to compare Brier scores over different samples. All the comparisons reported in this paper use the same sample. Finally, “sharpness” is depicted on the reliability table by the distribution of forecast probabilities, and can be expressed quantitatively by the variance of the forecasts. Unlike resolution, it does not depend on the observations. Further details on the characteristics of all these verification scores can be found in Stanski et al. (1990).

Fig. 1.
Fig. 1.

Scatterplot of (top) 6-h and (bottom) 30-h PPM temperature forecasts vs observations for Jan–Mar 2000, Toronto Pearson International Airport. Observation data are from METARs

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 2.
Fig. 2.

Schematic diagram showing the dependent and independent datasets used in this evaluation. Dark portions of shaded bars indicate summer-season data and lighter portions indicate winter-season data. The open bars refer to the periods for which old and new model data were available. Tick marks on the axis represent the beginning of each year. See text for further details

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 3.
Fig. 3.

Bias of winter-season temperature forecasts issued from the 1200 UTC run, as a function of projection time, for the five sets of forecasts. Sample size is about 17 000 events for each, from 205 Canadian stations

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 4.
Fig. 4.

Percent reduction of variance (RV) with respect to the independent sample mean for the same dataset as in Fig. 3

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 5.
Fig. 5.

Summary of reduction of independent sample variance (RV) over all four datasets for (a) 6- and (b) 48-h temperature forecasts. Summer sample of about 22 400 cases over 210 stations and winter sample of about 17 000 cases over 205 stations, respectively

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 6.
Fig. 6.

Brier score as a function of projection time for the five sets of POP forecasts issued from the 1200 UTC run, summer season. Sample size is about 18 000 cases over 170 stations

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 7.
Fig. 7.

Reliability table for 0–6-h PPM, UMOS, and DMO POP forecasts from the 1200 UTC run, summer season. Frequency of use of the 10 forecast probability deciles is shown in the inset (upper left) and by means of the figures above the plotted points, with UMOS frequency above and PPM frequency below. The categorical DMO forecasts are shown by the plotted points in the two extreme deciles. The independent sample climatological frequency is indicated by the horizontal dashed line

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 8.
Fig. 8.

Same as in Fig. 7 but for the 42–48-h POP forecasts

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 9.
Fig. 9.

Same as in Fig. 8 but for the 42–48-h UMO, PPM, and DMO POP forecasts. Forecast probability frequencies are shown with UMO above and PPM below

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 10.
Fig. 10.

Reliability component of Brier score as a function of projection time for the five sets of POP forecasts issued from the 1200 UTC run, summer season. Sample size is about 18 000 cases over 170 stations

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 11.
Fig. 11.

Brier score as a function of projection time for the five sets of POP forecasts issued from the 0000 UTC run, winter season. Sample size is about 13 000 cases over 168 Canadian stations

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 12.
Fig. 12.

Same as in Fig. 11 but for reliability component of Brier score

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 13.
Fig. 13.

Reliability table for 0–6-h PPM, UMOS, and DMO POP forecasts from the 0000 UTC run, winter season. Format as in Fig. 7.

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 14.
Fig. 14.

Same as in Fig. 13 but for 42–48-h POP forecasts

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 15.
Fig. 15.

Bias of summer-season wind speed forecasts issued from 0000 UTC, as a function of projection time, for the five sets of forecasts. Sample size is about 18 000 cases over 155 stations

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 16.
Fig. 16.

Frequency bias of 18-h UMOS and PPM wind speed forecasts, from the data in Tables 2 and 3

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 17.
Fig. 17.

Percent correct for (left) 6- and (right) 18-h wind speed forecasts, based on six-category forecast–observed contingency tables, for summer-season forecasts issued from 0000 UTC

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 18.
Fig. 18.

Heidke skill score (relative to chance) for summer-season wind speed forecasts issued from 0000 UTC, as a function of projection time, based on six-category contingency tables

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Fig. 19.
Fig. 19.

Reduction of independent sample variance (RV) as a function of projection time for wind direction forecasts, winter-season run from 1200 UTC. Sample size is about 8500 cases for which wind speed was forecast and observed over 8.96 km h−1, for 145 stations

Citation: Weather and Forecasting 18, 2; 10.1175/1520-0434(2003)018<0288:TCUMOS>2.0.CO;2

Table 1. 

Characteristics of statistical forecasts used in the comparison

Table 1. 
Table 2. 

Contingency table for 18-h wind speed forecasts, PPM, summer season

Table 2. 
Table 3. 

Contingency table for 18-h wind speed forecasts, UMOS, summer season

Table 3. 
Save