Abstract

The objective consensus forecasting (OCF) system is an automated operational forecasting system that adapts to underlying numerical model upgrades within 30 days and generally outperforms direct model output (DMO) and model output statistics (MOS) forecasts. It employs routinely available DMO and MOS guidance combined after bias correction using a mean absolute error (MAE)-weighted average algorithm.

OCF generates twice-daily forecasts of screen-level temperature maxima and minima, ground-level temperature minima, evaporation, sunshine hours, and rainfall and its probability for day 0 to day 6 for up to 600 Australian sites.

Extensive real-time trials of temperature forecasts yielded MAEs at days 0–2 about 40% lower than those from its component MOS and DMO forecasts. MAEs were also lower at day 1 than matching official forecasts of maxima and minima by 8% and 10% and outperformed official forecasts at over 71% and 75% of sites, respectively. MAEs of weighted average consensus outperformed simple average forecasts by about 5%.

1. Introduction

Acceleration in computing and remote sensing capabilities facilitates more frequent upgrades to data assimilation and numerical weather prediction systems resulting in improved weather forecasts from direct model output (DMO). A downside of this progress is a corresponding deterioration of the widely used and successful model output statistics (MOS) forecasts (Glahn and Lowry 1972) because MOS cannot easily accommodate new sites, models, and major upgrades to existing models. The increasing frequency of numerical and observational system changes suggests the importance of DMOs will increase relative to MOS. Hence, it is important that modern automated weather forecasting systems include DMO components.

A key similarity between DMO (Stensrud and Skindlov 1996) and degraded MOS (Wilson and Vallée 2002) forecasts is the presence of location-dependent, systematic biases. Once biases are removed, both schemes provide close to operational quality forecasts.

Clemen (1989) surveyed over 200 papers drawn from economics, management, psychology, medicine, and meteorology. He concluded that

  1. accuracy can be substantially improved through the combination of multiple forecasts,

  2. simple combinations often work better than more complex methods, and

  3. combining forecasts should become part of mainstream practice.

The objective consensus forecasting (OCF) system employs bias correction of both multimodel DMO and MOS component forecasts followed by consensus merging. It retains the flexibility of DMOs in generating forecasts for new sites and models and in quickly exploiting observational and numerical system enhancements yet still benefits from available bias-corrected MOS forecasts. Improvement in OCF accuracy as models improve is achieved by weighting bias-corrected component DMO forecasts according to their recent accuracy while constraining the sum of weights to one.

OCFs are issued at 0400 and 1600 UTC for screen-level maximum and minimum temperature, ground minimum temperature, hours of sunshine, daily evaporation, rainfall, and the probability of rain above 0.2 mm occurring. Forecast projections extend from the local day of issue (day 0) to day 6. Screen temperature and rainfall forecasts cover over 600 sites (see Table 1 and Fig. 1 for distribution details).

Table 1.

Approximate number of daily weather element observations

Approximate number of daily weather element observations
Approximate number of daily weather element observations
Fig. 1.

Distribution of sites for which temperature is forecast together with area of coverage of the LAPS 050 model over SE Australia (203 sites)

Fig. 1.

Distribution of sites for which temperature is forecast together with area of coverage of the LAPS 050 model over SE Australia (203 sites)

The 0400 UTC screen temperature forecast issue is the main result reported upon here as it has wider coverage than forecasts for other elements and it can be readily compared to corresponding official forecasts for day 1 using the Bureau of Meteorology's standard forecast verification system. Day 1 forecasts cover next morning minima with a lead time of about 16–19 h under normal conditions and afternoon maxima (26–29-h lead time). It is not possible to process rainfall forecasts using the OCF strategy because of its discontinuous occurrence, so discussion is reserved for section 6f.

The following conventions are used to refer to individual forecast systems. DMO and MOS forecasts derived from a 0000 (1200) UTC model run receive the suffix _00 (_12) where necessary. DMOs are referenced using two uppercase alphanumerics, drawn from their parent model names. Where the same model is available at different resolutions, the second alphanumeric indicates the resolution. Thus DMOs from the Local Area Prediction model (LAPS; Puri et al. 1998) at 37.5-km resolution run at 0000 UTC are denoted L3_00. MOS forecasts include an M as a third alphanumeric (e.g., MOS from L3_00 is L3M_00).

Section 2 describes MOS and DMO components of OCF, observations, and weather elements. Sections 3 and 4 describe the development of bias-correction and compositing algorithms, respectively. Operational aspects are described in section 5 and the main results are presented in section 6. Finally, section 7 provides a summary discussion and expected future developments.

2. Data

a. Numerical models

LAPS is run at approximately 37.5-, 12.5-, and 5.0-km resolution and generates the DMOs L3, L1, and L0, forecasts respectively. LAPS 37.5-km-resolution fields also underpin the Australian regional MOS forecasts (L3M). From the Australian Global Assimilation and Prognosis model (GASP; Seaman et al. 1995; Bourke et al. 1995) corresponding GA and GAM are generated.

The ease with which DMO forecasts can be generated allowed the the Met Office’s Global Circulation Model and the National Centers for Environmental Prediction’s (NCEP’s) Global Forecast System DMOs to be included in OCF. They are derived from low spatiotemporal resolution fields routinely received via GTS and are referred to here as UK and US, respectively. They contribute to OCF resilience to single component model changes and add new information through their low cross correlations (≈0.5) with the local DMOs compared to local-to-local cross correlations (≈0.9).

Details of the models underpinning OCF are provided in Table 2. They exhibit a range of spatial and temporal resolutions, analysis schemes, types (grid point or spectral), parameterizations, and cross correlations that are unachievable from single-model ensembles. Component forecasts included in OCF vary with weather element, projection, and, under operational conditions, availability.

Table 2.

Forecast schemes contributing to 0000 UTC issue of OCF

Forecast schemes contributing to 0000 UTC issue of OCF
Forecast schemes contributing to 0000 UTC issue of OCF

b. Observational data

High-quality observations are key elements of OCFs because they are used to update bias correction and weights applied to contributing forecasts. The daily weather elements in Table 1 are collected at 0000 UTC each day and contain values for the previous 24 h. Figure 1 shows the locations of daily screen temperature reports. Ground minima, sunshine hours, and evaporation observations have similar but less dense distributions. Safeguards against large observational error impacts are built into the bias-correction algorithm described in section 3.

c. MOS forecasts

MOS often employs over a million predictive equations that require 2–4 yr of stable developmental data for their derivation (Jacks et al. 1990). Viability of operational MOS relies upon its developmental conditions continuing. In the current climate of rapidly evolving models and observing systems, this assumption rarely holds.

The Australian MOS equations, derived using the method described in Woodcock (1984), are updated biannually and provide predictions for the elements in Table 1 and sites in Fig. 1 from L3M for days 0–2 and from GAM for days 1–6.

Underlying model changes in resolution and soil moisture parameterizations introduced large biases in summer and autumn maximum MOS temperature forecasts. Wilson and Vallée (2002) and Erickson et al. (2002) reported similarly MOS bias errors induced by model upgrades.

d. DMO forecasts

In OCF, the DMO forecasts at a specific site are spatially interpolated from a model's nearest layer grid values without any attempt to accommodate differences between gridpoint and site elevations or underlying surface variations such as land or sea. Time series interpolation or nearest neighbor in time values are then used to derive the final DMO forecasts. Australian models’ DMOs suffer less interpolation-generated bias than routinely available UK and US DMOs because of their higher spatiotemporal resolution.

Stensrud and Skindlov (1996) and Mao et al. (1999, their Fig. 3) showed that bias was a major source of DMO error. Apart from interpolation-introduced bias, Stensrud and Skindlov (1996), Mao et al. (1999), Fritsch et al. (2000), and Mass (2003) have attributed DMO bias to deficiencies in model physics, parameterization, computational resolution, and topographical resolution.

3. Bias correction

As a major error in both DMO and MOS forecasts is bias, it is easily removed. In the scheme adopted here bias correction is undertaken prior to consensus compositing as recommended by Fritsch et al. (2000). Partitioning bias correction and compositing rather than combining them into a single method, such as regression, is preferred because (i) it is relatively easy to build a robust bias-correction algorithm to withstand missing and/or extreme errors, (ii) modularity permits more detailed analysis and experimentation with the underlying processes and (iii) after bias correction, common error statistics such as mean absolute error (MAE) and rmse are clearly related to the amplitude of errors undistorted by bias.

The objective of bias correction is to minimize the error of the next forecast using bias from past errors. A short learning period enables an updating system to respond quickly to model changes but increases vulnerability to missing data and/or large errors and other limitations of small sample estimates. The interrelated issues to be resolved in the development of a bias-correction scheme concern (i) the parameter representing the center of the distribution in the historical data, (ii) the bias-correction method, and (iii) the historical sample size.

The methodology in OCF was adopted after experiments with daily forecasts from L1, L3, and L3M for day 1 and day 2 screen maximum and minimum temperatures at about 600 sites for 3 months in 2001. Algorithms were applied to each site, element, and projection. The effects were assessed from corresponding forecast element and projection statistics averaged over all sites and so involved over 500 000 forecasts.

Along with reliable statistics, this approach provided good tests of algorithm efficiency and robustness. Subsequently the adopted methodology was very effective in correcting GA MOS and DMOs, UK and US DMOs, and for evaporation, ground minima, and hours of sunshine forecasts.

a. Center of distribution of the historical sample

It is our intent to apply a common algorithm to a relatively short history of both DMO and MOS forecasts for a range of weather elements whose sample distributions may not be normal. Sample average is sensitive to extreme values, most likely resulting from unusual events such as poor forecasts and/or incorrect observations. Similarly, the median, depending as it does on one member of the sample, can also be misleading in small samples. The best easy systematic estimator (BES; Wonnacott and Wonnacott 1972, section 7.3) was adopted as the measure of the center of the historical sample in OCF:

 
formula

where Q1, Q2, and Q3 are the first, second, and third quartiles, respectively. BES is robust with respect to extreme values but represents the bulk of the common results.

b. Bias-correction methodology

Numerous time series algorithms are available to address the problem of minimizing the bias of the next forecast based on historically learned bias. Of these, a running-mean bias correction, exponential smoothing (single, double, and adaptive), simple linear regression, and the Kalman filter were tested, as they are suitable for small samples. The final choice was a simple running-BES bias correction as it performed close to best in most comparisons, was simple to implement, and robust.

c. Bias-correction window

Sample sizes in the literature include 7 days (DMO; Stensrud and Skindlov 1996), 21 days (DMO; Mao et al. 1999), 30 days (DMO and MOS; Young 2002), and 30 days (MOS; Erickson et al. 2002).

The relationship between number of days in the running bias-correction window and the improvement in the next days’ forecast is shown in Fig. 2. The percentage of next forecast large errors decreased rapidly as the number of days included in the running mean increased, reaching an asymptotic value by 15–20 days. Rmse plots had similar profiles suggesting that a common bias-correction window of more than 15 days could be applied to the every projection of temperature forecast and to both DMO and MOS. However, when producing a forecast for a single site under operational conditions, additional tolerance for missing or faulty observations may be necessary and so a 30-day bias-correction window was adopted.

Fig. 2.

Percentage of errors greater than 2.5°C vs bias-correction window over approximately 600 sites per day for 3 months

Fig. 2.

Percentage of errors greater than 2.5°C vs bias-correction window over approximately 600 sites per day for 3 months

4. Combining forecasts

Gupta and Wilton (1987), following a review of the literature on combining forecasts, suggested the following desirable properties of the compositing method.

  1. It should not require large quantities of data for estimating weights.

  2. It should distinguish between better and poorer available candidate models with the distinction being made on precision (i.e., low MAEs or rmse’s) and redundancy (i.e., low cross correlations with other contributing forecasts).

  3. Derived weights should be intuitively meaningful. More precise and less redundant components should be given higher weight.

Commonly, meteorological forecast combination studies employ either linear regression or simple averages. Relatively few meteorological studies have compared methods of combination. Apart from a brief mention of the performance of linear regression compared to equal weight averages in Brown and Murphy (1996) and Vislocky and Young (1989), the focus on optimizing forecast combination in the meteorological literature has been very limited until the recent papers by Young (2002) and Gerding and Myers (2003). Combined, these studies compared multiple-linear regression, partial least squares regression (e.g., Garthwaite 1994), nonnegative restricted least squares regression (e.g., Aksu et al. 1992), principal component regression (e.g., Jackson 1991), gradient of least descent (e.g., Forsythe et al. 1977), and adaptive data fusion based on gradient of least descent. Both studies demonstrated that simpler consensus algorithms forecasts usually outperformed those based on complex algorithms and that consensus forecasts outperformed their component forecasts, thus agreeing with Clemen (1989).

a. Linear regression

Thompson (1977) showed that for two series of independent forecasts their linear regression combination series has a lower rmse than either. Since linear regression optimally weights components, it should outperform simple average forecasts but in practice this is not guaranteed. Thompson's assumptions of stationary data and independence (cross-correlation coefficient = 0) of predictors cannot be satisfied in practice.

Despite a breach of Thompson's (1977) assumptions, several meteorological studies have successfully combined forecasts using linear regression. These include Woodcock and Southern (1983), Fraedrich and Leslie (1987), Vislocky and Young (1989), and Krishnamurti et al. (2000). However, when combining DMOs in OCF, some predictor cross correlations are very high (e.g., L3, L1, and L0 ≈ 0.9) indicating high redundancy of information. In OCF, the small samples in combination with such large cross correlations pose a high risk of instability in the resultant forecasts (Ezekiel and Fox 1970; Winkler and Clemen 1992).

Principal component regression (e.g., Jackson 1991) or averaging high cross-correlated predictions prior to regression can overcome the instability but both Young (2002) and Gerding and Myers (2003) preferred simpler methods of combining forecasts than principal component regression.

A further major factor against using regression for combining forecasts under operational conditions is that some predictors (component forecasts) may be unavailable. Thus, it is necessary to have a solution for different combinations of component forecasts, significantly increasing the computational burden.

b. Equal weight averages

Many meteorological studies have used equal weight averaging for combining forecasts. Some better known examples include Sanders (1973), Bosart (1975), Winker et al. (1977), Vislocky and Fritsch (1995), Fritsch et al. (2000), and Ebert (2001). Initially equal weight averaging appears to be a poor algorithm because it assigns the same weight to all forecasts irrespective of their relative historical or theoretical merit. Thus, the best forecasts are contaminated by the worst. However, the strength of equal weight averaging compared to linear regression is computational efficiency, robustness with respect to both missing data and high cross correlations among predictors, and a diminished dependence on stationary data.

c. Performance-weighted averages

The extensive quantitative data available in the OCF prompted a comparison of equal- and performance-weighted averages prior to finalizing the system configuration. Experiments with real forecasts (Young 2002), synthetic stationary and nonstationary series with both high and low cross correlations (not shown), indicated that weighted average combinations based on the inverse of MAE or mean-square error usually outperformed equal-weighted average combinations.

If ai is the MAE over the last 30 days for the ith contributing forecast scheme of n bias-corrected contributors, then it is weighted wi, where

 
formula

Weights are lower (higher) for component forecasts with greater (smaller) MAEs over the last 30 days and the sum of weights is constrained to equal one. Hereafter, the weight scheme described in Eq. (2) is referred to as optimal weighting (OW), where OW accommodates Gupta and Wilton's (1987) requirements with respect to meaningful weights and precision.

The consensus forecast is simply generated from the bias-corrected component forecasts (fi) by

 
formula

The OW is very similar to the gradient descent algorithm used by Young (2002) except Young (2002) used mean square error weights whereas OCF employs MAE weights: both measures generate similar results.

5. Operational procedure

The 30-day biases are computed between the observation cutoff for the 0000 UTC LAPS 37.5-km-resolution run and the completion of the subsequent L3, L1, L0, and L3M forecasts. They are computed daily following Young (2002) who showed that mean square errors of forecasts after daily updating were 5% smaller than achieved by those achieved using weekly updating. Biases are computed for all verifiable 0000 and 1200 UTC component forecasts. At the same time, using the same data, the corresponding 30-day, post-bias-correction MAEs are computed.

Equation (3) can be conveniently reconfigured as

 
formula

where fi is the ith forecast before bias correction.

Immediately. the new fi is available, its bias correction (bi) is applied. Then, whenever the OCFs are required, MAEs for the latest bias-corrected forecasts are accumulated; the appropriate weight applied to each forecast; and then the weighted forecasts are accumulated. Missing components are not problematic. Automatic accommodation of available forecasts is important as it makes OW computationally very efficient and robust.

From an operational perspective, the slowly varying weights and MAEs resulting from the use of 30-day parameters provides an additional level of robustness since day −1 biases and weights can be used if the day 0 computations fail.

6. Verification results

For a single site, day, and weather element, if the component forecasts errors are in phase (i.e., either all positive or all negative), then on average half of them will outperform OCF. However, if both positive and negative errors occur, the rank of OCF depends upon the extent of their cancellation. As the verifying domain expands, the likelihood of all bias-corrected components being in phase decreases and cancellation of opposite phase errors improves OCF relative to its components.

a. Impact of bias correction

Table 3 summarizes the impact of 30-day bias correction on MAEs of the specific OCF member forecasts for day 1 screen maxima and minima temperatures at about 600 sites per day over the January–June 2003 6-month period. Generally, the improvement in DMOs from bias correction increased with decreasing spatiotemporal resolution from L1 (23%) to US (39%). DMOs as a whole improved on average by 32% and MOS by 8%. Bias correction made a larger improvement to DMO forecasts of screen maxima (38%) than to forecasts of screen minima (23%) except for GA, which showed the opposite trend.

Table 3.

Impact of running 30-day bias correction on MAEs of raw forecasts of screen temperatures (°C) for day 1 over about 600 sites from Jan 2003 to Jun 2003 inclusive

Impact of running 30-day bias correction on MAEs of raw forecasts of screen temperatures (°C) for day 1 over about 600 sites from Jan 2003 to Jun 2003 inclusive
Impact of running 30-day bias correction on MAEs of raw forecasts of screen temperatures (°C) for day 1 over about 600 sites from Jan 2003 to Jun 2003 inclusive

Bias-correction impacts were positive for all weather elements but were less pronounced for ground temperature minima, evaporation, and sunshine hours than for screen maxima and minima forecasts.

Prior to bias correction, MOS forecasts were the most accurate at day 1 for both maxima and minima. However, after correction, L1 and UK DMOs had the lowest MAEs for maxima forecasts and L3, and the UK and US DMOs the lowest MAEs for forecasts of minima.

b. OCF versus existing guidance

The main quantitative result from the real-time trial of OCF over January–June 2003, that is, midsummer to early winter, was that over all weather elements and forecast projections’ MAEs from OCF were between 20% and 60% lower than the MAEs from its component forecasts. Results of both bias correction and compositing on day 0 temperature maxima through to day 3 temperature minima are shown in Fig. 3. Standard errors of MAEs in Fig. 3 ranged from 0.005°C for OCF to 0.026°C for DMO forecasts without bias correction for over 105 comparisons. The results show that both bias correction and compositing separately contributed substantially to the improvement of OCF over its component forecasts.

Fig. 3.

Comparison of MAEs (°C) of original, bias-corrected DMO, and MOS component forecasts with OCF over a 6-month real-time trial covering 105 matching site forecasts of screen temperature maxima and minima

Fig. 3.

Comparison of MAEs (°C) of original, bias-corrected DMO, and MOS component forecasts with OCF over a 6-month real-time trial covering 105 matching site forecasts of screen temperature maxima and minima

MAEs for OCF day 1 and day 2 maximum and minimum temperatures combined were 41% lower than the MAEs for corresponding uncorrected DMOs: a slightly lower improvement than the 46% reported by Mao et al. (1999) from their sample of six sites over 1 month.

Similarly, OCF day 1 combined temperature rmse’s over the 6 months was 32% lower than for corresponding MOS forecasts uncorrected for bias. This value agrees with 22%–46% improvement in rmse over the MOS forecasts achieved by Young (2002) using a similar methodology. Day 1 OCF MAEs were smaller than its best overall bias-corrected component forecasts of maximum temperature (L1 and UK) by 7% and for minimum temperatures (L3, and UK and US) by 11%.

Table 4 shows that OCF reduced the percentage of MAEs greater than 4.5°C generated by its component forecasts uncorrected for bias by 36%–82%.

Table 4.

Average 6-month all-site percentage reduction of MAEs of screen temperatures greater than 4.5°C in existing guidance due to OCF

Average 6-month all-site percentage reduction of MAEs of screen temperatures greater than 4.5°C in existing guidance due to OCF
Average 6-month all-site percentage reduction of MAEs of screen temperatures greater than 4.5°C in existing guidance due to OCF

c. OCF versus official forecasts

The accuracy of OCFs compared to official forecasts is a major consideration as poor guidance has the potential to hinder real-time forecast production. However, if OCFs are better, then forecasters will adapt to optimize usage.

Results over a year of real-time comparisons of matching OCF and official forecasts for day 1 maxima and minima temperature forecasts for about 134 sites per day covering over 44 000 comparisons for each forecast are summarized in Fig. 4, which shows the OCF consistently provided good guidance. MAEs were 8% lower than corresponding official day 1 forecasts for maxima and produced lower MAEs over 71% of the sites. For day 1 minimum forecasts the corresponding improvement over official forecasts was 10% with lower MAEs at 75% of the sites. Additionally, the MAE for OCF day 1 maximum temperature forecasts over the 134 sites in July 2003 was a record low of 0.96°C.

Fig. 4.

Comparison of OCF and official day 1 temperature forecasts rmse (°C) over matching forecasts at 134 sites per day. Note: verifications were unavailable for Mar 2003

Fig. 4.

Comparison of OCF and official day 1 temperature forecasts rmse (°C) over matching forecasts at 134 sites per day. Note: verifications were unavailable for Mar 2003

Feedback from operational forecasters and service managers has confirmed the value of the OCF guidance that these results have shown. In 2005, OCF will become fully operational and expanded to provide 3-hourly guidance forecasts of temperature, dewpoint, and air pressure to support aviation and fire weather services.

d. Optimum versus equal-weighted consensus forecasts

Optimum weight (OW) consensus forecasts used in OCF were based on the historical, bias-corrected MAEs. Comparative verification over about 105 matching events with corresponding equal-weighted forecasts showed optimum weight MAEs were 2%–5% lower. The improvement in the reduction of errors greater than 4.5°C ranged from 2% to 10%. Optimum weight averages outperformed equal weight averages for each weather element and projection. Young (2002) and Gerding and Myers (2003) also reported that optimum weights outperformed simple averages.

e. OCF for extreme events

As Young (2002) noted, a common criticism of consensus forecasts is a loss of sensitivity to extreme events. The January–June 2003 period continued the severe 2002 El Niño–associated drought over eastern Australia (Nicholls 2004) and included 30 record high daily maximum temperature events over 16 days. On some days a single site reported a record high maximum temperature event while on other days a cluster of sites were affected.

OCF performance varied according to whether the observed record temperature fell “within the ensemble” or “beyond the ensemble” range of predictions. Of the 20 days within the ensemble only GAM (MAE = 1.85°C), out of nine bias-corrected components, outperformed OCF (MAE = 2.18°C). GAM produced the best of the ensemble forecasts for four events in this category and OCF the best forecast for two events. Table 5 lists the performance of all components and OCF in the within-ensemble category.

Table 5.

Comparison of bias-corrected forecasts and OCF MAEs over 20 events of day 1 record-high maximum temperatures where the observed temperature fell within the range of the component forecasts

Comparison of bias-corrected forecasts and OCF MAEs over 20 events of day 1 record-high maximum temperatures where the observed temperature fell within the range of the component forecasts
Comparison of bias-corrected forecasts and OCF MAEs over 20 events of day 1 record-high maximum temperatures where the observed temperature fell within the range of the component forecasts

For the 10 events when the observed temperature was beyond the ensemble, five component forecasts were more accurate than OCF; L1_00 performed best. Performance in the beyond the ensemble category is listed in Table 6.

Table 6.

As in Table 5 but for the 10 events where the observed temperature fell beyond the range of the component forecasts

As in Table 5 but for the 10 events where the observed temperature fell beyond the range of the component forecasts
As in Table 5 but for the 10 events where the observed temperature fell beyond the range of the component forecasts

Over all 30 days, only L1_00 (MAE = 3.32°C) and GAM_12 (MAE = 3.40°C) outperformed OCF (MAE = 3.73°C), L1_00 provided the best forecast on three events and the worst on two events, GAM_12 provided the best forecast over five events and no worst forecasts, and OCF produced the best forecast twice and no worst forecasts. The performance of GAM_12 is interesting because operational forecasters generally have more confidence in the nested high-resolution LAPS guidance than in the coarser-resolution, longer-lead-time guidance available from GASP.

These results show that the perception of consensus forecasts being insensitive to extremes requires qualification. Only if the event falls beyond the ensemble does OCF provide a poor forecast but even then a forecaster must somehow anticipate that all of the guidance forecast errors would be in phase.

f. Rainfall and probability of rain

The OCF prediction of rain and its probability of occurrence cannot use either bias correction or MAE weights effectively because rainfall is discontinuous. Quantitative rainfall in OCF uses equal-weighted averages and probability of rain is determined from an equal-weighted average of DMO and MOS explicit rainfall probability predictions. If DMO rainfall exceeds 0.2 mm, its probability of rain is set to one; otherwise, it remains zero. The methodology is very similar to Ebert's (2001) poor man's ensemble (PME) but differs in that OCF uses site-interpolated rainfall and includes MOS predictions, whereas PME remaps multimodel gridpoint rainfall to a common grid before averaging grid DMO rainfall. PME also computes the probabilities of rain at different thresholds whereas OCF uses only a 0.2-mm threshold to determine rain or no rain. Limited verification of OCF rainfall amounts and the probability of rain by E. Ebert (2003, personal communication) shows OCF and PME rainfall and probability of rain occurrence predictions to be similar.

7. Discussion

a. Distribution of bias-corrected forecasts

Interestingly, after bias correction the UK and US DMO forecasts were as accurate for day 1 temperatures as the higher-resolution local DMOs from L1 and L3 despite a 12-h-longer lead time. Excluding GA, the day 1 MAEs of all components (including MOS) range from 1.7° to 1.9°C for maxima and from 1.5° to 1.7°C for minima (see Table 3).

For each component forecast, the number of sites where it was most accurate is graphed as relative frequency percentages in Fig. 5. The L0 contributions have been excluded from the graph because they only affect 203 sites (see Fig. 1). It is reasonable to assume that L1 and L3 would have increased successes if L0 had not competed because of their high cross correlations. Hence, not only are the bias-corrected component forecasts roughly equal in accuracy, they provide the best forecast in approximately the same proportions.

Fig. 5.

Relative frequency of occurrence of best component forecast of day 1 screen temperature maximum or minimum forecast. Based on approximately 105 events for each forecast type

Fig. 5.

Relative frequency of occurrence of best component forecast of day 1 screen temperature maximum or minimum forecast. Based on approximately 105 events for each forecast type

No coherent broad-scale spatial pattern of best performing components for a specific weather element on a specific day can be detected. Sites separated by only a few kilometers often had their best forecasts provided by different components. At any site, the best component forecast for the day 1 maximum usually differed from the best component for the day 2 maximum and usually differed again from the best component for the day 1 minimum, etc. Time series of the best component forecasts at single sites supported the impression of a random distribution of best component forecasts. The popular concept among forecasters of “the best model of the day” does not appear applicable to bias-corrected temperature fields.

b. Summary of trial

The operational trial of the OCF system exceeded expectations. It was far more accurate than its underlying component MOS and DMO forecasts and outperformed official forecasts at the majority of sites. Bias correction and optimal weighting algorithms were very efficient and robust.

During the OCF trial both LAPS 375 and LAPS 125 underwent significant upgrades that included extending LAPS 375 projections by 24 h and LAPS 125 by 12 h. Collection of 30 days upgraded model biases and MAEs commenced with parallel testing of the upgraded models that allowed seamless accommodation in OCF as the upgrades became operational. Daily statistics generated by the OCF system provided detailed parallel run information showing the impact of the upgrades on guidance forecasts prior to operational acceptance.

c. Additional DMOs

For days 0–2, up to 10 component forecasts are available for OCF temperature forecasts and model change impacts seem to be assimilated in about 15 learning days. Beyond day 2 the current components are GA and GAM so that a change to GASP will impact both components and a full 30-day assimilation will be required.

Synthetic data (not shown), real forecast experiments (e.g., Winkler et al. 1977), and theoretical (e.g., Clemen and Winkler 1985) studies have shown consensus forecasts usually improve rapidly from one to two components but the rate of improvement drops asymptotically with further additions. Also, independent predictors (low cross correlations with other predictors) contribute more to the consensus forecast accuracy than do dependent predictors (Clemen and Winkler 1985). Hence it is anticipated that additional international DMOs would have a substantial impact on OCF performance beyond day 2. To test this, it is planned to include longer projection DMOs derived from the European Centre for Medium-Range Weather Forecasts (ECMWF) and the Japan Meteorological Agency in OCF.

d. Further developments

Several procedures of the current OCF formulation could possibly be improved. Spatial interpolation versus nearest grid point, bias and MAE window history size, basing component weights on rmse rather than MAE history, and optimizing the number and mix of component forecasts warrant further investigation.

Acknowledgments

Thanks to Andrew Amad-Corson, Yen Le, Robert Dahni, and Milos Setek for their work in establishing the database and programming and technical support during the development and preliminary trials of OCF. Thanks also to Christina Sirakoff and Jim Fraser from the National Meteorological and Oceanographic Centre, and to Effie Hoareau and Moshe Galeboe, students from Royal Melbourne Institute of Technology, for their assistance with development and validation of bias-correction algorithms. Beth Ebert, Barry Hanstrum, Peter May, and Graham Mills from the Bureau of Meteorology Research Centre reviewed the draft manuscript and helped improve it with many suggestions. Thanks to them and two anonymous reviewers for their valuable advice. Finally, many thanks are due to Terry Hart for his sustained support throughout the project.

REFERENCES

REFERENCES
Aksu
,
C.
, and
S. I.
Gunter
,
1992
:
An empirical analysis of the accuracy of SA, OLS, ERLS and NRLS combination forecasts.
Int. J. Forecasting
,
8
,
27
43
.
Bosart
,
L. F.
,
1975
:
SUNYA experimental results in forecasting daily temperature and precipitation.
Mon. Wea. Rev.
,
103
,
1013
1020
.
Bourke
,
W.
,
T.
Hart
,
P.
Steinle
,
R.
Seaman
,
G.
Embery
,
M.
Naughton
, and
L.
Rikus
,
1995
:
Evolution of the Bureau of Meteorology's Global Assimilation and Prediction System. Part 2: Resolution enhancements and case studies.
Aust. Meteor. Mag.
,
44
,
19
40
.
Brown
,
B. G.
, and
A. H.
Murphy
,
1996
:
Improving forecasting performance by combining forecasts: The example of road-surface temperature forecasts.
Meteor. Appl.
,
3
,
257
265
.
Clemen
,
R. T.
,
1989
:
Combining forecasts: A review and annotated bibliography.
Int. J. Forecasting
,
5
,
559
583
.
Clemen
,
R. T.
, and
R. L.
Winkler
,
1985
:
Limits for the precision and value of information from dependent sources.
Operations Res.
,
33
,
427
442
.
Ebert
,
E. E.
,
2001
:
Ability of a poor man's ensemble to predict the probability and distribution of precipitation.
Mon. Wea. Rev.
,
129
,
2461
2480
.
Erickson
,
M. C.
,
J. P.
Dallaville
, and
K. L.
Carroll
,
2002
:
The new AVN/MRF MOS development and model changes: A volatile mix? Preprints, 16th Conf. on Probability and Statistics in the Atmospheric Sciences, Orlando, FL, Amer. Meteor. Soc., 82–87
.
Ezekiel
,
M.
, and
H. A.
Fox
,
1970
:
Methods of Correlation and Regression Analysis.
3d ed. Wiley, 548 pp
.
Forsythe
,
G. E.
,
M. A.
Malcolm
, and
C. B.
Moler
,
1977
:
Computer Methods for Mathematical Computations.
Prentice Hall, 259 pp
.
Fraedrich
,
K.
, and
L. M.
Leslie
,
1987
:
Combining predictive schemes in short-term forecasting.
Mon. Wea. Rev.
,
115
,
1640
1644
.
Fritsch
,
J. M.
,
J.
Hilliker
,
J.
Ross
, and
R. L.
Vislocky
,
2000
:
Model consensus.
Wea. Forecasting
,
15
,
571
582
.
Garthwaite
,
P. H.
,
1994
:
An interpretation of partial least squares.
J. Amer. Stat. Assoc.
,
89
,
122
127
.
Gerding
,
S.
, and
B.
Myers
,
2003
:
Adaptive data fusion of meteorological forecast modules. Preprints, Third Conf. on Artificial Intelligence Applications, Long Beach, CA, Amer. Meteor. Soc., CD-ROM, 4.8
.
Glahn
,
H. R.
, and
D. A.
Lowry
,
1972
:
The use of model output statistics (MOS) in objective weather forecasting.
J. Appl. Meteor.
,
11
,
1203
1211
.
Gupta
,
S.
, and
P. C.
Wilton
,
1987
:
Combination of forecasts: An extension.
Manage. Sci.
,
33
,
356
372
.
Jacks
,
E.
,
J. B.
Bower
,
V. J.
Dagostaro
,
J. P.
Dallaville
,
M. C.
Erickson
, and
J. C.
Su
,
1990
:
New NGM-based MOS guidance for maxima and minima temperature, probability of precipitation, cloud amount, and surface wind.
Wea. Forecasting
,
5
,
128
138
.
Jackson
,
J.
,
1991
:
A User's Guide to Principal Components.
Wiley, 569 pp
.
Krishnamurti
,
T. N.
,
T. E.
Kishtawal
,
D. W.
Shin
, and
C. E.
Williford
,
2000
:
Improving tropical precipitation forecasts from a multimodel superensemble.
J. Climate
,
13
,
4217
4227
.
Mao
,
Q.
,
R. T.
McNider
,
S. F.
Mueller
, and
H. H.
Juang
,
1999
:
An optimal model output calibration algorithm suitable for objective temperature forecasting.
Wea. Forecasting
,
14
,
190
202
.
Mass
,
C. F.
,
2003
:
IFPS and the future of the National Weather Service.
Wea. Forecasting
,
18
,
75
79
.
Nicholls
,
N.
,
2004
:
The changing nature of Australian droughts.
Climate Change
,
63
,
323
326
.
Puri
,
K.
,
G. S.
Dietachmayer
,
G. A.
Mills
,
N. E.
Davidson
,
R.
Bowen
, and
L. W.
Logan
,
1998
:
The new BMRC Limited Area Prediction System, LAPS.
Aust. Meteor. Mag.
,
47
,
203
223
.
Sanders
,
F.
,
1973
:
Skill in forecasting daily temperature and precipitation: Some experimental results.
Bull. Amer. Meteor. Soc.
,
54
,
1171
1179
.
Seaman
,
R.
,
W.
Bourke
,
P.
Steinle
,
T.
Hart
,
G.
Embery
,
M.
Naughton
, and
L.
Rikus
,
1995
:
Evolution of the Bureau of Meteorology's Global Assimilation and Prediction System. Part 1: Analysis and initialisation.
Aust. Meteor. Mag.
,
44
,
1
18
.
Stensrud
,
D. J.
, and
J. A.
Skindlov
,
1996
:
Gridpoint predictions of high temperature from a mesoscale model.
Wea. Forecasting
,
11
,
103
110
.
Thompson
,
P. D.
,
1977
:
How to improve accuracy by combining independent forecasts.
Mon. Wea. Rev.
,
105
,
228
229
.
Vislocky
,
R. L.
, and
G. S.
Young
,
1989
:
The use of perfect prog forecasts to improve model output statistics forecasts of precipitation.
Wea. Forecasting
,
4
,
202
209
.
Vislocky
,
R. L.
, and
J. M.
Fritsch
,
1995
:
Improved model output statistics forecasts through model consensus.
Bull. Amer. Meteor. Soc.
,
76
,
1157
1163
.
Wilson
,
L. J.
, and
M.
Vallée
,
2002
:
The Canadian Updatable Model Output Statistics (UMOS) system: Design and development tests.
Wea. Forecasting
,
17
,
206
222
.
Winkler
,
R. L.
, and
R. T.
Clemen
,
1992
:
The sensitivity of weights in combining forecasts.
Operations Res.
,
40
,
609
614
.
Winkler
,
R. L.
,
A. H.
Murphy
, and
R. W.
Katz
,
1977
:
The consensus of subjective probability forecasts: Are two heads better than one? Preprints, Fifth Conf. on Probability and Statistics, Las Vegas, NV, Amer. Meteor. Soc., 57−62
.
Wonnacott
,
T. H.
, and
R. J.
Wonnacott
,
1972
:
Introductory Statistics.
Wiley, 510 pp
.
Woodcock
,
F.
,
1984
:
Australian experimental model output statistics forecasts of daily maximum and minimum temperature.
Mon. Wea. Rev.
,
112
,
2112
2121
.
Woodcock
,
F.
, and
B.
Southern
,
1983
:
The use of linear regression to improve official temperature forecasts.
Aust. Meteor. Mag.
,
31
,
57
62
.
Young
,
G.
,
2002
:
Combining forecasts for superior prediction. Preprints, 16th Conf. on Probability and Statistics in the Atmospheric Sciences, Orlando, FL, Amer. Meteor. Soc., 107–111
.

Footnotes

Corresponding author address: Frank Woodcock, Australian Bureau of Meteorology Research Centre, P.O. Box 1289 K, Melbourne, Victoria 3001, Australia. Email: F.Woodcock@bom.gov.au