## Abstract

Previous methods for creating consensus forecasts weight individual ensemble members based upon their relative performance over the previous *N* days, implicitly making a short-term persistence assumption about the underlying flow regime. A postprocessing scheme in which model performance is linked to underlying weather regimes could improve the skill of deterministic ensemble model consensus forecasts. Here, principal component analysis of several synoptic- and mesoscale fields from the North American Regional Reanalysis dataset provides an objective means for characterizing atmospheric regimes. Clustering techniques, including *K*-means and a genetic algorithm, are developed that use the resulting principal components to distinguish among the weather regimes. This pilot study creates a weighted consensus from 48-h surface temperature predictions produced by the University of Washington Mesoscale Ensemble, a varied-model (differing physics and parameterization schemes) multianalysis ensemble with eight members. Different optimal weights are generated for each weather regime. A second regime-dependent consensus technique uses linear regression to predict the relative performance of the ensemble members based upon the principal components. Consensus forecasts obtained by the regime-dependent schemes are compared using cross validation with traditional *N*-day ensemble consensus forecasts for four locations in the Pacific Northwest, and show improvement over methods that rely on the short-term persistence assumption.

## 1. Introduction

Ensemble modeling is proving to be a viable approach for addressing uncertainty in numerical weather prediction (Hamill et al. 2000). Ensemble modeling creates a distribution of potential forecast solutions by varying the initial conditions, boundary conditions, model physics, parameterization schemes, or the models themselves. Ensemble models are run operationally at several meteorological centers, including the National Centers for Environmental Prediction (NCEP; Toth and Kalnay 1997) and the European Centre for Medium-Range Weather Forecasts (ECWMF; Molteni et al. 1996).

Recent research has elucidated methods for optimally combining the forecasts from a set of ensemble members to create a single deterministic consensus forecast. These methods include using the simple ensemble mean, linear regression in superensembles (Krishnamurti et al. 1999) and ensemble model output statistics (EMOS; Gneiting et al. 2005), performance-based metrics (Woodcock and Engel 2005), and Bayesian model averaging (BMA; Raftery et al. 2005). In its simplest expression, the consensus forecast can be expressed as a linear combination of individual ensemble member forecasts plus a bias correction term. The weight of a member in the combination typically is determined by considering the performance of that member relative to the other ensemble members over some window of time. Most methods, such as that in Eckel and Mass (2005), employ a sliding window of dates, *W*, between 1 week and 2 months directly preceding the date the forecast is issued. Thus, the underlying assumption is that if member A performs relatively better than member B on the date the forecast verifies, then member A can be expected to continue to perform better than member B on the following day. This approach captures the effects of model evolution, but handles the dependence of skill on atmospheric regime only to the extent that the same regime persists throughout the period *w*.

Such atmospheric regimes have been recognized in models ranging from the very simple (Lorenz 2006) through full climate models (Branstator 1992). Forecasters have long understood that model performance is linked to the characteristics of the weather situation being forecast. For example, Fritsch et al. (2000) found the magnitude of wintertime model temperature errors to be related to the positioning of the jet stream. Forecasters frequently subjectively weight the sources of information used in developing their forecasts depending upon such weather characteristics (Roebber 1998). Different model physics packages, such as cumulus parameterization schemes, reveal their relative strengths and weaknesses under particular weather situations (Wang and Seaman 1997). As many ensembles employ a variety of physics packages, such situational model skill variations should also be present in ensemble model output. Forecasters use these situational biases operationally when, for example, they recognize that model A is superior to model B when predicting heavy precipitation events (Meng and Zhang 2007). One may then surmise that an ensemble postprocessing system that considers these situational biases may improve forecast skill. Clustering, the process through which similar elements of a dataset are grouped together, can be used to separate a time series of model forecasts into a finite set of weather situations or regimes. Here, a regime is defined broadly as any configuration of atmospheric variable fields such that membership in a specific regime can be determined objectively. Deloncle et al. (2007) have recently shown that artificial intelligence methods, specifically *k*-nearest neighbor techniques and random forests, can successfully identify transitions between regimes.

This work explores the synthesis of consensus ensemble forecasting and weather regime identification. We hypothesize that the relative performance of ensemble members is linked to weather regimes. This hypothesis of differing relative performances implies that varying the weights of ensemble members depending upon regimes should produce a more accurate consensus forecast. Clustering forecast dates by weather regime prior to computing skill and assigning performance-based weights to ensemble members may result in reduced forecast error for a consensus deterministic forecast. The historical methods explored here serve as benchmarks with which to judge the success of the new postprocessing technique and, thus, test our underlying hypothesis that improved analysis of regime dependence increases the success of ensemble member weighting.

Section 2 describes the data used in this study and its preprocessing. The regime characterization techniques are described in section 3, as are the types of regimes found in the data. Section 4 explores methods of combining the data to obtain performance-weighted consensus forecasts and section 5 elucidates our methods of obtaining regime-dependent consensus forecasts. Results appear in section 6 and conclusions in section 7.

## 2. Ensemble model and verification data

The University of Washington Mesoscale Ensemble (UWME+; Eckel and Mass 2005) is used as our source of ensemble forecasts. The system is based upon the fifth-generation Pennsylvania State University–National Center for Atmospheric Research Mesoscale Model (MM5) and contains eight members. Each member has different initial and boundary conditions as well as different physical/parameterization schemes. We use data from a nested 12-km-spaced grid roughly centered over the states of Washington and Oregon. Table 1 summarizes the data downloaded from the UWME archives.

To provide a relatively simple methodological test bed, point data from four locations (Portland, Oregon; Astoria, Oregon; Olympia, Washington; and Redmond, Oregon) are extracted from the UWME+ ensemble forecast dataset. Specifically, 2-m temperature data are bilinearly interpolated at each location to create 1-yr time series representing model forecasts from each ensemble member extending for the period from 1 September 2005 to 31 August 2006. The city of Portland is located along the banks of the Columbia River, with the Cascades to the east and Coast Range to the west. Astoria Airport is located on the tidal flats of the Columbia, and is ringed by low mountains with the exception of the west where lies the Pacific Ocean. Olympia is located in the Puget Sound lowlands, and is separated from the Pacific by the Coast Range. Redmond is located at 900 m above sea level in central Oregon east of the Cascades.

Ensemble model forecasts are compared with the actual state of the atmosphere using a verification dataset. For point verification, surface observations are superior to gridded analyses, because they do not require additional interpolation and represent ground truth. Surface temperature verifications are therefore taken from the Automated Surface Observing System (ASOS) reports at 0000 UTC each day, and collected into a time series extending from 1 September 2005 to 31 August 2006. Errors (model forecast minus observation) were then computed for each ensemble member.

Table 2 summarizes the error mean (bias), error standard deviation, and mean absolute error for all four locations. Eckel and Mass (2005) note the importance of bias correction for short-range ensemble forecasts. Bias correction also implicitly corrects for the difference between the elevation of the model grid points and the actual elevation of the surface stations, rendering a lapse rate temperature adjustment unnecessary. The application of bias correction to each ensemble member results in a typical improvement of 10%. The details of our bias correction scheme are discussed in the appendix.

## 3. Atmospheric weather regime characterization

The North American Regional Reanalysis (NARR; Mesinger et al. 2006) provides a rich dataset from which to characterize atmospheric regimes. Data for 0000 UTC each day are obtained for the same 12-month time period (1 September 2005–31 August 2006) as the UWME+ ensembles. For a synoptic domain on the order of 1000 km × 1000 km consisting of the Pacific states and the adjacent ocean, the fields include 500-hPa heights (h500) to represent midlevel atmospheric flow, mean sea level pressure (MSLP), and specific humidity (q700) as a proxy for midlevel cloud cover. For a mesoscale domain on the order of 300 km × 300 km, the fields include the *u* and *υ* components of the wind at 850 and 925 hPa (uv850 and uv925). Since the NARR data are an analysis that combines observations with an initialization via a numerical weather prediction (NWP) model, it represents a best guess at the true state of the atmosphere at a given time. We choose to use the NARR analyses at the time the forecast is issued to represent a proxy for the model initialization.

Principal component analysis (PCA) is applied to reduce the dimensionality of this dataset while capturing the most important modes of variability (Wilks 2006). PCA is applied to each of the five NARR data fields individually. This analysis produces pattern vectors (eigenvectors) and a time series of amplitudes, the principal components (PCs) for each data field. Deloncle et al. (2007) showed that a small number of PCs can span the major weather regimes of a three-layer quasigeostrophic model. We therefore hypothesize that the same is at least approximately true of the real atmosphere. To interpret the pattern vectors physically and to avoid obtaining Buell patterns, the eigenvectors are rotated (Richman 1986). Buell patterns are characteristic of the domain shape and can appear regardless of the field under consideration; even PCs of random data can contain such patterns. Rotation is accomplished here using the promax (oblique) technique (Richman 1986), retaining only the first four eigenvectors. On average, the first four pattern vectors capture 88% of the variance of the data, confirming the ability of a small number of PCs to span the major weather regimes of the Pacific Northwest.

The pattern vectors for MSLP are plotted in Fig. 1. The first pattern vector for MSLP can be interpreted as the Aleutian low with a high pressure area west of California; the second can be interpreted as a trough of low pressure offshore of and roughly parallel to the Pacific coast. Physical interpretations can be ascribed to the other pattern vectors as well. A linear combination of these four patterns describes the actual atmospheric state for MSLP on a given day. Principal components of several atmospheric variables together can characterize atmospheric regimes, and provide inputs to a regime-dependent consensus forecast. Figures 2 –4 indicate the first rotated four pattern vectors for 500-hPa heights, specific humidity at 700 hPa, and 850-hPa horizontal wind vectors, respectively. The 925-hPa wind field is similar in appearance to the 850-hPa field. The first four eigenvectors of each of these five fields are linearly combined to quantify the weather regime as described in section 5.

## 4. Ensemble model consensus forecasts

We form a consensus forecast as a linear combination of individual member forecasts as

where *X** is the consensus forecast, *n* is the number of ensemble members, *X _{i}* are the individual bias-corrected ensemble member forecasts, and

*a*are ensemble member weights. The goal of this study is to select ensemble member weights

_{i}*a*so that

_{i}*X** is the optimal consensus forecast defined as having the least mean absolute error (MAE).

The simplest consensus forecast is the ensemble mean, where *a _{i}* are equal for all

*i*such that their sum is one (

*a*= 1/

_{i}*n*, where

*n*is the number of ensemble members), referred to as the equal-weighted average. An improved choice of the ensemble member weights,

*a*, can be obtained by considering the relative performance of ensemble members (Woodcock and Engel 2005). Performance-weighted averages (PWA) use the inverse of the MAE of that ensemble member over a window of forecasts

_{i}*W*to derive raw weights. Thus, ensemble members that are performing the best (have the smallest MAE) are given the largest weighting in the linear combination. The weights are then normalized so that their sum is equal to one. The formula for calculating an ensemble member weight using PWA is given as follows:

where *p _{i}* is a performance measure (here, MAE) for ensemble member

*i*and

*n*is the number of ensemble members.

Consensus forecast error can be reduced based upon a clever selection of the performance window, *W*. Here, *W* refers to a set of dates under which performance is evaluated; *w* refers to the number of forecast dates in *W*, typically in the context of the previous *w* days. The simplest choice of *W* is the entire set of forecasts from the dependent dataset, that is, the data used to train the algorithm (see section 5d, which describes the details of how we choose our training data). This dataset is henceforth referred to as *W*_{dep}. Operationally, other studies employ a sliding window of dates that directly precede that on which the forecast is issued. Eckel and Mass (2005) prefer 14 days (*W*_{14}) but Gneiting et al. (2005) prefer 40 days for the duration, *w*, of the window. Operationally, the performance of a 48-h ensemble forecast is not known until 48 h after the forecast is issued. Thus, to simulate a realistic forecast environment, *W*_{14} must be chosen so that there is a 2-day lag between the last valid date in the window and the time to which the forecast applies. For example, the performance of the ensemble for forecasts valid on 1–14 January is used to weight the ensemble members for the forecast valid 16 January.

Figure 5 reveals the skill of *N*-day sliding-window performance weighting for various performance window sizes (*w*) compared to equal weighting. These values were computed using the UWME+ 365-day time series of ensemble model temperature forecasts for the four locations described in section 2. An average of the four values is represented by the solid line in Fig. 5. The optimum window size was 7 days (0.70% average improvement over the ensemble mean), with an MAE slightly better than for 14 days (0.62% average improvement). This number reflects the typical persistence of atmospheric flow regimes in the Pacific Northwest. We thus use the best window for each of the individual verification locations.

Consensus forecasting using performance-weighted averaging with both *W*_{dep} and *W _{N}* offers a slight improvement (1.2% at Portland) over an equal-weighted forecast MAE. This percent improvement serves as a benchmark for the performance of the regime-dependent methods; they should be considered successful only if they can produce a greater percentage improvement. The goal of the clustering methods discussed in the next section is to identify regime-dependent patterns in the intermodel distribution of optimal weights so that a performance-weighted average applied to independent data will reveal a more substantial improvement. The purpose of the clustering is thus an intelligent selection of a temporally noncontiguous regime-based

*W*in order to minimize MAE.

## 5. Regime-based consensus forecasts

We consider three explicitly regime-based methods for determining the optimal weights for ensemble consensus forecasts. The first uses the *K*-means clustering algorithm to cluster forecast dates into atmospheric regimes before computing the performance-based weights. The second uses a genetic algorithm to cluster forecast dates into atmospheric regimes. The third uses multivariate linear regression of principal components to predict ensemble member errors and then applies that information to compute the performance-based weights. This section also describes the environment used to test these methods.

### a. Regime clustering using a K-means algorithm

Clustering divides a dataset into a number of subsets or groups called clusters. The goal of traditional clustering methods, such as *K*-means, is to group similar cases together while separating dissimilar ones (Wilks 2006). Similarity is determined based upon some measure of distance between points in phase space, typically the *N*-dimensional Euclidean distance. Here, each cluster is a group of forecast dates, that is, *W*_{clstr}, that form the temporally discontinuous window used for computing the relative performance of the ensemble members.

The *K*-means clustering algorithm (Hartigan and Wong 1979) uses an iterative process to minimize the distance between objects and the centroid of their associated cluster. The algorithm takes as input a training dataset and *k*, the number of desired clusters. The algorithm randomly selects *k* initial data points to serve as initial cluster centroids. The distance from each data point to each cluster centroid is calculated; the data points are assigned to the cluster with the closest centroid. The cluster centroids (analogous to the “center of mass” for the cluster) are then recalculated. The objects are regrouped by minimum distance. This process repeats iteratively until the data points no longer switch clusters; minimum distance has been attained.

In this project, the input data to the *K*-means algorithm are the principal component time series of the NARR dataset. There are thus up to 20 predictors available for use by the clustering algorithm; the number of predictor variables is equal to the number of dimensions in the phase space where the clustering occurs. We wish to explore the optimal performance of this algorithm, so the number of predictors becomes a parameter to be optimized. Parameters are added one at a time in analogy to stepwise regression. For example, suppose we wish to cluster using *p* predictors and *k* clusters. First, the *K*-means algorithm is run for each of the 20 predictors and *k* clusters. The best predictor (scoring the lowest MAE on dependent data) “wins” the first round and is retained. Next, the algorithm is run again 19 times, with the first-round winner and a new predictor serving as inputs to the algorithm. The process is repeated *p* times until the best predictors have been selected.

### b. Regime clustering using a genetic algorithm

Traditional clustering considers only the distribution of data within the phase space of the dataset to be classified (in this case, the principal components of the NARR data) so that the “distance” between members of a cluster is minimized. However, the ultimate goal for this project is to find clusters such that the sum of the consensus forecast MAEs of the clusters (weighted by the number of members in the cluster) is minimized. Examination of the scatterplots of the principal components did not reveal any readily discernable patterns. Therefore, a new method is considered that uses a genetic algorithm to perform the clustering of dates.

Consider a simple cluster rule R1: given a single PC (here, a time series of data points *x*_{1}), place all *x*_{1} > *r* into the first cluster, all *x*_{1} ≤ *r* into a second. Thus, *r* is a boundary that subdivides the phase space of the PC. A second cluster rule R2 can be applied to another PC. Together, these rules can divide the set of forecast dates into up to four clusters. If both rules R1 and R2 apply to the same PC *x*_{1}, only three clusters are produced. In general, *b* unique boundaries produce up to 2* ^{b}* clusters.

The boundaries (cluster rules) that minimize MAE can be determined using a genetic algorithm (GA). A genetic algorithm is a technique for optimization that intelligently samples a portion of the solution space before eventually converging upon the optimum (Haupt and Haupt 2004). A GA begins with a random pool of possible solutions called chromosomes. As in biological evolution, the solution pool evolves from generation to generation. Less fit members are removed, and fit members survive to produce offspring solutions. The fitness of a chromosome (each chromosome specifies a cluster rule) is determined by the total MAE of all the clusters using PWA, with each individual cluster MAE rescaled by the size of the cluster. Crossover blends information from different solutions, while mutation of the solutions prevents the gene pool from being trapped in local minima. This GA simultaneously optimizes several parameters, including the selection of which PC to apply a cluster boundary (a discrete variable) and the exact location *r* of the boundary (a continuous variable). For the latter, the GA is allowed to select from 20 predictors (five atmospheric fields times four PCs each). The number of boundaries is specified before the GA is run, and the analysis is repeated for a varying number of boundaries (one to five), which correspond to between 2 and 32 clusters. The 100 benchmark runs used 500 GA iterations with 20 chromosomes, a mutation rate of 0.25, and a survival rate of 0.5. Once optimal regimes are defined by the GA, performance-weighted averaging using optimal weights for each regime (cluster) is applied to generate consensus forecasts. The nature of these cluster rules constrains them to be parallel to the coordinate axes in phase space.

### c. Regime regression

An alternative technique for regime-based consensus forecasts uses multivariate linear regression for quantifying the model performance under various weather regimes. Whereas clustering divides a dataset into discrete groups, regression maintains the continuous nature of the principal components. With the 20 atmospheric PCs as inputs, forward-screening multivariate linear regression (Wilks 2006) predicts the errors (MAEs) of each ensemble member. These predicted errors are then used to calculate the performance weights using the PWA equation. The performance window consists of the set of predicted errors produced by the regression equation, rather than actual errors from previous dates, as done previously. Although this regression approach uses atmospheric regime information (the PCs), it does not explicitly classify a given forecast date as belonging to a specific regime as is done in the clustering approach. Because it uses linear regression of other atmospheric variables to postprocess model output, this technique is somewhat analogous to traditional MOS (Glahn and Lowry 1972). It offers an advantage over MOS, however, in that it considers atmospheric data from the entire domain rather than just at the point location. Moreover, these data have been compressed through the principal component analysis, thus reducing the danger of overfitting.

### d. Cross validation and the test environment

After training a model on a set of dependent data, cross validation is typically employed to test a model on an independent dataset to detect overfitting (Wilks 2006). It is particularly relevant to this project in order to prove that any newly developed methods can succeed in an operational forecasting environment, where information from previous days’ forecasts (where both forecasts and verifications are known) is applied to a current forecast (where verification data are as yet unknown). As weather regimes (discussed in section 3) may have a length of several days, an autocorrelation may exist in the relative performance metrics for ensemble members. This possibility would lead to unfairly boosted skill scores for the independent data and potentially overfitting if a 10-fold cross validation was employed. Instead, the dataset is split randomly into a dependent (training) dataset (67% of the original size, or 243 days) and an independent (testing) dataset (33% of the original size, or 122 days).

Given the relatively small length of the time series of ensemble data (365 data points), the MAE of the dependent and independent datasets can fluctuate randomly by as much as 10% depending upon how the division takes place, which can be greater in magnitude than the performance gains from using a new method. To overcome this difficulty, the random division (2/3 dependent, 1/3 independent) is repeated many times, with these trials approximating the distribution of possible error values. Over many trials, the means of the MAEs for the independent and dependent datasets both approach that of the entire dataset. This approach, using multiple simulations, gives more confidence to conclusions drawn about the relative performance of methods, and also characterizes the variance depending upon which data are used in each set. A similar approach is employed in the bootstrapping technique (Efron 1979). Each method that requires a division into dependent and independent data will be simulated 50 times, with each simulation having a different random separation of data into dependent and independent datasets. Performance metrics, such as MAE, will be reported as the mean value over all 50 simulated trials. Error standard deviations over the 50 simulation trials can characterize the uncertainty associated with the results.

## 6. Results

Three explicitly regime-dependent methods for generating consensus forecasts (described in section 5)—regime clustering with the *K*-means algorithm, regime clustering with a genetic algorithm, and regime regression—are evaluated in comparison to regime-independent methods (described in section 4). Using cross validation (as described in section 5d), the methods were trained on the dependent dataset (243 days), and applied to an independent dataset (122 days). The averages over 50 repetitions are used to evaluate performance. For the three explicitly regime-dependent methods, these simulations used PCs of NARR atmospheric fields for the same time and date that the forecasts are issued. As the NARR serves as a proxy for the model initial conditions, this test environment is operationally realistic.

### a. Results for the dependent dataset

It is informative to first analyze the results of the training (dependent) data before describing the results for the independent data. Figure 6 compares the performance of several methods for generating consensus forecasts using dependent training data. The baseline is an equal-weighted ensemble consensus forecast—the ensemble mean. The bars of the graph represent a percent improvement from this baseline method. Results for the optimum configuration of each technique (number of clusters and number of predictors) are displayed here; a discussion of the optimization is found in section 6c. All three new methods exhibit a superior performance compared to traditional performance-weighted and equal-weighted averaging when working with a dependent dataset. *K*-means clustering (8.77%) and GA clustering (9.84%) have the greatest average percent improvement, followed by regime regression (6.78%). Because the genetic algorithm approach selects cluster membership rules with the goal of minimizing MAE of the consensus forecasts, it should score the highest on the dependent (training) dataset.

### b. Results for the independent dataset

For a true test of skill, we analyze the results for the independent data tests. Figure 7 compares the performance of several methods for generating consensus forecasts using independent data. The baseline is again the equal-weighted ensemble consensus forecast—the ensemble mean. Results for the optimum configuration of each technique (number of clusters and number of predictors, or number of days in the sliding window) are displayed here. As expected, we observe a significant reduction in skill for the regime-dependent methods from the dependent training data tests in Fig. 6. On average over all four locations, *K*-means regime clustering (1.39% improvement) outperforms the best of the *N*-day sliding windows (0.87% improvement). Regime clustering with a genetic algorithm has comparable skill (0.77% improvement) to the sliding window, and occasionally outperforms it (e.g., at Olympia). Regime regression (3.66% decline) was unsuccessful when applied to the independent dataset.

Table 3 uses the means and standard deviations over the 50 simulations to characterize the uncertainty of the results. The 50 random divisions into dependent and independent datasets (discussed in section 5d) are somewhat analogous to a bootstrap. Thus, the standard deviations, when padded to the means, form a confidence interval for the percent improvements. Note that the *N*-day performance weighting does not have a confidence interval because it is applied to the entire dataset and should not be affected by partitioning of the data. The variations in the regime-dependent techniques are due to sensitivity to the data partitioning as well as the nondeterminism of the algorithms (random initial cluster seeds are used in *K* means, and the genetic algorithm makes liberal use of randomness through its mutation operator). The intervals confidently show both clustering techniques to be superior to an equal-weighting average, and regime regression to be inferior. However, the performance of the *N*-day sliding-window technique lies within the error bars for both clustering techniques. Thus, while *K*-means regime clustering may outperform the *N*-day window for most instances, the *N*-day technique should also be expected to perform better for some datasets. The wide error bars underscore the necessity for a bootstrap-like test environment.

### c. Technique tuning and optimization

For the *K*-means clustering technique, there are two parameters that must be specified: the number of clusters and the number of predictors. For dependent data, the skill increases monotonically with the number of clusters. The relationship becomes more complex for the independent dataset. Figure 8 illustrates a parameter phase space for the *K*-means clustering algorithm for Portland using the independent dataset. In general, using a greater number of predictors (atmospheric variable principal components) results in greater method skill. Increasing the number of clusters raises the skill until reaching 24 clusters, after which it begins to decline (due to overfitting). Thus, the optimum skill for Portland is obtained in the region of 16–20 predictors and 16–24 clusters. Figure 9 demonstrates the skill of the *K*-means method over a varying number of clusters for all four locations. Encouraged by the results of Fig. 8, all 20 predictors have been used to reduce the dimensionality of the phase space. The dotted line represents an average of the four locations. Thus, 24 clusters appear to be the optimum configuration for *K*-means regime clustering.

For the genetic algorithm clustering technique, one must predetermine the number of cluster boundaries to be applied. The location of these boundaries, and to which predictor they apply, is determined by the algorithm. Figure 10 demonstrates the effect of changing the number of cluster boundaries, and thus the number of clusters, on the algorithm’s performance. The number of clusters is equal to 2* ^{b}*, where

*b*is the number of boundaries; thus, 1, 2, 3, 4, and 5 boundaries correspond to 2, 4, 8, 16, and 32 clusters. The optimum number of boundaries (thus clusters) is less apparent for this algorithm, as Astoria favors five boundaries, while Portland performs best with only one. Examining the average performance of genetic algorithm clustering over all four locations (the solid line) suggests that a smaller number of clusters (2–8) yield forecasts with greater skill than a larger number of clusters (16–32). While the genetic algorithm performed best on the training data, it fared only moderately well on the independent data. An additional disadvantage of the technique is its greater computation time. The nature of the GA boundaries, which are by design perpendicular to the axes in the predictor variable phase space, may prevent optimum cluster membership from being attained. There may also be too many free parameters in the problem for the size of the training dataset. Since the GA-based clustering is a novel technique, it may improve with further development.

For regime regression, one can vary the number of predictor variables used in the stepwise multivariate linear regression. On dependent data, the skill levels increased with an increasing number of predictors until 14 predictor variables were attained. On independent data, the skill remained negative with respect to the equal-weighted average for all numbers of predictors (1–20), and did not vary by more than 1%. This result suggests that the poor skill of regime regression on independent data is not caused by a simple overfitting of the available predictor variables, but rather that the technique may require a larger training corpus of forecast dates. The nature of the technique may force unreliable extrapolation when confronted by regime information that is quite different than that in the training dataset.

### d. Summary

Table 4 traces the mean absolute error of the ensemble forecasts from the raw data through the latest regime-based consensus forecasts. Note that the percent improvements discussed previously have been computed within their own dataset (dependent, independent, or entire), and thus one will not get the same results if computing percentages from this table. Lines 1 and 3 show MAE averaged over all ensemble members, while lines 2 and 4 show the MAE for the individual ensemble member that performed the best over the entire time period. Bias correction (which implicitly includes a lapse rate adjustment) reduces the typical member MAE by roughly 0.4°C, and the best member MAE by 0.3°C. Taking the ensemble mean (equal-weighted average) reduces the error from that of a given ensemble member by a similar margin. The equal-weighted average also outperforms the best ensemble member by 0.15°C. Additional gains from using performance weighting on independent data are smaller (on the order of 0.02°C). The *K*-means regime clustering results in a performance-weighted ensemble consensus forecast that has typically smaller MAEs than that of traditional *N*-day performance weighting. The GA regime clustering performs similarly to *N*-day performance weighting, whereas regime regression causes degradation in skill. These results, especially the *K*-means regime clustering, demonstrate the enhancement of performance weighting produced by considering atmospheric regime information when generating optimal ensemble model consensus forecasts.

## 7. Conclusions

This work has tested the relative merits of several postprocessing techniques for the creation of a deterministic consensus forecast from ensemble model output for a particular forecasting location. Figure 11 presents a flowchart of the methods used in this study. The test bed consists of a 12-month time series of UWME+ ensemble forecasts for surface temperature at four locations in the Pacific Northwest (Portland, Olympia, Astoria, and Redmond). Two techniques use PCs of atmospheric fields to group the forecast dates into clusters. The first uses the *K*-means clustering algorithm, which maximizes similarities in the PCs for each regime. The second contains a genetic algorithm to determine the optimum cluster rules with the goal of reducing the MAEs of the resulting consensus forecasts. A final technique uses multivariate linear regression of atmospheric regime principal components to predict ensemble member forecast errors. The *K*-means clustering technique was the most successful at reducing the MAE below that of the sliding-window performance-weighted averaging (by greater than 0.5%). These results demonstrate the enhancement of optimal ensemble model consensus forecasts by considering atmospheric regime information, which is statistically significant, as seen from Table 3.

The improvements in forecast skill demonstrated here are preliminary since they represent a single forecast variable (2-m temperature) at four locations, limited training data, and a limited set of training techniques. A single year’s worth of data were used in developing and testing the techniques. Further skill might be obtained if a larger (i.e., multiyear) training dataset were available. Applying the “perfect prog” assumption and using the ensemble centroid at the time of forecast verification rather than the ensemble initializations for the regime information would likely improve performance. Moreover, there are a multitude of other techniques that could be applied to regime-based consensus forecasting. The promising results of this case study indicate that further research is merited. Additional studies, however, need to confirm their superior performance on longer datasets and for other locations. Although currently used for deterministic forecasts, the clustering paradigm could be applied eventually to probabilistic forecasting using implementations such as Bayesian model averaging (Raftery et al. 2005). Tuning the genetic algorithm parameters, as well as generalizing the cluster rules, may improve the performance of the GA clustering technique. Methods that permit membership in several atmospheric regimes simultaneously could be devised that would allow for nonlinear dependence of skill, in contrast with the linear dependence permitted in the regression model described here. Alternatively, one could devise a scheme for regime-dependent bias correction, which would use the regime information at an earlier stage and allow one to maintain multiple ensemble member forecasts. It would be interesting to compare such a scheme to the techniques developed here. Finally, our techniques could be expanded to a grid-based forecasting domain, and include several forecast variables simultaneously.

## Acknowledgments

The authors thank Dr. John H. E. Clark, Dr. Hampton N. Shirer, Dr. Eugenia Kalnay, and Dr. David Stauffer for providing valuable feedback on this project. Thanks also are due to Ben Root, Rich Grumm, and Paul Knight for facilitating the acquisition of the NARR data, and to the University of Washington for enabling public access to its ensemble data. Addressing the comments of two anonymous reviewers greatly improved this paper. This research was funded in part by the PSU Applied Research Lab Honors Program.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

### APPENDIX

#### Ensemble Preprocessing and Bias Correction

This appendix provides the details of the preprocessing of the UWME+ ensemble data and the bias correction procedures. Figure A1 depicts the spatial extent of the 32-km and nested 12-km UWME domains; the 12-km resolution data are used in this project (section 2).

Figure A2 presents a histogram of simple error (signed absolute error, or the forecast value minus the observed value) for the UWME+ forecasts at Portland from 1 September 2005 through 31 August 2006. The bias for the entire ensemble, ranging from −0.57° to −1.51°C across locations, motivates the need for a bias correction scheme.

Bias correction is applied separately to each ensemble member. The bias-corrected forecast for date *i*, *f* ′_{i}, can be described as follows:

where the *f _{i}* is the raw forecast on date

*i*,

*W*is the bias correction window (a set of dates) of size

_{i}*w*used for date

_{i}*i*,

*j*is a date within the bias correction window, and

*f*and

_{j}*o*are the forecast and observation, respectively, for date

_{j}*j*. Equation (A1) corrects for bias by subtracting from each forecast the mean difference between forecasts and observations in a given window of dates.

There are several ways to set the bias correction window, *W _{i}*. One way is to use the entire dependent training dataset

^{A1}for

*W*; we call this bias correction scheme B1. The correction determined from using B1 on the dependent dataset was applied to the dependent and independent forecasts separately (the first two rows of Table A1). Eckel and Mass (2005) used a “2-week running mean” bias correction in order to capture the temporal variability of the individual biases, as they found a

*W*of the previous 14 days to be most effective. We term this bias correction B2. In B2, if 14 dates were not available prior to the given forecast date (at the beginning of the dataset), as many days as were available were used. Method B2 operates on

_{i}*f*independently of the given day’s verification data

_{i}*o*, since day

_{i}*i*is not included in

*W*. Thus, B2 was applied to the entire set of dates and is summarized in the final two rows of Table A1.

_{i}As expected, the dependent data exhibit a slightly greater improvement. Approach B2 reduces the error on the independent data by a nearly identical percentage as does B1. Eckel and Mass (2005), however, note that bias correction factors can be dependent upon synoptic regime. Thus, B2 would not be independent of regime, and may “contaminate” the effectiveness of the regime clustering to be applied later in the project. Thus, B1 was selected as the bias correction scheme, and was applied to both the dependent and independent datasets for use throughout the rest of this case study.

## Footnotes

*Corresponding author address:* Sue Ellen Haupt, Applied Research Laboratory, The Pennsylvania State University, P.O. Box 30, State College, PA 16804. Email: haupts2@asme.org

See section 5d for a description of the cross-validation selection of the training data.