• Bakhshaii, A., , and R. Stull, 2009: Deterministic ensemble forecasts using gene-expression programming. Wea. Forecasting, 24, 14311451, doi:10.1175/2009WAF2222192.1.

    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Cooksey, R. W., 1996: Judgment Analysis: Theory, Methods and Applications. Academic Press, 407 pp.

  • Coulibaly, P., 2004: Downscaling daily extreme temperatures with genetic programming. Geophys. Res. Lett., 31, L16203, doi:10.1029/2004GL020075.

    • Search Google Scholar
    • Export Citation
  • Crochet, P., 2004: Adaptive Kalman filtering of 2-metre temperature and 10-metre wind speed forecasts in Iceland. Meteor. Appl., 11, 173187, doi:10.1017/S1350482704001252.

    • Search Google Scholar
    • Export Citation
  • Cui, B., , Z. Toth, , Y. Zhu, , and D. Hou, 2012: Bias correction for global ensemble forecast. Wea. Forecasting, 27, 396410, doi:10.1175/WAF-D-11-00011.1.

    • Search Google Scholar
    • Export Citation
  • Darwin, C., 1859: On the Origin of Species. Bantam Classic Edition, 495 pp.

  • Darwin, C., 1871: The Descent of Man. Penguin Classics, 791 pp.

  • Fogel, L. J., 1999: Intelligence through Simulated Evolution: Forty Years of Evolutionary Programming. John Wiley, 162 pp.

  • Fogel, L. J., , A. J. Owens, , and M. J. Walsh, 1966: Artificial Intelligence through Simulated Evolution. John Wiley, 170 pp.

  • Gibbons, J. F., , and S. Mylroie, 1973: Estimation of impurity profiles in ion-implanted amorphous targets using joined half-Gaussian distributions. Appl. Phys. Lett., 22, 568569, doi:10.1063/1.1654511.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., , and J. S. Whitaker, 2007: Ensemble calibration of 500-hPa geopotential height and 850-hPa and 2-m temperatures using reforecasts. Mon. Wea. Rev., 135, 32733280, doi:10.1175/MWR3468.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., , J. S. Whitaker, , and S. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions. Bull. Amer. Meteor. Soc., 87, 3346, doi:10.1175/BAMS-87-1-33.

    • Search Google Scholar
    • Export Citation
  • Haupt, R. L., , and S. E. Haupt, 2000: Optimum population size and mutation rate for a simple real genetic algorithm that optimizes array factors. Appl. Comput. Electromag. Soc. J., 15, 94102.

    • Search Google Scholar
    • Export Citation
  • Haupt, S. E., , G. S. Young, , and C. T. Allen, 2006: Validation of a receptor–dispersion model coupled with a genetic algorithm using synthetic data. J. Appl. Meteor. Climatol., 45, 476490, doi:10.1175/JAM2359.1.

    • Search Google Scholar
    • Export Citation
  • Hénon, M., 1976: A two-dimensional mapping with a strange attractor. Commun. Math. Phys., 50, 6977, doi:10.1007/BF01608556.

  • Hillis, W. D., 1990: Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D, 42, 228234, doi:10.1016/0167-2789(90)90076-2.

    • Search Google Scholar
    • Export Citation
  • Hoeting, J. A., , D. Madigan, , A. E. Raftery, , and C. T. Volinsky, 1999: Bayesian model averaging: A tutorial. Stat. Sci., 14, 382417, doi:10.1214/ss/1009212519.

    • Search Google Scholar
    • Export Citation
  • Homleid, M., 1995: Diurnal corrections of short-term surface temperature forecasts using the Kalman filter. Wea. Forecasting, 10, 689707, doi:10.1175/1520-0434(1995)010<0689:DCOSTS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • John, S., 1982: The three-parameter two-piece normal family of distributions and its fitting. Comm. Stat. Theory Methods, 11, 879885, doi:10.1080/03610928208828279.

    • Search Google Scholar
    • Export Citation
  • Kong, and et al. , 2012: Rate of de novo mutations and the importance of father’s age to disease risk. Nature, 488, 471475, doi:10.1038/nature11396.

    • Search Google Scholar
    • Export Citation
  • Lakshmanan, V., 2000: Using a genetic algorithm to tune a bounded weak echo region detection algorithm. J. Appl. Meteor., 39, 222230, doi:10.1175/1520-0450(2000)039<0222:UAGATT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci., 20, 130141, doi:10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2.

  • Lorenz, E. N., 2005: Designing chaotic models. J. Atmos. Sci., 62, 15741587, doi:10.1175/JAS3430.1.

  • Mercer, A. E., , C. M. Shafer, , C. A. Doswell III, , L. M. Leslie, , and M. B. Richman, 2012: Synoptic composites of tornadic and nontornadic outbreaks. Mon. Wea. Rev., 140, 25902608, doi:10.1175/MWR-D-12-00029.1.

    • Search Google Scholar
    • Export Citation
  • Mirus, K. A., , and J. C. Sprott, 1999: Controlling chaos in low- and high-dimensional systems with periodic parametric perturbations. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics, 59, 53135324.

    • Search Google Scholar
    • Export Citation
  • Monteith, K., , J. Carroll, , K. Seppi, , and T. Martinez, 2011: Turning Bayesian model averaging into Bayesian model combination. Proc. Int. Joint Conf. on Neural Networks (IJCNN'11), San Jose, CA, IEEE, 26572663.

  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600, doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • O’Steen, L., , and D. Werth, 2009: The application of an evolutionary algorithm to the optimization of a mesoscale meteorological model. J. Appl. Meteor. Climatol., 48, 317329, doi:10.1175/2008JAMC1967.1.

    • Search Google Scholar
    • Export Citation
  • Persson, A., 1991: Kalman filtering—A new approach to adaptive statistical interpretation of numerical meteorological forecasts. WMO Training Workshop on the Interpretation of NWP Products in Terms of Local Weather Phenomena and Their Verification, Wageningen, Netherlands, WMO, WMO PSMP Rep. Series 34, XX-27–XX-32.

  • Raftery, A. E., , T. Gneiting, , F. Balabdaoui, , and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133, 11551174, doi:10.1175/MWR2906.1.

    • Search Google Scholar
    • Export Citation
  • Roebber, P. J., 1998: The regime dependence of degree day forecast technique, skill, and value. Wea. Forecasting, 13, 783794, doi:10.1175/1520-0434(1998)013<0783:TRDODD>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Roebber, P. J., 2010: Seeking consensus: A new approach. Mon. Wea. Rev., 138, 44024415, doi:10.1175/2010MWR3508.1.

  • Roebber, P. J., 2013: Using evolutionary programming to generate skillful extreme value probabilistic forecasts. Mon. Wea. Rev., 141, 31703185, doi:10.1175/MWR-D-12-00285.1.

    • Search Google Scholar
    • Export Citation
  • Roebber, P. J., , and L. F. Bosart, 1996: The contributions of education and experience to forecast skill. Wea. Forecasting, 11, 2140, doi:10.1175/1520-0434(1996)011<0021:TCOEAE>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Ross, G. H., 1987: An updateable model output statistics scheme. Programme on Short- and Medium-Range Series, WMO Rep. 25, World Meteorological Organization, 25–28.

  • Simonsen, C., 1991: Self adaptive model output statistics based on Kalman filtering. WMO Training Workshop on the Interpretation of NWP Products in Terms of Local Weather Phenomena and Their Verification, Wageningen, Netherlands, WMO, WMO PSMP Rep. Series 34, XX-33–XX-37.

  • Stevens, M. H. H., , M. Sanchez, , J. Lee, , and S. E. Finkel, 2007: Diversification rates increase with population size and resource concentration in an unstructured habitat. Genetics, 177, 22432250, doi:10.1534/genetics.107.076869.

    • Search Google Scholar
    • Export Citation
  • Stewart, T., , P. J. Roebber, , and L. F. Bosart, 1997: The importance of the task in analyzing expert judgment. Organ. Behav. Hum. Decis. Processes, 69, 205219, doi:10.1006/obhd.1997.2682.

    • Search Google Scholar
    • Export Citation
  • Teisberg, T. J., , R. F. Weiher, , and A. Khotanzad, 2005: The economic value of temperature forecasts in electricity generation. Bull. Amer. Meteor. Soc., 86, 17651771, doi:10.1175/BAMS-86-12-1765.

    • Search Google Scholar
    • Export Citation
  • Wilson, L. J., , and M. Vallée, 2002: The Canadian Updateable Model Output Statistics (UMOS) system: Design and development tests. Wea. Forecasting, 17, 206222, doi:10.1175/1520-0434(2002)017<0206:TCUMOS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Yamagiwa, J., , J. Kahekwa, , and A. K. Basabose, 2003: Intra-specific variation in social organization of gorillas: Implications for their social evolution. Primates, 44, 359369, doi:10.1007/s10329-003-0049-5.

    • Search Google Scholar
    • Export Citation
  • Yang, H. T., , C. M. Huang, , and C. L. Huang, 1996: Identification of ARMAX model for short-term load forecasting: An evolutionary programming approach. IEEE Trans. Power Syst., 11, 403408, doi:10.1109/59.486125.

    • Search Google Scholar
    • Export Citation
  • Yuval, , and W. W. Hsieh, 2003: An adaptive nonlinear MOS scheme for precipitation forecasts using neural networks. Wea. Forecasting,18, 303–310, doi:10.1175/1520-0434(2003)018<0303:AANMSF>2.0.CO;2.

  • View in gallery

    Algorithm structure. The 10 EP-genes each contain 5 variables (blue), 1 relational operator (orange), 2 mathematical operators (green), and 3 coefficients (red). Here (top left) a father and (top right) a mother produce a (bottom right) child algorithm through crossover. The modified lines of the child (cf. the father) are indicated in red (lines 1, 3, 6, 7, and 9), and mutated components of EP-genes 5 and 7 are shown in yellow (in lines 5 and 7). Note that the coefficients are abbreviated to three digits here, but are carried to five digits in the calculations.

  • View in gallery

    Schematic illustration of sexual selection for a population of 40 algorithms, including 20 males (black squares, M1, M2, …, M20) and 20 females (red circles, F1, F2, … , F20). Of these, 10 males and 10 females pass the selection criterion. The top-ranked male M1 is cloned (large C), and also mates with the 10 females that exceed the MSE criterion. One of these child algorithms also contains a mutation (large M). See text for details.

  • View in gallery

    Root-mean-square difference (RMSD) between any two algorithms as a function of the RMSE (100 × °F) of these same two algorithms, for all combinations of algorithms, computed from the training sample. Color coding indicates RMSD of <0.5°F (dark blue) to 2.5°F (dark red), in increments of 0.5°F.

  • View in gallery

    Weights from the ensemble of EP algorithms based on training data performance. Shown are the average weight (absolute value), the frequency of usage (bubble size), and the weight variance (bubble color) of each input across the set of 2344 algorithms.

  • View in gallery

    Observed temperature as a function of 90% temperature interval. Contours indicate relative density of points (darker indicates more points).

  • View in gallery

    Minimum temperature forecasts at 60 h for the period 26 Dec 2010–4 Jan 2011. Shown are the observed temperature (black line and points), EP ensemble mean forecast temperature (green), and GFS ensemble MOS temperature (orange). Also shown are the calibrated EP ensemble 5% (blue dotted) and 95% (red dotted) temperatures, and the lowest and highest GFS ensemble MOS minimum temperature values (orange triangles).

  • View in gallery

    Observations at ORD for the period 0000 UTC 26 Dec 2010–0000 UTC 5 Jan 2011. Shown are temperature (red), dewpoint temperature (green), sea level pressure (dashed), cloud coverage (black rectangles), and wind speed (red dots).

  • View in gallery

    Box plot of calibrated 90% temperature intervals for the EP and GFS ensembles. The lines in the box plot show the 10th, 25th, 50th, 75th, and 90th percentile values. The points show outliers beyond the 90th percentile.

  • View in gallery

    BSS skill for the EP ensemble 60-h forecast minimum temperature for the range 0°–70°F (solid black), and the skill difference (%) between the EP and GFS ensembles (dashed black).

  • View in gallery

    (top to bottom) Resolution fraction (RES*), reliability (REL*) fraction, and Brier skill score (BSS) for the EP ensemble (green) and GFS MOS ensemble (orange) forecasts. See text for details. Note that the REL* is plotted at half the scale of the other two measures.

  • View in gallery

    RMSE (°F) following a one-step mutation for three algorithms representing early (blue, RMSE before mutation of 11.48°F), middle (green, RMSE of 8.54°F), and late (red, RMSE of 4.16°F) stages of evolution. Results shown for each stage are for mutations of coefficients, variables, operators, and copies (transpositions), and plotted are the 10th percentile (regular triangle), median (rectangle), and 90th percentile (downward triangle) RMSE. Distributions are based on 10 000 repeated experiments.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 44 44 5
PDF Downloads 24 24 3

Evolving Ensembles

View More View Less
  • 1 University of Wisconsin–Milwaukee, Milwaukee, Wisconsin
© Get Permissions
Full access

Abstract

An ensemble forecast method using evolutionary programming, including various forms of genetic exchange, disease, mutation, and the training of solutions within ecological niches, is presented. A 2344-member ensemble generated in this way is tested for 60-h minimum temperature forecasts for Chicago, Illinois.

The ensemble forecasts are superior in both ensemble average root-mean-square error and Brier skill score to those obtained from a 21-member operational ensemble model output statistics (MOS) forecast. While both ensembles are underdispersive, spread calibration produces greater gains in probabilistic skill for the evolutionary program ensemble than for the MOS ensemble. When a Bayesian model combination calibration is used, the skill advantage for the evolutionary program ensemble relative to the MOS ensemble increases for root-mean-square error, but decreases for Brier skill score. Further improvement in root-mean-square error is obtained when the raw evolutionary program and MOS forecasts are pooled, and a new Bayesian model combination ensemble is produced.

Future extensions to the method are discussed, including those capable of producing more complex forms, those involving 1000-fold increases in training populations, and adaptive methods.

Corresponding author address: Paul J. Roebber, Atmospheric Science Group, Department of Mathematical Sciences and School of Freshwater Sciences, University of Wisconsin–Milwaukee, 3200 North Cramer Ave., Milwaukee, WI 53211. E-mail: roebber@uwm.edu

Abstract

An ensemble forecast method using evolutionary programming, including various forms of genetic exchange, disease, mutation, and the training of solutions within ecological niches, is presented. A 2344-member ensemble generated in this way is tested for 60-h minimum temperature forecasts for Chicago, Illinois.

The ensemble forecasts are superior in both ensemble average root-mean-square error and Brier skill score to those obtained from a 21-member operational ensemble model output statistics (MOS) forecast. While both ensembles are underdispersive, spread calibration produces greater gains in probabilistic skill for the evolutionary program ensemble than for the MOS ensemble. When a Bayesian model combination calibration is used, the skill advantage for the evolutionary program ensemble relative to the MOS ensemble increases for root-mean-square error, but decreases for Brier skill score. Further improvement in root-mean-square error is obtained when the raw evolutionary program and MOS forecasts are pooled, and a new Bayesian model combination ensemble is produced.

Future extensions to the method are discussed, including those capable of producing more complex forms, those involving 1000-fold increases in training populations, and adaptive methods.

Corresponding author address: Paul J. Roebber, Atmospheric Science Group, Department of Mathematical Sciences and School of Freshwater Sciences, University of Wisconsin–Milwaukee, 3200 North Cramer Ave., Milwaukee, WI 53211. E-mail: roebber@uwm.edu

1. Introduction

In On the Origin of Species, Charles Darwin (1859) wrote: “Can it … be thought improbable … that other variations useful in some way to each being in the great and complex battle of life, should sometimes occur in the course of thousands of generations? If such do occur, can we doubt … that individuals having any advantage, however slight … would have the best chance of surviving and of procreating their kind?” This is the conceptual basis of both natural selection and of its digital doppelgänger, evolutionary programming (EP), a process in which simulated evolution is used to find solutions to problems as diverse as the sorting of numbers (Hillis 1990), forecasting short-term power loads (Yang et al. 1996), downscaling temperatures (Coulibaly 2004), and forming optimal forecast algorithms (Roebber 2010). Although the application of this idea in meteorological fields is relatively new (e.g., Lakshmanan 2000; Coulibaly 2004; Haupt et al. 2006; O’Steen and Werth 2009; Bakhshaii and Stull 2009; Roebber 2010, 2013), work in this area has been ongoing since the 1960s (e.g., Fogel 1999).

In Roebber (2013), the “method of tribes” was introduced, a concept similar to the “multiple worlds” of Bakhshaii and Stull (2009), although in the former instance this idea was used as a means to generate large ensembles of algorithms. These ensembles were shown to produce probabilistic forecasts for 500-hPa height with superior skill to those obtained from a model reforecast ensemble (Hamill et al. 2006).

Nonetheless, a number of outstanding issues remain in developing the probabilistic form of EP. Of most prominence and interest here is the question of how to optimize training. On the one hand, minimizing mean-square error (MSE) will provide the best performing individual algorithm. On the other hand, a population of individual algorithms trained in this way tends to be more alike than is desirable for providing unbiased probability estimates (e.g., Roebber 2013, 3182–3183; see also section 3 below). Further, although skillful, the EP ensemble forecasts exhibited less reliability (a measure of consistency with event climatology) than those from the reforecast model ensemble [see Hamill et al. (2006) for details concerning the reforecast ensemble]. Finally, the method of EP ensemble generation was not applied to temperature forecasts, where calibration requirements appear to be most necessary (e.g., Hamill and Whitaker 2007).

Here, using both simulated (the appendix) and real data (section 2), we present an EP method that produces skillful and reliable ensembles for minimum temperature forecasts (sections 3 and 4; full details of the method are documented and justified in the appendix). The results from a trained ensemble using the method are detailed in section 4, along with an extension using a more sophisticated calibration method. Finally, in section 5, we present a discussion of future directions.

2. Chicago (ORD) dataset

Observed and forecast data were collected for Chicago, Illinois (ORD), for April 2008 through February 2012 (Table 1). After accounting for missing data, this dataset includes 1370 individual forecast dates. The observed data are maximum and minimum temperature, precipitation amount category (0 if less than 1 mm, 1 if at least 1 mm and less than 5 mm, 2 if at least 5 mm), and snow on the ground (1 if at least 25 mm, otherwise, 0). These observations were supplemented with the last daily average temperature (through 0000 UTC of the forecast issue time) at three “upstream” sites: Des Moines, Iowa (DSM); Minneapolis, Minnesota (MSP); and St. Louis, Missouri (STL). Such data might be relevant, for example, when certain synoptic patterns progress across the region (e.g., heat waves and cold waves), but for which the model guidance displays a bias. The use of observed data in this way was productive in prior minimum temperature EP model development (Roebber 2010).

Table 1.

Potential cues for 60-h minimum temperature forecasts for Chicago, IL (ORD). Cues based on observations are shown in bold. Square brackets indicate that the cue is derived from an ensemble.

Table 1.

Forecast data are minimum temperature at 60 h from the 21-member Global Forecast System (GFS) ensemble model output statistics (MOS) issued at 0000 UTC (hereafter, GFS ensemble MOS). This ensemble consists of a control and 20 bred perturbation runs, to which the operational GFS MOS prediction equations are applied. Also obtained from each member of the GFS ensemble MOS were dewpoint temperature, cloud cover (clear, partly cloudy, overcast), and 12-h precipitation probability. Additionally, since wind speed is not an output of the GFS ensemble MOS, single-value wind speed forecasts for these time ranges were obtained from the 0000 UTC cycle of the extended-range GFS-based MOS (MEX) guidance. The GFS ensemble mean MOS temperature from the prior 0000 UTC run validating at the same time, was also obtained as a lagged-ensemble measure of forecast variability.

To account for spatial gradients, MEX forecasts of temperature, dewpoint temperature, wind speed, cloud cover, and precipitation probability were obtained for Davenport, Iowa (DVN), and Green Bay, Wisconsin (GRB). Additional potential predictors were the sine and cosine of the Julian day.

To reduce the dimensionality of the training problem, a maximum likelihood factor analysis was applied to a subset of these data, collapsing the combined model measures of temperature (including dewpoint), wind speed, and precipitation to three orthogonal factors (using varimax rotation), with one factor each projecting largely on temperature (57%), precipitation (73%), and wind (89%), as objectively assessed using relative weights analysis (see Cooksey 1996; Roebber 1998). Note that we retained the ORD wind speed forecast as a separate input, despite constructing a wind speed factor, to account for local as well as regional influences of the wind. Although the ORD wind speed forecast and the wind speed factor are correlated, the distribution of differences between these two normalized measures is approximately Gaussian; thus, in about 32% of the training cases, the differences between the regional and local winds are approximately 4 kt (2.1 m s−1) and more than 8 kt (4.1 m s−1) in 5% of the cases.

Additionally, a stepwise multiple linear regression analysis was performed on the set of predictors, excluding the three observation sites, in effect producing a “nested MOS” forecast (hereafter MLR). This approach was shown in Roebber (2010) to provide a highly competitive forecast for minimum temperature. The MLR so-produced used a combination of seven inputs, in decreasing order of importance (as determined by the variance accounted for by the input alone): the forecast temperature factor, the cosine of the Julian day, the ensemble mean cloud cover, the forecast wind speed factor, the forecast precipitation factor, the forecast cloud cover at DVN, and the precipitation persistence. With these steps, the final set of potential forecast cues consisted of 19 inputs (Table 2).

Table 2.

As in Table 1, but for final cues for 60-h minimum temperature forecasts for Chicago, Illinois (ORD). Cues based on observations are shown in bold.

Table 2.

These 1370 dates were split with the first 690 days to be used for training and the final 680 days for independent testing. The sampling is such that there is a ~2% greater percentage of days each in the September–October–November and December–January–February (DJF) periods in the training data compared to the test data. These differences, while small, result in greater root-mean-square errors (RMSEs) in the GFS ensemble MOS forecasts for the two periods (4.49° and 4.21°F for training and test, respectively).

3. Evolutionary program ensembles

As developed by Roebber (2010, 2013), the basic “genetic architecture” of EP consists of a series of IF–THEN equations involving linear and nonlinear combinations of predictors (Fig. 1), a structure chosen to be flexible yet still allow interpretation of the forecast logic. Each algorithm’s genome consists of the specific variables, mathematical operators, and values of the coefficients within the IF–THEN equations. A population of algorithms with this underlying structure is initialized randomly, and the level of “fitness” of each solution is assigned based on the MSE obtained when it is applied to the training dataset. Fit algorithms are allowed to propagate some portion of their components to the next generation—via crossover, the genetic exchange between two parent algorithms; mutation, a random change to one component of a parent, passed to the next generation; and cloning, an identical copy of the parent algorithm [see Figs. 12 and further discussion below; see Fogel et al. (1966) for an early description of this approach in the context of computer algorithms].

Fig. 1.
Fig. 1.

Algorithm structure. The 10 EP-genes each contain 5 variables (blue), 1 relational operator (orange), 2 mathematical operators (green), and 3 coefficients (red). Here (top left) a father and (top right) a mother produce a (bottom right) child algorithm through crossover. The modified lines of the child (cf. the father) are indicated in red (lines 1, 3, 6, 7, and 9), and mutated components of EP-genes 5 and 7 are shown in yellow (in lines 5 and 7). Note that the coefficients are abbreviated to three digits here, but are carried to five digits in the calculations.

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

Fig. 2.
Fig. 2.

Schematic illustration of sexual selection for a population of 40 algorithms, including 20 males (black squares, M1, M2, …, M20) and 20 females (red circles, F1, F2, … , F20). Of these, 10 males and 10 females pass the selection criterion. The top-ranked male M1 is cloned (large C), and also mates with the 10 females that exceed the MSE criterion. One of these child algorithms also contains a mutation (large M). See text for details.

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

A gradually tightening MSE threshold with genetic exchange between fit algorithms drives progress toward improved performance over successive generations. With these rules, the ith algorithm has the following form:
e1
where
eq1
and V1ij, V2ij, V3ij, V4ij, and V5ij can be any of the input variables; C1ij, C2ij, and C3ij are real-valued multiplicative constants in the range ±1; ORij is a relational operator (≤, >); and O1ij and O2ij are addition or multiplication operators. The input variables and the forecast variable are normalized to (0, 1) based on the minimum and maximum of the training data.

Note that while the genome consists of 10 IF–THEN statements, these can be zeroed out as the solutions evolve, through a conditional that can never be satisfied. Likewise, IF–THEN statements (hereafter, EP-genes) also can evolve to insure that the statement will always be executed. Thus, the final form of an algorithm usually consists of a “base” forecast equation with a small number of modifiers (see below). The structure of (1) differs from Roebber (2013) in that an ensemble deflation factor is no longer incorporated. Further discussion of this point is provided below.

One could extend this deliberately simple architecture to include higher-order polynomials and transcendental functions, and specific EP-genes could reference other EP-genes. This increase in complexity, however, would come at the cost of an increase in the search space that would need to be explored to find effective solutions and a potential loss of interpretability. Given the relative success of linear regression in meteorological forecasting in general, the possible additional benefit of these approaches likely would be outweighed by the costs. Nonetheless, this represents an area of research that could be addressed in future work, and is briefly explored in the conclusions.

A critical aspect of evolution, commented upon at length by Darwin (1871) in The Descent of Man, is the role of sexual selection: “… it appears that the strongest and most vigorous males … have prevailed under nature, and have led to the improvement of the natural breed or species. A slight degree of variability leading to some advantage, however slight, in reiterated deadly contests would suffice for the work of sexual selection.” This idea was first proposed in the context of evolutionary programming in the 1960s, but was not then implemented owing to computational limitations (Fogel 1999).

Accordingly, we have added a simulated sexual selection pressure to the procedure (Fig. 2), as follows. First, each algorithm in the initial population is randomly (but with equal probability) assigned as either male or female. Next, all of the male algorithms are ranked in order of fitness as defined by MSE on the training dataset. Then, the top-ranked male program is cloned and also allowed to “mate” with up to 10 female programs, selected randomly from the set of qualifying female programs (based on MSE). Each mating produces a “child” algorithm (randomly selected as either male or female) through crossover, while the maternal program is then removed from the pool of potential mates. A certain fraction of these offspring will experience mutations. This process continues until no more qualifying programs (male or female) remain in that generation. Note that with the above rules, only the top 10% of all qualifying male programs will produce offspring.

Although this behavior is not viewed favorably in most modern human societies, it does insure the rapid propagation of genetic material in a population, while still preserving the opportunity for new combinations (through crossover) and innovation (through mutation). It is worth noting that analogs to this social structure are abundant in nature, with perhaps the closest parallel being that of the gorilla (e.g., Darwin 1871; Yamagiwa et al. 2003). Tests of this technique (see the appendix) showed that for a given initial population size, comparably performing solutions required on average 53%–80% fewer generations than those evolving through the standard method of Roebber (2010), depending on the size of the initial algorithm population.

Consider Fig. 1, in which three algorithms trained to produce 60-h minimum temperature forecasts at Chicago are depicted: two parent algorithms (top row) and a child algorithm (bottom right), which has also undergone two types of mutation. Note that in the case of the paternal algorithm (Fig. 1, top left), EP-genes 2, 4, 6, and 10 are always invoked, while EP-genes 1, 3, 7, and 8 are never invoked. Thus, as mentioned above, the algorithm structure can be readily simplified to
e2
and modified according to the following conditions:
e2a
e2b
where T* is then scaled up to the physical range according to the maximum and minimum of the minimum temperature in the training data.

This algorithm sets a baseline based on the forecast temperature factor, modified by forecast cloud coverage, wind speed, and current temperature conditions upstream. This baseline is adjusted according to the presence of snow, provided that the forecast calls for cold conditions in (2a), and some weight is given to the MLR forecast if it is transitioning to wet conditions in (2b). The maternal algorithm (Fig. 1, top right), when simplified, shows similar logic, but, for example, does not conditionally weight the MLR forecast and adjusts more strongly to snow cover. These differences result in an RMSE of 4.12°F (4.37°F) for the paternal (maternal) algorithm forecasts on the training data. Note that in this paper, temperature will be expressed on the Fahrenheit scale to match the reported units of both the forecast and observed data. Additionally, although training is conducted using MSE as a metric, results will generally be reported in terms of RMSE for convenience of interpretation.

Crossover is obtained by first seeding the child algorithm with the paternal EP-genes. Next, there is a 50% probability that the paternal EP-gene will be replaced by the maternal EP-gene. In that case, however, the paternal coefficients [C1ij, C2ij, and C3ij, where the notation is as in (1)] are retained. This has occurred in Fig. 1 for EP-genes 1, 3, 6, 7, and 9. Once crossover is completed, a check for mutation is executed. Mutations can occur in at most 1 of the 10 EP-genes, and if a mutation is to occur, the specific line is first selected. Then, the mutation will be identified as one and only one of the following, with probability p: a change to a new variable V1ij, with p = 0.031 25; or a change to the operator ORij, O1ij, or O2ij, with p = 0.031 25; or a change to a new variable V2ij, V3ij, V4ij, or V5ij (with p = 0.015 625); or a change from the variable V2ij, V3ij, V4ij, or V5ij to a constant = 1 (with p = 0.015 625); or a change to a coefficient C1ij, C2ij, or C3ij with p = 0.0833. Thus, the overall chance of mutation is 50% and is shown in Fig. 1 to have occurred in EP-gene 7, coefficient C1ij. Note that a coefficient mutation is accomplished using a random draw from [−1, 1].

A second variety of mutation, known as transposition (copy error) is also included. This is considered also after crossover is completed, but can occur whether or not there has already been a mutation. The probability of transposition occurring is also set to 50%. Once transposition is indicated, it is accomplished by selecting 1 of 3 EP-gene segments—V1ij | ORij | V2ij; C1ij × V3ij | O1ij; C2ij × V4ij | O2ij; or C3 × V5ijfrom 1 of the 10 lines of a child algorithm and copying that segment to a different line. For the child algorithm of Fig. 1, this has occurred in the C2ij × V4ij|O2ij of EP-gene 5, but here has had the same effect as a standard coefficient mutation.

The example child algorithm (Fig. 1, bottom right) greatly resembles the father, except that the snow correction is now weighted more heavily (as with the mother). EP-genes 1, 3, and 8 are switched off, but this latent genetic diversity, which here would provide input from observed upstream conditions at MSP and forecast winds at ORD, could be tapped and/or modified in subsequent generations to produce variations in response. Owing to the late stage of the training, however, most further changes will lead to degraded performance—here, the crossover and mutations resulted in a small improvement in the RMSE to 4.07°F.

An illustration of sexual selection is provided in Fig. 2, for a population of 40 algorithms in an environment in which the carrying capacity is also a population of 40. These 40 algorithms are evenly split by gender, and ranked from 1 to 40 according to MSE on the available training data. In this example, suppose that 10 males and 10 females exceed the MSE performance criteria. Then, the following occurs. The top-ranked male (M1) is cloned and also mates with the 10 females that exceed the MSE criteria (F1, F2, …, F10). Although these would be randomly selected, in this simple example, since there are only 10 that qualify, only these 10 could be selected. The next 9 ranked males (M2, M3, …, M10) also satisfy the MSE criteria, but they cannot find qualified mates in this example, and thus do not produce members of the next generation. Here, 11 offspring would be produced: 10 through crossover and 1 through cloning.

Genetic bottlenecks are avoided as follows. If for any reason the total qualifying population of the best tribe decreases below 10 members, the MSE criterion is adjusted upward by 10% (i.e., selection pressure is relaxed). Although poorer performing members of the population are prevented from producing offspring, any such algorithms whose “position” in the population has not yet been replaced by offspring are carried to the next generation. Thus, under circumstances where the overall population has decreased below the critical number and the MSE criterion is relaxed, some of these “ghost” algorithms may again participate in genetic exchange. If not, however, they will gradually be replaced by better-performing offspring over succeeding generations.

In the above example, the 11 offspring replace 11 of the 20 non-qualifying algorithms (selected at random), leaving 9 ghost algorithms and the original 20 qualifying algorithms to fill the remaining 29 positions. Then, the MSE criterion is tightened and some fraction of these 40 algorithms will be disqualified at the next stage, depending on their performance (note that unless the MSE criterion is relaxed, none of the 9 ghosts would qualify). Provided that the original 20 qualifying algorithms satisfy the tightened MSE criterion at the next step, they would again qualify to produce new algorithms (in the case of the males, their ranking and the number of available mates would determine whether or not they actually did so).

Proceeding in this way, solutions may settle on suboptimal performance plateaus, perhaps owing to saturation of the “gene pool” by a small number of individuals. To reduce the likelihood of this occurring, we introduce the concept of “fatal disease,” an idea similar to that of Hillis (1990), who employed “coevolving parasites.” The details of this implementation depend on the tribe or niche with which an algorithm is associated and is further explained below.

Variation in the conditions in which the evolution occurs might affect the diversity of the gene pool. These include adjustments to the number of mating partners, the number of children per female, and modified restrictions on the eligibility of female algorithms. The increased combinations produced during crossover, while often producing changes that are impediments to performance, can sometimes produce new, beneficial genetic forms. These variations are incorporated into the training through the establishment of virtual “ecological niches.” In nature, organisms adapt to the unique conditions of their environment, providing ample opportunity for otherwise similar populations to diverge from one another (recall that the beak variations of the Galápagos finches provided considerable evolutionary inspiration to Darwin). As in Roebber (2013), here the overall population is divided into 20 “tribes,” in analogy with the organization of early human societies. Each tribe, however, inhabits a unique ecological niche (see Table A1). Inputs are made available in each niche according to “model” (GFS ensemble MOS, MEX, and the three model factors), “observations” (the two persistence inputs, and the three upstream observation sites), “model” combined with MLR, or observations combined with MLR. All of the niches include the sine and cosine of the Julian day.

Some niches are disease free, while others harbor fatal disease, and still others are home to nonfatal, but mutation-causing diseases. Fatal disease has a 50% chance of affecting any member possessing the suspect EP-gene, which occurs in proportion to its frequency in the population. Mutation-causing diseases are widespread, such that every individual in an affected tribal niche experiences one mutation each generation.

Sexual reproduction is varied in two ways. In the first mode, in a given generation, a male can mate with two partners and produce five children with that mating partner, where that partner is selected from any available female within the tribe (hereafter, mode 2–5–A–T, for two partners–five children–any available female–within tribe). In the second mode, a male can mate with 10 partners, producing 1 child per mating partner, with those partners being selected from any MSE-qualified female from any tribe (hereafter, mode 10–1–Q–A, for 10 partners–1 child–qualified female–any tribe). In the latter case, however, the children stay within the male tribe. The overall “carrying capacity” for the 20 tribes together is set to 10 000 individuals. If a specific tribal membership exceeds 1000, then there is a 50% chance that a given member will move, either to an adjacent tribe (e.g., member of tribe 1 moves to tribe 2; 25% chance) or a distant tribe (defined as five tribes away and including wraparound, constituting a fully different niche; e.g., member of tribe 1 moves to tribe 6; 25% chance). In the case of a nonoptimal tribe falling below 10 members, those individuals disperse to an adjacent tribe, thus potentially introducing genetic diversity.

Here, inputs are treated conceptually like cultural norms rather than available food sources. In other words, all algorithms are initialized (and mutated) with the above niche-based restrictions, but in any genetic exchange or tribal adoption, EP-genes that include norms from, for example, an observations niche can also be used in a model niche, and vice versa. Thus, successful input practices developed in one niche can be transported to another. Disease and mating modes, however, are fully restricted and confined to each niche (see Table A1).

Selection pressure on individuals is based on a tightening MSE threshold, which is dropped by 5% with each step after 5 iterations, and also after 30 iterations by matching the cumulative distribution function of each algorithm to that of the training data climatology. The remaining individuals from each tribe form a tribal consensus, which is also compared to the MSE threshold. If no tribe meets the threshold at any step, then the threshold is iteratively relaxed until a qualifying tribe is found. Additionally, if the total algorithm population across all tribes falls below 2500 members (i.e., 25% of the carrying capacity of 10 000) or the standard deviation of the generated tribal ensemble for the training data is less than 1°F, then the MSE threshold is relaxed by 1%.1 All individuals are ranked according to MSE, and training is stopped once at least 100 iterations are complete and the best individual RMSE at the current step differs from the average best individual RMSE over the last 50 iterations by 0.02 or less. The final ensemble is constructed using the surviving members of the best tribe (defined by the ensemble mean MSE on the training data).

Outside of disease, mutations and transpositions occur with 50% probability. Although such mutation rates may seem high, lower bound estimates of the human per capita mutation rate are on the order of 100 per generation (Kong et al. 2012). Experiments with the four types of mutation used here (to coefficients, variables, operators, and using transposition), show that regardless of the stage of the evolution (early, middle, or late; see the appendix for details), alterations sometimes can be produced that lead to performance improvements.

One issue that arises in training is how to minimize ensemble mean MSE while maximizing ensemble diversity. Consider Fig. 3, which shows the root-mean-square difference (RMSD) between any two algorithms as a function of the RMSE for these same two algorithms, for all combinations of algorithms, computed from the training sample. It is clear that a way to maximize variance between algorithms is to combine a more effective algorithm with a less effective algorithm, while as might be supposed, the way to minimize ensemble MSE is to combine the best individuals. Evidently the best-performing algorithms are considerably more alike than any two randomly selected algorithms (see also, e.g., the paternal and child algorithms of Fig. 1). As will be shown in section 4, however, when extended across the entire ensemble, sufficient diversity can be maintained for deriving skillful probabilities without sacrificing practical performance.

Fig. 3.
Fig. 3.

Root-mean-square difference (RMSD) between any two algorithms as a function of the RMSE (100 × °F) of these same two algorithms, for all combinations of algorithms, computed from the training sample. Color coding indicates RMSD of <0.5°F (dark blue) to 2.5°F (dark red), in increments of 0.5°F.

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

4. Results

a. Simple correction

Following training using the methods of section 3, a 2344 member ensemble was obtained from tribe 16 (observations and MLR niche, fatal disease, and mating mode 10–1–Q–A). When considering only those portions of the genome that are always invoked, we find that six inputs are most often utilized by the ensemble members: temperature factor, wind factor, prior day average temperature at STL, sine of the Julian day, ensemble mean cloud coverage, and prior day average temperature at DSM (Fig. 4). Of these, the most heavily weighted are the first three of the inputs listed above.

Fig. 4.
Fig. 4.

Weights from the ensemble of EP algorithms based on training data performance. Shown are the average weight (absolute value), the frequency of usage (bubble size), and the weight variance (bubble color) of each input across the set of 2344 algorithms.

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

There is considerable diversity in the use of this information by individual algorithms, however, as indicated by the weight variance (Fig. 4). Note that three additional variables appear with high frequency when qualifiers are considered, specifically, cosine of the Julian day (100%), MLR (98%), and snow on the ground (88%). Note that these results also provide a posteriori evidence that retaining the ORD wind speed forecast as a measure of local departures from regional wind conditions was unnecessary; since the training has largely filtered out its use (i.e., ORD wind speed was used in only about 5% of the algorithm population).

One can also examine diversity by considering a second ensemble trained without niches using the most general setup (i.e., the niche for all tribes is the same as that used for tribe 20: all inputs are available, fatal disease is present, and the mating mode is 10–1–Q–A). For these data, if the control on variance during training is removed (i.e., the standard deviation of the generated tribal ensemble for the training data is no longer required to be at least 1°F), there is an average threefold decrease in average ensemble variance compared to when niche-based training is used. Conversely, if the control on variance is imposed, the RMSE of the ensemble mean is ~6% higher than that obtained when using niches.

As stated in section 3, the best performing algorithms tend to be more alike, but there exists a range of “good” algorithms for which diversity of solutions do exist. This variability between reasonable forecast approaches is analogous to the formation of consensus by skilled forecasters when confronted with the same information (e.g., Roebber and Bosart 1996; Stewart et al. 1997), and supports the idea that useful probabilistic information can be obtained in this way. Nonetheless, while Roebber (2013) showed that the EP ensemble skill and resolution was superior to that obtained from the 15-member GFS-based reforecast ensemble (Hamill et al. 2006), its reliability was lower. This suggests that a direct spread calibration approach might prove useful and is employed here.

Although there is a small amount of skewness in the forecast distributions toward lower temperatures (as might be suggested by radiative physics), the Gaussian (normal) distribution provides an excellent approximation. Thus, for simplicity and with little loss of information,2 we set the calibrated variance to
e3
where S2 is the ensemble variance for a given forecast and I is an inflation factor, set to a fixed value for all cases. The inflation factor is fit over the entire training sample by requiring that 90% of the observations are contained within the 5th and 95th percentile temperatures implied by the corresponding Normal distribution [note that a more sophisticated calibration, using Bayesian model combination (BMC), is tested in section 4b]. To remove any bias from the ensemble mean, we also use a weighted correction following Cui et al. (2012):
e4
where Bcurrent is the forecast error from the last day, Bpast is the past accumulated bias, and Bnew is the updated accumulated bias (which will become Bpast for the next forecast). The next forecast is then adjusted by subtracting Bnew from the raw forecast. Here, we set w = 0.15.

We find that the 90% percentile forecast brackets the observed value 89% of the time on the test data, with a RMSE of 3.86°F, an 8.3% reduction in RMSE compared to that of the 60-h GFS ensemble MOS (RMSE = 4.21°F). Notably, we find that the GFS ensemble MOS lowest and highest forecast members bracket the observed temperatures only 40% of the time. The new EP ensemble also compares favorably to an ensemble trained using the simpler method of Roebber (2013), which attains a RMSE of 4.14°F on these same data (not shown).

Of relevance is the contribution to the EP ensemble performance from the spread calibration obtained through (3) and the bias calibration obtained through (4). We will return to this issue below when discussing Brier skill score (BSS) results. The raw EP ensemble [i.e., uncalibrated, that is not using either (3) or (4)] obtains a RMSE of 3.97°F on the test data, but the 90% interval verifies only 34% of the time. Bias correction through (4) alone improves the EP RMSE to 3.86°F. The GFS MOS ensemble 90% interval, based on a normal distribution approximation but without inflation through (3), also brackets the observed value 34% of the time. When inflation through (3) alone is applied to the EP forecasts, the RMSE remains unchanged at 3.97°F but the 90% interval brackets the observed value 90% of the time. As will be demonstrated below, when inflation is applied to the GFS MOS ensemble, bracketing improves to 90%, but at a cost of excessively wide forecast ranges.

In principle, bias correction should not be needed for the GFS forecasts since this is part of the MOS development, but we find that if (4) is applied to the GFS MOS ensemble, the RMSE improves to 4.01°F. Thus, the percent improvement of the raw EP forecast to the raw MOS ensemble is 5.7%, with each displaying too restrictive confidence intervals, while the fully corrected EP forecast improves RMSE relative to the corrected MOS by 3.7%, with calibrated confidence intervals. While these are somewhat incremental improvements in forecast error, as has been discussed elsewhere (e.g., Roebber 2010), such improvements can be extraordinarily valuable in specific user contexts.

The distribution of the temperature ranges encompassed by the EP 90% interval is bimodal in DJF (Fig. 5). The bulk of the range (73% of the cases) is from 10° to 20°F, but a second mode occurs at 23°F. These range intervals are clearly separated by the observed temperatures, with the coldest temperatures associated with the largest uncertainties. Notably, a second high uncertainty mode occurs with warmer DJF temperatures, in association with the passage of synoptic waves.

Fig. 5.
Fig. 5.

Observed temperature as a function of 90% temperature interval. Contours indicate relative density of points (darker indicates more points).

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

The portion of the independent test period from 26 December 2010 to 4 January 2011 (Figs. 6 and 7) illustrates many of the findings regarding the relative performance of the EP and GFS ensembles. The RMSE over these 10 days was 4.69°F (6.07°F) for the EP (GFS) ensemble, with the EP outperforming or tying the GFS ensemble MOS on 8 of the dates, while bracketing the observed minimum temperature on 9 of the dates (cf. only 1 day for the GFS ensemble MOS). Note that with spread calibration through (3) and bias correction through (4), the GFS MOS ensemble RMSE remains high at 5.93°F but brackets the observed minimum temperature on eight of the dates.

Fig. 6.
Fig. 6.

Minimum temperature forecasts at 60 h for the period 26 Dec 2010–4 Jan 2011. Shown are the observed temperature (black line and points), EP ensemble mean forecast temperature (green), and GFS ensemble MOS temperature (orange). Also shown are the calibrated EP ensemble 5% (blue dotted) and 95% (red dotted) temperatures, and the lowest and highest GFS ensemble MOS minimum temperature values (orange triangles).

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

Fig. 7.
Fig. 7.

Observations at ORD for the period 0000 UTC 26 Dec 2010–0000 UTC 5 Jan 2011. Shown are temperature (red), dewpoint temperature (green), sea level pressure (dashed), cloud coverage (black rectangles), and wind speed (red dots).

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

The spread calibration, owing to the limited variability in the raw MOS forecasts (Fig. 6), can produce considerably larger 90% ranges (Fig. 8). For 1 January 2011, for example, when the observed overnight minimum temperature fell to 28°F, the EP ensemble provides a range of 16°–41°F centered about the ensemble mean of 29°F, representing the uncertainty of the timing of the frontal passage. In this case, the observed temperature peaked at 53°F at 0200 UTC 1 January, and had begun falling by 0500 UTC, reaching 28°F at 1200 UTC and continuing downward to 19°F by 1800 UTC (Fig. 7). Note that during this period, the size of the EP ensemble 90% confidence interval ranges from a low of 10°F (on 3 January, under relatively quiescent conditions) to highs of 25°F on 30 December and 1 January, dates in which rapid warming and cooling, respectively, occurred. In contrast, for 1 January, the calibrated GFS ensemble MOS produces an excessively large range of 11°–55°F, with a mean forecast of 33°F. Thus, the alternatives for the GFS ensemble MOS are to use an uncalibrated (and therefore biased low) confidence interval, or a calibrated, wide confidence interval.

Fig. 8.
Fig. 8.

Box plot of calibrated 90% temperature intervals for the EP and GFS ensembles. The lines in the box plot show the 10th, 25th, 50th, 75th, and 90th percentile values. The points show outliers beyond the 90th percentile.

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

We examine the skill of the two ensemble forecast approaches using the BSS (Brier 1950), which as shown by Murphy (1973) can be partitioned into resolution (a measure of the correctness of forecast deviations from the climatological probabilities, i.e., forecast usability) and reliability components. Note that a reliable system would be one in which events occur at about the same frequency as the predicted probabilities. Thus, a climatological forecast has no resolution but is perfectly reliable.

The BSS decomposition is written as
e5
where RES is the resolution, REL is the reliability (with zero indicating perfect reliability), and UNC is the uncertainty. As in Roebber (2013), we report the complement of the reliability scaled by UNC, the reliability fraction (REL*):
e6
such that REL* increases for better reliability. Likewise, we scale the resolution by UNC to obtain the resolution fraction (RES*):
e7
Accordingly, BSS becomes RES* + REL* − 1.

While calibration improves the GFS ensemble MOS BSS, the EP ensemble is skillful and superior across the full temperature range (Fig. 9). Much of this additional skill is provided by better resolution (Fig. 10). For example, we find that at the extremes represented by the 0°–10°F and 60°–70°F ranges, the EP resolution averages 4.7% and 10.4% better than the GFS ensemble MOS, respectively, while the skill differential over these same ranges is 5.9% and 10.2%. The effects of bias correction on the overall BSS (i.e., for all values of temperature) are minimal, with the EP ensemble BSS improving by only 0.1% when bias correction through (4) alone is applied. As might be expected, improvement in BSS is substantial for both the EP and GFS MOS ensembles when spread calibration through (3) is applied, with improvements of 38% and 32%, respectively (to BSS of +10.1% and +5.4%, respectively). We conclude that the effort to emphasize diversification of the EP ensemble has helped to extract additional skill relative to the GFS MOS ensemble, but that the EP ensemble is still insufficiently diverse. This is confirmed through examination of the uncalibrated EP and GFS MOS ensemble rank histograms (not shown), which are both U shaped.

Fig. 9.
Fig. 9.

BSS skill for the EP ensemble 60-h forecast minimum temperature for the range 0°–70°F (solid black), and the skill difference (%) between the EP and GFS ensembles (dashed black).

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

Fig. 10.
Fig. 10.

(top to bottom) Resolution fraction (RES*), reliability (REL*) fraction, and Brier skill score (BSS) for the EP ensemble (green) and GFS MOS ensemble (orange) forecasts. See text for details. Note that the REL* is plotted at half the scale of the other two measures.

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

b. BMC correction

An intrinsic advantage of the EP ensemble generation method is the ability to produce large member ensembles using readily available data with minimal computational cost. As shown above, relative to the raw or calibrated and bias-corrected GFS ensemble MOS, this provides superior deterministic and probabilistic skill.

Since both the EP and the GFS ensembles are underdispersive, however, and thus require some kind of calibration, it may be that one of several existing, more sophisticated methods could be used to productive purpose. Here, we employ the technique of BMC along with bias correction using (4).

In the meteorological literature, the usual approach has been to use the method of Bayesian model averaging (BMA; e.g., Raftery et al. 2005) rather than BMC. BMA is in many ways similar to BMC, albeit with one important distinction: in BMA, one assumes that a single model member is the true data generating model (DGM) and seeks to select the best model. Consequently, BMA typically weights most heavily a relative few of the ensemble members. For BMC, in contrast, one proceeds from the assumption that no single ensemble member may truly represent the DGM, and instead seeks to select the best combination of model members. BMC has been shown to outperform BMA across a wide variety of datasets (see Monteith et al. 2011).

Unfortunately, owing to computational constraints, whether one chooses to use BMA or BMC, it is usual to subselect members from the overall ensemble (e.g., Hoeting et al. 1999). In the case of BMC with linear combination (Monteith et al. 2011), stepping through a subset of 10 ensemble members with 4 possible raw weights requires the evaluation of more than 1 000 000 possible combinations (410 or 1 048 576), while the possible range of normalized weight values becomes to .

Since we wish to directly compare the two calibrated ensembles, we subselect from the 21 GFS MOS ensemble members and the 2344 EP ensemble members using the following logic. First, all ensemble members were ranked according to their MSE on the training data. Second, the training data forecast variance between each ensemble pair was compared, and in the case for which this variance was less than a critical threshold (set to 0.09°F2), the lower-ranking member of the pair was eliminated from consideration. The 10 highest-ranked ensemble members by MSE that remained were then selected for BMC. For the EP ensemble, this resulted in between-member forecast variance ranging from 0.11° to 5.57°F2 and RMSE from 4.11° to 4.64°F. For the GFS MOS ensemble, these values were 1.35° to 4.04°F2 for variance and 4.51° to 4.67°F for RMSE.

Next, we bias correct each of the ensemble members using (4). Then, we cycle through all possible combination of weights for the 10 bias-corrected members, producing a weighted forecast after normalizing the weight values so that they sum to one. Following Monteith et al. (2011), we compute the posterior probability for the model combination (e) given the training data (D) according to
e8
where is estimated using the average error rate of the model combination on the training data, r is the number of correct predictions, and n is the total number of training cases. For this, we consider a model combination to be correct for a given training case if the weighted forecast is within 5° of the observed value. After all combinations have been considered, the selected model is the one that maximizes the logarithm of (8).
Each of the 10 members of the weighted model is assumed to be normally distributed with mean according to the bias-corrected forecast and variance estimated as
eq2
where wk is the normalized weight of the ensemble member, Fk is the bias-corrected forecast of the ensemble member, O is the observed value, and the summation is over the 10 ensemble members and the n training cases. The resulting probabilistic forecasts can thus be multimodal, although in practice, the PDF typically appears more like that of a skewed normal distribution.

Results from the BMC with linear combination are summarized in Table 3. For the EP ensemble, BMC with linear combination produces slightly lower RMSE than that obtained with bias correction alone, while the BSS performance is marginally worse compared to that obtained using the simple calibration of (3). The result is quite different for the GFS ensemble, where RMSE increases but BSS performance improves by 3.5%. We speculate that this differing effect is owing to the considerably different initial ensemble sizes of the two methods (2344 vs 21). As stated above, generating large ensembles with relatively low computational cost is a natural consequence of using EP. This test demonstrates that this relative advantage can be partially overcome using BMC, but note that the EP ensemble, whether or not it is corrected using BMC, remains superior to the BMC-calibrated GFS ensemble in both BSS and RMSE (and considerably better, in the latter case). Also of interest is that the maximum weighting does not correspond to that of the lowest individual ensemble member by RMSE, supporting the validity of the combined RMSE ranking and variance approach for model subselection.

Table 3.

Forecast performance of the EP and GFS MOS ensembles using BMC with linear combination. Also shown is a BMC model comprising five EP ensemble members, four GFS MOS members, and the MLR. For reference, the results from the EP and GFS MOS ensembles using the simple calibration with (3) is also shown (denoted basic calibration), along with the GFS MOS ensemble mean RMSE without bias correction (in parentheses). BSS is the percentage above climatology, evaluated on the basis of standard deviations from the monthly mean (see text for details). The minimum and maximum weight for the 10 members is also shown, and the RMSE ranking of member with maximum weight is shown in parentheses.

Table 3.

These results suggest that a third approach might be profitably applied. Effectively, the EP method provides a “new” set of models. One could pool the raw EP and GFS ensembles along with the single MLR forecast, and apply BMC with linear combination to that set as above. The result, after model subselection and BMC, is a weighted forecast consisting of five EP members, four GFS MOS members, and the MLR. This combined model, by weight, is 58% EP, 34% GFS, and 8% MLR, and improves RMSE by 10% (0.41°F) over the raw GFS ensemble and 6% (0.23°F) over the BMC GFS ensemble. Meanwhile, the combined model remains about as skillful as the EP and somewhat better than the BMC GFS ensemble in terms of BSS.

Refinements to the BMC procedure are also possible. For example, one could cycle through fewer raw weights and thus allow more ensemble members to be considered (cycling to 3 instead of 4 for raw weight would allow 13 ensemble members for about the same computational cost). Further, as an alternative to linear combination, one could sample from a Dirichlet distribution to estimate weights. Monteith et al. (2011), however, showed that the improvement in accuracy using this latter procedure rather than linear combination was modest.

5. Conclusions

In this paper, we have explored further the question of how to optimize the training of ensemble evolutionary programs, that is, how to balance minimizing ensemble RMSE against maximizing solution diversity. In doing so, we have explored several approaches, most particularly various forms of genetic exchange, disease, mutation, and in the training of solutions within “ecological niches.”

The approach used here produced a 2344-member evolutionary program ensemble, providing forecasts for Chicago minimum temperature at 60-h range with a resultant RMSE of 3.86°F on an independent test dataset, compared to a RMSE of 4.21°F for the uncorrected GFS ensemble MOS for these same data. The evolutionary program ensemble 90% percentile brackets the observed value 89% of the time for these data, compared to only 40% of the time for the GFS ensemble MOS (when using the lowest and highest MOS forecast members to define that range). When calibrating both ensembles using the same methods, whether a simple calibration or a procedure using Bayesian model combination, the evolutionary program ensemble provides superior deterministic and probabilistic skill to the GFS ensemble MOS. Further reduction in RMSE is shown to be possible by pooling the raw EP and GFS MOS ensemble members, and applying Bayesian model combination. While the overall improvement of this ensemble is relatively small in an absolute sense relative to the raw MOS guidance (0.41°F; in other words, public forecasts relying on either method would likely be perceived as comparable), these skill differences can produce substantial value in particular weather-decision domains, such as energy demand (e.g., Teisberg et al. 2005; Roebber 2010).

Further refinements are possible. Implementation of a parallelized form of the EP method could be a means to increase population sizes by as much as 1000-fold. Although larger populations have not been shown necessarily to produce better solutions (e.g., Haupt and Haupt 2000), the tests reported in this paper using the Lorenz (2005) model (see the appendix) established that using larger populations would shorten the number of iterations needed to obtain these solutions.

It should be noted that the computational costs of increased ensemble size are not substantial, particularly with respect to operations. Instead, the primary cost of the EP method is the need to construct training datasets for each site and the variable for which forecasts are desired. Adaptive systems would appear to be a means for limiting this cost after the initial investment, as has been done with MOS using Kalman filters (e.g., Persson 1991; Simonsen 1991; Homleid 1995; Crochet 2004), with the updateable MOS approach (Ross 1987; Wilson and Vallée 2002), and with artificial neural networks (e.g., Yuval and Hsieh 2003). One might suppose that a statistical method based upon the principles of evolution should be uniquely suited to adaptation. Efforts to develop such a method by the author are already under way.

A single experimental run with the Chicago data with an initial population of 200 000 members also shortened the training (by 12%). Additionally, evidence from nature indicates that population size, given sufficient resource availability, strongly increases genetic diversification (e.g., Stevens et al. 2007). At this point, however, the tests herein have been applied to populations an order of magnitude smaller than might be readily obtainable using a parallelized approach, and thus the effects of larger populations on solution diversity and overall probabilistic performance remain unknown. In that regard, given the relative tension between optimizing on MSE and on probabilistic performance (and hence, ensemble diversity), another area in need of further research is the stopping criteria. Techniques used in training for other nonlinear techniques such as artificial neural networks, would likely provide some guidance.

More complex evolutionary program architectures than used here are possible, but as noted previously, come at the cost of an increase in the search space to be explored and a potential loss of interpretability. Nonetheless, as a brief consideration of this idea, consider a genetic structure as before but where the output from any line of lower number can be used as a variable input to a line of higher number. Thus, for example, the results from EP-gene 1 of a particular algorithm could be fed as any of the three variables into EP-genes 2 and above, the results from EP-gene 2 could be fed as any one of the three variables into EP-gene 3 and above, and so on. This structure allows for the possibility of polynomial solutions of higher order than previously, as well as more complex conditionals (but note we do not add transcendental functions here). Several training runs with this modified architecture were performed using the Chicago data.

The best-performing individual algorithm (defined by performance on the training data) using the modified architecture had a RMSE of 3.95°F compared to 4.07°F for the best individual from the simpler architecture studied in this paper. This new algorithm references five prior lines, but can be simplified to a baseline involving MLR, wind speed, and snow cover, a complex conditional contingent upon precipitation probability and cloud cover (which if satisfied invokes precipitation, and wind speed), and three additional conditionals factoring in more combinations of wind and snow cover (not shown).

Given the differences in seasonal sampling (section 2), one expects that RMSE might be lower on the test data compared to the training data, as occurred using the simpler architecture, both for the ensemble and the best-performing individual. Using the more complex architecture, however, the RMSE for the best-performing individual increased to 3.98°F on the test data. Further, the corresponding, calibrated best ensemble produces a RMSE of 3.86°F on the test data, identical to that of the simpler architecture reported in section 4, but also with a slightly inferior BSS to that ensemble. Thus, the additional complexity is suggestive of overfitting, since cross-validation procedures have not been used. Regardless, while further research on this architecture is warranted, it seems that real gains will be hard to obtain without new and better predictive inputs.

Finally, one may well ask, where in the larger evolutionary balance does the human forecaster still remain? Perhaps the best forecasters, armed with sophisticated guidance such as presented here, or as extended to more problematic forecasts [e.g., severe weather, using cues at the resolvable scales as identified by Mercer et al. (2012)], could leverage this information in combination with their own evolved understanding to add further utility or value. Whether this will remain possible is a function of the weather decision support context in which such forecasts are issued (e.g., energy demand vs public safety), and societal willingness to invest in the necessary training that would allow forecasters to optimize their use of the range of available information, including analysis of observations, meteorological theory, and the full suite of forecasting tools.

APPENDIX

Training Rule Tests

A series of tests were conducted to find ways to maintain the diversity of algorithmic solutions while minimizing ensemble mean error. The following is a description of these tests.

One might suppose that increasing algorithm population sizes and/or mutation rates would increase the tempo of evolution, since both will add opportunities for innovation. The strong selection pressure applied here means that faster evolution will tend to produce better solutions in the ensemble mean sense more quickly. Fast training, however, tends to produce solutions that are less diverse than otherwise. Thus, there appears to be a trade-off between minimizing mean-square error (MSE) and preserving diverse solutions, since the latter can provide more robust probabilistic information.

Using chaotic datasets obtained from the Hénon map (Hénon 1976) and from a modified form of the Lorenz (2005) model, which resembles 500-hPa height data (Roebber 2013), the effects of increasing population size and mutation rate were examined. Population size strongly influenced the rate of evolution, with excellent solutions developing 30% faster for the Lorenz (2005) data when population sizes increased from 10 000 to 250 000 individuals. Increasing the mutation rate, however, did not tangibly affect the rate of training, likely because mutations in the usual training procedure are common (with a 50% probability of occurrence in a given member of a generation).

The effect of sexual selection with and without fatal disease was also tested using the Hénon map data. With the addition of fatal disease, the EP method using sexual selection trained in fewer steps and to better solutions (as measured by MSE) than with the standard procedure.

A coupled form of the Lorenz (1963) model (Mirus and Sprott 1999) was employed to test the effect of the number of mating partners (1, 5, and 10) and the number of children per female (1 or 2) on the MSE performance. For these data, these changes had little influence on the final MSE or on the number of iterations to reach that solution. In experiments constructing ensembles using the Chicago training data, however, mutation, transposition, and sexual selection together contributed to higher diversity as measured by the variance of ensemble forecasts for a given event.

The performance of ensembles created from 100 training runs using Chicago minimum temperature data was analyzed using analysis of variance to determine the effect of training niches on ensemble performance. In this experiment, niches that use model inputs and those that do not have fatal disease were favored to survive to the end of training (Table A1). For example, 100% of the training runs produced ensembles from tribe 1, whose niche includes only model data and lacks disease. We also find statistically significant effects of mating modes on the size of the resulting ensemble (with mode 2–5–A–T ensembles averaging about 20% larger), and on MSE (with mode 10–1–Q–A ensembles better). Fatal disease also had a statistically significant improving effect on MSE, consistent with the results reported above using the Hénon map.

Table A1.

The EP training “ecological niches” for each tribe. Inputs denote the selection of the 19 input variables used to train that tribe, where model refers to the GFS ensemble MOS and MEX variables, obs refers to the persistence and upstream 24-h average observations, and MLR refers to a nested multiple linear regression model (see text for details). Disease occurs in only some niches; and in those niches, it can cause either mutation or the “death” of that algorithm, as indicated. The mating variables include the number of female partners per qualifying male algorithm, the number of children produced per partner, whether the female partner is selected from only inside the tribe or from any tribe, and whether female partners are qualified by MSE. Also shown is the percentage out of 100 trials (Ens) in which a tribe qualified based on the MSE criteria by the end of the training (does not sum to 100 owing to multiple tribe survivals).

Table A1.

Niche restrictions were also tested. Specifically, the niche experiment was repeated, except that algorithms were restricted to using only those inputs that are allowed in a particular niche (e.g., an algorithm that transfers from a “model only” niche to an “observations only” niche would no longer be able to employ model variables in its genetic code). Not surprisingly, for this experiment, we find that niches that use a combination of model and observational data are favored to survive to the end of training. Interestingly, those with fatal disease are also favored to survive to the end of training, perhaps by pruning those with nonfunctioning genes. We also find that fatal disease and the 10–1–Q–A mating mode have statistically significant effects on the size of the resulting ensembles (larger) and on the MSE for these tribes (improved). Thus, common factors between the two niche experiments appear to be the benefits of fatal disease and mating mode 10–1–Q–A on ensemble performance. We note, however, that the root-mean-square error (RMSE) of the best performing ensemble is improved by 5% in the training data using the less restrictive niche rules.A1

The effect of mutation on algorithm performance was tested as a function of the stage of evolution. This was accomplished by identifying three algorithms, each representing either the early, middle, or late stage of training based on RMSE (11.48°, 8.54°, and 4.16°F, respectively). The algorithm was then mutated in any one of the four possible ways (to coefficients, variables, operators, or through transposition) and the RMSE of the resultant mutated form was determined. This was repeated 10 000 times to generate distributions of the training RMSE following the one step mutation (Fig. A1). The median outcome is no change in RMSE, but there is considerable variability, depending on the type of mutation and the stage of the training. At all stages, any of the four types of mutations can produce alterations that lead to improvements in the RMSE. The most effective of these at the early, middle, and late stages were transpositions, coefficients, and variables, respectively.

Fig. A1.
Fig. A1.

RMSE (°F) following a one-step mutation for three algorithms representing early (blue, RMSE before mutation of 11.48°F), middle (green, RMSE of 8.54°F), and late (red, RMSE of 4.16°F) stages of evolution. Results shown for each stage are for mutations of coefficients, variables, operators, and copies (transpositions), and plotted are the 10th percentile (regular triangle), median (rectangle), and 90th percentile (downward triangle) RMSE. Distributions are based on 10 000 repeated experiments.

Citation: Monthly Weather Review 143, 2; 10.1175/MWR-D-14-00058.1

Four additional approaches to ensemble diversification were attempted: 1) slowing the rate of MSE threshold tightening (from 5% to 1% reduction per generation); 2) reducing the frequency of mutations (from 50% to 10%), while increasing the rate of fatal disease (from 50% to 99%); and two forms of resampling [3) forming an EP ensemble of size M by rerunning the training M times, and selecting the best individual from each run, using niches with jitter to increase the training sample size; 4) using the best individual from a single EP training, but using bootstrapped GFS ensemble MOS “samples” of size 21 to drive an ensemble]. With the exception of method 3, all of these approaches led to increases in RMSE compared to the usual strategy, and methods 3 and 4 also led to decreases in root-mean-square difference.

REFERENCES

  • Bakhshaii, A., , and R. Stull, 2009: Deterministic ensemble forecasts using gene-expression programming. Wea. Forecasting, 24, 14311451, doi:10.1175/2009WAF2222192.1.

    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Cooksey, R. W., 1996: Judgment Analysis: Theory, Methods and Applications. Academic Press, 407 pp.

  • Coulibaly, P., 2004: Downscaling daily extreme temperatures with genetic programming. Geophys. Res. Lett., 31, L16203, doi:10.1029/2004GL020075.

    • Search Google Scholar
    • Export Citation
  • Crochet, P., 2004: Adaptive Kalman filtering of 2-metre temperature and 10-metre wind speed forecasts in Iceland. Meteor. Appl., 11, 173187, doi:10.1017/S1350482704001252.

    • Search Google Scholar
    • Export Citation
  • Cui, B., , Z. Toth, , Y. Zhu, , and D. Hou, 2012: Bias correction for global ensemble forecast. Wea. Forecasting, 27, 396410, doi:10.1175/WAF-D-11-00011.1.

    • Search Google Scholar
    • Export Citation
  • Darwin, C., 1859: On the Origin of Species. Bantam Classic Edition, 495 pp.

  • Darwin, C., 1871: The Descent of Man. Penguin Classics, 791 pp.

  • Fogel, L. J., 1999: Intelligence through Simulated Evolution: Forty Years of Evolutionary Programming. John Wiley, 162 pp.

  • Fogel, L. J., , A. J. Owens, , and M. J. Walsh, 1966: Artificial Intelligence through Simulated Evolution. John Wiley, 170 pp.

  • Gibbons, J. F., , and S. Mylroie, 1973: Estimation of impurity profiles in ion-implanted amorphous targets using joined half-Gaussian distributions. Appl. Phys. Lett., 22, 568569, doi:10.1063/1.1654511.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., , and J. S. Whitaker, 2007: Ensemble calibration of 500-hPa geopotential height and 850-hPa and 2-m temperatures using reforecasts. Mon. Wea. Rev., 135, 32733280, doi:10.1175/MWR3468.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., , J. S. Whitaker, , and S. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions. Bull. Amer. Meteor. Soc., 87, 3346, doi:10.1175/BAMS-87-1-33.

    • Search Google Scholar
    • Export Citation
  • Haupt, R. L., , and S. E. Haupt, 2000: Optimum population size and mutation rate for a simple real genetic algorithm that optimizes array factors. Appl. Comput. Electromag. Soc. J., 15, 94102.

    • Search Google Scholar
    • Export Citation
  • Haupt, S. E., , G. S. Young, , and C. T. Allen, 2006: Validation of a receptor–dispersion model coupled with a genetic algorithm using synthetic data. J. Appl. Meteor. Climatol., 45, 476490, doi:10.1175/JAM2359.1.

    • Search Google Scholar
    • Export Citation
  • Hénon, M., 1976: A two-dimensional mapping with a strange attractor. Commun. Math. Phys., 50, 6977, doi:10.1007/BF01608556.

  • Hillis, W. D., 1990: Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D, 42, 228234, doi:10.1016/0167-2789(90)90076-2.

    • Search Google Scholar
    • Export Citation
  • Hoeting, J. A., , D. Madigan, , A. E. Raftery, , and C. T. Volinsky, 1999: Bayesian model averaging: A tutorial. Stat. Sci., 14, 382417, doi:10.1214/ss/1009212519.

    • Search Google Scholar
    • Export Citation
  • Homleid, M., 1995: Diurnal corrections of short-term surface temperature forecasts using the Kalman filter. Wea. Forecasting, 10, 689707, doi:10.1175/1520-0434(1995)010<0689:DCOSTS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • John, S., 1982: The three-parameter two-piece normal family of distributions and its fitting. Comm. Stat. Theory Methods, 11, 879885, doi:10.1080/03610928208828279.

    • Search Google Scholar
    • Export Citation
  • Kong, and et al. , 2012: Rate of de novo mutations and the importance of father’s age to disease risk. Nature, 488, 471475, doi:10.1038/nature11396.

    • Search Google Scholar
    • Export Citation
  • Lakshmanan, V., 2000: Using a genetic algorithm to tune a bounded weak echo region detection algorithm. J. Appl. Meteor., 39, 222230, doi:10.1175/1520-0450(2000)039<0222:UAGATT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci., 20, 130141, doi:10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2.

  • Lorenz, E. N., 2005: Designing chaotic models. J. Atmos. Sci., 62, 15741587, doi:10.1175/JAS3430.1.

  • Mercer, A. E., , C. M. Shafer, , C. A. Doswell III, , L. M. Leslie, , and M. B. Richman, 2012: Synoptic composites of tornadic and nontornadic outbreaks. Mon. Wea. Rev., 140, 25902608, doi:10.1175/MWR-D-12-00029.1.

    • Search Google Scholar
    • Export Citation
  • Mirus, K. A., , and J. C. Sprott, 1999: Controlling chaos in low- and high-dimensional systems with periodic parametric perturbations. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics, 59, 53135324.

    • Search Google Scholar
    • Export Citation
  • Monteith, K., , J. Carroll, , K. Seppi, , and T. Martinez, 2011: Turning Bayesian model averaging into Bayesian model combination. Proc. Int. Joint Conf. on Neural Networks (IJCNN'11), San Jose, CA, IEEE, 26572663.

  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600, doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • O’Steen, L., , and D. Werth, 2009: The application of an evolutionary algorithm to the optimization of a mesoscale meteorological model. J. Appl. Meteor. Climatol., 48, 317329, doi:10.1175/2008JAMC1967.1.

    • Search Google Scholar
    • Export Citation
  • Persson, A., 1991: Kalman filtering—A new approach to adaptive statistical interpretation of numerical meteorological forecasts. WMO Training Workshop on the Interpretation of NWP Products in Terms of Local Weather Phenomena and Their Verification, Wageningen, Netherlands, WMO, WMO PSMP Rep. Series 34, XX-27–XX-32.

  • Raftery, A. E., , T. Gneiting, , F. Balabdaoui, , and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133, 11551174, doi:10.1175/MWR2906.1.

    • Search Google Scholar
    • Export Citation
  • Roebber, P. J., 1998: The regime dependence of degree day forecast technique, skill, and value. Wea. Forecasting, 13, 783794, doi:10.1175/1520-0434(1998)013<0783:TRDODD>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Roebber, P. J., 2010: Seeking consensus: A new approach. Mon. Wea. Rev., 138, 44024415, doi:10.1175/2010MWR3508.1.

  • Roebber, P. J., 2013: Using evolutionary programming to generate skillful extreme value probabilistic forecasts. Mon. Wea. Rev., 141, 31703185, doi:10.1175/MWR-D-12-00285.1.

    • Search Google Scholar
    • Export Citation
  • Roebber, P. J., , and L. F. Bosart, 1996: The contributions of education and experience to forecast skill. Wea. Forecasting, 11, 2140, doi:10.1175/1520-0434(1996)011<0021:TCOEAE>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Ross, G. H., 1987: An updateable model output statistics scheme. Programme on Short- and Medium-Range Series, WMO Rep. 25, World Meteorological Organization, 25–28.

  • Simonsen, C., 1991: Self adaptive model output statistics based on Kalman filtering. WMO Training Workshop on the Interpretation of NWP Products in Terms of Local Weather Phenomena and Their Verification, Wageningen, Netherlands, WMO, WMO PSMP Rep. Series 34, XX-33–XX-37.

  • Stevens, M. H. H., , M. Sanchez, , J. Lee, , and S. E. Finkel, 2007: Diversification rates increase with population size and resource concentration in an unstructured habitat. Genetics, 177, 22432250, doi:10.1534/genetics.107.076869.

    • Search Google Scholar
    • Export Citation
  • Stewart, T., , P. J. Roebber, , and L. F. Bosart, 1997: The importance of the task in analyzing expert judgment. Organ. Behav. Hum. Decis. Processes, 69, 205219, doi:10.1006/obhd.1997.2682.

    • Search Google Scholar
    • Export Citation
  • Teisberg, T. J., , R. F. Weiher, , and A. Khotanzad, 2005: The economic value of temperature forecasts in electricity generation. Bull. Amer. Meteor. Soc., 86, 17651771, doi:10.1175/BAMS-86-12-1765.

    • Search Google Scholar
    • Export Citation
  • Wilson, L. J., , and M. Vallée, 2002: The Canadian Updateable Model Output Statistics (UMOS) system: Design and development tests. Wea. Forecasting, 17, 206222, doi:10.1175/1520-0434(2002)017<0206:TCUMOS>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Yamagiwa, J., , J. Kahekwa, , and A. K. Basabose, 2003: Intra-specific variation in social organization of gorillas: Implications for their social evolution. Primates, 44, 359369, doi:10.1007/s10329-003-0049-5.

    • Search Google Scholar
    • Export Citation
  • Yang, H. T., , C. M. Huang, , and C. L. Huang, 1996: Identification of ARMAX model for short-term load forecasting: An evolutionary programming approach. IEEE Trans. Power Syst., 11, 403408, doi:10.1109/59.486125.

    • Search Google Scholar
    • Export Citation
  • Yuval, , and W. W. Hsieh, 2003: An adaptive nonlinear MOS scheme for precipitation forecasts using neural networks. Wea. Forecasting,18, 303–310, doi:10.1175/1520-0434(2003)018<0303:AANMSF>2.0.CO;2.

1

One could devise alternative threshold adjustments. For example, one could set the new critical threshold at each step according to the MSE of the Nth best individual, where N might depend on the maximum number of algorithms allowed. Here, where that maximum is set to 10 000, experiments with N up to 500 resulted in faster training, but also reduced the diversity of the resulting ensembles and were not pursued further.

2

Another possible way to account for a skewed distribution would be through the use of the two-piece normal distribution (Gibbons and Mylroie 1973; John 1982), in which the halves of two normal distributions with the same mode but two different variances are joined.

A1

A more generalized form of ecological niches, in which five sets of “cultural norms” for the four different disease and mating combinations were used, but for which the norms were defined by random selection of the 19 input variables for each set, with either partial or no variable overlap between sets, also failed to produce superior results on the training data.

Save