## Abstract

Simple phase schemes to predict seasonal climate based on leading ENSO indicators can be used to estimate the value of forecast information in agriculture and watershed management, but may be limited in predictive skill. Here, a simple two-tier statistical method is used to hindcast seasonal precipitation over the continental United States, and the resulting skill is compared with that of ENSO phase systems based on Niño-3 sea surface temperature anomaly (SSTA) and Southern Oscillation index (SOI) persistence. The two-tier approach first predicts Niño-3 winter season SSTA, and then converts those predictions to categorical precipitation hindcasts via a simple phase translation process. The hindcasting problem used to make these comparisons is relevant to winter wheat production over the central United States. Thus, given the state of seasonal SOI and Niño-3 indicators defined before August, the goal is to predict the tercile category of the following November–March precipitation. Generally, it was found that the methods based on either predicted or persisted winter Niño-3 conditions were skillful over areas where ENSO affects U.S. winter precipitation—that is, the Southeast and the Gulf Coast, Texas, the southern and central plains, the Southwest, Northwest, and the Ohio River valley—and that the two-tier approach based on predicted Niño-3 conditions was more likely to provide the best skill. Skill based on SOI persistence was generally lower over many of those regions and was insignificant over broad parts of the central and southwest United States, but did lead the other methods over the Ohio River valley and the northwest. A more restrictive test of leading hindcast skill showed that the skill advantages of the two-tier approach over the central and western United States were not substantial, and mainly highlighted SOI persistence’s lack of skill over the central United States and leading skill over the Ohio River valley. However, two-tier hindcasts based on neural-network-predicted Niño-3 SSTA were clearly more skillful than both ENSO phase methods over areas of the Southeast. It is suggested that the relative skill advantage of the two-tier approach may be due in part to the use of arbitrary thresholds in ENSO phase systems.

## 1. Introduction

The El Niño–Southern Oscillation (ENSO) is a coupled ocean–atmosphere climate mechanism that is a leading source of predictability in the earth’s climate system (Philander 1990; Barnston 1994; Allan et al. 1996; Trenberth 1997; Cane 2000; McPhaden et al. 2006). While ENSO-related indices of anomalous sea surface temperature (SSTA) and sea level pressure (SLPA) correlate with seasonal climate impacts over many areas of the globe (Ropelewski and Halpert 1987, 1996; Kiladis and Diaz 1989), the state of those indices also persist or develop predictably over interseasonal time scales (Rasmusson and Carpenter 1982; Trenberth and Shea 1987). As a result, they can be used as leading indicators in simple schemes to predict seasonal climate variation.

In studying the effect of ENSO-related forecast information in agricultural management such simple forecasting methods are useful because they can be easily integrated into long-term management and cropping simulations. A commonly used approach in these simulations is based on the definition of analog years. This method typically defines a set of analogous years in the historical record based on the condition of an ENSO predictor in the months before planting. Although analog years have been defined based on SSTA variables (Hansen et al. 1998; Jones et al. 2000; Stone et al. 2000), they are more commonly defined based on Southern Oscillation index (SOI) phase systems such as that of Stone et al. (1996a). Once a set of analog years has been defined based on a particular preplanting ENSO phase, crop simulations driven by the actual weather outcomes during those year’s growing seasons can be used to estimate yield effects associated with a specific planting strategy. Using an appropriate profit model, those yield effects can be converted to profit effects. By repeating these simulations using a variety of planting strategies an optimally profitable strategy for that leading ENSO phase can be determined. Once identified for each preplanting ENSO phase condition, the value of an ENSO phase system’s forecast information (Murphy 1993; Wilks 1997) can be estimated by comparing profits resulting from each year’s optimal planting practice with those resulting from simulations using a fixed planting practice (Fig. 1).

The predictive ability of ENSO phase systems is based on persistence and correlation. Although SOI phase systems have been used to predict frost risk (Stone et al. 1996b), in most applications their skill is based on the persistence of ENSO indicators between the preplanting and growing season period and those indicators’ correlation with growing-season precipitation. Although many have conducted management simulations based on this simple forecast method (e.g., Meinke et al. 1996; Meinke and Stone 1997; Carberry et al. 2000; Stephens et al. 2000; Hammer 2000; Hill et al. 2000; Everingham et al. 2003), such simulations can provide basic guidelines to the use of seasonal forecast information in agriculture. By testing for and defining best planting strategies they can determine management options that producers can consider in responding to forecast information. Value of information analyses can provide estimates of the associated profit effects. As a result, this method can provide answers to two basic questions that producers might pose in considering the use of growing-season climate forecasts: What are the best management practices I should use in responding to these forecasts, and what will the consequences be on yield and profits?

Basing the forecast component of these simulations on simple ENSO SSTA and SLPA indices provides an advantage in the relatively simple data requirements to form those indicators. ENSO indices derived from SSTA and SLPA reanalysis fields (Kaplan et al. 1998, 2000) or station data (e.g., Troup 1965; Ropelewski and Jones 1987) can be defined over near-centennial or greater time scales. As a result, Fig. 1’s annual management cycles can be simulated over similar periods if daily meteorological data are available to drive crop simulation models. By contrast, retrospective forecast records derived from the National Centers for Environmental Prediction (NCEP) climate forecast system, although potentially more skillful, are available for only 24 yr (Saha et al. 2006). Thus basing management simulations on these simple prediction schemes might provide better statistical sampling of the production and profit effects of forecast information. But although the ultimate goal of those simulations is to define optimal management strategies and estimate the value of forecast information, that value depends upon forecast skill (Murphy 1993). Although ENSO phase systems might provide a statistical advantage in management simulations, their limited skill may also provide lower estimates of potential forecast value. Alternatively, more skillful methods might be used that produce better estimates of the potential effect of ENSO forecasts on yields and profits.

Although usually based on a crude forecasting approach, the Fig. 1 simulation method can be an important research tool for studying the effect of forecast information in agriculture. With increased development such methods might also improve the performance of management software similar to the Australian “Rainman” application (Clewett et al. 2003), or might be used in personal computer–based simulation games designed to introduce producers to the profit and risk dynamics of forecast use. They might also be used to simulate the effect of forecast information in other areas where advanced knowledge of seasonal climate conditions might have value, for example, watershed and hydroelectric management. The goal here is to explore improvements to this method’s forecast foundation by comparing the skill of ENSO phase systems with that derived from a simple adaptation of the two-tier forecasting approach of Bengtsson et al. (1993). The forecasting problem used to compare these methods was defined by the constraints of winter wheat production in the central United States; thus the focus will be on forecasting winter precipitation. Because of the importance of avoiding artificial skill in studies integrating forecast information into climate-sensitive management simulations, section 2 reviews the importance of cross validation in statistical hindcasting. Section 3 describes the various datasets used and the calculation of Niño-3 and SOI ENSO variables, and justifies and defines a winter precipitation predictand. A discussion of methods in section 4 describes how the statistical two-tier forecasting method converts hind cast winter Niño-3 SSTA conditions to categorical hindcasts of “dry,” “normal,” or “wet” November–March (NDJFM) precipitation, and how the skill of those hindcasts was calculated. That section concludes with a description of the two ENSO phase methods used here. Section 5 presents the skill of all the methods considered here at the climate division level over the continental United States, and section 6 summarizes and discusses those results.

## 2. Categorical hindcasting and cross validation

An accurate estimate of a forecast method’s ability for correct prediction, or skill, is an important requirement before forecast information can be used confidently in agriculture or any other climate-sensitive undertaking. With historical records of limited–duration, leave-one-out cross validation (Stone 1974; Michaelsen 1987; Elsner and Schmertmann 1994) can provide such estimates by simulating forecasts of historical conditions. In categorical forecasts of seasonal climate conditions, this process acts as if a past year’s predictand class is unknown, and then predicts that class based on the year’s predictor conditions and a prediction rule. The prediction rule is derived from the relationship between predictor and predictand variables during the remaining years of the historical record. The central prediction problem in the two-tier method tested here is to retroactively forecast winter Niño-3 SSTA conditions during the 85-yr period 1915–99.^{1} Those forecasts will be based on prediction algorithms derived from the relationships between winter Niño-3 conditions and the state of previous spring–summer Niño-3 and SOI predictor variables during the remaining 84 yr of data. As Elsner and Schmertmann (1994) note, the leave-one-out method can be considered a fictitious reordering of history in which a past year’s predictand is considered a future unknown, and the remaining years are considered as known historical conditions preceding that year. The resulting retrospective forecasts are referred to as hindcasts. Comparing the hindcast outcome of each year in the historical record with the year’s observed predictand state can provide a “track record” of a forecast method’s skill. Management simulations using such hindcast information can also potentially provide end-users with a corresponding track record of the risks and benefits of forecast use.

However, artificially skillful hindcasts can produce biases in calculating forecast information’s effect on production or profits. For hindcast skill to approximate a forecasting method’s true skill, the overall hindcast algorithm must accurately mirror the circumstances of an actual forecast. As outlined by Elsner and Schmertmann (1994) the general principle in achieving this is that the algorithm must not be influenced in any way by the hindcast predictand, which is unknown under forecast conditions. The change in skill when this general principle is violated is referred to as artificial skill, which typically increases hindcast skill estimates above that of true forecast skill (Michaelsen 1987). Hindcast algorithms may consist of both prediction rule selection and the calculation of parameters used in the selected rule. To remove the direct influence of the predictand, rule selection and parameter estimation must be conducted with data from which that predictand has been withheld, also known as out-of-sample or training data. But two additional requirements ensure that training data are not indirectly influenced by the hindcast predictand:

The held-out predictand must be statistically independent of the training data. In seasonal forecasting this ensures that the held-out year’s predictand state is not closely duplicated by that of the preceding or following years, which would defeat the general purpose of cross validation. While the persistence of adjoining data values can be a problem in forming prediction rules over intraseasonal time scales, for example, the prediction of tropical storm movement (Shapiro 1984), seasonal SST anomalies generally do not persist interannually (Namias and Cayan 1981). The December–February (DJF) Niño-3 SSTA predictand considered here is negligibly correlated with itself at a 1-yr lag during 1915–99 (

*ρ*= −0.07).The hindcast algorithm and the training data should not be based on statistics that are influenced by the hindcast-year predictand. Thus, for example, a hindcast year’s prediction algorithm should not be based on climatologies, correlations, or percentiles that are calculated using that year’s predictand. As will be seen, this requirement does impose constraints on the calculation of tercile thresholds of winter-season SSTA and precipitation, and the calculation of sea surface temperature climatologies.

## 3. Data

### a. Pacific SSTA and SLPA data

The seasonal Niño-3 SSTA data used here were formed using the concatenation of the historical SSTA analysis of Kaplan et al. (1998) and the post-1981 SSTA analyses of Reynolds and Smith (1994). This extended Kaplan historical SSTA analysis is currently available from January 1856 to October 2006 over a 5° × 5° grid. But while this monthly data is reported as degrees Celsius departures from the 1951–80 Global Ocean Surface Temperature Atlas (GOSTA) sea surface temperature (SST) climatology (Bottomley et al. 1990), SST anomalies calculated relative to that climatology pose a problem in the strict cross validation of winter Niño-3 SSTA hindcasts made during 1951–80. That is, because SST conditions during those years contribute to the GOSTA SST climatology, the resulting hindcasts would violate the second cross-validation requirement outlined in section 2. To avoid the possibility of artificial skill, each hindcast year’s DJF Niño-3 SSTA predictand was recalculated from the Kaplan data using alternate SST means calculated from DJF Niño-3 SST values in the training data. Winter Niño-3 SST values were derived here by adding the appropriate monthly 1951–80 GOSTA climatological SST fields to the original Kaplan SSTA fields to recover monthly SST fields for December, January, and February during 1915–2000. Those fields were averaged into winter-season-mean SST fields for each year, which were then spatially averaged over the Niño-3 grid region to form each year’s winter Niño-3 SST value. In the course of the linear regression and neural network training algorithms used here to hindcast winter Niño-3 SSTA, those SST values were converted to SSTA using mean DJF Niño-3 SST calculated from each winter of 1915–99 except the hindcast year. Thus for each hindcast year, DJF Niño-3 SSTA was recalculated with respect to an 84-yr Niño-3 mean that held out the effect of that year. However, because cross-validation requires that information that is normally unavailable at forecast time should not influence a prediction rule, similar restrictions on climatology do not apply to predictor variables. As a result, the seasonal Niño-3 SSTA predictor values used here were directly averaged from the monthly extended Kaplan SSTA fields. Similarly, the spring–summer SOI values used here as predictor variables were calculated by seasonally averaging University of East Anglia (http://www.cru.uea.ac.uk/cru/data/soi.htm) monthly SOI values. Kaplan et al. (2000) gridded surface pressure anomaly fields are also used here for illustrative purposes (Fig. 3b).

### b. Winter-season precipitation predictand

In the two-tier forecast method evaluated here DJF Nino-3 SSTA hindcasts are an interim step to hindcasting the tercile category of cumulative NDJFM precipitation at each U.S. climate division, minus the Florida Keys division. Those seasonal precipitation totals were derived from U.S. climate division precipitation data (Guttman and Quayle 1996), which are in turn derived from averages of monthly station data reported over those 343 regions. These monthly precipitation data currently extend from January 1895 to April 2006.

In comparing ENSO-related yield effects on corn and winter wheat over the central United States, Mauget and Upchurch (1999) found that the most significant yield effect was found in Texas winter wheat. Generally, a significant tendency to above-median winter precipitation was found over the wheat growing regions of Texas, Oklahoma, and Kansas during warm winter SSTA conditions over the equatorial Pacific, and subsequent state-averaged yields showed a significant tendency to be above the technological yield trend. Conversely, cold SSTA conditions were consistent with dry winter conditions and suppressed yields. In Texas, winter wheat production is concentrated mainly in a climate division in the northwest part of the state, referred to here as “the Panhandle” (see Fig. 2a). As a result, Panhandle winter precipitation is closely correlated with Texas state wheat production. Correlations calculated here between the detrended National Agricultural Statistics Service–U.S. Department of Agriculture (NASS–USDA) Texas yield record and Panhandle climate division precipitation integrated over varying winter periods showed the November–March total to be most highly correlated (*ρ* = 0.64) with state-averaged yield (Fig. 2b). Panhandle November–March precipitation is in turn correlated with both equatorial Pacific SSTA (Fig. 3a) and the tropical oscillation in sea level pressure associated with the Southern Oscillation (Fig. 3b). Given the ability of the ENSO mechanism to develop predictably over interseasonal time scales and its correlation with U.S. winter wheat belt precipitation and yield effects, NDJFM precipitation is defined here as the target predictand in a practically relevant seasonal hindcasting problem. In comparing the skill of the two-tier and ENSO phase methods considered here, the goal will be to retrospectively predict whether this variable will fall into one of three tercile classes defined by its 33rd and 66th percentiles during the winters of 1915–99.

## 4. Methods

Seasonal forecast information can influence agricultural management decisions if forecasts are provided early enough to influence those decisions (Easterling and Mjelde 1987). In the Texas Panhandle, winter wheat planting for forage production can occur as early as late August or early September (Hossain et al. 2003). Thus to allow for a forecast formulation and delivery period of at least 1 month, the seasonal predictor variables used here must be defined over periods ending no later than July to be useful in influencing planting decisions. As a result, the applied hindcasting problem here is to classify the November–March precipitation tercile based on the state of seasonal SOI and Nino-3 SSTA conditions defined no later than the previous July.

But although advanced knowledge of seasonal precipitation might be useful in agriculture, predicting growing-season climate depends on preconditions in the climate system. Seasonal climate forecasting is possible if atmospheric boundary forcing evolves predictably over interseasonal time scales, and the atmospheric response to that forcing is also predictable. Two-tier climate prediction based on numerical methods (e.g., Bengtsson et al. 1993; Latif et al. 1993; Graham and Barnett 1995) tries to reproduce those processes by first predicting SST conditions at seasonal or greater lead times using ocean circulation models, and then uses those forecast boundary conditions to force either statistical or numerical atmospheric models during the predictand season. The skill of purely statistical methods may also depend on their ability to follow that two-stage physical process. The two-tier statistical method tested here begins by predicting ENSO SSTA boundary conditions; then assumes those conditions have a simplified linear effect on U.S. precipitation. As illustrated in Fig. 4, this approach first hindcasts DJF Niño-3 SSTA using two prediction methods. The second tier consists of converting those hindcasts to SSTA phase hindcasts and then translating those cold, neutral, or warm SSTA predictions to categorical hindcasts of NDJFM precipitation. The skill of the resulting dry, normal, or wet phase hindcasts is then estimated through the calculation of a Gandin–Murphy skill (G–M) score (Gandin and Murphy 1992; Livezey 2003). The following will describe the two steps of this statistical hindcast scheme, and the calculation of skill, in more detail.

### a. DJF Niño-3 hindcast methods

Two methods were tested in the SSTA prediction component of the Fig. 4 two-tier hindcasting scheme: linear regression with simple rule selection and a nonlinear neural network using commercially available Predict software (Neuralware, Inc.).

#### 1) Linear regression

Three linear regression prediction rules were tested here using a double-nested cross-validation process (Stone 1974; Drosdowsky and Chambers 1998). The first rule was a simple first-order autoregression (Chu and Katz 1985) of May–July (MJJ) Niño-3 (NIN3) on subsequent DJF Niño-3 SSTA conditions:

The second rule was a similar regression of MJJ SOI on subsequent DJF Niño-3 conditions:

The final rule was a simple multivariate linear regression (MLR) of both MJJ NIN3 and SOI on the DJF NIN3 predictand:

For each hindcast year *n* the double cross-validation process first selects the best prediction rule based on successive leave-one-out subsets of the training data. Given the selected rule, the associated coefficients were then derived via regression on the entire training dataset. Then, using the MJJ Niño-3 SSTA and/or MJJ SOI conditions for the hindcast year, the subsequent DJF Niño-3 SSTA anomaly was predicted via the resulting regression equation. In the course of the rule-selection process, any of the Eqs. (1)–(3) regression rules may have been selected to predict a particular hindcast year’s winter Niño-3 conditions. However, in the course of that process the Eq. (3) MLR model was selected for every year of 1915–99 based on its out-of-sample root-mean-square (RMS) prediction error relative to the other models. The RMS error of the 85 winter Niño-3 hindcasts based on the Eq. (3) MLR model was 0.645°C.

#### 2) Neural network

The feed-forward neural network (NNET) used here had a three-layer architecture similar to that used by Tangang et al. (1998), with two linear input nodes, one hidden layer with a variable number of nonlinear tanh activation function nodes, and one linear output node. Replacing the nonlinear hidden layer of such a network with direct connections between the linear input and output nodes produces an equivalent MLR network with two predictor variables. Because the addition of nonlinear nodes in the hidden layer allows the network to reproduce arbitrary continuous functions (Rojas 1996), the training of those nodes produces a form of nonlinear regression. While neural networks have been used to predict Niño-3 SSTA using SSTA and SLPA predictors derived from empirical orthogonal function (EOF) analysis (Tangang et al. 1998; Hsieh and Tang 1998), the emphasis here is on the use of simple predictors. As a result, the same MJJ Niño-3 and SOI predictors used in the MLR hindcasts were used as NNET predictor variables. In the context of a consistent hindcast algorithm, this is equivalent to adopting the results of the linear predictor selection process described above during each hindcast year. Thus for each year, neural networks were trained to optimize the correlation between the network output and observed DJF Niño-3 SSTA in the training data. The held-out year’s winter Niño-3 conditions were then hindcast using the optimized network and that year’s MJJ Niño-3 and SOI values. In addition to optimizing network parameters this training process also selected for the optimal number of nodes in the hidden layer. As a result, that number varied with each hindcast year, with the number of hidden nodes varying between 1 and 30. The RMS error of the resulting 84 NNET-derived Niño-3 hindcasts was 0.685°C.

### b. Derivation of categorical winter precipitation

Hindcast winter Niño-3 SSTA resulting from the MLR and NNET-based two-tier methods were converted to categorical hindcasts of winter precipitation through a simple phase conversion process (Fig. 4a). First, the hindcast SSTA was converted to a three-phase categorical hindcast through comparison with out-of-sample tercile thresholds calculated from the rankings of the remaining 84 yr’s DJF Niño-3 values during 1915–99. Those cold-, neutral-, or warm-phase hindcasts were then translated into categorical hindcasts of dry, normal, or wet NDJFM conditions, where the translation was dependent on the sign of the correlation between DJF Niño-3 SSTA and a climate division’s NDJFM precipitation during the 84 out-of-sample years. Thus, assuming a positive sense of correlation as found over the Texas Panhandle, warm Niño-3 SSTA hindcasts translated to wet precipitation hindcasts, while cold hindcasts translated to dry hindcasts. Neutral Niño-3 hindcasts translate to normal precipitation hindcasts regardless of the sense of correlation. The phase of a hindcast year’s observed NDJFM precipitation was, like the phase translation of the hindcast Niño-3 SSTA, defined using out-of-sample tercile thresholds of NDJFM precipitation calculated from the remaining 84-yr precipitation totals. The joint distribution of the resulting categorical hindcasts and the hindcast year’s observed phase of NDJFM precipitation were then used to update 3 × 3 contingency tables.

### c. Gandin–Murphy skill scores and Monte Carlo–derived confidence levels

The skill associated with a contingency table’s entries can be measured using the G–M skill score. This score is calculated by weighting the table’s entries by scoring matrix values designed to reward and penalize specific hindcast outcomes. An example of such a table for the MLR-based hindcasts for the Texas Panhandle climate division can be found in Fig. 4b. When normalized by the total number of hindcast years (85) the Fig. 4b contingency table produces the Fig. 4c probability matrix (𝗣). The G–M skill score (GMSS) is calculated as the sum of nine products derived from the corresponding entries of 𝗣 and a 3 × 3 scoring matrix 𝗦 (Fig. 4d):

The 𝗦 entries are selected based on a number of constraining requirements (Livezey 2003). First, the average score for random or static hindcasts is required to equal 0.0, while the score for perfect hindcasts is required to equal 1.0. The scoring matrix must be symmetric (e.g., *S _{ij}* =

*S*) if the user wants to equally penalize a hindcast of class

_{ji}*i*conditions when a class

*j*outcome occurs and visa versa. The matrix values are chosen to impose increased penalties for hindcasts in error by more than one class, for example, a dry hindcast when wet conditions are subsequently observed. Scoring matrix values are also selected to not reward conservatism; that is, they should give greater reward for correctly hindcasting wet or dry conditions than normal conditions. When the unconditional probabilities for dry, normal, or wet winter conditions are equal (0.33), the Fig. 4d 𝗦 entries satisfy those requirements. The calculation of G–M skill in Eq. (4) allows the 𝗦 matrix to assign positive rewards of varying magnitude to correct hindcasts—that is, the counts along the contingency table’s diagonal—and negative penalties to off-diagonal counts. Using this scoring matrix, the G–M skill for the Texas Panhandle’s MLR-derived hindcasts in the Fig. 4b contingency table is 0.253.

Although the mean G–M skill for random hindcasts is 0.0, values for randomly assigned contingency tables vary about 0.0. To estimate that variance Monte Carlo trials were conducted to test whether each contingency table formed here was consistent with a semirandom hindcast sequence. In these trials the actual sequence of 85 categorical precipitation hindcasts was replaced with a “red” noise sequence with interannual transition probabilities equal to that of the hindcast sequence. Together with the observed outcomes, that noise hindcast sequence was used to form a contingency table from which a G–M score was calculated. This process was repeated 2000 times, and the percentiles of the resulting skill score null distributions were used to estimate confidence levels. These G–M score null distributions were generated for each precipitation hindcast sequence formed here, that is, for each forecast method at each climate division. The resulting 95% confidence levels for significant G–M skill varied between 0.129 and 0.191, while the 99% confidence levels varied between 0.165 and 0.253.

### d. ENSO phase systems

In Fig. 4e the Texas Panhandle NIN-55 scores were based on a three-phase Niño-3 system and the sense of the correlation between DJF Niño-3 SSTA and NDJFM precipitation in the hindcast year’s training data. If that correlation was positive, as in the Panhandle, cold–neutral–warm phase MJJ Niño-3 conditions that are assumed to persist into the following DJF season were translated to dry–normal–wet NDJFM precipitation hindcasts. Conversely, at negatively correlated locations wet–normal–dry conditions would be hindcast. The MJJ Niño-3 SSTA phase thresholds used to derive those categorical hindcasts were the +0.5 and −0.5°C National Oceanic and Atmospheric Administration (NOAA) operational values used by Hansen et al. (1998) and Jones et al. (2000), thus the designation NIN-55. The Fig. 4e SOI-66 G-M scores were derived using an identical scheme, but based on MJJ SOI phase and the correlation sense between DJF SOI and NDJFM precipitation in the out-of-sample years. The thresholds for defining MJJ SOI phase were the +0.6 and −0.6 values used by Hill et al. (2000), thus the designation SOI-66. The subsequent treatment of the winter precipitation hindcasts resulting from the NIN-55 and SOI-66 methods—that is, the formation of contingency tables, and the calculation of G–M skill scores and confidence levels—was identical to that of the MLR and NNET-based two-tier hindcasts.

## 5. Results: Climate division hindcast skill

Figures 5a–d maps G–M hindcast skill for the NDJFM precipitation tercile at each U.S. climate division during 1915–99. The skill resulting from the two-tier MLR and NNET-derived hindcasts are mapped in Figs. 5a and 5b, while that resulting from the NIN-55 and SOI-66 ENSO phase methods are found in Figs. 5c and 5d. In those figures, colored climate divisions show G–M skill significant at a 95% confidence level. Figures 5a–c show that hindcasts based on both two-tier methods and on the NIN-55 phase system provided consistently significant skill over areas where NDJFM precipitation and DJF Niño-3 are correlated (Fig. 5f), specifically, the Southeast and the Gulf Coast, Texas, the southern and central plains, the Southwest, Northwest, and the Ohio River valley. By contrast, the SOI-66 method provided regionally significant skill over Florida and the Gulf Coast, the Ohio River valley, and climate divisions in the southwest and northwest, but was unskillful over the central and much of the southwestern United States. The percentage of U.S area with significant hindcast skill by method is graphed in Fig. 6a and shows that the two-tier and NIN-55 methods were skillful over 36%–38% of the continental United States, while SOI-66 skill was significant over 21% of U.S. area. In Fig. 6a the breakdown of area where skill was weakly (GMSS < 0.23), moderately (0.23 < GMSS < 0.32), or strongly significant (GMSS > 0.32) is similar in the MLR and NNET-based two-tier methods. However, while those two methods provided weakly significant skill over 19% of U.S. area, the NIN-55 phase method was weakly significant over a greater percentage (25%) of area. The SOI-66 method was strongly significant in one Idaho climate division, and climate divisions where that method was moderately skillful were found mainly in Florida, Montana, and the Ohio River valley.

Figure 5e maps the method that provided the best skill over climate divisions where at least one of the methods of Figs. 5a–d provided significant skill. Figure 6b graphs the percentage of U.S. area over which each method provided such leading skill. The latter figure shows that the MLR and NNET methods led in skill over 19% and 16% of U.S. area respectively, while the NIN-55 and SOI-66 phase methods each led over approximately 7% of U.S. area. In Fig. 5e NNET scores consistently led in skill over southeastern climate divisions in Florida, the Gulf Coast, Georgia, and the Carolinas. Over climate divisions in Texas, the central plains, and the southwest, the NNET and MLR methods alternated in providing leading skill, although the NIN-55 method led with weakly significant skill over areas of Texas, Oklahoma, and eastern New Mexico. Similarly, the two-tier methods led in skill over Montana and Idaho climate divisions, with the NIN-55 method leading over some areas of eastern Montana and Wyoming. The SOI-66 phase method consistently provided best skill over the Ohio River valley and most climate divisions in Oregon and Washington State, where significant skill was found. The NNET method produced the highest G–M skill found here (0.438) over climate divisions in Georgia, Florida, and southernmost Texas. The highest MLR-derived (0.412) and NIN-55 derived skill (0.403) was also found over southernmost Texas.

Figure 5e maps where each forecast method provided leading hindcast skill over the United States, but in many areas those skill advantages are minor. To make more restrictive tests for leading skill, pairwise skill comparisons were made between each of the four methods in Figs. 5a–d. Those comparisons were based on a null hypothesis (*H _{d}*) that holds that G–M score differences were consistent with both scores being randomly drawn from stochastic score distributions derived from fixed hit rate hindcasts (see the appendix). As a result,

*H*holds that two G–M scores are consistent with equal outcomes of a more basic measure of skill, that is, the overall rate of correct hindcasts, or hit rate (Wilks 1995). The Monte Carlo trials that generated those G–M score distributions showed that their standard deviations were stable as hit rate varied, as were the standard deviations of derived distributions of G–M score differences (Table A1 in the appendix). As the standard deviations of the derived difference distributions varied closely about 0.057, the following

_{d}*Z*statistic was used to test

*H*:

_{d}where *σ _{d}* = 0.057. Thus when the absolute value of the difference between two methods’ G-M scores was greater than 1.96

*σ*(0.112),

_{d}*H*was rejected at a 95% confidence level.

_{d}The *Z* statistics for the six possible pairwise comparisons of G–M scores for the MLR, NNET, NIN-55, and SOI-66 methods at each climate division can be found in Fig. 7. In those figures black climate divisions show where the G–M scores for the black font method in each figure’s legend exceeds the gray font method score at a 95% confidence level (*Z* > 1.96). Conversely, gray divisions mark where the gray font method’s score significantly exceeds that of the black font method (*Z* < −1.96). White divisions mark areas where |*Z*| < 1.96, or areas where the G–M scores of both methods are shaded as not significantly different from 0.0 (white) in Figs. 5a–d. Figures 7a–c map *Z* statistics for comparisons among the NNET, MLR, and NIN-55 methods, while Figs. 7d–f map the *Z* statistics between those three methods and the SOI-66 phase method.

In Figs. 7a–c the predominance of white climate divisions in the pairwise comparisons of the NNET, MLR, and NIN-55 methods shows that their regional skill advantages in Fig. 5e are rarely significant with respect to one another according to the Eq. (5) test. The exception is found in Fig. 7c, where the NNET-derived scores significantly exceed NIN-55 scores over an arc of climate divisions extending from the Gulf Coast to the coasts of the Carolinas. In Figs. 5a–c the MLR, NNET, and NIN-55 methods are consistently skillful over similar areas of the United States, and in Fig. 5e those methods randomly alternate in providing leading skill over the southwest, Texas, the central plains, Montana, and Idaho. But in Figs. 7a–c those three methods rarely lead the other two by a 1.96*σ _{d}* margin over those areas, which shows that no method has a consistent and substantial skill advantage over the central and western United States.

In contrast to Figs. 7a–c, Figs. 7d–f shows consistent patterns of substantial skill advantage in the pairwise comparisons of SOI-66 G–M scores versus the remaining three methods. Those figure’s general pattern of significant *Z* statistics shows MLR, NNET, and NIN-55 skill leading SOI-66 skill by more than 1.96*σ _{d}* over climate divisions in the central United States, while the SOI-66 method produces a similar skill advantage over each of those methods over areas of the Ohio River valley. The significant negative

*Z*statistics reflects the NNET, MLR, and NIN-55 method’s lack of significant G–M skill over areas of the southern Ohio River valley where the SOI-66 method produced weakly significant skill (<0.23). The significant positive statistics over the central United States mainly reflect similar but reversed skill comparisons, that is, areas of Texas and the Great Plains where the NNET, MLR, and NIN-55 methods produce moderately or weakly significant G–M skill, but SOI-66 skill was insignificant. The exception is found in climate divisions in the southeast and southern Texas in Fig. 7f, where, like the NNET versus NIN-55 comparison in Fig. 7c, high NNET skill substantially exceeds significant SOI-66 skill.

## 6. Summary and discussion

To test a possible alternative to the simple ENSO phase systems frequently used in agricultural management simulations (Fig. 1), the skill of Niño-3 (NIN-55) and SOI-based (SOI-66) phase systems were compared with that of a two-tier method based on multiple linear regression (MLR) and neural-network-derived (NNET) winter Niño-3 SSTA hindcasts, illustrated here in Fig. 4. The common prediction problem used to evaluate these methods was defined by the constraints of winter wheat production in the central United States. Thus the goal was to hindcast the tercile of November–March U.S. climate division precipitation over the continental United States during 1915–99, based on the state of SOI and Pacific SSTA conditions during the previous May–July season. The main measure of skill that these comparisons were based on was the Gandin–Murphy (G–M) skill score. Skillful G–M scores—that is, scores that significantly differed from those associated with semirandom categorical hindcast sequences—were determined via Monte Carlo simulations.

Over areas of the continental United States where winter Niño-3 SSTA conditions are significantly correlated with NDJFM rainfall (Fig. 5f), the MLR and NNET-based two-tier hindcasts and the NIN-55 ENSO phase method based on Niño-3 persistence provided significant and spatially consistent hindcast skill (Figs. 5a–c). But although SOI phase systems have been used to study ENSO effects on U.S. winter grain crops (Mjelde and Keplinger 1998; Hill et al. 2000), the SOI-66 method was not found here to be skillful over the central United States and much of the southwest (Fig. 5d). Over climate divisions where at least one method produced a significant G–M score, the two-tier hindcasts frequently provided leading skill (Fig. 5e). But regions where a single method produced consistent leading skill were more the exception than the rule. Those exceptions were the southeastern United States, where the NNET two-tier method provided some of the highest G–M scores found here, and the Ohio River valley and the Northwest, where SOI-66 skill consistently led that of the remaining three methods.

A more restrictive comparison of leading skill among the two-tier and ENSO phase methods did not reveal substantial differences in MLR, NNET, and NIN-55 skill over the central and western United States (Figs. 7a–c), but did show NNET skill to clearly lead NIN-55 and SOI-66 skill in the southeast (Figs. 7c,f). The most consistent result of those comparisons was to highlight the differences between a southern Ohio River valley region where the SOI-66 method led in skill, and central U.S. areas where the remaining methods were skillful but SOI-66 skill was insignificant (Figs. 7d–f).

In terms of providing significant hindcast skill over more extensive areas of the continental United States, the methods based on either predicted or persisted winter Niño-3 conditions—that is, the MLR, NNET and NIN-55 methods—were generally the leading performers here (Fig. 6a). The MLR and NNET methods based on predicted SSTA conditions were in turn more effective in providing leading skill than the NIN-55 ENSO phase method (Fig. 6b). The reasons for this relative skill advantage might be explored by comparing the two-tier and NIN-55 prediction approaches. In both methods, hindcast NDJFM precipitation tercile is derived from the phase of winter Niño-3 SSTA through a simple phase translation process. The two-tier method explicitly hindcasts DJF Niño-3 phase, while the NIN-55 method implicitly assumes that MJJ phase—as defined by −0.5 and +0.5°C NOAA thresholds—persists into the following winter season. Thus in both methods each climate division’s precipitation hindcast is phase translated from a common winter Niño-3 phase hindcast. As a result, in areas where those winter SSTA conditions correlate with NDJFM precipitation, the method with the most skillful SSTA phase hindcasts might produce more skillful precipitation hindcasts.

Tables 1a and 1b are the contingency tables for the NNET and MLR-derived hindcasts of DJF Niño-3 phase, while Table 1c compares the NOAA MJJ Niño-3 phase with the following winter’s DJF phase as defined by the tercile thresholds used here. As measured by each table’s G–M score, the NNET method is more skillful in hindcasting DJF Niño-3 phase (GMSS = 0.677) than the MLR method (GMSS = 0.597), and both two-tier methods are more skillful than NIN-55 persistence (GMSS = 0.456). The NNET method’s leading skill in predicting winter Niño-3 phase may explain its consistent skill in hindcasting NDJFM precipitation over southeastern areas where those two variables are highly correlated (Fig. 5f). However, all three methods hindcast Niño-3 phase at higher or lower rates than was observed to some degree. By definition, cold, neutral, and warm DJF Niño-3 conditions occurred in 28 or 29 yr of 1915–99. In Tables 1a and 1b both the NNET and MLR methods hindcast warm conditions at slightly higher rates than observed. But in Table 1c, the NIN-55 method hindcast neutral conditions at a clearly higher rate (46 yr) than was observed (29 yr). This occurred because although the NIN-55 NOAA thresholds specify neutral SSTA conditions as occurring over a 1.0°C range, the central tercile of the 85 MJJ Niño-3 values used here spans a range of only 0.52°C. As a result, the NIN-55 method identifies MJJ Niño-3 phase as neutral at a much higher rate than defined by tercile thresholds. That hindcast bias is then phase translated into a higher rate of normal NDJFM precipitation hindcasts. In the G–M scoring scheme, this produces conservative hindcasts, as hindcasting normal conditions at a higher rate than observed results in minor scoring rewards and penalties. This conservatism may explain the NIN-55 method’s increased incidence of weakly significant G–M skill over the United States in Fig. 6a, but also suggests the use of tercile phase thresholds that are more consistent with the percentiles of actual Niño-3 SSTA distributions. Alternatively, the use of optimal thresholds that maximize G–M skill in leave-one-out training datasets, similar in principle to the Optimal Climate Normal approach of Huang et al. (1996), might be considered an improvement to simple ENSO phase systems.

The statistical two-tier forecast method considered here is intended as a simple incremental improvement to the ENSO phase schemes normally used in agricultural management simulations. This approach produces substantially better hindcasts of NDJFM precipitation than the SOI-66 method over the wheat-growing region of the Texas Panhandle, but did not clearly improve on the skill derived from NIN-55 persistence (Fig. 4e). However, the economic forecast value that such simulations are meant to calculate is not a simple function of forecast skill. As pointed out by Murphy (1993), the relationship between forecast skill and value can be complex and user-specific. But that relationship can be determined through a simulation approach similar to that illustrated in Fig. 1. Future work will use such simulations to calculate those effects on winter wheat production and grazing management at farm locations in the Texas Panhandle.

## Acknowledgments

This work was supported by the ARS-University Ogallala Aquifer Project. Thanks are given to Jeanne Schneider of the ARS Grazinglands Research Laboratory for helpful preliminary review comments. All figures were produced using Generic Mapping Tools (Wessel and Smith 1995).

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

### APPENDIX

#### Gandin–Murphy Score Variance for Constant-Hit-Rate Contingency Tables

Tests made here to define substantial advantages in G–M skill were based on the hypothesis that two G–M scores were consistent with equal incidences of correct hindcasts. That incidence is given by the ratio of hindcast counts on a contingency table’s diagonal to the total number of hindcasts, also known as the hit rate (Wilks 1995). A Monte Carlo algorithm used to form constant-hit-rate contingency tables began with a random sequence of 85 observed precipitation terciles and an identical sequence of tercile hindcasts. As the contingency table derived from those observations and hindcast series had no off-diagonal entries, the resulting G–M score and hit rate equaled 1.0. Then, a fixed number of *N* tercile values in the hindcast series were randomly selected without replacement and randomly recast as an incorrect hindcast. Thus if the correct hindcast was dry, it would be reassigned as wet or normal with a 50% probability. The effect on the contingency table would be to randomly remove *N* values from the diagonal and randomly reassign them to off-diagonal locations, resulting in a lower but fixed hit rate of (85 − *N*)/85. The G–M score of that semirandomized contingency table was then calculated. This process was repeated with 2000 initially identical 85-yr observation and hindcast series, thus forming 2000 G–M skill scores with identical hit rates. Those scores were then used to form a second distribution to test the null hypothesis that differences in G–M skill were consistent with different hit rates. From those G–M 2000 skill scores, 2000 pairs were randomly selected with replacement, and their differences were calculated. The mean and standard deviation of the G–M score distribution, and of the derived difference distribution, were then recorded (Table A1). The distribution for a value of *N* = 25 (hit rate = 0.706) had a mean (*μ*_{GM}) of 0.561 and a standard deviation (*σ*_{GM}) of 0.039. As expected, the mean value of the differences of G–M skill scores randomly sampled from that normal distribution (*μ _{d}*) is near 0.0. The standard deviation of those differences (

*σ*) is 0.058. Repeating these Monte Carlo trials with increasing values of

_{d}*N*= 30, 35, 40, 45 caused the mean of the resulting G–M skill distribution to decrease, but the standard deviations of those distributions in Table A1 were relatively unchanged. Similarly, the standard deviations of the resulting difference distributions were closely clustered about 0.057, which led to the use of

*σ*= 0.057 in Eq. (5).

_{z}## Footnotes

*Corresponding author address:* Steven A. Mauget, Agricultural Research Service, U.S. Department of Agriculture, USDA Plant Stress and Water Conservation Laboratory, 3810 4th Street, Lubbock, TX 79415. Email: smauget@lbk.ars.usda.gov

^{1}

Winter year references here indicate the year in which the first month of the seasonal averaging period falls; for example, winter 1915 refers to December–February of 1915/16.