The authors examine recent changes in three agro-climate indices (frost days, thermal time, and heat stress index) in North America (centered around the continental United States) using observations from a historical climate network and an ensemble of 17 global climate models (GCMs) from the Fourth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR4). Agro-climate indices provide the basis for analyzing agricultural time series that are unbiased by long-term technological intervention. Observations from the last 60 years (1951–2010) confirm conclusions of previous studies showing continuing declines in the number of frost days and increases in thermal time. Increases in heat stress are largely confined to the western half of the continent. The authors do not observe accelerating agro-climate warming trends in the most recent decade of observations. The spatial variability of the temporal trends in GCMs is lower compared to the observed patterns, which still show some regional cooling trends. GCM skill, defined as the ability to reproduce observed patterns (i.e., correlation and error) and variability, is highest for frost days and lowest for heat stress patterns. Individual GCM skill is incorporated into two model weighting schemes to gauge their ability to reduce predictive uncertainty for agro-climate indices. The two weighted GCM ensembles do not substantially improve results compared to the unweighted ensemble mean. The lack of agreement between simulated and observed heat stress is relatively robust with respect to how the heuristic is defined and appears to reflect a weakness in the ability of this last generation of GCMs to reproduce this impact-relevant aspect of the climate system. However, it remains a question for future work as to whether the discrepancies between observed and simulated trends primarily reflect fundamental errors in model physics or an incomplete treatment of relevant regional climate forcings.
Recent crop simulation and global climate models suggest that the global food supply may decrease toward the end of the twenty-first century as a result of anthropogenic climate change (Easterling et al. 2007; Battisti and Naylor 2009). However, the majority of these models assume a static shift in the surface temperature distribution as the mean temperature increases (Easterling et al. 2007). This potentially neglects changes in variance or extreme values (as noted by Meehl et al. 2000) that are more closely coupled to biomass production than the mean surface air temperature (Neild and Newman 1986).
Recent work has focused on characterizing the effects of recent observed climate variability on crop yields, where yield is defined as the ratio of total production to area under production (Lobell and Asner 2003; Lobell and Field 2007; Tebaldi and Lobell 2008; Lobell et al. 2011). Time series of crop yields embed technological advances (e.g., cultivar development, fertilization procedures, and land management) and other nonclimate determinants of crop growth such as pest and disease outbreaks (although many of these determinants are themselves tightly coupled to climate variability and change). But removing these effects is difficult to implement because of the potentially confounding effect between technology and climate as well as the difficulty in accounting for regional variations in exogenous factors (Schlenker and Roberts 2009). An avenue to potentially improve the projections of climate change effects on agricultural production is to calculate climate-based indices such as the annual frost days, or consecutive days without precipitation; measures that are strongly correlated with biomass production in agro-ecosystems (Tollenaar and Hunter 1983; Muchow et al. 1990; Wilhelm et al. 1999). The three most common temperature-based agro-climate indices measure heat stress (heat stress index), cold stress (frost days or growing season length), and phenological development (thermal time or growing degree-days). Together these indices are important indicators of potential production for a given crop or region (Neild and Newman 1986). Increases in the duration, timing, or magnitude of suboptimum conditions as expressed in these indices could adversely affect total biomass production.
Several studies have shown changes in some agro-climate indices and related temperature extremes in North America in the twentieth century (Easterling 2002; Frich et al. 2002; Feng and Hu 2004; Kunkel et al. 2004; Robeson 2004; Schwartz et al. 2006; Christidis et al. 2007; Meehl et al. 2005). But it is not known how well global climate models (GCMs) are able to reproduce these historic patterns at the impact-relevant regional scale except with regard to broad qualitative comparisons (Tebaldi et al. 2006; Meehl et al. 2005), or with older generations of coupled atmosphere–ocean GCMs (Kiktev et al. 2003). Downscaling methods can improve model fidelity to regional or local conditions, especially when evaluating changes in the tails of the parameters distribution (Qian et al. 2010). However, it is still important to understand if the raw GCMs are able to simulate these impact-relevant indices even at the scales at which they are intended to be used for assessment activities (e.g., continental or subcontinental). Finally, the current generation of widely available GCM simulations are likely nearing the end of their shelf life. Therefore, it is important to document model skill for these types of variables so as to gauge whether the next generation of GCMs was able to improve upon prior results. This study evaluates historic agro-climate trends and evaluates the ability of a large number of current generation GCMs to simulate agro-climate indices. This is a necessary step toward improved impact assessments of the effects of anthropogenic climate change on North American agriculture in the twenty-first century.
Assessing climate model skill and accuracy is essential for impact assessment and for continuing GCM development. Past multimodel evaluations indicated that GCMs show considerable ability to reproduce global annual surface temperature, but are less skillful in reproducing observed precipitation and pressure fields (Covey et al. 2003). Models also successfully simulate other aspects of the climate system such as monthly sea surface temperature and salinity (Schneider et al. 2007). Detailed model–data comparisons and skill assessments are less common for climate variables based on daily data, as reproducibility is expected to be lower given the coarse resolution of GCMs and the increasing importance of random weather events in defining the statistics of a climate parameter as the time scale decreases. However, studies comparing results of multiple GCMs at higher temporal resolutions found that the models show some ability to simulate impact-relevant indices. For example, Tebaldi et al. (2006) found that an ensemble of GCMs correctly reproduced the sign of the temporal trend of 10 climate-extreme indices, while Meehl et al. (2004) found some similarities between spatial trends in model output and observations for twentieth-century frost days in the United States.
The release of results from nearly two dozen GCMs through the World Climate Research Programme Third Coupled Model Intercomparison Project (WCRP CMIP3 or CMIP3) gives researchers unprecedented access to state-of-the-art climate model output (Meehl et al. 2007). This dataset facilitates the study of more regional and localized climate change impacts (e.g., Tebaldi et al. 2004; Schneider et al. 2007). It also enables researchers to examine multimodel projections of more impact-relevant climate indices such as temperature and precipitation extremes (Meehl et al. 2005; Tebaldi et al. 2006) and to better evaluate differences in model skill through comparisons with past observations (Covey et al. 2003; Schmittner et al. 2005). Finally, using multimodel ensembles can reduce predictive uncertainty compared to using any single model in isolation (Hagedorn et al. 2005; Raftery et al. 2005). We use this set of model simulations to gauge the ability of the current generation of GCMs to reproduce observed patterns of agro-climate indices in North America.
Our approach to assess the skill of a GCM is given by Taylor (2001), who defines skill as the ability of a model to reproduce a climatic variable’s observed spatial and temporal patterns. A “perfect” model under this definition would have no error as computed by the root mean square (rms), would perfectly correlate with the data, and have the same standard deviation. Thus, “skill” measures correspondence between patterns, trends, and variability in the model and observations. We adopt this definition of skill for our analysis.
We address two questions: 1) what are the spatial and temporal patterns of agro-climate indices in North America in the late-twentieth and early-twenty-first century, and 2) what is the skill of GCMs (both individually and as a combined multimodel ensemble) in reproducing these patterns? We analyze observed patterns through the use of a combined global long-term station observation network, the Global Historical Climatology Network (GHCN) (Durre et al. 2010). We evaluate model skill by calculating twentieth-century agro-climate indices for 17 GCMs from the CMIP3 with the necessary daily maximum and minimum temperature output. We examine individual and ensemble model skill through spatial and temporal pattern similarity statistics. With this approach, we examine the current limits of the CMIP3 dataset to potentially provide useful projections for ecologically and socially important climate variables at more relevant spatial and temporal scales.
We analyze an area bounded by the conterminous United States and northern Mexico (north of 20° latitude) and Canada to the 55th parallel. The GHCN contains long records of daily station data for maximum and minimum temperatures for thousands of stations updated daily (Durre et al. 2010). These data have been carefully quality controlled to minimize processing errors by subjecting the data to 19 quality assurance tests (e.g., internal consistency checks, time-of-day reporting biases, duplicate values, etc). We removed all flagged values from the dataset, which still allowed us to retain observations from over 17 000 stations.
b. Climate models
We use GCM data from the WCRP CMIP3 multimodel dataset (Meehl et al. 2007). We choose GCMs with daily model output for the Climate of the Twentieth Century Experiment (20C3M) performed for the Fourth Assessment Report (AR4) of the Intergovernmental Panel on Climate Change (IPCC). In all, 17 GCMs have daily maximum and minimum temperature data with time series lengths ranging from 38 to 100 years (Table 2). For GCMs with more than one model run, we use the first model run to retain maximum variability in the daily output. We ignore GCM output over oceans in order to maintain consistency with the climatic processes observed by the GHCN.
a. Agro-climate indices calculation
We examine three agro-climate indices: frost days, thermal time (or growing degree-days), and the heat stress index (HSI). A frost day is a day where the minimum temperature is below 0°C. The number of frost days impacts crops by (i) affecting the growing season length and (ii) damaging crops from either early or late growing season frost events. Thermal time is the number of accumulated degrees within certain thresholds over a given time period for a crop. It is a useful heuristic because of its strong correlation with crop growth (Coehlo and Dale 1980). Following Feng and Hu (2004) we define thermal time as
where TT is the thermal time; Gb and Ge are the beginning and ending dates of a standard growth period for a crop (e.g., 1 April through 31 October); Tmax and Tmin are the maximum and minimum daily temperature, respectively; and Tl is the limiting temperature where the upper and lower limits define the range of crop growth. If the mean daily temperature [the first term in (1)] is below the lower limit of Tl or above the upper limit of Tl, then the mean daily temperature is set to Tl and the number of growing degrees for that day is set to 0. We use temperature thresholds applicable for growing maize (Zea mays L.) in the central United States to derive the thermal time values with Gb and Ge set to 1 April and 31 October, respectively, and upper Tl and lower Tl fixed at 30° and 10°C (Feng and Hu 2004). We use this particular growing season to facilitate comparison of trends across regions and because most of the study area is agriculturally active during this time of year. Maize was chosen as the threshold crop because it encompasses the largest acreage and greatest production value of any single crop in the United States (USDA 2009). In addition, in a warming climate there is potential for northern expansion of maize production areas (or conversely, a southern contraction of suitable production areas). The heat stress index, similar to thermal time, is the accumulated degrees above a specified temperature threshold (equivalent to the upper Tl of 30°C), aggregated over the growing season. Temperatures above this threshold can negatively impact key plant processes such as grain filling, resulting in reduced biomass production (Wilhelm et al. 1999). All three indices are expressed as aggregated yearly values, and we remove years with more than 10% missing days, or if more than 5% of days are missing within the season of interest (i.e., cold season for frost days or the growing season for heat stress and thermal time).
b. Linear trend time periods
We separate our analysis of the linear trends of agro-climate indices in the GHCN into two “sets” of time periods. The first set contains two time periods: 1951–80 and 1981–2010. Trends are not calculated prior to 1951 in order to maximize the number of stations available with nearly complete records (greater than 90% of years available). This allows for a consistent comparison across time periods and increases confidence in the observed trend patterns. The break point at 1980 is chosen since it roughly coincides with the beginning of the most recent warming period of the twentieth century (Brohan et al. 2006; Smith and Reynolds 2005) and is concurrent with noted atmospheric circulation changes in the late 1970s that coincided with an abrupt regime shift in the heat content of the Pacific Ocean (Barnett et al. 2001; Stephens et al. 2001; Brohan et al. 2006). The second period ends with the most recently available year of daily GHCN dataset used for this study.
Time periods in the second set of linear trends consist of seven overlapping 30-yr periods over which we assess statistical significance of agro-climate trends: 1951–80, 1956–85, 1961–90, 1966–95, 1971–2000, 1976–2005, and 1981–2010. These periods are used to illustrate the evolving character of the agro-climate signal as measured by the percentage of stations showing significant trends. We also use these overlapping time periods to examine the evidence for accelerating agro-climate warming trends.
To calculate the statistical significance of a station’s temporal trend, we use least squares regression to fit a linear trend to the data and account for temporal autocorrelation by fitting a first order autoregressive time series model to the residuals of the linear model (Harvey 1993, sections 3.3 and 3.4; Brockwell and Davis 1996, sections 3.3 and 8.3). Both models are fit using the arima function in the R statistical package (R Development Core Team 2008). A station’s trend is considered statistically significant if, after accounting for autocorrelation, the 95% confidence interval around the linear trend does not contain zero. We employ a Kalman filter to interpolate missing values in a station’s time series (Shumway and Stoffer 2006, 348–352). Stations for which 80% or more of the agro-climate index values are a single value (i.e., zero) are removed from the trend analysis. This is done because it is assumed that these observing stations are not in areas with climatic conditions commensurate with how we have defined these three agro-climate indices and it is difficult to estimate a trend for these locations because of the lack of days that fall within the thresholds. We also remove stations from this portion of the analysis if they have five or more consecutive years of missing data or if the last three or more of the final years are missing. Finally, because we test for statistical significance at multiple sites, we account for an inflated null hypothesis rejection rate (also known as the false-discovery rate) by calculating adjusted p values that are pooled across all station trends (Benjamini and Hochberg 1995). Overall, between 778 (thermal time) and 1233 (frost days) stations remained in the trend analysis upon completion of the filtering process (Table 1).
c. Interpolation method–trend patterns
We examine spatial patterns of agro-climate trends by interpolating the station trends onto a 0.5° latitude by 0.5° longitude grid. We use kriging methods to perform the interpolation (Bretherton et al. 1976; Cressie 1993). This requires the specification of a model to represent the spatial correlation structure between station trends so as to arrive at an objectively weighted value at each grid cell based on the surrounding stations.
We select the model for the correlation structure of the data by examining empirical variograms depicting the spatial correlation between locations in a 2D graphical form (Cressie 1993). In an empirical variogram, spatial dependence is expressed by the magnitude of the semivariance values (i.e., the dissimilarity) across all pairs of data points, separated into bins according to the distance (or “lag”) between station locations. Unresolved small-scale variation, referred to as the “nugget,” is given by the semivariance value for the lag-0 bin. For the analysis we create empirical variograms for the two time periods (1951–80 and 1981–2010). Examination of these variograms suggests that an exponential function with maximum ranges between 7° (frost days) and 9° (thermal time and heat stress index) distance is a reasonable approximation to the observed variability and is used for the interpolations. The nugget values for the three agro-climate indices range from 0.2 to 0.4 for frost days [in units of (days yr−1)2], 1.5 to 3.0 [(degree-days yr−1)2] for thermal time, and 0.4 to 1.9 [(degree-days yr−1)2] for heat stress index.
d. Interpolation method–block patterns
The spatial resolution of the CMIP3 simulations is too coarse for direct comparison to station trends. One option is to use empirical downscaling methods wherein a statistical model is fit between the GCM output and the locations of interest (Christensen et al. 2007). However, one goal of our analysis is to retain the original model output in order to evaluate strengths and weaknesses of the current generation of GCMs as a prelude to the release of the next CMIP dataset (CMIP5). Therefore, we must aggregate the observations up to a scale that is appropriate for comparison with the GCM data. We chose to interpolate the GHCN data to a 5° latitude by 5° longitude grid covering the study area. We assume this scale is coarse enough to allow for comparison with GCM output while also still showing regional patterns of change and variability for the agro-climate indices.
We follow a modified version of the procedure described in Haylock et al. (2008) using thin-plate splines (TPS) for the interpolation. The spline model (with an elevation covariate) is fit to the station data by generalized cross-validation in the R package fields (Fields Development Team 2006). The fitted model is then used to interpolate the daily observations of maximum and minimum temperature to a very fine grid (10′ resolution or roughly 0.167°) of locations commensurate with the National Oceanic and Atmospheric Administration 1′ gridded elevations/bathymetry for the world (ETOPO1) Global Relief Model (Amante and Eakins 2009). These interpolated data are averaged across the corresponding 5° grid cells. We use the resulting daily maximum and minimum temperature fields to calculate annual agro-climate indices for each grid cell. Haylock et al. (2008) argue that this method allows for better comparison with GCM output since averaging across a fine grid to achieve the coarse grid values is more similar to the output from the finite difference method used in GCM simulations than is interpolating directly from station observations to the grid centroid.
e. GCM analysis
We calculate two different skill scores for each agro-climate index to gauge the ability of GCMs to simulate twentieth-century patterns of agriculturally related climate change. These skill scores reflect both the correlation and the area-weighted average deviation between the GCM results and the observations. The first measure is the Taylor skill score (Taylor 2001):
where R is the temporal correlation between the model (the GCM) and observations; σm and σobs is the estimated standard deviation of the model and observations, respectively; and Ro is the multimodel ensemble mean correlation (Taylor 2001; Schneider et al. 2007). Related to this skill score, Taylor diagrams are also calculated. These diagrams compare the “distance” of a model from the observations and relative to other models by exploiting the geometric and algebraic relationship between the correlation, standard deviation, and rms error so that they can all be viewed simultaneously on one diagram (Taylor 2001). We calculate this and all subsequent skill scores for the agro-climate indices averaged across the entire study area, creating a single time series for each GCM and the observations.
The second skill score is the mean absolute error (MAE). This score is similar to the rms error but is less sensitive to outliers (Willmott et al. 1985). It is calculated as
where N is the number of years at each location to compare, obsi is the observed agro-climate value at a location in year i, and mi is the model value at the same location in year i. We note that the Taylor skill score is unaffected by a constant model bias while the MAE is sensitive to constant model bias.
We create Taylor diagrams (Taylor 2001) and weighted rankings based on Taylor skill scores and the mean absolute error (MAE) for anomalies of agro-climate indices for each GCM over the period 1961–98 (the period when all 17 GCMs had complete time series). Before calculation of the model’s two skill scores from (4) and (5) we use bilinear interpolation to create estimated GCM values for the coarse grid locations used to aggregate the GHCN observations.
a. Interpolation comparison
To evaluate the efficacy of our method, we compare the results of the TPS interpolation to another dataset of maximum and minimum temperatures (Maurer et al. 2002). These gridded data are available from 1950 to 1999 (currently being updated to 2010) for the conterminous United States at a resolution of 1/8° (~12 km). Originally, the dataset was created for hydrologic modeling and, as such, the model used to create the gridded observations is structured so that the land surface water and energy budgets balance at each time step. Thus, additional forcings are used to derive the model, in contrast to a two parameter (distance and elevation) model used in our TPS procedure. For each grid cell we calculate the MAE, the correlation coefficient, and the daily difference between the two datasets. We also compute the area-weighted average correlation coefficient and MAE across all grid cells.
In general there is good agreement between the datasets. Data-sparse regions in Mexico, mountainous regions, and near the Great Lakes have higher MAE values (Figs. 1a and 1b) and lower correlation coefficients (Figs. 1c and 1d). For both maximum and minimum temperature, the average MAE value across all grid cells and all days is 0.4°C, and the spatially averaged correlation coefficient is very high, at 0.99. In Figs. 1e and 1f, the daily time series of average differences between grid cells shows that the TPS gridded values are biased toward slightly colder maximum temperatures (mean difference of −0.1°C) and slightly warmer minimum temperatures (mean difference of 0.1°C). These differences appear to be consistent for the entire 50-yr period, suggesting that they are a product of differences in the gridding methodology.
b. Agro-climate trends
We first present the spatial results of the estimated and interpolated least squares station trends without accounting for serial correlation. Figure 2 shows spatial trend patterns of agro-climate indices for the analyzed stations and the interpolated results for the study area over the two time periods (1951–80 and 1981–2010). Only those stations with trends larger than ±0.5 (frost days yr−1), ±2.5 (growing degree-days yr−1), and ±2.5 (heat stress degree-days yr−1) are shown. There is a noticeable decline in the number of stations with increasing frost days trends (i.e., a decline in cooling trends; Figs. 2a and 2b) and an increase in the number of stations with negative frost days trends, especially in the eastern United States and Canada (red circles). The most recent time period shows the shift to a warming pattern that dominates large portions of the continent (Fig. 2b). The pattern is consistent with the regional late-twentieth-century trends seen in Cayan et al. (2001), Easterling (2002), and Feng and Hu (2004) where western areas of the continent show the most warming over the entire period, although areas of the central United States and southeastern Canada also have experienced substantial declines in frost days. Easterling (2002) and Feng and Hu (2004) also show weak cooling trends in the southeastern United States, and a decade after these studies, this pattern continues, insofar as areas around the South Atlantic coastal plain have shown a weak cooling trend.
There is a similar pattern for the thermal time trends in which negative trends are seen for the majority of stations for the first time period, which could be coincident with the positive frost trends (Figs. 2c and 2a). However, given that thermal time for maize is defined by both a lower and an upper surface air temperature limit, an increase in extreme high temperatures can also result in lower thermal time and thus a negative trend. Figure 2d generally mirrors the pattern seen with frost days (i.e., majority declines in frost days and increases in thermal time).
The pattern of heat stress index trends changes markedly between the two periods (Figs. 2e,f). There are negative HSI trends over most of the eastern and southern half of the United States in the first time period (nearly 40% of stations, Fig. 2e). Most of the interpolated HSI trends for Canada are near zero since the extreme maximum temperatures that define the heat stress index are not as common in this region. There are positive HSI trends in the western parts of the continent, while negative trends persist for both time periods in the central and southeastern regions. Feng and Hu (2004) and Lobell and Asner (2003) find similar cooling trends in the central and southeastern parts of the United States and the warming trend in the southwestern United States is consistent with what Easterling (2002) found for frost days. The warming trend for HSI values in the western United States confirms that warming in recent decades in this region is not limited to increasing night temperatures. The number of stations with large warming trends is the lowest for HSI values compared to the other agro-climate indices. This is consistent with prior studies showing that for most of the United States, the recent warming primarily manifests as an increase in minimum temperatures (Karl et al. 1993; Jones et al. 1999; Easterling et al. 2002; Caesar et al. 2006). However, the overall number of stations with large positive trends increases from 3% to 13% between the two time periods.
We estimate probability density functions (PDFs) of the linear station trends for the two time periods plus the entire 60-yr period for each agro-climate index (Fig. 3). The horizontal lines above each PDF correspond to two standard deviations around the sample mean. The results confirm the large shift in trends from the first time period to the two most recent periods. The continued increase in minimum temperatures indicated by the frost day trends and the lack of widespread increases in HSI values is a likely contributor to the large shift toward positive thermal time trends seen in Fig. 3b. The HSI PDFs also show that, although the majority of station trends remain near zero for the three time periods, the large change in the mean from the first to the most recent period stem from the decline in negative skewness in the trend distribution.
We use the method described in section 2 to detect statistically significant station trends while accounting for serial correlation in the data. We use the second set of seven overlapping 30-yr time periods from 1951 to 2010. The results in Table 1 are separated according to trend direction (warming or cooling) and region (east or west of the 100°W meridian). With the exception of frost day trends in the western region, all agro-climate indices show declines in the percentage of stations with statistically significant negative trends. However, the data also show the influence of the recent series of colder summers and winters as some agro-climate indices show substantial declines in the percentage of stations with statistically significant warming trends for the 1981–2010 period compared to the immediately preceding decades.
Figure 4 presents the results of Table 1 in graphical form by plotting the station trends after accounting for autocorrelation. If an accelerating warming trend is present, one would expect to see the mean temporal trend increase through time for the heat-stress index and thermal time while decreasing for frost days. Such an acceleration of warming is possible as the anthropogenic greenhouse gas (GHG) signal becomes more pronounced and as the maximum oceanic heat uptake approaches. We find no clear evidence of such trends at this time. The frost day trends shown in Fig. 4 actually increase in the most recent decades, although the overall trend is still negative. Thermal time and heat stress index trends (not shown) are nearly constant for all but the first 30-yr time period. We conclude that at least qualitatively, there is no obvious and strong evidence for an accelerating agro-climate warming trend.
The recent cooling pattern observed in parts of the central and southeastern United States for frost days and HSI is consistent with the so-called “warming hole” described by Pan et al. (2004) and Kunkel et al. (2006). Goldstein et al. (2009) posited that the cooling summer trend in the southeastern United States was likely due to deeper aerosol optical thickness resulting from interactions between anthropogenic aerosols and biogenic volotaile organic compounds (BVOC) causing a negative localized feedback between temperature and BVOC. Portmann et al. (2009) also speculated that spatiotemporal patterns of atmospheric aerosols in the eastern United States are linked to positive changes in the hydrologic cycle that in turn reduce summer temperatures.
c. GCM results
Most GCMs only have daily data from 1961–2000 for the 20C3M experiment as part of the IPCC AR4. We compare model versus observed trends for this period and calculate least squares linear trend estimates after interpolating to the same 5° latitude by 5° longitude grid as used for the observation data. Figure 5 summarizes the results of the comparison for each grid cell. The results are ordered by ascending observation trend for each agro-climate index. For each grid cell, the mean, median, and range of the 17 GCM agro-climate trends are plotted. At the top of each panel are the aggregated results for the continental and the eastern and western regions. Several interesting patterns emerge in this figure. The GCMs show very few cooling trends and generally show a more static range of agro-climate index trends. Furthermore, this range of GCM values is more likely to not include the observed trend if it is a cooling trend. The simulated agro-climate indices also do not exhibit cooling trends, unlike the GHCN data. The aggregated and spatially averaged GCM trends all show stronger warming than the observed trends. That being said, the sign of the simulated agro-climate index trends is correct for the continental and regional scales. However, it is important to note that, even at the continental scale, the observed thermal time trend falls outside the range of the GCM ensemble (Fig. 5b).
d. Model skill evaluation
The Taylor diagrams for the 17 GCMs for each agro-climate index show relatively low model skill for all agro-climate indices, particularly for HSI values (Fig. 6). The three statistics (correlation, standard deviation, and rms error) that define the Taylor diagram generally show poor model fit for the individual GCMs, with the exception of good agreement between observed and modeled temporal variability (Table 2). As a group the GCMs have highest skill in simulating frost days, although correlation coefficients are still for the most part low (μr = 0.23, σr = 0.2). However, most individual GCM standard deviations for this agro-climate index are close to the observed value. Most sample standard deviation values are also within 50% of the observed value for thermal time with similar correlation coefficients and rms errors as the simulated frost day anomalies. Taylor diagrams indicate that model skill for reproducing the observed heat stress is low. This may be partly due to GCMs simulating too few maximum daily temperatures above the 30°C HSI threshold. An extreme example of this is the fact that the National Center for Atmospheric Research Parallel Climate Model (NCAR PCM) output did not contain any days above the HSI threshold in the entire study area. Consequently, rms errors for this agro-climate index are much higher compared to thermal time or frost day rms errors and correlation coefficients are near zero for all GCMs. Some GCMs also have sample standard deviation values more than twice as large as the observations, although many are within 50% of the observed value.
We also include results from an empirically downscaled GCM [from the Geophysical Fluid Dynamics Laboratory model] for comparison with the raw GCM output. Daily maximum and minimum temperatures were downscaled to a ⅛° grid using the Maurer et al. (2002) dataset as the basis for the fitted model and then aggregated up to the coarse grid and the continental scale. This particular downscaling method uses asynchronous regression (described in O’Brien et al. 2001) to better represent the extremes of the temperature distribution. The results are plotted in the Taylor diagram in Fig. 6 and are labeled “DS.” The results do not appear to substantially change for the downscaled output for this particular model. All metrics used to calculate and plot the skill score (standard deviation, correlation, and rms error) show results that are similar to the raw GCM output. We do not find this result surprising since the regions that we use for analysis are on the subcontinental scale.
It is important to note that correlation values (and therefore part of the skill score assessment) could be greatly affected by differences in teleconnection phase (e.g., ENSO or NAO) or the initial values and patterns of sea surface temperatures. This boundary condition uncertainty can lead to low (or high) correlation between models and observations; meaning this metric may not be as meaningful a measure of skill compared to other metrics such as the standard deviation (Tebaldi and Knutti 2007). Nevertheless, if there is a consistent long-term trend in both models and observations, then it should be reflected in the correlation coefficient.
We use the Taylor and MAE skill scores to calculate model-specific weights and a corresponding weighted GCM ensemble mean to attempt to reduce predictive uncertainty of projected changes in agro-climate indices (Schneider et al. 2007). To calculate the model weights, the skill scores are rescaled to range from zero to one (Table 2). The two sets of model weights are applied to the ensemble-mean GCM for each agro-climate index. The resulting weighted and unweighted (also can be thought of as equally weighted) time series and the least squares trend lines are shown in Fig. 7. Of the two weighting schemes, only the Taylor skill score weighted mean ensemble is shown, but the MAE weighted time series produces very similar results. We also show the individual GCM time series and the observed mean time series and its trend line for comparison. Although the Taylor diagrams indicate low model skill for most individual GCMs, the ensemble means (both weighted and unweighted) track closely with the observed time series. Some individual discrepancies between the ensemble means and the observations can be identified, such as the sharp decline in thermal time and heat stress index values around the time of the Mt. Pinatubo eruption. Some GCMs did not include this strong, temporary forcing in the 20C3M experiment and thus no abrupt cooling would be expected. Nevertheless, the ensemble mean does show a decline during these years that is likely reflective of lower simulated temperatures for GCMs that did include volcanic forcings in the experiment.
The overall improvement in the two ensemble mean GCM time series results in smaller MAE and rms errors when compared to individual GCM error. For example, the weighted ensemble mean GCM MAE for the three agro-climate indices are 70%, 65%, and 69% (for frost days, thermal time, and heat stress index, respectively) of the mean of the individual GCM mean absolute error values.
Overall little improvement is seen in the Taylor skill-score-weighted ensemble mean GCM versus the arithmetic ensemble mean (Table 3). Least squares trends for thermal time and the heat stress index are larger than observed and very similar for all weighting methods. The model skill as depicted in these statistics is lowest for the heat stress index regardless of the method used. The three GCM ensembles do correctly reproduce the sign of the observed trend for all agro-climate indices, similar to the results of Tebaldi et al. (2006).
There is potential for the weighted ensemble GCM to suffer from model overfitting where the addition of parameters meant to improve the agreement between the GCM ensemble and the observations can cause loss of predictive accuracy and, thus, no reduction in uncertainty. We examine this possibility by performing a simple leave-one-out cross-validation test of the Taylor skill-score-weighted ensemble GCM. Cross-validation involves using a subset of the original dataset as the input or “training” dataset for the model. The model is then used for prediction over the remainder of the data that are originally withheld. In this case, the model is the ensemble GCM created from the Taylor and MAE skill scores. We withhold one year of data from each GCM and recalculate the weighted ensemble GCM and repeat for each year from 1961–98. The average MAE over all withheld years (Table 4) does not indicate model overfitting. However, it does appear that the two weighting schemes fail to improve on the prediction error.
e. Heat stress skill
One remaining issue is the much lower model skill (for both correlation and variability) when simulating HSI values compared to the other two agro-climate indices. Similar to thermal time, HSI is a cumulative measure of temperature within specified thresholds (and in this case an unbounded upper threshold). One would expect that a model’s ability to reproduce a yearly cumulative value would be less than its ability to reproduce the number of days beyond a particular threshold. Indeed, GCMs show higher skill in reproducing frost day trends compared to both thermal time and HSI. In addition, the threshold itself may have an impact on model skill. For example the frost day threshold of 0°C is only 0.1 sample standard deviations from the study area sample mean minimum daily temperature of 1.7°C (calculated across the 5° × 5° grid). However, the HSI threshold of 30°C is 1.7 sample standard deviations greater than the sample mean maximum daily temperature of 14.3°C. Is the low HSI model skill a statistical artifact due to the manner in which this heuristic is defined or is this aspect of the climate system poorly simulated by the GCMs?
We address this question by calculating two alternative definitions of heat stress. First, we define annual heat stress days as the number of days above a specified maximum daily temperature threshold, similar to frost days. Second, we calculate heat stress days for a series of 21 thresholds that correspond to values between zero and two standard deviations (between 14.3° and 34.5°C) above the observed mean daily maximum temperature. We then use the same set of standard deviation values to calculate 21 different cold stress thresholds between 1.2° and −17.6°C. Using these definitions, we recalculate heat and cold stress days for the interpolated observations and GCM output. We then compute the ratio of the area-wide standard deviation for each GCM to the observed standard deviation.
The box plots in Fig. 8 show that for both indices, as the threshold distance from the mean increases, the model fidelity to the observations decreases. However, there are also differences in this behavior between the indices. The range of standard deviation ratios is much smaller for the cold stress thresholds and grows smaller as the corresponding thresholds increase in absolute magnitude. This figure also shows the importance of using the ensemble mean to reduce predictive uncertainty in impact assessments. For while the range of standard deviation ratios greatly increases for the heat stress thresholds, the mean ratio remains near one (corresponding to observed value).
The spatiotemporal trends in the agro-climate indices all show large areas of regional warming in recent decades. However, we also find substantial differences in the magnitude and geographic extent of this warming across the United States and southern Canada. Common to all of the agro-climate index trends is the recent warming in the western continental United States, which is most pronounced for thermal time and heat stress index. This is consistent with GCM projections of precipitation decreases and temperature increases across the southwest United States concomitant with a weakening of the summer monsoon (Christensen et al. 2007).
The most recent decades show a more moderated warming trend. Some of this moderation is likely related to the weakened North Atlantic Oscillation (the so-called Warm Arctic/Cold Continent phenomenon, Budikova 2009). This is especially the case with the reduced warming trend for frost days, where historically low summer arctic sea ice amounts may lead to a weakened polar vortex in the fall and winter that translates into deep penetration of colder air masses into the midlatitudes (Francis et al. 2009). We also see evidence for the “warming hole” over the central and southeastern United States in these agro-climate indices (Kunkel et al. 2006). Goldstein et al. (2009) attribute cooler summer temperatures in the southeastern United States to a negative feedback between warming temperatures and biogenic volatile organic compounds. While the most recent HSI trends confirm this cooling effect in the extreme southeastern United States, the negative HSI trends are larger and more widespread in the central United States.
Our evaluation of the ability of GCMs to simulate twentieth-century North American agro-climate indices shows that individual model skill is low compared to other aspects of the climate system that have been evaluated (e.g., Covey et al. 2003; Schmittner et al. 2005). This is partly due to the fact that examining these three particular agro-climate indices requires using GCM data at a high temporal resolution that will negatively impact statistical agreement between models and observations. First and foremost, the low correlation scores are to be expected given that these are dynamic atmosphere–ocean models with their own simulated internal variability that has no year-to-year relation to the observed interannual variability (with the notable exception of GCMs that included large forcing events such as the Mt. Pinatubo eruption). In addition, the anthropogenic component of biogenic secondary organic aerosols is neglected in the CMIP3 models and thus this negative feedback on temperature is absent in 20C3M simulations (Goldstein et al. 2009). However, overall model skill also suffers from both the higher variability of climatic data at daily time scales and because the agro-climate indices are cumulative annual variables whose errors are compounded if a GCM exhibits a systematic bias. We also note that for subcontinental scales using the entire multimodel ensemble is important to increase the likelihood that the observed trend will fall within the range of model output. However, even with the full ensemble, for many grid cells the observed fell outside this range. This suggests that use of the range of model outputs can lead to overconfident predictions and potentially insufficient hedging against more extreme results (Draper 1995; Urban and Keller 2009).
The different heat and cold stress definitions analyzed in our study show that there are real differences in model skill as the thresholds increase. These differences exist both within the meteorological parameter being considered (i.e., maximum or minimum daily temperature) as well as across parameters. It appears that the CMIP3 generation of GCMs are less skillful at simulating the tails of the distribution for maximum temperatures as depicted by the heat stress index compared to other agriculturally pertinent climate indices such as frost days. The poor model fit around the years impacted by the Mt. Pinatubo eruption (Robock and Mao 1995) suggests some possible reasons for the discrepancy. This eruption caused a major cooling event due to the release of large amounts of aerosols into the atmosphere. Accurately simulating the effect of such aerosols on climate (both natural and anthropogenic), has long been a significant challenge for the modeling community (Penner et al. 1994; Haywood and Boucher 2000), especially for indirect effects such as clouds (Randall et al. 2007). Even for those models that explicitly specify the estimated reductions in shortwave radiation due to the effects of volcanic eruptions, the heat stress temporal trend and year-to-year anomalies are still biased toward values higher than those observed. Additional difficulties in simulating precipitation (and the attendant cloud cover) may also disproportionately affect the ability of GCMs to simulate heat extremes owing to its greater effect on maximum temperatures versus minimum temperatures.
We analyzed late-twentieth and early-twenty-first-century trends of three agro-climate indices (frost days, thermal time, and heat stress index) in North America and the ability of 17 GCMs to reproduce the observed temporal and spatial patterns. While many other indices could be used for impact analyses, these three represent a useful marker to illustrate relative strengths and weaknesses of the current generation of GCMs. Future efforts could include other agriculturally relevant indices such as the frost-free period or growing degree-day thresholds based on other crops and regions.
Using a historical climate network as the basis for the observations we find widespread warming trends in all three indices with frost days and thermal time exhibiting the most consistent warming trends. The areal extent of cooling trends declines through time and a large-scale pattern shift is observed around 1980. Accelerating agro-climate warming trends in the most recent observations are not observed, suggesting a stable or even a moderated warming trend at the present time in North America. GCM skill in reconstructing twentieth-century agro-climate index changes is poor compared to the ability of GCMs to simulate other aspects of the climate system such as sea surface temperature (Schmittner et al. 2005), and mean surface air temperature (Covey et al. 2003). This result held even when compared against an empirically downscaled GCM. The analyzed model-weighting schemes of Taylor (2001) and Schmittner et al. (2005) do not substantially improve agreement with the observed temporal patterns. Using the ensemble mean does, however, more accurately reproduce the observed variability of the agro-climate indices and the sign of the temporal trends. Using the multimodel ensemble increases the likelihood that the range of GCM trends will include the observed value. GCMs have the greatest skill in simulating frost days and accurately simulate both sign and the magnitude of the linear trend.
Finally, we note the implications of using GCMs for agricultural impact studies in North America given the observed and simulated agro-climate trends. The recent trends have likely mitigated most of the adverse effects of climate change on crop production seen in other regions of the world (Lobell et al. 2011). Thus, the anomalous conditions seen in North America should not be extrapolated to other areas as an indication of how a warming world will impact agriculture. Much more work is needed to understand how regional aerosol forcings (both anthropogenic and natural) and unforced internal variability interact with the anthropogenic GHG warming signal to create localized negative climate feedbacks. We believe incorporating these regional forcings into the next generation of GCM simulations, along with an increase in the total number of available simulations to better capture internal variability (such as in the forthcoming CMIP5; Meehl et al. 2009) will improve model performance and benefit future impact studies.
This study was supported by the National Oceanic and Atmospheric Administration under U.S. Department of Commerce Agreement EL133E07SE4607. This work was partially supported by the National Science Foundation (SES-0345925) as well as the Penn State Center for Climate Risk Management. We thank M. Haran, J. Fricks, and N. Urban for helpful feedback and discussion about the time series analysis and other aspects of spatial smoothing. The helpful comments and suggestions from three anonymous reviewers greatly improved this work. Any opinions and errors are those of the authors.
Current affiliation: Department of Biology, North Carolina State University, Raleigh, North Carolina.