## 1. Introduction

Weather and climate extremes, such as droughts and heat waves, have a huge impact on the environment [for reviews see Karl et al. (2008); Field et al. (2012)]. An unprecedented large number of extreme events has been observed in recent years (Frich et al. 2002; Alexander et al. 2006; Meehl et al. 2009; Coumou and Rahmstorf 2012; Peterson et al. 2012) including prominent examples such as the 2003 European heat wave (Stott et al. 2004; Schär et al. 2004), the 2010 Russian heat wave (Dole et al. 2011), and the 2000 floods in the United Kingdom (Pall et al. 2011). An important question is whether changes in the frequencies of such events are a consequence of internal variability in the climate system or whether they can be attributed to climate change (Frei and Schär 2001; Stott et al. 2004).

However, solid statistical evidence for changes in extremes and records is not easy to obtain (Frei and Schär 2001). Focusing on a confined geographical region or a brief period after a certain extreme event has been observed may lead to selection bias. A four-sigma event observed at a particular location is, by definition, rare but may be observed often if a more extended region is considered. Furthermore, the atmosphere is both spatially and temporally autocorrelated, and extreme events will therefore have a tendency to cluster in both space and time. For example, if a particular station or site experiences an extreme warm temperature on a particular day, this site will have an increased probability for also experiencing warm extremes on the following days. Neighboring sites will also have an increased probability for recording warm extremes on the same or following days. Teleconnections, related to modes of variability such as the North Atlantic Oscillation, complicate the spatial covariance structure and add long-range spatial connectivity. Also, long-range spatial connectivity may be more important for some temporal frequencies than for others—a consequence of the nonseparability of the covariance function in space and time (Hasselmann 1993; Stouffer et al. 2000; Tingley et al. 2012). Such challenges have been noted before (Benestad 2003) and they have been dealt with, for example, by reducing the number of degrees of freedom in the statistical tests (Wergen and Krug 2010) either by spatial subsampling (Kunkel et al. 2007) or by temporal bootstrapping (Alexander et al. 2006; Meehl et al. 2009). However, neither of these methods includes the full spatiotemporal covariance structure and may therefore lead to inaccurate estimates of the statistical significance.

In this paper, we will introduce and apply a method to investigate the statistical significance of extreme events that takes spatial and temporal correlations into account. Our approach is purely statistical and does not depend on knowledge about the physical mechanisms responsible for variability and trends. This is both a strength and a weakness. On one hand, our method is completely nonparametric and does not depend on any assumptions about the physical mechanisms; on the other hand, we need to separate internal variability from the secular variations using only the brief period with observations. Alternatively, circulation models offer an opportunity to get an estimate of the low-frequency variability and to investigate the influence of different forcings, for example, through optimal fingerprinting (Christidis et al. 2005; Morak et al. 2011). Such models are, however, limited by our current knowledge of the physics.

Our method produces an ensemble of surrogate fields with the same spatial and temporal structure as the original observed field but without the effects of the secular variations. The considered diagnostics are calculated from these “worlds that could have been” and their probability densities computed. These probability densities can then be compared to similar diagnostics from the original field. We are not limited to pointwise comparisons but can address pathwise questions, for example, related to the number of consecutive years when the diagnostic exceeds a certain value.

We will here consider three diagnostics based on the near-surface temperature: the warm extremes in summer mean temperature (Hansen et al. 2012) and monthly and daily warm records. For each year we calculate the area experiencing such events and then investigate if these areas have changed significantly over the period 1948–2011 with reanalysis data. Note that the statistical approach is not limited to temperature records but can be used for many kinds of extreme events and can be applied to both regularly gridded datasets and irregularly distributed station data.

The paper is organized as follows. In section 2a, we describe the data and the diagnostics used for records and extreme temperatures, while the ensemble surrogate method for the estimation of the statistical significance is described in section 2b. In section 3, we present the results regarding the number of daily and monthly warm records: in section 3a we focus on the annual number of records in the extratropical Northern Hemisphere (NH) and, in section 3b, look at the individual seasons and Europe and also discuss issues related to the robustness of the method. In section 4, we present results regarding the summer mean extremes, and the paper is closed by the conclusions in section 5.

## 2. Data and methodology

In section 2a, we describe the data and the three different diagnostics. Section 2b is devoted to a detailed description of the ensemble surrogate method used to estimate the statistical significance.

### a. Data and data processing

We use daily mean values of the near-surface temperature from the National Centers for Environmental Protection (NCEP) reanalysis (Kalnay et al. 1996). This dataset covers 64 years, from 1948 to 2011, and is defined on a 2.5° × 2.5° latitude–longitude grid. As mentioned, we will be interested in both warm extremes in the summer mean temperature and in monthly and daily warm records. We will take a large-scale perspective and include all of the extratropical Northern Hemisphere north of 20°N in the study if nothing else is mentioned. In the rest of the paper, we will refer to the reanalysis data as “observations.” We recognize that reanalysis data may not be optimal for the study of extremes and have, therefore, confirmed parts of our results using gridded instrumental monthly mean temperatures from the Hadley Centre and the Climatic Research Unit at the University of East Anglia, version 3 (HadCRUT3v) (Brohan et al. 2006).

The summer mean temperature (June–August) is calculated for each grid point in the extratropical NH. Each of these time series is then centered to zero by subtracting its temporal mean and normalized by dividing with its temporal standard deviation. The mean and standard deviation are calculated from all 64 years. The standard deviation is calculated from the time series after first removing the low-frequency variability (in the form of a third-order polynomial fit). This procedure differs somewhat from that of Hansen et al. (2012) where different subintervals were used as baselines for the standard deviation. These details are of minor importance for the present study but removing the low-frequency variability is important when studying whether the distribution of the anomalies has widened with time. As in Hansen et al., we then calculate the areas—as fractions of the total area under consideration—where these standardized anomalies fall in certain intervals: for example, between 2 and 3 [denoted very warm in Hansen et al. (2012)] or above 3 (extremely warm).

We will investigate the change in frequencies of both monthly and daily warm records in the near-surface temperature. Let us, for rigor, assume that we have a time series *x*_{1}, *x*_{2}, …, *x _{N}* of length

*N*. Most previous studies (Benestad 2003; Rahmstorf and Coumou 2011; Anderson and Kostinski 2010) have defined an upper (or “warm”) record as an entry larger than all previous entries: that is, an

*x*for which

_{j}*x*>

_{j}*x*for all

_{i}*i*= 1, …,

*j*− 1. For a stationary time series this gives the probability 1/

*j*for having a record in the

*j*th time step (in the first time step a record is certain). This result is independent of the underlying distribution as pointed out by many (e.g., Benestad 2003; Wergen and Krug 2010). Because these records become increasingly rare with time, it is difficult, at least visually, to isolate the influence of climate change, and Rowe and Derry (2012) introduced an alternative definition based on a fixed threshold. Here we will use yet another alternative, defining a record as an all-time record: that is, an

*x*for which

_{j}*x*>

_{j}*x*for all

_{i}*i*≠

*j*. For a stationary time series the probability that the all-time record falls in a given step is simply 1/

*N*, a result that is again independent of the underlying distribution. We have confirmed that the results of the following sections are only weakly sensitive to the exact definition of the records.

For monthly records we proceed as follows (the process for daily records is similar). For each location (grid point) and each of the 12 calendar months we find the year with the warmest temperature in the 64 years under consideration. For each calendar month we then calculate the area that experienced temperature records in a given year. Areas are calculated as weighted means with weights equal to the areas of the individual grid points, which in the present case with a regular grid amounts to weighting with the cosine(latitude). We report these areas as fractions of the total area under consideration, *N _{y}*

_{,m}, where

*y*is the year and

*m*is the month. We refer to

*N*

_{y}_{,m}as the number of records (with equal weighting they would be the number of records per grid point). We can then further calculate the annual and seasonal means of the number of records

*N*.

_{y}^{1}Using the average and not the total number of records allows us to compare different regions and seasons. We use the same weighting when we calculate the spatial means of the temperature and we have confirmed that our results are not sensitive to the weighting.

### b. Statistical significance

An important aspect of this study is the estimation of the statistical significance by a Monte Carlo approach where the observed area with records or extremes is compared to similar areas obtained from an ensemble of stationary surrogate temperature fields with the same spatial and temporal structure and coverage as the observed temperature field.

The surrogate fields are generated with a phase-scrambling procedure described in Christiansen (2007). First the original field is transformed into a linear combination of principal components (PCs; depending only on time) and empirical orthogonal functions (EOFs; depending only on space) and a truncated set of these are retained. Then the average annual cycle and the secular variations are removed from each retained PC. The secular variations (trends and variability on the lowest frequencies) are estimated by a third-order polynomial fit (however, see section 3b for the sensitivity to this choice). The resulting original PC anomalies are then Fourier transformed, the Fourier phases are randomized but with the same random phases for all PCs, and inverse Fourier transforms performed to get the surrogate PC anomalies. We then force the surrogate anomalies to have the same seasonal nonstationarity (cyclostationarity) as the original anomalies by scaling the surrogate anomalies with the annual periodic envelope of the standard deviation of the original anomalies. This correction was used in Christiansen (2001) and Thejll et al. (2003) when testing the significance of correlations. Now the average annual cycle is added back to the surrogate PC anomalies and these are finally recombined with the original EOFs to get the surrogate fields.

Note that the transformation to PC/EOF space is only done for efficiency. We choose a truncation so that 99.5% of the variability (including the seasonal cycle) is retained. For daily (monthly) data this requires that we retain the first 256 (51) PCs when we consider the extratropical NH. This amounts to a drastic reduction of the dimensionality as the original field is defined on more than 4000 grid points.

Using this procedure we are able to generate an ensemble of surrogate fields with the same temporal and spatial structure as the original field. More precisely, as shown in Christiansen et al. (2009), the phase-randomizing procedure preserves all instantaneous and lagged covariances and cross covariances. It therefore also preserves the strength of the variability in all grid points; that is, the variances are preserved while higher-order moments are not necessarily preserved (see the end of section 3b for details). Therefore, the surrogate fields represents “counterfactual worlds” or “worlds that could have been” if climate had been stationary (except for the annual cycle) over the 64 yr. Any remaining apparent trends or nonstationarities in the surrogate data are therefore due to sampling effects. Comparing the original data to the ensemble of surrogates amounts to assuming the null hypothesis that the original data share the properties of the surrogates, that is, they are stationary except for the annual cycle. We have previously used this method in tests of paleoreconstruction methods (e.g., Christiansen et al. 2009, 2010; Christiansen and Ljungqvist 2011). The method can be seen as an extension of the time series method described by Theiler and Prichard (1996) to spatiotemporal fields.

The properties of the surrogate method are demonstrated in Fig. 1 in which the observed NH February 1976 near-surface temperature anomaly is shown together with a February anomaly from a surrogate field. The similarities of the spatial structures—such as amplitudes and spatial decorrelation length—are clear. Likewise, Fig. 2 shows two decades of observed monthly temperature anomalies at a specific grid point (blue curve) together with the similar time series for one of the surrogates (red curve). Note how both observations and surrogate show larger intra- and interannual variability in winter than in other seasons.

## 3. Daily and monthly records

Here we present the temporal development of the number of warm daily and monthly records with emphasis on the statistical significance as calculated with the ensemble surrogate method described in section 2b. Section 3a considers the annual mean number of the daily and monthly records in the extratropical NH. Section 3b considers results when the analysis is restricted to individual seasons or Europe, and here we will also present results regarding the robustness of our results.

### a. The annual mean number of records in the extratropical NH

The number of daily warm records is shown in the bottom panel of Fig. 3 as a function of year and calendar day, while the number of monthly warm records is shown in the top panel. We recall from section 2a that the number of records is shorthand for the fraction of the area north of 20°N that experiences warm records, *N _{y}*

_{,m}. For comparison Fig. 4 shows the daily and monthly NH mean temperature anomalies in the same format. The vertical structures in these figures show the temporal autocorrelation of the records and of the temperature anomalies. The distributions of these structures also indicate a rather strong interannual variability with some years experiencing many records and some years few.

We also note that most calendar days and months experience records for all years although there are some days and months, in particular in the 1970s, without records. December 2010 is the month with most records: 9% of the extratropical NH had this December as the warmest on record. Likewise, 23 October 2003 is the day with most records: almost 10% of the extratropical NH had this day as the warmest on record. We observe a distinct low-frequency variation of the records with some warm records before 1965 and an abundance of warm records after 1995 separated by a relatively quiet period with fewest records in the 1970s. In particular, the recent decade shows many records with 2010 as a very active year. In the recent period most records fall in summer and autumn. Cold records (not shown) display the reverse pattern with most records between 1960 and 1995 and with relatively few records in the recent decade. There is a striking but not surprising resemblance between the NH mean temperature anomalies (Fig. 4) and the distribution of the number of records. It is also interesting that, while the warming in the late 1940s and in the 1950s was strongest in the summer season, the recent warming seems weakest in this season.

We now look at the total annual number of records, denoted *N _{y}* in section 2a. The total annual numbers of warm daily or monthly records are shown as function of year with the green curves in the two panels of Fig. 5. As expected from the discussion above, the numbers of daily and monthly records have the same general development. These numbers are small in the period 1960–85, after which they begin to increase. In particular, the last 10–15 yr have shown many records with the largest numbers in 1998 and 2010.

Significance levels as described in section 2b have been calculated from an ensemble of 1000 surrogates and are shown with the blue curves. For the number of monthly records the years 1998, 2007, and 2010 are significant at the 99% level, while 2003, 2006, and 2011 are also significant at the 95% level. For the number of daily records 1998 and all years from 2002 to 2011 are significant at the 99% level, while all years after 1998 excluding 2000 are significant at the 95% level.

Figure 5 also shows the number of warm records obtained from five different surrogate fields. By definition, the surrogate fields will in average show about 3 yr (5% of 64 yr) with the number of records above the 95% significance level. For monthly records none of the five surrogates shows as large a number of records as found in the observed data for 2010, but two surrogates show years with numbers comparable to the number found in the observed data for 1998 and several of the surrogates have years for which the number of records exceeds the 99% significance level. For daily records none of the surrogates reaches numbers as large as those found in the observed data in 1998 and in the last 5 yr.

But, how significant is the recent increase in the numbers of records? To address this question we consider the number of consecutive years with number of records above the 95% significance level. For the observed data this number is 2 (2006–07) for monthly records and 11 (2001–11) for daily records. In our ensemble of 1000 surrogates we never find 11 consecutive years with the number of daily records above the 95% significance level. In fact, as illustrated in Fig. 6, we most often find one or two such consecutive years in each surrogate and never above five. This very strongly indicates that the observed behavior of daily records in the last decade is very rare and that it cannot be explained as a chance occurrence. On the other hand, for monthly records the observed two consecutive years above the 95% level is quite common. However, there is some ambiguity in this result. We could, after having seen the top panel of Fig. 5, have decided to consider, for example, the maximum number of years in a decade with the number of monthly records above 0.02. We would then find that this number is 10 in the observed data and that the similar number in the surrogates never exceeds 9.

We have shown that for daily data the null hypothesis of no change in the number of records can be rejected, whereas this null hypothesis cannot be rejected for monthly data. That the null hypothesis cannot be rejected for monthly data does not necessarily mean that there is no change, but it does mean that this change is small enough to be explained as a chance occurrence. From the histogram in Fig. 6 we find that only about 4% of the surrogates have three or more consecutive years with the number of records above the threshold. Observing three or four (instead of two) consecutive years with the number of records above the threshold would therefore be enough to reject the null hypothesis.

### b. Individual seasons, Europe, and robustness

We now take a look at the daily records in the individual seasons (Fig. 7). In all seasons the number of warm records increases after 1990. Summer and autumn show larger numbers of records than winter and spring in the recent decade. The year 2010 is exceptional for all seasons but least exceptional in spring. We note that the significance intervals for the individual seasons are wider than for the whole year. This reflects the reduced absolute number of warm events when focusing on a specific season. We also note that the significance intervals are comparable for all four seasons. This is because the significance intervals are determined not simply by the strengths of the interannual or intraannual variability but rather by their relative strengths. Summer and autumn show both 11 (2001–11) consecutive years exceeding the 95% levels. This is not seen in any of the surrogates, and it indicates an extremely significant increase of daily records in these seasons. Winter and spring show five (2006–10) and three (1997–99) consecutive years exceeding the 95% levels, making the increase in the number of records in winter very significant (with the similar number found in 0.1% of the surrogates) and in spring moderately (3% of the surrogates) significant. For monthly records all seasons are insignificant (not shown).

Above, we have studied the whole NH north of 20°N. We expect that focusing on a large area will decrease natural fluctuations and increase the statistical significance. We have repeated the calculations for Europe (here latitudes between 35° and 70°N and longitudes between 10°W and 30°E) and the results are shown in the top panel of Fig. 8 for the annual number of daily records. We see that the number of observed records (relative to the size of the areas) is about the same as for the whole NH (Fig. 5), but the significance intervals have widened considerably. For the whole year, three consecutive years (2009–11) have a number of daily records above the 95% level. For all four individual seasons, this number is two: winter and spring 1989–90, summer 2002–03, and autumn 2005–06. For the whole year, three consecutive years are found in 1% of the surrogates making the change in the annual number of warm daily records very significant. The change is not significant for any of the individual seasons. For monthly records no significance is found at all (not shown).

Before generating the surrogate fields the secular variations have been removed to generate worlds that might have been without climate change. In the results described above, secular variations were estimated by a third-order polynomial fit. The lower the order of the polynomial, the more low-frequency variability is preserved in the surrogates and the larger the width of the resulting confidence intervals. We have tested the sensitivity of our results by repeating the analysis with polynomials of different orders. The bottom panel of Fig. 8 shows results for a first-order polynomial (i.e., linear) fit. The confidence intervals have expanded (cf. Fig. 5), and, if we focus on the 99% level, then the year 2010 is still significant, 1998 is almost significant, and 2007 has become insignificant. All years after 2005 are significant at the 95% level, whereas the years 2002–04, which were significant when the analysis was based on a third-order polynomial fit, now are insignificant. Using higher-order polynomials (fourth and fifth order) gives the same results as for the third-order polynomial. We have also confirmed that our results are robust to the level of truncation in the PC/EOF space. For example, retaining 99.95% of the variance [673 (230) PCs for daily (monthly) data] gives virtually the same significance levels as when retaining 99.5% [256 (51) PCs for daily (monthly) data].

The surrogate fields generated by the method used above and described in section 2b come with a caveat: the anomalies are drawn from Gaussian distributions. To test the influence of this limitation, we transform the surrogate anomalies (before adding back the annual cycle) so that for each month they have the same distribution as the original anomalies (Christiansen 2001). We apply this adjustment to the full field without the prior transformation to PC. This adjustment changes the covariance structure of the surrogate fields only slightly. We have confirmed that the adjustment only has modest impact on the significance.

As mentioned in section 2a, the reanalysis data may not be optimal for the study of extremes. The NCEP reanalysis suffers from homogeneity issues owing to changes in the observational network (Sturaro 2003), and the surface temperature may be influenced by the forecast model's land surface scheme. To test these limitations, we repeat the analysis of the monthly records using the gridded instrumental temperatures from HadCRUT3v (Brohan et al. 2006). We select the 299 grid points north of 20°N with complete monthly records during the period 1948–2011. The geographical distribution of these grid points is shown in Fig. 9. The number of warm records as a function of year and calendar month (Fig. 10, top) strongly resembles that found from the NCEP reanalysis (Fig. 3, top). A similar strong resemblance is found for the total number of warm records as a function of year, the related significance levels, and the NH mean temperature (Fig. 10, bottom; cf. Fig. 5, top). Accordingly, the HadCRUT3v data confirms the previous result that neither the change in the annual nor the seasonal number of warm records is statistically significant. The main differences—including a somewhat larger number of records after the year 2000—are mainly a consequence of the reduced spatial coverage and not differences in the data. We have confirmed this by repeating the calculation with the reanalysis data restricted to the 299 grid points used for the HadCRUT3v data.

## 4. Summer extremes

The observed areas with normalized (as described in section 2a) NH summer mean temperature (June–August) anomalies in different intervals are shown in Fig. 11 (top) as a function of time. We find a pronounced low-frequency variability common to all intervals. The warm area (anomalies above zero) decreases from 50% in the beginning of the period to 25% in the 1970s and then grows to 75% in the beginning of the twenty-first century. The very warm and extremely warm areas together [anomalies above 2, following the vocabulary of Hansen et al. (2012)] reach values of up to 20% in the end of the period and are particularly large in 1998 and 2010. This should be compared to an average value of 2.5% for a stationary climate (under Gaussianity). The widening and shrinking of the warm intervals are mirrored in the reverse behavior of the cold intervals. These results are in line with the findings of Hansen et al., although we consider a different geographic region and a slightly different definition of the normalized anomalies (see section 2a).

How much of this low-frequency variability can be explained by chance? The middle and bottom panels of Fig. 11 show the similar plots for two surrogate fields. It is clear that for shorter periods (e.g., from 1 to 5 yr) large deviations from the climatological means can be observed. In both surrogates we find years where the very warm and extremely warm areas together fill about 15% and years where the warm area fills more than 75%. However, these fluctuations are not as durable as those found in the original field. To be more quantitative, Fig. 12 shows the area with anomalies larger than two (very warm and extremely warm) as a function of time together with the confidence intervals calculated from an ensemble of 1000 surrogates. Three years, 1998, 2007, and 2010, are significant to the 99% level and five consecutive years, 2007–11, are significant to the 95% level (almost 9 consecutive years, 2003–11, as 2006 is just on the edge). This is extremely unusual in a stationary climate. In the 1000 surrogates the largest number of consecutive years above the 95% level is four.

As discussed, the surrogate fields have the same spatial and temporal correlations as the original surface temperature field. It is interesting to see how much spatial and temporal correlations influence the estimation of confidence intervals. We have therefore calculated the confidence intervals from surrogates breaking both the temporal and spatial correlations and from surrogates breaking only the temporal correlations. The former surrogates have been generated by temporally bootstrapping every grid point independently and the latter surrogates by bootstrapping the grid points uniformly (Kiktev et al. 2003). We find (Fig. 12) that breaking both spatial and temporal correlations (i.e., assuming all summers and all grid points are independent) results in extremely narrow confidence intervals (red curves). Breaking only the temporal correlations (i.e., assuming all summers are independent but keeping the spatial structure) results in much wider confidence intervals (brown curves), but they are still only little more than half as wide as the original confidence intervals (blue curves, keeping both the temporal and spatial structure). These results show, although they are specific to the present application as they depend on the spatial resolution of the field and the temporal averaging, the importance of taking both temporal and spatial correlations into account when estimating the statistical significance.

## 5. Conclusions

The number of warm records and extremes has often been reported to increase. However, such events happen also in a stationary climate and, because of the natural clustering of events in space and time, it is not straightforward to determine if the observed increase could have happened by chance or if it is a reflection of a changing climate.

We have studied the statistical significance of the increase in warm records and extremes in the extratropical Northern Hemisphere and considered the number of daily and monthly warm records as well as warm extremes in summer means. We applied an ensemble surrogate method that generates fields with the same temporal and spatial covariance structure as the original field but without its secular variations. When estimating the statistical significance by comparing the diagnostics calculated from the original field to the diagnostics calculated from the surrogate ensemble, the effects of clustering of events in space and time are therefore automatically taken into account.

For both the number of daily and monthly warm records and for the warm extremes in summer means, we find the same overall development with a minimum in the 1970s and a gradual but erratic increase since. Three years, 1998, 2007, and 2010, stand out as exceptional for all three diagnostics.

We find that the increase in the annual number of daily warm records over the last decade is highly significant in the extratropical NH. The 11 consecutive years in the period 2001–11 are all statistically significant to the 95% level, whereas three years, 1998, 2007, and 2010, are significant to the 99% level. The chance of having 11 consecutive years above the 95% level is vanishing in the surrogates. In other words, there is almost no chance that the observed number of records in the period 2001–11 would have happened in a stationary climate without the secular variations. Considering the individual seasons, we find that summer and autumn are extremely significant while the significance is weakest in spring. When we consider the annual number of monthly warm records, the significance is much weaker. Now only 2 consecutive years are above the 95% level, which could easily be a chance occurrence as such events are often found in the surrogates. Turning to the extreme summer means, we find that the very warm and extremely warm summers (i.e., with normalized anomalies larger than 2) again show a very significant change. Now 5 (almost 9) consecutive years are above the 95% level, a number that is never seen in the 1000 surrogates.

Considering a smaller area, we would expect the statistical significance to be more difficult to obtain. However, this depends on the strength of the natural variability, the number of spatial and temporal degrees of freedom, and the number of observed records or extremes. For Europe we do find an increase of warm records that is statistically significant when the full year is considered, but no significance is found for any of the individual seasons.

The number of records and the warm extremes follow the temperature anomaly (gray curves in Figs. 5, 7, 8, and 12) to a high degree, in particular regarding the low-frequency variations. Best correspondence seems to be found for annual means and the extratropical region, whereas larger deviations are found for the individual seasons (in particular winter) and when the analysis is restricted to Europe. However, these deviations seem to be within the natural fluctuations, as seen when repeating the calculation of the confidence intervals with surrogate ensembles where the secular variations have been superposed (not shown). This suggests that changes in the number of records and extremes could be explained simply as a consequence of a changing temperature mean and that nonlinear feedbacks or changes in the higher moments (skewness and flatness) of the temperature are not required.

We note that the ensemble surrogate method is not restricted to the simple records, and extremes considered in this paper but can be used for many other diagnostics such as heat waves, cold spells, etc. (Frich et al. 2002; Alexander et al. 2006). Also, it can be used for both gridded data and irregularly distributed station data. The importance of taking spatial and temporal correlations into account when estimating the significance of extremes and records depends on the strengths of these correlations and may be less for precipitation than for temperature. However, when the original data are not Gaussian, an additional transformation of the surrogates is needed. If the non-Gaussianity of the original data is moderate, this transformation will only change the covariance structure little; a main limitation of the ensemble surrogate method may appear when it is applied to strongly non-Gaussian data, such as precipitation.

This work was supported by the Danish Climate Centre at the Danish Meteorological Institute. The NCEP reanalysis data were provided by the NOAA/CIRES Climate Diagnostics Center, Boulder, Colorado, from their website (http://www.cdc.noaa.gov/). The HadCRUT3v data are available online as well (http://www.cru.uea.ac.uk/cru/data/temperature/).

## REFERENCES

Alexander, L. V., and Coauthors, 2006: Global observed changes in daily climate extremes of temperature and precipitation.

*J. Geophys. Res.,***111,**D05109, doi:10.1029/2005JD006290.Anderson, A., , and A. Kostinski, 2010: Reversible record breaking and variability: Temperature distributions across the globe.

,*J. Appl. Meteor. Climatol.***49**, 1681–1691.Benestad, R. E., 2003: How often can we expect a record event?

,*Climate Res.***25**, 3–13.Brohan, P., , J. Kennedy, , T. Haris, , S. F. B. Tett, , and P. D. Jones, 2006: Uncertainty estimates in regional and global observed temperature changes: A new dataset from 1850.

*J. Geophys. Res.,***111,**D12106, doi:10.1029/2005JD006548.Christiansen, B., 2001: Downward propagation of zonal mean zonal wind anomalies from the stratosphere to the troposphere: Model and reanalysis.

,*J. Geophys. Res.***106**, 27 307–27 322.Christiansen, B., 2007: Atmospheric circulation regimes: Can cluster analysis provide the number?

,*J. Climate***20**, 2229–2250.Christiansen, B., , and F. C. Ljungqvist, 2011: Reconstruction of the extratropical NH mean temperature over the last millennium with a method that preserves low-frequency variability.

,*J. Climate***24**, 6013–6034.Christiansen, B., , T. Schmith, , and P. Thejll, 2009: A surrogate ensemble study of climate reconstruction methods: Stochasticity and robustness.

,*J. Climate***22**, 951–976.Christiansen, B., , T. Schmith, , and P. Thejll, 2010: A surrogate ensemble study of sea level reconstructions.

,*J. Climate***23**, 4306–4326.Christidis, N., , P. A. Stott, , S. Brown, , G. C. Hegerl, , and J. Caesar, 2005: Detection of changes in temperature extremes during the second half of the 20th century.

*Geophys. Res. Lett.,***32,**L20716, doi:10.1029/2005GL023885.Coumou, D., , and S. Rahmstorf, 2012: A decade of weather extremes.

,*Nat. Climate Change***2**, 491–496, doi:10.1038/nclimate1452.Dole, R., and Coauthors, 2011: Was there a basis for anticipating the 2010 Russian heat wave?

*Geophys. Res. Lett.,***38,**L06702, doi:10.1029/2010GL046582.Field, C. B., and Coauthors, Eds., 2012:

*Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation.*Cambridge University Press, 594 pp.Frei, C., , and C. Schär, 2001: Detection probability of trends in rare events: Theory and application to heavy precipitation in the Alpine region.

,*J. Climate***14**, 1568–1584.Frich, P., , L. V. Alexander, , P. Della-Marta, , B. Gleason, , M. Haylock, , A. M. G. Klein Tank, , and T. Peterson, 2002: Observed coherent changes in climatic extremes during the second half of the twentieth century.

,*Climate Res.***19**, 193–212.Hansen, J., , M. Sato, , and R. Ruedy, 2012: Perception of climate change.

,*Proc. Natl. Acad. Sci. USA***109**, E2415–E2423, doi:10.1073/pnas.1205276109.Hasselmann, K., 1993: Optimal fingerprints for the detection of time-dependent climate change.

,*J. Climate***6**, 1957–1971.Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year Reanalysis Project.

,*Bull. Amer. Meteor. Soc.***77**, 437–471.Karl, T. R., , G. A. Meehl, , C. D. Miller, , S. J. Hassol, , A. M. Waple & , and W. L. Murray Eds., 2008: Weather and climate extremes in a changing climate. Regions of focus: North America, Hawaii, Caribbean, and US Pacific Islands. U.S. Climate Change Science Program and the Subcommittee on Global Change Research Tech. Rep., 164 pp.

Kiktev, D., , D. H. M. Sexton, , L. Alexander, , and C. K. Folland, 2003: Comparison of modeled and observed trends in indices of daily climate extremes.

,*J. Climate***16**, 3560–3571.Kunkel, K. E., , T. R. Karl, , and D. R. Easterling, 2007: A Monte Carlo assessment of uncertainties in heavy precipitation frequency variations.

,*J. Hydrometeor.***8**, 1152–1160.Meehl, G. A., , C. Tebaldi, , G. Walton, , D. Easterling, , and L. McDaniel, 2009: Relative increase of record high maximum temperatures compared to record low minimum temperatures in the US.

*Geophys. Res. Lett.,***36,**L23701, doi:10.1029/2009GL040736.Morak, S., , G. C. Hegerl, , and J. Kenyon, 2011: Detectable regional changes in the number of warm nights.

*Geophys. Res. Lett.,***38,**L17703, doi:10.1029/2011GL048531.Pall, P., , T. Aina, , D. A. Stone, , P. A. Stott, , T. Nozawa, , A. G. J. Hilberts, , D. Lohmann, , and M. R. Allen, 2011: Anthropogenic greenhouse gas contribution to flood risk in England and Wales in autumn 2000.

,*Nature***470**, 382–385, doi:10.1038/nature09762.Peterson, T. C., , P. A. Stott, , and S. Herring, 2012: Explaining extreme events of 2011 from a climate perspective.

,*Bull. Amer. Meteor. Soc.***93**, 1041–1067.Rahmstorf, S., , and D. Coumou, 2011: Increase of extreme events in a warming world.

,*Proc. Natl. Acad. Sci. USA***108**, 17 905–17 909, doi:10.1073/pnas.1101766108.Rowe, C. M., , and L. E. Derry, 2012: Trends in record-breaking temperatures for the conterminous United States.

*Geophys. Res. Lett.,***39,**L16703, doi:10.1029/2012GL052775.Schär, C., , P. L. Vidale, , D. Lüthi, , C. Frei, , C. Häberli, , M. A. Liniger, , and C. Appenzeller, 2004: The role of increasing temperature variability in European summer heatwaves.

,*Nature***427**, 332–336, doi:10.1038/nature02300.Stott, P. A., , D. A. Stone, , and M. R. Allen, 2004: Human contribution to the European heatwave of 2003.

,*Nature***432**, 610–614, doi:10.1038/nature03089.Stouffer, R. J., , G. Hegerl, , and S. Tett, 2000: A comparison of surface temperature variability in three 100-yr coupled ocean–atmosphere model integrations.

,*J. Climate***13**, 513–537.Sturaro, G., 2003: A closer look at the climatological discontinuities present in the NCEP/NCAR reanalysis temperature due to the introduction of satellite data.

,*Climate Dyn.***21**, 309–316.Theiler, J., , and D. Prichard, 1996: Constrained-realization Monte-Carlo method for hypothesis testing.

,*Physica D***94**, 221–235.Thejll, P., , B. Christiansen, , and H. Gleisner, 2003: On correlations between the North Atlantic Oscillation, geopotential heights, and geomagnetic activity.

,*Geophys. Res. Lett.***30**, 1347, doi:10.1029/2002GL016598.Tingley, M. P., , P. F. Craigmile, , M. Haran, , B. Li, , E. Mannshardt-Shamseldin, , and B. Rajaratnam, 2012: Piecing together the past: Statistical insights into paleoclimatic reconstructions.

,*Quat. Sci. Rev.***35**, 1–22.Wergen, G., , and J. Krug, 2010: Record-breaking temperatures reveal a warming climate.

,*Europhys. Lett.***92**, 30008, doi:10.1209/0295-5075/92/30008.

^{1}

*i*= 1, …,

*N*indicates the

*N*stations or grid points;

*y*= 1948, …, 2011 for years; and

*m*= 1, …, 12 for calendar months (or

*m*= 1, …, 365 for calendar days). We define

Then our number of records is given by *w _{i}* are the weights. Annual and seasonal means are calculated straightforward: for example, the annual mean as