A statistical learning approach to produce seasonal temperature forecasts in western Europe and Scandinavia was implemented and tested. The leading principal components (PCs) of sea surface temperature (SST) and the geopotential at the 150-hPa level (GPT) were derived from reanalysis datasets and used at different lags (from one to five seasons) as predictors. Random sampling of both the fitting years and the potential predictors together with the Least Absolute Shrinkage and Selection Operator regression (LASSO) was used to create a large ensemble of statistical models. Applying the models to independent test years shows that the ensemble performs well over the target areas and that the ensemble mean is more accurate than the best individual ensemble member on average. Skillful results were especially found for summer and fall, with the anomaly correlation coefficient values ranging between 0.41 and 0.68 for these seasons. The correct simulation of decadal trends, using sufficiently long time series for fitting (70 years), and the use of lagged predictors increased the prediction skill. The decadal-scale variability of SST, most importantly the Atlantic multidecadal oscillation (AMO), and different PCs of GPT are the most important individual predictors among all predictors. Both SST and GPT bring equally much predictive power, although their importance is different in different seasons.
Skillful seasonal forecasts can help various weather-sensitive sectors to anticipate weather-related risks (Clark et al. 2017; Tauser and Cajka 2014; De Cian et al. 2013). For that reason, the predictability of temperature, precipitation, and, for example, the North Atlantic Oscillation index (NAO) in monthly and seasonal time scales has been an active research topic in Europe. Several potential sources of predictability have been reported, including sea surface temperatures (SSTs) in various basins (e.g., Kolstad and Årthun 2018; Smith et al. 2016; Toniazzo and Scaife 2006; Brönnimann 2007; Rodwell and Folland 2002), continental snow cover (Allen and Zender 2011; Cohen and Jones 2011; Cohen and Entekhabi 1999), stratospheric geopotential (GPT) or winds (Jia et al. 2017; Wang et al. 2017; Scaife et al. 2016; Brönnimann et al. 2016), sea ice cover (Liptak and Strong 2014; Vihma et al. 2014), and soil moisture (Orth and Seneviratne 2014; van den Hurk et al. 2012). These parameters share two important properties: they vary slowly in time, and they can act as forcings for the troposphere. Both properties increase the predictive power of seasonal forecasting models.
Since it is computationally cheap compared to numerical–dynamical models––which require massive hardware for parallelization, long running times, and large storage capacity for storing the input and output data––statistical modeling is an appealing approach to produce seasonal forecasts. Moreover, even though they skillfully predict the wintertime NAO (Baker et al. 2018b), the current operational dynamical models do not always perform particularly well in the simulation of fundamental climate variables, such as temperature and precipitation over Europe (Mishra et al. 2019; Weisheimer and Palmer 2014). Therefore, investigations of better statistical approaches remain useful assuming that 1) predictability is an inherent part of the climate system in seasonal time scales, 2) dynamical models are only partly capable of utilizing it, and 3) modern statistical learning methods are capable enough to find and exploit the true physical connections between the predictands and predictors. Furthermore, uncovered statistical linkages may provide valuable insights into the behavior of the climate system.
Previously, both dynamical and statistical models have been used to predict, for example, precipitation over the British Isles (Baker et al. 2018a; Ossó et al. 2018), precipitation in Europe (Dunstone et al. 2018; Totz et al. 2017), the wintertime NAO index (Dobrynin et al. 2018; Hall et al. 2017; Scaife et al. 2014; Stockdale et al. 2015), snow accumulation over the Alps (Förster et al. 2018), sea ice cover in the Baltic Sea (Karpechko et al. 2015), wintertime European temperatures (Folland et al. 2012), and monthly temperatures at various midlatitude locations (Karpechko 2015).
Machine learning and statistical learning methods are becoming increasingly popular in different applications of atmospheric sciences (e.g., Sprenger et al. 2017; Ukkonen et al. 2017). However, they have not been used very comprehensively in statistical seasonal forecasting yet, probably because the most advanced deep learning methods require vast amounts of training data to learn the nonlinear relationships between predictors and predictands. In the framework of seasonal forecasting, the typical available number of years (i.e., the number of samples n) is only less than 100. This small n may effectively disable the fitting of deep learning models, such as artificial neural networks. However, statistical learning methods vary widely, and some methods are more suitable to smaller datasets than others. This paper is a step toward utilizing the potential and power of statistical learning in seasonal forecasting, using as robust statistical learning methods as possible to avoid overfitting.
The model of this paper consists of an ensemble of individual regression models created with random sampling and the regularization of predictors. The principal components (PCs) of large-scale predictor parameters are used as potential predictors to utilize the teleconnections from global or hemispheric domains. The regularization, or shrinkage, is based on the Least Absolute Shrinkage and Selection Operator regression (LASSO; e.g., Hastie et al. 2009, section 3.4 therein). The skill of the proposed procedure in producing seasonal and areal mean temperature forecasts is evaluated over a set of independent test years in different target domains (Fig. 1).
The paper is organized as follows. In sections 2a–c, the predictor and predictand variables are presented, followed by the description of the statistical methods in sections 2d–g. In section 3, the skill of the ensemble is evaluated over target domains and seasons (section 3a), the importance of details in modeling is shown (section 3b), and the automatically selected predictors are analyzed (section 3c). The discussion and conclusions are collected in sections 4 and 5.
The Python code for reproducing the results of this study is available online at https://github.com/fmidev/seasonal_forecasting.
2. Materials and methods
The shortness of time series is a major challenge in statistical seasonal forecasting. It is tempting to use the years of the satellite era, often considered to begin in the late 1970s, in training the statistical models because the quality of the reanalysis datasets increases when satellite observations are included. However, this period covers only ~40 years, which might be too short to obtain statistically significant results, especially when seasonal or monthly averaging is used in data aggregation, such that each month or season is represented only 40 times in the data. Potentially important decadal-scale variations of the climate system cannot be identified from so short a sample. Isolating separate test years from the original set of years makes the training sample even smaller. Finally, the chaotic, unpredictable part of the variability is often large compared to the predictable part, which makes the fitting of statistical models uncertain when the signal-to-noise ratio is too small.
Moreover, the selection of the test years for validating the model is not trivial. First, to be able to reliably measure the performance statistics, the number of samples (i.e., years in this case) should not be too small as the range of uncertainty estimates becomes impractically large. Second, to avoid the potentially existing autocorrelations in the predictors and predictands affecting the results too much, a contiguous and probably even isolated period should be used, with some buffer years between the fitting and testing years. Third, the most recent years should be selected as a test sample, so that the potential changes in predictability, caused, for example, by the ongoing climate change on centennial-scale or decadal-scale ocean processes, could be seen and generalized to represent the current state of the climate system as well as possible. In other words, an adequately long, contiguous, and representative period is required for testing (Tashman 2000). Sometimes different cross-validation procedures have been used as an alternative for independent test years, but overfitting can still happen if the same years are used in cross-validation and predictor selection. As a result, the application of those models to truly independent years fails (e.g., DelSole and Shukla 2009).
Nowadays, alternatives for global, relatively long, and homogeneous datasets exist. One of them is the ERA-20C reanalysis dataset (Poli et al. 2016), which was used as the source for predictor data in this study. As complementary input data, version 2C of the Twentieth Century Reanalysis (20CRv2c; Compo et al. 2011) was used. All results shown, unless otherwise stated, are based on ERA-20C. The data were downloaded in monthly time resolution and in their original spatial resolutions (ERA-20C: 1.125° × 1.125°; 20CRv2c: 2° × 2°). The leading principal components of global SST and Northern Hemispheric GPT at the 150-hPa level were extracted from both reanalyses to be used as potential predictors. The ERA-Interim reanalysis data (Dee et al. 2011) were used to validate the GPT data of other reanalyses. Near-surface temperatures from the observational HadCRUT4 dataset (5° × 5°; Morice et al. 2012) were averaged over target domains and seasons to be used as predictands. The details of the treatment of the parameters are explained in the next sections.
In seasonal forecasting all predictors in the predictor matrix, consisting of p potential predictors of length n, are more or less weak in explaining the predictand variables. For this reason it is important to pay attention not only to the choice of the statistical method, but also to the dimensions of the predictor matrix. To minimize the risk of overfitting, it is necessary to use as long a time series as possible (a large number of samples n) for fitting. As demonstrated by Hastie et al. (2009, chapter 18 therein), the risk of overfitting increases rapidly when p ≈ n or p > n (high-dimensional dataset), but the risk is smaller when p < n (low-dimensional dataset). As n increases, it is possible to also increase the number of potential predictors p and still have p < n.
On the other hand, using too uncertain data for fitting yields poorly fitted models. The uncertainty of reanalyses is the largest in the earliest years because of the sparseness of the observations used in the assimilation (Poli et al. 2016; Compo et al. 2011), and for that reason the years before 1915 were excluded from the centennial scale reanalyses. Excluding longer periods increases the ratio p/n (i.e., the dimensionality of the predictor matrix) too much, and for that reason it was found to decrease the accuracy of the model, as shown in section 3b.
a. Selection of predictor parameters
As mentioned in the introduction, different observational SST data have traditionally been used in statistical seasonal forecasting, and their predictive power is well known. Long time series are easily accessible either as derived indices describing SST variability in different locations, or as global gridded products. For these reasons SST is a natural choice to be used in this study as well.
Additionally, the variables describing stratospheric circulation are becoming popular. This is because they contain exploitable signals on the seasonal scale, and because their accessibility, at present, is good since modern reanalyses also contain pressure level information in addition to the surface layer data. Only surface data have been used in the assimilation of the ERA-20C and 20CRv2c reanalyses, and because upper-air observations were not assimilated, there is a potential risk that the derived stratospheric circulation would be too uncertain to be used as a predictor parameter for seasonal forecasting. Earlier, Gerber and Martineau (2018) validated the annular modes of the atmospheric circulation in several reanalysis datasets, including ERA-20C and 20CRv2c, finding the representation of the modes to be quite accurate among reanalyses, especially in the Northern Hemisphere. To further validate the seasonal 150-hPa level GPT data for this study, the ERA-Interim reanalysis geopotential anomalies for that level were compared to the GPT anomalies of ERA-20C and 20CRv2c in 1979–2010. The assimilation model of ERA-Interim also includes the upper-air observations, and for that reason it can be assumed to be reliable at different altitudes. High consistency was found in the representation of GPT between different reanalyses, as seen in Fig. 2. For example, the temporal correlation between ERA-20C and ERA-Interim GPT anomalies is 0.80 at its minimum, and 0.93 on average. Based on this analysis, the 150-hPa GPT is accurately represented in reanalysis datasets at least for the recent decades and is, therefore, potentially useful for training the seasonal forecasting models.
In addition to GPT and SST, other potential predictor parameters, such as sea ice and snow cover have been previously used in seasonal forecasting. See appendix A for the reasons why they are not included in this study.
b. Extraction of predictor variables
Both parameters, SST and GPT, were aggregated to seasonal time resolution using temporal averaging over the standard 3-monthly DJF, MAM, JJA, and SON seasons. Seasonal anomalies were then calculated by subtracting the 1915–2010 seasonal means from each grid cell separately for both parameters, followed by the application of latitude weighting of the data, using , to decrease the influence of the high-latitude grid cells in the results (Wilks 2011, section 12.2 therein). Finally, to remove the centennial-scale climate change signal, the 1915–2010 linear trends were subtracted from the predictor data,1 separately for each season and grid cell. Removing the trends and seasonal cycle improves the stationarity of the data, which eases, or enables, the fitting of the LASSO regression models later (Wilks 2011, section 9.1 therein).
After the aforementioned preprocessing steps, PC time series with NPCs∈ [1, 5] for SST and with NPCs∈ [1, 3] for GPT were calculated from the seasonal anomaly fields of each parameter using year-round data and global domain for SST and Northern Hemispheric domain for GPT (Figs. 3 and 4). Altogether, 37% of the total variance of the SST data was explained by the leading five PCs. The distribution of the explained variance proportion in the GPT data was different: 56% of the variance was explained already by the first three components.
The SST PC time series were then lagged with Nlags∈ [1, 5] to be able to take into account different and possibly prolonged causal relations of the predictors. In the 3-monthly data the Nlags value of 2, for example, means that predictor values from two seasons ago are used to forecast the predictand values of the target season. A total of 5 components × 5 lagged steps = 25 potential PC predictors were derived from the SST parameter. The changes in stratospheric circulation, including, for example, sudden stratospheric warming events (SSWs; e.g., Kidston et al. 2015), bring exploitable signals for the GPT parameter, and they typically affect the troposphere no longer than for two seasons. For this reason the number of lags was reduced to Nlags∈ [1, 2] for GPT. The total number of potential GPT predictors was then 3 components × 2 lagged steps = 6, and altogether 25 + 6 = 31 potential PC predictors were extracted from the SST and GPT parameters. Finally, all the potential predictors were scaled to N(0, 1) (i.e., to have zero mean and unit variance) because LASSO regression cannot be applied without standardizing the predictors.
The first SST component, SST1, describes the El Niño–Southern Oscillation (ENSO) index. The second component, SST2, represents mostly the Atlantic multidecadal oscillation (AMO). The third and fourth are both connected to the Pacific decadal oscillation (PDO) and to the North Pacific Gyre Oscillation (NPGO) indices. See appendix B for the explanation as to how different PCs of SST were linked to the previously identified SST variability.
c. Aggregation of the predictand variables inside target domains
Temporal and spatial aggregation inside the study domains (Fig. 1) was applied to the monthly HadCRUT4 temperature. The data were first season-averaged over the seasons. After that, the seasonal cycle was removed by the subtraction of the seasonal means, and detrending was applied by removing the linear 1915–2010 trend from each season separately. Finally, the areal means were calculated to form one predictand time series per domain.
To study the representativeness of the areal mean temperature inside the target domains, the correlation coefficient between the areal mean temperature anomaly and each individual grid cell anomaly in Scandinavia (SC) and western Europe (WE) were calculated. The mean correlation over all cells was 0.88 in SC and 0.78 in WE, and the weakest correlations, >0.58, were found in the offshore grid cells. Based on this analysis, the areal mean temperature represents well the temperature variability over land areas in the smaller SC and WE domains, and the mean temperature predictions can be generalized inside these regions.
In addition to the target domains of this study, the predictabilities of the Mediterranean region and eastern Europe were tested. They were found to be lower than in SC and WE for most of the seasons, sometimes considerably weaker. The results from those experiments were not included in this paper since many conclusions are based on averaging the results over all domains, and the risk for misleading conclusions would be increased if less predictable domains were included. On the other hand, this means that our results primarily apply to those parts of Europe where the prediction skill is the highest.
d. Random sampling of the training data
After preprocessing the predictor and predictand data and the derivation of the potential predictors using all the years of the study, the years 1986–2010 were put aside to be used later as an independent test sample. These 25 years did not participate in the fitting of the models, in the predictor selection procedure, or in the calibration of the postprocessing model.
The training data, consisting of years 1915–85 and 31 potential predictors, were then sampled 1000 times so that in each iteration, 35 years (50% of all fitting years) and 10 predictors [33% of all predictors, the percentage value suggested by Hastie et al. (2009, section 15.3 therein) for regression problems] were selected randomly without replacement to be used in fitting of each individual LASSO regression model. Building a model ensemble by this way is called random subspaces sampling or attribute bagging (e.g., Bryll et al. 2003), and it is closely related to random forest sampling and bootstrap-aggregation sampling (i.e., bagging; Hastie et al. 2009, sections 8.7 and 16.3 therein). In the attribute bagging approach, the data samples are smaller than in bagging, which considerably speeds up the calculation and increases the accuracy of the ensemble by decreasing the correlation among the individual regression models.
e. LASSO regression in random samples
For each target domain, season, and random sample, one LASSO regression model was fitted using Python’s Scikit-learn library (version 0.20.1; Pedregosa et al. 2012). Automatic predictor selection and shrinkage of the sampled predictors were applied using cross-validation and the least angle regression algorithm (LARS; Efron et al. 2004) as implemented in the LassoLarsCV function.
LASSO differs from the ordinary least squares regression (OLS) such that the quantity to be minimized,
where i denotes the sample and j the predictor indices, y the predictand, and β the regression coefficient, contains an extra penalty term, , in addition to the residual sum of squares, , minimized in OLS (Hastie et al. 2009, section 3.4 therein). The extra term defines the amount of shrinkage of the regression coefficients βj, controlled by the regularization parameter λ ≥ 0. Depending on the value of λ, the coefficients βj may be positive, negative, or zero, the latter leading effectively to the rejection of predictors.
In this study, K-fold cross-validation with K = 5 was used to optimize the value of λ for each ensemble member. The algorithm first fits a separate LASSO model for each K fold and then calculates a separate regularization path and the related mean square error for all of them. Finally the optimal value for the parameter is selected by first calculating the mean of mean square errors of all K solutions, and then investigating the minimum of it. Using K = 20 was also tested, but it was found to slow down the fitting process without improving the result.
The LARS algorithm is essentially the same as the forward stagewise predictor selection, which typically yields a higher number of selected predictors than other, greedier approaches, such as forward selection. Despite this, the weakest predictors were either rejected or considerably regularized in the fitting, as shown in section 3c. This regularization of the βj coefficients decreases the risk of overfitting of the individual ensemble members compared to using, for example, the unregularized OLS.
f. Deterministic validation metrics
Two metrics were selected for validating the quality of modeling against observed values, the most important being the anomaly correlation coefficient (ACC; Wilks 2011, section 8.6 therein), referring to the Pearson correlation coefficient calculated between the observed and modeled predictand anomaly time series. Additionally, the mean square skill score (MSSS) was used, defined as
The ym and yo symbols denote the modeled and observed time series respectively. As reference forecasts, denoted by yref, the climatological and the persistence forecasts were used, denoted MSSSclim and MSSSpers, assuming zero anomalies always for the former, and maintained observed anomalies from one season to the next for the latter.
The ACC value describes the model ability to reproduce the anomalies (independently of their amplitude), while MSSS tells about the model’s relative accuracy compared to the given reference forecast. ACC and MSSS can have values in the range [−1, 1] and [−∞, 1], respectively. Both scores are positively oriented, so that the higher the score, the better the result. In general, negative or near-zero ACC values indicate nonskillful forecasts when estimated from an independent test sample. Positive MSSS values indicate that the forecast is able to surpass the skill of the reference.
All metrics, since they are fast to calculate and interpret, and are, therefore, suitable for model developing purposes, were calculated using the ensemble mean of the LASSO models as a deterministic forecast. The probabilistic interpretation of the ensemble results remains to be applied in further work. To some extent the deterministic performance of the ensemble mean also reflects the probabilistic performance: it is unlikely that an inaccurate deterministic model would produce valuable probabilistic predictions, but an accurate one might well do so.
g. Postprocessing the raw output
After creating the ensemble of the LASSO models, each member of the ensemble was applied to predict the whole 1915–2010 period, including also the test years. Minimizing the RSS in the fitting of the LASSO regression models typically leads to reduced variance in the model output, when compared to the observed variability of the predictands. This was handled by increasing the variability of the modeled predictand values by a nonlinear quantile mapping method [modified from the approach of Räisänen and Räty (2013)], such that the results from each ensemble member were corrected separately using a model-specific quantile correction function
where yc represents the corrected predictand time series, ym the raw model output, Fm the cumulative distribution function calculated from the raw model data, and the corresponding inverse distribution from observed values. The correction functions were calculated for all models separately using the training years, and then applied to correct all years 1915–2010. Additionally, the same correction was applied to the persistence reference forecasts because the variability typically differs quite a lot between seasons, which would make MSSSpers too high if not using the correction.
a. Prediction skill of temperatures in Europe
Temperature predictability in all domains and seasons was estimated using the validation metrics, calculated over the independent test years. As seen in Fig. 5, the ACC of the ERA-20C-based ensemble is positive and statistically significant in all domains and seasons except in DJF and MAM in WE. The mean ACC, calculated over all seasons and domains, is 0.51. Essentially the same mean ACC score, 0.49, was achieved when the 20CRv2c reanalysis was used to train and test the model ensemble, indicating that both reanalyses contain information that can be extracted by the PCA and used in fitting and forecasting. On average, also the skill scores of the individual domains and seasons are comparable between the two reanalyses. In general, the prediction skill of temperatures in the JJA and SON seasons is better than in DJF and MAM, and prediction skill of the largest domain [i.e., all of Europe (EU)] is higher than that in smaller domains.
The MSSS scores can be decomposed to different terms, where ACC is one term (Murphy and Epstein 1989), and ACC is therefore a measure of potential rather than actual skill. Thus, even though related, these two metrics are not directly proportional to each other. The performance in terms of the MSSS scores is not as good as that measured by the ACC when comparing the number of statistically significant target domains and seasons. Overcoming the persistence forecast, for example, can be difficult because of the often strong autocorrelation of the predictand. Autocorrelation brings ACC skill to the persistence reference forecast, as shown in the bottom-right legends of Fig. 6; for this reason, it is difficult to overcome by the LASSO ensemble. Despite that, the climatological forecast seems to be even more difficult to beat, seen in the lower mean MSSSclim value (0.24) compared to the mean MSSSpers (0.37).
b. Sensitivity of skill to details of methodology
Including the local persistence of the predictand variable into the list of the potential predictors was found to improve the accuracy of the modeling such that the mean ACC would increase from 0.51 to 0.54, enhancing the prediction skill especially in the WE region. Using the persistence of the predictand as a predictor is common in different studies (e.g., Karpechko 2015; Kolstad et al. 2015; Johansson et al. 1998), but since the main objective of this work was to study and show the potential of lagged SST and GPT PCs, it was not further used here.
According to modeling tests with different domains and reanalyses, temperature forecasts are the most skillful and consistent when the global domain is used for extraction of SST PCs, compared to the use of the smaller Northern Hemispheric domain (not shown). The predictive power of GPT, however, was stronger when only the signals from the Northern Hemisphere were taken into account. To test the relative importance of the SST and GPT input parameters, both were excluded in the modeling in turn. When GPT was excluded, the mean ACC decreased from 0.51 to 0.44 (from 0.49 to 0.44 in 20CRv2c), and when SST was excluded, the mean ACC decreased again to 0.44 (0.39 in 20CRv2c). In general, SST performs better in JJA and SON, and GPT in DJF and MAM. The better performance of GPT in winter and spring is not surprising because of the stronger stratosphere–troposphere coupling, which increases predictability in these seasons (Baldwin et al. 2003). These results show that both input parameters could be used in seasonal forecasting separately, and both bring equally much predictive power in the forecasts on average. The best result, however, was achieved when both the parameters were included.
Using the quantile mapping correction for individual ensemble members is beneficial for the MSSS skill in those domains and seasons where the prediction skill is high enough prior to correction (ACC > 0.40), but it does not improve results significantly, if at all, if the original, uncorrected results are not very skillful. Without correction, the variance of the ensemble mean would be near zero, which explains the improvement in MSSS after the correction. Because of the near-zero variance of the ensemble mean, the simulated trends were also improved considerably. The effect on the ACC score was only minor, even though the nonlinear correction could, in principle, also affect that metric.
The usefulness of predictor lagging was tested by excluding lags 2–5 from the potential predictors, such that only the predictor values from the previous season were used to forecast the next season. As a consequence, the mean ACC dropped from 0.51 to 0.46, indicating that using the lagged predictors is valuable. Similarly, the importance of using long time series in fitting and predictor selection was tested by excluding the period 1915–44 from the fitting and predictor selection years. The mean ACC dropped even more, to 0.41, probably because of the increase in the ratio p/n, which leads to overfitting, or because of increased sampling uncertainty related to the preprocessing: calculation of trends, climatology, and PCs require a certain amount of data to be robust.
A large part of the predictability seems to arise from the correct simulation of the decadal variability and decadal trends. For example, in SC the observed and modeled temperature trends are positive during the test years in all seasons, but especially in DJF and SON, as seen in Fig. 6. Similar trends are visible in other domains as well, shown in Figs. S1 and S2 in the online supplemental material. To study the effect of decadal trend on the ACC, linear trends were removed separately from the modeled and observed test period time series 1986–2010 in each domain and season. The mean ACC, it was found, decreased from 0.51 to 0.36 after removal of the decadal trends.
The accuracy of the ensemble mean was compared to the accuracy of the best individual ensemble member for all domains and seasons. To find the best member, the ACC of each LASSO model was calculated for the fitting period 1915–85, and the member with the highest score was then selected. The mean ACC in test years was only found to be 0.34 for the best individual members, showing the power of the ensemble mean approach. As Fig. 7b shows, including more than 50 members in the ensemble does not improve the ensemble mean accuracy very often. However, growing the ensemble size larger than that is not usually harmful either, and sometimes the skill peaks with considerably larger ensemble sizes, justifying the selection of using 1000 members in this study.
c. Evolution of the ACC skill with time
The evolution of ACC over all years 1915–2010 was analyzed by applying a moving 25-yr analysis window. As shown in Fig. 7a, and in Figs. S3 and S4, the temporal variations in ACC were typically small and the absolute minimum values high in the JJA and SON seasons since 1950. Further, there was an increasing trend in the ACC skill in JJA in all domains of the study. This trend might be linked to the slowdown of the hemispheric Rossby waves and the increased persistence of weather patterns (e.g., Francis and Vavrus 2012), which can bring skill in the statistical model. However, this link cannot be confirmed directly from the data of this study.
Compared to JJA and SON, lower test period ACC skill scores were achieved in the DJF and MAM seasons. In addition, larger variability in the skill was discovered between different periods in DJF and MAM. Weisheimer et al. (2017) reported a corresponding time dependency of skill in their winter NAO forecasts, and found, for example, that their forecast skill drops around 1960 and increases again later. In our analysis, the timing of high and low skill periods depends on the season and domain, but on average, the skill seems to be lower around 1950–70.
In the majority of the seasons and domains, the ACC skill does not drop when crossing the boundary between the fitting and testing years, but remains essentially the same or increases slightly. This demonstrates that the model ensemble is not overfitted, which builds confidence in the true skill of the statistical modeling.
d. Analysis and interpretation of the predictors in the ensemble members
The predictors in the LASSO models differ depending on the season and domain. Additionally, the number of predictors in the models varies, as can be seen in the analysis of predictors in Fig. 8 and Figs. S5 and S6. Typically, a large average number of predictors was found in SON and sometimes in the JJA seasons, and the models fitted to forecast DJF and MAM contained fewer predictors.
Different PCs of GPT were found in all seasons, but they were especially common in the MAM models. While the second PC of GPT, GPT2, is by far the most common in all domains and seasons, all lags of SST2 in no particular order are in the top 15 among the 31 potential predictors. Temporally remote signals from past seasons appear in the models sometimes, as lags of three to five seasons were found for different PCs of SST. Lags 4 and 5, however, tend not to be as common as lags 1–3. Interestingly, the dominant components of both parameters, SST1 and GPT1, were both quite rare among the ensemble members. Forcing the ENSO-related SST1 component manually to be included in all random samples did not improve the results. In other studies, the ENSO has been found to affect, for example, the NAO variability (e.g., Hall et al. 2017; Bell et al. 2009) and the European climate in general (Brönnimann 2007). It is likely that the response to the ENSO forcing is nonlinear in Europe, and to be able to utilize SST1, linear regression models would require major modifications applied to it, as shown by Folland et al. (2012) and Hall et al. (2017). The preliminary tests of using the ENSO signal modification of Folland et al. did not bring any useful predictive power to the modeling, though, probably because of using seasonal mean values as predictors instead of monthly means.
On the other hand, the SST2 component, which mostly describes the AMO on decadal scales, was the most frequent SST predictor on average. This result not only implies the strong role of the North Atlantic in regulating the climate of Europe (Li et al. 2018; Chen et al. 2018; Knight et al. 2006; Czaja and Frankignoul 2002; Peng et al. 2005), but also suggests the importance of decadal-scale SST modes as a source of predictability in seasonal forecasts (Davini et al. 2015). Out of all SST components, only SST2 contains a clear trend during the test period years, which overlaps with the temperature trends in different predictand seasons. This trend probably partly explains the correct simulation of the decadal temperature trends during the test period years, especially in JJA and SON. Similar decadal-scale variability and trends can be seen in the GPT predictors, and they might also contribute to the simulation of decadal trends in the predictands. However, it is important to keep in mind that automatic predictor selection methods are not aware of true physical connections between predictors and predictands, and random, coincidental trends can affect the results, especially with a small number of samples n.
Finally, SST3 and SST4, which primarily describe the different shorter-term variability modes of the Pacific Ocean (as well as the PDO), are relatively common in all seasons, but especially in DJF and MAM. Corresponding shorter-scale variability is also visible in SST2, in addition to the quite strong decadal-scale signal of it. All lags of SST2 were common among the models, which might indicate that the predictive power of that component originates from the decadal low-frequency AMO variability, and not from the higher frequencies of it (i.e., annual or seasonal-scale variability). However, whether the longer-term variability of SST2 actually is more important than its shorter-scale variability in bringing prediction skill to seasonal forecasts, or vice versa, remains to be studied in further work.
Predictor selection is a sensitive and potentially error-prone task, vulnerable to failure if the quality of the predictand or predictors is low, if the internal variability of the predictand is prominent compared to the predictable component, and if the number of samples is too small. The selected modeling approach could be used directly to forecast grid cell values, but that would be risky, because the internal variability at individual grid cells is probably too large, which could cause problems in predictor selection and regularization. In this study, the predictands derived from the smaller target domains might still contain too much internal variability despite the areal averaging, which could explain at least partly why the quality of the results in WE was occasionally lower than in the other domains. A possible cure for this problem could be to use those predictors that were found from the largest EU domain to predict the smaller domains, using refitted models. In the EU domain, the signal-to-noise ratio is higher, and predictor selection might work better there for that reason, and it is possible that the same global predictors that bring predictability to the largest domain also work for the smaller domains. On the other hand, the risk of this approach is that if the target domain is too large, different parts of it should be predicted using partly or completely different predictors. However, this avenue of exploration is left for further research.
Another potential risk in the predictor selection is related to using lagged, occasionally highly collinear predictors. Collinearity refers to the correlations between predictors, and it is typically high between differently lagged versions of the same predictor. Autocorrelation analysis of the SST PCs, for example, indicates that the five lags are positively correlated in all five components (not shown). Collinearity can increase the probability of selecting nonoptimal lags. Even though the skill of the models might not degenerate very much from the use of nonoptimal lags, compared to using too many and/or too weak potential predictors, the explicability and interpretability of the models can weaken.
However, according to the modeling results, the potential drawbacks of using lagged predictors seem to be minor compared to the benefits of it. Because of the very complex and possibly temporally chained interactions between different components of the climate system (Vihma 2014), it is possible that some previously known or unknown mechanisms increase the prediction skill in the modeling. The propagation of the signal through such known pathways as SST → GPT → troposphere (Bell et al. 2009; Kidston et al. 2015) and continental snow cover → GPT → troposphere (Gastineau et al. 2017) can take several months or seasons, which probably explains why using different lags was beneficial. At best, statistical learning can find and take advantage of these mechanisms to improve the accuracy of the forecasts. A deeper analysis of the data is needed to find and explain these mechanisms in detail.
Each random sample contained only 50% of the fitting years. Those years were used to fit one LASSO model, and the remaining 50% could be used in validating that particular model. Further, the validation years could be used to weight the models based on their validation ACC skill in order to enhance the accuracy of modeling. This procedure was briefly tested, but it was found that the accuracy of the ensemble mean was not better compared to using unweighted models. It is possible that fittings performed on some certain subset of fitting years produce models that perform well in the validation sample, and when the number of random samples is high enough, models that were fitted using roughly that particular set of years get heavier weights. Effectively, this leads to the rejection of some potentially valuable fitting years. For successful weighting of models, probably more data (i.e., longer time series) would be needed.
The earliest years of the century are more uncertain in the SST predictor data, due to the sparseness of the observations at that time (Kennedy 2014), and the same time dependency of uncertainty also holds for the GPT data. These uncertainties affect the fitting of the LASSO models. However, if the biases of the predictor data are not systematic over the years, but random and not too large, they might not hamper the fitting of the models too much. This hypothesis is supported by the test where the exclusion of the earliest fitting years was found to decrease the accuracy of modeling (section 3b). The uncertainty of the most recent years, which were selected for testing purposes, is anyhow smaller, probably increasing the validation score values in those years to some extent. On the other hand, the temperatures in Europe are probably comparatively well described in the HadCRUT4 data throughout the years, thanks to the dense weather station network there throughout the century, reducing the uncertainty related to the predictands.
A large part of the existing literature on seasonal predictability concentrates on the wintertime NAO index forecasts (section 1), and because it was not a predictand of this study, comparing the skill of our modeling to that of others is challenging. References to summer or fall temperature forecasting experiments are especially sparse. Further, the measures of skill vary between different studies, and comparing, for example, the probabilistic or categorical skill scores used in many papers (Karpechko 2015; Weisheimer and Palmer 2014; Graham et al. 2005) with our deterministic scores is difficult. However, regarding the mostly insignificant and occasionally even negative ACC skill of temperature forecasts of the current operational forecasting models (Mishra et al. 2019), it is probably safe to say that contemporary dynamical models perform worse than our model, at least in summer and fall. On the other hand, we add, as a note of caution, that the historical skill of seasonal forecasts does not necessarily imply future skill (e.g., Weisheimer et al. 2017). In the statistical pre-reanalysis approach of Colman and Davey (1999), a good but not completely independently estimated summer predictability was found for a region that roughly corresponds to our WE domain. Interestingly, they used North Atlantic SSTs as predictors, which were also an important source of predictability in our modeling. Johansson et al. (1998), as well, studied the predictability of seasonal mean temperatures in northern Europe, using PCAs of SST as predictors, concluding that the fall season was the least predictable in their approach. In our modeling, fall was the most predictable along with summer. They found the most skill in winter forecasts, but even for that season, their ACC score was not very high.
In our approach, some potentially important predictors, such as sea ice and snow cover, could not be used because they were too uncertain over the period of our study ( appendix A). The sufficiently accurate predictors derived from these parameters might have improved the somewhat poorer winter and spring forecasts, as the variability and predictive power of these parameters are strongly connected to those seasons (e.g., Liptak and Strong 2014). The preliminary tests also suggested that using monthly mean values for predictor parameters, instead of seasonal means, improves the accuracy of the forecasts in some seasons and domains, particularly in spring. Spring is especially sensitive to the GPT variability, which partly explains the improvement: the wintertime SSW events typically affect spring the most and, apparently, identifying them correctly from seasonally averaged data can be too tricky for the regression models. Further, seasonally averaged predictor data provide no information about the timing of the events within the three-month winter season, which could be important because of the relatively short time frame of the stratosphere–troposphere interaction (e.g., Baldwin and Dunkerton 2001).
Replacing the ERA-20C and the 20CRv2c datasets with some other reanalyses would have improved the reliability of many or all input parameters, probably to the extent where, for example, predictors derived from the sea ice would have been applicable. However, other reanalyses, such as ERA-Interim, usually contain considerably shorter time series. Increasing the number of potential predictors p and at the same time decreasing the number of fitting years n would both increase the dimensionality of the data (i.e., the ratio p/n), which would lead to overfitting. In other words, using long time series for training the model ensemble allows us to use more PCs and more lags per predictor parameter, such that the full potential of the parameters can be utilized, probably better than would be possible with shorter time series.
In this paper, we present a new method to forecast seasonal temperatures in Europe. The skill of the model was estimated, and the most important predictors were identified to explain the sources of predictability. The main findings are collected below.
Statistical learning is useful in the field of seasonal forecasting. The complexity of the LASSO-based ensemble approach of this study is between the very simple statistical OLS models and computationally demanding dynamical models. The method is robust and suitable for smaller datasets, and is reasonably tractable in physical interpretation compared to other more flexible and highly nonlinear approaches, such as neural networks or support vector machines (Hastie et al. 2009). When compared to the skill of the best individual ensemble member, the statistical learning properties of the method can help in both selecting and weighing the most important predictors, and in improving the accuracy of the results.
As already shown in previous papers, the Northern Hemispheric GPT variability in the preceding seasons was confirmed to bring predictability into the forecasts of European temperatures over various domains, mostly in winter and spring.
When compared to GPT, the global SST variability, including the PDO and most importantly the AMO indices, is an equally important source of predictability on average. Because of sufficiently long time series used for fitting of models, decadal variability was captured and then skillfully exploited during the test period years. Summer and fall forecasts benefit the most of using the SST as a predictor.
Using lagged SST predictors increases the predictive power of the statistical model ensemble compared to the more traditional approach, where only values from the previous season, or month, are used. Slowly evolving currents and other processes in the oceans contain information and signals that can be exploited by using predictor values from the further past.
Two centennial-scale reanalyses, ERA-20C and 20CRv2c, were used separately in deriving the potential predictors. The prediction skill was found to be insensitive to the choice of the reanalysis product. Additionally, the predictive skill was roughly on the same level in the fitting and testing periods on average, although there were variations in skill with time. These results indicate the robustness and low variance of the model system.
Compared to the skill achieved in previous papers on seasonal temperature forecasts for European target regions, the forecasts for summer and fall were especially skillful and robust.
MK was supported by the Kone Foundation, via a personal Ph.D. grant for studying the climatic effects of ocean surface temperature in the Northern Hemisphere (https://koneensaatio.fi/en/grants/tuetut/2016-2/; Grant 088575), which was the main source for funding of this study. PU was supported by the EC Marie Curie Support Action LAWINE (Grant 707262). AYK was supported by the Academy of Finland (Grants 286298, 319397). OH and IL were supported by the Finnish Meteorological Institute. JR was supported by the Academy of Finland Centre of Excellence program (project 307331). We thank the creators and maintainers of the ERA-20C, 20CRv2c, and HadCRUT4 datasets, and other datasets used in this study. We are grateful to the three anonymous reviewers whose helpful comments improved the quality of the original manuscript. The authors declare no conflicts of interests.
Exclusion of Sea Ice and Snow Cover Variables from Predictor Parameters
In previous studies, the potential of the sea ice cover or concentration (SIC) as a source of predictability was recognized (e.g., Bader et al. 2011). Unfortunately, appropriate sea ice predictor data could not be found for this study, since the longest available sea ice records contain many uncertainties, which additionally vary in time. Attempts to decompose the HadISST1 (Rayner et al. 2003) or Walsh et al. (2015) sea ice datasets with PCA were not successful: the discontinuities and changes in the observational systems could be clearly seen in the derived PCs, making fitting and application of the LASSO models too uncertain. According to Bunzel et al. (2016), dynamical seasonal models are sensitive to details in the sea ice, which not only highlights the importance of ice concentration data for seasonal forecasting, but also reveals the risks related to using too uncertain datasets for the fitting of statistical models.
The PCs of snow cover (SNC) were found to be more homogeneous in reanalyses than the PCs of SIC in observational datasets. Using PCs of SNC as predictors was found to be possible and beneficial for the accuracy in some cases, and thus they could be used as predictors in seasonal forecasting. However, for the following reasons, they were excluded from this study:
When SNC was used, the results were slightly less accurate than with other predictors. Combining SNC with SST and GPT predictors did not improve the model skill compared to using only SST and GPT.
The snow covers in ERA-20C and 20CRv2c were found to differ, and less skillful results were achieved when 20CRv2c SNC was used as a predictor parameter, compared to using ERA-20C SNC.
Previous studies have revealed inaccuracies in the reanalysis snow data (e.g., Wegmann et al. 2017).
Connection of SST Principal Components to Known SST Indices
According to Messié and Chavez (2011), the leading five or six SST PCs can be associated with the major variability modes of the World Ocean, namely El Niño–Southern Oscillation (ENSO), the Atlantic multidecadal oscillation (AMO), the Pacific decadal oscillation (PDO), the North Pacific Gyre Oscillation (NPGO), the El Niño Modoki index (EMI), and the Atlantic El Niño index (ATL3). The results of Messié and Chavez, namely the monthly principal components of SST, were downloaded from http://climexp.knmi.nl/, season-averaged, and then compared to our components using cross-correlation analysis. Strong correlations were found between the first components (representing ENSO; r = 0.98), and the second components (AMO; r = 0.80). The third and fourth components (representing mostly the Pacific modes PDO, NPGO, and EMI) were again strongly correlated (r = −0.79 and r = −0.72, respectively). The fifth (EMI, ATL3) component was not very strongly correlated between the two PC datasets (r = 0.33), and it also leaked some variability to the sixth component (r = 0.56). The differences between the datasets arise from the use of different SST data, different time periods and resolution, and the partly different preprocessing steps.
Supplemental information related to this paper is available at the Journals Online website: https://doi.org/10.1175/JCLI-D-18-0765.s1.
Alternatives for that procedure exist in literature. To implement, for example, the low-pass filtering approach of Huang et al. (1996), the local mean temperature for the recent past (e.g., past 10 years) should be added as an additional predictor in the statistical model. However, defining the optimal averaging period objectively is not trivial, and additionally, this approach would hamper the usefulness of the potentially valuable decadal-scale signals in the data.