1. Introduction
Long-term climate change is a fundamental challenge for the decades to come. Nevertheless, adaptation needs to start now to address the risks associated with the changing climate. To do so, the best climate information for the next decades should be made available. While there are a vast number of climate information sources, in the form of processed climate simulations, for instance, it is important to assess the data used to produce the most accurate information possible. This becomes especially relevant at the regional scale as some models are known to have issues representing some regional features. In this paper, we illustrate some methods that aim to generate improved near-term climate information. The evaluation is carried out for the Mediterranean region, which has been identified as a climate change hotspot, especially in the summer season [June–August (JJA)] (Giorgi and Lionello 2008; Lionello and Scarascia 2018; Cos et al. 2022).
The Coupled Model Intercomparison Project (CMIP) is the largest community effort that provides climate simulations for the past, current, and future climate. CMIP6 is the latest phase of the project (Eyring et al. 2016) and uses newer versions of climate models running larger ensembles compared to the previous CMIP phase (CMIP5) (Taylor et al. 2012). CMIP6 provides, among many other experiments, continuous climate simulations for the twentieth and twenty-first centuries (historical runs and future projections) and retrospective initialized 10-yr forecasts (decadal predictions and hindcasts) with annual initializations from 1961 onward. The number of models included in CMIP6 is larger than that in CMIP5, as is the number of simulations performed with an individual model (referred to as ensemble members). The resulting increased ensemble size is key to improving estimates of the climate simulation uncertainty via a more complete sampling of both the internal variability and the structural or model error (Tebaldi and Knutti 2007).
Structural or model error is at the origin of differences in the simulation of observed teleconnections (Dalelane et al. 2023; McGregor et al. 2022), the representation of known drivers of change (Cannon 2020; Beobide-Arsuaga et al. 2021), and determining the models’ sensitivity (Meehl et al. 2020; Flynn and Mauritsen 2020), among other aspects. As a consequence, some of the models do a worse job representing certain aspects of the observed climate than others. Averaging the simulations ensures that the mean is better, in a statistical sense, than using any random simulation from a single model (Tebaldi and Knutti 2007). However, this simple approach might suffer from issues related to the lack of model democracy and independence among models (Knutti et al. 2017; Sanderson et al. 2015a,b). For instance, CMIP6 has seen an increase in the number of models involved with respect to previous CMIP cycles but at times is the result of including model variants that should not be considered independent models (Merrifield et al. 2020, 2023). The similarity in the number of model components has been shown to have a clear relationship with the proximity of model results (Boé 2018), therefore including biases in the sampling of model uncertainty (Tebaldi and Knutti 2007).
Climate projections are performed as a continuation of the historical simulations after 2015. Both historical and projection simulations start from an initial condition that is not related to the observed contemporaneous climate beyond the fact of using estimates of anthropic and natural forcings. Projections are typically extended to 2100. Decadal predictions are sets of climate simulations that are initialized with an estimate of the observed climate state every year from the 1960s and run in ensemble mode typically for up to 10 years. Projections and historical simulations are long continuous runs that are not necessarily in phase with climate variability. Decadal predictions attempt to phase in the internal variability, considered the dominant source of uncertainty for near-term regional information (Hawkins and Sutton 2009, 2011; Lehner et al. 2020) with that of the observed climate. However, they are too short to provide information for the whole near-term period (considered the next 20 years in this study) because they are only 10 years long. This study focuses on the next 20 years of climate projections because it is the period in which many sectors need to make decisions on adaptation. To attempt to improve the near-term climate information that can be obtained from the full CMIP6 projection ensemble discriminating among individual model quality and including as much information about the internal variability as possible, different methods based on the idea of seamless prediction are explored in this paper. These methods aim to produce the most accurate and reliable information that bridges the gap between the decadal predictions and the provision of climate information for the next 20 years based on climate projections, as in Befort et al. (2020, 2022), Mahmood et al. (2021, 2022).
Several methods to constrain the climate projections can be used for this purpose (Hegerl et al. 2021; Brunner et al. 2020b), but this paper focuses on two types of methods: (i) subsetting the ensemble by selecting only the members that best represent a given climate characteristic (Cox 2019; Hall et al. 2019; Tokarska et al. 2020; Mahmood et al. 2021) and (ii) estimating constraints that give each member a different weight according to its performance and/or independence (Knutti et al. 2017; Lorenz et al. 2018; Brunner et al. 2019; Merrifield et al. 2020).
Future climate estimates are available from both the full CMIP and constrained ensembles. However, to demonstrate their feasibility, most of the studies describing these methods estimated the credibility of the constrained projections using out-of-sample evaluation (e.g., testing constraints against other simulations that were not part of the ensemble when deriving the constraints) (Sanderson et al. 2017; Boé and Terray 2015; Herger et al. 2018). While this can conceptually illustrate that the constraints work according to expectations, it limits the confidence in their improved representation of the real-world climate because there is no comparison with any observations over the historical period. As suggested in Doblas-Reyes et al. (2013) for decadal predictions, it would be important to evaluate constrained projections against observations over a common period to measure whether and to what extent they could improve climate projections compared to the unconstrained projections, at least, in the period when observations are available. The application of such evaluation methods is routinely done for initialized climate projections (Meehl et al. 2021; Delgado-Torres et al. 2022), and their application to transiently forced climate projections has recently been demonstrated by Donat et al. (2023b) and has also been done for some constraining approaches (Mahmood et al. 2021, 2022). In this study, we illustrate the relevance of evaluating the methods to constrain the projections against observations as an essential element in the provision of climate information.
In this paper, climate projections of summer temperature for the Mediterranean region are obtained with four constraining methods. The methods are evaluated by retrospectively comparing the climate information generated against an observational dataset over a well-observed period. We examine three selection methods developed by Befort et al. (2020), Mahmood et al. (2021, 2022). These methods, which use either observed climate anomalies or decadal predictions as constraints, are designed to phase in the simulated internal variability of the simulations at a given point in time to reduce the uncertainty of regional climate information in the following two decades (Hawkins and Sutton 2009, 2011; Lehner et al. 2020). Additionally, we also consider the Climate Weights for Independence and Performance (ClimWIP) method, developed by Knutti et al. (2017) and further refined by Brunner et al. (2020a). This method weighs the different ensemble members according to their historical performance and independence and, therefore, aims to improve the sampling of modeling uncertainty (Tebaldi and Knutti 2007). The independence weighting is relevant in the Mediterranean region in the near term as model uncertainty explains around 20% of the uncertainty in the region (Lehner et al. 2020). As for the three selection methods, ClimWIP is implemented retrospectively for the first time for the evaluation of the quality of the generated information and the comparison with the three other methods. While some publications compare projection constraints (Hegerl et al. 2021; Brunner et al. 2020b), a typical drawback is that the methods considered are often implemented using different evaluation strategies, either by using different periods, not employing observational references or are applied to different sets of model simulations (e.g., that different groups had access to when performing their analysis). Therefore, in this study, we overcome this difficulty by building a common framework so that the methods can be compared and evaluated consistently.
2. Data and methods
a. Observational datasets
The ERA5 reanalysis (Hersbach et al. 2020) extended to 1950, and obtained from the Copernicus Climate Change Service Climate Data Store (CDS), the Berkeley Earth land air temperature (TAS) (Rohde and Hausfather 2020) and HadSLP v2 sea level pressure (SLP) (Allan and Ansell 2006) have been used as observational references. The use of these datasets is further explained in the following subsections and in Table 1.
Parameters describing the four constraining methods. Note that OBSfield-, DPfield-, and DPtime-selections have been applied in two different ways, using both North Atlantic and global SSTs. This makes seven approaches in total.
b. CMIP6 data
We have used CMIP6 data from historical simulations, climate projections, and decadal climate predictions, as described in more detail below.
1) Historical simulations and projections
The climate projection information used from CMIP6 includes the historical and SSP2-4.5 simulations, where SSP2 stands for the second shared socioeconomic pathway and 4.5 specifies the trajectory reaching a radiative forcing of 4.5 W m−2 at the end of the twenty-first century (Riahi et al. 2017). The historical simulations start from a preindustrial run with constant boundary conditions and are forced with estimates of anthropogenic and natural forcings from 1850 to 2014 (Eyring et al. 2016). The projection starts at the end of the historical simulation in 2015. The SSP2-4.5 scenario was selected as it falls halfway between the most optimistic and pessimistic anthropogenic emission scenarios and was considered sufficient for illustration purposes. However, since the study looks at projections 20 years into the future, the scenarios available in CMIP6 are very similar in that period. As a result, the simulation divergence due to the different scenarios is not as important for this time scale as the model uncertainty or the internal variability (Lehner et al. 2020).
Most models perform multiple simulations, differing among them only in the initial conditions to sample the uncertainty related to internal climate variability. These independent simulations form ensembles that have different sizes across the CMIP6 models. We use 209 members from 31 models of the CMIP6 multimodel ensemble (see Table S1 in the online supplemental material). Depending on the constraining method, individual members will be either given weights or selected into a smaller ensemble (which can be seen as weights of either 0 or 1).
The longest period used in the weight estimation of the constraining methods runs from 1935 to 2019. This period consists of the historical simulations from 1935 to 2014 and the projections from 2015 to 2019.
2) Decadal predictions
The Decadal Climate Prediction Project (DCPP) is a CMIP6-endorsed project that produces decadal prediction (also referred to as DP) data. DCPP produced a set of retrospective decadal climate simulations (or hindcasts) (Boer et al. 2016) initialized from estimates of the contemporaneous observations once a year since 1960 and run for the following 10 years. The latest initialization year common to all models is 2015 (listed in Table S2), but some models that contribute to real-time prediction exercises have extended the predictions to start dates beyond 2020 (Hermanson et al. 2022).
c. Constraining and evaluation of the ensembles
Four constraining methods are applied in this study. The first method, ClimWIP, is thoroughly explained by Brunner et al. (2020b) and Merrifield et al. (2020). It weights each member from the multimodel ensemble according to 1) its performance against some characteristics of the observed climate and 2) its independence against the rest of CMIP6 members over the historical period. The second method selects a number of CMIP6 ensemble members based on how they rank against spatial patterns of observed ocean temperature anomalies reflecting different modes of climate variability (hence referred to as OBSfield-selection in this study) according to a metric estimated over a common period (Mahmood et al. 2022). The selected members are arbitrarily set to N = 30 following Mahmood et al. (2022), who show that the sensitivity between taking the best 10, 30, or 50 members is relatively small, and selecting 30 members represents a good compromise between having a large enough ensemble size and avoid selecting members that differ substantially from the constraining reference. The third method also selects members from the CMIP6 multimodel ensemble, but in this case, they are chosen according to how they rank compared to initialized decadal predictions (and hence, it is referred to as DPfield-selection) over a future contemporaneous period (Mahmood et al. 2021). The fourth method is conceptually similar to the third, but instead, the CMIP6 ensemble members are selected based on the mean absolute error against the temporal evolution of the area-averaged decadal prediction ensemble mean (hence referred to as DPtime-selection). This selection is proposed in Befort et al. (2020) and takes the best 35 members as their assessment of 25, 35, and 45 members leads to similar conclusions. We include a slight change to the methodology by using sea surface instead of air surface temperatures to make it more comparable to the other selection methods. The last three methods have been implemented with two slightly different approaches, constraining based on global or regional North Atlantic sea surface temperature (SST) anomalies, that will be explained below. Therefore, in this study, the comparison is done across seven different constraining approaches from the four methods: ClimWIP method, two implementations of OBSfield-selection, two of DPfield-selection, and two of DPtime-selection.
We define here some common terminology to refer to aspects of the methods:
-
Start date: The methods are applied at a point in the past, referred to as the start date. The start date refers to the first day of January of the first year of the 20-yr period for which a future estimation is sought. The start date will be referred to that year (labeled in blue in Fig. 1a).
-
Constraining period: The constraining methods also share the concept of a constraining period that is the period over which the weight/rank estimations are performed (represented in green Fig. 1a). This period is different for the four methods as explained in the following paragraphs.
The 20-yr period that begins at the start date (in blue in Fig. 1) are the years for which the constrained projections are obtained and evaluated. The focus of this study is on the summer season; therefore, the 20-yr projection period is a 20-yr average of the seasonal average of JJA, i.e., a single temperature map per start date.
ClimWIP computes the distance of the performance and independence metrics (based on TAS and SLP) against observations and the rest of the members, respectively. The estimate is obtained in the constraining period, which uses data from the 35 years prior to the start date (Brunner et al. 2020a). This implementation is different from the one by Brunner et al. (2020b) and Merrifield et al. (2020) as the constraining period and the 20-yr constrained projections are contiguous here. The constraining period in the OBSfield-selection method is the 9-yr period prior to the start date, following the example of Mahmood et al. (2022), where they test different periods and settle for one of the periods that give better results. All members are ranked according to their 9-yr mean SST uncentered pattern correlation with the observed 9-yr mean field (see Fig. 1a). The DPfield-selection is conceptually similar, but the constraining period corresponds to the 9 years from the start date. In this case, the average SST anomaly fields of the CMIP6 members in this period are correlated with the 9-yr average of the DP ensemble mean (see Fig. 1a) of the DP initialized a few months shortly before or at the start date (each DCPP decadal forecast system uses a slightly different initialization date). Hence, while the OBSfield-selection method uses observations up to the start date, the DPfield-selection method takes advantage of the future information with respect to the start date provided by the decadal predictions. Regarding the DPtime-selection, it works exactly the same as DPfield-selection, but in this case, the members are ranked according to their area-averaged 9-yr SST anomaly time series mean absolute error against the 9-yr DP ensemble mean SST anomaly time series. As mentioned above, the last three methods have been implemented in this study using two different approaches: the field/area used to compute the uncentered pattern correlation (see Fig. 1c) or to compute the time series mean absolute error (see Fig. 1d) of the 9-yr mean SSTs was from the global ocean (referred to as Glob) and from the North Atlantic region (NAtl, defined as 0°–80°W, 0°–60°N) in OBSfield and DPfield methods or from the North Atlantic subpolar gyre (SPG, defined as 20°–50°W, 45°–60°N) in the DPtime method. These choices are motivated by the known drivers of summer TAS in the Mediterranean (Mariotti and Dell’Aquila 2012; Sutton and Dong 2012; Hermanson et al. 2014; Ghosh et al. 2017) and for the added value found in previous constraining studies Befort et al. (2020), Mahmood et al. (2021, 2022). For a more in-depth explanation of the ClimWIP, OBSfield-selection, DPfield-selection, and DPtime-selection methods, the reader is kindly referred to the appendix.
This study aims to compare these seven different constraining approaches to assess whether and to what extent they can improve the quality of 20-yr climate projections. Please note that the goal of this study is not to identify the best-performing method but rather to demonstrate the importance of evaluating the forecast quality of the constraining methods before using them to derive near-term climate information. To evaluate the quality of the constrained projections, we estimate the new ensembles on a rolling basis starting every year from 1970. In other words, every year from 1970 to 2000, either new weights or ensemble selection have been estimated and a constrained 20-yr ensemble has been obtained. This is another important difference with the way the ClimWIP method has been applied in the past, while it is similar to what has been done for evaluating the constraining methods developed in Mahmood et al. (2021, 2022) and in Befort et al. (2020). Over the historical period, the results are compared with an observational reference in an objective evaluation process. To allow for a consistent intercomparison of the evaluation results across the seven approaches, the same CMIP6 multimodel ensemble, periods, and reference observations have been used (see Table 1). The data have been interpolated to a common 1.5° × 1.5° grid using a first-order conservative interpolation method, which is a compromise among the CMIP6 model resolutions that range between 0.5° and 2.8°. The land masks for SSTs are made of all grid points where at least one model or observational dataset is masked. The 31 start dates (period 1970–2000) over which the constraining methods are applied are referred to as the evaluation period throughout the text. The last start date is 2000 because the corresponding 20-yr period into the future ends in 2019, the last date for which all observational datasets were available. Climate anomalies have been calculated using 1981–2010 as the reference period.
We compared the time series of 20-yr averages for all start dates against the ERA5 reanalysis which we use as the observational reference using a number of metrics (an illustration of how the evaluation was performed can be seen in Fig. 1b).
To illustrate the applicability of the methods to future near-term projections, we constrain the CMIP6 ensemble for the 20-yr period 2015–34. The year 2015 is the last feasible start date due to the limited availability of a large ensemble of more recent multimodel decadal predictions. This is because only a small subset of the DCPP systems has made its decadal predictions initialized after 2015 publicly available.
We used Earth System Model Evaluation Tool (ESMValTool; Righi et al. 2020) to load and process all the data in a standardized way and ensure that all the necessary checks were passed. The evaluation of the results has been done using functions from the R package s2dv (Manubens et al. 2018) within the ESMValTool framework.
d. Evaluation metrics
Various forecast quality metrics are used to evaluate the constraining methods. The metrics are calculated individually for each grid point along the time series defined by the start dates from 1970 to 2000. A test is conducted using the longer 1961–2000 evaluation period to determine the sensitivity of the evaluation results to the time period used.
We computed the temporal anomaly correlation coefficient (ACC) of the projections against the observations. The correlation difference of the constrained projections minus that of the unconstrained ones (ΔACC) is then estimated. The statistical significance of ΔACC is computed with a two-sided test for equality of dependent correlation coefficients (Steiger 1980; Siegert et al. 2017) using effective degrees of freedom that account for the autocorrelation of the time series (Zwiers and von Storch 1995).
The residual correlation (ResCor) (Smith et al. 2019) is computed to determine whether the constrained projections capture any of the observed variability that is not already captured by the raw projections. The observational reference and the constrained ensemble are regressed against the full ensemble and their residuals correlated against each other. The statistical significance of the ResCor has been computed with a two-sided t test (Wilks 2011) using effective degrees of freedom to account for the autocorrelation of the time series (Zwiers and von Storch 1995).
To assess the improvement of multicategorical probabilistic projections in the constrained ensemble, we used the ranked probability skill score (RPSS) with three categories (below normal, normal, and above normal) (Wilks 2011). The score of the full CMIP6 ensemble is used as a reference so that positive values correspond to an improvement of the constrained with respect to the full ensemble. The thresholds for the three categories have been computed using the whole evaluation period separately for the ensembles and the observations. The statistical significance is determined through a random walk test with the 95% confidence level (DelSole and Tippett 2016). To evaluate the contribution of reliability and resolution in the RPSS results, we decomposed the Brier score (Ferro 2007) in each of the three categories into reliability and resolution as described by Murphy (1973). Reliability is a measure of how under/overconfident an ensemble is, and it improves as it approaches 0. Resolution measures the degree to which the ensemble distribution follows the observational reference, and, contrary to the reliability, it is positively oriented.
3. Evaluation of the constrained estimates over the observational period
Figure 2 shows the quality metrics defined in section 2d over the Mediterranean region (dashed box bound by 10°W–40°E and 30°–45°N) for the seven constraining approaches: ClimWIP, and OBSfield and DPfield for the global and North Atlantic constraints, and DPtime-selection for the global and North Atlantic subpolar gyre constraints.
The ACC and ΔACC maps show high and significant correlation between the constrained ensemble mean and the observations and a small improvement of all the constraining approaches with respect to the full multimodel ensemble, respectively. However, the ΔACC values are not statistically significantly different from zero. The high ACCs are easily explained by the forced warming trend during the evaluation period (1970–2000) (Smith et al. 2019). The saturation (very high values closely bounded by a value of 1) of the ACC for every approach and the full ensemble due to the trend makes the other quality metrics more informative.
The ResCor and RPSS results are far more heterogeneous between the different methods. Here follows a summary of the results before their interpretation:
-
ClimWIP: The method shows a relatively large area of the Mediterranean region with significant deterioration of skill as measured by ResCor [Fig. 2(3)]. ClimWIP improves the near-term climate information with respect to the full ensemble in terms of probabilistic estimates throughout the western and central part of the region, with a few points showing statistically significant improvement [Fig. 2(4)]. This suggests that the method can improve the distribution of the categorical probability information (of 20-yr projected change), while it degrades the representation of the residual variability around the trend.
-
OBSfield-selection: The two approaches (Glob and NAtl) show a general improvement with respect to the full ensemble when assessing ResCor [Figs. 2(7),(11)]. Overall, the Glob selection generates a constrained ensemble that explains the residual variability in the Mediterranean region significantly better than the full ensemble in a larger area and to a greater extent than the NAtl selection. The RPSS also shows widespread improvements for the OBSfield-selections, with the exception of the Iberian Peninsula [Figs. 2(8),(12)]. In agreement with ResCor, the Glob OBSfield-selection provides a larger area with significant improvement in the RPSS.
-
DPfield-selection: The results are similar for both the Glob and NAtl selection approaches. It is important to stress that these selections produce some areas with significant ResCor deterioration [Figs. 2(15),(19)]. The Glob DPfield-selection shows the largest RPSS values of all the approaches, with improvements with respect to the full CMIP6 ensemble almost everywhere [Fig. 2(20)] The NAtl OBSDPfield-selection shows generally smaller RPSS in the region.
-
DPtime-selection:
-
SPG: There is little significant improvement in ResCor in the region [Fig. 2(23)]. Contrarily, the RPSS map shows a general enhancement in quality, especially over the sea, but with many nonsignificant points over land [Fig. 2(24)].
-
Glob: The results show some significant improvement in ResCor over most of the Mediterranean region, with some scattered deterioration in sea regions and the Iberian Peninsula [Fig. 2(27)]. The RPSS map shows a generalized deterioration, although not significant, over the whole region [Fig. 2(28)].
-
Overall, it can be seen that how the ClimWIP method shows the smallest effect on the constrained near-term climate information compared to the full ensemble results. This behavior is very different from the selection methods, whose constrained ensembles tend to show heterogeneous regions of significant enhancement and deterioration in comparison with the full CMIP6 ensemble. The Glob and NAtl OBSfield-selection and the Glob DPtime-selection methods have similar and significantly better ResCor results, suggesting that both are able to improve the representation of the variability that is not part of the radiative forcing. Nevertheless, the quality measured in terms of probabilistic forecast is poorer in the Glob DPtime-selection approach and it is significantly better for the DPfield-selection method which performed worse in the ResCor assessment.
To gain more insight on the results of the OBSfield- and DPfield-selection methods, we analyze which members are selected for each start date over the whole evaluation period. We first look at the spatial SST field of the constraining reference (the observations or the decadal prediction ensemble mean, depending on the method) and the two members with the best pattern correlation, as displayed in Fig. 3. For the DPfield-selection, the SST field of the reference is very homogeneous in space, especially at the beginning and end of the evaluation period, i.e., negative close to start date 1970 and positive close to start date 2000. Instead, the OBSfield-selection reference has a more heterogeneous SST structure and there is a less obvious trend from the beginning to the end of the evaluation period. This suggests that the DPfield-selection approaches are more driven by the warming trend than the OBSfield-selection approaches. This can be due to the fact that the DPfield-selection uses the multimodel mean of the decadal simulations as a reference, while the OBSfield-selection uses SST fields from a single observational dataset. The former presents a smoother SST field than the latter.
To better understand the OBSfield- and DPfield-selections, the exact member selection over the evaluation period is plotted in Fig. 4, which shows a large proportion of CanESM5 selection (especially for Glob DPfield, see the blue box in Fig. 4a). In principle, this would go against the idea of selecting members according to the internal variability (Mahmood et al. 2021, 2022) as no single model should dominate above others. CanESM5 is known for having the largest equilibrium climate sensitivity (ECS) within the CMIP6 ensemble (Meehl et al. 2020; Schlund et al. 2020), which suggests that the trend plays a large role in the selection process of this approach. The member selections from models with ECS values above the likely range proposed in the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC 2023) are highlighted in Fig. 4. To have an idea of the overselection of high-ECS members, the ratio of above-likely range selections is painted at the top of the figure, zero meaning that the ratio is the same as in the full CMIP6 ensemble and above (below) zero that the ratio of high-ECS members has increased (decreased) in the selection. This information tells us that Glob DPfield-selection tends to overselect models with very high climate sensitivity. The influence of the trend also depends on the region used to constrain the ensemble. The NAtl is relatively less affected by the trend as the contribution of internal variability and forcing are more similar than in the Glob case (Smith et al. 2020), and as seen in Fig. S1, the NAtl selections are, relative to its Glob counterpart, farther spread between models.
A similar analysis can be conducted for the DPtime-selection. Figure 6 shows the SST time series of the reference, the selected and the unselected CMIP6 members for the first, mid-, and last start dates. It can be seen how the spread between CMIP6 members changes according to the distance from the start date to the center of the climatology period (1981–2010). Therefore, the reference time series in start dates closer to the year 1996 should have more CMIP6 ensemble members “near” the reference, contrarily to start dates farther apart. Conceptually, this could mean that the strictness in the selection of the best 35 members is changing along the evaluation period. Figure 5e shows a slight reduction in the mean absolute errors of the best 35 members over the 90s decade, which could be due to the mentioned reduction of spread in the middle of the climatology period, but it also coincides with a period of higher SST trends. The selections from the DPtime method can be seen in Fig. 7, and contrarily to the OBSfield- and DPfield-selection methods, the high-ECS models are never overrepresented.
Figure 8 shows the distribution of weights for each member along all start dates and helps visualize how ClimWIP modifies the CMIP6 ensemble. The weights are quite homogeneous along the ensemble, but some darker horizontal stripes indicate that some members usually receive more weight. These weight maxima are normally associated with the independence weight (see Fig. S3 for a weight decomposition), although performance can also be the dominant source of weight in some cases. Figure S3a shows some years in the evaluation period with particularly homogeneous performance weights. This can be due to the caveats in finding optimal shape parameters [see Eq. (A1) in the appendix] already discussed in Merrifield et al. (2020).
Apart from the metric used to constrain the members, many other parameters might influence the results of the evaluation; therefore, other sensitivity tests have been conducted. We have tested the role of the observational uncertainty using Berkeley Earth as an observational reference instead of ERA5. Similar results were obtained (not shown). Finally, we changed the period over which the constraints are verified from 1970–2000 to 1961–2000, which is possible because there are SST observations prior to 1960 and decadal predictions are available with start dates from 1961. The quality structures are overall maintained, with the exception of the OBSfield-selection RPSS that shows a noticeably lower skill over the longer 40-yr period, and the Glob DPtime-selection RPSS that sees a slight improvement (Fig. S4). The lower quality could be a consequence of inferior observational data in the earlier decades in parts of the Mediterranean region (Cornes et al. 2018; Harris et al. 2020) or because the OBSfield-selection method fails to capture the TAS behavior in 1961–69. Due to these uncertainties related to the earlier data, we decided to focus the evaluation on the later period from 1970 onward.
The overall results of the evaluation show that different constraining approaches offer projections with different quality in the Mediterranean region. This highlights the potential but also the limitations of constraining projections in the Mediterranean region when an objective evaluation strategy is followed. Our approach of verifying the projections over the observational period highlights the absolute need to evaluate the constrained projections. This is particularly the case when climate-vulnerable users are expected to make decisions based on constrained projections.
4. Constrained future projections
We have obtained updated future near-term projections with each of the constraining approaches. To interpret the projections, it is essential to consider the evaluation of the approaches as it informs about the robustness and the trustworthiness of the information. We also divide the future warming distributions into two subregions: the western and eastern Mediterranean (10°W–15°E for the former and 15°–40°E for the latter) because the quality estimates show some spatial heterogeneities across these regions. Figure 9 illustrates the cumulative distributions for the 2015–34 20-yr average near-surface temperature from the full CMIP6 ensemble (blue) and from each of the constrained ensembles (red). The projections are shown separately for both subregions, and the RPSS and ResCor for the average temperature estimates over the validation period are displayed.
ClimWIP modifies the distribution only slightly and keeps the highlighted quantiles almost unchanged from their values in the full CMIP6 ensemble (see Figs. 9a,b). Since it uses weights rather than a selection of members, it can preserve the original distribution range. The selection approaches, instead, tend to result in a narrower temperature range as they can exclude members at the extremes of the distribution and thereby reduce the uncertainty range of the projections.
A general feature of the four OBS/DPfield approaches is that the temperature range is concentrated to higher values after choosing the best 30 members (see Figs. 9c–j). The NAtl OBSfield-selection removes the lower 25th percentile of the raw ensemble in both the western and eastern Mediterranean. The NAtl DPfield-selection shows a range more focused toward even higher temperatures. The selections based on global SSTs retain the lower part of the distribution more than the NAtl selections, although the Glob DPfield-selection produces the largest increase in mean warming in both Mediterranean subregions.
In contrast, the DPtime approaches tend to constrain the highest percentiles of the distribution to lower temperature values than the full CMIP6 ensemble (see Figs. 9k–n). The constrained 75th and 90th percentiles always see a decline in temperature, while the median is practically the same as the original one. The lower percentiles cut off the coldest members of the full distribution.
Some constraining approaches provide statistically significant improvements in the evaluation period (illustrated by the RPSS and ResCor values in bold in Fig. 9). The residual correlation gives us an idea of how well the constrained ensemble captures variability around the estimated trend. Therefore, the OBSfield-selection provides a good representation of this variability in the historical period over the eastern Mediterranean (Figs. 9d,f). This is also the case for the Glob DPtime-selection in the eastern Mediterranean. In terms of the RPSS, the Glob DPfield-selection, with RPSSs above 0.2 for both the east and the west, has the highest quality. NAtl DPfield- and OBSfield-selections also have a significantly positive RPSS in the western and eastern basins, respectively, but with lower values. The constrained ensemble with a higher RPSS is also the one that produces the highest increase in warming compared to the original CMIP6 ensemble. Note that Figs. 9f and 9j show two constraints (Glob OBSfield- and DPfield-selection) with similar quality in the eastern Mediterranean that lead to very different warming distributions, suggesting that even with constraints that improve with respect to the full ensemble, some uncertainty remains about the optimal methodology.
An analysis of the selected members from specific models can help in understanding the results of the selection methods. For instance, it is found that in the Glob DPfield-selection, 23 out of the 30 selected members are from models with ECSs above the IPCC AR6 likely range (above 4°C) where 19 of those are in the low-likelihood high warming range (above 5°C) (Meehl et al. 2020; IPCC 2023). Only 10 out of 30 members should be above the likely range if the ratio of models above the likely range was maintained from the full CMIP6 ensemble. This can explain why the DPfield-selection produces warmer results than the full ensemble. Additionally, as it has been presented before, the use of the decadal prediction ensemble mean implies that much of the SST variability is smoothed out (this can be observed in the SST maps from the decadal predictions in Fig. 3a), making the warming signal of the single CMIP6 members relatively more important than the variability in the selection process.
5. Discussion
One of the novel aspects of this study is the performance comparison against observations of different approaches when applying constraints to climate projections. A common framework has been developed to assess the quality of the methods with a systematic comparison against observational data over the historical period.
The overall evaluation shows that the seven constraining approaches studied have different quality features in the Mediterranean region. The improvements of the constrained information with respect to the full CMIP6 ensemble highlight the potential of improving the projections of near-surface air temperature in the Mediterranean region. Nevertheless, in certain areas for some approaches, there is no gain in quality by constraining the full ensemble. The need to deliver this key piece of information to climate information users motivated the systematic evaluation against observational datasets. The climate change community has applied constraints to future projections, often without a systematic evaluation framework (Hegerl et al. 2021; Brunner et al. 2020b). Sometimes, the improvements in the constrained projections were supported by out-of-sample evaluation (Knutti et al. 2017; Brunner et al. 2019; O’Reilly et al. 2020), e.g., in a perfect-model framework where the constraint is then tested on model simulations that have not been used to derive the constraint. These out-of-sample evaluation examples can demonstrate that the methods work conceptually but not necessarily that they provide information of higher quality about the real-world climate. This is an important limitation of constraining long-term climate projections.
As our study focuses on nearer-term projections, we apply the methods and knowledge of the climate prediction community, which systematically estimates the quality of the information generated against observational references. This is an essential process to generate confidence in the climate information that is disseminated.
The first constraining method, ClimWIP, does not provide increased quality in the Mediterranean region over the observational period. It can be argued that this is because the method is optimized for and more sensitive to mid-/long-term trend projections [as concluded in Befort et al. (2022)] rather than near-term climate, when the internal variability plays an important relative role in the climate estimates (Merrifield et al. 2020; Brunner et al. 2020a). The weights estimated by ClimWIP do not produce noticeable changes in the constrained ensemble compared to the unconstrained, as the constrained ensemble distribution remains practically the same as when using the full ensemble (Fig. 9). This suggests that the sampling of model uncertainty (through the independence criteria) has no measurable effect on near-term climate projections for the next 20 years in the Mediterranean. The residuals of the constrained ensemble are very small and opposite to the residuals of the observed time series (not shown), meaning that the weighting is not capturing the internal variability well. These two results can explain both the very low RPSS and deteriorated ResCor for this method. In the lines of what is proposed in Ribes et al. (2022), a way to further explore the potential of the ClimWIP method for shorter time scales could be to combine it with other constraining approaches that do capture internal variability.
It is found that the constrained ensembles built using selection methods that take into account some aspects of climate variability phasing are better suited for the near-term future. This translates to better results in the evaluation metrics, for both the ensemble mean and probabilistic estimates.
The Glob OBSfield- and DPfield-selection are the approaches with the largest improvements in the probabilistic estimates of near-term climate. To better understand why, we decomposed the ranked probability score of the three categories (or terciles) considered in the RPSS into its resolution and reliability components (Murphy 1973). The two terms from the full CMIP6 ensemble are subtracted from the corresponding terms of the constrained ensembles (the reliability, which is negatively oriented, has been multiplied by −1 so that a positive difference corresponds to a gain in reliability) to obtain the improvements from constraining. Figures S5 and S6 show that the reliability drives the improvement of the probabilistic climate estimates in the constrained ensembles, especially for categories 1 (below average) and 3 (above average). This implies that constraining the ensemble improves the sampling of the uncertainty, which is what is measured by the reliability. Resolution is spatially more heterogeneous across the region. The negative values mean that for a reliable ensemble, the constrained information cannot discriminate between categories better than the full ensemble in those locations. The decomposition of the two terms also explains the OBSfield-selection decrease in quality in the westernmost Mediterranean region due to a reduced resolution in all three categories.
An analysis of the selection processes in the evaluation period has been conducted for the OBSfield and DPfield approaches. The results presented above, in Eq. (1) and Figs. 3–5, show that the members ranked via the uncentered pattern correlation of their SST field against a reference could sometimes be selecting members for different reasons than internal variability alone: The metric can favor those members that have the absolute highest (not the closest) anomaly in the grid points where the reference has the same anomaly sign. This means that constraining on start dates away from the center of the climatology period (1981–2010) will favor members with a higher ECS as their absolute SST anomalies will be larger than the signal of other members. As seen in the results, the trend effect is larger in the Glob than in the NAtl selection, probably because the latter region is more influenced by internal variability (Smith et al. 2020). This characteristic of the methods when the selection is based on uncentered pattern correlations implies that results in previous studies (Mahmood et al. 2021, 2022) may not have been due to phasing in climate variability but rather constrained based on signatures of both variability and warming trend. Further research is warranted to be done to overcome these potential caveats.
One approach to avoid the effect of the trend and the overselection of high-ECS members would be centering the Pearson correlation coefficient, therefore comparing the covariance of two fields that have their area average equal to zero. In practice, what this means is that the terms mji and xi in Eq. (1) would now be
The selections over the evaluation period have also been investigated for the DPtime method. Figure 7 indicates that the methodology picks a more scattered selection of members, suggesting that it might be phasing the internal variability as proposed in Befort et al. (2020). In this work, we raise the issue that performing the constraint on SST anomalies changes the intermodel standard deviation of the ensemble depending on the relative time distance between the start date and the center of the climatology period (see Fig. 6), and therefore, the selection for different start dates is done on distributions that have different properties. Nevertheless, no strong evidence is found that supports this issue as the values of the mean absolute error are quite constant along the evaluation period (see Figs. 5e,f). This suggests that the size of the full CMIP6 ensemble is large enough to apply the DPtime-selection, and constrain a 35-member ensemble, in a fair way. Further work can be conducted on this method to obtain the optimized results for the Mediterranean using different SST regions and/or metrics based on which to constrain.
The quality of the selection methods can also be sensitive to the evaluation period chosen. We have tested two evaluation periods that include the start dates over 1961–2000 and 1970–2000. We noticed that the former period results in degraded quality for the OBSfield-selection RPSS, especially in the eastern Mediterranean (see differences between Fig. 2 and Fig. S4). Nevertheless, ResCor seems to be more robust to period changes as in most approaches it gives similar values to the evaluation performed in 1970–2000. To get insight into the evolution of the constrained and unconstrained projections along the evaluation period in smaller subregions, we assessed the eastern and western Mediterranean surface temperature area-averaged 20-yr mean time series of the full CMIP6 and constrained ensembles along with the observations (Figs. S9–S14 show the time series). We aggregated the data spatially to simplify the interpretation, although this makes the results not directly comparable to those in Fig. 2 and Fig. S4. The observed eastern Mediterranean trend only appears after 1970, as seen in Figs. S9–S11. As a consequence, the ensembles from the selection approaches do not capture this behavior leading to a reduction in the probabilistic quality of the constrained ensemble. In the western Mediterranean, the near-surface air temperature shows in the period 1961–70 a trend comparable to the trend after 1970 (see Figs. S12–S14). This favors the quality of approaches influenced by the trend such as Glob DPfield-selection, leading to improvements in the RPSS in the western region when using the longer 1961–2000 evaluation period. The difference in the RPSS between the Glob and NAtl approaches in the OBSfield and DPfield-selections could also be explained by the higher influence of the trend in the Glob SSTs, as previously discussed. The results including start dates for the period 1961–70 might be more uncertain due to lower data quality in parts of the Mediterranean in the 1960s (Cornes et al. 2018; Harris et al. 2020).
Finally, we have also provided warming projections for the near-term future (2015–34) in the Mediterranean region to show how the results vary for each constraining approach and to give an idea of the expected evolution of temperature with respect to the 1981–2010 climatology. The evaluation of the methods showed that not all approaches have increased quality with respect to the full CMIP6 ensemble in estimating surface temperature changes. Therefore, all projections should be treated carefully (this also applies to the projections of the full ensemble) and interpreted with the evaluation estimates in mind (Donat et al. 2023). Furthermore, we have identified potential issues of the different approaches that potential users of the methods should be aware of before trusting the future projections of any of the methods. The OBSfield and DPfield methods always produce warmer temperatures in accordance with the regional constraints conducted in Ribes et al. (2022), while the DPtime constraint lowers the high percentiles of the distribution more in line with the results presented in Cos et al. (2022). The Glob DPfield-selection shows the strongest future warming, not only in its median value but in the whole distribution, in line with the influence that the warming trend has on this approach. This constrained ensemble takes a large proportion of members from high-ECS models (Meehl et al. 2020; Hausfather 2019). According to the Glob DPfield-selection, the summer temperature projections for the Mediterranean region of the period 2015–34 are 1.7°C (west) and 1.8°C (east) warmer than the climatological period 1981–2010 (with a 90% range spanning from 1.1° to 2.2°C in the west and from 1.2° to 2.3°C in the east). On the other side of the spectrum, to the Glob DPtime-selection are 1.4°C (west) and 1.5°C (east) warmer than the climatological period 1981–2010 in the west and east Mediterranean (with a 90% range spanning from 1.1° to 1.8°C in the west and from 1.1° to 1.9°C in the east). The full CMIP6 ensemble gives warmings of 1.4°C (0.8°–2.0°C 90% range) in the western region and 1.5°C (0.9°–2.0°C 90% range) in the eastern region.
From these results, a full understanding of the methods is needed to provide credible temperature projections over the Mediterranean region when applying constraining methodologies to large multimodel ensembles. This task is within the scope of the climate service actors, for whom the intercomparison between the seven approaches and the proposed evaluation framework offers a robust setting. Caution is advised to whomever wants to apply or use results from this kind of method as the structural issues must be understood and ideally solved before using such methods. The authors hope that this work has also provided some insight into the potential issues of the methods explored and an enhanced understanding of how they constrain a multimodel ensemble the way they do.
6. Summary and conclusions
We have compared different methods for constraining near-term CMIP6 projections of summer near-surface air temperature over the Mediterranean area. We assessed the performance of the DPtime-selection, DPfield-selection, and OBSfield-selection methods (Befort et al. 2020; Mahmood et al. 2021, 2022), which aimed to phase the constrained ensemble with the state of climate variability at the start date, and the ClimWIP method (Brunner et al. 2020a; Merrifield et al. 2020), which weights each member of the ensemble according to their past performance and independence within the ensemble. The comparison has been carried out with a new framework inspired in the forecast quality assessment of the decadal predictions. The framework led to quality estimates of the constrained projection approaches obtained by producing 20-yr temperature estimates every year from 1970 to 2000 and computing quality metrics against observational references over the corresponding evaluation period.
The evaluation results show some differences between the constraining methods. The improvement and deterioration of climate information with respect to quality measures of the full CMIP6 ensemble show strong spatial heterogeneity. The OBSfield-, DPfield-, and DPtime-selection approaches provide significant improvements in the quality of the forecast in scattered regions of the Mediterranean basin in terms of variability and probabilistic information, although the ability of the constraints to improve the accuracy of the information depends on both the region and the method. The ClimWIP method generally shows small quality differences with respect to the full CMIP6 ensemble as the weights tend to be distributed homogeneously across the multimodel ensemble. An analysis of the mechanisms of the constraining methods shows that the selection methods based on SST fields are not only phasing variability (OBSfield and DPfield), but the forced trend plays an important role in the selection approaches. In contrast, the method based on SST anomaly time series (DPtime) seems to not be as affected by this forced trend. The analysis shows that there are important sensitivities to some of the parameters in the selection methods and that there might be issues with the constraining metrics that need to be resolved before gaining confidence in the robustness of the constraining methods.
Despite the caveats of the seven analyzed constraining approaches, there is potential to improve the near-term climate projections as robust improvements were found. This study opens the door to the need of optimizing and better understanding these methods for the target region and variable of interest. And more importantly, the study highlights the need for evaluating the approaches through retrospective assessments against observational references. This is one way to generate confidence in near-term climate estimates.
Acknowledgments.
We wish to thank all those who have provided the data used for this work and for the data support by M. Samso and P. A. Bretonniere. The discussions with B. Solaraju, R. Bilbao, P. Ortega, C. Delgado-Torres, A. Weisheimer, C. O’Reilly, and J. Mindlin are also gratefully acknowledged. We acknowledge the developments and support of the s2dv and ESMValTool packages, especially to S. Loosveldt-Tomas, A. Ho, and C. Delgado-Torres. The work of the two anonymous reviewers has also greatly contributed to the quality of this text. This work was partly supported by the European Commission Horizon Europe projects ASPECT (Grant 101081460) and PATHFINDER (Grant PID2021-127943NB-I00 funded by MICIU/AEI/10.13039/501100011033 and by ERDF, EU). Dr. Markus G. Donat is grateful for the kind support through the AXA Research Fund. Dr. Raül Marcos-Matamoros is a Serra Húnter fellow.
Data availability statement.
All the data used are publicly available or restricted to the signed-up users of the C3S CDS portal. The observational data used are obtained from Berkeley Earth (https://berkeleyearth.org/data/) and HadSLP (https://doi.org/10.1175/JCLI3937.1). CMIP data: all the CMIP6 datasets were downloaded from the Earth System Grid Federation (ESGF). The models used are listed in Tables S1 and S2. The tool used for the diagnostics (ESMValTool) can be found at https://github.com/ESMValGroup/. ESMValTool and ESMValCore are developed on the GitHub repositories available at https://github.com/ESMValGroup. The evaluation metrics were run with the s2dv package (https://cran.r-project.org/web/packages/s2dv/index.html). The software developed and used to make the calculations in this study is available at https://doi.org/10.5281/zenodo.11160789.
APPENDIX
Detailed Information about the Application of the Constraining Methods
a. Performance and independence weighting method (ClimWIP)
This method aims at estimating weights for each member of the multimodel ensemble according to how well it resembles the historical climate and how independent it is from the rest of the members. The former criterion measures the historical performance (measured as some distance between the observed and simulated fields) of the members with the goal of obtaining a better representation of the future TAS; the second criterion is a measure of the member independence within the ensemble, the goal being that the resulting distribution samples better the modeling uncertainty by reducing the overrepresentation of some model classes. The performance diagnostics used are constrained to the target region. As this paper focuses on the Mediterranean region in summer (Brunner et al. 2019), in contrast to the three selection methods, weights are computed on summer means for both the constraining and the evaluation phases.
There are some differences between the implementation of Brunner et al. (2020a) and Merrifield et al. (2020) and the implementation in the current work. In our evaluation framework, we applied this method from 1970 to 2000 by taking the 35 years prior to each start date and computing the diagnostics, the shape parameters, and the weights over that time period. Afterward, the 20-yr mean is computed for TAS, and the resulting time series is compared to the observed running mean.
b. Selection methods based on SST patterns around the start date
These methods aim to select members from the CMIP6 ensemble that are better aligned with the variability of the initialized predictions or the observations. This way, a better representation of the variability in the near-term constrained projections is expected. From the work by Mahmood et al. (2021), the selection process consists of comparing the member’s spatial distribution of 9-yr mean sea surface temperature against the constraining reference, which, depending on the method, can be observations or decadal predictions. Two regions are considered to compute the SST pattern correlation, the global sea extent (Glob), and the North Atlantic (NAtl), which lead to different selections of the best 30 members. The near-future boreal summer (June–August) constrained projections are obtained with the subselected ensemble. The constraining is done on annual means, while the evaluation is done for the JJA season (Mahmood et al. 2022).
To evaluate the methods, we apply them for the start dates 1970 to 2000. Once the selection of the best 30 members is done, the 1–20 lead-year JJA mean is computed for each start date and the resulting time series is compared against the corresponding observations using the same periods using a set of metrics defined in section 4d.
1) Decadal prediction constraints (DPfield-selection)
The method consists of using the decadal multimodel mean anomalies starting at a given start date to constrain the projections from that start date for the following 20 years. Using constrained projections further into the future is possible, but we decided to keep the horizon of 20 years to be consistent with the IPCC projection estimates (IPCC 2023). The decadal ensemble mean is built using the datasets mentioned in the Decadal predictions [section 2b(2)] by first doing the multimember mean of each model and then the multimodel mean (Delgado-Torres et al. 2022). The anomaly is then computed using the reference period 1981–2010. This means that for year 1 of the 20-yr period, we subtracted the 1981–2010 mean of all the first years; for year 2, as we use the second year of the 20-yr periods that fall in 1981–2010, we use the start dates over 1980–2009. The 9-yr mean projections and decadal predictions are compared using the spatial pattern correlation and the best 30 members selected.
2) Observational constraints (OBSfield-selection)
This method is conceptually the same as DP selection, but instead of constraining using 9 years of predictions from the start date, we use observations from 9 years prior to the start date. The 30 members with the highest spatial pattern correlation between the observed and simulated SSTs are selected to obtain the constrained ensemble.
c. Selection method based on decadal prediction SST temporal evolution from the start date (DPtime-selection)
This method has been developed in the work by Befort et al. (2020) and is conceptually similar to the DPfield-selection method. It also aims at phasing the variability of the selected ensemble by using the information given by the first 9 years of the decadal predictions multimodel ensemble [refer to section b(1) for more information on the processing of the decadal predictions]. The selection process consists of comparing the SST anomalies temporal evolution (time series of the area averaged SSTs) of the decadal predictions multimodel mean against each one of the CMIP6 members’ SST time series, as highlighted in Fig. 1d of the main text. The area-averaged SSTs are computed for the global and North Atlantic SPG (defined as 20°–50°W, 45°–60°N) regions. Constraining with the latter region provides added value (Befort et al. 2020), and the former is an addition of this study to explore the performance of using the global temporal evolution as a constraint. The comparison between an individual CMIP6 member time series and the decadal ensemble mean involves assessing their resemblance based on the mean absolute error derived from nine consecutive annual mean temperature values (time series in Fig. 1d). To identify the CMIP6 members that display long-term variability similar to the decadal prediction ensemble mean, a Hanning window of 11 years is applied to smooth out all CMIP6 ensemble members. This smoothing procedure aims to mitigate the inherent interannual variability present in each member, a process implicitly carried out in the decadal predictions ensemble mean through the averaging of all the decadal members. Note that the methodology developed in Befort et al. (2020) uses surface air temperature instead of sea surface temperature, and we use SSTs for consistency with the other selection methods.
REFERENCES
Allan, R., and T. Ansell, 2006: A new globally complete monthly historical gridded mean sea level pressure dataset (HadSLP2): 1850–2004. J. Climate, 19, 5816–5842, https://doi.org/10.1175/JCLI3937.1.
Befort, D. J., C. H. O’Reilly, and A. Weisheimer, 2020: Constraining projections using decadal predictions. Geophys. Res. Lett., 47, e2020GL087900, https://doi.org/10.1029/2020GL087900.
Befort, D. J., and Coauthors, 2022: Combination of decadal predictions and climate projections in time: Challenges and potential solutions. Geophys. Res. Lett., 49, e2022GL098568, https://doi.org/10.1029/2022GL098568.
Beobide-Arsuaga, G., T. Bayr, A. Reintges, and M. Latif, 2021: Uncertainty of ENSO-amplitude projections in CMIP5 and CMIP6 models. Climate Dyn., 56, 3875–3888, https://doi.org/10.1007/s00382-021-05673-4.
Boé, J., 2018: Interdependency in multimodel climate projections: Component replication and result similarity. Geophys. Res. Lett., 45, 2771–2779, https://doi.org/10.1002/2017GL076829.
Boé, J., and L. Terray, 2015: Can metric-based approaches really improve multi-model climate projections? The case of summer temperature change in France. Climate Dyn., 45, 1913–1928, https://doi.org/10.1007/s00382-014-2445-5.
Boer, G. J., and Coauthors, 2016: The Decadal Climate Prediction Project (DCPP) contribution to CMIP6. Geosci. Model Dev., 9, 3751–3777, https://doi.org/10.5194/gmd-9-3751-2016.
Brunner, L., R. Lorenz, M. Zumwald, and R. Knutti, 2019: Quantifying uncertainty in European climate projections using combined performance-independence weighting. Environ. Res. Lett., 14, 124010, https://doi.org/10.1088/1748-9326/ab492f.
Brunner, L., A. G. Pendergrass, F. Lehner, A. L. Merrifield, R. Lorenz, and R. Knutti, 2020a: Reduced global warming from CMIP6 projections when weighting models by performance and independence. Earth Syst. Dyn., 6, 995–1012, https://doi.org/10.5194/esd-11-995-2020.
Brunner, L., and Coauthors, 2020b: Comparing methods to constrain future European climate projections using a consistent framework. J. Climate, 33, 8671–8692, https://doi.org/10.1175/JCLI-D-19-0953.1.
Cannon, A. J., 2020: Reductions in daily continental-scale atmospheric circulation biases between generations of global climate models: CMIP5 to CMIP6. Environ. Res. Lett., 15, 064006, https://doi.org/10.1088/1748-9326/ab7e4f.
Cornes, R. C., G. Van Der Schrier, E. J. M. Van Den Besselaar, and P. D. Jones, 2018: An ensemble version of the E-OBS temperature and precipitation data sets. J. Geophys. Res. Atmos., 123, 9391–9409, https://doi.org/10.1029/2017JD028200.
Cos, J., F. Doblas-Reyes, M. Jury, R. Marcos, P.-A. Bretonnière, and M. Samsó, 2022: The Mediterranean climate change hotspot in the CMIP5 and CMIP6 projections. Earth Syst. Dyn., 13, 321–340, https://doi.org/10.5194/esd-13-321-2022.
Cox, P. M., 2019: Emergent constraints on climate-carbon cycle feedbacks. Curr. Climate Change Rep., 5, 275–281, https://doi.org/10.1007/s40641-019-00141-y.
Dalelane, C., K. Winderlich, and A. Walter, 2023: Evaluation of global teleconnections in CMIP6 climate projections using complex networks. Earth Syst. Dyn., 14, 17–37, https://doi.org/10.5194/esd-14-17-2023.
Delgado-Torres, C., and Coauthors, 2022: Multi-model forecast quality assessment of CMIP6 decadal predictions. J. Climate, 35, 4363–4382, https://doi.org/10.1175/JCLI-D-21-0811.1.
DelSole, T., and M. K. Tippett, 2016: Forecast comparison based on random walks. Mon. Wea. Rev., 144, 615–626, https://doi.org/10.1175/MWR-D-15-0218.1.
Doblas-Reyes, F. J., and Coauthors, 2013: Initialized near-term regional climate change prediction. Nat. Commun., 4, 1715, https://doi.org/10.1038/ncomms2704.
Donat, M. G., C. Delgado-Torres, P. De Luca, R. Mahmood, P. Ortega, and F. J. Doblas-Reyes, 2023: How credibly do CMIP6 simulations capture historical mean and extreme precipitation changes? Geophys. Res. Lett., 50, e2022GL102466, https://doi.org/10.1029/2022GL102466.
Eyring, V., S. Bony, G. A. Meehl, C. A. Senior, B. Stevens, R. J. Stouffer, and K. E. Taylor, 2016: Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev., 9, 1937–1958, https://doi.org/10.5194/gmd-9-1937-2016.
Ferro, C. A. T., 2007: Comparing probabilistic forecasting systems with the brier score. Wea. Forecasting, 22, 1076–1088, https://doi.org/10.1175/WAF1034.1.
Flynn, C. M., and T. Mauritsen, 2020: On the climate sensitivity and historical warming evolution in recent coupled model ensembles. Atmos. Chem. Phys., 20, 7829–7842, https://doi.org/10.5194/acp-20-7829-2020.
Ghosh, R., W. A. Müller, J. Baehr, and J. Bader, 2017: Impact of observed North Atlantic multidecadal variations to European summer climate: A linear baroclinic response to surface heating. Climate Dyn., 48, 3547–3563, https://doi.org/10.1007/s00382-016-3283-4.
Giorgi, F., and P. Lionello, 2008: Climate change projections for the Mediterranean region. Global Planet. Change, 63, 90–104, https://doi.org/10.1016/j.gloplacha.2007.09.005.
Hall, A., P. Cox, C. Huntingford, and S. Klein, 2019: Progressing emergent constraints on future climate change. Nat. Climate Change, 9, 269–278, https://doi.org/10.1038/s41558-019-0436-6.
Harris, I., T. J. Osborn, P. Jones, and D. Lister, 2020: Version 4 of the CRU TS monthly high-resolution gridded multivariate climate dataset. Sci. Data, 7, 109, https://doi.org/10.1038/s41597-020-0453-3.
Hausfather, Z., 2019: CMIP6: The next generation of climate models explained. Carbon Brief, https://www.carbonbrief.org/cmip6-the-next-generation-of-climate-models-explained.
Hawkins, E., and R. Sutton, 2009: The potential to narrow uncertainty in regional climate predictions. Bull. Amer. Meteor. Soc., 90, 1095–1108, https://doi.org/10.1175/2009BAMS2607.1.
Hawkins, E., and R. Sutton, 2011: The potential to narrow uncertainty in projections of regional precipitation change. Climate Dyn., 37, 407–418, https://doi.org/10.1007/s00382-010-0810-6.
Hegerl, G. C., and Coauthors, 2021: Toward consistent observational constraints in climate predictions and projections. Front. Climate, 3, 678109, https://doi.org/10.3389/fclim.2021.678109.
Herger, N., G. Abramowitz, R. Knutti, O. Angélil, K. Lehmann, and B. M. Sanderson, 2018: Selecting a climate model subset to optimise key ensemble properties. Earth Syst. Dyn., 9, 135–151, https://doi.org/10.5194/esd-9-135-2018.
Hermanson, L., R. Eade, N. H. Robinson, N. J. Dunstone, M. B. Andrews, J. R. Knight, A. A. Scaife, and D. M. Smith, 2014: Forecast cooling of the Atlantic subpolar gyre and associated impacts. Geophys. Res. Lett., 41, 5167–5174, https://doi.org/10.1002/2014GL060420.
Hermanson, L., and Coauthors, 2022: WMO global annual to decadal climate update: A prediction for 2021–25. Bull. Amer. Meteor. Soc., 103, E1117–E1129, https://doi.org/10.1175/BAMS-D-20-0311.1.
Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 1999–2049, https://doi.org/10.1002/qj.3803.
IPCC, 2023: Climate Change 2021: The Physical Science Basis. Cambridge University Press, 2391 pp., https://doi.org/10.1017/9781009157896.
Knutti, R., J. Sedláček, B. M. Sanderson, R. Lorenz, E. M. Fischer, and V. Eyring, 2017: A climate model projection weighting scheme accounting for performance and interdependence. Geophys. Res. Lett., 44, 1909–1918, https://doi.org/10.1002/2016GL072012.
Lehner, F., C. Deser, N. Maher, J. Marotzke, E. M. Fischer, L. Brunner, R. Knutti, and E. Hawkins, 2020: Partitioning climate projection uncertainty with multiple large ensembles and CMIP5/6. Earth Syst. Dyn., 11, 491–508, https://doi.org/10.5194/esd-11-491-2020.
Lionello, P., and L. Scarascia, 2018: The relation between climate change in the Mediterranean region and global warming. Reg. Environ. Change, 18, 1481–1493, https://doi.org/10.1007/s10113-018-1290-1.
Lorenz, R., N. Herger, J. Sedláček, V. Eyring, E. M. Fischer, and R. Knutti, 2018: Prospects and caveats of weighting climate models for summer maximum temperature projections over North America. J. Geophys. Res. Atmos., 123, 4509–4526, https://doi.org/10.1029/2017JD027992.
Mahmood, R., M. G. Donat, P. Ortega, F. J. Doblas-Reyes, and Y. Ruprich-Robert, 2021: Constraining decadal variability yields skillful projections of near-term climate change. Geophys. Res. Lett., 48, e2021GL094915, https://doi.org/10.1029/2021GL094915.
Mahmood, R., M. G. Donat, P. Ortega, F. J. Doblas-Reyes, C. Delgado-Torres, M. Samsó, and P.-A. Bretonnière, 2022: Constraining low-frequency variability in climate projections to predict climate on decadal to multi-decadal timescales – A poor man’s initialized prediction system. Earth Syst. Dyn., 13, 1437–1450, https://doi.org/10.5194/esd-13-1437-2022.
Manubens, N., and Coauthors, 2018: An R package for climate forecast verification. Environ. Modell. Software, 103, 29–42, https://doi.org/10.1016/j.envsoft.2018.01.018.
Mariotti, A., and A. Dell’Aquila, 2012: Decadal climate variability in the Mediterranean region: Roles of large-scale forcings and regional processes. Climate Dyn., 38, 1129–1145, https://doi.org/10.1007/s00382-011-1056-7.
McGregor, S., C. Cassou, Y. Kosaka, and A. S. Phillips, 2022: Projected ENSO teleconnection changes in CMIP6. Geophys. Res. Lett., 49, e2021GL097511, https://doi.org/10.1029/2021GL097511.
Meehl, G. A., C. A. Senior, V. Eyring, G. Flato, J.-F. Lamarque, R. J. Stouffer, K. E. Taylor, and M. Schlund, 2020: Context for interpreting equilibrium climate sensitivity and transient climate response from the CMIP6 Earth system models. Sci. Adv., 6, eaba1981, https://doi.org/10.1126/sciadv.aba1981.
Meehl, G. A., and Coauthors, 2021: Initialized Earth system prediction from subseasonal to decadal timescales. Nat. Rev. Earth Environ., 2, 340–357, https://doi.org/10.1038/s43017-021-00155-x.
Merrifield, A. L., L. Brunner, R. Lorenz, I. Medhaug, and R. Knutti, 2020: An investigation of weighting schemes suitable for incorporating large ensembles into multi-model ensembles. Earth Syst. Dyn., 11, 807–834, https://doi.org/10.5194/esd-11-807-2020.
Merrifield, A. L., L. Brunner, R. Lorenz, V. Humphrey, and R. Knutti, 2023: Climate model Selection by Independence, Performance, and Spread (ClimSIPS v1.0.1) for regional applications. Geosci. Model Dev., 16, 4715–4747, https://doi.org/10.5194/gmd-16-4715-2023.
Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595–600, https://doi.org/10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
O’Reilly, C. H., D. J. Befort, and A. Weisheimer, 2020: Calibrating large-ensemble European climate projections using observational data. Earth Syst. Dyn., 11, 1033–1049, https://doi.org/10.5194/esd-11-1033-2020.
Riahi, K., and Coauthors, 2017: The shared socioeconomic pathways and their energy, land use, and greenhouse gas emissions implications: An overview. Global Environ. Change, 42, 153–168, https://doi.org/10.1016/j.gloenvcha.2016.05.009.
Ribes, A., J. Boé, S. Qasmi, B. Dubuisson, H. Douville, and L. Terray, 2022: An updated assessment of past and future warming over France based on a regional observational constraint. Earth Syst. Dyn., 13, 1397–1415, https://doi.org/10.5194/esd-13-1397-2022.
Righi, M., and Coauthors, 2020: Earth System Model Evaluation Tool (ESMValTool) v2.0—Technical overview. Geosci. Model Dev., 13, 1179–1199, https://doi.org/10.5194/gmd-13-1179-2020.
Rohde, R. A., and Z. Hausfather, 2020: The Berkeley Earth land/ocean temperature record. Earth Syst. Sci. Data, 12, 3469–3479, https://doi.org/10.5194/essd-12-3469-2020.
Sanderson, B. M., R. Knutti, and P. Caldwell, 2015a: Addressing interdependency in a multimodel ensemble by interpolation of model properties. J. Climate, 28, 5150–5170, https://doi.org/10.1175/JCLI-D-14-00361.1.
Sanderson, B. M., R. Knutti, and P. Caldwell, 2015b: A representative democracy to reduce interdependency in a multimodel ensemble. J. Climate, 28, 5171–5194, https://doi.org/10.1175/JCLI-D-14-00362.1.
Sanderson, B. M., M. Wehner, and R. Knutti, 2017: Skill and independence weighting for multi-model assessments. Geosci. Model Dev., 10, 2379–2395, https://doi.org/10.5194/gmd-10-2379-2017.
Schlund, M., A. Lauer, P. Gentine, S. C. Sherwood, and V. Eyring, 2020: Emergent constraints on equilibrium climate sensitivity in CMIP5: Do they hold for CMIP6? Earth Syst. Dyn., 11, 1233–1258, https://doi.org/10.5194/esd-11-1233-2020.
Siegert, S., O. Bellprat, M. Ménégoz, D. B. Stephenson, and F. J. Doblas-Reyes, 2017: Detecting improvements in forecast correlation skill: Statistical testing and power analysis. Mon. Wea. Rev., 145, 437–450, https://doi.org/10.1175/MWR-D-16-0037.1.
Smith, D. M., and Coauthors, 2019: Robust skill of decadal climate predictions. npj Climate Atmos. Sci., 2, 13, https://doi.org/10.1038/s41612-019-0071-y.
Smith, D. M., and Coauthors, 2020: North Atlantic climate far more predictable than models imply. Nature, 583, 796–800, https://doi.org/10.1038/s41586-020-2525-0.
Steiger, J. H., 1980: Tests for comparing elements of a correlation matrix. Psychol. Bull., 87, 245–251, https://psycnet.apa.org/doi/10.1037/0033-2909.87.2.245.
Sutton, R. T., and B. Dong, 2012: Atlantic Ocean influence on a shift in European climate in the 1990s. Nat. Geosci., 5, 788–792, https://doi.org/10.1038/ngeo1595.
Taylor, K. E., R. J. Stouffer, and G. A. Meehl, 2012: An overview of CMIP5 and the experiment design. Bull. Amer. Meteor. Soc., 93, 485–498, https://doi.org/10.1175/BAMS-D-11-00094.1.
Tebaldi, C., and R. Knutti, 2007: The use of the multi-model ensemble in probabilistic climate projections. Philos. Trans. Roy. Soc., A365, 2053–2075, https://doi.org/10.1098/rsta.2007.2076.
Tokarska, K. B., M. B. Stolpe, S. Sippel, E. M. Fischer, C. J. Smith, F. Lehner, and R. Knutti, 2020: Past warming trend constrains future warming in CMIP6 models. Sci. Adv., 6, eaaz9549, https://doi.org/10.1126/sciadv.aaz9549.
Wilks, D. S., 2011: Forecast verification. Statistical Methods in the Atmospheric Sciences, D. S. Wilks, Ed., International Geophysics, Vol. 100, Academic Press, 301–394, https://doi.org/10.1016/B978-0-12-385022-5.00008-7.
Zwiers, F. W., and H. von Storch, 1995: Taking serial correlation into account in tests of the mean. J. Climate, 8, 336–351, https://doi.org/10.1175/1520-0442(1995)008<0336:TSCIAI>2.0.CO;2.