A SIMPLE PEDAGOGICAL MODEL LINKING INITIAL-VALUE RELIABILITY WITH TRUSTWORTHINESS IN THE FORCED CLIMATE RESPONSE

Using a simple pedagogical model, it is shown how information about the statistical reliability of initial-value ensemble forecasts can be relevant in assessing the trustworthiness of the system’s

The abstract for this article can be found in this issue, following the table of contents. U nderstanding and modeling the response of the climate system to external forcings, especially to anthropogenic forcing, are central to climate science. Typically, projections of coupled general circulation models (GCMs) are made that simulate a future world using assumptions, or scenarios, of how such forcings will evolve over time. A related methodology is pursued in attributing extreme weather and climate events to anthropogenic forcing. Here, the response of the current climate system to the ongoing anthropogenic forcing is estimated by comparing model states with and without anthropogenic forcing.
Regional climate projections are increasingly providing the basis for climate-related decisionmaking in a range of societal sectors. Hence, the degree to which the GCMs' forced responses are trustworthy is becoming an issue of practical as well as theoretical importance. Given the continuing significant biases in regional simulations of key climate variables (IPCC 2013), this trustworthiness cannot be guaranteed.
For example, the frequency of anticyclonic blocking is projected to decrease in most climate models as a result of anthropogenic forcing (Matsueda et al. 2009;Masato et al. 2013). However, the frequency of blocking is severely underestimated in phase 5 of the This article is licensed under a Creative Commons Attribution 4.0 license. Coupled Model Intercomparison Project (CMIP5) twentieth-century integrations (Anstey et al. 2013) and continues to be underestimated in high-resolution simulations (Schiemann et al. 2017). In the presence of such shortcomings, how trustworthy are projections of blocking frequency? Since anticyclonic blocking is a vital factor in determining the frequency of drought throughout the year, and cold weather in winter, answering the question above is of considerable importance.
Other examples where substantial model biases could lead to untrustworthiness of model-derived regional climate projections are the Asian summer monsoon (e.g., Webster et al. 1998;Turner and Annamalai 2012;Ramesh and Goswami 2014) and the decreasing Artic sea ice and its link to the global atmospheric circulation (e.g., Wettstein and Deser 2014;Francis and Skific 2015;Barnes and Screen 2015).
By definition, we cannot know for certain whether our current climate change projections are untrustworthy until these changes eventually occur in the future and can be verified with observations. However, we build trust in climate models by critically evaluating their performance in present-day or past climate conditions. This is typically achieved by assessing characteristics of the simulated climatological probability distributions, for example, first-and higher-order moments, and patterns of climatological spectral variability.
As will be shown below, these characteristics do not guarantee the trustworthiness of the forced response. Are there other diagnostics that can help determine the trustworthiness of regional climate projections or attributions of observed extreme weather events? One possible diagnostic is the statistical reliability of initial-value ensemble forecasts (Wilks 2011;Weisheimer and Palmer 2014) that relate forecast probability with frequency of occurrence. The possible link between initial-value reliability and the trustworthiness of the forced response was first proposed by Palmer et al. (2008, hereafter P08), who discussed how information from a multimodel ensemble of initialized coupled seasonal forecasts can be used to constrain the trustworthiness of the regional projection of precipitation. However, this notion has proved controversial. For example, Scaife et al (2009) argued that since initial-value predictions (on the seasonal time scale or shorter) merely provide estimates of the future internal variability of the climate system, they are largely irrelevant in assessing the trustworthiness of the long-term forced-response signal. Similarly, in their recent review paper, Stott et al. (2016) discuss, but largely discount, the use of initialized seasonal forecast reliability diagrams for assessing the ability of a particular climate model to be used for extreme event attribution. Stott et al. (2016, p. 32) comment: "A seasonal forecast reliability diagram indicates whether the model is able to capture the predictable features of the event under consideration. Although the use of reliability is well established for forecasting, its meaning for attribution is less clear given that reliable attribution is still possible when there is no inherent real-world predictability." We will return to this comment below. However, for now it can be noted that Matsueda et al. (2016) have tested the P08 hypothesis based on a series of high-and low-resolution integrations of a comprehensive atmospheric model, where the high resolution is treated as a surrogate of truth. They showed quantitatively that information about the reliability of seasonal forecast ensembles can help improve the skill of regional climate change projections of precipitation.
Although this latter study lends some support to the P08 philosophy, it is clear that a conceptual picture of how initial-value unreliability can undermine trust in the climate-forced response is lacking. In this paper, we present such a picture using a simple to understand nonlinear toy model, qualitatively analogous to the extratropical atmosphere. We show how the failure to produce a trustworthy response to external forcing can be clearly diagnosed from the model's initial-condition forecast unreliability.
The pedagogical model is configured in two different ways. One configuration, the more complex, is defined as reality, and the other, simpler, configuration defines our weather/climate model. An external forcing, representing, for example, anthropogenic forcing, is then applied to both configurations. The model is deliberately constructed in such a way that its response to the forcing is completely incorrect compared with reality. We use this setup to investigate whether and how it is possible to know if the model response to an external forcing is untrustworthy without directly comparing the forced model with its corresponding forced reality (which corresponds to the real-world situation where the forced reality will be known only in the future). It is demonstrated in the conceptual model that the untrustworthiness of the forced response is not only diagnosable from, but it is also dynamically linked with the unreliability of the initial-value forecasts. By contrast, the untrustworthiness of the forced response could not be determined by simple diagnostics of biases in the model's unforced climatology.
The structure of the paper is as follows: The conceptual toy model is introduced in the next section. In "Forecast reliability," the concept of forecast reliability is discussed in the context of the conceptual model. A brief analysis of the (un)reliability of operational subseasonal and seasonal precipitation forecasts in the European Centre for Medium-Range Weather Forecasts (ECMWF) model is presented in the section "Reliability on seasonal and subseasonal time scales. " Some discussion and conclusions are given in the final section.

THE TOY MODEL.
Our climate can be considered a nonlinear dynamical system represented schematically by the equation where F is at least quadratic in the state variable X. Nonlinearity can manifest in two different ways. First, if we linearize these equations to describe how small perturbations δX evolve in time, the linearized equations have the form (1) Notice that if F is at least quadratic in X, then the Jacobian operator dF/dX will be at least linear in X. That is to say, in a nonlinear system, the growth of small perturbations (which characterize the predictability of the system) will vary with the underlying state X of the system. This dependence has been illustrated, for example, by Buizza and Palmer (1995) when studying the fastest-growing linear perturbations in the atmosphere. The existence of relatively stable regions of phase space, which characterize the phenomena of circulation regimes (Legras and Ghil 1985), is another manifestation of nonlinearity. Figure 1 shows two configurations of an idealized system that incorporate these manifestations of nonlinearity (Palmer 1999). A ball is dropped onto the ridge separating two channels. If the ball is dropped slightly to the right of the ridge, the ball falls into the right-hand cup for the first configuration and into the left-hand cup for the more complex second configuration. The position of the ball can be considered as defining the variable X. Near the ridge, small perturbations to X will grow; this can be considered an unstable part of the system. By contrast, the cups can be considered very stable parts of the system; essentially, the ball will remain stationary in the cups until it is lifted out and dropped again into the funnel. The cups can be considered to represent atmospheric circulation regimes.
A fan is shown in Fig. 1, which, in analogy to the anthropogenic climate change situation, demonstrates the effects of a simple external forcing on the system. In both configurations, when the fan is off and the ball is repeatedly dropped into the funnel, the probability of the ball dropping in either of the cups is equal (i.e., 50%). However, when the fan is switched on, the probability that the ball will drop into the right-hand cup in the top configuration, or the left-hand cup in the bottom configuration, decreases.
The second configuration is clearly more complicated than the first. So let us suppose that the second configuration corresponds to reality and the first is a simplified model of reality. Hence, compared with reality, the model responds incorrectly to the applied forcing. For example, if the right-hand cup defines what we shall call an "anticyclonic blocking regime" and the left-hand cup defines a "zonal-flow regime" and the fan defines "anthropogenic forcing," then while the model predicts an increase in the zonal regime with anthropogenic forcing, reality predicts an increase in the blocked regime.
Most current generation CMIP models predict that an increase in greenhouse gas forcing will lead to a decrease in the frequency of northern European blocking (Matsueda et al. 2009;Anstey et al. 2013). However, in the absence of a clear theory explaining this, most modelers would not feel especially confident in such a result, especially as the CMIP models do not simulate long-lived blocking anticyclones with any degree of realism (Masato et al 2013), and even models with higher atmospheric resolution still suffer large biases in Euro-Atlantic blocking (Schiemann et al. 2017). In this regard, the missing "twist" in the simplified model of reality could be associated with an occasional phase error in the Rossby wave response to tropical forcing or, perhaps, an inadequate stratosphere.
The key question we wish to consider in this paper is the following: Given the incorrect response to an external forcing in our toy model, compared to our toy reality, how could we determine a priori that the described idealized model response to forcing was wholly incorrect before the fan was actually switched on? This clearly relates to the more practical question of how to know whether the CMIP models are correctly simulating the response to forcing at a time before the response to this forcing is known in reality.
Note that we cannot readily answer our question by studying the unforced model climatological probability distribution. The idealized model has regimes (i.e., cups) in exactly the same location as reality. Not only that, the unforced frequency of occurrence of each of the regimes is also exactly correct (i.e., 50%). That is to say, our idealized model has none of the climatological errors in blocking and yet it responds incorrectly to forcing. Is there a way that we could know this before the fan was switched on?
FORECAST RELIABILITY. What specific information from a set of initial-value forecasts is relevant here? We argue that it is the probabilistic forecast reliability. Consider an (initial value) ensemble forecast system run over a large number of independent initial states. Focus on some binary event: rainfall exceeding the lower climatological tercile, for example. For a given grid point and forecast lead time, we can partition forecasts from our ensemble system into subsamples, or bins, where the probability of the event lay in the ranges, say, 0%-10%, 10%-20%, 20%-30%, …, and 90%-100%. For each forecast, the binary event either did occur or did not occur. Therefore, we can associate a frequency of occurrence for each of the subsamples of ensemble forecasts. For a reliable ensemble prediction system and a large-enough sample of events, the frequency of occurrence of events in the 0%-10% bin should lie between 0% and 10% and so on for all other bins. Put another way, on a 2D plot, where the frequency of occurrence defines distance along the y axis and the different forecast probability bins are plotted along the x axis, the data points should lie on or close to the diagonal. Examples of actual reliability diagrams from a state-of-the-art numerical weather prediction model are given below.
Wit h t h i s a s b a c kground let us return to the idealized system shown in Fig. 1. The points A, B, C, D, E, and F in Fig. 2 correspond to sets of initial conditions for ensemble forecasts using observations of reality, assimilated (with some small random observation error) into the faulty model during a period before the fan has been switched on. That is to say, we will study the probabilistic reliability of the unforced model.
For initial conditions corresponding to A and B, the model reliably predicts, respectively, a 100% or 0% probability for the left-hand regime. That is to say, in the subset of situations where the initial conditions belong to A, the forecast probability of occurrence of the left-hand regime is always 100%, and the frequency of occurrence of this regime is 100%. Similarly, in the subset of situations where the initial conditions belong to B, the forecast probability of occurrence of the left-hand regime is always 0%, and the frequency of occurrence of this regime is 0%.
For a set of initial conditions belonging to category C, that is, close to the unstable ridge, half of the time the ball falls into the left-hand regime, and half of the time the ball falls into the right-hand regime. Hence, the frequency of occurrence of either regime is 50%. The forecast ensembles reliably predict that this is a situation with no predictability; here, in any ensemble the forecast probability of the lef thand regime is 50%. This illustrates an important point relevant to the Stott et al. (2016) comment in the introduction: reliability diagrams do not only test the ability of models to capture the predictable features of the event under consideration, they also test the ability of the model to predict reliably the situations where there is no predictable signal, that is, by producing an ensemble probability equal (within sampling uncertainty) to the climatological frequency of the event.
However, consider sets of initial conditions denoted by D in Fig. 2b. Some of the time the initial conditions will lie on the top channel; on other occasions, the initial conditions will lie on the bottom channel. Our imperfect model, on the other hand, is unable to discriminate between these situations and in all circumstances the model initial conditions (cf. Fig. 2a) lie on the left-hand channel (this is an example where initial-condition error is predominantly due to model error rather than observation error). As such, the model always predicts a 100% probability of occurrence of the left-hand regime, when the observed frequency of occurrence of reality is actually 50%. Probabilistic forecasts from initial conditions D are therefore unreliable.
Worse still, for initial conditions E the model ensembles predict 0% probability of the left-hand regime, when the observed frequency of occurrence in reality is 100%, and for initial conditions F the model predicts a probability of 100% when the observed frequency in reality is 0%.
Binning the data according to prescribed probability bins and regressing over all the data points, the reliability curve for the model would be flat, as illustrated in Fig. 3.
The key problem with the model is that it does not have the twist in its channel, compared with reality.
In principle, this error should show up in studies of the climatology of the model. However, the twist itself is not an energetically dominant aspect of the system. Fig. 1, we imagine a set of ensemble forecasts of the model of reality from the different initial conditions A-F. We study the probabilistic reliability of these forecasts.

Fig. 2. Based on the toy systems shown in
That is to say, although the model error could in principle be revealed by some empirical orthogonal function (EOF) analysis, one may have to go to some very high-order EOF before the problem becomes apparent. This problem is revealed by focusing on forecast reliability. We therefore assert that forecast reliability should be considered a necessary (but certainly not sufficient) condition for having confidence that the model responds correctly to external forcing.

RELIABILITY ON SEASONAL AND SUBSEA-SONAL TIME SCALES.
In the previous sections, the relevance of initial-value reliability for the forced climate response was demonstrated in a pedagogical nonlinear system. In this section, we discuss and present some reliability diagrams on seasonal and subseasonal time scales from the ECMWF operational system to assess the reliability of initialized ensemble forecasts performed with complex state-of-the-art coupled ocean-atmosphere models.
Seasonal predictions provide estimates of seasonalmean statistics of weather, typically up to 3 months ahead of the season in question. As such, a seasonal forecast can provide information on how likely it is that the coming season will be wetter, drier, warmer, or colder than normal. Seasonal climate forecasts are increasingly being used across a range of application sectors, and reliable inputs are essential for any forecast-based decision-making. Weisheimer and Palmer (2014) characterized the reliability of regional temperature and precipitation forecasts from ECMWF's operational seasonal forecast system 4 in terms of usefulness and found a wide range of rankings, depending on region and variable. Most of the temperature forecasts over land were found to be at least marginally useful in terms of reliability. Overall, the reliability performance for precipitation was poorer than for temperature with more regions classified with lower reliability scores. For example, over northern Europe, the reliability of precipitation forecasts for winters [December-February (DJF)] is classified as not useful for dry events and marginally useful for wet events. The reliability for dry summers over Europe is notably poor with southern Europe classified in reliability categories as not useful and northern Europe as "dangerously useless. " The most frequent reliability category for precipitation was the marginally useful category. Such forecasts are not very reliable but might be marginally useful for some applications; see the calibration technique discussed in the next section. The category with the second-highest number of regions is the one of perfect reliability, which is an optimistic result for the usefulness of seasonal forecasts of precipitation. However, there are substantially more cases of areas that have a poorer precipitation forecast reliability than there are for temperature. It is exactly those areas that our analysis above suggests that the response to forcing is untrustworthy. Weisheimer et al. (2017) have proposed the use of reliability information from a 110-yr-long dataset of seasonal retrospective forecasts of extreme events on the seasonal time scale to contrast the reliability of forecast probabilities at the beginning of the twentieth century, indicative of a "pre-climate change" period, with the reliability of forecast probabilities for more recent decades, indicative of a "post-climate change" period. Consistent with the discussion here, they conclude that such information can be important for increasing the confidence in attribution statements of extreme weather and climate events.
The problem of seasonal forecasts becoming seriously unreliable for certain events, regions, and seasons, as highlighted above, starts to manifest itself already much earlier in the forecast range. Here, we analyze data from a subseasonal retrospective forecast experiment using the ECMWF monthly forecasting system of model cycle 41R1. The 32-day-long hindcasts were run for 80 start dates over the period 1989-2008. We estimate forecast reliability for four weekly periods from day 4 to 32 based on weekly precipitation anomalies over Europe. The results are presented in Fig. 4 and show that while forecast reliability is good for days 4-11, it quickly drops to a flat line in the reliability diagram for days 12-18 onward, indicating no useful forecast reliability on these time scales. The decrease of reliability only after the end of the conventional medium range of weather forecasts points to the likelihood that such unreliability is due to the relatively slowly growing model error rather than the more rapidly growing initial-condition uncertainty.
Although multimodel ensembles typically have greater initial-value reliability than single-model ensembles, reliability is still far from perfect. For example, the component models of a contemporary multimodel ensemble all suffer from the same systematic deficiencies in regime statistics (such as undersimulation of long-lived blocks). Hence, the results here are also relevant to multimodel ensembles.

DISCUSSION AND CONCLUSIONS.
In this paper, we have used a simple nonlinear toy model of the atmospheric circulation to clarify the notion that the unreliability of initial-value ensemble predictions can directly reveal the untrustworthiness of the model's forced response. The system consists of a reality with two circulation regimes and a simpler model of reality. While, by construction, the model simulates these two regimes with the correct climatological structure and frequency, the model is not able to correctly simulate the response to an external forcing. We have shown that it is possible to know a priori (i.e., without performing the model simulation under forcing) that the model response to forcing is incorrect. The key that allows such knowledge is data from initial-value (unforced) predictions of reality using the simplified model. Specifically, the model's unreliability indicates that its response to an external forcing is also not trustworthy.
With this in mind, it is worth returning to the Stott et al. (2016) quote in the introduction. As discussed in the previous section "Forecast reliability," an initialvalue reliability diagram does not only indicate whether the model is able to capture the predictable features of the event under consideration, it also indicates whether the model can predict climatological probabilities in situations where the event is unpredictable. That is to say, initial-value reliability is still an important diagnostic to study even in situations where there is no inherent real-world predictability. We therefore do not agree with the conclusion of Stott et al. (2016, p. 32) that "although the use of reliability is well established for forecasting, its meaning for attribution is less clear given that reliable attribution is still possible when there is no inherent real-world predictability. " How can we use initialcondition reliability for improving quantitatively the trustworthiness of a climate model's response to forcing? The reliability diagram provides a simple framework for calibrating probabilities to make them (in sample) reliable. Suppose the raw forecast probabilities are not reliable because the ensemble system is overconfident and creates, on average, insufficient spread to account for the forecast errors (i.e., the ensemble is underdispersive). In Fig. 5 such a case is schematically illustrated by the solid circles in the reliability diagram. Here, the diagram shows an example for overconfident predictions of a binary tercile event. Where forecast probabilities are larger than 1/3, the events occur less frequently than predicted by the ensemble. Where forecast probabilities are smaller than 1/3, the events occur more frequently. By moving the uncalibrated forecast probabilities (solid circles) horizontally toward the perfect reliability diagonal (open circles), the forecast probabilities can be corrected to be more reliable, keeping the observed frequency of occurrence unchanged. This is illustrated in Fig. 5 by shifting the solid circles to the open circles.
Here, we propose that such forecast calibration could, for example, be used to downweight probabilistic attribution statements. Exactly how this should be done is a matter for further research, but a plausible methodology would be to downweight the probabilities based on the reliability of initial-value forecasts with shorter lead times that can be verified. The lead time of the forecasts could be linked to the time scale of the attributed event of interest. For example, probabilistic attribution statements for seasonal-mean extreme events could be downweighted on the basis of seasonal forecast reliability for percentile categories in which the event occurs. Similarly, statements about extremes on shorter time scales could be calibrated with forecast reliability from extended-range or subseasonal predictions.