Improving short-range flow-dependent reliability, which may provide a practical approach to increase forecast skill out to ∼10 days, is discussed and illustrated for a specific flow situation associated with convection over North America and poor skill for Europe.
Numerical weather prediction is fundamentally a probabilistic task owing to the growth of unavoidable uncertainties in the forecast’s initial conditions and in the forecast model itself (Sutton 1954; Lorenz 1963). A key question for the user is how certain they can be that a “10% probability of precipitation” really means that they will be unlucky to get wet. How would they assess the reliability of such a prediction? One approach would be for them to keep a record of the days when the forecast indicated a 10% probability of precipitation and to see if rainfall actually occurred on 10% of those days. If, in reality, it occurred on 20% of the days, this would indicate that these probabilistic forecasts were unreliable. On the other hand if, for a large set of occasions when the forecast gave a 40% chance of a hurricane making landfall in a particular region, a hurricane did make landfall 40% of the time, this would represent a “reliable” (Sanders 1958) set of forecasts. If the decision about whether to defend a vulnerable piece of infrastructure could be based purely on the forecast, then, since the forecast is reliable, a simple approach would be to defend the infrastructure if the cost to do so was less than 40% of the loss that would otherwise occur (Richardson 2000). In reality other factors will influence such a decision, but this example does highlight the importance of reliability. We may ask how we could do even better. Suppose we could partition the forecasts into two categories (based on the initial flow conditions) where the probabilities of landfall were, say, 60% and 20% (i.e., one flow situation is more likely to lead to landfall than the other), and suppose that the forecasts in both categories were reliable. It is straightforward to show (see "The benefits of flow-dependent reliability" sidebar) that, under the assumptions of this simple decision model, the expected expense in defending the infrastructure is always reduced (or matched). This result, which does not depend on the choice of numbers, emphasizes the potential utility of improving flow-dependent reliability. The key question to address here is, How do we assess and improve the flow-dependent reliability of our forecasts? Before attempting to address this question, it is useful to discuss developments that have already improved forecast reliability and then to motivate use of the approach taken in this study.
THE BENEFITS OF FLOW-DEPENDENT RELIABILITY
In the landfalling hurricane example, suppose that the overall forecast probability for landfall is p and that this probability is reliable (so that landfall does occur in a fraction p of these cases). For the simple decision model discussed, let C be the fixed cost of taking action and let L be the fixed loss that would be sustained if action was not taken and the event occurred. Writing α = C/L, the expected expense of defending or rebuilding the infrastructure would then be min(p,α) (in units of L). Suppose now that it is possible to partition the forecasts into two categories based on their initial flow conditions, with the fraction of forecasts in each category being w1 and w2 (where w1 + w2 = 1) and the reliable probabilities of landfall being p1 and p2, respectively. Then w1p1 + w2 p2 = p. Without loss of generality, we can assume p1 < p < p2. The overall expected expense following the partition will be w1min(p1,α) + w2min(p2,α). There are four possible rankings of α, p1, p, and p2 and corresponding expected expenses for the partitioned and unpartitioned forecasts (Table SB1). In all cases, the expected expense is reduced or matched when a partitioning (with flow-dependent reliability) is possible.
Rankings and comparison of expected expenses for partitioned and unpartitioned forecasts.
PREVIOUS PROGRESS IN ENSEMBLE FORECAST RELIABILITY.
Probabilistic weather forecasts are generally made through use of an ensemble of individual forecasts (Lewis 2005), each starting from a slightly different initial state and including a different realization of model uncertainty. Computing resources put a limit on the number of ensemble members that can be made at a chosen resolution, and thus on the fidelity with which the underlying forecast distribution can be estimated. This and other limitations make it impractical to assess forecast reliability for all possible events and probabilities—particularly for rare and/or extreme events—and so simpler approaches to diagnosing deficiencies in reliability are useful. One consequence of reliability in an ensemble forecast system is the so-called spread–error relationship (Leutbecher and Palmer 2008), whereby the average difference (or distance in state space) between an ensemble member and the ensemble mean (e.g., the ensemble standard deviation) should match the average difference between the eventual outcome and the ensemble mean [e.g., the root-mean-square error (rmse) of the ensemble mean] when averaged over a large-enough set of start dates and using a simple adjustment to account for finite ensemble size. As will be seen below, this relationship provides a practical first-order method for assessing ensemble reliability, although it is not so useful for the diagnosis of deficiencies in reliability.
Here, we focus mainly on the European Centre for Medium-Range Weather Forecasts (ECMWF) ensemble (ENS) forecast system, which became operational in May 1994 (Palmer et al. 1992; Molteni et al. 1996). The diagnostic approach discussed should, however, be more widely applicable. Figure 1a shows annual means of ENS spread (solid) and error (dashed) as a function of forecast lead time for Northern Hemisphere 500-hPa geopotential height (Z500) for the years 1996, 2005, and 2014. System developments have led to a reduction in errors and an improvement in the annual-mean agreement with the spread to the point that the spread and error curves are virtually indistinguishable by 2014. These changes represent substantial improvements in the reliability and sharpness of predictive distributions.
In 1996 (Fig. 1a, blue curves), the rapid rise in spread to day 2 reflected the inclusion of fast-growing “singular vector” (SV) perturbations to the initial conditions (Molteni and Palmer 1993), which helped the spread keep pace with error growth over the first 2 days, but it can be seen that this was not sustained beyond 2 days. Since 1996, four-dimensional variational data assimilation (Rabier et al. 2000) was developed and subsequently incorporated into an ensemble of data assimilations (EDA; Isaksen et al. 2010) to produce a set of equally likely initial conditions that take into account observation and model uncertainty and the growth of uncertainty from the previous set of initial conditions. These developments, together with many other incremental improvements (e.g., English et al. 2000; Dee and Uppala 2009; Jung et al. 2010), have enabled a reduction in the magnitude of SV perturbations (Leutbecher and Lang 2014) along with the reduced errors and improved annual-mean agreement with the spread. Notice also the more realistic “exponential” shape (Molteni and Palmer 1993; Harlim et al. 2005) of the spread and error curves for 2014 over the first 5 or 6 days, which are much flatter at short ranges.
Obtaining annual-mean agreement between spread and error is an important first step in the development of a reliable forecast system. The primary reason for making ensemble forecasts, however, is to represent the day-to-day variations in forecast uncertainty. Figure 1b shows time series of spread and error for forecasts of European Z500 at a lead time of 6 days from several of the world’s operational forecasting centers. The concerted rise and fall in spread for all centers demonstrates that predictability varies with the prevailing flow situation. The partial agreement between the spread and error curves (it can never be perfect on a day-to-day basis; Whitaker and Loughe 1998) demonstrates that the forecasts do possess some degree of flow-dependent reliability, including for flow situations where intrinsic predictability is low (i.e., where uncertainty is growing rapidly). A key question is whether we can do any better.
WHY FOCUS ON INITIAL UNCERTAINTY GROWTH RATES?
This section motivates the approach considered here, where flow-dependent reliability (out to ∼10 days) is assessed and improved through the diagnosis of short-range (∼12 h) ensemble forecasts.
A CASE FOR INTERNATIONAL COOPERATION
The case shown in Fig. 2 highlights that one key source of forecast uncertainty is associated with intense convective activity over the United States. Around the date shown, the National Oceanic and Atmospheric Administration’s (NOAA) Storm Prediction Center’s storm reports recorded 631 events of high winds, 194 events of hail, and 89 events of tornadoes, primarily within the upper Mississippi Valley. Unfortunately, injuries and widespread structural damage did occur. Hence, this case illustrates the broader impacts and benefits of international cooperation in weather research, since dynamically active regions of the atmosphere are important for forecasts both locally and “downstream.”
The collocation of strong uncertainty growth-rates with moist processes (in both the MCS and WCB regions) is interesting. Does this result represent, for example, an improvement over the baroclinic singular vectors highlighted by Molteni and Palmer (1993) when nonlinear effects and moisture are included, or does it reflect deficiencies in the representation of model uncertainty? Note that model uncertainty is represented in the EDA through “stochastic perturbations to physical tendencies” (SPPT; Buizza et al. 1999), but there are no SV initial perturbations to the background forecasts. In short, how well does the forecast system represent intrinsic uncertainty growth rates and maintain reliability in these (and other) synoptic flow situations? This work thus complements the more idealized studies on predictability by Durran and Gingrich (2014) and Sun and Zhang (2016), and results such as those presented in Fig. 2 may provide useful information for such studies on the sources and scales of uncertainty that dominate present-day forecast initialization.
The trough/CAPE initial flow situation represents here a useful example with which to demonstrate the more general utility of evaluating flow-dependent reliability at short (∼12 h) forecast ranges. If we can show that short-range forecasts from an initial flow situation, such as the trough/CAPE synoptic pattern, are unreliable owing to deficiencies in uncertainty growth rates (too much or too little), then this will imply that ensemble forecasts that predict a high likelihood of a trough/CAPE situation at any lead time will become unreliable after this lead time (in an area expanding from the likely trough/CAPE event). Such arguments would apply to other synoptic flow types, for which initial uncertainty growth rates might be found to be deficient. Putting the argument the other way around, modeling developments that improve initial uncertainty growth rates for a range of flow types are a necessary requirement for more reliable (and indeed more skillful) ensemble forecasts. Here we discuss a diagnostic strategy that should help achieve this goal.
EVALUATION OF INITIAL UNCERTAINTY GROWTH.
COMPOSITE STUDY: THE TROUGH/CAPE FLOW TYPE.
As an example of a flow-dependent application of Eq. (2), we consider “jet stream winds” (zonal winds at 200 ± 15 hPa—as observed by aircraft, together with collocated model background values) for the trough/CAPE synoptic situation discussed above. A total of 54 cases were found during the period 19 November 2013–12 May 2015 (when version 40r1 of ECMWF’s Integrated Forecast System was operational) for which the initial conditions of the background forecast of the EDA unperturbed control closely matched the trough/CAPE pattern, using the same method as in Rodwell et al. (2013). Figure 3f shows the aircraft observation density for these cases. Aircraft observations are numerous over central North America at this cruising altitude, and, indeed, they are particularly influential in the data assimilation system.
Before calculating the terms in the EDA reliability budget in Eq. (2), all data are first aggregated onto an approximately equal-area grid (∼125 km2). Figures 3a–e show the resulting terms for the trough/CAPE composite. As discussed above, the budget decomposes the squared departure term (Depar2; Fig. 3a) into contributions from the bias (Bias2; Fig. 3b), ensemble variance (EnsVar; Fig. 3c), observation uncertainty (ObsUnc2; Fig. 3d), and the Residual term (Fig. 3e).
The spatial structure of the squared departure term (Fig. 3a) indicates increased ensemble-mean forecast departures from the observations in the Great Lakes–Mississippi River region of North America in the trough/CAPE composite. This is to be expected because of the strong (and less predictable) MCS activity liable to be taking place in this region. The ensemble variance (Fig. 3c) does indicate more uncertainty (relative to surrounding regions), but notice that this does not fully account for the increased departures. The observation uncertainty term (Fig. 3d) scales roughly as the reciprocal of the number of observations aggregated into a grid cell (Fig. 3f) and is thus relatively small over North America. The residual (Fig. 3e; note the different shading interval) is particularly large in the region associated with MCS activity; it has roughly twice the magnitude of the ensemble variance and is statistically significant at the 5% significance level (as indicated by the more saturated colors).
A key question here is whether the residual (Fig. 3e) in the Great Lakes–Mississippi River region is due to an underestimation of observation errors (or observation error correlations) or a lack of modeled ensemble variance (or model representativity of point observations). At the ∼125-km2 aggregation scale used here, it is unlikely that deficiencies in observation error modeling or representativity are the most important issue since similar (high) observation densities are seen over western North America (Fig. 3f) where the residual (Fig. 3e) is much smaller. Hence it is more likely that the main deficiency is an underestimation of ensemble variance in the jet-stream winds during trough/CAPE situations. From a diagnostic point of view, the key result here is that the EDA reliability budget is able to identify statistically significant flow-dependent deficiencies in reliability. Note that the EDA uses the variances of the background forecasts directly in its background error covariance matrix. This is likely to make the EDA responsive to the flow of the day, and this could be important for the success of this targeting of a given flow type.
It is possible that an enhanced representation of model uncertainty in this convective region could improve the reliability budget—consistent with the results of Rodwell et al. (2016), who showed that turning off SPPT reduced mean EnsVar in convective regions by ∼60%.
There could, however, also be a role for flow-dependent systematic model error. For the trough/CAPE composite, Fig. 4c shows the convective heating at 300 hPa within the EDA control (unperturbed) background forecast. It highlights the increased likelihood for MCS activity over the Great Lakes–Mississippi River region. This convective heating is largely balanced by dynamical cooling (Fig. 4a). The “hole” in radiative cooling in this region is perhaps indicative of higher cloud tops, and, indeed, the cloud term (Fig. 4d) does also indicate active grid-scale microphysics. While there is some cancellation between all these terms, there is a positive mean “analysis increment” (Fig. 4e; note the reduced shading interval). The new observational data being assimilated evidently suggest that the background forecast at 300 hPa is systematically too cold. This could indicate that the modeled convection does not extend high enough, with the consequence that the interaction between the mesoscale convection and the jet stream is underrepresented. This could be another reason for the apparent lack of ensemble variance in zonal winds at 200 hPa (Fig. 3). Investigation will continue to better identify the reasons for the strong positive residual in Fig. 3e.
MODEL OR OBSERVATION UNCERTAINTY?
The residual in Fig.3e is likely to be associated with deficiencies in the forecast model (or the representation of model uncertainty). However, the EDA reliability budget is also sensitive in general to the modeling of observation uncertainty; this is useful because a good modeling of observation uncertainty is also important for the reliable initialization of ensemble forecasts.
There can be situations where the residual term is most clearly associated with the modeling of observation uncertainty. For example, composites have been produced based on the existence of WCBs (objectively identified through trajectory calculations; Wernli and Davies 1997; Madonna et al. 2014). The EDA reliability budget for this WCB composite, evaluated for satellite Microwave Humidity Sounder channels sensitive to midtropospheric humidity (not shown here), has an ObsUnc2 term that alone is around twice the magnitude of the Depar2 term. The budget thus suggests that the magnitudes of modeled observation uncertainty could be reduced. This would have the effect of drawing the EDA more strongly toward the observations, producing a sharper initial distribution. It is possible that other factors will need to be improved at the same time, including cloud detection and the forecast model’s representation of sharp inversions—particularly when trying to assimilate observations with deep weighting functions. Note that with new observation types and developments to a data assimilation system (such as the all-sky developments active in this WCB example; Geer and Bauer 2011), it is better from a reliability point of view to first err on the side of overestimating observation uncertainty.
In other situations, there may be more ambiguity as to whether it is the representation of observation or model uncertainty that is the main problem. Deficiencies in the model’s representativity of observations are a good example—which may be explored by calculating Eq. (2) without aggregating data beforehand. Additional information on observation errors (e.g., “Desroziers statistics”; Desroziers et al. 2005) might also help resolve some ambiguities.
DISCUSSION: A DIAGNOSTIC DEVELOPMENT FRAMEWORK.
This study has been motivated by the idea that focusing modeling efforts on short-range flow-dependent reliability offers a practical framework for improving forecast skill (see "A framework for forecast system development" sidebar for a discussion based on the improvement of “proper” scores). To investigate the feasibility of this approach, the “reliability budget” in Eq. (2) of Rodwell et al. (2016), which is essentially an extension of the spread–error relationship to include observation uncertainty, has been applied here to the ECMWF EDA for a composite of cases where the trough/CAPE flow over North America was present in the initial conditions. This initial flow type is known to lead to increased forecast uncertainty for Europe at a lead time of 6 days (Rodwell et al. 2013). Results here indicate that uncertainty growth rates in the vicinity of the jet stream over North America are too weak, and this is likely to be associated with insufficient forcing by mesoscale convective systems MCSs, which, themselves, have low predictability. The immediate implication of this result is that, while forecast uncertainty is large for Europe in these situations, it may not be large enough. Partly this may be a consequence of systematic errors in the height that this convection attains (and thus how strongly it interacts with the jet stream), but deficiencies in the representation of model uncertainty in such convective situations are also likely to play a role. There is scope for providing advice to users based on such knowledge. For example, if a strong trough/CAPE situation exists over North America in the initial conditions, or is likely tomorrow, then advice to users stating that the current large forecast uncertainty for Europe next week is probably an underestimate and that delaying decisions (until details of the imminent MCS activity are known) might be a sensible course of action. It is also possible that calibration of forecast output (Hagedorn et al. 2008, 2012; Hamill et al. 2008; Hemri et al. 2014) could benefit in a limited way from such knowledge.
A FRAMEWORK FOR FORECAST SYSTEM DEVELOPMENT
The second line in Eq. (SB1) shows the “reliability–refinement” decomposition of the Brier score (DeGroot and Fienberg 1983). Instead of the usual approach of binning directly on forecast probabilities, it is argued here, as a thought experiment, that this decomposition can be considered as a sum over a partition of K initial synoptic-scale flow types. The probabilities that arise from any forecast initialized from a given flow type should be similar enough to be represented by a single probability bin (Stephenson et al. 2008) if the flow types are defined tightly enough, if the events are local to the flow type, and if short-enough lead times are considered. Here pk represents such forecast probabilities from each flow type, ok represent the corresponding outcome frequencies, and wk represents the fraction of forecasts within each initial flow-type bin. Modeling developments that improve short-range flow-dependent reliability will bring pk closer to ok, and thus reduce the overall reliability term, but they should have less impact on the refinement term, which is only directly dependent on the initial flow types and the verifying observations. Hence, modeling efforts focused on improving short-range flow-dependent reliability (even if this involves increasing uncertainty growth rates) should lead to improvements in the Brier score. As an aside, the use of more observations, or the extraction of more information from the observations, is likely to be important for the improvement of the refinement term. Similar arguments follow for other proper scores (see, e.g., Bentzien and Friederichs 2014), which can measure forecast performance over a wider range of event definitions.
There remain slight differences in the way the ECMWF ENS is initialized and run compared to the background forecasts of the EDA (Lang et al. 2015), but as a more seamless EDA–ENS is developed, these short-range diagnostics should become ever more relevant to the ENS. Forecast model developments (including developments to the representation of model uncertainty; Plant and Craig 2008; Berner et al. 2009; Christensen et al. 2017; Ollinaho et al. 2017), which improve short-range reliability for this and a variety of other flow types, should enable the ensemble to better maintain reliability out to ∼10 days, as the phase-space trajectories “pass through” these different flow types. (At longer lead times, slower processes that are not assessable by the EDA reliability budget will start to become important.) Further work is required before the anticipated beneficial impacts on forecast skill can be verified for the trough/CAPE flow type. However, previous application of the EDA reliability budget (without flow dependence; Rodwell et al. 2016) indicated that there may be too much uncertainty growth in regions of subtropical anticyclones. The hypothesis was that the ECMWF representation of model uncertainty (Buizza et al. 1999) might be too active in clear-sky conditions. Motivated by this diagnostic result, the model uncertainty scheme at ECMWF has been adapted to reduce its impact in clear-sky conditions, and this development is about to be implemented in the operational forecast system. Further discussion of the impact of this change will be published shortly (S.-J. Lock 2017, personal communication). The demonstration that such diagnostics can have a real beneficial impact on the development of an operational forecast system is encouraging. More reliable initialization of ensemble forecasts can also come from improvements in the flow-specific modeling of observation error characteristics and “observation operators,” which map forecast model fields to observed quantities (Geer and Bauer 2011). Again, such modeling developments should lead to improved forecast skill.
Of course, such modeling work already takes place (at ECMWF and in other forecasting centers around the world), but the thought is that this diagnostic framework, based on short-range flow-dependent reliability, might help to organize and prioritize efforts. For example, initial flow types could be identified through localized cluster analysis and prioritized for attention based on their contribution to the overall reliability term of a proper score (see "A framework for forecast system development" sidebar) or, analogously, on their frequency-weighted residual in the EDA reliability budget in Eq. (2). This budget should also be useful in monitoring the overall development process (including the assimilation of more observational information as well as improvements to modeling), the aim being to reduce EnsVar while maintaining small Bias2 and Residual terms for a range of initial flow types.
ACKNOWLEDGMENTS
The authors thank Elias Hólm, Heather Lawrence, Sam Lillo, and Ashton Robinson for the helpful discussions, and Alan Thorpe for initiating links between ECMWF and the University of Oklahoma. They would also like to thank Brian Etherton, Tom Hamill, Tim Palmer, and an anonymous reviewer for their insightful comments. One author, D.B.P., would especially like to thank Steven Cavallo and members of his research group.
REFERENCES
Bentzien, S., and P. Friederichs, 2014: Decomposition and graphical portrayal of the quantile score. Quart. J. Roy. Meteor. Soc., 140, 1924–1934, https://doi.org/10.1002/qj.2284.
Berner, J., G. J. Shutts, M. Leutbecher, and T. N. Palmer, 2009: A spectral stochastic kinetic energy backscatter scheme and its impact on flow-dependent predictability in the ECMWF ensemble prediction system. J. Atmos. Sci., 66, 603–626, https://doi.org/10.1175/2008JAS2677.1.
Browning, K. A., 1990: Organization of clouds and precipitation in extratropical cyclones. Extratropical Cyclones: The Erik Palmén Memorial Volume, C. Newton and E. Holopainen, Eds., Amer. Meteor. Soc., 129–153.
Buizza, R., M. Miller, and T. N. Palmer, 1999: Stochastic representation of model uncertainties in the ECWMF Ensemble Prediction System. Quart. J. Roy. Meteor. Soc., 125, 2887–2908, https://doi.org/10.1002/qj.49712556006.
Christensen, H. M., S.-J. Lock, I. M. Moroz, and T. N. Palmer, 2017: Introducing independent patterns into the stochastically perturbed parametrization tendencies (SPPT) scheme. Quart. J. Roy. Meteor. Soc., 143, 2168–2181, https://doi.org/10.1002/qj.3075.
Dee, D. P., and S. Uppala, 2009: Variational bias correction of satellite radiance data in the ERA-Interim reanalysis. Quart. J. Roy. Meteor. Soc., 135, 1830–1841, https://doi.org/10.1002/qj.493.
DeGroot, M., and S. Fienberg, 1983: The comparison and evaluation of forecasters. J. Roy. Stat. Soc., Ser. D (The Statistician), 32, 12–22, https://doi.org/10.2307/2987588.
Desroziers, G., L. Berre, B. Chapnik, and P. Poli, 2005: Diagnosis of observation background and analysis-error statistics in observation space. Quart. J. Roy. Meteor. Soc., 131, 3385–3396, https://doi.org/10.1256/qj.05.108.
Durran, D. R., and M. Gingrich, 2014: Atmospheric predictability: Why butterflies are not of practical importance. J. Atmos. Sci., 71, 2476–2488, https://doi.org/10.1175/JAS-D-14-0007.1.
English, S. J., R. J. Renshaw, P. C. Dibben, A. J. Smith, P. J. Rayer, C. Poulsen, F. W. Saunders, and J. R. Eyre, 2000: A comparison of the impact of TOVS arid ATOVS satellite sounding data on the accuracy of numerical weather forecasts. Quart. J. Roy. Meteor. Soc., 126, 2911–2931, https://doi.org/10.1002/qj.49712656915.
Geer, A. J., and P. Bauer, 2011: Observation errors in all-sky data assimilation. Quart. J. Roy. Meteor. Soc., 137, 2024–2037, https://doi.org/10.1002/qj.830.
Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359–378, https://doi.org/10.1198/016214506000001437.
Grams, C. M., and Coauthors, 2011: The key role of diabatic processes in modifying the upper-tropospheric wave guide: A North Atlantic case-study. Quart. J. Roy. Meteor. Soc., 137, 2174–2193, https://doi.org/10.1002/qj.891.
Hagedorn, R., T. M. Hamill, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136, 2608–2619, https://doi.org/10.1175/2007MWR2410.1.
Hagedorn, R., R. Buizza, T. M. Hamill, M. Leutbecher, and T. N. Palmer, 2012: Comparing TIGGE multimodel forecasts with reforecast-calibrated ECMWF ensemble forecasts. Quart. J. Roy. Meteor. Soc., 138, 1814–1827, https://doi.org/10.1002/qj.1895.
Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 2620–2632, https://doi.org/10.1175/2007MWR2411.1.
Harlim, J., M. Oczkowski, J. A. Yorke, E. Kalnay, and B. R. Hunt, 2005: Convex error growth patterns in a global weather model. Phys. Rev. Lett., 94, 228501, https://doi.org/10.1103/PhysRevLett.94.228501.
Hemri, S., M. Scheuerer, F. Pappenberger, K. Bogner, and T. Haiden, 2014: Trends in the predictive performance of raw ensemble weather forecasts. Geophys. Res. Lett., 41, 9197–9205, https://doi.org/10.1002/2014GL062472.
Hoskins, B. J., M. E. McIntyre, and A. W. Robertson, 1985: On the use and significance of isentropic potential vorticity maps. Quart. J. Roy. Meteor. Soc., 111, 877–946, https://doi.org/10.1002/qj.49711147002.
Isaksen, L., J. Hasler, R. Buizza, and M. Leutbecher, 2010: The new ensemble of data assimilations. ECMWF Newsletter, No. 123, ECMWF, Shinfield Park, Reading, United Kingdom, 22–27.
Jung, T., and Coauthors, 2010: The ECMWF model climate: Recent progress through improved physical parametrizations. Quart. J. Roy. Meteor. Soc., 136, 1145–1160, https://doi.org/10.1002/qj.634.
Klinker, E., and P. D. Sardeshmukh, 1992: The diagnosis of mechanical dissipation in the atmosphere from large-scale balance requirements. J. Atmos. Sci., 49, 608–627, https://doi.org/10.1175/1520-0469(1992)049<0608:TDOMDI>2.0.CO;2.
Klocke, D., and M. J. Rodwell, 2014: A comparison of two numerical weather prediction methods for diagnosing fast-physics errors in climate models. Quart. J. Roy. Meteor. Soc., 140, 517–524, https://doi.org/10.1002/qj.2172.
Lang, S. T. K., M. Bonavita, and M. Leutbecher, 2015: On the impact of re-centring initial conditions for ensemble forecasts. Quart. J. Roy. Meteor. Soc., 141, 2571–2581, https://doi.org/10.1002/qj.2543.
Leutbecher, M., and T. N. Palmer, 2008: Ensemble forecasting. J. Comput. Phys., 227, 3515–3539, https://doi.org/10.1016/j.jcp.2007.02.014.
Leutbecher, M., and S. T. K. Lang, 2014: On the reliability of ensemble variance in subspaces defined by singular vectors. Quart. J. Roy. Meteor. Soc., 140, 1453–1466, https://doi.org/10.1002/qj.2229.
Lewis, J. M., 2005: Roots of ensemble forecasting. Mon. Wea. Rev., 133, 1865–1885, https://doi.org/10.1175/MWR2949.1.
Lillo, S. P., and D. B. Parsons, 2017: Investigating the dynamics of error growth in ECMWF medium-range forecast busts. Quart. J. Roy. Meteor. Soc., 143, 1211–1226, https://doi.org/10.1002/qj.2938.
Lorenz, E. N., 1963: Deterministic non periodic flow. J. Atmos. Sci., 20, 130–141, https://doi.org/10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2.
Madonna, E., H. Wernli, H. Joos, and O. Martius, 2014: Warm conveyor belts in the ERA-Interim dataset (1979–2010). Part I: Climatology and potential vorticity evolution. J. Climate, 27, 3–26, https://doi.org/10.1175/JCLI-D-12-00720.1.
Molteni, F., and T. N. Palmer, 1993: Predictability and finite-time instability of the northern winter circulation. Quart. J. Roy. Meteor. Soc., 119, 269–298, https://doi.org/10.1002/qj.49711951004.
Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF ensemble prediction system: Methodology and validation. Quart. J. Roy. Meteor. Soc., 122, 73–119, https://doi.org/10.1002/qj.49712252905.
Ollinaho, P., and Coauthors, 2017: Towards process-level representation of model uncertainties: Stochastically perturbed parametrizations in the ECMWF ensemble. Quart. J. Roy. Meteor. Soc., 143, 408–422, https://doi.org/10.1002/qj.2931.
Palmer, T. N., F. Molteni, R. Mureau, R. Buizza, P. Chapelet, and J. Tribbia, 1992: Ensemble prediction. ECMWF Tech. Rep., 43 pp.
Parsons, D. B., and Coauthors, 2017: THORPEX research and the science of prediction. Bull. Amer. Meteor. Soc., 98, 807–830, https://doi.org/10.1175/BAMS-D-14-00025.1.
Plant, R. S., and G. C. Craig, 2008: A stochastic parameterization for deep convection based on equilibrium statistics. J. Atmos. Sci., 65, 87–105, https://doi.org/10.1175/2007JAS2263.1.
Rabier, F., H. Järvinen, E. Klinker, J.-F. Mahfouf, and A. Simmons, 2000: The ECMWF operational implementation of four-dimensional variational assimilation. I: Experimental results with simplified physics. Quart. J. Roy. Meteor. Soc., 126, 1143–1170, https://doi.org/10.1002/qj.49712656415.
Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126, 649–667, https://doi.org/10.1002/qj.49712656313.
Rodwell, M. J., 2016: Using ensemble data assimilation to diagnose flow-dependent forecast reliability. ECMWF Newsletter, No. 146, ECMWF, Shinfield Park, Reading, United Kingdom, 29–34.
Rodwell, M. J., and T. N. Palmer, 2007: Using numerical weather prediction to assess climate models. Quart. J. Roy. Meteor. Soc., 133, 129–146, https://doi.org/10.1002/qj.23.
Rodwell, M. J., and T. Jung, 2008: Understanding the local and global impacts of model physics changes: An aerosol example. Quart. J. Roy. Meteor. Soc., 134, 1479–1497, https://doi.org/10.1002/qj.298.
Rodwell, M. J., and Coauthors, 2013: Characteristics of occasional poor medium-range weather forecasts for Europe. Bull. Amer. Meteor. Soc., 94, 1393–1405, https://doi.org/10.1175/BAMS-D-12-00099.1.
Rodwell, M. J., S. T. K. Lang, N. B. Ingleby, N. Bormann, E. Hólm, F. Rabier, D. S. Richardson, and M. Yamaguchi, 2016: Reliability in ensemble data assimilation. Quart. J. Roy. Meteor. Soc., 142, 443–454, https://doi.org/10.1002/qj.2663.
Sanders, F., 1958: The evaluation of subjective probability forecasts. MIT Dept. of Earth, Atmospheric and Planetary Sciences Sci. Rep. 5, 63 pp.
Smith, L. A., C. Ziehmann, and K. Fraedrich, 1999: Uncertainty dynamics and predictability in chaotic systems. Quart. J. Roy. Meteor. Soc., 125, 2855–2886, https://doi.org/10.1002/qj.49712556005.
Stephenson, D. B., C. A. S. Coelho, and I. T. Jolliffe, 2008: Two extra components in the Brier score decomposition. Wea. Forecasting, 23, 752–757, https://doi.org/10.1175/2007WAF2006116.1.
Sun, Y. Q., and F. Zhang, 2016: Intrinsic versus practical limits of atmospheric predictability and the significance of the Butterfly Effect. J. Atmos. Sci., 73, 1419–1438, https://doi.org/10.1175/JAS-D-15-0142.1.
Sutton, O. G., 1954: The development of meteorology as an exact science. Quart. J. Roy. Meteor. Soc., 80, 328–338, https://doi.org/10.1002/qj.49708034503.
Swinbank, R., and Coauthors, 2016: The TIGGE project and its achievements. Bull. Amer. Meteor. Soc., 97, 49–67, https://doi.org/10.1175/BAMS-D-13-00191.1.
Wernli, H., and H. C. Davies, 1997: A Lagrangian-based analysis of extratropical cyclones. I: The method and some applications. Quart. J. Roy. Meteor. Soc., 123, 467–489, https://doi.org/10.1002/qj.49712353811.
Whitaker, J. S., and A. F. Loughe, 1998: The relationship between ensemble spread and ensemble mean skill. Mon. Wea. Rev., 126, 3292–3302, https://doi.org/10.1175/1520-0493(1998)126<3292:TRBESA>2.0.CO;2.