## Abstract

The full range of ensemble forecast members, expressed in percentile format, can give additional valuable information to users compared to a deterministic solution. This study contains the first rigorous assessment of the operational U.K. site-specific percentile forecast generated by the Met Office. Maximum meteorological summer daytime temperature forecasts issued between 2014 and 2016 are analyzed using the ranked probability score (RPS), ranked probability skill score (RPSS), categorized mean squared error (MSE), quantile skill score (QSS), and relative economic value (REV). Site-specific observed climatology is used to define the temperature threshold for each category (where applicable) thereby ensuring identical categorical event base rates across all 99 sites considered. Forecast ranges between 6 and 120 h are assessed, with the RPS decomposition indicating no perceivable change in the reliability yet an almost linear decrease with forecast range solely due to a near-linear increase in the resolution. Using the categorized MSE (the deterministic equivalent to the RPS), the probabilistic forecast is shown to possess more skill than its deterministic counterpart and the disparity between these scores increases with the forecast range. This finding is reinforced by a REV assessment; this indicates that the economic value associated with the probabilistic envelope is greater than that associated with the deterministic solution at the majority of cost–loss ratios. The QSS appears to be well correlated with the RPSS (*r*_{s} = 0.852 at *T* + 24) and identifies the outlying quantiles of the probabilistic forecast as being the least skillful.

## 1. Introduction

Subjective use of climatology to contextualize weather conditions has been the recent focus of the meteorological community particularly in the context of warnings of events that are likely to cause impacts (Neal et al. 2014). Objective use of site-specific climatological information to identify thresholds has also been the basis of a paper by Magnusson et al. (2014) and more recently by Sharpe et al. (2018). The present study extends this methodology to the everyday summer daytime maximum temperatures reported by the U.K. surface observing network and forecast daily by postprocessed probabilistic and deterministic site-specific numerical weather prediction models developed by the Met Office. Four temperature thresholds are used to align the continuous output from these models with public perceptions, thereby categorizing each forecast as a very cool, cool, typical, warm, or very warm summer day. Therefore, the focus of this study is an attempt to adapt established verification metrics to reflect the perceptions of forecast users in an objective way, since a member of the public is (arguably) more likely to interpret the forecast in this way. At each site, the numerical value of each threshold is obtained by selecting a percentile from the cumulative distribution function formed by the long-term climatology at that site. The use of climatological information in the assessment of numerical weather prediction models is not a new concept. NCEP has used equally likely climate-based bins for the evaluation of its models since 1995 (Zhu et al. 1996). However, in the context of gridded models, reanalysis data are often used for the climatology (Zhu and Toth 2008); whereas, restricting the analysis to station sites allows the use of observed climatology instead. Site-specific data are also more appropriate to members of the public who access data for their location using the Met Office app or website. The frequency of occurrence (base rate) of each category will be the same at every site because the same percentile is chosen for each of the four thresholds. However, the resulting numerical values will depend upon the site-specific climatology, so sites with distinctly different climatologies will have very different threshold values. Although the Met Office does not currently convey its operational forecasts in terms of site-specific climatological categories, such expression would be possible and of potential benefit to those who crave contextualization of weather information. Currently temperatures are presented in terms of a number (degrees Celsius in the United Kingdom) and a single all-site color range. However, both of these can be misleading to members of the public. For instance, some individuals may only appreciate temperature in degrees Fahrenheit. Similarly, an all-site color range is not an adequate representation since U.K. site climatologies vary significantly. For example, when an all-site color range is adopted the blue (commonly used to represent cold) is more likely to represent typical conditions in northerly or high-altitude sites.

In this study three measures are used to assess the site-specific probabilistic forecasts: the ranked probability score (RPS) (Epstein 1969), the quantile score (QS) (Ben Bouallègue et al. 2015; Bentzien and Friederichs 2014), and the relative economic value (REV) (Richardson 2000). Both the REV and RPS are used to compare the skill associated with both the deterministic and the probabilistic forecast, where comparison with the latter is made using a categorized version of the mean squared error (MSE). The QS has been adopted in preference to the arguably more commonly used continuous ranked probability score (CRPS) because the site-specific, postprocessed, probabilistic forecast is expressed in terms of quantiles (rather than ensemble members). An assessment of the effect on performance of using climate-based categories is achieved thought a comparison of the QS and the RPS.

Section 2 contains a discussion of the site-specific climatology generation; section 3 describes the methodologies used to verify the forecast; section 4 presents the results of analyzing forecasts issued during the meteorological summers of 2014, 2015, and 2016; and section 5 gives brief conclusions.

## 2. Site-specific climatology

Following the rationale outlined in Magnusson et al. (2014) temperature data from the quality controlled U.K. observing network (Met Office 2012) rather than from model analyses were used to generate a climatological cumulative distribution function at site *i*.

With the help of the National Climate Information Centre a was generated at each U.K. observing site using the 30-yr time series of quality controlled observational data collected between 1983 and 2012. However, 45 of the 144 sites available in the Met Office meteorological database were excluded from the analysis because less than 20 years of summer daytime maximum temperature data were available during this period and this was deemed insufficient to adequately represent the climatology. At the remaining 99 sites categories are used to assess the forecast. To aid forecast communication to users (who typically struggle to appreciate climatological percentiles) category thresholds were chosen based on a whole number of days. Therefore, the daily meteorological summer (June, July, August) maximum temperature condition was defined as follows:

Very cool if it was one of the 10 coolest days of a typical meteorological summer;

Cool if it corresponds to a temperature between the 10th and 20th coolest day of a typical meteorological summer;

Typical if it corresponds to a temperature between the 20th coolest and 20th warmest day of a typical meteorological summer;

Warm if it corresponds to a temperature between the 10th and 20th warmest day of a typical meteorological summer; and

Very warm if it was one of the 10 warmest days of a typical meteorological summer.

Although these categories correspond to unusual percentile choices (viz., values of 0.1087, 0.2174, 0.7826, and 0.8913) they allow the forecast to be communicated in a natural and simple way, without the need to mention percentiles of the climatological distribution. Figure 1 displays the distribution formed by the upper threshold (which also serves as the lower threshold of the higher category) of the lowest four categories at each valid site. A negative skew is immediately obvious with far more sites lying toward the warmer end of the scale. This reflects the unequal geographical distribution of sites within the United Kingdom, perhaps in line with population proportions. The site located on Cairngorm Summit (57.117°N, 3.633°W; altitude 1245 m) stands out as the only outlier; it is interesting to note that the warmest 10-day threshold at this site is cooler than the coolest 10-day threshold at 86 of the remaining 98 sites. None of the other sites considered in this study have an altitude > 500 m. There is a tendency for the maximum daily summer temperature to decrease with increasing latitude; however, Cairngorm Summit remains an outlier (of the distribution formed by the coolest three categories) even when the sites are restricted to those only located in Scotland. Fair Isle (59.533°N, 1.633°W; altitude 68 m) is the only other outlier for the distributions formed by the coolest 10- and 20-day category thresholds at Scottish sites.

The subfigures of Fig. 2 show the geographical distribution of the threshold values in Fig. 1. Two temperature scales have been used to display this data thereby facilitating a clearer reflection of the spread of temperatures at sites across the United Kingdom and enabling intersite comparison. The warmest site thresholds are located in the southeast (as expected) and the temperatures appear to decrease with increasing distance from London with the coolest (blue) stations located in Scotland and the north of England, west Wales, and the far southwest of England. There also appears to be an indication that at similar latitudes, coastal and mountainous sites tend to be cooler than inland sites at lower altitudes. The warmest site is St. James Park, where

Very cool is a temperature less than 18.2°C;

Cool is a temperature greater than or equal to 18.2°C and less than 19.5°C;

Typical is a temperature greater than or equal to 19.5°C and less than 25.5°C;

Warm is a temperature greater than or equal to 25.5°C and less than 27.6°C; and

Very warm is a temperature greater than or equal to 27.6°C.

For comparison, at the coolest site (the Cairngorm Mountains):

Very cool is a temperature less than 11.0°C;

Cool is a temperature greater than or equal to 11.0°C and less than 11.9°C;

Typical is a temperature greater than or equal to 11.9°C and less than 14.7°C;

Warm is a temperature greater than or equal to 14.7°C and less than 15.5°C; and

Very warm is a temperature greater than or equal to 15.5°C.

## 3. Verification methodology

### a. Deterministic and probabilistic data

Postprocessed, site-specific, Met Office forecasts are derived by blending model data from various deterministic and ensemble model forecasts (Moseley 2011); this procedure produces a spread of possible scenarios, each of which is accompanied by a probability of occurrence and a deterministic forecast. The concept is to create an optimal blend of all available observations and models to produce a most accurate site-specific forecast. Gridded nowcasts, regional (high resolution), and global (lower resolution) numerical weather prediction deterministic and ensemble models (generated by both Met Office and ECMWF) are used for this purpose. Where site-specific observations are available, Kalman filtering is also applied and then the different forecasts are combined using lagging and blending techniques.

Deterministic forecast data are available via the online Met Office forecast database DataPoint (Met Office 2018; however, although probabilistic model data are currently unpublished it is available on request from the Met Office via a Freedom of Information request to enquiries@metoffice.gov.uk, citing this paper). The postprocessed, probabilistic, site-specific, cumulative forecast distribution function , which the Met Office routinely produces generates the following 15 percentiles: 0.025, 0.05, 0.10, 0.20, 0.25, 0.30, 0.40, 0.50, 0.60, 0.70, 0.75, 0.80, 0.90, 0.95, and 0.975. These quantiles are denoted by to . A probabilistic forecast provides more information about the spread of possible outcomes, enabling more informed decisions and potentially providing a more reliable service. In contrast, a deterministic forecast consists of just one value—a blend of the ensemble means and deterministic model solutions. Although for the forecast user deterministic data may appear more precise and useful (since little interpretation is required) this simplicity can be misleading because it can give a false impression of certainty.

### b. Ranked probability score decomposition

The negatively orientated ranked probability score (Epstein 1969) is a proper score for probabilistic forecasts based on ranked categories. The RPS ranges from 1 to 0. A site-specific value of the RPS is evaluated using Eq. (1):

where denotes the temperature category threshold, is the number of forecasts at site , is the value of the cumulative probability density function at site for forecast , denotes the upper boundary to the temperature category at site , is the observed value at site during forecast , and

Then an all-site value for the RPS is evaluated using

The RPS is the mean squared error of a probabilistic multicategory forecast; therefore, a comparison between probabilistic and deterministic forecasts is achieved by making a direct comparison between Eq. (1) for the RPS and the categorized version of the MSE expressed by

where denotes deterministic forecast *k* at site *i*.

The Brier score (BS) (Brier 1950) compares forecast probabilities with binary observations in relation to a single threshold instead of multiple categories. Equation (4) gives an expression for the BS at site

therefore, RPS_{i} is simply the mean of BS_{i}(*m*) over each category, that is,

Sanders (1963) and Murphy (1973) describe how the BS decomposes into reliability (REL), uncertainty (UNC), and resolution (RES) terms; while UNC is forecast independent, both RES and REL are important indicators of forecast quality. Reliable forecasts contain probabilities that are close to the observed frequency of occurrence, whereas well-resolved forecasts contain probabilities that differ significantly from the event base rate. The BS decomposition is expressed by

where

In these equations is the number of probability bins, is the probability associated with bin *j*, is the number of times the threshold corresponding to category at site was predicted with a probability within bin *j*, is the number of times threshold was exceeded at site given that it was predicted with a probability within bin *j*, and is the proportion of forecast times during which threshold was exceeded at site . The RES, REL, and UNC terms for the RPS are easily obtained by combining Eqs. (5) and (6), giving

respectively.

The decomposition in Eqs. (6a)–(6d) is based on a system of forecast probability binning, and it is common practice to use equally sized probability bins, for which deciles are often chosen. The postprocessed multimodel probabilistic forecast is expressed in terms of 15 quantiles; therefore, interpolation will usually be necessary to obtain an exact forecast probability associated with each temperature threshold *m*. The value of the probabilities within each bin will vary from forecast to forecast and so Stephenson et al. (2008) devised a new (more complicated) decomposition of the BS that accounts for the within-bin variance. Generalizing this result to the RPS will add an additional level of complexity; therefore, in this study, the probability associated with each threshold *m* has been chosen as the minimum forecast quantile that has a value less than *m*. This choice significantly reduces the complexity because no interquantile interpolation is required and the within-bin variance reduces to zero. The impact of this simplification is likely to be minimal, whereas the added complexity of accounting for within-bin variance in Eqs. (7a)–(7c) would be complex.

### c. Ranked probability skill score

The ranked probability skill score (RPSS) is positively orientated, ranges from −∞ to 1 and is evaluated using

where denotes the value of the RPS achieved when a suitably chosen reference forecast is substituted for the actual forecast. A positive value for RPSS indicates that the issued forecast is more skillful than the reference.

The RPSS can be expressed as

where , , and are given by Eq. (7) and since , it follows that

where only the left-hand side is dependent on the forecast. This demonstrates how forecast skill is bounded from below by the (forecast independent) ratio of to , an expression which may be different for each site. Consequently, to ensure that the minimum possible value for is the same at every site it is necessary to set . In other words, when performing intersite comparisons, the sample uncertainty should be used as the reference forecast and this generates the following expression for the skill score,

obtained via the application of Eqs. (5), (6a), (7), and (8). Clearly, at sites where the frequency of occurrence of each category during the assessment period is almost identical to its long-term climatological equivalent, the choice of reference forecast is immaterial. However, the RPSS will indicate improved performance at sites where these frequencies differ because climatologically based frequencies for the reference forecast will not produce an optimal . Therefore, in the following sections has been chosen as the reference forecast for any figure that displays site-specific values. However, choosing site-specific climatology is appropriate from the standpoint of the operational environment in which it is necessary to issue a forecast every day because if, for some technical reason, no forecast was available at the time of broadcast there must be a fallback position. Therefore, all-site RPSS performance figures use the site-specific climatologically based frequencies for the reference forecast.

### d. Quantile score

The quantile score (Gneiting and Raftery 2007; Bentzien and Friederichs 2014) is a proper scoring method that sums over a weighted absolute error between forecasts and observations. Unlike the other scores considered in this study, it does not use categorical thresholds, but instead the QS considers each quantile separately. This is ideal for verifying the postprocessed probabilistic Met Office forecast because it is expressed in terms of 15 quantiles. Ben Bouallègue et al. (2015) define the QS as

where *τ* is the probability level of the quantile, is the *n*th forecast value for this quantile, is a conditional indictor function known as the check loss function (first defined in the context of regression by Koenker and Bassett (1978), is the observation, and . Consequently, Eq. (12) becomes

and just like the RPS, it is possible to decompose the QS into reliability, resolution, and uncertainty components. Similarly, the quantile skill score (QSS) is evaluated using

where (just as for the RPSS) the reference forecast is usually chosen to be either the long-term site-specific quantile climatology or the sample quantile uncertainty term given by

Here (the so-called unconditional sample quartile) is the *τ*th quartile of the observed distribution; therefore, is a characteristic of the observations and as such it is completely independent of the forecast.

### e. Relative economic value

The cost–loss ratio (Thompson 1952; Murphy 1977) is a decision-making analysis tool that considers forecasts in terms of a user’s decision to act against an adverse event. Fundamental to this approach is an expected cost *C* associated with taking action (irrespective of whether the event occurs or not) and an expected loss *L* associated with a missed event (given that no action is taken). The relative economic value is the ratio of the expected financial gain of taking the actual forecast service *E*(forecast) to the expected financial gain of taking a perfect forecast service *E*(perfect forecast). Consequently

where

where *h*, *f*, *m*, and cr refer to the number of hits, false alarms, misses, and correct rejections, respectively. Dividing both numerator and denominator by *L* introduces the *C*/*L* ratio into Eqs. (16a)–(16d), thereby facilitating the evaluation of REV for different cost–loss ratios. A particular advantage of this assessment method is that it allows for easy comparison between deterministic and probabilistic forecasts.

## 4. Results

The present study examines the ability with which the postprocessed site-specific probabilistic and deterministic forecast products issued by the Met Office predict the maximum summer daytime temperature categories defined by the thresholds displayed in Fig. 2 during the meteorological summers between 2014 and 2016. Although these forecasts are generated by blending various probabilistic and deterministic models, the final probabilistic forecast solution is heavily weighted toward the output from the Met Office Global and Regional Ensemble Prediction System (MOGREPS). During the first 72 h, the PDF generated by the latest model run of the U.K. version of MOGREPS is given a weight of 70% in the probabilistic forecast. Whereas, at forecast ranges greater than 72 h, the global version of MOGREPS is given this weighting with a smaller contribution coming from the medium-range global ensemble forecast produced by the European Centre for Medium-Range Weather Forecasts (ECMWF).

Figure 3 displays the RPS together with its decomposed components as a function of increasing forecast range between 6 and 120 h. An increase in the RPS with forecast range is as expected, and what is surprising is the near-linear nature of this increase from 0.04 at *T* + 6 to 0.075 at *T* + 120. This appears to be solely due to a decrease in RES, since both UNC and REL are almost invariant with respect to the forecast range. UNC is forecast independent so its homogeneity is expected; however, the invariance of REL with respect to forecast range is a positive result. A decrease in RES with increasing forecast range indicates that the probabilities within the forecast tend toward the local climatology, as often speculated on a subjective examination of site-specific meteograms (box-and-whisker plots generated by either ECMWF or the Met Office).

Figure 4 displays the all-site RPSS together with its decomposed REL and RES components. Using UNC as the reference forecast ensures that the RPSS is simply the difference between RES/UNC and REL/UNC. Figure 3 showed how UNC and RES were almost invariant to the forecast range; consequently, it is no surprise to observe that both the RPSS and RES/UNC decrease in a near-linear fashion with increasing forecast range while REL/UNC remains almost invariant. Between *T* + 6 and *T* + 30, the most highly weighted model used within the postprocessing at the time was the high-resolution MOGREPS–U.K. model; however, at *T* + 36 an alternative (lower resolution) model replaced it. This substitution manifests itself in Fig. 4 as an interruption of the steady decrease in both RPSS and RES/UNC—the values of these statistics at *T* + 36 being almost identical to those at *T* + 30. Therefore, it would be interesting to investigate whether an earlier switch (at *T* + 30 or before) from the MOGREPS–U.K. model would result in improved RPSS performance. Similar behavior is observed between the time steps at *T* + 42 and *T* + 48, *T* + 66 and *T* + 72, and *T* + 114 and *T* + 120.

Figure 5 compares the site-specific values of the RPSS at *T* + 24 using the site-specific UNC reference (*x* axis) and the site-specific long-term climatology reference (*y* axis). Most points lie below the diagonal, indicating that the RPSS relative to UNC is smaller at the majority of sites. It is no surprise that forecast skill is lower when it is measured relative to sample uncertainty. The sample uncertainty represents forecasting each category according to its frequency of occurrence during the trial period and therefore it is only known after the event. The frequency of occurrence of each category during the trial period may differ (potentially quite significantly) from the long-term frequency of occurrence of these categories prior to the trial. Therefore forecasting the climatological frequency of occurrence will produce a poorer score then forecasting the true frequency during the trial period (i.e., the sample uncertainty). Consequently, the sites that lie farthest from the diagonal in Fig. 5 are those where the frequency of occurrence of each category during the trial period is most different from the 30-yr site-specific climatology. The two sites highlighted in red and blue are Exeter Airport and Bridlington MRSC; these sites are the farthest and closest to the diagonal, respectively. The inset histogram displays the difference in the frequency of occurrence of each category (expressed as a proportion) during the long-term climatology and the equivalent frequencies during the trial period. At Bridlington the frequency of occurrence of each category was within ±5% of its equivalent during the climatological period. At Exeter Airport the largest difference was observed in the typical category which occurred over 20% less during the trial period.

Figure 6 shows the value of the RPSS (calculated using site-specific sample uncertainty as the reference forecast) at *T* + 24 (left) and *T* + 120 (right) for each site in Fig. 2. At *T* + 24 there is a cluster of high-performing (red) sites in a swath of locations stretching from Yeovilton in the southwesterly county of Somerset, through inland counties of the south of England, Greater London, and on into East Anglia with another cluster on the coast between North Wales and the Lake District. The two highest-performing sites at *T* + 24 are both aerodromes; the value of the RPSS at the aforementioned Yeovilton (51.017°N, 2.63°W) is 0.652 and at Farnborough (51.283°N, 0.767°W) it is 0.656, and both of these sites are located within the south of England. Poor performing sites at *T* + 24 also appear to be somewhat clustered with one group located in the southwest of England and another in the north of Scotland above 58°N. It is also interesting to observe that 16 out of the 20 green/turquoise sites are coastal locations. Perhaps the added complexities of forecasting in the coastal zone adversely affect performance at these locations; however, 7 of the 29 red sites are also located on the coast so coastal effects cannot be the only cause of this effect. The four worst performing sites at *T* + 24 are Bridlington (54.083°N, 0.167°W), Ronaldsway (54.083°N, 4.633°W), Inverbervie (56.850°N, 2.267°W), and St. Mary’s Airport on the Isles of Scilly (49.917°N, 6.300°W). At *T* + 120 the picture is more mixed, the seven highest-performing turquoise/green sites are widely distributed, from Portland (50.517°N, 2.450°W) and St. Catherines Point (50.567°N, 1.3°W) on the south coast of England to Fair Isle (59.533°N, 1.633°W) off the north coast of Scotland. The southerly cluster of sites identified as high performing at *T* + 24 remain good (light blue) performers at *T* + 120; however, the northwest coastal locations have become relatively poor (dark blue) performers. By-and-large the cluster of poorly performing sites in the southwest of England remain poor performers at *T* + 120. The poorest performing sites at *T* + 120 are West Freugh (54.867°N, 4.933°W) in Dumfries and Galloway, Eskdalemuir (55.317°N, 3.217°W) also in Dumfries and Galloway, Ronaldsway (54.085°N, 4.632°W) on the Isle of Man, and Dishforth (54.133°N, 1.417°W) in North Yorkshire. Only Ronaldwsay (the worst performing site according to this measure) is also in the bottom four sites at *T* + 24 and *T* + 120.

The postprocessed model blend produces both a probabilistic and deterministic forecast. The calculations used to create the deterministic solution are entirely separate from those used to create its probabilistic equivalent. The deterministic forecast is a weighted mean of deterministic forecast values and ensemble means. Figures 3–6 have examined the probabilistic version; however, the deterministic equivalent currently populates the Met Office website. The categorized MSE evaluates the difference between the observed category and the category predicted by the deterministic forecast. In the form expressed by Eq. (3) the MSE is directly comparable to the RPS.

Figure 7 plots the categorized MSE (for deterministic forecasts) against the RPS (for corresponding probabilistic forecasts), with every point corresponding to a different site in Fig. 2. The black and red color coding denote a forecast range of 24 and 120 h, respectively. All points lie above the diagonal line of equal skill, since both scores are negatively oriented this indicates that the probabilistic forecast tends to be more skillful than its deterministic equivalent at every site. At *T* + 24 there are three sites that display almost equal deterministic and probabilistic forecast skill. These sites are Exeter airport (near the coast in southwest England), Manston (near the coastal town of Ramsgate in southest England) and Ronaldsway (on the South Coast of the Isle of Man), whereas Bridlington (a coastal site in northeast England) stands out as the farthest from the diagonal. Both of the measures displayed in Fig. 7 identify Bridlington as the site with the largest RPS and MSE values at both *T* + 24 and *T* + 120. The *T* + 24 score at this site appears to be less than that associated with the *T* + 120 forecast at a majority of U.K. sites. Out of the remaining 94 sites, the *T* + 120 MSE was lower than that corresponding to the *T* + 24 Bridlington deterministic forecast at 84 sites and the *T* + 120 RPS was lower than that corresponding to the *T* + 24 Bridlington probabilistic forecast at 85 sites. However, the site that is farthest from the diagonal at *T* + 120 is Weybourne (a coastal site on the North Norfolk coast in East Anglia). The site that is closest to the diagonal at *T* + 120 is Mumbles Head (located on Swansea Bay on the south coast of Wales).

The MSE and RPS are highly correlated at both forecast ranges with correlation coefficients of *R*^{2} = 0.825 at *T* + 24 and *R*^{2} = 0.914 at *T* + 120, indicating a strong relationship between the sites deterministic and probabilistic forecast skill. It is also interesting to note that when the sites are ranked in terms of these statistics the Spearman ranked correlation coefficient obtained by comparing *T* + 24 with *T* + 120 is 0.784 for the RPS and 0.684 for the MSE. This indicates that a high-performing site at a short forecast range tends to remain a high performer at longer ranges. The cloud of red points is noticeably farther from the diagonal equal-skill line, suggesting that the probabilistic forecast score decreases more slowly with increasing forecast range compared with that associated with the deterministic solution and therefore gives better guidance to users.

Figure 8 displays the all-site values of the categorized MSE and RPS at forecast ranges from *T* + 6 to *T* + 120. At every forecast range the probabilistic forecast outperforms its deterministic equivalent; however, the difference between these performance statistics noticeably increases with the forecast range, thereby confirming the supposition inferred from Fig. 7. The increase in MSE is almost monotonic (with the exception of *T* + 90) and as discussed in relation to Figs. 3 and 4 the generally monotonic increase in RPS with forecast range is also broken at *T* + 36, *T* + 48, *T* + 108, and *T* + 120. This figure also appears to indicate that the performance of the probabilistic model at *T* + 120 is approximately the same as the performance of the deterministic model at *T* + 54.

The QSS (Bentzien and Friederichs 2014) is a useful way to evaluate the skill associated with each quantile of a probabilistic forecast; it is particularly useful in the present study because the forecast is expressed solely in terms of quantiles.

Figure 9 displays the all-site QSS for every forecast quantile at *T* + 6, *T* + 24, and *T* + 120, evaluated using site-specific climatology as the reference forecast. As expected the QSS decreases with increasing forecast range; however, at each range considered the quantiles below 20% and above 90% generate the lowest scores. This is an indication that outlying quantiles possess less skill at forecasting summer maximum daytime temperatures.

The QSS and RPSS both assess forecast skill against a reference; however, the (more established) RPSS evaluates skill at predicting multiple categories whereas the QSS requires no such categorization for its evaluation. Figure 10 examines whether (despite this fundamental difference) these skill scores are correlated by comparing the values they generate for the *T* + 24 forecasts issued at each site. In this figure UNC has been chosen for the reference forecast and the all-quantile QSS has been evaluated by averaging the QSS over every forecast quantile.

The Spearman’s rank correlation coefficient value for the points in Fig. 10 is 0.852. This clearly indicates that despite different calculation methodologies there exists a strong correlation between the performance ranking of summer daytime maximum temperature forecasts at each site according to the QSS and the RPSS. Both scores decompose into resolution, reliability, and uncertainty terms; however, its ability to calculate an overall score in addition to scores for individual quantiles places the QSS at a distinct advantage over the RPSS for the assessment of Met Office, probabilistic forecasts since this forecast product is quantile based.

Figure 11 shows the results of a REV assessment of the thresholds corresponding to the ten warmest summer days at every U.K. site as displayed in Fig. 2. This category has been chosen because expected losses associated with the lower threshold sets are likely to be minimal. Each subfigure compares the deterministic forecast (red) with each quantile of the probabilistic forecast (gray) and the probabilistic forecast envelope (blue) at forecast ranges of 6, 24, and 120 h. The forecast quantile corresponding to the 50th percentile is displayed in dark gray. Within each subfigure there is an obvious discontinuity in the first derivative of every curve at *C*/*L* ratios of approximately 0.08. This first order discontinuity corresponds to the point at which the cost of always taking action is the same as that of never taking action [i.e., in Eq. (13c)]. At this *C*/*L* ratio all light gray curves with higher REVs than the dark gray curve correspond to quantiles with percentiles > 50% and all the remaining light gray curves describe quantiles with percentiles < 50%. The probabilistic forecast envelope out performs the deterministic solution at all *C*/*L* ratios below 0.46 at *T* + 6, 0.46 at *T* + 24, and 0.56 at *T* + 120. This improved performance is due to the higher quantiles of the forecast (those with probabilities > 50%). The deterministic solution marginally outperforms all quantiles of the probabilistic forecast at *C*/*L* ratios between 0.46 and 0.66 at *T* + 6, 0.46 and 0.77 at *T* + 24, and 0.56 and 0.67 at *T* + 120; however, in each case the probabilistic forecast again outperforms the determinist solution at *C*/*L* ratios that exceed these ranges. Therefore, these figures appear to indicate that

when the loss associated with an event is many times greater than the cost of the forecast, the probabilistic solution should be preferred (particularly at longer lead times) and should be acted on whenever an event is predicted by a high quantile;

when the loss associated with an event is approximately between 1.5 and 3 times the cost of the forecast, the deterministic solution is a suitable choice for action;

when the loss associated with an event is comparable of the cost of the forecast, the probabilistic solution should again be preferred, acting whenever an event is predicted by a low quantile.

Figure 11 appears to indicate that even if the cost of generating the probabilistic model solution is higher than that associated with the deterministic model, this is justifiable for customers with low *C*/*L* ratios, particularly at longer lead times. However, in this instance the deterministic solution is the derived “most likely” solution from the PDF of the probabilistic forecast so there is no significant difference in production cost. The similarity between the deterministic solution and the median of the probabilistic PDF is confirmed in Fig. 11 by the apparent correlation between the REV (as a function of *C*/*L* ratio) for the dark gray quantile (50th percentile) and the deterministic solution within each subfigure.

## 5. Summary

This study examines the skill with which postprocessed Met Office numerical models (both deterministic and probabilistic) predicted summer daytime maximum temperatures at U.K. observing sites between 2014 and 2016 for which at least 20 years of observed climatology is available between 1983 and 2012. An RPSS analysis indicates that the probabilistic forecast is better than the site-specific climatology at all sites, with a significant cluster of high-performing inland sites in the south of the United Kingdom (particularly around London) at a forecast range of 24 h. Sites near the coast, in Northern Ireland, the Isle of Man, and across northern Scotland appeared not to perform as well. By-and-large the sites identified as relatively poor performers at *T* + 24 were also poorly ranked at *T* + 120; however the measured performance at some sites which were identified as relatively good performers at *T* + 24 become relatively poor performers at *T* + 120.

The reliability component to the all-site RPSS (evaluated as being between 0.1 and 0.15) was found to be almost invariant to changes in the forecast range. Using a categorized version of the MSE enables a direct comparison between the probabilistic forecast and its deterministic equivalent. This reveals that the probabilistic model is the better performer at every site at all forecast ranges with the difference increasing with the range.

An all-site REV assessment (using site-specific thresholds corresponding to the very warm category) confirmed this finding, with the probabilistic model outperforming its deterministic equivalent for the vast majority of *C*/*L* ratios at *T* + 24 and this advantage increased significantly at *T* + 120. However, the deterministic model solution was the slightly better performer when the financial cost of the service was between approximately one and two thirds of the expected financial loss. The REV analysis also identifies that the performance of the deterministic solution was very similar to that of the 50th percentile of the probabilistic forecast.

A good degree of correlation between site-specific values of the (threshold independent) QSS and the (threshold dependent) RPSS confirms that the choice of thresholds used for the RPSS calculation does not appear to significantly affect the measured skill. In addition, the QSS gives useful information for users who act when a particular forecast percentile crosses a specific activity-related threshold. All-site values of the QSS indicate that the central quantiles (those between the 20th and 90th) are the better performers. At *T* + 6 and *T* + 24 the lowest quantiles (corresponding to the 2.5th and 5th percentiles) display the poorest skill; however, at *T* + 120 the maximum quantile (the 97.5th percentile) becomes the poorest performer.

Categorizing weather forecast values in terms of the climatology at the site for which they correspond is particularly useful to the forecast user who typically wants to know whether warmer/cooler conditions are being forecast at their location relative to the time of year. The analysis undertaken in the present study examines the forecast skill associated with predicting these categories using either a deterministic or a probabilistic postprocessed numerical model solution at forecast ranges from 6 to 120 h. Perhaps the relatively poor performance of the outlying quantiles is an indication that the probabilistic forecast solution between 2014 and 2016 was underdispersed.

This analysis constitutes the first rigorous assessment of operational percentile forecasts generated by the Met Office, and it is highly applicable to many other weather types for which site-specific forecasts and long-term climatologies exist. Examining weather components during different synoptic conditions will help identify situations in which the model does not perform as well. Another interesting extension would be to attempt to use impact data to choose the thresholds from site-specific climatological CDFs, thereby enabling an investigation of the extent with which these categories influence health or asset management across the United Kingdom. A further extension could also include the use of another model (such as the ECMWF medium range ensemble) as the reference forecast. Additional benefit could also be obtained by assessing forecast performance at each stage of the postprocessing procedure.

## Acknowledgments

The authors would particularly like to thank the National Climate Information Centre, the Public Weather Service, Clare Bysouth, and Rebecca Stretton for their assistance and the helpful suggestions of anonymous reviewers. The probabilistic model forecast data used in this study are unpublished; however, they are archived at the Met Office and are available for research use via a freedom of information request. For access, please contact enquiries@metoffice.gov.uk citing this paper.

## REFERENCES

*19th Conf. on Probability and Statistics*, New Orleans, LA, Amer. Meteor. Soc., 2.2, http://ams.confex.com/ams/pdfpapers/131645.pdf.

*15th Conf. on Weather Analysis and Forecasting*, Norfolk, VA, Amer. Meteor. Soc., J79–J82.

## Footnotes

For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).