A Physically Based Stochastic Boundary Layer Perturbation Scheme. Part II: Perturbation Growth within a Superensemble Framework

: Convection-permitting forecasts have improved the forecasts of ﬂooding from intense rainfall. However, probabilistic forecasts, generally based upon ensemble methods, are essential to quantify forecast uncertainty. This leads to a needto understandhow differentaspectsof themodelsystemaffectforecastbehavior.Wecomparetheuncertaintydue to initialandboundarycondition(IBC)perturbationsandboundarylayerturbulence usingasuperensemble(SE)createdto determine the inﬂuence of 12 IBC perturbations versus 12 stochastic boundary layer (SBL) perturbations constructed using a physically based SBL scheme. We consider two mesoscale extreme precipitation events. For each, we run a 144-member SE. The SEs are analyzed to consider the growth of differences between the simulations, and the spatial structure and scales of those differences. The SBL perturbations rapidly spin up, typically within 12h of precipitation commencing. The SBL perturbations eventually produce spread that is not statistically different from the spread produced by the IBC perturbations, though in one case there is initially increased spread from the IBC perturbations. Spatially, the growth from IBC occurs on larger scales than that produced by the SBL perturbations (typically by an order of magnitude). However,analysisacrossmultiplescalesshowsthattheSBLschemeproducesa randomrelocationofprecipitationupto the scale at which the ensemble members agree with each other. This implies that statistical postprocessing can be used instead of running larger ensembles. Use of these statistical postprocessing techniques could lead to more reliable probabilistic forecasts of convective events and their associated hazards. we highlight the need to go beyond the grid scale when considering convective-scale forecasts. We also indicate the need for careful consideration of the interpretation of


Introduction
Forecasting of convective events has had a ''step change'' in ability since the advent of convection-permitting models (e.g., Lean et al. 2008;Clark et al. 2016). In turn, this has led to improvements in the prediction of floods with a rapid rate of rise, i.e., both surface water and flash flooding (e.g., Roberts et al. 2009;Cuo et al. 2011). However, quantitative forecasting of convective precipitation still remains a key challenge due to uncertainty in spatial structure (e.g., Roberts and Lean 2008;Dey et al. 2014Dey et al. , 2016aFlack et al. 2018), timing (e.g., Lean et al. 2008), storm structure (e.g., Stein et al. 2015) and intensity (e.g., Mittermaier 2014): these issues are covered in more detail by Clark et al. (2016).
Convection-permitting forecasts lead to improved forecasts of convective events (e.g., Clark et al. 2009) but the smaller scales represented have, in general, faster error growth than the larger scales represented in coarser-resolution systems (e.g., Hohenegger et al. 2006;Hohenegger and Schär 2007;Clark et al. 2010). While errors growing faster at smaller scales in the atmosphere is not a surprising result (e.g., Lorenz 1969), the implication is that for most forecast lead times a probabilistic approach is required.
To help represent this uncertainty many operational centers use ensemble prediction systems (hereafter ensembles) at convectionpermitting resolution (e.g., Seity et al. 2011;Baldauf et al. 2011;Hagelin et al. 2017) to indicate the range of plausible outcomes from subtle changes in initial conditions, boundary conditions and model physics (e.g., Buizza and Palmer 1995). However, there are still questions concerning error growth within ensembles, and as such convective-scale predictability (e.g., Zhang et al. 2003;Selz and Craig 2015;Johnson and Wang 2016). These questions need to be answered to allow for the effective design and implementation of convective-scale ensembles. While error growth is overall faster at these scales there are differences in the error growth that depend on the environmental flow, such as the presence or lack of a diurnal cycle (e.g., Nielsen and Schmacher 2016), and the scales at which the dominant growth occurs (e.g., Johnson et al. 2014;Flack et al. 2018). These factors need to be considered carefully in ensemble design to allow a reliable ensemble to be made, as they indicate that perturbations need to be made across a range of scales. Here we compare ensembles created by two different types of perturbations in the context of both magnitude and spatial aspects of perturbation growth.
Recent work examining convective-scale error growth has considered the spatial aspects of the growth for a range of cases (e.g., Johnson et al. 2014;Surcel et al. 2016). Generally these studies indicate that more widespread precipitation results in a greater areal extent of error growth than more localized precipitation. However, more localized precipitation is less predictable compared to larger areas of precipitation (e.g., . There are also other factors that determine the spatial aspects of error growth. For example, Flack et al. (2018) indicated the scales at which the error growth was dominating were partly linked to the large-scale synoptic forcing. Indeed, for their experiments it was shown that cases with weaker synoptic forcing had perturbation growth that dominated on scales O(1) km whereas for cases that were strongly forced there was an order of magnitude difference, so growth dominated on scales O(10) km.
Many more studies have considered the magnitude of error growth across multiple cases (e.g., Done et al. 2006;Keil and Craig 2011;Done et al. 2012). These studies showed that the total (area-averaged) precipitation had reduced spread between ensemble members in strong synoptically forced compared to weakly forced cases. These results were then developed by Keil et al. (2014) and Kühnlein et al. (2014) to consider the response of convection to different perturbation strategies. It was indicated that model physics perturbations had a greater influence on the total precipitation spread in weakly forced cases compared to strongly forced conditions, particularly around the initiation time of events, in agreement with Surcel et al. (2017). This agrees with previous studies considering convective cases that found that model physics perturbations have their greatest impact at convective initiation (e.g., Zhang et al. 2003;Hohenegger et al. 2006;Leoncini et al. 2010).
Intrinsic predictability experiments yield the theoretical lowest amount of uncertainty possible for an event whereas practical predictability experiments yield the uncertainty in models for actual cases (based on current capabilities). In a forecasting context, both intrinsic predictability experiments and practical predictability experiments have their uses for forecast interpretation. Generally studies [including most previously discussed, with the exception of Keil et al. (2014) and Kühnlein et al. (2014)] have focused on intrinsic rather than practical predictability. However, there are now more studies considering practical predictability (e.g., Melhauser and Zhang 2012;Sun and Zhang 2016). Both of these studies considered the up/downscale growth of perturbations and show that if the errors on large scales (of roughly 1000 km) are large then the forecasts can be improved via more accurate initial conditions, whereas if the errors on the large scale are small then, regardless of improvements in initial conditions, there will be limited improvement in the forecasts on the mesoscale. This result was also found by Durran and Gingrich (2014) and Weyn and Durran (2017), though the latter study notes that there is no upscale/downscale growth within their idealized simulations and the errors grow up-amplitude on all scales simultaneously. These discrepancies show that further work needs to go into these practical predictability experiments as this will help indicate where forecasts can be improved further, for example through better specification of initial conditions or better representation of unresolved processes such as turbulent eddies.
In Clark et al. (2021, hereafter Part I) we discussed the formulation of our physically based stochastic boundary layer (SBL) perturbation scheme and tested it for two distinct cases (18 July and 5 August 2017) over the United Kingdom. Our physically based stochastic scheme is designed to represent the sampling error from unresolved turbulent eddies within the boundary layer. It depends upon the average number of thermals triggered over an area in a set time and is such that situations with, on average, more thermals result in relatively smaller stochastic increments. Testing showed that the scheme does not result in any significant systematic change in overall precipitation, but generates significant differences from a control simulation at the convective cell scale over a forecast of several hours and so can form the basis of an ensemble designed to represent the impact of this form of uncertainty. The stochastic scheme is designed to be relatively insensitive to the spatial scale the perturbations are applied on, and testing confirmed this; some sensitivity to the magnitude of the perturbations was observed, though a factor-of-10 increase was required to produce significantly more displacement in the convective precipitation from the control simulations.
The magnitude of stochastic increments appears very small (around 0.01 K), but this is because the boundary layer heating is similarly small on the same time scale. In fact, at the scales applied, the variability of increments can easily match the size of the mean. As discussed in Part I in more depth, this irreducible variability must exist in even the most idealized smoothly forced circumstances, and one of our objectives is to determine how significant this source of variability is. Other sources of uncertainty exist, including uncertainty in surface parameters, and so-called structural uncertainties due to the inaccuracy of the parameterization scheme. The former depends on knowledge of surface characteristics (or lack thereof) and is difficult to model universally. For example, the ''uncertainty'' in evapotranspiration would be larger in a model using climatological values of, for example, leaf area index compared with that using a measured value from satellite-based remote sensing. Clearly, the objective with such uncertainty is to reduce it using more or better measurements (though again there is likely to be an irreducible limit to be determined). ''Structural'' uncertainty is not a well enough defined concept, but we take it to mean that the ensemble mean response to forcing is likely to be in error. Such errors tend to be systematic, often leading to different quasiequilibrium profiles for given forcing, and it is very hard to argue that the representation of such errors should be stochastic on small scales without introducing the physical reasoning behind our scheme. Our scheme represents the variability about the ensemble mean, which increases as the space and time averaging scale decreases. Of course, the mean is zero so the question is how much of the variability is retained and grows. We therefore would argue that the variability represented by our scheme must be considered at high resolution, and in this paper we do so cleanly, comparing its effect with that of a well-defined and separate source of uncertainty.
Thus, here in Part II of this study, we wish to determine the impact of the SBL compared to initial and boundary condition (IBC) perturbations on forecast uncertainty and so determine the spatial scales at which these perturbations act. We consider the perturbation growth in a superensemble (SE) framework using the same two cases in practical predictability experiments. An SE is a large ensemble that consists of several subensembles in which different types of perturbations are used. This is a useful but expensive tool. This expense arises from the need to consider a large number of ensemble members, either n m or (if each factor has a different ensemble size) 1 n 0 3 n 1 3 ÁÁÁ 3 n m , where n represents the ensemble size and m represents the number of factors being considered (i.e., for our situation m 5 2 to compare the influence of IBC perturbations and SBL perturbations) to be able to determine the impact of each factor.
The SE framework is a simple and effective method for determining the (relative) impact of different sources of uncertainty upon the forecast (e.g., Kühnlein et al. 2014;Keil et al. 2014). Since the SBL perturbations are small scale, we wish to address a second question. Practical ensembles do not contain enough members to enable probabilities of, for example, precipitation to be derived simply and directly. Some postprocessing is needed to smooth the predicted probabilities, often based on ''neighborhood'' methods (as discussed in section 4d). The SE provides us with a tool to compare both the scales of variability due to the SBL with that assumed in the neighborhood processing, and the predicted rainfall probabilities. If the postprocessed ensemble is similar to the full SE it implies that the postprocessing acts to artificially increase the ensemble size thus saving the computational expense of running an SE operationally, particularly at the convective scale. While this paper acts to test our scheme in an operational context, the questions considered in the manuscript apply more widely to all forms of SBL perturbations.
Thus, through our SE we consider two questions: (i) How does the perturbation growth induced by our SBL compare to the growth from IBC perturbations, and (ii) how does the impact of our SBL scheme compare to that from postprocessing an ensemble without our SBL scheme using neighborhood-based diagnostics?
The remainder of this paper is set out as follows. The construction of the SE is discussed in section 2, a brief overview of the cases is given in section 3 and diagnostics considered here are explained in section 4, with a particular emphasis on those not used in Part I. The magnitude of the perturbation growth is considered in section 5 and the spatial aspects are considered in section 6; finally, conclusions are drawn in section 7.

The superensemble
Here we have taken an operational convection-permitting ensemble and expanded it into a much larger ensemble using the perturbations from our SBL scheme. We have termed this larger ensemble an SE as it is one large ensemble made of many subensembles. The SE (Fig. 1) is constructed using the Met Office Unified Model (MetUM) at version 10.6. The MetUM is a nonhydrostatic, semi-implicit, semi-Lagrangian model that uses the Even Newer Dynamics for General Atmospheric Modeling of the Environment (ENDGAME) formulation for its dynamical core (Wood et al. 2014). We use the standard MetUM parameterizations for the boundary layer (Lock et al. 2000), microphysics (Wilson and Ballard 1999), radiation (Edwards and Slingo 1996) and surface-layer scheme (Porson et al. 2010). A convection scheme is not used as convection is treated explicitly.
The SE is constructed from 12 members of the operational Met Office Global and Regional Ensemble Prediction System for the United Kingdom (MOGREPS-U.K.; Hagelin et al. 2017). The MOGREPS-U.K. configuration of the MetUM is a 2.2 km grid-length ensemble. It is closely connected to the U.K. variable resolution (UKV) configuration of the MetUM operational at the time of the case studies (except the UKV has a 1.5 km grid length over the United Kingdom). This configuration uses 4DVAR data assimilation to produce an analysis every 3 h; in practice analysis increments are ''nudged'' into a forecast started from the 1 h forecast from the previous analysis. MOGREPS-U.K. follows a similar process, starting with the same UKV 1 h forecast and analysis increments reconfigured to the 2.2 km grid, but each ensemble member also has added the downscaled perturbations for IBCs from the 33 km grid-length global ensemble (MOGREPS-G; Bowler et al. 2008;Tennant et al. 2011). The intention is thus to retain both the high-resolution information from the UKV analysis and the mesoscale perturbations from MOGREPS-G. This setup is identical to the setup described by Hagelin et al. (2017) except that the UKV analysis increments have since been updated to 4DVAR analysis increments instead of 3DVAR analysis increments. Each of the 12 MOGREPS-U.K. members forms the basis of a 12-member subensemble by generating a further 11 members with our SBL scheme using 11 different random seeds. This process results in a set of 12 subensembles each with 12 members, and thus an SE with 144 members. The IBC perturbed components of the SE are those generated by the operational MOGREPS-U.K. system.
In our experiments, unlike in the operational version of MOGREPS-U.K., we do not use the operational stochastic potential temperature (u) perturbations or the random parameter scheme to produce model physics perturbations (discussed in McCabe et al. 2016;Hagelin et al. 2017). We run the SBL scheme discussed in Part I instead.
Our SBL scheme is designed to represent the variation due to unresolved turbulent processes that is not accounted for in traditional boundary layer schemes. In the SE the SBL scheme is set up to perturb u, q, u, and y (where q, u, and y represent specific humidity, and wind components in the zonal and meridional directions, respectively) over a region of 8 3 8 grid boxes that is repeated in a ''checkerboard'' effect across the domain. The magnitudes of the perturbations are set to a value that is physically appropriate based on boundary layer scalings and is not multiplied by an extra factor. On this eddy turnover time scale the scheme adds perturbations with standard deviation roughly (A t /DA) 1/2 u * , where A t is the area occupied by one eddy, DA is the averaging area, and u * is the free convective temperature scale, typically of order 0.1 K; with an 8 3 8 checkerboard and 2.2 km grid box, so A t /DA ' 1/16, this is roughly 0.1/16 K. Thus, perturbations are very small. They could have been applied over a smaller area, and thus been larger, but the results of Part I suggest the results would not be significantly different. The scheme is applied on all model levels that the boundary layer scheme runs on and at every time step throughout the run. In practice this means all model levels, but the perturbations outside the actual boundary layer are generally much smaller. Full justification for these choices and sensitivity of the scheme is discussed in Part I.
In the SE experiments the two cases considered are initiated at 1500 UTC the day prior to the event of interest (17 July and 4 August 2017, respectively). This allows the event of interest to occur at a time in the forecast (approximately T 1 24 h) when all the perturbations have had time to grow to produce a similar influence on the forecast, as we will demonstrate in section 5.
Throughout the rest of the paper the following notation (used within Fig. 1) is used to describe the different ensemble members within the SE: a.x where a refers to the IBC member and x refers to the stochastic member. Thus member 0.0 of the SE is the control of MOGREPS-U.K. and there are no stochastic perturbations added (i.e., it is the unperturbed control member of the entire SE). Furthermore, we define two types of subensembles (IBC and SBL subensembles) e.g., 1.x refers to the subensemble with IBC member 1 with all 12 stochastic members (i.e., an IBC subensemble) whereas a.1 is the subensemble with stochastic member 1 with all 12 different IBCs (i.e., a SBL subensemble). We also refer to subensemble a.0 as the control subensemble (the ensemble with no stochastic perturbations), which is our equivalent to MOGREPS-U.K. without any stochastic perturbations.

Case studies
As discussed in the introduction we use the same cases here as we did in Part I; however, we provide a brief overview of the cases here to set the scene and the terminology around the cases. Both cases are given names based primarily after the locations where the convection was observed to be most intense or dynamically active, rather than from the analysis domains. Figure 2 shows the probability of reaching an hourly-precipitation accumulation of at least 1 mm for these events generated from the control subensemble (a.0) and the entire SE, for both the Coverack case (Figs. 2a,c) and the Kent case (Figs. 2b,d). The cases were chosen to show different types of convection, and via the convective adjustment time scale (e.g., Done et al. 2006) can be shown to occur within different places along the spectrum of convective regimes (e.g., Flack et al. 2018).

a. Coverack case: 18 July 2017
In this case a mesoscale convective system (MCS) progressed toward the United Kingdom after forming off the coast of Brittany at 1200 UTC. The MCS moved over Cornwall at 1400 UTC bringing intense precipitation that resulted in a devastating flood for the village of Coverack (Essex 2018) as part of the MCS became anchored over Coverack for approximately 3 h from 1400 UTC. The convective adjustment time scale for this case is initially 4.2 h and over time reduces to 0.4 h. Combining this with the local forcing keeping the storm anchored places this case toward the nonequilibrium end of the spectrum (despite the marginal time scale).

b. Kent case: 5 August 2017
The second case began as scattered showers forming in the lee of the Welsh mountains before aggregating as they traveled across England. Upon reaching east England (East Anglia) at 1400 UTC, they had formed into two S-N oriented squall lines (see Figs. 2b,d). The eastern squall line then moved along the north Kent coast (not shown). By chance, the lead author was there at the time and from 1502 to 1534 UTC he witnessed multiple mesocyclones and three funnel clouds as the squall line passed directly overhead. The rainfall associated with the southern squall line was intense and could have led to flooding had it been further south over land. However, most of the precipitation occurred along the coast either onto marshland or into the sea. This case has a low convective adjustment time scale of initially 1.1 h dropping to 0.1 h and so is placed on the convective quasi-equilibrium end of the spectrum of convective regimes.

Diagnostics
Three diagnostics were utilized in this study and are now described. Alongside the mean square difference (MSD) previously discussed in Part I and defined in Flack et al. (2018), a variance diagnostic and a diagnostic that considers the spatial aspects of the forecasts is also used: the temperature variance and the ensemble agreement scale (EAS; Dey et al. 2016a,b). All analysis using the MSD is performed over a region of 205 3 205 grid boxes (451 km 3 451 km) which includes the formation locations for each event. The temperature variance is calculated for the full forecasts and the interior domain (2.2 km) of MOGREPS-U.K., while the EAS is calculated across the entire domain, but shown over the same analysis domain as the MSD. Figure 2 indicates the analysis domains for each case, which are identical to those used in Part I. The diagnostics are considered for both forecasts at times specific to the life cycle of the event across the full SE including formation and decay or leaving the United Kingdom. These times are T 1 12 h to T 1 36 h for the Coverack case and T 1 6 h to T 1 30 h for the Kent case (Fig. 1). They are further chosen to allow at least 1% of points within the domain to have precipitation as otherwise it becomes difficult to separate numerical artifacts due to the small number of points from physical differences.

a. Mean square difference
The MSD was used and discussed in detail in Part I in the testing of our SBL scheme. We repeat the formula here for convenience: for P c the hourly precipitation accumulations in the control forecast and P p the hourly-precipitation accumulations in the perturbed forecast, evaluated at each grid square within the analysis domain. As in Part I the MSD is considered only over the common points, so is referred to as MSD common . The ''common'' points are those at which precipitation occurs in the same location in both the control and perturbed forecasts. The MSD common is used to help diminish the ''double penalty'' problem as it neglects points where precipitation only occurs in one forecast. An arbitrary hourly precipitation accumulation threshold for the identification of convective precipitation is used here for MSD common . This threshold is set at 1.0 mm to keep consistency with Part I although the conclusions are insensitive to reasonable changes in this value (not shown). For the calculation of the MSD the ensembles have been bootstrapped with replacement for 10 000 samples to produce confidence intervals on the mean, and reliable 95th and 5th percentiles. Furthermore, times during this analysis period with a low number of precipitating points are separated by vertical dot-dashed lines on the figures. Times before that, indicated by a line near the start of the analysis period, or after that, indicated by a line near the end of the analysis period, are less statistically reliable, and hence conclusions are not drawn from these periods.

b. Temperature variance
The temperature variance has been chosen as a diagnostic because it is one of the components of the difference total energy (DTE), which consists of the difference kinetic energy and a thermal component (e.g., Zhang et al. 2003), and is frequently used to consider error growth (e.g., Zhang et al. 2003;Selz and Craig 2015;Durran and Gingrich 2014). Here, we consider the evolution of the temperature variance (DTE T ) at 850 hPa averaged across the interior domain. The temperature variance is given by where c p is the specific heat capacity, T ref is a reference temperature, here taken to be 273 K, and T 0 denotes the difference between the control (0.0) and the perturbed forecasts (a.x). This diagnostic allows a direct inference of the size of the temperature perturbations to indicate the impact of the spatial scale of different size perturbations, and the behavior can be used to help infer the different growth mechanisms.

c. Ensemble agreement scale
At small scales ensemble members are more likely to be in disagreement with each other and observations because of differences in positioning and intensity which means low predictability and low skill for any given member. A lack of predictability at small scales will also lead to noisy (spatially fragmented) probability forecasts from an ensemble unless there are either sufficient ensemble members to account for the uncertainty or neighborhood processing is used to effectively add members. At larger scales there is typically more agreement so a ''skillful scale'' can be defined as the smallest scale at which the members are in agreement. Here we use the EAS defined by Dey et al. (2016a) to determine a scale for each individual grid point that can be used to establish appropriate neighborhood sizes to be used when generating probabilities from the ensembles.
The calculation of the EAS starts by comparing pairs of fields and is applied to each grid point. First, for each grid point, a comparison is made with its equivalent and then successively larger square neighborhoods are tested until a neighborhood size is found in which the precipitation forecasts are found to have sufficient agreement with one another (they ''suitably'' agree) as defined by Eqs. (1) and (2) below. Usually, the overall EAS for each grid point (i, j) is defined as the average agreement scale between all member-member pairs at that grid point (Dey et al. 2016a,b). However, given the size of the SE we restrict this to the average agreement scale between the control and perturbed member pairs.
The agreement scale, S A ij , for each control-perturbed member pair, at each grid point, is the neighborhood width for which the two forecasts are in agreement. If two forecast agree at the scale n 3 n grid points, the agreement scale is (n 2 1)/2 grid lengths. For example, if the forecasts agree at the grid scale, they have an agreement scale of 0 whereas if a neighborhood of size 3 3 3 grid boxes is required, they have an agreement scale of 1 and for 5 3 5 grid boxes the agreement scale is 2, etc. The EAS is therefore defined as the minimum scale S in terms of number of complete grid boxes, that satisfies where D S ij is the normalized ratio of the squared differences between the fields and D S crit,ij is the critical value which determines if the values suitably agree with one another. For hourly precipitation accumulations for the control-perturbed member pairs, D S ij is calculated as where P S p,ij and P S c,ij represent the precipitation within the neighborhood of width S centered on grid box (i, j) for the control (c) and perturbed (p) forecasts. The critical value is determined by where a is a bias tolerance between the forecasts varying between zero and unity (here it is set to 0.5) and S lim is a predetermined fixed maximum scale where (1) will always be satisfied [here it is set to 80 as in Dey et al. (2016a)]. Recall that for forecasts that agree at the grid scale S is zero and so D S i,j 5 a. The assessment of agreement is performed iteratively, incrementing S from the initial value of zero until an agreement scale, or S lim , is reached.
Given that (1) is for the minimum when the criterion is met the EAS will range from zero (an acceptably spatially identical forecast) to S lim which either implies that there is only precipitation in one forecast over the area corresponding to S lim , or that there is no precipitation in either forecast over the area, or that there is no spatial agreement between forecasts.

d. Postprocessing with the EAS
The EAS is used to define a neighborhood size for generating probabilities that can vary with each grid point in the domain rather than have a fixed size for every grid point. The use of the EAS, as developed by Dey et al. (2016a), has included applications for the United Kingdom (Dey et al. 2016b), China (e.g., Chen et al. 2017) and the United States (Blake et al. 2018). The postprocessing here follows three simple steps: 1) The ensemble probabilities are calculated at each individual grid point, as standard.
2) The EAS is calculated using the method outlined above. 3) At each grid point the neighborhood length is defined by the EAS for that grid point. The postprocessed probability of rainfall at each grid point is then calculated as the average of the probabilities within the neighborhood [the neighborhood ensemble probability (NEP) as defined by Schwartz et al. (2010) and Schwartz and Sobash (2017)]; e.g., for a grid point with an EAS of 5, an average over the probabilities in the 11 3 11 grid points centered on that grid point is calculated.

Magnitude analysis
Here we analyze the precipitation intensity within the SE. Figure 3 shows the cumulative precipitation for both of the cases, alongside the maximum hourly accumulations within the analysis domain. It indicates that the control member (a.0) of each of the ensembles lies toward the center of the precipitation distribution and that the spread is increasing with increasing lead time. The dashed lines representing the stochastic members remain close to their corresponding IBC member, implying that there is more spread from the IBC perturbations than from the SBL perturbations, and that the SBL scheme serves the purpose of ''filling the gaps'' associated with having a small ensemble. This impression is confirmed by the standard deviation (not shown) and the subensemble-averaged range of the IBC subensembles being an order of magnitude larger than that of the stochastic subensembles. For the Coverack case the range of the IBC and SBL subensembles at T 1 48 h are 6.8 and 0.5 mm, respectively; for the Kent case the same ranges are 1.6 and 0.2 mm, respectively. The order of magnitude differences between the range of the subensembles also, qualitatively, holds throughout the forecast after the initial perturbation growth. The process of ''filling in the gaps'' in itself is a useful property as it may enable further confidence in the probabilities generated by the ensemble forecasts, and thus a better interpretation of the forecast. It is also worth noting that any bias introduced by the scheme for these cases is minimal and has no meteorological significance.
When the largest precipitation totals are considered, which become particularly meaningful when considering a flooding or potential flooding situation, our stochastic scheme introduces an increase in the number of events. This increase is shown particularly in hourly accumulations over 50 mm (Figs. 3c,d) where the probability of exceedance in the SE is 32/144 5 22.2% compared to 0% in the control subensemble in the Coverack case; for the Kent case the equivalent probabilities are 16/144 5 11.1% and 1/12 5 8.3% between T 1 12 h and T 1 36h . The larger SE is able to sample more extreme tails of the precipitation rate distribution compared to the control subensemble, which is equivalent to MOGREPS-U.K. without any stochastic perturbations. This production of larger precipitation rates from the SBL scheme would have been beneficial for operational meteorologists in the Coverack case as it showed increased potential for large precipitation rates, and hence risk of flooding.
The magnitude of the perturbation growth from the scheme is considered further through the use of the MSD, addressing the first question we posed in section 1. When considering the MSD for every point within the domain it is clear that there is more spread produced from the IBC than from the SBL perturbations (not shown). However, from the full MSD it is not clear whether the ''double penalty'' problem is influencing the results. Hence in the remainder of this section we shall discuss the perturbation growth magnitude by considering only the common points in both forecasts (MSD common ). Figure 4 shows MSD common for both cases and for all of the IBC subensembles and the SBL subensembles.
As expected for both of the cases there is a larger confidence interval for the MSD common in the IBC subensembles compared to the SBL perturbation subensembles. The initial period of growth in the analysis period is hard to interpret because of the limited number of precipitating points meeting the required threshold (,1% of points in the analysis domain) for both periods (and also at the end of the analysis period for the Kent case (Fig. 4). Throughout both forecasts the impact of the SBL perturbations retains a similar magnitude whereas the impact of the IBCs varies in magnitude. For the Coverack event (Fig. 4a) the MSD common values in the perturbation subensembles remain statistically distinguishable from each other (at the 5% statistically significance level) until T 1 26 h, 14 h after the start of the precipitation in the forecast. Until this time, MSD common for the forecasts in the SBL subensembles remains smaller than that for the forecasts in the IBC subensembles. On the other hand, MSD common values from the perturbation subensembles in the Kent case are statistically indistinguishable, at the 5% significance level, throughout the forecast after 10 h from the start of the run (which is 4 h into the precipitation). There is a short period of time at T 1 25 h where the subensembles do split and this is associated with departure of the squall line from the analysis domain at different times.
Further insight into why there are differences between the IBC and SBL perturbation growth can be gained from the DTE T (Fig. 5). The most obvious difference is (as expected) that the IBC-induced perturbations are larger than the SBL perturbations by a factor of 10. Considering the growth rate, within the first two hours there are minimal differences although the SBL perturbations grow slightly faster than IBC perturbations; at later times the SBL perturbations grow at a much faster rate, as in Weyn and Durran (2019).
More revealing differences occur from considering the overall evolution of the growth of the DTE T . For both cases the IBC growth is relatively smooth with limited changes of growth rate until saturation of the initial growth. In contrast, the growth of the SBL perturbations is more ''stepped'' and irregular with time, particularly for the Coverack case (Fig. 5a) in which the steps, and the associated growth rate changes, are large. The difference in growth evolution between the two cases is akin to results from Flack et al. (2018) in which cases that were closer to the nonequilibrium end of the convective spectrum had ''erratic steps'' in their error growth, whereas x-type comparisons). The thickest (darkest) line represents the average MSD, the thinner (dark) lines the 95% confidence interval on the mean from bootstrapping, and the palest lines represent the 5% and 95% quantiles of the MSD of the SE. The vertical dot-dashed lines represent when 1% of points in the analysis domain (420 grid points) are precipitating, and as such will produce more statistically reliable values. In (a) points after the line exceed this threshold and in (b) points between the two lines exceed the threshold.

MARCH 2021
F L A C K E T A L .
cases toward the equilibrium end were much smoother. The steps are produced as a direct result of perturbation growth due to convection (cf. with Figs. 3a,b) and imply that, while there is a difference in initial magnitude and so there is still growth in the SBL perturbations at the end of the forecast, there is a scale separation between the growth of the IBC and SBL perturbations. Furthermore, the growth is less likely to saturate as the SBL perturbations are applied throughout the forecast. The stepping and influence of continuous perturbations also raises questions about the upscale growth of the errors under different circumstances, and in more realistic models as opposed to the idealized configurations examined previously (e.g., Zhang et al. 2003;Selz and Craig 2015;Weyn andDurran 2018, 2019) and as such warrants further investigation (however, addressing these questions is beyond the scope of this paper). Note that the magnitude of the eventual DTE T in SBLperturbed runs correspond to about 0.4 and 0.3 K standard deviation in the Coverack and Kent cases, respectively; these are similar to, but larger than, the total boundary layer standard deviation, most of which occurs at very small scales, and much larger than the stochastic forcing applied. This variability can easily account for much of the ''representativity'' error of boundary layer temperature observations.
In summary, the magnitude analysis has revealed that for the Kent case, independent of the type of perturbation, the common points are precipitating at a similar rate, whereas for the Coverack case the precipitation rate is being altered by both types of perturbations, with the IBC having a stronger impact than the perturbations from the SBL scheme. This finding is consistent with Flack et al. (2018) (which considered Gaussian u perturbations in the boundary layer rather than the more physically derived ones used here): there is a smaller impact of SBL perturbations on precipitation intensity in cases of scattered showers (such as the Kent case in which the intensities from the perturbed members remain close to the control) compared to cases with more organized convection (such as the Coverack case in which the intensities deviate more strongly from the control), and more generally consistent with Weyn and Durran (2019). The magnitude results further show that not only can the SBL scheme produce reasonable differences from the corresponding control members (a.0), but also that these differences can be comparable to those produced by IBC growth after around 12 h. There is also evidence supporting the idea of the scheme ''filling in the gaps'' left by the control subensemble due to growth being directly related to convection, and hence occurring on smaller scales. However, not all aspects of the perturbation growth have been considered, and this analysis has been performed on the grid scale. To consider the perturbation growth further we next consider spatial diagnostics to analyze the ensembles where, from the DTE T , larger differences occur.

Spatial analysis
The forecasts of convection in the two cases are also subject to positioning errors, thus the spatial aspects of the forecast are now considered. The objective is to compare the scales of agreement (or, more relevantly, disagreement) associated with the two perturbation methods. This analysis is performed across multiple scales through the use of the EAS and has been computed separately from the IBC subensembles and the SBL subensembles. Thus, each IBC member has a subensemble of SBL members and vice versa. The fraction of common points has also been calculated for the SE, and for the SBL perturbations remains consistent with the results in Part I (not shown). Figure 6 shows the average EAS for four subensembles chosen randomly from each set and for each case. Figures 6a-d  and 6i-l show the EAS from the IBC subensembles (a.0, a.2, a.6, and a.11, with a varying across the IBC members) and Figs. 6e-h and 6m-p show the EAS of the SBL subensembles (0.x, 2.x, 6.x, and 11.x, with x varying over the SBL members). The results presented in this figure are for near the period of maximum intensity; however, the conclusions drawn are consistent for all other times in the analysis periods (not shown).
The two cases at these times (1500 UTC for Coverack, 1400 UTC for Kent) both have organized convection (although there are still some scattered showers in Wales for the Kent case at this time) and both show similar results. There are a few more locations with a small EAS (EAS ' 1) for the Coverack case compared with the Kent case (e.g., compare Figs. 6e and 6m). This difference is due to the larger areal extent of organized precipitation coverage associated with the MCS compared to the narrow squall lines (e.g., Fig. 2). The larger regions of organized convection having more agreement in the location of precipitation, and hence larger predictability (indicated by the small EAS), is consistent with Johnson et al. (2014) and Surcel et al. (2016).
Differences between the two perturbation techniques are clearer than between the two cases. There is a much smaller spatial uncertainty given by the SBL subensembles (smallest EAS of 1) compared with that of the IBC subensembles (smallest EAS of 5). For example, compare Figs. 6e and 6a. This separation of scales implies that IBC perturbations provide more variability on larger spatial scales than the SBL perturbations. The scale difference is approximately on the order of 5-10 grid points. The perturbation growth is generally occurring on scales smaller than 5-6 grid points for the SBL perturbations, whereas for the IBC perturbations growth is generally occurring on scales larger than 5-6 grid points. The existence of convection in regionally different locations with different initial conditions supports the greater importance of this perturbation type at larger scales, for example in Figs. 6m, 6o, and 6p there is less variability in the location of convection in northern France compared to Fig. 6n. The envelope of the EAS remains the same between different SBL subensembles i.e., subensembles including all the members with different IBC perturbations (e.g., . This scale separation of perturbation growth shows that the two types of perturbations have different roles and that using them in conjunction will allow greater forecast variability. This conclusion is somewhat supported by the DTE T analysis which ties the growth from the SBL perturbations specifically to convection, whereas this link is less apparent for the IBC perturbations. These results demonstrate that the scale separation happens with physical-based perturbations as well as with the idealized perturbations considered in Weyn and Durran (2019). We now address the second question posed in section 1, which is whether the SBL scheme produces a random relocation of cells below the ''skillful'' scale of the forecast. To examine this question we compare the probability of exceedance fields created from two ensembles: the control subensemble (a.0) and the full SE. The probability fields for the control subensemble and the full SE are shown in Figs. 7a-d for the two different cases. A threshold of hourly accumulations exceeding 4 mm is used. This threshold is larger than that used for the previous calculations as within operations for short lead times (6-36 h) ensembles are predominantly used to consider the likelihood of extremes and the chance of severe weather. Comparing Figs. 7a-d with their equivalent plots in Fig. 2 shows the expected reduced precipitation coverage (and probabilities) associated with a higher precipitation threshold. Between the control subensemble (Figs. 7a,b) and the full SE (Figs. 7c,d) the clear difference that stands out is the smoother probability field for the full SE, that does appear to ''fill in the gaps'' and smooth out the small-scale variability.
The combination of this result and the EAS (Fig. 6) indicates that there is the possibility of producing similar results to the full SE (in terms of spatial location) by using neighborhood techniques to artificially increase the ensemble size. To demonstrate this possibility, and to see whether the EAS is the correct scale with which to postprocess the results, the control subensemble is postprocessed with the average EAS generated from the control subensemble (i.e., Figs. 6a and 6i, for the Coverack and Kent cases, respectively). The EAS generated from the control subensemble is used as in an operational context there would not be any access to the other runs (given that an SE is computationally expensive to run). This EAS is used to set a different neighborhood size for each grid point to generate the probability of rainfall at that location. A smaller EAS implies a more confident forecast and so fewer neighborhood points are used compared to a larger EAS.
The results of the postprocessing using a neighborhood based on the EAS is shown in Figs. 7e and 7f. Comparing these figures with Figs. 7a and 7b shows (as in the SE) a smoother field with ''filled in gaps.'' The postprocessing does not give the same result as the SE (Figs. 7c,d) for either case as there are some cells introduced in the SE that do not appear in the postprocessed plots. However, as the vast majority of the grid points in the SE with nonzero probabilities also have nonzero probabilities in the postprocessed data it shows that sensible postprocessing of ensembles can act to artificially increase the ensemble size. For these two cases the postprocessing is not changing the overall ''story'' of the weather forecast. Therefore, in this instance postprocessing provides meaningful probabilities, with significantly reduced computational expense, compared to that of running the full SE.

Conclusions
Convective-scale ensembles are enabling better probabilistic forecasts of severe weather associated with convective events. In Part II of this study we have compared and contrasted the roles of SBL perturbation growth and IBC perturbation growth within the framework of an SE. The SE comprised the 12 members of MOGREPS-U.K., within each of which a 12-member subensemble was created using the SBL scheme outlined in Part I. This study has resulted in the following conclusions: 1) Boundary layer perturbation growth, as defined by the MSD in hourly precipitation, can equal that of IBC perturbation growth within 12 h of precipitation in the forecast commencing. This occurs only when considering the common points in ensemble pairs as otherwise the result is dominated by the ''double penalty'' problem, and so would indicate the two forms of perturbation growth do not equal each other. 2) SBL perturbations can enhance the largest precipitation values within the forecast. 3) On the forecast time scales studied, (about 12-36 h) IBC perturbation growth dominates on scales with neighborhood widths greater than 6 grid points whereas boundary layer perturbation growth dominates on scales with neighborhood widths less than 6 grid points. While magnitude differences play a role, this is determined to be a spatial difference as well from the behavior of the temperature variance linking the rapid growth of the boundary layer perturbations to the convection. 4) Using the EAS to postprocess the ensemble is a computationally cheap alternative to provide similar probabilities to those produced by the SBL scheme in the full SE.
These conclusions clearly hold for these two cases and for this configuration of ensemble, particularly regarding the scales present in the IBC perturbations. The results from other convective cases and other weather types (such as extratropical cyclones) may be different and longer term testing of the scheme would be required to show these results more generally, and also determine the reliability of forecasts produced with these types of perturbations. However, the results have noteworthy implications for the prediction of convection and in particular potential flooding from intense rainfall cases as they indicate the need to consider that the precipitation falling into one grid point is also likely to fall within another grid point up to the skillful scale (assuming the skillful scale reflects reality). The consideration of precipitation up to the skillful scale is required as small ensembles do not necessarily provide the correct uncertainty at the gridpoint scale. This work also has implications for research into convective-scale ensembles and model verification because it indicates the need for consideration of physically based SBL perturbations in convection-permitting ensembles. However, it also demonstrates that there are computationally cheap alternatives to running vast ensembles that can produce similar results (as in Schwartz and Sobash 2017;Blake et al. 2018, for example). As with many other papers in this area (e.g., Roberts and Lean 2008;Dey et al. 2016a;Flack et al. 2018), we highlight the need to go beyond the grid scale when considering convective-scale forecasts. We also indicate the need for careful consideration of the interpretation of diagnostics for convective-scale verification and comparisons, because of the large uncertainty at the small scales, to ensure fair and meaningful comparisons are made.