Initial Conditions for Convection-Allowing Ensembles over the Conterminous United States

Craig S. Schwartz National Center for Atmospheric Research, Boulder, Colorado

Search for other papers by Craig S. Schwartz in
Current site
Google Scholar
PubMed
Close
,
May Wong National Center for Atmospheric Research, Boulder, Colorado

Search for other papers by May Wong in
Current site
Google Scholar
PubMed
Close
,
Glen S. Romine National Center for Atmospheric Research, Boulder, Colorado

Search for other papers by Glen S. Romine in
Current site
Google Scholar
PubMed
Close
,
Ryan A. Sobash National Center for Atmospheric Research, Boulder, Colorado

Search for other papers by Ryan A. Sobash in
Current site
Google Scholar
PubMed
Close
, and
Kathryn R. Fossell National Center for Atmospheric Research, Boulder, Colorado

Search for other papers by Kathryn R. Fossell in
Current site
Google Scholar
PubMed
Close
Free access

Abstract

Five sets of 48-h, 10-member, convection-allowing ensemble (CAE) forecasts with 3-km horizontal grid spacing were systematically evaluated over the conterminous United States with a focus on precipitation across 31 cases. The various CAEs solely differed by their initial condition perturbations (ICPs) and central initial states. CAEs initially centered about deterministic Global Forecast System (GFS) analyses were unequivocally better than those initially centered about ensemble mean analyses produced by a limited-area single-physics, single-dynamics 15-km continuously cycling ensemble Kalman filter (EnKF), strongly suggesting relative superiority of the GFS analyses. Additionally, CAEs with flow-dependent ICPs derived from either the EnKF or multimodel 3-h forecasts from the Short-Range Ensemble Forecast (SREF) system had higher fractions skill scores than CAEs with randomly generated mesoscale ICPs. Conversely, due to insufficient spread, CAEs with EnKF ICPs had worse reliability, discrimination, and dispersion than those with random and SREF ICPs. However, members in the CAE with SREF ICPs undesirably clustered by dynamic core represented in the ICPs, and CAEs with random ICPs had poor spinup characteristics. Collectively, these results indicate that continuously cycled EnKF mean analyses were suboptimal for CAE initialization purposes and suggest that further work to improve limited-area continuously cycling EnKFs over large regional domains is warranted. Additionally, the deleterious aspects of using both multimodel and random ICPs suggest efforts toward improving spread in CAEs with single-physics, single-dynamics, flow-dependent ICPs should continue.

Corresponding author: Craig Schwartz, schwartz@ucar.edu

Abstract

Five sets of 48-h, 10-member, convection-allowing ensemble (CAE) forecasts with 3-km horizontal grid spacing were systematically evaluated over the conterminous United States with a focus on precipitation across 31 cases. The various CAEs solely differed by their initial condition perturbations (ICPs) and central initial states. CAEs initially centered about deterministic Global Forecast System (GFS) analyses were unequivocally better than those initially centered about ensemble mean analyses produced by a limited-area single-physics, single-dynamics 15-km continuously cycling ensemble Kalman filter (EnKF), strongly suggesting relative superiority of the GFS analyses. Additionally, CAEs with flow-dependent ICPs derived from either the EnKF or multimodel 3-h forecasts from the Short-Range Ensemble Forecast (SREF) system had higher fractions skill scores than CAEs with randomly generated mesoscale ICPs. Conversely, due to insufficient spread, CAEs with EnKF ICPs had worse reliability, discrimination, and dispersion than those with random and SREF ICPs. However, members in the CAE with SREF ICPs undesirably clustered by dynamic core represented in the ICPs, and CAEs with random ICPs had poor spinup characteristics. Collectively, these results indicate that continuously cycled EnKF mean analyses were suboptimal for CAE initialization purposes and suggest that further work to improve limited-area continuously cycling EnKFs over large regional domains is warranted. Additionally, the deleterious aspects of using both multimodel and random ICPs suggest efforts toward improving spread in CAEs with single-physics, single-dynamics, flow-dependent ICPs should continue.

Corresponding author: Craig Schwartz, schwartz@ucar.edu

1. Introduction

Convection-allowing ensembles (CAEs) provide useful and valuable forecast guidance (e.g., Clark et al. 2012; Evans et al. 2014; Schwartz et al. 2019) and are now operational at many meteorological offices (e.g., Gebhardt et al. 2011; Peralta et al. 2012; Hagelin et al. 2017; Raynaud and Bouttier 2017; Jirak et al. 2018; Klasa et al. 2018). Nonetheless, uncertainty remains about optimal CAE design, especially regarding initial condition perturbations (ICPs),1 which are needed to generate forecast diversity before perturbations from lateral boundary conditions (LBCs) and model error representation schemes engender substantial spread (e.g., Hohenegger et al. 2008; Vié et al. 2011; Peralta et al. 2012; Kühnlein et al. 2014; Romine et al. 2014; Zhang 2019).

There are several broad approaches for producing CAE ICPs. One simple method is to add random noise to a deterministic field (e.g., Hohenegger and Schär 2007; Johnson et al. 2014; Raynaud and Bouttier 2016; hereafter RB16). While this operation is trivial, randomly produced ICPs are not flow dependent, a potential limitation.

Another straightforward method for initial condition (IC) generation is to downscale preexisting coarse-resolution analyses or short-term forecasts from either an ensemble or collection of deterministic numerical weather prediction (NWP) models directly onto the CAE grid (Jones and Stensrud 2012; Duc et al. 2013; Romine et al. 2014; Schumacher and Clark 2014; Schwartz et al. 2015a,b, 2019; Tennant 2015; Clark 2017; Schellander-Gorgas et al. 2017; Jirak et al. 2018; Klasa et al. 2018; Cafaro et al. 2019; Porson et al. 2019). Downscaling means the ICPs directly reflect the NWP model and data assimilation (DA) system underlying the coarse-resolution fields, and finescale details are not introduced into the CAE ICs.

Alternatively, ICPs derived from coarser-resolution models can be recentered about a deterministic analysis of one’s choosing, which could either be interpolated onto or produced directly on the CAE grid (Xue et al. 2007; Kong et al. 2008, 2009; Peralta et al. 2012; Kühnlein et al. 2014; Tennant 2015; RB16; Raynaud and Bouttier 2017; Hagelin et al. 2017). Thus, while these ICPs again reflect the external modeling system, the initial ensemble center could possess finescale structures that are imparted to individual ensemble members during recentering. However, recentering can be complex; European studies examining how recentering affects ensemble forecasts revealed mixed results (e.g., Lang et al. 2015; Tennant 2015; RB16), and recentering analysis ensembles about deterministic “hybrid” variational-ensemble analyses within DA contexts yields little impact (e.g., Clayton et al. 2013; Wang et al. 2013; Pan et al. 2014; Schwartz et al. 2015c).

Still another approach for generating CAE ICPs is to produce them directly on the CAE grid with an ensemble DA system, which provides flow-dependent ICPs fully consistent with the CAE forecast model that span all possible resolvable scales (e.g., Vié et al. 2011; Bouttier et al. 2012; Harnisch and Keil 2015; Wheatley et al. 2015; Johnson and Wang 2016; RB16; Keresturi et al. 2019). While this method is more sophisticated than and theoretically preferable to others, convective-scale DA is still evolving and computationally expensive.

Within each of these overarching methods, there are many options for producing CAE ICs: random noise can be generated in a variety of manners with different correlation length scales; coarse-resolution analyses are available from numerous NWP models with varied resolutions and DA methods; perturbations can be derived from and centered about many potential datasets; and myriad high-resolution DA implementations are possible. Moreover, these various approaches can be combined to produce CAE ICPs (e.g., Zhang 2018, 2019).

Yet, despite the multitude of options for CAE ICP generation, few studies have rigorously examined CAE forecast sensitivity to ICPs. Perhaps the most systematic study devoted to CAE ICPs was RB16, who found ICPs provided from both correlated random noise and a high-resolution, perturbed-observation variational DA system led to better CAE forecasts than ICPs from downscaled global ensemble analyses through 9–12 h. However, RB16 reported negligible sensitivity to ICP method for 12–36-h forecasts, presumably because LBC information quickly swept through their fairly small France-centered computational domain, and it is unclear how RB16’s results may translate to larger domains that are less prone to LBC impacts (e.g., Warner et al. 1997; Romine et al. 2014; Schumacher and Clark 2014), where sensitivity to ICPs may be detectable beyond 9–12 h.

In addition, several studies have assessed the suitability of limited-area ensemble Kalman filters (EnKFs; Evensen 1994; Houtekamer and Zhang 2016) for CAE initialization in case-study or idealized frameworks, although some did not fully isolate ICP impacts. For example, Harnisch and Keil (2015) suggested a convective-scale EnKF could initialize better CAE forecasts than downscaled ICPs for three forecasts, but forecast differences were not fully attributable to ICPs given discrepancies regarding DA and LBCs between various CAEs. Similarly, although Schumacher and Clark (2014) suggested an EnKF-initialized CAE sometimes outperformed a CAE initialized by downscaling and recentering non-EnKF perturbations about a deterministic analysis for a multiday heavy rainfall case, many differences between the CAEs also limited attribution to ICPs. Conversely, Johnson and Wang (2016) performed an idealized, controlled experiment and noted ICPs produced directly on a convection-allowing grid via EnKF DA led to modestly better 9-h precipitation forecasts than when ICPs were provided by coarser-resolution EnKF analyses, but their “perfect model” framework may not apply to many real-data situations.

More broadly, limited-area EnKFs are attractive for CAE initialization, as EnKFs seamlessly meld ensemble DA and forecasting in a single step to produce analysis ensembles that can initialize CAEs. Furthermore, continuously cycling EnKFs have become increasingly popular for real-time CAE forecast applications. For instance, between 2015 and 2017, the National Center for Atmospheric Research (NCAR) produced experimental, real-time CAE forecasts over the conterminous United States (CONUS) initialized with a continuously cycling EnKF (Schwartz et al. 2015b, 2019), and in 2017, Germany began using a continuously cycling EnKF to initialize their operational CAE (Schraff et al. 2016; Pantillon et al. 2018).

However, using continuously cycling limited-area EnKFs to initialize CAEs has risks, as biases can accumulate through assimilation cycles and degrade forecasts (e.g., Hsiao et al. 2012; Torn and Davis 2012; Romine et al. 2013; Wong et al. 2020); EnKFs over large domains like the CONUS might be more susceptible to this problem than EnKFs over comparatively small European domains where prominent LBC influences may mitigate bias accumulation. Therefore, although NCAR’s experimental CAE forecasts were credible and widely adopted by both researchers and forecasters (Schwartz et al. 2019), it remains unclear whether large-domain, limited-area, continuously cycling EnKFs are optimal for producing CAE ICPs. Furthermore, objective assessments of systematic, controlled experiments designed to isolate impacts of ICPs on real-data CAE forecasts over the CONUS have yet to be reported, although subjective evaluations of two CAEs differing solely by ICPs performed during NOAA’s 2019 Hazardous Weather Testbed Spring Forecasting Experiment suggested ICPs had little impact on severe weather forecasts (Clark et al. 2019).

Accordingly, to further understanding about ICP methods for CAEs, including EnKF-based approaches, this study systematically examined 31 48-h forecasts from several 10-member CAEs over the CONUS, where many of the CAEs differed solely by their ICPs. In addition, to explore the impacts of recentering ICPs, other CAEs differed solely by their central initial states. Thus, differences between various CAE forecasts were fully attributable to either ICPs or central initial state, providing insight about CAE initialization and design that has implications for development of future operational CAEs, such as those at NOAA under the Unified Forecast System (UFS) framework.

2. Model configurations and ICP strategies

a. Model configurations

All CAEs employed forecast model configurations similar to those used in NCAR’s real-time CAE project (Schwartz et al. 2015b, 2019). Specifically, 48-h forecasts were produced by version 3.6.1 of the Advanced Research Weather Research and Forecasting (WRF) Model (Skamarock et al. 2008; Powers et al. 2017) over a two-way nested domain spanning the CONUS and adjacent areas (Fig. 1a). The horizontal grid spacing was 15 km in the outer domain and 3 km in the nest, where time steps were 75 and 18.75 s, respectively. Both domains had 40 vertical levels, a 50-hPa top, and used common physical parameterizations (Table 1), except no cumulus parameterization was employed on the 3-km grid. All ensemble members used identical physics and dynamics.

Fig. 1.
Fig. 1.

(a) Computational domain. The horizontal grid spacing was 15 km in the outer domain (415 × 325 points) and 3 km in the nest (1581 × 986 points). Objective verification only occurred over the red shaded region of the 3-km domain (CONUS east of 105°W). Blue dots denote locations of rawinsonde observations used for verification. (b) Total accumulated Stage IV (ST4) precipitation (mm) between 0000 UTC 1 May and 0000 UTC 2 Jun 2015 over the verification region. This accumulation period encompassed all possible valid times of the model forecasts.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

Table 1.

Physical parameterizations for all WRF Model forecasts. Cumulus parameterization was only used on the 15-km domain.

Table 1.

During the 48-h integrations, LBCs were produced for each ensemble member by perturbing forecasts from NCEP’s Global Forecast System (GFS) with random, correlated, Gaussian noise with zero mean (e.g., Barker 2005; Torn et al. 2006) drawn from the default “cv3” background error covariances (BECs) provided by the WRF Model’s DA system (WRFDA; Barker et al. 2012), which were produced with the “NMC method” (Parrish and Derber 1992) based on differences between 48- and 24-h forecasts from a legacy ~100-km configuration of the GFS model. Following Schwartz et al. (2015b), LBC perturbation magnitudes linearly increased throughout the forecasts to promote spread. Identical LBC sets were used for all CAEs.

b. IC generation and experimental design

Five sets of 10-member, 48-h CAE forecasts were produced over May 2015, which was the wettest month ever recorded over the CONUS (e.g., Blunden and Arndt 2016) and featured a broad precipitation maximum over the central CONUS (Fig. 1b). The CAEs were identical except for their ICs, which are now described.

1) Continuously cycling EnKF DA

Two CAEs had ICPs derived from an experimental continuously cycling ensemble adjustment Kalman filter (Anderson 2001, 2003; Anderson and Collins 2007), a type of EnKF, implemented in the Data Assimilation Research Tested (DART) software (Anderson et al. 2009). The EnKF DA system had 80 ensemble members and produced analyses solely on the 15-km domain (Fig. 1a).

Similar to Schwartz et al. (2015a), an initial 15-km ensemble was created by adding random noise drawn from the WRFDA-provided BECs to the 0.25° 0000 UTC 26 April 2015 GFS analysis; this randomly generated ensemble served as the prior (before assimilation) ensemble for the first EnKF analysis. Then, the 0000 UTC 26 April 2015 posterior (after assimilation) ensemble initialized a 6-h, 80-member ensemble forecast that became the prior ensemble for the second EnKF analysis at 0600 UTC 26 April 2015, and this analysis–forecast cycle with a 6-h period continued until 0000 UTC 31 May 2015. Model configurations and LBC perturbation strategies during the 6-h forecast steps were identical to those described in section 2a, except forecasts were not produced on the 3-km grid. Soil states evolved freely and independently for each member during the entire cycling period, and the first 5 days of cycling were considered spinup and discarded.

The EnKF assimilated conventional observations as described by Schwartz et al. (2015b), with the addition of global positioning system radio occultation refractivity observations. Observation errors, preprocessing, and quality control were also detailed by Schwartz et al. (2015b) and included an “outlier check” to reject observations far from the prior ensemble mean, inflating errors of observations near lateral boundaries, rejecting surface observations with mismatched modeled and observed terrain heights, and superobbing aircraft and satellite wind observations.

Specific DA settings were mostly similar to those employed by NCAR during their real-time CAE project (Schwartz et al. 2015b, 2019) and summarized in Table 2. Compared to NCAR’s real-time EnKF analyses produced in May 2015 (Schwartz et al. 2015b), the biggest differences involved ensemble size and covariance inflation. Specifically, the 80 members and posterior relaxation-to-prior-spread (RTPS) inflation (Whitaker and Hamill 2012) differed from the real-time analyses, which had 50 members and used prior adaptive inflation (e.g., Anderson 2009). The switch to RTPS inflation was based on systematic experimentation finding little precipitation forecast sensitivity to inflation method, and because RTPS inflation is simpler, we chose it for this work. Ultimately, the EnKF configuration herein was well tuned in a spread–skill sense (Houtekamer et al. 2005) and initialized significantly better precipitation forecasts than the real-time EnKF analyses (not shown), likely due to the larger ensemble size, which benefits EnKFs (e.g., Zhang et al. 2013; Houtekamer et al. 2014).

Table 2.

Configuration details of the DART-based EnKF DA system.

Table 2.

For each 0000 UTC EnKF analysis between 1 and 31 May 2015 (inclusive), the first 10 15-km analysis members (i.e., members 1–10) initialized 48-h forecasts on the nested grid (Fig. 1a), where 3-km ICs were downscaled from the 15-km analyses that lacked storm-scale structures. Because the EnKF can be conceived as separately updating a mean and perturbations about the mean, our EnKF-based ICPs were centered about 80-member ensemble mean EnKF analyses (“EnKFEnKF”; Table 3). On average, each EnKF member was equally likely to be closest to “truth”, so choosing the first 10 members to initialize 48-h forecasts was analogous to randomly picking 10 members from full 80-member analysis ensembles (e.g., Schwartz et al. 2014, 2019), and 10-member CAEs can provide skillful and valuable forecasts (e.g., Clark et al. 2011, 2018; Schwartz et al. 2014).

Table 3.

Description of the various CAEs in terms of their initial centers, IC perturbation methods, and initial hydrometeors.

Table 3.

In addition, between 1 and 31 May 2015 (inclusive), 0000 UTC perturbations of zonal and meridional wind, potential temperature, water vapor mixing ratio, and perturbation geopotential and dry surface pressure (U, V, θ, qυ, ϕ, and μ, respectively) from analysis members 1–10 were added to corresponding 0000 UTC 0.25° GFS analyses to create another set of ICs that initialized 48-h forecasts on the nested grid (“GFSEnKF”; Table 3). These ICs had identical U, V, θ, qυ, ϕ, and μ perturbations as EnKFEnKF but were centered on GFS analyses, rather than EnKF mean analyses (Table 3), providing insight about sensitivity to IC center and suitability of large-domain regional continuously cycling EnKFs to initialize CAE forecasts. Compared to the EnKF analyses, GFS analyses had coarser resolution but assimilated many more observations, including satellite radiances, and reflect a well-tuned, operational deterministic forecast system. Moreover, standard WRF Model preprocessing discards GFS hydrometeor analyses, such that CAEs with GFS initial centers started with no (zero value) hydrometeors, contrasting those CAEs with EnKF mean initial centers (Table 3).

2) Random ICPs

Performing limited-area EnKF DA can be expensive, so other cheaper, pragmatic methods of producing CAE ICPs were also explored. Thus, two additional sets of 10-member 48-h forecasts on the nested grid (Fig. 1a) were initialized by taking random draws of U, V, θ, qυ, ϕ, and μ from the WRFDA-provided BECs and adding them to both 0000 UTC 15-km EnKF mean (“EnKFRAND”) and 0.25° GFS analyses (“GFSRAND”) between 1 and 31 May 2015 (inclusive; Table 3). The random patterns differed for each initialization, but, for a particular initialization and ensemble member, identical random perturbations were added to both EnKF mean and GFS analyses (i.e., member 1 of EnKFRAND and GFSRAND had identical perturbations).

Length scales and variances of random perturbations can be tuned when drawing from BECs, providing many possibilities for specifying initial correlated random noise. However, we only used one set of tuning parameters where the length scales were empirically reduced by ~85% from those within the WRFDA-provided BECs, which was necessary because of our much finer grid spacing compared to the ~100-km statistics contained in the BECs. Variances were also reduced relative to those in the WRFDA-provided BECs in attempt to roughly approximate spread of the other initial ensembles. Ultimately, our randomly produced ICPs had mesoscale structures and contrasted RB16, who used a similar method to produce random CAE ICPs but with convective-scale structures. Although subsequent CAE forecasts may be sensitive to length scale and variance of initial random noise, examining this sensitivity was beyond the scope of this work, and the primary purpose of constructing random ICPs was to assess whether they yielded comparable forecast quality as flow-dependent EnKF ICPs, but with substantially lower costs, by examining relative performances of CAEs with the same initial center but different ICPs (e.g., EnKFRAND versus EnKFEnKF and GFSRAND versus GFSEnKF; Table 3).

3) SREF ICPs

An additional IC set was produced by adding perturbations of U, V, θ, qυ, ϕ, and μ derived from 2100 UTC–initialized 3-h forecasts of NCEP’s Short-Range Ensemble Forecast (SREF; Du et al. 2014) system to 0000 UTC 0.25° GFS analyses (“GFSSREF”; Table 3); these ICs then initialized 48-h forecasts on the nested grid between 1 and 31 May 2015 (inclusive). This inexpensive method was very similar to that used by the Center for Analysis and Prediction of Storms to produce CAE ICPs for many years (e.g., Xue et al. 2007; Kong et al. 2008, 2009; Gallo et al. 2017). Like EnKF perturbations, SREF perturbations were flow-dependent, and although the SREF system had 16-km horizontal grid spacing, data available to us had been coarsened to 32 km.

During the experimental period (May 2015) the SREF contained 21 members with diversity provided by varied dynamic cores, physics, and ICs (Du et al. 2014).2 However, we only needed perturbations from 10 members, which were chosen as the 8 SREF members used to initialize the National Severe Storms Laboratory’s experimental CAE (Clark 2017) plus 2 additional members based on the Advanced Research WRF dynamic core (the “p3” and “n3” SREF members). Contrasting the single-physics, single-dynamics EnKF ICPs, these 10 SREF-based ICPs collectively reflected three dynamic cores, each associated with its own unique IC generation method, and, moreover, some physics schemes varied across SREF members with a common core (Du et al. 2014). Thus, below we refer to SREF ICPs as “multimodel ICPs,” with the understanding that differences between GFSSREF members cannot be fully attributed to dynamic core, physics, or initialization method encapsulated and entangled within their ICPs.

3. ICP characteristics and spread growth

a. Initial spread characteristics

Mean 700-hPa zonal wind spread over all 31 0000 UTC initial 15-km ensembles highlighted differences between the various ICPs. Specifically, EnKF and SREF ICPs were flow dependent (Figs. 2a,b), with relatively large spread associated with stronger mean height gradients that portend uncertainty, such as over eastern Canada and the central CONUS, and comparatively small spread associated with weaker height gradients over the southeast CONUS and West Coast. Conversely, random ICPs were not flow dependent and yielded nearly uniform mean spread reflecting the tuned BECs (Fig. 2c).

Fig. 2.
Fig. 2.

Average standard deviation of 700-hPa zonal wind (m s−1) over all 10-member, 15-km 0000 UTC initial ensembles between 1 and 31 May 2015 (inclusive) constructed with (a) EnKF, (b) SREF, and (c) random perturbations. Mean 700-hPa height (m; contoured every 20 m) over all 0000 UTC Global Forecast System (GFS) analyses between 1 and 31 May 2015 (inclusive) is overlaid on each panel, and the verification region (CONUS east of 105°W) is outlined.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

Consistent with Figs. 2a and 2b, EnKF- and SREF-based initial ensembles had comparable spreads for wind (Figs. 3a,b), while SREF-based initial ensembles had larger spread than EnKF-based initial ensembles for temperature and moisture below 250 hPa (Figs. 3c,d); these larger SREF spreads for thermodynamic variables were possibly manifestations of diverse precipitation patterns produced by the multiple models in the unconstrained 3-h SREF forecasts leveraged to obtain ICPs. Except for jet stream level, randomly produced initial ensembles had broadly comparable spreads as those with SREF and EnKF perturbations for wind but with smoother vertical structures (Figs. 3a,b). However, random ICPs had more midtropospheric temperature and low-level moisture spread than the other ICPs (Figs. 3c,d).

Fig. 3.
Fig. 3.

Standard deviation of (a) zonal wind (m s−1), (b) meridional wind (m s−1), (c) potential temperature (K), and (d) water vapor mixing ratio (g kg−1) as a function of pressure (hPa) over the verification region (CONUS east of 105°W) averaged over all 10-member, 15-km 0000 UTC initial ensembles between 1 and 31 May 2015 (inclusive).

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

b. Spread evolution

1) Spread and error at rawinsonde locations

Ensemble mean RMSEs with respect to rawinsonde observations and standard deviations at rawinsonde locations were computed to assess spread and error growth. The ensembles with EnKF ICPs (EnKFEnKF and GFSEnKF; Table 3) had similar spreads throughout the forecast (Fig. 4), but the CAE initially centered about EnKF mean analyses (EnKFEnKF; gray curves) had smaller RMSEs than the CAE initially centered about GFS analyses (GFSEnKF; blue curves) at the initial time, indicating the EnKF (section 2b) fit rawinsonde observations more closely than GFS analyses. However, EnKFEnKF ensemble mean RMSEs grew quickly and were statistically significantly worse than those from GFSEnKF after initialization, indicating forecast sensitivity to initial center and suggesting GFS analyses were overall better than EnKF mean analyses.

Fig. 4.
Fig. 4.

(a) 925-, (b) 850-, (c) 700-, (d) 500-, and (e) 300-hPa temperature (K) ensemble mean RMSE compared to rawinsonde observations (solid lines) and standard deviation at rawinsonde locations (dashed lines) aggregated over all 31 3-km forecasts over the verification region (CONUS east of 105°W) as a function of forecast hour. See Fig. 1a for locations of rawinsonde observations. Statistical significance between EnKFEnKF and GFSEnKF RMSEs was determined by a bootstrap resampling technique (see section 4b) and denoted by open circles placed on the curve with the significantly lower RMSE; for example, a blue circle on a blue line indicates GFSEnKF had a statistically significantly lower RMSE than EnKFEnKF at the 95% level. (f)–(j) As in (a)–(e), but for wind speed (m s−1). Black asterisks represent observation error standard deviations.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

The three CAEs with GFS initial centers (GFSEnKF, GFSRAND, GFSSREF; Table 3) typically had similar RMSEs but different spreads (Fig. 4), with initial spreads generally aligned with Figs. 3a–c. In particular, except for 500- and 300-hPa wind, EnKF ICPs had the smallest initial spread at rawinsonde locations due to the restorative effect of assimilating those very observations, and GFSEnKF spread was smaller than GFSRAND and GFSSREF spread from 12 to 48 h.

While GFSEnKF and GFSSREF spread sometimes grew more than GFSRAND spread over the entire 48-h forecast, over the first 12 h, GFSSREF spread usually grew faster than GFSEnKF spread and GFSRAND spread growth rates were typically highest (Fig. 4). Although GFSRAND initial spread was often relatively large, even when GFSRAND initial spread was comparable to or smaller than that of the other ensembles, rapid spread growth still occurred over the first 12 h (Figs. 4a,b,f,g,j), suggesting GFSRAND forecast spread was not simply modulated by its initial spread.

2) Perturbation power spectra

To further understand spread growth characteristics over the first 12 h, perturbation power spectra were computed using the discrete Fourier transform after applying a Hanning window (e.g., Harris 1978) to enforce periodicity. Random 2-m temperature and 10-m wind ICPs had less power than SREF and EnKF ICPs for scales <500 and 250 km, respectively (Figs. 5a,f), reflecting the specified length scales used to construct random noise. However, random ICPs led to rapid error growth over the first hour (Figs. 5b,g), with larger growth rates than EnKF and SREF ICPs at small scales, suggesting rapid GFSRAND spread increases over the first 12 h (e.g., Figs. 4a,b,f–j) were driven by small-scale perturbation growth ultimately spurred by downscale propagation of random mesoscale errors (e.g., Durran and Gingrich 2014). After the first hour, GFSRAND error growth rates were much slower (Figs. 5c–e,h–j), but by 12 h, at all scales GFSRAND had the most perturbation energy and GFSEnKF the least (Figs. 5e,j), consistent with greater low-level GFSRAND 12-h forecast spread compared to GFSEnKF (Figs. 4a,f).

Fig. 5.
Fig. 5.

(a)–(e) Average perturbation energy for 2-m temperature (K2) as a function of wavelength (km) computed over all 31 3-km forecasts and the 3-km domain east of 105°W, excluding 16 and 42 grid points from the eastern and northern/southern boundaries, respectively, for (a) analyses and (b) 1-, (c) 3-, (d) 6-, and (e) 12-h forecasts. (f)–(j) As in (a)–(e), but for 10-m kinetic energy (m2 s−2). Vertical lines denote 6 times the horizontal grid spacing (3 km), the approximate effective resolution of the forecasts (Skamarock 2004), and horizontal lines are for reference to help visualize changes across forecast hours. Note that the initial conditions had 15-km horizontal grid spacing, which is manifested by lack of small-scale power in (a) and (f).

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

Overall, these spectra illustrate rapid GFSRAND error growth was insensitive to ICP variance magnitude; GFSRAND surface temperature spread was relatively small (Fig. 3c) while its surface wind spread was relatively large (Figs. 3a,b), yet rapid GFSRAND error growth occurred over the first hour for both variables. Similar evolutions were evident for other vertical levels, and after 12 h, spectra from all three ensembles gradually converged as common LBCs exerted their influence (not shown).

3) Precipitation spread

Precipitation development and spread over the first 18 h was sensitive to initial spread characteristics. Most notably, precipitation variances (about each ensemble’s mean) were largest in the two CAEs with random ICPs, with rapid spread increases over the first 6 h (Fig. 6a) consistent with fast low-level error growth (Figs. 4a,b,f,g, 5). Comparatively, precipitation variances were less sensitive to IC center, although the CAEs initially centered about GFS analyses had less spread than those initially centered about EnKF mean analyses through 18 h, possibly because initial nonzero hydrometeor states in the CAEs initially centered about EnKF mean analyses (Table 3) contributed to spread growth over the first few hours. GFSSREF had larger spread than GFSEnKF through 24 h, possibly due to the multiple models reflected in SREF-based ICPs and generally consistent with greater GFSSREF initial spread (Figs. 3c,d, 4) and spread growth over the first 12 h (Fig. 4). Variances computed after a bias correction (see section 4b) generally behaved similarly as uncorrected variances over the first 12–18 h but with smaller differences among the ensembles (Fig. 6b).

Fig. 6.
Fig. 6.

Average ensemble variance (mm2) over the verification region (CONUS east of 105°W) and all 31 3-km forecasts of 1-h accumulated precipitation as a function of forecast hour computed from (a) raw, native grid data and (b) bias-corrected precipitation interpolated onto the ST4 grid (see section 4b). Values on the x axis represent ending forecast hours of 1-h accumulation periods (e.g., an x-axis value of 24 is for 1-h accumulated precipitation between 23 and 24 h).

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

After 18 h, precipitation variances were more similar across all five CAEs than at earlier times. However, of the three CAEs initially centered about GFS analyses, GFSSREF had the most spread between 24 and 33 h for raw variances (Fig. 6a), while bias-corrected variances indicated more spread from random and SREF ICPs between 24 and 42 h compared to EnKF ICPs (Fig. 6b).

c. Forecast example

The forecast initialized at 0000 UTC 11 May 2015 nicely illustrates how different ICPs impacted spread growth. At this time, precipitation was ongoing in the vicinity of tropical depression Ana over southeastern North Carolina and along surface boundaries stemming from a low pressure center over South Dakota, which was associated with an upper-level trough over the Rockies and adjacent plains (Figs. 7s–u). Initial 2-m temperature EnKF and SREF perturbations indicated flow dependence with enhanced spread around these features (Figs. 7a,g,s,t), whereas random perturbations did not reflect these phenomena (Figs. 7m,u).

Fig. 7.
Fig. 7.

Standard deviation of 2-m temperature (K) at the (a),(g),(m) initial time and for (b),(h),(n) 1-, (c),(i),(o) 3-, (d),(j),(p) 6-, (e),(k),(q) 9-, and (f),(l),(r) 12-h 3-km forecasts initialized at 0000 UTC 11 May 2015 for (a)–(f) GFSEnKF, (g)–(l) GFSSREF, and (m)–(r) GFSRAND. (s)–(u) As in (a),(g), and (m), respectively, but with a different color scale to more easily see structural features, and 10-m winds (barbs; kts), sea level pressures (hPa; gray lines), and 500-hPa heights (m; magenta lines) from the 0000 UTC 11 May 2015 GFS analysis are overlaid.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

The 2-m temperature perturbation magnitudes were initially small (Figs. 7a,g,m,s–u), but by 1 h, spread substantially increased. At 1 h, GFSEnKF and GFSSREF spread primarily reflected the surface low pressure system and attendant fronts (Figs. 7b,h), while GFSRAND had not fully developed flow-dependent characteristics (Fig. 7n). However, after 3 h, all spread patterns reflected synoptic-scale features (Figs. 7c,i,o), and by 6–12 h, the three ensembles had comparable structures near the fronts (Figs. 7d–f,j–l,p–r), with GFSSREF spread highest along the boundaries. Conversely, in the weak forcing regime over the southeastern CONUS and Ohio Valley, GFSRAND possessed much more spread than GFSSREF and GFSEnKF that peaked from 6 to 9 h (Figs. 7d,e,j,k,p,q), consistent with Johnson et al. (2014), who suggested random noise was most likely to promote spread growth in weak forcing scenarios. It appears that these random spread patterns were initially organized on small scales (Figs. 7n,o), consistent with perturbation spectra indicating rapid small-scale error growth over the first several hours (Figs. 5a–c,f–h).

Regarding precipitation, GFSEnKF had more spread than GFSRAND and GFSSREF at 1 h (Figs. 8a,f,k), and although GFSRAND 2-m temperature structures were not fully flow-dependent at this time (Fig. 7n), GFSRAND precipitation spread represented flow-dependent features (Fig. 8k). This finding was similar to RB16, who noted flow-dependent precipitation structures quickly developed in a CAE with storm-scale random ICPs. By 3 h, precipitation spread had grown substantially in all CAEs (Figs. 8b,g,l), and by 6–12 h, there were generally more enhanced and wider areas of nonzero spread in GFSRAND than GFSEnKF in the vicinity of frontally forced precipitation (Figs. 8c–e,h–j,m–o). Consistent with Fig. 6a, GFSRAND spread peaked at 6 h, and while GFSSREF and GFSEnKF had similar patterns, there was slightly more GFSSREF spread from 3 to 12 h.

Fig. 8.
Fig. 8.

Standard deviation of 1-h accumulated precipitation (mm) for (a),(f),(k) 1-, (b),(g),(l) 3-, (c),(h),(m) 6-, (d),(i),(n) 9-, and (e),(j),(o) 12-h 3-km forecasts initialized at 0000 UTC 11 May 2015 for (a)–(e) GFSEnKF, (f)–(j) GFSSREF, and (k)–(o) GFSRAND over the verification region (CONUS east of 105°W). The mean standard deviation (σ) over the verification region is annotated above each panel.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

Between 3 and 9 h, GFSRAND precipitation spread was particularly large over the weakly forced southeast CONUS and Ohio Valley, whereas GFSEnKF and GFSSREF had much less spread in similar locales (Figs. 8b–d,g–i,l–n). Areas of enhanced GFSRAND precipitation spread often appeared to be preceded by relatively large GFSRAND 2-m temperature perturbations (Figs. 7m–p) and were accompanied by low probabilities of precipitation over wide areas where observed precipitation did not occur (not shown). Thus, at least for this case, random ICPs led to false alarms in some members in areas with weak forcing.

d. Summary

The preceding analyses suggest random ICPs promoted rapid short-term error growth, primarily driven by small-scale perturbations (Fig. 5), while EnKF and SREF ICPs had comparatively slower error growth rates. Accordingly, compared to the other ICP strategies, random ICPs generally yielded more spread over the first 12–18 h (Figs. 4, 6). As the next section shows, this additional spread from random ICPs was sometimes helpful, yet did not always possess favorable characteristics.

4. Precipitation verification

Hourly accumulated precipitation forecasts were objectively compared to NCEP’s Stage IV (ST4) analyses (Lin and Mitchell 2005) over the CONUS east of 105°W (Fig. 1a), where ST4 data were most robust (e.g., Nelson et al. 2016) and considered as “truth.” While some statistics were computed on native grids, many verification metrics require a common grid for forecasts and observations. So, for these metrics, all precipitation forecasts were interpolated to the ST4 grid (4.763-km horizontal grid spacing) using a precipitation-conserving budget interpolation algorithm (e.g., Accadia et al. 2003). We primarily focused on precipitation because it is an important sensible weather field and depends on many physical processes, thus providing an overall summary of model performance.

Statistics presented in this section are aggregated over all 31 forecasts.

a. Bias characteristics

1) Total precipitation

Total precipitation over the verification region (Fig. 1a) normalized by number of grid points in the verification region was determined for each member on native grids (Fig. 9). To concisely summarize results, only the mean and range (maximum minus minimum; lines with circle markers) are shown for all five CAEs (Fig. 9a), while individual GFSSREF members are shown in Fig. 9b.

Fig. 9.
Fig. 9.

(a) Average 1-h accumulated precipitation (mm) per grid point over all 31 3-km forecasts and the verification region (CONUS east of 105°W), computed on native grids, as a function of forecast hour. These statistics were computed for all 10 ensemble members, but for most ensembles, only the ensemble mean values (lines) and ranges (maximum minus minimum; lines with circle markers near the x axis) are shown. The red and gray shadings represent envelopes of the 10 members comprising EnKFRAND and GFSSREF, respectively, and darker shadings indicate intersections of their two envelopes. Values on the x axis represent ending forecast hours of 1-h accumulation periods (e.g., an x-axis value of 24 is for 1-h accumulated precipitation between 23 and 24 h). (b) As in (a), but the curves are for individual members from the GFSSREF ensemble, with colors corresponding to the dynamic core each SREF member possessed. These different dynamic cores were reflected solely in the ICPs; for the WRF Model forecasts, all members had common configurations.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

The largest differences between the CAEs occurred over the first 12 h, where the two CAEs with random ICPs spunup precipitation much faster than the other CAEs but grossly overshot observed domain-total precipitation (Fig. 9a). While the CAEs with EnKF and SREF ICPs had broadly similar mean spinups, distinct trifurcation of GFSSREF members occurred based on dynamic core (Fig. 9b), consistent with Johnson et al. (2011) and indicating how ICPs reflecting multiple models can lead to clustering, which is undesirable (e.g., Gowan et al. 2018; Schwartz et al. 2019). In general, spinup appeared more sensitive to ICPs than IC center, even though initial center determined whether the CAE had initial hydrometeors (Table 3). This finding suggests ICP characteristics influence spinup more than initial hydrometeor state for 0000 UTC–initialized forecasts over the central-eastern CONUS.

Despite varied spinups, all CAEs generally well-represented diurnal cycle timing after 18 h, where GFSEnKF and GFSSREF members with NMM dynamic core ICPs had domain-total precipitation typically best matching observations, including the observed peak around 24 h (Fig. 9). Conversely, the two ensembles with random ICPs had less mean precipitation than observations between ~24–42 h, while at the maximum around 24 h EnKFEnKF produced too much precipitation (Fig. 9a). Interestingly, despite overpredicting at 24 h, EnKFEnKF precipitation dramatically decreased and underpredicted between ~26–42 h, perhaps due to insufficient upscale convection growth (e.g., Schwartz et al. 2015b).

GFSSREF clearly had the widest range throughout the forecast, reflecting its ICPs with multimodel diversity (Fig. 9a). Additionally, the two CAEs with random ICPs typically had wider ranges than those with EnKF ICPs, particularly over the first 12 h, essentially a manifestation of randomness. However, after 12 h, except for GFSSREF, the four other CAEs had fairly similar ranges.

2) Precipitation distributions

Average areal coverages of 1-h accumulated precipitation meeting or exceeding selected accumulation thresholds (e.g., 10.0 mm h−1) were calculated over the verification region on native grids to assess precipitation distributions (Fig. 10). The CAEs generally well-represented diurnal cycle timing after the spinup, although there were sometimes biases, particularly for thresholds ≤1.0 mm h−1, where all CAEs usually had mean coverages lower than those observed (Figs. 10a,b).

Fig. 10.
Fig. 10.

Fractional areal coverage (%) of 1-h accumulated precipitation meeting or exceeding (a) 0.25, (b) 1.0, (c) 2.5, (d) 5.0, (e) 10.0, and (f) 20.0 mm h−1 over the verification region (CONUS east of 105°W), computed on native grids and aggregated over all 31 3-km forecasts as a function of forecast hour. These statistics were computed for all 10 ensemble members, but for most ensembles, only the ensemble mean values (lines) and ranges (maximum minus minimum; lines with circle markers near the x axis) are shown. The red and gray shadings represent envelopes of the 10 members comprising EnKFRAND and GFSSREF, respectively, and darker shadings indicate intersections of their two envelopes. Values on the x axis represent ending forecast hours of 1-h accumulation periods (e.g., an x-axis value of 24 is for 1-h accumulated precipitation between 23 and 24 h).

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

Areal coverage characteristics for thresholds ≥2.5 mm h−1 (Figs. 10c–f) were broadly consistent with domain-total precipitation statistics, and GFSSREF members again clustered based on dynamic core represented in the ICPs (not shown). Specifically, the two ensembles with random ICPs had lower mean coverages than observations between ~24–42 h for thresholds ≥2.5 mm h−1 but clearly overpredicted during the spinup, which contributed to their excessive total precipitation during this period (e.g., Fig. 9a). Before and during the first observed precipitation peak (18–24 h), GFSEnKF and GFSSREF typically had ensemble mean coverages closest to observations, while EnKFEnKF overpredicted for thresholds between 2.5 and 10.0 mm h−1 (Figs. 10c–e). Ensemble ranges of areal coverages (lines with circles in Fig. 10) also resembled those for domain-total precipitation, with GFSSREF having the widest ranges and the CAEs with random ICPs possessing relatively large ranges for the first ~12 h.

Probability density functions (PDFs) further revealed the different spinups engendered by EnKF and random ICPs (Fig. 11). At 1 h, while finescale structures were still developing, CAEs with EnKF ICPs had more heavy precipitation than those with random ICPs, although none of the forecasts could yet reproduce the observed heavy rainfall frequency (Fig. 11a). However, between 1 and 3 h, heavy precipitation rapidly developed in the CAEs with random ICPs (Fig. 11b), with slower development in the CAEs with EnKF ICPs, and by 5–7 h, the CAEs with random ICPs produced too much rainfall >40.0 mm h−1 while PDFs of the CAEs with EnKF ICPs gradually aligned with those observed (Figs. 11c,d).

Fig. 11.
Fig. 11.

Probability density functions (PDFs; %) constructed from all points within the verification region (CONUS east of 105°W) over all 31 (a) 1-, (b) 3-, (c) 5-, and (d) 7-h 3-km forecasts of 1-h accumulated precipitation (mm) for member 1 from various ensembles, computed on native grids. The corresponding observed (ST4) PDFs are also shown. Dashed lines are for reference to help visualize changes across forecast hours.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

Collective findings clearly suggested issues when initializing CAEs with random noise for short-term precipitation forecasts (Figs. 811), possibly due to gross imbalances in random initial states. Moreover, our results are consistent with Johnson et al. (2014), who found initializing CAEs with correlated random noise led to “spurious precipitation that formed over large areas on many cases” at short forecast ranges, and as Johnson et al. (2014) constructed random ICPs with smaller length scales than those used here, it appears using random noise to initialize CAEs may be challenging regardless of its correlation scale.

b. Ensemble precipitation verification

Areal coverages sometimes indicated biases (Fig. 10), which can hamper interpretation of verification metrics designed to quantify spatial errors (e.g., Baldwin and Kain 2006; Roberts and Lean 2008). Thus, forecasts were bias corrected before assessing measures of probabilistic forecast quality with a “probability-matching” approach that forced each ensemble member’s distribution to the ST4 distribution by replacing the model grid point containing the most precipitation within the verification region with the highest ST4 amount within the verification region, and so on, thus eliminating bias (e.g., Ebert 2001; Clark et al. 2009, 2010a, b; Schwartz et al. 2015a; Loken et al. 2019; Pyle and Brill 2019). Despite replacing model values with observations, this method preserves forecast spatial patterns.

After interpolating precipitation forecasts to the ST4 grid and bias correcting, a “neighborhood” approach (e.g., Theis et al. 2005; Ebert 2008, 2009) was employed to derive probabilistic fields suitable for verification following Schwartz and Sobash (2017). First, ensemble probabilities (EPs) at a particular grid point were determined as the fraction of ensemble members predicting an event at that point, where an event was defined as precipitation meeting or exceeding an accumulation threshold (e.g., 5.0 mm h−1). Then, “neighborhood ensemble probabilities” (NEPs; Schwartz et al. 2010; Schwartz and Sobash 2017) were computed by choosing a neighborhood length scale (r) to define a spatial neighborhood and averaging EPs over all grid points in the neighborhood. NEPs are probabilities of event occurrence at a point given a neighborhood length scale (e.g., Schwartz and Sobash 2017) and are more appropriate for verifying CAEs than point-based probabilities (i.e., EPs) because they incorporate spatial uncertainty and acknowledge that CAEs are inherently inaccurate at the grid scale.

NEPs were produced from all CAEs with r between 5 and 150 km, which represented radii of circular neighborhoods. Following Schwartz and Sobash (2017), NEPs at the ith point were verified against corresponding observations (i.e., ST4) at the ith point, where the ith observed value could either be binary (i.e., 0 or 1) or fractional depending on what the metric required; fractional observations (e.g., Roberts and Lean 2008) at the ith point were obtained by determining the fraction of observed events within its neighborhood, analogously to NEPs.

For brevity, results are shown solely for r = 100 km, but overall findings were unchanged using different r. Additionally, a maximum event threshold of 10.0 mm h−1 was used, as metrics computed at higher thresholds were noisy due to small sample sizes (e.g., Fig. 10).

Statistical significance testing followed Schwartz (2019), who examined performance of several ensembles, and the following text parallels from there. Specifically, statistical significance was determined with a bootstrap technique by randomly drawing paired samples (10 000 times) of daily statistics from two ensembles over all forecast cases to calculate resampled distributions of aggregate differences between two ensembles (e.g., Hamill 1999; Wolff et al. 2014). This procedure assumed individual forecasts, initialized 24 h apart, were independent (e.g., Hamill 1999). Bounds of 90% bootstrap confidence intervals (CIs) were obtained from the distribution of resampled aggregate differences using the bias corrected and accelerated method (e.g., Gilleland 2010). If bounds of a 90% bootstrap CI did not encompass zero, using a one-tailed interpretation, differences between two ensembles were statistically significant at the 95% level or higher.

1) Fractions skill scores

The fractions skill score [FSS; Roberts and Lean (2008)] was used to evaluate spatial placement, where FSS = 1 means a perfect forecast and FSS = 0 indicates no skill. For fixed initial centers, CAEs with flow-dependent ICPs usually had higher FSSs than those with random ICPs, while differences between GFSSREF and GFSEnKF FSSs were usually small and not statistically significant (Fig. 12). These results indicated value of flow-dependent ICPs compared to random ICPs and minimal benefits of multimodel ICPs. However, regardless of ICPs, the three CAEs initially centered on GFS analyses typically had higher FSSs than the two CAEs initially centered about EnKF mean analyses, demonstrating GFS analyses were generally better than EnKF mean analyses and suggesting initial center is more important than ICPs for achieving high FSSs.

Fig. 12.
Fig. 12.

Fractions skill score (FSS) over the verification region (CONUS east of 105°W) with a 100-km neighborhood length scale for the (a) 0.25, (b) 1.0, (c) 5.0, and (d) 10.0 mm h−1 thresholds aggregated over all 31 3-km forecasts of 1-h accumulated precipitation as a function of forecast hour. Values on the x axis represent ending forecast hours of 1-h accumulation periods (e.g., an x-axis value of 24 is for 1-h accumulated precipitation between 23 and 24 h). Note that the y-axis scales are different in each panel. Symbols along the top axis denote forecast hours when differences between two ensembles were statistically significant at the 95% level. As indicated in the legend, in order from top to bottom, the rows indicate differences between GFSEnKF and EnKFEnKF, EnKFEnKF and EnKFRAND, GFSEnKF and GFSRAND, GFSSREF and GFSRAND, and GFSEnKF and GFSSREF, with symbols corresponding to those in the legend that denote which ensemble had statistically significantly higher FSSs. For example, in the top row, blue symbols indicate GFSEnKF had statistically significantly higher FSSs than EnKFEnKF, while gray symbols indicate EnKFEnKF had statistically significantly higher FSSs than GFSEnKF. Similarly, in the bottom row, blue symbols indicate GFSEnKF had statistically significantly higher FSSs than GFSSREF, while black symbols indicate GFSSREF had statistically significantly higher FSSs than GFSEnKF. Absence of a symbol means the difference was not statistically significant at the 95% level.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

2) Rank histograms

Rank histograms (e.g., Hamill 2001) based on domain-total precipitation were constructed as in Schwartz et al. (2014). Although rank histograms are sensitive to observation errors (e.g., Hacker et al. 2011), ST4 observation errors are not well-known and were not included. The reliability index (RI; Delle Monache et al. 2006) was used to summarize rank histogram flatness; lower values are preferable.

Observations fell within the ensemble more regularly and more optimal values were achieved in most bins when CAEs had random or SREF ICPs rather than EnKF ICPs (Fig. 13). GFSSREF and GFSRAND RIs were fairly similar and much smaller than GFSEnKF RIs (Figs. 13a,c), and differences between the CAEs were comparatively small after 18 h (cf. Fig. 13a and Fig. 13c and Fig. 13b and Fig. 13d), reflecting generally converging spread with time (e.g., Figs. 4, 6). Nonetheless, results at all forecast ranges suggest that enhanced precipitation spread engendered by random ICPs (e.g., Fig. 6) led to better dispersion characteristics than flow-dependent EnKF ICPs, even though this improved spread was also a manifestation of spinup issues (e.g., Figs. 711) and using random ICPs degraded forecast skill as measured by FSSs (e.g., Fig. 12).

Fig. 13.
Fig. 13.

Rank histograms containing all 31 3-km (a),(b) 1–18- and (c),(d) 18–36-h forecasts of non-bias-corrected domain-total 1-h accumulated precipitation on the ST4 grid over the verification region (CONUS east of 105°W) for ensembles initially centered about (a),(c) GFS and (b),(d) EnKF mean analyses. Horizontal lines are optimal values, and the reliability index (RI; Delle Monache et al. 2006) is annotated for each ensemble in the legend; lower values are better and indicate flatter rank histograms.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

3) ROC areas

Ability to discriminate events from climatology was quantified by area under the relative operating characteristic (ROC) curve (Mason 1982; Mason and Graham 2002), which was computed using decision thresholds of 1%, 2%, 3%, 4%, 5%, 10%, 15%, …, 95%, and 100% and a trapezoidal approximation. ROC area > 0.5 indicates better discriminating ability than random forecasts.

As with FSSs, all three CAEs initially centered on GFS analyses usually had higher ROC areas than the two CAEs initially centered about EnKF mean analyses (Figs. 14a–d), again suggesting GFS analysis superiority to EnKF mean analyses and greater importance of initial center than ICPs. Between ~6–18 h, for fixed initial centers, the CAEs with random ICPs had statistically significantly higher ROC areas than those with EnKF ICPs, while before 6 h and after 18 h, EnKF and random ICPs yielded similar ROC areas (Figs. 14a–d). These results differed from FSSs (Fig. 12) that clearly indicated EnKF ICPs were preferable to random ICPs. Outside the 6–18-h period, GFSSREF had the highest ROC areas among the three CAEs with GFS initial centers that were often statistically significantly higher than GFSEnKF ROC areas, suggesting benefits of incorporating multimodel diversity within CAE ICPs and contrasting the similar GFSSREF and GFSEnKF FSSs.

Fig. 14.
Fig. 14.

Area under the ROC curve over the verification region (CONUS east of 105°W) with a 100-km neighborhood length scale for the (a),(e) 0.25, (b),(f) 1.0, (c),(g) 5.0, and (d),(h) 10.0 mm h−1 thresholds aggregated over all 31 3-km forecasts of 1-h accumulated precipitation as a function of forecast hour computed with (a)–(d) the full range of decision thresholds and (e)–(h) a truncated set of decision thresholds with a lowest nonzero threshold of 25% (see text). Values on the x axis represent ending forecast hours of 1-h accumulation periods (e.g., an x-axis value of 24 is for 1-h accumulated precipitation between 23 and 24 h). Note that the y-axis scales are different in each panel. Statistical significance is denoted along the top axis as in Fig. 12.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

Further investigation revealed the relatively poor 6–18-h GFSEnKF ROC areas compared to GFSRAND and GFSSREF were primarily due to insufficient contributions from NEPs < 25%. Specifically, GFSRAND and GFSSREF were less sharp than GFSEnKF, with higher coverages of NEPs for r = 100 km between 5% and 25% and lower coverages of NEPs ≥ 45% at most thresholds (Figs. 15a–d). In general, the GFSRAND distribution differed more from the GFSEnKF distribution than GFSSREF; for example, within the 5%–25% bin for the 0.25 and 1.0 mm h−1 thresholds, GFSRAND had ~50% more NEPs than GFSEnKF while the difference between GFSSREF and GFSEnKF was smaller (~15%; Figs. 15a,b). These enhanced low-probability coverages in GFSSREF and GFSRAND reflected their greater spreads relative to GFSEnKF between 6 and 18 h (Fig. 6) that were beneficial (e.g., Fig. 13a) and enabled better detection of low-probability events while not appreciably increasing false alarm rates, boosting ROC areas.

Fig. 15.
Fig. 15.

Sharpness diagrams depicting how often bias-corrected NEPs of 1-h accumulated precipitation computed with r = 100 km fell into various probabilistic bins over the verification region (CONUS east of 105°W) and all 31 (a)–(d) 6–18- and (e)–(h) 18–36-h forecasts for event thresholds of (a),(e) 0.25, (b),(f) 1.0, (c),(g) 5.0, and (d),(h) 10.0 mm h−1. Scale along the right axis is for ratios of number of occurrences between GFSRAND to GFSEnKF (solid) and GFSSREF to GFSEnKF (dashed), with a horizontal line at 1.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

Between 18 and 36 h, even though GFSRAND again had more low probabilities than GFSEnKF (Figs. 15e–h), differences between GFSRAND and GFSEnKF coverages of 5%–25% NEPs were smaller than between 6 and 18 h. Although greater GFSRAND spread was beneficial from a dispersion perspective (e.g., Fig. 13c), GFSRAND spatial placement was significantly poorer than GFSEnKF (Fig. 12), counteracting benefits from enhanced spread and likely leading to comparable GFSRAND and GFSEnKF ROC areas outside of 6–18 h (Figs. 14a–d). Conversely, differences between GFSEnKF and GFSSREF NEP distributions were similar across both forecast intervals (Fig. 15), and as GFSSREF and GFSEnKF FSSs were similar, the combination of good GFSSREF placement and more GFSSREF spread translated into higher GFSSREF ROC areas for most of the forecast relative to GFSEnKF.

Overall, ROC areas indicated more benefits of using both random and multimodel ICPs than FSSs. However, higher ROC areas from these techniques appear related solely to enhanced spread and greater low probability coverages. In fact, when ROC areas were computed with decision thresholds of 0%, 25%, 30%, 35%, …, 95%, and 100% to explicitly exclude contributions from NEPs < 25%, although ROC areas plummeted, CAEs with EnKF ICPs had higher ROC areas than CAEs with the same initial center but random ICPs, and GFSEnKF had comparable or higher ROC areas than GFSSREF (Figs. 14e–h). These truncated ROC areas provided similar conclusions as FSSs regarding benefits of flow-dependent EnKF ICPs, and it appears that employing multimodel ICPs may be unnecessary for users unconcerned with low-probability decision thresholds.

4) Attributes statistics

Attributes diagrams (Wilks 2011) were constructed with forecast probability bins of 0%–5%, 5%–15%, 15%–25%, …, 85%–95%, and 95%–100% (Fig. 16) to assess calibration, with curves falling on the diagonal indicating perfect reliability. Over the first 18 h, for fixed IC centers, the CAEs with random and SREF ICPs were more reliable than those with EnKF ICPs for most thresholds and probability bins, and GFSRAND was sometimes more reliable than GFSSREF (Figs. 16a–d). The better GFSRAND and GFSSREF reliabilities compared to GFSEnKF were aided by less sharp distributions with fewer high-probability forecasts (e.g., Figs. 15a–d) that diminished overconfidence, again reflecting their greater spreads. Nonetheless, relatively poor GFSRAND FSSs suggest many low probabilities did not correspond well with observations. Initial center again mattered, as for fixed ICPs, the CAEs with GFS initial centers typically had better reliabilities than those with EnKF mean initial centers.

Fig. 16.
Fig. 16.

Attributes statistics computed over the verification region (CONUS east of 105°W) with a 100-km neighborhood length scale aggregated over all 31 (a)–(d) 1–18- and (e)–(h) 18–36-h 3-km forecasts of 1-h accumulated precipitation for the (a),(e) 0.25, (b),(f) 1.0, (c),(g) 5.0, and (d),(h) 10.0 mm h−1 thresholds. Horizontal lines near the x axis represent observed frequencies of the event and diagonal lines are lines of perfect reliability. Points lying in gray-shaded regions had skill compared to climatological forecasts as measured by the Brier skill score (Brier 1950; Wilks 2011). Statistical significance is denoted along the top axis as in Fig. 12, where the symbols denote the ensemble statistically significantly closest to perfect reliability. Values were not plotted for a particular bin if fewer than 500 grid points had forecast probabilities in that bin over the verification region and all 31 forecasts. Note that the attributes diagrams themselves stop at 100%; area above 100% was added to make room for statistical significance markers.

Citation: Monthly Weather Review 148, 7; 10.1175/MWR-D-19-0401.1

Similar conclusions generally held at later times (18–36 h; Figs. 16e–h), although GFSSREF and GFSRAND had closer reliabilities than at earlier times. Over both periods, most ensembles were overconfident and all CAEs had little or no skill with respect to forecasts of climatology at the 10.0 mm h−1 threshold, indicating challenges persist for making reliable predictions of highly localized events like heavy rainfall.

5. Summary and conclusions

Five sets of 48-h, 10-member, 3-km CAE forecasts were initialized at 0000 UTC each day in May 2015 over the CONUS with various configurations designed to isolate forecast sensitivity to ICPs and central initial state. Sensitivity to ICs extended throughout the 48-h forecasts, contrasting many European studies showing IC impacts through only 6–12 h (e.g., Hohenegger et al. 2008; Vié et al. 2011; Kühnlein et al. 2014; RB16); this disparity is probably due to the much bigger computational domain used here, and our findings suggest enhanced importance of ICs for large domains.

Specifically, using random mesoscale ICPs yielded undesirable spinup characteristics and relatively poor FSSs compared to employing flow-dependent ICPs provided by both single-physics, single-dynamics 15-km limited-area continuously cycling EnKF analyses and 3-h multimodel SREF forecasts. However, these deleterious characteristics from random ICPs increased spread, leading to less overconfidence and broader low-probability coverages that improved ROC areas, rank histogram flatness, and attributes statistics compared to EnKF—and sometimes SREF—ICPs. Therefore, it appears random ICPs engendered some beneficial properties despite lack of flow dependence, but substantial work is needed to further understand and remedy detrimental impacts of random noise on model spinup.

Compared to EnKF ICPs, SREF ICPs yielded comparable FSSs but improved performance for spread-sensitive metrics. Yet, individual members of the SREF-initialized CAE had different climatologies that undesirably clustered by dynamic core reflected in its ICPs. Thus, although SREF-based and random ICPs often provided improvements over EnKF ICPs, given the challenges associated with multimodel and random ICPs, collective results suggest obtaining “good spread” in CAEs remains elusive, and within future operational CAEs like those being developed under NOAA’s UFS, it may be more fruitful to attempt to recover the helpful, spread-inducing aspects from random and multimodel ICPs by instead using stochastic physics schemes in association with single-physics, single-dynamics, flow-dependent ICPs (e.g., Bouttier et al. 2012; Romine et al. 2014; Jankov et al. 2019).

Additionally, our findings stress the importance of CAE initial center, which was more important than ICPs for achieving high ROC areas and FSSs. Moreover, CAEs initially centered about operational GFS analyses were unequivocally superior to those initially centered on our experimental EnKF mean analyses. These results strongly suggest relative superiority of GFS analyses and lend credence to the “partial cycling” strategy currently employed by NOAA’s limited-area DA systems over the CONUS that periodically discards cycled states and replaces them with fields from a global model (e.g., Benjamin et al. 2016; Wu et al. 2017).

Despite our seemingly discouraging EnKF-based results, continuously cycling EnKFs over large regional domains can potentially be enhanced by decreasing the cycling period (e.g., using 1-h cycles), assimilating more observations, and likely most importantly, improving the limited-area NWP model (e.g., Romine et al. 2013). In addition, our results documenting very slow perturbation growth over the first 12 h from EnKF ICPs compared to random ICPs suggest efforts toward understanding and accelerating these slow growths should be undertaken to improve short-term forecast spread from EnKF ICPs. While increasing EnKF resolution may also help, especially for nowcasting purposes, finer EnKF resolution is likely not a panacea, and it is entirely possible that continuously cycling limited-area EnKFs, despite their many attractive properties, may not currently be optimal for initializing large-domain regional CAEs, particularly for next-day forecasts that are less impacted by spinup. Nonetheless, ongoing research at NCAR is attempting to improve limited-area NWP models (e.g., Wong et al. 2020), with hopes that these efforts will translate into better continuously cycling DA systems over large regional domains.

Acknowledgments

This work was partially funded by NCAR’s Short-term Explicit Prediction (STEP) program and NOAA/OAR Office of Weather and Air Quality Grants NA17OAR4590182 and NA17OAR4590122. Thanks to Adam Clark and two anonymous reviewers for their constructive comments. All forecasts were produced on NCAR’s Yellowstone supercomputer (Computational and Information Systems Laboratory 2016). The National Center for Atmospheric Research is sponsored by the National Science Foundation.

REFERENCES

  • Accadia, C., S. Mariani, M. Casaioli, A. Lavagnini, and A. Speranza, 2003: Sensitivity of precipitation forecast skill scores to bilinear interpolation and a simple nearest-neighbor average method on high-resolution verification grids. Wea. Forecasting, 18, 918932, https://doi.org/10.1175/1520-0434(2003)018<0918:SOPFSS>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anderson, J. L., 2001: An ensemble adjustment Kalman filter for data assimilation. Mon. Wea. Rev., 129, 28842903, https://doi.org/10.1175/1520-0493(2001)129<2884:AEAKFF>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anderson, J. L., 2003: A local least squares framework for ensemble filtering. Mon. Wea. Rev., 131, 634642, https://doi.org/10.1175/1520-0493(2003)131<0634:ALLSFF>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anderson, J. L., 2009: Spatially and temporally varying adaptive covariance inflation for ensemble filters. Tellus, 61A, 7283, https://doi.org/10.1111/j.1600-0870.2008.00361.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anderson, J. L., 2012: Localization and sampling error correction in ensemble Kalman filter data assimilation. Mon. Wea. Rev., 140, 23592371, https://doi.org/10.1175/MWR-D-11-00013.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anderson, J. L., and N. Collins, 2007: Scalable implementations of ensemble filter algorithms for data assimilation. J. Atmos. Oceanic Technol., 24, 14521463, https://doi.org/10.1175/JTECH2049.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Anderson, J. L., T. Hoar, K. Raeder, H. Liu, N. Collins, R. Torn, and A. Arellano, 2009: The Data Assimilation Research Testbed: A community facility. Bull. Amer. Meteor. Soc., 90, 12831296, https://doi.org/10.1175/2009BAMS2618.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Baldwin, M. E., and J. S. Kain, 2006: Sensitivity of several performance measures to displacement error, bias, and event frequency. Wea. Forecasting, 21, 636648, https://doi.org/10.1175/WAF933.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Barker, D. M., 2005: Southern high-latitude ensemble data assimilation in the Antarctic Mesoscale Prediction System. Mon. Wea. Rev., 133, 34313449, https://doi.org/10.1175/MWR3042.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Barker, D. M., and Coauthors, 2012: The Weather Research and Forecasting Model’s Community Variational/Ensemble Data Assimilation System: WRFDA. Bull. Amer. Meteor. Soc., 93, 831843, https://doi.org/10.1175/BAMS-D-11-00167.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Benjamin, S. G., and Coauthors, 2016: A North American hourly assimilation and model forecast cycle: The Rapid Refresh. Mon. Wea. Rev., 144, 16691694, https://doi.org/10.1175/MWR-D-15-0242.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bishop, C. H., and D. Hodyss, 2009a: Ensemble covariances adaptively localized with ECO-RAP. Part I: Tests on simple error models. Tellus, 61A, 8496, https://doi.org/10.1111/j.1600-0870.2008.00371.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bishop, C. H., and D. Hodyss, 2009b: Ensemble covariances adaptively localized with ECO-RAP. Part II: A strategy for the atmosphere. Tellus, 61A, 97111, https://doi.org/10.1111/j.1600-0870.2008.00372.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Blunden, J., and D. S. Arndt, 2016: State of the Climate in 2015. Bull. Amer. Meteor. Soc., 97, SIS275, https://doi.org/10.1175/2016BAMSSTATEOFTHECLIMATE.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bouttier, F., B. Vieì, O. Nuissier, and L. Raynaud, 2012: Impact of stochastic physics in a convection-permitting ensemble. Mon. Wea. Rev., 140, 37063721, https://doi.org/10.1175/MWR-D-12-00031.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 13, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Cafaro, C., T. H. A. Frame, J. Methven, N. Roberts, and J. Bröcker, 2019: The added value of convection-permitting ensemble forecasts of sea breeze compared to a Bayesian forecast driven by the global ensemble. Quart. J. Roy. Meteor. Soc., 145, 17801798, https://doi.org/10.1002/qj.3531.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Chen, F., and J. Dudhia, 2001: Coupling an advanced land surface–hydrology model with the Penn State–NCAR MM5 modeling system. Part I: Model implementation and sensitivity. Mon. Wea. Rev., 129, 569585, https://doi.org/10.1175/1520-0493(2001)129<0569:CAALSH>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Clark, A. J., 2017: Generation of ensemble mean precipitation forecasts from convection-allowing ensembles. Wea. Forecasting, 32, 15691583, https://doi.org/10.1175/WAF-D-16-0199.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Clark, A. J., W. A. Gallus Jr., M. Xue, and F. Kong, 2009: A comparison of precipitation forecast skill between small convection-allowing and large convection-parameterizing ensembles. Wea. Forecasting, 24, 11211140, https://doi.org/10.1175/2009WAF2222222.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Clark, A. J., W. A. Gallus Jr., and M. L. Weisman, 2010a: Neighborhood-based verification of precipitation forecasts from convection-allowing NCAR WRF Model simulations and the operational NAM. Wea. Forecasting, 25, 14951509, https://doi.org/10.1175/2010WAF2222404.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Clark, A. J., W. A. Gallus Jr., M. Xue, and F. Kong, 2010b: Growth of spread in convection-allowing and convection-parameterizing ensembles. Wea. Forecasting, 25, 594612, https://doi.org/10.1175/2009WAF2222318.1.

    • Crossref