Improved Analog Ensemble Formulation for 3-Hourly Precipitation Forecasts

Julia Jeworrek aThe University of British Columbia, Vancouver, British Columbia, Canada

Search for other papers by Julia Jeworrek in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0002-4586-0982
,
Gregory West aThe University of British Columbia, Vancouver, British Columbia, Canada
bBC Hydro, Vancouver, British Columbia, Canada

Search for other papers by Gregory West in
Current site
Google Scholar
PubMed
Close
, and
Roland Stull aThe University of British Columbia, Vancouver, British Columbia, Canada

Search for other papers by Roland Stull in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Analog ensembles (AnEns) traditionally use a single numerical weather prediction (NWP) model to make a forecast, then search an archive to find a number of past similar forecasts (analogs) from that same model, and finally retrieve the actual observations corresponding to those past forecasts to serve as members of an ensemble forecast. This study investigates new statistical methods to combine analogs into ensemble forecasts and validates them for 3-hourly precipitation over the complex terrain of British Columbia, Canada. Applying the past analog error to the target forecast (instead of using the observations directly) reduces the AnEn dry bias and makes prediction of heavy-precipitation events probabilistically more reliable—typically the most impactful forecasts for society. Two variants of this new technique enable AnEn members to obtain values outside the distribution of the finite archived observational dataset—that is, they are theoretically capable of forecasting record events, whereas traditional analog methods cannot. While both variants similarly improve heavier precipitation events, one variant predicts measurable precipitation more often, which enhances accuracy during winter. A multimodel AnEn further improves predictive skill, albeit at higher computational cost. AnEn performance shows larger sensitivity to the grid spacing of the NWP than to the physics configuration. The final AnEn prediction system improves the skill and reliability of point forecasts across all precipitation intensities.

Significance Statement

The analog ensemble (AnEn) technique is a data-driven method that can improve local weather forecasts. It improves raw model forecasts using past similar model predictions and observations, reducing future forecast errors and providing probabilities for a range of possible outcomes. One limitation of AnEns is that they commonly tend to make rare-event (e.g., heavy precipitation) forecasts appear less extreme. Usually, heavier precipitation events have a higher impact on society and the economy. This study introduces two new AnEn techniques that make operational forecasts of both probabilities and most likely amounts more accurate for heavy precipitation.

© 2023 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Julia Jeworrek, jjeworrek@eoas.ubc.ca

Abstract

Analog ensembles (AnEns) traditionally use a single numerical weather prediction (NWP) model to make a forecast, then search an archive to find a number of past similar forecasts (analogs) from that same model, and finally retrieve the actual observations corresponding to those past forecasts to serve as members of an ensemble forecast. This study investigates new statistical methods to combine analogs into ensemble forecasts and validates them for 3-hourly precipitation over the complex terrain of British Columbia, Canada. Applying the past analog error to the target forecast (instead of using the observations directly) reduces the AnEn dry bias and makes prediction of heavy-precipitation events probabilistically more reliable—typically the most impactful forecasts for society. Two variants of this new technique enable AnEn members to obtain values outside the distribution of the finite archived observational dataset—that is, they are theoretically capable of forecasting record events, whereas traditional analog methods cannot. While both variants similarly improve heavier precipitation events, one variant predicts measurable precipitation more often, which enhances accuracy during winter. A multimodel AnEn further improves predictive skill, albeit at higher computational cost. AnEn performance shows larger sensitivity to the grid spacing of the NWP than to the physics configuration. The final AnEn prediction system improves the skill and reliability of point forecasts across all precipitation intensities.

Significance Statement

The analog ensemble (AnEn) technique is a data-driven method that can improve local weather forecasts. It improves raw model forecasts using past similar model predictions and observations, reducing future forecast errors and providing probabilities for a range of possible outcomes. One limitation of AnEns is that they commonly tend to make rare-event (e.g., heavy precipitation) forecasts appear less extreme. Usually, heavier precipitation events have a higher impact on society and the economy. This study introduces two new AnEn techniques that make operational forecasts of both probabilities and most likely amounts more accurate for heavy precipitation.

© 2023 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Julia Jeworrek, jjeworrek@eoas.ubc.ca

1. Introduction

Analog ensemble (AnEn) forecasts search an archive of past forecasts to find multiple past forecasts that were similar to a future target forecast. Then the observations corresponding to those multiple past forecasts are combined into an ensemble as an improved (postprocessed) estimate of the future weather.

AnEn performance is known to be conditioned by the quality (Marty et al. 2012; Nagarajan et al. 2015) and length (Hamill et al. 2006; Chapman et al. 2022; Delle Monache et al. 2020) of their archive of paired search (i.e., past forecast) and observational datasets. The temporal length of the archive is especially important when forecasting significant and heavier precipitation rates, because rarer events have smaller sample sizes in a finite dataset, which reduces the chance of finding good analogs. As a result, AnEns of high impact events are more likely to be biased toward less extreme (i.e., more common) values (Hamill et al. 2006; Alessandrini et al. 2019; Hamill et al. 2015; Alessandrini 2022). Significantly, this AnEn method is not capable of forecasting a record event.

Alessandrini et al. (2019) proposed an AnEn bias-correction method for rare wind events by applying a correction factor to the analog members if the target forecast exceeds a defined threshold. Initial testing is required to assess a suitable threshold value. Alessandrini (2022) improved analogs for rare events in solar power predictions by combining the technique in Alessandrini et al. (2019) with another method that is similar to the so-called supplemental lead time (SLT) approach introduced by Jeworrek et al. (2022, hereafter J22). The SLT approach expands the analog search to more analog candidates from adjacent lead times in the same archive of past forecasts. Namely, analog candidates are considered over a lead-time window centered on the target lead time. As such, it behaves similarly to an AnEn with a longer past-forecast search dataset. SLTs improve the representation of rare events by shifting the critical threshold further down the heavy-precipitation tail of the forecasted distribution, but the underforecast issue remains for more rare events.

Conventional AnEns use a single numerical weather prediction (NWP) model as a basis of finding analogs (e.g., Delle Monache et al. 2011, 2013; Alessandrini et al. 2015; J22). The use of a single (i.e., deterministic) imperfect NWP solution limits the range of possible outcomes and thus postprocesses only that particular NWP forecast. As NWP ensembles are able to represent error growth dynamically, Hamill and Whitaker (2006) and Eckel and Delle Monache (2016) hypothesized that the initialization of an AnEn from a small set of NWP models may enable more diversity in the prediction system, making it more robust. Hamill and Whitaker (2006) found decreased daily precipitation forecast skill when selecting analogs from each member individually compared to the ensemble mean using a grid spacing of ∼250 km. Mugume et al. (2018) obtained opposite results for an ensemble of Weather Research and Forecasting (WRF; Skamarock et al. 2008) Model configurations with varying cumulus schemes and 10-km grid spacing during a case study in Uganda. Eckel and Delle Monache (2016) showed mixed results for temperature and wind speed forecasts; namely, a ∼15-km single-model AnEn performed better for wind, and a ∼33-km multimodel AnEn performed better for temperature forecasts. Odak Plenković et al. (2020) found that using the mean and standard deviation of a ∼10-km NWP ensemble as predictors can provide an accurate and efficient multimodel AnEn, similar to using each NWP member as a predictor during one analog search. However, searching for analogs from each NWP ensemble member separately [AnEnMem in Odak Plenković et al. (2020); similar to Eckel and Delle Monache (2016)] was less accurate. While all these studies reach different conclusions about how an NWP ensemble is best utilized to obtain optimal AnEn results, the properties of the NWP ensemble are rarely considered or varied in AnEn sensitivity studies.

This study follows two previous studies over the mountainous coastal region of southwest British Columbia (BC). One study (Jeworrek et al. 2021, hereafter J21) showed that WRF physics configurations and model settings, such as spatial and temporal resolution, have a significant impact on forecast performance, as do climatological and geographical conditions. The other study (J22) optimized an AnEn by postprocessing one of the best performing configurations in J21, while considering accumulation windows from hourly to daily. The present study examines whether the previous single-model AnEn results from J22 hold true for some of the other configurations (grid spacings and physics suites) that also performed well in J21, but under different conditions and for different verification statistics. The focus in all three studies (J21; J22, and this study) is on the South Coast of BC, where electricity production relies heavily on hydropower, and where accurate precipitation forecasts are crucial to manage water resources and mitigate flood risks. However, local weather predictions in this region can be subject to large uncertainties because of the diverse terrain with steep topography and complex coastlines, and because of the upstream data void over the Pacific Ocean (i.e., largely devoid of nonsatellite-derived observations such as weather balloon radiosondes).

The dry-bias issue of high-impact AnEns is evident in J22’s optimized AnEn, where forecasts of heavy precipitation events exceeding 90th percentile (90p) observed amounts were often unconditionally underpredicted. In an attempt to overcome this issue, this study proposes new analog techniques to generate ensemble members that are capable of making forecasts outside the observational distribution of the archived training period (see section 2a). The proposed methods are specifically designed for zero-bound variables such as precipitation and do not rely on the definition of a threshold value such as Alessandrini et al. (2019)’s correction method for rare events. While the new analog technique prioritizes prediction accuracy for heavier precipitation, two variants are designed with regards to the prediction of weaker precipitation events.

Section 3c utilizes a multiphysics, multiresolution NWP ensemble and combines the analogs found from each individual NWP member into one AnEn, similar to Eckel and Delle Monache (2016). It further investigates whether a multimodel AnEn benefits most from using varied NWP physics configurations or varied NWP grid spacings. Section 3d compares all tested AnEn variants with raw and bias-corrected NWP forecasts over an independent 1-yr verification period, and section 4 provides the summary and conclusions.

2. Methodology

This study uses the same archived dataset as described in J22 plus two additional physics setups for the WRF Model. In addition to the WSM5-KF-YSU-NoahMP configuration used in J22 (see abbreviations and references in the appendix), this study also uses the Thom-KF-ACM2-NoahMP and Thom-GF-YSU-NoahMP configurations, all of which were among the best performing WRF physics suites in J21. In that study, WSM5-KF-YSU-NoahMP had the best 75th percentile (75p) equitable threat score (ETS) and 75p probability of detection (POD) for subdaily precipitation, whereas Thom-KF-ACM2-NoahMP had the best correlation, mean square difference (MSD), and standard deviation (SD) of errors, as well as the best 75p false alarm ratio (FAR) and good 75p accuracy and mean absolute error (MAE), especially during winter. Thom-GF-YSU-NoahMP had the best bias and good performance for measurable precipitation (>0.25 mm) metrics, and ranked among the best configurations for summer MAE.

Since all three configurations use the same NoahMP land surface model, we will refer to the three configurations as WSM5-KF-YSU, Thom-KF-ACM2, and Thom-GF-YSU, hereafter. Other model settings (e.g., initial and boundary conditions, domains, other parameterizations, etc.) are identical to J21. For example, three two-way nested WRF domains (with 27–9–3-km grid spacing) are centered over the BC South Coast (see Fig. 1a in that study).

While J22 optimized forecasts from only the 9-km midsize domain, the present study considers all three grid spacings. The resulting nine-member multiphysics, multiresolution reforecast ensemble was generated over the same period as in J22, and split into 4.75 years of training (from January 2016 to September 2020) and 1 year of testing (the water year 2021: October 2020–September 2021). The investigation of the proposed methods in section 3a through section 3c is conducted on the training period using a leave-one-out approach, whereas the independent verification and comparison with a reference bias-correction method in section 3d is conducted over the testing period.

Following the optimization results from J22: this study uses a temporal trend similarity (TTS) of τ = 1, 2, or 3 for forecast days 1, 2, and 3, respectively, and supplemental lead times (SLTs) in a window of ±6 lead-time steps. This means that analogs are identified not only by matching the predictors at the same lead time, but also by considering 1) their temporal evolution and 2) adjacent analog candidates over a range of surrounding lead times. The TTS and SLT techniques are described in detail in J22. This study also uses the same 46 stations from the Environment and Climate Change Canada (ECCC) and BC Hydro networks (see Fig. 1 in J22). However, here we focus on 3-hourly rolling accumulation windows, which have notably better predictability than hourly windows, while still maintaining relatively high temporal resolution (J21).

First, we perform the efficient forward selection technique (“All-EFS” predictor selection in J22) on each of the nine training datasets individually to identify appropriate predictor variables. Similar to brute force, the All-EFS technique tests a large number of predictor combinations, but in a stepwise approach and with weighting-option limitations to reduce the number of trials. The best predictors and weights are assessed in terms of 75p threshold-weighted continuous ranked probability scores (twCRPS; J22). The results are rarely significantly different compared to using the predictor weighting that were trained on the configuration used in J22 (9-km WSM5-KF-YSU). Significant differences were found only for configurations having both different physics and grid spacing. Although the model configurations here have different grid spacings and parameterization combinations, many commonalities still exist, such as the initial and boundary conditions, the WRF dynamical core and version, the domain setup, and various unchanged parameterization types.

The resulting predictors and weights are specific to each observational station, meteorological season, and—despite the minor differences—model configuration. Similar to Fig. A3 in J22, the most important predictor is model precipitation, commonly followed by integrated water vapor (IVT; during all seasons except summer) and various kinematic variables. Other meteorological model variables may also be selected as predictors by the All-EFS technique, but usually with smaller relative weights or at fewer stations only. Table 1 in J22 lists all considered predictor candidates.

As in J22, if the distribution of verification values across observational stations is normal, a paired t test assesses whether different methods yield significantly different distributions at the α = 0.05 level. However, if the Shapiro–Wilk test for normality (Shapiro and Wilk 1965) is rejected, the nonparametric two-sided Wilcoxon signed-rank test (Wilcoxon 1945) is used instead.

a. AnEn variants

In other AnEn studies (including J22), the AnEn uses the past analog observations (AnObs) as ensemble members (see Fig. 1). For clarification we call this conventional AnEn the analog observation ensemble (AnObsEn) hereafter, whereas “AnEn” will be used as umbrella term for the analog ensemble technique comprising various subtypes.

Fig. 1.
Fig. 1.

Illustration of the methodology used for a 4-member analog observation ensemble (AnObsEn; adapted from J22).

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

As an alternative to the AnObsEn we introduce two variants of an analog error ensemble (AnErrEn) that apply the past error of each past analog model forecast (AnFcst) to the target model forecast (TaFcst). Since precipitation distributions are zero bound and skewed, an additive bias correction:
AnErrAdd=TaFcst+AnObsAnFcst
may result in negative values, whereas a multiplicative bias correction:
AnErrMult=AnObsAnFcst×TaFcst
may result in division by zero [i.e., “not a number” (NaN)] or result in unreasonably large numbers for small AnFcsts. Neither of these options is unconditionally appropriate. Therefore, we define the two variants of the AnErrEn to use AnErrMult, AnErrAdd, or zero as a solution for their ensemble members under specific conditions as follows:
AnErrEn1={0ifAnFcst=0andAnObs=0,1AnErrAddifAnFcst=0andAnObs>02orAnErrMult>AnErrAdd>0,3AnErrMultotherwise.
AnErrEn2={0ifAnFcst=0andAnObs=0,AnErrAddifAnFcst=0andAnObs>0orAnErrMult>AnErrAdd>0orAnObs=0andTaFcst>AnFcst>0,4AnErrMultotherwise.
AnErrEn1 defaults to AnErrMult unless AnErrMult results in division by zero or numbers larger than AnErrAdd. AnErrEn2 does the same with one additional (perhaps controversial) condition: If the TaFcst and past AnFcst both suggest nonzero precipitation, even though the past verifying AnObs is zero, the difference between the TaFcst and past AnFcst is applied to the TaFcst (only if TaFcst > AnFcst, otherwise AnErrAdd would be negative). We chose to experiment with this second AnErrEn variant in an attempt to recover more true positive precipitation events from the past history. Although the precipitation values of nonzero AnErrEn members differ from AnObsEn (e.g., they can generate larger extreme values), AnErrEn1 is dry more often than AnObsEn. If AnObs is zero, both AnObsEn and AnErrEn1 are also zero. Whereas if AnObs is nonzero, AnObsEn is nonzero but AnErrEn1 can be zero or nonzero. Since the observed precipitation distribution has a majority of dry events, the baseline chance of selecting dry AnObs is higher in a large AnEn. This study investigates whether AnErrEn2 can recover more nonzero precipitation forecasts to reduce the number of missed events without introducing too many false alarms.

Sections 3a and 3b evaluate single-model AnEns over the training period, where each of the three AnEn variants (AnObsEn, AnErrEn1, and AnErrEn2) are created using each of the nine WRF configurations separately (27 ensemble prediction systems total). Section 3c proposes an additional multimodel AnEn (MM-AnEn) alternative that is similar to Eckel and Delle Monache (2016)’s hybrid ensemble: It combines the (N) best analog members from various (M) model configurations to compose one ensemble—in this study, producing one ensemble for each of the AnEn variants. Since the MM-AnEn uses equal numbers of analogs from each of the M configurations, the ensemble size is a multiple of the number M (i.e., N × M).

If different models in the MM-AnEn select the same analog, the AnObsEn would resample the same AnObs as an AnEn member, whereas AnErrEn members would be different as long as the TaFcst differs among models (even though AnObs is identical). Eckel and Delle Monache (2016) found that allowing repeated MM-AnObsEn members from resampling analogs has a beneficial effect of adding more weight to better analogs.

We also built an ensemble of the best overall analogs with the best similarity score no matter which configuration they originate from. Although this ensemble improved upon the single-model AnEns, it is slightly overconfident, it underpredicts the observed frequencies (i.e., is less reliable), and it has slightly worse MAEs and skill–spread relationships (not shown). Since it performed somewhat worse than the MM-AnEn described above, section 3c only shows results of the MM-AnEn using equal numbers of AnEn members from different NWP models.

b. Reference bias-correction method

Section 3d compares performance of the AnEn variants to 1) the raw WRF forecasts and 2) a bias-corrected version of the WRF forecasts. Since raw NWP output is impaired by systematic errors, we apply a simple bias correction technique following McCollor and Stull (2008) that aims to reduce the seasonal bias in a forecast by using estimates of past biases. The degree of mass balance (DMB)
DMBj,k=y=20162020i=4545PaFcstj+i,y,ky=20162020i=4545PaObsj+i,y,k
is calculated for each day-of-year j and lead time k. It divides the sum of past model forecasts (PaFcsts) by the sum of past observations (PaObs) over initializations i in a range of ±45 days centering the target day-of-year j across the training years y. The bias-corrected TaFcstc is then calculated as
TaFcstj,kc=TaFcstj,krawDMBj,k.
In essence, this approach is similar to the multiplicative bias correction in Eq. (2), but it uses the past seasonal biases from the training period instead of the individual analogs. The moving initialization window of ±45 days was chosen based on the need to avoid division by zero or very small numbers in Eq. (5). While this window is sufficient for our training dataset, different (e.g., shorter) datasets or drier station locations may require a longer moving initialization window. In this study the resulting window of 91 daily initializations in total corresponds to the length of a meteorological season and is therefore capable of adjusting for seasonal error characteristics while also being robust. A time window of 91 days has also been used in Hamill and Whitaker (2006) to filter for seasonal error characteristics in bias-correction and analog-search techniques.

During the verification over the testing period, the AnEns as well as the DMB-corrected NWP forecasts are all postprocessed based on the data in the training period only. This warrants a fair comparison for the performance of the proposed AnEn variants and the WRF forecasts with the seasonal systematic error removed. The AnObsEn method used herein was also benchmarked against a simpler AnObsEn method in J22.

3. Results and discussion

a. The role of ensemble size

Unlike with NWP ensembles, the number of ensemble members in AnEns has no impact on computational time. This provides an opportunity to investigate how performance skill varies with ensemble size and to determine an optimal size that is free of computational constraints.

Figure 2 shows that the optimal ensemble size varies by verification metric and is often in the range of 30–60 members. Using less than ∼30 ensemble members, the AnEn skill is considerably worse, and using more ensemble members than the optimal ensemble size, skill worsens again but less rapidly. For increasing ensemble sizes beyond the ideal number of members, the AnErrEn errors grow more gradually than the AnObsEn errors, making the choice of ensemble size more important for AnObsEn.

Fig. 2.
Fig. 2.

Mean absolute error (MAE), continuous ranked probability score (CRPS), and 95th percentile Brier score (95p BS) as a function of ensemble size, averaged over lead times and station locations. The optimal ensemble size is marked and annotated at the minimum (best) values. The solid line is the average, and the shading is the spread across model configurations.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

The AnObsEn often reaches the lowest errors overall at a smaller ensemble size, followed by AnErrEn1 and AnErrEn2. Only at high percentile thresholds does AnErrEn produce better (lower) Brier scores (BS) than AnObsEn. Within the spread among configurations in Fig. 2, finer grids tend to have larger errors, which is consistent with findings in J21.

The optimal ensemble size (i.e., minimum error) is also dependent on lead time and precipitation intensity (not shown). For example, the best AnObsEn continuous ranked probability score (CRPS) occurs at an ensemble size of about 48, 51, and 54 members on average for forecast days 1, 2, and 3, respectively. The best AnObsEn BS occurs at an ensemble size of 50–60 members for thresholds below the 80th percentile (80p) and decreases to 40 members at 90p. The two AnErrEn variants show similar lead-time dependencies, but (as mentioned above) since the error minimum is located on a flatter curvature, there is a wider range of ensemble sizes that have similarly low error values. For prediction systems with longer forecast horizons it might be worth accounting for the lead-time dependency of the AnEn size. However, for the three-day forecasts in this study we consider an ensemble size of 50 analog members to be appropriate across AnEn variants, lead times, and percentiles.

The spread–skill relationship in Fig. 3 [assessed similarly to Delle Monache et al. (2020), and Alessandrini et al. (2019)] shows that all AnEn systems are able to quantify their own uncertainty well. Namely, the binned spread (i.e., SD) of the 50 ensemble members correlate strongly with the skill (i.e., RMSE) of the ensemble median for all AnEn variants. The nearly perfect (roughly < 1:1) slope, however, has a consistent offset to the left, which indicates that the ensembles are underdispersive. This offset is smallest for AnObsEn and largest for AnErrEn2.

Fig. 3.
Fig. 3.

Binned spread–skill diagram for the 50-member AnEn variants calculated over all stations and forecast lead times. Each dot represents the center of one of 50 equally populated ensemble-spread intervals (>200 000 samples in each bin), and each line connects dots that are result of the same WRF configuration.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

While Fig. 3 is calculated over all stations and forecast lead times, subdividing the dataset according to lead times and geographical locations (not shown) indicates slightly worse relative skill–spread relationships for longer lead times and higher-elevation inland station locations. However, even the ensembles for shorter outlooks and coastal station locations are still underdispersive.

b. Performance dependencies on the precipitation rate

The two AnErrEn variants were designed to improve the dry bias of the AnObsEn, especially during high-impact events (discussed in the introduction). Figure 4 shows that they indeed improve the frequency bias of events with thresholds > 80p, where the AnObsEn becomes very dry as it is unable to find enough good analogs among the wet events in the search dataset. While AnErrEn2 has a relatively stable frequency bias of about 0.90–0.95 across percentile thresholds, the AnErrEn1 is too often too dry for low precipitation rates, but converges with the AnErrEn2 toward higher percentiles. The frequency bias compares only the predicted and the observed event frequencies without assessing whether the predictions were successful. Probability of detection (POD) indicates how many of the observed events were successfully predicted. It shows that the AnEn members capture about 50% of total observed 3-hourly wet events and about 25%–30% of the 90p (“heavy”) observed precipitation events. AnErrEn2 has the best PODs across percentile thresholds, which is in part a consequence of having more wet events overall (i.e., a higher frequency bias). AnErrEn1 has PODs similar to AnObsEn at lower percentiles, and PODs similar to AnErrEn2 at higher percentiles, where the hits for 95p (97p) thresholds are increased by over 30% (60%) by using one of the AnErrEns as opposed to the AnObsEn. PODs of >95p events using AnErrEn methods are still fairly low, but it needs to be considered that PODs of 3-hourly forecasts are considerably lower than those for longer (e.g., daily) windows.

Fig. 4.
Fig. 4.

Categorical metrics as a function of observed percentile threshold. Values are calculated across all 50 ensemble members, 46 stations, and day-1 forecast lead times. The solid line is the average, and the shading is the spread across model configurations. The black arrows on the y axes point toward better values for each metric, respectively.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

The probability of false detection (POFD) gives the fraction of false alarms over all negative (dry) observed events. Since AnErrEn2 has a higher frequency bias, it is also more likely to have a higher POFD. However, at very low percentile thresholds AnObsEn surprises us by having a larger POFD despite lower frequency bias. The overall accuracy (the fraction of correct forecasts with respect to percentile thresholds; Fig. 4) reflects that the AnErrEn2 (and to a slightly lesser extend AnErrEn1) has more correct predictions in total, even for low-impact precipitation intensities.

Figure 5 provides another perspective of the AnEn dry bias as a function of the WRF-forecasted precipitation rate. Since verifying observations are available only in hindsight, the analysis in Fig. 5 is useful as it informs what bias one can expect based on the target WRF forecast alone. To ensure fair sample-size comparison despite the skewed nature of precipitation distributions, the TaFcsts are binned into equally distributed intervals, which spread out more widely toward the higher-precipitation tail of the distribution.

Fig. 5.
Fig. 5.

The 50-member AnEn median bias as a function of WRF precipitation rate. Circles are located at the center of 50 equally populated bins (with >4000 samples each) and represent average values across 46 stations and 3 forecast days. Values closer to zero bias are better. As in previous figures, the solid line is the average, and the shading is the spread across model configurations. The dotted vertical line marks the threshold for measurable precipitation (0.25 mm).

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

When the TaFcst is very small, so is the ensemble median bias of the AnEns. At measurable precipitation rates (>0.25 mm) the AnEn bias stabilizes for AnErrEn2 at about −0.75 mm (3 h)−1 and for AnErrEn1 at about −1 mm (3 h)−1. The bias of the AnObsEn, on the other hand, continues to worsen with increasing TaFcst precipitation rates.

1) Significant precipitation events

Events in the right tail of the precipitation distribution can have a large impact on society and industries. For example, high precipitation rates can disrupt road, rail, and air transportation, damage crops, impact construction operations, and in the worst case cause flash floods and landslides endangering people’s lives. Assessing the probability of extreme events accurately enables people to better prepare for and mitigate disasters.

Reliability diagrams visualize how well forecasted probabilities correspond to observed frequencies, which is an important forecast property. For the percentile thresholds shown in Fig. 6 (75p, 90p, and 95p), AnErrEn1 is the best calibrated AnEn technique; AnErrEn2 leans slightly toward overprediction (e.g., if an event is predicted at a 75% probability, is it observed only ∼65% of the times); and AnObsEn underpredicts the events (e.g., if an event is predicted at a 70% probability, is it observed 80% of the times). These differences become larger for higher percentile thresholds, with AnErrEn1 outperforming the other two AnEn variants with regard to reliability at all significant and heavier precipitation rates.

Fig. 6.
Fig. 6.

Reliability diagrams for 75p, 90p, and 95p thresholds with corresponding sharpness diagrams (insets in the bottom-right corners, showing the relative frequencies of forecasts). Like in previous figures, the solid line is the average, and the shading is the spread across model configurations.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

2) Measurable precipitation events

The probability of measurable precipitation (i.e., rain versus no-rain events) may not have as disastrous impacts as heavy precipitation rates, but it can still affect people’s decision making, for example, in agriculture, construction (e.g., pouring cement), wildfire management, the film industry, and outdoor recreational activities. Based on forecast application, different end users might prioritize that the prediction system detects true precipitation events (high POD), avoids false alarms (low POFD), or is unbiased (frequency bias ∼1.0).

Due to the distinct bimodal climate in BC, we investigate the predictability of summer-season and winter-season measurable precipitation separately. The BC warm (summer) season is very dry and has only occasional precipitation events that are often convective, whereas the cool (winter and shoulder) seasons are much more wet with frequent frontal precipitation. In summer, most end users of precipitation forecasts are likely focused on the prediction of wet events, whereas in winter when it is often wet, some end users may be interested in the prediction of dry events.

Figure 7 shows that categorical metrics of measurable precipitation exhibit a clear diurnal cycle in summer. During local daytime, the WRF forecasts are too often wet (see frequency bias plot in the bottom-left panel of Fig. 7), which induces an elevated false-alarm rate (see POFD plot in Fig. 7). Hit rates (not shown) and PODs also increase corresponding to the wetter baseline; however, overall accuracy diminishes as a result of the large number of false alarms. As discussed in J21, configurations that have coarser grid spacings and use the KF cumulus scheme suffer from this diurnal error (see Fig. 6 in that study).

Fig. 7.
Fig. 7.

Categorical metrics for (left) summer and (right) winter measurable precipitation [>0.25 mm (3 h)−1] as a function of forecast lead time. As in Fig. 4, the solid line is the average, and the shading is the spread across model configurations; the black arrows on the y axes point toward better values for each metric, respectively. However, here values are calculated for the 50-member AnEn median for a fair comparison with the raw WRF forecasts. The vertical grid lines mark lead times that correspond to 0000 local standard time.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

AnErrEn2 inherits some of this false-alarm pattern from the raw WRF forecasts in summer. AnObsEn and AnErrEn1 are barely distinguishable in Fig. 7 (green AnErrEn1 is plotted over yellow AnObsEn) and are both able to remove most of the diurnal cycle, as seen by a smoother curve in POFD and frequency bias. However, they have a considerable dry bias: AnObsEn and AnErrEn1 forecast measurable precipitation less than half as often as the station observations. Nevertheless, thanks to their low POFD, they yield the best accuracy in comparison with the raw model forecasts and AnErrEn2. Figure 8 shows that AnObsEn and AnErrEn1 have good resolution in summertime (AnObsEn slightly better than AnErrEn1) as opposed to AnErrEn2, which has a positive ensemble bias in the rank histogram and is overconfident.

Fig. 8.
Fig. 8.

(top) Reliability diagrams with corresponding sharpness diagrams (>0.25 mm) and (bottom) rank histograms for (left) summer and (right) winter precipitation.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

During wintertime, AnErrEn2 acts again as a compromise between the dry AnObsEn/AnErrEn1 and the wetter raw model forecasts. Ensemble-median AnErrEn2 has the best overall accuracy due to its better POD (Fig. 7). Figure 8 shows that all AnEns, but especially AnErrEn2, are underdispersive during winter; i.e., the U shape of the rank histogram indicates that verifying observations frequently fall outside the ensemble spread. However, AnErrEn2’s wet bias is less pronounced in winter than in summer. The AnObsEn/AnErrEn1 dry bias, on the other hand, is more pronounced in winter than in summer.

In summary, AnObsEn shows the best performance in terms of reliability and accuracy in summer for the study region. However, forecasters who prioritize POD over POFD (fewer misses at the cost of more false alarms) during the dry summer season, may opt for the wetter AnErrEn2 variant. In winter, AnErrEn2 has the best accuracy for measurable precipitation. However, forecasters who prioritize POFD over POD (fewer false alarms at the cost of more misses) during the wet winter season, may opt for the drier AnObsEn variant.

Note that the values in Fig. 7 may differ from Fig. 4, because Fig. 4 evaluates each of the 50 ensemble members during forecast day 1, whereas Fig. 7 considers the ensemble median to warrant a fair comparison with the raw WRF forecasts. This means that the forecasts in Fig. 4 get credit for predicting the event even with a minority of members, whereas in Fig. 7 a majority of AnEn members is required to exceed the threshold.

c. Multimodel AnEn

Ensemble-size testing similar to Fig. 2 shows that MM-AnEn performance increases rapidly until approximately 30–60 members, and continues to increase further at a much lesser rate until 100 members (and possibly beyond). Since the strongest improvement is evident from the first ∼50 members but since there are nine model configurations (M = 9, from three physics configurations and three grid spacings), we select an ensemble size of 54 members for the MM-AnEn (i.e., N = 6 members per model configuration). The 54-member MM-AnEn slightly improves the spread–skill relationship (not shown), in that the trend line is a little bit closer to the 1:1 line (less underdispersive/overconfident) as compared to the single-model AnEns (Fig. 3).

Reliability diagrams (Fig. 9) show that the MM-AnEn technique produces very reliable forecasts for high percentile thresholds, where the underlying AnEn variants are almost indistinguishable. Lower precipitation thresholds show larger differences among AnEn variants: MM-AnObsEn and MM-AnErrEn1 have an unconditional dry bias for measurable precipitation (similar to the winter reliability diagram for the single-model AnEns in Fig. 8), whereas MM-AnErrEn2 shows a tendency toward being overconfident (i.e., having less resolution).

Fig. 9.
Fig. 9.

The 54-member MM-AnEn reliability diagrams with insets of corresponding sharpness diagrams for different precipitation intensities.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

The MM-AnEn contains ensemble members that originate from a multimodel and multiresolution NWP ensemble. To investigate the impact of “initializing” a MM-AnEn from a NWP ensemble that includes different model physics configurations versus different grid spacings, we test additional MM-AnEns using only the best analog members (i.e., with best similarity score, see J22) based on either a physics configuration or a grid spacing. For example, the 3-km MM-AnEn consists of the 18 best analogs found by each of the three physics configurations. Since low percentile thresholds show the largest deviation from perfect reliability among AnEn variants, Fig. 10 displays measurable-precipitation reliability diagrams. It is evident that the model grid spacing has a larger impact on AnEn reliability than the choice of physics configuration, which is seen by the larger spread among lines on the right side of Fig. 10. Finer grid spacings have higher resolution, which makes the MM-AnErrEn2 more reliable. However, MM-AnObsEn and MM-AnErrEn1 underforecast measurable precipitation and finer grids make these AnEns even more dry.

Fig. 10.
Fig. 10.

The 54-member MM-AnEn reliability diagrams and corresponding sharpness diagrams (insets) for measurable precipitation, using the 18 best analogs from (left) each grid spacing for each physics configuration and (right) each physics configuration for each grid spacing.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

MM-AnEns perform more similarly at high percentile thresholds (not shown); however, coarser-grid MM-AnEns and AnErrEns are more reliable compared to finer grids and the AnObsEn variant, which have a tendency to underpredict higher-impact precipitation events.

The MM-AnEn in this study blends analogs from models with varying grid-spacing and physics configurations only, whereas all NWP datasets use the same model dynamics and initial and boundary condition. Exploring the full benefit of a true multimodel AnEn technique (using different NWP models and initial and boundary conditions) is computationally expensive and lies outside the scope of this study.

d. Verification

The previous sections 3a3c investigated the proposed AnEn methods over the 4.75-yr training period using the leave-one-out approach. The long training period enabled a robust assessment of the behavior of the AnEn variants across precipitation intensities, lead times, and seasons. This section verifies these results over the independent 1-yr testing period and compares AnEn variants with a baseline of DMB corrected model forecasts, where each postprocessed testing forecast is trained solely by the training data.

One motivation of this study is to improve the AnEn dry bias especially for high-impact events. The high-percentile thresholds in Fig. 11 confirm that the AnErrEn variants have less underprediction than the single-model AnObsEn, making the high-impact AnErrEns more reliable. The MM-AnEns have better resolution at higher percentiles as seen by the subtle S curve, and they are sharper. For example, MM-AnEns predict rare 95p events more often with higher probabilities than the single-model AnEns, due to the difficulties of the single-model AnEns in finding enough good high-impact analogs. The DMB correction improves the raw ensemble slightly; however, high-impact raw and DMB-corrected model ensemble forecasts both have very poor resolution and are overconfident.

Fig. 11.
Fig. 11.

Reliability diagrams with insets of corresponding sharpness diagrams for different precipitation intensities during the testing period. The ensemble size varies between forecasts and is given in parenthesis in the legend.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

Larger differences among AnEn variants are evident in the reliability diagram for measurable precipitation (top left in Fig. 11), although they agree overall with our findings from the previous sections. Single-model AnEns have a larger underprediction than MM-AnEns at low forecast probabilities, which agrees with the considerable dry biases discussed earlier in section 3b(2). However, MM-AnEns generally have a larger underprediction at mid- to high forecast probabilities. Overall, MM-AnErrEn2 yields the best reliability for measurable precipitation during testing. Although raw and DMB-corrected ensembles are generally sharper, they are again overconfident and underpredict measurable precipitation at lower and overpredict at higher forecast probabilities.

Similar to the bias during the training period seen in Fig. 5, AnEns also show a persistent dry bias across forecasted precipitation intensities during the testing period (left side of Fig. 12). Raw and DMB-corrected model forecasts, on the other hand, exhibit a bias that becomes wetter for larger forecasted precipitation intensities. Since smaller precipitation rates are much more frequent, the total mean bias (right bottom panel of Fig. 12) is negative for all forecasts, which means that the predictions are drier than the observations on average. The DMB correction is able to reduce the average raw-model biases, but the wet biases at higher-intensity precipitation rates worsen compared to the raw NWP. AnEn-median biases are even more dry than the raw and DMB-corrected model forecasts. However, as intended and seen before, the AnErrEns reduce the dry bias as compared to the AnObsEns. Note that for better visualization, the MM-AnEns are not included on the left side of Fig. 12; however, their trends lie mostly between the green AnErrEn1 and the blue AnErrEn2 lines.

Fig. 12.
Fig. 12.

Mean error of raw model forecasts, DMB-corrected model forecasts, and AnEn medians during the testing period aggregated over 46 stations and all lead times in the 3-day forecast horizon. The left panel is analogous to Fig. 5, but with different line styles for different grid spacings. The bottom-right panel shows the total average of mean biases with the range over configurations.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

Biases in complex terrain are highly dependent on location. Since this study uses a different selection of station locations than in J21, the mean biases differ. Although overall biases are drier in this study, higher resolutions are still drier than coarser resolutions, making the absolute bias of finer grids larger. Overall, AnEns trained on coarser-grid WRF Model forecasts yield better verification statistics.

Figure 13 compares verification statistics among all individual AnEn experiments and DMB-corrected model forecasts using the raw output as reference. Although the MM-AnEns do not further reduce mean biases, they show larger improvement over the raw model forecasts with regard to MAEs (not shown) and CRPS (Fig. 13).

Fig. 13.
Fig. 13.

CRPSS of the nine-configuration ensemble with DMB-corrected members and AnEn variants using the nine-configuration raw model ensemble as reference during the testing period aggregated over stations and lead times. The black arrow on the x axis points toward better values.

Citation: Weather and Forecasting 38, 8; 10.1175/WAF-D-23-0018.1

Although the differences in average CRPS are not large among AnEn methods, in most cases the differences are significant at the 95% confidence level when comparing the distributions of CRPS across stations. Only about 1 in 10 pairings (typically ones that use similar grid spacings) are not significantly different. Single-model AnErrEns are almost always significantly different from MM-AnEns. All AnEn CRPS differ significantly from the raw and the DMB-corrected model forecasts, whereas CRPS from the raw and the DMB-corrected model forecasts are not significantly different from each other.

The DMB-corrected model ensemble, which improves average bias, has often worse MAEs and CRPS, especially on finer grids (not shown). This is likely a result of applying a multiplicative factor to a dataset with large internal variability. Although individual errors can cancel each other out when calculating the average bias score, nonsystematic errors may be amplified by the DMB correction. Especially finer grids, which have larger SD of errors (see J21) show this effect. Note that raw and DMB-corrected forecasts have a statistical disadvantage for probabilistic ensemble statistics in that their ensemble size is much smaller (9 members) than the AnEns (50 or 54 members).

4. Summary and conclusions

This study improved short-range 3-hourly precipitation AnEn forecasts by introducing new methods of constructing ensembles from past NWP analogs. The conventional approach of using past observations as ensemble members to compose an AnEn is compared to two new approaches that apply past analog error to the target NWP forecast. Both methods construct AnEn members by applying either a multiplicative or additive bias correction to the target forecast based on the values of past analog model forecast (AnFcst) and analog observation (AnObs). Since the first variant (AnErrEn1) is by design relatively often dry, AnErrEn2 uses a modified definition that allows the resulting analog members to be wet more often. Through the application of past errors to the target forecast, both AnErrEn methods were able to create ensemble members with values outside of the observational distribution in the training dataset [and we have confirmed cases where this occurs (not shown)]. Overcoming this limitation dramatically improves the prediction of extreme events, which traditional AnEns commonly struggle with—probability of detection (POD) of 90p and 95p events is improved by 30% and 60%, although it is still relatively low.

In addition, we investigated multimodel (MM) AnEns that use an equal number of analog members from each of 1) nine model configurations with varying physics parameterizations and grid spacings, and 2) the subsets of the NWP model ensemble with either a fixed physics configuration or a fixed grid spacing.

This study showed that both AnErrEn methods, but in particular AnErrEn2, have less dry bias than the AnObsEn, especially for significant and heavier forecasted precipitation rates [>2 mm (3 h)−1] and observed precipitation rates > 80p. Overall accuracy was improved by both AnErrEn methods across precipitation intensities. For significant and heavier precipitation events during training AnErrEn1 was most reliable, while AnObsEn unconditionally underpredicted such events, and AnErrEn2 unconditionally overpredicted them (but to a lesser extent). During the testing period, AnErrEn1 erred on the dry side while AnErrEn2 was slightly more reliable for 75p.

Using all NWP configurations, the MM-AnEn exhibited excellent reliability up to very high percentiles with minor deviations among AnEn variants. Even the MM-AnObsEn variant represented rare events much better than the single-model AnObsEn. This is because although the MM-AnObsEn uses members from the same observational dataset as the AnObsEn, searching analogs based on various NWP models increases the chances of finding good analogs, and/or puts extra weight on good analogs by selecting them multiple times.

Precipitation across all intensities (especially measurable precipitation) was mostly underpredicted by the AnObsEns and more so by AnErrEn1, while measurable precipitation was conditionally overpredicted by the sharper AnErrEn2. This was observed for both single-model and multimodel AnEn variants. In summer the raw model forecasts exhibited a distinct diurnal cycle with an elevated number of false-alarm events during day time hours. AnErrEn2 inherited some of this pattern, whereas AnErrEn1 and AnObsEn were able to remove most of it. Despite their dry frequency bias, AnErrEn1 and AnObsEn were the most accurate during summer. However, during winter, AnErrEn2 had best accuracy and POD. Thus, the best choice of AnEn variant depends on the forecast application and the end-user priorities.

In this study, the performance of all AnEns were more sensitive to different grid spacings in the NWP dataset than to different physics configurations. AnEns from coarser (27-km) NWP models yielded better performance across many metrics (bias, MAE, and CRPS). This finding contradicts the findings of Delle Monache et al. (2013) and Eckel and Delle Monache (2016), who concluded that higher-resolution NWP models improve AnEn performance. However, both of these previous studies compared 33- and 15-km grid spacings for wind and temperature forecasts. While our 9- and 3-km NWP grids are likely affected by convective gray zone issues (Jeworrek et al. 2019), the AnEn dependency on NWP grid spacing is very similar between configurations using the conventional KF versus the scale-aware GF cumulus scheme. J21 showed that GF performs better under convective situations during summer, but most precipitation in the area of interest occurs during wintertime, and overall verification statistics obtained similar grid dependencies from KF and GF configurations. While AnEns significantly improve the raw NWP model errors, they seem to inherit similar trends in grid-size dependency.

The reference DMB-corrected WRF Model forecasts reduced the model dry bias, yet amplified the MAE and CRPS. AnEns, on the other hand, were impaired by an exaggerated dry bias, but significantly improved MAEs and CRPS. As such, this study confirms the conclusions of Nagarajan et al. (2015) that AnEns perform best when random error components dominate the NWP forecast, whereas time series methods (such as DMB bias correction) are better when systematic error components dominate. High-resolution (spatially and temporally) precipitation forecasts have particularly high random-error components, since finer scales enhance spatiotemporal variability and the chance of double penalty errors.

This study has developed a novel technique, the AnErrEn, that utilizes analog error to build an ensemble prediction system that is seamlessly skillful across common and rare events. As such, it does not require the definition of a threshold value as in Alessandrini et al. (2019)’s linear regression technique, which may introduce discontinuities in the forecast distribution, and requires forecasters to conduct initial testing to identify an appropriate threshold value.

The definition of the AnErrEn approach in this study is closely linked to the properties of precipitation, namely, that precipitation distributions are zero-bound and skewed. Several conditions were necessary to define an appropriate solution for cases when the bias correction may result in unreasonable values. Simpler AnErrEn formulations may be suitable for other variables with different distributions. For example, an AnErrEn for temperature forecasts may be created based on a simple additive bias correction using Eq. (1).

This study found dramatic accuracy improvements from using AnErrEns as compared to the traditional AnEn, paired with dramatic improvements in computational efficiency as compared to a traditional NWP ensemble. The AnErrEn further improved the dry bias across 3-hourly precipitation rates, and much improved reliability and POD for high impact events. For end users who do not require high temporal resolution, it should be noted that better skill can be obtained by using longer accumulation windows (i.e., time steps longer than 3 h used herein), based on results from J21 and J22.

Although the multimodel (MM) variants of AnEns obtained great improvements even for high impact events, and even with the conventional AnObsEn approach; they are computationally more expensive as they require several NWP model datasets and analog searches. However, one of the main advantages of AnEns is that they are computationally lightweight and efficient. Therefore, these new single-model AnErrEn methods provide many of the benefits of MM-AnEns without the additional computational expense.

1

Without this condition AnErrMult would be NaN and AnErrAdd would be equal to TaFcst, although both AnFcst and AnObs suggest dry conditions.

2

Without this condition AnErrMult would be NaN.

3

Without this condition AnErrMult can result in unrealistically large numbers for small AnFcst. We determine that no AnErr member should exceed the value of the additive bias correction.

4

Without this condition AnErrMult would be 0 although both TaFcst and AnFcst suggest precipitation. This would introduce a lower frequency bias compared to the raw forecast.

Acknowledgments.

Computational and storage resources to create the reforecast dataset and to optimize the AnEns were provided by WestGrid (westgrid.ca) and the Digital Research Alliance of Canada (alliancecan.ca) through the Resource Allocation Competition (RAC) Awards 2019–22. The research was enabled by funding support provided by Mitacs (Grants IT07224 and IT28208), BC Hydro (Contracts 00089063, 00091424, and PO 4130005193), the Natural Science and Engineering Research Council (NSERC; Discovery Grant RGPIN-2017-03849), and the University of British Columbia (UBC). We thank Dr. William Wei Hsieh for supporting this research through the Chih-Chuang and Yien-Ying Wang Hsieh Memorial Scholarship. We acknowledge that this work is based on and includes content from a thesis submitted by the first author in partial fulfillment of the requirements for their degree of doctor of philosophy (https://doi.org/10.14288/1.0427253).

Data availability statement.

WRF Model point forecasts are not publicly archived, but can be reproduced following Jeworrek et al. (2021) or made available upon request [contact Roland Stull (rstull@eoas.ubc.ca)]. ECCC station data used for verification are available at https://climate.weather.gc.ca/historical_data/search_historic_data_e.html (accessed on 2 November 2021), whereas BC Hydro station data may be obtained by contacting Gregory West (greg.west@bchydro.com).

APPENDIX

List of Abbreviations and Their Definitions

ACM2

Asymmetric Convection Model version 2 PBL scheme (Pleim 2007)

All-EFS

Efficient forward selection testing all model variables (J22)

AnEn(s)

Analog ensemble(s)

AnErrEn(s)

Analog error ensemble(s)

AnErrEn1

Analog error ensemble variant 1, see Eq. (3)

AnErrEn2

Analog error ensemble variant 2, see Eq. (4)

AnErrAdd

AnErrEn member calculated from the additive bias, see Eq. (1)

AnErrMult

AnErrEn member calculated from the multiplicative bias, see Eq. (2)

AnFcst(s)

Analog forecast(s)

AnObs

Analog observation(s)

AnObsEn

Analog observation ensemble

BC

British Columbia

CRPS

Continuous ranked probability score

CRPSS

Continuous ranked probability skill score

DMB

Degree of mass balance

ECCC

Environment and Climate Change Canada

ETS

Equitable threat score

FAR

False alarm ratio

Fcst

Forecast

GF

Grell–Freitas cumulus convection scheme (Grell and Freitas 2014)

KF

Kain–Fritsch cumulus convection scheme (Kain 2004)

MAE(s)

Mean absolute error(s)

MM-AnEn(s)

Multimodel analog ensemble(s)

MSD

Mean square difference

NaN

Not a number

NoahMP

Multiphysics Noah land surface model (Niu et al. 2011; Yang et al. 2011)

NWP

Numerical weather prediction

Obs

Observation(s)

PaFcst(s)

Past forecast(s)

PaObs

Past observation(s)

PBL

Planetary boundary layer

POD

Probability of detection

POFD

Probability of false detection

SD(s)

Standard deviation(s)

SLT(s)

Supplemental lead time(s)

TaFcst(s)

Target forecast(s)

Thom

1.5-moment 6-class Thompson microphysics scheme (Thompson et al. 2008)

TTS

Temporal trend similarity

twCRPS

Threshold-weighted continuous ranked probability score

VerifObs

Verifying observation(s)

WRF

Weather Research and Forecasting Model

WSM5

WRF single-moment 5-class microphysics scheme (Hong et al. 2004)

YSU

Yonsei University PBL scheme (Hong et al. 2006)

75p

75th percentile

80p

80th percentile

90p

90th percentile

95p

95th percentile

REFERENCES

  • Alessandrini, S., 2022: Predicting rare events of solar power production with the analog ensemble. Sol. Energy, 231, 7277, https://doi.org/10.1016/j.solener.2021.11.033.

    • Search Google Scholar
    • Export Citation
  • Alessandrini, S., L. Delle Monache, S. Sperati, and J. N. Nissen, 2015: A novel application of an analog ensemble for short-term wind power forecasting. Renewable Energy, 76, 768781, https://doi.org/10.1016/j.renene.2014.11.061.

    • Search Google Scholar
    • Export Citation
  • Alessandrini, S., S. Sperati, and L. Delle Monache, 2019: Improving the analog ensemble wind speed forecasts for rare events. Mon. Wea. Rev., 147, 26772692, https://doi.org/10.1175/MWR-D-19-0006.1.

    • Search Google Scholar
    • Export Citation
  • Chapman, W. E., L. Delle Monache, S. Alessandrini, A. C. Subramanian, F. M. Ralph, S.-P. Xie, S. Lerch, and N. Hayatbini, 2022: Probabilistic predictions from deterministic atmospheric river forecasts with deep learning. Mon. Wea. Rev., 150, 215234, https://doi.org/10.1175/MWR-D-21-0106.1.

    • Search Google Scholar
    • Export Citation
  • Delle Monache, L., T. Nipen, Y. Liu, G. Roux, and R. Stull, 2011: Kalman filter and analog schemes to postprocess numerical weather predictions. Mon. Wea. Rev., 139, 35543570, https://doi.org/10.1175/2011MWR3653.1.

    • Search Google Scholar
    • Export Citation
  • Delle Monache, L., F. A. Eckel, D. L. Rife, B. Nagarajan, and K. Searight, 2013: Probabilistic weather prediction with an analog ensemble. Mon. Wea. Rev., 141, 34983516, https://doi.org/10.1175/MWR-D-12-00281.1.

    • Search Google Scholar
    • Export Citation
  • Delle Monache, L., S. Alessandrini, I. Djalalova, J. Wilczak, J. C. Knievel, and R. Kumar, 2020: Improving air quality predictions over the United States with an analog ensemble. Wea. Forecasting, 35, 21452162, https://doi.org/10.1175/WAF-D-19-0148.1.

    • Search Google Scholar
    • Export Citation
  • Eckel, F. A., and L. Delle Monache, 2016: A hybrid NWP–analog ensemble. Mon. Wea. Rev., 144, 897911, https://doi.org/10.1175/MWR-D-15-0096.1.

    • Search Google Scholar
    • Export Citation
  • Grell, G. A., and S. R. Freitas, 2014: A scale and aerosol aware stochastic convective parameterization for weather and air quality modeling. Atmos. Chem. Phys., 14, 52335250, https://doi.org/10.5194/acp-14-5233-2014.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134, 32093229, https://doi.org/10.1175/MWR3237.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., J. S. Whitaker, and S. L. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions. Bull. Amer. Meteor. Soc., 87, 3346, https://doi.org/10.1175/BAMS-87-1-33.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., M. Scheuerer, and G. T. Bates, 2015: Analog probabilistic precipitation forecasts using GEFS reforecasts and climatology-calibrated precipitation analyses. Mon. Wea. Rev., 143, 33003309, https://doi.org/10.1175/MWR-D-15-0004.1.

    • Search Google Scholar
    • Export Citation
  • Hong, S.-Y., J. Dudhia, and S.-H. Chen, 2004: A revised approach to ice microphysical processes for the bulk parameterization of clouds and precipitation. Mon. Wea. Rev., 132, 103120, https://doi.org/10.1175/1520-0493(2004)132<0103:ARATIM>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Hong, S.-Y., Y. Noh, and J. Dudhia, 2006: A new vertical diffusion package with an explicit treatment of entrainment processes. Mon. Wea. Rev., 134, 23182341, https://doi.org/10.1175/MWR3199.1.

    • Search Google Scholar
    • Export Citation
  • Jeworrek, J., G. West, and R. Stull, 2019: Evaluation of cumulus and microphysics parameterizations in WRF across the convective gray zone. Wea. Forecasting, 34, 10971115, https://doi.org/10.1175/WAF-D-18-0178.1.

    • Search Google Scholar
    • Export Citation
  • Jeworrek, J., G. West, and R. Stull, 2021: WRF precipitation performance and predictability for systematically varied parameterizations over complex terrain. Wea. Forecasting, 36, 893913, https://doi.org/10.1175/WAF-D-20-0195.1.

    • Search Google Scholar
    • Export Citation
  • Jeworrek, J., G. West, and R. Stull, 2022: Optimizing analog ensembles for sub-daily precipitation forecasts. Atmosphere, 13, 1662, https://doi.org/10.3390/atmos13101662.

    • Search Google Scholar
    • Export Citation
  • Kain, J. S., 2004: The Kain–Fritsch convective parameterization: An update. J. Appl. Meteor., 43, 170181, https://doi.org/10.1175/1520-0450(2004)043<0170:TKCPAU>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Marty, R., I. Zin, C. Obled, G. Bontron, and A. Djerboua, 2012: Toward real-time daily PQPF by an analog sorting approach: Application to flash-flood catchments. J. Appl. Meteor. Climatol., 51, 505520, https://doi.org/10.1175/JAMC-D-11-011.1.

    • Search Google Scholar
    • Export Citation
  • McCollor, D., and R. Stull, 2008: Hydrometeorological accuracy enhancement via postprocessing of numerical weather forecasts in complex terrain. Wea. Forecasting, 23, 131144, https://doi.org/10.1175/2007WAF2006107.1.

    • Search Google Scholar
    • Export Citation
  • Mugume, I., and Coauthors, 2018: Improving quantitative rainfall prediction using ensemble analogues in the tropics: Case study of Uganda. Atmosphere, 9, 328, https://doi.org/10.3390/atmos9090328.

    • Search Google Scholar
    • Export Citation
  • Nagarajan, B., L. Delle Monache, J. P. Hacker, D. L. Rife, K. Searight, J. C. Knievel, and T. N. Nipen, 2015: An evaluation of analog-based postprocessing methods across several variables and forecast models. Wea. Forecasting, 30, 16231643, https://doi.org/10.1175/WAF-D-14-00081.1.

    • Search Google Scholar
    • Export Citation
  • Niu, G.-Y., and Coauthors, 2011: The community Noah land surface model with multiparameterization options (Noah-MP): 1. Model description and evaluation with local-scale measurements. J. Geophys. Res., 116, D12109, https://doi.org/10.1029/2010JD015139.

    • Search Google Scholar
    • Export Citation
  • Odak Plenković, I., I. Schicker, M. Dabernig, K. Horvath, and E. Keresturi, 2020: Analog-based post-processing of the ALADIN-LAEF ensemble predictions in complex terrain. Quart. J. Roy. Meteor. Soc., 146, 18421860, https://doi.org/10.1002/qj.3769.

    • Search Google Scholar
    • Export Citation
  • Pleim, J. E., 2007: A combined local and nonlocal closure model for the atmospheric boundary layer. Part I: Model description and testing. J. Appl. Meteor. Climatol., 46, 13831395, https://doi.org/10.1175/JAM2539.1.

    • Search Google Scholar
    • Export Citation
  • Shapiro, S. S., and M. B. Wilk, 1965: An analysis of variance test for normality (complete samples). Biometrika, 52, 591611, https://doi.org/10.2307/2333709.

    • Search Google Scholar
    • Export Citation
  • Skamarock, W., and Coauthors, 2008: A description of the Advanced Research WRF version 3. NCAR Tech. Note NCAR/TN-475+STR, 113 pp., https://doi.org/10.5065/D68S4MVH.

  • Thompson, G., P. R. Field, R. M. Rasmussen, and W. D. Hall, 2008: Explicit forecasts of winter precipitation using an improved bulk microphysics scheme. Part II: Implementation of a new snow parameterization. Mon. Wea. Rev., 136, 50955115, https://doi.org/10.1175/2008MWR2387.1.

    • Search Google Scholar
    • Export Citation
  • Wilcoxon, F., 1945: Individual comparisons by ranking methods. Biom. Bull., 1, 8083, https://doi.org/10.2307/3001968.

  • Yang, Z.-L., and Coauthors, 2011: The community Noah land surface model with multiparameterization options (Noah-MP): 2. Evaluation over global river basins. J. Geophys. Res., 116, D12110, https://doi.org/10.1029/2010JD015140.

    • Search Google Scholar
    • Export Citation
Save
  • Alessandrini, S., 2022: Predicting rare events of solar power production with the analog ensemble. Sol. Energy, 231, 7277, https://doi.org/10.1016/j.solener.2021.11.033.

    • Search Google Scholar
    • Export Citation
  • Alessandrini, S., L. Delle Monache, S. Sperati, and J. N. Nissen, 2015: A novel application of an analog ensemble for short-term wind power forecasting. Renewable Energy, 76, 768781, https://doi.org/10.1016/j.renene.2014.11.061.

    • Search Google Scholar
    • Export Citation
  • Alessandrini, S., S. Sperati, and L. Delle Monache, 2019: Improving the analog ensemble wind speed forecasts for rare events. Mon. Wea. Rev., 147, 26772692, https://doi.org/10.1175/MWR-D-19-0006.1.

    • Search Google Scholar
    • Export Citation
  • Chapman, W. E., L. Delle Monache, S. Alessandrini, A. C. Subramanian, F. M. Ralph, S.-P. Xie, S. Lerch, and N. Hayatbini, 2022: Probabilistic predictions from deterministic atmospheric river forecasts with deep learning. Mon. Wea. Rev., 150, 215234, https://doi.org/10.1175/MWR-D-21-0106.1.

    • Search Google Scholar
    • Export Citation
  • Delle Monache, L., T. Nipen, Y. Liu, G. Roux, and R. Stull, 2011: Kalman filter and analog schemes to postprocess numerical weather predictions. Mon. Wea. Rev., 139, 35543570, https://doi.org/10.1175/2011MWR3653.1.

    • Search Google Scholar
    • Export Citation
  • Delle Monache, L., F. A. Eckel, D. L. Rife, B. Nagarajan, and K. Searight, 2013: Probabilistic weather prediction with an analog ensemble. Mon. Wea. Rev., 141, 34983516, https://doi.org/10.1175/MWR-D-12-00281.1.

    • Search Google Scholar
    • Export Citation
  • Delle Monache, L., S. Alessandrini, I. Djalalova, J. Wilczak, J. C. Knievel, and R. Kumar, 2020: Improving air quality predictions over the United States with an analog ensemble. Wea. Forecasting, 35, 21452162, https://doi.org/10.1175/WAF-D-19-0148.1.

    • Search Google Scholar
    • Export Citation
  • Eckel, F. A., and L. Delle Monache, 2016: A hybrid NWP–analog ensemble. Mon. Wea. Rev., 144, 897911, https://doi.org/10.1175/MWR-D-15-0096.1.

    • Search Google Scholar
    • Export Citation
  • Grell, G. A., and S. R. Freitas, 2014: A scale and aerosol aware stochastic convective parameterization for weather and air quality modeling. Atmos. Chem. Phys., 14, 52335250, https://doi.org/10.5194/acp-14-5233-2014.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134, 32093229, https://doi.org/10.1175/MWR3237.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., J. S. Whitaker, and S. L. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions. Bull. Amer. Meteor. Soc., 87, 3346, https://doi.org/10.1175/BAMS-87-1-33.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., M. Scheuerer, and G. T. Bates, 2015: Analog probabilistic precipitation forecasts using GEFS reforecasts and climatology-calibrated precipitation analyses. Mon. Wea. Rev., 143, 33003309, https://doi.org/10.1175/MWR-D-15-0004.1.

    • Search Google Scholar
    • Export Citation
  • Hong, S.-Y., J. Dudhia, and S.-H. Chen, 2004: A revised approach to ice microphysical processes for the bulk parameterization of clouds and precipitation. Mon. Wea. Rev., 132, 103120, https://doi.org/10.1175/1520-0493(2004)132<0103:ARATIM>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Hong, S.-Y., Y. Noh, and J. Dudhia, 2006: A new vertical diffusion package with an explicit treatment of entrainment processes. Mon. Wea. Rev., 134, 23182341, https://doi.org/10.1175/MWR3199.1.

    • Search Google Scholar
    • Export Citation
  • Jeworrek, J., G. West, and R. Stull, 2019: Evaluation of cumulus and microphysics parameterizations in WRF across the convective gray zone. Wea. Forecasting, 34, 10971115, https://doi.org/10.1175/WAF-D-18-0178.1.

    • Search Google Scholar
    • Export Citation
  • Jeworrek, J., G. West, and R. Stull, 2021: WRF precipitation performance and predictability for systematically varied parameterizations over complex terrain. Wea. Forecasting, 36, 893913, https://doi.org/10.1175/WAF-D-20-0195.1.

    • Search Google Scholar
    • Export Citation
  • Jeworrek, J., G. West, and R. Stull, 2022: Optimizing analog ensembles for sub-daily precipitation forecasts. Atmosphere, 13, 1662, https://doi.org/10.3390/atmos13101662.

    • Search Google Scholar
    • Export Citation
  • Kain, J. S., 2004: The Kain–Fritsch convective parameterization: An update. J. Appl. Meteor., 43, 170181, https://doi.org/10.1175/1520-0450(2004)043<0170:TKCPAU>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Marty, R., I. Zin, C. Obled, G. Bontron, and A. Djerboua, 2012: Toward real-time daily PQPF by an analog sorting approach: Application to flash-flood catchments. J. Appl. Meteor. Climatol., 51, 505520, https://doi.org/10.1175/JAMC-D-11-011.1.

    • Search Google Scholar
    • Export Citation
  • McCollor, D., and R. Stull, 2008: Hydrometeorological accuracy enhancement via postprocessing of numerical weather forecasts in complex terrain. Wea. Forecasting, 23, 131144, https://doi.org/10.1175/2007WAF2006107.1.

    • Search Google Scholar
    • Export Citation
  • Mugume, I., and Coauthors, 2018: Improving quantitative rainfall prediction using ensemble analogues in the tropics: Case study of Uganda. Atmosphere, 9, 328, https://doi.org/10.3390/atmos9090328.

    • Search Google Scholar
    • Export Citation
  • Nagarajan, B., L. Delle Monache, J. P. Hacker, D. L. Rife, K. Searight, J. C. Knievel, and T. N. Nipen, 2015: An evaluation of analog-based postprocessing methods across several variables and forecast models. Wea. Forecasting, 30, 16231643, https://doi.org/10.1175/WAF-D-14-00081.1.

    • Search Google Scholar
    • Export Citation
  • Niu, G.-Y., and Coauthors, 2011: The community Noah land surface model with multiparameterization options (Noah-MP): 1. Model description and evaluation with local-scale measurements. J. Geophys. Res., 116, D12109, https://doi.org/10.1029/2010JD015139.

    • Search Google Scholar
    • Export Citation
  • Odak Plenković, I., I. Schicker, M. Dabernig, K. Horvath, and E. Keresturi, 2020: Analog-based post-processing of the ALADIN-LAEF ensemble predictions in complex terrain. Quart. J. Roy. Meteor. Soc., 146, 18421860, https://doi.org/10.1002/qj.3769.

    • Search Google Scholar
    • Export Citation
  • Pleim, J. E., 2007: A combined local and nonlocal closure model for the atmospheric boundary layer. Part I: Model description and testing. J. Appl. Meteor. Climatol., 46, 13831395, https://doi.org/10.1175/JAM2539.1.

    • Search Google Scholar
    • Export Citation
  • Shapiro, S. S., and M. B. Wilk, 1965: An analysis of variance test for normality (complete samples). Biometrika, 52, 591611, https://doi.org/10.2307/2333709.

    • Search Google Scholar
    • Export Citation
  • Skamarock, W., and Coauthors, 2008: A description of the Advanced Research WRF version 3. NCAR Tech. Note NCAR/TN-475+STR, 113 pp., https://doi.org/10.5065/D68S4MVH.

  • Thompson, G., P. R. Field, R. M. Rasmussen, and W. D. Hall, 2008: Explicit forecasts of winter precipitation using an improved bulk microphysics scheme. Part II: Implementation of a new snow parameterization. Mon. Wea. Rev., 136, 50955115, https://doi.org/10.1175/2008MWR2387.1.

    • Search Google Scholar
    • Export Citation
  • Wilcoxon, F., 1945: Individual comparisons by ranking methods. Biom. Bull., 1, 8083, https://doi.org/10.2307/3001968.

  • Yang, Z.-L., and Coauthors, 2011: The community Noah land surface model with multiparameterization options (Noah-MP): 2. Evaluation over global river basins. J. Geophys. Res., 116, D12110, https://doi.org/10.1029/2010JD015140.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Illustration of the methodology used for a 4-member analog observation ensemble (AnObsEn; adapted from J22).

  • Fig. 2.

    Mean absolute error (MAE), continuous ranked probability score (CRPS), and 95th percentile Brier score (95p BS) as a function of ensemble size, averaged over lead times and station locations. The optimal ensemble size is marked and annotated at the minimum (best) values. The solid line is the average, and the shading is the spread across model configurations.

  • Fig. 3.

    Binned spread–skill diagram for the 50-member AnEn variants calculated over all stations and forecast lead times. Each dot represents the center of one of 50 equally populated ensemble-spread intervals (>200 000 samples in each bin), and each line connects dots that are result of the same WRF configuration.

  • Fig. 4.

    Categorical metrics as a function of observed percentile threshold. Values are calculated across all 50 ensemble members, 46 stations, and day-1 forecast lead times. The solid line is the average, and the shading is the spread across model configurations. The black arrows on the y axes point toward better values for each metric, respectively.

  • Fig. 5.

    The 50-member AnEn median bias as a function of WRF precipitation rate. Circles are located at the center of 50 equally populated bins (with >4000 samples each) and represent average values across 46 stations and 3 forecast days. Values closer to zero bias are better. As in previous figures, the solid line is the average, and the shading is the spread across model configurations. The dotted vertical line marks the threshold for measurable precipitation (0.25 mm).

  • Fig. 6.

    Reliability diagrams for 75p, 90p, and 95p thresholds with corresponding sharpness diagrams (insets in the bottom-right corners, showing the relative frequencies of forecasts). Like in previous figures, the solid line is the average, and the shading is the spread across model configurations.

  • Fig. 7.

    Categorical metrics for (left) summer and (right) winter measurable precipitation [>0.25 mm (3 h)−1] as a function of forecast lead time. As in Fig. 4, the solid line is the average, and the shading is the spread across model configurations; the black arrows on the y axes point toward better values for each metric, respectively. However, here values are calculated for the 50-member AnEn median for a fair comparison with the raw WRF forecasts. The vertical grid lines mark lead times that correspond to 0000 local standard time.

  • Fig. 8.

    (top) Reliability diagrams with corresponding sharpness diagrams (>0.25 mm) and (bottom) rank histograms for (left) summer and (right) winter precipitation.

  • Fig. 9.

    The 54-member MM-AnEn reliability diagrams with insets of corresponding sharpness diagrams for different precipitation intensities.

  • Fig. 10.

    The 54-member MM-AnEn reliability diagrams and corresponding sharpness diagrams (insets) for measurable precipitation, using the 18 best analogs from (left) each grid spacing for each physics configuration and (right) each physics configuration for each grid spacing.

  • Fig. 11.

    Reliability diagrams with insets of corresponding sharpness diagrams for different precipitation intensities during the testing period. The ensemble size varies between forecasts and is given in parenthesis in the legend.

  • Fig. 12.

    Mean error of raw model forecasts, DMB-corrected model forecasts, and AnEn medians during the testing period aggregated over 46 stations and all lead times in the 3-day forecast horizon. The left panel is analogous to Fig. 5, but with different line styles for different grid spacings. The bottom-right panel shows the total average of mean biases with the range over configurations.

  • Fig. 13.

    CRPSS of the nine-configuration ensemble with DMB-corrected members and AnEn variants using the nine-configuration raw model ensemble as reference during the testing period aggregated over stations and lead times. The black arrow on the x axis points toward better values.

All Time Past Year Past 30 Days
Abstract Views 103 48 0
Full Text Views 537 520 41
PDF Downloads 404 384 25