Evaluating GFS and ECMWF Ensemble Forecasts of Integrated Water Vapor Transport along the U.S. West Coast

Briana E. Stewart aMeteorology Program, Plymouth State University, Plymouth, New Hampshire

Search for other papers by Briana E. Stewart in
Current site
Google Scholar
PubMed
Close
,
Jason M. Cordeira aMeteorology Program, Plymouth State University, Plymouth, New Hampshire

Search for other papers by Jason M. Cordeira in
Current site
Google Scholar
PubMed
Close
, and
F. Martin Ralph bCenter for Western Weather and Water Extremes, Scripps Institution of Oceanography, University of California, San Diego, San Diego, California

Search for other papers by F. Martin Ralph in
Current site
Google Scholar
PubMed
Close
Free access

Abstract

Atmospheric rivers (ARs) are long and narrow regions in the atmosphere of enhanced integrated water vapor transport (IVT) and can produce extreme precipitation and high societal impacts. Reliable and skillful forecasts of landfalling ARs in the western United States are critical to hazard preparation and aid in decision support activities, such as Forecast-Informed Reservoir Operations (FIRO). The purpose of this study is to compare the cool-season water year skill of the NCEP Global Ensemble Forecast System (GEFS) and ECMWF Ensemble Prediction System (EPS) forecasts of IVT along the U.S. West Coast for 2017–20. The skill is analyzed using probability-over-threshold forecasts of IVT magnitudes ≥ 250 kg m−1 s−1 (P250) using contingency table skill metrics in coastal Northern California and along the west coast of North America. Analysis of P250 with lead time (dProg/dt) found that the EPS provided ∼1 day of additional lead time for situational awareness over the GEFS at lead times of 6–10 days. Forecast skill analysis highlights that the EPS leads over the GEFS with success ratios 0.10–0.15 higher at lead times > 6 days for P250 thresholds of ≥25% and ≥50%, while event-based skill analysis using the probability of detection (POD) found that both models were largely similar with minor latitudinal variations favoring higher POD for each model in different locations along the coast. The relative skill of the EPS over the GEFS is largely attributed to overforecasting by the GEFS at longer lead times and an increase in the false alarm ratio.

Significance Statement

The purpose of this study is to evaluate the efficacy of the NCEP Global Ensemble Forecast System (GEFS) and the ECMWF Ensemble Prediction System (EPS) in forecasting enhanced water vapor transport along the U.S. West Coast commonly associated with landfalling atmospheric rivers and heavy precipitation. The ensemble models allow us to calculate the probability that enhanced water vapor transport will occur, thereby providing situational awareness for decision-making, such as in hazard mitigation and water resource management. The results of this study indicate that the EPS model is on average more skillful than the GEFS model at lead times of ∼6–10 days with a higher success ratio and lower false alarm ratio.

Cordeira’s current affiliation: Center for Western Weather and Water Extremes, Scripps Institution of Oceanography, University of California, San Diego, San Diego, California.

© 2022 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Jason M. Cordeira, jcordeira@ucsd.edu

Abstract

Atmospheric rivers (ARs) are long and narrow regions in the atmosphere of enhanced integrated water vapor transport (IVT) and can produce extreme precipitation and high societal impacts. Reliable and skillful forecasts of landfalling ARs in the western United States are critical to hazard preparation and aid in decision support activities, such as Forecast-Informed Reservoir Operations (FIRO). The purpose of this study is to compare the cool-season water year skill of the NCEP Global Ensemble Forecast System (GEFS) and ECMWF Ensemble Prediction System (EPS) forecasts of IVT along the U.S. West Coast for 2017–20. The skill is analyzed using probability-over-threshold forecasts of IVT magnitudes ≥ 250 kg m−1 s−1 (P250) using contingency table skill metrics in coastal Northern California and along the west coast of North America. Analysis of P250 with lead time (dProg/dt) found that the EPS provided ∼1 day of additional lead time for situational awareness over the GEFS at lead times of 6–10 days. Forecast skill analysis highlights that the EPS leads over the GEFS with success ratios 0.10–0.15 higher at lead times > 6 days for P250 thresholds of ≥25% and ≥50%, while event-based skill analysis using the probability of detection (POD) found that both models were largely similar with minor latitudinal variations favoring higher POD for each model in different locations along the coast. The relative skill of the EPS over the GEFS is largely attributed to overforecasting by the GEFS at longer lead times and an increase in the false alarm ratio.

Significance Statement

The purpose of this study is to evaluate the efficacy of the NCEP Global Ensemble Forecast System (GEFS) and the ECMWF Ensemble Prediction System (EPS) in forecasting enhanced water vapor transport along the U.S. West Coast commonly associated with landfalling atmospheric rivers and heavy precipitation. The ensemble models allow us to calculate the probability that enhanced water vapor transport will occur, thereby providing situational awareness for decision-making, such as in hazard mitigation and water resource management. The results of this study indicate that the EPS model is on average more skillful than the GEFS model at lead times of ∼6–10 days with a higher success ratio and lower false alarm ratio.

Cordeira’s current affiliation: Center for Western Weather and Water Extremes, Scripps Institution of Oceanography, University of California, San Diego, San Diego, California.

© 2022 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Jason M. Cordeira, jcordeira@ucsd.edu

1. Introduction

Atmospheric rivers (ARs) are long and narrow regions of enhanced integrated water vapor transport (IVT) that can influence the occurrence of precipitation-related high-impact weather events along U.S. West Coast such as floods and flash floods (e.g., Young et al. 2017). Landfalling ARs may also influence the occurrence of extreme wind events (Waliser and Guan 2017) and increase the likelihoods of avalanches and avalanche fatalities (Hatchett et al. 2017) and shallow landslides (Oakley et al. 2018). The potential for hazardous weather associated with landfalling ARs can be summarized by 1) National Weather Service–issued watches, warnings, and advisories (WWAs) where 60%–90% of flood-related WWAs in the western United States occur on days with cool-season landfalling ARs (Bartlett and Cordeira 2021) and 2) damage claims in the National Flood Insurance Program where ARs have caused an average of $1.1 billion (U.S. dollars) in flood damages annually across the western United States (Corringham et al. 2019). Due to the causal relationship between landfalling ARs and the potential for hazardous weather across the western United States, reliable and skillful forecasts of landfalling ARs are critical to hazard preparation, risk mitigation, and water resources management (e.g., DeFlorio et al. 2018; Ralph et al. 2019; Cordeira and Ralph 2021). The goal of this study is to compare cool-season skill of the European Centre for Medium-Range Weather Forecasts (ECMWF) Ensemble Prediction System (EPS) and the National Centers for Environmental Prediction (NCEP) Global Ensemble Forecast System (GEFS) model forecasts of enhanced IVT along the West Coast of North America for water years (WY; 1 October–30 September) 2017–20 that is often observed during landfalling ARs. This study complements a previous study by Cordeira and Ralph (2021) that only investigated GEFS model forecasts.

As discussed in Cordeira and Ralph (2021), studies of midlatitude cyclones over the northeast Pacific illustrate that the skill of numerical weather prediction model forecasts can vary widely. For example, parameters related to storm location and intensity can be either highly accurate or contain errors on the scale of hundreds of kilometers and tens of hectopascals for a range of lead times (McMurdie and Mass 2004). Ensemble forecasts of these cyclones and related parameters have proven beneficial to increasing forecasting skill relative to deterministic forecasts with ECMWF EPS forecasts of different parameters, including precipitation, typically containing higher forecast skill than forecasts from other modeling centers (Atger 2001; Hamill et al. 2008; Froude 2010; Su et al. 2014). In Atger (2001), the difference in member populations between the EPS and GEFS models (N = 50 versus N = 20, respectively, at that time) was evaluated by constructing a smaller ensemble version of the EPS, but the skill remained similar to the fully populated versions suggesting that model resolution or other characteristics of the ensemble (e.g., the method of generation of perturbations) likely drive differences in skill between the two models. These differences in skill may also be reduced through statistical postprocessing methods to correct for systematic biases and other inconsistencies among forecasts for a given location and lead time, but require retrospective analyses of skill, long periods of study, and large numerical datasets (e.g., Hagedorn et al. 2008; Hamill et al. 2017).

For AR-related forecasts, some parameters are also forecast with higher predictability than others. For example, Lavers et al. (2016) demonstrated that forecasts of IVT across the western United States have higher “potential predictability” than parameters related to precipitation or river discharge at lead times of ∼4–9 days. The Atmospheric River Retrospective Forecasting Experiment (WPC 2012) came to a similar conclusion along the U.S. West Coast: model forecasts of moisture parameters, such as IVT, may be helpful in identifying and forecasting extreme events even when model precipitation forecasts do not forecast large precipitation amounts (i.e., they may improve situational awareness). IVT has therefore been investigated in numerous predictability and forecasting studies related to AR frequency, intensity, duration, and landfalling location (e.g., Cordeira et al. 2017).

In an effort to summarize forecasts of AR frequency, intensity, duration, and landfalling location, an “AR Landfall Tool” was created and used as an aid in providing situational awareness of landfalling ARs using ensemble numerical weather prediction data (Cordeira et al. 2017; Cordeira and Ralph 2021). The AR Landfall Tool is primarily a tool that depicts ensemble IVT data as a probability-over-threshold for different IVT magnitudes for a forecast transect along the west coast of North America (Fig. 1). The most commonly used threshold in AR-related forecasts along the U.S. West Coast is IVT magnitudes ≥ 250 kg m−1 s−1 (Cordeira et al. 2017) and the ensemble probability of IVT magnitudes ≥ 250 kg m−1 s−1 is referred to as P250 by Cordeira and Ralph (2021). The study by Cordeira and Ralph (2021) found that the 20-member GEFS P250 forecasts near coastal Northern California at 38°N, 123°W for WY2017–2020 were reliable and successful at lead times up to ∼8–9 days with an average success ratio > 0.50 for P250 forecasts ≥ 50% at lead times of 8 days and Brier skill scores > 0.10 at a lead time of 8–9 days. The highest success ratios and probability of detection values for P250 forecasts ≥50% occurred on average along the Northern California and Oregon coastlines and the lowest occurred on average along the Southern California coastline. The average probability of detection of more intense and longer duration landfalling ARs was also 0.10–0.20 higher than weaker and shorter duration events at lead times of 3–9 days. Cordeira and Ralph (2021) also identify that the potential for AR Landfall Tool forecasts to enhance situational awareness may be improved for individual applications by allowing for flexibility in the location and time of verification; the success ratios increased 10%–30% at lead times of 5–10 days allowing for flexibility of ±1.0° latitude and ±6 h in verification.

Fig. 1.
Fig. 1.

Forecasts of the coastal ensemble probability of IVT magnitude ≥ 250 kg m−1 s−1 (shaded according to scale) initialized daily at 0000 UTC 7–11 Feb 2019 for the (a),(c),(e),(g),(i) GEFS model out to 16 days and (b),(d),(f),(h),(j) EPS model out to 15 days. Coastal latitudes are shown in the rightmost panels of each image and topography is shaded every 100 m using a blue–green–brown–white color scale. The gray, black, and red bars in these panels represent the number of hours that probability values exceed 75%, 90%, and 99% and are not used in this study.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

The results of the GEFS study by Cordeira and Ralph (2021) are comparable to results from reforecast and global hindcast data from different numerical weather prediction modeling centers that demonstrate that the accuracy of forecasting the occurrence of ARs decreases as lead time increases and approaches climatology or random chance at lead times of 10–14 days (Nardi et al. 2018; see their Fig. 5). Of the different numerical weather prediction modeling centers also studied by Nardi et al. (2018), the ECMWF model forecasts contained lower bias in overall frequency of AR occurrence (i.e., enhanced IVT magnitudes) over the Northeast Pacific and on average predicted fewer AR occurrences than the GEFS and reanalysis datasets. Comparing the EPS and GEFS directly, DeFlorio et al. (2019) found that the EPS is particularly more accurate than the GEFS at predicting either no AR activity or high AR activity relative to moderate AR activity across most lead times over the Northeast Pacific. Combined, these studies suggest that the EPS may provide enhanced situational awareness in AR occurrence relative to the GEFS in association with a higher success ratio (and a lower false alarm ratio) for forecasts of both AR occurrence (i.e., IVT magnitudes ≥ 250 kg m−1 s−1) and AR nonoccurrence (i.e., IVT magnitudes < 250 kg m−1 s−1). The present study will validate this hypothesis by directly comparing the skill of EPS and GEFS P250 forecasts along the west coast of North America within the framework of the AR Landfall Tool.

Additional motivation for the present study is provided by the differences between EPS and GEFS P250 forecasts shown in Fig. 1 prior to a high-impact landfalling AR along the California coast on 13–15 February 2019 (Hatchett et al. 2020; Hecht et al. 2022). Comparison of AR Landfall Tool forecasts from both models initialized at 0000 UTC 7 February 2019 illustrate a high-confidence EPS forecast with P250 ≥ 50% along the central California coast at lead times of ∼6.5–9 days verifying during the 13–15 February 2019 period that are not present in the GEFS forecast (cf. Figs. 1a,b). Forecasts initialized at 0000 UTC 8 February 2019 subsequently illustrate 1) a poleward expansion of P250 ≥ 50% in GEFS forecasts along the California coast at lead times of ∼6–7 days, suggesting an increase in confidence over time, and 2) a lead-time contraction or filamentation of P250 ≥ 50% in EPS forecasts along the California coast at lead times of ∼5–6 days (cf. Figs. 1c,d), suggesting a change in confidence over time and forecast jumpiness (Zsoter et al. 2009). It is important to recognize that the change in confidence over time may be interpreted as a refinement of the previous forecast to emphasize a shorter duration event or an increase in confidence related to timing. Forecasts initialized on 9, 10, and 11 February 2019 continue to illustrate that a combination of a “poleward expansion of P250 ≥ 50%” by GEFS forecasts and “lead-time contraction and filamentation” by EPS forecasts became the final forecast solution in the days prior to the 13–15 February 2019 event (Figs. 1e–j). The differences between the GEFS and EPS models and the forecast changes over time (i.e., “dProg/dt”) are further addressed in this study.

In addition to studying forecast skill for the west coast of North America, the present study will focus on a model grid point located at 38°N, 123°W for consistency with Cordeira and Ralph (2021). The analysis for this location in coastal Northern California is intended to highlight forecasts where ongoing studies seek to better quantify the skill of AR-related forecasts to provide situational awareness in support of Forecast-Informed Reservoir Operations (FIRO; Jasperse et al. 2020). FIRO leverages the skill of modern numerical weather prediction models and hydrologic forecasting techniques to inform water resources management by maximizing water supply while minimizing flood risk within a catchment area, its reservoir, and regions downstream. Model results (Delaney et al. 2020) and FIRO in practice at Lake Mendocino within the Russian River watershed, where landfalling ARs drive an overwhelming majority of annual precipitation and almost all precipitation extremes and floods (Ralph et al. 2013) and outlets to the Pacific Ocean near 38°N, 123°W, indicate that FIRO can increase median storage by >30% over conventional reservoir operations while maintaining water supply, mitigating flood risk, and providing healthy ecosystems.

2. Data and methods

This study used 0.5° latitude × 0.5° longitude gridded forecast data along the west coast of North America (25°–55°N; see locations in Fig. 1) from the GEFS and the EPS models verifying during October–April of water years (WYs) 2017, 2018, 2019, and 2020. Data were primarily collected from The Interactive Grand Global Ensemble (TIGGE) database at ECMWF with additional details regarding collection of forecast data provided within Cordeira and Ralph (2021). The forecast data include what fraction of the 20 (GEFS) and 50 (EPS) ensemble members contained forecast IVT magnitudes ≥ 250 kg m−1 s−1 (i.e., the probability of IVT magnitudes ≥ 250 kg m−1 s−1; P250) and compares those values to the IVT magnitude at verification. Verification in this study followed the methodology of Cordeira and Ralph (2021) and is defined as the 0-h ensemble-mean IVT magnitude forecast from each ensemble. In this way, the model forecasts are scored against themselves in lieu of comparing each model forecast to an independent reanalysis (e.g., the EPS forecasts are verified against the EPS 0-h ensemble mean). The difference and absolute difference between the 0-h ensemble-mean IVT magnitudes in the EPS and GEFS models were on average 3.0 and 16.3 kg m s−1, respectively, across the study period at 38°N, 123°W. See section 3 for additional context on differences in numbers of verifying events between the two models. Note that forecast skill in this study is analyzed every 12 h as opposed to every 6 h as in Cordeira and Ralph (2021) given the 12-h availability of the 0-h EPS ensemble-mean IVT magnitude forecasts.

There were four upgrades to the EPS and zero upgrades to the GEFS model system during the period of study (Table 1). The subsequent comparisons of the two model systems in this study should emphasize that while overall model skill remained constant for the GEFS, it may have improved for the EPS. In other words, variability in skill in the EPS may not be attributed to meteorology alone and this study therefore assesses the “operational” skill of the models. In contrast to the study by Cordeira and Ralph (2021) that assessed GEFS year-to-year variability in forecast skill and forecasts verifying every 6 h, the EPS model upgrades and reduced number of verification times using 12-h verification based on availability of EPS forecast data makes it challenging to assess skill on an annual basis in this study. Note that the GEFS model system was upgraded following the period of study and is discussed again in section 4.

Table 1

EPS and GEFS model versions used in this study.

Table 1

As in Cordeira and Ralph (2021), forecast skill is assessed using a four-outcome contingency table of whether P250 forecasts exceeded a percentage threshold (e.g., ≥50%) and whether a verification time contained IVT magnitudes ≥ 250 kg m−1 s−1. The success ratio, probability of detection (POD), and equitable threat score (ETS) are calculated from the contingency table at different lead times which are complemented by reliability diagrams and the Brier skill score (BSS). Note that the success ratio identifies what fraction of P250 forecasts verify with IVT magnitudes ≥ 250 kg m−1 s−1, the POD identifies what fraction of verification times with IVT magnitudes ≥ 250 kg m−1 s−1 were correctly forecast, the ETS measures what fraction of observed and/or forecasted events were correctly predicted while also considering hits associated with random chance, and the BSS assesses the relative skill of P250 forecasts as compared to reference climatology and includes both forecasts of events (i.e., large P250 values prior to times with IVT magnitudes ≥ 250 kg m−1 s−1) and nonevents (i.e., small P250 values prior to times with IVT magnitudes < 250 kg m−1 s−1).

The success ratio, POD, and ETS are all calculated using P250 thresholds of ≥25%, ≥50%, and ≥75% and illustrated for a point at 38°N, 123°W. These metrics are subsequently shown along the entire west coast of North America using a threshold of P250 ≥ 50%. These metrics are only calculated if there are 10 or more forecasts/verification times to evaluate. The BSS is not calculated from a threshold, but is also shown for a point at 38°N, 123°W and summarized along the west coast of North America. Statistical significance for the differences between the EPS and GEFS forecasts is visualized using random sampling and non-overlap of the 95th percentile confidence levels. At each lead time, the success ratio and POD for each model is calculated for a random sample of 25 forecasts or 25 event times, respectively, and iterated 1000 times with replacement to generate success ratio and POD distributions from which to identify confidence levels. The same process is followed for calculating statistical significance with the BSS and ETS with random samples of 50 forecasts to account for the larger populations of data used in these metrics.

The difference of the average P250 values at lead times of 6–8 days (EPS minus GEFS) was also calculated in order to assess intermodel variability. In this analysis, the 95th percentile was chosen to mark a significant difference between the two models. Intermodel variability is also explored for P250 forecasts prior to events verifying with different event characteristics related to 1) verifying times with IVT magnitudes of 250–500 kg m−1 s−1, or ≥500 kg m−1 s−1, and 2) verifying times associated with individual high-impact landfalling ARs in Northern California. The latter were chosen based on extrema in daily precipitation data from the California Department of Water Resources Northern Sierra Eight Station Index [CDEC (California Data Exchange Center) 2021] during WY17–20. Both the time of maximum IVT magnitude at 38°N, 123°W prior to the 10 highest daily precipitation totals in the Northern Sierra Eight Station Index and the first time (i.e., onset time) of IVT magnitudes increasing above 250 kg m−1 s−1 were used as the verification times to analyze forecast skill of P250 prior to individual high-impact landfalling ARs (Table 2). Eight of the 10 verification times were the same in the 12-h GEFS and EPS data and two of the 10 times were different by ±12 h, owing to intermodel differences in the 0-h ensemble mean forecast data.

Table 2

The 10 highest daily precipitation totals in the Northern Sierra Eight Station Index in California during the study period with associated times of maximum IVT magnitude, maximum IVT magnitude, onset time of IVT magnitudes increasing to ≥250 kg m−1 s−1, and precipitation amount. The cases are sorted by date. Date and times marked with an asterisk indicate that the EPS time of maximum IVT magnitude occurred 12 h after the time listed. IVT magnitudes are the 0-h ensemble-mean forecasts from each respective model.

Table 2

Throughout this study IVT magnitudes ≥ 250 kg m−1 s−1 are used to verify forecasts associated with landfalling ARs. While not all occurrences of IVT magnitudes ≥ 250 kg m−1 s−1 are associated with landfalling ARs, this threshold is a common ingredient used to describe landfalling ARs on the U.S. West Coast sometimes simply referred to as “AR conditions” (Cordeira et al. 2017; Ralph et al. 2019; Cordeira and Ralph 2021). Requiring a geometric or object-based definition of a landfalling AR coincident with enhanced IVT magnitudes (e.g., Shields et al. 2018) would ultimately reduce the number of verifying events while simultaneously likely removing many weaker or shorter duration periods with IVT magnitudes ≥ 250 kg m−1 s−1, potentially increasing the skill of the EPS and GEFS forecasts based on the results of Cordeira and Ralph (2021) and those results that will be shown herein. This study includes all verification times with IVT magnitudes ≥ 250 kg m−1 s−1 given that most forecast tools used in the prediction of landfalling ARs (e.g., Cordeira et al. 2017 and Fig. 1) currently do not invoke a geometric or object-based definition.

3. Results

a. Forecast characteristics and reliability

For all the cool-season WYs examined in this study, 273 of 1698 (∼16.1%) and 266 of 1698 (∼15.7%) 12-h times (i.e., initializations) verified with IVT magnitudes ≥ 250 kg m−1 s−1 at 38°N, 123°W for the EPS and GEFS, respectively (Table 3 and Fig. 2). The difference in the number of verification times indicates that each model is verified against a slightly different but overwhelmingly similar set of events. The two models verified 245 events that were the same (>92%); the remaining differences were largely associated with marginal events either containing IVT magnitudes < 300 kg m s−1 or slight differences in timing (e.g., the EPS had IVT magnitudes drop below the 250 kg m s−1 threshold just prior to the verification time while the GEFS was still above the threshold; not shown). WY17 contained the majority of the verification times (>100 in each model), while WY18, WY19, and WY20 contained ∼25%–50% fewer verification times meeting the minimum IVT magnitude criteria in this study. For reference, the frequency of P250 forecast values ≥ 50% and ≥75% in each model as a function of lead time are approximately the same as the frequency of event times with IVT magnitudes ≥ 250 kg m−1 s−1 at short lead times and decreases as lead time increases (Fig. 2). The lower frequency of P250 forecast values ≥ 50% and ≥75% at longer lead times suggests some underforecasting that would be consistent with a lower POD at longer lead times. Alternatively, the frequency of P250 forecast values ≥ 25% is higher than the frequency of event times with IVT magnitudes ≥ 250 kg m−1 s−1 at almost all leads times, with the GEFS forecasts containing a noticeably higher frequency than the EPS forecasts, especially at lead times > 10 days. The large frequency of P250 forecast values ≥ 25% at longer lead times suggests some overforecasting that would be consistent with a lower success ratio and higher false alarm ratio.

Fig. 2.
Fig. 2.

Frequency of the GEFS (red) and EPS (blue) P250 forecasts ≥ 25%, ≥50%, and ≥75% for all forecasts verifying every 12 h at 38°N, 123°W. The number of verifying times with IVT magnitudes ≥ 250 kg m−1 s−1 in each model is shown as thin black horizontal lines.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

Table 3

Number of verification times and verification times with IVT magnitudes ≥ 250 kg m−1 s−1 during the study period and during each WY for the EPS and GEFS at 38°N, 123°W. The fraction is provided in parentheses.

Table 3

The statistical consistency of the EPS and GEFS IVT magnitude forecasts for 1) all forecasts and 2) for all forecasts specifically prior to verifying times with IVT magnitudes of 250–500 kg m−1 s−1 at 38°N, 123°W is shown via a dispersion diagram of the average ensemble spread (i.e., ensemble member standard deviation) and average root-mean-square error (RMSE) of the ensemble-mean IVT magnitude forecasts (Fig. 3). Note that these relationships are influenced by both the underlying skill of the model forecasts and characteristics of ensemble spread in model forecasts of nonnormally distributed IVT magnitude that favors nonevents over events by a factor of >4:1 (Table 3). The ensemble spread is on average not large enough to capture the average RMSE of all forecasts in both models, which implies that both the EPS and GEFS are underdispersive across all lead times (Fig. 3a). As mentioned above, the average of all forecasts likely misrepresents the statistical consistency of forecasts prior to events with IVT magnitudes ≥ 250 kg m−1 s−1 by including many low-magnitude spread forecasts prior to events with IVT magnitudes < 250 kg m−1 s−1 that outnumber the former by >4:1 (Table 3). The ensemble spread of forecasts prior to verifying times with IVT magnitudes of 250–500 kg m−1 s−1 is comparable to the average RMSE for both models for lead times through ∼6 days (i.e., adequately dispersive) and becomes under dispersive thereafter as the models become less skillful and spread saturates at 100–125 kg m−1 s−1 (Fig. 3b). In both analyses, the average ensemble spread and RMSE are similar across all lead times in the GEFS and EPS with differences less than ∼5–10 kg m−1 s−1 in spread and less than ∼15 kg m−1 s−1 in RMSE. The RMSE (spread) of the GEFS forecasts is consistently larger (smaller) than EPS forecasts at almost all leads times and is largest (smallest) for forecasts verifying with IVT magnitudes of 250–500 kg m−1 s−1 and at lead times of 6–10 days, suggesting that the smaller GEFS ensemble may be characteristically more under dispersive than the larger EPS ensemble. This result also suggests that the GEFS ensemble member IVT magnitude forecasts, relative to the EPS, may cluster too closely and result in an overly confident P250 forecast (e.g., P250 = 100% or P250 = 0%) for events with IVT magnitudes near 250 kg m−1 s−1.

Fig. 3.
Fig. 3.

The average ensemble spread and root-mean-square error (RMSE) of ensemble-mean IVT magnitude forecasts for (a) all times and (b) times verifying with IVT magnitudes of 250–500 kg m−1 s−1 for forecasts verifying every 12 h. Solid lines represent the RMSE, and dashed lines represent ensemble spread (GEFS = red, EPS = blue).

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

Analysis of the P250 forecast values as a function of lead time (dProg/dt) for the GEFS and EPS forecasts in each WY at 38°N, 123°W highlights both interannual, intraseasonal and intermodel variability (Fig. 4). For example, WY17 contained many forecasts with higher P250 than during other water years owing to higher AR activity (Table 3 and Cordeira and Ralph 2021) that was common to both models. The P250 differences (EPS minus GEFS) between the two model systems for all forecasts do not visually demonstrate any large systematic preferences for higher P250 values within either model at lead times > 6–10 days (Fig. 5; left column); however, note that the GEFS forecasts did produce 50%–100% more P250 forecasts with values ≥ 25% at leads times > 10 days (Fig. 2) that is not necessarily apparent in this illustration. The visualization in Fig. 5 does clearly illustrate large event-to-event variability where either the EPS or GEFS contained higher or lower P250 values prior to different verification times. These results are similar for all forecasts verifying with IVT magnitudes of 250–500 kg m−1 s−1 (Fig. 5; right column) with a visual preference for a positive difference (EPS is higher) beginning to appear (more red shade in Fig. 5; right column) prior to these verifying events. When averaged across all forecasts prior to verification times with IVT magnitudes ≥ 250 kg m−1 s−1, both the EPS and GEFS contain P250 values that increase relative to their model’s climatology (here taken as the 4-yr average of P250) at a lead time of ∼11 days and cross above the 50% P250 threshold at a lead time of ∼5–6 days (Fig. 6). For similar P250 thresholds, the EPS provides, on average, and additional one day of lead time over the GEFS at lead times of ∼6–10 days. For similar lead times, the EPS P250 values are on average ∼5% points higher than the GEFS at the same lead times. Note that the dProg/dt analysis illustrating higher or lower P250 values prior to verifying events does not necessarily imply higher or lower skill given that higher P250 at longer lead times could result from random chance and be associated with a large false alarm ratio.

Fig. 4.
Fig. 4.

A forecast–lead-time illustration of the (left) GEFS and (right) EPS P250 (shaded according to scale) for verification times every 12 h during (a),(b) WY17; (c),(d) WY18; (e),(f) WY19; and (g),(h) WY20 at 38°N, 123°W.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

Fig. 5.
Fig. 5.

A forecast–lead-time illustration of the difference in P250 forecast values between the EPS and GEFS (EPS minus GEFS) for (left) all verification times and (right) all verification times with IVT magnitudes of 250–500 kg m−1 s−1 every 12 h during (a),(b) WY17; (c),(d) WY18; (e),(f) WY19; and (g),(h) WY20 at 38°N, 123°W.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

Fig. 6.
Fig. 6.

Forecast–lead-time change in the P250 forecast values (i.e., “dProg/dt”) prior to verification times with IVT magnitudes ≥ 250 kg m−1 s−1 for the GEFS (red) and EPS (blue) for forecasts verifying every 12 h at 38°N, 123°W. The solid line represents the mean while the dashed lines represent the 95th confidence level generated through 1000 random samples of 25-member populations of forecasts.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

The short- and medium-range P250 forecasts at 38°N, 123°W are reliable with GEFS P250 forecasts of 52%–60% at lead times of 1–3, 4–6, and 7–9 days verifying 48%, 58%, and 53% of the time (Fig. 7a) and EPS P250 forecasts of 52%–60% verifying 57%, 64%, and 61% of the time (Fig. 7b). The P250 forecasts at longer lead times are unreliable and less skillful in the GEFS model with P250 forecasts of 42%–50% at lead times of 10–12 and 13–15 days verifying 40% and 33% of the time similar to results shown by Cordeira and Ralph (2021) (Fig. 7a). Alternatively, the EPS verifies P250 forecasts of ∼42%–50% at a lead time of 10–12 and 13–15 days at 48% and 51% of the time. These reliability diagrams confirm that the GEFS has a higher false alarm ratio at longer lead times as suggested by numerous P250 forecasts ≥ 25% at longer lead times that are not verifying (Figs. 2 and 7). On average at lead times > 10 days, GEFS P250 forecasts are not as reliable as the EPS P250 forecasts.

Fig. 7.
Fig. 7.

Reliability diagram for the (a) GEFS and (b) EPS P250 forecast at 38°N, 123°W for lead times of 1–15 days (colored solid lines) grouped by lead times of 3 days according to the legend. Values are only calculated if there were 10 or more forecasts to evaluate and forecast values are grouped into a 0% only bin and then in 10% bins thereafter. The black lines represent the no resolution (solid), no skill (dot–dash), and 1:1 reliability (dashed) for the latter 10 bins.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

b. Success ratio

From a forecast perspective, the success ratio for both models at 38°N, 123°W illustrate that EPS P250 forecasts ≥ 50% have on average higher skill than the GEFS across all lead times, with significantly higher skill at lead times of 6–8 days where the difference between the two models grows to ≥0.20 (Fig. 8b). Similar results are shown for threshold P250 forecasts ≥ 25% and ≥75% (Figs. 8a,c). Similar to Cordeira and Ralph (2021), the success ratio of GEFS P250 forecasts ≥ 50% is higher along the Washington and Oregon coastlines at lead times > 5 days where ARs are more frequent and lower along the southwest Canadian, Southern California, and northwest Mexico coastlines where ARs are less frequent (Fig. 8d). This latitudinal variability in success ratio is also observed with EPS forecasts (Fig. 8e); however, the EPS success ratio is noticeably larger at leads times between ∼4 and 10 days (e.g., >0.70 at lead times of 6–8 days near ∼35°N in central California as compared to >0.50). The success ratios of EPS forecasts are significantly higher across almost all latitudes and almost all lead times > 3 days with largest differences at lead times of 6–8 days across central and Southern California (∼32°–38°N), >8 days across Northern California, Oregon, and Washington, and ∼6–10 days farther north (Fig. 8f). The least significant differences in the success ratio between the two models are located across most lead times along the coastline of Northern California and Oregon between ∼40° and 46°N.

Fig. 8.
Fig. 8.

Success ratio calculated for the GEFS and EPS for all forecast lead times valid between 0000 UTC 1 Oct 2016 and 1200 UTC 30 Apr 2020: (a)–(c) at 38°N, 123°W for P250 thresholds ≥ 25%, ≥50%, and ≥75%, respectively; (d),(e) along the coast from 25° to 55°N on the west coast of North America; and (f) the difference between the two models along the coast. Solid lines in (a)–(c) represent the mean while the dashed lines in (a)–(c) represent 95th percent confidence levels for statistical significance generated through 1000 random samples of 25-member populations of forecasts. The dots in (f) denote locations where the difference is statistically significant at the 95th percent confidence level based on a two-sided Student’s t test compared with the mean of the same 1000 random samples of 25-member populations of forecasts. The black “N/A” shading indicates that fewer than 10 forecasts of P250 ≥ 50% were made at that lead time and latitude to evaluate the success ratio.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

c. Probability of detection

From an “event” perspective, the POD for the GEFS and EPS forecasts at 38°N, 123°W using P250 thresholds of ≥25%, ≥50%, and ≥75% were largely similar with overlapping confidence levels (Figs. 9a–c). For example, the POD for P250 forecasts ≥ 50% differed by at most ∼0.10 at lead times of 6–7 days and the differences were not statistically significant (Fig. 9b). The POD values indicate that >50% of events are correctly forecast by P250 forecasts ≥ 25%, ≥50%, and ≥75% at lead times within ∼8–9 days, ∼6 days, and ∼3–4 days, respectively. The similarities among the two models in POD and differences in success ratio suggest that events are not necessarily being mis-forecast, but that the GEFS is likely overforecasting the frequency of events relative to the EPS. In both models, the POD is highest along the Oregon and Washington coastlines at ∼42°–47°N and lowest along the Southern California and northwest Mexico coastlines at ∼28°–34°N (Figs. 9d,e). The POD of the EPS is significantly higher along the Washington and Oregon coastlines (∼43°–48°N) at lead times of 4–6 days, whereas the POD of the GEFS is significantly higher along the California coastline (∼34°–38°N) at lead times of 0–2 days and northwest Mexico coastline (25°–31°N) at lead times of 6–10 days (Fig. 9f).

Fig. 9.
Fig. 9.

As in Fig. 8, but for the probability of detection (POD).

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

d. Equitable threat score and Brier skill score

The ETS illustrates that EPS P250 forecasts ≥ 50% at 38°N, 123°W are only slightly more skillful on average than the GEFS forecasts, accounting for random chance, at lead times of ∼2–10 days (Fig. 10b). The difference between the GEFS and EPS ETS values grows more significant as the P250 forecast threshold is lowered from ≥75% to ≥25%, which again likely highlights the overforecasting of “low probability” forecasts in the GEFS (Figs. 10a,c; i.e., the GEFS is penalized for being correct by random chance). The largest and significant differences in ETS between the two models is ∼0.10–0.12 at a lead time of 6–7 days for a threshold of ≥50% (Fig. 10b) and ∼0.08–0.12 at a lead time of 3–10 days for a threshold of ≥25% (Fig. 10a). The difference in ETS for P250 ≥ 50% extends to most latitudes along the West Coast of North America with EPS forecasts leading the GEFS forecast in ETS by ∼0.10 for similar lead times or ∼1 day for similar ETS values (Figs. 10d,e). As previously shown by the success ratio and POD, the highest ETS values are located along the Northern California and Oregon coastlines (∼40°–44°N) and the lowest are located along the Southern California coastline (∼32°–34°N). The differences in ETS for P250 ≥ 50% between the EPS and GEFS are primarily significant at lead times of 2–7 days at latitudes between ∼42° and 48°N (Fig. 10f).

Fig. 10.
Fig. 10.

As in Figs. 8 and 9, but for the equitable threat score (ETS). Statistical significance is calculated as in Fig. 8, except using 1000 random samples of 50-member populations of forecasts.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

The BSS illustrates that the EPS P250 forecasts at 38°N, 123°W for events and nonevents are also on average more skillful than GEFS P250 forecasts across all lead times (Fig. 11a) with BSS values that drop below 0.1 at lead times of ∼10–11 days. The largest and significant differences in the BSS are >0.10 at lead times of ∼6–9 days. This difference in the BSS extends to most latitudes along the West Coast of North America with EPS forecasts leading the GEFS forecast by ∼0.10 for similar lead times or ∼1 day for similar BSS values (Figs. 11b,c). The EPS forecasts are significantly higher than the GEFS across most latitudes in Washington, Oregon, and Northern California at lead times of ∼2–8 days and also across northwest Mexico at lead times > 8 days (Fig. 11d). Note that the latter BSS values in Mexico in each model are <0.1 at these lead times with their differences straddling zero (i.e., comparing a forecast with very little skill over climatology to one with negative skill). Similarly, large differences in the BSS also exist near 31°–33°N in Southern California and they are not statistically different. They also exist where the BSS of the GEFS drops below zero (i.e., little-to-no utility over the reference climatology). The BSS suggests that the ensemble-derived P250 forecasts have limited utility in forecasting events or nonevents beyond ∼7–10 days depending on latitude, especially in Southern California.

Fig. 11.
Fig. 11.

Brier skill score calculated for the GEFS and EPS for all forecast lead times valid between 0000 UTC 1 Oct 2016 and 1200 UTC 30 Apr 2020: (a) at 38°N, 123°W; (b),(c) along the coast from 25° to 55°N on the West Coast of North America; and (d) the difference between the two models along the coast. Statistical significance is calculated as in Fig. 8, but using 1000 random samples of 50-member populations of forecasts.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

e. Forecast skill variants

For a randomly generated multimodel ensemble, calculated by randomly selecting a 20-member ensemble that varies from all GEFS members to all EPS members, the success ratio for P250 forecasts ≥ 50% at 38°N, 123°W is largely similar (i.e., 71%–74%) for a GEFS-weighted versus EPS-weighted ensemble at leads times less than 5 days (Fig. 12). At lead times greater than six days, P250 forecasts ≥ 50% created from an ensemble generated with an increasingly higher numbers of GEFS ensemble members produces systematically lower success ratios than an ensemble generated with increasingly higher numbers of EPS members. Note that these differences were not assessed for statistical significance. In other words, the spread of the EPS ensemble with 50 members reduced to 20 members still produces a more successful forecast than the GEFS. This result suggests that the forecast success of a multimodel ensemble using a combination of the EPS and GEFS members is on average also lower than the skill of the EPS ensemble alone. Both results are similar to results by Atger (2001).

Fig. 12.
Fig. 12.

Success ratio for P250 forecasts ≥ 50% derived from a randomly selected 20-member ensemble using different combinations of ensemble members from the GEFS and EPS forecasts at 38°N, 123°W. The top row represents an ensemble of all EPS members and bottom row represents an ensemble of all GEFS members. The values are a mean success ratio calculated from repeating the random sampling 50 times.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

The POD of P250 forecasts at 38°N, 123°W can be partitioned into verifying times with IVT magnitudes of 250–500 kg m−1 s−1 and of ≥500 kg m−1 s−1, in a similar (but not identical) fashion to Cordeira and Ralph (2021) in order to 1) assess whether or not more intense events are detected using P250 at longer lead times as compared to less intense events and 2) provide context for the accuracy of forecasts prior to landfalling ARs of different intensities and their precipitation (see section 3g). Note that we should expect the P250 values to be higher in advance of more intense events given the propensity for the spread of the ensembles to result in more ensemble members with IVT magnitudes ≫ 250 kg m−1 s−1. The POD of EPS P250 forecasts ≥ 50% increases from 0.30 at a lead time of 7 days for verifying times with IVT magnitudes of 250–500 kg m−1 s−1 (Fig. 13a) to 0.50 for verifying times with IVT magnitudes of ≥500 kg m−1 s−1 (Fig. 13b). Alternatively stated, more than half of more intense events are forecast at lead times of 7 days as compared to less intense events at lead times of ∼5.5 days. These results suggest that more intense events are detected using P250 forecasts at longer lead times as compared to less intense events similar to Cordeira and Ralph (2021). Although not tested for statistical significance, the EPS P250 forecasts do have a higher POD than GEFS P250 forecasts at lead times of 3–9 days using a P250 forecast threshold of 25% for events verifying with IVT magnitudes of 250–500 kg m−1 s−1 and at lead times of ∼6–8 days using thresholds of 50% and 25% for events verifying with IVT magnitudes ≥ 500 kg m−1 s−1. This latter result suggests that the EPS provides slightly better lead time prediction of more intense landfalling ARs in coastal regions of Northern California as compared to the GEFS.

Fig. 13.
Fig. 13.

Probability of detection of the GEFS (red) and EPS (blue) for all forecast lead times of 0–15 days and three different P250 thresholds (≥25%, ≥50%, and ≥75%) for forecasts verifying with IVT magnitudes of (a) 250–500 kg m−1 s−1 and (b) ≥500 kg m−1 s−1.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

f. Forecast differences

The differences in the EPS and GEFS P250 forecast skill metrics at lead times of ∼6–8 days at 38°N, 123°W motivated additional analysis of the variability between the two models. The distribution of the differences of the average P250 forecast in the 6–8-day period illustrates that a majority of the values are near zero (due to a large majority of nonevents and the differences between small probability values being small), but the distribution is skewed toward the GEFS having higher P250 values than the EPS across all forecasts (Fig. 14a). The differences of the average P250 forecast in the 6–8-day lead time (i.e., EPS minus GEFS) for times that verified with IVT magnitudes of 250–500 kg m−1 s−1 (Fig. 14b) and >500 kg m−1 s−1 (Fig. 14c) illustrate where the EPS or GEFS provided a more successful forecast. Overall, there are a larger number of occurrences of positive (blue) values as compared to negative (red) values for events with IVT magnitudes 250–500 kg m−1 s−1 (121 of 194 favor the EPS) and for events with IVT magnitudes > 500 kg m−1 s−1 (28 of 35 favor the EPS), indicating an overwhelming number of relative “wins” for the EPS P250 forecasts prior to times that verify. Note that some cases, however, were greater successes than others. For example, 6–8-day forecasts prior to 0000 UTC 16 February 2017 (blue arrows in Figs. 14b,c) were clearly higher in the EPS model, whereas the 6–8-day forecasts prior to 0000 UTC 20 November 2017 were clearly higher in the GEFS (red arrow in Fig. 14b). Future work is aimed at identifying why these forecasts, and others, are more confident in one model versus another.

Fig. 14.
Fig. 14.

(a) Distribution of the EPS minus GEFS distribution for 6–8-day P250 forecasts (average EPS P250 minus average GEFS P250) for all verification times at 38°N, 123°W. (b),(c) Forecast differences between the EPS and GEFS average 6–8-day P250 forecasts (EPS minus GEFS) for events that verified with IVT magnitudes 250–500 kg m−1 s−1 in (b) and ≥500 kg m−1 s−1 in (c) at 38°N, 123°W. Positive values (blue) illustrate a higher EPS forecast while negative values (red) illustrate a higher GEFS forecast.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

g. Top-10 precipitation events

The top-10 dates with the highest daily precipitation in the Northern Sierra region of California from the Northern Sierra Eight Station Index were selected as verifying times for further analysis. These dates, precipitation amounts, their antecedent time that contained the maximum IVT magnitude at 38°N, 123°W and the maximum IVT magnitude are provided in Table 2. All 10 of the events contained daily precipitation amounts > 2.7 in. (∼68 mm) and featured event-maximum IVT magnitudes ≥ 500 kg m−1 s−1. The EPS and GEFS P250 forecasts prior to each of the times of maximum IVT magnitude illustrate a large amount of variability at lead times > 3 days with features characteristic of both “smooth” and “jumpy” dProg/dt forecasts (Figs. 15a,b). The standard deviation of the EPS and GEFS forecasts (i.e., spread) is quite similar and is largest at 0.3 at a lead time of ∼5–6 days indicating that some forecasts were already starting to “lock in” on AR conditions while others were less certain (Fig. 15e). The visual spread among the EPS P250 forecasts in Fig. 15b appears larger than the GEFS, but is primarily due to especially low P250 forecast values prior to 15–16 December 2016 (red line) and 14 February 2019 (orange line). Note that these two events both appear at the bottom of both the GEFS and EPS forecast “envelopes” in Figs. 15a and 15b as two of the most challenging-to-forecast high-impact precipitation events during the 4-yr period of study at lead times of ∼4–6 days. Alternatively, both high-impact precipitation events during early January (salmon line) and early February 2017 (dark green line) that preceded the Lake Oroville Spillway incident (France et al. 2018; Vano et al. 2019; White et al. 2019) were apparently two of the least challenging (or best)-to-forecast events. Further analysis is required to understand why some of these events were forecast relatively well with lead times > 6–8 days, while others were forecast relatively poorly with leads times < 6 days. A review of the forecast skill of the GEFS prior to the 14 February 2019 case is provided by Hecht et al. (2022).

Fig. 15.
Fig. 15.

(a),(c) GEFS and (b),(d) EPS P250 forecasts prior to the top-10 precipitation events in Northern California that were also associated with landfalling ARs. Forecasts are partitioned prior to the time of maximum IVT magnitude in (a) and (b) and the first time with IVT magnitudes ≥250 kg m−1 s−1 (onset of event) in (c) and (d) at 38°N, 123°W. (e),(f) The top-10 average P250 forecast value and standard deviation of the top-10 P250 forecast values for both models at the time of maximum IVT and onset, respectively.

Citation: Weather and Forecasting 37, 11; 10.1175/WAF-D-21-0114.1

The prior analysis considered P250 forecasts prior to the time of maximum IVT magnitude, which were all >500 kg m−1 s−1 and quasi-centered within a >24-h period with IVT magnitudes ≥ 250 kg m−1 s−1 (not shown). A similar analysis is shown by illustrating the P250 forecasts prior to the onset time of each event, or the time when the IVT magnitude first increased above 250 kg m−1 s−1 prior to the high-impact precipitation event (Figs. 15c,d). The difference between the two models and individual cases is much larger. For example, the average P250 forecast values prior to verification at onset are higher in the EPS at most lead times < 5 days. The standard deviation of P250 forecast values prior to verification at onset in the GFES are also both larger than the EPS at onset and also larger than the GEFS at the time of maximum IVT. These differences suggest that, for these top-10 precipitation events and potentially others, the onset of AR conditions (i.e., timing) is particularly challenging to forecast as compared to the time of maximum IVT magnitude in the GEFS as compared to the EPS in coastal regions of Northern California.

4. Conclusions

Landfalling ARs are typically associated with enhanced water vapor transport that can produce extreme precipitation, high societal impacts, and benefits to water resources. For example, floods, flash floods, extreme winds, avalanches, and creating favorable conditions for shallow landslides and debris flows are all possible hazards related to landfalling ARs (e.g., Young et al. 2017; Waliser and Guan 2017; Hatchett et al. 2017; Oakley et al. 2018; Bartlett and Cordeira 2021) that make skillful forecasts of their occurrence beneficial to hazard mitigation and water resources management. Given the relative predictability of IVT forecasts at lead times of ∼6–10 days over traditional precipitation forecast metrics (Lavers et al. 2016) and a previous study analyzing the probability of IVT magnitudes ≥ 250 kg m−1 s−1 (P250) along the west coast of North America using the GEFS, this study assesses the same skill of P250 between the EPS and GEFS ensemble forecasts.

Analysis of dProg/dt for the GEFS and EPS P250 forecasts prior to verification times with IVT magnitudes ≥ 250 kg m−1 s−1 in coastal regions of Northern California (38°N, 123°W) indicated that EPS P250 forecasts provided ∼1 day of additional lead-time guidance for situational awareness over the GEFS at lead times of 6–10 days (Fig. 6). Reliability of all P250 forecasts from the EPS and GEFS models at 38°N demonstrated that the P250 forecasts at leads times through ∼9 days were on average reliable; however, the EPS was overall more reliable than the GEFS at lead times > 9 days (Fig. 7). The EPS also had higher forecast skill with success ratios that were 0.10–0.15 higher than the GEFS at lead times > 6 days for P250 thresholds of ≥25% and ≥50%, and >3 days for P250 threshold ≥ 75% with similar differences for P250 ≥ 50% as a function of latitude along the west coast of North America (Fig. 8). When accounting for success via random chance, the ETS values suggested that differences between the EPS and GEFS P250 forecasts largely arise for P250 thresholds ≥25% and ≥50% for leads times between ∼6 and 8 days at 38°N and are more widespread at different lead times elsewhere along the coast for P250 ≥ 50% (Fig. 10). The ETS difference is largest along the coast of Oregon and Washington where AR frequency is highest (not shown), suggesting that the GEFS may attribute at least some of its skill in this region to random chance and its lack of skill to false alarms.

The event-based skill analysis demonstrated that the POD for both models was largely similar at 38°N (Fig. 9) with 1) improvements in POD for P250 prior to more intense events as compared to less intense events (Fig. 13) and 2) minor latitudinal variations favoring higher POD in the EPS along the coast of Washington and Oregon and higher POD in the GEFS along the coast of Mexico (Fig. 9). The BSS illustrated that the EPS contained higher skill at lead times of ∼6–10 days than the GEFS in forecasting events and nonevents at 38°N (Fig. 11). The BSS of the EPS was also higher than the GEFS by ∼0.10 for most locations along the West Coast of North America at lead times > 1 day, and for a given BSS value, the EPS led the GEFS by ∼0.5 days at lead times of 3–5 days and by ∼1.0 days at lead times of 5–10 days (Fig. 11). Overall, the EPS and GEFS P250 forecasts contained largely similar skill at lead times < 6 days and >10 days, whereas the EPS contained better skill at lead times of ∼6–8 days at 38°N. These lead times do vary by latitude, but overall favor the EPS. Given the largely similar POD values, yet differences in success and ETS, it appears the EPS provides more skillful P250 forecasts owing to a lower false alarm ratio.

The statistical consistency of the two ensembles did suggest that the EPS ensemble spread was slightly larger than the GEFS ensemble spread, paired better with a lower RMSE than the GEFS, and was less underdispersive than the GEFS at lead times of ∼6–10 days specifically for forecasts prior to verification times with IVT magnitudes of 250–500 kg m−1 s−1 in coastal regions of Northern California (Fig. 3). When the ensemble spread of the EPS is normalized to 20 members through random sampling, the success ratio of the EPS forecasts remained larger than the comparable 20-member GEFS forecasts (Fig. 12), suggesting that model physics, data assimilation, perturbation growth, and model resolution, or some other factor is likely responsible for the systematically better skill of the EPS or the systematically lower skill of the GEFS. Future work is aimed at assessing whether there are combinations of EPS and GEFS ensembles that can produce superior forecast skill to either ensemble system alone, especially with the newer 30-member GEFS version 12 introduced in late 2020 after the period of study.

The differences in skill at lead times of 6–8 days led to an analysis of whether or not these differences could be attributed to systematic biases in the difference between EPS and GEFS P250 forecasts. Both models had relatively similar frequencies of “wins” in terms of having a higher average 6–8-day P250 forecast values over the other model for events verifying between 250 and 500 kg m−1 s−1; however, the EPS was a clear winner with an overwhelming majority of higher average 6–8-day P250 forecasts prior to verifying times with IVT magnitudes ≥ 500 kg m−1 s−1 (Fig. 14). This “winner” and “loser” perspective derived from calculating EPS average P250 minus GEFS average P250 is related to the POD and illustrates that the EPS typically has a higher average P250 forecast prior to more intense landfalling ARs. Differences in the forecasts and forecast jumpiness was also observed through analysis of the top-10 high precipitation events in Northern California and their associated P250 forecasts prior to onset and the time of maximum IVT magnitudes (Fig. 15). This analysis suggested that 1) the onset of a landfalling AR is more challenging to forecast using P250 as compared to the time of maximum intensity and 2) differences in forecast jumpiness and forecast skill between the EPS and GEFS at lead times of 6–8 days is likely related to how well the ensembles predict the “encompassing meteorology” of individual events (e.g., synoptic meteorology prior to and during an event as shown by Hecht et al. 2022), in addition to the aforementioned differences in model physics, data assimilation, perturbation growth, and model resolution.

The results of this study can be applied to forecasts of landfalling ARs to enhance situational awareness, and applications such as FIRO (Ralph et al. 2019). In coordination with FIRO, forecasts within a 3–5-day lead time are typically used to determine the likelihood and strength of an upcoming AR and its precipitation, which support decisions on how best to manage water within a FIRO-supported reservoir. This study suggests that both the EPS and GEFS P250 forecasts provide on average similar measures of skill for periods of enhanced IVT associated with a landfalling AR at lead times < 6 days and are likely equally useful in this decision-making process. However, the EPS P250 are on average more likely to provide better guidance at lead times > 6 days, potentially in supporting decisions as to whether or not a second event is likely at longer lead times. This result is supported by the BSS of P250 forecasts at leads times 6–8 days that indicated that the EPS provides more skillful probabilistic forecasts than the GEFS for both events with IVT magnitudes ≥ 250 kg m−1 s−1 and nonevents with IVT magnitudes < 250 kg m−1 s−1.

The EPS model was upgraded four times within the period of study, whereas the GEFS was not upgraded during the period of study. The upgrades to the EPS likely influenced the overall skill of the model throughout the study and is why it is important to note that this assessment is only valid for an assessment of the “operational” skill of the models during WY17–20 (i.e., if a forecaster used one model consistently in lieu of the other). It will be important to reproduce this study in the future with the latest version 12 of the GEFS model and with reforecast data in order to identify whether or not systematic biases in forecasts and skill remain. It will also be important to investigate the relationships between forecast skill and the “encompassing meteorology” in order to identify whether or not different types of flow patterns and events result in higher or lower AR-related forecast skill. For example, previous studies have shown that forecast skill is related to variability in large-scale flow patterns resulting from different phases of certain teleconnection patterns, including the Pacific–North American Oscillation (PNA), El Niño–Southern Oscillation (ENSO), and the Madden–Julian oscillation (DeFlorio et al. 2018, 2019). Given the apparent intraseasonal variability in lead-time prediction of events in Fig. 4 and variability in skill of P250 forecasts prior to the top-10 precipitation events in Northern California, these events or an expansion to a larger number of events could serve as a basis for further analysis into why these particular events, their IVT, and precipitation were more or less predictable than others.

Acknowledgments.

Support for this project was provided by awards from the State of California, Department of Water Resources (4600013361) and the U.S. Army Corps of Engineers (W912HZ-15-2-0019, W912HZ-19-2-0023) as part of broader projects led by the Center for Western Weather and Water Extremes (CW3E) at the University of California, San Diego, Scripps Institution of Oceanography. We are grateful for the comments provided by two anonymous reviewers, which improved the quality of this manuscript. A majority of this research was completed as a M.S. thesis by the first author (BS) at Plymouth State University with comments provided on earlier drafts by Dr. Eric Hoffman.

Data availability statement.

Data analyzed in this study were a reanalysis and derivation of existing data, which are openly available at locations cited in section 2 and in Cordeira and Ralph (2021).

REFERENCES

  • Atger, F., 2001: Verification of intense precipitation forecasts from single models and ensemble prediction systems. Nonlinear Processes Geophys., 8, 401417, https://doi.org/10.5194/npg-8-401-2001.

    • Search Google Scholar
    • Export Citation
  • Bartlett, S. M., and J. M. Cordeira, 2021: A climatological study of National Weather Service watches, warnings, and advisories and landfalling atmospheric rivers in the western United States 2006–18. Wea. Forecasting, 36, 10971112, https://doi.org/10.1175/WAF-D-20-0212.1.

    • Search Google Scholar
    • Export Citation
  • CDEC, 2021: Department of Water Resources California Data Exchange Center: CDEC Webservice JSON and CSV. Accessed 1 March 2021, https://cdec.water.ca.gov/dynamicapp/wsSensorData.

  • Cordeira, J. M., and F. M. Ralph, 2021: A summary of GFS ensemble integrated water vapor transport forecasts and skill along the U.S. West Coast during water years 2017–20. Wea. Forecasting, 36, 361377, https://doi.org/10.1175/WAF-D-20-0121.1.

    • Search Google Scholar
    • Export Citation
  • Cordeira, J. M., F. M. Ralph, A. Martin, N. Gaggini, J. R. Spackman, P. J. Neiman, J. J. Rutz, and R. Pierce, 2017: Forecasting atmospheric rivers during CalWater 2015. Bull. Amer. Meteor. Soc., 98, 449459, https://doi.org/10.1175/BAMS-D-15-00245.1.

    • Search Google Scholar
    • Export Citation
  • Corringham, T. W., F. M. Ralph, A. Gershunov, D. R. Cayan, and C. A. Talbot, 2019: Atmospheric rivers drive flood damages in the western United States. Sci. Adv., 5, eaax4631, https://doi.org/10.1126/sciadv.aax4631.

    • Search Google Scholar
    • Export Citation
  • DeFlorio, M. J., D. E. Waliser, B. Guan, D. A. Lavers, F. M. Ralph, and F. Vitart, 2018: Global assessment of atmospheric river prediction skill. J. Hydrometeor., 19, 409426, https://doi.org/10.1175/JHM-D-17-0135.1.

    • Search Google Scholar
    • Export Citation
  • DeFlorio, M. J., and Coauthors, 2019: Experimental Subseasonal‐to‐Seasonal (S2S) forecasting of atmospheric rivers over the western United States. J. Geophys. Res. Atmos., 124, 11 24211 265, https://doi.org/10.1029/2019JD031200.

    • Search Google Scholar
    • Export Citation
  • Delaney, J., and Coauthors, 2020: Forecast informed reservoir operations using ensemble streamflow predictions for a multipurpose reservoir in northern California. Water Resour. Res., 56, e2019WR026604, https://doi.org/10.1029/2019WR026604.

  • France, J. W., I. A. Alvi, P. A. Dickson, H. T. Falvey, S. J. Rigbey, and J. Trojanowski, 2018: Oroville Dam spillway incident independent forensic team final report (5 January 2018). 584 pp., www.ussdams.org/our-news/oroville-dam-spillway-incident-independent-forensic-team-final-report.

  • Froude, L. S. R., 2010: TIGGE: Comparison of the prediction of Northern Hemisphere extratropical cyclones by different ensemble prediction systems. Wea. Forecasting, 25, 819836, https://doi.org/10.1175/2010WAF2222326.1.

    • Search Google Scholar
    • Export Citation
  • Hagedorn, R., T. M. Hamill, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136, 26082619, https://doi.org/10.1175/2007MWR2410.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 26202632, https://doi.org/10.1175/2007MWR2411.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., E. Engle, D. Myrick, M. Peroutka, C. Finan, and M. Scheuerer, 2017: The U.S. National Blend of Models for statistical postprocessing of probability of precipitation and deterministic precipitation amount. Mon. Wea. Rev., 145, 34413463, https://doi.org/10.1175/MWR-D-16-0331.1.

    • Search Google Scholar
    • Export Citation
  • Hatchett, B. J., S. Burak, J. J. Rutz, N. S. Oakley, E. H. Bair, and M. L. Kaplan, 2017: Avalanche fatalities during atmospheric river events in the western United States. J. Hydrometeor., 18, 13591374, https://doi.org/10.1175/JHM-D-16-0219.1.

    • Search Google Scholar
    • Export Citation
  • Hatchett, B. J., and Coauthors, 2020: Observations of an extreme atmospheric river storm with a diverse sensor network. Earth Space Sci., 7, e2020EA001129, https://doi.org/10.1029/2020EA001129.

    • Search Google Scholar
    • Export Citation
  • Hecht, C. W., A. C. Michaelis, A. C. Martin, J. C. Cordeira, F. Cannon, and F. M. Ralph, 2022: Illustrating ensemble predictability across scales associated with the 13–15 February 2019 atmospheric river event. Bull. Amer. Meteor. Soc., 103, E911–E922, https://doi.org/10.1175/BAMS-D-20-0292.1.

  • Jasperse, J. , and Coauthors, 2020: Lake Mendocino forecast informed reservoir operations final viability assessment. UC San Diego, 141 pp., https://cw3e.ucsd.edu/FIRO_docs/LakeMendocino_FIRO_FVA.pdf.

  • Lavers, D. A., D. E. Waliser, F. M. Ralph, and M. D. Dettinger, 2016: Predictability of horizontal water vapor transport relative to precipitation: Enhancing situational awareness for forecasting western U.S. extreme precipitation and flooding. Geophys. Res. Lett., 43, 22752282, https://doi.org/10.1002/2016GL067765.

    • Search Google Scholar
    • Export Citation
  • McMurdie, L., and C. Mass, 2004: Major numerical forecast failures over the Northeast Pacific. Wea. Forecasting, 19, 338356, https://doi.org/10.1175/1520-0434(2004)019<0338:MNFFOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Nardi, K. M., E. A. Barnes, and F. M. Ralph, 2018: Assessment of numerical weather prediction model reforecasts of the occurrence, intensity, and location of atmospheric rivers along the west coast of North America. Mon. Wea. Rev., 146, 33433362, https://doi.org/10.1175/MWR-D-18-0060.1.

    • Search Google Scholar
    • Export Citation
  • Oakley, N. S., J. T. Lancaster, B. J. Hatchett, J. Stock, F. M. Ralph, S. Roj, and S. Lukashov, 2018: A 22-year climatology of cool season hourly precipitation thresholds conducive to shallow landslides in California. Earth Interact., 22, https://doi.org/10.1175/EI-D-17-0029.1.

    • Search Google Scholar
    • Export Citation
  • Ralph, F. M., T. Coleman, P. J. Neiman, R. J. Zamora, and M. D. Dettinger, 2013: Observed impacts of duration and seasonality of atmospheric-river landfalls on soil moisture and runoff in coastal Northern California. J. Hydrometeor., 14, 443459, https://doi.org/10.1175/JHM-D-12-076.1.

    • Search Google Scholar
    • Export Citation
  • Ralph, F. M., J. J. Rutz, J. M. Cordeira, M. Dettinger, M. Anderson, D. Reynolds, L. J. Schick, and C. Smallcomb, 2019: A scale to characterize the strength and impacts of atmospheric rivers. Bull. Amer. Meteor. Soc., 100, 269289, https://doi.org/10.1175/BAMS-D-18-0023.1.

    • Search Google Scholar
    • Export Citation
  • Shields, C. A., and Coauthors, 2018: Atmospheric River Tracking Method Intercomparison Project (ARTMIP): Project goals and experimental design. Geosci. Model Dev., 11, 24552474, https://doi.org/10.5194/gmd-11-2455-2018.

    • Search Google Scholar
    • Export Citation
  • Su, X., H. Yuan, Y. Zhu, Y. Luo, and Y. Wang, 2014: Evaluation of TIGGE ensemble predictions of Northern Hemisphere summer precipitation during 2008–2012. J. Geophys. Res. Atmos., 119, 72927310, https://doi.org/10.1002/2014JD021733.

    • Search Google Scholar
    • Export Citation
  • Vano, J. A., K. Miller, M. D. Dettinger, R. Cifelli, D. Curtis, A. Dufour, J. R. Olsen, and A. M. Wilson, 2019: Hydroclimatic extremes as challenges for the water management community: Lessons from Oroville Dam and Hurricane Harvey. Bull. Amer. Meteor. Soc., 100, S9S14, https://doi.org/10.1175/BAMS-D-18-0219.1.

    • Search Google Scholar
    • Export Citation
  • Waliser, D., and B. Guan, 2017: Extreme winds and precipitation during landfall of atmospheric rivers. Nat. Geosci., 10, 179183, https://doi.org/10.1038/ngeo2894.

    • Search Google Scholar
    • Export Citation
  • WPC, 2012: The 2012 Atmospheric River Retrospective Forecasting Experiment: Final Experiment Report. NOAA, 19 pp., http://www.wpc.ncep.noaa.gov/hmt/ARRFEX_Final_Report.pdf.

  • White, A. B., B. J. Moore, D. J. Gottas, and P. J. Neiman, 2019: Winter storm conditions leading to excessive runoff above California’s Oroville Dam during January and February 2017. Bull. Amer. Meteor. Soc., 100, 5570, https://doi.org/10.1175/BAMS-D-18-0091.1.

    • Search Google Scholar
    • Export Citation
  • Young, A. M., K. T. Skelly, and J. M. Cordeira, 2017: High-impact hydrologic events and atmospheric rivers in California: An investigation using the NCEI Storm Events Database. Geophys. Res. Lett., 44, 33933401, https://doi.org/10.1002/2017GL073077.

    • Search Google Scholar
    • Export Citation
  • Zsoter, E., R. Buizza, and D. Richardson, 2009: “Jumpiness” of the ECMWF and Met Office EPS control and ensemble-mean forecasts. Mon. Wea. Rev., 137, 38233836, https://doi.org/10.1175/2009MWR2960.1.

    • Search Google Scholar
    • Export Citation
Save
  • Atger, F., 2001: Verification of intense precipitation forecasts from single models and ensemble prediction systems. Nonlinear Processes Geophys., 8, 401417, https://doi.org/10.5194/npg-8-401-2001.

    • Search Google Scholar
    • Export Citation
  • Bartlett, S. M., and J. M. Cordeira, 2021: A climatological study of National Weather Service watches, warnings, and advisories and landfalling atmospheric rivers in the western United States 2006–18. Wea. Forecasting, 36, 10971112, https://doi.org/10.1175/WAF-D-20-0212.1.

    • Search Google Scholar
    • Export Citation
  • CDEC, 2021: Department of Water Resources California Data Exchange Center: CDEC Webservice JSON and CSV. Accessed 1 March 2021, https://cdec.water.ca.gov/dynamicapp/wsSensorData.

  • Cordeira, J. M., and F. M. Ralph, 2021: A summary of GFS ensemble integrated water vapor transport forecasts and skill along the U.S. West Coast during water years 2017–20. Wea. Forecasting, 36, 361377, https://doi.org/10.1175/WAF-D-20-0121.1.

    • Search Google Scholar
    • Export Citation
  • Cordeira, J. M., F. M. Ralph, A. Martin, N. Gaggini, J. R. Spackman, P. J. Neiman, J. J. Rutz, and R. Pierce, 2017: Forecasting atmospheric rivers during CalWater 2015. Bull. Amer. Meteor. Soc., 98, 449459, https://doi.org/10.1175/BAMS-D-15-00245.1.

    • Search Google Scholar
    • Export Citation
  • Corringham, T. W., F. M. Ralph, A. Gershunov, D. R. Cayan, and C. A. Talbot, 2019: Atmospheric rivers drive flood damages in the western United States. Sci. Adv., 5, eaax4631, https://doi.org/10.1126/sciadv.aax4631.

    • Search Google Scholar
    • Export Citation
  • DeFlorio, M. J., D. E. Waliser, B. Guan, D. A. Lavers, F. M. Ralph, and F. Vitart, 2018: Global assessment of atmospheric river prediction skill. J. Hydrometeor., 19, 409426, https://doi.org/10.1175/JHM-D-17-0135.1.

    • Search Google Scholar
    • Export Citation
  • DeFlorio, M. J., and Coauthors, 2019: Experimental Subseasonal‐to‐Seasonal (S2S) forecasting of atmospheric rivers over the western United States. J. Geophys. Res. Atmos., 124, 11 24211 265, https://doi.org/10.1029/2019JD031200.

    • Search Google Scholar
    • Export Citation
  • Delaney, J., and Coauthors, 2020: Forecast informed reservoir operations using ensemble streamflow predictions for a multipurpose reservoir in northern California. Water Resour. Res., 56, e2019WR026604, https://doi.org/10.1029/2019WR026604.

  • France, J. W., I. A. Alvi, P. A. Dickson, H. T. Falvey, S. J. Rigbey, and J. Trojanowski, 2018: Oroville Dam spillway incident independent forensic team final report (5 January 2018). 584 pp., www.ussdams.org/our-news/oroville-dam-spillway-incident-independent-forensic-team-final-report.

  • Froude, L. S. R., 2010: TIGGE: Comparison of the prediction of Northern Hemisphere extratropical cyclones by different ensemble prediction systems. Wea. Forecasting, 25, 819836, https://doi.org/10.1175/2010WAF2222326.1.

    • Search Google Scholar
    • Export Citation
  • Hagedorn, R., T. M. Hamill, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136, 26082619, https://doi.org/10.1175/2007MWR2410.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 26202632, https://doi.org/10.1175/2007MWR2411.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., E. Engle, D. Myrick, M. Peroutka, C. Finan, and M. Scheuerer, 2017: The U.S. National Blend of Models for statistical postprocessing of probability of precipitation and deterministic precipitation amount. Mon. Wea. Rev., 145, 34413463, https://doi.org/10.1175/MWR-D-16-0331.1.

    • Search Google Scholar
    • Export Citation
  • Hatchett, B. J., S. Burak, J. J. Rutz, N. S. Oakley, E. H. Bair, and M. L. Kaplan, 2017: Avalanche fatalities during atmospheric river events in the western United States. J. Hydrometeor., 18, 13591374, https://doi.org/10.1175/JHM-D-16-0219.1.

    • Search Google Scholar
    • Export Citation
  • Hatchett, B. J., and Coauthors, 2020: Observations of an extreme atmospheric river storm with a diverse sensor network. Earth Space Sci., 7, e2020EA001129, https://doi.org/10.1029/2020EA001129.

    • Search Google Scholar
    • Export Citation
  • Hecht, C. W., A. C. Michaelis, A. C. Martin, J. C. Cordeira, F. Cannon, and F. M. Ralph, 2022: Illustrating ensemble predictability across scales associated with the 13–15 February 2019 atmospheric river event. Bull. Amer. Meteor. Soc., 103, E911–E922, https://doi.org/10.1175/BAMS-D-20-0292.1.

  • Jasperse, J. , and Coauthors, 2020: Lake Mendocino forecast informed reservoir operations final viability assessment. UC San Diego, 141 pp., https://cw3e.ucsd.edu/FIRO_docs/LakeMendocino_FIRO_FVA.pdf.

  • Lavers, D. A., D. E. Waliser, F. M. Ralph, and M. D. Dettinger, 2016: Predictability of horizontal water vapor transport relative to precipitation: Enhancing situational awareness for forecasting western U.S. extreme precipitation and flooding. Geophys. Res. Lett., 43, 22752282, https://doi.org/10.1002/2016GL067765.

    • Search Google Scholar
    • Export Citation
  • McMurdie, L., and C. Mass, 2004: Major numerical forecast failures over the Northeast Pacific. Wea. Forecasting, 19, 338356, https://doi.org/10.1175/1520-0434(2004)019<0338:MNFFOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Nardi, K. M., E. A. Barnes, and F. M. Ralph, 2018: Assessment of numerical weather prediction model reforecasts of the occurrence, intensity, and location of atmospheric rivers along the west coast of North America. Mon. Wea. Rev., 146, 33433362, https://doi.org/10.1175/MWR-D-18-0060.1.

    • Search Google Scholar
    • Export Citation
  • Oakley, N. S., J. T. Lancaster, B. J. Hatchett, J. Stock, F. M. Ralph, S. Roj, and S. Lukashov, 2018: A 22-year climatology of cool season hourly precipitation thresholds conducive to shallow landslides in California. Earth Interact., 22, https://doi.org/10.1175/EI-D-17-0029.1.

    • Search Google Scholar
    • Export Citation
  • Ralph, F. M., T. Coleman, P. J. Neiman, R. J. Zamora, and M. D. Dettinger, 2013: Observed impacts of duration and seasonality of atmospheric-river landfalls on soil moisture and runoff in coastal Northern California. J. Hydrometeor., 14, 443459, https://doi.org/10.1175/JHM-D-12-076.1.

    • Search Google Scholar
    • Export Citation
  • Ralph, F. M., J. J. Rutz, J. M. Cordeira, M. Dettinger, M. Anderson, D. Reynolds, L. J. Schick, and C. Smallcomb, 2019: A scale to characterize the strength and impacts of atmospheric rivers. Bull. Amer. Meteor. Soc., 100, 269289, https://doi.org/10.1175/BAMS-D-18-0023.1.

    • Search Google Scholar
    • Export Citation
  • Shields, C. A., and Coauthors, 2018: Atmospheric River Tracking Method Intercomparison Project (ARTMIP): Project goals and experimental design. Geosci. Model Dev., 11, 24552474, https://doi.org/10.5194/gmd-11-2455-2018.

    • Search Google Scholar
    • Export Citation
  • Su, X., H. Yuan, Y. Zhu, Y. Luo, and Y. Wang, 2014: Evaluation of TIGGE ensemble predictions of Northern Hemisphere summer precipitation during 2008–2012. J. Geophys. Res. Atmos., 119, 72927310, https://doi.org/10.1002/2014JD021733.

    • Search Google Scholar
    • Export Citation
  • Vano, J. A., K. Miller, M. D. Dettinger, R. Cifelli, D. Curtis, A. Dufour, J. R. Olsen, and A. M. Wilson, 2019: Hydroclimatic extremes as challenges for the water management community: Lessons from Oroville Dam and Hurricane Harvey. Bull. Amer. Meteor. Soc., 100, S9S14, https://doi.org/10.1175/BAMS-D-18-0219.1.

    • Search Google Scholar
    • Export Citation
  • Waliser, D., and B. Guan, 2017: Extreme winds and precipitation during landfall of atmospheric rivers. Nat. Geosci., 10, 179183, https://doi.org/10.1038/ngeo2894.

    • Search Google Scholar
    • Export Citation
  • WPC, 2012: The 2012 Atmospheric River Retrospective Forecasting Experiment: Final Experiment Report. NOAA, 19 pp., http://www.wpc.ncep.noaa.gov/hmt/ARRFEX_Final_Report.pdf.

  • White, A. B., B. J. Moore, D. J. Gottas, and P. J. Neiman, 2019: Winter storm conditions leading to excessive runoff above California’s Oroville Dam during January and February 2017. Bull. Amer. Meteor. Soc., 100, 5570, https://doi.org/10.1175/BAMS-D-18-0091.1.

    • Search Google Scholar
    • Export Citation
  • Young, A. M., K. T. Skelly, and J. M. Cordeira, 2017: High-impact hydrologic events and atmospheric rivers in California: An investigation using the NCEI Storm Events Database. Geophys. Res. Lett., 44, 33933401, https://doi.org/10.1002/2017GL073077.

    • Search Google Scholar
    • Export Citation
  • Zsoter, E., R. Buizza, and D. Richardson, 2009: “Jumpiness” of the ECMWF and Met Office EPS control and ensemble-mean forecasts. Mon. Wea. Rev., 137, 38233836, https://doi.org/10.1175/2009MWR2960.1.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Forecasts of the coastal ensemble probability of IVT magnitude ≥ 250 kg m−1 s−1 (shaded according to scale) initialized daily at 0000 UTC 7–11 Feb 2019 for the (a),(c),(e),(g),(i) GEFS model out to 16 days and (b),(d),(f),(h),(j) EPS model out to 15 days. Coastal latitudes are shown in the rightmost panels of each image and topography is shaded every 100 m using a blue–green–brown–white color scale. The gray, black, and red bars in these panels represent the number of hours that probability values exceed 75%, 90%, and 99% and are not used in this study.

  • Fig. 2.

    Frequency of the GEFS (red) and EPS (blue) P250 forecasts ≥ 25%, ≥50%, and ≥75% for all forecasts verifying every 12 h at 38°N, 123°W. The number of verifying times with IVT magnitudes ≥ 250 kg m−1 s−1 in each model is shown as thin black horizontal lines.

  • Fig. 3.

    The average ensemble spread and root-mean-square error (RMSE) of ensemble-mean IVT magnitude forecasts for (a) all times and (b) times verifying with IVT magnitudes of 250–500 kg m−1 s−1 for forecasts verifying every 12 h. Solid lines represent the RMSE, and dashed lines represent ensemble spread (GEFS = red, EPS = blue).

  • Fig. 4.

    A forecast–lead-time illustration of the (left) GEFS and (right) EPS P250 (shaded according to scale) for verification times every 12 h during (a),(b) WY17; (c),(d) WY18; (e),(f) WY19; and (g),(h) WY20 at 38°N, 123°W.

  • Fig. 5.

    A forecast–lead-time illustration of the difference in P250 forecast values between the EPS and GEFS (EPS minus GEFS) for (left) all verification times and (right) all verification times with IVT magnitudes of 250–500 kg m−1 s−1 every 12 h during (a),(b) WY17; (c),(d) WY18; (e),(f) WY19; and (g),(h) WY20 at 38°N, 123°W.

  • Fig. 6.

    Forecast–lead-time change in the P250 forecast values (i.e., “dProg/dt”) prior to verification times with IVT magnitudes ≥ 250 kg m−1 s−1 for the GEFS (red) and EPS (blue) for forecasts verifying every 12 h at 38°N, 123°W. The solid line represents the mean while the dashed lines represent the 95th confidence level generated through 1000 random samples of 25-member populations of forecasts.

  • Fig. 7.

    Reliability diagram for the (a) GEFS and (b) EPS P250 forecast at 38°N, 123°W for lead times of 1–15 days (colored solid lines) grouped by lead times of 3 days according to the legend. Values are only calculated if there were 10 or more forecasts to evaluate and forecast values are grouped into a 0% only bin and then in 10% bins thereafter. The black lines represent the no resolution (solid), no skill (dot–dash), and 1:1 reliability (dashed) for the latter 10 bins.

  • Fig. 8.

    Success ratio calculated for the GEFS and EPS for all forecast lead times valid between 0000 UTC 1 Oct 2016 and 1200 UTC 30 Apr 2020: (a)–(c) at 38°N, 123°W for P250 thresholds ≥ 25%, ≥50%, and ≥75%, respectively; (d),(e) along the coast from 25° to 55°N on the west coast of North America; and (f) the difference between the two models along the coast. Solid lines in (a)–(c) represent the mean while the dashed lines in (a)–(c) represent 95th percent confidence levels for statistical significance generated through 1000 random samples of 25-member populations of forecasts. The dots in (f) denote locations where the difference is statistically significant at the 95th percent confidence level based on a two-sided Student’s t test compared with the mean of the same 1000 random samples of 25-member populations of forecasts. The black “N/A” shading indicates that fewer than 10 forecasts of P250 ≥ 50% were made at that lead time and latitude to evaluate the success ratio.

  • Fig. 9.

    As in Fig. 8, but for the probability of detection (POD).

  • Fig. 10.

    As in Figs. 8 and 9, but for the equitable threat score (ETS). Statistical significance is calculated as in Fig. 8, except using 1000 random samples of 50-member populations of forecasts.

  • Fig. 11.

    Brier skill score calculated for the GEFS and EPS for all forecast lead times valid between 0000 UTC 1 Oct 2016 and 1200 UTC 30 Apr 2020: (a) at 38°N, 123°W; (b),(c) along the coast from 25° to 55°N on the West Coast of North America; and (d) the difference between the two models along the coast. Statistical significance is calculated as in Fig. 8, but using 1000 random samples of 50-member populations of forecasts.

  • Fig. 12.

    Success ratio for P250 forecasts ≥ 50% derived from a randomly selected 20-member ensemble using different combinations of ensemble members from the GEFS and EPS forecasts at 38°N, 123°W. The top row represents an ensemble of all EPS members and bottom row represents an ensemble of all GEFS members. The values are a mean success ratio calculated from repeating the random sampling 50 times.

  • Fig. 13.

    Probability of detection of the GEFS (red) and EPS (blue) for all forecast lead times of 0–15 days and three different P250 thresholds (≥25%, ≥50%, and ≥75%) for forecasts verifying with IVT magnitudes of (a) 250–500 kg m−1 s−1 and (b) ≥500 kg m−1 s−1.

  • Fig. 14.

    (a) Distribution of the EPS minus GEFS distribution for 6–8-day P250 forecasts (average EPS P250 minus average GEFS P250) for all verification times at 38°N, 123°W. (b),(c) Forecast differences between the EPS and GEFS average 6–8-day P250 forecasts (EPS minus GEFS) for events that verified with IVT magnitudes 250–500 kg m−1 s−1 in (b) and ≥500 kg m−1 s−1 in (c) at 38°N, 123°W. Positive values (blue) illustrate a higher EPS forecast while negative values (red) illustrate a higher GEFS forecast.

  • Fig. 15.

    (a),(c) GEFS and (b),(d) EPS P250 forecasts prior to the top-10 precipitation events in Northern California that were also associated with landfalling ARs. Forecasts are partitioned prior to the time of maximum IVT magnitude in (a) and (b) and the first time with IVT magnitudes ≥250 kg m−1 s−1 (onset of event) in (c) and (d) at 38°N, 123°W. (e),(f) The top-10 average P250 forecast value and standard deviation of the top-10 P250 forecast values for both models at the time of maximum IVT and onset, respectively.

All Time Past Year Past 30 Days
Abstract Views 725 305 0
Full Text Views 198 98 9
PDF Downloads 179 90 8