1. Introduction
Numerical weather prediction (NWP) models are routinely evaluated by research scientists at modeling centers [e.g., European Centre for Medium-Range Weather Forecasts (ECMWF), National Centers for Environmental Prediction (NCEP), etc.], academic institutions, and companies offering forecast products to general users and customers. There can be various motivations for evaluating model forecasts, but one of the most common reasons is to ensure a steady improvement in forecast performance (e.g., accuracy, precision) as model systems have and continue to evolve. Forecast model evaluation and verification make use of a wide-ranging suite of statistical metrics. At best, an optimal combination of these metrics can allow for rich insights into the forecast capabilities of a model, but at worst, failing to include and accurately analyze multiple complimentary metrics can lead to a simplistic understanding of a model system (Murphy and Winkler 1987). However, desired outcomes from verification efforts differ depending on the information that is most meaningful to the end-user (Casati et al. 2008; Ebert et al. 2013). For example, NWP model developers focused on precipitation forecasts may desire and require specific insights that are only available through a combination of traditional (e.g., contingency table statistics such as frequency bias) and spatial [e.g., fractions skill score and the Method for Object-Based Diagnostic Evaluation (MODE)] verification methods, such as those used by Wolff et al. (2014). Or, some end-users may be interested in identifying model biases in order to adjust for forecast error using postprocessing techniques, which is the motivation for this paper. Knowledge of these biases not only helps to inform statistical postprocessing (e.g., Model Output Statistics; Glahn and Lowry 1972) but also provides motivation to remedy those existing biases through model updates.
Modern deterministic forecasting in the United States is largely dependent on three publicly accessible models: the Global Forecast System (GFS), North American Mesoscale Forecast System (NAM), and the High-Resolution Rapid Refresh (HRRR) model (NWS 2017). Retrospective and real-time forecast verification efforts of these and other modeling systems are usually focused on either the globe (if applicable) or various regions and atmospheric levels, leading to an understanding of model biases for forecast variables at the surface and aloft. Numerous studies have investigated model biases associated with GFS, NAM, and/or HRRR forecasts of various quantities, usually focusing on either one or multiple of the following: 2-m temperature; 2-m dewpoint temperature; 10-m wind components, speed, or direction; precipitation accumulation over hourly or daily intervals; incoming/outgoing radiation; or several other variables of interest, including upper-air parameters such as geopotential height at various pressure levels.
Evaluation of precipitation forecasts can be and has been completed by use of several forecast verification metrics that focus on both the magnitude of accumulated precipitation and spatial characteristics of the forecast precipitation. Haiden et al. (2012) completed a global daily precipitation forecast verification for five global NWP models from the Canadian Meteorological Centre (CMC), Japan Meteorological Agency (JMA), NCEP, Met Office (UKMO), and ECMWF through use of the stable equitable error in probability space (SEEPS) score, which measures the ability of a model to discriminate among light, moderate, and heavy precipitation across different climatic regions. One main result was the overprediction of light precipitation, as defined relative to the climatology of each included station, in both the tropics and extratropics when either dry conditions (overpredicting drizzle) or heavy precipitation (underpredicting heavy precipitation) was observed. Wolff et al. (2014) evaluated GFS and NAM precipitation forecasts with traditional, spatial, and object-based methods and provided further explanation of the prediction problems associated with precipitation over the contiguous United States. By use of the MODE spatial verification technique, precipitation objects can be created from model quantitative precipitation forecast fields as well as gridded observation fields at any given threshold. Then, related statistics such as the area and total number of these objects (i.e., count) can be derived. The high frequency bias Wolff et al. (2014) identified at most precipitation thresholds during the winter was related to inflated precipitation object areas in the GFS and precipitation object counts in the NAM, and the low frequency bias was related to low object counts in the summer. Coarse horizontal grid resolution was hypothesized to lead to the issues of relatively large object areas and discrepancies in object counts. While NAM precipitation forecasts were less skillful than the GFS, the spatial distribution of precipitation aligned more closely to observations. Other studies turned their focus on the forecast skill of regional models; Herman and Schumacher (2016) compared precipitation forecasts between the NAM 4-km, HRRR, and other NWP models and found that the NAM 4-km had a strong positive frequency bias across the United States, while the HRRR and other models tended to overforecast extreme events, especially in the West and Southwest regions. Additionally, Dougherty et al. (2021) focused on the comparative performance of 12-h precipitation forecasts from the HRRR, NAM 3-km, and the Naval Research Laboratory’s Coupled Ocean–Atmosphere Mesoscale Precipitation System (COAMPS) across California during the 2018/19 winter season and found the HRRR to be the most skillful at predicting precipitation amounts, particularly for moderate rainfall events.
Expanding the focus beyond precipitation, HRRR forecasts have been evaluated with respect to various phenomena of interest, including but not exclusive to mesoscale convective systems (Pinto et al. 2015), wind variability within the Pacific Northwest (Pichugina et al. 2019), and surface-based forecasts of temperature, humidity, wind, and precipitation (Ikeda et al. 2013; Lee et al. 2019; Fovell and Gallagher 2020; Gallagher 2021; Min et al. 2021). Not only do the variables or phenomena of interest differ in these studies, but also the model versions and related time periods included in the analyses. Fovell and Gallagher (2020) completed an in-depth verification of the HRRRv3 in the boundary layer and at the surface over the contiguous United States, specifically focusing on near-surface wind speed and potential temperature as well as 2-m temperature during January and August 2013. They found that 10-m wind speed biases were negatively correlated with observed speed and 2-m temperature biases were strongly tied to station elevation, where 75% of Automated Surface Observing Systems (ASOS) stations sited above 500 m above sea level had a warm bias. Ikeda et al. (2013) took a different approach and evaluated HRRRv1 2-m temperature, wet-bulb temperature, and instantaneous precipitation type forecasts during 39 winter precipitation events over the 2010/11 cold season. Analysis of 2-m temperature forecasts during these events indicated a slight cold bias, although the mean error and mean absolute error fell within the ASOS sensor accuracy of 0.5°C and also were highly correlated (r > 0.96) with observations, even at the longest analyzed lead time of 8 h. Note that both of these studies provide integral information about HRRR forecast biases, but only for very localized periods of times or events.
In addition to the in-depth research studies discussed above, NWP model evaluation occurs in real time and is highly prioritized before upgrades to any operational model. Underscoring the importance of consistent forecast system evaluation in the United States, a new Environmental Modeling Center Verification System (EVS) is undergoing development (Levit et al. 2022) as of the time of writing. Meanwhile, NCEP Environmental Modeling Center’s Model Evaluation Group will continue model-upgrade centered verification, as this is necessary to fully understand the influence of the underlying model architecture changes on the forecast itself, especially in various weather situations. This type of work has also been explored by Caron and Steenburgh (2020), who evaluated and compared 24-h precipitation forecasts from the HRRRv2 and HRRRv3, as well as the GFSv14 and GFSv15 across the western United States. Analysis of contingency tables (which assess true/false positives and negatives for dichotomous forecasts) and their related statistics (e.g., hit rate, false alarm rate, equitable threat score) determined no statistically significant differences between the two HRRR versions, while statistically significant differences did exist between the GFS versions at higher precipitation thresholds and later in the forecast period. Similarly, the impact of model version updates on forecast error will be briefly explored within this paper.
While each of these studies provides some level of insight into the model systems of interest, one shortcoming is that they all focus on relatively short time frames, varying from 2 months to over 1 year. Additionally, while global and regional forecast evaluation exists for deterministic models with respect to specific forecast quantities, such as precipitation, there is a need for more subregional (i.e., statewide) focused verification that may better capture local error and biases for a few societally impactful forecast parameters. A robust evaluation of operational model forecasts using reliable observations will not only provide insights and guidance to those who forecast across New York State (NYS), but will also provide the foundation for developing data-driven models of forecast uncertainty across seasons, lead times, and regions. As such, the goal of this work is to evaluate GFS, NAM, and HRRR model forecasts of 2-m temperature, 10-m wind speed, and 1- or 3-h precipitation accumulation, leveraging New York State Mesonet (NYSM) Standard Network surface observations. To cut down on repetition throughout the remainder of this paper, 2-m temperature and 10-m wind speed will be referenced without their height descriptors. Section 2 provides an overview of the model forecasts evaluated within, along with the observations and metrics of interest. The impact of a recent GFS upgrade to its associated forecast error is explored in section 3. Forecast verification then ensues for the three quantities of interest in section 4, followed by a summary of the major findings in section 5.
2. Data and methods
a. Forecast and observation data
Operational forecasts from the 0.5° GFS, 12-km NAM, and 3-km HRRR models were obtained for each 0000 and 1200 UTC initialization between 1 January 2018 and 31 December 2021. Due to storage limitations, forecast hours were limited to the range of 0–96 h (interval of 3 h) for the GFS. All 18 forecast hours (interval of 1 h) were obtained for the HRRR along with all 84 forecast hours (interval of 1 h until hour 36, followed by 3-h interval for remaining hours) for the NAM. To continue with the verification task, a robust network of observations is required.
The NYSM is a meteorological observation network collecting quality controlled 5-min observations across NYS. Both automated and manual quality assurance and quality control procedures are applied to all NYSM data in real time as well as on a daily, weekly, monthly, and annual basis (Brotzge et al. 2020). Each observation is automatically assigned a data quality flag of good, suspect, warning, or failure; publicly shared data do not include any data that were assigned a warning or failure flag. The flagged data are manually reviewed and issues are resolved either remotely or via a technician visit. All available NYSM data observed between 1 January 2018 and 31 December 2021 were downloaded for each of the 126 standard sites (Fig. 1). Observations of interest include temperature, precipitation accumulation, and wind speed. Wind speed is measured by redundant sensors across the NYSM, including both propeller and sonic anemometers (Brotzge et al. 2020). Note that propeller measurements of wind speeds less than 1 m s−1 were removed from the analysis within, as propeller anemometers may not be able to accurately measure very slow wind speeds. Wind speeds above this threshold were from the sonic anemometer if those observations were available. If they were unavailable, the propeller anemometer measurements were used instead. Note that this methodology was adopted on 1 March 2018; before this date the inverse was true, where propeller anemometer observations were preferred over those from the sonic anemometer (Brotzge et al. 2020). To directly compare forecasts to NYSM observations, the GFS and NAM temperature and wind speed forecasts were linearly interpolated to the closest NYSM site location. As the HRRR grid is highly resolved, the nearest neighbor grid points to each NYSM site were used (as in Min et al. 2021) for the temperature, wind speed, and precipitation forecasts instead of interpolation. The nearest neighbor grid points were also used for the GFS and NAM precipitation forecasts. Note that only model grid cells over land were included in the interpolation and nearest-neighbor methods, which impacts the values at NYSM sites in close proximity to bodies of water (e.g., Atlantic Ocean, Lake Erie, Lake Ontario). Using a point verification method potentially introduces observation representation issues, especially for precipitation forecasts (Haiden et al. 2012), but this method does allow for the direct use of NYSM observations across NYS and an analysis of location-specific model performance. At each NYSM site location, these data were resampled to 1- and 3-h increments to be directly comparable to 1-h HRRR forecasts, 3-h GFS forecasts, and mixed 1- and 3-h forecasts from the NAM. With the exception of precipitation, the 5-min observations were used as the representative observation at each valid model output time (i.e., the top of each hour). Precipitation was summed over the 1- and 3-h time intervals preceding the valid time to obtain an accurate accumulation for that valid time. For example, the observed 1- and 3-h precipitation for a 0600 UTC valid time would be obtained by summing observations from 0505 to 0600 UTC and from 0305 to 0600 UTC, respectively. While observation error is not explicitly accounted for within this paper, the NYSM performs rigorous quality control and assurance on all data, as detailed above.
A map of the 126 NYSM site locations (circles) and topography. The color fill of the dots represents the distinct climate division within which the site is located. The 10 divisions, their corresponding color, and the number of sites (N) associated with each division are provided on the rightmost colorbar.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
New York State covers diverse terrain, including coastal regions in close proximity to the Atlantic Ocean, Lake Ontario, and Lake Erie; the Adirondack and Catskill Mountains; plateaus; and valleys. These regional terrain features can have impacts on local meteorological conditions that, to varying degrees, may also be captured by the GFS, NAM, and HRRR. Each NYSM standard network site has a designation of a climate division (NCEI 2015) to help understand these variable meteorological conditions. The 10 climate divisions in NYS include the Central Lakes (N = 13), Great Lakes (16), St. Lawrence Valley (6), Champlain Valley (7), Mohawk Valley (6), Hudson Valley (20), Coastal (8), and the Northern (18), Eastern (23), and Western (9) Plateaus (Fig. 1). These climate divisions are used to investigate model verification statistics across NYS.
NYSM site and/or instrument downtime due to maintenance or weather-related issues leads to unusable or unavailable data. The percentage of all hourly observations throughout a calendar year that are “null” are calculated annually from 2018 to 2021 and provided in Table 1. Temperature, wind speed, and precipitation observations across 2018–21 are null only less than 1% of all observation times. Temperature observations contain the greatest percentage of null data ranging from 0.69% to 0.91%, while precipitation observations are associated with the least at ∼0.06%. This analysis indicates that null observations are infrequent throughout the network, which maximizes the amount of data available for model verification.
Percentage of null 5-min NYSM observations annually.
b. Verification statistics
The 2 × 2 contingency table used for precipitation verification.
3. Operational model upgrades
NWP model architecture is upgraded on a regular basis in the pursuit of alleviating model biases and thereby improving forecast performance. During the time period of 2018–21, the GFS underwent three updates: one on 12 July 2019 to the FV3-based GFS (v15.1), another shortly after on 7 November 2019 to GFSv15.2, and one on 22 March 2021 to GFSv16. The HRRR underwent two updates during this time period: HRRRv3 went into operations on 12 July 2018 and HRRRv4 on 2 December 2020. In contrast, the NAM had its final major upgrade move into operations on 21 March 2017. These updates have the potential to impact forecast biases, but the impacts can be nearly impossible to untangle without forecasts from both model versions that are run over the same time period (i.e., in parallel).
Any potential differences in forecast behavior among each of these model versions (e.g., GFSv15.2 versus GFSv16) can only be elucidated through statistical comparison of parallel forecasts. GFSv16 was run in parallel with GFSv15 starting with the 1200 UTC initialization on 17 October 2019 (A. Bentley 2022, personal communication). Evaluation of GFSv16 forecasts in both real time and for retrospective weather events indicated an improvement to forecast skill with respect to tropical cyclone genesis, snowstorms, and extreme rainfall events, among other phenomena (NOAA 2021). Although thorough comparative analyses exist for specific regions (e.g., North America/Pacific, tropics) and meteorological events, it is of interest to understand the impact of this model update on forecasts over NYS to assess any statistical significance in model biases. As such, all available parallel 0.5° GFSv16 data were downloaded to facilitate analysis of forecast error differences between GFSv15 and GFSv16. Comparisons between these two versions were completed using the available data for the model runs initialized at 0000 UTC from 3 November 2019 to 31 May 2020. A total of 208 forecast initializations were available during this time period, meaning that only three initializations were missing. Due to the lack of parallel data availability, additional comparisons between earlier GFS versions and any HRRR versions were not completed for this study. However, recall that Caron and Steenburgh (2020) did not identify statistically significant precipitation forecast differences between HRRRv2 and HRRRv3, which would have impacted the forecasts during the first several months of 2018.
The focus of this comparative analysis is on temperature forecasts and their associated error. Joint distributions of temperature forecasts and error for GFSv15 and GFSv16 are provided in Figs. 2a and 2b, respectively. Qualitative comparison between these distributions reveals the most noticeable differences at the lowest frequencies, with fewer instances of highly negative error at near-freezing temperatures in GFSv16. Differences within the highest frequencies will have a greater impact on the statistical comparison and result in a larger relative change in frequency, as seen in Fig. 2c. GFSv16 temperature forecasts > 0°C are more frequently associated with cold biases and less frequently associated with warm biases. The opposite generally holds true when forecast temperatures are subfreezing in GFSv16. But with such small normalized frequency differences between model versions, it is necessary to test for statistical significance. Using the nonparametric Mann–Whitney U test, the temperature error distributions from the GFSv15 and GFSv16 were found to have statistically significant differences.
Normalized joint histograms and the linear regression fit (red solid line) of 2-m temperature forecasts (°C) and associated forecast error (°C) in GFS versions (a) 15 and (b) 16 from 3 Nov 2019 to 31 May 2020. (c) The difference between the GFSv16 and GFSv15 joint histograms, with orange indicating a frequency increase and purple indicating a frequency decrease.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
Any time dependence of forecast bias differences is also of interest to identify. Forecast errors were averaged for each month from November 2019 through May 2020 and are compared between GFSv15 and GFSv16 in Fig. 3. While GFSv16 shows a similar error or slightly reduced cold bias between November and January, it tends to have a stronger cold bias in the spring, which is likely associated with boundary layer scheme changes and the increased vertical resolution that were implemented in GFSv16. Moving forward into the following verification analyses, it is necessary to remember that the forecast error distribution differs between GFSv15 and GFSv16. Including multiple model versions in the same verification analysis could potentially lead to washed out signals due to compensating errors or a dominating error contribution from a version that has been operational for a longer time period than other versions. With that said, model version changes occur regularly and the value of a robust period of record may outweigh the implications of deriving bulk verification statistics from forecasts encompassing multiple model versions. To simplify the verification approach, the remainder of this paper will disregard the model upgrades that occurred between 2018 and 2021 to be able to treat the GFS and HRRR forecasts as standalone models and allow for comparisons among the GFS, NAM, and HRRR.
Scatterplot comparison of the monthly average 2-m temperature error between GFSv15 and GFSv16 between November 2019 and May 2020. Relative to GFSv15, orange indicates a decrease in error in GFSv16 and purple indicates an increase in error in GFSv16.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
4. Forecast verification
Verification statistics for temperature, wind speed, and 1- or 3-h precipitation accumulation were calculated on various subsets of the GFS, NAM, and HRRR data, including common forecast hours and ranges, seasons, and climate divisions. Analyses of both the 0000 and 1200 UTC initialization of each NWP model were completed, but this paper focuses solely on the results from the 1200 UTC data. After acknowledging expected differences in the distributions of temperature and wind speed given that the local afternoon time period is not represented in the 0000 UTC data due to the focus on the first 18 forecast hours, the 0000 UTC initialization was found to share considerable similarities in its error characteristics as compared to the 1200 UTC initialization, especially when comparing the diurnal variation in error.
Murphy and Winkler (1987) emphasized that a careful analysis of the joint distributions of forecast performance can provide increased insight into forecast quality (i.e., accuracy, bias, reliability, discrimination, etc.; Murphy 1993) compared to solely relying on summary statistics that may be oversimplified. The following three sections include multiple realizations of similar metrics or complimentary metrics in order to gain a deep understanding of model error and biases associated with the GFS, NAM, and HRRR over NYS. The verification of temperature and wind speed is similarly structured, while the precipitation verification involves various statistical measures derived from the contingency table described in section 2.
a. 2-m temperature
Comparisons of observed temperature with forecast error allow for an understanding of NWP biases with respect to various temperatures. For example, if a forecast system has historically had a considerable cold bias when extremely warm temperatures were observed, it is important for this bias to be investigated and corrected in addition to a meteorologist accounting for that bias within their forecasting process. Daily averages of observed NYSM station temperature and associated forecast error are shown in Figs. 4a–c for the GFS, NAM, and HRRR. Only fhour ≤ 18 are included in this analysis to facilitate comparison among models. Linear regression fits to these data aid in identifying any existing relationship between observations and forecast error. Interestingly, as observed temperatures warm, the cool temperature bias in the GFS and the warm temperature bias in the NAM both become slightly more prominent (Figs. 4a,b) as seen in the linear fits. However, the NAM does exhibit a consistent warm bias particularly when temperatures are colder than −10°C. The strongest relationship between observations and forecast error exists in the HRRR: forecasts are too cool when observed temperatures are in the lower end of the temperature climatology (e.g., from near-freezing through −20°C) and forecasts are too warm when observed temperatures are in the upper end of the temperature climatology (e.g., above freezing; Fig. 4c). Each of these relationships between the observed temperature and associated forecast error (Fig. 4, top row) hold true when considering forecast temperature and its associated error (Fig. 4, bottom row). Further, the relationship between the daily mean observed temperature and forecast error is individual to each forecast model, which suggests that the unique combinations of modeling choices such as boundary and surface layer parameterizations and grid resolution distinctly impact temperature forecasts and resultant error characteristics.
NYSM (top) observed and (bottom) forecast 2-m temperature compared to forecast error in the (a),(d) GFS, (b),(e) NAM, and (c),(f) HRRR for fhour ≤ 18. Each dot represents a daily mean value for each NYSM site location, where the color fill indicates the density of points (blue is less dense, yellow is more dense). Zero error is shown by the gray solid line, and the linear regression fit is shown by the red solid line. Annotated on each panel are the number of points (N) included in the analysis as well as the correlation (r) and the coefficient of determination (r2) of the linear regression.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
The linear fits in Fig. 4 are helpful to glean the general error characteristics of each model, but as seen in the NAM (Fig. 4b) there can be patterns of error specific to certain observed and forecast temperature ranges. The split between warm and cold biases throughout forecast temperatures from −40° to 40°C in 5°C intervals is provided in Fig. 5. The forecast errors associated with very cold and warm temperatures in all three models are either positively or negatively skewed while more neutral errors are found in more frequently forecast temperatures (Fig. 5). Figure 5 broadly follows the trends seen in Fig. 4, but with more of a focus on the forecast error associated with discrete temperature forecast ranges. Within the GFS, a cold bias is slightly more frequent at temperatures below 25°C, while the frequency of a warm bias increases above this threshold (Fig. 5a). The forecast error of the NAM is less clear-cut than the GFS, as the bias is either equally split between warm and cold (i.e., neutral, on average) or switches between warm and cold multiple times throughout all of the considered temperature ranges (Fig. 5b). The HRRR has a relatively smooth progression of a cold forecast bias at the coldest temperature ranges to a neutral and then warm forecast bias at the warmest temperature forecast ranges (Fig. 5c). Overall, each model struggles with a warm bias when forecasting temperatures above 30°C. This warm bias is also prevalent in the NAM at temperatures below −30°C. Below these temperatures in the HRRR, a cold bias exists.
Percentage of raw (1- or 3-h) forecasts in multiple temperature forecast ranges in intervals of 5°C, from −40° to 40°C for the (a) GFS, (b) NAM, and (c) HRRR. The percentage of negative errors (e.g., cold error) is indicated by negative values and blue bars whereas the percentage of positive errors (e.g., warm error) is indicated by positive values and red bars. The magnitude of the bar in each temperature range sums to 100%.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
The general takeaways from the previous two analyses relating overall and daily mean forecast errors to temperature forecasts are as follows.
-
The GFS has a cold bias at all forecast temperatures <25°C.
-
There is a less direct relationship between most forecast temperatures and error in the NAM. However, a warm forecast bias exists at both temperature extremes.
-
The HRRR progresses smoothly from a cold bias at the coolest forecast temperature ranges to a warm bias at the warmest forecast temperature ranges.
-
A warm bias exists in the GFS, NAM, and HRRR at the warmest (>30°C) forecast temperatures.
It is of interest to identify any additional deviations from these generalizations if they exist, starting with variations with lead time (i.e., forecast hour). The average temperature error for the GFS, NAM, and HRRR are shown at each available lead time in Fig. 6. The magnitude and/or sign of forecast error do tend to vary with lead time, especially for the GFS and NAM, particularly when transitioning from daytime to nighttime (defined as between local standard times of 1700 and 2100 due to variance in sunset time throughout the year; Fig. 6, yellow shade) which is not entirely unexpected given the impact that the diurnal cycle can have on forecast biases. The transition from day to night and from night to day is associated with sharp changes in error. It is notable that the forecast error in the early morning hours is best captured by the GFS, on average, as the maxima in forecast error are the closest to zero around this time period. Note that the overarching cool bias of the GFS that was seen in Figs. 4 and 5 is especially evident within Fig. 6.
Average 2-m temperature error at each available forecast hour for the GFS (blue solid line), NAM (purple dotted line), and HRRR (orange dashed line). The zero temperature error is indicated with a horizontal black dashed line. The transition to local overnight is shaded in yellow.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
Seasonal variations in forecast error are also relevant to explore. Figure 7 shows the temperature bias associated with various temperature forecast ranges where fhour ≤ 18 in intervals of 5°C for each meteorological season, where the winter season includes December–February (DJF), spring includes March–May (MAM), summer includes June–August (JJA), and fall includes September–November (SON). The percentage of temporally aligned observations that fall into each forecast range are annotated on the lower half of each box, indicating the frequency of accurate forecasts within a given range. For example, when the HRRR forecast temperatures between −30° and −35°C during DJF, it had a cold bias of −3.8°C with only 29% of associated observations falling within that range. At least two forecasts must fall into each forecast range to be included in this analysis. Tables 3 and 4 provide the same error data shown in Fig. 7, but averaged across forecast temperature ranges and seasons, respectively.
Seasonally averaged 2-m temperature error for multiple temperature forecast ranges in intervals of 5°C, from −40° to 40°C. Only fhour ≤ 18 are considered. The rows include data for each season clustered by model. The first four rows show data from the GFS for DJF, MAM, JJA, and SON. The same seasonal order is then repeated for the NAM and then the HRRR, from top to bottom. Each box within the heatmap is color coded by the mean temperature error, where red represents a warm bias and blue represents a cold bias. The temperature bias is annotated in the upper half of each box and the percentage of observations that fell within the respective forecast range is annotated in the lower half, indicating the frequency of accurate forecasts within a given range.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
The 2-m temperature error (°C) averaged across all seasons in Fig. 7 for each model (columns) and temperature range (rows). Bold text in each row indicates the value with the lowest error magnitude (i.e., the best model statistic).
The 2-m temperature error (°C) averaged across all forecast temperature ranges in Fig. 7 for each model (columns) and season (rows). Bold text in each row indicates the value with the lowest error magnitude (i.e., the best model statistic).
The bias characteristics of the GFS were broadly consistent across each of the seasons, with a slight cold bias at all but the warmest temperature ranges (Fig. 7, Table 4) which aligns with the general cold bias seen in Figs. 4d, 5a, and 6. In all seasons but DJF, the right tail of the GFS forecast distribution is associated with a warm bias, which is expected as compared to the analysis in Fig. 5a, but is not seen in Fig. 4d as these temperatures occur infrequently in each respective season (not shown). Within the NAM, a warm bias is found during DJF (0.73°C), MAM (0.41°C), and SON (1.15°C) at subfreezing forecast temperature ranges, but DJF differed from the other seasons with its consistent cool bias at above freezing temperatures. The general cool-to-warm bias seen in Figs. 4f and 5c for the HRRR is also prevalent in all seasons, although less prominently during JJA. Min et al. (2021) identified a similar warm bias in the warm season across NYS in HRRRv3, except associated with daily maximum temperature forecasts, which was ultimately related to incoming shortwave radiation biases due to issues modeling low-level cloud cover.
It is also of interest to check the biases associated with the NAM and GFS forecasts later in the forecast period, specifically at 18 < fhour ≤ 84 (Fig. 8). Tables 5 and 6 provide the same data averaged over forecast temperature ranges and seasons, respectively. There are only a few noticeable differences as compared to fhour ≤ 18 (Fig. 7). The GFS now has a warm bias at temperatures > 15°C in DJF. Additionally, the NAM bias is no longer nearly neutral but instead mostly cool for temperature ranges > −5°C in MAM. Note that because the study herein and the results shown in Fig. 7 are for a 1200 UTC initialization time and, typically, only the first 18 h of forecasts are analyzed for direct comparison to the GFS and NAM, the analyses lack data from the 0600–1200 UTC time frame. Hence, Fig. 8 not only represents data that include a 6-h period (0600–1200 UTC) not considered in Fig. 7, but it also captures the expected general increase in forecast error with increasing forecast lead time, which primarily contributes to the differences between Figs. 7 and 8. Unsurprisingly, these results indicate that forecast biases are in flux over multiple time dimensions (e.g., forecast hours and seasons). If the biases identified so far were to be used within a forecasting process, care would need to be taken to ensure that the multiple dependencies on lead time and time of year and, as explored in the following paragraphs, geographic location are properly accounted for.
As in Fig. 7, but only including the GFS and NAM at 18 < fhour ≤ 84.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
The 2-m temperature error (°C) averaged across all seasons in Fig. 8 for each model (columns) and temperature range (rows). Bold text in each row indicates the value with the lowest error magnitude (i.e., the best model statistic).
The 2-m temperature error (°C) averaged across all forecast temperature ranges in Fig. 8 for each model (columns) and season (rows). Bold text in each row indicates the value with the lowest error magnitude (i.e., the best model statistic).
Forecast errors can and do, of course, also vary over the spatial dimension. In an effort to understand the forecast performance of the GFS, NAM, and HRRR in the climate divisions introduced in section 2 and also capture the forecast variance, the RMSE was calculated for each month of forecast initialization and climate division. In this way, any seasonal to subseasonal temporal and/or regional dependence can be explored. Remember that as RMSE approaches 0, forecast accuracy maximizes.
Box-and-whisker plots of the monthly temperature RMSE within each climate division are presented in Fig. 9. It was anticipated that the GFS would perform poorly (i.e., have inflated RMSE) across all climate divisions due to its coarse horizontal resolution and, similarly, the HRRR was expected to consistently yield the most accurate temperature forecasts due to its 3-km horizontal resolution. Instead, the NAM has the lowest median RMSE in all climate divisions other than the Northern Plateau and St. Lawrence Valley. This is likely related to the largely balanced near-neutral warm and cool errors across most temperature ranges seen in Fig. 7. The accuracy of HRRR temperature forecasts is especially reduced in the Central Lakes, Champlain Valley, and Coastal climate divisions, as the median RMSE surpasses the 75th percentile of both competing models. With its slight cool error (Fig. 7), the GFS has a reduced RMSE as compared to the HRRR but is generally at or slightly below the accuracy of the NAM across the climate divisions.
Box-and-whisker plots of 2-m temperature (RMSE; °C) calculated with respect to NYSM climate divisions and seasons. Within each climate division, a box-and-whisker plot is provided for the GFS (light gray), NAM (medium gray), and HRRR (dark gray). Overlaid on each are the monthly RMSE values color coded by season, where DJF is blue, MAM is purple, JJA is green, and SON is orange.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
As the focus shifts to seasonal differences, RMSE is largely minimized in SON among all models and climate divisions (Fig. 9). With the highest RMSEs across most climate divisions, the HRRR struggles with temperature forecasts especially in JJA. This behavior suggests that the difficulty of the HRRR to forecast warm afternoon temperatures (Figs. 4–6) in the summer season (Fig. 7) places the model at a particular disadvantage relative to the GFS and NAM.
b. 10-m wind
A similar suite of verification statistics is used to evaluate wind speed forecasts across NYS. Daily NYSM station-mean statistics are used to identify relationships between wind speed bias and observed (Figs. 10a–c) and forecast (Figs. 10d–f) wind speeds. The wind speed errors tend to maximize at slower observed wind speeds and the errors minimize at higher observed wind speeds (Figs. 10a–c). In other words, low observed winds are frequently overforecast and high observed winds are underforecast (but to a lesser degree), which align with findings by Fovell and Cao (2014) and Fovell and Gallagher (2020). Gallagher (2021) similarly found HRRR 10-m wind forecasts to be biased high as compared to ASOS network observations, even with ASOS often reporting windier conditions compared to NYSM. However, low wind observations are associated with a greater magnitude of forecast error than high wind observations (not shown). Increasing forecast wind speed is associated with increased overforecasting in the GFS and NAM, but less so in the HRRR (Figs. 10d–f). A consistent relationship exists between the daily mean observations and forecasts among the models, which differs from the model-specific relationships identified in the same analysis for temperature (Figs. 4a–c). The consistency suggests a common issue across these models leading to inaccurate forecasts of near-surface wind speed. The high bias at low wind speeds suggests a process-related issue that leads to the tendency of forecasting the presence of wind even in observed no- or low-wind conditions. Fovell and Cao (2014) noted that forecast skill is very much dependent on the land surface model parameterization, specifically in how surface roughness is handled. This is intricately connected to the land-use category within NWP models. Gallagher (2021) identified a relationship between the model land-use category and forecast wind speed biases, with urban, grasslands/open shrublands, and croplands consistently underpredicted by the HRRR in NYS and all forested categories overpredicted. Hence, the general overprediction is due to the fact that 43% of NYSM sites are located in forested areas (Gallagher 2021).
As in Fig. 4, but for (a)–(c) observed and (d)–(f) forecast 10-m wind speed.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
The variation of the average wind speed error with lead time is shown in Fig. 11. As anticipated from the analysis of Figs. 10d–f, the wind speed bias tends to be positive, or overforecast the observed wind speed, at each forecast hour. The fluctuation of the forecast error specifically for the GFS and NAM follows the expected increase in forecast wind speed during the daytime and calm winds during the nighttime. However, the range of possible forecast error in the NAM exceeds that of both the HRRR and the GFS, demonstrating that the NAM specifically has difficulty in forecasting wind speeds during the daytime, but that bias decreases considerably during overnight hours, although still not to a magnitude that outperforms the GFS at those later lead times.
As in Fig. 6, but for 10-m wind speed error.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
The seasonally averaged wind speed bias for each model and forecast range in intervals of 2 m s−1 from 0 to 22 m s−1 is presented in Fig. 12. Tables 7 and 8 provide the same data averaged over wind speed ranges and seasons, respectively. Again, wind speeds are overforecast at all ranges in each season especially as the forecast wind speeds increase. There is a hint of some seasonality, as the wind speed errors in JJA are at times elevated when wind speeds > 10 m s−1. The greatest percentage of observations falling into the forecast wind speed range exists at ranges of 0–2 m s−1, which means that forecast accuracy is maximized at the lowest forecast wind speeds. Further, the greatest forecast wind speed ranges (e.g., ≥18 m s−1) are rarely, if ever, observed. These findings are similar for the GFS and NAM at 18 < fhour ≤ 84 (not shown).
As in Fig. 7, but for 10-m wind speed error for multiple wind speed forecast ranges in intervals of 2 m s−1, from 0 to 22 m s−1.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
The 10-m wind speed error (m s−1) averaged across all seasons in Fig. 12 for each model (columns) and forecast wind speed range (rows). Bold text in each row indicates the value with the lowest error magnitude (i.e., the best model statistic).
The 10-m wind speed error (m s−1) averaged across all forecast wind speed ranges in Fig. 12 for each model (columns) and season (rows). Bold text in each row indicates the value with the lowest error magnitude (i.e., the best model statistic).
With wind speed errors consistent among models, the distinct over- and underperformance of the models across climate divisions evident in the temperature analysis in section 4a are not present in Fig. 13. A singular model does not outperform another to the degree seen in Fig. 9, but the GFS does hold a competitive edge over the other models in the Champlain Valley, Eastern Plateau, and Hudson Valley where the median RMSE is below the 25th percentile of both the NAM and HRRR. Also, the NAM performs particularly poorly in the Great Lakes and Northern Plateau. Wind speed forecast accuracy is maximized in JJA for all models which is likely due to lower average wind speed forecasts during the summer season. The least accurate wind speed forecasts occur in DJF and MAM due to the overprediction at higher forecast wind speeds during these two seasons (Fig. 12).
As in Fig. 9, but for 10-m wind speed RMSE (m s−1).
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
c. Precipitation
The final parameter to be explored in this paper is 1- and 3-h forecasts of precipitation accumulation. As touched upon earlier in this section, the verification of 1- or 3-h precipitation accumulation will use a different framework than that of temperature and wind speed to allow for verification at multiple thresholds of interest. Much of the focus within this penultimate section rests on the contingency table metrics introduced in section 2b. Contingency tables were calculated for 1-h precipitation thresholds of 0.1, 0.5, 1, 1.5, 2, 2.5, 3, 4, and 5 mm across various forecast hours and for all NYSM site locations. To facilitate one-to-one forecast hour and model comparison, these 1-h thresholds are adjusted for the 3-h precipitation accumulation forecasts provided by all forecast hours in the GFS and fhour > 36 in the NAM.
The bias score was calculated using Eq. (3) for the aforementioned precipitation forecast thresholds. As the frequency bias compares the number of forecast events to observed events, a value of 1 indicates an unbiased forecast while a value above or below 1 points toward a high or low frequency bias, respectively. A high frequency bias indicates that a precipitation event is forecast more often than it is observed, suggesting a wet bias and a low frequency bias indicates the opposite, suggesting a dry bias.
The frequency bias at each threshold and forecast hour in the GFS, NAM, and HRRR are shown in Fig. 14. A wet bias exists at precipitation rates below ∼1 mm at all forecast hours in the GFS and NAM while the frequency bias appears to be consistently wet or dry across all thresholds in the HRRR. Dry biases tend to exist at greater precipitation thresholds, suggesting that moderate-to-heavy precipitation tends to be underforecast by the GFS and NAM, but especially the GFS. Within the GFS, the bias becomes increasingly dry during local afternoon (i.e., forecast hours 6–12, 30–36, etc.) throughout the forecast period, particularly at thresholds above 1.5 mm. The NAM and HRRR also experience a pronounced afternoon dry bias, but unlike the GFS they then transition into a wet bias during the overnight hours. The forecast hour dependency of frequency bias in the GFS and NAM could be useful information for forecasters relying on these model forecasts, especially if other bulk metrics, such as a performance diagram, smooth these details out.
Frequency bias of 1- and 3-h precipitation rate at multiple thresholds and forecast hours for the (a) GFS, (b) NAM, and (c) HRRR. Green indicates a high frequency (wet) bias and purple indicates a low frequency (dry) bias.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
An assessment of how well precipitation events are forecast when they are correctly forecast is complimentary to the frequency bias analysis. This information is provided by the ETS, also referred to as the Gilbert skill score, which was calculated using the contingency table with the same thresholds as frequency bias with Eq. (5). ETS measures the fraction of correctly forecast events to all observed and forecast events including an adjustment for true positives that occurred due to random chance; hence, a larger ETS suggests a more skillful forecast. ETS values are shown across precipitation thresholds and forecast hours in Fig. 15, similarly to the frequency bias in Fig. 14. Comparing across common forecast hours, the ETS is similar for all models in that they exhibit the greatest, or most skillfull, ETS when the threshold is <1 mm and within approximately the first 6 h of the forecast period, capturing the morning to early afternoon hours. The ETS tends to decrease for a given threshold throughout the forecast period, indicating that skillful predictions of any of the events at the tested thresholds diminish with longer lead times. Unlike the frequency bias, the ETS shows considerably reduced diurnal fluctuations in the GFS and NAM, although increases in ETS are evident during overnight hours (e.g., forecast hours 18–24 and 42–48). The reduced ETS at higher precipitation thresholds underscores the difficulty of skillfully forecasting such events, as seen by the dry biases in the GFS and HRRR and oscillating wet/dry biases in the NAM (Fig. 14).
Equitable threat score of 1- and 3-h precipitation rate at multiple thresholds and forecast hours for the (a) GFS, (b) NAM, and (c) HRRR. The score is annotated on the contours by intervals of 0.05.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
The ETS and frequency bias analyses above are ripe with data about diurnal fluctuations, but can be burdensome to derive model-to-model comparisons from. To help alleviate this issue, pairs of success ratio and POD are plotted in the performance diagrams shown in Fig. 16 for each precipitation threshold and model. While fhour ≤ 18, each of the three NWP models produces the best forecast for light precipitation (lighter marker fills), as those rates lead to maximized POD (y axis) and success ratio (x axis), although with a high frequency bias (dashed lines) that decreases to unity (i.e., neutral bias) at thresholds of 0.5 mm in the HRRR and 1.5 mm in the GFS and NAM (Fig. 16a). With increasing precipitation thresholds (darkening marker fills), the GFS transitions from a high (wet) to a low (dry) frequency bias while the NAM and HRRR transition from a high to a consistent neutral bias, following the trends in Fig. 14. Note that the diurnal change in frequency bias in the NAM and HRRR is neutral here due to the compensating high and low frequency bias values in the first 18 forecast hours (Fig. 14). This indicates that while light precipitation may be well forecast, those events are forecast more frequently than observed and that heavier precipitation forecasts in the HRRR and NAM are forecast as often or less often than observed, respectively. Although the GFS has similar POD values as compared to NAM and HRRR, its success ratio consistently remains around 0.4–0.5 indicating fewer false alarms forecast by the GFS at higher precipitation thresholds. At later forecast times (Fig. 16b), the GFS and NAM have reduced forecast accuracy, as indicated by reduced CSI, and comparatively lesser POD and success ratios. However, the GFS continues to outperform the NAM with greater success ratios at thresholds above 1 mm, aligning with its performance during earlier forecast hours (Fig. 16a). In summary, as precipitation thresholds increase, the forecast frequency bias for the NAM and HRRR reduces, yet the forecast is less trustworthy due to its increased false alarm ratio, whereas the bias for the GFS becomes drier yet is more trustworthy as it is associated with a consistent, relatively low false alarm ratio.
Performance diagrams for each success ratio–probability of detection pair at each precipitation threshold of 0.1, 0.5, 1, 1.5, 2, 2.5, 3, 4, and 5 mm where (a) fhour ≤ 18 and (b) 18 < fhour ≤ 84. Data pairs represent metrics calculated across the entire NYSM network and over the 4-yr time period. The marker color fill darkens with increasing threshold value. GFS values are shown as purple triangles, NAM values as green squares, and HRRR values as red circles. The color filled contours are the critical success index and the dashed lines are the frequency bias.
Citation: Weather and Forecasting 39, 2; 10.1175/WAF-D-23-0094.1
5. Summary and conclusions
GFS, NAM, and HRRR forecasts initialized at 1200 UTC of 2-m temperature, 10-m wind speed, and accumulated precipitation over New York State (NYS) were verified using NYS Mesonet (NYSM) observations from the 126 standard network sites during 1 January 2018–31 December 2021. Grid forecasts from the GFS and NAM were interpolated to each NYSM site location whereas a nearest-neighbor method was used for HRRR forecasts. Only model grid cells over a land surface were included in the interpolation and nearest-neighbor methods. Focus was first placed on evaluating changes to temperature forecast error with update of the GFS to version 16. Statistically significant differences were identified in the forecast error distributions of temperature, which suggests caution when analyzing these data in a bulk sense. Within the remainder of the paper, analysis of each forecast quantity considered temporal (e.g., forecast hour, month, season) and/or spatial (e.g., climate division) effects on verification metrics. Both forecast error and root-mean-square error were used to analyze temperature and wind speed forecasts, while contingency table statistics provided the foundation for precipitation verification.
The temperature forecast errors associated with each model were unique, especially when considered across different temperature ranges. The GFS, NAM, and HRRR were all found to have dominant warm biases when forecast temperatures were extremely warm, specifically at temperatures > 30°C. At temperatures < 30°C, the GFS more frequently had a cool bias. The NAM tended to have a warm bias not only at cold extremes, but also at warm forecast temperatures, whereas the HRRR tended toward a cold bias as the forecast temperatures decreased. The accuracy of the HRRR temperature forecasts degraded in the summer season as the model struggled to forecast warm afternoon temperatures. Within this warmer temperature range, both the GFS and NAM shifted toward varying degrees of cold biases. All models provided the most accurate forecasts during the fall season, independent of the climate division. While the GFS and HRRR exhibited similar bias characteristics across seasons (e.g., consistent warm or cool biases in each temperature range), the NAM had a cool bias above freezing temperatures that was only identified in the winter season. Despite this, the NAM had the highest relative temperature forecast accuracy among the models in most of the climate divisions. These results illustrate the level of specificity that is required to provide the most representative biases for each season and region of interest, as it has been demonstrated that biases calculated over a relatively small area such as NYS or small range of years may not generalize well when applied to different temporal or regional subsets.
Unlike temperature error, the discrepancy in wind speed error remained consistent across the GFS, NAM, and HRRR. Low observed wind speeds are frequently overforecast and high observed wind speeds are underforecast by each of the NWP models, which previous literature attributes to land-use category and surface roughness. Errors peaked in the winter and spring seasons as these are associated with increased average wind speeds. Lower wind speeds in the summer season led to the most accurate wind speed forecasts from all models and across all climate divisions.
Precipitation accumulation forecasts were the final variable analyzed herein. The NAM and especially the GFS were found to have a dry bias associated with 1- or 3-h precipitation > 1.5 mm. This dry bias is persistent throughout the first 84 forecast hours of the GFS and intensifies during the afternoon hours. The opposite occurs in the NAM, where the afternoon hours are associated with a wet bias at these same thresholds. The HRRR follows the same diurnal change as the NAM, but instead at all precipitation thresholds. The NAM and HRRR were found to have near-unity frequency biases for all but the lowest precipitation thresholds, but with an increased false alarm ratio their forecasts may be considered less trustworthy than GFS forecasts. At the same thresholds, GFS forecasts have a relatively low false alarm ratio but also frequency bias < 1. When considering the 1200 UTC initializations of these models, there is a trade-off between the frequency bias and false alarm ratio. With the GFS, high-threshold events may be forecast less frequently than they are observed as compared to the NAM and HRRR, but when they are forecast they are more likely to be observed. This pushes the point that one’s tolerance for a false alarm versus a missed forecast must be taken into consideration if depending on a specific deterministic precipitation forecast.
This verification effort serves as a necessary foundation for future work that will focus on the development of machine learning (ML) models to predict forecast error associated with near-surface temperature, wind speed, and accumulated precipitation over a given forecast period for any NYSM site location across NYS. The verification statistics discussed herein along with other forecast variables will be used as features for the ML model to learn from to be able to predict forecast error for a given variable (e.g., 2-m temperature). This application is similar to Model Output Statistics (MOS; Klein and Glahn 1974), but instead will predict model error instead of an adjusted forecast value, ideally leading to an assessment of forecast uncertainty at various lead times at locations across NYS.
Acknowledgments.
Funding for this research was provided by the National Science Foundation Grant ICER-2019758. This research is made possible by the New York State (NYS) Mesonet. The authors thank Arnoldas Kurbanovas of the Extreme Collaboration, Innovation, and Technology (xCITE) Laboratory for gathering and storing the NWP data needed for this paper. Original funding for the NYS Mesonet was provided by Federal Emergency Management Agency Grant FEMA-4085-DR-NY, with the continued support of the NYS Division of Homeland Security and Emergency Services; the state of New York; the Research Foundation for the State University of New York (SUNY); the University at Albany, SUNY; the Atmospheric Sciences Research Center (ASRC) at SUNY Albany; and the Department of Atmospheric and Environmental Sciences (DAES) at SUNY Albany.
Data availability statement.
GFS and NAM data were accessed from the National Centers for Environmental Information. HRRR data were accessed from the University of Utah MesoWest HRRR archive (Blaylock et al. 2017) and from Amazon Web Services via the Python package Herbie. NYSM data can be requested at http://nysmesonet.org.
REFERENCES
Blaylock, B. K., J. D. Horel, and S. T. Liston, 2017: Cloud archiving and data mining of high-resolution rapid refresh forecast model output. Comput. Geosci., 109, 43–50, https://doi.org/10.1016/j.cageo.2017.08.005.
Brotzge, J. A., and Coauthors, 2020: A technical overview of the New York State Mesonet standard network. J. Atmos. Oceanic Technol., 37, 1827–1845, https://doi.org/10.1175/JTECH-D-19-0220.1.
Caron, M., and W. J. Steenburgh, 2020: Evaluation of recent NCEP operational model upgrades for cool-season precipitation forecasting over the western conterminous United States. Wea. Forecasting, 35, 857–877, https://doi.org/10.1175/WAF-D-19-0182.1.
Casati, B., and Coauthors, 2008: Forecast verification: Current status and future directions. Meteor. Appl., 15, 3–18, https://doi.org/10.1002/met.52.
Dougherty, K. J., J. D. Horel, and J. E. Nachamkin, 2021: Forecast skill for California heavy precipitation periods from the high-resolution rapid refresh model and the Coupled Ocean–Atmosphere Mesoscale Prediction System. Wea. Forecasting, 36, 2275–2288, https://doi.org/10.1175/WAF-D-20-0182.1.
Ebert, E., and Coauthors, 2013: Progress and challenges in forecast verification. Meteor. Appl., 20, 130–139, https://doi.org/10.1002/met.1392.
Fovell, R. G., and Y. Cao, 2014: Wind and gust forecasting in complex terrain. 15th WRF Users Workshop, Boulder, CO, NCAR, 5A.2, http://www2.mmm.ucar.edu/wrf/users/workshops/WS2014/ppts/5A.2.pdf.
Fovell, R. G., and A. Gallagher, 2020: Boundary layer and surface verification of the High-Resolution Rapid Refresh, version 3. Wea. Forecasting, 35, 2255–2278, https://doi.org/10.1175/WAF-D-20-0101.1.
Gallagher, A. R., 2021: Exploring environmental and methodological sensitivities of forecasted and observed surface winds and gusts using underutilized datasets. M.S. thesis, Dept. of Atmospheric and Environmental Sciences, University at Albany, State University of New York, 283 pp.
Glahn, H. R., and D. A. Lowry, 1972: The use of Model Output Statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11, 1203–1211, https://doi.org/10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.
Haiden, T., M. J. Rodwell, D. S. Richardson, A. Okagaki, T. Robinson, and T. Hewson, 2012: Intercomparison of global model precipitation forecast skill in 2010/11 using the SEEPS score. Mon. Wea. Rev., 140, 2720–2733, https://doi.org/10.1175/MWR-D-11-00301.1.
Herman, G. R., and R. S. Schumacher, 2016: Extreme precipitation in models: An evaluation. Wea. Forecasting, 31, 1853–1879, https://doi.org/10.1175/WAF-D-16-0093.1.
Ikeda, K., M. Steiner, J. Pinto, and C. Alexander, 2013: Evaluation of cold-season precipitation forecasts generated by the hourly updating high-resolution rapid refresh model. Wea. Forecasting, 28, 921–939, https://doi.org/10.1175/WAF-D-12-00085.1.
Klein, W. H., and H. R. Glahn, 1974: Forecasting local weather by means of Model Output Statistics. Bull. Amer. Meteor. Soc., 55, 1217–1227, https://doi.org/10.1175/1520-0477(1974)055<1217:FLWBMO>2.0.CO;2.
Lee, T. R., M. Buban, D. D. Turner, T. P. Meyers, and C. B. Baker, 2019: Evaluation of the High-Resolution Rapid Refresh (HRRR) model using near-surface meteorological and flux observations from northern Alabama. Wea. Forecasting, 34, 635–663, https://doi.org/10.1175/WAF-D-18-0184.1.
Levit, J. J., G. S. Manikin, A. M. Bentley, L. C. Dawson, and T. L. Jensen, 2022: Development and planning of the new environmental modeling center verification system (EVS). 27th Conf. on Probability and Statistics and 31st Conf. on Weather Analysis and Forecasting (WAF)/27th Conf. on Numerical Weather Prediction (NWP), Online, Amer. Meteor. Soc., J2.2, https://ams.confex.com/ams/102ANNUAL/meetingapp.cgi/Paper/391554.
Min, L., D. R. Fitzjarrald, Y. Du, B. E. J. Rose, J. Hong, and Q. Min, 2021: Exploring sources of surface bias in HRRR using New York state mesonet. J. Geophys. Res. Atmos., 126, e2021JD034989, https://doi.org/10.1029/2021JD034989.
Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281–293, https://doi.org/10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.
Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330–1338, https://doi.org/10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.
NCEI, 2015: U.S. climate divisions. National Centers for Environmental Information, accessed 3 August 2023, https://www.ncei.noaa.gov/access/monitoring/dyk/us-climate-divisions.
NOAA, 2021: NOAA upgrades flagship U.S. global weather model. National Oceanic and Atmospheric Administration, accessed 7 April 2022, https://www.noaa.gov/media-release/noaa-upgrades-flagship-us-global-weather-model.
NWS, 2017: About models. National Weather Service, accessed 6 August 2023, https://www.weather.gov/about/models.
Pichugina, Y. L., and Coauthors, 2019: Spatial variability of winds and HRRR–NCEP model error statistics at three Doppler-lidar sites in the wind-energy generation region of the Columbia River basin. J. Appl. Meteor. Climatol., 58, 1633–1656, https://doi.org/10.1175/JAMC-D-18-0244.1.
Pinto, J. O., J. A. Grim, and M. Steiner, 2015: Assessment of the high-resolution rapid refresh model’s ability to predict mesoscale convective systems using object-based evaluation. Wea. Forecasting, 30, 892–913, https://doi.org/10.1175/WAF-D-14-00118.1.
Roebber, P. J., 2009: Visualizing multiple measures of forecast quality. Wea. Forecasting, 24, 601–608, https://doi.org/10.1175/2008WAF2222159.1.
Wolff, J. K., M. Harrold, T. Fowler, J. H. Gotway, L. Nance, and B. G. Brown, 2014: Beyond the basics: Evaluating model-based precipitation forecasts using traditional, spatial, and object-based methods. Wea. Forecasting, 29, 1451–1472, https://doi.org/10.1175/WAF-D-13-00135.1.