Search Results
You are looking at 1 - 10 of 98 items for
- Author or Editor: Thomas M. Hamill x
- Refine by Access: All Content x
Abstract
High-quality, high-resolution, hourly unbiased surface (2 m) temperature analyses are needed for many applications, including training and validation of statistical postprocessing applications. These temperature analyses are often generated through data assimilation procedures, whereby a background short-range gridded forecast is adjusted to newly available observations. Even with frequent updates to newly available observations, surface-temperature analysis errors and biases can be comparatively large relative to errors and biases of midtropospheric variables, especially over land, despite more near-surface in situ observations. Larger near-surface errors may have several causes, including biased background forecasts and the spatial heterogeneity of surface temperatures that results from subgrid-scale surface, vegetation, land-use, and terrain variations. Are biased raw background forecasts the predominant cause of surface temperature analysis errors? Part I of this two-part series describes a simple benchmark for evaluating the error characteristics of short-term (1 h) raw model background surface temperature forecasts. For stations with a relatively complete time series of data, it is possible to generate an hourly, diurnally, and seasonally dependent observation climatology at a station. The deviation of the current hour’s temperature observation with respect to this hour’s and Julian day’s climatology is added to the climatology for the next hour. For contiguous U.S. stations in July 2015, the station benchmark was lower in error than interpolated 1-h high-resolution numerical predictions of surface temperature from NOAA’s High-Resolution Rapid Refresh (HRRR) system, although not including full postprocessing. For August 2018, 1-h HRRR forecasts were much improved when tested against the station benchmark.
Abstract
High-quality, high-resolution, hourly unbiased surface (2 m) temperature analyses are needed for many applications, including training and validation of statistical postprocessing applications. These temperature analyses are often generated through data assimilation procedures, whereby a background short-range gridded forecast is adjusted to newly available observations. Even with frequent updates to newly available observations, surface-temperature analysis errors and biases can be comparatively large relative to errors and biases of midtropospheric variables, especially over land, despite more near-surface in situ observations. Larger near-surface errors may have several causes, including biased background forecasts and the spatial heterogeneity of surface temperatures that results from subgrid-scale surface, vegetation, land-use, and terrain variations. Are biased raw background forecasts the predominant cause of surface temperature analysis errors? Part I of this two-part series describes a simple benchmark for evaluating the error characteristics of short-term (1 h) raw model background surface temperature forecasts. For stations with a relatively complete time series of data, it is possible to generate an hourly, diurnally, and seasonally dependent observation climatology at a station. The deviation of the current hour’s temperature observation with respect to this hour’s and Julian day’s climatology is added to the climatology for the next hour. For contiguous U.S. stations in July 2015, the station benchmark was lower in error than interpolated 1-h high-resolution numerical predictions of surface temperature from NOAA’s High-Resolution Rapid Refresh (HRRR) system, although not including full postprocessing. For August 2018, 1-h HRRR forecasts were much improved when tested against the station benchmark.
Abstract
Common methods for the postprocessing of deterministic 2-m temperature (T 2m) forecasts over the United States were evaluated from +12- to +120-h lead. Forecast data were extracted from the Global Ensemble Forecast System (GEFS) v12 reforecast dataset and thinned to a ½° grid. Analyzed data from the European Centre/Copernicus reanalysis (ERA5) were used for training and validation. Data from the 2000–18 period were used for training, and 2019 forecasts were validated. The postprocessing methods compared were the raw forecast guidance, a decaying-average bias correction (DAV), quantile mapping (QM), a univariate model output statistics (uMOS) algorithm, and a multivariate (mvMOS) algorithm. The mvMOS algorithm used the raw forecast temperature, the DAV adjustment, and the QM adjustment as predictors. Forecasts from all the postprocessing methods reduced the root-mean-square error (RMSE) and bias relative to the raw guidance. QM produced forecasts with slightly higher error than DAV. DAV estimates were the most consistent from day to day. The uMOS and mvMOS algorithms produced statistically significant lower RMSEs than DAV at forecast leads longer than 1 day, with mvMOS exhibiting the lowest error. Taylor diagrams showed that the MOS methods reduced the variability of the forecasts while improving forecast-analyzed correlations. QM and DAV modified the distribution of forecasts to more closely exhibit those of the analyzed data. A main conclusion is that the judicious statistical combination of guidance from multiple postprocessing methods is capable of producing forecasts with improved error statistics relative to any one individual technique. As each method applied here is algorithmically relatively simple, this suggests that operational deterministic postprocessing combining multiple correction methods could produce improved T 2m guidance.
Abstract
Common methods for the postprocessing of deterministic 2-m temperature (T 2m) forecasts over the United States were evaluated from +12- to +120-h lead. Forecast data were extracted from the Global Ensemble Forecast System (GEFS) v12 reforecast dataset and thinned to a ½° grid. Analyzed data from the European Centre/Copernicus reanalysis (ERA5) were used for training and validation. Data from the 2000–18 period were used for training, and 2019 forecasts were validated. The postprocessing methods compared were the raw forecast guidance, a decaying-average bias correction (DAV), quantile mapping (QM), a univariate model output statistics (uMOS) algorithm, and a multivariate (mvMOS) algorithm. The mvMOS algorithm used the raw forecast temperature, the DAV adjustment, and the QM adjustment as predictors. Forecasts from all the postprocessing methods reduced the root-mean-square error (RMSE) and bias relative to the raw guidance. QM produced forecasts with slightly higher error than DAV. DAV estimates were the most consistent from day to day. The uMOS and mvMOS algorithms produced statistically significant lower RMSEs than DAV at forecast leads longer than 1 day, with mvMOS exhibiting the lowest error. Taylor diagrams showed that the MOS methods reduced the variability of the forecasts while improving forecast-analyzed correlations. QM and DAV modified the distribution of forecasts to more closely exhibit those of the analyzed data. A main conclusion is that the judicious statistical combination of guidance from multiple postprocessing methods is capable of producing forecasts with improved error statistics relative to any one individual technique. As each method applied here is algorithmically relatively simple, this suggests that operational deterministic postprocessing combining multiple correction methods could produce improved T 2m guidance.
Abstract
During the period 9–16 September 2013, more than 17 in. (~432 mm) of rainfall fell over parts of Boulder County, Colorado, with more than 8 in. (~203 mm) over a wide swath of Colorado’s northern Front Range. This caused significant flash and river flooding, loss of life, and extensive property damage. The event set a record for daily rainfall (9.08 in., or >230 mm) in Boulder that was nearly double the previous daily rainfall record of 4.8 in. (122 mm) set on 31 July 1919. The operational performance of precipitation forecast guidance from global ensemble prediction systems and the National Weather Service’s global and regional forecast systems during this event is documented briefly in the article and more extensively in online supplemental appendixes. While the precipitation forecast guidance uniformly depicted a much wetter-than-average period over northeastern Colorado, none of the global nor most of the regional modeling systems predicted precipitation amounts as heavy as analyzed. Notable exceptions to this were the Short-Range Ensemble Forecast (SREF) members that used the Advanced Research Weather Research and Forecasting Model (ARW-WRF) dynamical core. These members consistently produced record rainfall in the Front Range. However, the SREF’s record rainfall was also predicted to occur the day before the heaviest actual precipitation as well as the day of the heaviest precipitation.
Abstract
During the period 9–16 September 2013, more than 17 in. (~432 mm) of rainfall fell over parts of Boulder County, Colorado, with more than 8 in. (~203 mm) over a wide swath of Colorado’s northern Front Range. This caused significant flash and river flooding, loss of life, and extensive property damage. The event set a record for daily rainfall (9.08 in., or >230 mm) in Boulder that was nearly double the previous daily rainfall record of 4.8 in. (122 mm) set on 31 July 1919. The operational performance of precipitation forecast guidance from global ensemble prediction systems and the National Weather Service’s global and regional forecast systems during this event is documented briefly in the article and more extensively in online supplemental appendixes. While the precipitation forecast guidance uniformly depicted a much wetter-than-average period over northeastern Colorado, none of the global nor most of the regional modeling systems predicted precipitation amounts as heavy as analyzed. Notable exceptions to this were the Short-Range Ensemble Forecast (SREF) members that used the Advanced Research Weather Research and Forecasting Model (ARW-WRF) dynamical core. These members consistently produced record rainfall in the Front Range. However, the SREF’s record rainfall was also predicted to occur the day before the heaviest actual precipitation as well as the day of the heaviest precipitation.
Abstract
Probabilistic quantitative precipitation forecasts (PQPFs) were generated from The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE) database from July to October 2010 using data from Europe (ECMWF), the United Kingdom [Met Office (UKMO)], the United States (NCEP), and Canada [Canadian Meteorological Centre (CMC)]. Forecasts of 24-h accumulated precipitation were evaluated at 1° grid spacing within the contiguous United States against analysis data based on gauges and bias-corrected radar data.
PQPFs from ECMWF’s ensembles generally had the highest skill of the raw ensemble forecasts, followed by CMC. Those of UKMO and NCEP were less skillful. PQPFs from CMC forecasts were the most reliable but the least sharp, and PQPFs from NCEP and UKMO ensembles were the least reliable but sharper.
Multimodel PQPFs were more reliable and skillful than individual ensemble prediction system forecasts. The improvement was larger for heavier precipitation events [e.g., >10 mm (24 h)−1] than for smaller events [e.g., >1 mm (24 h)−1].
ECMWF ensembles were statistically postprocessed using extended logistic regression and the five-member weekly reforecasts for the June–November period of 2002–09, the period where precipitation analyses were also available. Multimodel ensembles were also postprocessed using logistic regression and the last 30 days of prior forecasts and analyses. The reforecast-calibrated ECMWF PQPFs were much more skillful and reliable for the heavier precipitation events than ECMWF raw forecasts but much less sharp. Raw multimodel PQPFs were generally more skillful than reforecast-calibrated ECMWF PQPFs for the light precipitation events but had about the same skill for the higher-precipitation events; also, they were sharper but somewhat less reliable than ECMWF reforecast-based PQPFs. Postprocessed multimodel PQPFs did not provide as much improvement to the raw multimodel PQPF as the reforecast-based processing did to the ECMWF forecast.
The evidence presented here suggests that all operational centers, even ECMWF, would benefit from the open, real-time sharing of precipitation forecast data and the use of reforecasts.
Abstract
Probabilistic quantitative precipitation forecasts (PQPFs) were generated from The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE) database from July to October 2010 using data from Europe (ECMWF), the United Kingdom [Met Office (UKMO)], the United States (NCEP), and Canada [Canadian Meteorological Centre (CMC)]. Forecasts of 24-h accumulated precipitation were evaluated at 1° grid spacing within the contiguous United States against analysis data based on gauges and bias-corrected radar data.
PQPFs from ECMWF’s ensembles generally had the highest skill of the raw ensemble forecasts, followed by CMC. Those of UKMO and NCEP were less skillful. PQPFs from CMC forecasts were the most reliable but the least sharp, and PQPFs from NCEP and UKMO ensembles were the least reliable but sharper.
Multimodel PQPFs were more reliable and skillful than individual ensemble prediction system forecasts. The improvement was larger for heavier precipitation events [e.g., >10 mm (24 h)−1] than for smaller events [e.g., >1 mm (24 h)−1].
ECMWF ensembles were statistically postprocessed using extended logistic regression and the five-member weekly reforecasts for the June–November period of 2002–09, the period where precipitation analyses were also available. Multimodel ensembles were also postprocessed using logistic regression and the last 30 days of prior forecasts and analyses. The reforecast-calibrated ECMWF PQPFs were much more skillful and reliable for the heavier precipitation events than ECMWF raw forecasts but much less sharp. Raw multimodel PQPFs were generally more skillful than reforecast-calibrated ECMWF PQPFs for the light precipitation events but had about the same skill for the higher-precipitation events; also, they were sharper but somewhat less reliable than ECMWF reforecast-based PQPFs. Postprocessed multimodel PQPFs did not provide as much improvement to the raw multimodel PQPF as the reforecast-based processing did to the ECMWF forecast.
The evidence presented here suggests that all operational centers, even ECMWF, would benefit from the open, real-time sharing of precipitation forecast data and the use of reforecasts.
Abstract
A global reforecast dataset was recently created for the National Centers for Environmental Prediction’s Global Ensemble Forecast System (GEFS). This reforecast dataset consists of retrospective and real-time ensemble forecasts produced for the GEFS from 1985 to present day. An 11-member ensemble was produced once daily to +15-day lead time from 0000 UTC initial conditions. While the forecast model was stable during the production of this dataset, in 2011 and several times thereafter, there were significant changes to the forecast model that was used in the data assimilation system itself, as well as changes to the assimilation system and the observations that were assimilated. These changes resulted in substantial changes in the statistical characteristics of the reforecast dataset. Such changes make it challenging to uncritically use reforecasts for statistical postprocessing, which commonly assume that forecast error and bias are approximately consistent from one year to the next. Ensuring the consistency in the statistical characteristics of past and present initial conditions is desirable but can be in tension with the expectation that prediction centers upgrade their forecast systems rapidly.
Abstract
A global reforecast dataset was recently created for the National Centers for Environmental Prediction’s Global Ensemble Forecast System (GEFS). This reforecast dataset consists of retrospective and real-time ensemble forecasts produced for the GEFS from 1985 to present day. An 11-member ensemble was produced once daily to +15-day lead time from 0000 UTC initial conditions. While the forecast model was stable during the production of this dataset, in 2011 and several times thereafter, there were significant changes to the forecast model that was used in the data assimilation system itself, as well as changes to the assimilation system and the observations that were assimilated. These changes resulted in substantial changes in the statistical characteristics of the reforecast dataset. Such changes make it challenging to uncritically use reforecasts for statistical postprocessing, which commonly assume that forecast error and bias are approximately consistent from one year to the next. Ensuring the consistency in the statistical characteristics of past and present initial conditions is desirable but can be in tension with the expectation that prediction centers upgrade their forecast systems rapidly.
Abstract
No abstract available.
Abstract
No abstract available.
Abstract
When evaluating differences between competing precipitation forecasts, formal hypothesis testing is rarely performed. This may be due to the difficulty in applying common tests given the spatial correlation of and non-normality of errors. Possible ways around these difficulties are explored here. Two datasets of precipitation forecasts are evaluated, a set of two competing gridded precipitation forecasts from operational weather prediction models and sets of competing probabilistic quantitative precipitation forecasts from model output statistics and from an ensemble of forecasts. For each test, data from each competing forecast are collected into one sample for each case day to avoid problems with spatial correlation. Next, several possible hypothesis test methods are evaluated: the paired t test, the nonparametric Wilcoxon signed-rank test, and two resampling tests. The more involved resampling test methodology is the most appropriate when testing threat scores from nonprobabilistic forecasts. The simpler paired t test or Wilcoxon test is appropriate to use in testing the skill of probabilistic forecasts evaluated with the ranked probability score.
Abstract
When evaluating differences between competing precipitation forecasts, formal hypothesis testing is rarely performed. This may be due to the difficulty in applying common tests given the spatial correlation of and non-normality of errors. Possible ways around these difficulties are explored here. Two datasets of precipitation forecasts are evaluated, a set of two competing gridded precipitation forecasts from operational weather prediction models and sets of competing probabilistic quantitative precipitation forecasts from model output statistics and from an ensemble of forecasts. For each test, data from each competing forecast are collected into one sample for each case day to avoid problems with spatial correlation. Next, several possible hypothesis test methods are evaluated: the paired t test, the nonparametric Wilcoxon signed-rank test, and two resampling tests. The more involved resampling test methodology is the most appropriate when testing threat scores from nonprobabilistic forecasts. The simpler paired t test or Wilcoxon test is appropriate to use in testing the skill of probabilistic forecasts evaluated with the ranked probability score.
Abstract
Forecasters often develop rules of thumb for adjusting model guidance. Ideally, before use, these rules of thumb should be validated through a careful comparison of model forecasts and observations over a large sample. Practically, such evaluation studies are difficult to perform because forecast models are continually being changed, and a hypothesized rule of thumb may only be applicable to a particular forecast model configuration.
A particular rule of thumb was examined here: dprog/dt. Given a set of lagged forecasts from the same model all verifying at the same time, this rule of thumb suggests that if the forecasts show a trend, this trend is more likely than not to continue and thus provide useful information for correcting the most recent forecast. Forecasters may also note the amount of continuity of forecasts to estimate the magnitude of the error in the most recent forecast.
Statistical evaluation of this rule of thumb was made possible here using a dataset of forecasts from a “frozen” model. A 23-yr record of forecasts was generated from a T62 version of the medium-range forecast model used at the National Centers for Environmental Prediction. Forecasts were initialized from reanalysis data, and January–March forecasts were examined for selected locations. The rule dprog/dt was evaluated with 850-hPa temperature forecasts. A total of 2070 sample days were used in the evaluation.
Extrapolation of forecast trends was shown to have little forecast value. Also, there was only a small amount of information on forecast accuracy from the amount of discrepancy between short-term lagged forecasts. The lack of validity of this rule of thumb suggest that others should also be carefully scrutinized before use.
Abstract
Forecasters often develop rules of thumb for adjusting model guidance. Ideally, before use, these rules of thumb should be validated through a careful comparison of model forecasts and observations over a large sample. Practically, such evaluation studies are difficult to perform because forecast models are continually being changed, and a hypothesized rule of thumb may only be applicable to a particular forecast model configuration.
A particular rule of thumb was examined here: dprog/dt. Given a set of lagged forecasts from the same model all verifying at the same time, this rule of thumb suggests that if the forecasts show a trend, this trend is more likely than not to continue and thus provide useful information for correcting the most recent forecast. Forecasters may also note the amount of continuity of forecasts to estimate the magnitude of the error in the most recent forecast.
Statistical evaluation of this rule of thumb was made possible here using a dataset of forecasts from a “frozen” model. A 23-yr record of forecasts was generated from a T62 version of the medium-range forecast model used at the National Centers for Environmental Prediction. Forecasts were initialized from reanalysis data, and January–March forecasts were examined for selected locations. The rule dprog/dt was evaluated with 850-hPa temperature forecasts. A total of 2070 sample days were used in the evaluation.
Extrapolation of forecast trends was shown to have little forecast value. Also, there was only a small amount of information on forecast accuracy from the amount of discrepancy between short-term lagged forecasts. The lack of validity of this rule of thumb suggest that others should also be carefully scrutinized before use.