Search Results
You are looking at 1 - 10 of 98 items for
- Author or Editor: Thomas M. Hamill x
- Refine by Access: All Content x
Abstract
Rank histograms are a tool for evaluating ensemble forecasts. They are useful for determining the reliability of ensemble forecasts and for diagnosing errors in its mean and spread. Rank histograms are generated by repeatedly tallying the rank of the verification (usually an observation) relative to values from an ensemble sorted from lowest to highest. However, an uncritical use of the rank histogram can lead to misinterpretations of the qualities of that ensemble. For example, a flat rank histogram, usually taken as a sign of reliability, can still be generated from unreliable ensembles. Similarly, a U-shaped rank histogram, commonly understood as indicating a lack of variability in the ensemble, can also be a sign of conditional bias. It is also shown that flat rank histograms can be generated for some model variables if the variance of the ensemble is correctly specified, yet if covariances between model grid points are improperly specified, rank histograms for combinations of model variables may not be flat. Further, if imperfect observations are used for verification, the observational errors should be accounted for, otherwise the shape of the rank histogram may mislead the user about the characteristics of the ensemble. If a statistical hypothesis test is to be performed to determine whether the differences from uniformity of rank are statistically significant, then samples used to populate the rank histogram must be located far enough away from each other in time and space to be considered independent.
Abstract
Rank histograms are a tool for evaluating ensemble forecasts. They are useful for determining the reliability of ensemble forecasts and for diagnosing errors in its mean and spread. Rank histograms are generated by repeatedly tallying the rank of the verification (usually an observation) relative to values from an ensemble sorted from lowest to highest. However, an uncritical use of the rank histogram can lead to misinterpretations of the qualities of that ensemble. For example, a flat rank histogram, usually taken as a sign of reliability, can still be generated from unreliable ensembles. Similarly, a U-shaped rank histogram, commonly understood as indicating a lack of variability in the ensemble, can also be a sign of conditional bias. It is also shown that flat rank histograms can be generated for some model variables if the variance of the ensemble is correctly specified, yet if covariances between model grid points are improperly specified, rank histograms for combinations of model variables may not be flat. Further, if imperfect observations are used for verification, the observational errors should be accounted for, otherwise the shape of the rank histogram may mislead the user about the characteristics of the ensemble. If a statistical hypothesis test is to be performed to determine whether the differences from uniformity of rank are statistically significant, then samples used to populate the rank histogram must be located far enough away from each other in time and space to be considered independent.
Abstract
Common methods for the postprocessing of deterministic 2-m temperature (T2m) forecasts over the United States were evaluated from +12- to +120-h lead. Forecast data were extracted from the Global Ensemble Forecast System (GEFS) v12 reforecast dataset and thinned to a ½° grid. Analyzed data from the European Centre/Copernicus reanalysis (ERA5) were used for training and validation. Data from the 2000–18 period were used for training, and 2019 forecasts were validated. The postprocessing methods compared were the raw forecast guidance, a decaying-average bias correction (DAV), quantile mapping (QM), a univariate model output statistics (uMOS) algorithm, and a multivariate (mvMOS) algorithm. The mvMOS algorithm used the raw forecast temperature, the DAV adjustment, and the QM adjustment as predictors. Forecasts from all the postprocessing methods reduced the root-mean-square error (RMSE) and bias relative to the raw guidance. QM produced forecasts with slightly higher error than DAV. DAV estimates were the most consistent from day to day. The uMOS and mvMOS algorithms produced statistically significant lower RMSEs than DAV at forecast leads longer than 1 day, with mvMOS exhibiting the lowest error. Taylor diagrams showed that the MOS methods reduced the variability of the forecasts while improving forecast-analyzed correlations. QM and DAV modified the distribution of forecasts to more closely exhibit those of the analyzed data. A main conclusion is that the judicious statistical combination of guidance from multiple postprocessing methods is capable of producing forecasts with improved error statistics relative to any one individual technique. As each method applied here is algorithmically relatively simple, this suggests that operational deterministic postprocessing combining multiple correction methods could produce improved T2m guidance.
Abstract
Common methods for the postprocessing of deterministic 2-m temperature (T2m) forecasts over the United States were evaluated from +12- to +120-h lead. Forecast data were extracted from the Global Ensemble Forecast System (GEFS) v12 reforecast dataset and thinned to a ½° grid. Analyzed data from the European Centre/Copernicus reanalysis (ERA5) were used for training and validation. Data from the 2000–18 period were used for training, and 2019 forecasts were validated. The postprocessing methods compared were the raw forecast guidance, a decaying-average bias correction (DAV), quantile mapping (QM), a univariate model output statistics (uMOS) algorithm, and a multivariate (mvMOS) algorithm. The mvMOS algorithm used the raw forecast temperature, the DAV adjustment, and the QM adjustment as predictors. Forecasts from all the postprocessing methods reduced the root-mean-square error (RMSE) and bias relative to the raw guidance. QM produced forecasts with slightly higher error than DAV. DAV estimates were the most consistent from day to day. The uMOS and mvMOS algorithms produced statistically significant lower RMSEs than DAV at forecast leads longer than 1 day, with mvMOS exhibiting the lowest error. Taylor diagrams showed that the MOS methods reduced the variability of the forecasts while improving forecast-analyzed correlations. QM and DAV modified the distribution of forecasts to more closely exhibit those of the analyzed data. A main conclusion is that the judicious statistical combination of guidance from multiple postprocessing methods is capable of producing forecasts with improved error statistics relative to any one individual technique. As each method applied here is algorithmically relatively simple, this suggests that operational deterministic postprocessing combining multiple correction methods could produce improved T2m guidance.
Abstract
A global reforecast dataset was recently created for the National Centers for Environmental Prediction’s Global Ensemble Forecast System (GEFS). This reforecast dataset consists of retrospective and real-time ensemble forecasts produced for the GEFS from 1985 to present day. An 11-member ensemble was produced once daily to +15-day lead time from 0000 UTC initial conditions. While the forecast model was stable during the production of this dataset, in 2011 and several times thereafter, there were significant changes to the forecast model that was used in the data assimilation system itself, as well as changes to the assimilation system and the observations that were assimilated. These changes resulted in substantial changes in the statistical characteristics of the reforecast dataset. Such changes make it challenging to uncritically use reforecasts for statistical postprocessing, which commonly assume that forecast error and bias are approximately consistent from one year to the next. Ensuring the consistency in the statistical characteristics of past and present initial conditions is desirable but can be in tension with the expectation that prediction centers upgrade their forecast systems rapidly.
Abstract
A global reforecast dataset was recently created for the National Centers for Environmental Prediction’s Global Ensemble Forecast System (GEFS). This reforecast dataset consists of retrospective and real-time ensemble forecasts produced for the GEFS from 1985 to present day. An 11-member ensemble was produced once daily to +15-day lead time from 0000 UTC initial conditions. While the forecast model was stable during the production of this dataset, in 2011 and several times thereafter, there were significant changes to the forecast model that was used in the data assimilation system itself, as well as changes to the assimilation system and the observations that were assimilated. These changes resulted in substantial changes in the statistical characteristics of the reforecast dataset. Such changes make it challenging to uncritically use reforecasts for statistical postprocessing, which commonly assume that forecast error and bias are approximately consistent from one year to the next. Ensuring the consistency in the statistical characteristics of past and present initial conditions is desirable but can be in tension with the expectation that prediction centers upgrade their forecast systems rapidly.
Abstract
During the period 9–16 September 2013, more than 17 in. (~432 mm) of rainfall fell over parts of Boulder County, Colorado, with more than 8 in. (~203 mm) over a wide swath of Colorado’s northern Front Range. This caused significant flash and river flooding, loss of life, and extensive property damage. The event set a record for daily rainfall (9.08 in., or >230 mm) in Boulder that was nearly double the previous daily rainfall record of 4.8 in. (122 mm) set on 31 July 1919. The operational performance of precipitation forecast guidance from global ensemble prediction systems and the National Weather Service’s global and regional forecast systems during this event is documented briefly in the article and more extensively in online supplemental appendixes. While the precipitation forecast guidance uniformly depicted a much wetter-than-average period over northeastern Colorado, none of the global nor most of the regional modeling systems predicted precipitation amounts as heavy as analyzed. Notable exceptions to this were the Short-Range Ensemble Forecast (SREF) members that used the Advanced Research Weather Research and Forecasting Model (ARW-WRF) dynamical core. These members consistently produced record rainfall in the Front Range. However, the SREF’s record rainfall was also predicted to occur the day before the heaviest actual precipitation as well as the day of the heaviest precipitation.
Abstract
During the period 9–16 September 2013, more than 17 in. (~432 mm) of rainfall fell over parts of Boulder County, Colorado, with more than 8 in. (~203 mm) over a wide swath of Colorado’s northern Front Range. This caused significant flash and river flooding, loss of life, and extensive property damage. The event set a record for daily rainfall (9.08 in., or >230 mm) in Boulder that was nearly double the previous daily rainfall record of 4.8 in. (122 mm) set on 31 July 1919. The operational performance of precipitation forecast guidance from global ensemble prediction systems and the National Weather Service’s global and regional forecast systems during this event is documented briefly in the article and more extensively in online supplemental appendixes. While the precipitation forecast guidance uniformly depicted a much wetter-than-average period over northeastern Colorado, none of the global nor most of the regional modeling systems predicted precipitation amounts as heavy as analyzed. Notable exceptions to this were the Short-Range Ensemble Forecast (SREF) members that used the Advanced Research Weather Research and Forecasting Model (ARW-WRF) dynamical core. These members consistently produced record rainfall in the Front Range. However, the SREF’s record rainfall was also predicted to occur the day before the heaviest actual precipitation as well as the day of the heaviest precipitation.
Abstract
Probabilistic quantitative precipitation forecasts (PQPFs) were generated from The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE) database from July to October 2010 using data from Europe (ECMWF), the United Kingdom [Met Office (UKMO)], the United States (NCEP), and Canada [Canadian Meteorological Centre (CMC)]. Forecasts of 24-h accumulated precipitation were evaluated at 1° grid spacing within the contiguous United States against analysis data based on gauges and bias-corrected radar data.
PQPFs from ECMWF’s ensembles generally had the highest skill of the raw ensemble forecasts, followed by CMC. Those of UKMO and NCEP were less skillful. PQPFs from CMC forecasts were the most reliable but the least sharp, and PQPFs from NCEP and UKMO ensembles were the least reliable but sharper.
Multimodel PQPFs were more reliable and skillful than individual ensemble prediction system forecasts. The improvement was larger for heavier precipitation events [e.g., >10 mm (24 h)−1] than for smaller events [e.g., >1 mm (24 h)−1].
ECMWF ensembles were statistically postprocessed using extended logistic regression and the five-member weekly reforecasts for the June–November period of 2002–09, the period where precipitation analyses were also available. Multimodel ensembles were also postprocessed using logistic regression and the last 30 days of prior forecasts and analyses. The reforecast-calibrated ECMWF PQPFs were much more skillful and reliable for the heavier precipitation events than ECMWF raw forecasts but much less sharp. Raw multimodel PQPFs were generally more skillful than reforecast-calibrated ECMWF PQPFs for the light precipitation events but had about the same skill for the higher-precipitation events; also, they were sharper but somewhat less reliable than ECMWF reforecast-based PQPFs. Postprocessed multimodel PQPFs did not provide as much improvement to the raw multimodel PQPF as the reforecast-based processing did to the ECMWF forecast.
The evidence presented here suggests that all operational centers, even ECMWF, would benefit from the open, real-time sharing of precipitation forecast data and the use of reforecasts.
Abstract
Probabilistic quantitative precipitation forecasts (PQPFs) were generated from The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE) database from July to October 2010 using data from Europe (ECMWF), the United Kingdom [Met Office (UKMO)], the United States (NCEP), and Canada [Canadian Meteorological Centre (CMC)]. Forecasts of 24-h accumulated precipitation were evaluated at 1° grid spacing within the contiguous United States against analysis data based on gauges and bias-corrected radar data.
PQPFs from ECMWF’s ensembles generally had the highest skill of the raw ensemble forecasts, followed by CMC. Those of UKMO and NCEP were less skillful. PQPFs from CMC forecasts were the most reliable but the least sharp, and PQPFs from NCEP and UKMO ensembles were the least reliable but sharper.
Multimodel PQPFs were more reliable and skillful than individual ensemble prediction system forecasts. The improvement was larger for heavier precipitation events [e.g., >10 mm (24 h)−1] than for smaller events [e.g., >1 mm (24 h)−1].
ECMWF ensembles were statistically postprocessed using extended logistic regression and the five-member weekly reforecasts for the June–November period of 2002–09, the period where precipitation analyses were also available. Multimodel ensembles were also postprocessed using logistic regression and the last 30 days of prior forecasts and analyses. The reforecast-calibrated ECMWF PQPFs were much more skillful and reliable for the heavier precipitation events than ECMWF raw forecasts but much less sharp. Raw multimodel PQPFs were generally more skillful than reforecast-calibrated ECMWF PQPFs for the light precipitation events but had about the same skill for the higher-precipitation events; also, they were sharper but somewhat less reliable than ECMWF reforecast-based PQPFs. Postprocessed multimodel PQPFs did not provide as much improvement to the raw multimodel PQPF as the reforecast-based processing did to the ECMWF forecast.
The evidence presented here suggests that all operational centers, even ECMWF, would benefit from the open, real-time sharing of precipitation forecast data and the use of reforecasts.
Abstract
High-quality, high-resolution, hourly unbiased surface (2 m) temperature analyses are needed for many applications, including training and validation of statistical postprocessing applications. These temperature analyses are often generated through data assimilation procedures, whereby a background short-range gridded forecast is adjusted to newly available observations. Even with frequent updates to newly available observations, surface-temperature analysis errors and biases can be comparatively large relative to errors and biases of midtropospheric variables, especially over land, despite more near-surface in situ observations. Larger near-surface errors may have several causes, including biased background forecasts and the spatial heterogeneity of surface temperatures that results from subgrid-scale surface, vegetation, land-use, and terrain variations. Are biased raw background forecasts the predominant cause of surface temperature analysis errors? Part I of this two-part series describes a simple benchmark for evaluating the error characteristics of short-term (1 h) raw model background surface temperature forecasts. For stations with a relatively complete time series of data, it is possible to generate an hourly, diurnally, and seasonally dependent observation climatology at a station. The deviation of the current hour’s temperature observation with respect to this hour’s and Julian day’s climatology is added to the climatology for the next hour. For contiguous U.S. stations in July 2015, the station benchmark was lower in error than interpolated 1-h high-resolution numerical predictions of surface temperature from NOAA’s High-Resolution Rapid Refresh (HRRR) system, although not including full postprocessing. For August 2018, 1-h HRRR forecasts were much improved when tested against the station benchmark.
Abstract
High-quality, high-resolution, hourly unbiased surface (2 m) temperature analyses are needed for many applications, including training and validation of statistical postprocessing applications. These temperature analyses are often generated through data assimilation procedures, whereby a background short-range gridded forecast is adjusted to newly available observations. Even with frequent updates to newly available observations, surface-temperature analysis errors and biases can be comparatively large relative to errors and biases of midtropospheric variables, especially over land, despite more near-surface in situ observations. Larger near-surface errors may have several causes, including biased background forecasts and the spatial heterogeneity of surface temperatures that results from subgrid-scale surface, vegetation, land-use, and terrain variations. Are biased raw background forecasts the predominant cause of surface temperature analysis errors? Part I of this two-part series describes a simple benchmark for evaluating the error characteristics of short-term (1 h) raw model background surface temperature forecasts. For stations with a relatively complete time series of data, it is possible to generate an hourly, diurnally, and seasonally dependent observation climatology at a station. The deviation of the current hour’s temperature observation with respect to this hour’s and Julian day’s climatology is added to the climatology for the next hour. For contiguous U.S. stations in July 2015, the station benchmark was lower in error than interpolated 1-h high-resolution numerical predictions of surface temperature from NOAA’s High-Resolution Rapid Refresh (HRRR) system, although not including full postprocessing. For August 2018, 1-h HRRR forecasts were much improved when tested against the station benchmark.
Abstract
No abstract available.
Abstract
No abstract available.
Abstract
When evaluating differences between competing precipitation forecasts, formal hypothesis testing is rarely performed. This may be due to the difficulty in applying common tests given the spatial correlation of and non-normality of errors. Possible ways around these difficulties are explored here. Two datasets of precipitation forecasts are evaluated, a set of two competing gridded precipitation forecasts from operational weather prediction models and sets of competing probabilistic quantitative precipitation forecasts from model output statistics and from an ensemble of forecasts. For each test, data from each competing forecast are collected into one sample for each case day to avoid problems with spatial correlation. Next, several possible hypothesis test methods are evaluated: the paired t test, the nonparametric Wilcoxon signed-rank test, and two resampling tests. The more involved resampling test methodology is the most appropriate when testing threat scores from nonprobabilistic forecasts. The simpler paired t test or Wilcoxon test is appropriate to use in testing the skill of probabilistic forecasts evaluated with the ranked probability score.
Abstract
When evaluating differences between competing precipitation forecasts, formal hypothesis testing is rarely performed. This may be due to the difficulty in applying common tests given the spatial correlation of and non-normality of errors. Possible ways around these difficulties are explored here. Two datasets of precipitation forecasts are evaluated, a set of two competing gridded precipitation forecasts from operational weather prediction models and sets of competing probabilistic quantitative precipitation forecasts from model output statistics and from an ensemble of forecasts. For each test, data from each competing forecast are collected into one sample for each case day to avoid problems with spatial correlation. Next, several possible hypothesis test methods are evaluated: the paired t test, the nonparametric Wilcoxon signed-rank test, and two resampling tests. The more involved resampling test methodology is the most appropriate when testing threat scores from nonprobabilistic forecasts. The simpler paired t test or Wilcoxon test is appropriate to use in testing the skill of probabilistic forecasts evaluated with the ranked probability score.