Search Results

You are looking at 1 - 10 of 86 items for

  • Author or Editor: Thomas M. Hamill x
  • All content x
Clear All Modify Search
Thomas M. Hamill

Abstract

The most common method of verifying multicategory probabilistic forecasts such as are used in probabilistic quantitative precipitation forecasting is through the use of the ranked probability score. This single number description of forecast accuracy can never capture the multidimensional nature of forecast quality and does not inform the forecaster about the sources of forecast deficiencies. A new type of reliability diagram is developed here and applied to probabilistic quantitative precipitation forecasts from a university contest. This diagram is shown to potentially be useful in helping the forecaster to correct some errors in assigning the categorical probabilities.

Full access
Thomas M. Hamill

Abstract

Rank histograms are a tool for evaluating ensemble forecasts. They are useful for determining the reliability of ensemble forecasts and for diagnosing errors in its mean and spread. Rank histograms are generated by repeatedly tallying the rank of the verification (usually an observation) relative to values from an ensemble sorted from lowest to highest. However, an uncritical use of the rank histogram can lead to misinterpretations of the qualities of that ensemble. For example, a flat rank histogram, usually taken as a sign of reliability, can still be generated from unreliable ensembles. Similarly, a U-shaped rank histogram, commonly understood as indicating a lack of variability in the ensemble, can also be a sign of conditional bias. It is also shown that flat rank histograms can be generated for some model variables if the variance of the ensemble is correctly specified, yet if covariances between model grid points are improperly specified, rank histograms for combinations of model variables may not be flat. Further, if imperfect observations are used for verification, the observational errors should be accounted for, otherwise the shape of the rank histogram may mislead the user about the characteristics of the ensemble. If a statistical hypothesis test is to be performed to determine whether the differences from uniformity of rank are statistically significant, then samples used to populate the rank histogram must be located far enough away from each other in time and space to be considered independent.

Full access
Thomas M. Hamill

Abstract

No abstract available.

Full access
Thomas M. Hamill

Abstract

When evaluating differences between competing precipitation forecasts, formal hypothesis testing is rarely performed. This may be due to the difficulty in applying common tests given the spatial correlation of and non-normality of errors. Possible ways around these difficulties are explored here. Two datasets of precipitation forecasts are evaluated, a set of two competing gridded precipitation forecasts from operational weather prediction models and sets of competing probabilistic quantitative precipitation forecasts from model output statistics and from an ensemble of forecasts. For each test, data from each competing forecast are collected into one sample for each case day to avoid problems with spatial correlation. Next, several possible hypothesis test methods are evaluated: the paired t test, the nonparametric Wilcoxon signed-rank test, and two resampling tests. The more involved resampling test methodology is the most appropriate when testing threat scores from nonprobabilistic forecasts. The simpler paired t test or Wilcoxon test is appropriate to use in testing the skill of probabilistic forecasts evaluated with the ranked probability score.

Full access
Thomas M. Hamill

Abstract

During the period 9–16 September 2013, more than 17 in. (~432 mm) of rainfall fell over parts of Boulder County, Colorado, with more than 8 in. (~203 mm) over a wide swath of Colorado’s northern Front Range. This caused significant flash and river flooding, loss of life, and extensive property damage. The event set a record for daily rainfall (9.08 in., or >230 mm) in Boulder that was nearly double the previous daily rainfall record of 4.8 in. (122 mm) set on 31 July 1919. The operational performance of precipitation forecast guidance from global ensemble prediction systems and the National Weather Service’s global and regional forecast systems during this event is documented briefly in the article and more extensively in online supplemental appendixes. While the precipitation forecast guidance uniformly depicted a much wetter-than-average period over northeastern Colorado, none of the global nor most of the regional modeling systems predicted precipitation amounts as heavy as analyzed. Notable exceptions to this were the Short-Range Ensemble Forecast (SREF) members that used the Advanced Research Weather Research and Forecasting Model (ARW-WRF) dynamical core. These members consistently produced record rainfall in the Front Range. However, the SREF’s record rainfall was also predicted to occur the day before the heaviest actual precipitation as well as the day of the heaviest precipitation.

Full access
Thomas M. Hamill

Abstract

Probabilistic quantitative precipitation forecasts (PQPFs) were generated from The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE) database from July to October 2010 using data from Europe (ECMWF), the United Kingdom [Met Office (UKMO)], the United States (NCEP), and Canada [Canadian Meteorological Centre (CMC)]. Forecasts of 24-h accumulated precipitation were evaluated at 1° grid spacing within the contiguous United States against analysis data based on gauges and bias-corrected radar data.

PQPFs from ECMWF’s ensembles generally had the highest skill of the raw ensemble forecasts, followed by CMC. Those of UKMO and NCEP were less skillful. PQPFs from CMC forecasts were the most reliable but the least sharp, and PQPFs from NCEP and UKMO ensembles were the least reliable but sharper.

Multimodel PQPFs were more reliable and skillful than individual ensemble prediction system forecasts. The improvement was larger for heavier precipitation events [e.g., >10 mm (24 h)−1] than for smaller events [e.g., >1 mm (24 h)−1].

ECMWF ensembles were statistically postprocessed using extended logistic regression and the five-member weekly reforecasts for the June–November period of 2002–09, the period where precipitation analyses were also available. Multimodel ensembles were also postprocessed using logistic regression and the last 30 days of prior forecasts and analyses. The reforecast-calibrated ECMWF PQPFs were much more skillful and reliable for the heavier precipitation events than ECMWF raw forecasts but much less sharp. Raw multimodel PQPFs were generally more skillful than reforecast-calibrated ECMWF PQPFs for the light precipitation events but had about the same skill for the higher-precipitation events; also, they were sharper but somewhat less reliable than ECMWF reforecast-based PQPFs. Postprocessed multimodel PQPFs did not provide as much improvement to the raw multimodel PQPF as the reforecast-based processing did to the ECMWF forecast.

The evidence presented here suggests that all operational centers, even ECMWF, would benefit from the open, real-time sharing of precipitation forecast data and the use of reforecasts.

Full access
Thomas M. Hamill
Full access
Thomas M. Hamill

Abstract

A global reforecast dataset was recently created for the National Centers for Environmental Prediction’s Global Ensemble Forecast System (GEFS). This reforecast dataset consists of retrospective and real-time ensemble forecasts produced for the GEFS from 1985 to present day. An 11-member ensemble was produced once daily to +15-day lead time from 0000 UTC initial conditions. While the forecast model was stable during the production of this dataset, in 2011 and several times thereafter, there were significant changes to the forecast model that was used in the data assimilation system itself, as well as changes to the assimilation system and the observations that were assimilated. These changes resulted in substantial changes in the statistical characteristics of the reforecast dataset. Such changes make it challenging to uncritically use reforecasts for statistical postprocessing, which commonly assume that forecast error and bias are approximately consistent from one year to the next. Ensuring the consistency in the statistical characteristics of past and present initial conditions is desirable but can be in tension with the expectation that prediction centers upgrade their forecast systems rapidly.

Full access
Thomas M. Hamill

Abstract

Forecasters often develop rules of thumb for adjusting model guidance. Ideally, before use, these rules of thumb should be validated through a careful comparison of model forecasts and observations over a large sample. Practically, such evaluation studies are difficult to perform because forecast models are continually being changed, and a hypothesized rule of thumb may only be applicable to a particular forecast model configuration.

A particular rule of thumb was examined here: dprog/dt. Given a set of lagged forecasts from the same model all verifying at the same time, this rule of thumb suggests that if the forecasts show a trend, this trend is more likely than not to continue and thus provide useful information for correcting the most recent forecast. Forecasters may also note the amount of continuity of forecasts to estimate the magnitude of the error in the most recent forecast.

Statistical evaluation of this rule of thumb was made possible here using a dataset of forecasts from a “frozen” model. A 23-yr record of forecasts was generated from a T62 version of the medium-range forecast model used at the National Centers for Environmental Prediction. Forecasts were initialized from reanalysis data, and January–March forecasts were examined for selected locations. The rule dprog/dt was evaluated with 850-hPa temperature forecasts. A total of 2070 sample days were used in the evaluation.

Extrapolation of forecast trends was shown to have little forecast value. Also, there was only a small amount of information on forecast accuracy from the amount of discrepancy between short-term lagged forecasts. The lack of validity of this rule of thumb suggest that others should also be carefully scrutinized before use.

Full access