## Abstract

In the event of the release of a dangerous atmospheric contaminant, an atmospheric transport and dispersion (ATD) model is often used to provide forecasts of the resulting contaminant dispersion affecting the population. These forecasts should also be accompanied by accurate estimates of the forecast uncertainty to allow for more informed decisions about the potential hazardous area. This study examines the calculation of uncertainty in the meteorological data as derived from an ensemble, and its effects when used as additional input to drive an ATD model. The first part of the study examines the capability of a linear function to relate ensemble spread to error variance of the ensemble mean given ensemble spread from 24 days of forecasts from the National Centers for Environmental Prediction (NCEP) Short-Range Ensemble Forecast (SREF). This linear function can then be used to calibrate the ensemble spread to produce a more accurate estimate of the meteorological uncertainty. Results for the linear relationship of wind variance are very good, with values of the coefficient of determination *R*^{2} generally exceeding 0.94 for forecast lengths of 12 h and greater. The calibration is shown to be more sensitive to forecast hour than vertical level within the lower troposphere. The second part presents a 24-h case study to assess the impact of meteorological uncertainty calculations on Second-Order Closure Integrated Puff (SCIPUFF) ATD model predictions. Both uncalibrated ensemble wind variances and wind variances calibrated based on the results of the first part show improvement in mean concentration forecasts relative to a control experiment using the default hazard mode uncertainty when compared with a baseline SCIPUFF integration based on a high-resolution dynamic analysis of the meteorological conditions. The SCIPUFF experiments that use a wind variance calibration show both qualitative and quantitative improvement in most of the mean concentrations and patterns over the control experiment and the SCIPUFF experiment using uncalibrated wind variances. The SCIPUFF experiments using meteorological ensemble uncertainty information also produce mean concentrations and patterns that compare favorably to those of an explicit SCIPUFF ensemble based on each SREF member. Use of the uncalibrated variance information within a single ATD prediction produces mean ATD predictions most similar to those of the explicit ATD ensemble, and use of calibrated ensemble variance is shown to have some advantages over the explicit ATD ensemble.

## 1. Introduction

In the event of a dangerous airborne substance release, either accidental or as a deliberate attack, atmospheric transport and dispersion (ATD) models coupled with adequate meteorological (MET) information can predict vital information about the risk areas. This information can be a critical component in the plan of any emergency response personnel. As valuable as this information is, however, it is important that it be accompanied by an estimate of the uncertainty in these predicted risk areas and exposure levels (National Research Council 2003; Rao 2005).

Ensemble forecasting attempts to quantify MET uncertainty by running multiple numerical weather prediction (NWP) models and/or model configurations, with each member using different initial conditions, boundary conditions, and/or model physics. The goal of these ensembles is to span the likely range of outcomes given the uncertainties in the model and its inputs (Leith 1974). While ensemble forecasting is a significant step toward forecasting both the most likely outcome and the uncertainty in the forecast, the size of operational ensembles and their design for traditional weather forecasting render them insufficient to fully represent the probability density function (PDF) of possible forecasts, especially for ATD. An ensemble capable of doing so is impractical with current computing resources, if indeed such an ensemble is even possible. Therefore, any existing MET ensemble is merely sampling the full PDF of possible outcomes, and any estimates of uncertainty we generate from such an ensemble should be calibrated to better represent an estimate of the full PDF rather than that of the limited ensemble. Many studies, including Houtekamer et al. (1997), show that most MET ensembles are underdispersive (the ensemble spread is consistently smaller than the spread in the forecast errors).

There are many studies that examine the representativeness of ensemble spread as a measure of forecast uncertainty. Several attempt to define a correlation between ensemble spread and different measures of the error in the resulting MET forecast across many different variables (Kalnay and Dalcher 1987; Murphy 1988; Barker 1991; Houtekamer 1993; Buizza 1997; Hamill and Colucci 1998; Stensrud et al. 1999). These studies all find an unacceptably low correlation between spread and errors (Grimit and Mass 2007). This poor correlation is explained by Houtekamer (1993), who uses a stochastic model to show that, even in idealized cases, the correlation between ensemble spread and absolute error will never be very large. Additionally, Houtekamer (1993) shows that extremes in ensemble spread (either much smaller or much larger than the climatological average) are much more likely to predict skill than spreads near the climatological mean (Whitaker and Loughe 1998).

Grimit (2004) posits that, rather than a simple spread-to-error correlation, a more probabilistic approach should be used to evaluate ensemble uncertainty. Grimit stresses the distinction between forecast error and forecast uncertainty. In brief, the forecast uncertainty measures the spread of the full forecast PDF, whereas a forecast error is computed relative to a single realization (what actually happens) chosen from that PDF. Thus, when we calculate the absolute error using an observation, we are only sampling one realization from the full forecast error PDF and we cannot relate that single observation to the actual uncertainty in the forecast. If, however, the relationship between the spread of the MET ensemble and forecast uncertainty is approximately constant for a given forecast length, we can collect multiple independent samples of different instances to reconstruct the distribution of errors over a range of spreads. Using this distribution, we can calculate the variance of the errors (our measure of uncertainty) and compare this with the ensemble spread (Grimit 2004).

Following this approach, Grimit (2004) and Grimit and Mass (2007) use an idealized stochastic model to show a strong linear relationship between ensemble spread and error variance. Because they are using an idealized model, the relationship is not only linear but lies very near the ideal *y* = *x* line passing through the origin. Any remaining variation from the *y* = *x* line is attributed to sampling errors caused by having a finite number of samples. Roulston (2005) applies a similar technique, but instead of using the ensemble variance, explores the spatial relationship within variable fields by comparing average ensemble covariance between two points to the actual error covariance between those same points in European Centre for Medium-Range Weather Forecasts (ECMWF) ensemble forecasts. This current study seeks to extend the methods of Grimit (2004) and Roulston (2005) by using the variance from an operational Short-Range Ensemble Forecast (SREF) to predict the actual error variance and actual single-point error covariance observed in the ensemble mean for variables important to ATD. If the relationship is linear, as in the Grimit (2004) and Roulston (2005) studies, or some other simple relationship, it provides us with a simple, computationally efficient method for calibrating actual error from the diagnosed ensemble variance.

The ATD models also have uncertainties dependent on their initial conditions and model assumptions. Thus the creation of ATD ensembles driven by MET ensembles provides another way to account for such uncertainties. Many ATD models used today are designed to use the outputs from MET models to drive their transport and dispersion calculations, presenting an excellent opportunity to leverage output from existing MET model ensembles to drive ensembles of ATD models. Several studies have applied this or other ensemble approaches to ATD models with some success (Dabberdt and Miller 2000; Galmarini et al. 2001; Straume 2001; Warner et al. 2002; Delle Monache and Stull 2003; Draxler 2003; Galmarini et al. 2004; Lee et al. 2009). The advantages of creating an ensemble of dispersion forecasts are to improve the ensemble mean forecast and to provide an estimate of the uncertainty in that dispersion forecast.

Unfortunately, in a crisis situation where decisions must be made quickly to coordinate emergency response, timely computation of an ensemble of ATD model forecasts may not be possible, so decisions may be made based on a single ATD model forecast, even on a laptop computer in the field. Since MET ensembles are readily available, we propose the use of the uncertainty information present in the MET ensemble, preferably calibrated to better represent the actual error variances, to assess the uncertainty of a single ATD forecast. Although many ATD models use output from advanced mesoscale models to provide the flow field for transport and dispersion calculations, given known source characteristics, uncertainty in the transporting winds remains the biggest cause of uncertainty in ATD models (Rao 2005). Therefore, if we can estimate the uncertainty in the input wind fields, we can improve our estimates of the ATD model uncertainty related to the MET inputs and produce an estimate of uncertainty in the resulting dispersion calculation with a single ATD model forecast. This capability is already built into the Second-Order Closure Integrated Puff (SCIPUFF) dispersion model, making it ideal for studying the capability of this multi-MET, single ATD model approach in predicting dispersion uncertainty (Sykes et al. 2006).

SCIPUFF ingests MET uncertainty information from ensembles via four-dimensional (three spatial dimensions plus time), single-point wind statistics: variance in the *U* component of wind (called UUE in SCIPUFF and throughout this paper), variance in the *V* component of wind (VVE), covariance between the *U* and *V* components (UVE), and a Lagrangian length scale (SLE). Four-dimensional wind variances and covariance are readily computed in an uncalibrated form directly from MET ensemble output, while calculation of SLE is an area of ongoing research (Lee et al. 2009). However, we would like to apply a simple, effective calibration to the variances to provide SCIPUFF with the best possible estimate of MET uncertainty. Here we use a linear variance calibration (LVC) method based on Grimit (2004) and Roulston (2005), which is described in more detail in section 2a, to improve our wind variances.

Warner et al. (2002) is one of the first studies to examine the effect of MET ensemble variability on ATD model results. They explore the use of uncertainty information in SCIPUFF combined with a single dispersion model simulation, and use the large-scale variability (LSV) length scale as a proxy for SLE. LSV was originally designed to account for subgrid-scale variability in SCIPUFF. Warner et al. (2002) use a couple of ad hoc estimates to evaluate the effect of LSV on the resulting concentration probabilities, demonstrating that longer length scales translate to larger areas of low dosage and smaller areas of high dosage.

We have two primary goals in the current study. The first is to demonstrate the applicability of the Roulston (2005) method for calibrating ensemble variances of field variables (e.g., boundary layer winds) important to ATD models. The second goal is to investigate the sensitivity and potential improvement in SCIPUFF forecasts using ensemble spread, both in an uncalibrated form and calibrated using the technique described herein. Section 2 details the linear calibration method and presents results of the calibration for typical planetary boundary layer (PBL) wind fields during the training period. Section 3 uses a case study to evaluate the sensitivity of SCIPUFF predictions to input MET uncertainty. The predictions using ensemble-derived variance information, both calibrated and uncalibrated, are compared with a baseline dispersion solution created using MET output from a high-resolution mesoscale model using four-dimensional data assimilation (Stauffer and Seaman 1994) to provide a high-quality “dynamic analysis” of MET conditions to run SCIPUFF. An ensemble of SCIPUFF forecasts driven by each MET ensemble member is also compared to the baseline and other single-SCIPUFF experiments. Section 4 summarizes our conclusions and suggests areas for future research.

## 2. Linear variance calibration

### a. Methodology

The method outlined in Roulston (2005) involves studying the two-point error covariance of a MET ensemble through a training period and determining a linear relationship with the observed two-point error covariance during the same period. Here we adapt that technique for use with single-point variances from a MET ensemble during the training period. While the full matrix of single-point statistics is much smaller than that of two-point statistics, we have chosen to follow the Roulston (2005) method by using sampling with replacement to obtain a representative sample rather than compute the statistics for every point, which can be done if the computing power is sufficient. In this sampling technique, instead of calculating statistics on all points at each forecast length, we instead choose a random sample point in space and forecast time from all forecasts with a given forecast length, and then repeat the process some number of times with the same forecast length to create a sufficient sample that approximates the results of calculating the statistics at every point. The goal is to derive linear relationships between MET ensemble variance and actual error variance to calibrate forecast variance.

Since the error of the ensemble mean is the result of a single realization (the mean value versus what actually happened), we need to compute the error over many realizations to calculate the actual uncertainty in the forecast at a particular forecast hour. This is accomplished by calculating the ensemble mean error at many locations in space throughout the training period for a given forecast length. This process assumes that the relationship between the ensemble variance and the actual error variance is constant throughout the training period and throughout the domain sampled for each variable field and forecast length, so that the temporal–spatial averaging is comparable to averaging over all realizations. While this assumption may not always be true, this technique may still be useful if the correlation is high. In this situation, the calculated error variance would be an approximation based on all the ensemble variance to error variance relationships present within the sample. This also provides a possible future refinement: separating forecast points by their underlying ensemble variance to error variance relationship, by regime clustering, regional calibration, etc. However, care must be taken in acquiring an adequate sample size while not expanding into another distribution (e.g., trying to acquire a sufficient regional sample by lengthening the training period to include multiple regimes).

For this study, we follow the Roulston (2005) method of grouping into equally populated bins based on our predictor value, the ensemble variance. The variances are used to represent the ensemble uncertainty at a particular forecast length. The variance, or spread, of the ensemble, EVar of a scalar quantity *s*, at a given height or pressure, is given by

where *M* is the number of ensemble members, *ij* denotes the value at a single grid point (*i*, *j*), *s _{m}* is the scalar quantity at

*ij*for a single ensemble member, and the overbar denotes an average over all ensemble members. Thus EVar[

*s*(

*ij*)], the ensemble variance of scalar

*s*at point (

*i*,

*j*) over

*M*ensemble members, is averaged inside each bin of size

*N*to give the average ensemble variance of the bin, EVar

*:*

_{b}with *ij _{n}* representing the point chosen randomly and

*N*the total number of points in each bin. The value of

*N*should be specified to be large enough that there is a sufficient sample within each bin, but small enough not to include too wide a range of ensemble variances in a single bin.

The actual error variance of the ensemble mean, AVar* _{b}*, for the corresponding bin is

where *s _{υ}* denotes the value of scalar

*s*at the verification time (i.e., the “true” value) and other variables are as before. Note that without averaging over a group of samples, this reduces to the mean-square error in the ensemble mean forecast.

The quality of fit for the linear calibration is evaluated both qualitatively and quantitatively using the coefficient of determination *R*^{2}, as specified in Wilks (2006). The stability of the regression with respect to the sampling parameters was analyzed by examining the sensitivity of *R*^{2}, slope, and intercept to changes in the sampling parameters, and we have found a value for *N* of 1000 to be acceptable with the sample size of 100 000 used in this study.

### b. Experimental design

The ensemble used in this study is the National Centers for Environmental Prediction (NCEP) SREF, as described in McQueen et al. (2005). During our 2004 study period, SREF was produced operationally twice daily at 0900 and 2100 UTC with 15 members, 10 of which were produced using the Eta Model and 5 using the Regional Spectral Model (RSM). Because of variations in the MET variables’ availability between the two models, we only use the 10 Eta members in this study. The native horizontal resolution of these Eta members is 36 km before being interpolated to the common 40-km grid for SREF output (NCEP grid 212). The North American Regional Reanalysis (NARR) is used as the verification for this study (Mesinger et al. 2006), interpolated from its native 32-km grid (NCEP grid 221) to the 40-km SREF common grid to allow direct comparison. We use a 24-day training period to calculate the relationship between ensemble variance and actual error variance based on SREF forecasts from 0900 UTC 23 August to 2100 UTC 15 September 2004 (48 forecasts). For this study, we consider forecasts in 3-h increments from 3 to 60 h.

The ensemble members vary in the convective parameterization and microphysics schemes used and their initial conditions, and thus are designed to represent the variability in fields important for synoptic and mesoscale forecasting, such as quantitative precipitation forecasts (QPF). This may not be the ideal ensemble setup for ATD applications, as dispersion calculations are more dependent on the low-level winds, stability, and boundary layer depth, and therefore would be better served by an ensemble created to directly provide variance in those fields. The ideal configuration for a MET ensemble for use in ATD forecasts remains an open research topic, but such an ensemble needs to vary internal parameterizations such as the land surface model, the PBL scheme, and their internal parameters rather than (or in addition to) cloud microphysics and convective parameterization. Also, initial-condition perturbations and data assimilation methods need to focus on variables that more directly affect PBL mixing and development, such as low-level winds, stability, surface buoyancy flux, surface moisture, and so on. Such an ensemble would ideally also have sufficient spatial resolution in the horizontal and vertical, as small-scale features such as slope flows and sea breezes can have important effects in areas of complex topography and coastlines not well resolved by the 40-km SREF. However, SREF is a good example of the type of ensemble that would be available operationally for hazard prediction and consequence assessment and thus available for use in SCIPUFF during a crisis.

Our goal is to compute the uncertainty in the low-level wind fields that are most important for use in ATD calculations. We examine the surface *U* and *V* fields at 10 m above ground level (AGL) for representing lower PBL conditions, and the 850-hPa *U* and *V* fields for conditions near the top of the PBL.

Evaluation of the LVC method is conducted over the SREF domain, which covers the entire continental United States. We sample 100 000 random points from SREF at a given forecast length throughout the training period and bin these points by the ensemble spread into bins of 1000 points each, giving 100 bins total. These points are then plotted against the actual error variance of the bin, using the NARR as verification. The plots are then evaluated both qualitatively and quantitatively for fit using the coefficient of determination *R*^{2}, which represents the portion of the variance in the actual error variance accounted for by the linear fit, and thus is a measure of the quality of the fit. The fit for forecasts of varying length from 3 to 60 h every 3 h is evaluated to determine any dependence on forecast length.

### c. Results

Figure 1 compares the spread in the 10-m-AGL *U* field versus actual error variance over the entire SREF domain at five different forecast lengths. These plots indicate a strong linear relationship between ensemble spread and error variance for each forecast length. This qualitative observation is supported by the *R*^{2} values in Fig. 2, which shows that LVC *R*^{2} values for 10-m-AGL and 850-hPa winds rise from 0.74–0.92 at a forecast length of 3 h to 0.92 or higher at 9 h and 0.95 or higher at 12 h and beyond. Thus at 12 h and longer, over 95% of the variance in the actual error variance is explained by the linear model.

Another property of the plots in Fig. 1 is the change in slope with forecast length. This is an important result, as it indicates that the calibration should change based on the forecast length. If ensemble variance is truly equivalent to error variance, then the relationship would have a slope of 1 and an intercept of 0. As Fig. 3 shows, the change in slope with forecast length is not unique to the 10-m-AGL *U*, but rather is present in both wind components at both levels studied. Figure 4 shows that there is also an increase in the *y* intercept with forecast length, but this increase is primarily observed at longer forecast lengths. The exact cause of these LVC parameter changes with forecast length is uncertain, although they are likely related to changes in the dispersiveness of the ensemble changing relative to the actual uncertainty in the atmosphere with forecast length, observation error, and sampling artifacts from using a finite-sized ensemble.

## 3. SCIPUFF case study

### a. Model descriptions

The SCIPUFF ATD model used in this study is a second-order closure Lagrangian puff dispersion model, using a collection of Gaussian puffs to represent the three-dimensional, time-dependent concentration field (Sykes et al. 2006). SCIPUFF can use a constant specified wind speed and direction, surface or profile observation data, or gridded output from a numerical weather prediction (MET) model to drive its dispersion calculations. We consider here only this final method of using gridded data from a MET model. There are two different MET models used to drive SCIPUFF for this study: 40-km SREF, also used in the LVC and described in section 2b, and the fifth-generation Pennsylvania State University–National Center for Atmospheric Research Mesoscale Model (MM5), which is used to produce a 4-km baseline as described below. Both model outputs are converted into the Multiscale Environmental Dispersion over Complex Terrain (MEDOC) common grid file format for ingestion into SCIPUFF (Sykes et al. 2006).

The SCIPUFF model also forecasts dispersion probabilities as a function of time and space in addition to the typical dispersion outputs related to the mean plume. This makes it extremely valuable in assessing the potential risk of extreme events that cannot be captured in a simple mean plume, as well as in providing some measure of uncertainty present in the calculations.

SCIPUFF includes two methods for assessing the uncertainty in the wind field. One is the subgrid uncertainty model, which is intended to approximate the uncertainty due to subgrid variations in the wind field not resolved by the gridded MET input (Sykes et al. 2006). The other is a grid-scale uncertainty model, intended to approximate the uncertainty caused by uncertainty in the MET model wind fields. Both are modeled similarly, using the velocity variance, single-point covariance, and a length scale to calculate the corresponding uncertainty in the puff concentrations and locations. This study focuses on the ATD uncertainty due to uncertainty in the MET model wind fields, as input to SCIPUFF via MET ensemble–derived UUE and VVE.

UUE, VVE, UVE, and SLE are used by SCIPUFF in conjunction with the subgrid uncertainty model to calculate the variance of each of the Lagrangian puffs. Thus, a larger UUE or VVE input will result in a larger puff variance, which results in a larger-area, lower-mean concentration field. In this formulation, the covariance UVE acts as a dilatation axis; a negative UVE means a negative correlation, and thus positive *U* variations correspond to negative *V* variations, resulting in puffs (and plumes) stretched in the northwest–southeast direction, and conversely a positive UVE results in northeast–southwest stretching.

To produce our best estimate of the MET conditions for our case study, we run MM5 (Dudhia 1993; Grell et al. 1994) using four-dimensional data assimilation (FDDA) to produce dynamic analyses for a 36-h period encompassing the entire test case (Stauffer and Seaman 1990). The configuration typically uses three domains: an outer domain with a horizontal resolution of 36 km, and internal nests of 12-km and 4-km resolution (Fig. 5). The dynamic-analysis FDDA ingests standard World Meteorological Organization (WMO) upper-air and surface data throughout the model integration (no free forecast period), using a combination of analysis nudging on the outer two domains and observation nudging on the innermost two domains (Stauffer and Seaman 1994). An objective analysis was used to blend the NARR and observations for the initial and lateral boundary conditions, and also for the gridded analyses used for FDDA. This multiscale continuous data assimilation allows us to reconstruct a MET analysis that provides an excellent representation of the state of the atmosphere during the case study. A 4-km MM5 dynamic analysis is used as an input to SCIPUFF to create a baseline dispersion output (section 3c) with which we can compare our experimental results.

### b. Case description

To evaluate the results from SCIPUFF using both calibrated and uncalibrated MET ensemble statistics, we chose a case without significant precipitation and with generally light winds over the eastern United States. Early in the case study, high pressure is situated over eastern Canada with a stalled cold front dissipating in the mid-Atlantic and midwestern states. Low-level winds are easterly along the coast and more southerly west of the Appalachian Mountains (Fig. 6). This easterly wind results in the damming of cooler air against the Appalachians, leading to the formation of a “wedge ridge” pattern (e.g., Stauffer and Warner 1987) by the end of the study period.

Figure 6 also shows a sample comparison of the 4-km MM5 baseline 10-m winds with the observed 10-m winds, along with the baseline sea level pressure, at 15 h (0000 UTC 25 August 2004). The wind pattern is typical of the mean boundary layer flow during the case study. There are light easterly winds over eastern North Carolina, with good agreement between the model-predicted wind field and the observed winds. The winds become southeasterly in the higher terrain of western North Carolina in both the dynamic analysis and the observations. Following the anticyclonic flow farther, we find it turning more southerly in eastern Ohio as we reach the edge of the 4-km domain, with analysis and observation continuing to show good agreement. The area of poorest agreement between the dynamic-analysis winds and observed 10-m winds at this time is in central South Carolina, but even these only show a 45° deviation from observed winds. However, the simulated plume (section 3c) does not pass through this region. Over the entire domain and analysis period, the speed and direction errors are quantified by the mean magnitude of the vector wind difference between the baseline and observations as 1.56 m s^{−1} for the lowest layer, with an average observed wind speed of 2.56 m s^{−1}, and 1.17 m s^{−1} for layers between the lowest layer and 1 km AGL, with an average observed wind speed of 4.52 m s^{−1}. Both of these mean errors are smaller than those of the nowcast dynamic analysis reported by Schroeder et al. (2006) and the high-resolution dynamically initialized forecasts used for hazard prediction and consequence assessment described by Stauffer et al. (2007), indicating a very good simulation of actual conditions for this case study by the MM5 baseline.

### c. Experimental design

The second portion of this investigation explores the use of ensemble mean winds and the ability of spatially and temporally varying UUE, VVE, and UVE derived from the SREF MET ensemble, both uncalibrated and calibrated using the method presented in section 2, to improve SCIPUFF dispersion calculations. These predictions using a single-SCIPUFF forecast, but with ensemble-derived MET uncertainty information, are hypothesized to be similar to those based on an explicit ensemble of dispersion calculations, each member based on an individual MET member. Since we have no actual concentration (e.g., tracer) data, we use the 4-km MM5 dynamic analysis as input to SCIPUFF to produce a baseline dispersion realization as a basis of comparison against the experiments described below. An image of the MM5 domains superimposed over a map of the United States is shown in Fig. 5.

The SCIPUFF dispersion forecast assumes the release of a passive tracer (C_{7}F_{14}) at 0900 UTC 24 August 2004 (just before dawn) lasting for 10 min at 36°N, 79°W, or just west of Durham, North Carolina, near the Durham/Orange County line, from an elevation of 10 m AGL. Figure 7 shows the SCIPUFF domain and release point, and provides a comparison between the high-resolution terrain from MM5 used to create the baseline (Fig. 7a) versus the lower-resolution SREF terrain used in the SCIPUFF experiments (Fig. 7b). SCIPUFF forecasts are made for a 24-h period beginning with the start of the release at 0900 UTC 24 August 2004. To avoid the SREF spinup period in both the mean forecast and LVC, we use the 2100 UTC 23 August 2004 SREF forecast for our experiments, so that the SCIPUFF forecasts are using 12–36-h SREF forecasts.

We recognize that applying the same ATD code used in the experiments to produce the baseline dispersion data can introduce “twin problem” issues (Halem et al. 1982). However, use of different MET inputs (MM5 continuously assimilating all available surface and upper-air data versus mean SREF forecasts), created at much different spatial resolutions (4-km, 22 terrain-following layers between surface and 850 hPa in MM5 versus 40-km, 7 pressure levels between sea level and 850 hPa in SREF), to drive the SCIPUFF baseline should greatly reduce the impact of any SCIPUFF-based relationships between the baseline SCIPUFF results and experimental SCIPUFF results.

Five different experiments are designed for this investigation. The first four experiments use the same uncertainty model but with different inputs for the variances. The first is a control experiment (CTL) that uses the MET ensemble mean winds without any MET ensemble uncertainty statistics (UUE, VVE, etc.). In the absence of uncertainty inputs, SCIPUFF uses its own default simple error growth model for the variances and length scale based on grid length and the Gifford spectrum, and doubling of the error in 2–3 days (Sykes et al. 2006). [The baseline also uses the simple error growth model, but variances are kept small because the MET input is considered an analysis (0-h forecast) since the continuous data assimilation relaxes the model fields toward the observations.] The second experiment, UNCAL, uses the MET ensemble mean fields, but uses the raw (uncalibrated) MET ensemble spreads (i.e., UUE, VVE, and UVE) as the input uncertainty fields. SLE is assumed to be constant in time and throughout the domain and is set to an intermediate value of 50 km based on Warner et al. (2002). The third experiment, CAL_10m, again uses the MET ensemble mean wind fields, but with calibrated uncertainty statistics based on the ensemble spread at 10 m AGL and using the calibration method discussed in section 2. We assume that the calibration is constant throughout space and time and uses the LVC parameter values derived for the SREF 10-m-AGL *U* and *V* 24-h forecast (midpoint of the 24-h SCIPUFF prediction) over the entire SREF domain from section 2c (Figs. 3, 4) as follows:

The single-point covariance UVE is not altered and SLE is again fixed at 50 km. The fourth experiment, CAL_10m_TV, a time-varying CAL_10m calibration experiment, also calibrates ensemble spreads based on the 10-m-AGL LVC (Figs. 3, 4), but varies the calibration with forecast length. Thus the 12-h calibration is applied to the 12-h SREF forecast (0-h SCIPUFF), the 15-h calibration to the 15-h SREF forecast (3-h SCIPUFF), and so on. The fifth experiment is the mean of an explicit SCIPUFF ensemble (EX_ENS), in which each SCIPUFF member uses one of the SREF member fields for MET and has the default hazard mode uncertainty model turned off.

Each experiment is created in a similar manner. First, MEDOC files are created for each of the individual ensemble members from the SREF output. Then, in all but EX_ENS, these files are averaged to create a single ensemble mean MEDOC file. For those experiments using ensemble MET variance information (UNCAL, CAL_10m, and CAL_10m_TV), the mean file also includes the ensemble variance information (UUE, VVE, UVE). Then the LVC is applied by multiplying by the slope and adding the intercept values as specified above to the corresponding variable fields.

Each experiment is evaluated both qualitatively and quantitatively using a set of statistical scores. The qualitative comparisons focus on the mean surface concentrations, plume shapes, and orientations, because these reflect the underlying wind fields and the wind variances, which differ among the experiments. The quantitative comparison of mean surface concentration uses bias (*B*), probability of detection (POD), threat score (TS), and false-alarm rate (FAR) as described in Wilks (2006), for logarithmically increasing threshold concentrations of 2 × 10^{−15}, 5 × 10^{−15}, and 1 × 10^{−14} (m^{3} C_{7}F_{14}) (m^{3} air)^{−1}. The statistical scores are calculated hourly over a theoretical sampling network grid with horizontal spacing of 0.2° × 0.2° at the surface (0 m) using the mean concentration from the baseline SCIPUFF simulation as the verification.

### d. Results

The impact of the calibration at 6 h (1500 UTC 24 August 2004) with the uncalibrated UUE and VVE is shown in Figs. 8a,b, respectively, and the postcalibration UUE and VVE from CAL_10m in Figs. 8c,d. Throughout most of our SCIPUFF domain, uncalibrated wind variances are quite small, so when CAL_10m (4) is applied, the additive constant dominates over the multiplicative constant, yielding larger variances. The same is true for all forecast lengths during the experiments, so CAL_10m_TV provides a similar effect, generally increasing variance values in the SCIPUFF domain. The variances remain low in the vicinity of the plume throughout the forecast period (not shown), so that the application of both calibrations results in increased variance near the plume throughout the SCIPUFF forecasts.

We already see significant differences among the SCIPUFF experiments at this time (Fig. 9). In CTL, the default hazard mode uncertainty model used when no uncertainty information is provided causes much more effective diffusion^{1} of the mean plume than those forecasts where MET ensemble uncertainty information is provided (UNCAL, CAL_10m, and CAL_10m_TV). This effective diffusion results in a much broader plume with lower concentrations than the baseline. Among those experiments where uncertainty information is provided, the calibrated experiments show a broader concentration pattern than the uncalibrated UNCAL experiment. The reason for this becomes clear after comparing the uncalibrated UUE and VVE to the calibrated UUE and VVE at 6 h (Fig. 8). The calibrated variances, which are much larger than the uncalibrated variances, cause SCIPUFF to broaden the plume. The explicit ensemble EX_ENS exhibits a mean surface concentration similar to that seen in UNCAL, although somewhat smaller and with a more west–east rather than west-northwest to east-southeast orientation.

Generally, all of the experiments have approximately the same location of maximum concentration. However, the broader diffusion pattern lowers the maximum concentration while expanding the area of lower concentrations, conserving the total mass of the release. CTL clearly expands the plume and reduces the concentration far too much compared to the baseline. Experiment UNCAL retains the high-concentration area, but does not expand the plume enough, and preferentially elongates the plume in the west-northwest to east-southeast direction much more than the baseline. The CAL_10m and CAL_10m_TV experiments expand the plume more than the UNCAL and better predict the baseline plume area and shape. EX_ENS results are most similar to the UNCAL results, with a similar concentration pattern that includes the same elongation as UNCAL, but in a west–east direction.

At 12 h (Fig. 10), the baseline mean surface concentration is beginning to show some asymmetries toward the northwest in the mean plume. EX_ENS is showing greater asymmetry, but with a pronounced elongation in the east–west direction. All of the other experimental results, however, are still quite symmetrical, although UNCAL shows a slight elongation in the east–west direction. We postulate that the baseline pattern asymmetry toward the northwest develops earlier than that in the experiments because its SCIPUFF simulation uses higher-resolution MET and terrain data. At this time, the UNCAL experiment more closely resembles EX_ENS than any other experiment.

We see a significant change in the appearance of the mean surface concentration fields at 18 h (Fig. 11). We now start to see asymmetry in not just the baseline, but in the calibrated experiments as well. The CTL plume continues to expand and now encompasses a significant portion of the domain. The UNCAL experiment continues to be roughly symmetric about its long axis, now oriented southwest–northeast. The two calibration experiments, however, are now distorted into a kidney shape, with higher concentrations extending northwestward into southeastern West Virginia. Of particular interest here is the orientation of the centerline of maximum concentration (long axis) of the plumes. The baseline plume has a long axis oriented roughly to the northwest and extending into West Virginia. The CTL and UNCAL fail to capture this orientation, each with a long axis that extends north-northeast or northeast. The two calibrated experiments with their kidney shape are able to have an axis that points northeast, paralleling the Appalachians through the Piedmont of North Carolina, but then bend westward farther north, extending the plume northwestward into West Virginia. This allows the two calibrated experiments to better fit the shape of the baseline plume for the medium to low contours shown. The EX_ENS mean concentration field again looks substantially like that of UNCAL, but exhibits an eastward lobe of higher concentration that is missing from UNCAL.

We postulate that the cause for this northwest tilt in the experimental forecasts is related to the UVE in the region. The UVE values over southern West Virginia prior to this time are negative and less than −0.5 m^{2} s^{−2}, in contrast to near-zero or positive values throughout the majority of the domain (Fig. 12). This negative correlation between *U* and *V* would tend to elongate a plume in the northwest–southeast direction. The calibrated experiment plumes (CAL_10m and CAL_10m_TV), with larger effective diffusion, would reach this negative UVE region before the UNCAL experiment plume. We hypothesize that this effect leads to the kidney-shaped area and thus, the improved agreement with the baseline.

At 24 h (Fig. 13), the baseline plume becomes oriented just west of north and extends across the Ohio River into eastern Ohio near the northern edge of the SCIPUFF domain. The maximum concentrations extend from northern North Carolina to eastern Ohio. The CTL experiment plume has now become so diffuse that it covers two-thirds of the domain, making it rather difficult to use as a predictive tool. The UNCAL experiment, with its lower effective diffusion, reproduces the higher concentrations exhibited by the baseline whereas CTL, CAL_10m, and CAL_10m_TV do not, but is unable to extend its higher concentrations far enough north and south to cover the same latitudinal extent as the baseline. The CAL_10m_TV experiment, while not able to keep the highest concentrations, is able to achieve similar latitudinal extent and orientation of the medium contours of the baseline plume at this late time. The CAL_10m experiment is also able to cover the correct latitudinal extent, but the orientation of the plume is more north-northeast toward southwestern Pennsylvania. The EX_ENS again most resembles the UNCAL experiment, with the highest concentrations in each falling in much the same area and orientation. The EX_ENS has lower-concentration extensions toward the east and southwest than UNCAL does, and is able to extend higher-mean concentrations into Ohio, which is more consistent with the baseline.

Overall, CTL was outperformed by all of the experiments that included MET uncertainty information derived from a MET ensemble throughout the 24-h forecast period when compared to the baseline. Both calibrated experiments, CAL_10m and CAL_10m_TV, have similar mean concentration patterns throughout the forecast period and best match the baseline throughout most of the forecast period, especially for medium to low concentrations. Experiments UNCAL and EX_ENS have similar patterns, although not as similar to each other as the relation between the two calibrated experiments. This similarity between UNCAL and EX_ENS indicates that the SCIPUFF uncertainty model is performing as intended and is producing a reasonable approximation of the explicit ensemble given only the wind variances and covariance. This means that differences among all the experiments are not due to an artifact of the way the variance data are being treated by the model, but rather by the input (or modeled) variances themselves.

It is worth noting that none of the experiments are able to predict the hazard area at the highest concentrations of the baseline. UNCAL and EX_ENS are able to produce some area at the higher concentrations, and CAL_10m and CAL_10m_TV are able to achieve better areal coverage at the medium- and lower-concentration thresholds, but none achieves the correct area for the highest concentrations. We believe that this may be due in part to the differing PBL depths between the 4-km baseline and the 40-km SREF experiments. The baseline uses the PBL depth computed by MM5, but the experiment set uses a PBL depth calculated by SCIPUFF based on the MET input from SREF. The MM5 PBL depths are consistently lower over the SCIPUFF domain than those based on the SREF MET, by about 600 m on average throughout the simulation period (not shown). Since the tracer quickly becomes vertically mixed throughout the boundary layer, the increased PBL depth of the SREF-based experiments leads to an increased volume occupied by the tracer, and thus lower concentrations by conservation of mass. This would suggest that the two calibrated experiments may produce the higher concentrations with similar PBL depths as the baseline, and that UNCAL and EX_ENS may have concentrations that are too high with similar PBL depths.

Evaluation of the cumulative hourly statistics for the lowest threshold of 2 × 10^{−15} in Fig. 14 shows that the CAL_10m_TV performs best in all measures except FAR, while CAL_10m performs second best. However, the FAR is relatively small and ranges between 0.00 and 0.03 for all experiments. We also feel that for hazard prediction, FAR is of less importance than TS and POD, as the cost of failing to warn is far higher than that of a false alarm. Experiment UNCAL performs similarly to EX_ENS in every category, both outperforming the control in POD and TS despite having lower *B*.

Figure 15 indicates that at the middle threshold examined (5 × 10^{−15}), the two calibrated experiments continue to show an advantage over the other experiments, but by a smaller margin than for the lower threshold. Again the calibration that changes with forecast length (CAL_10m_TV) slightly outperforms the calibration that uses the same calibration throughout the forecast period (CAL_10m). UNCAL and EX_ENS again perform similarly in all statistical scores, with UNCAL performing slightly better. At this intermediate threshold, CTL performs worst in all statistical scores.

Statistical scores are much closer at the highest threshold of 1 × 10^{−14} in Fig. 16, which shows different results. UNCAL now performs slightly better than the other experiments in all scores, and EX_ENS performs second best. CAL_10m_TV performs next best in POD and TS. CTL again performs worst in all scores. This would seem to indicate that the calibration is less effective for very high thresholds. However, statistical differences among the experiments are much smaller, and a time series of the statistical scores (not shown) indicates that the calibrated experiments actually perform better than UNCAL and EX_ENS during the first 12 h of the simulation. Also, we must again consider the effect of the different PBL depths in preventing the calibrated experiments from reproducing the highest concentrations in the baseline.

In general, the objective statistical scores suggest that there is added value in including MET variance information as an input to SCIPUFF. Also, there is additional added value in using calibrated variance information for all but the highest threshold examined, and calibrated results show some advantage even at the highest threshold through 12 h. Allowing the calibration to vary with forecast length provides yet more benefit. It is important to note that while the calibrated experiments did not perform as well at the highest threshold late in the study period, lower concentrations can still be dangerous, depending on the material. Furthermore, moderate to light concentrations will often persist over a much longer period of time, resulting in greater exposure than a high but brief peak concentration.

## 4. Conclusions

In this study we examined the use of MET ensemble information to quantify MET uncertainty to improve atmospheric transport and dispersion predictions. The first part of the study (section 2) examined a linear variance calibration (LVC) method that provides a simple model for computing MET uncertainty from the spread in MET ensembles. The calibration method was tested for wind variances in the lower troposphere using SREF during a 24-day period in late summer 2004.

The calibration results for wind fields over the SREF domain were very good, with *R*^{2} values of 0.95 or greater for both vertical levels at forecast lengths of 12 h and longer. This result indicated that for our study period, ensemble spread is related to the actual error variance in the ensemble mean field through a simple linear relationship. The results also indicated that the calibration varies in the vertical but changes more with forecast length, as slopes generally increased with forecast length and *y* intercepts increased with forecast length at longer forecast lengths. The reason for this result is likely related to changes in the ensemble dispersiveness at different forecast lengths relative to the actual uncertainty in the atmosphere and to sampling artifacts from finite ensemble size. These results are promising for use in ATD, as uncertainty in the grid-resolved wind field is a major contributor to uncertainty in the dispersion calculations.

The calibration technique is new and offers substantial areas for continued research. Before widespread use, however, the calibration should be further tested for other training windows of varying length and season, and other models and model resolutions. Other possible areas for future research include studying the diurnal and seasonal variation in the calibration and calculating the calibration directly on the vertical levels used for input into the ATD model.

The case study (section 3) evaluated the use of ensemble-derived MET uncertainty information in the SCIPUFF ATD model via the UUE, VVE, UVE, and SLE fields using both calibrated and uncalibrated MET ensemble spread. Experiments included a control forecast using the default hazard prediction MET uncertainty, experiments using uncalibrated and calibrated ensemble wind variances, and the explicit average of an ensemble of SCIPUFF forecasts. All experiments were compared to a baseline SCIPUFF simulation driven by MET from a high-resolution MM5 dynamic analysis using four-dimensional data assimilation of observational data throughout the 24-h SCIPUFF forecast period.

The SCIPUFF experimental results showed that including ensemble-derived MET uncertainty information dramatically improves the predicted concentrations relative to the baseline SCIPUFF simulation. The addition of a very simple calibration to UUE and VVE that is constant in space and time changed the predicted plume compared to the uncalibrated experiment, and a similar calibration allowed to vary in time produced additional small improvements. Both calibrated experiments improved the shape of the predicted plume over the uncalibrated prediction at most forecast times when compared to the baseline for medium and low concentrations. The wind variance calibrations shifted the orientation of the centerline of the plume and expanded the hypothetical hazard regions of medium concentration in areas more consistent with those in the baseline SCIPUFF simulation. The calibrated experiments were unable to reproduce the highest concentrations present in the baseline beyond 12 h. However, this may be due in part to larger PBL depths in the SREF experiments relative to the baseline.

The statistical results for hourly concentrations compared to the baseline throughout the 24-h period also showed that the calibrated experiments outperform the uncalibrated experiment at two medium concentrations of 2 × 10^{−15} and 5 × 10^{−15}, with a slight advantage in the calibration allowed to vary with forecast length (CAL_10m_TV) relative to a static calibration based on the midpoint of the forecast (CAL_10m). However, the calibrated experiments performed slightly worse than UNCAL for the highest threshold examined (1 × 10^{−14}). Since the thresholds for human impact vary by hazardous material, the differing results also suggest the need to analyze more than one threshold when assessing the impact of ATD changes.

The results of using MET uncertainty information (both calibrated and uncalibrated) in SCIPUFF also compared favorably to an explicit ensemble of SCIPUFF calculations based on the output of individual MET ensemble members, which we consider to be an attractive but computationally intensive method for generating dispersion uncertainty. The plume explicitly averaged from the SCIPUFF ensemble members more closely resembled the plume for the SCIPUFF experiment using uncalibrated ensemble statistics. The explicit ensemble, which also produced similar statistical scores to the uncalibrated experiment, produced scores worse than those of the calibrated experiments at the lower two thresholds examined. Thus, calibrated ensemble statistics experiments appeared to produce mean concentration fields more similar to those in the baseline SCIPUFF simulation, even compared to the explicit SCIPUFF ensemble.

Future work will further study the effects of uncertainty information in ATD short-range forecasts, as well as longer-range (>24 h) forecasts, using both ensemble mean winds and “best-member” winds, and the best way to apply the calibration to that information. Further investigation into schemes for applying calibrations that vary by level and by subregion may be required to discover the best way to apply these calibrations. More immediately, additional case studies should be investigated.

This paper has demonstrated proof of concept that 1) using MET uncertainty based on ensemble-derived wind variance statistics improves ATD predictions, and 2) calibrating MET ensemble statistics as a proxy for MET uncertainty adds value to the ATD predictions, even relative to those based on an ensemble of ATD models.

## Acknowledgments

We acknowledge Mark Roulston for stressing the importance of covariance information and starting us down the path of using a linear calibration via binning. Our frequent discussion with Ian Sykes about parameterizing meteorological uncertainty in SCIPUFF is much appreciated. We thank Jeff McQueen and NCEP for providing full ensemble data for SREF. We also thank our fellow Penn State team members, Joel Peltier, John Wyngaard, George Young, Nelson Seaman, and Jared Lee, for useful discussion about ATD uncertainty, and Jeff Zielonka for creating the baseline simulation. Finally, we thank the anonymous reviewers who provided insightful suggestions for improving this manuscript. This work was supported by the Defense Threat Reduction Agency (DTRA) under the supervision of CDR Stephanie Hamilton and Dr. John Hannan via the L3-Titan/DTRA Contract DTRA01-03-0013.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

## Footnotes

*Corresponding author address:* David R. Stauffer, Department of Meteorology, The Pennsylvania State University, 503 Walker Building, University Park, PA 16802. Email: stauffer@meteo.psu.edu

^{1}

We use the term “effective diffusion,” as it is not representing real diffusion in the atmosphere, but rather is a manifestation of the increased uncertainty in the mean concentration.