## Introduction

The need to assess the accuracy of hazardous material transport and dispersion models continues to be of great importance. Applications for these models as planning aids, and even as “real time” emergency response tools, continue to increase (Petty 2000). Past studies have compared the predictions of transport and dispersion models with field observations using a variety of statistical quantities. Statistical measures of bias, scatter, and correlation have been discussed (Hanna et al. 1993) and have been applied typically to quantities *derived* from the field observations, for example, maximum dosage along a sampler arc or estimated plume width at a given downwind range. To a large degree, derived quantities have been used for comparisons because it was recognized that measures of bias, scatter, and correlation applied to point-to-point comparisons, that is, observations and predictions paired in space and time, could indicate very poor model performance, given small plume displacements (Hanna 1988). That is, the predicted and observed “plumes” could have the same shape and size, and yet point-to-point comparisons would indicate that the model performs poorly in terms of the above statistical measures simply because, for example, the input wind direction was in error by a few degrees. In addition, for typical air-pollution applications, models might be required to predict the maximum concentrations at a certain distance without regard to direction of the plume. Therefore, the above statistical measures applied to derived quantities (e.g., the maximum) would be deemed acceptable for assessing model performance in air-quality or similar studies. In a similar way, when used as a tool for future planning, one imagines that a transport and dispersion model that gets the overall size, shape, and maximum values correct might be good enough, especially when considering the impracticality of predicting the detailed wind field weeks or months into the future.

Recently developed transport and dispersion models that incorporate access to more complete weather information, to include numerical weather predictions/forecasts, have led some to consider their usage as real-time, or at least near-real-time, emergency response aids (Nasstrom et al. 2000). For such applications, the actual location of the hazardous material, not just its size and shape, is of critical importance. Thus, observations and predictions, paired in space and time, must be compared, and measures to assess such an examination should be identified.

Recent studies of the European Tracer Experiment (ETEX) have included a figure of merit in space (FMS), defined as the overlap area between the prediction and observation divided by the total predicted and observed areas, all above some threshold concentration. For ETEX, FMS was used to compare the predictions of several models (Mosca et al. 1998). At its core, FMS compares observations and predictions paired in space and time; however, the FMS method, when applied to actual field observations, requires interpolation over the set of sampled observations and, as such, can be subject to “artificial” sensitivities introduced by the type of interpolation technique used. Furthermore, FMS does not distinguish between regions of over- and underprediction (i.e., regions of overprediction and underprediction are “weighted” identically). For some applications, describing the model's accuracy separately in terms of regions of over- and underprediction would provide insight, in particular, because a model user may assess the risk associated with model over- and underpredictions very differently.

Model validation efforts are meant to examine the accuracy of a given model's predictions within an “operational” context. Such validation analysis should include a metric by which field trial observations and predictions can be compared within the context of the specific application—a measure of effectiveness (MOE). The ideal MOE would faithfully capture and portray model performance and would convey to the users a certain degree of confidence (high or low) in the model, taking into account their particular application.

In this paper, an MOE that allows for point-to-point comparisons of predictions and observations and can provide for straightforward communication of a model's relative worth (accuracy) to a user who is concerned with the size, shape, *and* location of the hazard is identified and described.

## Description of two-dimensional MOE

A fundamental feature of any comparison of hazard prediction model output to observations is the over- and underprediction regions. We define the false-negative region where a hazard is observed but not predicted and the false-positive region where a hazard is predicted but not observed. Figure 1 shows one possible interpretation of these regions—the observed and predicted areas in which a prescribed dosage is exceeded. This view can be extended to consider the marginal over- and underpredicted dosages, as will be discussed below. In any case, numerical estimates of the false-negative region (*A*_{FN}), the false-positive region (*A*_{FP}), and the overlap region (*A*_{OV}) characterize this conceptual view.

*x*axis corresponds to the ratio of the overlap region to the observed region, and the

*y*axis corresponds to the ratio of the overlap region to the predicted region. When these mathematical definitions are algebraically rearranged, one recognizes that the

*x*axis corresponds to 1 minus the false-negative fraction and the

*y*axis corresponds to 1 minus the false-positive fraction,where

*A*

_{FN}= region of false negative,

*A*

_{FP}= region of false positive,

*A*

_{OV}= region of overlap,

*A*

_{PR}= region of the prediction, and

*A*

_{OB}= region of the observation.

### Characteristics of the MOE

Consistent with the above algebraic rearrangement, Fig. 2 shows the region of false negative decreasing from left to right and the region of false positive decreasing from bottom to top. Figure 2 demonstrates some of the key characteristics of the two-dimensional (2D) MOE space. We begin with the (1, 1) point located at the upper-right corner. Here, both plumes overlap entirely (no false-negative or false-positive fraction), and, thus, the model would achieve perfect agreement with the field trial. Point (0, 0) signifies that there is no region of overlap, and, thus, the model disagrees completely with the field trial. The 2D MOE includes directional effects; that is, the prediction of the location of a hazard, not just the shape and size of the plume, is critical to obtaining a high MOE “score.”

Along the line *x* = 1, the prediction completely envelops the observation. Along the line *y* = 1, the observation completely envelops the prediction. The “purple” diagonal line represents the situation in which the prediction and the observation have identical “total” sizes [i.e., *x* = *y* implies from Eq. (1) that *A*_{OB} = *A*_{PR}]. As one traverses this diagonal line from (1, 1) toward (0, 0), the fraction of overlap area between the predicted and observed plumes decreases.

Figure 3 suggests an additional interpretation of the 2D MOE. In this figure, the gold circular region represents the estimate of the MOE for some set of fictional model predictions and field trial observations. The point estimate, perhaps the vector mean value of several similar trials, would be found approximately at the center of this region, and the overall size of the region represents the uncertainty associated with the point estimate of the MOE.

If a second set of model predictions was compared with “model A,” several conclusions might be anticipated. The second model's MOE estimate might be found in the region shaded “orange” (lower left). This would imply that model A performs significantly better—both its false-positive and false-negative fractions are lower. As an alternative, the second model might lead to an estimate in the green region (upper right), an indication that model A is the poorer performer (for this set of field trial observations). Last, the new model predictions might lead to an MOE value that is located in one of the gray regions. The implication here is that a user would have to make a determination as to the trade-off between false positive and false negative before deciding which model was most appropriate for his or her specific application.

### Computation of the MOE

*A*

_{FN},

*A*

_{FP}, and

*A*

_{OV}, it is not necessary to have actual physical areas to compute the components of the MOE. Rather,

*A*

_{FN},

*A*

_{FP}, and

*A*

_{OV}can be computed directly from the predictions and field trial observations paired in space and time. For the dosage-based MOE, the false-positive region is the dosage predicted in a region but not observed. Therefore, for

*A*

_{FP}(as shown in Fig. 4a), one first considers all of the samplers at which the prediction is of greater value than the observation. Next, one sums the differences between the predicted and observed dosages at those samplers. Based on the samplers that contained observed values that were larger than the predicted values, one can similarly compute

*A*

_{FN}. Then

*A*

_{OV}is calculated by considering all samplers and summing the dosages associated with the minimum predicted or observed value. Restating the above mathematically, letthen

These estimates can be made on a linear scale, as shown in Fig. 4a for a Project Prairie Grass field trial, or on a logarithmic scale. If concentration information were available in place of dosages, an analogous procedure could be used to compute concentration-based MOE values.

*T,*one can partition this set into four subsets—OV, FN, FP, and BELOW:ThenIt is possible to modify the above definition of

*A*

_{OV}to include the number of elements (samplers) in the BELOW set. To be consistent with the conceptual view illustrated in Fig. 1,

*A*

_{OV}was defined as in Eq. (5).

Figure 4b illustrates this procedure using a sulfur dioxide (SO_{2}) dosage threshold of 60 mg s m^{−3} that was consistent with the sampler sensitivity for the Project Prairie Grass field trial (Barad 1958). This procedure is analogous to assessing an area-based MOE at a contour level (e.g., as illustrated conceptually in Fig. 1) using area interpolation of observations and predictions.

## Relationship of MOE to standard statistical measures

This section describes and illustrates the mathematical relationships between the 2D MOE and several one-dimensional measures, including FMS, fractional bias, and a measure of scatter between observations and predictions.

### Figure of merit in space

*A*

_{FN}and

*A*

_{FP}are then substituted into Eq. (6), and, following algebraic rearrangement, one obtains

*C*

_{FN}and

*C*

_{FP}to weight the false-negative and false-positive regions, respectively. We refer to this notional user scoring function as the risk-weighted FMS (RWFMS):where

*C*

_{FN}and

*C*

_{FP}are greater than 0.

It may be true that, for some applications (e.g., technical model validation), the weightings for false negatives and false positives are considered irrelevant or are set equal (*C*_{FN} = *C*_{FP}). As developed here, the implicit coefficient associated with *A*_{OV} is 1.0. Therefore, the notion of equal weights for *A*_{FN} and *A*_{FP} (i.e., *C*_{FN} = *C*_{FP}) is insufficient for the complete specification of RWFMS. That is, the precise RWFMS values will depend on the values chosen for *C*_{FN} and *C*_{FP} and not just on their ratio.

Figure 5a shows contours of RWFMS (i.e., isolines) in the 2D MOE space for *C*_{FN} = *C*_{FP} = 1. Figure 5b similarly illustrates the case in which *A*_{FN} is weighted by a factor of 10 relative to *A*_{FP} and a factor of 5 relative to *A*_{OV}; that is, *C*_{FN} = 5 and *C*_{FP} = 0.5.

The contours of RWFMS can be used as the basis for scoring the performance of a model and for coloring the MOE space according to the RWFMS score. Figure 6 provides an example of user coloring of the MOE space, based on RWFMS for several values of *C*_{FN} and *C*_{FP} (e.g., red is “bad” and green is “good”). At an RWFMS of 0.0, this coloring scheme incorporates pure red. As the user-defined RWFMS increases from 0.0 to 0.50, the intensity of green increases linearly. For instance, at an RWFMS value of 0.5, there are equal intensities of red and green (hence, yellow). In a similar way, for RWFMS values between 0.50 and 1.0, the red intensity is reduced linearly with increasing RWFMS value. At an RWFMS value of 1.0, the coloring used is pure green.

### Fractional bias

A hazardous material transport and prediction model might be applied to problems for which the actual location of the hazard or direction of the plume is of no particular importance. For example, such a model might be used to study potential future outcomes of an accidental or intentional release. In these cases, the actual weather (e.g., wind speed and direction) in the far future, associated with the planning, cannot be known with any certainty. For these applications it is desirable to have a scoring function that simply compares the sizes of the predicted and observed regions. In essence, model users in these cases would want a model that minimizes the overall model bias.

*C*= observation/prediction of interest (e.g., dosage),

*C*

_{p}corresponds to model prediction,

*C*

_{o}corresponds to observation, and a bar above the quantity (e.g.,

*C*

*y*=

*x.*Then,

*A*

_{PR}and

*A*

_{OB}must be within a factor of

*s*of each other, with

*s*> 1. This requirement is stated mathematically by requiring

Figure 7 shows isolines of this FB figure of merit (FBFOM), in MOE space, for various values of the parameter *s.*

A coloring scheme (red to green, as discussed previously) for the 2D MOE space, using FBFOM, can be formulated (Warner et al. 2001b), with the results shown in Fig. 8.

*n*= number of data points used in the comparisons,

*C*

^{(i)}

_{o}

*i*th observed concentration, and,

*C*

^{(i)}

_{p}

*i*th predicted concentration. Then, the numerator can be rearranged as in Eq. (18):In a similar way, the denominator in Eq. (17) can be reduced toThus, Eq. (20) results:

*A*

_{FP}and

*A*

_{FN}from Eq. (8) into Eq. (20) leads toand, after algebraic simplification,Further rearrangement of Eq. (22) yieldswhich shows that isolines of constant FB in the 2D MOE space are straight rays through the origin (Fig. 7) with slope,

*m*:Within the context of the FB figure of merit, for FB ≥ 0,

*m*= 1/

*s*from Eq. (16), and, for FB < 0,

*m*=

*s.*

### Measure of scatter

## Example applications of the 2D MOE

This section provides a few example applications of the MOE to predictions of short-range field observations. Comparisons of predictions from the U.S. Defense Threat Reduction Agency's Hazardous Prediction and Assessment Capability (HPAC; SAIC 2001), which includes the second-order closure-integrated puff (SCIPUFF) model as its transport and dispersion engine, and the U.S. Department of Energy's National Atmospheric Release Advisory Center (NARAC; Nasstrom et al. 2000) model with the Project Prairie Grass field observations (Warner et al. 2001c) and with controlled, computer-simulated releases (Warner et al. 2001a) have been completed using the MOE.

### Model comparisons with Project Prairie Grass

Project Prairie Grass field trials were conducted during the summer of 1956 in north-central Nebraska near the town of O'Neil (Barad 1958). The primary objective of Project Prairie Grass was to determine the rate of diffusion of a neutrally buoyant tracer gas as a function of meteorological conditions. These experiments involved continuous 10-min releases of SO_{2} from a near-surface point source. Downwind SO_{2} concentrations were sampled along five concentric, semicircular arcs located 50, 100, 200, 400, and 800 m away from the gas source. The samplers were arranged at 2° intervals along the 50-, 100-, 200-, and 400-m arcs (91 samplers per arc) and at 1° intervals for the 800-m arc (i.e., 181 samplers along the 800-m arc). The Project Prairie Grass experiments represent a relatively well-defined, well-known, and classic standard for the evaluation of transport and dispersion models. As such, these experiments are ideal for the initial demonstration of the MOE concepts.

A total of 70 releases were conducted during the Project Prairie Grass experiment. Of these 70 releases, 19 were eliminated from further consideration because crucial wind or sampler concentration information was missing (14 releases), the source height was different and no turbulence fit at that height was reported (4 releases) or, in one case, the maximum observed concentration was extremely small (<0.5 mg m^{−3}). The 51 trials that were included in this study were numbered 5, 7–28, 32–46, 48–51, and 54–62 in the original Project Prairie Grass report (Barad 1958). The detailed protocol for the computation of SCIPUFF and NARAC predictions of the Project Prairie Grass releases are further discussed in Warner et al. (2001e).

Figure 10 displays SCIPUFF MOE values for the five Project Prairie Grass arcs. The point estimate for the MOE value for each arc is represented by the vector average obtained from the individual MOEs calculated for the 51 Project Prairie Grass releases that were examined and lies approximately at the center of the given colored cluster. The colored clusters correspond to the approximate 95% confidence region associated with the MOE point estimate. These approximate confidence regions were computed using the bootstrap percentile method and 10 000 bootstrap samples (Efron and Tibshirani 1993). For example, MOE values were computed at each arc for each of the 51 releases that were examined. Resampling with replacement (“bootstrap”) of the 51 MOE vectors was done; that is, 10 000 sets of 51 vector resamples were created. From these sets, 10 000 vector averages (i.e., MOE values) were calculated and used to estimate the confidence region associated with the original MOE point estimate.

For the MOE based on total dosage, model performance degrades at the longer ranges (Fig. 10a). This result is statistically significant, because the 800-m arc (red) MOE confidence region is completely separated from the 50-m arc (dark blue) MOE confidence region. For the MOEs based on a dosage threshold of 60 mg s m^{−3}, similar results, albeit with smaller differences, can be seen (Fig. 10b).

MOE values for individual trials were created by combining the arc results. To create values that would be more closely related to actual area-based measures, the MOE computations were weighted by the intersampler distances for the individual arcs. First, each sampler was associated with a short line segment centered at the sampler location and having the same length as the intersampler distance for that arc. The intersampler spacing was computed as *rθ,* where *r* is the distance to the arc (e.g., 800 m) and *θ* is the angular separation between samplers (in radians). This procedure led to the following intersampler distances, rounded to two decimal places (in meters): 1.75, 3.49, 6.98, 13.96, and 13.96 for the 50-, 100-, 200-, 400-, and 800-m arcs, respectively. Next, the summed dosages used to define *A*_{OV}, *A*_{FN}, and *A*_{FP} for each individual arc (or, for some specified threshold, the numbers of samplers used to define *A*_{OV}, *A*_{FN}, and *A*_{FP}) were multiplied by the corresponding intersampler distance. Adding the values for the five arcs together forms the *A*_{OV}, *A*_{FN}, and *A*_{FP} estimates for the entire (all arcs) trial. In this way, contributions from each arc to an overall “area based” MOE were estimated. This weighting scheme should not be considered to be a general technique, but rather to be a natural approach for this specific arc-based sampling space. For a densely sampled field trial like Project Prairie Grass, one can also consider area interpolation as a method of “sampler weighting” that would lead to area-based MOE values.

Figures 11a and 11b show MOE 95% confidence regions for the 51 Project Prairie Grass trials, based on total dosage and a dosage threshold of 60 mg s m^{−3}, respectively. Figures 11c and 11d show MOE 95% confidence regions for the same 51 Project Prairie Grass trials as a function of stability category grouping. Stability category assignments, which were previously developed for Project Prairie Grass by Irwin and Rosu (1998), were used to group trials with similar characteristics in terms of atmospheric stability. For this display, stability category assignments of 1, 2, and 3 were considered “unstable,” assignments of 5, 6, and 7 were considered “stable;” and trials assigned 4 were considered “neutral.”

For both types of MOE values—those based on total dosage and those based on a dosage threshold—Fig. 11 suggests that the model performed best during the trials associated with the more unstable atmospheric conditions (red). In terms of total dosage (Figs. 11a,c), the false-negative fraction does not change much across stability conditions; however, the false-positive fraction steadily increases from unstable to neutral to stable. For the dosage threshold-based MOEs (Figs. 11b,d), predictions of trials conducted under stable conditions led to increased false-positive and decreased false-negative fractions relative to the other trials. That is, the model overpredicted the region above the threshold and, in this sense, might be considered somewhat conservative. For the neutral trials, the false-positive fraction was minimized, but at the expense of an increased false-negative fraction relative to the other stability conditions.

### Probabilistic predictions

To address inherent uncertainties associated with real observations, some transport and dispersion models provide predictions of ensemble mean values, as well as probabilistic-based predictions. In this view, observations of concentration, for example, are seen as individual realizations from some population—the ensemble (Venkatram 1988; ASTM 2000). The SCIPUFF model can provide probabilistic predictions of hazardous material transport and dispersion (Sykes et al. 1996). This section illustrates the application of the user-oriented MOE to assess probabilistic-based prediction outputs.

Figure 12 provides a view of SCIPUFF's capability to produce probabilistic outputs. The predicted contours are associated with the probability that a dosage of 60 mg s m^{−3} is exceeded. For example, the 0.1 probability contour, the outer light purple contour, is meant to encompass the region in which 9 of 10 plume realizations will lie. Therefore, by this nomenclature, the smaller probability values, like 0.1, lead to notionally fatter plumes.

Figure 13 shows approximate 95% confidence regions for MOE estimates based on probabilistic prediction outputs and the mean value prediction (yellow cluster) for the 51 Project Prairie Grass field trials that were examined (Warner et al. 2001b). The 0.01 probability values are always associated with the widest predictions (most conservative) and the 0.999 probability values are always associated with the narrowest predictions. Figures 13a and 13b show the same MOE confidence regions superimposed on two notional user colorings—one that equally weights false-positive and false-negative fractions and one that weights false-negative fractions as 10 times as important as false-positive fractions.

Based on the RWFMS user colorings described in Fig. 13, the following notional evaluation is possible. For equal weighting of the false-negative, false-positive, and overlap fractions (i.e., *C*_{FN} = *C*_{FP} = 1), the probabilistic predictions in the range between 0.01 and 0.90, as well as the mean value predictions, lead to acceptable (i.e., within the green user-colored space) model performance. For the conservative *C*_{FN} = 5 and *C*_{FP} = 0.5 user coloring, the 0.01 probability prediction provides the only acceptable performance (of those examined in this study). This example illustrates how the MOE, in conjunction with an agreed-upon scoring function (or coloring), can be used to tune a model parameter (the probability parameter in this case) to meet a user's requirement.

### MOEs assessed in dosage regimes of significance to humans

_{50}=

*l*and probit slope =

*α*(for our notional probit effects model), where at LCt

_{50}, by definition, one-half of the exposed population would be expected to die. Next, for any dosage (or time-integrated concentration)

*d,*let LE(

*d*) be the fraction of the people that die. Then, for any sampler

*i,*consider the marginal contributions:One then obtains

*A*

_{OV},

*A*

_{FN}, and

*A*

_{FP}by summing the marginal contributions of Eq. (29).

For the notional effects model shown in Fig. 14, small differences between actual and predicted dosages near LCt_{50} can have a dramatic impact on the outcome. On the other hand, for larger predicted and observed dosages (well beyond LCt_{50}) or for smaller predicted and observed dosages (well below LCt_{50}), substantial relative differences do not necessarily have much impact in terms of human effects. This technique allows one to assess the MOE in the regime that is of *particular interest* to the user and application. For example, if a model predicts a dosage of 10^{−15} mg s m^{−3} and the observation is really 10^{−12} mg s m^{−3}, and neither level has any impact on humans, one could question the significance of this “3-orders-of-magnitude difference.”

By incorporating the lethality of the release, as outlined above, one can convert the *x* and *y* axes of the MOE to “fraction of the population inadvertently exposed” (false negative) and “fraction of the population unnecessarily warned” (false positive), respectively. We illustrate this with the Project Prairie Grass data and predictions. For these short-range, densely sampled observations, interpolation between samplers is relatively straightforward—we used a Delaunay triangulation procedure (Warner et al. 2001b). Therefore, one can compute MOE values in terms of actual areas, that is, false-positive, false-negative, and overlap areas. Next, one can assume an underlying population distribution—we chose spatially uniform for this illustration. At this point then, the *x* and *y* axes of our MOE space are converted to the fraction of the population inadvertently exposed to some effects level of interest and the fraction of the population unnecessarily warned.

Figure 15a presents an overlay of MOE estimates for the comparisons of SCIPUFF and NARAC predictions of the Project Prairie Grass field trials on a notional user-coloring scheme. Shown are MOE estimates for the NARAC and SCIPUFF predictions based on ocular and lethal effects for some notional agent. Sulfur dioxide was the agent actually released during the Project Prairie Grass field trials, but for this demonstration we assumed a nerve agent–like material in its place. To generate the MOE values of Fig. 15, a probit model was used with “ocular” defined as ocular effects of OE_{50} = 30 mg s m^{−3} with a probit slope = 12 and “lethal” defined as LCt_{50} = 4000 mg s m^{−3} with a probit slope of 12. For this situation (Fig. 15a), assuming green is acceptable, one might conclude that both models and both levels of effects, ocular and lethal, are satisfactory, at least at short range. Of course, at longer ranges, one imagines that MOE performance based on ocular effects might degrade substantially, because these ocular effects are expected to extend to ranges where, for example, uncertainties in wind field direction and speed are much more significant. Figure 15b presents the four SCIPUFF and NARAC comparative MOE estimates from our earlier discussion where, in this case, the FBFOM user coloring is applied. Here only models with MOE values near the diagonal are considered acceptable, because it is on the diagonal that the observed and predicted regions are of identical size. In this case, at least for ocular effects, our notional user might prefer the SCIPUFF predictions.

### Testing for differences between two-dimensional MOE values

Figure 15 also shows that the 95% confidence regions for the corresponding NARAC and SCIPUFF MOE values are completely separate, suggesting statistically significant differences. These differences can be further quantified by computing *p* values associated with an appropriate hypothesis test, as discussed below. First, the 51 individual MOE vector differences between various model predictions are computed; for example, the vector differences between the NARAC (ocular) and the SCIPUFF (ocular) MOE values are calculated. If two sets of model predictions were identical, then all 51 vector differences would be (0, 0). For this study, the null hypothesis is that the two models being compared are equivalent. Therefore, any MOE vector difference is expected to be equally likely to reside in any of the four quadrants, defined as positive *x,* positive *y* (+, +); positive *x,* negative *y* (+, −); negative *x,* positive *y* (−, +); and negative *x,* negative *y* (−, −). Given this null hypothesis, one tests how unlikely the observed result is by simulating the appropriate permutations in the following way. First, the quadrant with the most MOE vector differences is identified, and the number of differences in that quadrant is noted. For example, for SCIPUFF (lethal) minus NARAC (lethal), 42 of 51 vector differences occupy the “−, +” quadrant, consistent with Fig. 15 and the notion that the SCIPUFF (lethal) predictions resulted in a larger false-negative and smaller false-positive fraction than the NARAC (lethal) predictions. Next, we simulate results for equivalent model predictions by creating 100 000 samples of 51 drawn from the uniform integer distribution on [1, 4], that is, a multinomial distribution with equal likelihood for each of the four outcomes. These 100 000 samples of 51 correspond to our simulated vector differences for equivalent models. For each sample of 51, the numbers of “1”s, “2”s, “3”s, and “4”s that were randomly selected are determined. The maximum observed number (e.g., 42 from above) is then compared with the corresponding maximum number associated with each simulated sample. The number of simulated samples that contain a maximum value that is greater than or equal to the observed maximum is determined and is denoted *N*_{≥}. The estimated *p* value is then computed as *N*_{≥} divided by 100 000. For suitably low *p* values, one might reject the null hypothesis of equivalence between the values being compared (Sprent and Smeeton 2001). The two-dimensional, four-quadrant hypothesis test described here is a natural extension of the one-dimensional sign test (Sprent 1998). When comparing SCIPUFF and NARAC MOE values based on ocular and lethal effects, the resulting maximum numbers for MOE vector differences are 42 and 36, respectively, and reside in the (−, +) quadrant. In both cases, the resulting *p* values are less than 1 × 10^{−5}, strongly supporting the notion that the differences between SCIPUFF and NARAC shown in Fig. 15 are statistically significant.

## Conclusions

A two-dimensional MOE has been proposed for the evaluation of transport and dispersion models. In model-to-field-trial comparisons this user-oriented measure of effectiveness has consistently resolved important model performance features. Statistically significant resolution of model performance differences as a function of downwind range and meteorological stability category grouping were described for SCIPUFF predictions. Also, differences between models—SCIPUFF and NARAC—could be easily discerned and characterized with the MOE.

By applying a lethality/effects “filter,” one can compute MOE values that relate the goodness of a prediction for presumed agents of greatly varying toxicity (e.g., ocular vs lethal effects). A quantitative method that can aid in the communication of a user's risk tolerance has been described—the coloring of the two-dimensional MOE space in terms of a user's potential scoring function.

The above features may make this MOE of particular value with respect to validation studies. For instance, the specific application and user will dictate the effects level of interest and the associated risk tolerance (therefore, describing a user-coloring scheme). This type of user-oriented MOE, when properly employed, necessarily involves the user and a specific application. This early involvement of the potential user is often a critical missing element during a validation effort.

Future related efforts will focus on expanding the application of this MOE to predictions of longer-range observations (e.g., European Tracer Experiment; Mosca et al. 1998), observations within an urban environment (Urban 2000; Allwine et al. 2002), and interior building releases (Platt et al. 2002) and as a diagnostic aid for model intercomparisons (Warner et al. 2001d).

This research is sponsored by the Defense Threat Reduction Agency, with Dr. Allan Reiter as project monitor. The authors thank Drs. Steven R. Hanna and Joseph C. Chang of George Mason University for numerous helpful discussions. The views expressed in this paper are solely those of the authors. No official endorsement by the Department of Defense is intended or should be inferred.

## REFERENCES

Allwine, K. J., , J. H. Shinn, , G. E. Streit, , K. L. Clawson, , and M. Brown. 2002. Overview of URBAN 2000: A multiscale field study of dispersion through an urban environment.

*Bull. Amer. Meteor. Soc.*83:521–536.ASTM 2000. Standard guide for statistical evaluation of atmospheric dispersion model performance. American Society for Testing and Materials, Designation D 6589-00, 17 pp. [Available from ASTM, 100 Barr Harbor Dr., PO Box C700, West Conshohocken, PA, 19428.].

Barad, M. L. Ed.,. 1958. Project Prairie Grass, a field program in diffusion. Vols. I and II. Geophysical Res. Papers 59, Rep. AFCRC-TR-58-235, 439 pp.

Efron, B., and R. J. Tibshirani. 1993.

*An Introduction to the Bootstrap*.*Monographs on Statistics and Applied Probability,*No. 57, Chapman and Hall, 436 pp.Finney, D. J. 1971.

*Probit Analysis*. Cambridge University Press, 333 pp.Hanna, S. R. 1988. Air quality model evaluation and uncertainty.

*J. Air Pollut. Control Assoc.*38:406–412.Hanna, S. R., , J. C. Chang, , and D. G. Strimaitis. 1993. Hazardous model evaluation with field observations.

*Atmos. Environ.*27A:2265–2285.Irwin, J. S., and M-R. Rosu. 1998. Comments on draft practices for statistical evaluation of atmospheric dispersion models. Preprints,

*10th Joint Conf. on the Applications of Air Pollution Meteorology,*Phoenix, AZ, Amer. Meteor. Soc., 6–10.Mosca, S., , G. Graziani, , W. Klug, , R. Bellasio, , and R. Bianconi. 1998. A statistical methodology for the evaluation of long-range dispersion models: An application to the ETEX exercise.

*Atmos. Environ.*32:4307–4324.Nasstrom, J. S., , G. Sugiyama, , J. M. Leone Jr., , and D. L. Ermak. 2000. A real-time atmospheric dispersion modeling system. Preprints,

*11th Joint Conf. on the Applications of Air Pollution Meteorology,*Long Beach, CA, Amer. Meteor. Soc., 84–89.Petty, R. 2000. User requirements for dispersion modeling.

*Proc. Workshop on Multiscale Atmospheric Dispersion Within the Federal Community,*Silver Spring, MD, Office of the Federal Coordinator for Meteorological Services and Supporting Research, 1-1–1-3.Platt, N., , S. Warner, , and J. F. Heagy. 2002. Application of two-dimensional user-oriented measure of effectiveness to interior building releases.

*Proc. Sixth Annual George Mason University Transport and Dispersion Modeling Workshop,*Fairfax, VA, Defense Threat Reduction Agency, CD-ROM. [Available from School of Computational Sciences, MS 5C3, 103 Science & Technology I, George Mason University, 4400 University Drive, Fairfax, VA 22030-4444.].SAIC 2001. The hazard prediction and assessment capability (HPAC) user's guide version 4.0. Science Applications International Corporation (SAIC) for Defense Threat Reduction Agency (DTRA) HPAC-UGUIDE-02-U-RAC0, 598 pp. [Available from Defense Threat Reduction Agency, 6801 Telegraph Road, Alexandria, VA, 22310-3398.].

Sprent, P. 1998.

*Data Driven Statistical Methods*. Chapman and Hall, 406 pp.Sprent, P., and N. C. Smeeton. 2001.

*Applied Nonparametric Statistical Methods.*3d ed. Chapman and Hall, 461 pp.Sykes, R. I., , S. F. Parker, , and R. S. Gabruk. 1996. SCIPUFF—A generalized hazard prediction model. Preprints,

*Ninth Joint Conf. on the Applications of Air Pollution Meteorology,*Atlanta, GA, Amer. Meteor. Soc., 184–188.Venkatram, A. 1988. Inherent uncertainty in air quality modeling.

*Atmos. Environ.*22:1221–1227.Warner, S. Coauthors 2001a. Evaluation of transport and dispersion models: A controlled comparison of Hazard Prediction and Assessment Capability (HPAC) and National Atmospheric Release Advisory Center (NARAC) predictions. Institute for Defense Analyses Paper P-3555, 251 pp. [Available from Steve Warner, Institute for Defense Analyses, 4850 Mark Center Drive, Alexandria, VA 22311-1882.].

Warner, S., , N. Platt, , and J. F. Heagy. 2001b. Application of user-oriented measure of effectiveness to HPAC probabilistic predictions of Prairie Grass field trials. Institute for Defense Analyses Paper P-3586, 275 pp. [Available from Steve Warner, Institute for Defense Analyses, 4850 Mark Center Drive, Alexandria, VA 22311-1882.].

Warner, S., , N. Platt, , and J. F. Heagy. 2001c. User-oriented measures of effectiveness for the evaluation of transport and dispersion models.

*Proc. Seventh Int. Conf. on Harmonisation within Atmospheric Dispersion Modelling for Regulatory Purposes,*Belgirate, Italy, JRC-EI, 24–29.Warner, S. Coauthors 2001d. Model intercomparison with user-oriented measures of effectiveness.

*Proc. Fifth Annual George Mason University Transport and Dispersion Modeling Workshop,*Fairfax, VA, Defense Threat Reduction Agency, CD-ROM. [Available from School of Computational Sciences, MS 5C3, 103 Science & Technology I, George Mason University, 4400 University Drive, Fairfax, VA 22030-4444.].Warner, S. Coauthors 2001e. User-oriented measures of effectiveness for the evaluation of transport and dispersion models. Institute for Defense Analyses Paper P-3554, 797 pp. [Available from Steve Warner, Institute for Defense Analyses, 4850 Mark Center Drive, Alexandria, VA 22311-1882.].