1. Introduction
Verification at the mesoscale has proven to be a challenging problem. Mesoscale models provide highly detailed, deterministic guidance regarding the position and structure of specific weather systems. While such information can be valuable to forecasters, the typical statistical scores like rms, bias, and the threat score do not necessarily reflect the increase in value (e.g., Mass et al. 2002). Koch (1985), White et al. (1999), Ebert and McBride (2000), Colle et al. (2001), and Baldwin et al. (2002) suggest that small errors in timing or position will penalize highly detailed forecasts more than their less detailed counterparts because of the increased variability of the predicted fields. Also, many of the phenomena best predicted by a mesoscale model, such as convection (e.g., Bernardet et al. 2000; Xue et al. 2001), intense precipitation (e.g., Colle et al. 1999), or coastal jets (e.g., Doyle 1997), compose only a small portion of the overall weather. Measurements of these events may not significantly influence the overall statistics. Unless these phenomena are coherently isolated and verified separately, the added value of the mesoscale forecasts may be lost in the statistics.
In this regard, event-based methods show considerable promise as a verification tool. Koch (1985) isolated the loci of mesoscale convective system (MCS) genesis regions as predicted by a numerical model and compared them to the corresponding observed loci. He reported critical success indices on the order of 0.45 based on the criteria that a system occurred within 250 km and 3 h of the forecast loci. Ebert and McBride (2000) compared contiguous rain areas in gridded precipitation analyses with the corresponding forecasts and decomposed the error into location, rain volume, and pattern elements. They found that 45% of the verified rain events were well forecast based on the magnitude of the error components. Baldwin et al. (2001, 2002) are also developing an events-oriented approach based on the similarity between the observed and predicted field attributes. Their methods seek to objectively mimic the process by which a human would subjectively compare these two fields.
These event-based methods are most effective when high-resolution, high quality observational data are available. Since mesoscale forecasts offer very specific guidance, the prevailing philosophy is to compare the forecasts to the observations as specifically as possible and record the deviations as a distribution of errors. This can pose problems if the observations are either incomplete, uncertain, or are not sampled at the same scale as the forecasts (Tustison et al. 2001). This is often the case over the open ocean, where intense wind events are a concern to shipping and navy operations. Over the ocean, the primary observations consist of satellite- based wind measurements, buoys, and ship observations. The satellite data are quite promising for their relatively high resolution, but the coverage is limited to finite swaths. Rain and very high winds, often associated with oceanic wind events, further limit the coverage. In many cases, the full extent, intensity, and coverage of any one event are only partially known.
Since high-resolution numerical models are frequently used to explicitly forecast oceanic winds, an event- based evaluation would be a useful measure of model performance. Composite techniques offer an attractive and relatively simple approach to the data-availability problem. The philosophy is similar to that applied by Gray and Frank (1977), who used sounding composites to gain insight on tropical cyclone structure. In their study, a given storm was typically sampled by only one or two sounding measurements. By combining multiple measurements from many storms onto a common storm- relative grid, important information concerning the general storm-environment structure was derived. Applying this method to event verification relaxes the information requirements on any one event. If enough quasi-randomly placed observations of a distribution of similar events exist, bulk properties of the forecasts and the observations can be reliably estimated in an event-relative framework. This allows incomplete observations to be smoothly incorporated into a coherent, statistically meaningful comparison.
In this work, the composite method is applied to evaluate forecasts of the mistral in the Mediterranean Sea. Mistrals are well-defined regions of strong northeasterly, northerly, or northwesterly winds that occur in the northern Mediterranean (Fig. 1). The mistral is a multiscale phenomenon that occurs as a result of synoptic- scale northerly or northwesterly flow that is accelerated as it passes over the Massif Central and is deflected by the western Alps (Jiang et al. 2003). Anecdotal evidence indicates that these events are well predicted by the models; thus, the event-driven statistics should show some palpable measure of skill.
2. Model and observational data
Wind events were selected from approximately 1 yr of 10-m wind real-time forecasts over the Mediterranean (Fig. 1). The forecasts were generated using the Naval Research Laboratory Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS) (Hodur 1997), run at the Fleet Numerical Meteorology and Oceanography Center (FNMOC). Forecasts were initialized at both 0000 and 1200 UTC and were run from November 2000 through October 2001. All forecasts were initialized with the multivariate optimum interpolation (MVOI) (Barker 1992) using an interim 6-h COAMPS forecast as a first guess. Boundary conditions, provided by forecasts from the Navy Operational Global Atmospheric Prediction System (NOGAPS) (Hogan et al. 2002), were updated every 3 h using a Davies (1976) scheme. Two one-way nested grids were used with horizontal spacings of 81 and 27 km, respectively. The domains had 30 sigma levels in the vertical, with the lowest level at 10 m AGL, and the forecasts were run to 72 h.
The forecasts were verified against the Special Sensor Microwave Imager (SSM/I) winds retrieved using the Goodberlet et al. (1990) regression with the Petty (1993) water vapor correction. In the absence of rain, Goodberlet at al. (1990), Petty (1993), and Gemmill and Krasnopolsky (1999) found that the retrieved winds estimated the in situ buoy and ship observations at the 19.5- m level with an rms of 2 m s−1. For this study, all data within 75 km of land and all rain-flagged data were discarded. The SSM/I speeds were adjusted to the 10 m level for comparison with the model data using the logarithmic wind profile for neutral conditions. Typical adjustments were less than 1 m s−1.
Despite the relatively low general rms errors, the SSM/I wind estimates tend to have larger errors at wind speeds greater than 20 m s−1. This is due in part to the lack of high-wind data available for the derivation of the empirical retrieval. These errors could potentially skew the observed wind distributions for intense mistrals, especially near the axis of maximum winds. Very few reliable estimates of the high-wind error have been performed, though rms errors as high as 4 m s−1 have been measured at FNMOC. While the SSM/I bias at these speeds is unknown, a trend for underestimation has been noted. Because of these problems, the maximum wind strength in the most intense mistrals will likely be poorly estimated. The deficiencies of the SSM/ I estimates emphasize the importance of keeping the model evaluation as general as possible while still providing information specific to the mistral. Although the deterministic truth is unknown, general forecast features such as event position and occurrence can still be investigated. Relative error growth through the forecast integration can also be tracked. Results shown in sections 4 and 5 indicate that the model and SSM/I winds were quite similar on average near the mistral center. The largest discrepancies were in areas of weaker winds.
The SSM/I data were bilinearly interpolated to the 27-km model grid to facilitate direct comparisons. The model grid spacing was quite compatible with the 25- km SSM/I footprint. The typical wind data swath was about 1400 km wide, and all satellite passes that occurred within 1 h of the verification time were interpolated to the same grid. The mistral region was well sampled by the bulk of the passes, with the best coverage over the western and central Mediterranean. Satellite overpass intervals were generally 6–12 h; thus, some forecasts had more validation data than others. Over the Mediterranean, the overpass frequency was sufficient to evaluate the forecasts at 12-h intervals.
3. Verification technique
A rules-based algorithm was used to define unique, contiguous wind events in each forecast, and a subdomain was defined on the 27-km grid (Fig. 1) where the algorithm searched for events. All searches were initially conducted on the forecast data because speed and direction were readily defined at all grid points. From this the conditional distribution of events given that the event was in the forecast was derived. The conditional distribution contingent on the existence of observed events was more difficult to derive because of the lack of full SSM/I observations. The existence of observed events was inferred using methods described in the next paragraph and in section 5. Events that extended beyond the subdomain were kept in the composite if the center of the event was within the subdomain. In the Mediterranean, all contiguous points with winds greater than 12 m s−1 and directions between 270° and 70° were defined as mistrals. Although this definition is rather subjective, it proved to be quite effective upon visual inspection. Once an event was identified, all surrounding data were transferred to a 31 × 31 point relative sample grid with grid spacing identical to the model. The center of each polygonal event as defined by its area-weighted “center of mass” was positioned at the center of the relative grid. After that, all available satellite data were also positioned on the relative grid. Model data were then templated by the available observations, such that all forecasts outside of the SSM/I swath were removed from the set (Fig. 2). Transferring the events to the relative grid effectively synchronized them about a common central point. Once this was done, the remaining sources of variance within the composite were related to the size, shape, and intensity of each event. Size criteria could easily be used to further refine the distributions. However, the lack of observations necessitated that most of the events be combined in a single composite containing sizes from 75 to 500 grid points. This relatively broad range poses some limitations on the precision of the results, but meaningful information was still obtained. The relationships between precision and event-size range, as well as several other factors, are detailed in section 6.
As mentioned above all events were initially identified from the forecasts alone. This implies that only the conditional distribution of model-predicted events can be easily obtained. Murphy (1991) pointed out that a complete scoring system should include enough information to reconstruct the joint distribution of the forecasts and the observations. Here this means that the conditional distribution contingent on the existence of observed events needs to be derived. The satellite data alone were too limited to fully describe any specific feature. Many events were partially sampled, and wind direction was completely unknown. However, it will be shown in section 5 that the existence and center point of the observed events were sufficiently estimated from the model analyses and short-term forecasts. The corresponding distribution of satellite observations associated with these model-estimated events was then used to estimate the distribution of observed events. Other longer-term forecasts valid at these times were verified against the observations in the composite sense. The veracity of these calculations depends on the ability of the model to consistently simulate event location and existence. Results discussed in section 5 show that observational composites derived from the 0–12-h forecasts were viable in this regard.
Although mistral winds were relatively common, the observations were not frequent enough to derive composites at every forecast hour. Instead model SSM/I samples were collected in 6-h intervals starting with the 3-h forecast. Thus, statistics associated with the 6-h forecast actually contain SSM/I model comparisons for all forecasts between 3 and 9 h. The 6-h interval slightly broadened the distributions, but the compromise was necessary to maintain statistical significance. The first and last 3-h forecast intervals (0–2 and 70–72 h) contained almost no observations and were thus not used.
4. Verification of the mistral forecast conditional composite
This first set of composite statistics was derived from the set of known forecasts. The sampling strategy ensures that an event exists in all of the forecast samples but not necessarily in the corresponding observations. Thus the forecast composites represent an estimate of all predicted mistrals as sampled by the SSM/I template, and the observational composites represent the sample of satellite winds associated with the forecasts. Although deterministic properties are not derived from any one event, information regarding the general model performance can be statistically determined. At the most basic level, it is important to know if the model can systematically reproduce a distribution of events that is similar in size and geographical location to the observations. The number distributions of the predicted and observed mistrals on the native model grid can be used for this comparison. In the 18-h forecasts, for example (Fig. 3), the shapes of the distributions were slightly different, but the highest occurrence frequencies were centered in the Gulf of Lion in both the model and the observations. The primary differences were in the number of occurrences. Near the mistral center, the model crossed the wind threshold on 5–10 more occasions than was observed, but the SSM/I frequencies were higher just downwind (east) of the narrow gap between Corsica and Sardinia. Other than these differences the general model wind distribution was quite similar to the observations.
The relative-grid occurrence frequencies, as represented by the number distributions, are shown in Fig. 4. The relative-grid statistics are more stringent than the geographical case because the forecast events are relegated to the center of the grid. Large daily discrepancies will broaden and dilute the distribution of the observations with respect to the forecasts. In this case, the relative-grid number composites look quite similar to the geographical distributions. Sardinia, Corsica, the Balearic Islands, and the French/Spanish coastline all appear as data-void areas.1 This stems from the geographically fixed nature of the mistral, and for this special case a relatively well synchronized composite could be derived on the geographic grid alone. Such a trait made this an attractive phenomenon to test the relative- grid verification method. Future studies will be conducted on meteorological events that are not geographically anchored, thus requiring the use of the relative grid.
The composite wind speeds (Fig. 5) suggest that the differences in the number distributions represent relatively minor differences in the wind speed statistics. The average maximum wind speed near the center of the 18- h mistral forecast composite was just over 16 m s−1 (Fig. 5a), while the corresponding observed maximum speed was about 15 m s−1. The positions of the axes of maximum winds were quite close to one another, though the forecasts were displaced 1–2 grid points westward in some areas. Speed biases were +1 to +2 m s−1 near the mistral center and −1 to −3 m s−1 on the eastern and western sides of the mistral. The positive speed bias reflects the positive bias in the number distributions. However, the rms values near the mistral center were 3 m s−1, which was relatively low compared to the mean speed. These values suggest that the positive biases in the number distribution were associated with either a small systematic error or a large intermittent error. In a separate experiment Nachamkin (2002) showed the number frequency bias in this case was quite sensitive to small wind speed differences. Adding an arbitrary 1.5 m s−1 factor to the SSM/I winds reduced the frequency bias by up to 50%. This sensitivity suggests that the errors were due to small systematic biases. This example emphasizes the complementary roles played by the speed and frequency bias as well as the rms in determining the nature of the forecast error.
The highest rms values were actually in areas of weaker winds on the eastern and western flanks of the mistral. The lowest rms values were in a region of weak to moderate winds south of the mistral core. Standard deviation was quite low in the forecast distribution near the center of the mistral (Fig. 6a), which is expected given that many similar events are composited about a common point. Forecast standard deviation generally increased away from the event center in part because of differences in the structure and size of individual events. The highest standard deviations were east and west of the mistral core in the regions of high rms mentioned above. The observational standard deviations (Fig. 6b) were more uniform across the grid. Values were elevated east of the mistral core, but were not as high as the forecast standard deviations.
The error patterns from most of the mistral forecasts at other lead times were similar to those at 18 h. Wind biases were generally positive within and west of the high-wind core, with negative biases and larger rms errors to the north and east. The magnitude of the errors generally increased with time. By 66 h (Fig. 7) the rms and bias increased by about 1 m s−1 over most areas, except near the French coast, where rms increased by 2–3 m s−1. Bias values in the area between Corsica and Sardinia were close to neutral, but the rms was nearly double that at 18 h. Errors within the axis of the strongest winds remained relatively low, especially in the central and southern portions of the mistral. The average speed patterns remained quite similar in shape, though the forecasts again appeared to be shifted slightly westward. The gradient on the eastern side of the speed maximum was sharper in the forecasts than the observations. This does not necessarily indicate tighter gradients in the forecasts, as phase errors can also affect the gradient strength in the composite. Pure phase errors between two high-gradient fields would exhibit an elevated rms in and near the gradient zone. The rms does increase within the gradient, but it remains high well to the east of the gradient area. This indicates additional errors beyond a simple phase shift.
Note that the number of observations, as indicated by the 20-observation template, is different in Figs. 5 and 7. This is because the observation samples are collected only when a mistral is predicted, and the number, size, and location of mistrals will change through the model forecast. Thus, different sets of observations can be selected for comparison at different forecast lead times. As mentioned above, these statistics represent the model performance given that a mistral is predicted. Mistrals may occur that were not predicted, but those events are not sampled here. Samples of the mistral based on the existence of an observed event are discussed in the next section.
The speed distributions and their corresponding difference statistics suggest that when a mistral is predicted, the predictions are quite reliable out to long lead times. Forecast wind speeds in most areas are within 3 m s−1 of the corresponding observations. Some of the best performance is found in the region of maximum winds. The weak points in the model are primarily associated with the longitudinal extent of the mistral. The largest errors existed to the west and particularly the east of the mistral core. These were associated with areas of elevated standard deviation in conjunction with moderate rms and negative bias errors. This indicates highly variable winds in these areas, with the model having too much event-to-event variability and not enough strong winds. Complex coastal terrain to the east and west of the mistral axis may be influencing the local winds, and the 27-km grid spacing will not fully resolve the processes in these areas. Elsewhere, the general increase in the bias with time appears to be more benign as the rms error in the regions of positive bias increased in proportion to the bias (Fig. 7). This reflects a general trend of increased positive model bias with forecast time that occurred in the comparisons to all SSM/I winds over the entire model grid.
Basic elements of the model performance, contingent on the existence of a forecast event, can be summarized from the average conditional statistics over the entire relative grid (Fig. 8). The forecast-based conditional bias was initially negative but slowly increased through the forecast. Values were less than 1 m s−1 in magnitude, indicating significant cancellation of the positive and negative centers. This suggests that condensed averages of bias and rms can be relatively insensitive to mesoscale errors. The grid-total rms values rose slowly through the first 30 h and then steadily increased. The greatest increases occurred between 30 and 54 h. Pattern correlations are sensitive to the mesoscale pattern, but the values depend on the calculation method. The correlations between the average wind patterns remained high through the forecast period, never dropping below 0.88. However, the average correlation, taken as the mean of the daily correlations weighted by the number of observations, dropped steadily from a high of 0.75 at 0 h to 0.53 at 66 h. Clearly, by 66 h the deterministic quality of the forecasts drops significantly. However, the general pattern, especially within the mistral core remains quite consistent (Fig. 7). This suggests that the long-term forecasts correctly indicate the likelihood of a mistral but lack the precise details. According to Fig. 7, the greatest uncertainties develop north and east of the core winds.
5. Verification of the mistral observation conditional composite
The statistics above only apply to cases when a mistral was predicted but say little about the mistral forecasts when an event was actually observed. Those statistics could be quite different, especially if the forecasts are of poor quality. As noted previously, the distribution of observed events is difficult to estimate because of SSM/I data-coverage constraints. However, event properties can be estimated from the model analyses and short-term forecasts if these fields are sufficiently accurate. Since all observations are placed on the relative grid with respect to the center of the modeled events, the analyses need only be good enough to estimate the event position and occurrence frequency. When valid events are located in the analyses, all associated observations can be composited on the relative grid. Consistent correlations between the positions of the observed and simulated events permit the construction of a focused observational composite. When the observational composite is constructed, all forecasts valid at the times of the observations can be verified in the composite sense.
In practice, predicted wind events from the first 12 h of the model integration were used to generate the observational composites. As with the forecast-based conditional composites (discussed in the previous section), the observations were binned into 6-h intervals. In this study all of the observed events were derived from the wind fields in the first 3–9 h of the model run because of the lack of SSM/I passes at other times. Once a candidate event was located, the relative grid was positioned with the center defined as the center of the model- simulated event. Then, just as before, SSM/I and model data were collected from all points with valid observations. Over the course of the data-collection period, data from many separate events were combined to generate a composite. To increase the probability of the existence of an observed event, only those events with 75 or more SSM/I observations greater than 12 m s−1 on the relative grid were chosen. This ensured that high winds were observed somewhere on the relative grid when an event was found in the short-term forecasts. Attempts were made to impose additional constraints based on daily correlations and rms errors, but variability in the event sizes and observation counts prevented the development of consistent thresholds.
Using the model fields to estimate the occurrence and location of the observed mistrals may result in some events being missed, especially in areas with little data input. Given the lack of observations, some loss of deterministic information is inevitable. While mistral occurrence can be inferred from the observations alone, statistical assumptions would be necessary to determine the probability of mistral existence for each set of high- wind observations. Properties like size and central location would also be probabilistic, and the quality of the estimate would depend on the number of observations. Such an analysis would essentially be a statistical analog of the deterministic analysis already produced by the optimal interpolation (OI) package. Since the OI analysis offers consistent estimates of the wind field that are among the set of solutions most likely to exist given the available observations, the OI analyses were used as the probability filter for event existence. The SSM/I winds are assimilated in the OI, so the analyses are influenced by these observations to some extent. However, if large discrepancies exist, the model background is used in place of the SSM/I. Since portions of the analyses not covered by data were heavily biased toward the background forecast field, the full analysis fields were not used. The observational composites consist entirely of SSM/I data associated with the highly probable events. This limits the bias introduced by the model while maximizing the statistical benefits provided by the analysis. When formulating an estimate of reality, consistency is as important as accuracy. In Summary, the observational composites can be thought of as a statistical estimate of the distribution of winds associated with all observed events. The samples are taken when and where the analyses and short-term forecasts predict a high probability of sampling an event.
The observational composite was checked to ensure that a distribution of events with known meteorological characteristics was consistently isolated on the relative grid. To be valid, the observational distribution needed to resemble or at least be well correlated with the corresponding distribution of analyses and forecasts used to derive it. Both distributions should have low variance within each distribution, high daily correlations, and a high probability of finding an event over a consolidated area. The observational and model composites are compared in Fig. 9. As with the forecast-based conditional composites in the previous section, only those points on the relative grid with at least 20 SSM/I samples were used in the statistics. The reduced data coverage reflects the corresponding reduction in the number of events incurred by imposing the 75-point minimum SSM/I high-speed coverage constraint.
The fields in Fig. 9 indicate that a well-synchronized sample of mistral observations was successfully isolated. Event probability density, defined as the number of positive event observations divided by the total number of observations, was greater than 0.9 in a consolidated area near the grid center. The high density indicates many consistently collocated events. The probability density decreased away from the central maximum as mistrals of different sizes and shapes overlapped. The observed and predicted average speed distributions were similar, with maximum winds centered close to the center of the relative grid (Fig. 9b). The correlation between the average speed patterns was 0.95 while the corresponding average daily speed correlation was 0.77. The daily correlations were typically lower than the correlations between the average fields mainly because of random fluctuations in the wind fields. However, these values are still acceptably high. Both the observed and modeled wind standard deviations (Fig. 9c) were low near the center of the grid, with increasing values toward the grid boundaries. This is an important statistic because low standard deviation indicates that a relatively homogeneous sample of events was chosen. This allows for the assumption that an observed event exists in every sample and is anchored near grid center.
The statistics from the 18-h forecasts that corresponded to the observed mistral composite are shown in Figs. 10a and 10b. Both speed distributions in Fig. 10a display maximum values above 16 m s−1 near the grid center, though the forecast pattern was shifted 1–2 grid points west of the observations in many areas. This trend was also evident in the 18-h forecast-based composites (Fig. 5a), though not as pronounced. The bias (Fig. 10b) was generally negative to positive from east to west across the mistral core, while the rms was generally 3 m s−1 or less except in a region east of the speed maximum, where the rms increased to 4 m s−1. The variances in both the observations and the forecasts (not shown) were higher in this region, but the forecast variance exceeded that of the observations by up to 2 m s−1. This behavior in the rms and variance fields was again similar to those in the forecast-based composites in Figs. 5 and 6. At 66 h (Figs. 10c,d), the east-to-west gradient in the bias pattern remained across the mistral core, but bias values were primarily negative. The shape of the forecast speed distribution remained in tact, though the forecast maximums were distinctly lower. The largest negative biases approached −4 m s−1 and occurred east of the core winds. The rms errors in that area exceeded 7 m s−1.
The similarities between the observation and forecast- based conditional speed composites indicate that the same error patterns were occurring regardless of whether a mistral event was predicted or observed. Areas east of the mistral center showed the largest errors, while errors closer to the mistral center were smaller. This similarity in the patterns indicates that the forecasts were relatively consistent, especially early in the forecast period. The only indication of significant differences appeared in the grid-total bias statistics (Fig. 11). The observation-based conditional bias generally dropped after 30 h instead of steadily increasing through the forecast period as the forecast-based bias did (Fig. 8). This is reflected at 66 h by generally positive speed biases across the mistral core in the forecast-based composite (Fig. 7b) and generally negative biases in the observation-based composite (Fig. 10d). In the next section this behavior is quantified to describe the consistency and resolution of the forecasts with respect to event-scale phase errors, false alarms, and missed forecasts.
6. Discussion
The large number of partially sampled events prevents the reliable decomposition of the error into displacement, pattern, and volume components as was done by Ebert and McBride (2000). However, the sampling process itself provides useful information regarding the type of errors that occurred. The samples that make up each composite are based on the condition that a predicted or observed event exists. Specifically, the forecast-based conditional composite consists of synchronized samples of the forecasts and observations given that an event exists in the forecasts. The observation-based conditional composite consists of synchronized samples of the forecasts and the observations given that an event exists in the observations. If the predicted and observed events are always collocated, both conditional samples and their associated conditional biases will be identical. In this work, these samples will be referred to as symmetric, owing to the agreement in the opposing conditional samples. Differences between the predicted and observed event locations and frequencies promote differences in the conditional biases because of the nature of the samples. These asymmetric errors are represented by event- scale false alarms, missed forecasts, and phase errors. False alarms tend to positively skew the forecast-based conditional (F − O) bias since event forecasts outnumber the observations in the sample. Similarly, missed forecasts tend to negatively skew the observation-based conditional (F – O) bias. Taken together, the difference between the conditional biases is an indication of the presence of these event-scale position errors. The errors are asymmetric in that they arise from differences in the sample means associated with predicted and observed events that are significantly displaced.
The conditional bias difference (CBD = CONDITIONAL BIASFCST − CONDITIONAL BIASOBS) has some interesting properties that illustrate model performance as well as the limitations of the verification process. As noted above, the CBD will be zero if all predicted and observed events are collocated, for in this case the conditional samples are identical. The CBD is also unaffected by a constant model bias. Since such a bias is present in all samples, it factors out when the conditional samples are differenced. This makes the CBD less sensitive to systematic error than the false alarm ratio (FAR) or the equitable threat score (ETS). The CBD is not without its complexities, however. Its value is relative to the bias associated with the symmetric events. For example, if a set of symmetrically collocated event forecasts and observations has a bias of 1 m s−1, any subsequent asymmetric samples with a conditional bias of 1 m s−1 will not contribute to the CBD. This is because the bias associated with the displaced sample is no different than that of the symmetric samples. The CBD can be negative if the asymmetric samples yield high values that are not considered as events. An example of this would be a forecast of weak mistral that is collocated with strong observed winds that did not strictly meet the mistral criteria. The sample is asymmetric because the forecast–observation pair only contributes to the forecast-based composite.2 This example illustrates that an event in the strict terms of the definition need not exist in both fields for the forecast to be useful. In this regard knife-edge errors, resulting from the use of discrete thresholds to describe events, are mitigated. The existence of numerous “near misses” will simply reduce the magnitude of the asymmetric error.
The largest caveats associated with the CBD are its dependence on the size of the relative grid and the range of the features that are considered to be events. Statistics from relative grids that are large compared to the size of the event are less sensitive to spatial errors because of the large overlap between the conditional samples. Statistics from a broad event-size spectrum may be skewed in favor of the larger events. Furthermore, broad event spectrums tend to reduce the sensitivity of the CBD by shifting increasingly large errors from the asymmetric to the symmetric sample set. For example, a large forecast event collocated with a small observed event would be sampled as a symmetric pair if both events met the criteria for the event sample set. If only the large event met the criteria, it would be sampled as an asymmetric false alarm. Broad event spectra promote fewer asymmetric samples that correspondingly contribute less to the conditional biases relative to the symmetric samples. In the limit as the event definition widens to include all event types and sizes, and the relative- grid size approaches the size of the forecast grid, the conditional biases approach the true bias and the CBD approaches zero. This illustrates the insensitivity of the true bias to asymmetric errors and is one reason why its value is limited in mesoscale verification. The true bias conveys very little information regarding the forecast resolution; it only samples systematic error.
At the opposite extreme, pointwise contingency statistics like the ETS are essentially derived from samples with a relative-grid size of one point. This maximizes the sensitivity to asymmetric samples since only exact “hits” are sampled symmetrically. The error capacity in these types of scores is rapidly saturated by relatively minor errors, and they are thus limited by their oversensitivity. The CBD measurements described in this study are an attempt to moderate the sensitivity of a bias-based measurement by varying its precision. The degree of precision is a subjective quantity, but it can be consistently defined using relatively simple parameters.
Restricting the relative-grid size and the breadth of the event distribution increases the sensitivity of the CBD. However, this comes at the expense of statistical significance and simplicity of the final result. One may have very precise information about multiple event types, but the validity is limited if each bin only contains a few observations. The tolerance for precision must be balanced by the available data. One way to estimate the number of samples necessary for statistical significance is by measuring the sample error. This can be done by comparing the mean speed from all mistral forecasts on the relative grid with the sample mean derived from those forecasts that corresponded to valid SSM/I measurements (Fig. 12). Figure 12a illustrates that any grid point containing less than 20 samples will likely have significant sampling errors. In some areas even the 20- point minimum threshold, which was applied to the statistics in this study, contained sample errors as high as 4 m s−1. Using the variance from the distribution of all mistral forecasts, the width of the 95% confidence interval as well as the number of samples required to achieve this width can be estimated. Figure 12b indicates that even 20 samples are not enough to achieve a width of 1.5 m s−1 at the 95% confidence level in the northern portions of the relative grid. At best, 20 samples achieve a 95% confidence interval of 2 m s−1 in these areas. Not surprisingly, these are the areas with the largest sample errors in Fig. 12a. At the opposite extreme, the number of required samples drops below 10 near the grid center because of the lowered variance in the forecast-based sample. This does not necessarily imply that the total sample error is overcome by only 10 samples. The observational variance and sample error will likely be higher than that of the forecast-based conditional distribution near the event center, especially as the asymmetric errors increase. The lack of a very large sample of SSM/I wind observations prevented an accurate estimate of the sample error in the observations over most of the relative grid. However, the magnitude of the sample variances in the observations was generally similar to that of the forecasts. Thus it was assumed that the observed sample errors were similar to those of the forecasts, and the minimum number of samples was left at 20. These restrictions on the sample size were the primary reason for combining most of the resolvable mistrals into a single category spanning 75– 500 points. Most events contained 300 points or less, and the larger events were quite strongly forced and relatively well predicted. However, some event-scale errors may have been sampled symmetrically, and the CBD statistics may underestimate the asymmetric error. Increased sample sizes will allow for more precise measurements in future studies.
The conditional bias differences along with the conditional biases are shown in Fig. 13. The forecast-based conditional bias was recalculated here to account for the reduced area template in the observation-based composite (cf. Figs. 5 and 10a). The values increased by 0.2–0.4 m s−1, but the qualitative trends remained relatively unchanged (Figs. 8 and 13). The full model bias within the mistral search subdomain in Fig. 1 was also calculated for all days when a mistral was either predicted or observed. During the first 30 h of the mistral forecasts, both of the conditional biases increased at nearly the same rate (Fig. 13), resulting in a nearly constant CBD. The full bias also increased, but it closely tracked forecast-based bias. After 30 h, the forecast- based bias continued to slowly rise through 54 h and leveled off by 66 h. The observation-based bias generally fell through the same period, with the greatest declines between 54 and 66 h. The full bias remained positive and relatively close to the forecast-based bias, but decreases at 42 and 66 h reflected similar decreases in the observation-based bias. These divergent trends are reflected in the CBD, which steadily increased after 30 h. Interpreting these results, the steady rise in the bias in all samples prior to 30 h indicates that it is a systematic error. The model produced reliable forecasts with relatively minor event-scale phase shifts, false alarms, or missed forecasts. The depictions of the conditional composites in Figs. 5 and 10 reflect this. Beyond 30 h, the asymmetric errors become more noticeable, although the values were relatively small compared to the relative-grid mean speed, which was about 11 m s−1 at any given time. The 66-h speed distributions in Figs. 7 and 10 indicate that the sample differences are not related to any major systematic phase errors. The close agreement between the full bias and the forecast-based bias may be due to the higher number of days in which a mistral forecast was known to exist. The number of observed mistral days may have been underestimated because of the lack of observations.
7. Conclusions
One advantage of the mesoscale forecast is its ability to resolve highly detailed phenomena. Evaluating this ability in a simple, concise, and meaningful way makes mesoscale verification a difficult and many-faceted problem. Meteorological events are predicted as tangible entities with distinct properties like size, shape, intensity, and location. Most of the standard verification methods were not devised to judge this kind of forecast. The bias is relatively insensitive to large phase errors, false alarms, and missed forecasts. It only measures systematic error. The rms is sensitive to nonsystematic error, but it is an absolute measure and is thus insensitive to the sampling method. False alarms, missed forecasts, and displacement errors are indistinguishable from one another and other types of random error. At the other extreme, the ETS and scores like it are very sensitive. The scores drop rapidly if the forecast deviates much from an exact solution. The poor performance of the ETS suggests that very precise forecasts do not share a one-to-one correspondence with the observations and should not be interpreted as such. However, precise forecasts that convey a high likelihood that an acceptably similar event will occur within an acceptable temporal and spatial window are very useful. To successfully measure this ability, the sensitivity of the verification should reflect the parameters of acceptability, and it should consistently measure the deviation of the forecasts from these parameters in a simple and transparent way.
The sporadic coverage and variable precision of the observations represent a major obstacle in the verification process. Often, only portions of meteorological events are directly sampled by high-resolution observations. Even when analyses are available, their reliability depends on the observation density and type, the first-guess field (if any), and even the ambient weather. Deriving complex deterministic parameters from the measurements or the analyses makes the determination of observational uncertainty very difficult. How does one quantify the expected error in the shape of an event when observational error is often expressed in terms of a standard deviation of the measured quantities? These considerations are especially relevant when measuring wind events over the oceans. Data are often sparse, and the analyses may contain errors. With regards to this mistral study, the existence and central position of the observed events were estimated from the analyses and short-term forecasts. The true observational uncertainty of the central position is unknown. However, the statistics in Fig. 9 indicate that the analyses were reliable in terms of the expected speed errors associated with the mean observational pattern. The nature of this comparison reflects the philosophy of providing focused information as simply as possible. Composites and conditional samples are used to measure the general character of the observed and forecast fields. While the deterministic properties of the events are not directly evaluated, the expected errors in the meteorological parameters that imply the existence of the events are evaluated. Even when deterministic properties are compared, the statistics are nondeterministic in that they represent the average over many occurrences.
These measurements indicate that the COAMPS mistral forecasts are quite reliable up to 66 h in advance. Observed and predicted wind speed composite patterns were well correlated both in the daily and average sense. The rms and bias errors were relatively low in comparison with the wind speeds, especially near the center of the highest winds. A minor westward shift of the position of the highest average winds in the forecast mistral core in relation to the observations indicates a minor systematic location error. Any other location errors were likely nonsystematic. Given the geographically constrained nature of the mistral, this result is not unexpected. The area of greatest forecast uncertainty was located east of the main wind core, where the conditional rms error was maximized. The conditional model bias in this region was relatively large and negative regardless of whether an event was observed or predicted, indicating a consistent underprediction error. High winds are likely not as predictable in this portion of the mistral, as indicated by the high variance.
The similarities in the conditional composites and their sample biases also indicate that the forecasts were consistent on the event scale. Prior to the 30-h forecast time, the difference in the conditional biases remained small and nearly constant, though the predicted wind speeds generally increased through this period. Since this tendency was in all of the samples, it was likely the result of a consistent model bias. After 30 h the conditional sample biases began diverging, indicating mistral-scale errors. The geographically anchored nature of mistral suggests that the majority of the asymmetric errors were missed forecasts and false alarms. As noted above, the position of the composite average speed patterns indicates only minor systematic phase errors. Despite the increase in the asymmetric error, the magnitude of the CBD was relatively small compared to average wind speed.
In relation to many current verification problems, the mistral represents a relatively modest challenge. Although the observations were incomplete, the stationary and predictable nature of the mistral made it relatively easy to track. Future studies on less predictable systems such as open-ocean winds and precipitation systems will prove to be far more challenging. Lower predictability as well as the increased probability of phase errors will likely increase the asymmetric error. Separating model skill from event predictability also presents an interesting challenge as these quantities are intertwined. One possible measure of predictability could be the divergence rate of the conditional samples from multiple model runs with differing initial conditions. The magnitude of the differences in the conditional biases could also be used to evaluate the breadth of the multiple model solutions. In this regard, a sample-based composite method would be a good candidate for the verification of mesoscale ensembles.
Acknowledgments
This research is supported by the Office of Naval Research (ONR) through Program Element 0602435N and the Space and Naval Warfare Systems Command (SPAWAR) through Program Element 603207N. Computing time was supported in part by a grant of high performance computing (HPC) time from the Department of Defense Major Shared Resource Center, Stennis Space Center, Mississippi, and performed on a Cray SV1. Special thanks go to Kim Richardson (NRL), the NRL Satellite Meteorological Applications group, and FNMOC for providing SSM/I data. Sue Chen, Jerry Schmidt, and Jim Doyle (NRL) provided guidance and assistance with COAMPS. Jeff Lerner (FNMOC) also provided helpful input.
REFERENCES
Baldwin, M. E., S. Lakshmivarahan, and J. S. Kain, 2001: Verification of mesoscale features in NWP models. Preprints, Ninth Conf. on Mesoscale Processes, Fort Lauderdale, FL, Amer. Meteor. Soc., 255–258.
Baldwin, M. E., S. Lakshmivarahan, and J. S. Kain, 2002: Development of an “events-oriented” approach to forecast verification. Preprints, 19th Conf. on Weather Analysis and Forecasting, San Antonio, TX, Amer. Meteor. Soc., 255–258.
Barker, E. H., 1992: Design of the navy's multivariate optimum interpolation analysis system. Wea. Forecasting, 7 , 220–231.
Bernardet, L. R., L. D. Grasso, J. E. Nachamkin, C. A. Finley, and W. R. Cotton, 2000: Simulating convective events using a high- resolution mesoscale model. J. Geophys. Res, 105 , 14963–14 982.
Colle, B. A., K. J. Westrick, and C. F. Mass, 1999: Evaluation of MM5 and Eta-10 precipitation forecasts over the Pacific Northwest during the cool season. Wea. Forecasting, 14 , 137–154.
Colle, B. A., C. F. Mass, and D. Ovens, 2001: Evaluation of the timing and strength of MM5 and Eta surface trough passages over the eastern Pacific. Wea. Forecasting, 16 , 553–572.
Davies, H. C., 1976: A lateral boundary formulation for multi-level prediction models. Quart. J. Roy. Meteor. Soc, 102 , 405–418.
Doyle, J. D., 1997: The influence of mesoscale orography on a coastal jet and rainband. Mon. Wea. Rev, 125 , 1465–1488.
Ebert, E., and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors. J. Hydrol, 239 , 179–202.
Gemmill, W. H., and V. M. Krasnopolsky, 1999: The use of SSM/I data in operational marine analysis. Wea. Forecasting, 14 , 789–800.
Goodberlet, M. A., C. T. Swift, and J. C. Wilkerson, 1990: Ocean surface wind speed measurements of the Special Sensor Microwave/Imager (SSM/I). IEEE Trans. Geosci. Remote Sens, 28 , 823–827.
Gray, W. M., and W. M. Frank, 1977: Tropical cyclone research by data compositing. NEPRF Tech. Rep. TR-177-01, Naval Environmental Prediction Research Facility, Monterey, CA, 70 pp.
Hodur, R. M., 1997: The Naval Research Laboratory's Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS). Mon. Wea. Rev, 125 , 1414–1430.
Hogan, T. F., M. S. Peng, J. A. Ridout, and W. M. Clune, 2002: A description of the impact of changes to NOGAPS convection parameterization and the increase in resolution to T239L30. NRL Memo. Rep. NRL/MR/7530-02-52, Naval Research Laboratory, Monterey, CA, 10 pp.
Jiang, Q., R. B. Smith, and J. D. Doyle, 2003: The nature of the mistral: Observations and modelling of two MAP events. Quart. J. Roy. Meteor. Soc, 129 , 857–875.
Koch, S. E., 1985: Ability of a regional scale model to predict the genesis of intense mesoscale convective systems. Mon. Wea. Rev, 113 , 1693–1713.
Mass, C. F., D. Ovens, K. Westrick, and B. A. Colle, 2002: Does increasing horizontal resolution produce more skillful forecasts? The results of two years of real-time numerical weather prediction over the Pacific Northwest. Bull. Amer. Meteor. Soc, 83 , 407–430.
Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality. Mon. Wea. Rev, 119 , 1590–1601.
Nachamkin, J. E., 2002: Forecast verification using meteorological event composites. Preprints, 19th Conf. on Weather Analysis and Forecasting, San Antonio, TX, Amer. Meteor. Soc., 206– 209.
Petty, G. W., 1993: A comparison of SSM/I algorithms for the estimation of surface wind. Proc. Shared Processing Network DMSP SSM/I Algorithm Symp., Monterey, CA, Fleet Numerical Meteorology and Oceanography Center.
Tustison, B., D. Harris, and E. Foufoula-Georgiou, 2001: Scale issues in verification of precipitation forecasts. J. Geophys. Res, 106D , 11775–11784.
White, G. B., J. Paegle, W. J. Steeenburgh, J. D. Horel, R. T. Swanson, L. K. Cook, D. J. Onton, and J. G. Miles, 1999: Short-term forecast validation of six models. Wea. Forecasting, 14 , 84–108.
Xue, M., and Coauthors, 2001: The Advanced Regional Prediction System (ARPS)—A multi-scale nonhydrostatic atmospheric simulation and prediction tool. Part II: Model physics and applications. Meter. Atmos. Phys, 76 , 143–165.