Recent advancements in numerical weather prediction (NWP) and the enhancement of model resolution have created the need for more robust and informative verification methods. In response to these needs, a plethora of spatial verification approaches have been developed in the past two decades. A spatial verification method intercomparison was established in 2007 with the aim of gaining a better understanding of the abilities of the new spatial verification methods to diagnose different types of forecast errors. The project focused on prescribed errors for quantitative precipitation forecasts over the central United States. The intercomparison led to a classification of spatial verification methods and a cataloging of their diagnostic capabilities, providing useful guidance to end users, model developers, and verification scientists. A decade later, NWP systems have continued to increase in resolution, including advances in high-resolution ensembles. This article describes the setup of a second phase of the verification intercomparison, called the Mesoscale Verification Intercomparison over Complex Terrain (MesoVICT). MesoVICT focuses on the application, capability, and enhancement of spatial verification methods to deterministic and ensemble forecasts of precipitation, wind, and temperature over complex terrain. Importantly, this phase also explores the issue of analysis uncertainty through the use of an ensemble of meteorological analyses.
MesoVICT focuses on the application, capability, and enhancement of spatial verification methods as applied to deterministic and ensemble forecasts of precipitation, wind, and temperature over complex terrain and includes observation uncertainty assessment.
As numerical weather prediction (NWP) models began to increase considerably in resolution, it became clear that traditional gridpoint-by-gridpoint verification methods did not provide adequate or sufficient diagnostic information about forecast performance for some users (e.g., Mass et al. 2002). Double penalties arising from spatial displacement (or perhaps timing) errors and more rapid growth of small-scale errors often result in poorer performance scores for higher-resolution forecasts than their coarser counterparts, even when subjective evaluation would support the higher-resolution models as being better. Subsequently, a host of new verification methods were developed very rapidly (Ebert and McBride 2000; Harris et al. 2001; Casati et al. 2004; Nachamkin 2004; Davis et al. 2006; Keil and Craig 2007; Roberts and Lean 2008; Marzban et al. 2009; Gilleland et al. 2010b), which we will refer to as spatial methods. There still are gaps in our understanding when it comes to interpreting what all the new spatial methods tell us; gaining an in-depth understanding of forecast performance depends on grasping the full meaning of the verification results. Furthermore, the investment required to implement a new spatial method is relatively high compared to traditional verification methods, making it important to have some criteria on which to decide which methods would best suit a particular user’s need(s). Therefore, a spatial methods meta-verification, or intercomparison, project was created to try to answer some of these questions.
The first spatial verification methods intercomparison project (ICP; Gilleland et al. 2009, 2010a) was initiated in 2007 with the aim of better understanding the rapidly increasing literature concerning new spatial verification methods, and several questions were addressed:
How does each method inform about forecast performance overall?
Which aspects of forecast error does each method best identify (e.g., location error, scale dependence of the skill, etc.)?
Which methods yield identical information to each other and which methods provide complementary information?
The aim of the ICP was to analyze the behavior and provide a structured and systematic cataloging of the existing spatial verification methods, with the final goal of providing guidance to the users on which methods are best for which purpose. The project focused on prescribed errors for idealized cases and quantitative precipitation forecasts, where the ability of the verification methods to diagnose the known error was assessed (Ahijevych et al. 2009). For this first intercomparison, complex terrain was considered too problematic, and therefore a test dataset from a region with relatively flat terrain was chosen, namely, the Great Plains of the United States. Moreover, the verification methods were tested on nine selected case studies from the 2005 National Severe Storms Laboratory (NSSL) and Storm Prediction Center (SPC) Spring Experiment, using Stage II precipitation analyses as the verification fields, and the results were compared to human subjective assessments. The focus then was on the existing NWP operational capability: convection-permitting 4–7-km grids and the challenges these grid spacings presented for verifying precipitation forecasts.
The ICP was highly informative not only for end users, but also for the verification method developers, and it led to several improvements in the verification approaches (e.g., Keil and Craig 2009). Results for specific methods from the first intercomparison of spatial verification methods can be found in the Weather and Forecasting ICP special collection (https://journals.ametsoc.org/topic/verification_icp; Brill and Mesinger 2009; Davis et al. 2009; Ebert 2009; Ebert and Gallus 2009; Keil and Craig 2009; Marzban et al. 2009; Nachamkin 2009; Wernli et al. 2009; Casati 2010; Gilleland et al. 2010b; Lack et al. 2010; Lakshmanan and Kain 2010; Mittermaier and Roberts 2010).
The ICP helped to promote the mainstream adoption of spatial verification approaches. Since the ICP special collection, a large number of papers have been published making use of the newer spatial techniques. For example, neighborhood methods [particularly the fractions skill score (FSS)] have been employed by Weusthoff et al. (2010), Schaffer et al. (2011), Sobash et al. (2011), Duc et al. (2013), Mittermaier et al. (2013), and Skok and Roberts (2016). Scale separation methods have been used by De Sales and Xue (2011) and Liu et al. (2011), and field deformation approaches by Nan et al. (2010). Feature-based techniques have been used by Demaria et al. (2011), Hartung et al. (2011), Johnson et al. (2011), Gorgas and Dorninger (2012b), Wapler et al. (2012), Crocker and Mittermaier (2013), Mittermaier and Bullock (2013), Weniger and Friederichs (2016), and Mittermaier et al. (2016). The verification of forecasts at observing locations has been extended since papers by Theis et al. (2005), Ben Bouallègue and Theis (2014), Mittermaier (2014), and Mittermaier and Csima (2017), who compared deterministic and ensembles forecasts on the kilometer scale. A comprehensive list of papers can be found at the spatial verification intercomparison website (www.ral.ucar.edu/projects/icp/references.html).
Current operational NWP now includes kilometer-scale deterministic and ensemble systems, in which the details of the terrain are more explicitly felt by the models. Small-scale detail is not confined to just precipitation but also to other spatial fields. Moreover, as NWP evolves and models become more accurate, analysis and observation uncertainty related to measurements and recording procedures, temporal and spatial sampling, gridding procedures, and other observation errors have an evolving (and sometimes increasingly important) impact on the verification results and ought to be taken into account in verification practice.
A second phase of the ICP, called the Mesoscale Verification Intercomparison over Complex Terrain (MesoVICT; www.ral.ucar.edu/projects/icp), was established in 2014, where the concept of the spatial methods intercomparison is extended to incorporate recent trends in modeling and verification science. The focus of MesoVICT is the application, assessment, and enhancement of the capability of spatial verification methods to evaluate deterministic and ensemble forecasts over complex terrain at near-convection-resolving and convection-permitting resolution. Test cases include additional variables beyond precipitation, such as wind and temperature. The cases represent interesting meteorological events that develop over time rather than single snapshots, in a region of complex terrain over the Alps where extensive observation data were collected during the Mesoscale Alpine Program Forecast Demonstration Project (MAP-FDP; Rotach et al. 2009). Synoptic observations from a very dense observation network provided the input for the generation of model-independent analyses that are used as verifying fields.
This article introduces the rationale behind MesoVICT, gives an overview, updates the classification of spatial verification techniques, describes the MesoVICT datasets, and gives a simple example of the approach used to evaluate the verification methods. The results of the MesoVICT project will be reported in an American Meteorological Society (AMS) special collection in Monthly Weather Review and Weather and Forecasting.
MesoVICT is a nonfunded, open, collaborative project. Participation in the project is still possible and welcomed. Researchers, including students, who are interested in participating are encouraged to contact one of the authors or visit the project homepage at https://ral.ucar.edu/projects/icp/.
Building from the first intercomparison.
The major results of the first spatial verification methods intercomparison project are summarized in Gilleland et al. (2009, 2010a) and Ahijevych et al. (2009). Through that project, the spatial verification approaches were categorized into four classes:
Neighborhood methods relax the requirements of exact forecast location and define neighborhoods of increasing sizes in which forecasts and observations are matched, which can also be applied on the time scale. This is equivalent to applying a low-pass spatial filter. The data treatment within the neighborhood defines the verification strategy, ranging from simple upscaling to the use of probabilistic and ensemble verification approaches by assessing the forecast probability density function within the observation neighborhood. Also included in this class are single-observation–neighborhood-forecast methods that compare occurrence of the event in the forecast neighborhood with the observed occurrence of the event.
Scale-separation techniques use single-band spatial filters to decompose forecast and observed fields into scale components from which traditional verification scores are evaluated for each individual scale component, separately. These methods enable assessment of the scale dependence of the bias, error, and skill and evaluation of the forecast versus observed scale structure, that is, the scales for which the forecast can replicate observed structures, and scales at which the forecast performs well.
Feature-based (object based) methods first identify and isolate features in the forecast and observation domains by applying a threshold (e.g., rainfall accumulations ≥ 10 mm) and then assess different feature attributes (e.g., location, extent, intensity) for the paired forecast–observation features.
Field-deformation techniques use a vector (displacement) field to morph the forecast field versus the observed field (up to an optimal fit); then a scalar (amplitude) field is applied in order to correct intensity errors. These techniques assess the displacement and intensity error evaluated over the whole field.
In the present classification, we propose adding an emerging class of spatial verification techniques:
Distance measures for binary images assess the distance between forecast and observation fields by evaluating the geographical distances between all the grid points exceeding a selected threshold. These techniques can be considered a hybrid between field-deformation and feature-based techniques.
Distance measures for binary images were developed in image processing for edge detection and pattern recognition and include Pratt’s figure of merit (FoM; Pratt 1978); the Fréchet distance (Alt and Godau 1995; Eiter and Mannila 1994); the Hausdorff metric and its derivatives, the modified and partial Hausdorff distances (Dubuisson and Jain 1994); the mean error distance (MED; Peli and Malah 1982); and the Baddeley delta metric (Baddeley 1992a,b). These distance measures are sensitive to the difference in shape and extent of objects and assess the distance/displacement between forecast and observation features. Several studies have exploited these metrics for spatial verification of precipitation forecasts (Schwedler and Baldwin 2011; Gilleland 2011, 2017; Gilleland et al. 2008; Venugopal et al. 2005; Zhu et al. 2011) and sea ice prediction (Hebert et al. 2015; Heinrichs et al. 2006; Dukhovskoy et al. 2015).
In the first intercomparison classification, the neighborhood and scale-separation methods were referred to as filtering methods because they use spatial filtering, and the field-deformation and feature-based techniques were grouped as displacement techniques because they explicitly assess displacements. However, these classes represent a fairly general categorization of the methods so that two methods in the same class can still be very different; in fact, some methods straddle more than one category, and others simply do not fit very well in any category. For example, the structure–amplitude–location (SAL) method introduced by Wernli et al. (2009) identifies features in the fields but does not analyze them individually; rather, it yields summary scores based on all features across space. Often, techniques from one category of methods might be used for optimizing a method in another category: for example, a low-pass smoother can be adopted within a feature-based approach to reduce small-scale noise (and hence help to identify the objects). Similarly, scale separation and smoothing are applied in some field deformation approaches: as an example, Keil and Craig’s (2009) displacement and amplitude score (DES) uses a spatial filter prior to defining the deformation vector field, and the image warping approach presented by Gilleland et al. (2010b) uses smoothing to help find the best fitting warp, but the final summary is based on the original, raw amplitudes without any smoothing.
Therefore, a new penta-classification (Fig. 1) is proposed here, which aims to highlight the similarities and overlaps between classes. The proximity of the techniques in the penta-classification diagram indicates which methods are more closely related. For example, scale-separation approaches are close to neighborhood methods because they both rely on spatial filtering (a single band versus a low-bandpass filter, respectively). Field-deformation methods overlap with the scale-separation approaches because many field-deformation algorithms perform a single-band spatial filtering prior to the field morphing. Neighborhood methods are close to feature-based methods because smoothing (i.e., a low-bandpass filter) is often applied to better identifying features (e.g., from high-resolution models). Finally, distance metrics bridge feature-based and field-deformation methods: in fact, they are related to feature-based methods because they both assess distances between features; however, their algorithms resemble more those of field-deformation approaches because they measure the distances between all gridpoint pairs (with no prior identification and matching of forecast and observed features).
The first spatial verification intercomparison project addressed the following research questions:
How does the method inform about performance at different scales?
Does the method inform about the spatial structure error?
How does the method inform about location error?
Can the method inform about timing error?
Does the method inform about intensity error and distribution differences?
Does the method provide information about hits, misses, false alarms, and correct negatives?
Does the method do anything that is counterintuitive?
Does the method have tunable parameters, and how sensitive are the results to their specific user-chosen values?
Can the results be aggregated across multiple cases?
Are the results accompanied by confidence intervals or statistical significance?
A table of attributes was produced as a reference and guide by Gilleland et al. (2009) in order to help users identify which methods provide useful information to answer each of those questions.
For MesoVICT, new challenges and research questions have been identified:
What is the ability of the method to verify forecasts of variables other than precipitation (e.g., wind)?
How can the method be adapted to evaluate ensemble forecasts?
Does the method show unusual behavior in complex terrain, and how should results be interpreted given the challenges of forecasting in complex terrain?
What is the sensitivity of existing spatial verification methods to their own specific tuning parameters, the domain size, interpolation, and regridding? The aim of this assessment is to provide guidance on the best practices.
Can the method be used fairly to compare the performance of high-resolution and coarser-resolution forecasts?
Can the method account, or be adapted to account, for analysis or observation uncertainty?
The MesoVICT outcomes shall lead to benefits for verification users and verification method developers. The aim is to refresh the guidance for the end user on the best use of spatial verification approaches. On the other hand, this analysis will potentially identify shortcomings in existing methods. MesoVICT is encouraging the scientific community to further develop and improve the existing spatial verification methods.
MesoVICT is a verification methods (meta verification) intercomparison rather than a model intercomparison. The focus of MesoVICT is, therefore, to analyze and document the behavior of the spatial verification methods. Participants in the intercomparison are expected to test their chosen spatial verification methods on a set of selected case studies. To harmonize the intercomparison and facilitate participation, the selected NWP forecasts and a verifying gridded analysis and station observations have been interpolated onto a common preformatted grid, and all the data can be downloaded in ASCII format from the MesoVICT website www.ral.ucar.edu/projects/icp/. The MesoVICT data and a simple example are described in the “MesoVICT data” and “MesoVICT case and example application” sections.
The intercomparison is structured with the view of maximizing participation (hence including a wide spectrum of preexisting and new spatial verification methods), but also ensuring that participants analyze the same data and focus on addressing the same scientific questions, which is crucial if the intercomparison is to succeed. Figure 2 provides a schematic of the experimental design (Dorninger et al. 2013). All participants in the intercomparison are requested to complete the analysis of the core experiment (case 1) for inclusion in the intercomparison final review. There are six identified cases (described in the “MesoVICT case and example application” section and Table 5), of which the analysis of case 1 is the absolute minimum. The core of the intercomparison focuses on the assessment of deterministic precipitation forecasts over complex terrain. Both gridded and point observations are provided, depending on what a particular method requires.
Following Dorninger et al. (2013), the core experiment provides the foundation on which the other tiers are gradually built up tier by tier. Beyond the core, participants can choose how many additional tiers to evaluate and perhaps make other contributions. The tiers represent a progression in terms of forecast type, parameter, and choice of a verification analysis or observation, exploring a range of challenges:
Tier 1 explores the spatial verification of deterministic wind forecasts, as well as ensemble forecasts of precipitation and wind, against control analysis and point observations.
Tier 2a considers the use of an ensemble of analyses as the verification dataset for quantifying analysis uncertainty in a deterministic forecast context.
Tier 2b considers the use of an ensemble of analyses as the verification dataset for quantifying analysis uncertainty in an ensemble forecast context.
Tier 3 is the user-defined tier where method-specific sensitivities can be explored, model reruns can be assessed alongside the common dataset, or other parameters can be explored.
One of the key aims of the first spatial verification method intercomparison was to understand whether the spatial verification approaches provide more intuitive results that are in better agreement with a human’s subjective verification. Objective results of the different spatial verification methods were therefore compared to a subjective assessment (Ahijevych et al. 2009). The experiences gained during this process revealed just how challenging subjective assessment can be, as there was considerable disagreement between the assessors, depending on how each ranked the forecast attributes from the most to the least important. We do not propose to repeat exactly the same methodology in MesoVICT, but instead construct a ranking of cases for each of the participating methods and compare how the different verification methods rank the same set of cases. This process could result in a grouping of methods that assess similar attributes. Because each case study spans a time period of a few days, hourly scores can be used to track aspects of forecast performance over the case study time span. Furthermore, aggregation of the forecast performance metrics over the time period of the case study will provide the summary score for the ranking of the six case studies, which represent very different but typical synoptic settings in the Alpine region. Participants are encouraged to also provide inference information for these aggregated scores (e.g., confidence intervals or p scores) to ensure that the ranking of the case studies is accompanied by statistical significance.
MesoVICT takes advantage of the huge data collection effort within the framework of two World Weather Research Programme (WWRP) Forecast Demonstration Projects (FDPs), namely, the Mesoscale Alpine Program (MAP)–Demonstration of Probabilistic Hydrological and Atmospheric Simulation of Flood Events in the Alpine Region (D-PHASE) project (Rotach et al. 2009) and the Convective and Orographically-Induced Precipitation Study (COPS; Wulfmeyer et al. 2008) over central Europe in 2007. This data collection covers observations, Vienna Enhanced Resolution Analyses (VERA), and deterministic and ensemble model forecasts for at least June–November 2007. All of these data are stored at the World Data Centre for Climate (WDCC) of the Deutsches Klimarechenzentrum (DKRZ) in Hamburg, Germany, and are freely available (https://cera-www.dkrz.de/WDCC/ui/cerasearch/). Observational data and VERA analyses are stored in netCDF format and model forecasts are stored in gridded binary 1 (GRIB1) format. To facilitate ease of access to the data for MesoVICT participants, data for the selected case studies are provided in ASCII format (“MesoVICT case and example application” section) at the National Center for Atmospheric Research (NCAR) website and can be downloaded from www.ral.ucar.edu/projects/icp.
During the D-PHASE Operations Period (DOP) from June to November 2007, a total of 23 atmospheric deterministic NWP models were run in an operational mode, many at a horizontal grid spacing of a few kilometers (convection-permitting models). Moreover, seven regional atmospheric ensemble modeling systems of intermediate resolution were run with up to 24 members. Some basic model specifications of the participating models, including their lower-resolution driving models, are given by Arpagaus et al. (2009). Not all of the high-resolution model domains cover the entire Alpine region, which makes a comprehensive model comparison difficult, as shown by Dorninger and Gorgas (2013). Therefore, three models have been selected for MesoVICT: the forecasts from the Swiss model Consortium for Small-Scale Modeling 2 (COSMO-2), the Canadian Global Environmental Multiscale Limited Area Model (GEM-LAM), and the ensemble model COSMO Limited Area Ensemble Prediction System (COSMO-LEPS) from the Hydro-Meteo-Climate Regional Service of Emilia-Romagna (ARPA-SIMC), Italy. Their basic model specifications are listed in Table 1 and their limited area domains are illustrated in Fig. 3a. References and links to model documentation are provided in Table 2.
Recognizing the opportunity to analyze the advances in NWP systems within the framework of MesoVICT, several institutions provided reruns of the case studies using their most recent state-of-the-art NWP systems. The reruns that have been completed to support projects in tier 3 are listed in Table 1, and a selection of their mainly innermost domains is shown in Fig. 3b. Again, references and links to model documentation can be found in Table 2.
Because the gridded analysis and its uncertainty are key components of MesoVICT, it is important to briefly explain how it is generated. To simplify the intercomparison of verification methods requiring gridded analyses, the fields of the selected models were interpolated onto the 8-km VERA grid using an inverse distance method. The method used for remapping the NWP fields is based on the Cressman interpolation scheme (Cressman 1959). The interpolation radius is adapted according to the model resolution so that at least nine surrounding grid points are involved in each interpolation. For the interpolation of a 2.2-km grid on an 8-km grid, an interpolation radius of 4 km was chosen. The selection of 8 km for the analysis grid is a compromise between the available observation density (∼16 km) and the model grid resolution (kilometer scale).
The VERA scheme (Steinacker et al. 2000) has been developed to provide the best possible model-independent analysis fields in complex terrain. Sparsely and irregularly distributed observations are interpolated to a regular grid using a thin-plate spline algorithm. Upstream of the analysis, VERA applies a comprehensive data quality control scheme in order to exclude erroneous data from the analysis procedure (Steinacker et al. 2011). The analysis scheme does not make use of any NWP-model information as background fields, which makes it very suitable for model verification purposes.
For downscaling the analysis beyond the resolution given by the observation network, the so-called fingerprint method has been developed (Steinacker et al. 2006; Bica et al. 2007). The irregular spacing and sparse density of station observations with respect to topography (i.e., in valleys and basins, on mountaintops, on passes and slopes) may result in a quite rough analysis field. On the one hand, conventional analysis systems cannot sufficiently resolve small-scale structures caused by topography, and they will be treated as noise and smoothed out. On the other hand, mountainous topography can produce small-scale structures of considerable amplitudes. Two different physical processes can be identified as the cause for the modification of the atmosphere in complex terrain. These are thermal effects due to different heating or cooling of the atmosphere over mountains (e.g., thermal high or thermal low over the Alps; Bica et al. 2007) and dynamical effects (e.g., blocking and lee-side effects). These features can be modeled far below the scales resolved by the observation network depending on whether a very high-resolution topographic dataset is available. Such modeled thermal and dynamic fingerprints are used to downscale the observation data locally by a mean square method. They are calculated for every analysis individually by computing weighting factors. These fingerprints have much in common with empirical orthogonal functions (EOFs), but they are determined physically rather than statistically.
VERA is a two-dimensional analysis system for surface parameters. Hourly analysis fields have been produced for mean sea level pressure, surface potential and equivalent potential temperature, near-surface wind, and accumulated precipitation as default parameters. Surface mixing ratio and moisture flux divergence are computed from the default output in a postprocessing step. Three fingerprints have been implemented: one thermal and two dynamic fingerprints (one for east–west, the other for north–south streaming patterns), which enhances the information for pressure- and temperature-related parameters in data-sparse regions. Although wind and precipitation analyses are not supported by fingerprints, Dorninger et al. (2008) showed that the quality of the analysis is not diminished as long as there is a sufficient coverage of observation stations.
The ensemble of VERA is generated using the observation error ensemble approach of Gorgas and Dorninger (2012a). In this approach error estimates at observation locations, derived as residuals from VERA’s quality control scheme (Steinacker et al. 2011), are perturbed assuming a Gaussian distribution and added to the observations, which are then reanalyzed using scale-dependent weightings of the perturbations. For details, refer to Gorgas and Dorninger (2012a).
VERA is applicable to arbitrary grid resolutions and analysis domains. For MesoVICT it covers the larger D-PHASE domain with horizontal grid resolution of 8 km. The domain of the VERA ensemble extends only over the larger Alpine region for computational reasons (Fig. 3).
For verification methods that make use of the observations at sites rather than on a grid, data from a very dense observation network in and around the Alps are provided for the MesoVICT community. In a joint activity of MAP D-PHASE and COPS, a unified D-PHASE–COPS (JDC) dataset of surface observations over central Europe was established (Dorninger et al. 2009; Gorgas et al. 2009). These products include data provided via the Global Telecommunication System (GTS) of the World Meteorological Organization (WMO) as well as from other networks for the whole of 2007, including the COPS measurement period (June–August 2007) and the DOP of D-PHASE (June–November 2007). A list of all data providers is given in Table 3 (Dorninger et al. 2013). The Department of Meteorology and Geophysics of the University of Vienna took over the responsibility for this data-collection activity.
The JDC dataset consisted of reports from more than 12,000 stations over Europe corresponding to a mean station distance of approximately 16 km. Not all stations measure all parameters. An overview of all data included in the JDC dataset can be found in Table 4 (Dorninger et al. 2013). Some of the station networks measured precipitation at different accumulation periods. The accumulation periods of non-GTS precipitation data varied among weather services. To create a homogeneous dataset of highest possible station density for precipitation, accumulation periods of shorter than 1 h were summed up to 1-, 3-, 6-, 12-, and 24-h periods.
MesoVICT CASE AND EXAMPLE APPLICATION.
A set of six synoptic cases have been selected, covering a wide range of meteorological phenomena in and around the Alps, for example, widespread convective events, organized convection on squall lines, cyclogenesis with heavy precipitation leading to severe flooding, and cold front interactions with the Alpine barrier (Table 5). For a detailed description of the synoptic situation of the different cases, the reader is referred to Dorninger et al. (2013).
In addition to these NWP case studies, a new set of idealized synthetic cases is proposed within MesoVICT (available at www.ral.ucar.edu/projects/icp). Synthetic cases primarily aim to represent simplified and individual forecast errors (e.g., a displacement or an extent error) and in the first ICP proved to be very informative on the basic diagnostic capabilities of the spatial verification methods.
Only a few studies have addressed some of the scientific questions listed in the “Objectives” subsection, and MesoVICT is the first coordinated effort to answer them for several verification methods. Therefore, MesoVICT participants are asked to begin their studies by running the core case (Fig. 2). The core case (20–22 June 2007) is characterized by strong convective events due to unstable warm and moist air masses advected into the Alpine region on 20 June 2007. The following day intense convective events occurred again ahead of a cold front with strong westerly winds. The resulting spotty rain field for a 12-h accumulation period is shown in Fig. 4a together with the cold front approaching from the northwest in terms of equivalent potential temperature in Fig. 4b.
In the following we present a simple example to show how the scientific questions can be addressed. We use the mean absolute error (MAE) and continuous ranked probability score (CRPS) as verification measures and as comparators for the spatial verification methods that will be evaluated in the MesoVICT project.
Figure 5 shows the spatial distribution of MAE for precipitation for the whole period of the core case. Large spatial variability is evident, partly related to the shape of the Alpine barrier (over Austria and parts of Switzerland). The time series of the MAE and CRPS are presented in Fig. 6. They show a pronounced maximum at around 0300 UTC 21 June 2007, when the cold front and its associated rainband (not shown) impinges on the Alps.
What is the ability of the method to verify forecasts of variables other than precipitation forecasts (e.g., wind)?
In theory the MAE and CRPS can be applied to any scalar variable, though for quantities that vary by orders of magnitude, it is probably wise to transform the variable before using it, for example, precipitation or cloud-base height.
How can the method be adapted to evaluate ensemble forecasts?
Adapting the MAE for ensemble forecasts is rather straightforward since the CRPS becomes the MAE in the limiting (deterministic) case (Hersbach 2000).
Does the method show unusual behavior in complex terrain, and how should results be interpreted given the challenges of forecasting in complex terrain?
A key question for spatial verification methods is if and how the method can account for meteorological fields connected to the mountain ranges (e.g., increasing precipitation rates with higher altitudes, windward/leeward effects for precipitation and foehn effects, cold air pooling in valleys and basins, valley wind systems versus synoptic winds at higher altitudes, etc.). The behavior of MAE and CRPS in complex terrain depends on how well the model represents the actual orography. For surface variables the impact of model orography and the deviation of that from the real orography may lead to larger errors in complex terrain, especially for variables such as temperature. Precipitation and/or cloud may also be displaced relative to the underlying orography. Model resolution impacts the level of detail contained in the model orography (coarser models have a smoother orography, with potentially larger deviations in terms of actual and model heights). Additionally, domain average statistics often needed for spatial verification methods may include data over flat and complex terrain diluting the signal, which has also to be taken into account.
What is the sensitivity of existing spatial verification methods to their own specific tuning parameters, the domain size, interpolation, and regridding?
The MAE and CRPS are not spatial methods per se, but they can be applied at a grid-square by grid-square basis to create spatial maps of forecast accuracy (Fig. 5). They could also be applied to upscaled or smoothed fields to show that results are sensitive to interpolation/regridding and displacement.
Can the method be used fairly to compare the performance of high-resolution and coarser-resolution forecasts?
If the MAE and/or CRPS are applied to forecasts originating on different grids, then it is recommended to interpolate both to a high-resolution grid (but not higher than the real resolution of the analysis grid) so that no information is lost from either forecast. It is also recommended to interpolate both forecasts to station locations (or use nearest grid point for precipitation) and verify using station observations. The results can be plotted on maps to examine the spatial structure. Ideally, one could track the information content on different scales and compare directly, scale by scale (an approach used by many spatial verification methods).
Can the method account, or be adapted to account, for analysis/observation uncertainty?
The MAE and CRPS can be adapted. Figures 7 and 8 show the uncertainty of the CRPS for wind speed forecasts at the grid point corresponding to Vienna. Each single ensemble member of the VERA ensemble (Gorgas and Dorninger 2012a) has been used to calculate the CRPS. Figure 7 shows the uncertainty of the analysis indicated by the spread of the different ensemble members, resulting in substantial differences for CRPS. The time series as shown in Fig. 8 for the MesoVICT core case indicate large variability of the uncertainty in the CRPS. This variation may be connected to the diurnal cycle of the wind speed (lower wind speeds during night resulting in lower variance of CRPS) and/or high variance of the wind observations during the frontal passage (approximately 1200–1500 UTC 21 June 2007 for Vienna).
SUMMARY, RECENT MesoVICT ACTIVITY, AND NEXT STEPS.
The initial spatial forecast verification intercomparison (ICP) was initiated in response to the rapid development of new methods for verifying high-resolution forecasts. The ICP clarified much about how the newly proposed methods described aspects of forecast error, which ones provided diagnostic guidance (and what sort), and which methods might yield similar kinds of information. However, the ICP cases focused only on verifying deterministic forecasts of precipitation over the central United States with characteristically flat terrain. A great deal remains to be learned about the information content and behavior of spatial verification methods in other forecast contexts.
This second phase of the project, the Mesoscale Verification Intercomparison in Complex Terrain (MesoVICT), was initiated in 2014 with the aim of advancing the knowledge of the various methods to determine how well they inform about forecast performance over complex terrain, for additional variables (e.g., wind and temperature), in the presence of modeling uncertainty (represented by ensemble NWP) and observation and analysis uncertainty (represented by an ensemble of analyses). Point observations have also been provided as verification data. Instead of single snapshots in time, MesoVICT cases evolve over a few days, and hence are both more realistic for a forecaster, but also more complicated in terms of analyzing model performance. To help ensure that all methods are analyzed on the same set of cases, a core case has been identified that all participants are expected to analyze. Beyond this core case, a tiered evaluation framework is provided to allow for more advanced studies to be conducted within the project.
The project started with an initial planning meeting in September 2013 at the 13th European Meteorological Society (EMS) and 11th European Conference on Applications of Meteorology (ECAM) annual conference in Reading, United Kingdom. This was followed by a kickoff meeting (First MesoVICT Workshop) held in Vienna in October 2014. Since then, MesoVICT meetings and presentations have taken place at other conferences on several occasions, including at each EMS annual meeting. A full list is given in Table 6, along with current plans for future meetings. Some first investigations using MesoVICT data have already been published (Geiß 2015; Gilleland 2017; Kloiber 2017; Skok and Hladnik 2018), and more are expected within the next few years. A closing meeting is planned for 2020, and the results of the MesoVICT project will be reported in an American Meteorological Society special collection in Monthly Weather Review and Weather and Forecasting.
MesoVICT is organized by the World Meteorological Organization (WMO) Joint Working Group on Forecast Verification Research (JWGFVR) and is an activity of the World Weather Research Programme (WWRP) High Impact Weather (HIWeather) Evaluation task team. NCAR is sponsored by the U.S. National Science Foundation. Additional support for Brown and Gilleland was provided from the National Science Foundation (NSF) through Earth System Modeling (EaSM) Grant AGS-1243030. Work at NCAR was also supported in part by the Air Force 557th Weather Wing. Simon Kloiber (University of Vienna) provided Figs. 7 and 8.