1. Introduction
Clouds vary rapidly in height, thickness, coverage, and time. Since no single observation characterizes the full volume, verification of numerical weather prediction (NWP) cloud forecasts is challenging. Passive geostationary satellite observations offer superior spatial and temporal coverage but are limited by their top-down nature. Active sensors, such as CloudSat, offer volumetric coverage, but are constrained to a horizontal data “curtain.” Ground-based observations often undersample boundary layer clouds (Alexander and Protat 2018; McErlich et al. 2021), and are constrained by their pointwise nature. Furthermore, NWP cloud forecast variables, such as cloud liquid, cloud ice, and cloud type are not directly or routinely observed. To compare model data to satellite observations in radiance or brightness temperature space, we can utilize satellite simulators, such as the Community Radiative Transfer Model (CRTM) (Johnson et al. 2023), Radiative Transfer for Television InfraRed Observation Satellite (TIROS) Operational Vertical Sounder (TOVS) (RTTOV) (Saunders et al. 2020), or the Cloud Feedback Model Intercomparison Project (CFMIP) Observation Simulator Package (COSP) (Bodas-Salcedo et al. 2011). Radiance verification requires a meticulous correlation between the radiances and the model’s geophysical quantities, as highlighted by Zhang et al. (2010) and Bani Shahabadi et al. (2016). While radiance-based comparisons offer a consistent means of evaluating the model forecasts, satellite retrievals such as cloud top height and cloud base height, provide valuable insights for identifying systematic errors associated with specific cloud regimes such as low, mid, and high clouds.
Traditional verification metrics based on point-to-point comparisons, such as probability of detection (POD), bias and root-mean-squared error (RMSE), and equitable threat score (ETS), offer valuable insights into forecast performance (Roebber 2009). However, none of these methods account for near misses. For instance, a pointwise score for a partly cloudy situation where the predicted clouds do not fully overlap the observations would not be able to communicate that the forecast coverage was generally correct despite the lack of overlap. High-resolution forecasts may have low scores when being evaluated by the traditional metrics, providing no information on displacement error (Davis et al. 2009). Feature-based verification, on the other hand, treats predicted and observed clouds as distinct features with identifying attributes such as area, location, intensity, etc. This approach provides a more comprehensive understanding of spatial forecast quality, thereby contributing to model improvement. For instance, Davis et al. (2009) demonstrated that the Weather Research and Forecasting (WRF) Nonhydrostatic Mesoscale Model (NMM) exhibited a tendency to overpredict the number and extent of rain area objects. These errors could be attributed to excessive near-gridscale variability if numerical dissipation is not properly tuned (Skamarock 2004). Such aliasing of near-gridscale energy in the lower troposphere may lead to the excessive growth of precipitation systems.
Several object-based methods are available for assessing model performance, including the contiguous rain area method (Ebert and McBride 2000), the composite method (Nachamkin 2004), and multiscale object verification (Lack et al. 2010). Here we focus on applying the MET Method for Object-Based Diagnostic Evaluation (MODE; Davis et al. 2006, 2009) to evaluate cloud forecasts. MODE facilitates quantitative, feature-based spatial verification between observed data and forecasted information. However, obtaining reliable high-quality collocated cloud forecasts and observations is challenging, primarily due to the inherently ill-defined nature of clouds. MODE requires consistent, gridded fields to define the verification objects. In this work, we utilized satellite-retrieved GOES-16 cloud masks and collocated COAMPS-derived forecast masks. These masks depict the presence of specific cloud types based on analyzed thermodynamic fields, retrieved satellite cloud properties, and COAMPS predicted thermodynamics and microphysics. The masks are continuous gridded binary fields, making them well suited for use in the MODE verification process.
The study is organized into the following structure: section 2 describes the datasets and classification methods employed for different cloud masks, section 3 presents the verification metrics, most of which were generated using the MET package, followed by sensitivity tests to assess MODE’s configurations on the statistical outputs. Section 5 provides the primary verification of both stable and unstable cloud masks, including detailed results from the 1800 UTC (6-h lead time), as well as the 3-, 6-, 9-, and 12-h lead time forecasts. The final section offers concluding remarks, summarizing the verification outcomes and discussing how the community tools can be effectively leveraged.
2. Materials
a. Cloud mask datasets
Forecast data originates from COAMPS high-resolution forecasts over the United States Mid-Atlantic region for the 2-yr period from 1 January 2018 to 31 December 2019 (Nachamkin et al. 2022). Although the COAMPS setup consists of three telescoped domains (45-, 15-, and 5-km horizontal resolution, respectively), only the data from the 5-km domain over the Norfolk, Virginia, region is used to derive the cloud masks (Fig. 1). Norfolk was selected as it is a major hub for Navy operations. COAMPS forecasts are initialized daily at 0000, 0600, 1200, and 1800 UTC out to 12-h lead time. This study specifically focuses on the 1200 UTC initialization to leverage the increased accuracy of the GOES retrievals during daylight hours. For each cycle, the Naval Research Laboratory’s Atmospheric Variational Data Assimilation System (NAVDAS) (Daley and Barker 2001) generated the initial conditions, and the Navy Global Environmental Model (NAVGEM) provided the boundary conditions. Explicit microphysics parameterization (Rutledge and Hobbs 1983, 1984) is used for all domains, where mixing ratios of cloud droplets, cloud ice, rain, snow, and graupel were predicted by a modified single-moment bulk scheme (Rutledge and Hobbs 1983, 1984). Subgrid-scale convection is only parameterized on the 15- and 45-km domain. Further details can be found in Nachamkin et al. (2022) and Chen at al. (2003).
Five different cloud masks classified by Nachamkin et al. (2022): lower-tropospheric stable and unstable clouds, midtropospheric clouds (labeled as “mid”), upper-tropospheric clouds (labeled as “high”), and deep precipitating clouds (labeled as “deep”) for (left) COAMPS and (right) GOES-16 at 1800 UTC [forecast lead time (tau) = 6 h] 8 May 2018.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
Satellite observations consisted of the GOES-16 Advanced Baseline Imager (ABI) 0.65-μm normalized reflectance (visible channel) as well as retrievals of cloud top height, cloud base height, liquid water path, and cloud top phase. Data originated from the Cooperative Institute for Research in the Atmosphere (CIRA), and the National Aeronautics and Space Administration (NASA) Langley Research Center (LARC) Clouds and the Earth’s Radiant Energy System (CERES) (Minnis et al. 2021). Additional mitigation to account for semitransparent ice-phase clouds and snow cover artifacts was performed as described by Nachamkin et al. (2022). Further, 3-h precipitation accumulation from the Integrated Multi-satellitE Retrievals for GPM (IMERG) Level-3, version 6B data (Huffman et al. 2019) was used to locate deep precipitating clouds.
Figure 1 illustrates the five distinct cloud regimes identified for this study based on their physical characteristics: lower-tropospheric stable and unstable clouds, midtropospheric clouds, upper tropospheric clouds, and deep precipitating clouds. Clouds at different levels of troposphere, and in different environments, may interact with various components of the numerical weather model in different ways. For instance, the physics and structure of low-level clouds in unstable environments may respond strongly to the model turbulence scheme, land surface and microphysics schemes, whereas upper-level clouds may interact more strongly with the radiation scheme, microphysics, advection and cumulus parameterization. It is important to note that clouds at all tropospheric levels in all environments have some interaction and response to all of these components (e.g., turbulence does affect high clouds, and radiation does affect low clouds in unstable environments), though the relative importance of each of these effects may be different. Binary cloud mask data were derived from these five cloud regimes (Fig. 1). Both the forecast and observed cloud masks were interpolated to a common 5-km grid at 3-h intervals during the local daytime (from 1200 to 0000 UTC at intervals of every 3 h).
b. Brief review of cloud classification
Clouds are classified based on multiple observed and analyzed variables such as cloud top height, cloud base height, precipitation and boundary layer buoyancy. In this work, we focus specifically on the stable and unstable clouds as described by Nachamkin et al. (2022). A brief description of the algorithm is provided in this section, for full details see Nachamkin et al. (2022).
Stable and unstable clouds were identified from lower-tropospheric clouds with cloud base heights below 6 km above ground level (AGL) for unstable and 4 km AGL for stable clouds. The GOES retrieved clouds were paired with the nearest (in time) COAMPS analyses and short-term forecasts to estimate the observed atmospheric thermodynamics. Note the model environments were only used to assist in classifying the cloud types, and were not used to conduct any quantitative comparison of cloud thermodynamic or microphysical properties. COAMPS forecast clouds were paired directly with the corresponding thermodynamic variables valid at the time of the forecast. Boundary layer stability was determined from the boundary layer lapse rates along with a series of indices including the lifted condensation level (LCL), the convective condensation level (CCL), and the estimated inversion strength (EIS). These variables were combined to produce weighted scores that described the lower-tropospheric stability. Cloud type masks were in turn derived based on empirical thresholds applied to the stable and unstable scores. Thresholds were determined by comparing satellite observations, balloon soundings, surface observations, and precipitation maps with the scores to determine the most consistent match. Although the exact threshold selection process was subjective, the masks served the purpose of identifying general cloud families that are determined by specific atmospheric processes. For example, fair weather cumulus and marine stratocumulus clouds are well differentiated by this scheme. Since stability and cloud types follow a continuum, the masks were allowed to overlap (Fig. 1).
3. Verification methods
a. Traditional measures
Evaluation of binary model forecasts and observations can be performed at the gridpoint-to-gridpoint scale using the standard 2 × 2 contingency table (Table 1).
Standard 2 × 2 contingency table. The nij values represent the counts of grid points in each forecast–observation category, where i represents the forecast and j represents the observations.
b. Spatial metrics
c. Object-oriented verification
Additional spatial verification can be performed on specific features (objects) following the methodology defined by MODE (Davis et al. 2006, 2009). This verification process involves the following steps:
-
Identification of forecast and observed objects. This is achieved by convolving or smoothing the raw fields and applying a threshold to the convolved fields. The raw cloud fields are incorporated into the objects for subsequent verification.
-
Computation of single object attributes. Various attributes of individual objects are calculated, including area and centroid.
-
Grouping of objects within the same field. Objects sharing the same field are grouped, and those with similar geometric shapes are merged using a specific merging threshold.
-
Matching objects from different fields. All possible pairs of objects are compared to determine which objects should be matched. The merging and matching process utilizes a fuzzy logic algorithm (Davis et al. 2006, 2009). The algorithm quantifies the similarity of the object using a weighted average of object attributes known as “total interest.” For jth object pair, the total interest (Ij) is calculated as follows:where S represents the number of object attributes, ci and wi denote the confidence and weight of each attribute, respectively. The term fij is the interest function for an attribute for a pair of objects. The interest function provides a degree of similarity for each attribute (e.g., centroid distance) with a value ranging from zero (representing no interest) to one (high interest). It is worth noting that the weights are user defined and may influence the outcome of the matching process.
Total interest measures the degree of similarity between two objects, represented by a fuzzy value ranging from 0.0 to 1.0. In this study, the attributes used include centroid distance (weight: 2 or 20%), minimum separation distance of object boundaries (weight: 4 or 40%), orientation angle difference (measured in degrees clockwise from the zonal direction, weight: 1 or 10%), area ratio (weight: 1 or 10%), and intersection area (weight: 2 or 20%). These weight values were from the default configurations from the MET release. The confidence parameter ci describes how well each attribute represents the forecast and is set to unity for all attributes considered, except for the orientation angle and centroid distance, where little confidence is assigned for objects with substantial distance separation and/or markedly different sizes (Davis et al. 2009; Johnson and Wang 2013). Objects are considered a match when the total interest exceeds a defined threshold.
The schematic in Fig. 2 illustrates that among three forecast objects and two observed objects, there are three pairs of matches in exceedance of a user-defined total interest value (0.7 in this schematic). Forecast object 1 matches observed objects 1 and 2, while forecast object 2 matches observed object 2. Since MODE allows for double matching, forecast object 1 also matches both observed objects 1 and 2. Consequently, observed objects 1 and 2 are considered as a “cluster.” Additionally, because one of the observed objects (object 2) also matches the forecast object 2, forecast objects 1 and 2 are viewed as another cluster. Forecast object 3, on the other hand, has no matches in the observation, possibly due to its smaller area coverage compared to other objects. It is worth noting that cluster labeling is contingent on the matches between forecast and observed objects. Thus, different forecast fields verified against the same observed fields may yield distinct clusters. Figure 2 does not demonstrate the underlying merging process, wherein objects exceeding a user-defined merging threshold are combined. A more detailed depiction of the impact of convolving and thresholding on features will be provided in section 4.
Schematic illustrating hypothetical cloud objects from forecast (blue circles) and observation (red circles) with corresponding interest matrix at right. Yellow shading indicates matched objects whereas no shading denotes no match. Hypothetical total interest value greater than 0.7 is shown in bold in the matrix. Adopted from Davis et al. (2009).
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
Object-based verification scores, including frequency bias and ETS, along with object attributes such as area difference and centroid distance (Fig. 3), are computed for individual pairs of simple objects rather than for clusters. Note that the standard contingency table–based categories (Table 1) use grid points within identified objects, rather than the number of objects (Skinner et al. 2018), to facilitate comparison of the statistics from the standard gridpoint calculations and the object-based approach. It is important to note that from these object attributes, additional metrics can be derived. For instance, we can determine the extent of overlap or nonoverlap between the forecast and observed objects by computing the total object area that intersects or does not intersect (symmetric difference) between forecast and observed objects. This value is then divided by the total object area, taking into account the total size of the forecast and observed objects.
Schematic to illustrate how MODE defines the object cluster attributes: symmetric difference, intersection area, union area, and centroid distance. Symmetric difference is the combined total area between two matched objects that do not overlap, intersection area is the area that two matched objects overlap, union area measures the total area shared between matched objects, and centroid distance is the distance between two paired object centroids.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
4. MODE configuration sensitivity test
MODE is highly configurable, which allows a user to tune the software for different purposes, but also poses a challenge to select the optimal parameters that best capture the features of interest. This section presents the sensitivity tests on some MODE parameters to achieve the optimal configuration for cloud verification.
Initial tests comparing the MODE default configuration to configurations with a few modified parameters showed that the output statistics (such as skill score and object attributes) are very sensitive to the configuration options. Unstable clouds are more likely to be sensitive to the configuration changes due to their smaller size, greater spatial variance, and convective nature, thus the examples below are conducted for the unstable clouds.
There are many configurable parameters in the MODE software, which could warrant detailed examination. As a starting point, we evaluated the following parameters:
-
Convolution radius (R). This parameter determines how many surrounding grid points are used to smooth the raw fields. With this parameter, MODE converts the raw binary fields g(x, y) into continuous fields C(x, y) by a simple circular filter function φ:where (x, y) and (x1, y1) are grid coordinates. The filter function φ is defined asHere the influence of radius R and height H satisfy πR2 = H.
-
Convolution threshold (T). This parameter creates a binary mask field by applying a threshold T onto the convolved continuous fields:From the masked field, it sets the boundaries of the objects, and retrieves the original fields. As noted in Davis et al. (2009), large convolution radius and high convolution threshold will result in retaining large-scale features of interest for the subsequent verification.
-
Merging threshold. This is used to filter which simple objects are to be merged into the cluster objects.
-
Interest threshold. This threshold determines if the simple objects or cluster objects between forecast and observations are a match. Once the total interest [Eq. (9)] for forecast and observation exceeds this threshold, objects are considered a match.
Figure 4 shows the sensitivity of the predicted and observed unstable cloud mask objects to the convolution radius (convolution threshold is set to 0.4). When the convolution radius is set to 0, it means that the raw, unfiltered binary fields were used to set the boundary of the “feature.” Due to various shapes and discontinuities in the raw binary spatial distribution (scattered nature of clouds), MODE-defined (object) boundaries might be erroneous. For instance, the object identified over the ocean in the southeast corner of the domain has “holes” (separations) within the object (Fig. 4 for convolution radius = 0). This is likely caused by the raw discontinuous mask boundaries. As the convolution radius increases, more model grid points were used to smooth the raw fields, and the “holes” within the object disappear (Fig. 4 for convolution radii ≥ 5). But larger convolution radii can dilate and oversmooth the raw model fields such that distance systems can be merged together (e.g., Fig. 4 for convolution radius at 20 grid points).
Unstable mask at 1800 UTC 30 Dec 2019 for original raw fields, the convolved fields with a convolution radius (R) from 0 to 20 grid points, and corresponding identified objects, respectively. The convolution threshold is set to 0.4. Line contours in the convolved fields represent the smallest closure that contains the respective objects. Lines are shown to delineate objects, not used for the calculation of object attributes.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
The varying convolution radii also affect object attributes. Here we examine a few object attribute distributions by varying the convolution radius from 5 to 20 grid points, as well as no convolution (convolution radius = 0) (Fig. 5). Area difference is the forecast object area minus the observed object area, signifying the object size bias. Centroid distance and intersection area are illustrated in Fig. 3, indicating location bias and overlapped area size. Varying the convolution radius affects all the examined object attribute distributions. Specifically, fewer objects are matched with increasing convolution radius since smaller objects are either filtered out or merged together. Larger objects generally retain their size though smaller-scale appendages are removed or smoothed. After filtering, the remaining objects and clusters tend to represent larger-scale weather features, resulting in increased differences between the location, size, coverage and overlap of the matched objects. Importantly, the equivalent meteorological scale represented by the objects can be adjusted significantly by adjusting the convolution and merging thresholds as will be demonstrated below. While clouds are scattered and noisy in nature, we intend to maintain the physical representation of the meteorological features while filtering some of the noisy and less predictable features. Subjective interpretation reveals that smaller convolution radii filter out smaller-scale cloud features that are not of interest for our verification, particularly in the observations, while maintaining the cloud features at spatial scales that the COAMPS model can adequately resolve. Differences in object attributes (e.g., area and centroid) correspond to the performance of predicted cloud features on the order of the scale of the convolution radius. Large radii tend to overfilter the raw fields thus creating larger differences on all object attributes (Fig. 5) and sometimes combine separate systems, creating larger entities (lower right two systems in Fig. 4 for convolution radii = 20, and forecast area distribution in Fig. 5). This leads us to choose a radius of 5 grid points (which is 25 km) to reasonably represent the raw fields with small object attribute differences. As documented in Cai and Dumais (2015), 2–8 times the grid spacing for convolution radius is sufficient to reveal convective storm objects in convection-allowing models. The purpose of this smoothing is to retain subjectively important cloud feature while reducing the importance of other features of interest smaller than the effective resolution of the model.
Boxplots of object attributes (forecast area size, area difference, intersection area, and centroid distance) for unstable clouds as a function of convolution radius (km) at 1800 UTC 30 Dec 2019. Each dot represents a simple object defined by MODE.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
The next test is to hold the convolution radius constant at t5 grid points but vary the threshold for matching and merging objects. To capture some smaller features within the identified objects, the merging threshold was set to 0.1 less than the convolution threshold (e.g., if the convolution threshold is 0.5, the merging threshold is 0.4). The convolution and merging thresholds modify the convolved field to either emphasize larger-scale objects, or interpolate to smaller-scale objects to account for near misses. In general, the area size decreases with increasing convolution threshold (Fig. 6). Thresholds below 0.5 act to fill existing objects, and can even restore small-scale objects that were heavily filtered by the convolution. In the extreme case, a threshold of 0.0 restores a single-pixel object to a scale that is approximately twice that of the convolution radius. Conversely, thresholds greater than 0.5 select the core regions of the largest objects, removing smaller-scale features that were diluted by the filter.
MODE identified objects for unstable clouds using different convolution threshold (T) at 1800 UTC 30 Dec 2019. Convolution radius was set to 25 km (5 grid points), and the merging threshold was set to 0.1 less than the convolution threshold.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
Objects that are close together will be affected drastically by increasing convolution and merging thresholds. We proceed to examine the number of paired grid points within each category of the contingency table. The smaller the convolution threshold, more objects from both forecast and observation are merged together. These larger objects tend to result in overlap between the forecasts and observations (“hit” in Fig. 7a). Conversely, the larger the convolution threshold, fewer objects will be retained and merged, thus increasing the number of misses from both the forecast and observation (“miss” in Fig. 7a). Similar trends are observed for other contingency table categories (“false alarm” and “correct negative”). This behavior is consistent with the contingency table based statistics (Fig. 7b), where the ETS decreases and the FAR increases with increasing convolution threshold. The highest skill scores, such as ETS and POD, achieved at a convolution threshold of 0.1, result from the more generous overlap between padded and merged objects. This padding occurs even in areas where cloud coverage is sparse in the raw fields (Fig. 6). Our intent is to ensure that the convolved fields faithfully represent the raw fields, minimizing any artificial smoothing or padding introduced by the convolution and thresholding process in MODE. Figure 7 reveals that data points from the four categories of the contingency table at a convolution threshold of 0.5 (merging threshold at 0.4 in this case) are similar to the raw fields.
(a) Number of forecast/observed points within the raw (unconvolved) and convolved field objects as a function of different convolution thresholds for unstable clouds at 1800 UTC 30 Dec 2019. The x axis represents different convolution thresholds. The convolution radius is set to 25 km (5 grid points). (b) Different skill scores computed using the number of points in (a).
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
The above tests were also performed for the other four cloud types and the results were very similar (not shown). Given the above sensitivity tests, a convolution radius of 5 grid points and a convolution threshold of 0.5 with a matching and merging threshold of 0.4 will be used for the rest of the verification. An additional examination for different interest thresholds was also carried out. We found that the final results are relatively insensitive to variations in interest thresholds when the interest threshold is above 0.5 (figure not shown). The default interest threshold of 0.7 works well for all cloud types. Additional removal of the continuous area covering a minimum of 50 grid squares (a configuration option within MODE) is used, similar to Davis et al. (2006). This constraint limits very small size object being identified due to being spatially far apart from other objects (for instance the small-area object over eastern Maryland is present in Fig. 4).
5. Verification results
a. Verification for cloud mask at 1800 UTC
We start with the 6-h forecast dataset valid at 1800 UTC [local time 1300 eastern standard time (EST)] that has the most optimal sunlight. Object-based verification can generate similar traditional pointwise metrics but using points within identified objects (section 3c). In general, cloud mask data residing within the MODE-identified objects exhibit similar pointwise statistics to the full-grid statistics calculated by Nachamkin et al. (2022) (Fig. 8).
Scatterplots of daily frequency bias (FBias), frequency mean (FMean), and equitable threat score (ETS) for COAMPS (a),(b) unstable and (c),(d) stable clouds at 1800 UTC. The FBias, FMean, and ETS statistics are derived from grid points within MODE objects. Lines in each plot denote a second-order polynomial fit among the data points.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
Specifically, unstable clouds (cumulus) exhibited little seasonality for the bias or mean coverage. Mean bias is close to 1.0, though there were some forecasts with values of up to 2.5 (Fig. 8a). The ETS exhibits greater seasonal variability (Fig. 8b), with higher values in the winter associated with higher detection probabilities and lower false alarm ratios in the winter (figure not shown). Nachamkin et al. (2022) noted that unstable clouds from cold-air outbreaks and cyclone warm sectors over this region in the wintertime account for cloud coverages that are comparable to those in the summertime. The strong forcing associated with these systems might explain the higher ETS in the wintertime. In this region, most wintertime unstable clouds form when cold air adverts over the relatively warmer waters of the Great Lakes and Atlantic Ocean. The forcing for these clouds is geographically fixed and is modulated primarily by water-air temperature differences. Since the latent heat fluxes are often quite strong, even simple boundary layer and bulk microphysics schemes are able to simulate these clouds. Other wintertime unstable clouds form along narrow frontal bands associated with large-scale cyclones. Summertime convection is not as strongly forced is often governed by micro- to mesoscale processes that are not well observed or resolved in COAMPS.
The mean stable cloud cover from the forecasts is slightly higher during the winter season than the summer season (Fig. 8c). Mean bias is generally less than 1, which implies that the ETS for stable clouds is overall lower than for unstable clouds. The ETS seasonality for stable clouds is less pronounced than for unstable clouds (Fig. 8d), though the ETS is still lower during the summer. The low summer ETS is correlated with decreased cloud coverage.
FSS values at different spatial scales are also analyzed. COAMPS unstable clouds exhibit a faster FSS improvement rate compared to stable clouds with increasing spatial scale (Figs. 9 and 10). As noted in section 3b, the rate of FSS improvement corresponds to the forecast spatial-scale error. Unstable clouds appear to be associated with smaller-scale errors than stable clouds, and are more sensitive to the neighborhood size. Though FSS is a relative score depending on the reference MSEref at a given neighborhood size, it may be affected by extreme low cloud coverage. Figure 10 shows that the mean FSS for stable clouds on some summer days is consistently low despite increasing neighborhood size, which corresponds to days with sparse cloud coverage (Fig. 8c). This indicates that FSS metric can be adversely affected on very clear days. Note that the FSS distribution presented here is slightly different from Nachamkin et al. (2022) due to an updated fix to the snow cover removal algorithm applied to the GOES data.
FSS distribution (second-order polynomial fit to the daily FSS) at different spatial neighborhood scale (km; denoted by different colors) for unstable and stable clouds at 1800 UTC. Solid lines represent the two scales of FSS analyzed in Nachamkin et al. (2022).
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
Mean FSS at different spatial scales (y axis) for (a) unstable and (b) stable clouds at 1800 UTC.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
Further analysis of the cloud objects suggests that the average area size for observed stable and unstable clouds in the summertime is about the same (Figs. 11a,d). In the summertime, a slight overforecast of unstable cloud objects is seen while the opposite is true for stable clouds (Figs. 11b,e). Stable clouds also exhibit much larger underforecasts in coverage in the wintertime, which is consistent with the low fractional biases (Fig. 8c). Both of the stable and unstable clouds show similar overlap ratios with the observed objects regardless of season, though generally higher overlap ratio is noticed in the wintertime (Figs. 11c,f), which is consistent with the higher wintertime ETS (Figs. 8b,d).
Observation object area size (km2), forecast-minus-observation object area difference (km2), and object area overlap percentage (%; intersection area divided by the union area) for (a)–(c) unstable and (d)–(f) stable clouds at 1800 UTC during the warm season (15 Apr–13 Oct, Julian days from 105 to 286) and cold season. White lines and diamonds in the box denote the median and mean, respectively.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
b. Verification for cloud masks at all lead times
Both the forecast and observed cloud masks are available at 3-h intervals from the analysis time (tau = 0) to forecast 12-h lead time. A further verification for all of these lead times was performed. Figure 12 shows the box-and-whiskers plot for the frequency bias, ETS, and FSS for unstable and stable cloud objects at every 3-h lead time. Note these metrics are calculated using the grid points covered by the MODE defined objects. An aggregated mean frequency bias and ETS is calculated with the whole data sample, which gives more weight to the days with larger cloud cover. For unstable clouds, the mean frequency bias is close to 1 at lead times 3–9 h but is slightly higher at 12 h. Similarly, the ETS and FSS is highest at the 3-h lead time and gradually decreases with forecast lead time. These statistics may indicate the forecast quality of unstable clouds decreases over time, or the quality of the observed unstable clouds also diminishes when the sun angle lowers. The sudden increase in the bias scores at 12 h suggests that the latter process plays a significant role. Wintertime unstable clouds in this region are dominated by lower-tropospheric stratocumulus over and downwind of the Great Lakes and Atlantic Ocean during cold air outbreaks. The GOES-16 cloud mask retrievals sometimes undersample these low clouds at night due to small differences between cloud top and near-surface brightness temperatures. Either way, fewer matches between the forecast and the observations occurred, resulting in poor performance measured by these traditional statistical measures. The frequency bias for unstable clouds shows a slightly higher variability than for stable clouds, particularly at lead time 12 h. The negative bias for stable clouds is apparent in the aggregated frequency bias for all the forecast lead times. This may lead to slightly smaller mean ETS and FSS values for stable clouds compared to the unstable clouds. However, the aggregated ETS for stable clouds is similar to the unstable clouds.
MODE verified frequency bias, equitable threat score (ETS), and fractions skill score (FSS) for the COAMPS (left) unstable and (right) stable clouds initiated at 1200 UTC up to 12-h lead time at every 3-h frequency during 2018–19. Red lines on the frequency bias and ETS panels show the aggregated mean statistics. Black lines and diamonds in the box denote the median and mean, respectively. The number of available date samples is labeled at the bottom of the figures in the top panel.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
We further conducted diagnosis on the centroid displacement for the matched objects between forecast and observation (section 3b). The centroid difference between two paired objects is an indicator for “over the target” accuracy. A smoothing method called kernel density estimation (KDE; Peel and Wilson 2008) is further used to illustrate the bivariate and univariate distributions of the east–west and north–south centroid displacements. KDE is analogous to a histogram. KDE represents the data using a continuous density function (or kernel, in this case Gaussian kernel is used) in one or multiple dimensions that is less sensitive to bin sizes and easier to interpret. Figure 13 shows that COAMPS generally places the clouds in the right location at all forecast lead times, where the mean of the KDE for the univariate centroid distribution (in either the east–west or north–south direction) is nearly zero. The model displays a larger spread of matched object centroid locations at longer forecast lead time, where the contour areas of the 50% or 90% of the joint KDEs expand slightly over time, indicating greater uncertainty associated with the location at longer forecast lead time (Fig. 13). The model also tends to have slightly larger position errors at the east–west direction than the north–south direction, illustrated by the larger span of east–west grids for the KDEs of the centroid joint probability distribution. The COAMPS model shows similar forecast skill in positioning for unstable and stable clouds. No systematic position errors are found for either cloud type.
Scatterplot of forecast–observation object clusters’ centroid displacement for unstable and stable cloud objects (orange dots) at forecast lead time (tau) 3, 6, 9, and 12 h, respectively. Contours overlaid on the scatter points are kernel density estimation (KDE) representing 50%, 90%, and 99% of the bivariate joint probability distribution. The corresponding contours at the top of y axis and right of x axis are the univariate KDEs. The number of matched objects is noted in the top-left corner of each panel.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
The overall matched object covered area for the forecast unstable clouds shows a slightly diurnal pattern, with the largest aggregates occurring at the optimal sunlight (forecast 6-h lead time, local time 1300 EST), then gradually dissipating and shrinking in area toward sunset. Meanwhile the area coverage for stable clouds is roughly steady up to 6 h and decreases by 9–12 h (Fig. 14). Though the mean forecast area for unstable and stable cloud objects are similar, the variability of the object forecast area for unstable clouds is much larger (Figs. 14a,c). The area difference is generally small for unstable clouds during daylight (nearly zero for the median), although the mean of the object area difference tends to increase over time. COAMPS slightly underpredicts the mean area for unstable clouds at forecast 3-h lead time but then overpredicts beyond 6 h with the largest overprediction and uncertainty at 12 h (1900 EST local time). Conversely, COAMPS tends to underpredict the mean area for stable clouds during early morning and early afternoon, which is consistent with the bias noted earlier (Figs. 8 and 12). The distribution of object areas in both unstable and stable clouds is highly skewed, characterized by a substantial gap between the mean and the median values. Overall, unstable clouds exhibit a significantly higher count of matched objects compared to stable clouds. An increased number of matched objects over forecast lead time for unstable clouds is particularly noticeable in the summertime (figure not shown).
Box-and-whisker plots of the forecasted object area (km2) and forecast-minus-observation object area difference (km2) for (a),(b) unstable and (c),(d) stable clouds at forecast lead time 3, 6, 9, and 12 h during 2018–19. Black lines and diamonds in the box denote the median and mean, respectively. The number of matched object pairs is noted at the bottom of each panel.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
The cloud fraction for both unstable and stable clouds (Fig. 15) exhibits a similar pattern to the object area (Figs. 14a,c). It is important to note that the total number of objects for forecast or observation does not decrease over time (Figs. 15a,c). However, the number of matched objects for stable clouds decreases over time (Figs. 15b,d). This implies that the reduction in matched objects is a consequence of a reduction in the number of actual matches over time. The mean OTS for both unstable and stable clouds remains about 0.8–0.85 up to lead time 9 h and decreases to 0.7–0.75 at 12 h (Figs. 15b,d). An OTS score of 1.0 indicates a perfect forecast (section 3b). While a direct benchmark of OTS score for cloud mask object verification is not readily available, OTS scores for regional storm-scale precipitation (Johnson and Wang 2013) and for cold cloud objects identified from the satellite radiances (Griffin et al. 2021) are on the order of 0.3–0.5 and 0.5–0.85, respectively. This suggests that, in the context of object-oriented verification, COAMPS generally performs well in predicting low-level clouds.
Box-and-whisker plots of the cloud fraction for all objects of forecast and observation and object-based threat score (OTS) for (a),(b) unstable and (c),(d) stable clouds at forecast lead time 3, 6, 9, and 12 h during 2018–19. Black lines and white diamonds in the box denote the median and mean, respectively. The number of total objects for forecast and observation is noted at the bottom of the box-and-whisker plot in the top panels, while the number of matched objects is annotated at the bottom of the y axis in the bottom panels.
Citation: Weather and Forecasting 39, 3; 10.1175/WAF-D-23-0197.1
6. Conclusions and discussion
This study focuses on verifying the cloud masks derived from the Navy’s regional model COAMPS against GOES-16 retrieved cloud masks over the United States Mid-Atlantic region during the period 2018–19. The verification process is conducted using the community tool MET. The cloud masks consist of five distinct cloud types: stable, unstable, midlevel clouds, high-level tropospheric clouds and deep precipitating clouds (Nachamkin et al. 2022). Object-based verification is carried out in detail, with a particular focus on COAMPS stable and unstable clouds.
Overall, COAMPS stable and unstable cloud objects observed during the optimal sunlight hour (1300 EST) tend to be larger during wintertime compared to summertime, resulting in greater differences in area and overlap ratio during the winter season. COAMPS represents unstable clouds well, with a mean frequency bias close to 1 at forecast lead times of 3–9 h but is about 2 at 12-h lead time. Stable clouds, on the other hand, are consistently underpredicted (with a frequency bias less than 1.0) across most forecast lead times. A decrease in skill score, as indicated by either ETS or FSS, is seen over time for both unstable and stable clouds. COAMPS generally positions the unstable and stable objects accurately, showing minimal systematic position errors. The object area for COAMPS unstable clouds exhibits a diurnal pattern in daytime forecasts, reflecting the physical patterns of these cloud types. In contrast, the object area for COAMPS stable clouds displays steady coverage throughout the day. Over time, COAMPS model tends to slightly overforecast (underforecast) the object area for unstable (stable) clouds. Overall, the mean object-based threat score (OTS) for both unstable and stable remains about 0.8–0.85 up to 9-h lead time and slightly decreases to 0.7–0.75 at 12 h, suggesting COAMPS excels in predicting low-level clouds.
The study demonstrates that MET is capable of verifying model cloud forecasts; however, it underscores the importance of exercising caution and fine-tuning the process to accurately represent cloud features of interest. The results suggest that the grid points selected by the objects yield similar pointwise statistics (e.g., frequency bias, ETS) compared to those obtained from the full grid raw data. Furthermore, object-based verification offers additional insight into various attributes (e.g., displacement error, area coverage) of the matched and unmatched objects. An equivalent threat score but for objects OTS is used to further demonstrate the object-oriented verification.
Moreover, this study illustrates that no single verification approach can fully capture the quality of the forecasts, given the intricacies of the NWP models, the limitations of observations, as well as the inherent complexities in verification approaches (Murphy 1993). To gain a comprehensive understanding of model performance, it is essential to combine standard, spatial, and object-orientated verification techniques. These approaches offer a consistent cross-check of the model performance and provide complementary measures. Furthermore, our sensitivity tests highlight the importance of adapting community tools to specific application and emphasize the need to understand the tool’s configurations and their implications (Davis et al. 2009; Johnson and Wang 2013; Cai and Dumais 2015; Griffin et al. 2021). This understanding becomes particularly critical when transitioning from regional to global verification efforts. Leveraging community tools also streamlines collaboration and expedites the transition from research to operational applications.
Acknowledgments.
The authors acknowledge funding support from the Office of Navy Research (Project N0001423WX00921) and computing resources provided by the Navy DSRC. The authors thank Elizabeth Satterfield for the initial implementation of the MET software package and helpful suggestions. The authors thank three anonymous reviewers whose comments and suggestions greatly improved the quality of the manuscript. The authors declare no conflicts of interest.
Data availability statement.
The COAMPS forecasts as well as the satellite data are stored at the DoD HPC centers and are controlled unclassified data that require users to register with the U.S. government and acquire permission prior to use. More details can be found online (https://www.nrlmry.navy.mil/coamps-web/web/reg). Interested users can submit further inquiries to the corresponding author of this work. The IMERG data are available from NASA, and information about the dataset can be found online (https://gpm.nasa.gov/data/directory). The satellite retrieval data were collected daily from NASA and CIRA. NASA LARC daily imagery can be found at https://satcorps.larc.nasa.gov/, and CIRA daily imagery can be found at https://rammb.cira.colostate.edu/ramsdis/online/goes-16.asp.
REFERENCES
Alexander, S. P., and A. Protat, 2018: Cloud properties observed from the surface and by satellite at the northern edge of the Southern Ocean. J. Geophys. Res. Atmos., 123, 443–456, https://doi.org/10.1002/2017JD026552.
Bani Shahabadi, M. B., Y. Huang, L. Garand, S. Heilliette, and P. Yang, 2016: Validation of a weather forecast model at radiance level against satellite observations allowing quantification of temperature, humidity, and cloud-related biases. J. Adv. Model. Earth Syst., 8, 1453–1467, https://doi.org/10.1002/2016MS000751.
Bodas-Salcedo, A., and Coauthors, 2011: COSP: Satellite simulation software for model assessment. Bull. Amer. Meteor. Soc., 92, 1023–1043, https://doi.org/10.1175/2011BAMS2856.1.
Cai, H., and R. E. Dumais Jr., 2015: Object-based evaluation of a numerical weather prediction model’s performance through forecast storm characteristic analysis. Wea. Forecasting, 30, 1451–1468, https://doi.org/10.1175/WAF-D-15-0008.1.
Chen, S., and Coauthors, 2003: COAMPS version 3 model description-general theory and equations. Naval Research Laboratory Tech. Note NRL/PU/7500-03448, 143 pp.
Daley, R., and E. Barker, 2001: NAVDAS Source Book 2001: NRL atmospheric variational data assimilation system. Naval Research Laboratory Tech. Note NRL/PU/7530-01-441, 163 pp.
Davis, C. A., B. G. Brown, and R. Bullock, 2006: Object-based verification of precipitation forecasts. Part II: Application to convective rain systems. Mon. Wea. Rev., 134, 1785–1795, https://doi.org/10.1175/MWR3146.1.
Davis, C. A., B. G. Brown, R. Bullock, and J. Halley-Gotway, 2009: The Method for Object-based Diagnostic Evaluation (MODE) applied to numerical forecasts from the 2005 NSSL/SPC Spring Program. Wea. Forecasting, 24, 1252–1267, https://doi.org/10.1175/2009WAF2222241.1.
Ebert, E. E., and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors. J. Hydrol., 239, 179–202, https://doi.org/10.1016/S0022-1694(00)00343-7.
Griffin, S. M., and Coauthors, 2021: Evaluating the impact of planetary boundary layer, land surface model, and microphysics parameterization schemes on cold cloud objects in simulated GOES-16 brightness temperatures. J. Geophys. Res. Atmos., 126, e2021JD034709, https://doi.org/10.1029/2021JD034709.
Huffman, G. J., and Coauthors, 2019: NASA Global Precipitation Measurement (GPM) Integrated Multi-satellitE Retrievals for GPM (IMERG). NASA Algorithm Theoretical Basis Doc., version 06, NASA, 38 pp., https://gpm.nasa.gov/sites/default/files/document_files/IMERG_ATBD_V06.pdf.
Johnson, A., and X. Wang, 2013: Object-based evaluation of a storm-scale ensemble during the 2009 NOAA Hazardous Weather Testbed Spring Experiment. Mon. Wea. Rev., 141, 1079–1098, https://doi.org/10.1175/MWR-D-12-00140.1.
Johnson, B. T., C. Dang, P. Stegmann, Q. Liu, I. Moradi, and T. Auligne, 2023: The Community Radiative Transfer Model (CRTM): Community-focused collaborative model development accelerating research to operations. Bull. Amer. Meteor. Soc., 104, E1817–E1830, https://doi.org/10.1175/BAMS-D-22-0015.1.
Lack, S. A., G. L. Limpert, and N. I. Fox, 2010: An object-oriented multiscale verification scheme. Wea. Forecasting, 25, 79–92, https://doi.org/10.1175/2009WAF2222245.1.
McErlich, C., A. McDonald, A. Schuddeboom, and I. Silber, 2021: Comparing satellite- and ground-based observations of cloud occurrence over high southern latitudes. J. Geophys. Res. Atmos., 126, e2020JD033607, https://doi.org/10.1029/2020JD033607.
Minnis, P., and Coauthors, 2021: CERES MODIS cloud product retrievals for edition—4. Part I: Algorithm changes. IEEE Trans. Geosci. Remote Sens., 59, 2744–2780, https://doi.org/10.1109/TGRS.2020.3008866.
Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281–293, https://doi.org/10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.
Nachamkin, J. E., 2004: Mesoscale verification using meteorological composites. Mon. Wea. Rev., 132, 941–955, https://doi.org/10.1175/1520-0493(2004)132<0941:MVUMC>2.0.CO;2.
Nachamkin, J. E., A. Bienkowski, R. Bankert, K. Pattipati, D. Sidoti, M. Surratt, J. Gull, and C. Nguyen, 2022: Classification and evaluation of stable and unstable cloud forecasts. Mon. Wea. Rev., 150, 81–98, https://doi.org/10.1175/MWR-D-21-0056.1.
Peel, S., and L. J. Wilson, 2008: Modeling the distribution of precipitation forecasts from the Canadian Ensemble Prediction System using kernel density estimation. Wea. Forecasting, 23, 575–595, https://doi.org/10.1175/2007WAF2007023.1.
Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 78–97, https://doi.org/10.1175/2007MWR2123.1.
Roebber, P. J., 2009: Visualizing multiple measures of forecast quality. Wea. Forecasting, 24, 601–608, https://doi.org/10.1175/2008WAF2222159.1.
Rutledge, S. A., and P. V. Hobbs, 1983: The mesoscale and microscale structure and organization of clouds and precipitation in midlatitude cyclones. VIII: A model for the “seeder-feeder” process in warm-frontal rainbands. J. Atmos. Sci., 40, 1185–1206, https://doi.org/10.1175/1520-0469(1983)040%3C1185:TMAMSA%3E2.0.CO;2.
Rutledge, S. A., and P. V. Hobbs, 1984: The mesoscale and microscale structure and organization of clouds and precipitation in midlatitude cyclones. XII: A diagnostic modeling study of precipitation development in narrow cold-frontal rainbands. J. Atmos. Sci., 40, 1185–1206, https://doi.org/10.1175/1520-0469(1984)041%3C2949:TMAMSA%3E2.0.CO;2.
Saunders, R., and Coauthors, 2020: RTTOV-13 science and validation report. NWPSAF-MO-TV-046, 106 pp., https://nwp-saf.eumetsat.int/site/download/documentation/rtm/docs_rttov13/rttov13_svr.pdf.
Skamarock, W. C., 2004: Evaluating mesoscale NWP models using kinetic energy spectra. Mon. Wea. Rev., 132, 3019–3032, https://doi.org/10.1175/MWR2830.1.
Skinner, P. S., and Coauthors, 2018: Object-based verification of a prototype warn-on-forecast system. Wea. Forecasting, 33, 1225–1250, https://doi.org/10.1175/WAF-D-18-0020.1.
Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. International Geophysics Series, Vol. 59, Elsevier, 467 pp.
Wolff, J. K., M. Harrold, T. Fowler, J. H. Gotway, L. Nance, and B. G. Brown, 2014: Beyond the basics: Evaluating model-based precipitation forecasts using traditional, spatial, and object-based methods. Wea. Forecasting, 29, 1451–1472, https://doi.org/10.1175/WAF-D-13-00135.1.
Zhang, Y., S. A. Klein, J. Boyle, and G. G. Mace, 2010: Evaluation of tropical cloud and precipitation statistics of community atmosphere model version 3 using CloudSat and CALIPSO data. J. Geophys. Res., 115, D12205, https://doi.org/10.1029/2009JD012006.