## Abstract

A sensitivity analysis methodology recently developed by the authors is applied to COAMPS and WRF. The method involves varying model parameters according to Latin Hypercube Sampling, and developing multivariate multiple regression models that map the model parameters to forecasts over a spatial domain. The regression coefficients and *p* values testing whether the coefficients are zero serve as measures of sensitivity of forecasts with respect to model parameters. Nine model parameters are selected from COAMPS and WRF, and their impact is examined on nine forecast quantities (water vapor, convective and gridscale precipitation, and air temperature and wind speed at three altitudes). Although the conclusions depend on the model parameters and specific forecast quantities, it is shown that sensitivity to model parameters is often accompanied by nontrivial spatial structure, which itself depends on the underlying forecast model (i.e., COAMPS vs WRF). One specific difference between these models is in their sensitivity with respect to a parameter that controls temperature increments in the Kain–Fritsch trigger function; whereas this parameter has a distinct spatial structure in COAMPS, that structure is completely absent in WRF. The differences between COAMPS and WRF also extend to the quality of the statistical models used to assess sensitivity; specifically, the differences are largest over the waters off the southeastern coast of the United States. The implication of these findings is twofold: not only is the spatial structure of sensitivities different between COAMPS and WRF, the underlying relationship between the model parameters and the forecasts is also different between the two models.

## 1. Introduction

Sensitivity analysis (SA) refers to a wide range of techniques for assessing the impact of something on something else (e.g., the impact of a single observation on the forecasts in the process of data assimilation, or the impact of model parameters on the forecasts). In meteorology various SA methods have been developed (Aires et al. 2014; Fasso 2006; Oakley and O’Hagan 2004; Saltelli et al. 2010; Sobol’ 1993; Zhao and Tiede 2011; Lucas et al. 2013; Errico 1997; Ancell and Hakim 2007; Safta et al. 2015; Hacker et al. 2011; Laine et al. 2012; Ollinaho et al. 2014; Roebber 1989; Roebber and Bosart 1998; Robock et al. 2003; Järvinen et al. 2012; Marzban 2013; Marzban et al. 2014, 2018a,b, 2019, 2020), and a partial taxonomy of the these methods is provided in Marzban et al. (2019).

A subset of these works describes the development of an SA method for assessing how model parameters affect the spatial structure of forecasts. The SA method requires sampling the space spanned by the model parameters, and therefore, continuous- and discrete-valued model parameters must be treated separately. When the model parameters are continuous, Latin hypercube sampling (LHS) is the sampling method of choice (McKay et al. 1979; Cioppa and Lucas 2007; Montgomery 2009). For discrete model parameters two of the most commonly used sampling designs are Graeco-Latin square designs and fractional factorial designs (Montgomery 2009). The case of continuous model parameters is covered in Marzban et al. (2014, 2018a,b), where the forecast model is COAMPS (Hodur 1997). The case of discrete model parameters is described in Marzban et al. (2019, 2020), where the model parameters are those of the stochastic kinetic energy backscatter scheme (Leith 1990; Mason and Thomson 1992; Berner et al. 2011) in the Weather Research and Forecasting (WRF) Model.

Whereas comparisons of forecast models in terms of the quality of their forecasts is relatively commonplace (Wilks 2011; Jolliffe and Stephenson 2011), a comparison in terms of their sensitivity with respect to model parameters is less common. Such a comparison is an inherently difficult problem because of the nonlinear and interactive fashion in which model parameters affect forecast quantities. For instance, even if a specific model parameter appears in exactly the same manner in two forecast models, it is not at all guaranteed that the effect of that model parameter on the forecasts will be the same. Even less trivial is the effect of the model parameters on the spatial structure of the forecasts. In spite of the inherently nonlinear and interactive nature of the forecast models, here the underlying relationships are approximated by a statistical model that assumes linear and noninteractive relations (also see the discussion section), if only for the purpose of being able to assess the spatial structure of the sensitivities.

Given that all of the development of the present SA method is based on COAMPS and WRF, it is natural to ask whether these models behave differently in terms of their sensitivity to model parameters. To that end, nine model parameters are identified in COAMPS and WRF based on their similarity in their respective models, and the effect of these model parameters on the spatial structure of several forecast fields is examined. The forecast fields examined are 24-h forecasts of water vapor, convective and gridscale precipitation, and air temperature and wind speed at three different vertical levels. Spatial structure is often quantified by a variogram; a graph which, among other things, also reveals the extent of spatial correlations (Marzban and Sandgathe 2009). However, by virtue of being diagrams, variograms are not suitable for displaying local spatial structure across a larger domain. Therefore, instead, the sensitivity results are displayed as maps which visually display spatial structure. Some of the analysis reported here abandons the focus on spatial structure in favor of a more quantitative assessment of sensitivity of a scalar summary measure, namely, the 90th percentile of the respective forecast field across the forecast domain.

Broadly speaking, the SA method employed here has three main components: 1) a sampling scheme according to which the model parameters are varied (sampled), 2) a statistical model relating the model parameters to the forecasts, and 3) a measure of sensitivity. Further details of these components are presented in the Method section. Here, suffice it to say that 50 different values of 9 model parameters are selected, multivariate multiple regression is employed to map the model parameters to forecast fields on 30 days, and regression coefficients and *p* values are used as measures of sensitivity.^{1} Further details of the SA methodology can be found in the prequels to this article (e.g., Marzban et al. 2014, 2018a,b, 2019); the main focus of the present article is to apply that methodology to the problem of comparing COAMPS and WRF in terms of their sensitivity with respect to a common set of model parameters.

The outline of the paper is as follows: The data section discusses the manner in which the (computer) data necessary for estimating the sensitivities are generated. The method section summarizes the important components of the SA methodology. The number of comparisons one can make is large and unwieldy, but some of the more salient findings are reported in the results section, which is followed by discussions of finer details.

## 2. Data

COAMPS, version 4.2.2 (Hodur 1997), and WRF, version 3.7.1 (Skamarock and Klemp 2008), with lateral boundary conditions specified by the Navy Operational Global Atmospheric Prediction System (NOGAPS), and the Global Forecast System (GFS), respectively, are run on a 25-km domain over the CONUS. A period of 30 days are selected, with initial forecast hours 4 days apart to ensure minimal temporal association between days. Specifically, they are 16, 20, 24, and 28 February; 4, 8, …, 28 March; 1, 5, …, 29 April; 3, 7, …, 31 May; and 22, 26, and 30 June 2009. These specific dates are used here in order to maintain continuity with previous work performed by the authors and to allow for comparisons with their past results.

The model parameters, denoted Par1 through Par9, and their COAMPS and WRF names are shown in Table 1. A total of 50 different values for these 9 parameters are selected according to LHS; the range over which they are varied, and their default value is also shown in Table 1. LHS has been used in sampling the model parameter space (Marzban et al. 2014), selecting the members in ensembles for ensemble forecasting (Hacker et al. 2011), for emulation (Santner et al. 2003), and for performing variance-based SA (Saltelli et al. 2010, 2008). The popularity of this sampling scheme derives from the fact that it leads to estimates that are often more precise, never less, than simple random sampling (Cioppa and Lucas 2007; Marzban 2013).

This final choice of the nine parameters in Table 1 is the result of beginning with 11 COAMPS parameters that were used in past work, and attempting to find the analogous parameters in WRF. Two of the original 11 COAMPS parameters were excluded (and hence, set to their respective default value) because the authors were unable to find their WRF analog. The specific choice and range of the model parameters is equal to those employed in Holt et al. (2011). One exception is described in the appendix.

Generally, Par1 is associated with the representation of boundary layer turbulent mixing, while Par2 to Par5 control various aspects of the Kain–Fritsch (KF) (Kain 2004) parameterization of cumulus convection. Par6 to Par9 are associated with the parameterization of microphysical processes for gridscale precipitation. More details of these model parameters are provided in the appendix.

The examined meteorological quantities are the 24-h forecasts of precipitation (accumulated over 3 h), surface water vapor, and temperature and wind speed both at the surface and at 850 and 250 hPa. These quantities, denoted F1–F9, are tabulated in Table 2.

## 3. Method

### a. Statistical model

The SA methods developed here consist of relatively standard techniques in the field of experimental design (Montgomery 2009). There are three main ingredients: 1) a statistical model for mapping the model parameters to the forecasts, 2) a sampling scheme for generating the data necessary for the development of the aforementioned model, and 3) a specific measure of sensitivity. Random-effects models were employed for most of the past development because they allow one to compute *p* values and/or confidence intervals that assess the model sensitivities across all values of model parameters, not the specific values chosen for the analysis only. Also, depending on the application, corrections were made to the *p* values (e.g., based on controlling the false discovery rate) in order to account for the spatial correlations in the forecast fields and multiple hypothesis testing. However, for the purpose of comparing two models (i.e., COAMPS and WRF), many of these steps are not necessary.

To assess the effect of the model parameters on the forecasts, a fixed-effects regression setting is adopted. Schematically, the model is

where *X*_{1}, …, *X*_{9} denote the nine model parameters, and therefore, the *β* coefficients assess the sensitivity of the forecast with respect to the corresponding model parameter. Given the existence of spatial correlation in the forecasts, the forecast at any grid point is correlated with the forecasts in neighboring grid points. To account for this spatial correlation in the forecasts, the model in Eq. (1) is extended, on the left-hand side, to allow for multiple responses corresponding to forecasts at several grid points. The resulting model is called a multivariate multiple regression (MMR) model (Draper and Smith 1998; DelSole and Yang 2011), wherein the term “multiple” refers to the multiple model parameters (here nine), and “multivariate” implies that there exist several response variables corresponding to the forecasts at several grid points. In the present analysis, the selected grid points belong to a *W* × *W* window of grid points centered on each of the grid points in the forecast domain. The discussion section addresses the “optimal” size of this window.

More specifically, the MMR model employed here is

where the matrix $Y$ denotes the forecast, $X$ is the matrix of model parameters, and ** ϵ** is a matrix of errors whose sum-of-squares is to be minimized for the purpose of estimating the regression coefficients

*β*; $Y$ is an

*n*×

*q*matrix with

*n*denoting the number of cases (here

*n*= 50) and

*q*=

*W*×

*W*, and $X$ is a (

*p*+ 1) ×

*q*matrix with

*p*denoting the number of forecast model parameters (here

*p*= 9). Then the

*n*×

*q*matrix

**represents any other source of variability in the forecasts that cannot be accounted for by the forecast model parameters.**

*ϵ*The utility of this multivariate model derives from the existence of statistical tests that check whether a given model parameter has an effect on any of the response variables. Here, Pillai’s trace test (Fox et al. 2013; Rencher and Christensen 2012), allows one to compute a single *p* value assessing the statistical significance of a given model parameter on the forecast at any of the grid points in the *W* × *W* window. This feature reduces the number of statistical tests that must be performed, and therefore, tames the increase in the number of type I errors associated with multiple hypothesis testing.^{2}

MMR has one additional virtue. When performing inference in multiple regression (i.e., with only one response), the variance of each *β*, and the covariance between pairs of *β*, depends on the variance of ** ϵ**. That variance is estimated from the sample variance of the residuals, which itself depends on the sample variance of the response. In MMR, however, inference of

*β*ultimately depends on the variance-covariance matrix of

**, and therefore, the variance-covariance matrix of the response variables. In other words, the abovementioned tests involve the covariance matrix of the forecast quantity across grid points in the**

*ϵ**W*×

*W*window, and as such, the spatial correlation across grid points is explicitly taken into account.

To provide a bit more detail, note that the model in Eq. (2) involves two sets of parameters (not to be confused with the model parameters in Table 1)—the coefficients *β*, and Σ, the variance-covariance matrix of the errors ** ϵ**. It can be shown (Fox et al. 2013) that the maximum likelihood estimate of Σ is given by $\Sigma ^=(1/n)\u2061(Y\u2212X\beta ^)T\u2061(Y\u2212X\beta ^)$, where the $\beta ^$ denote the ordinary least squares estimate of the

*β*coefficients. The cross term (1/

*n*)$Y$

^{T}$Y$ in this expression is the sample covariance matrix of the (centered) forecasts $Y$ within the aforementioned

*W*×

*W*window of grid points. Given that the computation of the

*p*value in MMR involve $\Sigma ^$, the spatial correlation of the forecasts enters the analysis prominently.

Although the value of *W* is a user-specified quantity, depending on the forecast quantity and the spatial extent of the correlations across grid points, here, it is fixed at *W* = 3; that is, the corresponding window is 3 × 25 = 75 km on each side. *W* = 5 was also tested, and the results were found to be very similar to (though smoother than) those for *W* = 3. See the discussion section for some of the complexities in choosing a value for *W*.

It is important to emphasize that in spite of all of these precautions to avoid problems with spatial correlations in multiple hypothesis testing, no hypothesis testing is actually performed here. In other words, the *p* values are not compared with any preset significance level for the purpose of rejecting or not rejecting any null hypothesis. Instead, the *p* values are used directly as a measure of sensitivity. Another reason for using the *p* value as a measure of sensitivity is discussed below.

Consequently, it is possible to view the spatial structure associated with each model parameter as a spatial map of *p* values. However, a priori there exists one such map, per day, for 30 days. To aggregate the 30 maps into a single map of *p* values, the 30 *p* values at each grid point are subjected to a test of uniformity. This is a useful test because if a given parameter has no effect at a given grid point, then the distribution of the *p* values across time should be uniform. The uniformity of the *p* values, under the null hypothesis, is a standard theorem in statistics (Montgomery 2009), for which there exist many tests; here, a Kolmogorov–Smirnov test is performed. This procedure leads to a single map of *p* values for each model parameter, assessing sensitivity across all 30 days. An assumption of this test is the temporal independence of the *p* values. Having visually examined the time series of the *p* values across the 30 days, there is no evidence to suggest that the *p* values are temporally correlated.

Therefore, the effect, across 30 days, of a given model parameter on the forecast at a given grid point is gauged by a *p* value testing whether the model parameter has an effect on the forecasts at any of the grid points in a *W* × *W* window centered on the grid point. Given that these *p* values are employed only as a measure of sensitivity (and not for hypothesis testing), it is convenient to transform them into an equivalent measure that is positively associated with sensitivity. To that end, instead of displaying a map of *p* values, the map of −log(*p* value) is displayed.

Although these maps are the main device for visually assessing the spatial structure of the impact of each model parameter on the forecasts, it is useful (and simpler) to also perform an SA that does not involve spatial structure of the forecasts. To that end, an extremely simplified version of the above method is also examined; there, the response of the regression model is a univariate quantity representing the 90th percentile of the forecasts across the entire spatial domain. In other words, the purpose of that exercise is to assess the effect of the model parameters on a single, nonspatial summary measure of the forecasts. This simplification allows one to “step back” from *p* values and view the distribution (technically, histogram) of the regression coefficients across the 30 days. With all model parameters standardized to have a sample mean and variance of 0 and 1, respectively, these coefficients can be compared across different model parameters, and even across different models (e.g., COAMPS and WRF). The histograms (across 30 days) of the regression coefficients are summarized by boxplots in order to allow a visual comparison across model parameters. The comparison of these boxplots is necessarily a qualitative exercise, but it does have the advantage of being multifaceted; some guidance for interpreting comparative boxplots is provided in the results section.

### b. Model adequacy

Given that all of the sensitivity results are based on MMR models of the type shown in Eq. (2), it is useful to assess the adequacy of the models themselves. Here, the *R*^{2} value of the MMR is used to measure the goodness-of-fit. Recall that *R*^{2} measures the proportion of variability in the response (i.e., the forecast quantity) that is due to variability in the predictors (i.e., model parameters). Given that the number of cases employed for training the MMR models (i.e., 50) is much larger than the number of parameters in the MMR (1 + 9), and the absence of multicollinearity between the model parameters (a consequence of LHS), any concern about overfitting is probably unwarranted. As such, *R*^{2} is an adequate measure of the goodness of the fits.

## 4. Results

Before embarking on a comparison of the sensitivities across COAMPS and WRF, it is helpful to compare the forecasts directly. The top panels in Fig. 1 show the 24-h forecasts of surface water vapor (kg kg^{−1}) from COAMPS and WRF for one day (16 February 2009). Here, the nine model parameters are set to their default values. The spatial differences between the two fields become clearer when one examines the grid-by-grid difference between the two fields (bottom-left panel). A general agreement between the two models is seen, as shown by gridpoint values generally falling along the 1:1 diagonal line in the scatterplot of the two sets of forecasts (bottom-right panel). This scatterplot shows that, although there are differences between the two forecasts on a grid-by-grid basis, there is no overall bias between the forecasts of the two models. More quantitatively, the regression *R*^{2} suggests that 92% of the variability in WRF’s forecasts can be explained by that of COAMPS. From a spatial perspective, on this particular day, WRF’s forecast is generally characterized by more humid conditions over the oceans than those of COAMPS, while COAMPS produces higher humidity levels than WRF over the southeastern United States. These differences are reflected in the clustering of points apparent in the scatterplot. Many of the characteristics of these figures are typical across the 30 days and 50 parameter values.

Many of the comparisons in Fig. 1 apply to Fig. 2 as well, where forecasts of water vapor have been averaged across all 30 days in the data. As seen in the bottom panel, there are distinct spatial regions in which forecasts from COAMPS and WRF are different. This is not surprising, as the regional forecasts are generated using slightly different initial conditions and lateral boundary forcing (from the GFS global forecast modeling system for WRF and from NOGAPS for COAMPS). Differences in model components between WRF and COAMPS (temporal and spatial discretization schemes, advection scheme, land surface model, etc.) may also contribute to the differences observed.

### a. Nonspatial SA

#### 1) Water vapor

As mentioned in the Method section, it is useful to perform an SA of the model parameters on a scalar (nonspatial) measure of forecasts–here, the 90th percentile (across the domain) of the forecast quantity. Figure 3 shows the boxplots of the regression coefficients in Eq. (2) when the response is forecasts of water vapor. As such, these coefficients measure the sensitivity of water vapor forecasts with respect to the model parameters. The aforementioned standardization of the model parameters allows for a comparison of regression coefficients corresponding to different model parameters, and different models—COAMPS (black) and WRF (red). Given that the comparison of boxplots is an inherently qualitative exercise, some guidance on their interpretation is in order. Recall that the spread in each boxplot is across the 30 days in the data. As such, the overall position of the boxplots relative to the horizontal line at zero conveys information about the magnitude of the effect, on the forecast, of the corresponding model parameter. The spread of the boxplots conveys information on the consistency of the effect (if any) across the 30 days.

According to Fig. 3 it appears that, with the exception of Par1 and Par2, none of the model parameters have a significant impact on water vapor (because the boxplots have a significant overlap with the horizontal line at zero). And even among the exceptions, one may argue that Par2 in COAMPS has a very weak and variable effect, at best. The boxplots for Par1 have little overlap with the zero line, but the relatively large spread suggests that the effect is variable across the 30 days. These mostly negative values for the regression coefficients for Par1 may be explained as follows: Par1 corresponds to an ad hoc modulation of the parameterized turbulent mixing within the atmospheric boundary layer, including the process of entrainment of free atmospheric air into the boundary layer. Larger values of this parameter result in stronger mixing, more important entrainment of dry air into the boundary layer from aloft, contributing to a drier boundary layer and surface air.

Par2 represents the magnitude of an additive perturbation to the temperature increment applied at the lifting condensation level (LCL) in the trigger function of the Kain–Fritsch parameterization of cumulus convection. This parameter, therefore, influences the characteristics of air parcels in updrafts, namely their buoyancy and upward velocity, and therefore the likelihood of deep convection, and its upward extent and amount of precipitation produced when cumulus convection is activated. The link between Par2 and surface water vapor content is less direct than with Par1, precluding a simple explanation. The complex link is suggested by the large daily variability in the sensitivity of surface water vapor to this parameter seen with both models. In spite of the fact that a majority of days exhibit a negative sensitivity (larger LCL temperature increments are associated with drier surface conditions), a significant number of days are characterized by a positive sensitivity (larger increments associated with more humid surface conditions). Depending on when convection is triggered and precipitation occurs in the 24-h forecast period, the enhanced likelihood of deeper convective updrafts may result in a net upward moisture flux, leaving overall drier conditions at the surface. To gain further insights in the variability on the effect of this parameter would require more detailed diagnostics on the evolution of conditions in WRF and COAMPS forecasts.

As a way of further quantifying these results, a paired *t* test has also been performed on the difference between the black and red boxplots. This test is possible only because there is a one-to-one pairing of the model parameters across COAMPS and WRF. The corresponding *p* values are shown under each boxplot in Fig. 3. Although these *p* values are consistent with the qualitative comparison of the boxplots, the authors prefer the latter because boxplots provide information—albeit qualitative—on both the statistical significance as well as on the magnitude of the effect; by contrast, *p* values provide information only on the former.

#### 2) Other forecast fields

The above analysis is repeated on all forecast quantities F1–F9 in Table 2. The results are complex and lend themselves to a wide range of interpretations. Table 3 summarizes the most clear and unambiguous findings. The “Influential” model parameters are selected based on both the magnitude and the variability of the effect of the model parameters, assessed qualitatively from the boxplots of the regression coefficients (e.g., Fig. 2 for water vapor). The signs (plus sign or minus sign) indicate whether the effect of the model parameter is positive or negative; a positive (negative) association means that increasing the model parameter increases (decreases) the forecast quantity. The parameters in parentheses are those deemed to be only marginally influential.

The appearance of Par5 as an important parameter for convective and gridscale precipitation, with opposite signs for the regression coefficients, can be explained easily. Par5 controls the fraction of convective precipitation produced by the KF scheme that is fed back to the gridscale microphysics scheme. This process may alternatively be viewed as a representation of detrainment of condensate generated within convective updrafts, predominantly occurring at upper levels. A negative (positive) sensitivity of convective (gridscale) precipitation to Par5 is expected given this transfer of condensate from the convective (subgridscale) to gridscale precipitation.

A comparison of the boxplots of Par5 for convective and gridscale precipitation (not shown) suggests more variability across the 30 days for the former. This may be because the transfer of subgrid to gridscale condensate generally occurs at higher altitudes, leaving the possibility that the large-scale precipitating particles will evaporate before reaching the surface, diminishing the variability in the signal detected in surface water vapor forecasts.

Par9 appears to have a small effect on gridscale precipitation with a positive sensitivity, more predominantly for COAMPS. This parameter represents the threshold beyond which cloud water is converted into precipitation. The positive sensitivity may be explained as follows: a larger threshold contributes to a delay in the autoconversion of cloud to rainwater, resulting in clouds containing more water, which can eventually lead to more gridscale precipitation being produced over the accumulation period. The difference in sensitivity between the models is likely related to 1) the different formulation in the parameter itself (i.e., the actual threshold in COAMPS, compared to a critical particle radius involved in the estimation of the cloud water threshold for WRF; see Eq. (A9) in the appendix), and/or 2) the different formulations used in the two models for the rate at which cloud water is converted to rain once the threshold has been reached.

For surface air temperature, Par4 appears as an important parameter in COAMPS and WRF, with a negative sensitivity. This can be understood as follows: Par4 controls the entrainment/detrainment between convective updrafts and their environment. Increasing its value leads to a reduction in entrainment, which in turn means less dilution of the buoyant air parcels, larger amount of convective precipitation, and hence more evaporative cooling below the clouds, which contributes to cooler surface temperatures.

The appearance of Par4 (in Table 3) as an important parameter with a positive sensitivity on air temperature at 250 hPa may be explained as follows: As Par4 increases, it promotes reduced entrainment of environmental air into the convective updrafts, allowing the rising moist air parcels to maintain their buoyancy, likely resulting in greater vertical extent of the updrafts and hence increased diabatic heating in the upper troposphere.

According to Table 3, surface wind speed is controlled mostly by Par1, for both COAMPS and WRF. A possible explanation is that increasing the mixing within the boundary layer leads to enhanced downward turbulent transfer of momentum from aloft, resulting in higher wind speeds at the surface. Also, this enhanced mixing promotes deeper boundary layers, increasing the likelihood of the turbulent layer extending up to the 850-hPa level, in turn reducing wind speeds at that level from the impact of turbulent drag.

As mentioned previously, the number of comparisons one can make is unwieldy, and for that reason, the discussion here is restricted to only the most important of the model parameters, and only those whose effect that can be supported with explanations involving meteorological processes.

### b. Spatial SA

#### 1) Water vapor

Figure 4 shows the map of −log(*p* value) for water vapor (F1); the rows in the figure correspond to the nine parameters Par1–Par9. A variety of patterns can be observed, but only the most salient are discussed here.

Before interpreting the *p* value maps, it is important to recall what exactly a *p* value represents. Consider the *p* values which test the null hypothesis that a given model parameter has no effect on a given forecast quantity. Then, a small *p* value signifies evidence from data in support of the alternative hypothesis that the model parameter has an effect on the forecast quantity. By contrast, a large *p* value does not indicate that the model parameter has no effect on the forecast quantity; it simply implies that there is insufficient evidence from data to support the hypothesis that an effect does exist.

Par1 has a spatially uniform effect in both COAMPS and WRF. This is not surprising as this parameter modifies the turbulent mixing within the boundary layer and exchange with the free atmosphere aloft. This process is expected to be active at least part of the time at every grid point in the domain during a diurnal cycle, covered here in our 24-h forecasts. Some effect of Par1 on turbulent transport of low level moisture should be expected in both models, as is confirmed in our analysis.

Par2 is also a parameter that has an effect on surface moisture over large areas of the domain, particularly for WRF where sensitivity to this parameter is uniformly distributed across the entire model grid. In contrast, the parameter appears to have little to no effect over most of the Pacific Ocean sector within the COAMPS domain. This feature possibly has its origins in a tighter coupling of conditions in the surface layer to the imposed lower boundary conditions in COAMPS, reducing the possible subtle effects of parameters controlling aspects of precipitation. This could particularly be the case in areas mostly characterized by light precipitation, such as is the case for the eastern Pacific over the days considered in our experiment.

Par3, also modifying the magnitude of temperature increments at the LCL in the KF trigger function, exhibits an effect on surface moisture only over parts of the domain in both models. Sensitivity to this parameter is mostly confined to the continental mountainous areas of western North America. In contrast to Par2, Par3 represents a multiplicative modulation of the baseline temperature increment at the LCL within the KF trigger function. This baseline increment is determined as a function of gridscale vertical motion at the LCL [see Eq. (1) in Kain (2004)]. This vertical motion is expected to be more important over mountainous regions due to topographically induced flow perturbations. Therefore, Par3 has the potential for a greater influence in these regions, compared to grid points where the gridscale vertical motion is nonexistent or too small to induce a significant temperature perturbations in the trigger function. An anonymous reviewer is acknowledged for noting a west–east dipole pattern.

Par4, too, has an effect on surface moisture mostly over the mountainous western North American continent, a pattern which is common to both COAMPS and WRF. Some effect over the eastern United States is also noticeable, more so for COAMPS than WRF. Par4 controls the amount of entrainment/detrainment of convective updrafts with their environment. Therefore, its influence is expected only at grid points where convective conditions are diagnosed (presence of unstable conditions and updrafts as determined by the KF trigger function). The findings are generally consistent with the discussion above, suggesting that such conditions are more prevalent over mountainous areas. That the influence of this parameter is over a larger portion of the eastern United States in COAMPS is not easily explained, but may be related to more complex relation used to determine the cloud radius in WRF compared to COAMPS, and/or and the slightly different default values for the cloud radius used in the KF parameterization in both models.

Par5, the fraction of convective precipitation fed back to the gridscale precipitation, exhibits a pattern similar to the parameters controlling various aspects of the KF scheme discussed above. However, sensitivity to this parameter characterizes and larger region in WRF. This difference is again not easily explained, but may be related to differences in the microphysical parameterizations used in both models, influencing the characteristics of gridscale precipitation.

Par6 appears to have little to no effect on water vapor over the entire domain, with the possible exception of the Pacific northwest region for COAMPS, and the entire western coastal states of the United States and northern Mexico for WRF. Once again, differences in both models are likely rooted in differences in the microphysical parameterization schemes used by both models, which could sensitively affect gridscale precipitation and its evaporation in conditions observed in these particular areas.

Not surprisingly, Par7 (the intercept of the size distribution of the snow particles) has an effect on water vapor but mostly in the Northern latitudes in the eastern United States, and all latitudes in the mountainous western United States. That spatial pattern is common to both COAMPS and WRF. This is not surprising as frozen precipitation is preferably occurring in the colder environments of the higher latitudes and mountainous areas.

Par8 appears to have no effect anywhere, while Par9 has a more noticeable effect but mostly contained to the southwest coast of the United States. The lack of sensitivity of Par8 over Par9 is consistent with the study by Weinstein (1970), who found that *q*_{0} (Par9) is the crucial parameter in the parameterization of autoconversion. The effect of Par9 is noticeable over a wider region in COAMPS compared to WRF, particularly over the coastal areas of Southern California. Par9 regulates the production of cloud water and its autoconversion to precipitation. The difference in sensitivities between models is unlikely to be directly related to differences between representations of autoconversion in WRF and COAMPS since, as noted by Hong et al. (2004), similar rates are obtained with the Tripoli and Cotton (1980) scheme (in WRF), compared to the simpler formulation of Kessler (1969) (in COAMPS; see the appendix), for typical cloud water contents. Therefore, differences in effects are likely to originate from higher-order interactions with other model components, more difficult to clearly identify and characterize.

#### 2) Other forecast fields

The spatial maps corresponding to surface air temperature (F4) and surface wind speed (F7) are essentially identical to those of water vapor (F1) shown in Fig. 4.

The spatial maps of sensitivity for convective precipitation are shown in Fig. 5. The color bar displays the values of −log(*p* value), and the color white is used to signify grid points at which the number of cases with convective precipitation is insufficient for estimating the regression coefficients in the MMR model.

The following observations can be made: All parameters exhibit some sensitivity over the western part of the domain, where convective precipitation is generally light during the days considered. Lower sensitivity is observed over the eastern half of the United States, with the exception of Par2 and Par5 over the southeastern United States and coastal regions of the Gulf states, where the more intense convective precipitation occurs in our sample of days. In regions of more intense convection, sensitivity to Par2 is generally more important in WRF than COAMPS (see maps of differences in sensitivities, with generally negative values over the eastern part of the continent), whereas the differences in sensitivities to Par5 are characterized by greater variability at smaller scales, suggesting that a broad interpretation of these differences is not straightforward.

The spatial map of sensitivities for gridscale precipitation is not shown here, as it is similar to that of convective precipitation (Fig. 5), but with the following notable exceptions: Par1, Par2, and Par7, and only in WRF, show higher sensitivity at higher latitudes and over higher terrain. To first order, this spatial distribution of the Par7 sensitivity is expected as solid precipitation preferentially occurs over these regions. The lack of sensitivity to this parameter in COAMPS over the higher latitudes is worth further investigation.

By contrast, Par9 appears to have an effect limited to the waters off the southwestern coast, but only for COAMPS. This difference between COAMPS and WRF is likely related to the slightly different nature of the parameter and its control over autoconversion and how it interacts with the other cloud microphysical processes in conditions of light coastal precipitation (i.e., drizzle).

Forecasts of air temperature and wind speed at 850 and at 250 hPa all have nearly identical spatial structure of sensitivities. That for air temperature at 850 hPa is shown in Fig. 6. Par1, Par2, Par4, and Par5 have generally uniform effect across large portions of the domain, except for weaker effects off the southeastern coast of the United States and over the eastern Gulf of Mexico for Par1, and over the mountainous regions of the western United States and over the southeastern Pacific for Par2, Par4, and Par5 in COAMPS. It is interesting to note the differences in the spatial structure between COAMPS and WRF with respect to sensitivity to Par3. COAMPS exhibits stronger sensitivity over the eastern part of the domain, whereas WRF shows enhanced sensitivity over the western part of the continent. For Par6, COAMPS has a distinct north–south structure, with stronger sensitivity over the northern latitudes, whereas WRF shows a sensitivity distributed over the entire domain but with a greater variability at smaller scales. Such complex sensitivity patterns and differences between models are not easily explained. A more comprehensive and detailed analysis of respective model output would be required to gain further insights into the origins of these structures and their differences.

#### 3) Model adequacy

As mentioned in the Method section, it is useful to assess the goodness of the MMR models used for assessing sensitivity. Figure 7 shows the results when the forecast quantity is water vapor. The *R*^{2} values for COAMPS and WRF are shown in Figs. 7a and 7b. Although the magnitude and spatial structure of the *R*^{2} values are similar, there are some differences; Fig. 7c shows the difference of the *R*^{2} values, and it can be seen that COAMPS and WRF can vary in their typical *R*^{2} value by as much as ±0.3. The biggest differences are in the waters of the southeastern coast of the United States where WRF is generally accompanied with higher *R*^{2} values than COAMPS. In spite of these differences, the scatterplot of the *R*^{2} values shows that the MMR models for COAMPS and WRF are comparable in terms of model adequacy, with both having large *R*^{2} values in the 0.6 to 1.0 range. Figure 7e shows the histogram of the *R*^{2} values for COAMPS (black) and WRF (red); although the two have similar means, the histogram for COAMPS is slightly skewed with larger *R*^{2} values than those of WRF. This skew is confirmed by examining the quantile–quantile plot of the *R*^{2} values (Fig. 7f); if COAMPS and WRF had exactly the same distribution of *R*^{2} values, the resulting quantile–quantile plot would be a diagonal line; a difference in their means would be reflected in a vertical shift, and a difference in variance would translate to a slope that is not equal to 1. As evident in Fig. 7f, COAMPS has marginally higher *R*^{2} values than WRF. In summary, these figures suggest that the MMR models for COAMPS and WRF are adequate for assessing sensitivity.

Although the predominantly large *R*^{2} values for the MMR models used to assess the sensitivities in COAMPS and WRF suggest that the MMR models are adequate, the differences in the spatial structures associated with COAMPS and WRF suggest that the underlying relationship between model parameters and forecasts is different between the two models.

## 5. Conclusions and discussion

A methodology for assessing the spatial structure of the sensitivity of forecasts with respect to model parameters, developed elsewhere by the authors, is applied to the forecasts of COAMPS and WRF for the purpose of comparing the two models in terms of their sensitivities. Nine model parameters and nine forecast quantities are examined. Additionally, the effect of the model parameters on the 90th percentile of the forecast quantities is also examined. The number of possible comparisons is too large for reporting here, but three of the more salient findings are

The effect of Par2 (i.e., temperature increment at the LCL for KF trigger) on water vapor is relatively homogeneous for WRF, but has a distinct (land–water) spatial structure in COAMPS. WRF and COAMPS also differ in terms of their sensitivity to Par5 (fraction of available precipitation in KF fed back to grid scale).

Forecasts of convective and gridscale precipitation are also similarly different between WRF and COAMPS, with Par2 and Par5 being the main discriminators.

For forecasts of 850-hPa temperature and wind speed, the spatial structure of sensitivity is different across COAMPS and WRF for all of the model parameters, to a varying degree, with the exception of Par8 (an autoconversion factor for the microphysics) which appears to have no influence anywhere in the domain for both models.

COAMPS and WRF differ in terms of the underlying processes that map the model parameters to their respective forecasts.

Many other conclusions follow from an examination of the generated maps of sensitivity. One general finding is that if/when a model parameter has an effect on forecasts at all, the effect has a highly nontrivial spatial structure which varies across COAMPS and WRF. Attempts have been made to “explain” many of these results in terms of meteorological processes; however, many of the results remain unexplainable in spite of both the large magnitude and the statistical significance of the effects. On the positive side, these findings shed light on the complexity of the impact of model parameters on forecasts; on the cautionary side, these results suggest that any attempt to finetune model parameters must take into account the possibility that the spatial structure of the effects depends on the forecast model itself.

In any window-based method, the selection of the window size involves a trade-off. In the present setting, adopting larger windows leads to smoother *p*-value maps which may “hide” underlying spatial structure, while a smaller window will not adequately account for spatial correlations in the forecast quantity. Here, *W* was set to 3 (and *W* = 5 was also tested, giving similar results). On the recommendation of an anonymous reviewer, a variogram analysis (not shown) was also performed, revealing that the extent of the spatial correlation is generally much longer than 3 grid lengths; depending on the forecast parameter, the forecast day, and the value of the model parameters, it can range from 20 to 40 grid lengths. Also, when the forecast quantity is precipitation, the discontinuous nature of the forecast field creates additional complexity in the interpretation of the variogram (Marzban and Sandgathe 2009). Consequently, while *W* is set to a fixed number in this work, it is advisable to set *W* to a value no larger than the spatial extent of the correlations, with the latter determined either on physical grounds, or inferred from the variogram.

The analysis presented here involves an examination of the sensitivities of COAMPS and of WRF, as well as a comparison of the sensitivities between the two models. If one were more interested in the comparison only, a more direct method is possible because the experimental design used here (to generate the computer data for SA) is a paired design (i.e., the 50 values of the 9 parameters are the same between COAMPS and WRF). The main advantage of paired designs is that they generally have higher power than unpaired designs. Said differently, if a difference truly does exist between COAMPS and WRF, it is more likely to detect it in a paired design. Even though an examination of the scatterplots of the regression coefficients of COAMPS versus WRF (not shown) suggests that there is little evidence that the sensitivities are paired, the authors have examined the boxplots of the difference between the COAMPS and WRF regression coefficients. The resulting boxplots (not shown) give similar results to those reported here with a few exceptions. One exception pertains to the effects of Par1 and Par2. Specifically, Par1 has a larger (negative) effect in COAMPS as compared to that in WRF, while the pattern is reversed for Par2. It may be worth repeating all of the analysis above using statistical models that are designed for paired data; of course, in such an analysis, the spatial structure of each of the two models can no longer be assessed.

Many more additions to this work are possible. On the technical side, the multivariate multiple regression model can be extended to include interaction terms between the model parameters. Although interactions are expected based on theoretical considerations, their inclusion will lead to reduced interpretability of the final results. For example, one may not be able to disambiguate the effect of a given parameter from the effect of an interaction. Such confounding of effects is almost always present in a multivariate model involving interactions, and it is the main reason why interactions are not included in the present study.

Another technical extension, specifically to analysis pertaining to precipitation, involves including more precipitation days. This extension may “fill in” the white regions in Fig. 5, and thereby allow an examination of the spatial structure of sensitivity in the western region of the forecast domain.

And, finally, on the nontechnical side, many of the results found here remain unexplained in terms of meteorological processes, either because there are too many to fit within the bounds of a journal article, or because the authors have been unable to construct a viable explanation. Any explanation of these unexplained results is apt to be helpful in separating true results from artifacts of the analysis, and in turn lead to a better understanding of the true relationship between model parameters and forecasts.

## Acknowledgments

This work has received support from the Office of Naval Research (N00014-12-G-0078 task 29) and the National Science Foundation (AGS-1402895). We are grateful to Greg Hakim for invaluable contribution.

Data availability statement: The (computer) data generated and used for the present study are available upon request from the corresponding author.

### APPENDIX

#### Explanation of Model Parameters

This section provides a detailed explanation of the nine model parameters in Table 1.

Par1 is a parameter added in both models that increases or decreases the intensity of turbulent mixing within the boundary layer (BL). As the boundary layer scheme in COAMPS is a 1.5-order closure scheme (see Hodur 1997), Par1 multiplies the mixing length, which in turn modifies the diffusion coefficient at every vertical level in the BL:

where *S* is a constant, $\u2113$ is the stability-dependent mixing length, *e* is the turbulent kinetic energy (TKE), and *P*_{1} represents Par1. The default BL parameterization in WRF is the Yonsei University (YSU) scheme (Hong et al. 2006), which does not involve TKE and mixing lengths. Here, the diffusion coefficients are based on the following formulation:

where *k* is the von Kármán constant, *p* is the profile shape exponent, *z* is height above the surface, *h* is the height of the turbulent BL, and *w*_{s} is a stability-dependent turbulent velocity scale. In a manner similar to the modified diffusion coefficient in COAMPS, the BL diffusion coefficient in WRF is simply modified as

Par2 and Par3 modify the temperature increment at the lifting condensation level (LCL) used in the trigger function within the KF cumulus parameterization scheme. The parameters modify temperature increments as follows:

where Δ*T*_{KF} is the increment determined within the KF scheme [using Eq. (1) from Kain (2004)], and *P*_{2} and *P*_{3} represent Par2 and Par3 in Table 1, respectively.

Par4 is the cloud radius involved in the calculation of the entrainment rate of environmental air into convective updrafts [see Eq. (5) in Kain (2004)]. In WRF, this cloud radius *R* is calculated following:

where *R*_{min} (i.e., Par4) is the minimum cloud radius, and *W*_{KL} represents gridscale excess vertical velocity above a specified threshold. In COAMPS, a value of *R* is specified without the dependence on *W*_{KL}.

Par5 is also a constituent parameter within the KF scheme, and represents a specified fraction of convective precipitation fed back to the microphysical parameterization for gridscale precipitation. This parameter represents the detrainment of condensate produced within convective updrafts to their environment, primarily occurring at upper levels.

Par6 and Par7 represent the intercept values of the raindrop and snowflake size distributions, respectively, and affect the precipitating particle number concentrations and hence a number of microphysical processes, such as the mean sedimentation rate of the particles. The parameterization of these microphysical processes is based on Rutledge and Hobbs (1983) in COAMPS, whereas WRF incorporates updates from Hong et al. (2004). For the particular parameters here, the main difference between models involves the addition of a dependency on temperature for the intercept parameter for snow (Par7) in WRF:

where *T*_{0} is a reference temperature. In COAMPS, n0s is set to a single default value (n0s = 2 × 10^{7} m^{−4}), whereas Par7 in WRF corresponds to n0s_{ref} in Eq. (A6), with a default value of 2 × 10^{6} m^{−4}.

Par8 and Par9 are parameters involved in the parameterized autoconversion of cloud water to rainwater. Par8 controls the rate at which the conversion occurs while Par9 is the cloud water threshold beyond which autoconversion occurs. Following Kessler (1969), the autoconversion rate is generally expressed as

where *q*_{c} is the cloud water mixing ratio, *q*_{0} is the critical cloud water threshold, and *α* is a parameter modulating the conversion of excess cloud water to rain. Equation (A7) is the formulation used in COAMPS. Therefore, Par8 corresponds to *α*, and Par9 corresponds to *q*_{0}. The default microphysical parameterization in WRF uses a modified formulation following Hong et al. (2004). The modified autoconversion is based on the more physically based parameterization of Tripoli and Cotton (1980):

where *μ* is the dynamic viscosity of air, *ρ*_{w} is the density of water, *N*_{C} is the droplet concentration, *g* is the acceleration due to gravity, *E*_{c} is the mean collection efficiency, and *H* is the Heaviside unit step function, which suppresses the autoconversion processes until cloud water reaches a critical mixing ratio *q*_{0}. Furthermore, this critical mixing ratio is expressed as

where *ρ* is air density and *r*_{cr} is the critical mean droplet radius at which autoconversion begins. To mimic within WRF the effect of parameters applied in COAMPS through Eq. (A7), and considering the relative uncertainty with which values of the various parameters in Eqs. (A8) and (A9) can be determined, *E*_{c} (collection efficiency) in Eq. (A8) was selected as Par8, and *r*_{cr} (critical droplet radius) in Eq. (A9) as Par9. The range of values for the substitute Par8 and Par9 parameters have been determined such that a rough consistency is maintained with the range specified for the corresponding parameters in COAMPS (see Table 1).

## REFERENCES

*J. Climate*

*Mon. Wea. Rev.*

*Mon. Wea. Rev.*

*Technometrics*

*J. Climate*

*Applied Regression Analysis*

*Bull. Amer. Meteor. Soc.*

*Proc. iEMSs Third Biennial Meeting: Summit on Environmental Modelling and Software*

*R J.*

*Tellus*

*Mon. Wea. Rev.*

*Ocean Dyn.*

*Mon. Wea. Rev.*

*Mon. Wea. Rev.*

*Quart. J. Roy. Meteor. Soc.*

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science.*

*J. Appl. Meteor.*

*On the Distribution and Continuity of Water Substance in Atmospheric Circulations. Meteor. Monogr.*

*Quart. J. Roy. Meteor. Soc.*

*Phys. Fluids A Fluid Dyn.*

*Geosci. Model Dev.*

*Mon. Wea. Rev.*

*Wea. Forecasting*

*Mon. Wea. Rev.*

*Geosci. Model Dev.*

*Mon. Wea. Rev.*

*Meteor. Appl.*

*Mon. Wea. Rev.*

*J. Fluid Mech.*

*Technometrics*

*Design and Analysis of Experiments.*

*J. Roy. Stat. Soc.*

*Geosci. Model Dev.*

*Methods of Multivariate Analysis.*

*J. Geophys. Res.*

*Mon. Wea. Rev.*

*Mon. Wea. Rev.*

*J. Atmos. Sci.*

*Geosci. Model Dev.*

*Global Sensitivity Analysis: The Primer.*

*Comput. Phys. Commun.*

*The Design and Analysis of Computer Experiments.*

*J. Comput. Phys.*

*Math. Model. Comput. Exp.*

*J. Appl. Meteor.*

*J. Atmos. Sci.*

*Statistical Methods in the Atmospheric Sciences.*

*Nonlinear Processes Geophys.*

## Footnotes

^{1}

In Marzban et al. (2018b), it was shown that 50 or 99 different values of the model parameters yield comparable results.

^{2}

Briefly, a type I error refers to the situation where an effect truly does not exist, but the data suggest that it does.