## Abstract

Estimates of the uncertainty of model output fields (e.g., 2-m temperature, surface radiation fluxes, or wind speed) are of great value to the weather and climate communities. The traditional approach for the uncertainty estimation is to conduct an ensemble of simulations where the model configuration is perturbed and/or different models are considered. This procedure is very computationally expensive and may not be feasible, in particular for higher-resolution experiments. In this paper, a new method based on Bayesian hierarchical models (BHMs) that requires just one model run is proposed. It is applied to the Weather Research and Forecasting (WRF) Model’s 2-m temperature in the Botnia–Atlantica region in Scandinavia for a 10-day period in the winter and summer seasons. For both seasons, the estimated uncertainty using the BHM is found to be comparable to that obtained from an ensemble of experiments in which different planetary boundary layer (PBL) schemes are employed. While WRF-BHM is not capable of generating the full set of products obtained from an ensemble of simulations, it can be used to extract commonly used diagnostics including the uncertainty estimation that is the focus of this work. The methodology proposed here is fully general and can easily be extended to any other output variable and numerical model.

## 1. Introduction

The estimation of the uncertainty of a model prediction is a critical issue that has become increasingly relevant in the last decades with the use of more complex and sophisticated numerical models. When a variable like temperature is measured with a thermometer at a given location and time, the uncertainty that goes with the measurement can be readily estimated. In the same way, the model predictions at a given grid box and time optimally should come with an uncertainty estimate. The uncertainty of simulated weather and climate variables has been the subject of a significant number of published works. It is inherent to the chaotic nature of the nonlinear equations solved in weather and climate models (Slingo and Palmer 2011) and has important implications for short-term weather forecasts (e.g., Fox and Wikle 2005), future climate change studies (e.g., Salazar et al. 2016; Overland et al. 2016), ocean (e.g., Bitner-Gregersen et al. 2014), and chemistry (e.g., Malmberg et al. 2008) research and intermodel comparison (e.g., Wang et al. 2016). As discussed in Leutbecher and Palmer (2008), there are two main sources of errors in numerical weather prediction models that result in uncertainties: initial condition errors and model errors. The former arises because the data used to initialize the models are not perfect, with some of these errors amplifying during the simulation and resulting in significant forecast errors. The latter is due to deficiencies in the model’s representation of the dynamics and physics of the atmosphere by numerical algorithms. The traditional way to estimate the uncertainty of a model variable is to perform an ensemble of experiments in which one (or more) aspect(s) of the model configuration is varied: for example, different physics and/or dynamics options are selected (e.g., Evans et al. 2012) and the initial and/or boundary conditions are perturbed (e.g., Torn and Hakim 2009). Another commonly followed approach is a multimodel ensemble (e.g., Salazar et al. 2016). The uncertainty sampled with an ensemble of simulations will, of course, depend on how the ensemble is set up. For example, in a multiphysics ensemble, the uncertainty estimate will reflect the sensitivity of the model’s response to the different treatments of a given process (e.g., boundary layer dynamics, cumulus convection, or cloud microphysics processes). Another possibility is to perturb the initial and/or boundary conditions with some measurement of uncertainty associated with them. In this case, it will be an indication of how the uncertainty associated with the forcing data propagates to the model response. Philosophically, such ensemble methods are fundamentally different from measurement error and natural variability of variables in nature. Ensemble uncertainty estimation methods quantify the differences in various researchers’ implementation and interpretation of nature. We point out that such model ensemble error summaries are conceptually different than natural variability plus measurement uncertainty. These methods are not only time consuming but very computationally demanding; therefore, slowing down the progression into higher spatial resolutions is needed for a more realistic simulation of the atmospheric conditions (e.g., Prein et al. 2015). Providing the uncertainty for a time-dependent spatial field like the 2-m temperature is crucial for a majority of weather and climate applications, not least for climate change projections, as it will help guide policy-makers and stakeholders responsible for developing mitigation strategies. There is a need to develop efficient ways of estimating the uncertainty of a model prediction that uses the least possible amount of computational resources. In this paper, such an approach is proposed that makes use of a single model run and a statistical model, the Bayesian hierarchical model (BHM).

The vast majority of the statistical models used in numerical weather research are based on multiple linear regression, such as model output statistics (MOS) techniques Vislocky and Fritsch 1995, 1997) and the Statistical Hurricane Intensity Prediction Scheme (SHIPS; DeMaria and Kaplan 1994; DeMaria et al. 2005). Statistical models based on artificial neural networks (e.g., Kuligowski and Barros 1998; Sideratos and Hatziargyriou 2007), linear stochastic processes (e.g., Toth et al. 2000), and Bayesian model averaging (e.g., Raftery et al. 2005) have also been used. The BHM proposed here is a hierarchical model inferred using the Bayesian approach (i.e., the different layers in the BHM are linked through Bayes’s rule), which allows both observed data and model parameters to be random variables, resulting in a more realistic and accurate estimation of the uncertainty compared to non-Bayesian approaches (e.g., Haining et al. 2007). In addition, flexible prior information can be incorporated into the Bayesian model: as pointed out by Dormann (2007), prior information (i.e., information that accounts for subjective beliefs or expectations regarding the behavior of a given parameter before it is estimated from the data itself) can be very helpful in modeling spatial autocorrelative effects compared to ordinary nonspatial linear effects. These features of the Bayesian model are behind our choice of this approach for this work.

An example of a problem that has been successfully addressed using a BHM approach is the estimation of the rate of suicide mortality in 32 London boroughs (Blangiardo et al. 2013). The number of suicides in each borough, as well as their location, is known. Assuming that the number of suicides is conditionally independent given its suicide rate, in the sense that the occurrence of one does not affect the probability of a second to occur, the number of suicides will follow a Poisson distribution:

where is the location index with , and is the suicide rate at . The parameter is modeled by a multivariate Gaussian distribution:

which is specified by the parameters , where is the mean vector and is the covariance matrix indexed by ; is a hyperparameter, as it has to be specified before the method is implemented. An appropriate prior distribution will then be chosen for . Once that is done, and through Bayes’s rule, one can derive the posterior distribution,

where is the whole data, and denotes distribution. Finally, can be extracted, giving the full distribution of suicide mortality from which statistics such as the mean and standard deviation can be inferred.

One obstacle, however, hinders the use of a Bayesian model: the approximations step. The normalizing constant used in Eq. (3), relevant for model-selection problems but not important for parameter estimation, is typically intractable with the analytical derivation replaced by numerical approximations that are often very costly. The Markov chain Monte Carlo (MCMC; Steinsland and Jensen 2010) method is a key step to make the inference. The basic idea is to draw a sequence of dependent samples from a target density of interest. As the number of draws tends to infinity, the resulting output behaves as if it were a sample from the density of interest. However, as the data become larger and their inherent structure more complex, this procedure becomes very computationally expensive. What is more, for hierarchical models, such as the BHM used here, the convergence of the MCMC is very slow and unpredictable (De Smedt et al. 2015). The estimation of hyperparameters is also challenging for the MCMC. An alternative to the MCMC method for which faster approximations of posterior distributions even for more complex models are possible is the integrated nested Laplace approximation (INLA; Rue et al. 2009) method. The INLA method uses a combination of analytical approximations and numerical integrations to speed up the estimation of the posterior distribution instead of generating a long sequence of dependent samples such as in the MCMC. Methods such as INLA make the Bayesian model increasingly more popular in several applications such as for spatial or spatiotemporal data in imaging analysis and geographical analysis (Cressie and Wikle 2011; Blangiardo et al. 2013). Even with these new approximation methods, the implementation of a Bayesian model still requires large amounts of computational resources [in particular, random-access memory (RAM)] compared to non-Bayesian approaches. In particular, it takes about 50 min and roughly 12 gigabyte (GB) RAM on a computer with an Intel Xeon central processing unit (CPU) E3-1270, version 3, at 3.5 GHz to implement a single BHM model. Hence, it has to be applied on a high-performance server that can handle big data with a complex dependent structure.

The Weather Research and Forecasting (WRF; Skamarock et al. 2008) Model used in this work is a fully compressible, nonhydrostatic model that uses a terrain-following hydrostatic pressure-based coordinate in the vertical and the Arakawa C grid staggering for horizontal discretization. It has been used in a wide variety of applications including polar meteorology studies (e.g., Hines and Bromwich 2008) and boundary layer research (e.g., Banks et al. 2016). In this work, the uncertainty of the daily mean 2-m temperature is estimated from one WRF run using the BHM. To assess how realistic the WRF-BHM uncertainty estimate is, it will be compared to that obtained from an ensemble of WRF simulations in which different planetary boundary layer (PBL) schemes are employed. The uncertainty that will be targeted in this work is that associated with different treatments of the boundary layer dynamics and representation of the near-surface flow, which play a crucial role in the simulation of the daily mean near-surface air temperature. It is important to stress that WRF-BHM is not capable of generating the full set of products that can be obtained from an ensemble of experiments. An example is the most likely event, such as for precipitation type. Knowing which precipitation type (e.g., rain, snow, ice pellets, or freezing rain) is more likely can be achieved by checking which is predicted by the majority of the ensemble members with precipitation but cannot be obtained from the BHM. As a result, the BHM cannot be considered as a true alternative to ensemble forecasting. Having said that, ensemble products such as the probability of exceedance, box-and-whisker diagrams, background-error covariance for ensemble Kalman filter, and hybrid variational data and uncertainty estimation, the focus of this work, can be obtained in a computationally cheaper and faster way with the BHM. In this paper, the uncertainty estimation is considered for the daily-mean 2-m temperature, but the methodology developed can be easily extended to other fields.

## 2. Experimental setup

In this study, WRF, version 3.7.1, is forced with the National Centers for Environmental Prediction (NCEP) Climate Forecast System Reanalysis (CFSR; Saha et al. 2010) 6-hourly data (horizontal resolution of 0.5° × 0.5°) and is run for one month in the boreal summer (July 2016) and winter (January 2017) seasons. To minimize the accumulation of integration errors, and following Lo et al. (2008), each month’s run is broken into three overlapping 11–12-day periods (e.g., for July 2016, the model is run from 0000 UTC 30 June to 0000 UTC 11 July, from 0000 UTC 10 July to 0000 UTC 21 July, and from 0000 UTC 20 July to 0000 UTC 1 August) with the first day regarded as model spinup.

The physics parameterizations used include the Goddard 6-class microphysics scheme (Tao et al. 1989), the Rapid Radiative Transfer Model for Global Circulation Models (RRTMG) models for both shortwave and longwave radiation (Iacono et al. 2008), the Mellor–Yamada Nakanishi and Niino (MYNN), level 2.5 (Nakanishi and Niino 2006); with Monin–Obukhov surface layer parameterization (Monin and Obukhov 1954); the 4-layer Noah land surface model (Chen and Dudhia 2001); and for cumulus convection the Betts–Miller–Janjić scheme (Janjić 1994) coupled with the precipitating convective cloud scheme described in Koh and Fonseca (2016). A climatological aerosol distribution based on Tegen et al. (1997) is used in the radiation scheme. An interactive prognostic scheme for the sea surface skin temperature (SSKT) based on Zeng and Beljaars (2005) is also added to the model. The lower boundary condition to the SSKT scheme is the 6-hourly SSTs from CFSR, linearly interpolated in time in order to have a continuously varying forcing on the skin layer. The fractional sea ice coverage and sea ice depth are also read from CFSR every 6 h with the sea ice albedo a function of air temperature, skin temperature, and snow (Mills 2011). Gravitational settling of cloud drops in the atmosphere is parameterized as in Duynkerke (1991) and Nakanishi (2000), while cloud water (fog) deposition onto the surface because of turbulent exchange and gravitational settling is treated using the simple fog deposition estimation (FogDES) scheme (Katata et al. 2011).

In addition to the control experiment (model configuration stated above, with the MYNN, level 2.5, PBL scheme), nine additional simulations are performed using different PBL schemes. As the focus of this work is on the 2-m temperature, it makes sense to perturb the PBL scheme as the near-surface fields are highly sensitive to how the boundary layer dynamics are represented. This is particularly true for Scandinavia, which has a very large seasonal contrast: snow-covered terrain, sea ice in the Gulf of Bothnia, and a very short daylight period in winter, contrasting with open water everywhere, snow-free land, and long daylight hours in summer. These sensitivity runs are conducted over a 10-day period at the beginning of January 2017 and July 2016 with a 1-day spinup before. The winter period selected comprises both milder days, when baroclinic systems from the Atlantic prevail, and colder days, when areas of high pressure are in control and rather cold temperatures, dropping below −30°C in some places, occur. The summer period started with a southerly flow and rather warm temperatures, but the weather conditions became more unstable in the middle and latter parts of the period. Both are representative of the different atmospheric conditions experienced in Scandinavia in the cold and warm seasons. The PBL schemes considered are the Yonsei University (YSU; Hong et al. 2006); asymmetric convective model, version 2 (ACM2; Pleim 2007); quasi-normal scale elimination (QNSE; Sukoriansky et al. 2005); Mellor–Yamada–Janjić (MYJ; Janjić 1994); Bougeault–Lacarrère (BouLac; Bougeault and Lacarrère 1989); the University of Washington (UW) scheme (Bretherton and Park 2009), Grenier–Bretherton–McCaa (GBM; Grenier and Bretherton 2001); Shing–Hong (SH) scheme (Shin and Hong 2015); and MYNN, level 3 (MYNN3; Nakanishi and Niino 2006). Further information about these schemes can be found in Banks et al. (2016) and Cohen et al. (2015).

WRF is run in a nested configuration with the domains shown in Fig. 1a. The outermost grid, at a horizontal resolution of 15 km, comprises the entire Scandinavian Peninsula. The 3-km grid includes the full Botnia–Atlantica region that consists of the Nordic counties of Nordland, Norway; Västerbotten, Sweden; and Pohjanmaa, Finland. The cumulus scheme is switched off in the second domain, while in the outermost grid analysis, nudging toward CFSR is employed. The nudging regions are given in Fig. 1b: the water vapor mixing ratio is nudged in the mid- and upper troposphere, whereas the horizontal wind components and potential temperature perturbation are nudged in the upper troposphere and lower stratosphere, with the nudging time scale set to 1 h for all variables. To prevent the reflection of waves near the model top, a sponge layer is included (Skamarock et al. 2008). Figure 1b also shows the model vertical levels: 60 levels, concentrated in the PBL with about 20 levels in the lowest 200 m. The model output of domain 1 is discarded, and that of domain 2 is stored every 1 h and used for analysis.

## 3. Statistical model

The BHM provides a framework to characterize spatial or spatiotemporal data (Cressie and Wikle 2011). It is a statistical model consisting of multiple layers, typically three. The top layer, denoted as data model, is devoted to the data vector **y** modeled by a certain distribution [**y**|], conditional on the latent variables (defined as those that are not directly measured but rather inferred from observed variables) and the parameters . In the second layer, often called latent or process model, the latent variables are modeled by the distribution [|], conditional on the parameters , in order to characterize the complex spatiotemporal dependencies in the data that are unobservable. In the last layer, the prior distributions for the parameters [], are provided. Through the product of the three (conditional) distributions, as shown in the equation below, one can obtain the joint distribution and make inference on and based on **y**, :

Let the data vector , where denotes the daily mean 2-m temperature as predicted by WRF at a given location and time . The top layer is modeled as follows:

where is the mean of , is a Gaussian noise with mean 0 and variance of accounting for measurement error, and ∈ . The are assumed to be conditionally independent. In the second layer, the latent variables are modeled as

where , is the intercept term, is the *m*th covariate, is the *m*th coefficient of , and *M* is the number of covariates. The first two terms on the right-hand side of Eq. (6) express the mean as a multiple linear regression, while is a noise term that accounts for the spatiotemporal effect; is modeled through its first-order neighbor in time and neighbors in space as

where . In Eq. (7), characterizes the temporal correlation between and , and represents the spatial effect specified by a Gaussian Markov random field (GMRF). A GMRF is a random vector following a multivariate Gaussian distribution. Any element of the random vector, given its neighbors, is conditionally independent of all the other elements. A GMRF is used here because (i) usually the data in nature follows a Gaussian distribution, (ii) it is generally a good starting point for statistical methods, and (iii) it approximates the Bayesian inference with high accuracy. Let and where is the marginal variance of, such that is a weakly stationary process in time. As shown in Lindgren et al. (2011), one can represent , a continuous indexed Gaussian field with Matérn covariance (Matérn 1960), as a discrete indexed GMRF with Matérn covariance defined in the triangulation over the spatial domain (i.e., the domain is subdivided into spatial elements that have the minimum number of boundary points; for a two-dimensional space, the domain is subdivided into triangles). Note that the larger the number of vertices in the triangulation, the higher the accuracy of the GMRF representing it, which comes at an increased computational cost. Blangiardo et al. (2013) illustrates how to configure the triangulation in detail; can be used to represent a discrete indexed GMRF with zero mean and covariance defined by

for .

Matérn covariance at time point *t* is defined as

where is the modified Bessel function of the second kind, is the Gamma function, is a smoothness parameter, and is a range parameter. In Eq. (9), represents the Euclidian (i.e., straight line) distance between the sites and . Two important properties of a random field with Matérn covariance are (i) it is stationary and isotropic (when the distance is Euclidean), and (ii) the covariance decreases as the distance between and increases [note that very rapidly approaches zero with increasing ]. This second property of a Matérn covariance appropriately reflects the relationship of the temperature over the spatial domain in our case and justifies the use of this particular covariance in this work. The ratio represents the distance at which the covariance is about 0.1 (Matérn 1960; Stein 1999). If two points are spaced farther apart than this distance, the corresponding random variables are weakly correlated or uncorrelated.

The parameters used in Eqs. (7) and (9) specify the latent spatiotemporal process and are included in the parameter space . In the last layer, some priors are provided for all these parameters:

The posterior mean of a latent variable/parameter is used as its Bayes estimator: for example, is the Bayes estimator of . Posterior means of all the latent variables/parameters are derived from the joint posterior distribution following Eq. (4).

The INLA (Rue et al. 2009) package in the software R is used to implement the BHM as described above. The R code is quite straightforward; it codes the BHM model as presented above directly into the R language, and is freely available online.^{1}

## 4. Methods

In this section, the setup of the BHM and the diagnostics used to evaluate the performance of the WRF-BHM are presented.

### a. WRF data

The focus is on the daily mean 2-m temperature from the 3-km WRF grid, Fig. 1a, which contains 301 × 381 grid points. The BHM is set up in a 20-day period in the winter (1–20 January 2017) and summer (1–20 July 2016) seasons. For each month, the BHM was first fit to the WRF data using different length periods multiple of 5 days and up to 30 days. It was found that there was very little sensitivity in the parameters for a period longer than 20 days. Hence, and for computational reasons, a 20-day period for each season is selected to develop the BHM. The variance is assumed to be the same for all time and space points. While it may be reasonable to consider that at a given grid point, is constant in time (at least for a given month), spatial homogeneity is unlikely: for example, a coastal location is expected to have a smaller temperature variance than an inland site, with the latter more likely to experience extreme values. Given this, possible spatial heterogeneity in is investigated in this study. Four approaches are considered:

Method I: The 10 000 grid points (

*N*= 10 000) are randomly selected from the available 301 × 381 in the model domain. The assumption is that the variance is spatially homogeneous in the domain considered.Method II: The 10 000 grid points (

*N*= 10 000) around coastlines and inland sites randomly are selected. A grid point is considered to be a coastline point if one of its nearest spatial neighbors is totally (or partially) covered by an ocean or sea. It is considered to be an inland point if its nearest coastal point is beyond its third-order neighbors. Here, it is assumed that spatial heterogeneity in arises mostly from the proximity to water bodies, in particular to the Atlantic Ocean and Gulf of Bothnia.Method III: The 10 000 grid points (

*N*= 10 000) at two different elevations (namely, ≤673.2 and >673.2 m) are randomly selected. A threshold of 673.2 m is chosen as to have at least 10 000 grid points in each category. The assumption is that spatial heterogeneity in arises mostly from the elevation of the site.Method IV: The domain is divided into nine regions (shown in Fig. A1) each containing a minimum of 10 000 grid points (

*N*= 10 000). The purpose of this approach is to have the largest possible spatial heterogeneity in with the available data.

In a nutshell, a total of 10 000 grid points × 20 days’ worth of data from the model simulations are used to set up the BHM for each season.

### b. BHM parameter setting

#### 1) Covariates

While the 2-m temperature is a function of many covariates, for simplicity only one, the elevation, is considered (i.e., ). Future extensions of the BHM developed here may consider additional covariates such as the latitude, land-use type, and the proximity to a water body.

#### 2) Prior distributions for parameters

How to build the prior distribution (informative or noninformative approach) of a given parameter is a challenging task in Bayesian analysis and will depend on whether there is preexisting information about the parameter being modeled. Both approaches have their strengths and weaknesses (e.g., Berger 2006). When one has limited information about the parameter, weakly informative prior is generally a good choice (Simpson et al. 2015). In this study, the default specifications in INLA for the distributions of the parameters are used, which correspond to noninformative prior for and weakly informative priors for the others.

The noninformative prior assigned to is Gaussian distributed with a mean of 0 and a precision (inverse of variance) of 0, while the values follow a Gaussian distribution with a mean of 0 and a precision of 0.001. The is set to (1.5 × 10^{−5}), which is defined as a random variable if (Martino and Rue 2010); is assumed to follow a normal distribution with a mean of 0 and a precision of 0.15; and are assumed as . The variable is set to 1, which corresponds to a third-order conditional autoregressive structure under regular lattice data. Although other values for are possible, they have not been fully tested in the INLA package and hence are not considered.

### c. BHM evaluation

Four diagnostics are employed with the aim of assessing different aspects of the WRF-BHM. Diagnostics D1 and D2 are computed at every grid point using the WRF-BHM and the 10 day × 10 day summer and winter experiments and compare the mean and standard deviation obtained from the two methods using the root-mean-square error (RMSE). They are defined in Eqs. (10) and (11) below:

where is the WRF-BHM estimate of the mean 2-m temperature at the *i*th grid point and time , while is the sample mean calculated from the 10 WRF simulations with days. Diagnostic D2 is as D1, but for the standard deviation:

where is the WRF-BHM estimate of the standard deviation for a given method, and is the sample standard deviation at the *i*th grid point and time calculated from the 10 WRF simulations (also days).

The next two diagnostics assess how close the temperature distributions estimated from WRF-BHM and the traditional multiphysics ensemble experiment are. The WRF-BHM temperature distribution is Gaussian by assumption, Eq. (5), while lacking additional information, the temperature distribution from the multiphysics ensemble at a given day and grid point is obtained using the kernel density estimation method with the 10 ensemble members. The third indicator is the Kolmogorov–Smirnov (K-S) statistic (Massey 1951) given by

where and represent the cumulative distribution functions (CDFs) of the WRF-BHM and the ensemble 2-m temperature data at the *i*th grid point and time , respectively. At a given grid point and for a given day, the two CDFs are sampled times with the maximum distance between them being stored. This procedure is then repeated for the other days with the diagnostic returning the maximum distance over days. The calculations were repeated for different values of , and it was concluded that the sensitivity of the K-S statistic to is almost negligible, at least for values in the range 500–10 000 (not shown). The optimal score is 0, in which case the two CDFs overlap each other for the full period, and the worst score is 1.

The fourth diagnostic is the Kullback–Leibler (K-L) divergence (MacKay 2003) defined as

where and are the probability distribution functions (PDFs) of WRF-BHM and ensemble 2-m temperature data at the *i*th grid point and time , respectively. This diagnostic is also computed over days. For a given day, and are sampled 1000 times (i.e., = 1000) in the range [268, 300 K] in the summer period and [238, 290 K] in the winter period to compute the K-L divergence for that day. The K-L divergence scores are found to be not insensitive to the number of samples , at least for values in the interval 500–10 000 (not shown). The ranges of temperatures selected for each season for sampling the two PDFs cover the full range of temperatures obtained with the WRF-BHM and ensemble approaches. The K-L divergence values for each of the 10 days are then averaged to generate the D4 diagnostic. The K-L divergence can be interpreted as a nonsymmetric (i.e., not invariant to a change in the order of and ) measure of the difference between the two PDFs at a given model grid point. As with D3, the optimal score is 0 obtained when the two PDFs overlap, where the closer to zero, the better the agreement between the predicted and observed temperature distributions.

## 5. Results and discussion

### a. Winter experiment (January 2017)

Table 1 shows the summary statistics of the posterior distributions of the BHM parameters for this season. Only parameters statistically significantly different from zero at the 95% confidence interval , where represents the quantile, are shown. In addition, for method IV, only the minimum and maximum of the summary statistics over the nine regions are given; full results are provided in the appendix. The marginal variance , which explains partially the variability of , is not reported as it is not of interest.

Regarding method I, is constant across the domain implying, from Eq. (6), a linear decrease of the 2-m temperature with elevation at a rate of −6.7 K km^{−1}. This value is close to the average environmental lapse rate of −6.5 K km^{−1}. The parameter follows a distribution with a mean of 0.609 K and a standard deviation of 0.001 K, indicating that the uncertainty in the WRF-predicted 2-m temperature is typically of about 0.61 K. The correlation used in defined in Eq. (7) is rather high, suggesting that the temperature variability at a given grid point is strongly linked to that of its closest neighbors. This gives further credibility to the BHM approach followed here, which is built on the premise that the closest neighbors of a given grid point can provide useful information about the state of a field at that grid point. The same conclusion is reached by looking at the last parameter shown, , which indicates the distance between two points beyond which the corresponding temperatures are weakly correlated or uncorrelated. For method I, this distance is typically 10–12 km, which corresponds to 3–4 model grid points, and therefore includes the closest spatial neighbors.

Comparable values are obtained for methods II–IV. For method II, a lower for coastal sites is expected as they are likely to have a reduced temperature variability when compared to inland regions. For method III, the higher-magnitude for the low-terrain category (≤673.2 m) is indicative of steeper lapse rates, closer to the dry adiabatic lapse rate (~9.8 K km^{−1}). This is consistent with the expected strong thermal inversions that develop here as a result of cold-air pooling (e.g., Pepin et al. 2009). The increased for the high-elevation category (>673.2 m) reflects a wider temperature distribution with a more frequent occurrence of extremes. For method IV, a wide range of values for all parameters is obtained, highlighting a strong spatial variability in the domain considered. In line with the findings of methods II and III (as shown in Table A1 in the appendix), the highest values are found in the northern and eastern parts of the domain, closer to the Arctic and featuring more mountainous terrain, with lower values in the southern part of the grid, which is relatively flat and where the steepest lapse rates are found. For all methods, is generally reduced when is larger, indicating an increasing difficulty of the model to capture the 2-m temperature distribution if the latter is broader and contains more extreme values.

Having fully set up the BHM, the results for the four diagnostics will now be presented. Figure 2a shows the diagnostic D1 for the winter period. Over water bodies and southwestern Finland, the RMSE values are rather small, generally less than 0.5 K. Conversely, and mostly over the northeastern part of the domain in northern Sweden and Finland, they are generally in excess of 0.75 K. Here, rather cold surface temperatures associated with stably stratified boundary layers are common in winter (e.g., Pepin et al. 2009). The WRF Model is known to underperform in such environments (e.g., García-Díez et al. 2013) as some of the dominant physical processes are not well modeled (e.g., Steeneveld 2014). By using different PBL schemes that represent the boundary layer dynamics in different ways (e.g., Cohen et al. 2015; Banks et al. 2016), a wide range of values for the 2-m temperature is obtained, resulting in a larger discrepancy between the two approaches. An inspection of the temperature data revealed that for a given day in the 10-member ensemble experiment, a temperature range in excess of 10 K is obtained in the Arctic Scandinavia, whereas farther south the range is generally smaller, typically less than 2 K (not shown). Larger RMSE values are also seen in valleys and low-elevation regions, where cold-air pooling is a common occurrence in winter, and over the high terrain, where the WRF surface-temperature forecasts are known to be sensitive to the choice of the PBL scheme (e.g., Zhang et al. 2013). Elsewhere, the RMSE values are mostly in the range 0.5–0.75 K. The results in Fig. 2a show that the mean 2-m temperature estimates using the two approaches are generally similar (RMSE differences mostly within 0.75 K). Figure 3 shows the results for for the four methods. The spatial pattern of the RMSE values of the standard deviation resembles that of the RMSE of the mean shown in Fig. 2a with higher values in the northeastern part of the domain and lower values elsewhere, in particular in southern and southwestern parts. According to Table 1, is typically of 0.6 K, and hence, the RMSE values obtained here are about a third to half of that magnitude. This indicates that the uncertainty estimated from just one WRF experiment using the BHM approach is comparable to that obtained from an ensemble of simulations with perturbed model physics. This is confirmed by looking at the uncertainties themselves (not shown). As no method is found to clearly outperform another, method I will be considered in the subsequent analysis, given its simplicity.

To illustrate the closeness between the WRF-BHM and ensemble temperature distributions, Figs. 4a and 4c show diagnostics D3, the K-S statistic, and D4, the K-L divergence, respectively. Both metrics are nonnegative with the optimal value being zero. The K-S values are mostly within 0.8, being less than 0.5 over the water bodies and in parts of southern Sweden and southern and western Finland. Here, the WRF-BHM and ensemble 2-m temperature CDFs are expected to be more similar given the smaller RMSE values shown in Figs. 2a and 3. Diagnostic D4 gives a similar picture with the two PDFs being more divergent in the northeastern part of the domain where their means and standard deviations are farther apart. Quantitatively speaking, the scores for the K-S and K-L diagnostics shown in Fig. 4 are not very good. As shown in Fig. 5, in which an explicit comparison of the two temperature distributions at several locations in the model domain is performed, and as discussed in more detail below, the lower K-L (and K-S) scores arise from a more significant deviation of the ensemble temperature distribution from a Gaussian shape.

As to allow for a more direct evaluation in the left panels of Fig. 5 and at a given day and for two grid points over land and water for which higher and lower D1, D2, and D4 (similar results for D3) scores are obtained, the WRF-BHM and ensemble temperature distributions for the winter season are plotted. It is important to note that similar results are obtained at other grid points (not shown), and hence these curves are representative of those obtained elsewhere. Regarding the shape of the distribution, and despite the presence of a secondary maximum, the ensemble temperature distribution generally resembles a Gaussian distribution, supporting the assumption of a Gaussian shape made in section 3 for the BHM. What is more, and as expected, the agreement between the curves is better if the D1, D2, and D4 scores are higher, which also confirms that the use of these verification diagnostics for this work is justified. For all the winter results, the mean value of the WRF-BHM temperature distribution is within 1 K of that obtained with the ensemble of simulations. Because of the presence of the referred secondary maximum, possibly arising from the small number (10) of ensemble members considered, there are some differences in the spread of the two distributions. In any case, and when taken as a whole, the results given in Fig. 5 highlight that the simpler and computationally cheaper WRF-BHM approach can be used to estimate the ensemble-generated temperature distribution at least for this region and season.

### b. Summer experiment (July 2016)

Table 2 is as Table 1, but for the summer experiment. For this season, is in the range from −6 to −8 K km^{−1}, typical values of the environmental lapse rate, and is mostly between 0.3 and 0.5 K. A comparison with the corresponding winter values reveals a smaller for the summer period. During the warm season, the occurrence of cold and stably stratified environments is infrequent with the longer daylight leading to a more convectively active boundary layer and stronger vertical mixing. This is likely to lead to a smaller variability of the 2-m temperature, which is reflected in the reduced values of and in the higher values of . These results have an important implication: temporal homogeneity cannot be assumed at least for some of the BHM parameters for a full year. Should this technique be extended to comprise a complete annual cycle, in a high-latitude region with complex topography such as Botnia–Atlantica, the BHM would have to be fitted to each season (or perhaps each month) separately. Nevertheless, the computational resources needed to do so are much less than those required to perform an ensemble of simulations.

The correlation obtained with WRF data is still high in excess of 0.5 but smaller than the winter one by up to 30%. The lower correlation values seem to come mostly from the coastal regions where strong thermal circulations that normally develop in this season (e.g., Smedman et al. 1995) may not be well represented by the model. The higher for high-elevation sites noted in the winter experiment in method III is also present, while the steeper lapse rates obtained for the low-elevation regions in the cold season are no longer seen as cold-air pooling is not significant in the summer. The results for method IV (given in Table A2) show generally smaller differences in the parameter values for the nine regions compared to the winter experiment. As is the case for the cold season, is lower when is higher, reflecting an increasing difficulty of the BHM to simulate the observed temperature variability when the latter is larger.

The results for diagnostics D1 and D2 for the summer period are given in Figs. 2b and 6, respectively. The RMSE for the mean temperature is smaller than that for the winter season with values generally less than 0.5 K everywhere in the domain. RMSE values in excess of 0.5 K are seen mostly in coastal regions. As stated before, land–sea breeze circulations and associated temperature variations are a common occurrence in the summer along coastlines, and their representation is sensitive to the PBL option chosen (e.g., Steele et al. 2013, 2015). Hence, and at these sites, a larger difference between the predicted 2-m mean temperatures using the two methods is likely to occur. An inspection of the temperatures for the ensemble experiment reveals that in coastal regions the predicted 2-m temperatures can differ by as much as 3 K, whereas at inland sites they are mostly within 1 K (not shown). Larger RMSE values are also seen over the high terrain where the model temperature forecasts are known to be sensitive to the PBL scheme used (e.g., Zhang et al. 2013). As is the case for the diagnostic D1, the RMSE of the standard deviation estimates, given in Fig. 6, is also generally smaller than that obtained in the winter season. As for the summer period, is typically of 0.3–0.5 K (Table 2) the RMSE values are about half of that magnitude, and hence, the uncertainty estimated using the WRF-BHM is comparable to that obtained from the ensemble of simulations. The only regions for which diagnostic D2 exceeds 0.4 K are a portion of central-eastern Sweden and parts of Finland. Here, there is a larger spread in the temperatures predicted by the different members of the ensemble (not shown). It is possible that more regional and local-scale circulations develop, leading to an increased sensitivity to the choice of the PBL scheme. A further investigation into why the variability is larger in these regions is beyond the scope of this paper. Similar scores are obtained regardless of the assumptions made on the spatial heterogeneity of . As for the winter season, method I is selected for further analysis given its simplicity.

Figures 4b and 4d show the K-S and K-L divergence diagnostics for the summer season. The magnitude and spatial pattern of the two scores are similar to those obtained in the winter season with lower values mostly over the Gulf of Bothnia and Atlantic Ocean and higher values in some inland sites. As opposed to the winter period, the lowest scores are not seen in the northeastern (Arctic) part of the domain but in some coastal and high-terrain regions. Here, as shown in Figs. 2b and 6, the mean and standard deviation of the 2-m temperature distributions obtained using the WRF-BHM and multiphysics ensemble differ the most, so the correspondent CDFs and PDFs are expected to be farther apart. These lower scores can be attributed to a more significant deviation of the ensemble temperature distribution from a Gaussian shape as seen in Fig. 5.

The right panels in Fig. 5 show the two temperature distributions at four grid points for the summer season (similar results are obtained at other locations in the domain, not shown). As for the winter case, the WRF-BHM distribution generally resembles that obtained with an ensemble of simulations with the respective means being within 1 K of each other. The presence of a secondary maximum in the ensemble temperature distribution may be an artifact of the limited number of samples used. The results presented here highlight the success of the proposed WRF-BHM formulation in simulating the daily mean 2-m temperature distribution as given by the multiphysics ensemble.

## 6. Summary and conclusions

When running a numerical model, the gridbox-averaged value is a standard output, but the uncertainty that goes with that prediction is generally not provided. A common way to estimate the uncertainty associated with weather and climate projections is to perform an ensemble of model simulations. This can be achieved in different ways, for example, by changing the model physics and/or dynamics configuration, perturbing the initial and/or boundary conditions, or even by considering different models. In this paper, we propose a novel approach that only requires one simulation: fitting a BMH to the model data and from the estimated distribution extracting the uncertainty of a given variable (here, the 2-m temperature, a crucial field for climate projections, is considered). It is important to note that WRF-BHM is not a true alternative to ensemble forecasting in the sense that it cannot generate the same set of metrics. For example, the most likely event (e.g., for precipitation type) cannot be obtained with the BHM. However, and for commonly used ensemble products such as the uncertainty estimation, probability of exceedance, and background-error covariance for ensemble Kalman filter and hybrid variational data assimilation schemes, it provides an easy to apply, computationally cheap, and faster alternative to ensemble forecasting. We would also like to stress that, although not discussed here, the BHM can also be run in prediction mode. In particular, it can analyze data possessing spatiotemporal dependence and make use of the estimated dependence to predict the values of a given field at different locations and time periods. However, it can be argued that using the BHM for an understanding of the spatiotemporal dependency inherent to a dataset is more relevant than using it for prediction purposes.

WRF is run at 3-km spatial resolution over the Botnia–Atlantica region for one month in the winter (January 2017) and summer (July 2016) seasons. In addition, an ensemble of simulations is generated with 10 different PBL schemes for the first 10 days of January 2017 and July 2016. The uncertainty sampled in this work is that associated with the way the boundary layer dynamics and near-surface flow are represented in the model and their impact on the daily mean 2-m temperature predictions. A fundamental assumption of comparing the BHM to the multiphysics ensemble is that the error is primarily associated with PBL physics uncertainties and that the multiphysics ensemble provides a meaningful estimate of that.

The BHM uses nearby information in both space and time to estimate statistical properties of our WRF variable of interest at each grid location. The 2-m temperature, at a given time and spatial location, is assumed to follow a normal distribution with mean and standard deviation . Four assumptions are made to explore spatial heterogeneity in , ranging from spatial homogeneity to largest possible heterogeneity. Four diagnostics are used to evaluate the performance of the BHM. In the first two, the temperature mean and standard deviation estimated from WRF-BHM is compared to that obtained from the referred ensemble of simulations. The other two diagnostics give information regarding the degree of overlap between the two distributions. For a given day and model grid point, the ensemble temperature distribution is calculated using the kernel density estimation method with the 10 ensemble members.

The BHM parameters obtained with WRF data are consistent with the expected values. Given the differences in some of them between the two seasons, and at least for this region, it is not advisable to consider just one BHM for the full year. Studies that wish to extend this work for a complete annual cycle should fit a BHM to at least each season separately, as the regional and local-scale dynamics are very different throughout the year. For both summer and winter seasons, the mean and uncertainty estimated using the WRF-BHM are found to be generally comparable to those obtained from an ensemble of simulations with different PBL schemes. The exceptions are (i) Arctic Scandinavia and low-level regions in winter, where cold and stably stratified boundary layers are ubiquitous; (ii) coastal regions in the summer, where land–sea breeze circulations are more significant; and (iii) over the high terrain in both seasons. Here, the WRF-predicted 2-m temperature has been reported to be particularly sensitive to the choice of the PBL scheme (e.g., García-Díez et al. 2013; Steele et al. 2015; Zhang et al. 2013) resulting in larger differences in the simulated temperature distributions using the two approaches. A direct inspection of the WRF-BHM and ensemble temperature distributions revealed the presence of a secondary maximum in the latter, with this stronger deviation from a Gaussian shape probably explaining the lower scores. This feature possibly arises from the limited number of ensemble members considered. Despite differences in the spread, the means of the two distributions are very similar, within 1 K of each other. Hence, the WRF-BHM approach can be used to estimate the uncertainty and, more generally, the distribution of the 2-m temperature as given by the multiphysics ensemble. Since these results are comfortably interpreted based on our general understanding of the weather variables for the Scandinavian Peninsula, we expect the application of the BHM for estimation of modeled weather variable uncertainty to perform similarly in other temporal and spatial domains. What is more, if the ensemble WRF distributions displayed in Fig. 5 were associated with actual surface observations, and we used an appropriate inference method to assess the equality of the distribution pairs, the approach followed in this work then allows a statistical assessment on the equality of the model results and observations. Such inferential statistical analyses have important applications for geospatial research, including climate change.

In conclusion, the main contributions of this work are as follows: (i) we apply the BHM method to estimate the uncertainty of a WRF output variable (2-m temperature) using space- and time-dependent data; (ii) we prove the consistency of the method by demonstrating that the uncertainty estimate correctly summarizes the aggregated space/time data; (iii) we show that the temperature distribution obtained with WRF-BHM is similar to that computed from an ensemble of model simulations with perturbed physics. The methodology developed here is fully general and can easily be applied to any other variable and numerical model. Possible future extensions of this work include (i) employing more covariates in the process model; (ii) developing a data-driven and/or a higher-order autoregressive model in time; (iii) considering a wider domain and/or higher spatial resolution as well as temporal extension in order to have a larger number of pixels that will allow for a better fit of the BHM; (iv) extending the GMRF in space from stationary to nonstationary based, for example, on latitude and land-use type; (v) applying the latent model on instead of on to reflect heterogeneity within one BHM; and (vi) develop more efficient algorithms to speed up the implementation of the BHM that requires a significant amount of computer power. In addition, the BHM method can be used in conjunction with an ensemble of simulations to model the distribution of a particular variable. Some of these improvements will be presented in a subsequent paper.

## Acknowledgments

We acknowledge Botnia–Atlantica, an EU program financing cross-border cooperation projects in Sweden, Finland, and Norway, for their support of this work through the WindCoE project. We thank the High Performance Computing Center North (HPC2N) for providing the computer resources needed to perform the numerical experiments presented in this paper. We thank two anonymous reviewers for their detailed and insightful comments and suggestions that helped to improve the quality of the paper.

### APPENDIX

#### BHM Parameters for Method IV

In this appendix, the values of the parameters for each of the nine regions (Fig. A1) of method IV are given in Tables A1 and A2 for the winter and summer experiments, respectively.

## REFERENCES

*Statistics for Spatio-Temporal Data.*Wiley, 624 pp.

*Information Theory, Inference and Learning Algorithms.*Cambridge University Press, 640 pp.

*Implementing Approximate Bayesian Inference using Integrated Nested Laplace Approximation: A manual for the INLA program.*Department of Mathematical Sciences, Norwegian University of Science and Technology, 72 pp.

*10th Annual School of Earth, Society, and Environmental Research Review*, Urbana, Illinois, University of Illinois at Urbana-Champaign.

*Interpolation of Spatial Data: Some Theory for Kriging.*Springer-Verlag, 228 pp.

*Joint Statistical Meeting 2016*, Chicago, Illinois, Section on Statistics and the Environment, 2886–2898.

## Footnotes

© 2019 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).