## 1. Introduction

Mountain glaciers around the world have experienced a large loss of ice volume throughout the twentieth century (e.g., Cogley 2010; Leclercq et al. 2011). This ice loss is projected to continue in the future (e.g., Marzeion et al. 2012; Radić et al. 2014). Reductions of the mountain cryosphere, however, imply serious consequences for many natural and human systems worldwide (e.g., Carey 2010; Kaser et al. 2010; Huss 2011; Marzeion et al. 2012; Giesen and Oerlemans 2013). The primary obstacle that hinders accurate modeling of glacier mass changes—at both regional and global scales—is the scarcity of local-scale atmospheric information required as the input for process-based glacier mass balance models (e.g., Hock and Holmgren 2005; Mölg et al. 2009). To date, research groups around the world carry out considerable efforts to maintain in situ observation systems on and near mountain glaciers (e.g., Munro and Marosz-Wantuch 2009). While observations obtained through these efforts represent an invaluable information source on local-scale atmospheric variability for high mountain sites, the following three constraints limit their immediate applicability to study climatic change and its impacts. In situ observations from glacierized mountain environments (i) usually cover only short periods in time (typically a few years), (ii) provide a nonhomogeneous, generally poor spatial coverage (both within a region and of regions), and (iii) are often of a limited accuracy given the extremely challenging circumstances under which they are obtained (e.g., precipitation falling primarily as snow, high wind speeds, high solar radiation, low frequency of maintenance visits to remote stations that are difficult to access, lightning strokes, and instrument theft or damage; e.g., Juen 2006; Cogley 2011; Scheel et al. 2011; Rangwala and Miller 2012).

Often, as an attractive alternative to in situ atmospheric observations from glacierized mountain environments, the output of global numerical atmospheric models (GNAMs) is considered (e.g., Marzeion et al. 2012; Giesen and Oerlemans 2013; Radić et al. 2014). The output of GNAMs is available for extensive periods in time, including the future, at high temporal resolutions, and on a homogeneous grid that covers the entire globe. However, GNAMs are limited in terms of their spatial resolution, which ranges from several tens to several hundreds of kilometers and is insufficient for a realistic representation of the atmospheric boundary layer. The scale discrepancy between GNAMs and local-scale atmospheric states is critical especially over complex terrain, where mountain glaciers are often located (cf. the model topography of the ERA-Interim for the European Alps, the northern tropical Andes, and New Zealand in Fig. 1). In the fields of climate and weather research, so-called atmospheric downscaling techniques have been developed to correct the information provided by the large-scale GNAMs to more realistically represent local-scale atmospheric states (e.g., Flato et al. 2013).

Dynamical downscaling (DD) is limited-area atmospheric modeling. It consists of solving the governing equations of atmospheric motion, mass, and energy conservation for a limited spatial domain, requiring time-dependent output by a GNAM as lateral boundary conditions. DD yields a physically consistent set of multiple atmospheric parameters and therefore represents an attractive downscaling option in the context of glacier modeling (e.g., Kotlarski et al. 2010; Mölg and Kaser 2011; Prasch et al. 2013). However, the major drawback is that DD is computationally expensive, which leads to limitations in the achievable spatial resolution of long-term simulations for climate change research: for example, 12 km in the case of the European domain of the Coordinated Regional Climate Downscaling Experiment (Kotlarski et al. 2014). Therefore, DD output usually needs further postprocessing prior to its application in hydrological and glaciological models (e.g., Marke et al. 2011). A critical concern about DD currently being investigated is the value added through its application (e.g., Flato et al. 2013). For example, Eden et al. (2014) found no clear added value of the statistically postprocessed output of two regional climate models over statistically postprocessed GNAM output for daily precipitation measured across the United Kingdom.

Statistical downscaling (SD) represents the computationally relatively inexpensive alternative to DD. A large diversity of different SD approaches exists, ranging from simple linear bias corrections to nonlinear algorithms, all critically dependent on high-quality observations needed to train the empirical models. SD may be categorized from a conceptual point of view according to the statistics of interest that are optimized. For example, regression-type SD considers the temporal sequencing, while bias correction (or quantile mapping) approaches adapt temporal mean values (or marginal distributions, respectively). A major drawback of SD is that the statistical procedures do not guarantee physical consistency. Active challenges concern the simultaneous and consistent consideration of multiple statistical properties with relevance for impact models, for example, to realistically model intervariable and intersite dependency structures or to accurately represent trends of all quantiles (with a particular focus on extreme values) when correcting the marginal distributions (e.g., Cannon et al. 2015; Cannon 2016). Regression-type approaches underestimate the observed variability (and thus trends) by construction unless the model fits the observations perfectly (e.g., Bürger 2014). Stationarity of the empirical relationships under altered climatic circumstances is a key assumption upon which all SD approaches rely (e.g., Frías et al. 2006). A nontrivial ingredient of SD procedures—also relating to the stationarity assumption—is the identification of information in a GNAM relevant for the local-scale weather, that is, the selection of the large-scale predictors.

Our study systematically explores the potential of different predictor strategies for improving the performance of regression-based downscaling approaches. As the large-scale predictors, the ERA-Interim data are addressed. The downscaling experiments are carried out for three distinct measurement sites, all located within glacierized mountain environments but in very different orographic and climate settings: 1) the Vernagtbach station in the European Alps, 2) the Artesonraju measuring site in the tropical Andes, and 3) the Mount Brewster station in New Zealand. The investigated local-scale target variables include all variables with relevance for the sophisticated surface energy balance–based glacier mass balance models (e.g., Mölg et al. 2009; Weber and Prasch 2016): precipitation (prc), air temperature (*t*), wind speed (ws), relative humidity (rh), and global radiation (gr). The different parameters are considered at a daily time resolution (i.e., daily mean values for *t*, ws, rh, and gr and daily sums for prc). A decisive advantage of regression-based downscaling approaches in the context of this study is their ability to provide statistically significant inference based on short (a few years) time series (e.g., Hastie et al. 2009; Hofer et al. 2015). However, it is important to note that regression-based downscaling can be established only based on GNAMs with a realistic time sequencing (such as reanalysis data or numerical weather prediction model output but not free climate model runs; e.g., Maraun et al. 2010). This is because regression models relate the predictors with the target variable’s time series in sequence (i.e., the time sequencing links each model state to an observed state). The ultimate goal of this analysis is to extend short-term, high-resolution time series from glacierized mountain environments into the past in order to provide a well-grounded knowledge base for estimating future changes.

Section 2 provides a general literature overview on predictor selection in SD. Sections 3 and 4 describe the observational and reanalysis datasets used in this study. Section 5 gives an overview about “sDoG,” the regression-based downscaling code that was developed in the framework of the Statistical Downscaling for Glacierized Mountain Environments project and was applied here. In section 6, different predictor strategies found in the literature are applied and evaluated in a systematic manner for the different target variables and sites. Results are summarized and conclusions are drawn in section 7.

## 2. Predictor selection in SD

Predictor selection in SD involves the choice of 1) the predictor variable types (such as *t*, rh, prc, etc.); 2) the spatial allocation in the GNAM, in terms of grid point(s) and vertical level(s) at which the parameters are considered; and 3) the specific GNAM(s) to be considered (e.g., reanalysis datasets by one or several institutions). In the SD literature, two conceptually different approaches of predictor selection can be distinguished. First, the so-called direct approaches exist, which are univariate approaches with the predictor being a single variable from the GNAM, namely, the equivalent of the target variable (e.g., prc as predictor for prc). In terms of spatial allocation, the direct predictors are extracted at the grid point(s) that geographically correspond(s) to the site of interest. Direct approaches represent the usual approach in the field of numerical weather prediction ensemble postprocessing (e.g., Thorarinsdottir and Gneiting 2010; Messner et al. 2014) but are becoming more and more popular also in the downscaling for climate applications (e.g., Schmidli et al. 2007; Eden and Widmann 2014). Second, the so-called traditional approaches involve multivariate predictor sets, in terms of different quantities, grid points, and/or vertical levels (prc, rh, etc., from multiple grid points and vertical levels). Thus, the direct approaches assume that for a local-scale target variable, the large-scale equivalent integrates all relevant information in the GNAM. The traditional approaches are less restrictive by assuming an empirical link between a local-scale variable and various larger-scale atmospheric information provided by the GNAM.

For direct SD approaches, it is important to note that the spatial allocation within a GNAM of a predictand measured at 2 m above the ground is not without ambiguity. In particular, would the better predictor be (i) the corresponding surface variable from the GNAM at 2-m height or (ii) the corresponding variable at the pressure level of the study site? This discrepancy between the surface pressure level (or altitude) in a GNAM and the true surface pressure at the site of interest (or altitude) increases with the true topographic complexity and with decreasing horizontal resolution of a GNAM. For example, Hofer et al. (2015) showed that for daily air temperature measured at 5050 m in the glacierized Cordillera Blanca (Peru), air temperature data by the ERA-Interim considered at the pressure level of the study site (550 hPa) have considerably higher skill than reanalysis air temperature data at 2 m above the surface. Culver and Monahan (2013) argue that variables from above the atmospheric boundary layer are generally better represented within GNAMs, because dynamic scales above the planetary boundary layer are relatively large, whereas the performance of GNAMs within the atmospheric boundary layer is limited by the approximation made necessary because of their discretization (e.g., Giorgi and Bates 1989). The discrepancy between model surface and pressure level of the study site concerns the target variables air temperature, wind speed, and relative humidity but not precipitation and global radiation, because these two variables are available only as surface variables from the GNAMs.

In the SD literature, the vast majority of studies concentrate on prc, *t*, and ws as the target variables, whereas only very few studies focus on rh (and humidity variables in general) or gr. In the case of traditional SD (here defined as SD that considers multivariate predictor sets), the applied predictor sets as found in the literature widely differ. Table 1 shows a selective literature review examining the variety of applied predictor sets for each of the target variables in the present study. All parameter abbreviations used in Table 1 (and throughout this paper) are defined in Table 2. In the different studies, it is difficult to identify a common strategy with regard to the specific predictor sets applied. However, we can stratify the traditional SD approaches into the following options: L, V, G, VL, GL, GV, and GVL (see Table 2). Option L denotes the use of predictor data from different vertical levels, V refers to the use of different physical quantities or parameters (like prc, tc, etc.), and G denotes the use of horizontal gridpoint fields (instead of single gridpoint data). Options VL, GL, GV, and GVL represent the corresponding combinations of L, V, and G. With the exception of GVL, all these multiple predictor options are evaluated in a systematic manner in the second part of this study.

Literature review on predictor variable sets applied in traditional SD approaches that focus on one or several of the target variables in this study at daily or higher time resolutions. The boldface letters in parentheses (**V**, **VL**, etc.) indicate the multiple predictor options as categorized in the present study to which each of the individual approaches correspond most closely. For the definitions of the abbreviations, see Table 2.

Predictor specifications and abbreviations used throughout the paper. All variables without specification S are from the pressure level close to the study site.

Predictor sets applied for prc often include a combination between circulation and humidity variables (including or not including prc as predictor itself). Recently, several studies showed that simulated prc represents an important predictor for local-scale prc (e.g., Schmidli et al. 2006; Eden and Widmann 2014; Eden et al. 2014) and in particular that direct approaches outperform traditional SD (e.g., Widmann et al. 2003; Themeßl et al. 2011). On the one hand, it is intuitive that simulated prc shows high skill for local-scale prc, assuming that simulated prc already integrates all relevant information by a GNAM in the physically most consistent way. On the other hand, traditionally little confidence has been attributed to simulated prc, because (i) prc is known to be strongly influenced by small-scale processes that are not fully represented in the GNAMs (e.g., Prein et al. 2015) and (ii) the large uncertainty in prc measurements, which generally inhibits the assimilation of prc measurements into the GNAMs (e.g., Dee et al. 2011). Both (i) and (ii) are even more problematic for regions of complex topography, such as glacier-covered mountains. Thus, when focusing on prc in glacierized mountain environments, it is still unclear whether direct or traditional predictor selection may be considered as the more suitable downscaling option.

For *t*, Hofer et al. (2015) demonstrated that *t* by the ERA-Interim showed higher skill than several other potential predictors. Traditional approaches consider predictor sets most of the time including *t* together with circulation variables (e.g., *z* and slp), sometimes also humidity variables, and often over spatial variable fields (multiple predictor options GV, VL, or GVL). For the target variable ws, traditional approaches include ws, *u*, and/or *υ* in most cases. Other frequently applied predictors are vo and dp (or dsp), as well as *t* and rh. As mentioned above, the remaining two target variables in this study, rh and gr, have been rarely the target of SD studies. For rh, next to humidity variables, circulation variables and *t* have been applied. Huth (2005), however, found that upper-air humidity variables are the most efficient predictors and that adding circulation and/or temperature variables brings marginal or no improvement. For daily gr as a target variable, Iizumi et al. (2012) included *t*, circulation and humidity variables, and gr itself as predictors. Several studies with a focus on cloud cover (closely related to gr) use circulation, temperature, and humidity variables as predictors. Only a small number of SD studies have performed comparisons between different predictor sets (e.g., Huth 2005; Cavazos and Hewitson 2005; Salameh et al. 2009; Hofer et al. 2015). Some studies also applied the same set of predictors for different predictand variables (Enke and Spekat 1997). It may be concluded that the definite choice of the predictor set for traditional SD usually remains somewhat arbitrary (e.g., Hofer et al. 2010).

## 3. Study sites and observations

In this study, the efficiency of various predictor strategies is quantified with a focus on three different measurement sites, namely, Vernagtbach (VERNAGT), Artesonraju (ARTESON), and Mount Brewster (BREWSTER). All three sites are situated close to a mountain glacier but in three very distinct climate settings. As described below, each of the target variables considered here is measured at each site, all year round, and for more than three years. Parts of the data are published in earlier studies (e.g., Hofer et al. 2010, 2012; Escher-Vetter et al. 2014; Cullen and Conway 2015). Figure 1 details the topography in which the three sites are situated, and Table 3 gives an overview about their specific locations and the instrumentation relevant for this study.

Locations of and measurement sensors used at VERNAGT, ARTESON, and BREWSTER.

VERNAGT is located at an altitude of 2640 m in the Vernagtbach basin in the Austrian European Alps. The European Alps are characterized by strong spatial gradients of weather and climate, resulting from the interaction of the 1200-km-long and approximately 200-km-wide mountain range with distinct atmospheric circulation patterns embedded in the prevailing westerly flow (e.g., Schmidli et al. 2002). VERNAGT is situated in an inner alpine dry valley very close to the main alpine crest. Mean annual prc at VERNAGT amounts to about 1500 mm (determined for the period 1979 to 2006; Braun et al. 2007). Measurements of VERNAGT considered in this study cover the period from January 2002 to December 2012, as published in Escher-Vetter et al. (2014). Regarding prc, automated measurements during winter were problematic at VERNAGT (because of snow covering the weighing gauge, over- or underestimation of prc caused by snow drift, etc.). Therefore, a homogenized, quality-controlled time series is used for prc, not yet published in Escher-Vetter et al. (2014), and preliminarily available only from 2009 to 2011. This homogenized prc time series is based on a semiautomated joint analysis of 1) the continuous records by a weighing rain gauge, a tipping-bucket rain gauge, and a sonic ranging sensor (all three devices installed at VERNAGT at a few meters distance to each other), 2) the *t* and ws records at VERNAGT, 3) available photos of VERNAGT from automatic cameras, and 4) the prc records by a station installed in the nearby village Vent, Austria. Information from these latter three sources is used mainly to decide which of the three devices listed in the first source provided the most plausible values in individual situations.

ARTESON consists of two weather stations, one installed on a moraine of the Artesonraju Glacier (AWS2; at 4827 m MSL), and one on a moraine of the Caraz Glacier (AWS1; at 5050 m MSL), at a distance of less than 1 km from each other in the Paron Valley (northern Cordillera Blanca, Peru; see also Juen 2006). The Cordillera Blanca is the most extensively glacier-covered tropical mountain range, with peaks reaching up to almost 7 km MSL. The midtropospheric flow in the Cordillera Blanca is dominated by easterly winds all year round (e.g., with only the meridional component varying; Georges 2005), and anomalies of the large-scale flow play a key role in the interannual precipitation variability (e.g., Vuille et al. 2008). Annual mean prc in the Paron Valley amounts to 787 mm water equivalent (WE, averaged over the years 1953–94; Juen 2006). ARTESON data series used in this study are available from March 2007 to July 2011 (prc, measured at AWS2), July 2006 to May 2012 (*t* and rh, measured at AWS1), and November 2004 to November 2011 (gr and ws, measured at AWS2). The *t* and rh data series are used from AWS1 as they represent the longest and most reliable time series of these parameters at ARTESON. The prc, ws, and gr data used here are from AWS2, because at AWS1, prc was not measured, the wind sensor was affected by technical failures, and the measuring time series for gr is longer at AWS2 than at AWS1.

BREWSTER is situated in the Southern Alps of New Zealand, at an altitude of 1650 m next to the proglacial lake of Brewster Glacier. BREWSTER represents an example for a Southern Hemisphere midlatitude site, within a mountain range surrounded by ocean and immediately exposed to the prevailing westerly airflow. The 600-km-long, approximately 2.5-km-high Southern Alps represent a barrier to this westerly airflow, resulting in considerable rates of orographic precipitation that likely exceed 12 m WE per year (e.g., Cullen and Conway 2015 and references therein). Data from BREWSTER that were used in this study and are described in more detail in Cullen and Conway (2015) cover the period from June 2010 to October 2015. Precipitation measurements at BREWSTER were typically available only for the months November–May (with the exact dates varying for each year), because of deep snow covering the rain gauge during winter and spring. Therefore, winter prc was constructed by scaling a precipitation time series measured at a site 30 km from Brewster Glacier located at an altitude of 320 m MSL (for details, see Cullen and Conway 2015).

Because in this study all available observations are used for each site, the respective application time periods in this study necessarily differ: 11 years for VERNAGT, approximately 6 years for ARTESON, and approximately 5 years for BREWSTER. While the performance of a regression model is generally expected to increase with increasing size of the training data (e.g., Hofer et al. 2015), it is observed to approach a limiting behavior (e.g., Meek et al. 2002; Hastie et al. 2009). In other words, after a certain limit of the training data is reached, the dependence of the performance on the size of the training data diminishes. This limit can be determined only empirically (e.g., Meek et al. 2002). Hofer et al. (2015) found for *t* at ARTESON that the minimum number of observations required for the SD model to obtain statistically significant skill ranges from 40 to 140. In our modeling procedures (see section 5), the minimum number of observations required to obtain statistically significant skill is obtained in all cases. We still acknowledge that longer observational time series may lead to higher and more certain estimates of the skill. We therefore consider the certainty of the estimated values of skill in the definition of the optimum predictor option (see sections 5 and 6e). Also note that because of the large geographical distance between the three sites, the weather conditions affecting each site would also differ for identical time periods.

As mentioned above, all sites are situated in the proximity of mountain glaciers. The sites VERNAGT, BREWSTER, and ARTESON AWS1 are situated on rocky terrain downslope of the glacier termini, albeit at varying distances: (more than) 1500 m in the case of VERNAGT (Rissel 2012), 200 m in the case of ARTESON AWS1, and 500 m in the case of BREWSTER (Cullen and Conway 2015). ARTESON AWS2 is situated on a moraine, which delimits the tongue of Artesonraju Glacier to the south, about 30 m above the glacier surface, and at 80 m horizontal distance. The moraine itself is glacier-covered about 670 m upslope of AWS2. Note that the sites cannot be considered to be representative for the atmospheric conditions immediately above the nearby glacier surfaces (see, e.g., Munro and Marosz-Wantuch 2009). The way in which the glacier boundary layer affects the different target variables (particularly *t*, ws, and rh; e.g., Van den Broeke 1997) is most likely to vary among the different sites, as well as throughout the measuring periods. Presumably these effects are strongest for ARTESON, because of the two measurement sites AWS1 and AWS2 being located closer to the glaciers than VERNAGT and BREWSTER. For BREWSTER, in particular, Conway (2013) reported topographic controls playing a larger role than the glacier boundary layer in controlling ws and wind direction.

## 4. The predictors: Reanalysis data

The reanalysis data used in this study are the ERA-Interim by the European Centre for Medium-Range Weather Forecasts (ECMWF). Although reanalysis data are often considered to represent the large-scale truth, the considerable discrepancies between different reanalysis datasets available from different institutions show that reanalysis data are strongly affected by uncertainties in the observations, the numerical model, and data assimilation systems (e.g., Hofer et al. 2012 and references therein). The ERA-Interim represents a third-generation reanalysis dataset, with many of the inaccuracies exhibited by previous reanalyses (e.g., ERA-40) eliminated or strongly reduced (see Dee et al. 2011). In a study focusing on daily air temperature in the Cordillera Blanca, ERA-Interim showed higher performance than other available reanalysis datasets (Hofer et al. 2012). Figure 1 shows the model topography of the ERA-Interim centered around the three study sites. The largest discrepancy between the true and the ERA-Interim topographies is evident for ARTESON, where peaks reach up to almost 7 km MSL, and the study site is situated about 2 km lower in the model topography than in reality. In the case of VERNAGT, the ERA-Interim model topography is about 1 km lower than the true elevation. At BREWSTER, the model topography is about 700 m lower.

## 5. The sDoG tool

The SD analyses in this study are carried out using sDoG, an SD training and validation framework that combines state-of-the-art linear regression tools, with a particular focus on appropriately considering the pitfalls of short and patchy observational records. The sDoG framework is the upgraded version of the SD training and validation approaches presented in Hofer (2012) and Hofer et al. (2015). It includes linear, single, or multiple ordinary least squares regression (OLS), with and without the consideration of symmetry-producing variable transformations (e.g., Wilks 2011), beta regression, and generalized linear models (GLMs). For prc, sDoG offers the possibility of a two-step approach (Stern and Coe 1984), namely, to model prc occurrence in a first step and prc amount on days with precipitation in a second step. In the case of multivariate regression (i.e., for traditional SD), sDoG provides the choice of different feature selection algorithms, in particular, stepwise regression, the least absolute shrinkage and selection operator (LASSO), and partial least squares regression (PLS). In this study, these feature selection algorithms are based on OLS (see also section 6a), but combinations of stepwise regression, LASSO, and PLS with GLMs and symmetry-producing OLS are also possible.

To account for seasonality, sDoG computes different regression models for each day of the year (doy). Data slices centered around the doy of interest, with the window size amounting to 31 days, are used to fit each doy’s model. The window size of 31 days has been determined as a compromise between including enough data for the model fit (see Hofer et al. 2015) and minimizing the effect of seasonality in the time series (e.g., Themeßl et al. 2012). sDoG applied to a certain target variable, at a specific site of interest, produces 365 models (for 29 February, the same model as for 28 February is used). Given the redundancy of data for neighboring doys and to save computation time in the multidimensional experiments performed below, results are computed here only for each 10th doy (i.e., doy 1, 11, 21, …, 361; in total, 37 doys). Each regression model is then validated individually for each assessed predictor option (in total 16), for each of the 37 doys, each of the five variables, and each of the three target sites, summing up to 8880 regression models to be evaluated (excluding the precursor analyses described in section 6a).

The quantity SS is appropriate for the evaluation of predictors used in regression models, because it increases when there is a strong explanatory relationship between the model and the observations, and it decreases when the respective relationship is found to be weak. This can be understood more intuitively based on the decomposition of SS in two main terms (after Murphy 1988): namely, 1) the square of the correlation coefficient between the model and the observations and 2) the conditional and the unconditional bias terms (see also Wilks 2011; Hofer et al. 2015). More specifically, SS increases with increasing correlation between a model and the observations, and SS decreases when there are nonzero bias terms. It is important to note that for ordinary least squares regression, the conditional and unconditional bias terms are zero by construction (e.g., Wilks 2011). However, given that SS here is estimated based on cross validation (see below) and thus based on *independent* values of the model and the observations (i.e., which have not seen each other in the training of the ordinary least squares regression), the bias terms are not exactly zero (however close to zero, as we will also show for a case study below). Furthermore, because the square of the correlation coefficient cannot become negative, negative values of SS point to the model being affected by a bias (e.g., Wilks 2011). Put in another way, SS as defined here, multiplied by 100, corresponds to the percentage improvement of the model to be evaluated over the reference model, which is a constant.

The sDoG framework estimates the test error based on cross validation (e.g., Michaelsen 1987; Wilks 2011; Hastie et al. 2009). In short, cross validation consists of splitting up a data series with *n* observations into *k* parts of size *n*/*k*. Then, in *k* repetitions, *k* − 1 parts are used to train a particular model, while in each repetition, the *k*th part left apart in the training is used to estimate the test errors. Putting together all *n*/*k* test errors estimated in all *k* repetitions, a time series of *n*/*k* × *k* = *n* test errors can be obtained (e.g., Hofer et al. 2015 and references therein). This way, cross validation allows all the available data to be used for the training of a particular model and at the same time for estimating the test error (which is not the case for split-sample validation; e.g., Wilks 2011). Cross validation is therefore particularly useful when setting up a statistical model based on short observation time series, for example, as is often necessary when the focus is on data-sparse mountain environments (see also Hofer et al. 2015). However, note that cross validation requires higher computational expenses than split-sample validation given that the model training and error estimation procedure has to be repeated *k* times. In sDoG, *k* (also called the number of “folds” of the cross validation) is a specifiable input. In this study, *k* is set to 10, as suggested by, for example, Hastie et al. (2009).

The sDoG framework uses a variant of cross validation that accounts for persistence in the time series, namely, moving-block cross validation (e.g., Hofer et al. 2015 and references therein). Moving-block cross validation is particularly important in the case of meteorological time series as they are frequently affected by serial or autocorrelation (e.g., Wilks 2011; Hofer et al. 2015). The moving-blocks cross validation is applied not only for estimating the model performance but also in the feature selection (e.g., in the case of LASSO). Then, sDoG computes a double cross-validation loop (e.g., Michaelsen 1987): that is, an inner loop for feature selection and an outer loop for quantifying the model performance, with *k* of the inner cross-validation loop set to 4 in this study, as suggested by, for example, Shao (1993).

*L*. Here,

*L*is approximated for first-order autoregressive processes as by Wilks (1997), based on the implicit equationwith

In this study, ρ_{1} is computed as the lag-1 autocorrelation [AR(1)] of *ε*_{r}(*t*), and *n* is the number of elements in *ε*_{r}(*t*). For each case (in terms of target variable, site, and doy), 10 000 bootstrap estimates of SS are generated by first bootstrapping *ε*_{r}(*t*) and *ε*(*t*) (in parallel) and then computing the values of SS based on the bootstrapped time series of *ε*_{r}(*t*) and *ε*(*t*) following Eq. (1). Here, AR(1) of *ε*_{r}(*t*) is considered in the definition of *L* rather than AR(1) of *ε*_{r}(*t*) is generally larger than AR(1) of *ε*(*t*). This is because the reference model is defined here as a constant time series, and consequently, AR(1) of *ε*_{r}(*t*) is identical to AR(1) in each doy’s observation time series. The term AR(1) of *ε*(*t*), by contrast, is smaller than AR(1) in each doy’s observation time series, because in the case of positive skill, the model reflects (at least) a part of the temporal variability in the time series. As a starting value for the iterative solution of Eq. (2), *L* = (*n*)^{1/2} is used, as suggested by Wilks (1997). One standard error (SE) of SS can then be defined as the standard deviation of the 10 000 bootstrap estimates of SS. Furthermore, SS is considered significantly positive if the 5th percentile of SS is larger than zero. The version of sDoG used in this study is written in MATLAB and available from the first author upon request. An implementation of sDoG in R is under way.

## 6. Results and discussion

### a. Precursor experiments

In this section, we briefly address two secondary analyses carried out to systematically determine the experimental setup in this study but which do not pertain to the primary results here and are therefore not presented in more detail. The first analysis refers to the choice of the regression model for the experiments. A set of linear regression options suggested in the literature for the different target variables were tested, with a particular focus on considering deviations from a Gaussian distribution (see Table 4 for a complete list of all considered regression options). Somewhat surprisingly, OLS outperformed all other regression option in most cases. Consistent improvements relative to OLS were obtained only for ws by means of a Poisson GLM using the log link function. For prc, by contrast, we found large negative values of SS when applying a two-step model approach (see Table 4). A closer look revealed that prc amounts simulated by the gamma GLM (with reciprocal link function) reached unrealistic values in terms of very large amounts for predictor data outside the range of the training data. To sum up, because none of the tested regression alternatives showed significant improvements against OLS systematically for all doys and sites, we decided to use OLS for all target variables and all experiments in this study (combined with feature selection in the case of multiple regression; see below). Note that the applied OLS procedures do not assume a Gaussian error distribution (neither in the OLS parameter estimation nor in the validation and the uncertainty analysis based on cross validation and moving-block bootstrap). For example, the Gauss–Markov theorem states that in a linear regression model, the best linear unbiased estimator of the coefficients is given by the OLS estimator, with no need for the errors to be normally distributed (however assuming homoscedastic, uncorrelated errors with zero mean and finite variance). Linear models have the practical advantage of being more interpretable, which is especially relevant regarding the stationarity assumption in SD (e.g., Zorita and von Storch 1999).

Linear regression models tested in this study, some relevant references, and the concerned target variables.

In a second precursor analysis, we compared the LASSO, PLS, and stepwise regression with regard to their suitability as the feature selection algorithm in the multiple predictor experiments below. In terms of computational speed, we found that the coordinate descent algorithm implemented in LASSO could require considerable time, depending on the design of the predictor matrix (in particular, the number of predictors and the amount of collinearity among them, not shown), while PLS resulted as the fastest algorithm. In terms of model skill, LASSO significantly outperformed PLS for small predictor sets (less than 10 covariates), while PLS outperformed LASSO in the case of the larger predictor sets. Therefore, in this study, LASSO is applied for the multiple predictor options L and V, and PLS is applied for all the remaining multiple predictor options. Note that traditionally in SD, stepwise selection has been the most frequently applied choice (e.g., Glahn and Lowry 1972; Huth 2004). LASSO has been introduced only recently (e.g., Gao et al. 2014), whereas PLS has received very little attention in the context of SD so far. Even though we feel that our preliminary analysis justifies the use of LASSO and PLS here, the relative performance of the different available feature selection algorithms for SD deserves to be investigated more comprehensively in a separate study.

### b. OS analysis

An important concept that needs to be considered in the case of direct predictor selection in SD is the optimum scale (OS) of a GNAM (e.g., Grotch and MacCracken 1991). OS analysis is based on the phenomenon that GNAM output considered at resolutions lower than the native resolution generally show higher skill than the output at the native resolution. This phenomenon is counterintuitive, because the native resolution of GNAMs is generally lower than the scales of the atmospheric processes usually being studied. The reason why increased skill can be achieved through spatial averaging is that data by single GNAM grid points are affected by numerical errors (i.e., truncation errors and errors that are due to explicit and implicit model filtering and smoothing), which can be reduced by spatial filtering techniques (e.g., Räisänen and Ylhäisi 2011; Hofer et al. 2012 and references therein). In weather prediction, this phenomenon is more commonly referred to as “effective resolution” of a numerical model (e.g., Skamarock 2004).

Figures 2 and 3 (left panels) show results of an OS analysis for the three study sites, VERNAGT, ARTESON, and BREWSTER, and the five different target variables, prc, *t*, ws, rh, and gr. Single OLS is applied using—for each predictand—the same parameter from the ERA-Interim as a predictor (direct SD, thus prc for prc, *t* for *t*, etc.). Then, values of SS are computed for increasing squared averaging domains centered around the study site (for the grid point closest to the study site, the average of 2 × 2 grid points, 3 × 3 grid points, and so on, until 10 × 10 grid points; see also the squares in Fig. 1, top left panel). For *t*, ws, and rh, the horizontal averaging domain is considered at the ERA-Interim levels close to the study site (i.e., the 750-hPa level for VERNAGT, 550-hPa level for ARTESON, and the 850-hPa level for BREWSTER). For prc and gr, available only as surface variables in the ERA-Interim, the respective surface parameters are used as predictors.

The definition of the OS is not trivial. In Figs. 2 and 3 (left panels), values of SS are shown for the following possibilities: 1) single gridpoint predictors (assuming that the OS is equivalent to the minimum scale; hereinafter referred to as no-OS); 2) a fixed (parameter- and site-independent) OS (thus without prior data-based analysis; hereinafter referred to as fixed-OS); and 3) an OS determined individually for each site, target variable, and doy (hereinafter referred to as spec-OS). Option 2, fixed-OS, is assessed here because it is often applied in SD studies [e.g., Themeßl et al. (2011), who assume a fixed-OS of 3 × 3 grid cells of a regional climate model without data-based analysis]. Here, we use 4 times the minimum scale of the ERA-Interim as fixed-OS, motivated by results found in Hofer et al. (2012).

The results in Figs. 2 and 3 (left panels) show that spec-OS and fixed-OS can lead to improvements of SS of up to 0.4 relative to the SS values obtained by no-OS, with specific improvements largely depending on the different target variables, sites, and doys. Improvements due to spec-OS or fixed-OS are, however, hardly evident for BREWSTER, where only the humidity-related target variables prc and rh show small increases of SS for some doys (prc in April and rh in Southern Hemisphere summer). Most noteworthy is the increase of SS by applying fixed-OS or spec-OS—against no-OS—for rh at VERNAGT in the summer months: the direct, no-OS option shows high skill for winter with values of SS exceeding 0.7, but for summer (in particular, July and August), SS drops below 0.3 (see Fig. 3, middle-right panel, top subplot). The fixed-OS option shows higher skill for rh in winter (SS > 0.8) and considerably higher skill for rh in summer (SS ≈ 0.6). In summary, for VERNAGT and ARTESON, values of SS are increased in more than 75% of all cases (target variables and doys) by fixed-OS (relative to no-OS). The spec-OS shows improvements for a further 15% of all cases. Thus, in total, values of SS are improved in more than 90% of all variables and doys by means of OS analysis but with the improvements not being strongly dependent on an exact data-based choice of the OS (justifying the use of a predetermined size of horizontal averaging domain; e.g., Themeßl et al. 2011). For BREWSTER, by contrast, fixed-OS leads to decreasing values of SS in about 80% of all cases (variables and doys), and spec-OS leads to improvements of SS in only about 40% of all cases (variables and doys). This indicates that the loss of local-scale detail due to the spatial averaging of the ERA-Interim predictors is generally larger than numerical errors in single gridpoint data for BREWSTER.

### c. Example

Before discussing the performance of further single and multiple predictor options, we illustrate the regression procedure in more detail at the example of rh at VERNAGT, doy 221 (i.e., 9 August). The time series used to train and evaluate the regression models consists (as already detailed in section 5) of 31 daily mean values centered around each doy 221 of each of the 11 years of measurements at VERNAGT (thus, 11 × 31 = 341 values). Figure 4 shows a scatterplot of the rh observations against sDoG model results based on the no-OS predictor (i.e., the time series of rh extracted at the reanalysis grid point closest to the study site, top left panel) and the corresponding time series plot (top-right panel). The bottom panels show scatterplots and time series plots with the sDoG model based on the spec-OS predictor (i.e., for rh at VERNAGT doy 221, the average of the time series extracted at 6 × 6 grid points centered around the study site). For the no-OS predictor, a weak relationship to the observations is evident, with SS = 0.27. The bias between the observed and the modeled time series, however, is very small. Figure 4 clearly shows the major problem of the models for which low values of SS are found in this study, namely, underestimation of the observed variance. Because of the low correlation of the no-OS predictor to the observations, the OLS constrains the coefficients to small values, and the resulting model is almost a constant. For the sDoG model based on the spec-OS predictor, however, considerably higher correlation to the observations is evident, an accordingly higher value of skill is obtained (SS = 0.67), and the modeled variance is more close to the observed variance. The corresponding bias is almost zero. Histograms of the errors (not shown) reveal considerably narrower and more symmetric tails for the spec-OS predictor than for the no-OS predictor, with a clear maximum at and around zero. Note that SS = 1 would imply that the model time series is identical to the observational time series. As a consequence, observed and modeled distributions would be identical as well. Note also that for SD methods that optimize the similarity between the modeled and the observed distributions (such as quantile mapping; e.g., Déqué 2007), a perfect result would not give any indication about the correlation between the modeled and the observed time series. In other words, when the distributions of two time series match perfectly, the mean squared or absolute errors between the two time series may still be large.

### d. Single predictor analysis

The right panels of Figs. 2 and 3 show box plots of SS for the different doys, the three study sites, and five target variables, with different variable types being tested as single predictors. The leftmost predictor in each panel shows the fixed-OS predictor (abbreviated as f-OS in the *x* coordinate), and the second leftmost predictor is the no-OS case (i.e., the closest grid point as predictor). As noted before, both fixed-OS and no-OS use the target variables’ equivalents as predictors (i.e., prc for prc, *t* for *t*, etc., at the pressure level of the study sites—for *t*, rh, and ws—and at the ERA-Interim surface—for prc and gr). For *t*, rh, and ws, the third predictor in each plot is the target variable’s equivalent but considered at the ERA-Interim model surface (at 2 m above the ground for *t* and rh and at 10 m above the ground for ws). Thus, the first two (three) single predictor options for prc and gr (and *t*, ws, and rh, respectively) correspond to the direct predictor approach. For each target variable, a list of further potential predictors are analyzed, all considered at the grid point closest to the study site (no-OS), and—according to their availability in the ERA-Interim—at each study site’s pressure level or as surface variables.

For prc, the parameters prc, tc, rh, sh—and for BREWSTER also *u* and *υ*—show significant skill as predictors. The importance of the horizontal wind for prc at BREWSTER can be related to the dominance of orographic precipitation for this site (e.g., Kerr et al. 2011). For VERNAGT and BREWSTER, large-scale prc turns out to be the most important predictor for local-scale prc [the direct predictor option, in agreement with the findings of, e.g., Widmann et al. (2003) that are discussed in section 2]. For ARTESON, however, prc (whether fixed-OS or no-OS) shows very little skill and is outperformed by tc, rh, and sh. This is likely to be related to deficiencies of the ERA-Interim (and reanalysis data in general; e.g., Poccard et al. 2000) with regard to prc in the tropics. In the case of *t* and rh, for all three sites, the direct predictor—both fixed-OS and no-OS—outperforms all other assessed predictor quantities, including *t*S and rhS. For *t* at ARTESON, all predictors other than *t* (fixed-OS and no-OS) and *t*S show only very little skill, while for VERNAGT and BREWSTER, several parameters other than *t* show significant skill for most of the doys, for example, sh and *υ* for local-scale *t* at VERNAGT and *w* and *υ* for local-scale *t* at BREWSTER. The importance of the meridional wind component for local-scale *t* variability at both midlatitudes sites is likely related to cold and warm air advection often coinciding with northerly and southerly flow, respectively, at VERNAGT (and vice versa for BREWSTER). For local-scale ws, ws fixed-OS and no-OS show the highest skill of all single predictors at VERNAGT and BREWSTER, but *u* emerges as a similarly important predictor at ARTESON. At BREWSTER, wsS outperforms ws fixed-OS and no-OS (note that this is the only case where a predictor considered at the surface outperforms the corresponding pressure-level parameter). For rh, the performance of circulation predictors and air temperature is much lower than the performance of humidity predictors. This is in agreement with the findings in Huth (2005), who studied the performance of predictors by the NCEP–NCAR reanalysis data for humidity variables in the Czech Republic. For local-scale gr at all sites, the humidity variables rh and sh are important predictors.

In summary, the direct predictors show the highest skill of all assessed parameters in all cases despite prc at ARTESON, and the quantities considered at the model surface are outperformed by their equivalents considered at the pressure levels corresponding to the study sites. For VERNAGT and ARTESON, fixed-OS shows overall higher performance than no-OS, while the opposite is true for BREWSTER.

### e. Multiple predictor options and synthesis

Next to the OS analysis and different single predictors, multiple predictor options L, V, G, VL, GL, and GV (defined in section 2) are explored in this study. The specific predictor sets used for each site, target variable, and multiple predictor option are detailed in Table 5. Note that the multiple predictor options initially include largely differing numbers of predictor variables, namely, 3 or 4 (L), 9 or less (V), 100 (G), 36 or less (VL), 300 or 400 (GL), and 900 or less (GV). However, using the feature selection algorithms LASSO and PLS (see section 6a), the dimensionality of the predictor dataset that finally enters into the regression models is largely reduced (with 20 being the predefined possible maximum number of predictor variables, which is further decreased by means of double cross-validation-based screening; see section 5).

Predictor sets used for each of the multiple predictor options and each target variable at VERNAGT, ARTESONR, and BREWSTER. The numbers next to the parameter abbreviations indicate the pressure level at which each parameter is considered (hPa). If not stated otherwise, each parameter is considered at the ERA-Interim grid point closest to the study site. The predictor set for V includes all parameters analyzed as single predictors (see Figs. 2 and 3, where the levels are not indicated for the sake of brevity).

In this section, we discuss the optimum predictor option for each target variable, site, and doy. The optimum predictor option for a particular case is defined here following the principles of the one standard error (1 SE) rule (as suggested for statistical model selection; e.g., in Hastie et al. 2009): that is, to choose the least complex model within 1 SE of the best model. In this study, for simplicity, only two levels of model complexity are distinguished, namely, 1) the single predictor models (including all considered single predictors—no-OS and fixed-OS) and 2) multiple predictor models (including L, V, G, VL, GL, and GV). Then, for each case, the optimum single predictor option and the optimum multiple predictor option are identified by looking for the respective optimum values of SS. The optimum multiple predictor option is then considered as the optimum option only if its improvement relative to the optimum single predictor option is larger than 1 SE of its uncertainty. If the improvement is within 1 SE of its uncertainty, however—or if the value of SS of the optimum single predictor option is larger than the value of SS of the optimum multiple predictor option (which is true only in a few exceptions)—then the optimum single predictor option results as the overall optimum option. To sum up, the ranking within multiple predictor options is based solely on the values of SS obtained, while the ranking between multiple and single predictor options also considers the uncertainty of SS. In other words, when one of the single predictor options turns out as the optimum option, this indicates that the improvement of the optimum multiple predictor is smaller than 1 SE of its uncertainty. This way, the improvement obtained by using a more complex model is set in relation with the uncertainty of the improvement, and parsimonious models are preferred if this uncertainty is larger than the improvement.

Figures 5 and 6 (left panels) show the optimum predictor options resulting for each variable, station, and doy. The right panels in Figs. 5 and 6 show the corresponding uncertainties, in terms of the SE of SS—in the following referred to as SE(SS)—for the multiple predictor option GV. Note that, while SE(SS) show large differences for the different target variables, sites, and doys, we find no particular dependence of SE(SS) on model complexity in terms of the different predictor options considered (not shown). Option GV clearly turns out as the optimum predictor option on average over all cases. In individual cases, however, certain options are preferred over GV. For example, in the case of prc at VERNAGT, the two single predictor options fixed-OS and direct, no-OS dominate as the optimum options for November to May. This can be explained by large values of SE(SS) in these cases (ranging up to 0.3; see Fig. 5 top-right panel). The values of SE(SS) found for prc at ARTESON and BREWSTER are generally smaller (between 0.05 and 0.15) than for prc at VERNAGT but still larger than for the other target variables. A yearly course is evident also for prc at these two sites, with the larger values of SE(SS) at ARTESON evident in Southern Hemisphere wintertime (from June to September)—dry season in the Cordillera Blanca (e.g., Juen 2006). For BREWSTER, by contrast, larger values of SE(SS) are found in Southern Hemisphere summer, coinciding with the times for which BREWSTER data of prc used here are actually measured on site (since wintertime precipitation at BREWSTER is reconstructed based on a lowland station; see section 3). Overall, the uncertainty patterns of SS for prc reflect the difficulties of measuring solid precipitation and, in general, of measuring precipitation in remote mountain sites.

For the target variable *t*, the multiple predictor options (in particular GV, GL, and VL) systematically outperform the single predictor options for all sites (with the only exception of ARTESON for about one-third of all doys, where fixed-OS dominates). This is because of values of SE(SS) being overall very small for *t* (ranging from 0.005 to 0.02; see Fig. 5 bottom-right panel). For the target variable ws, despite relatively large values of SE(SS) of up to 0.14, the single predictor options are hardly selected as the optimum predictor option at all. The corresponding improvements of SS obtained through using one of the multiple predictor options (particularly GV) do consequently exceed 0.14 in these occasions (see also Fig. 7, right panel, discussed below). For rh, GV clearly dominates for VERNAGT, and V clearly dominates for BREWSTER, while for rh at ARTESON, the fixed-OS option results as the optimum option particularly from January to March (core wet season in the Cordillera Blanca; e.g., Juen 2006). In the case of gr, the single predictor option fixed-OS is preferred for VERNAGT (particularly for November–April—Northern Hemisphere winter). Despite prc in Northern Hemisphere winter, uncertainties of SS found for VERNAGT are about 50% lower than for ARTESON and BREWSTER. Occasional maxima in the uncertainties of SS possibly relate to individual measurement problems in the generally short time series, particularly for ARTESON and BREWSTER. For example, the maxima of SE(SS) for gr at BREWSTER in June and July (cf. Fig. 6, bottom-right panel) may be related to difficulties in resolving cloud properties over the Southern Alps (e.g., Conway and Cullen 2016).

Figure 7 (left panel) shows the maximum values of SS obtained through the optimum predictor option for each case. At all sites, the overall highest values of SS are obtained for the target variable *t*, and the lowest values of SS are obtained for prc (VERNAGT and ARTESON) and ws (BREWSTER). Note, however, that the higher values of SS for prc at BREWSTER are most probably a result of the BREWSTER time series containing data from a lowland station less influenced by topographic effects. For the target variables *t*, ws, and rh, the highest values of SS are found at VERNAGT. For prc and gr, the highest values of SS result for BREWSTER. ARTESON shows the lowest values of SS for all target variables. This is possibly related to (i) deficiencies of reanalysis data in the tropics, for example, caused by the enhanced role of subgrid-scale processes like convection (e.g., Cavazos and Hewitson 2005); (ii) the fact that the discrepancies between the real and the ERA-Interim topographies are largest for ARTESON, given the extreme complexity of the Cordillera Blanca (see Fig. 1); (iii) effects of the glacier boundary layer likely being most prominent at ARTESON (as discussed in section 3); but also (iv) possibly lower quality of the observational data at ARTESON, considering its remoteness and difficult accessibility (e.g., Juen 2006). Note that for the target variable *t* at VERNAGT, values of SS are exceptionally high. Values of SS being close to 1 already in the case of fixed-OS (cf. Fig. 2) indicates that day-to-day and year-to-year variability of *t* at VERNAGT is governed almost entirely by *t* at large scales. For example, the scale of fixed-OS is as large as 3° × 3° (longitude–latitude, corresponding to an area of approximately 90 000 km^{2}; cf. also the black square in Fig. 1).

Figure 7 (right panel) shows the added value of the optimum predictor option relative to the direct, no-OS predictor (in terms of differences of SS). For comparison, the added value of spec-OS relative to the direct, no-OS predictor is also shown (light shadings in Fig. 7, right panel). For almost all variables, sites, and doys, the application of the optimum predictor option leads to a considerable improvement relative to the direct, no-OS predictor. The smallest improvements are evident for the target variable *t*. The largest improvements are obtained for the target variable ws. This implies that, despite evident discrepancies between local-scale ws and large-scale ws, large parts of the local-scale ws anomalies are governed by the large-scale atmospheric conditions. The spec-OS shows its largest efficiency for ws and rh at VERNAGT (with median differences in skill being larger than 0.1), as well as for prc at VERNAGT, and ws, rh, and gr at ARTESON (with median differences in skill being larger than 0.05). Overall, the improvements, in terms of differences in values of SS of the optimum predictor option relative to the direct, no-OS predictor option, range from 0.05 up to 0.4 (doys’ medians), with considerable seasonal variations (cf. the edges of the boxes in Fig. 7) and extreme values extending beyond 0.5. These results clearly indicate the importance of sophisticated predictor screening, and particularly of the use of multiple predictor options for the performance of statistical downscaling.

Finally, Fig. 8 shows the full (not doy separated) time series of daily ws at VERNAGT from December 2001 to September 2002 (the shown time series is limited to 10 months only in order to allow for the individual values to be distinguished). The time series modeled by sDoG (multiple predictor option GV) overall corresponds well to the observations (the corresponding mean SS—averaged over all 365 doys—amounts to 0.59). Together with the modeled time series, sDoG also outputs the cross-validation-based uncertainty estimates in terms of one and two standard deviations of the test errors (i.e., standard errors) for individual doys (not to be confused with values of SE(SS) estimated based on bootstrap). To show an example of a model projection beyond the range of measurements used here, sDoG model values are also displayed for December 2001. As a reference, the untransformed reanalysis data series of ws extracted from the closest grid point is shown (i.e., the predictor of the direct, no-OS option but without regression). The reanalysis data series shows an overall higher variability than the measured time series. Despite some degree of correlation, errors for individual points in time are evidently much larger than the errors of the sDoG model based on option GV. Consequently, even though ws of the closest reanalysis data grid point shows—as expected—large discrepancies to the observations, the regression model GV, which considers multiple predictor variables at multiple grid points, is able to simulate local-scale, daily ws at VERNAGT more realistically.

## 7. Summary and conclusions

We presented a detailed analysis of possible predictor strategies in regression-based downscaling for precipitation (prc), air temperature (*t*), wind speed (ws), relative humidity (rh), and global radiation (gr). All variables were considered at a daily time scale. In our analysis, we focused on the skill of the predictors to correctly represent anomalies from a mean value determined individually for each day of the year. Two conceptually different approaches of predictor selection, as found in the literature, were evaluated, namely, direct predictor selection (which implies the use of a single variable as a predictor, usually the same large-scale parameter; e.g., large-scale prc for local-scale prc) and traditional predictor selection (based on multiple regression, using as predictors several large-scale parameters from various grid points and/or vertical levels).

In the case of direct SD, the optimum scale analysis shows that for VERNAGT and ARTESON, overall higher skill can be obtained by applying horizontal gridpoint averaging to the predictors instead of considering the predictors from a single grid point. It is noteworthy that a constant number of 4 × 4 grid points (fixed-OS) performs well for all five parameters and different days of the year at these two distinct sites. In particular, for *t* at VERNAGT, the skill found for fixed-OS is exceptionally high, with values of SS close to 1 throughout the year. This implies that daily *t* anomalies at VERNAGT are determined almost entirely by *t* at scales as large as 300 km. For all target variables at BREWSTER, by contrast, gridpoint averaging most of the time deteriorates the skill of direct SD. A reason for the more critical loss of local-scale details through gridpoint averaging at BREWSTER is possibly its location on an island (New Zealand), whereas VERNAGT and ARTESON are situated within continental landmasses. Despite this contradictory result for BREWSTER, our analyses suggest that OS analysis is a potentially very efficient measure for improving the predictors’ skill without the need of increasing the SD model complexity. Regarding the optimum vertical allocation of a predictor for direct SD, we find that—in agreement with previous studies—the predictors from the same pressure level as the study site outperform the predictors at the ERA-Interim surface. For the different target variables, the direct predictors usually outperform a set of possible single predictors analyzed here. Only in the case of prc and gr at ARTESON, different parameters—like relative humidity or total cloud cover—outperform the direct predictors. The low performance of prc and gr as direct predictors for ARTESON is most likely related to deficiencies in the ERA-Interim for these parameters in the tropics.

Our results show that traditional SD based on multivariate predictor sets clearly outperforms direct SD (with or without the consideration of OS analysis). This is particularly true for ws, for which the largest improvements are obtained by including predictors other than wind speed and from different grid points horizontally (i.e., GV, the multiple predictor option often found for ws also in the SD literature). For example, differences in skill between the no-OS, direct SD approach and the optimum multiple predictor option for ws have median values of almost 0.2 for ARTESON, 0.3 for BREWSTER, and 0.4 for VERNAGT, with the highest values for individual days of the year exceeding 0.5. Yet our uncertainty analysis reveals that in some cases, standard errors of the obtained values of skill can be as large as these improvements. For these cases, according to the one-standard-error rule, the use of direct SD is suggested. The largest standard errors of the skill are found for the target variable prc, pointing to considerable sampling uncertainty related to prc time series.

Of all target variables, we find the highest performance of the ERA-Interim predictors for *t* and the lowest skill for prc. For the two midlatitude sites VERNAGT and BREWSTER, higher skill can be obtained than for the tropical site ARTESON. In all cases, ERA-Interim predictors show significant skill in reflecting the local-scale atmospheric variability. The comparably high skill values achieved through ordinary least squares regression are promising and provide a benchmark for more complex SD methods. All analyses in this study are based on the SD training and validation code sDoG, which is applicable for all mountain sites with at least three years of observation time series available for model training. The key output of sDoG is simulations of local-scale, high-resolution time series covering the entire period of reanalysis data availability (e.g., from 1979 to the present for the ERA-Interim dataset), along with reliable uncertainty estimates.

This study is funded by the Austrian Science Fund (P280060). Research on Brewster Glacier is supported by the Department of Geography, University of Otago, New Zealand; the National Institute of Water and Atmospheric Research (Climate Present and Past CLC01202); and the Department of Conservation under concession OT-32299-OTH. Maintenance and data quality control of the Vernagtbach station are financed by the Bavarian Academy of Sciences, Commission for Geodesy and Glaciology, Munich, Germany. The ERA-Interim data were obtained online from the ECMWF.

## REFERENCES

Braun, L. N., H. Escher-Vetter, M. Siebers, and M. Weber, 2007: Water balance of the highly glaciated Vernagt basin, Ötztal Alps.

*The Water Balance of the Alps*, R. Psenner and R. Lackner, Eds., Vol. III,*Alpine Space—Man and Environment*, Innsbruck University Press, 33–42.Bürger, G., 2014: Comment on “Bias correction, quantile mapping, and downscaling: Revisiting the inflation issue.”

,*J. Climate***27**, 1819–1820, doi:10.1175/JCLI-D-13-00184.1.Cannon, A. J., 2008: Probabilistic multisite precipitation downscaling by an expanded Bernoulli–gamma density network.

,*J. Hydrometeor.***9**, 1284–1300, doi:10.1175/2008JHM960.1.Cannon, A. J., 2016: Multivariate bias correction of climate model output: Matching marginal distributions and intervariable dependence structure.

,*J. Climate***29**, 7045–7064, doi:10.1175/JCLI-D-15-0679.1.Cannon, A. J., S. R. Sobie, and T. Q. Murdock, 2015: Bias correction of GCM precipitation by quantile mapping: How well do methods preserve changes in quantiles and extremes?

,*J. Climate***28**, 6938–6959, doi:10.1175/JCLI-D-14-00754.1.Carey, M., 2010:

*In the Shadow of Melting Glaciers: Climate Change and Andean Society*. Oxford University Press, 288 pp.Cavazos, T., and B. C. Hewitson, 2005: Performance of NCEP–NCAR reanalysis variables in statistical downscaling of daily precipitation.

,*Climate Res.***28**, 95–107, doi:10.3354/cr028095.Chandler, R. E., and H. S. Wheater, 2002: Analysis of rainfall variability using generalized linear models: A case study from the west of Ireland.

,*Water Resour. Res.***38**, 1192, doi:10.1029/2001WR000906.Cheng, C. S., G. Li, Q. Li, and H. Auld, 2008: Statistical downscaling of hourly and daily climate scenarios for various meteorological variables in south-central Canada.

,*Theor. Appl. Climatol.***91**, 129–147, doi:10.1007/s00704-007-0302-8.Cogley, J. G., 2010: A more complete version of the world glacier inventory.

,*Ann. Glaciol.***50**(53), 32–38, doi:10.3189/172756410790595859.Cogley, J. G., 2011: Present and future states of Himalaya and Karakoram Glaciers.

,*Ann. Glaciol.***52**(59), 69–73, doi:10.3189/172756411799096277.Conway, J. P., 2013: Constraining cloud and airmass controls on the surface energy and mass balance of Brewster Glacier, Southern Alps of New Zealand. Ph.D. dissertation, University of Otago, 182 pp.

Conway, J. P., and N. J. Cullen, 2016: Cloud effects on surface energy and mass balance in the ablation area of Brewster Glacier, New Zealand.

,*Cryosphere***10**, 313–328, doi:10.5194/tc-10-313-2016.Cullen, N. J., and J. P. Conway, 2015: A 22 month record of surface meteorology and energy balance from the ablation zone of Brewster Glacier, New Zealand.

,*J. Glaciol.***61**, 931–946, doi:10.3189/2015JoG15J004.Culver, A. M. R., and A. H. Monahan, 2013: The statistical predictability of surface winds over western and central Canada.

,*J. Climate***26**, 8305–8322, doi:10.1175/JCLI-D-12-00425.1.Dee, D. P., and Coauthors, 2011: The ERA-Interim reanalysis: Configuration and performance of the data assimilation system.

,*Quart. J. Roy. Meteor. Soc.***137**, 553–597, doi:10.1002/qj.828.Déqué, M., 2007: Frequency of precipitation and temperature extremes over France in an anthropogenic scenario: Model results and statistical correction according to observed values.

,*Global Planet. Change***57**, 16–26, doi:10.1016/j.gloplacha.2006.11.030.Eden, J. M., and M. Widmann, 2014: Downscaling of GCM-simulated precipitation using model output statistics.

,*J. Climate***27**, 312–324, doi:10.1175/JCLI-D-13-00063.1.Eden, J. M., M. Widmann, D. Maraun, and M. Vrac, 2014: Comparison of GCM- and RCM-simulated precipitation following stochastic postprocessing.

,*J. Geophys. Res. Atmos.***119**, 11 040–11 053, doi:10.1002/2014JD021732.Enke, V., and A. Spekat, 1997: Downscaling climate model outputs into local and regional weather elements by classification and regression.

,*Climate Res.***8**, 195–207, doi:10.3354/cr008195.Escher-Vetter, H., L. N. Braun, and M. Siebers, 2014: Hydrological and meteorological records from the Vernagtferner basin–Vernagtbach station, for the years 2002 to 2012. PANGAEA, accessed 2 July 2015, doi:10.1594/PANGAEA.829530.

Ferrari, S., and F. Cribari-Neto, 2004: Beta regression for modelling rates and proportions.

,*J. Appl. Stat.***31**, 799–815, doi:10.1080/0266476042000214501.Flato, G., and Coauthors, 2013: Evaluation of climate models.

*Climate Change 2013: The Physical Science Basis*, T. F. Stocker et al., Eds., Cambridge University Press, 741–866.Frías, M. D., E. Zorita, J. Fernández, and C. Rodríguez-Puebla, 2006: Testing statistical downscaling methods in simulated climates.

,*Geophys. Res. Lett.***33**, L19807, doi:10.1029/2006GL027453.Gao, K., L. Schulz, and M. Bernhardt, 2014: Statistical downscaling of ERA-Interim forecast precipitation data in complex terrain using LASSO algorithm.

,*Adv. Meteor.***2014**, 472741, doi:10.1155/2014/472741.Georges, C., 2005: Recent glacier fluctuations in the tropical Cordillera Blanca and aspects of the climate forcing. Ph.D. dissertation, University of Innsbruck, 169 pp.

Giesen, R. H., and J. Oerlemans, 2013: Climate-model induced differences in the 21st century global and regional glacier contributions to sea-level rise.

,*Climate Dyn.***41**, 3283–3300, doi:10.1007/s00382-013-1743-7.Giorgi, F., and G. T. Bates, 1989: The climatological skill of a regional model over complex terrain.

,*Mon. Wea. Rev.***117**, 2325–2347, doi:10.1175/1520-0493(1989)117<2325:TCSOAR>2.0.CO;2.Glahn, H. R., and D. A. Lowry, 1972: The use of model output statistics (MOS) in objective weather forecasting.

,*J. Appl. Meteor.***11**, 1203–1211, doi:10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.Grotch, S. L., and M. C. MacCracken, 1991: The use of global climate models to predict regional climatic change.

,*J. Climate***4**, 286–303, doi:10.1175/1520-0442(1991)004<0286:TUOGCM>2.0.CO;2.Hastie, T., R. Tibshirani, and J. Friedman, 2009:

*The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. 2nd ed. Springer Series in Statistics, Springer, 745 pp.Hewitson, B., and R. Crane, 2006: Consensus between GCM climate change projections with empirical downscaling: Precipitation downscaling over South Africa.

,*Int. J. Climatol.***26**, 1315–1337, doi:10.1002/joc.1314.Hock, R., and B. Holmgren, 2005: A distributed surface energy-balance model for complex topography and its application to Storglaciären, Sweden.

,*J. Glaciol.***51**, 25–36, doi:10.3189/172756505781829566.Hofer, M., 2012: Statistical downscaling of atmospheric variables for data-sparse, glaciated mountain sites. Ph.D. dissertation, University of Innsbruck, 96 pp.

Hofer, M., T. Mölg, B. Marzeion, and G. Kaser, 2010: Empirical-statistical downscaling of reanalysis data to high-resolution air temperature and specific humidity above a glacier surface (Cordillera Blanca, Peru).

,*J. Geophys. Res.***115**, D12120, doi:10.1029/2009JD012556.Hofer, M., B. Marzeion, and T. Mölg, 2012: Comparing the skill of different reanalyses and their ensembles as predictors for daily air temperature on a glaciated mountain (Peru).

,*Climate Dyn.***39**, 1969–1980, doi:10.1007/s00382-012-1501-2.Hofer, M., B. Marzeion, and T. Mölg, 2015: A statistical downscaling method for daily air temperature in data-sparse, glaciated mountain environments.

,*Geosci. Model Dev.***8**, 579–593, doi:10.5194/gmd-8-579-2015.Huss, M., 2011: Present and future contribution of glacier storage change to runoff from macroscale drainage basins in Europe.

,*Water Resour. Res.***47**, W07511, doi:10.1029/2010WR010299.Huth, R., 2002: Statistical downscaling of daily temperature in central Europe.

,*J. Climate***15**, 1731–1742, doi:10.1175/1520-0442(2002)015<1731:SDODTI>2.0.CO;2.Huth, R., 2004: Sensitivity of local daily temperature change estimates to the selection of downscaling models and predictors.

,*J. Climate***17**, 640–652, doi:10.1175/1520-0442(2004)017<0640:SOLDTC>2.0.CO;2.Huth, R., 2005: Downscaling of humidity variables: A search for suitable predictors and predictands.

,*Int. J. Climatol.***25**, 243–250, doi:10.1002/joc.1122.Iizumi, T., M. Nishimori, M. Yokozawa, A. Kotera, and N. Duy Khang, 2012: Statistical downscaling with Bayesian inference: Estimating global solar radiation from reanalysis and limited observed data.

,*Int. J. Climatol.***32**, 464–480, doi:10.1002/joc.2281.Juen, I., 2006: Glacier mass balance and runoff in the Cordillera Blanca, Peru. Ph.D. dissertation, University of Innsbruck, 173 pp.

Kaser, G., M. Grosshauser, and B. Marzeion, 2010: The contribution potential of glaciers to water availability in different climate regimes.

,*Proc. Natl. Acad. Sci. USA***107**, 20 223–20 227, doi:10.1073/pnas.1008162107.Kerr, T., I. Owens, and R. Henderson, 2011: The precipitation distribution in the Lake Pukaki catchment.

,*J. Hydrol. N. Z.***50**, 361–382.Kotlarski, S., F. Paul, and D. Jacob, 2010: Forcing a distributed glacier mass balance model with the regional climate model REMO. Part I: Climate model evaluation.

,*J. Climate***23**, 1589–1606, doi:10.1175/2009JCLI2711.1.Kotlarski, S., and Coauthors, 2014: Regional climate modeling on European scales: A joint standard evaluation of the EURO-CORDEX RCM ensemble.

,*Geosci. Model Dev.***7**, 1297–1333, doi:10.5194/gmd-7-1297-2014.Leclercq, P., J. Oerlemans, and J. Cogley, 2011: Estimating the glacier contribution to sea-level rise for the period 1800–2005.

,*Surv. Geophys.***32**, 519–535, doi:10.1007/s10712-011-9121-7.Maraun, D., and Coauthors, 2010: Precipitation downscaling under climate change: Recent developments to bridge the gap between dynamical models and the end user.

,*Rev. Geophys.***48**, RG3003, doi:10.1029/2009RG000314.Marke, T., W. Mauser, A. Pfeiffer, and G. Zängl, 2011: A pragmatic approach for the downscaling and bias correction of regional climate simulations: Evaluation in hydrological modeling.

,*Geosci. Model Dev.***4**, 759–770, doi:10.5194/gmd-4-759-2011.Marzeion, B., A. H. Jarosch, and M. Hofer, 2012: Past and future sea-level change from the surface mass balance of glaciers.

,*Cryosphere***6**, 1295–1322, doi:10.5194/tc-6-1295-2012.Meek, C., B. Thiesson, and D. Heckerman, 2002: The learning-curve sampling method applied to model-based clustering.

,*J. Mach. Learn. Res.***2**, 397–418.Messner, J. W., G. J. Mayr, A. Zeileis, and D. S. Wilks, 2014: Heteroscedastic extended logistic regression for postprocessing of ensemble guidance.

,*Mon. Wea. Rev.***142**, 448–456, doi:10.1175/MWR-D-13-00271.1.Michaelsen, J., 1987: Cross-validation in statistical climate forecast models.

,*J. Climate Appl. Meteor.***26**, 1589–1600, doi:10.1175/1520-0450(1987)026<1589:CVISCF>2.0.CO;2.Mölg, T., and G. Kaser, 2011: A new approach to resolving climate-cryosphere relations: Downscaling climate dynamics to glacier-scale mass and energy balance without statistical scale linking.

,*J. Geophys. Res.***116**, D16101, doi:10.1029/2011JD015669.Mölg, T., N. Cullen, and G. Kaser, 2009: Solar radiation, cloudiness and longwave radiation over low-latitude glaciers: Implications for mass balance modeling.

,*J. Glaciol.***55**, 292–302, doi:10.3189/002214309788608822.Munro, D. S., and M. Marosz-Wantuch, 2009: Modeling ablation on Place Glacier, British Columbia, from glacier and off-glacier data sets.

,*Arct. Antarct. Alp. Res.***41**, 246–256, doi:10.1657/1938-4246-41.2.246.Murphy, A. H., 1988: Skill scores based on the mean square error and their relationships to the correlation coefficient.

,*Mon. Wea. Rev.***116**, 2417–2424, doi:10.1175/1520-0493(1988)116<2417:SSBOTM>2.0.CO;2.Nelder, J. A., and R. W. M. Wedderburn, 1972: Generalized linear models.

,*J. Roy. Stat. Soc.***135A**, 370–384.Poccard, I., S. Janicot, and P. Camberlin, 2000: Comparison of rainfall structures between NCEP/NCAR reanalyses and observed data over tropical Africa.

,*Climate Dyn.***16**, 897–915, doi:10.1007/s003820000087.Prasch, M., W. Mauser, and M. Weber, 2013: Quantifying present and future glacier melt-water contribution to runoff in a central Himalayan river basin.

,*Cryosphere***7**, 889–904, doi:10.5194/tc-7-889-2013.Prein, A. F., and Coauthors, 2015: A review on regional convection-permitting climate modeling: Demonstrations, prospects, and challenges.

,*Rev. Geophys.***53**, 323–361, doi:10.1002/2014RG000475.Pryor, S. C., and R. J. Barthelmie, 2014: Hybrid downscaling of wind climates over the eastern USA.

,*Environ. Res. Lett.***9**, 024013, doi:10.1088/1748-9326/9/2/024013.Radić, V., A. Bliss, A. C. Beedlow, R. Hock, E. Miles, and J. G. Cogley, 2014: Regional and global projections of twenty-first century glacier mass changes in response to climate scenarios from global climate models.

,*Climate Dyn.***42**, 37–58, doi:10.1007/s00382-013-1719-7.Räisänen, J., and J. S. Ylhäisi, 2011: How much should climate model output be smoothed in space?

,*J. Climate***24**, 867–880, doi:10.1175/2010JCLI3872.1.Rangwala, I., and J. R. Miller, 2012: Climate change in mountains: A review of elevation-dependent warming and its possible causes.

,*Climatic Change***114**, 527–547, doi:10.1007/s10584-012-0419-3.Rissel, R., 2012: Physikalische Interpretation des Temperatur-Index-Verfahrens zur Berechnung der Eisschmelze am Vernagtferner (Physical interpretation of the temperature index method for calculating formation of the ice melt at the Vernagtferner). B.S. thesis, Fachbereich Architektur, Bauingenieurwesen und Umweltwissenschaften, Technische Universität Braunschweig, 76 pp.

Salameh, T., P. Drobinski, M. Vrac, and P. Naveau, 2009: Statistical downscaling of near-surface wind over complex terrain in southern France.

,*Meteor. Atmos. Phys.***103**, 253–265, doi:10.1007/s00703-008-0330-7.Scheel, M. L. M., M. Rohrer, C. Huggel, D. Santos Villar, E. Silvestre, and G. J. Huffman, 2011: Evaluation of TRMM Multi-Satellite Precipitation Analysis (TMPA) performance in the central Andes region and its dependency on spatial and temporal resolution.

,*Hydrol. Earth Syst. Sci.***15**, 2649–2663, doi:10.5194/hess-15-2649-2011.Schmidli, J., C. Schmutz, C. Frei, H. Wanner, and C. Schär, 2002: Mesoscale precipitation variability in the region of the European Alps during the 20th century.

,*Int. J. Climatol.***22**, 1049–1074, doi:10.1002/joc.769.Schmidli, J., C. Frei, and P. L. Vidale, 2006: Downscaling from GCM precipitation: A benchmark for dynamical and statistical downscaling methods.

,*Int. J. Climatol.***26**, 679–689, doi:10.1002/joc.1287.Schmidli, J., C. M. Goodess, C. Frei, M. R. Haylock, Y. Hundecha, J. Ribalaygua, and T. Schmith, 2007: Statistical and dynamical downscaling of precipitation: An evaluation and comparison of scenarios for the European Alps.

,*J. Geophys. Res.***112**, D04105, doi:10.1029/2005JD007026.Schubert, S., and A. Henderson-Sellers, 1997: A statistical model to downscale local daily temperature extremes from synoptic-scale atmospheric circulation patterns in the Australian region.

,*Climate Dyn.***13**, 223–234, doi:10.1007/s003820050162.Shao, J., 1993: Linear model selection by cross-validation.

,*J. Amer. Stat. Assoc.***88**, 486–494, doi:10.2307/2290328.Skamarock, W. C., 2004: Evaluating mesoscale NWP models using kinetic energy spectra.

,*Mon. Wea. Rev.***132**, 3019–3032, doi:10.1175/MWR2830.1.Stern, R. D., and R. Coe, 1984: A model fitting analysis of daily rainfall data.

,*J. Roy. Stat. Soc.***147A**, 1–34.Stidd, C. K., 1973: Estimating the precipitation climate.

,*Water Resour. Res.***9**, 1235–1241, doi:10.1029/WR009i005p01235.Themeßl, M. J., A. Gobiet, and A. Leuprecht, 2011: Empirical-statistical downscaling and error correction of daily precipitation from regional climate models.

,*Int. J. Climatol.***31**, 1530–1544, doi:10.1002/joc.2168.Themeßl, M. J., A. Gobiet, and G. Heinrich, 2012: Empirical-statistical downscaling and error correction of regional climate models and its impact on the climate change signal.

,*Climatic Change***112**, 449–468, doi:10.1007/s10584-011-0224-4.Thorarinsdottir, T. L., and T. Gneiting, 2010: Probabilistic forecasts of wind speed: Ensemble model output statistics by using heteroscedastic censored regression.

,*J. Roy. Stat. Soc.***173A**, 371–388, doi:10.1111/j.1467-985X.2009.00616.x.Van den Broeke, M. R., 1997: Structure and diurnal variation of the atmospheric boundary layer over a mid-latitude glacier in summer.

,*Bound.-Layer Meteor.***83**, 183–205, doi:10.1023/A:1000268825998.Vuille, M., G. Kaser, and I. Juen, 2008: Glacier mass balance variability in the Cordillera Blanca, Peru and its relationship to climate and large-scale circulation.

,*Global Planet. Change***62**, 14–28, doi:10.1016/j.gloplacha.2007.11.003.Weber, M., and M. Prasch, 2016: Influence of the glaciers on runoff regime and its change.

*Regional Assessment of Global Change Impacts: The Project GLOWA-Danube*, W. Mauser and M. Prasch, Eds., Springer International Publishing, 493–509, doi:10.1007/978-3-319-16751-0_56.Weichert, A., and G. Bürger, 1998: Linear versus nonlinear techniques in downscaling.

,*Climate Res.***10**, 83–93, doi:10.3354/cr010083.Widmann, M., C. S. Bretherton, and E. P. Salathé Jr., 2003: Statistical precipitation downscaling over the northwestern United States using numerically simulated precipitation as a predictor.

,*J. Climate***16**, 799–816, doi:10.1175/1520-0442(2003)016<0799:SPDOTN>2.0.CO;2.Wilks, D. S., 1997: Resampling hypothesis tests for autocorrelated fields.

,*J. Climate***10**, 65–82, doi:10.1175/1520-0442(1997)010<0065:RHTFAF>2.0.CO;2.Wilks, D. S., 2011:

*Statistical Methods in the Atmospheric Sciences*. 3rd ed. Elsevier, 676 pp.Zorita, E., and H. von Storch, 1999: The analog method as a simple statistical downscaling technique: Comparison with more complicated methods.

,*J. Climate***12**, 2474–2489, doi:10.1175/1520-0442(1999)012<2474:TAMAAS>2.0.CO;2.