## 1. Introduction

Atmospheric reanalyses are the product of a synthesis of observations and numerical weather prediction model integration through a data assimilation scheme that optimally combines information from both sources. These datasets provide a four-dimensional retrospective analysis of the atmospheric state over decades or even centuries as in, for example, ERA-Interim (Dee et al. 2011) and the Twentieth Century Reanalysis (20CR; Compo et al. 2011). Therefore, reanalyses have become an essential tool for scientists to monitor, analyze, and understand the state of the climate as well as its change over time.

In the last decade, efforts have been undertaken to increase the commonly coarse resolution of global reanalyses by nesting a regional reanalysis into an existing global reanalysis by using a limited-area model; examples include the North American Regional Reanalysis (Mesinger et al. 2006) and the Arctic System Reanalysis (Wilson et al. 2011). A regional reanalysis that is based on the Consortium for Small-Scale Modeling (COSMO) model has also recently become available: the COSMO reanalysis at 6-km horizontal grid spacing (COSMO-REA6; Bollmeyer et al. 2015).

COSMO-REA6 provides data for multiple variables at a high horizontal resolution of 6.2 km. The reanalysis covers approximately 20 years of data and uses the European Coordinated Regional Climate Downscaling Experiment (CORDEX)-11 domain (http://www.cordex.org; Giorgi et al. 2009). Because no precipitation data are assimilated in its reanalysis cycle, the reanalysis data still exhibit shortcomings in the representation of precipitation, especially at the subdaily aggregation level. The aim of this study is to show the feasibility of a statistical downscaling approach for precipitation applied to COSMO-REA6 by using the “analog ensemble” (AnEn) method.

The AnEn approach is an analog-based technique that is mainly used as a postprocessing scheme for the output of numerical weather prediction (NWP) systems (e.g., Hamill and Whitaker 2006; Delle Monache et al. 2013; Eckel and Delle Monache 2016). The method provides sound probabilistic estimates for a specific variable. Because AnEn is based only on deterministic forecasts, the ensemble can be generated with much lower computational effort than is involved in running a full NWP ensemble configuration. A further advantage is the sampling of (analog) ensemble members from the observed distribution; thus, the corresponding “true” probability density function is estimated. This stands in contrast to other postprocessing methods based on statistical models that employ some kind of transformation to estimate the targeted distribution.

Recent implementations focus on the use of AnEn to provide estimates for parameters that are not forecast by the physical model itself (Alessandrini et al. 2015a; Vanvyve et al. 2015). Thereby, the sampling from observation space renders the use of a possibly error-prone forward operator unnecessary. With respect to the application of AnEn to precipitation, Hamill and Whitaker (2006) employ the technique to reforecasts that are generated with a coarse T62 (spectral resolution) version of NCEP’s Global Forecast System. Their findings show the potential of AnEn to improve precipitation forecasts. Its calibration with respect to the set of predictors, neighborhood size, and choice of metric is essential to obtaining an added value relative to the verification time series data, however.

In this study, the AnEn method is not applied to forecasts but rather to reanalysis data for downscaling precipitation. The main focus lies on testing and tuning the various characteristics of the AnEn approach with the aim of generating high-quality retrospective precipitation time series. This will allow for the reconstruction of synthetic observation datasets for time periods when no observations are available.

The paper is structured as follows: Section 2 describes the data used for this study, and section 3 provides details about the different aspects of the generation process of the analog ensembles. Section 4 presents the results, and section 5 summarizes them and provides conclusions.

## 2. Data

For the AnEn approach, two datasets are needed: 1) The predictors typically consisting of model simulations (forecasts) are used to determine which data are chosen as analogs. 2) The predictands constituting the actual members of the analog ensemble are taken from the set of observations of the parameter to be estimated.

### a. Observations

In this study, we focus on a dense network of precipitation observations over Germany obtained from the German Weather Service’s (DWD) climatological database (available online at

### b. Reanalysis

In previous implementations of the AnEn method, weather forecasts have been used as predictors to identify the analogs. In this approach, regional reanalysis data constitute the predictors. The reanalysis data come from a high-resolution regional reanalysis (the data are accessible online at ftp://ftp-cdc.dwd.de/pub/REA/COSMO_REA6) on a domain covering the European continent at a horizontal grid resolution of 6.2 km (REA6). Because of the fine horizontal resolution, the grid points are not interpolated to constitute an estimate for the observation site but rather the nearest-gridpoint approach is used. A visualization of the reanalysis grid can be found in Fig. 1, with the black crosses representing every fifth grid point of REA6.

For a preliminary investigation of predictors, a set of 37 REA6 variables was taken into account (4 variables accumulated over 6 h and 33 instantaneous variables). As values for the latter were available at both the beginning and the end of the respective 6-h time interval, these variables were reduced to a single value by simply calculating the mean of the two values. Experiments using both values show that the results differ only marginally.

To computationally be able to run the experiments for all observing sites with the different settings discussed in the following section, the number of predictors had to be reduced. First, an absolute correlation threshold value of 0.4 was defined. Only predictors that exhibit such a correlation for at least one observing site are taken into consideration. From this subset, predictor variables that exhibit a high cross correlation with one or more other predictors in the subset are neglected. A list of the final 10 predictor variables can be found in Table 1.

List of variables chosen as predictors for the AnEn implementation.

## 3. Methods

The approach used in this study is based on the AnEn technique described by Delle Monache et al. (2013) that is mostly applied to postprocess numerical weather predictions. The idea of AnEn is to estimate the probability density function of a parameter at a given time step *t* (e.g., a forecast lead time) by sampling from a dataset of past observations of the same parameter. The observations to be sampled into AnEn are the ones that correspond to the predictors most similar to those at the desired time step *t* given a predictor dataset. In general, these predictors are taken from model simulations (e.g., numerical weather predictions). To measure the degree of similarity between the time step in question and the time steps in the training dataset from which the observations are sampled, a metric has to be defined that consists of either one or multiple predictors. The quality of the final AnEn is mainly dependent on the choice of these predictors as well as the design of the metric.

Our approach utilizes the AnEn method to conduct a statistical downscaling of reanalysis data by estimating an analog ensemble for an observational time series using reanalyzed variables as predictors. Figure 2 provides a schematic illustration of the applied approach. To accomplish the planned AnEn implementation, we take the following steps to construct the AnEn estimates:

Two time series covering an 8-yr period, namely, the observational dataset and the corresponding predictor dataset, are prepared. The observations are 6-hourly rain gauge measurements from the dense network of the DWD. The predictor data are taken from the REA6 containing numerous parameters. More details can be found in section 2.

The observation and predictor time series are divided into two 4-yr datasets: first, a training dataset from which the analogs will be taken and, second, a verification dataset for which the analog ensembles will be determined and that will be used to assess the performance of the AnEn method in this implementation.

The initial number of 37 predictors is reduced using a correlation analysis. By taking into account cross correlations, a final set of 10 predictors is chosen to be used in the determination of the AnEn estimates.

A metric is defined to calculate the similarity between a time step in the verification dataset and all time steps in the training dataset given the predictors chosen. The metric is set up so that it takes into account not only the grid cell containing the observation site but also a number of observations around this grid cell; that is, a neighborhood approach is applied.

Because the metric includes a weight for each predictor type, these weights have to be determined. Therefore, an approach called simplified brute force (SBF) is applied to determine the weight for each predictor at each observation site independently. Further, to assess the performance of SBF in estimating the optimal predictor weights, a standard brute-force (BF) approach is used for 10 randomly chosen observation sites.

To evaluate different aspects of the method and its implementation, the following steps are taken to provide alternative datasets for comparison:

A random ensemble is generated that is similar to the analog ensemble but by randomly choosing observations from the training dataset.

A logistic regression approach is applied to the original reanalysis data to provide an alternate statistical postprocessing method.

The AnEn estimation is repeated but with the restriction to summer or winter data, that is, applying a seasonal filter to the AnEn approach.

### a. Metric

*L*

^{2}-norm extended by including a weight for the single predictors. The value of the metric determining the similarity of time

*i*in the training dataset with respect to time step

*j*in the verification dataset at station

*s*is defined as

*N*

_{p}is the number of predictors,

*k*th predictor normalized by its standard deviation in the training dataset at time

*i*at location

*s*,

*k*th predictor normalized by its standard deviation in the verification dataset at time

*j*at location

*s*, and

**w**

_{k}(

*s*) is the weight of the

*k*th predictor for station

*s*. The metric values for each time step in the training data are therefore determined by the similarity of each predictor between the chosen time step in the verification dataset and the time step in the training dataset. The impact of each predictor is further determined by the respective predictor weight.

When the metric values have been calculated for each time step in the training dataset, these values are sorted in increasing order. For an AnEn size of *N*_{ENS}, the first *N*_{ENS} indices are used to determine the members of the analog ensemble, that is, the observations at the respective times in the training dataset.

### b. Neighborhood approach

Similar to Hamill and Whitaker (2006), the metric is calculated not only for the single grid cell containing the observation site but for a neighborhood of *N*_{NBH} × *N*_{NBH} grid boxes centered around the observation site, with *N*_{NBH} defining the number of grid boxes used in each direction. With this approach, the search pattern for the best analogs in the training-data time series is extended not only to take into account the best match of predictors at the observation site but to find the optimal analogs with respect to the similarity in the spatial structure of the predictors.

Previous studies have applied successfully the same AnEn algorithm for a range of meteorological variables and models by performing the matching at the observation location only (e.g., Delle Monache et al. 2011; Alessandrini et al. 2015b; Nagarajan et al. 2015). We hypothesize that for precipitation the skill of the model may be more pronounced over a geographical area than at a single point, however, in which case performing the matching over a neighborhood of grid points may prove to be beneficial for downscaling a model-based estimate of precipitation. This hypothesis will be verified in section 4.

The settings for the neighborhood approach used in this study are *N*_{NBH} = 11, 21, 31, 41, and 51 as well as the simple approach with a single grid box (*N*_{NBH} = 1). With a horizontal grid size of 6.2 km, the largest neighborhood approximately covers an area of 316 × 316 km^{2} around the observation, thus including local and regional features of land surface and orography.

*k*th predictor at the grid point

*l*of the neighborhood around station

*s*at time

*i*of the training dataset and time

*j*of the verification dataset. With this approach, we are able not only to take into account similarities in the predictor space but to calculate metrics depending also on spatial patterns of the predictors; that is, the metric will be smallest where the correspondence between the predictor patterns of the training dataset is closest to the predictor pattern at the observation time

*j*.

Notice that in previous studies (e.g., Delle Monache et al. 2011, 2013; Alessandrini et al. 2015a,b) the matching was also often performed over a small time window (e.g., three time steps), which in part compensates for the lack of spatial pattern matching in such studies. While it is likely that AnEn for precipitation also could be further enhanced by using multiple time steps (especially in combination with the neighborhood approach), the use of temporal trends has not been considered in this study for complexity reasons. The inclusion of temporal trends for precipitation would require modeling a complex multivariate process because of its non-Gaussian characteristics. Therefore, additional consideration would have to be given to the dichotomous case (precipitation yes/no) in such a framework.

### c. Finding the optimal set of predictors

To find the optimal combination of predictors, a common approach is the so-called brute-force technique, as proposed by Junk et al. (2015) and Alessandrini et al. (2015a), in which all the possible combinations of predictor weights are tested. Within the scope of this study, however, the application of BF is impractical because it would be computationally very expensive with respect to the amount of data to be considered. Hence, we follow the method that is suggested by Alessandrini et al. (2016) for selecting the AnEn predictors and the corresponding weights in an application related to forecasts of tropical cyclone intensity—simplified brute force.

In SBF, the weight for each predictor–observation site combination is determined by an iterative procedure applied to the training dataset. In an initial step, an AnEn is generated for each predictor by setting the weights of all other predictors to zero. For these *N*_{p} ensembles, the continuous ranked probability score (CRPS; e.g., Hersbach 2000) is calculated. The predictor resulting in the lowest CRPS is chosen as the first predictor *p*_{1}.

Then, for each of the remaining predictors, analog ensembles for all combinations of the other predictors with *p*_{1} are estimated. This is done by applying a weight to each predictor in each of the predictor combinations using an increment of 0.1, with the sum of the weights always being equal to 1. From these analog ensembles, the one with the lowest CRPS determines the second predictor *p*_{2}. This step is then repeated until the increase in performance (the decrease in CRPS) for an iterative step is lower than a predefined threshold, which is set to 1% in our case. In addition, we limit the maximum number of predictors to five for computational reasons. The determined weights *w*_{k}(*s*) are then used to calculate the metrics for finding the analog ensembles.

### d. Random ensemble

To analyze the performance of the AnEn method with respect to the resolution component of the ensemble, a random ensemble (RaEn) is generated for each station, verification time step, and ensemble size. The different realizations of the random ensembles are drawn from the observational training dataset for the respective time of day. The random-number generator is reinitialized with a new seed value for each draw such that similar draws (i.e., permutations) are incidental.

### e. Logistic regression

A further reference to measure the performance of AnEn is to use a common postprocessing tool for meteorological-model output. Therefore, a logistic regression (LR) is performed for every station and each time of the day. The LR estimation for each observing site is based on the same dataset as the AnEn estimation, that is, the same observations, predictors, and training and verification periods. However, 1461 being the number of available observations, that is, the number of training-data time steps, it is not possible to use the same neighborhood approach. Whereas AnEn uses the neighborhood predictor values by adding them to the single metric value, LR would need a far greater number of observations to allow inclusion of all of the neighborhood values as predictors. Nevertheless, we performed LR separately for each neighborhood size using one value averaged over the neighborhood as predictor. The LR models are then applied to each verification time step, with LR providing probabilities of exceedance for a given threshold. The procedure is repeated for all of the thresholds used in the verification of AnEn (section 4).

### f. Seasonal filter

When determining the analog ensemble, the standard approach is to use all of the available data to find the analogs. Especially with respect to precipitation as a predictand, a seasonal filter is expected to be beneficial because the characteristics of precipitation generation vary considerably across different seasons for Germany. Therefore, an experiment is set up in which the analog ensemble for a season (winter and summer) is only determined by the training data of the respective season; for example, summer analogs are generated using observations from June, July, and August in the training time series. This includes also the process of determining the predictors and their respective weights. This approach is similar in philosophy to the approach in Hamill and Whitaker (2006), who selected the training period made of ±45 days centered around the date of interest.

## 4. Results

The various aspects of the proposed AnEn implementation are reflected in this results section by performing different evaluations, each of which focuses on a specific aspect of the methods described in the six parts of section 3. Section 4a analyzes the performance of the predictor-finding algorithm SBF by comparing the CRPS of this approach with that of BF at 10 randomly chosen observation sites. In section 4b, the importance of the single predictor variables is investigated by looking at the average optimal weights for different precipitation thresholds. The effect of different settings for the main tuning parameters *neighborhood size* and *ensemble size* are explored in section 4c by examining the Brier skill score of AnEn. Section 4d further analyzes the performance of the AnEn estimates by comparing them with the original reanalysis in a deterministic fashion using the equitable threat score. Further, a probabilistic evaluation against the logistic regression model is conducted using the Brier skill score for different precipitation intensities. Section 4e takes a closer look at the probabilistic characteristics—namely, the reliability and resolution of the AnEn implementation—by using the CRPS against a random ensemble. Last, the benefit of the seasonal filter is investigated in section 4f by comparing the changes in Brier score for low- and high-precipitation events when only summer or winter data are used to estimate the analog ensemble.

### a. Performance of the predictor-finding algorithm

The best method for finding the optimal set of predictors is the BF approach, that is, testing all possible combinations of predictors and determining the one that results in the best score (here, lowest CRPS). With the BF method being very time consuming, SBF is chosen as an alternative—that is, computationally cheaper—approach as described in section 3c. To evaluate the performance of the SBF relative to BF, we apply BF to 10 randomly chosen stations for all different settings of neighborhood and ensemble sizes. We then compare the resulting CRPS of the two methods.

The performance is assessed by calculating the CRPS ratio of the BF method and the SBF approach; that is, a perfect performance of SBF would result in a ratio value of 1. In Fig. 3, frequencies of ratio values above 0.95 are depicted as colored boxes in a row for each neighborhood size in the top panel (aggregated over the stations and ensemble sizes, the total number per row is 50) and for each ensemble size in the bottom panel (aggregated over neighborhood sizes and stations, the total number per row is 60).

In general, there are only 3 of 300 total combinations of stations, neighborhood sizes, and ensemble sizes that lie below 0.98 and none that lie below 0.97. In most cases, the ratio lies between 0.99 and 1.0, with 24 of the combinations showing the exact same solution for both methods.

With respect to the neighborhood size, the frequencies in the upper panel of Fig. 3 indicate that the performance of our SBF approach draws closer to the BF method with increasing neighborhood size; for example, for the largest neighborhood size (51 × 51), 47 of the 50 combinations lie above a ratio of 0.99. This gain in performance of SBF may be the result of the corresponding increase in the total number of variables used to calculate the metric. For a combination of two predictors, the number of predictor values determining the metric value is two, that is, the predictor values at the observation locations. With a neighborhood of *N*_{NBH}, however, this number rises to *N*_{NBH} = 51). Therefore, arbitrary variations in the predictor values leading to inferior SBF results (relative to BF) are less likely to occur.

For the ensemble sizes, a similar characteristic can be observed. The CRPS ratio draws closer to 1 with increasing ensemble size. For the case of 50 ensemble members, all predictor sets determined by our approach result in ratio values of 0.99 or higher, that is, a decrease in performance of 1% at most relative to the BF method. The reason may be that the number of possible subsets determined as the analog ensemble is considerably reduced with increasing ensemble size.

In terms of the geographic locations of the stations, Fig. 4 shows a map containing the 10 stations used in the BF experiment and their corresponding CRPS ratio averaged over all 30 combinations of neighborhood and ensemble sizes. For all stations, mean ratios above 0.99 are attained. Although the values seem to decrease from north to south, such a dependence could be coincidental and cannot be established with 10 locations.

### b. Predictor weights

As described in the methods section, a set of predictor weights (from 0 to 1) is calculated independently for each of the 742 observation sites to optimally generate our analog ensembles given the training dataset, that is, to minimize the CRPS of the resulting AnEn for each site. Using the six neighborhood sizes and the five ensemble sizes, 30 AnEn estimates and therefore 30 sets of predictor weights are obtained for each station.

To investigate optimal predictor sets with respect to different precipitation thresholds without performing the SBF and the BF again for all thresholds, we utilize the 30 obtained predictor weight sets by calculating the Brier score for the respective precipitation thresholds at each site and determining the predictor set with the lowest Brier score at each site. From these 742 predictor sets, we calculate the average weight of each of the 10 predictors. We apply this procedure independently for each of the precipitation thresholds to illustrate the variations of the optimal predictor set with precipitation intensity. Figure 5 shows the average over all observation sites as colored boxes for each predictor–threshold combination.

From the prechosen set of predictors, precipitation itself has the largest weights and contributes the most to the overall performance of the implemented AnEn approach. Further, the algorithm seems to benefit from the predictors’ zonal wind at 850 hPa and, to some degree, convective available potential energy (CAPE). For the smallest threshold of 0.1 mm, that is, precipitation yes/no, all predictors other than precipitation have a minor relevance for the outcome of AnEn.

While one would expect some differences when considering seasonal variations, the results indicate major differences only for the zonal wind in 850 hPa, which has a stronger influence in summer. Its average weight is higher for smaller thresholds in summer (0.5, 1.0, and 2.0 mm), and vice versa for winter. For summer, this interrelation can be attributed to orography-induced (convective) precipitation where zonal winds play a role. In winter, heavy precipitation over Germany is often connected with low pressure systems, which typically exhibit strong westerly winds. On average, however, the importance of precipitation as a predictor is dominating the optimal metrics of AnEn for all thresholds and seasons.

To provide insight about regional and local variations of optimal predictor sets, Fig. 6 depicts a map of the predictor weights for precipitation (left panel) and zonal wind at 850 hPa (right panel) at each observation site for the minimum CRPS achieved in the 30 experiments without seasonal filtering.

While the predictor precipitation attains widespread weight values of 0.7 and higher, there are regions with weights below 0.4 or even 0.2. When comparing these results with the map for the zonal wind speed at 850 hPa on the right, it becomes evident that these are the regions where the weights of zonal wind are high (even drawing close to 1). Comparison with the underlying orography of the terrain (gray shading) shows that the higher weights of the zonal wind are collocated with two different topographic features. First, a larger area of high weights for zonal wind stretches from inland close to the western shore of the North Sea down into the central north German plain. This can be attributed to precipitation intensity being strongly dependent on the onshore moisture transport from the North Sea. Because the prevailing winds here are westerly, a similar effect cannot be observed for the predictor meridional wind and the regions south to the shores of the North Sea and Baltic Sea (not shown). The other observations with high zonal-wind predictor weights mostly follow the main ridges of midrange mountainous regions such as the Black Forest or the Hunsrück. Most of these regions are closer to the western part of Germany, that is, toward the more maritime climate with prevailing westerlies. There are also, however, a few locations with high zonal-wind weights in the southeast—for example, in the Bavarian forest. These orographic effects can be explained with precipitation amplification and attenuation at the midmountain ridges being driven by zonal wind.

### c. Neighborhood and ensemble sizes

*x*axis and ensemble sizes on the

*y*axis. Redder colors indicate better performance of AnEn, and bluer colors indicate that LR is better; the black dot denotes the maximum BSS in a diagram. The BSS is calculated over all seasons (Figs. 7a,d), winter (Figs. 7b,e), and summer (Figs. 7c,f) and for threshold values of 1.0 (Figs. 7a–c) and 5.0 mm (Figs. 7d–f).

In general, the results show superior performance of AnEn relative to LR, with LR only performing better than AnEn for the higher threshold of 5.0 mm at the smallest ensemble size of 10 members or without the neighborhood approach (left column of each panel). This is especially pronounced in the winter season; during summer, AnEn performs better than LR for all combinations.

The combination for which the maximum BSS is achieved varies between seasons and thresholds in neighborhood as well as ensemble size depending on the focus of the user. If an overall solution is wanted, a reasonable compromise would be a 31-by-31 neighborhood with 30 ensemble members.

### d. Original reanalysis and logistic regression

Although the major focus of AnEn is to provide probabilistic estimates, we deterministically evaluate the AnEn results with the REA6 as a first assessment of the estimation quality. Therefore, the root-mean-square error (RMSE) with respect to the precipitation observations is calculated for the AnEn ensemble mean (ensemble size 30 and neighborhood size 31 × 31) as well as for the original REA6 precipitation estimates. The results show that, depending on the choice of neighborhood and ensemble size (see the previous section), AnEn can reduce the average RMSE by more than 10% over all stations and by more than 40% for specific stations.

Taking a deeper look into the deterministic performance of the AnEn estimates, Fig. 8 shows the equitable threat scores (ETS) for different precipitation thresholds also calculated from the AnEn ensemble mean and REA6. The box plots are derived from the results for the 742 observation sites. With the variability among the sites being very large, the deterministic performance of AnEn relative to the original reanalysis strongly depends on the location. For each threshold, there are sites for which AnEn outperforms the original reanalysis and vice versa. In general, AnEn seems to perform better than the reanalysis for thresholds of 1.0 and 2.5 mm. For precipitation thresholds that are larger than 5 mm, the ETS could not be calculated because there were not enough data to provide a reasonable contingency table.

For a further analysis of AnEn performance, especially with respect to probabilistic characteristics, the original REA6 data have to be postprocessed to allow for a fair comparison using probabilistic metrics. Therefore, the well-established statistical LR model is applied to the REA6 data using the same training dataset as that of AnEn. The results are shown as BSSs with the LR estimates as reference. Figure 9 shows a stationwise comparison of the BSS for the two thresholds of 1.0 (left panel) and 5.0 mm (right panel) precipitation. Again, the redder colors indicate better performance of AnEn and bluer colors indicate that LR is better. For the lower threshold, AnEn beats LR at each station, with the performance gain varying with geographic region. The largest gain in performance can be observed over central Germany, with values decreasing to the north and south. The BSS is in general lower for the windward side of midrange mountains and is better for the leeward side, which might again be related to the performance of the reanalysis itself over these areas.

For the higher threshold of 5.0 mm, the BSS is in general much closer to 1, indicating much better performance of AnEn relative to LR. The effect of windward and leeward sides of mountainous regions can also be observed similar to what was observed for the lower threshold, albeit with a weaker characteristic. This shortcoming could be compensated for by using a longer training dataset. The fact that AnEn outperforms LR with respect to the Brier score for certain thresholds is even more remarkable because AnEn is tuned to the CRPS whereas LR is applied for each threshold separately.

### e. Random ensemble

A central question when using AnEn is whether its performance can be solely attributed to a better representation of reliability or whether also the skill in terms of resolution is increased. Therefore, the performance of AnEn is compared with RaEn, in which the same number of ensemble members is chosen randomly from the training dataset, as described in section 3d.

Figure 10 shows a comparison for each station between AnEn and RaEn in terms of the continuous ranked probability skill score (CRPSS), defined analogously to the Brier skill score in Eq. (3) but with RaEn as the reference. Values that are greater than 0 show a higher skill for AnEn, whereas values of less than 0 indicate a poorer performance relative to RaEn.

The plot exhibits a CRPSS of greater than 0 for all stations, with the skill score increasing from the northwest to the southeast. This gradient roughly reflects the transition from maritime climate toward the North Sea coast to a humid continental climate in the southern and eastern parts of Germany; that is, the AnEn method for precipitation seems to gain more skill from resolution in a continental climate.

Lower CRPSS values can also be found in the midrange mountains of central Germany. This result may be due to orographic effects that are not well represented in the reanalysis itself and therefore decrease the performance gain of the AnEn method.

### f. Seasonal filter

The results of the seasonal-filter version of precipitation AnEn are shown in Figs. 11 and 12 as the change of Brier score values for each station when using the seasonal filter on the training data relative to the AnEn approach using the full training dataset.

Figure 11 shows the results for the seasonal filter for winter. A widespread improvement of the Brier score, that is, negative values, can be found for both thresholds (1.0 and 5.0 mm), with the amplitude of the reduction in Brier score being larger for the 5.0-mm threshold.

The results for summer are shown in Fig. 12. The plot exhibits characteristics in Brier score reduction that are similar to the winter results for the 1.0-mm threshold. For the 5.0-mm precipitation threshold, however, the variability in summer is much higher than in winter. On average across all stations, there is still a reduction in Brier scores, but there are multiple stations with higher values of Brier score for the seasonal filter, especially in the midrange mountains in southwestern Germany and along the Elbe River (from the North Sea coast in a southeasterly direction). Hence, these stations exhibit poorer performance when using the seasonal training data only. This may be based on the fact that the training dataset from which the analogs are chosen has a length of only 4 years; that is, for a summer season, the number of available data points is 368, which may be too small to find reasonable analogs for a threshold of 5.0-mm precipitation in some areas.

## 5. Conclusions

A statistical downscaling approach for reanalysis precipitation data using the analog ensemble method has been presented. The results show its potential to generate retrospective time series data for precipitation observations, with AnEn even being able to outperform the reanalysis itself, depending on location and precipitation intensity. Furthermore, AnEn provides a reliable quantification of the underlying uncertainty of the reanalysis.

In general, the performance of the AnEn method strongly depends on the quality of the predictors, that is, the reanalysis data and especially reanalyzed precipitation. Therefore, AnEn performs better for stations on flat terrain than for those in midrange mountains, where the reanalysis is prone to deviate more often from the true atmospheric conditions. On the other hand, the improvements of AnEn with respect to the reanalysis become larger with increasing complexity of the topography.

To assess the added value in terms of uncertainty quantification for AnEn, a logistic regression is applied to the reanalysis data; LR serves as a well-established statistical reference method to generate probabilistic estimates from deterministic time series. On the basis of the fact that AnEn only samples from the distribution in observation space and is not tuned to specific thresholds, the generated analog ensemble clearly outperforms the logistic regression. The superior performance not only comes from the use of the ensemble approach itself—that is, by enhancing the reliability of the estimates—but also originates from the skill of the analog ensembles in terms of resolution. This can be seen from the comparison with a random ensemble taken from the same training dataset, that is, sampled from the same observation distribution.

The quality of AnEn reconstructed time series varies depending on the configuration adopted. The experiments were conducted for different numbers of analogs to assess the impact of the ensemble size on the quality of AnEn for precipitation. Although smaller ensemble sizes (i.e., 20 members) have proven to be sufficient for other AnEn applications (Delle Monache et al. 2013), it seems that an ensemble of 30–40 members is better suited to optimally capture the characteristics of downscaled precipitation, especially for larger precipitation thresholds. The largest tested ensemble size of 50 members shows a lower performance on average when compared with a size of 30 or 40 members. For experiments with longer training periods, however, the use of larger ensemble sizes could be reconsidered. A longer training dataset would allow for a larger number of appropriate samples because the percentage of number of samples with respect to the total number of data points would be smaller; that is, the quality of the estimated probability density function would be enhanced. A larger sample basis would especially benefit the representation of extreme-precipitation events because of their sparseness in the dataset.

The neighborhood size also influences the quality of the AnEn approach. The 1 × 1 experiment, that is, using the predictors without a neighborhood, results in the worst performance of all neighborhood-size experiments. We can therefore confirm our hypothesis that for precipitation analogs the use of spatial predictor patterns leads to a considerable performance increase whereas local atmospheric properties alone do not seem to be sufficient to determine the analogs properly. The best performance was achieved by using a neighborhood size of 31 × 31 or 41 × 41. For different setups, and especially for different horizontal grid sizes, the optimal settings of the neighborhood size may be different because the extent of the relevant patterns may well depend on the scale resolved by the reanalysis.

Because the characteristics of precipitation over Germany are strongly dependent on the season, the AnEn results can be enhanced by using a seasonal filter; for example, analog time series for summer are constructed by using training data that only contain summer months. For a long-term reconstruction, such a seasonal filter could be expanded to a moving window by taking a fixed number of days/months around the day/month in question. This would result in extended computations but would most likely enhance the quality of the analog time series.

With respect to the predictors used, reanalyzed precipitation is naturally the most important one for the construction of synthetic precipitation observations. Other predictors can also play a major role in enhancing the quality of the analog, however, especially with respect to season and amplitude of the precipitation event. Again, the quality enhancement of the estimates through an increased number of predictors would come at the cost of increased computational demand. Therefore, the applied simplified brute-force algorithm provides a reasonable solution for finding the optimal set of predictors. SBF is shown to maintain levels of performance that are very similar to those of the optimal but expensive brute-force approach but with a considerable decrease in computational effort. For this reason, it allows one to test and eventually use a larger set of predictors in the AnEn metric.

Overall, the experiments in this study show as a proof of concept that AnEn has the potential to generate high-quality synthetic precipitation observation time series from reanalysis data. The study further indicates the possibilities of enhancing the output by adapting key tuning parameters such as ensemble size and neighborhood size. Depending on the specifications of the application, such as time frame and/or the characteristics of the input data, these tuning parameters have to be revisited.

## Acknowledgments

This work has been conducted in the framework of the Hans-Ertel-Centre for Weather Research funded by the German Federal Ministry for Transportation and Digital Infrastructure (Grant BMVI/DWD DWD2014P5A). The authors also thank the National Center for Atmospheric Research (NCAR) for granting visiting funds as well as providing computational resources at the NCAR–Wyoming Supercomputing Center.

## REFERENCES

Alessandrini, S., L. Delle Monache, S. Sperati, and G. Cervone, 2015a: An analog ensemble for short-term probabilistic solar power forecast.

,*Appl. Energy***157**, 95–110, doi:10.1016/j.apenergy.2015.08.011.Alessandrini, S., L. Delle Monache, S. Sperati, and J. N. Nissen, 2015b: A novel application of an analog ensemble for short-term wind power forecasting.

,*Renewable Energy***76**, 768–781, doi:10.1016/j.renene.2014.11.061.Alessandrini, S., L. Delle Monache, C. M. Rozoff, and W. E. Lewis, 2016: Probabilistic prediction of hurricane intensity with an analog ensemble.

*Special Symp. on Hurricane Katrina*, New Orleans, LA, Amer. Meteor. Soc., 486. [Available online at https://ams.confex.com/ams/96Annual/webprogram/Paper289851.html.]Bollmeyer, C., and Coauthors, 2015: Towards a high-resolution regional reanalysis for the European CORDEX domain.

,*Quart. J. Roy. Meteor. Soc.***141**, 1–15, doi:10.1002/qj.2486.Compo, G. P., and Coauthors, 2011: The Twentieth Century Reanalysis project.

,*Quart. J. Roy. Meteor. Soc.***137**, 1–28, doi:10.1002/qj.776.Dee, D., and Coauthors, 2011: The ERA-Interim reanalysis: Configuration and performance of the data assimilation system.

,*Quart. J. Roy. Meteor. Soc.***137**, 553–597, doi:10.1002/qj.828.Delle Monache, L., T. Nipen, Y. Liu, G. Roux, and R. Stull, 2011: Kalman filter and analog schemes to postprocess numerical weather predictions.

,*Mon. Wea. Rev.***139**, 3554–3570, doi:10.1175/2011MWR3653.1.Delle Monache, L., T. Eckel, D. Rife, and B. Nagarajan, 2013: Probabilistic weather prediction with an analog ensemble.

,*Mon. Wea. Rev.***141**, 3498–3516, doi:10.1175/MWR-D-12-00281.1.Eckel, F. A., and L. Delle Monache, 2016: A hybrid NWP–analog ensemble.

,*Mon. Wea. Rev.***144**, 897–911, doi:10.1175/MWR-D-15-0096.1.Giorgi, F., C. Jones, and G. R. Asrar, 2009: Addressing climate information needs at the regional level: The CORDEX framework.

,*WMO Bull.***58**, 175–183. [Available online at https://library.wmo.int/pmb_ged/bulletin_58-3.pdf.]Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application.

,*Mon. Wea. Rev.***134**, 3209–3229, doi:10.1175/MWR3237.1.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15**, 559–570, doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.Junk, C., L. Delle Monache, S. Alessandrini, G. Cervone, and L. von Bremen, 2015: Predictor-weighting strategies for probabilistic wind power forecasting with an analog ensemble.

,*Meteor. Z.***24**, 361–379, doi:10.1127/metz/2015/0659.Mesinger, F., and Coauthors, 2006: North American Regional Reanalysis.

,*Bull. Amer. Meteor. Soc.***87**, 343–360, doi:10.1175/BAMS-87-3-343.Nagarajan, B., L. Delle Monache, J. P. Hacker, D. L. Rife, K. Searight, J. C. Knievel, and T. N. Nipen, 2015: An evaluation of analog-based postprocessing methods across several variables and forecast models.

,*Wea. Forecasting***30**, 1623–1643, doi:10.1175/WAF-D-14-00081.1.Vanvyve, E., L. Delle Monache, A. J. Monaghan, and J. O. Pinto, 2015: Wind resource estimates with an analog ensemble approach.

,*Renewable Energy***74**, 761–773, doi:10.1016/j.renene.2014.08.060.Wilson, A. B., D. H. Bromwich, and K. M. Hines, 2011: Evaluation of polar WRF forecasts on the Arctic System Reanalysis domain: Surface and upper air analysis.

,*J. Geophys. Res.***116**, D11112, doi:10.1029/2010JD015013.