Veriﬁcation of Solid Precipitation Forecasts from Numerical Weather Prediction Models in Norway

: Assessing the quality of precipitation forecasts requires observations, but all precipitation observations have associated uncertainties making it difﬁcult to quantify the true forecast quality. One of the largest uncertainties is due to the wind-induced undercatch of solid precipitation gauge measurements. This study discusses how this impacts the veriﬁcation of precipitation forecasts for Norway for one global model [the high-resolution version of the ECMWF Integrated Forecasting System (IFS-HRES)], and one high-resolution, limited-area model [Applications of Research to Operations at Mesoscale(MEPS)].First,the forecasts are compared withhigh-quality referencemeasurements(less undercatch) and with more simple measurement equipment commonly available (substantial undercatch) at the Haukeliseter observation site. Then the veriﬁcation is extended to include all Norwegian observation sites: 1) stratiﬁed by wind speed, since calm (windy) conditions experience less (more) undercatch; and 2) by applying transfer functions, which convert measured precipitation to what would have been measured with high-quality equipment with less undercatch, before the forecast–observation comparison is performed. Results show that the wind-induced undercatch of solid precipitation has a substantial impact on veriﬁcation results. Furthermore, applying transfer functions to adjust for wind-induced undercatch of solid precipitation gives a more realistic pictureof true forecast capabilities. In particular, estimates of systematicforecast biases areimproved, and to a lesser degree, veriﬁcation scores like correlation, RMSE, ETS, and stable equitable error in probability space (SEEPS). However, uncertainties associated with applying transfer functions are substantial and need to be taken into account in the veriﬁcation process. Precipitation forecast veriﬁcation for liquid and solid precipitation should be done separately whenever possible.


Introduction
Accurate precipitation forecasts are important for weather warnings, as an integral part of the hydrological cycle, and for people's everyday life. However, all liquid-equivalent precipitation measurements, whether it is by precipitation gauges, ultrasonic measurements, snow pillows, radar, satellite, or by other equipment, operated by professionals or measured by citizen weather stations, have weaknesses (e.g., Sun et al. 2018;Tapiador et al. 2017;Nitu et al. 2018;Goble et al. 2020). This makes it difficult to quantify the true precipitation, and hence the true quality of weather forecasts. This study discusses how the wind-induced undercatch of solid precipitation by precipitation gauge measurements (e.g., Rasmussen et al. 2012;Nitu et al. 2018) impacts the verification process of weather forecasts.
Precipitation gauges are widely used for verification of short-and medium-range precipitation forecasts (e.g., Rodwell et al. 2010;Cookson-Hills et al. 2017;Frogner et al. 2019), as the basic input to many gridded precipitation products, and as ground truth for calibrating or verifying precipitation estimates from radar and satellite (e.g., Sun et al. 2018;Sebastianelli et al. 2013). The main advantages of precipitation gauges are that the observation data are easily accessible, measure precipitation directly and include long time series for a number of places. However, some disadvantages are spatial and temporal representativity and observational errors. The challenges of the high spatial and temporal variability of precipitation in the verification process have been addressed the last few decades with the emergence of high-resolution numerical weather prediction (NWP) models (e.g., Dorninger et al. 2018;Gilleland et al. 2010). However, the observational error part has received less attention. In cold regions and periods (e.g., high latitudes, elevations, and winter) the wind-induced undercatch of solid precipitation is the dominant source of observational errors. In the presence of wind the airflow around the opening of the precipitation gauge is disturbed and hydrometeors may not fall into the gauge (e.g., Rasmussen et al. 2012;Nitu et al. 2018;Colli et al. 2016). The magnitude of the error depends on, among others, the wind speed, the shape of the gauge, artificial and natural shielding of the gauge, precipitation phase, and shape and fall velocity of the hydrometeors (e.g., Kochendorfer et al. 2017;Rasmussen et al. 2012;Nitu et al. 2018). Wetting and evaporation errors, capping on Denotes content that is immediately available upon publication as open access.
top of the precipitation gauge and blowing snow are some other features that may also have an impact on the quality of the observations (Rasmussen et al. 2012).
The World Meteorological Organization (WMO) has facilitated several projects to improve precipitation gauge measurements (Goodison et al. 1998;Sevruk et al. 2009;Nitu et al. 2018). This includes the recently finished WMO Solid Precipitation Intercomparison Experiment (SPICE) where at a number of locations (with various climates) a set of different measurement equipment has been tested and compared. A reference measurement setup has been agreed on, the double-fence automated reference (DFAR), which minimizes the windinduced undercatch and to which other measurement setups can be compared (e.g., Rasmussen et al. 2012;Wolff et al. 2013). Such reference setups are not feasible to be used widely as they require more space, are more expensive than simpler configurations, and may not follow local legal restrictions. However, based on controlled experiments, so-called transfer functions (TFs) have been developed to estimate what would have been measured by the DFAR, given measurements with a simpler configuration.
In many verification studies, the effect of wind-induced undercatch has been ignored (e.g., Frogner et al. 2019;Haiden et al. 2012), or used to explain results without an attempt toward quantification of the effect (e.g., Gowan et al. 2018;Wang et al. 2019). However, the undercatch has been taken into account in a quantitative way in some studies; Buisán et al. (2020) compared verification results obtained using WMO SPICE DFAR observations and more commonly used measurement equipment prone to wind-induced undercatch (both unadjusted and adjusted with TFs). They demonstrated that in the absence of DFAR or other reference observations, solid precipitation verification is complex and subject to uncertainty, but that applying TFs provide more reliable verification scores although the results then still contain the limitations and uncertainty associated with the TFs, Cookson-Hills et al. (2017) applied a TF when the daily mean air temperature was below 08C, Bromwich et al. (2018) used gridded precipitation datasets that had undergone quality-control procedures to better handle the wind-induced undercatch, while Schirmer and Jamieson (2015) verified forecast winter and mountain precipitation against observations from ultrasonic snow depth measurements and snow pillows. In the latter study a systematic underestimation in forecast winter and mountain precipitation was found, as opposed to the forecast overestimation found in studies using precipitation gauges prone to undercatch. Similarly, a change from overestimation to underestimation of precipitation was found for four NWP systems when corrected precipitation measurements were used in an Arctic region . These studies show that not only is the magnitude of the precipitation errors uncertain in cold regions, but sometimes also the sign of the bias.
The aim of this work is 1) to study the impact of wind-induced undercatch of observed solid precipitation on forecast verification, 2) contribute to the development of best practices for using precipitation gauges in the verification and 3) assess the true quality of solid precipitation forecasts in Norway from the high-resolution version of the global ECMWF Integrated Forecasting System (IFS-HRES), and the control run from the operational Nordic regional convection-permitting system, MetCoOp Ensemble Prediction System (MEPS). This study extends previous work by putting additional effort in estimating the uncertainty in the verification introduced by wind-induced undercatch, by extending the time period, domain, and the verification metrics studied. This work also uses the outcomes of the recently finished WMO SPICE in new applications.
The observations and NWP systems used in this study are described in section 2, while the applied transfer functions are described and discussed in section 3. In section 4, precipitation observations, adjusted observations, and forecasted precipitation are compared for one of the WMO SPICE supersites, Haukeliseter in Norway. In section 5, hourly precipitation forecasts in Norway are verified stratified by wind speed and by making adjustments of observations by transfer functions. In section 6, a method to adapt the adjustments on hourly precipitation to daily precipitation observations are suggested and tested to increase the data sample. A discussion on where wind-induced undercatch of solid precipitation is important is done in section 7. Finally, a summary and conclusions are made in section 8.

a. Observation data
In this study we use quality controlled observations from the Norwegian Meteorological Institute (eklima.met.no) of hourly and daily precipitation, instantaneous 2-m air temperature (T2m) and 10-m wind speed (WS10). The quality control system consists of both automatic and human quality control routines to flag or remove suspicious or erroneous observations (Kielland 2005). However, erroneous precipitation observations due to windinduced undercatch are not removed or flagged in the quality control process. All hourly precipitation observations are measured with Single-Alter shielded Geonor (SA Geonor) precipitation gauges while the equipment for daily precipitation varies. In total, there are 76 stations with hourly precipitation (SA Geonor), T2m, and WS10 measurements. In addition there are 206 stations where daily precipitation is measured by Norwegian/Swedish gauges for which 179 have a Norwegian version of the Nipher windshield and 27 are unshielded. It should be noted that the automated SA Geonor measurements are built to maintain homogeneity of the precipitation records in Norway when replacing the manual gauges and have been shown to be very similar in what they measure (Bakkehøi et al. 1985;Jacobi et al. 2019;M. Wolff 2020, personal communication). For the Norwegian observations the wind-induced undercatch is the dominant error source, but also wetting and evaporation errors, capping on top of the precipitation gauge and blowing snow may play a substantial role (Førland et al. 1996). The spatial distribution of observation sites is shown in Fig. 1.
The Haukeliseter observation site was a supersite in the recently finished WMO SPICE where targeted studies were done on wind-induced uncercatchment of solid precipitation. Haukeliseter is located in the Norwegian mountains (59.81188N, 7.21438E, 991 m MSL, Fig. 1) and described in detail in Wolff et al. (2013Wolff et al. ( , 2015. In this study we make use of the Haukeliseter collocated Geonor precipitation observations with SA and DFAR shielding, WS10 and T2m measurements. It should be noticed that also the DFAR Geonor measurements experience wind-induced undercatch of solid precipitation, estimated to be 5%-6% (Nitu et al. 2018), but considerably less than by other measurement equipment. Kochendorfer et al. (2017) showed that averaged over the eight WMO SPICE sites the wind-induced undercatch was 24% for SA shieldings and 34% for completely unshielded precipitation gauges.

b. Model data
The two NWP systems verified are the operational versions of 1) the high-resolution version of the global ECMWF Integrated Forecasting System (IFS-HRES; Buizza et al. 2017), and 2) the control run from the Nordic regional convection-permitting system, MetCoOp Ensemble Prediction System (MEPS; Müller et al. 2017;Bengtsson et al. 2017;Frogner et al. 2019).
IFS-HRES employs a 4D-Var assimilation scheme for upper air fields, and optimal interpolation for near-surface variables. The system has approximately 9-km grid spacing and 137 vertical layers. Cloud and large-scale precipitation processes are described by prognostic equations for cloud liquid water, cloud ice, rain, snow and a gridbox fractional cloud cover. The basic design of the scheme with respect to the prognostic cloud fraction and sources/sinks of all cloud variables due to the major microphysical generation and destruction processes follows Tiedtke (1993). However, liquid and ice water contents are independent, allowing a more physically realistic representation of supercooled liquid water and mixed-phase cloud than the original scheme. Rain and snow precipitate with a determined terminal fall speed and can be advected by the three-dimensional wind. A multidimensional implicit solver is used for the numerical solution of the cloud and precipitation prognostic equations (Forbes et al. 2011).
MEPS is an operational regional convection-permitting ensemble prediction system for Norway, Sweden and Finland, based on the HIRLAM-ALADIN Research on Mesoscale Operational NWP in Euromed (HARMONIE)-Applications of Research to Operations at Mesoscale (AROME). The system employs a 3D-Var assimilation scheme for upper air and optimal interpolation for surface assimilation. The system has 2.5-km grid spacing and 65 vertical layers. Lateral boundary conditions are taken from 6-h old IFS-HRES forecasts, and the regional domain is shown in Fig. 1. In MEPS, the cloud microphysics is handled by the ICE3 scheme (Cohard and Pinty 2000a,b), with some additions known as the OCND2 option (Müller et al. 2017). ICE3 is a single-moment scheme with explicit calculations for the mass of cloud water, rain, cloud ice, snow, and graupel. Shallow convection is parameterized by the EDMFm scheme (Bengtsson et al. 2017).
The model data are paired to the observation data by applying nearest neighbor interpolation. The choice of interpolation method may have an impact on the verification results, but Køltzow et al. (2019) showed that for a winter period in Norway the difference in root-mean-square error applying near neighbor and bilinear interpolation is less than 2% for precipitation. The two NWP systems do not use precipitation measurements in their assimilation process and the verification is therefore done with independent observation data.
The main verification period in this study is the 3-yr-long period from January 2017 to December 2019 in which both model systems have gone through upgrades. In the verification of operational systems this will always be the case unless the systems are verified for shorter periods. To make the statistics more robust longer periods are required and we do believe the model upgrades have minor impact on how undercatch in precipitation observations influences the verification. Also, in the period chosen, there were only minor changes in the microphysics scheme in both the MEPS and the IFS. In the section on verification of daily precipitation we present time series of the evolution of forecast errors during the period.

Transfer functions
Transfer functions aim to transfer measured precipitation exposed to wind-induced undercatch to what would have been measured with equipment less prone to wind-induced undercatch, e.g., transfer SA Geonor measurements to what would have been measured by DFAR Geonor. Many different TFs are reported in the literature (e.g., Goodison et al. 1998;Sevruk et al. 2009;Smith 2009;Wolff et al. 2015;Kochendorfer et al. 2017;Buisán et al. 2017), and they may vary in input parameters, sites (characterized by different climatologies) used for development, wind sheltering, measurement equipment and accumulation periods, to mention some aspects. We apply the TF developed during the recently finished WMO SPICE described in Kochendorfer et al. (2017): The TF [Eq.
(1)] states that the catch efficiency (CE) of observed precipitation is a (negative) exponential function of the wind speed U with a coefficient in the exponent that varies with the arcus tangens of air temperature T air . Universal values of the three coefficients a, b, and c are determined based on observation data from 30-min precipitation events from eight globally distributed WMO SPICE sites for unshielded and SA shielded measurements, for wind measurements at 10 m and gauge height .
A substantial case-to-case variability in the observed CE for the same wind speed and temperature exists (e.g., Fig. 4 in Wolff et al. 2015), implying that additional factors must be of importance, e.g., precipitation characteristics and other local conditions. The site dependency is seen in Table 1 with differing a, b, and c coefficients when these are estimated from the eight individual WMO SPICE sites (coefficients provided by J. Kochendorfer 2020, personal communication). Plotting the CEs with the different site coefficients (Fig. 2) underlines the large uncertainty in the adjustment of the observed precipitation. For example, with an air temperature of 22.58C and 3 m s 21 wind speed the CE varies from 0.88 to 0.52 depending on the choice of coefficients (sites). In addition, for all coefficient sets there is a maximum wind speed (max WS, right column of Table 1), beyond which the CE is kept constant, since above this threshold the observational data are scarce and the coefficients are difficult to be estimated. In Fig. 2 the dashed lines show the extrapolated CE without such a cutoff beyond max WS; for these wind speeds the CE should be used very cautiously. A suggestion on how the observation uncertainty can be taken into account in the verification process is made and tested in section 5.

Observed, adjusted, and forecasted precipitation at Haukeliseter
At Haukeliseter precipitation is observed with both the commonly applied SA Geonor (prone to wind-induced undercatch) and the WMO reference equipment DFAR (less prone to wind-induced undercatch). Accumulated precipitation from DFAR and the unadjusted SA Geonor measurements, MEPS and IFS-HRES forecasts (lead time from 16 to 130 h) at Haukeliseter from 15 December 2017 to 31 March 2018 is shown in Fig. 3. In addition, adjusted precipitation accumulated after applying TFs [Eq.
(1) with different sets of coefficients from Table 1] on the SA Geonor measurements are shown. Notice that this period is not included in the period used to determine the TF coefficients for Haukeliseter  and can therefore be used for independent verification. The SA Geonor at Haukeliseter measures 57% (accumulated 225 mm) of the precipitation amount in the DFAR Geonor (393 mm), illustrating the problem of wind-induced undercatch of solid precipitation. The undercatch also results in less accurate observations in terms of variability, because the amount of undercatch depends on wind speed, demonstrated by a linear correlation between SA and DFAR Geonor measurements of 0.88 (e.g., SA Geonor measurements are not always able to identify which events had the most precipitation).
By applying the TF with universal coefficients, the accumulated precipitation is 306 mm, which is 36% more than measured by SA Geonor, but only 78% of what was measured by the DFAR Geonor. Applying the individual coefficients give a large span in estimates of accumulated precipitation, from 250 mm (Sodankyla coefficients) to 444 mm (Formigal coefficients), i.e., a span from 64% to 113% of DFAR Geonor. The correlation between TFs estimated precipitation and DFAR Geonor measurements varies from 0.88 to 0.93, i.e., a small improvement over the SA/DFAR correlation. The best match after applying TFs and DFAR measurements is, as expected, found when using the coefficients from Haukeliseter itself, with estimated accumulated precipitation to be 372 mm (95% of DFAR Geonor) and 0.93 in correlation. In comparison, the more complex transfer function developed by Wolff et al. (2015) for Haukeliseter gives a similar result (2% overestimation and correlation of 0.93) for the investigated period.
The validity of the different sets of coefficients for the TF are limited by the max WS10 in Table 1. For wind speeds higher than these thresholds there was insufficient data to estimate the CE from the WMO SPICE sites. In Fig. 3, we have kept a constant CE, determined by the max WS10, beyond these TABLE 1. Universal coefficients a, b, and c for 10-m height wind measurements, for Eq. (1) based on all WMO SPICE sites (single-Alter shielded and unshielded) and a cutoff wind speed (insufficient data for deriving CE above this limit) from Table 2 in Kochendorfer et al. (2017). Same coefficients and wind speed cutoff based on individual WMO SPICE sites for Single-Alter shieldings obtained from J. Kochendorfer (2020, personal communication thresholds. However, since Haukeliseter is a windy site, it can be argued that this approach gives a too high CE (resulting in a too mild adjustment) for coefficient sets with a low maximum wind speed threshold. We therefore also applied a threshold on 12 m s 21 for all coefficient sets, i.e., applied the dashed lines in Fig. 2 up to 12 m s 21 . With the exception of the Haukeliseterbased coefficients (already a threshold on 12 m s 21 ) this gave a relatively modest increase in estimated precipitation in the range between 3% and 27% (4% for the universal coefficients) at Haukeliseter; however, the main features of Fig. 3 are kept (not shown). At Haukeliseter, MEPS (479 mm) and IFS-HRES (521 mm) have a substantial overestimation of SA Geonor measured precipitation, with 213% and 232% of the observed precipitation, respectively. They also overestimate, but to a much smaller degree, compared to DFAR Geonor measurements with 122% (MEPS) and 133% (IFS-HRES). Also in terms of correlation the numbers are improved, 0.74 (both models) against DFAR Geonor, while the correlation against SA Geonor is 0.65 (IFS-HRES) and 0.66 (MEPS). However, the correlation between forecasts and TFs are only slightly higher compared to SA Geonor (increase between 0.01 and 0.04, with the Haukeliseter coefficients giving the largest increase). This indicates that applying TFs improves estimates of forecast bias (i.e., the correspondence between the mean forecast and mean observation) but to a more limited degree the correlation.

Verification of hourly precipitation forecasts a. Stratification on wind speed
By stratifying the precipitation verification with respect to wind speed, more weight can be put on the results obtained during calm conditions and less during windy conditions exposed to wind-induced undercatch. The additive forecast bias of hourly accumulated liquid (T2m . 128C) and solid precipitation (,228C) from MEPS and IFS-HRES when compared to uncorrected SA Geonor measurements, stratified by wind speed are shown in Figs. 4a and 4b. The observations used are from the 76 sites observing hourly precipitation, T2m and WS10 described in section 2a. For MEPS, the liquid precipitation bias is insensitive to wind speed, while the bias for solid precipitation increases with wind speed. We believe that (at least part of) this positive bias is artificial and due to the measurements wind-induced undercatch. A similar behavior is also seen for solid precipitation from IFS-HRES. However, IFS-HRES also has an increase in precipitation bias with wind speed for liquid precipitation that cannot be explained by observational issues.
In addition to continuous metrics, skill scores based on categorical forecasts are commonly used for precipitation verification. The Equitable Threat Score (ETS) is shown for liquid and solid precipitation in calm (,2.5 m s 21 ) and windy conditions (.4 m s 21 ) for a variety of precipitation thresholds in Figs. 5a and 5b. The ETS is a measure evaluated from the threat score 5 hits/(hits 1 false alarms 1 misses), which then is modified for hits obtained by a random forecast [see Wilks (2011, chapter 8) or Jolliffe and Stephenson (2012) for definition and more details]. The ETS therefore measures the fraction of observed and/or forecast events that were correctly predicted, relative to chance and is one of the most widely used scores in the verification of precipitation forecasts. ETS is a positively oriented score, i.e., a larger score indicates a better forecast, with a perfect score being 1 and a forecast with no skill achieving the value 0. During calm conditions the ETS is quite similar for liquid and solid precipitation, but during windy conditions there is a large difference depending on the precipitation phase, apparently showing low skill for solid precipitation, which we argue is (at least partially) due to windinduced observation errors.
The verification of precipitation stratified by wind speed clearly indicates that when solid precipitation is present the results may be misleading if the wind-induced undercatch is not considered. However, we do not suggest applying such an approach operationally: a weakness of applying the wind stratification approach to assess the forecast quality is that many precipitation events happen during windy conditions and the trustful part of the verification is then based on a much smaller data sample. Such an approach would also put more weight on wind-sheltered sites and generally include fewer cases from windy areas such as coastal and mountainous regions. In addition, only including calm conditions may provide a skewed sample in terms of types of precipitation systems.

b. Applying transfer functions
In this section, we use the TF developed by Kochendorfer et al. (2017) and described in section 3, to adjust the 76 hourly SA Geonor observations (described in section 2a and also used in section 5a) before used in verification. The large spread in CE depending on the choice of coefficient set (Table 1 and Figs. 2 and 3) underlines the high uncertainty associated with the choice of transfer function. Ideally, the most appropriate TF and coefficients should be found for each observation site (e.g., sites similar to Haukeliseter should use coefficients from Haukeliseter etc). However, there is no straightforward and objective way to do this. Therefore, we present for all Norwegian observation sites that observe hourly precipitation, WS10, and T2m, precipitation biases applying the universal coefficients, accompanied by the spread resulting from applying all sets of coefficients in Table 1. The latter is an attempt to quantify the uncertainty in the choice of TFs. For this purpose, we argue that extrapolation beyond the maximum wind speed thresholds in Table 1 (dashed curves in Fig. 2) gives a better FIG. 4. Liquid (red) and solid (blue) precipitation forecasts divided by observed precipitation (SA Geonor measurements) for (a) MEPS and (b) IFS-HRES, stratified by observed wind speed. Shaded areas are 95% confidence intervals calculated by bootstrapping.
representation of the uncertainty than applying an unrealistic cutoff threshold (constant CE beyond WSmax).
Liquid and solid precipitation biases for all Norwegian sites for MEPS and IFS-HRES for a 3-yr-long period (from January 2017 to December 2019) are shown in Fig. 6. An underestimation of solid precipitation is seen at coastal stations (marked with blue), which is even more pronounced after applying TFs. Opposite to this, a general overestimation in solid precipitation is seen in the mountains (stations situated at more than 700 m MSL marked with red) before corrections, but reduced to a neutral or less pronounced positive bias after applying TFs. The uncertainty (spread between different choices of TF coefficients) is large for the mountain sites, which is expected due to higher wind speeds. The average MEPS bias is positive for both liquid (forecast/measured 5 1.05) and solid (1.15) precipitation before corrections. However, applying the universal coefficients give an underestimation of solid precipitation (forecast/adjusted observations 5 0.87). To estimate the uncertainty in the average bias we assume that all coefficient sets in Table 1 are equally likely to be representative for a site. Then the total bias is calculated by combining random draws of which TF coefficients to be used at each station, e.g., a random draw of coefficients on site 1 is combined with a random draw of coefficients on site 2 and so on until the last site and an average bias can be calculated. This process is repeated 1000 times and the spread between 1000 calculated average biases is used as a measure of the uncertainty, ranging from 0.71 to 0.88, with median 0.82 (forecast/adjusted observation). Many of the same features found for MEPS are also valid for IFS-HRES. For IFS-HRES, an overestimation of liquid precipitation (forecasted/measured 5 1.26) is present, while the solid precipitation bias change from an overestimation (1.21) to an underestimation (0.92) after adjusting the observations with an uncertainty range from 0.75 to 0.93 and median 0.86. After adjustments of the observations it is evident that both model systems overestimate liquid precipitation, but underestimate the true solid precipitation. It is important to understand the reasons for this model behavior, but it requires further investigations and is out of the scope of this paper. Also verification metrics like normalized standard deviation of the error (NSDE), root-mean-square error (RMSE), correlation, and ETS were calculated with and without applying TFs for solid precipitation. However, only small changes in these metrics were found when applying raw and adjusted observations (not shown).
The large local spatial variability in precipitation biases seen for both model systems, e.g., between the three mountain stations Klevavatn (underestimation), Midtstova (overestimation), and Finse (neutral/underestimation), situated less than 20 km apart, indicate that local effects and the observation site sampling play an important role for the verification results. Moreover, when averaging over all sites above they are all equally weighted, i.e., areas with a denser observation network are given more weight. It would therefore be beneficial to also verify precipitation against daily accumulated precipitation that then would include many more observation sites and make the statistics more robust and possibly more geographical representative.

Verification of daily precipitation forecasts
Precipitation observations are becoming more and more automated and thereby also available for a range of accumulation periods (e.g., as short as an hour). However, a large number of precipitation measurements are still done manually, e.g., daily accumulated precipitation at Norwegian climate stations. To be able to include these observations in the verification process will make the results more robust (increase the amount of observations) and make it possible to create longterm verification time series.
To apply TFs to adjust daily observations for wind-induced undercatch requires some additional considerations; 1) information on the time of day when the precipitation mainly accumulated on a given day would be needed to use the best possible WS10 and T2m to calculate the CE. To overcome this issue we follow the approach of Vormoor and Skaugen (2013) and distribute the observed daily precipitation in time by using the temporal patterns of the forecasts. If the forecast has no precipitation, the observed precipitation is distributed equally over the 24 h; 2) observations of WS10 and T2m may not be available. To overcome this we use WS10 and T2m from the forecasts, which was shown to be a useful approach by Masuda et al. (2019) using reanalysis; 3) a mixture of observation equipment/sheltering exists and hence different TFs are ideally needed for different stations. Norwegian stations observing daily precipitation deploy SA Geonor, Norwegian/Swedish gauges with a Nipher wind shield and Norwegian unshielded precipitation gauges. However, the SA Geonor measurements and the manual gauges used in Norway have been shown to be very similar in what they measure (Bakkehøi et al. 1985;Jacobi et al. 2019;M. Wolff 2020, personal communication). Therefore, we apply Eq. (1) with the universal SA Geonor coefficients (Table 1) on all sites with windshield and the coefficients (Table 1) for the unshielded gauges for the unshielded observations.
To test steps 1 and 2 in the described approach, we compare the adjusted daily precipitation at observation sites where these adjustments can be done both based on WS10, T2m, and precipitation timing from forecasts and observations. For the winter months December, January and February 2017-19, applying observations-based adjustments (observed T2m, WS10, and timing of precipitation) give an increase in total precipitation of 17%, while applying MEPS and IFS-HRES adjustments (forecasted T2m, WS10, and timing of precipitation) give 22% and 20% increase, respectively. A general positive wind speed bias in MEPS leads to an overestimation of the wind-induced undercatch. On the other hand, the correlation between adjustments calculated by observed values and MEPS is 0.87, while 0.79 for IFS-HRES. We attribute the lower correlation for IFS-HRES to less spatial detail and an underestimation of wind speed in some regions (e.g., in the mountains). This comparison is done mainly on sites equipped with SA Geonor. In the rest of this section we calculate the CE based on observations where observations of WS10 and T2m exist and based on MEPS forecasts for the other sites.
The daily bias for different seasons, before and after measurement adjustments are shown in Fig. 7. A substantial change in the biases is present during winter (DJF). For MEPS the bias changes, on average, from an almost neutral bias to an underestimation of almost 0.8 mm day 21 (;20%). A clear change in bias for MEPS is also seen during spring (MAM) and autumn (SON). Without any adjustment of the observations IFS-HRES show a consistent overestimation of 0.4-0.8 mm day 21 (10%-25%) during all seasons. However, in winter this is changed to a negative bias (;0.2 mm day 21 , ;5%) after measurement adjustments. Also in spring and autumn the measurement adjustments clearly reduce the IFS-HRES overestimation. The interpretation of these results should take into account the added uncertainty in the adjustment process of daily accumulated precipitation described above. For example, since the adjustment with T2m and WS10 from MEPS had a tendency to adjust too much, the impact seen on the verification results can be looked at as an upper limit of expected impact. However, these results are qualitatively similar to what was found for hourly observations and adjustments. Although a large impact is found in systematic biases when adjusting the observations, only minor impact is again seen on metrics like correlation, NSDE, RMSE, and ETS (not shown).
In the following we apply the ''stable equitable error in probability space'' metric, hereafter named SEEPS (Rodwell et al. 2010). SEEPS divides the precipitation into ''dry,'' ''light precipitation,'' and ''heavy precipitation'' based on the local climatology. In our use we have chosen ''dry'' to be when measured daily accumulated precipitation is less than 0.5 mm and the 1/3 highest precipitation amounts (excluding dry events) are defined as ''heavy.'' This means that ''heavy precipitation'' events are not necessarily high-impact or extreme precipitation events. The SEEPS score is chosen because it is one of the headline verification scores at ECMWF and MET-Norway and encourages refinement, discourages hedging, can be combined for different climatic regions, is less sensitive to sampling uncertainty, is equitable, and measures the error in probability space and is therefore less sensitive to observation and representativeness errors (Rodwell et al. 2010). The SEEPS score can be decomposed, and different aspects of the precipitation forecast error can be quantified (Haiden et al. 2012). A disadvantage is that SEEPS requires knowledge about local precipitation climatology, i.e., it is necessary to have long time series of precipitation observations available, which should be adjusted for wind-induced observation errors. However, this is not always possible (i.e., WS10 and T2m are not available for all sites). Instead we take advantage of the fact that SEEPS measures the error in probability space and therefore we adjust the forecasts to what should be expected to be measured (CE 3 forecasted precipitation) in windy conditions. The adjusted forecasts are then compared with the (unadjusted) observations and the SEEPS score can be calculated.
For both NWP systems SEEPS scores are slightly worse after adjustments are made, and on average 3%-4% worse for the winter months (Fig. 8). A simple test by only calculating SEEPS for sites with elevations higher than 300 m MSL (e.g., increasing the fraction of solid precipitation in the dataset) shows slightly higher differences (not shown). Furthermore, the change in SEEPS score due to adjustments is less than the difference between the two NWP systems verified. MEPS and IFS-HRES have quite similar performance in 2019, but in certain periods (e.g., autumn) in 2018 and 2017 IFS-HRES scores better. The decomposition of the SEEPS score ( Fig. 9) shows that these differences originate from cases where forecasted MEPS precipitation is less intense than the observed precipitation. An annual cycle with larger errors in summer than winter for MEPS indicates that part of this is related to double-penalty issues, which is more pronounced with the higher horizontal resolution in MEPS. IFS-HRES has a general overestimation of the frequency of light precipitation, and heavy precipitation during summer, but a weaker annual cycle of the error. However, MEPS shows a better frequency bias for all the three precipitation categories and lower errors during dry conditions. Hence, the SEEPS decomposition clearly identifies different strengths and weaknesses in the two NWP systems. However, it is not within the scope of this paper to discuss these intramodel differences in detail.
The SEEPS diagnostics also show how the adjustment of precipitation impacts the total score. The adjustments lead to a minor improvement, in both systems, during winter in bias frequency for dry conditions. In addition, the winter bias frequency of heavy precipitation events in IFS-HRES is improved by reducing the overestimation. For MEPS, a winter overestimation of the heavy precipitation events is turned into an underestimation of the same events. For the error components the adjustments give a higher contribution from the category ''forecasted light and observed heavy,'' but a lower error contribution from ''forecasted light and observed dry.'' 7. Which areas are exposed to wind-induced undercatch of solid precipitation?
The impact of wind-induced observation errors on forecast verification, as calculated by TFs, depend on the amount of solid precipitation, wind speed and temperature. The impact therefore varies in time and with season and region. In Scandinavia, the solid precipitation is a substantial part of the total precipitation during winter, as shown by forecasted precipitation from MEPS for December 2018-February 2019 (Figs. 10a-c). The fraction of solid precipitation increases inland and northward, as expected, and liquid precipitation is mostly limited to coastal/ocean regions or inland areas south of 608N. The areas with the highest precipitation amounts, in the Norwegian mountains, are dominated by solid precipitation (.600 mm). The spatial distribution of the wind-induced undercatch is estimated in Fig. 10d, i.e., the fraction of the total precipitation that we would expect to observe with SA Geonor measurements given that the model precipitation equals what would be measured with DFAR (hereafter named modelestimate of the SA Geonor fraction). The model-based CE is calculated by applying MEPS forecasts of WS10 and T2m in Eq. (1) with the universal coefficients from Table 1. Then the model-estimate of the SA Geonor fraction can be estimated as the sum of hourly model-based CE 3 forecasted precipitation divided by the sum of the forecasted precipitation (assumed to equal DFAR measured).
The model-estimated SA Geonor fraction varies with temperature (e.g., higher over the sea, but reduced northward), and wind speed (e.g., lower in the mountains, but higher for lower elevated inland regions), and shows substantial geographical variability. While estimated SA Geonor fraction in Finland is relatively homogeneous (;0.7-0.8, decreasing northward), the spatial pattern in Norway is more complex, but follows the Norwegian topography and coastline with a minimum estimated SA Geonor fraction in the mountains (,0.5), but higher fractions in more wind sheltered valleys and lower elevation inland regions (.0.80). It should also be noticed that the SA Geonor fraction along the Norwegian coast decreases from the south (.0.9) to north (,0.6). In summary, the modelestimated SA Geonor fraction is less than 0.9 for most of the domain with the exception of the south east part covering Denmark and the southwest coast of Norway. Wind-induced observation errors can therefore possibly impact the verification results for most parts of the MEPS domain.

Conclusions
The aim of this study is to investigate how the wind-induced undercatch of solid precipitation measurements impacts weather forecast verification. Precipitation gauges are a key component for verification of weather forecasts, as input to gridded precipitation products, and as ground truth for precipitation estimates from radar and satellite products. Even with moderate wind speeds the wind-induced undercatch of solid precipitation by precipitation gauges is substantial (e.g., Rasmussen et al. 2012;Nitu et al. 2018) and makes it difficult to quantify the true precipitation, and hence the true quality of weather forecasts.
This study takes transfer function coefficients for calculating the gauge catch efficiency (CE) that were developed at eight sites at various parts of the world during the WMO SPICE project and applies them to 3-yr gauge measurements collected in Norway. Typical operational SA measurements before and after the CE adjustments and more reliable reference DFIR Geonor measurements were compared to the precipitation forecasts from a global model (high-resolution ECMWF IFS-HRES) and high-resolution regional model [Applications of Research to Operations at Mesoscale (MEPS)]. In addition, the difference between liquid precipitation (less prone to undercatch) and solid precipitation verification has been studied. The main findings can be summarized as follows; FIG. 8. Time series of daily SEEPS skill score (1-SEEPS) for MEPS (red) and IFS-HRES (blue) compared with unadjusted (dashed lines) and adjusted (solid lines) precipitation measurements from 123 stations. A 3-month running mean is applied when plotting.
d The wind-induced undercatch of solid precipitation introduces observational errors that can have substantial impact on NWP verification results. Verification at the Haukeliseter supersite shows that more reliable observations, i.e., those from DFAR Geonor, result in a substantial improvement in forecast errors compared to the standard SA Geonor equipment.
d Applying TFs to adjust for wind-induced undercatch of solid precipitation provides useful information and gives a more realistic picture of the true forecast capabilities. In particular, estimates of systematic forecast biases are improved and, to a lesser degree, verification metrics like correlation, NSDE, RMSE, and ETS.
d It is possible to make use of WS10, T2m, and the temporal distribution of precipitation from forecasts to adjust daily observations of precipitation, but this adds an extra layer of uncertainty and the interpretation of the verification results should be made with caution. An alternative, and a more consistent way, is to evaluate forecasted precipitation reduced by the modelestimated precipitation undercatch with raw measurements. d Applying TFs is associated with uncertainty, which should be taken into account in the verification process and the interpretation should be done with caution. Further work on reducing the uncertainty in the TFs is needed. Open issues are how to find the most appropriate TF for a given observational site, and how best to include the (remaining) observational uncertainty in the verification process.
d Due to the uncertainty associated with applying TFs it is recommended to complement the precipitation verification with different types of precipitation measurements independent of precipitation gauges (e.g., snow pillows) when available.
d The interpretation of precipitation verification is easier if the evaluation for liquid and solid precipitation is done separately.
d Model skill is dependent on types of precipitation (liquid versus solid), location (mountain versus coastal), seasons, and precipitation rate (dry, light, and heavy). For the particular region, period, and NWP systems in this study some specific findings are: 1) an impression of overestimation of winter precipitation was changed to an underestimation after adjustments of observations; 2) the adjustment of observations increases the underestimation of coastal precipitation and reduces the overestimation of precipitation in the mountains when compared to raw observations (most pronounced in MEPS); 3) due to the limitations of existing TFs it remains to provide reliable estimates of verification metrics like correlation, RMSE, ETS, and SEEPS. Only verifying solid precipitation in calm conditions reduces the problem of wind-induced undercatch but has other disadvantages.
Existing TFs have been evaluated recently (Pierre et al. 2019;Smith et al. 2019) and the associated uncertainty has been documented. The performance varies substantially by site, but in general the TFs underadjust at windier sites (also seen at Haukeliseter in this study), while a net overadjustment is found for less windy sites . Their results also suggest that TFs are useful but should be applied with caution similar to what we found in our study. Furthermore, the results suggest that local meteorology and conditions (e.g., natural shielding through vegetation) are important for understanding the undercatch and that assigning appropriate TFs to all observation sites is therefore a difficult task (Pierre et al. 2019).
To further improve our understanding of how wind-induced undercatch affects verification results there are a number of possibilities. Here we identify at least three ways forward; 1) Improve TFs and how they are used in the verification process; 2) Perform precipitation verification for global models on all WMO SPICE sites against DFAR, SA and unshielded gauges to better understand the impact (e.g., Buisán et al. 2020); 3) Perform complementary forecast verification against a variety of solid precipitation observation types, when available (e.g., Schirmer and Jamieson 2015).