Strong winds crossing elevated terrain and descending to its lee occur over mountainous areas worldwide. Winds fulfilling these two criteria are called foehn in this paper although different names exist depending on the region, the sign of the temperature change at onset, and the depth of the overflowing layer. These winds affect the local weather and climate and impact society. Classification is difficult because other wind systems might be superimposed on them or share some characteristics. Additionally, no unanimously agreed-upon name, definition, nor indications for such winds exist. The most trusted classifications have been performed by human experts. A classification experiment for different foehn locations in the Alps and different classifier groups addressed hitherto unanswered questions about the uncertainty of these classifications, their reproducibility, and dependence on the level of expertise. One group consisted of mountain meteorology experts, the other two of master’s degree students who had taken mountain meteorology courses, and a further two of objective algorithms. Sixty periods of 48 h were classified for foehn–no foehn conditions at five Alpine foehn locations. The intra-human-classifier detection varies by about 10 percentage points (interquartile range). Experts and students are nearly indistinguishable. The algorithms are in the range of human classifications. One difficult case appeared twice in order to examine the reproducibility of classified foehn duration, which turned out to be 50% or less. The classification dataset can now serve as a test bed for automatic classification algorithms, which—if successful—eliminate the drawbacks of manual classifications: lack of scalability and reproducibility.
Many processes and phenomena in the atmosphere need to be diagnosed—from low pressure systems with fronts in midlatitudes and hurricanes in the tropics to fog or lightning. Some diagnoses are easy to make. Hearing thunder identifies lightning, and not being able to see a building less than 1 km away during daytime indicates fog. These diagnoses can even be automated with suitable instrumentation—to identify lightning from its signature in the electromagnetic waves it emits and fog from scattering of a light source. Some processes and phenomena, however, are much harder to classify, often because not enough information is available or the process itself is insufficiently understood. Lately, methods from statistics and machine learning in combination with a huge increase in computing power have been harnessed with ever-increasing success to tackle more and more difficult classification tasks, earning them the label “artificial intelligence.” Arguably the greatest progress has been made in classifying images, from spotting a dog in a photo to identifying a particular person. The underlying neural-network algorithms, however, typically need thousands or even hundreds of thousands of preclassified images provided by humans in order to “learn.” Such “supervised” learning is much easier than “unsupervised learning” for which no “truth” exists. This is the area where classifications by human experts are still the gold standard, albeit with several drawbacks: lack of scalability and reproducibility, as well as unknown error rates. Because only a few people have the required expertise to perform a classification, which takes a substantial amount of time, the classification task cannot be extended to an arbitrarily large number of instances, and comparisons of classifications among different experts or by the same expert performed at different times are at best extremely rare.
A group of experts collaborated recently on such a task to remedy two of the classification drawbacks by providing estimates of classification uncertainty and reproducibility, and a database against which existing and future algorithms can be tested. The classification task identified periods of downslope windstorms in time series of weather station measurements.
Such windstorms result from winds that cross topographic obstacles and accelerate as they descend to their lee. They occur over mountainous locations worldwide and are known by different names, which are sometimes also used to refer to an additional characteristic. Because no all-encompassing name exists, this article will use “foehn” for simplicity without implying a temperature increase during its onset, or a specific region. Foehn winds affect local weather and climate and impact agriculture (growing conditions due to temperature and humidity changes; top soil erosion), tourism (reliable spots for wind and kite surfing), artificial snow making (change of wet-bulb temperatures), air pollution (trapping pollutants in cold pools underneath the foehn layer, or sweeping them away in case of breakthrough), human health (reduction of air pollution), forest fires (intensifying them to uncontrollable extents), ground traffic (toppling trucks; snow or sand drifts; blasting of vehicles with sand and small rocks), and air traffic (closure of runways when crosswinds are too high). The increasing density of automatic weather stations allows the observation of such winds at progressively more locations. Classification, however, is difficult because other wind systems, such as radiatively driven downslope/downvalley winds, might be superimposed on foehn or share some of its characteristics, or because not enough information is available. The difficulty is compounded because no unanimously agreed-upon definition of foehn and its indications exist, foehn occurs in a variety of synoptic-scale and mesoscale settings, and different names are being used depending on the region, the sign of the temperature change at its onset, and its depth.
Nevertheless, two unanimously agreed-upon characteristics are that air crosses an obstacle and that it descends and accelerates on the downwind side causing strong winds. A fairly simple conceptual model of the flow situation after the onset of foehn, corroborated by field campaigns, laboratory experiments, computer simulations, and theoretical investigations, is shown in Fig. 1. Unfortunately, no continuous measurements covering the vertical cross section are routinely available for classification; only weather stations at the ground provide the necessary observations. Nowadays, with the proliferation of automatic weather stations and mesonets in some regions, measurements close to the crest of the obstacle are also available so that the first foehn characteristic of air crossing the topographic obstacle can be checked. The second characteristic, that air descends, leads to adiabatic warming and consequentially to a decrease in relative humidity. This pattern can be examined through differences between the crest and a downwind station of variables that are approximately conserved in foehn flow, such as potential temperature or mixing ratio.
Classification is made more ambiguous by processes for which potential temperature and mixing ratio are not conserved, that is, turbulent mixing within the foehn flow, at the surface and its upper interface; mixing air in from tributaries; phase changes of water (formation and evaporation of liquid and solid particles); and daytime warming and nighttime cooling due to surface sensible heat flux. How large these diabatic effects are varies with the season, time of day, location, and large-scale and mesoscale flow configurations. Information about their contribution is not readily available so that classifications become difficult and possess an unknown and variable degree of uncertainty.
THE COMMUNITY FOEHN CLASSIFICATION EXPERIMENT.
The Community Foehn Classification Experiment set out to quantify the uncertainty of human foehn classifications, to compare them to machine classifications, and to provide a dataset for the development of foehn classification algorithms. Three groups of human experts and two objective algorithms faced the task of identifying foehn periods. The first group (most of them are coauthors of this paper) consisted of 26 seasoned experts in mountain meteorology from different continents with operational or research backgrounds and thus a broad range of concepts of what constitutes foehn. The other two groups were made up of students taking the advanced weather forecasting course at the University of Innsbruck in 2016 (34) and 2017 (18), respectively. The student groups had a fairly homogeneous level of expertise because they had received four hours of lectures on foehn and had to apply it in homework problems in their advanced weather forecasting course. It was explained to the students why it was crucial for the outcome of this study that they worked completely independently. In addition to human experts, two algorithms were used that also employ the concept shown in Fig. 1. One, labeled A1 henceforth, is in operational use by the Swiss weather service. It uses percentiles of the distribution of the difference of potential temperature between crest and downstream locations (small; cf. Fig. 1), wind speed (high), and relative humidity (low) as hard thresholds for the classification of three categories: no foehn, foehn air mixed with cold valley air, and foehn. The second algorithm, A2, in operational use at the University of Innsbruck, learns from the data by itself and does not use hard thresholds. It uses so-called statistical mixture models to fit two or more parametric distributions to the observed distribution of classifying variables, such as potential temperature difference between crest and downwind stations, and wind speed, to yield a probability for foehn between 0 and 1, instead of merely a binary yes–no classification. Both algorithms require that the appropriate directional sector for foehn winds be manually set.
The classification experiment was designed to strike a balance between ideal goals and practical feasibility for the human classifiers. Therefore, five topographically different locations of differing annual foehn frequencies in the Swiss Alps were selected (Table 1 and Fig. 2). Twelve 48-h periods at each station yielded a total of 60 cases, for which the experts had to classify south foehn periods lasting at least 1 h at 30-min resolution. One of the coauthors, who did not himself manually classify (D. Plavcan), selected these cases based on results from the two automated classification algorithms, A1 and A2, to cover all permutations: phases of foehn–no foehn for which both, only one, or none agreed. Cases contained none, one, or several foehn periods, respectively. Unbeknownst to the classifiers, one difficult 48-h period appeared twice in order to estimate reproducibility.
Each participant received a wind speed–coded wind rose for each location, a pseudo-3D image of the location from Google Earth, exact coordinates, plots of meteorological variables for each of the 60 periods of 48 h, and instructions that contained an annotated example of an additional case reproduced here in Fig. 3. To classify only south foehn events, air had to cross the Alpine crest from south to north as indicated by the wind direction at the crest plotted in black instead of gray, which is fulfilled for the whole 48-h period in this case. Three periods of foehn are inferred: from 9:00 to 10:20, 11:10 to 14:30, and 31:00 to 45:20 (as hh:min). During these periods, similar potential temperatures at crest and the classification location imply the second foehn characteristic of lee-slope descent. Wind directions are from the appropriate sector1 and wind speeds are higher. Temperatures increase at the onset of each period, presumably when foehn erodes an underlying shallow cold pool. Humidity also drops, reflecting the drawdown of drier air from higher altitudes. Because relative humidity (%) instead of specific humidity (g kg−1) is plotted, the temperature increase additionally contributes to a drop in relative humidity.
The three human groups classified foehn duration during the 12 × 48 h periods at each of the 5 locations broadly similarly, as Fig. 4 shows. Median durations (colored horizontal lines) are within a few percentage points of each other. The group of mountain meteorology experts has the greatest diversity of backgrounds and consequently the most varied concepts of what constitutes foehn. As a result, their classification variation is larger than that of the second group of students, who all had the same foehn concept instilled in their course. The variation of the first group of students, on the other hand, is larger—mainly because of a few outliers at each location.
The variation and thus classification uncertainty is smallest at location 4, a station at the northern edge of the Alps. The largest uncertainty occurred for location 1, where foehn can potentially blow from several wind sectors and for which the crest station might not always be representative of the upstream conditions.
The agreement between the algorithms and human classifications varies. Results for A1 are within a few percentage points of the medians of the human groups at locations 2 and 3 and for A2 at locations 1 and 4. However, they are at the margins of human classifications for locations 2 (A2), 3 (A2), and 5 (A1 and A2), and A1 is even outside at locations 1 and 4.
Figure 5 shows the classifications from the three groups of human classifiers and the two algorithms for one of the 60 cases. At about midday of the second day, the potential temperature at valley station 1 reached a value close to that of the crest station (purple line), indicating descent of air. Wind speeds also increased. In the evening the signals in the variables reverse, indicating the cessation of foehn conditions. Human classifications agree on a core period of foehn from 11:00 to 14:30 (labeled “easy” in Fig. 5) but differ in onset and end times, with end times less unanimous than onset times. The two algorithms classify similarly.
The nighttime period between days 1 and 2, on the other hand, is more difficult. About 60% of the experts and students classified it as foehn (labeled “difficult”), again agreeing for the core period but differing for onset and even more so for end times. On the evening of the first day the wind direction changed into the foehn sector. At the same time, both average and peak wind speeds increased and potential temperature also increased. Unlike the easy period, however, potential temperature is 5 K colder than at the crest, which likely led the other 40% to classify it as a radiatively cooled nocturnal downslope/downvalley flow. Air originating from a different level than represented by the crest station (cf. Fig. 2) and mixing of foehn air with radiatively cooled air from the valley and its tributaries might have been responsible for such a large difference. The three-category algorithm A1 classifies no foehn, whereas the mixture model algorithm A2 gives a probability close to 1 that it is foehn. The decrease and fluctuations of the probability toward the end of the period stems from the decrease and fluctuations in wind speed and later on the increase in potential temperature difference.
This “difficult” period indicates that a simple yes or no might not be enough for all applications when it comes to classifying foehn flows, for example because of the superposition of foehn and a radiatively cooled downvalley wind. Algorithm A1 adds the third category of “mixed foehn/valley air” (although it does not classify it as such in this particular case). Algorithm A2 gives a continuous probability of foehn occurrence.
Changes in classification uncertainty.
Over all 60 cases, delineating the beginning and end of a foehn event had a higher variability among all classifiers. Although the majority of the classified foehn events started with a temperature increase, the uncertainty was not clearly different from the events that started with no change or a decrease in temperature. Classification uncertainty was also higher for the nighttime compared to the daytime for similar reasons as in the difficult period in Fig. 5. Classification uncertainty also varied somewhat seasonally, with low uncertainty in the fall [September–November (SON)] and winter [December–February (DJF)] months; the highest uncertainty in the spring [March–May (MAM)], particularly among human classifiers; and medium uncertainty in the summer months [June–August (JJA)].
To evaluate reproducibility, one of the more difficult cases (at location 1) occurred twice in the dataset, unbeknown to the classifiers. Figure 6 shows the relative frequency of the absolute difference of the foehn duration classified at the first occurrence and the second occurrence of that case. Ideally and for perfect reproducibility, the difference in classified foehn duration among the identical cases is zero. However, fewer than half of the classifiers achieved perfect reproducibility.
This lack of reproducibility is worrisome, although probably less extreme for easier cases. Nevertheless, it corroborates the first author’s personal experience of classifying foehn at different locations globally.
The dataset will be available from the University of California (UC), Irvine, which hosts a large repository of classification datasets (https://archive.ics.uci.edu/ml/about.html).
Several lessons have been learned from this experiment that add on the one hand supporting evidence to what was previously at least informally known from other classification tasks (points i–iii below), and on the other hand (points iv–vi) add new knowledge. i) Busy experts are willing to volunteer a chunk of their scarce time provided the classification task is an intellectually challenging puzzle. ii) Human experts use implicit (and in the case of the master’s students, explicitly taught) physically based concepts to help them distinguish between the two categories of foehn–no foehn. iii) Expert classifications carry uncertainty and are not even necessarily reproducible, which needs to be quantified (as here) or at least considered when interpreting results using such classifications. iv) Uncertainty is largest for onset and even more so for the ending of a foehn event and also larger during the night. v) Combining advanced statistical and/or machine learning models with physically based concepts for choosing their input variables yields similar results to those of human experts. In addition, they easily scale to longer time series or more locations and are reproducible, which is a fundamental scientific requirement and allows the comparison of different datasets (foehn occurrence at different locations in this case). It is thus highly recommended to develop objective classification procedures, ideally without having to resort to manually specified and/or hard limits. If the algorithms are additionally made available as packages of open-source languages, foehn classifications can easily be reproduced by other researchers. vi) Diagnoses contain more information when they are probabilistic instead of binary yes–no—a concept that has a long history of implementation in (weather) forecasts.
In addition to shedding light on human and machine classification of foehn, the dataset allows the testing of existing and newly developed algorithms for unsupervised learning tasks when truth is not known, such as in the case of foehn occurrence. It can also serve a community interested in estimating the accuracy of previous human foehn classifications and climatologies.
The authors give many thanks to Achim Zeileis for discussions on how to best design the experiment, and a huge thanks to the experts and the 2016 and 2017 cohorts of the Advanced Weather Forecasting class who classified these 60 periods.
FOR FURTHER READING
Current affiliation: UBIMET, Vienna, Austria
Deduced from wind roses and topography maps; not shown.