The Wind Forecast Improvement Project (WFIP) is a public–private research program, the goal of which is to improve the accuracy of short-term (0–6 h) wind power forecasts for the wind energy industry. WFIP was sponsored by the U.S. Department of Energy (DOE), with partners that included the National Oceanic and Atmospheric Administration (NOAA), private forecasting companies (WindLogics and AWS Truepower), DOE national laboratories, grid operators, and universities. WFIP employed two avenues for improving wind power forecasts: first, through the collection of special observations to be assimilated into forecast models and, second, by upgrading NWP forecast models and ensembles. The new observations were collected during concurrent year-long field campaigns in two high wind energy resource areas of the United States (the upper Great Plains and Texas) and included 12 wind profiling radars, 12 sodars, several lidars and surface flux stations, 184 instrumented tall towers, and over 400 nacelle anemometers. Results demonstrate that a substantial reduction (12%–5% for forecast hours 1–12) in power RMSE was achieved from the combination of improved numerical weather prediction models and assimilation of new observations, equivalent to the previous decade’s worth of improvements found for low-level winds in NOAA/National Weather Service (NWS) operational weather forecast models. Data-denial experiments run over select periods of time demonstrate that up to a 6% improvement came from the new observations. Ensemble forecasts developed by the private sector partners also produced significant improvements in power production and ramp prediction. Based on the success of WFIP, DOE is planning follow-on field programs.
An observational, data assimilation, and modeling study demonstrates improvements in the accuracy of wind forecasts for wind energy.
Wind power is a variable energy source, dependent on weather conditions. Electricgrid operators keep the grid stable by balancing variable generation resources (e.g., wind and solar) and conventional generation (e.g., coal, gas, and nuclear) with energy demand. Having accurate advance knowledge of the amount of wind power available through reliable weather forecasts can lead to improvements in the efficiency of the entire electrical grid system, including the operation of fossil fuel plants, resulting in lower costs as well as lower CO2 emissions (Marquis et al. 2011; GE Energy 2010; EnerNex 2011). Lowering the costs of integrating wind energy onto the grid can accelerate the development of wind energy as a growing component of the nation’s energy portfolio, thereby mitigating anthropogenically forced climate change while also reducing air pollution.
The U.S. Department of Energy (DOE) sponsored the Wind Forecast Improvement Project (WFIP) with the goal of advancing the integration of wind power and reducing the cost of energy by improving short-term wind energy forecasts, including forecasts of ramp events (large changes in wind power production over short time intervals). WFIP was a public–private partnership with two private sector teams led by forecasting companies WindLogics and AWS Truepower that collaborated with DOE and National Oceanic and Atmospheric Administration (NOAA) laboratories and with the National Weather Service. The core of WFIP was composed of two concurrent year-long field programs in high wind energy resource areas of the United States (the upper Great Plains and Texas).
WFIP employed two avenues for improving wind energy forecasts: enhanced measurement networks and numerical weather prediction (NWP) model system advancements. The former included networks of in situ and remote sensing instruments deployed in the two study areas, including proprietary tall tower and turbine nacelle (i.e., the housing containing the generator and gearbox) anemometer observations from the wind energy industry, and for the first time assimilating the proprietary data into NOAA's NWP models. Additional observations allow for a more precise depiction of the model’s initial state of the atmosphere, potentially resulting in more accurate forecasts. The intent of the WFIP instrumentation networks was to provide observations through a deep layer of the atmosphere, and over a sufficiently broad area, to influence NWP forecasts out to at least a 6-h lead time. These observations were assimilated into real-time forecasts as well as retrospective simulations spanning the WFIP field campaign year to allow for an evaluation of seasonal differences in the skill of the models and the impact of the observations.
Wind energy forecasting efforts have often focused on day-ahead forecasting (typically 18–42 h ahead) because some conventional power plants, such as coal plants, are normally scheduled that far in advance. But additional opportunities exist on the short-term (0–6 h) time frame to adjust schedules, start natural gas generators, minimize scheduling errors in the energy markets, or use bilateral trading to take advantage of improved forecasts. Even when a wind power plant has been scheduled a day ahead, a more accurate forecast for the next few hours is important for balancing day-ahead forecast errors, minimizing penalties, and maximizing revenues. This short-term time frame was the focus of WFIP.
The second avenue for enhancing wind energy forecasts was to improve the NWP forecast systems used by all partners directly and to produce a broad assessment of model skill specifically validated with turbine-height wind observations. Midway through the WFIP field program, the NOAA/National Weather Service (NWS) upgraded its operational hourly updated NWP forecast model from the Rapid Update Cycle (RUC) model to the Rapid Refresh (RAP) model, which included improvements resulting from WFIP, and the impacts of this upgrade were evaluated using WFIP observations. In addition, improvements to the research version of RAP and to NOAA’s High-Resolution Rapid Refresh (HRRR) research model were continuously made during WFIP. WindLogics incorporated forecasts from these improved NOAA models into machine learning energy prediction algorithms, while the AWS Truepower team developed and operated in real time an experimental nine-member optimized ensemble forecast system for WFIP that utilized RAP and HRRR for initial and boundary conditions. Also, with WFIP funding, NOAA developed a data-dissemination capability to make the large amounts of raw model output from the HRRR model available in real time to the two private sector teams and to the entire wind energy industry.
Complementary research to the work discussed here addressed the accuracy of using stability-dependent wind profile relationships to estimate the wind speed at turbine-hub height, the development of a community ramp tool and metric, development of a gap-filling and quality control algorithm for remote sensing data, and various sensitivity studies examining the role of observation type and data assimilation techniques in model performance.
The WFIP program builds upon a long research effort within the meteorological community aimed at providing better environmental information to support wind energy. The first computer models devoted to wind power forecasting were developed during the 1980s, an outgrowth of a Pacific Northwest National Laboratory (PNNL) working group (Wendell et al. 1978; Bossanyi 1985). Throughout the 1990s, a variety of statistical approaches were employed to improve forecast skill. In 1999, eWind, the predecessor of the forecasting system used in the WFIP southern study area (SSA), was developed by AWS Truepower (AWST). In the early 2000s, the California Independent System Operator (CAISO) developed a centralized wind power forecasting system (Makarov et al. 2010). Since then, a large number of ISOs, utilities, and balancing authorities have deployed wind power forecasting systems in the United States, all of which are dependent on national-scale or global wind forecasts and observations provided by NOAA or other national forecasting centers. With deeper penetration of wind energy, accurately forecasting the wind is ever more critical for developing and managing the modern electrical grid (Monteiro et al. 2009; Mahoney et al. 2012; Giebel and Kariniotakis 2007). Thus, the WFIP research effort seeks to complement and continue the evolution of wind energy forecasting, further facilitating the development and operation of wind power production in the United States.
A strength of WFIP was its collaborative framework, bringing together federally funded laboratories and centers, private sector companies, and universities. Participants included several Department of Energy (DOE) national laboratories [National Renewable Energy Laboratory (NREL), Argonne National Laboratory (ANL), Pacific Northwest National Laboratory (PNNL), and Lawrence Livermore National Laboratory (LLNL)], two NOAA research laboratories [Earth System Research Laboratory (ESRL) and Air Resources Laboratory (ARL)], the NWS, and two teams of partners from the private sector and university communities. One team, led by AWS Truepower, included the Electric Reliability Council of Texas (ERCOT), which operates the electric grid in most of Texas. A second team was led by WindLogics, Inc. (a subsidiary of NextEra Energy), and included the Midcontinent Independent System Operator (MISO), which operates the electric grid in the upper Midwest states. Additional observations were also made available in-kind by Iberdrola USA, Leosphere, and West Texas A&M University (WTAMU). A list of team partners is provided in Table 1.
New instrumentation was deployed or acquired in two high wind energy resource areas of the United States during concurrent year-long field campaigns that ran from September 2011 to September 2012. The first area was in the upper Great Plains (Fig. 1), or the northern study area (NSA), where DOE and NOAA partnered with the WindLogics team. The second field campaign was centered over the SSA in western and central Texas (Fig. 2), where DOE and NOAA partnered with the AWS Truepower team. A vital and ultimately successful aspect of WFIP was the collaboration between the wind energy industry and NOAA in acquiring and assimilating for the first time into NOAA’s models proprietary data from tall tower (mostly 60 m) and wind turbine nacelle-mounted anemometers (Table 2). All 405 nacelle anemometer sites and a subset of the tall tower sites were available for assimilation in real time (41 out of 133 in the NSA, 27 out of 51 in the SSA), while the remaining industry tall towers were available after a several-day delay for use in retrospective modeling studies. The WFIP observing systems also included 12 wind profiling radars, 12 sodars, several lidars, and 71 surface meteorological stations. Observations from the radar wind profilers and sodars, as well as surface meteorological stations, also were assimilated into the NWP models used to make wind power forecasts. The primary observations used for evaluating model performance are the tall tower data, wind profiler data, and power output from 23 wind plants in the NSA and 34 wind plants in the SSA. The wind plant power data were independent of the data assimilation process.
To assist in maintaining the WFIP instrumentation throughout the year-long field campaign and in identifying potential model problems, the observations and model forecasts were displayed continuously on a real-time publicly accessible website, updated on a subhourly basis, with separate websites for the proprietary data. The public websites for the NSA and SSA can be found online (http://wfip.esrl.noaa.gov/psd/programs/wfip/).
A key component of WFIP was to develop improved quality control procedures to ensure that the assimilated observations were as accurate as possible, as a few erroneous observations can easily negate the positive impact of many accurate observations when assimilated into an NWP model. New processing techniques were implemented to reduce spurious signals from birds and other contamination in wind profiling radar data (Bianco et al. 2013; Wilczak et al. 1995), and techniques were also developed to identify and correct direction offsets in the tall tower observations, as discussed in Wilczak et al. (2014). The wind profiler QC technique continues to be applied to boundary layer profiler wind observations with QC flags relayed through the Meteorological Assimilation Data Ingest System (MADIS; http://madis.noaa.gov), improving their assimilation into NOAA real-time model forecast systems.
NOAA WEATHER MODELS.
Because of the focus on short-term forecasts (up to 15 h), the principal NOAA models used during WFIP were the hourly updated 13-km-resolution RUC, the 13-km-resolution RAP, and the 3-km-resolution HRRR (Fig. 3). RAP and HRRR both used version 3.4.1 of the Advanced Research core of the Weather Research and Forecasting Model (ARW; Skamarock et al. 2008). RUC and RAP provided forecasts out to 18 h, while HRRR provided forecasts out to 15 h.
RUC was the NOAA/NWS/National Centers for Environmental Prediction (NCEP) operational hourly updated forecast system through the first half of the WFIP field campaign, when it was replaced by RAP on 1 May 2012 (Table 3). Prior to this date, the RAP model was run at NCEP in a test mode, and we refer to both the operational and test versions as the NCEP_RAP. Research versions of the RAP (ESRL_RAP) and HRRR models were run in real time 24 h a day for 7 days a week by NOAA/ESRL through the entire WFIP campaign; these versions differed from NCEP_RAP as improvements to the model physics were incorporated over time. In particular, estimates of surface aerodynamic roughness lengths were improved, and the vertical resolution in the land surface model was increased, which improved the boundary layer diurnal cycle.
All of these models use three-dimensional (3D) variational data assimilation, with RAP using the Gridpoint Statistical Interpolation (GSI) analysis system (Wu et al. 2002). The GSI is capable of assimilating a diverse set of observations, and new capabilities for assimilating energy-related observations (tall towers and nacelle anemometers) were developed for GSI as an outcome from WFIP. HRRR was in development during WFIP and at that time did not perform data assimilation on the 3-km grid, but used initial and boundary conditions obtained by direct interpolation from the 13-km ESRL_RAP. (In 2013, a 3-km data assimilation system was added to the HRRR.)
The intent during WFIP was to assimilate the special WFIP observations into the research ESRL_RAP and HRRR models, but not into the operational NCEP_RUC and NCEP_RAP, and then to compare the skill of these models. Inadvertently, the NCEP_RAP assimilated a small subset of the WFIP observations (one wind profiling radar and five sodars in the NSA; three sodars in the SSA; none of the tall towers, nacelle anemometers, or surface mesonet). Since these observations will have added some skill to NCEP_RAP, comparisons of the skill of ESRL_RAP and HRRR to the NCEP_RAP model, shown later in the "Northern study area" and "Southern study area" sections, will provide a conservative estimate of what the improvement would have been had none of the WFIP observations been assimilated into NCEP_RAP.
The latency of the NCEP_RUC, NCEP_RAP, and ESRL_RAP models was approximately 1 h during WFIP, while the latency of ESRL_HRRR was 1.5–2 h. When HRRR became operational at NCEP in 2014, the latency was reduced to approximately 1 h. All model forecast intercomparisons shown are independent of time latency. Details on RUC can be found in Benjamin et al. (2010), and for RAP and HRRR can be found online (http://rapidrefresh.noaa.gov and http://rapidrefresh.noaa.gov/hrrr, respectively).
REAL-TIME FORECAST ERROR STATISTICS: RAP AND RUC MODELS.
NCEP_RUC was used as a baseline forecast against which to compare the upgraded WFIP forecasts until RUC ceased operations on 1 May 2012. The NCEP_RUC model did not assimilate any of the new WFIP observations while the research ESRL_RAP did, and so a comparison of these two models combines fundamental model improvements of RAP over RUC, as well as the impacts of assimilation of the WFIP data. The tall tower datasets are the primary source used for this evaluation. To properly evaluate the skill of an NWP model at forecasting winds for wind energy, it is essential to convert from wind speed to the equivalent power that a wind turbine would produce. To convert wind speed into power, we used a generic International Electrotechnical Commission class 2 (IEC 2005) wind turbine power curve, which is the most common type of wind turbine deployed by the NSA and SSA wind generator partners.
We have chosen to use a simple mean bias correction for the RUC–RAP comparison, and for the data-denial analysis that follows, after testing indicated that although more complex bias-correction techniques reduced the overall model error, they did not significantly alter the relative improvement between models or the improvement due to the assimilation of the new observations. The percentage root-mean-square error (RMSE) relative improvement of the ESRL_RAP model over NCEP_RUC is shown in Fig. 4 for the bias-corrected power evaluated using the 41 real-time tall towers in the NSA and the 27 real-time tall towers in the SSA. The improvement in hub-height power ranged from 12% to 5% for forecast hours 1–12.
For comparison purposes, an analysis of the operational NOAA/NWS North American Mesoscale Forecast System (NAM) and Global Forecast System (GFS) models 850-hPa vector-wind RMSE (using radiosondes for verification) encompassing the past 10 years (www.emc.ncep.noaa.gov/mmb/verif/vlcek/) over North America shows an annual improvement for the 12-h NAM forecasts (using the Eta Model before 2006) of 0.7% yr−1, while for the GFS the value is approximately 0.4% yr−1. Repeating the analysis shown in Fig. 4 for hub-height vector winds instead of power, similar improvements of 12%–5% for forecast hours 1–12 are found (not shown). Thus, the regional improvement from the combination of RAP and assimilation of the WFIP observations in the NSA and SSA represents close to a decade’s worth of improvement typically found in the operational models over North America, marking a significant advancement for the wind energy industry.
DATA-DENIAL NWP EXPERIMENT RESULTS.
One of the primary goals of WFIP was to determine the impact of the special WFIP observations on the model forecast skill of turbine hub-height winds. Isolating the impact of the new observations required controlled data-denial simulations, where the identical NWP model was run twice: first as a control run that assimilated only the routinely available observations and second as an experimental run that assimilated both the routine and the special WFIP observations. Differences in forecast skill between these two simulations determine the impact that the special WFIP observations alone had on improving model forecast skill.
Six separate data-denial episodes were chosen, ranging in length from 7 to 12 days, for a total of 55 days (Table 4). The intent in selecting these days was to get a distribution through all four seasons of the year. In addition, weeks were chosen when few observations were missing, a variety of meteorological phenomena were sampled (cold fronts, low-level jets, thunderstorms), and there were large-amplitude ramp events.
Figure 5 displays RMSEs of the tall tower–derived wind power for the control and experimental simulations, both using the RAP model. The RMSE (Fig. 5, top panels) is expressed as a percentage of the maximum wind power capable of being generated (the rated power). For all hours in both the NSA and SSA, the experimental simulations (that assimilate the WFIP observations) have smaller or equal RMSEs than the control. The improvement is slightly larger in the NSA, where there were more observations assimilated over a larger domain, than in the SSA. The bottom panels in Fig. 5 show the difference between the two curves of the top panels; this difference, which defines the improvement in the forecast, is approximately 1% of capacity at forecast hour 1 and is statistically significant through forecast hour 7 in the NSA, and through forecast hour 4 in the SSA, at the 95% confidence level. When expressed as a relative percentage improvement, the maximum RMSE improvement (at forecast hour 1) in the bottom two panels in Fig. 5 is equivalent to approximately 5%–6%. Similar magnitudes of improvement were found for r2, the coefficient of determination, for the NSA and SSA (not shown). Interestingly, the SSA has higher values of RMSE than the NSA, perhaps as a result of more prevalent and more difficult to forecast low-level jets (e.g., Freedman et al. 2008), the presence of complex terrain (many of the wind plants are on mesa tops), and possibly more frequent convection. Additional statistical analyses can be found in the three DOE final reports from NOAA, WindLogics, and AWS Truepower (Wilczak et al. 2014; Finley et al. 2014; Freedman et al. 2014).
To demonstrate that the WFIP observations also have a positive impact on a deeper layer of the atmosphere than only at turbine heights, we show the improvement in vector-wind 0–2-km layer-averaged RMSE, using data from the radar wind profilers as verification (Fig. 6). The RMSE improvement is large at the initialization time (hour 0), indicating a closer model fit to the newly assimilated observations, and this improvement diminishes with time but remains statistically significant for the first eight forecast hours. Previous profiler data-denial experiments (Benjamin et al. 2004, 2010) showed similar short-range forecast impacts from regional profiler networks.
In addition to the hourly updated RAP model, data-denial assimilation experiments were also run with the NOAA/NWS NAM 12-km parent and 4-km CONUS nest domains. These results are consistent with the RAP experiments, as discussed in Wilczak et al. (2014).
The statistics shown in Fig. 5 are averages of power RMSE calculated at individual point locations. These statistics quantify forecast skill applicable to an individual wind plant that fits within a single model grid cell. For some applications, one would instead be interested in comparing spatially averaged power observations with spatially averaged model forecasts. For example, a grid operator may be more interested in the aggregate wind power of the entire balancing area or the aggregate power in a geographic area that feeds into one transmission node. Spatially averaged forecast skill can differ from the average skill of individual point locations if there are compensating errors, where an overforecast at one point tends to balance an underforecast at another point. This difference is the same as that found between forecasting precipitation at a point location versus a catchment basin that spans many model grid points.
To evaluate the effects of spatial averaging, we used forecasts and observations from the NSA, since that domain had tower data covering a larger geographic area than the SSA. First, an 8 × 8 grid was overlain on the NSA domain (Fig. 7, left panel), with each grid box approximately 100 km (north–south) × 150 km (east–west). Within each of these grid boxes all of the tower observations and power forecasts for those towers were averaged at each hour. The RMSE was then computed for each of these 64 sets of aggregated observations and forecasts and averaged. The process was then repeated using a 4 × 4 grid, a 2 × 2 grid, and finally averaging the observations and forecasts for all of the tower sites together (a 1 × 1 grid) and, then, calculating the RMSE.
The forecast power RMSEs for the various degrees of spatial averaging are shown in Fig. 7 (right panel), with the solid curves for the average of the 55 days of the data-denial control simulations and the dashed curves for the experimental simulations assimilating the new WFIP observations, using all 133 tall towers for verification. The reduction in RMSE provided by spatial averaging is very large, with more than a factor of 2 difference between treating each tower individually to when all towers are aggregated together.
The difference between the dashed lines and solid lines shows the improvement from assimilating the new WFIP observations at the various degrees of spatial averaging. Interestingly, although the RMSE decreases continuously with more spatial averaging, the improvement from assimilating the WFIP observations remains fairly constant for all size averages until it finally decreases in the 1 × 1 box when all towers are combined into a single aggregate. This indicates that even for moderately large aggregation areas (the 2 × 2 boxes are 400 km × 600 km) forecasts can be improved significantly with assimilation of new observations. In this case the improvement averaged for forecast hours 1–6 for the 2 × 2 grid is 0.8% of rated capacity, while the relative improvement (0.8/13 × 100) is 6%.
The preceding analyses focused on the NOAA RAP and RUC forecast models and used data-denial studies to determine the impact of the new observations on forecast skill, using tall-tower or wind profiler observations for verification. The next two sections focus on specific WindLogics results from the northern study area and then AWS Truepower results from the southern study area, both using actual wind plant power output for model evaluation (including ramp events), and both evaluating the impact of the HRRR model. The southern study area analysis also focuses on the use of ensemble forecast systems for wind energy and addresses the impact and forecasting of low-level jets.
NORTHERN STUDY AREA.
In the NSA, WindLogics made wind power forecasts for 23 NextEra Energy operational wind plants (as shown in Fig. 1). Aggregate forecasts were also created by summing forecasts for the individual plants. The analysis presented here will focus on the aggregate results. All forecast skill evaluations are based on observed wind plant power production.
The wind power forecasts generated from the various models applied several levels of postprocessing, a common practice among commercial forecast vendors. Although evaluation of postprocessing techniques commonly used in the wind energy forecasting sector was not the focus of WFIP, it is important to assess whether the fundamental wind speed forecast improvements achieved for the raw forecasts remain after typical postprocessing is applied to them. The levels of forecasts included 1) a “raw” forecast made using hub-height wind forecasts directly from the models and converted into power using a plant-specific power curve; 2) a “bias corrected” forecast made by calculating a rolling 2-week hub-height wind speed model bias relative to the turbine nacelle anemometer measurements over the previous 2 weeks (after a turbine-manufacturer-provided blade-wash correction was applied to them) for every forecast hour, and then bias correcting the wind speed before applying a plant-specific power curve; 3) a “trained” forecast that utilized sophisticated methods for creating nonlinear regression functions from a set of training data (Support Vector Machine, SVM; Cortes and Vapnik 1995; Chang and Lin 2011) using NWP model data as inputs and observed wind plant power data as the “target” variable to statistically correct the individual model-based wind power forecasts; and 4) a “trained ensemble,” which combined trained forecasts from a short-range model with the NAM and the local wind plant persistence forecast to generate wind power forecasts. Note that to accurately predict power from an operating wind plant the forecast must consider turbine waking, which is accounted for explicitly in the bias-corrected forecasts and implicitly in the trained forecasts.
Since several months of data are required for the training process, the trained forecasts were generated starting in January 2012 and continued through the WFIP field campaign ending in August 2012. The forecast system was trained monthly using hourly data from the start of the field campaign through the end of a given month and, then, was used to produce trained forecasts for the following month. A comparison of the system-aggregate raw, bias-corrected, and trained power forecast RMSEs, expressed as a percentage of the rated capacity, for the first 12 forecast hours and for the 8-month period, is shown in Fig. 8. As can be seen, bias correcting the model wind speeds prior to calculating the power improves the power forecasts for all three models (HRRR, ESRL_RAP, and NCEP_RAP), but particularly for HRRR. ESRL_RAP, which assimilates the WFIP observations, is more skillful than NCEP_RAP, which does not assimilate the WFIP observations. At the nonaggregated wind plant level (not shown), the ESRL_RAP-based bias-corrected power forecasts also had the lowest forecast errors at most wind plant locations, followed by the HRRR and NCEP_RAP bias-corrected forecasts, respectively.
For all models, the training process further improves the overall RMSE compared to the simpler bias-correction method, with absolute percentage improvements of 0.3%–0.6% of rated capacity averaged over the first 12 forecast hours (with improvements of 1%–2% of rated capacity compared to the raw forecasts). While the training process reduces the forecast error differences somewhat between the various model-based forecasts, the ESRL_RAP trained forecasts (which assimilate in the WFIP observations) are still the best of the individual models, with RMSEs lower by 0.62% of rated capacity compared with the NCEP_RAP trained forecasts.
Typically, the WindLogics operational forecasts are made from an ensemble of several trained models that also include persistence information for the first several forecast hours. An example of such an operational forecast is shown in Fig. 8 for the ESRL_RAP (plus NAM) two-member ensemble. All trained ensemble forecasts have lower RMSEs in the first 2 h when persistence information adds skill, with average RMSE improvements of 0.1% (trained ensemble ESRL_RAP plus NAM versus trained ESRL_RAP) and 0.7% (trained ensemble HRRR plus NAM versus trained HRRR) of rated capacity at later forecast hours resulting from the use of the ensemble.
While standard bulk statistical error metrics are useful for gauging forecast skill, they often are inadequate in capturing a complete sense of forecast impacts. This is particularly true for power system operations because reliability of the grid is of primary importance. Grid operators are concerned about very large forecast errors that develop quickly, even if they occur only rarely, because they can influence the reserves and operating practices that operators use to ensure reliability. These large forecast errors often occur as a result of actual or predicted “ramp events,” when wind power production is changing rapidly. The ramp rate is of particular concern as there are limits to how quickly conventional generation can be ramped up (or down) to offset the changes in wind generation in order to keep the system balanced. Since grid operators must balance the system on a minute-by-minute basis, in this ramp analysis the observations were used at their highest temporal resolution (10 min) and the hourly model output was interpolated to these 10-min intervals.
Because the definition of a “wind energy ramp event” will vary from one operating area to another depending on the penetration of wind on the system and the other types of generation available, a suite of ramp definitions was used as follows:
The power changes X% (of rated capacity) over a Y-h period (or less).
The event can be longer than Y h as long as d(power)/dt ≥ (X% capacity)/(Y h) occurs at some point during the event.
The beginning (end) of a ramp occurs when the 10-min ramp rate exceeds (falls below) 2.5% of the defined threshold rate.
A correctly forecast ramp (or “hit”) occurs when the midpoint of the predicted ramp is within ±2 h of the observed ramp midpoint and the magnitude of the predicted ramp is within ±50% of observed (and of correct sign).
Reasonable ranges of X and Y were chosen from scatterplots of observed ramp amplitude versus ramp duration for the entire system (or an individual site) for all ramps that equaled or exceeded 15% of the rated capacity [i.e., large enough to potentially create issues in Midcontinent Independent System Operator (MISO) grid system operations at the time of this study].
Several metrics were calculated to evaluate the accuracy of the wind power forecasts during wind energy ramp events. The metrics (calculated both at the wind plant level and the system aggregate level) included frequency bias, probability of detection, false-alarm rate, and critical-success index (or threat score), as well as errors in ramp-event timing, magnitude, duration, and ramp rate.
The frequency bias is defined as the ratio of the number of forecasted ramp events to the number of observed events and is shown in Fig. 9 for the aggregate of forecasts over a subset of ramp definitions. Results from a suite of ramp definitions are shown to illustrate the range of variability in the metrics for a given model forecast and the consistency in the results across ramp definitions. Only the bias-corrected forecasts are shown, as the trained forecasts were statistically optimized by minimizing the RMSE at the expense of losing the sharpness of ramp events, such that trained forecasts are not optimal for forecasting ramp events. As can be seen in Fig. 9, the bias-corrected ESRL_RAP-based power forecasts most accurately predict the total number of aggregate ramp events on average. (A frequency bias value of 1 indicates that the model predicts the same number of the events as was observed.) The HRRR-based forecasts tend to overpredict the number of aggregate ramp events by about 9%, and the NCEP_RAP-based forecasts tend to underpredict the number of events by about 10%. When a similar analysis is performed at the wind plant level, all model-based bias-corrected power forecasts tend to underpredict the number of events for most ramp definitions, but HRRR-based power forecasts do a significantly better job at forecasting the number of events for all ramp definitions as compared to the other forecasts (not shown), likely because of its higher resolution.
A comparison of the probability of detection values for all three bias-corrected model-based system-aggregate power forecasts for a subset of ramp-event definitions is also shown in Fig. 9. Probability of detection is defined as the fraction of observed ramp events that is predicted correctly, and a value of 1 indicates a perfect forecast. The ESRL_RAP-based forecasts more accurately predict ramp events than do the HRRR and NCEP_RAP-based forecasts for most ramp definitions, by about 1% and 4%, respectively. A similar analysis done at the wind plant level showed that the HRRR-based forecasts were the most accurate, followed by those of ESRL_RAP and then NCEP_RAP (not shown).
Ramp-rate errors for the aggregate bias-corrected forecasts for all events classified as hits (i.e., correctly predicted ramps) are shown in Fig. 10. Negative values indicate that the forecast ramp rate is smaller than observed. As can be seen in Fig. 10, all forecasts underpredict the ramp rate for all ramp definitions, but the HRRR-based forecasts have significantly smaller ramp-rate errors (50%–60% less for most ramp definitions) than do the coarser-resolution model-based forecasts. This is largely because the HRRR-based forecasts significantly outperformed the coarser-resolution forecasts in accurately forecasting ramp duration. There is a clear demarcation in forecast ramp-rate errors as a function of ramp definition, with all forecasts more accurately predicting ramp rate for the smaller-magnitude (15% rated) ramp events. For these events, the HRRR-based forecasts are extremely accurate at correctly predicting the average ramp rate, with errors as much as 80%–90% smaller than the coarser-resolution forecasts. Similar results were obtained when the ramp-rate errors were calculated at the individual wind plants (not shown).
SOUTHERN STUDY AREA.
For WFIP, a major component of AWS Truepower’s contribution in the SSA region was the continued development and evaluation of ensemble forecast systems for wind energy. The WFIP SSA forecasting system (WFIPFS) is an enhanced and expanded version of AWST’s operational eWind forecast system with five core components: 1) an ensemble of rapid-update short-term NWP forecasts, 2) a statistical adjustment procedure for each of the NWP forecasts, 3) a set of statistical time series prediction schemes, 4) an ensemble composite weighting algorithm, and 5) a wind plant output model. The WFIP enhanced observations were incorporated into most of the model system’s data assimilation schemes. Actual wind plant power production from 34 plants was used for model postprocessing and validation.
Prior to WFIP, ERCOT was using two AWST forecast products: an optimized short-term wind power forecast (OPTENS_STWPF), which provided overall power production forecasts, and the ERCOT Large Ramp Alert System (ELRAS), which was used for ramp forecasts. These two were used as the baseline against which the WFIP forecasts were compared. ELRAS was primarily used to alert system operators that a major generation source of wind may become unavailable in a short amount of time (e.g., due to a forecast ramp event). If a large ramp is forecast, operators could subjectively decide to dispatch other available resources to meet whatever the load demand might be during their reliability update process.
The NWP component of the WFIPFS is composed of nine individual members based on three different modeling systems run by AWST, as well as HRRR. Most of the models have similar grid configurations. However, given the scale and nature of the phenomena affecting wind forecasts in the boundary layer (i.e., low-level jets, convective outflow boundaries, frontal systems), the following attributes were varied among the ensemble members:
NWP models used to generate the simulations,
source of lateral boundary conditions (BCs),
boundary layer physics scheme,
convective cloud scheme, and
data assimilation scheme and incorporation of enhanced observations used to initialize the models.
The three models used by AWST were 1) WRF version 3.3.1 (Skamarock et al. 2005), 2) the Advanced Regional Prediction System (ARPS version 5.2.11) model (Xue et al. 2000, 2001), and 3) the Mesoscale Atmospheric Simulation System (MASS) model (Manobianco et al. 1996). All simulations had a horizontal grid spacing of 5 km and an update frequency of 2 h with most simulations using initial conditions (ICs) and BCs from ESRL_RAP. Three of the ensemble members were warm started (no spinup) using the previous forecast to initialize conditions for 11 of 12 runs per day, and a low-resolution (15 km) ensemble Kalman filter (EnKF) member was used to produce ICs and BCs for two of the ensemble members. The nine AWST high-resolution models produced 13-h power forecasts every 2 h for each wind plant for 1 yr, as well as a system-wide aggregate.
The baseline OPTENS_STWFP used two 8-km MASS runs employing NOAA GFS and NAM ICs and BCs, weighted each of three postprocessing methods [unadjusted, persistence adjusted, and model output statistics (MOS) adjusted] based on the relative performance over the previous month, and incorporated the latest 15-min power data for persistence corrections. An optimized WFIP ensemble forecast (OPTENS_WFIP) was similar except that it used forecasts from each of the 10 different NWP models (9 AWST plus HRRR) that used the ESRL_RAP ICs and BCs. The NWP forecasts were available for bias correction 2 h after initialization. A bias-corrected forecast was delivered every 15 min. The most recent wind plant tower and power generation data were used in the bias correction algorithm to include information for “persistence.”
Comparisons of the WFIP ensemble forecasts (OPTENS_WFIP) with the baseline (OPTENS_STWPF) are shown in Fig. 11. The greatest error reduction (30%–60%) occurs in the first 90 min of the forecast. Beyond 90 min, the forecast error reduction steadily decreases but is still apparent. The initial improvement before 90 min can be attributed to several factors, including more accurate (and a larger ensemble of) higher-resolution WFIP models, and the use of the ESRL_RAP ICs and BCs. Data-denial experiments (not shown) indicate lesser contributions to the magnitude of the overall forecast improvement at longer forecast time horizons from assimilated project observations. Additional sensitivity experiments are required to determine which component of the WFIPFS contributed most to forecast improvement. Although an ensemble forecast should outperform a deterministic forecast, and a large ensemble should outperform a small ensemble, the results here demonstrate the magnitude of these improvements when applied to wind energy forecasting.
One way to graphically summarize the improved ensemble performance (and compare the individual ensemble members) is through a Taylor diagram (Taylor 2001). The similarity between the ensemble performance and the observations is quantified in terms of their correlation (or their coefficient of determination), their centered root-mean-square difference, and the amplitude of their variations (represented by their standard deviations). Thus, Taylor diagrams can be especially useful in evaluating multiple aspects of model performance in a phase–amplitude space.
Figure 12 represents a Taylor diagram showing the individual ensemble members (various symbols), the OPTENS_STWPF (open upside-down black triangle), and OPTENS_WFIP (depicted by the open green triangle) 3-h forecast performance for the WFIP system-wide aggregate as compared with observations (black asterisk). Note that the individual model members (unoptimized, therefore requiring no statistical postprocessing) show considerable scatter in the phase–amplitude space, with the MASS and HRRR members performing best. There is also a significant increase in overall skill shown by the OPTENS_WFIP as compared with the baseline OPTENS_STWPF, with definitive movement toward minimizing RMSE, increasing r2, and capturing the characteristic observational variability (solid blue arrow in Fig. 12). Note there is still significant room for improvement, as denoted by the dotted red arrow.
Next, the use of the WFIP ensemble for forecasting ramp events is investigated. Single forecasts from deterministic models cannot communicate the likelihood of occurrence or likelihood of different ramp event scenarios. Therefore, 6-h probabilistic ramp event forecasts were created every 15 min. For individual models these were derived using quantile regression of historical forecasts. These forecasts contain the probability of exceedance for several ramp-rate thresholds and a probability distribution of ramp rates. The WFIP probabilistic ramp forecasts were compared to ELRAS, which was based on a single ARPS model run and 3D variational data assimilation (3DVAR; Zack et al. 2011).
The low-level jet (LLJ) is a phenomenon that has been investigated by wind energy interests for nearly 40 years (Sisterson and Frenzen 1978; Kelly et al. 2004; Banta et al. 2008). LLJs occur regularly throughout the year in the southern Great Plains (e.g., Bonner 1968) and are especially prevalent over the WFIP SSA (Freedman et al. 2008). The height of the LLJ wind speed maximum varies between 50 and 400 m, but typically occurs at about 200 m (Banta et al. 2002). Thus, a special concern for wind energy interests and a forecasting challenge introduced by LLJs is the large vertical shear [upward of 8 m s–1 (100 m)–1] that can occur across the turbine rotor plane.
Critical observational and forecasting issues concerning LLJs are 1) the strength of the vertical gradient of wind speed, 2) their formation and persistence, 3) spatial characteristics such as width and depth, and 4) intermittent turbulence leading to propagation of strong winds toward the surface. To ensure sufficient vertical resolution of the full profile of the LLJ, the field measurement campaign included the deployment of several integrated observation sites: that is, the collocation of a surface meteorological station, sodar, and wind profiling radar.
Qualitative analysis indicates that the LLJ is a regular, periodic (e.g., Fig. SB1), and dominant feature in the SSA that drives capacity factors to over 60% (and therefore a large fraction of power production) during the nocturnal hours. Given the large wind shears that can be generated by LLJs, model forecast errors that displace the wind profile by just a few tens of meters can lead to large errors in forecast power production, as can mistiming their onset or cessation.
OPTENS_WFIP produced a marked improvement in forecasting the amplitude and phase of the LLJ, as demonstrated in Fig. SB1 (showing the 3-h forecasts for a several-day sequence dominated by the formation and decay of the LLJ). In particular, the individual model-member raw forecasts were often significantly in error regarding the amplitude and phase of the diurnal wind speed cycle, while the OPTENS_WFIP forecasts (solid blue line in Fig. SB1) were more skillful. This is consistent with the findings of Deppe et al. (2013) and highlights a continuing issue that models have in capturing the temporal and spatial distributions of LLJs (Storm et al. 2009; Storm and Basu 2010). Accurate LLJ depiction is crucial for wind energy forecasts, and the results here illustrate the limitations of model parameterization schemes, especially for the PBL (Deppe et al. 2013; Werth et al. 2011). This further demonstrates the critical role played by the WFIP model system statistical postprocessing and bias correction schemes, which show much better alignment with the observations (solid black line in Fig. SB1). The large spread in individual model-member forecasts (Figs. 12 and SB1) is striking and suggests the value of probabilistic uncertainty forecasts for LLJ scenarios, a subject deserving of additional study but beyond the scope of this paper.
The probabilistic ramp forecasts were verified using the rank probability skill score (RPSS; Murphy 1969). The RPSS represents the improvement of the ramp probability forecast over the climatological ramp probabilities, and a value greater than zero indicates forecast skill greater than the reference (climatological) forecast. Several ensemble forecasts were generated using combinations of all (“All”) and the best ARPS, WRF, and MASS members (i.e., “Best 3”) from the MOS method. The results (Fig. 13) show that the ensembles produced a more accurate probabilistic forecast of 60-min ramps than any one of the single members. There is, on average, a 20% improvement in RPSS (forecast skill) of the ensemble forecasts (Best 3, All) over the best-performing single-member methods (i.e., HRRR, MASS). This result highlights the additional value from using an ensemble to generate a probabilistic ramp forecast. Of the single members, the HRRR probabilistic ramp forecast performed the best followed by the MASS and WRF [Mellor–Yamada–Nakanishi–Niino (MYNN, University of Wisconsin (UW)] forecast members. The ARPS [3DVAR, the ARPS Data Analysis System (ADAS)] and EnKF members (ARPS, WRF) performed poorly, mostly because of higher false-alarm rates. The baseline ELRAS also performed poorly when compared to the other WFIP members. The inclusion of all these members in the ensemble outperformed the Best 3 ensemble, highlighting the advantage of model diversity.
WFIP allowed for NOAA, DOE national laboratories, private-sector forecasting companies, universities, and electric grid operators to collaborate on improving wind energy forecasts. A significant number of new observing systems, including proprietary tall-tower and turbine nacelle anemometer measurements provided by the private sector, were assimilated into NOAA’s regional and hourly updated forecast models, in both real-time and retrospective simulations. Data-denial experiments demonstrated that assimilation of these observations led to statistically significant improvements in turbine-height power forecasts. Improvements in the forecasts also occurred with the transition from the RUC to the RAP model midway through the WFIP field campaign. The improvements from the combination of assimilation of the additional observations and the upgrade to the RAP model ranged from 12% to 5% for forecast hours 1–12 in the NSA and SSA, equivalent to approximately the previous decade’s improvements achieved for NOAA/NWS operational forecasts for 850-hPa winds over North America.
The results from the NSA demonstrate that the research models (ESRL_RAP, HRRR) that included the additional WFIP data assimilation and improved model physics produced more accurate wind power forecasts than those created using the current operational model (NCEP_RAP), as illustrated in the general bulk error statistics and wind energy ramp forecast performance. A comparison of the various forecast metrics calculated with the two research models indicated that for overall bulk statistics, the 13-km-resolution ERSL_RAP-based forecasts are more accurate, while for ramp rates the 3-km HRRR-based forecasts performed best. These analyses collectively show that when it comes to power system operations, the complexity in identifying the best weather model for a particular forecast need is reason to have a diverse choice of models.
For the SSA, the AWST forecasts (OPTENS_WFIP) demonstrated impressive improvement in forecast power production compared with the baseline (OPTENS_STWPFS), with the largest improvement (60%) in aggregate capacity factor RMSE at hour 1, and consistently better performance (>20% decrease in RMSE) through hour 3. The probabilistic WFIP ensemble ramp predictions resulted in a large (20% or more) improvement in the RPSS as compared with the baseline (ELRAS) forecasts. Finally, the enhanced field observations facilitated identification and analysis of the principal phenomena (LLJs) responsible for the winds generating the larger capacity factors notable in the ERCOT domain and were also key to model system performance in capturing the phase and amplitude of the diurnal wind speed cycle.
Although the WFIP analyses have shown that significant improvements were made in wind energy forecasts, including wind ramp events, the remaining step of quantifying the economic benefits of these forecast improvements is still in process. One of the challenges of that analysis is in associating a monetary benefit to the improved grid reliability that is achieved through better short-term wind energy forecasts. Future efforts coordinated across DOE, NOAA, and the private sector are also needed to continue improvements to model forecast systems and data assimilation methods to optimally utilize additional observations, to investigate forecast uncertainty, and to develop techniques required in more extreme complex-terrain environments.
Numerous individuals made significant contributions to the success of WFIP. These include Tim Martin, Clark King, Jesse Leach, Tom Ayers, Jim Jordan, Dan Gottas, David Welsh, Leon Benjamin, Matthew Filippelli, Kurt Elsholz, Vic Morris, and Dan Nelson (field deployment and data acquisition); John Schroeder, Tom Strong, Shane Beard, Dave Christensen, Dennis Finn, Roger Carter, Brad Reese, Robert Lipschutz, Jason Rich, and Charles Kovalsky (data communications); Philippe Beaucage, Katherine Rojowsky, Sukanta Basu, and Paul Svenson (data analysis); Jennifer Leise, Deborah Hanley, and Ken Pennock (project management); Carrie Gillespie, Ken DeRose, Dennis Todey, and Bob Conzemius (real-time meteorological tower and turbine nacelle data); Victor Yannuzzi and Francisco Guzman (forecasting support); Dennis Keyser, Steven Levine, Jeff Whiting, and Ming Hu (GSI and data assimilation); Eric Rogers and Curtis Alexander (model development); Richard Eckman and Elizabeth Weatherhead (statistical analysis); Stan Calvert, Will Shaw, Geoff DiMego, John Brown, and Allen White (program support); Brian Ancell, Keith Brewster, Kevin Thomas, and Steve Young (AWST modeling systems); Venkat Banunarayanan, Saleh Nasir, Kristen Orwig, Greg Brinkman, Greg Stark, Erik Ela, and Jie Zhang (ongoing economic analysis); Isabel Flores (ERCOT liaison); and Michael McMullen (MISO liaison). The authors gratefully thank three anonymous reviewers for many constructive comments and suggestions. This work was supported by the U.S. Department of Energy (DOE), Office of Energy Efficiency and Renewable Energy, and by the National Oceanic and Atmospheric Administration.
ADDITIONAL AFFILIATION: Atmospheric Sciences Research Center, University at Albany, State University of New York, Albany, New York