New Methods for Data Storage of Model Output from Ensemble Simulations

Peter D. Düben European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Peter D. Düben in
Current site
Google Scholar
PubMed
Close
,
Martin Leutbecher European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Martin Leutbecher in
Current site
Google Scholar
PubMed
Close
, and
Peter Bauer European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Peter Bauer in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Data storage and data processing generate significant cost for weather and climate modeling centers. The volume of data that needs to be stored and data that are disseminated to end users increases with increasing model resolution and the use of larger forecast ensembles. If precision of data is reduced, cost can be reduced accordingly. In this paper, three new methods to allow a reduction in precision with minimal loss of information are suggested and tested. Two of these methods rely on the similarities between ensemble members in ensemble forecasts. Therefore, precision will be high at the beginning of forecasts when ensemble members are more similar, to provide sufficient distinction, and decrease with increasing ensemble spread. To keep precision high for predictable situations and low elsewhere appears to be a useful approach to optimize data storage in weather forecasts. All methods are tested with data of operational weather forecasts of the European Centre for Medium-Range Weather Forecasts.

Denotes content that is immediately available upon publication as open access.

© 2019 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Peter D. Düben, peter.dueben@ecmwf.int

Abstract

Data storage and data processing generate significant cost for weather and climate modeling centers. The volume of data that needs to be stored and data that are disseminated to end users increases with increasing model resolution and the use of larger forecast ensembles. If precision of data is reduced, cost can be reduced accordingly. In this paper, three new methods to allow a reduction in precision with minimal loss of information are suggested and tested. Two of these methods rely on the similarities between ensemble members in ensemble forecasts. Therefore, precision will be high at the beginning of forecasts when ensemble members are more similar, to provide sufficient distinction, and decrease with increasing ensemble spread. To keep precision high for predictable situations and low elsewhere appears to be a useful approach to optimize data storage in weather forecasts. All methods are tested with data of operational weather forecasts of the European Centre for Medium-Range Weather Forecasts.

Denotes content that is immediately available upon publication as open access.

© 2019 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Peter D. Düben, peter.dueben@ecmwf.int

1. Introduction

Modern predictions of weather and climate require the use of supercomputing facilities. Numerical predictions of weather and climate are data intensive and requirements for data storage in weather and climate science are larger compared to many other high-performance computing applications. Initial conditions after data assimilation, model output of simulations, and data of climate reanalysis need to be stored. The amount of data is substantial: the data-handling system of the European Centre for Medium-Range Weather Forecasts (ECMWF) provides access to over 210 petabytes of primary data. This amounts to the data volume that can be stored on the hard disks of 100 000 PCs. ECMWF’s data archive grows by about 233 terabytes per day (ECMWF 2017). Unfortunately, supercomputing storage capacities did not grow at the same rate as processing power during the last 30 years (Kuhn et al. 2016).

There are several parameters to change data storage of model output that need to be traded off against each other such as output frequency, resolution of data, the number of model fields that are stored, the data format and compression techniques that are used, and numerical precision. There is also a trade-off between data volume on one side and time for staff and computing on the other side: If more effort is spent to identify the optimal data format for individual model fields and if more resources are spent to perform data postprocessing using costly compression techniques, a stronger reduction of the amount of data that needs to be stored can be achieved. The more effort is spent, the more savings can be achieved.

It is easier to generate significant savings when using compression methods that accept loss of information or reduce the number of output fields or resolution when compared to compression methods that attempt to preserve the information content and keep resolution and the number of output fields constant. Because of the continuous growth of data volume at weather and climate centers it is likely that more aggressive measures regarding the reduction of the number of output fields or data compression will be required to achieve a sustainable solution in the future.

If data volume can be reduced using lower numerical precision or data compression techniques, more data points can be stored at the same cost. However, if precision is too low, scientific diagnostics and downstream use of model output may be perturbed and information may be lost. Precision in data storage should be adjusted to the level of forecast error that is present in simulations. For example, it is hard to argue that temperature should be stored with five significant digits at a 10-day forecast that will show uncertainties in predictions of several degrees. Forecast error will increase with lead time (e.g., see Magnusson and Källén 2013) and it has been suggested that numerical precision should therefore decrease with model lead time (Düben et al. 2015).

Data compression can be performed either lossless or lossy. Lossless compression allows the reproduction of the original field exactly with no error while there is loss of information if lossy compression techniques are used. Unfortunately, it is difficult to design efficient lossless compression techniques for floating-point numbers that would be required for output of weather and climate models (Hübbe et al. 2013). Therefore, model output is typically stored with lossy compression techniques. While most models are running with double-precision floating-point precision with 64 bits per variable, model output is typically stored with either single-precision floating-point numbers with 32 bits per variable (mostly in netCDF format), or in fixed-point representation with 16, 24, or 32 bits per variables (mostly in GRIB format). Data volume could be reduced even stronger using additional compression techniques and several papers have investigated the use of lossy compression techniques for weather and climate data (Woodring et al. 2011; Bicer et al. 2013; Hübbe et al. 2013; Wang et al. 2015; Zender 2016; Silver and Zender 2017). However, weather and climate data are very diverse and variability of data can change quickly within a single model level or a single grid column. Specific humidity is, for example, confined between 0 and 1 but very small values can still be very important and values can change by orders of magnitude between the stratosphere and the troposphere within a single vertical level; precipitation can have significant outliers in extreme events; for fields such as geopotential height or temperature both global and local gradients carry important information; fields such as surface pressure can depend heavily on local features such as topography and show steep local gradients. This diversity makes it is difficult to quantify a “loss” of information if lossy compression techniques are used and weather and climate scientists are often skeptical whether an additional compression can preserve all relevant information. As a result, domain scientists need to be involved closely when investigating compression techniques; see, for example, the study of Baker et al. (2016) that invited domain scientists to study their climate diagnostics on model output data of the Community Earth System Model after the fpzip algorithm (Lindstrom and Isenburg 2006) was used for lossy compression. There are approaches to identify the impact of compression techniques on the quality of diverse weather and climate data (e.g., Baker et al. 2017). One suggestion is to compare the effect of compression techniques to natural variability of fields or the variability within an ensemble (Baker et al. 2014) or to focus on the rather than mean error norms (Woodring et al. 2011).

In this paper, we aim to improve the information content per bit for model output from ensemble predictions. The ensemble forecast of ECMWF is producing more than 55 terrabyte of output data during a single day (T. Quintino 2018, personal communication). Since ensemble spread is tuned to be a good approximation of forecast error, model fields are naturally very similar between different ensemble members at the beginning of a forecast while ensemble spread and, hence, difference between model fields is increasing with forecast lead time. In the following, we will test whether it is possible to use similarities between ensemble members to increase efficiency in model output. One approach could be to store ensemble data in form of probability distributions for each data point instead of the standard approach to save all ensemble members independently. If, for example, model data of ensemble simulations would be stored as mean fields and a couple of higher moments, data volume could be reduced significantly (e.g., see Stephenson and Doblas-Reyes 2000). However, in this case, individual ensemble members could not be reproduced from the data. This seems to be a great loss of relevant information, in particular for the evaluation of meteorological features such as individual tropical cyclones or fronts. Furthermore, it may not be easy to represent PDFs of variables with a small number of parameters since probability distributions are not necessarily Gaussian for variables in Earth system models. For several quantities non-Gaussian features may be important to describe the system and non-Gaussian features are likely to occur due to the nonlinearity of the Earth system (e.g., see Cooper et al. 2013). It would also be possible to design statistical models to resemble the properties of ensemble data that only require a small fraction of the data volume of the initial data (Castruccio and Genton 2016).

We suggest new methods for data storage of model output from ensemble simulations that are not based on standard techniques for lossy compression. The methods exploit similarities between ensemble members and link precision to predictability thus maintaining high precision where predictability is high while using lower precision for results that show large uncertainty. Therefore, precision will, on average, also reduce with forecast lead time as forecast errors increase. Because of the link between precision and forecast uncertainty, we call the new method “predictability-optimized compression.” It is still possible to reproduce meteorological features for individual ensemble members. We hope that the new storage methods will reduce rounding errors for a given number of bits per variable such that the overall number of bits that are stored for ensemble forecast data can be reduced. We test the approach for storage of model output of an operational ensemble forecast with the Integrated Forecast System (IFS) at ECMWF and investigate savings and errors for data that are stored in GRIB format. It is not the focus of this paper to show that the new methods for data storage can compete with existing methods for lossy data compression in terms of the potential reduction of data volume. Existing lossy compression techniques could still be used on top of the methods presented in this paper. However, since the methods of this paper, which are developed for specific use in weather model output, are intuitive for domain scientists and not as cryptic as many lossy compression techniques, there is some hope that domain scientists may be less reluctant to use them. This user friendliness is a key point.

The obvious use of predictability-optimized compression is the compression of model output from ensemble weather predictions. However, the methods that are presented in this paper could potentially also be used to compress model output from climate simulations. However, savings will be smaller for climate data because the ensemble spread will be larger. For climate data, it might also be easier to remove time-averaged mean fields instead of mean fields that are calculated across the ensemble (as suggested in this paper). Our new methods are therefore not tested for model output from climate simulations in this paper.

Section 2 provides an overview on the GRIB storage format. Section 3 introduces the new methods for storage. Section 4 evaluates the quality of the new storage methods and section 5 evaluates savings that can be achieved when the different storage methods are used. Section 6 concludes the paper.

2. Data storage using GRIB

At ECMWF, model output data are stored using the GRIB standard (Dey 2007). In GRIB, global meteorological data are stored per vertical level in form of two-dimensional horizontal fields. Each data point is stored as a fixed-point number. Here, the maximal and minimal value of each field is diagnosed per model level and the interval between these values is divided into numbers with n being the number of bits that is stored per variable. This approach works very well for physical fields that show a fixed range, such as surface temperature that will hardly exceed the range between 223 and 323 K. However, if variations become small in comparison to the maximal and the minimal field values, a large number of bits per variable n is needed to maintain a sufficient level of precision, in particular, if the local magnitude of field values and their variability is changing significantly within a vertical level. ECMWF’s IFS is a spectral model and model fields are represented in spectral space, using spherical harmonics as basis functions, and/or in gridpoint space on a Gaussian grid that has a reduced number of grid points toward the pole compared to a standard longitude–latitude grid. Coefficients for spectral fields are also stored as fixed-point numbers. However, the magnitude of numbers is scaled with a power law to acknowledge the dependence of the magnitude of spectral coefficients on the wavenumber. While the model is called “spectral,” the calculation of nonlinear terms and the calculation of physical processes that are acting in the vertical grid column are still calculated in gridpoint space. Because the resolution in spectral and gridpoint space is not always equivalent, model fields are stored either in spectral or gridpoint space, dependent on the nature of the respective fields.

The GRIB format allows the user to define the number of bits per value n for each field that is stored. The size of files scales roughly linearly with the number of bits per value. At ECMWF all model fields are stored with the same number of bits at all lead times and most model fields are stored with 16 bits per value. However, a couple of fields that are known to be sensitive to low precision are stored with 24 bits per value. GRIB files of gridpoint fields can be compressed using second-order, PNG and JPEG-2000 compression. Second-order compression is the standard compression technique for GRIB files at ECMWF (in the rare case that compression is actually used). Second-order compression adjusts the range between minimal and maximal values for local areas to reduce cost (Dey 2007).

3. Three new compression methods for data storage

The three new compression methods that are suggested and evaluated in this paper are introduced in the following. There is no benefit in terms of reducing the data volume if precision is kept at the same level of precision (using the same n) as the original data. If precision is kept at the original precision level, the original data can be reproduced exactly up to very small errors that are caused by mapping and averaging procedures but overall storage requirements will be higher with the new compression methods since additional fields need to be stored. However, it is the hope that precision can be reduced for the fields that are stored with no significant increase in error if the new compression methods are used. Whether this will be possible is tested in section 4.

a. Method 1: Remove coarse-resolution field

Global geophysical parameter fields can show large-scale gradients over the globe. For example surface temperature: 223 K at the pole and 323 K in the tropics. This can cause a large spread between the maximal and minimal value across a vertical level and therefore low precision. However, if a mean field is removed from data, perturbations to the mean field will show a smaller spread between the maximal and minimal field values. Steps 1–3 describe the encoding of a field with method 1 and step 4 describes the decoding:

  1. Map the original field onto a coarser grid and store the coarse-grid information at standard precision.

  2. Map the coarse-grid field back onto the original grid and subtract it from the original field to calculate perturbations from the mean field.

  3. Reduce precision for the perturbation field and store it.

  4. Map the coarse grid to the original grid and add the perturbation field to regenerate the original field.

The larger the difference between the fine and the coarse grid, the smaller will be the additional cost to store the mean field. However, differences between the maximal and minimal field value in the perturbation field will increase such that more precision will be necessary. Method 1 does not require ensemble information and can be performed for any model field independently. Precision will be independent of ensemble spread and method 1 is therefore not a predictability-optimized compression. Method 1 is based on the assumption that large-scale gradients can be removed from data to reduce the dynamic range of numbers that are stored to increase precision. The same assumption is used when applying second-order compression for GRIB data when adjusting the range between minimal and maximal values for local areas. However, the use of a coarser grid in method 1 will allow a more detailed representation of large-scale pattern.

b. Method 2: Remove ensemble mean

Method 2 is similar to method 1. However, instead of a coarse-resolution mean field the ensemble mean will be used to calculate perturbations. Steps 1–3 describe the encoding of a field with method 2 and step 4 describes the decoding:

  1. Calculate and store the ensemble mean field at standard precision.

  2. Calculate the perturbations from the ensemble mean for all ensemble members.

  3. Reduce precision for the perturbation field and store it.

  4. Add the ensemble mean to the perturbations to regenerate the original field.

Method 2 links precision of data to predictability since rounding errors will be large when local ensemble spread is large. Since ensemble forecasts are tuned to provide a reliable estimate of forecast uncertainty, ensemble spread is large where uncertainty is large. The ensemble mean needs to be stored as an additional field. However, to store one more field for a 50-member ensemble forecast (as used at ECMWF) increases data volume by only 2% if precision remains the same for all fields. If precision would be kept at the original level for the perturbation fields, the perturbation field of the last ensemble member could, in theory, be reproduced from the ensemble mean and all of the other perturbation fields such that no additional field would be required.

c. Method 3: Remove ensemble spread

In method 2, the difference between the maximal and minimal value (and therefore precision) is limited by the maximal ensemble spread per vertical level. However, ensemble spread and predictability can vary significantly over the globe. Steps 1–3 describe the encoding of a field with method 3 and step 4 describes the decoding:

  1. Identify the minimal field value from all ensemble members at each grid point ( at grid point i) and the ensemble spread ( defined as difference between the maximal and minimal field value). Store both the minimal values and the ensemble spread at standard precision.

  2. Calculate the difference between each ensemble member and the minimal field value and normalize each value with the ensemble spread at each grid point. A field value is converted into .

  3. Reduce precision for the perturbation field () and store it.

  4. Multiply with the ensemble spread and add the minimal field at each grid point to regenerate the original field.

In method 3 all local field values are stored with the maximal field value equal to 1.0 and the minimal value equal to 0.0. There will be one ensemble member that has the minimal and one ensemble member that has the maximal field value at each grid point such that local precision is maximized. In principle, the mean or maximal field could also be used instead of the field of minimal ensemble values for this method.

Errors that are caused by the growth of perturbations in a chaotic system typically show an exponential increase in time. Errors that are, on the other hand, caused by errors or approximations within the forecast model show a linear growth in time. Both kinds of errors are present within ensemble weather predictions (Magnusson and Källén 2013). The ensemble spread will, in an approximation, follow the general behavior of the error that is present within model simulations. The precision that is available for methods 2 and 3 will decrease linearly with ensemble spread. It is, however, not possible to provide a theoretical limit for the precision that should be used for all output fields of weather models because model error will depend on the weather regime that is present and therefore vary between different forecasts. Furthermore, different model fields also show different error growth regarding growth rate and shape. The growth rate can even differ for the same field for different locations across the globe.

4. Tests with weather forecast data

In this section we present a selection of results for different model fields. We perform tests for operational ensemble weather forecast data for a forecast that was issued midday 1 September 2017. The forecasts were calculated at 18-km horizontal resolution on an octahedral reduced Gaussian grid with 640 grid points between the equator and the pole (O640). For spectral data, spherical harmonics are represented up to wavenumber 639 (T639). The data were evaluated at the resolution of the forecast simulations.

For all examples, the mean field at coarse resolution, the ensemble mean, the minimal field values and the ensemble spread that are used for methods 1–3 are stored at the standard of 16 bits per value with no reduction in precision. However, the perturbation fields that are stored for each method, which represent the largest part of storage cost, are stored at reduced precision. We present results that use a reduced Gaussian grid with 160 grid points between pole and equator (N160) and an octahedral reduced Gaussian grid with 320 grid points between pole and equator (O320) as a coarse grid for method 1. The octahedral grid has a smaller concentration of grid points toward the poles in comparison to the standard reduced Gaussian grid. However, it was found that results with N160 were slightly better in comparison to results with an O160 grid such that a slightly higher cost when storing the coarse grid was justified. The same was not true for a N320 grid in comparison to an O320 grid.

a. Mean errors for standard model fields

We test the three methods for four different types of fields that are often used in meteorology. All fields are plotted in Fig. 1 for the initial date of the forecast. Figure 2 shows the perturbation fields that are stored for the three methods for the example of surface pressure. It is visible that the large-scale topographic information is removed from all of the fields and that the data range is saturated for each grid point between zero and one for method 3.

Fig. 1.
Fig. 1.

(from top left to bottom right) Initial fields for geopotential height (m2 s−2) at 500 hPa, 2-m temperature (°C), surface pressure (hPa), and specific humidity (g kg−1) at 850 hPa for 1 Sep 2017. Geopotential height at 500 hPa and 2-m temperature show large-scale gradients. Surface pressure shows small-scale gradients caused by the strong dependence on topography. Specific humidity is very challenging since both large- and small-scale gradients are present. Geopotential height is stored in spectral space. The 2-m temperature, surface pressure, and specific humidity are stored in gridpoint space.

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0170.1

Fig. 2.
Fig. 2.

(from top left to bottom right) Plots of the perturbation data that are actually stored for a single ensemble member at reduced precision for surface pressure for methods 1–3 at day 5 of the forecast. For method 1, coarse grid data are stored on an N160 grid.

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0170.1

Figure 3 shows global plots of the error if precision is reduced for surface pressure for the different storage methods with 10 bits precision. For the standard data format and storage method 1, no evolution with forecast lead time is visible, as expected. The magnitude of errors appears to be reduced for method 1 in comparison to the default storage at 10 bit precision. For storage methods 2 and 3 error levels increase in time, as expected. After 1 day, the error with 10 bits with the new storage methods 2 and 3 is smaller than the error with 14 bits for the standard data format. After 10 days, the error with 10 bits is still much smaller with methods 2 and 3 when compared to storage with 10 bit precision with the standard data format. For method 1 it was found that a very strong reduction in precision to six bits for the perturbation field can cause a bias over the ocean when the fields are recombined. Because perturbations in surface pressure over the ocean are very small, almost all ocean points will be rounded to the closest number that is represented when precision is reduced. This number does not necessarily need to be zero such that a bias can be generated. However, the same mechanism will also generate a bias if the default storage methods is used at a very low level of precision and the bias could be avoided if either the maximal and minimal field values or the general storage data format would be adjusted to make sure that zero can be represented exactly.

Fig. 3.
Fig. 3.

Plots of the error for surface pressure (Pa) data of one ensemble member that is stored at reduced precision in comparison to the standard data format with 16 bits per value when the different storage methods are used. Please note the change of the color bars. (from top to bottom) Reduced precision to 14 and 10 bits when the standard data format is used at day 1 of the forecast, method 1 using an N160 grid as coarse grid, method 2, and method 3. Methods 1, 2, and 3 store the perturbation fields with only 10 bits per value and results are shown for (left) day 1 and (right) day 10. Fields that were stored using methods 1–3 were postprocessed into the original data format at 16 bits per value to calculate the error.

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0170.1

Figure 4 shows the global L1 norm of the absolute error for reduced precision data calculated against the default data at standard precision (16 bits). As expected, the error of the default method and method 1 does not change significantly over time while error is increasing with time for methods 2 and 3. Each of the three methods is managing to reduce error in comparison to the default storage for 2-m temperature and surface pressure data at the same precision level of 10 bits. However, it is clearly visible that potential savings will be different for different fields and both methods 2 and 1 with an N160 coarse grid do not show any benefit for specific humidity at 850 hPa compared to the default encoding with the same number of bits. Method 1 will be most successful if most features of a given field can be resolved on the coarse grid. Method 2 will be most successful, if predictability is high. Specific humidity shows many small-scale gradients (see Fig. 1) that cannot be resolved on the N160 grid and predictability is low. Results show that methods 2 and 3 can reduce errors for both spectral and gridpoint fields when precision of the perturbation field is reduced to the same level of precision in standard data storage. If forecast errors are compared at different precision levels, the relative change in forecast error between the different storage techniques would stay approximately the same since rounding errors will change by a factor of 2 per additional or removed bit. Therefore, results should be considered as relative savings (e.g., two or four bits more or less).

Fig. 4.
Fig. 4.

(from top left to bottom right) Global L1 norm of absolute errors for geopotential height (m2 s−2) at 500 hPa, 2-m temperature (°C), surface pressure (hPa), and specific humidity (g kg−1) at 850 hPa in comparison to the default data that are stored at 16 bits precision. There are three lines that indicate the error for a precision reduction to 14, 12, and 10 bits for the default storage. Precision was reduced to 10 bits for the perturbation fields of methods 1–3. We present two different grids to store the coarse grid information for method 1 (N160 and O320). Results for method 1 are not plotted for geopotential height since it cannot be used in spectral space.

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0170.1

b. Errors compared to analysis for standard model fields

Figure 5 shows the global L1 norm of the absolute error for the different storage compression methods when calculated against the forecast analysis at standard precision and with a stronger reduction in precision for the perturbation field to 6 bits (since a reduction to 10 bits would not cause a visible difference between the lines). Here, the analysis serves as the best guess of the state of the atmosphere at any given time to calculate forecast errors. Methods 1–3 do not increase the calculated error significantly except for method 1 for surface pressure. As expected, the error for the standard data format and method 1 at reduced precision is most prominent at the beginning of the forecast since differences between forecast and analysis are very small. The same is not visible for methods 2 and 3, which are predictability-optimized compression methods and show a very low error at the beginning of the forecast.

Fig. 5.
Fig. 5.

As in Fig. 4, but with the error calculated against model analysis and with precision for the perturbation fields of methods 1–3 reduced to only 6 bits. Crosses are added to the default data with 16 bits to improve readability.

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0170.1

The different fields have different properties regarding error growth. Specific humidity is difficult to predict and shows large errors. The increase of error caused by the different compression techniques is therefore small. Methods 1–3 work well for geopotential height and 2-m temperature. These fields are comparably smooth and well predictable. Method 1 has difficulties with surface pressure because of the small-scale gradients that are present.

c. Impact on probabilistic scores

We will now evaluate the impact of a reduction in precision on the continuous ranked probability score (CRPS) that is a standard diagnostic to assess model quality for ensemble simulations. We calculate CRPS for different fields in three regions of the southern extratropics, northern extratropics, and tropics for different lead times and use ECMWF’s standard evaluation tool. Here, the original data are mapped to a longitude–latitude grid of 1.5° resolution before the CRPS is calculated. To mimic the use of reduced precision data storage and to allow a fair comparison we reduce precision for the original data before the mapping is performed and increase precision to 16 bits for all fields that are stored at 1.5° resolution. We evaluate CRPS for simulations with the operational ensemble for dates between 12 January and 28 December 2016 evaluating 51 forecasts that were distributed equidistantly between these dates (one forecast every 7 days). We evaluate the impact on CRPS for geopotential height, temperature, and zonal and meridional velocity (Z, T, u, and υ) for the southern and northern extratropics and the tropics at 50, 100, 200, 500, 700, 850, and 1000 hPa. We only evaluate results for methods 2 and 3 in comparison to the standard data format. Since the fields that are evaluated are stored in spectral space in the original data format, method 1 cannot be used. We show a reduction of precision to only 6 bits since a smaller increase to, for example 10 bits, would cause a difference compared to the original data that would be too small in comparison to the error bars to be meaningful.

While the magnitude of improvements from the suggested storage methods is different between different quantities and locations, the error for methods 2 and 3 appears to be smaller when compared to the standard method. However, errors are of the same magnitude as the error bars for some of the quantities, which makes it difficult to compare results. Since we cannot plot all of the scores that were calculated, Fig. 6 shows only example results for Z500, T850, and u200.

Fig. 6.
Fig. 6.

Error in CRPS caused by a reduction in precision for the standard data format as well as for methods 2 and 3 plotted against forecast lead time in days. Precision is reduced to 6 bits for the standard data format and for the perturbation fields when using methods 2 or 3. Plots show the error for (top) geopotential height at 500 hPa, (middle) temperature at 850 hPa, and (bottom) zonal velocity at 200 hPa in (left) the northern extratropics, (middle) the tropics, and (right) the southern extratropics. Please note that the absolute values for CRPS are of O(10) for z500, O(0.1) for t850, and O(1) for u200 such that errors caused by the precision reduction are reasonably small.

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0170.1

d. Fields with strong changes in variability

There is an additional benefit for the use of storage method 3 that was not visible in the results presented so far. Because of the normalization of the data with the ensemble spread, local errors are normalized to the local variability of the field. This can be very beneficial for quantities that change significantly in magnitude and variability within a vertical layer. This is for example the case for specific humidity at 200 hPa. Specific humidity at 200 hPa will be within the troposphere at the equator and within the stratosphere near the poles such that the magnitude of values for specific humidity will change by orders of magnitude between the equator and the pole. To store such a field is very difficult in GRIB format since precision will not be sufficient to represent local gradients between very small numbers at the pole unless the number of bits per variable n is high. However, if numbers are normalized by the ensemble spread, this disadvantage disappears. Figure 7 shows results for specific humidity at 200 hPa over Iceland. As expected, precision for method 3 can be reduced much stronger in comparison to the default method due to the normalization by ensemble spread. The same benefit cannot be expected for methods 1 and 2 since the magnitude of local data points is not adjusted to the local variability.

Fig. 7.
Fig. 7.

(from top left to bottom right) Specific humidity in (g kg−1) at 200 hPa over Iceland for the first ensemble member at day 5 of the forecast for default data at 16 bits precision, data for which precision has been reduced to 14 and 12 bits, and data that are using method 3 with precision for the perturbation field reduced to 10 bits.

Citation: Monthly Weather Review 147, 2; 10.1175/MWR-D-18-0170.1

5. Savings

Tables 1 and 2 present the data volume that needs to be stored for the different storage methods. As expected, methods 2 and 3 will increase the amount of storage only by a rather small amount in comparison to the data amount of the default method with 10 bits per value. Method 1 will generate more data in comparison to methods 2 and 3, and the data volume will depend on the resolution of the coarse grid. The use of methods 1–3 will, unfortunately, reduce savings that are possible when using second-order compression. This can be expected since the perturbation fields will not be as smooth as the original fields, which will reduce the benefit of standard compression techniques. However, savings with second-order compression will depend on the properties of the fields and will be different for different fields if the original data format is used but also if methods 1–3 have been applied. For surface pressure the compressed data volume for method 2 is actually smaller compared to the compressed version of the original data format at 10 bits precision.

Table 1.

Storage volume in gigabytes for gridpoint data for the different model fields and storage methods. Data volume will be the same for all gridpoint fields if no compression is used. If the standard second-order GRIB compression is applied to the data (Dey 2007), the size of the files will depend on the properties of the fields that are stored. For methods 1–3 the perturbation fields are stored with 10 bits per value. It should be noted that standard model data are not compressed in standard data processing and storage at ECMWF.

Table 1.
Table 2.

Storage volume in megabyte for spectral fields calculated from geopotential height at 500-hPa data. Data are stored using a special packing technique for spectral fields (see section 2). For methods 2 and 3 the perturbation fields are stored with 10 bits per value.

Table 2.

It has been suggested to reduce numerical precision with model lead time as errors grow (Düben et al. 2015). If, for example, precision of data would reduce linearly over time from 16 bits per variable at the beginning to 10 bits at the end of a forecast, this would save almost 20% of storage requirements . However, this approach would be a challenge for the file system since the size of data files would change over time. This method could, in principle, also be applied to data stored with method 1. However, a further reduction of precision with forecast lead time may have a rather severe impact on data quality for methods 2 and 3 since the increase in error during a forecast is already represented by the ensemble spread and the difference to the ensemble mean.

6. Conclusions

We suggest three new methods to store model output of ensemble simulations to allow a stronger reduction in numerical precision to reduce data volume. For each of the three methods, the original fields can be reproduced at almost the same level of quality compared to the original data as long as precision is not reduced. However, small errors may be caused by field manipulations such as the mapping procedure between the fine and the coarse grid for method 1 or numerical rounding for the perturbation fields. The two storage methods (methods 2 and 3) that are most successful to reduce error when using the same level of precision for the perturbation fields in comparison to standard methods exploit similarities between ensemble members to increase information content per bit that is stored. These methods take into account forecast uncertainty, that is represented by ensemble spread, to provide high precision for model variables that can be predicted with low uncertainty (small ensemble spread) and low precision for model variables that show large uncertainties (large ensemble spread). We therefore call these methods “predictability-optimized compression.” In general, it is difficult to justify that model output should use the same level of precision for predictable and unpredictable weather situations. However, it is an advantage for data handling if all data files have the same size. Because the suggested storage methods combine precision with predictability, effective precision will decrease with growing errors but all files will still have the same size. The three methods that were suggested showed no detrimental impact on global conservation.

The new storage methods have three disadvantages: 1) additional fields need to be stored. However, additional cost is limited if many ensemble members are stored (+2%–4% for methods 2 and 3 if 50 ensemble members are stored at the same precision). 2) The benefit of other compression techniques, such as second-order compression that can be used for GRIB files, is reduced when the new storage methods are used. 3) Additional postprocessing is necessary before the user can reproduce the original fields, and individual ensemble members can only be evaluated if the ensemble mean field or the ensemble minimum and ensemble spread fields are available for methods 2 and 3, respectively. This will generate additional requirements for the storage and data handling system. However, similar requirements are already necessary for some fields. For example, to map fields from model to pressure levels the logarithm of surface pressure needs to be available. The GRIB data format is offering enough flexibility to store all ensemble members plus the mean and spread field within a single file even if the fields are stored in different resolution. There is also a risk that method 1 can generate values that are slightly negative for variables that are always positive. The same problem will not appear for methods 2 and 3. However, whether this will be a problem for method 1 will depend on the mapping procedure that is used between coarse and fine grids. If, for example, the coarse grid would not be used to store averaged or interpolated values of the original field but in contrast always store the minimal value of the field within a certain area, all field values of the recombined field would always remain positive.

We study the usefulness of the three storage methods on model output data of the operational ensemble weather forecast at ECMWF. Results were therefore evaluated for model output data stored in GRIB format, which is the standard data format at ECMWF. However, similar results should be possible for other storage formats.

Tests show that the three methods can indeed allow a stronger reduction in precision to reduce data volume. However, the amount of precision reduction that can be saved depends on the distribution and the predictability of fields. It is, therefore, difficult to make general statements about the amount of data volume that could be saved for all output fields of ensemble simulations. The results clearly indicate that methods 2 and 3 show very low errors at the beginning of forecasts that are growing over time, as expected. Given the results of this paper, method 3 appears to be the most promising candidate to generate savings in ensemble data storage while keeping information content as high as possible.

Acknowledgments

Peter Bauer and Peter D. Düben gratefully acknowledge funding from the ESIWACE project. The ESiWACE project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant 675191. Peter D. Düben gratefully acknowledges funding from the Royal Society. Many thanks to two anonymous reviewers, Michail Diamantakis, Pedro Maciel, Enrico Fucile, Nils Wedi, and Charlie Zender for helpful discussions and comments. No primary data have been used in this paper. Model output of ensemble simulations at ECMWF has been used to evaluate the new methods for data storage that are introduced in this paper. The data are available from the authors upon request.

REFERENCES

  • Baker, A. H., and Coauthors, 2014: A methodology for evaluating the impact of data compression on climate simulation data. Proc. 23rd Int. Symp. on High-Performance Parallel and Distributed Computing (HPDC’14), Vancouver, BC, Canada, ACM, 203–214, https://doi.org/10.1145/2600212.2600217.

    • Crossref
    • Export Citation
  • Baker, A. H., and Coauthors, 2016: Evaluating lossy data compression on climate simulation data within a large ensemble. Geosci. Model Dev., 9, 43814403, https://doi.org/10.5194/gmd-9-4381-2016.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Baker, A. H., H. Xu, D. M. Hammerling, S. Li, and J. P. Clyne, 2017: Toward a multi-method approach: Lossy data compression for climate simulation data. High Performance Computing, J. M. Kunkel et al., Eds., Springer International Publishing, 30–42.

    • Crossref
    • Export Citation
  • Bicer, T., J. Yin, D. Chiu, G. Agrawal, and K. Schuchardt, 2013: Integrating online compression to accelerate large-scale data analytics applications. 2013 IEEE 27th Int. Symp. on Parallel and Distributed Processing, Boston, MA, IEEE, 1205–1216, https://doi.org/10.1109/IPDPS.2013.81.

    • Crossref
    • Export Citation
  • Castruccio, S., and M. G. Genton, 2016: Compressing an ensemble with statistical models: An algorithm for global 3D spatio-temporal temperature. Technometrics, 58, 319328, https://doi.org/10.1080/00401706.2015.1027068.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Cooper, F. C., J. G. Esler, and P. H. Haynes, 2013: Estimation of the local response to a forcing in a high dimensional system using the fluctuation-dissipation theorem. Nonlinear Processes Geophys., 20, 239248, https://doi.org/10.5194/npg-20-239-2013.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Dey, C., 2007: Guide to the WMO table driven code form used for the representation and exchange of regularly spaced data in binary form. WMO, accessed 19 October 2017, 103 pp., https://www.wmo.int/pages/prog/www/WMOCodes/Guides/GRIB/GRIB2_062006.pdf.

  • Düben, P. D., F. P. Russell, X. Niu, W. Luk, and T. N. Palmer, 2015: On the use of programmable hardware and reduced numerical precision in earth-system modeling. J. Adv. Model. Earth Syst., 7, 13931408, https://doi.org/10.1002/2015MS000494.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • ECMWF, 2017: Data handling system. ECMWF, accessed 23 October 2017, https://www.ecmwf.int/en/computing/our-facilities/data-handling-system.

  • Hübbe, N., A. Wegener, J. M. Kunkel, Y. Ling, and T. Ludwig, 2013: Evaluating lossy compression on climate data. Supercomputing, J. M. Kunkel, T. Ludwig, and H. W. Meuer, Eds., Springer, 343–356.

    • Crossref
    • Export Citation
  • Kuhn, M., J. Kunkel, and T. Ludwig, 2016: Data compression for climate data. Supercomput. Front. Innov.: Int. J., 3, 7594, https://doi.org/10.14529/jsfi160105.

    • Search Google Scholar
    • Export Citation
  • Lindstrom, P., and M. Isenburg, 2006: Fast and efficient compression of floating-point data. IEEE Trans. Vis. Comput. Graph., 12, 12451250, https://doi.org/10.1109/TVCG.2006.143.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Magnusson, L., and E. Källén, 2013: Factors influencing skill improvements in the ECMWF forecasting system. Mon. Wea. Rev., 141, 31423153, https://doi.org/10.1175/MWR-D-12-00318.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Silver, J. D., and C. S. Zender, 2017: The compression–error trade-off for large gridded data sets. Geosci. Model Dev., 10, 413423, https://doi.org/10.5194/gmd-10-413-2017.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Stephenson, D. B., and F. J. Doblas-Reyes, 2000: Statistical methods for interpreting Monte Carlo ensemble forecasts. Tellus, 52A, 300322, https://doi.org/10.3402/tellusa.v52i3.12267.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wang, N., J.-W. Bao, J.-L. Lee, F. Moeng, and C. Matsumoto, 2015: Wavelet compression technique for high-resolution global model data on an icosahedral grid. J. Atmos. Oceanic Technol., 32, 16501667, https://doi.org/10.1175/JTECH-D-14-00217.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Woodring, J., S. Mniszewski, C. Brislawn, D. DeMarle, and J. Ahrens, 2011: Revisiting wavelet compression for large-scale climate data using JPEG 2000 and ensuring data precision. 2011 IEEE Symp. on Large Data Analysis and Visualization, Providence, RI, IEEE, 31–38, https://doi.org/10.1109/LDAV.2011.6092314.

    • Crossref
    • Export Citation
  • Zender, C. S., 2016: Bit grooming: Statistically accurate precision-preserving quantization with compression, evaluated in the netCDF operators (NCO, v4.4.8+). Geosci. Model Dev., 9, 31993211, https://doi.org/10.5194/gmd-9-3199-2016.

    • Crossref
    • Search Google Scholar
    • Export Citation
Save
  • Baker, A. H., and Coauthors, 2014: A methodology for evaluating the impact of data compression on climate simulation data. Proc. 23rd Int. Symp. on High-Performance Parallel and Distributed Computing (HPDC’14), Vancouver, BC, Canada, ACM, 203–214, https://doi.org/10.1145/2600212.2600217.

    • Crossref
    • Export Citation
  • Baker, A. H., and Coauthors, 2016: Evaluating lossy data compression on climate simulation data within a large ensemble. Geosci. Model Dev., 9, 43814403, https://doi.org/10.5194/gmd-9-4381-2016.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Baker, A. H., H. Xu, D. M. Hammerling, S. Li, and J. P. Clyne, 2017: Toward a multi-method approach: Lossy data compression for climate simulation data. High Performance Computing, J. M. Kunkel et al., Eds., Springer International Publishing, 30–42.

    • Crossref
    • Export Citation
  • Bicer, T., J. Yin, D. Chiu, G. Agrawal, and K. Schuchardt, 2013: Integrating online compression to accelerate large-scale data analytics applications. 2013 IEEE 27th Int. Symp. on Parallel and Distributed Processing, Boston, MA, IEEE, 1205–1216, https://doi.org/10.1109/IPDPS.2013.81.

    • Crossref
    • Export Citation
  • Castruccio, S., and M. G. Genton, 2016: Compressing an ensemble with statistical models: An algorithm for global 3D spatio-temporal temperature. Technometrics, 58, 319328, https://doi.org/10.1080/00401706.2015.1027068.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Cooper, F. C., J. G. Esler, and P. H. Haynes, 2013: Estimation of the local response to a forcing in a high dimensional system using the fluctuation-dissipation theorem. Nonlinear Processes Geophys., 20, 239248, https://doi.org/10.5194/npg-20-239-2013.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Dey, C., 2007: Guide to the WMO table driven code form used for the representation and exchange of regularly spaced data in binary form. WMO, accessed 19 October 2017, 103 pp., https://www.wmo.int/pages/prog/www/WMOCodes/Guides/GRIB/GRIB2_062006.pdf.

  • Düben, P. D., F. P. Russell, X. Niu, W. Luk, and T. N. Palmer, 2015: On the use of programmable hardware and reduced numerical precision in earth-system modeling. J. Adv. Model. Earth Syst., 7, 13931408, https://doi.org/10.1002/2015MS000494.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • ECMWF, 2017: Data handling system. ECMWF, accessed 23 October 2017, https://www.ecmwf.int/en/computing/our-facilities/data-handling-system.

  • Hübbe, N., A. Wegener, J. M. Kunkel, Y. Ling, and T. Ludwig, 2013: Evaluating lossy compression on climate data. Supercomputing, J. M. Kunkel, T. Ludwig, and H. W. Meuer, Eds., Springer, 343–356.

    • Crossref
    • Export Citation
  • Kuhn, M., J. Kunkel, and T. Ludwig, 2016: Data compression for climate data. Supercomput. Front. Innov.: Int. J., 3, 7594, https://doi.org/10.14529/jsfi160105.

    • Search Google Scholar
    • Export Citation
  • Lindstrom, P., and M. Isenburg, 2006: Fast and efficient compression of floating-point data. IEEE Trans. Vis. Comput. Graph., 12, 12451250, https://doi.org/10.1109/TVCG.2006.143.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Magnusson, L., and E. Källén, 2013: Factors influencing skill improvements in the ECMWF forecasting system. Mon. Wea. Rev., 141, 31423153, https://doi.org/10.1175/MWR-D-12-00318.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Silver, J. D., and C. S. Zender, 2017: The compression–error trade-off for large gridded data sets. Geosci. Model Dev., 10, 413423, https://doi.org/10.5194/gmd-10-413-2017.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Stephenson, D. B., and F. J. Doblas-Reyes, 2000: Statistical methods for interpreting Monte Carlo ensemble forecasts. Tellus, 52A, 300322, https://doi.org/10.3402/tellusa.v52i3.12267.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wang, N., J.-W. Bao, J.-L. Lee, F. Moeng, and C. Matsumoto, 2015: Wavelet compression technique for high-resolution global model data on an icosahedral grid. J. Atmos. Oceanic Technol., 32, 16501667, https://doi.org/10.1175/JTECH-D-14-00217.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Woodring, J., S. Mniszewski, C. Brislawn, D. DeMarle, and J. Ahrens, 2011: Revisiting wavelet compression for large-scale climate data using JPEG 2000 and ensuring data precision. 2011 IEEE Symp. on Large Data Analysis and Visualization, Providence, RI, IEEE, 31–38, https://doi.org/10.1109/LDAV.2011.6092314.

    • Crossref
    • Export Citation
  • Zender, C. S., 2016: Bit grooming: Statistically accurate precision-preserving quantization with compression, evaluated in the netCDF operators (NCO, v4.4.8+). Geosci. Model Dev., 9, 31993211, https://doi.org/10.5194/gmd-9-3199-2016.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    (from top left to bottom right) Initial fields for geopotential height (m2 s−2) at 500 hPa, 2-m temperature (°C), surface pressure (hPa), and specific humidity (g kg−1) at 850 hPa for 1 Sep 2017. Geopotential height at 500 hPa and 2-m temperature show large-scale gradients. Surface pressure shows small-scale gradients caused by the strong dependence on topography. Specific humidity is very challenging since both large- and small-scale gradients are present. Geopotential height is stored in spectral space. The 2-m temperature, surface pressure, and specific humidity are stored in gridpoint space.

  • Fig. 2.

    (from top left to bottom right) Plots of the perturbation data that are actually stored for a single ensemble member at reduced precision for surface pressure for methods 1–3 at day 5 of the forecast. For method 1, coarse grid data are stored on an N160 grid.

  • Fig. 3.

    Plots of the error for surface pressure (Pa) data of one ensemble member that is stored at reduced precision in comparison to the standard data format with 16 bits per value when the different storage methods are used. Please note the change of the color bars. (from top to bottom) Reduced precision to 14 and 10 bits when the standard data format is used at day 1 of the forecast, method 1 using an N160 grid as coarse grid, method 2, and method 3. Methods 1, 2, and 3 store the perturbation fields with only 10 bits per value and results are shown for (left) day 1 and (right) day 10. Fields that were stored using methods 1–3 were postprocessed into the original data format at 16 bits per value to calculate the error.

  • Fig. 4.

    (from top left to bottom right) Global L1 norm of absolute errors for geopotential height (m2 s−2) at 500 hPa, 2-m temperature (°C), surface pressure (hPa), and specific humidity (g kg−1) at 850 hPa in comparison to the default data that are stored at 16 bits precision. There are three lines that indicate the error for a precision reduction to 14, 12, and 10 bits for the default storage. Precision was reduced to 10 bits for the perturbation fields of methods 1–3. We present two different grids to store the coarse grid information for method 1 (N160 and O320). Results for method 1 are not plotted for geopotential height since it cannot be used in spectral space.

  • Fig. 5.

    As in Fig. 4, but with the error calculated against model analysis and with precision for the perturbation fields of methods 1–3 reduced to only 6 bits. Crosses are added to the default data with 16 bits to improve readability.

  • Fig. 6.

    Error in CRPS caused by a reduction in precision for the standard data format as well as for methods 2 and 3 plotted against forecast lead time in days. Precision is reduced to 6 bits for the standard data format and for the perturbation fields when using methods 2 or 3. Plots show the error for (top) geopotential height at 500 hPa, (middle) temperature at 850 hPa, and (bottom) zonal velocity at 200 hPa in (left) the northern extratropics, (middle) the tropics, and (right) the southern extratropics. Please note that the absolute values for CRPS are of O(10) for z500, O(0.1) for t850, and O(1) for u200 such that errors caused by the precision reduction are reasonably small.

  • Fig. 7.

    (from top left to bottom right) Specific humidity in (g kg−1) at 200 hPa over Iceland for the first ensemble member at day 5 of the forecast for default data at 16 bits precision, data for which precision has been reduced to 14 and 12 bits, and data that are using method 3 with precision for the perturbation field reduced to 10 bits.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 565 132 13
PDF Downloads 462 85 6