1. Limitations of performance metrics
One of the primary objectives of measurement or model evaluation is to quantify the uncertainty in the data. This is because the uncertainty directly determines the information content of the data (e.g., Jaynes 2003), and dictates our rational use of the information, be it for data assimilation, hypothesis testing, or decision-making. Further, by appropriately quantifying the uncertainty, one gains insight into the error characteristics of the measurement or model, especially via efforts to separate the systematic error and random error (e.g., Barnston and Thomas 1983; Ebert and McBride 2000).
Currently the common practice of measurement or model verification is to compute a common set of performance metrics. These performance metrics are statistical measures to summarize the similarity and difference between two datasets. These metrics are based on direct comparison of datum pairs on their corresponding spatial/temporal location. The most commonly used ones are bias, mean square error (MSE), and correlation coefficient (CC) (e.g., Fisher 1958; Wilks 2011), but many variants or derivatives, such as (un)conditional bias (e.g., Stewart 1990), MSE (Murphy and Winkler 1987), unbiased root-mean-square error (ubRMSE), anomaly correlation coefficient, coefficient of determination (CoD), and skill score (SS; e.g., Murphy and Epstein 1989) also fall into this category. Table 1 lists some of these metrics and their definitions.
Examples of conventional performance metrics.* The observations and forecasts are denoted as x and y, respectively.
Among them, the “big three”—bias, MSE, and CC—are the most widely used in diverse disciplines, exemplified by the popular “Taylor diagram” (Taylor 2001).
These metrics do, however, have several limitations:
Interdependence. Most of these conventional performance metrics are not independent; they have been demonstrated to relate to each other in complex ways. For example, the MSE can be decomposed in many ways to link it with other metrics, such as bias and correlation coefficient (e.g., Murphy 1988; Barnston 1992; Taylor 2001; Gupta et al. 2009; Entekhabi et al. 2010). These relations indicate both redundancy among these metrics, and the metrics’ indirect connection to independent error characteristics. This leads to ambiguity in the interpretation and intercomparison of these metrics.
Underdetermination. It is easy to verify that these metrics do not describe unique error characteristics, even when many of them are used collectively. In fact, many different combinations of error characteristics can produce the same values of these metrics.
This is illustrated in Fig. 1. A monthly time series of land surface temperature anomaly data, extracted from satellite-based observations (Wan et al. 2004) over a location in the United States (35°N, 95°W), is used as the reference (black curves) for validating two separate hypothetical sets of predictions (Figs. 1a and 1b, blue curves). Their respective scatterplots are also shown (Figs. 1c and 1d), with values of five major conventional metrics listed (bias, MSE, CC, CoD, and SS). When seen from either the time series plots or the scatterplots, the two measurements exhibit apparently very different error characteristics. However, all the metrics, except bias, give nearly identical values (Figs. 1c and 1d). In fact, there is an infinite number of ways to construct measurements that can produce identical values for many of the metrics. Therefore, when given a set of these metrics values, one will have fundamental difficulty in inferring and communicating the error characteristics of the predictions.
Incompleteness. There are no well-accepted guidelines on how many of these metrics are sufficient. Many inexperienced users follow a “the more the better” philosophy, and it is not rare to see works that employ as many as 7–10 metrics, many being variants of one another. However end users and stakeholders generally prefer as concise a summary as possible.
Linear error assumption. The metrics discussed here are all based on the least squares measure. The underlying assumption is that the measurement model is linear, with additive, Gaussian random error. When this assumption is not satisfied, one can still compute these metrics, yet they are misleading or meaningless (Barnston and Thomas 1983; Habib et al. 2001). In fact, these metrics have been frequently abused when one applies them in ignorance of the underlying error model.
Dramatically different measurement errors can produce identical values of the conventional metrics. (a),(b) Two separate measurements (blue curves) of the same reference dataset (black curves) are constructed, and (c),(d) their associated scatterplots are shown (blue line: linear fit of measurements vs truth; black line: linear fit of truth vs truth). The corresponding values of the conventional metrics (bias, MSE, CC, CoD, and SS) are given in each scatterplot. The two erroneous measurements, distinctly different, yield nearly identical values (subject to errors due to limited sample size) in all the metrics but bias. In contrast, the three parameters (λ, δ, and σ) of the error model in (1) clearly depict their differences.
Citation: Monthly Weather Review 144, 2; 10.1175/MWR-D-15-0087.1
Many enhanced techniques and metrics have been proposed to overcome these limitations. These include various decomposition schemes of the conventional metrics (e.g., Murphy 1988; Stewart 1990; Barnston 1992; Taylor 2001; Gupta et al. 2009; Entekhabi et al. 2010), object-oriented metrics (e.g., Ebert and McBride 2000), and nonparametric metrics that do not rely on underlying assumptions about the nature of the systematic error (e.g., Weijs et al. 2010; Gong et al. 2013; Vrugt and Sadegh 2013). These are all generally valuable efforts; however, we doubt that these kinds of relatively complex statistical analyses will ever fully replace the use of linear performance metrics. As such, we propose an alternative philosophy when applying these simple metrics.
2. Error characterization with an error model
The first two terms depict a deterministic relationship between x and y, and they jointly quantify the systematic error, with λ and δ. The random error, εi , a stochastic term, is solely quantified by σ, and is “orthogonal” (Papoulis 1984) to the systematic error. The separation of the errors into systematic and random parts is critical for uncertainty quantification, as random error is an amalgamation of all types of errors unexplainable with current data and knowledge at hand, and directly reflects our current state of ignorance.
It is easy to see that (1) characterizes the conditional distribution p(y | x) as
Once the error model (1) is parameterized, it is easy to see that all the relevant error information is captured by the three parameters. This can be demonstrated in Fig. 1 as well. While for the two sets of measurements shown in Figs. 1a and 1b, four of the five conventional metrics fail to differentiate them (see their respective values in Figs. 1c and 1d), the three model parameters completely captured the different error characteristics, including their differences in the scale error (λ = 0.6 vs 1.7), displacement error (δ = 6 vs 5.1), and random error (σ = 1.1 vs 3.3). Moreover, their representation of the error characteristics is strikingly intuitive, when one compares the numbers with the scatterplots (Figs. 1c and 1d). For example, the scale error λ, being less than 1 for the first set of measurements, indicates the measurements compress the dynamic range of the data, a fact easily confirmed by both the time series (Fig. 1a) and the scatterplot (Fig. 1c). The same effect applies to the second measurements. The popular conventional metrics listed in Table 1, in contrast, clearly lack such explanatory power.

Expressions of the conventional performance metrics with the three model parameters.*


As far as all the performance metrics illustrated in the Taylor diagram are concerned, the three parameters of the error model are sufficient to fully capture the error characteristics of model predictions. There is no additional error information to be gained from these performance metrics.
In fact, the relationship between the conventional metrics and the linear, additive error model (1) has been noted in earlier works (e.g., Murphy 1995; Livezey et al. 1995). However, they invariably failed to explicitly define and treat the random error with a parametric form. This is the root cause of the complexity in the interpretation of the conventional metrics, and the vagueness in using them to quantify uncertainty. The various MSE decomposition schemes [(4a)–(4c)] help alleviate the problem, yet a decomposition scheme with the error-parameter representation (3) is more straightforward.
However, for such a nonlinear, non-Gaussian error model, or any generic nonlinear error models, the conventional metrics are inappropriate; they are not meaningful representations of the error structure (Barnston and Thomas 1983) as their assumed separation of systematic and random error no longer applies. For example, Habib et al. (2001) demonstrated that the correlation coefficient would produce misleading results in such cases. Therefore, the indiscriminate use of the conventional metrics with nonlinear error models such as (7) is imprudent and incorrect.
Error characterization is equivalent to characterizing the conditional distribution of model predictions given observations (the error model), while the conventional metrics are only valid for the linear, additive error model (1). Therefore, we argue that one should base their error analysis on the appropriate error model, which defines and separates the systematic and random errors, and its parameters quantify them. It should be up to the researcher to demonstrate that a particular error model fits well for this purpose. Of course, the choice of the correct error model is dictated by the model developer’s empirical understanding of the error characteristics.
3. Summary
Based on this consideration, here we promote a philosophy for error analysis and uncertainty quantification that puts more emphasis on error modeling than on the conventional performance metrics computing. There are two primary motivations for this comment.
First, the most popular conventional performance metrics: bias, MSE, and correlation coefficients, have several limitations related to interdependence, nondetermination, and incompleteness. These limitations make it difficult to infer the error characteristics with a given set of metrics values (Fig. 1), and their roles in characterizing p(y | x) are very vague. The traditional approach to alleviate this problem is to introduce more metrics or decomposition schemes [e.g., (4a)–(4c)], leading to a proliferation of metrics. We demonstrate that these metrics and decomposition schemes can be, however, all derived from the linear error model, which can be interpreted natively in a way that is both coherent and free from these difficulties, without invoking a multitude of metrics.
Second, making the first step in the process of model evaluation the choice of error model, forces us to reconcile our specific knowledge of the situation, including research and application goals, with the way we choose to present the reconciliation of our predictive models with observational data.
It is worth emphasizing that the validity of both error modeling and performance metrics, or the entire verification business as a whole [see (8)], hinges on the assumption of statistical stationarity. Thus, when we establish an error model or compute a set of metrics, we should select the appropriate spatial and temporal domains within each of which the error characteristics of the instrument or model are believed to be stationary, and treat each domain (e.g., summer vs winter) with separate model fitting or metrics computing.
It is important to be cognizant of the fact that the error model itself is a characterization of a conditional probability distribution p(y | x), which contains all of the information available for evaluating our predictive model (Murphy et al. 1989). Our job is to find a suitable analytic, parametric description of the conditional. Since the error information is distilled in the error model’s parameters, it is not surprising to see the relationship between most of the conventional metrics and the model parameters (Table 2). Of course, a suitable parametric error model is not always easy to find, and in such cases, nonparametric approaches, which are fundamentally based on information theoretic concepts (Nearing and Gupta 2015; Weijs et al. 2010) may be employed.
Acknowledgments
This research was supported by the NASA Earth System Data Records Uncertainty Analysis Program (Martha E. Maiden) under solicitation NNH10ZDA001N-ESDRERR. Computing resources were provided by the NASA Center for Climate Simulation. We thank two anonymous reviewers for their helpful comments.
REFERENCES
Barnston, A. G., 1992: Correspondence among the correlation, RMSE, and Heidke forecast verification measures; refinement of the Heidke score. Wea. Forecasting, 7, 699–709, doi:10.1175/1520-0434(1992)007<0699:CATCRA>2.0.CO;2.
Barnston, A. G., and J. L. Thomas, 1983: Rainfall measurement accuracy in FACE: A comparison of gage and radar rainfalls. J. Climate Appl. Meteor., 22, 2038–2052, doi:10.1175/1520-0450(1983)022<2038:RMAIFA>2.0.CO;2.
Carroll, R. J., D. Ruppert, L. A. Stefanski, and C. M. Crainiceanu, 2006: Measurement Error in Nonlinear Models: A Modern Perspective. 2nd ed. CRC Press, 485 pp.
Ebert, E. E., and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors. J. Hydrol., 239, 179–202, doi:10.1016/S0022-1694(00)00343-7.
Entekhabi, D., R. H. Reichle, R. D. Koster, and W. T. Crow, 2010: Performance metrics for soil moisture retrievals and application requirements. J. Hydrometeor., 11, 832–840, doi:10.1175/2010JHM1223.1.
Fisher, S. R. A., 1958: Statistical Methods for Research Workers. 13th ed. Hafner Publishing Co., 356 pp.
Gong, W., H. Gupta, D. Yang, K. Sricharan, and A. Hero, 2013: Estimating epistemic and aleatory uncertainties during hydrologic modeling: An information theoretic approach. Water Resour. Res., 49, 2253–2273, doi:10.1002/wrcr.20161.
Gupta, H. V., H. Kling, K. K. Yilmaz, and G. F. Martinez, 2009: Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. J. Hydrol., 377, 80–91, doi:10.1016/j.jhydrol.2009.08.003.
Habib, E., W. F. Krajewski, and G. J. Ciach, 2001: Estimation of rainfall interstation correlation. J. Hydrometeor., 2, 621–629, doi:10.1175/1525-7541(2001)002<0621:EORIC>2.0.CO;2.
Jaynes, E. T., 2003: Probability Theory: The Logic of Science. Cambridge University Press, 762 pp.
Livezey, R. E., J. D. Hoopingarner, and J. Huang, 1995: Verification of official monthly mean 700-hPa height forecasts: An update. Wea. Forecasting, 10, 512–527, doi:10.1175/1520-0434(1995)010<0512:VOOMMH>2.0.CO;2.
Murphy, A. H., 1988: Skill scores based on the mean square error and their relationships to the correlation coefficient. Mon. Wea. Rev., 116, 2417–2424, doi:10.1175/1520-0493(1988)116<2417:SSBOTM>2.0.CO;2.
Murphy, A. H., 1995: The coefficients of correlation and determination as measures of performance in forecast verification. Wea. Forecasting, 10, 681–688, doi:10.1175/1520-0434(1995)010<0681:TCOCAD>2.0.CO;2.
Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330–1338, doi:10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.
Murphy, A. H., and E. S. Epstein, 1989: Skill scores and correlation coefficients in model verification. Mon. Wea. Rev., 117, 572–582, doi:10.1175/1520-0493(1989)117<0572:SSACCI>2.0.CO;2.
Murphy, A. H., B. G. Brown, and Y.-S. Chen, 1989: Diagnostic verification of temperature forecasts. Wea. Forecasting, 4, 485–501, doi:10.1175/1520-0434(1989)004<0485:DVOTF>2.0.CO;2.
Nearing, G., and H. Gupta, 2015: The quantity and quality of information in hydrologic models. Water Resour. Res., 51, 524–538, doi:10.1002/2014WR015895.
Papoulis, A., 1984: Probability, Random Variables and Stochastic Processes. 2nd ed. McGraw-Hill Inc., 575 pp.
Stewart, T. R., 1990: A decomposition of the correlation coefficient and its use in analyzing forecasting skill. Wea. Forecasting, 5, 661–666, doi:10.1175/1520-0434(1990)005<0661:ADOTCC>2.0.CO;2.
Tang, L., Y. Tian, F. Yan, and E. Habib, 2015: An improved procedure for the validation of satellite-based precipitation estimates. Atmos. Res., 163, 61–73, doi:10.1016/j.atmosres.2014.12.016.
Taylor, K. E., 2001: Summarizing multiple aspects of model performance in a single diagram. J. Geophys. Res., 106, 7183–7192, doi:10.1029/2000JD900719.
Tian, Y., C. D. Peters-Lidard, and J. B. Eylander, 2010: Real-time bias reduction for satellite-based precipitation estimates. J. Hydrometeor., 11, 1275–1285, doi:10.1175/2010JHM1246.1.
Tian, Y., G. J. Huffman, R. F. Adler, L. Tang, M. Sapiano, V. Maggioni, and H. Wu, 2013: Modeling errors in daily precipitation measurements: Additive or multiplicative? Geophys. Res. Lett., 40, 2060–2065, doi:10.1002/grl.50320.
Vrugt, J., and M. Sadegh, 2013: Towards diagnostic model calibration and evaluation: Approximate Bayesian computation. Water Resour. Res., 49, 4335–4345, doi:10.1002/wrcr.20354.
Wan, Z., Y. Zhang, Q. Zhang, and Z.-L. Li, 2004: Quality assessment and validation of the MODIS global land surface temperature. Int. J. Remote Sens., 25, 261–274, doi:10.1080/0143116031000116417.
Weijs, S., G. Schoups, and N. Giesen, 2010: Why hydrologic predictions should be evaluated using information theory. Hydrol. Earth Syst. Sci., 14, 2545–2558, doi:10.5194/hess-14-2545-2010.
Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. Academic Press, 676 pp.
Willmott, C. J., 1981: On the validation of models. Phys. Geogr., 2, 184–194.