## 1. Limitations of performance metrics

One of the primary objectives of measurement or model evaluation is to quantify the uncertainty in the data. This is because the uncertainty directly determines the information content of the data (e.g., Jaynes 2003), and dictates our rational use of the information, be it for data assimilation, hypothesis testing, or decision-making. Further, by appropriately quantifying the uncertainty, one gains insight into the error characteristics of the measurement or model, especially via efforts to separate the systematic error and random error (e.g., Barnston and Thomas 1983; Ebert and McBride 2000).

Currently the common practice of measurement or model verification is to compute a common set of performance metrics. These performance metrics are statistical measures to summarize the similarity and difference between two datasets. These metrics are based on direct comparison of datum pairs on their corresponding spatial/temporal location. The most commonly used ones are bias, mean square error (MSE), and correlation coefficient (CC) (e.g., Fisher 1958; Wilks 2011), but many variants or derivatives, such as (un)conditional bias (e.g., Stewart 1990), MSE (Murphy and Winkler 1987), unbiased root-mean-square error (ubRMSE), anomaly correlation coefficient, coefficient of determination (CoD), and skill score (SS; e.g., Murphy and Epstein 1989) also fall into this category. Table 1 lists some of these metrics and their definitions.

Examples of conventional performance metrics.* The observations and forecasts are denoted as *x* and *y*, respectively.

Among them, the “big three”—bias, MSE, and CC—are the most widely used in diverse disciplines, exemplified by the popular “Taylor diagram” (Taylor 2001).

These metrics do, however, have several limitations:

Interdependence. Most of these conventional performance metrics are not independent; they have been demonstrated to relate to each other in complex ways. For example, the MSE can be decomposed in many ways to link it with other metrics, such as bias and correlation coefficient (e.g., Murphy 1988; Barnston 1992; Taylor 2001; Gupta et al. 2009; Entekhabi et al. 2010). These relations indicate both redundancy among these metrics, and the metrics’ indirect connection to independent error characteristics. This leads to ambiguity in the interpretation and intercomparison of these metrics.

Underdetermination. It is easy to verify that these metrics do not describe unique error characteristics, even when many of them are used collectively. In fact, many different combinations of error characteristics can produce the same values of these metrics.

This is illustrated in Fig. 1. A monthly time series of land surface temperature anomaly data, extracted from satellite-based observations (Wan et al. 2004) over a location in the United States (35°N, 95°W), is used as the reference (black curves) for validating two separate hypothetical sets of predictions (Figs. 1a and 1b, blue curves). Their respective scatterplots are also shown (Figs. 1c and 1d), with values of five major conventional metrics listed (bias, MSE, CC, CoD, and SS). When seen from either the time series plots or the scatterplots, the two measurements exhibit apparently very different error characteristics. However, all the metrics, except bias, give nearly identical values (Figs. 1c and 1d). In fact, there is an infinite number of ways to construct measurements that can produce identical values for many of the metrics. Therefore, when given a set of these metrics values, one will have fundamental difficulty in inferring and communicating the error characteristics of the predictions.

Incompleteness. There are no well-accepted guidelines on how many of these metrics are sufficient. Many inexperienced users follow a “the more the better” philosophy, and it is not rare to see works that employ as many as 7–10 metrics, many being variants of one another. However end users and stakeholders generally prefer as concise a summary as possible.

Linear error assumption. The metrics discussed here are all based on the least squares measure. The underlying assumption is that the measurement model is linear, with additive, Gaussian random error. When this assumption is not satisfied, one can still compute these metrics, yet they are misleading or meaningless (Barnston and Thomas 1983; Habib et al. 2001). In fact, these metrics have been frequently abused when one applies them in ignorance of the underlying error model.

Many enhanced techniques and metrics have been proposed to overcome these limitations. These include various decomposition schemes of the conventional metrics (e.g., Murphy 1988; Stewart 1990; Barnston 1992; Taylor 2001; Gupta et al. 2009; Entekhabi et al. 2010), object-oriented metrics (e.g., Ebert and McBride 2000), and nonparametric metrics that do not rely on underlying assumptions about the nature of the systematic error (e.g., Weijs et al. 2010; Gong et al. 2013; Vrugt and Sadegh 2013). These are all generally valuable efforts; however, we doubt that these kinds of relatively complex statistical analyses will ever fully replace the use of linear performance metrics. As such, we propose an alternative philosophy when applying these simple metrics.

## 2. Error characterization with an error model

*x*and

*y*, respectively, then the most generic mathematical expression for the relationship between the two is their joint distribution,

*p*(

*y*,

*x*). As Murphy et al. (1989) pointed out,

*p*(

*y*,

*x*) “contains

*all*of the nontime-dependent information relevant to forecast verification.” Therefore, the most general approach for model verification is directly characterizing

*p*(

*y*,

*x*) (Nearing and Gupta 2015). Since

*p*(

*x*) is given, this is equivalent to characterizing

*p*(

*y*|

*x*), which is the conditional distribution, or the likelihood function, or the error model. Murphy and Winkler (1987) named this equation “likelihood-base rate factorization.” We hereby argue and demonstrate that the error model and the conventional metrics are closely related, and that there are unique advantages offered by the error model.

*x*

_{i}and

*y*

_{i}are the observations and predictions, respectively, indexed by

*i =*1, 2,

*…*,

*N*, with

*N*being the total number of data points. The three terms on the right-hand side represent three separate sources of estimation errors. The first term, with an error parameter

*λ*, represents scale error, or dynamic-range error, when

*λ*deviates from unit. The second term,

*δ*, represents a constant deviation, and can be called the displacement error. The third term,

*ε*

_{i}, is a stochastic variable and indicates the random error, typically assumed as independent and identically distributed, with zero mean value and a standard deviation of

*σ*.

The first two terms depict a deterministic relationship between *x* and *y*, and they jointly quantify the systematic error, with *λ* and *δ.* The random error, *ε*_{i} , a stochastic term, is solely quantified by *σ*, and is “orthogonal” (Papoulis 1984) to the systematic error. The separation of the errors into systematic and random parts is critical for uncertainty quantification, as random error is an amalgamation of all types of errors unexplainable with current data and knowledge at hand, and directly reflects our current state of ignorance.

It is easy to see that (1) characterizes the conditional distribution *p*(*y* | *x*) as *N*( ) denotes the Gaussian distribution, assumed here for convenience. Therefore, an error model is essentially a parametric description of the conditional distribution. Subsequently, if (1) is the correct error model, its three parameters, *λ*, *δ*, and *σ*, capture all the error characteristics in the measurements. Given any set of estimates *y*_{i} and reference *x*_{i}, the three quantities, *λ*, *δ*, and *σ*, and their confidence intervals, can be easily estimated in any number of ways, including the ordinary least squares method, the maximum likelihood method, or the Bayesian method (e.g., Carroll et al. 2006; Wilks 2011).

Once the error model (1) is parameterized, it is easy to see that all the relevant error information is captured by the three parameters. This can be demonstrated in Fig. 1 as well. While for the two sets of measurements shown in Figs. 1a and 1b, four of the five conventional metrics fail to differentiate them (see their respective values in Figs. 1c and 1d), the three model parameters completely captured the different error characteristics, including their differences in the scale error (*λ* = 0.6 vs 1.7), displacement error (*δ* = 6 vs 5.1), and random error (*σ* = 1.1 vs 3.3). Moreover, their representation of the error characteristics is strikingly intuitive, when one compares the numbers with the scatterplots (Figs. 1c and 1d). For example, the scale error *λ*, being less than 1 for the first set of measurements, indicates the measurements compress the dynamic range of the data, a fact easily confirmed by both the time series (Fig. 1a) and the scatterplot (Fig. 1c). The same effect applies to the second measurements. The popular conventional metrics listed in Table 1, in contrast, clearly lack such explanatory power.

*δ*= 0), the mere presence of scale error (

*λ*≠ 1) will be interpreted as “bias.” Further, bias reflects the combination of both inherent flaws in the model (

*λ*and

*δ*) and the mean value of the data sample

*λ*and

*δ*define the error character of the predictions (in the context of the linear error model), independent of any particular data samples; once computed with a given calibration dataset, they will—assuming validity of the error model—remain valid in the context of other observations.

Expressions of the conventional performance metrics with the three model parameters.*

*λ*and

*σ*.

As far as all the performance metrics illustrated in the Taylor diagram are concerned, the three parameters of the error model are sufficient to fully capture the error characteristics of model predictions. There is no additional error information to be gained from these performance metrics.

In fact, the relationship between the conventional metrics and the linear, additive error model (1) has been noted in earlier works (e.g., Murphy 1995; Livezey et al. 1995). However, they invariably failed to explicitly define and treat the random error with a parametric form. This is the root cause of the complexity in the interpretation of the conventional metrics, and the vagueness in using them to quantify uncertainty. The various MSE decomposition schemes [(4a)–(4c)] help alleviate the problem, yet a decomposition scheme with the error-parameter representation (3) is more straightforward.

*α*and

*β*, and the random error,

*ε*

_{i}, is no longer additive and Gaussian. For the particular case of precipitation errors studied in (Tian et al. 2013), the probability distribution of the random error is often assumed approximately lognormal:

*σ*quantifies the random error.

However, for such a nonlinear, non-Gaussian error model, or any generic nonlinear error models, the conventional metrics are inappropriate; they are not meaningful representations of the error structure (Barnston and Thomas 1983) as their assumed separation of systematic and random error no longer applies. For example, Habib et al. (2001) demonstrated that the correlation coefficient would produce misleading results in such cases. Therefore, the indiscriminate use of the conventional metrics with nonlinear error models such as (7) is imprudent and incorrect.

Error characterization is equivalent to characterizing the conditional distribution of model predictions given observations (the error model), while the conventional metrics are only valid for the linear, additive error model (1). Therefore, we argue that one should base their error analysis on the appropriate error model, which defines and separates the systematic and random errors, and its parameters quantify them. It should be up to the researcher to demonstrate that a particular error model fits well for this purpose. Of course, the choice of the correct error model is dictated by the model developer’s empirical understanding of the error characteristics.

## 3. Summary

*x*and model forecast

*y*are available, one will be able to establish the conditional distribution, or error model,

*p*(

*y*|

*x*). This will enable one to rationally estimate the observation

*x*and the associated uncertainty when only forecast

*y*is available,

*p*(

*x*|

*y*), such as in many real-time applications (e.g., Tian et al. 2010). This step is enabled by Bayes’s theorem (e.g., Jaynes 2003):

*p*(

*y*|

*x*). Of course, this inference procedure between

*p*(

*y*|

*x*) and

*p*(

*x*|

*y*) is fully reversible [see (8)], depending on the focus of the evaluation (Murphy and Winkler 1987).

Based on this consideration, here we promote a philosophy for error analysis and uncertainty quantification that puts more emphasis on error modeling than on the conventional performance metrics computing. There are two primary motivations for this comment.

First, the most popular conventional performance metrics: bias, MSE, and correlation coefficients, have several limitations related to interdependence, nondetermination, and incompleteness. These limitations make it difficult to infer the error characteristics with a given set of metrics values (Fig. 1), and their roles in characterizing *p*(*y* | *x*) are very vague. The traditional approach to alleviate this problem is to introduce more metrics or decomposition schemes [e.g., (4a)–(4c)], leading to a proliferation of metrics. We demonstrate that these metrics and decomposition schemes can be, however, all derived from the linear error model, which can be interpreted natively in a way that is both coherent and free from these difficulties, without invoking a multitude of metrics.

Second, making the first step in the process of model evaluation the choice of error model, forces us to reconcile our specific knowledge of the situation, including research and application goals, with the way we choose to present the reconciliation of our predictive models with observational data.

It is worth emphasizing that the validity of both error modeling and performance metrics, or the entire verification business as a whole [see (8)], hinges on the assumption of statistical stationarity. Thus, when we establish an error model or compute a set of metrics, we should select the appropriate spatial and temporal domains within each of which the error characteristics of the instrument or model are believed to be stationary, and treat each domain (e.g., summer vs winter) with separate model fitting or metrics computing.

It is important to be cognizant of the fact that the error model itself is a characterization of a conditional probability distribution *p*(*y* | *x*), which contains all of the information available for evaluating our predictive model (Murphy et al. 1989). Our job is to find a suitable analytic, parametric description of the conditional. Since the error information is distilled in the error model’s parameters, it is not surprising to see the relationship between most of the conventional metrics and the model parameters (Table 2). Of course, a suitable parametric error model is not always easy to find, and in such cases, nonparametric approaches, which are fundamentally based on information theoretic concepts (Nearing and Gupta 2015; Weijs et al. 2010) may be employed.

## Acknowledgments

This research was supported by the NASA Earth System Data Records Uncertainty Analysis Program (Martha E. Maiden) under solicitation NNH10ZDA001N-ESDRERR. Computing resources were provided by the NASA Center for Climate Simulation. We thank two anonymous reviewers for their helpful comments.

## REFERENCES

Barnston, A. G., 1992: Correspondence among the correlation, RMSE, and Heidke forecast verification measures; refinement of the Heidke score.

,*Wea. Forecasting***7**, 699–709, doi:10.1175/1520-0434(1992)007<0699:CATCRA>2.0.CO;2.Barnston, A. G., and J. L. Thomas, 1983: Rainfall measurement accuracy in FACE: A comparison of gage and radar rainfalls.

,*J. Climate Appl. Meteor.***22**, 2038–2052, doi:10.1175/1520-0450(1983)022<2038:RMAIFA>2.0.CO;2.Carroll, R. J., D. Ruppert, L. A. Stefanski, and C. M. Crainiceanu, 2006:

*Measurement Error in Nonlinear Models: A Modern Perspective.*2nd ed. CRC Press, 485 pp.Ebert, E. E., and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors.

,*J. Hydrol.***239**, 179–202, doi:10.1016/S0022-1694(00)00343-7.Entekhabi, D., R. H. Reichle, R. D. Koster, and W. T. Crow, 2010: Performance metrics for soil moisture retrievals and application requirements.

,*J. Hydrometeor.***11**, 832–840, doi:10.1175/2010JHM1223.1.Fisher, S. R. A., 1958:

*Statistical Methods for Research Workers.*13th ed. Hafner Publishing Co., 356 pp.Gong, W., H. Gupta, D. Yang, K. Sricharan, and A. Hero, 2013: Estimating epistemic and aleatory uncertainties during hydrologic modeling: An information theoretic approach.

,*Water Resour. Res.***49**, 2253–2273, doi:10.1002/wrcr.20161.Gupta, H. V., H. Kling, K. K. Yilmaz, and G. F. Martinez, 2009: Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling.

,*J. Hydrol.***377**, 80–91, doi:10.1016/j.jhydrol.2009.08.003.Habib, E., W. F. Krajewski, and G. J. Ciach, 2001: Estimation of rainfall interstation correlation.

,*J. Hydrometeor.***2**, 621–629, doi:10.1175/1525-7541(2001)002<0621:EORIC>2.0.CO;2.Jaynes, E. T., 2003:

*Probability Theory: The Logic of Science.*Cambridge University Press, 762 pp.Livezey, R. E., J. D. Hoopingarner, and J. Huang, 1995: Verification of official monthly mean 700-hPa height forecasts: An update.

,*Wea. Forecasting***10**, 512–527, doi:10.1175/1520-0434(1995)010<0512:VOOMMH>2.0.CO;2.Murphy, A. H., 1988: Skill scores based on the mean square error and their relationships to the correlation coefficient.

,*Mon. Wea. Rev.***116**, 2417–2424, doi:10.1175/1520-0493(1988)116<2417:SSBOTM>2.0.CO;2.Murphy, A. H., 1995: The coefficients of correlation and determination as measures of performance in forecast verification.

,*Wea. Forecasting***10**, 681–688, doi:10.1175/1520-0434(1995)010<0681:TCOCAD>2.0.CO;2.Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification.

,*Mon. Wea. Rev.***115**, 1330–1338, doi:10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.Murphy, A. H., and E. S. Epstein, 1989: Skill scores and correlation coefficients in model verification.

,*Mon. Wea. Rev.***117**, 572–582, doi:10.1175/1520-0493(1989)117<0572:SSACCI>2.0.CO;2.Murphy, A. H., B. G. Brown, and Y.-S. Chen, 1989: Diagnostic verification of temperature forecasts.

,*Wea. Forecasting***4**, 485–501, doi:10.1175/1520-0434(1989)004<0485:DVOTF>2.0.CO;2.Nearing, G., and H. Gupta, 2015: The quantity and quality of information in hydrologic models.

,*Water Resour. Res.***51**, 524–538, doi:10.1002/2014WR015895.Papoulis, A., 1984:

*Probability, Random Variables and Stochastic Processes.*2nd ed. McGraw-Hill Inc., 575 pp.Stewart, T. R., 1990: A decomposition of the correlation coefficient and its use in analyzing forecasting skill.

,*Wea. Forecasting***5**, 661–666, doi:10.1175/1520-0434(1990)005<0661:ADOTCC>2.0.CO;2.Tang, L., Y. Tian, F. Yan, and E. Habib, 2015: An improved procedure for the validation of satellite-based precipitation estimates.

,*Atmos. Res.***163**, 61–73, doi:10.1016/j.atmosres.2014.12.016.Taylor, K. E., 2001: Summarizing multiple aspects of model performance in a single diagram.

,*J. Geophys. Res.***106**, 7183–7192, doi:10.1029/2000JD900719.Tian, Y., C. D. Peters-Lidard, and J. B. Eylander, 2010: Real-time bias reduction for satellite-based precipitation estimates.

,*J. Hydrometeor.***11**, 1275–1285, doi:10.1175/2010JHM1246.1.Tian, Y., G. J. Huffman, R. F. Adler, L. Tang, M. Sapiano, V. Maggioni, and H. Wu, 2013: Modeling errors in daily precipitation measurements: Additive or multiplicative?

,*Geophys. Res. Lett.***40**, 2060–2065, doi:10.1002/grl.50320.Vrugt, J., and M. Sadegh, 2013: Towards diagnostic model calibration and evaluation: Approximate Bayesian computation.

,*Water Resour. Res.***49**, 4335–4345, doi:10.1002/wrcr.20354.Wan, Z., Y. Zhang, Q. Zhang, and Z.-L. Li, 2004: Quality assessment and validation of the MODIS global land surface temperature.

,*Int. J. Remote Sens.***25**, 261–274, doi:10.1080/0143116031000116417.Weijs, S., G. Schoups, and N. Giesen, 2010: Why hydrologic predictions should be evaluated using information theory.

,*Hydrol. Earth Syst. Sci.***14**, 2545–2558, doi:10.5194/hess-14-2545-2010.Wilks, D. S., 2011:

*Statistical Methods in the Atmospheric Sciences.*Academic Press, 676 pp.Willmott, C. J., 1981: On the validation of models.

,*Phys. Geogr.***2**, 184–194.