Measuring the Accuracy of Species Distribution Models: A Review. Liu, C., White, M., & Newell, G. In Anderssen, R. S., Braddock, R. D., & Newham, L. T. H., editors, 18th World IMACS Congress and MODSIM09 International Congress on Modelling and Simulation, pages 4241–4247, July, 2009.
abstract   bibtex   
Species distribution models (SDMs) are empirical models relating species occurrence to environmental variables based on statistical or other response surfaces. Species distribution modeling can be used as a tool to solve many theoretical and applied ecological and environmental problems, which include testing biogeographical, ecological and evolutionary hypotheses, assessing species invasion and climate change impact, and supporting conservation planning and reserve selection. The utility of SDM in real world applications requires the knowledge of the model's accuracy. The accuracy of a model includes two aspects: discrimination capacity and reliability. The former is the power of the model to differentiate presences from absences; and the latter refers to the capability of the predicted probabilities to reflect the observed proportion of sites occupied by the subject species. Similar methodology has been used for model accuracy assessment in different fields, including medical diagnostic test, weather forecasting and machine learning, etc. Some accuracy measures are used in all fields, e.g. the overall accuracy and the area under the receiver operating characteristic curve; while the use of other measures is largely restricted to specific fields, e.g. F-measure is mainly used in machine learning field, or is referred to by different names in different fields, e.g. "true skill statistic" is used in atmospheric science and it is called "Youden's J" in medical diagnostic field. In this paper we review those accuracy measures typically used in ecology. Generally, the measures can be divided into two groups: threshold-dependent and threshold- independent. Measures in the first group are used for binary predictions, and those in the second group are used for continuous predictions. Continuous predictions may be transformed to binary ones if a specific threshold is employed. In such cases, the threshold-dependent accuracy measures can also be used. The threshold-dependent indices used in or introduced to SDM field include overall accuracy, sensitivity, specificity, positive predictive value, negative predictive value, odds ratio, true skill statistic, F-measure, Cohen's kappa, and normalized mutual information (NMI). However, since NMI only measures the agreement between two patterns, it cannot differentiate the worse-than-random models from the better-than- random models, which reduces its utility as an accuracy measure. The threshold-independent indices used in or introduced to the SDM field include the area under the receiver operating characteristic curve (AUC), Gini index, and point biserial correlation coefficient. The proportion of 2 explained deviance D and its adjusted form have been also introduced into SDM field. But this adjusted metric has no theoretical foundation in the context of generalized linear modeling. Therefore, we provide another adjusted form, which was proposed by H. V. Houwelingen based on the asymptotic $ḩi$ distribution of the log-likelihood statistics. Its superiority over other related measures has been found through previous 2 2 simulation studies. We also provide another analogous measure, the coefficient of determination R , which has had a long history in weather forecast verification and was also recommended for use in medical 2 2 diagnosis. Though these measures D and R are routinely used to evaluate generalized linear models (GLMs), we argue that nothing prevents them from being applied to other GLM-like models. In SDM accuracy assessment, discrimination capacity is often considered, but model reliability is frequently ignored. The primary reason for this is that no reliability measure has been introduced into the ecological literature. To meet this need we also suggest that root mean square error be used as a reliability measure. Its squared form, mean square error, has been used in meteorology for a long time, and is called Brier's score. We also discuss the effect of prevalence dependence of accuracy measures and the precision of accuracy estimates.
@inproceedings{liuMeasuringAccuracySpecies2009,
  title = {Measuring the Accuracy of Species Distribution Models: A Review},
  booktitle = {18th {{World IMACS Congress}} and {{MODSIM09 International Congress}} on {{Modelling}} and {{Simulation}}},
  author = {Liu, C. and White, M. and Newell, G.},
  editor = {Anderssen, R. S. and Braddock, R. D. and Newham, L. T. H.},
  year = {2009},
  month = jul,
  pages = {4241--4247},
  abstract = {Species distribution models (SDMs) are empirical models relating species occurrence to environmental variables based on statistical or other response surfaces. Species distribution modeling can be used as a tool to solve many theoretical and applied ecological and environmental problems, which include testing biogeographical, ecological and evolutionary hypotheses, assessing species invasion and climate change impact, and supporting conservation planning and reserve selection. The utility of SDM in real world applications requires the knowledge of the model's accuracy. The accuracy of a model includes two aspects: discrimination capacity and reliability. The former is the power of the model to differentiate presences from absences; and the latter refers to the capability of the predicted probabilities to reflect the observed proportion of sites occupied by the subject species. Similar methodology has been used for model accuracy assessment in different fields, including medical diagnostic test, weather forecasting and machine learning, etc. Some accuracy measures are used in all fields, e.g. the overall accuracy and the area under the receiver operating characteristic curve; while the use of other measures is largely restricted to specific fields, e.g. F-measure is mainly used in machine learning field, or is referred to by different names in different fields, e.g. "true skill statistic" is used in atmospheric science and it is called "Youden's J" in medical diagnostic field. In this paper we review those accuracy measures typically used in ecology. Generally, the measures can be divided into two groups: threshold-dependent and threshold- independent. Measures in the first group are used for binary predictions, and those in the second group are used for continuous predictions. Continuous predictions may be transformed to binary ones if a specific threshold is employed. In such cases, the threshold-dependent accuracy measures can also be used. The threshold-dependent indices used in or introduced to SDM field include overall accuracy, sensitivity, specificity, positive predictive value, negative predictive value, odds ratio, true skill statistic, F-measure, Cohen's kappa, and normalized mutual information (NMI). However, since NMI only measures the agreement between two patterns, it cannot differentiate the worse-than-random models from the better-than- random models, which reduces its utility as an accuracy measure. The threshold-independent indices used in or introduced to the SDM field include the area under the receiver operating characteristic curve (AUC), Gini index, and point biserial correlation coefficient. The proportion of 2 explained deviance D and its adjusted form have been also introduced into SDM field. But this adjusted metric has no theoretical foundation in the context of generalized linear modeling. Therefore, we provide another adjusted form, which was proposed by H. V. Houwelingen based on the asymptotic {$\chi$} distribution of the log-likelihood statistics. Its superiority over other related measures has been found through previous 2 2 simulation studies. We also provide another analogous measure, the coefficient of determination R , which has had a long history in weather forecast verification and was also recommended for use in medical 2 2 diagnosis. Though these measures D and R are routinely used to evaluate generalized linear models (GLMs), we argue that nothing prevents them from being applied to other GLM-like models. In SDM accuracy assessment, discrimination capacity is often considered, but model reliability is frequently ignored. The primary reason for this is that no reliability measure has been introduced into the ecological literature. To meet this need we also suggest that root mean square error be used as a reliability measure. Its squared form, mean square error, has been used in meteorology for a long time, and is called Brier's score. We also discuss the effect of prevalence dependence of accuracy measures and the precision of accuracy estimates.},
  isbn = {978-0-9758400-7-8},
  keywords = {*imported-from-citeulike-INRMM,~INRMM-MiD:c-12770318,accuracy,field-measurements,modelling,modelling-uncertainty,review,species-distribution},
  lccn = {INRMM-MiD:c-12770318}
}

Downloads: 0