Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases?

Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases?. Maldonado, C., Molina, C. I., Zizka, A., Persson, C., Taylor, C. M., Albán, J., Chilquillo, E., Rønsted, N., & Antonelli, A. Global Ecology and Biogeography, 24(8):973--984, August, 2015.

Paper doi abstract bibtex

Aim Massive digitalization of natural history collections is now leading to a steep accumulation of publicly available species distribution data. However, taxonomic errors and geographical uncertainty of species occurrence records are now acknowledged by the scientific community – putting into question to what extent such data can be used to unveil correct patterns of biodiversity and distribution. We explore this question through quantitative and qualitative analyses of uncleaned versus manually verified datasets of species distribution records across different spatial scales. Location The American tropics. Methods As test case we used the plant tribe Cinchoneae (Rubiaceae). We compiled four datasets of species occurrences: one created manually and verified through classical taxonomic work, and the rest derived from GBIF under different cleaning and filling schemes. We used new bioinformatic tools to code species into grids, ecoregions, and biomes following WWF's classification. We analysed species richness and altitudinal ranges of the species. Results Altitudinal ranges for species and genera were correctly inferred even without manual data cleaning and filling. However, erroneous records affected spatial patterns of species richness. They led to an overestimation of species richness in certain areas outside the centres of diversity in the clade. The location of many of these areas comprised the geographical midpoint of countries and political subdivisions, assigned long after the specimens had been collected. Main conclusion Open databases and integrative bioinformatic tools allow a rapid approximation of large-scale patterns of biodiversity across space and altitudinal ranges. We found that geographic inaccuracy affects diversity patterns more than taxonomic uncertainties, often leading to false positives, i.e. overestimating species richness in relatively species poor regions. Public databases for species distribution are valuable and should be more explored, but under scrutiny and validation by taxonomic experts. We suggest that database managers implement easy ways of community feedback on data quality.

@article{maldonado_estimating_2015,
	title = {Estimating species diversity and distribution in the era of {Big} {Data}: to what extent can we trust public databases?},
	volume = {24},
	issn = {1466-8238},
	shorttitle = {Estimating species diversity and distribution in the era of {Big} {Data}},
	url = {http://onlinelibrary.wiley.com/doi/10.1111/geb.12326/abstract},
	doi = {10.1111/geb.12326},
	abstract = {Aim

Massive digitalization of natural history collections is now leading to a steep accumulation of publicly available species distribution data. However, taxonomic errors and geographical uncertainty of species occurrence records are now acknowledged by the scientific community – putting into question to what extent such data can be used to unveil correct patterns of biodiversity and distribution. We explore this question through quantitative and qualitative analyses of uncleaned versus manually verified datasets of species distribution records across different spatial scales.


Location

The American tropics.


Methods

As test case we used the plant tribe Cinchoneae (Rubiaceae). We compiled four datasets of species occurrences: one created manually and verified through classical taxonomic work, and the rest derived from GBIF under different cleaning and filling schemes. We used new bioinformatic tools to code species into grids, ecoregions, and biomes following WWF's classification. We analysed species richness and altitudinal ranges of the species.


Results

Altitudinal ranges for species and genera were correctly inferred even without manual data cleaning and filling. However, erroneous records affected spatial patterns of species richness. They led to an overestimation of species richness in certain areas outside the centres of diversity in the clade. The location of many of these areas comprised the geographical midpoint of countries and political subdivisions, assigned long after the specimens had been collected.


Main conclusion

Open databases and integrative bioinformatic tools allow a rapid approximation of large-scale patterns of biodiversity across space and altitudinal ranges. We found that geographic inaccuracy affects diversity patterns more than taxonomic uncertainties, often leading to false positives, i.e. overestimating species richness in relatively species poor regions. Public databases for species distribution are valuable and should be more explored, but under scrutiny and validation by taxonomic experts. We suggest that database managers implement easy ways of community feedback on data quality.},
	language = {en},
	number = {8},
	urldate = {2018-02-19TZ},
	journal = {Global Ecology and Biogeography},
	author = {Maldonado, Carla and Molina, Carlos I. and Zizka, Alexander and Persson, Claes and Taylor, Charlotte M. and Albán, Joaquina and Chilquillo, Eder and Rønsted, Nina and Antonelli, Alexandre},
	month = aug,
	year = {2015},
	keywords = {Cinchoneae, GBIF, Rubiaceae, SpeciesGeoCoder, data quality, occurrence data, species richness},
	pages = {973--984}
}

Downloads: 0

{"_id":"6XGDayGYinWBjf55Z","bibbaseid":"maldonado-molina-zizka-persson-taylor-albn-chilquillo-rnsted-etal-estimatingspeciesdiversityanddistributionintheeraofbigdatatowhatextentcanwetrustpublicdatabases-2015","downloads":0,"creationDate":"2018-04-30T02:45:36.804Z","title":"Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases?","author_short":["Maldonado, C.","Molina, C. I.","Zizka, A.","Persson, C.","Taylor, C. M.","Albán, J.","Chilquillo, E.","Rønsted, N.","Antonelli, A."],"year":2015,"bibtype":"article","biburl":"https://bibbase.org/zotero/cisnerosheredia","bibdata":{"bibtype":"article","type":"article","title":"Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases?","volume":"24","issn":"1466-8238","shorttitle":"Estimating species diversity and distribution in the era of Big Data","url":"http://onlinelibrary.wiley.com/doi/10.1111/geb.12326/abstract","doi":"10.1111/geb.12326","abstract":"Aim Massive digitalization of natural history collections is now leading to a steep accumulation of publicly available species distribution data. However, taxonomic errors and geographical uncertainty of species occurrence records are now acknowledged by the scientific community – putting into question to what extent such data can be used to unveil correct patterns of biodiversity and distribution. We explore this question through quantitative and qualitative analyses of uncleaned versus manually verified datasets of species distribution records across different spatial scales. Location The American tropics. Methods As test case we used the plant tribe Cinchoneae (Rubiaceae). We compiled four datasets of species occurrences: one created manually and verified through classical taxonomic work, and the rest derived from GBIF under different cleaning and filling schemes. We used new bioinformatic tools to code species into grids, ecoregions, and biomes following WWF's classification. We analysed species richness and altitudinal ranges of the species. Results Altitudinal ranges for species and genera were correctly inferred even without manual data cleaning and filling. However, erroneous records affected spatial patterns of species richness. They led to an overestimation of species richness in certain areas outside the centres of diversity in the clade. The location of many of these areas comprised the geographical midpoint of countries and political subdivisions, assigned long after the specimens had been collected. Main conclusion Open databases and integrative bioinformatic tools allow a rapid approximation of large-scale patterns of biodiversity across space and altitudinal ranges. We found that geographic inaccuracy affects diversity patterns more than taxonomic uncertainties, often leading to false positives, i.e. overestimating species richness in relatively species poor regions. Public databases for species distribution are valuable and should be more explored, but under scrutiny and validation by taxonomic experts. We suggest that database managers implement easy ways of community feedback on data quality.","language":"en","number":"8","urldate":"2018-02-19TZ","journal":"Global Ecology and Biogeography","author":[{"propositions":[],"lastnames":["Maldonado"],"firstnames":["Carla"],"suffixes":[]},{"propositions":[],"lastnames":["Molina"],"firstnames":["Carlos","I."],"suffixes":[]},{"propositions":[],"lastnames":["Zizka"],"firstnames":["Alexander"],"suffixes":[]},{"propositions":[],"lastnames":["Persson"],"firstnames":["Claes"],"suffixes":[]},{"propositions":[],"lastnames":["Taylor"],"firstnames":["Charlotte","M."],"suffixes":[]},{"propositions":[],"lastnames":["Albán"],"firstnames":["Joaquina"],"suffixes":[]},{"propositions":[],"lastnames":["Chilquillo"],"firstnames":["Eder"],"suffixes":[]},{"propositions":[],"lastnames":["Rønsted"],"firstnames":["Nina"],"suffixes":[]},{"propositions":[],"lastnames":["Antonelli"],"firstnames":["Alexandre"],"suffixes":[]}],"month":"August","year":"2015","keywords":"Cinchoneae, GBIF, Rubiaceae, SpeciesGeoCoder, data quality, occurrence data, species richness","pages":"973--984","bibtex":"@article{maldonado_estimating_2015,\n\ttitle = {Estimating species diversity and distribution in the era of {Big} {Data}: to what extent can we trust public databases?},\n\tvolume = {24},\n\tissn = {1466-8238},\n\tshorttitle = {Estimating species diversity and distribution in the era of {Big} {Data}},\n\turl = {http://onlinelibrary.wiley.com/doi/10.1111/geb.12326/abstract},\n\tdoi = {10.1111/geb.12326},\n\tabstract = {Aim\n\nMassive digitalization of natural history collections is now leading to a steep accumulation of publicly available species distribution data. However, taxonomic errors and geographical uncertainty of species occurrence records are now acknowledged by the scientific community – putting into question to what extent such data can be used to unveil correct patterns of biodiversity and distribution. We explore this question through quantitative and qualitative analyses of uncleaned versus manually verified datasets of species distribution records across different spatial scales.\n\n\nLocation\n\nThe American tropics.\n\n\nMethods\n\nAs test case we used the plant tribe Cinchoneae (Rubiaceae). We compiled four datasets of species occurrences: one created manually and verified through classical taxonomic work, and the rest derived from GBIF under different cleaning and filling schemes. We used new bioinformatic tools to code species into grids, ecoregions, and biomes following WWF's classification. We analysed species richness and altitudinal ranges of the species.\n\n\nResults\n\nAltitudinal ranges for species and genera were correctly inferred even without manual data cleaning and filling. However, erroneous records affected spatial patterns of species richness. They led to an overestimation of species richness in certain areas outside the centres of diversity in the clade. The location of many of these areas comprised the geographical midpoint of countries and political subdivisions, assigned long after the specimens had been collected.\n\n\nMain conclusion\n\nOpen databases and integrative bioinformatic tools allow a rapid approximation of large-scale patterns of biodiversity across space and altitudinal ranges. We found that geographic inaccuracy affects diversity patterns more than taxonomic uncertainties, often leading to false positives, i.e. overestimating species richness in relatively species poor regions. Public databases for species distribution are valuable and should be more explored, but under scrutiny and validation by taxonomic experts. We suggest that database managers implement easy ways of community feedback on data quality.},\n\tlanguage = {en},\n\tnumber = {8},\n\turldate = {2018-02-19TZ},\n\tjournal = {Global Ecology and Biogeography},\n\tauthor = {Maldonado, Carla and Molina, Carlos I. and Zizka, Alexander and Persson, Claes and Taylor, Charlotte M. and Albán, Joaquina and Chilquillo, Eder and Rønsted, Nina and Antonelli, Alexandre},\n\tmonth = aug,\n\tyear = {2015},\n\tkeywords = {Cinchoneae, GBIF, Rubiaceae, SpeciesGeoCoder, data quality, occurrence data, species richness},\n\tpages = {973--984}\n}\n\n","author_short":["Maldonado, C.","Molina, C. I.","Zizka, A.","Persson, C.","Taylor, C. M.","Albán, J.","Chilquillo, E.","Rønsted, N.","Antonelli, A."],"key":"maldonado_estimating_2015","id":"maldonado_estimating_2015","bibbaseid":"maldonado-molina-zizka-persson-taylor-albn-chilquillo-rnsted-etal-estimatingspeciesdiversityanddistributionintheeraofbigdatatowhatextentcanwetrustpublicdatabases-2015","role":"author","urls":{"Paper":"http://onlinelibrary.wiley.com/doi/10.1111/geb.12326/abstract"},"keyword":["Cinchoneae","GBIF","Rubiaceae","SpeciesGeoCoder","data quality","occurrence data","species richness"],"downloads":0},"search_terms":["estimating","species","diversity","distribution","era","big","data","extent","trust","public","databases","maldonado","molina","zizka","persson","taylor","albán","chilquillo","rønsted","antonelli"],"keywords":["cinchoneae","gbif","rubiaceae","speciesgeocoder","data quality","occurrence data","species richness"],"authorIDs":[],"dataSources":["26cFrftHvkZZv3Mp2"]}