A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods. Zhang, A., Feng, J., Ward, R. D, Wan, P., Gao, Q., Wu, J., & Zhao, W. PLoS ONE, 7(2):e30986, 2012. doi abstract bibtex Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.
@article{zhang_new_2012,
title = {A {New} {Method} for {Species} {Identification} via {Protein}-{Coding} and {Non}-{Coding} {DNA} {Barcodes} by {Combining} {Machine} {Learning} with {Bioinformatic} {Methods}},
volume = {7},
doi = {10.1371/journal.pone.0030986},
abstract = {Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100\% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95\% confidence intervals (CI) of 99.75-100\%. The new methods also obtained a 96.29\% success rate (95\%CI: 91.62-98.40\%) for 484 rust fungi queries and a 98.50\% success rate (95\%CI: 96.60-99.37\%) for 1094 brown algae queries, both using ITS barcodes.},
language = {eng},
number = {2},
journal = {PLoS ONE},
author = {Zhang, Ai-Bing and Feng, Jie and Ward, Robert D and Wan, Ping and Gao, Qiang and Wu, Jun and Zhao, Wei-Zhong},
year = {2012},
pmid = {22363527},
pages = {e30986},
}
Downloads: 0
{"_id":"FCZTyba4E2f2sz7ho","bibbaseid":"zhang-feng-ward-wan-gao-wu-zhao-anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods-2012","author_short":["Zhang, A.","Feng, J.","Ward, R. D","Wan, P.","Gao, Q.","Wu, J.","Zhao, W."],"bibdata":{"bibtype":"article","type":"article","title":"A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods","volume":"7","doi":"10.1371/journal.pone.0030986","abstract":"Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.","language":"eng","number":"2","journal":"PLoS ONE","author":[{"propositions":[],"lastnames":["Zhang"],"firstnames":["Ai-Bing"],"suffixes":[]},{"propositions":[],"lastnames":["Feng"],"firstnames":["Jie"],"suffixes":[]},{"propositions":[],"lastnames":["Ward"],"firstnames":["Robert","D"],"suffixes":[]},{"propositions":[],"lastnames":["Wan"],"firstnames":["Ping"],"suffixes":[]},{"propositions":[],"lastnames":["Gao"],"firstnames":["Qiang"],"suffixes":[]},{"propositions":[],"lastnames":["Wu"],"firstnames":["Jun"],"suffixes":[]},{"propositions":[],"lastnames":["Zhao"],"firstnames":["Wei-Zhong"],"suffixes":[]}],"year":"2012","pmid":"22363527","pages":"e30986","bibtex":"@article{zhang_new_2012,\n\ttitle = {A {New} {Method} for {Species} {Identification} via {Protein}-{Coding} and {Non}-{Coding} {DNA} {Barcodes} by {Combining} {Machine} {Learning} with {Bioinformatic} {Methods}},\n\tvolume = {7},\n\tdoi = {10.1371/journal.pone.0030986},\n\tabstract = {Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100\\% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95\\% confidence intervals (CI) of 99.75-100\\%. The new methods also obtained a 96.29\\% success rate (95\\%CI: 91.62-98.40\\%) for 484 rust fungi queries and a 98.50\\% success rate (95\\%CI: 96.60-99.37\\%) for 1094 brown algae queries, both using ITS barcodes.},\n\tlanguage = {eng},\n\tnumber = {2},\n\tjournal = {PLoS ONE},\n\tauthor = {Zhang, Ai-Bing and Feng, Jie and Ward, Robert D and Wan, Ping and Gao, Qiang and Wu, Jun and Zhao, Wei-Zhong},\n\tyear = {2012},\n\tpmid = {22363527},\n\tpages = {e30986},\n}\n\n","author_short":["Zhang, A.","Feng, J.","Ward, R. D","Wan, P.","Gao, Q.","Wu, J.","Zhao, W."],"key":"zhang_new_2012","id":"zhang_new_2012","bibbaseid":"zhang-feng-ward-wan-gao-wu-zhao-anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods-2012","role":"author","urls":{},"metadata":{"authorlinks":{}},"html":""},"bibtype":"article","biburl":"https://bibbase.org/zotero/kountour","dataSources":["MnayAXw3qciX87bz7"],"keywords":[],"search_terms":["new","method","species","identification","via","protein","coding","non","coding","dna","barcodes","combining","machine","learning","bioinformatic","methods","zhang","feng","ward","wan","gao","wu","zhao"],"title":"A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods","year":2012}