Virk: An Active Learning-based System for Bootstrapping Knowledge Base Development in the Neurosciences. Ambert, K. H., Cohen, A. M., Burns, G. A., Boudreau, E., & Sonmez, K. Frontiers in Neuroinformatics, 2013.
Virk: An Active Learning-based System for Bootstrapping Knowledge Base Development in the Neurosciences [link]Paper  doi  abstract   bibtex   
The frequency and volume of newly-published scientific literature is quickly making manual maintenance of publicly-available databases of primary data unrealistic and costly. Although machine learning can be useful for developing automated approaches to identifying scientific publications containing relevant information for a database, developing such tools necessitates manually annotating an unrealistic number of documents. One approach to this problem, active learning, builds classification models by iteratively identifying documents that provide the most information to a classifier. Although this approach has been shown to be effective for related problems, in the context of scientific databases curation, it falls short. We present Virk, an active learning system that, while being trained, simultaneously learns a classification model and identifies documents having information of interest for a knowledge base. Our approach uses a support vector machine classifier with input features derived from neuroscience-related publications from the primary literature. Using our approach, we were able to increase the size of the Neuron Registry, a knowledge base of neuron-related information, by a factor of 90%, a knowledge base of neuron-related information, in 3 months. Using standard biocuration methods, it would have taken between 1-2 years to make the same number of contributions to the Neuron Registry. Here, we describe the system pipeline in detail, and evaluate its performance against other approaches to sampling in active learning.
@article{ambert_virk:_2013,
	title = {Virk: {An} {Active} {Learning}-based {System} for {Bootstrapping} {Knowledge} {Base} {Development} in the {Neurosciences}},
	volume = {7},
	issn = {1662-5196},
	url = {http://www.frontiersin.org/neuroinformatics/10.3389/fninf.2013.00038/abstract},
	doi = {10.3389/fninf.2013.00038},
	abstract = {The frequency and volume of newly-published scientific literature is quickly making manual maintenance of publicly-available databases of primary data unrealistic and costly. Although machine learning can be useful for developing automated approaches to identifying scientific publications containing relevant information for a database, developing such tools necessitates manually annotating an unrealistic number of documents. One approach to this problem, active learning, builds classification models by iteratively identifying documents that provide the most information to a classifier. Although this approach has been shown to be effective for related problems, in the context of scientific databases curation, it falls short. We present Virk, an active learning system that, while being trained, simultaneously learns a classification model and identifies documents having information of interest for a knowledge base. Our approach uses a support vector machine classifier with input features derived from neuroscience-related publications from the primary literature. Using our approach, we were able to increase the size of the Neuron Registry, a knowledge base of neuron-related information, by a factor of 90\%, a knowledge base of neuron-related information, in 3 months. Using standard biocuration methods, it would have taken between 1-2 years to make the same number of contributions to the Neuron Registry. Here, we describe the system pipeline in detail, and evaluate its performance against other approaches to sampling in active learning.},
	number = {38},
	journal = {Frontiers in Neuroinformatics},
	author = {Ambert, Kyle H. and Cohen, Aaron M. and Burns, Gully APC and Boudreau, Eilis and Sonmez, Kemal},
	year = {2013}
}

Downloads: 0