Unbiased online active learning in data streams. Chu, W., Zinkevich, M., Li, L., Thomas, A., & Tseng, B. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, of KDD '11, pages 195–203, New York, NY, USA, August, 2011. Association for Computing Machinery.
Unbiased online active learning in data streams [link]Paper  doi  abstract   bibtex   
Unlabeled samples can be intelligently selected for labeling to minimize classification error. In many real-world applications, a large number of unlabeled samples arrive in a streaming manner, making it impossible to maintain all the data in a candidate pool. In this work, we focus on binary classification problems and study selective labeling in data streams where a decision is required on each sample sequentially. We consider the unbiasedness property in the sampling process, and design optimal instrumental distributions to minimize the variance in the stochastic process. Meanwhile, Bayesian linear classifiers with weighted maximum likelihood are optimized online to estimate parameters. In empirical evaluation, we collect a data stream of user-generated comments on a commercial news portal in 30 consecutive days, and carry out offline evaluation to compare various sampling strategies, including unbiased active learning, biased variants, and random sampling. Experimental results verify the usefulness of online active learning, especially in the non-stationary situation with concept drift.
@inproceedings{chu_unbiased_2011,
	address = {New York, NY, USA},
	series = {{KDD} '11},
	title = {Unbiased online active learning in data streams},
	isbn = {978-1-4503-0813-7},
	url = {https://doi.org/10.1145/2020408.2020444},
	doi = {10.1145/2020408.2020444},
	abstract = {Unlabeled samples can be intelligently selected for labeling to minimize classification error. In many real-world applications, a large number of unlabeled samples arrive in a streaming manner, making it impossible to maintain all the data in a candidate pool. In this work, we focus on binary classification problems and study selective labeling in data streams where a decision is required on each sample sequentially. We consider the unbiasedness property in the sampling process, and design optimal instrumental distributions to minimize the variance in the stochastic process. Meanwhile, Bayesian linear classifiers with weighted maximum likelihood are optimized online to estimate parameters. In empirical evaluation, we collect a data stream of user-generated comments on a commercial news portal in 30 consecutive days, and carry out offline evaluation to compare various sampling strategies, including unbiased active learning, biased variants, and random sampling. Experimental results verify the usefulness of online active learning, especially in the non-stationary situation with concept drift.},
	urldate = {2022-03-28},
	booktitle = {Proceedings of the 17th {ACM} {SIGKDD} international conference on {Knowledge} discovery and data mining},
	publisher = {Association for Computing Machinery},
	author = {Chu, Wei and Zinkevich, Martin and Li, Lihong and Thomas, Achint and Tseng, Belle},
	month = aug,
	year = {2011},
	keywords = {active learning, adaptive importance sampling, bayesian online learning, data streaming, unbiasedness},
	pages = {195--203},
}

Downloads: 0