Facing the reality of data stream classification: coping with scarcity of labeled data. Masud, M. M., Woolam, C., Gao, J., Khan, L., Han, J., Hamlen, K. W., & Oza, N. C. Knowledge and Information Systems, 33(1):213–244, October, 2012.
Facing the reality of data stream classification: coping with scarcity of labeled data [link]Paper  doi  abstract   bibtex   
Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.
@article{masud_facing_2012,
	title = {Facing the reality of data stream classification: coping with scarcity of labeled data},
	volume = {33},
	issn = {0219-3116},
	shorttitle = {Facing the reality of data stream classification},
	url = {https://doi.org/10.1007/s10115-011-0447-8},
	doi = {10.1007/s10115-011-0447-8},
	abstract = {Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.},
	language = {en},
	number = {1},
	urldate = {2022-03-28},
	journal = {Knowledge and Information Systems},
	author = {Masud, Mohammad M. and Woolam, Clay and Gao, Jing and Khan, Latifur and Han, Jiawei and Hamlen, Kevin W. and Oza, Nikunj C.},
	month = oct,
	year = {2012},
	pages = {213--244},
}

Downloads: 0