Semi-supervised Learning over Streaming Data using MOA. Le Nguyen, M. H., Gomes, H. M., & Bifet, A. In 2019 IEEE International Conference on Big Data (Big Data), pages 553–562, December, 2019.
doi  abstract   bibtex   
Machine learning algorithms for data streams usually suppose that all data examples available for learning are strictly labeled. Unfortunately, in real-world scenarios, data examples are not always labeled. Semi-supervised learning is a challenging task to learn using labeled and unlabeled data at the same time. It is especially relevant in the context of data streams, where the data is generated in real-time, and the labels may be missing due to various factors (e.g., network delay, errors during the communication between sensors, expensive labeling process, and others). In this paper, we present two novel approaches to handle missing labels for classification learning in data streams, namely cluster-and-label and self-training. We discuss the strengths and weaknesses of each solution to establish a baseline to evaluate semi-supervised learning techniques in data streams. These methods are implemented inside the MOA (Massive Online Analysis) open-source software as an internal benchmark component, to help researchers to run experimental comparisons on semi-supervised learning on data streams easily.
@inproceedings{le_nguyen_semi-supervised_2019,
	title = {Semi-supervised {Learning} over {Streaming} {Data} using {MOA}},
	doi = {10.1109/BigData47090.2019.9006217},
	abstract = {Machine learning algorithms for data streams usually suppose that all data examples available for learning are strictly labeled. Unfortunately, in real-world scenarios, data examples are not always labeled. Semi-supervised learning is a challenging task to learn using labeled and unlabeled data at the same time. It is especially relevant in the context of data streams, where the data is generated in real-time, and the labels may be missing due to various factors (e.g., network delay, errors during the communication between sensors, expensive labeling process, and others). In this paper, we present two novel approaches to handle missing labels for classification learning in data streams, namely cluster-and-label and self-training. We discuss the strengths and weaknesses of each solution to establish a baseline to evaluate semi-supervised learning techniques in data streams. These methods are implemented inside the MOA (Massive Online Analysis) open-source software as an internal benchmark component, to help researchers to run experimental comparisons on semi-supervised learning on data streams easily.},
	booktitle = {2019 {IEEE} {International} {Conference} on {Big} {Data} ({Big} {Data})},
	author = {Le Nguyen, Minh Huong and Gomes, Heitor Murilo and Bifet, Albert},
	month = dec,
	year = {2019},
	keywords = {Clustering algorithms, Data models, Labeling, Prediction algorithms, Predictive models, Semisupervised learning, Supervised learning, data streams, semi-supervised learning},
	pages = {553--562},
}

Downloads: 0