Online Clustering: Algorithms, Evaluation, Metrics, Applications and Benchmarking. Montiel, J., Ngo, H., Le-Nguyen, M., & Bifet, A. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, of KDD '22, pages 4808–4809, New York, NY, USA, August, 2022. Association for Computing Machinery.
Online Clustering: Algorithms, Evaluation, Metrics, Applications and Benchmarking [link]Paper  doi  abstract   bibtex   
Online clustering algorithms play a critical role in data science, especially with the advantages regarding time, memory usage and complexity, while maintaining a high performance compared to traditional clustering methods. This tutorial serves, first, as a survey on online machine learning and, in particular, data stream clustering methods. During this tutorial, state-of-the-art algorithms and the associated core research threads will be presented by identifying different categories based on distance, density grids and hidden statistical models. Clustering validity indices, an important part of the clustering process which are usually neglected or replaced with classification metrics, resulting in misleading interpretation of final results, will also be deeply investigated. Then, this introduction will be put into the context with River, a go-to Python library merged between Creme and scikit-multiflow. It is also the first open-source project to include an online clustering module that can facilitate reproducibility and allow direct further improvements. From this, we propose methods of clustering configuration, applications and settings for benchmarking, using real-world problems and datasets.
@inproceedings{montiel_online_2022,
	address = {New York, NY, USA},
	series = {{KDD} '22},
	title = {Online {Clustering}: {Algorithms}, {Evaluation}, {Metrics}, {Applications} and {Benchmarking}},
	isbn = {978-1-4503-9385-0},
	shorttitle = {Online {Clustering}},
	url = {https://doi.org/10.1145/3534678.3542600},
	doi = {10.1145/3534678.3542600},
	abstract = {Online clustering algorithms play a critical role in data science, especially with the advantages regarding time, memory usage and complexity, while maintaining a high performance compared to traditional clustering methods. This tutorial serves, first, as a survey on online machine learning and, in particular, data stream clustering methods. During this tutorial, state-of-the-art algorithms and the associated core research threads will be presented by identifying different categories based on distance, density grids and hidden statistical models. Clustering validity indices, an important part of the clustering process which are usually neglected or replaced with classification metrics, resulting in misleading interpretation of final results, will also be deeply investigated. Then, this introduction will be put into the context with River, a go-to Python library merged between Creme and scikit-multiflow. It is also the first open-source project to include an online clustering module that can facilitate reproducibility and allow direct further improvements. From this, we propose methods of clustering configuration, applications and settings for benchmarking, using real-world problems and datasets.},
	urldate = {2023-03-31},
	booktitle = {Proceedings of the 28th {ACM} {SIGKDD} {Conference} on {Knowledge} {Discovery} and {Data} {Mining}},
	publisher = {Association for Computing Machinery},
	author = {Montiel, Jacob and Ngo, Hoang-Anh and Le-Nguyen, Minh-Huong and Bifet, Albert},
	month = aug,
	year = {2022},
	keywords = {benchmarking, data streams, decision support, online clustering, stream clustering, stream learning},
	pages = {4808--4809},
}

Downloads: 0