State-of-the-art on clustering data streams. Ghesmoune, M., Lebbah, M., & Azzag, H. Big Data Analytics, 1(1):13, December, 2016.
State-of-the-art on clustering data streams [link]Paper  doi  abstract   bibtex   
Clustering is a key data mining task. This is the problem of partitioning a set of observations into clusters such that the intra-cluster observations are similar and the inter-cluster observations are dissimilar. The traditional set-up where a static dataset is available in its entirety for random access is not applicable as we do not have the entire dataset at the launch of the learning, the data continue to arrive at a rapid rate, we can not access the data randomly, and we can make only one or at most a small number of passes on the data in order to generate the clustering results. These types of data are referred to as data streams. The data stream clustering problem requires a process capable of partitioning observations continuously while taking into account restrictions of memory and time. In the literature of data stream clustering methods, a large number of algorithms use a two-phase scheme which consists of an online component that processes data stream points and produces summary statistics, and an offline component that uses the summary data to generate the clusters. An alternative class is capable of generating the final clusters without the need of an offline phase. This paper presents a comprehensive survey of the data stream clustering methods and an overview of the most well-known streaming platforms which implement clustering.
@article{ghesmoune_state---art_2016,
	title = {State-of-the-art on clustering data streams},
	volume = {1},
	issn = {2058-6345},
	url = {https://doi.org/10.1186/s41044-016-0011-3},
	doi = {10.1186/s41044-016-0011-3},
	abstract = {Clustering is a key data mining task. This is the problem of partitioning a set of observations into clusters such that the intra-cluster observations are similar and the inter-cluster observations are dissimilar. The traditional set-up where a static dataset is available in its entirety for random access is not applicable as we do not have the entire dataset at the launch of the learning, the data continue to arrive at a rapid rate, we can not access the data randomly, and we can make only one or at most a small number of passes on the data in order to generate the clustering results. These types of data are referred to as data streams. The data stream clustering problem requires a process capable of partitioning observations continuously while taking into account restrictions of memory and time. In the literature of data stream clustering methods, a large number of algorithms use a two-phase scheme which consists of an online component that processes data stream points and produces summary statistics, and an offline component that uses the summary data to generate the clusters. An alternative class is capable of generating the final clusters without the need of an offline phase. This paper presents a comprehensive survey of the data stream clustering methods and an overview of the most well-known streaming platforms which implement clustering.},
	number = {1},
	urldate = {2022-03-25},
	journal = {Big Data Analytics},
	author = {Ghesmoune, Mohammed and Lebbah, Mustapha and Azzag, Hanene},
	month = dec,
	year = {2016},
	keywords = {Data stream clustering, State-of-the-art, Streaming platforms},
	pages = {13},
}

Downloads: 0