Approximate Distributed K-Means Clustering over a Peer-to-Peer Network. Datta, S., Giannella, C., & Kargupta, H. IEEE Transactions on Knowledge and Data Engineering, 21(10):1372–1388, October, 2009. Publisher: Published by the IEEE Computer Society
Approximate Distributed K-Means Clustering over a Peer-to-Peer Network [link]Paper  doi  abstract   bibtex   
Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by “local” synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.
@article{Datta2009,
	title = {Approximate {Distributed} {K}-{Means} {Clustering} over a {Peer}-to-{Peer} {Network}},
	volume = {21},
	issn = {1041-4347},
	url = {http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2008.222},
	doi = {10.1109/TKDE.2008.222},
	abstract = {Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by “local” synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.},
	number = {10},
	urldate = {2011-11-06},
	journal = {IEEE Transactions on Knowledge and Data Engineering},
	author = {Datta, S. and Giannella, C.R. and Kargupta, H.},
	month = oct,
	year = {2009},
	note = {Publisher: Published by the IEEE Computer Society},
	pages = {1372--1388},
}

Downloads: 0