Approximate Distributed K-Means Clustering over a Peer-to-Peer Network

Approximate Distributed K-Means Clustering over a Peer-to-Peer Network. Datta, S., Giannella, C., & Kargupta, H. IEEE Transactions on Knowledge and Data Engineering, 21(10):1372–1388, October, 2009. Publisher: Published by the IEEE Computer Society

Paper doi abstract bibtex

Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by “local” synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.

@article{Datta2009,
	title = {Approximate {Distributed} {K}-{Means} {Clustering} over a {Peer}-to-{Peer} {Network}},
	volume = {21},
	issn = {1041-4347},
	url = {http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2008.222},
	doi = {10.1109/TKDE.2008.222},
	abstract = {Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by “local” synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.},
	number = {10},
	urldate = {2011-11-06},
	journal = {IEEE Transactions on Knowledge and Data Engineering},
	author = {Datta, S. and Giannella, C.R. and Kargupta, H.},
	month = oct,
	year = {2009},
	note = {Publisher: Published by the IEEE Computer Society},
	pages = {1372--1388},
}

Downloads: 0

{"_id":"uLEoqsQo3WFQppGGj","bibbaseid":"datta-giannella-kargupta-approximatedistributedkmeansclusteringoverapeertopeernetwork-2009","author_short":["Datta, S.","Giannella, C.","Kargupta, H."],"bibdata":{"bibtype":"article","type":"article","title":"Approximate Distributed K-Means Clustering over a Peer-to-Peer Network","volume":"21","issn":"1041-4347","url":"http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2008.222","doi":"10.1109/TKDE.2008.222","abstract":"Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by “local” synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.","number":"10","urldate":"2011-11-06","journal":"IEEE Transactions on Knowledge and Data Engineering","author":[{"propositions":[],"lastnames":["Datta"],"firstnames":["S."],"suffixes":[]},{"propositions":[],"lastnames":["Giannella"],"firstnames":["C.R."],"suffixes":[]},{"propositions":[],"lastnames":["Kargupta"],"firstnames":["H."],"suffixes":[]}],"month":"October","year":"2009","note":"Publisher: Published by the IEEE Computer Society","pages":"1372–1388","bibtex":"@article{Datta2009,\n\ttitle = {Approximate {Distributed} {K}-{Means} {Clustering} over a {Peer}-to-{Peer} {Network}},\n\tvolume = {21},\n\tissn = {1041-4347},\n\turl = {http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2008.222},\n\tdoi = {10.1109/TKDE.2008.222},\n\tabstract = {Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by “local” synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.},\n\tnumber = {10},\n\turldate = {2011-11-06},\n\tjournal = {IEEE Transactions on Knowledge and Data Engineering},\n\tauthor = {Datta, S. and Giannella, C.R. and Kargupta, H.},\n\tmonth = oct,\n\tyear = {2009},\n\tnote = {Publisher: Published by the IEEE Computer Society},\n\tpages = {1372--1388},\n}\n\n","author_short":["Datta, S.","Giannella, C.","Kargupta, H."],"key":"Datta2009","id":"Datta2009","bibbaseid":"datta-giannella-kargupta-approximatedistributedkmeansclusteringoverapeertopeernetwork-2009","role":"author","urls":{"Paper":"http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2008.222"},"metadata":{"authorlinks":{}},"html":""},"bibtype":"article","biburl":"https://bibbase.org/zotero/ifromm","dataSources":["N4kJAiLiJ7kxfNsoh"],"keywords":[],"search_terms":["approximate","distributed","means","clustering","over","peer","peer","network","datta","giannella","kargupta"],"title":"Approximate Distributed K-Means Clustering over a Peer-to-Peer Network","year":2009}