μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality

μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality. Sarma, A., Goyal, P., Kumari, S., Wani, A., Challa, J. S., Islam, S., & Goyal, N. In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pages 1–11, September, 2019. ISSN: 2168-9253
doi abstract bibtex

DBSCAN is one of the most popular and effective clustering algorithms that is capable of identifying arbitrary-shaped clusters and noise efficiently. However, its super-linear complexity makes it infeasible for applications involving clustering of Big Data. A major portion of the computation time of DBSCAN is taken up by the neighborhood queries, which becomes a bottleneck to its performance. We address this issue in our proposed micro-cluster based DBSCAN algorithm, μDBSCAN, which identifies core-points even without performing neighbourhood queries and becomes instrumental in reducing the run-time of the algorithm. It also significantly reduces the computation time per neighbourhood query while producing exact DBSCAN clusters. Moreover, the micro-cluster based solution makes it scalable for high dimensional data. We also propose a highly scalable distributed implementation of μDBSCAN, μDBSCAN-D, to exploit a commodity cluster infrastructure. Experimental results demonstrate tremendous improvements in performance of our proposed algorithms as compared to their respective state-of-the-art solutions for various standard datasets. μDBSCAN-D is an exact parallel solution for DBSCAN which is capable of processing massive amounts of data efficiently (1 billion data points in 41 minutes on a 32 node cluster), while producing a clustering that is same as that of traditional DBSCAN.

@inproceedings{sarma_dbscan_2019,
	title = {μ{DBSCAN}: {An} {Exact} {Scalable} {DBSCAN} {Algorithm} for {Big} {Data} {Exploiting} {Spatial} {Locality}},
	shorttitle = {μ{DBSCAN}},
	doi = {10.1109/CLUSTER.2019.8891020},
	abstract = {DBSCAN is one of the most popular and effective clustering algorithms that is capable of identifying arbitrary-shaped clusters and noise efficiently. However, its super-linear complexity makes it infeasible for applications involving clustering of Big Data. A major portion of the computation time of DBSCAN is taken up by the neighborhood queries, which becomes a bottleneck to its performance. We address this issue in our proposed micro-cluster based DBSCAN algorithm, μDBSCAN, which identifies core-points even without performing neighbourhood queries and becomes instrumental in reducing the run-time of the algorithm. It also significantly reduces the computation time per neighbourhood query while producing exact DBSCAN clusters. Moreover, the micro-cluster based solution makes it scalable for high dimensional data. We also propose a highly scalable distributed implementation of μDBSCAN, μDBSCAN-D, to exploit a commodity cluster infrastructure. Experimental results demonstrate tremendous improvements in performance of our proposed algorithms as compared to their respective state-of-the-art solutions for various standard datasets. μDBSCAN-D is an exact parallel solution for DBSCAN which is capable of processing massive amounts of data efficiently (1 billion data points in 41 minutes on a 32 node cluster), while producing a clustering that is same as that of traditional DBSCAN.},
	booktitle = {2019 {IEEE} {International} {Conference} on {Cluster} {Computing} ({CLUSTER})},
	author = {Sarma, Aditya and Goyal, Poonam and Kumari, Sonal and Wani, Anand and Challa, Jagat Sesh and Islam, Saiyedul and Goyal, Navneet},
	month = sep,
	year = {2019},
	note = {ISSN: 2168-9253},
	keywords = {Approximation algorithms, Big Data, Clustering algorithms, Density-based Clustering, Distributed Computing, Distributed databases, Exact Clustering Algorithm, Optimized Neighborhood Query, Parallel algorithms, Partitioning algorithms, Spatial Locality, Time complexity},
	pages = {1--11},
}

Downloads: 0

{"_id":"qWAbxBRxbCy3sCMbv","bibbaseid":"sarma-goyal-kumari-wani-challa-islam-goyal-dbscananexactscalabledbscanalgorithmforbigdataexploitingspatiallocality-2019","author_short":["Sarma, A.","Goyal, P.","Kumari, S.","Wani, A.","Challa, J. S.","Islam, S.","Goyal, N."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","title":"μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality","shorttitle":"μDBSCAN","doi":"10.1109/CLUSTER.2019.8891020","abstract":"DBSCAN is one of the most popular and effective clustering algorithms that is capable of identifying arbitrary-shaped clusters and noise efficiently. However, its super-linear complexity makes it infeasible for applications involving clustering of Big Data. A major portion of the computation time of DBSCAN is taken up by the neighborhood queries, which becomes a bottleneck to its performance. We address this issue in our proposed micro-cluster based DBSCAN algorithm, μDBSCAN, which identifies core-points even without performing neighbourhood queries and becomes instrumental in reducing the run-time of the algorithm. It also significantly reduces the computation time per neighbourhood query while producing exact DBSCAN clusters. Moreover, the micro-cluster based solution makes it scalable for high dimensional data. We also propose a highly scalable distributed implementation of μDBSCAN, μDBSCAN-D, to exploit a commodity cluster infrastructure. Experimental results demonstrate tremendous improvements in performance of our proposed algorithms as compared to their respective state-of-the-art solutions for various standard datasets. μDBSCAN-D is an exact parallel solution for DBSCAN which is capable of processing massive amounts of data efficiently (1 billion data points in 41 minutes on a 32 node cluster), while producing a clustering that is same as that of traditional DBSCAN.","booktitle":"2019 IEEE International Conference on Cluster Computing (CLUSTER)","author":[{"propositions":[],"lastnames":["Sarma"],"firstnames":["Aditya"],"suffixes":[]},{"propositions":[],"lastnames":["Goyal"],"firstnames":["Poonam"],"suffixes":[]},{"propositions":[],"lastnames":["Kumari"],"firstnames":["Sonal"],"suffixes":[]},{"propositions":[],"lastnames":["Wani"],"firstnames":["Anand"],"suffixes":[]},{"propositions":[],"lastnames":["Challa"],"firstnames":["Jagat","Sesh"],"suffixes":[]},{"propositions":[],"lastnames":["Islam"],"firstnames":["Saiyedul"],"suffixes":[]},{"propositions":[],"lastnames":["Goyal"],"firstnames":["Navneet"],"suffixes":[]}],"month":"September","year":"2019","note":"ISSN: 2168-9253","keywords":"Approximation algorithms, Big Data, Clustering algorithms, Density-based Clustering, Distributed Computing, Distributed databases, Exact Clustering Algorithm, Optimized Neighborhood Query, Parallel algorithms, Partitioning algorithms, Spatial Locality, Time complexity","pages":"1–11","bibtex":"@inproceedings{sarma_dbscan_2019,\n\ttitle = {μ{DBSCAN}: {An} {Exact} {Scalable} {DBSCAN} {Algorithm} for {Big} {Data} {Exploiting} {Spatial} {Locality}},\n\tshorttitle = {μ{DBSCAN}},\n\tdoi = {10.1109/CLUSTER.2019.8891020},\n\tabstract = {DBSCAN is one of the most popular and effective clustering algorithms that is capable of identifying arbitrary-shaped clusters and noise efficiently. However, its super-linear complexity makes it infeasible for applications involving clustering of Big Data. A major portion of the computation time of DBSCAN is taken up by the neighborhood queries, which becomes a bottleneck to its performance. We address this issue in our proposed micro-cluster based DBSCAN algorithm, μDBSCAN, which identifies core-points even without performing neighbourhood queries and becomes instrumental in reducing the run-time of the algorithm. It also significantly reduces the computation time per neighbourhood query while producing exact DBSCAN clusters. Moreover, the micro-cluster based solution makes it scalable for high dimensional data. We also propose a highly scalable distributed implementation of μDBSCAN, μDBSCAN-D, to exploit a commodity cluster infrastructure. Experimental results demonstrate tremendous improvements in performance of our proposed algorithms as compared to their respective state-of-the-art solutions for various standard datasets. μDBSCAN-D is an exact parallel solution for DBSCAN which is capable of processing massive amounts of data efficiently (1 billion data points in 41 minutes on a 32 node cluster), while producing a clustering that is same as that of traditional DBSCAN.},\n\tbooktitle = {2019 {IEEE} {International} {Conference} on {Cluster} {Computing} ({CLUSTER})},\n\tauthor = {Sarma, Aditya and Goyal, Poonam and Kumari, Sonal and Wani, Anand and Challa, Jagat Sesh and Islam, Saiyedul and Goyal, Navneet},\n\tmonth = sep,\n\tyear = {2019},\n\tnote = {ISSN: 2168-9253},\n\tkeywords = {Approximation algorithms, Big Data, Clustering algorithms, Density-based Clustering, Distributed Computing, Distributed databases, Exact Clustering Algorithm, Optimized Neighborhood Query, Parallel algorithms, Partitioning algorithms, Spatial Locality, Time complexity},\n\tpages = {1--11},\n}\n\n\n\n","author_short":["Sarma, A.","Goyal, P.","Kumari, S.","Wani, A.","Challa, J. S.","Islam, S.","Goyal, N."],"key":"sarma_dbscan_2019","id":"sarma_dbscan_2019","bibbaseid":"sarma-goyal-kumari-wani-challa-islam-goyal-dbscananexactscalabledbscanalgorithmforbigdataexploitingspatiallocality-2019","role":"author","urls":{},"keyword":["Approximation algorithms","Big Data","Clustering algorithms","Density-based Clustering","Distributed Computing","Distributed databases","Exact Clustering Algorithm","Optimized Neighborhood Query","Parallel algorithms","Partitioning algorithms","Spatial Locality","Time complexity"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero/mh_lenguyen","dataSources":["iwKepCrWBps7ojhDx"],"keywords":["approximation algorithms","big data","clustering algorithms","density-based clustering","distributed computing","distributed databases","exact clustering algorithm","optimized neighborhood query","parallel algorithms","partitioning algorithms","spatial locality","time complexity"],"search_terms":["dbscan","exact","scalable","dbscan","algorithm","big","data","exploiting","spatial","locality","sarma","goyal","kumari","wani","challa","islam","goyal"],"title":"μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality","year":2019}