μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality. Sarma, A., Goyal, P., Kumari, S., Wani, A., Challa, J. S., Islam, S., & Goyal, N. In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pages 1–11, September, 2019. ISSN: 2168-9253
doi  abstract   bibtex   
DBSCAN is one of the most popular and effective clustering algorithms that is capable of identifying arbitrary-shaped clusters and noise efficiently. However, its super-linear complexity makes it infeasible for applications involving clustering of Big Data. A major portion of the computation time of DBSCAN is taken up by the neighborhood queries, which becomes a bottleneck to its performance. We address this issue in our proposed micro-cluster based DBSCAN algorithm, μDBSCAN, which identifies core-points even without performing neighbourhood queries and becomes instrumental in reducing the run-time of the algorithm. It also significantly reduces the computation time per neighbourhood query while producing exact DBSCAN clusters. Moreover, the micro-cluster based solution makes it scalable for high dimensional data. We also propose a highly scalable distributed implementation of μDBSCAN, μDBSCAN-D, to exploit a commodity cluster infrastructure. Experimental results demonstrate tremendous improvements in performance of our proposed algorithms as compared to their respective state-of-the-art solutions for various standard datasets. μDBSCAN-D is an exact parallel solution for DBSCAN which is capable of processing massive amounts of data efficiently (1 billion data points in 41 minutes on a 32 node cluster), while producing a clustering that is same as that of traditional DBSCAN.
@inproceedings{sarma_dbscan_2019,
	title = {μ{DBSCAN}: {An} {Exact} {Scalable} {DBSCAN} {Algorithm} for {Big} {Data} {Exploiting} {Spatial} {Locality}},
	shorttitle = {μ{DBSCAN}},
	doi = {10.1109/CLUSTER.2019.8891020},
	abstract = {DBSCAN is one of the most popular and effective clustering algorithms that is capable of identifying arbitrary-shaped clusters and noise efficiently. However, its super-linear complexity makes it infeasible for applications involving clustering of Big Data. A major portion of the computation time of DBSCAN is taken up by the neighborhood queries, which becomes a bottleneck to its performance. We address this issue in our proposed micro-cluster based DBSCAN algorithm, μDBSCAN, which identifies core-points even without performing neighbourhood queries and becomes instrumental in reducing the run-time of the algorithm. It also significantly reduces the computation time per neighbourhood query while producing exact DBSCAN clusters. Moreover, the micro-cluster based solution makes it scalable for high dimensional data. We also propose a highly scalable distributed implementation of μDBSCAN, μDBSCAN-D, to exploit a commodity cluster infrastructure. Experimental results demonstrate tremendous improvements in performance of our proposed algorithms as compared to their respective state-of-the-art solutions for various standard datasets. μDBSCAN-D is an exact parallel solution for DBSCAN which is capable of processing massive amounts of data efficiently (1 billion data points in 41 minutes on a 32 node cluster), while producing a clustering that is same as that of traditional DBSCAN.},
	booktitle = {2019 {IEEE} {International} {Conference} on {Cluster} {Computing} ({CLUSTER})},
	author = {Sarma, Aditya and Goyal, Poonam and Kumari, Sonal and Wani, Anand and Challa, Jagat Sesh and Islam, Saiyedul and Goyal, Navneet},
	month = sep,
	year = {2019},
	note = {ISSN: 2168-9253},
	keywords = {Approximation algorithms, Big Data, Clustering algorithms, Density-based Clustering, Distributed Computing, Distributed databases, Exact Clustering Algorithm, Optimized Neighborhood Query, Parallel algorithms, Partitioning algorithms, Spatial Locality, Time complexity},
	pages = {1--11},
}

Downloads: 0