A Novel Scalable DBSCAN Algorithm with Spark

A Novel Scalable DBSCAN Algorithm with Spark. Han, D., Agrawal, A., Liao, W., & Choudhary, A. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 1393–1402, May, 2016.
doi abstract bibtex

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, parallelization of DBSCAN is a challenging work because based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to have an advanced experience to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using the new big data framework Spark. In order to reduce search time, we apply kd-tree in our algorithm. More specifically, we propose a novel approach to avoid communication between executors so that we can locally obtain partial clusters more efficiently. Based on Java API, we select appropriate data structures carefully: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. In addition, we use other advanced features from Spark to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10k), 10 using 32 cores (100k), and 137 using 512 cores (1m). Another experiment using 10k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.

@inproceedings{han_novel_2016,
	title = {A {Novel} {Scalable} {DBSCAN} {Algorithm} with {Spark}},
	doi = {10.1109/IPDPSW.2016.57},
	abstract = {DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, parallelization of DBSCAN is a challenging work because based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to have an advanced experience to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using the new big data framework Spark. In order to reduce search time, we apply kd-tree in our algorithm. More specifically, we propose a novel approach to avoid communication between executors so that we can locally obtain partial clusters more efficiently. Based on Java API, we select appropriate data structures carefully: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. In addition, we use other advanced features from Spark to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10k), 10 using 32 cores (100k), and 137 using 512 cores (1m). Another experiment using 10k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.},
	booktitle = {2016 {IEEE} {International} {Parallel} and {Distributed} {Processing} {Symposium} {Workshops} ({IPDPSW})},
	author = {Han, Dianwei and Agrawal, Ankit and Liao, Wei-Keng and Choudhary, Alok},
	month = may,
	year = {2016},
	keywords = {Algorithm design and analysis, Big data, Clustering algorithms, DBSCAN, Distributed databases, Programming, Spark framework, Sparks, bigdata, clustering},
	pages = {1393--1402},
}

Downloads: 0

{"_id":"8vMTyP7Cm3Kv84dc8","bibbaseid":"han-agrawal-liao-choudhary-anovelscalabledbscanalgorithmwithspark-2016","author_short":["Han, D.","Agrawal, A.","Liao, W.","Choudhary, A."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","title":"A Novel Scalable DBSCAN Algorithm with Spark","doi":"10.1109/IPDPSW.2016.57","abstract":"DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, parallelization of DBSCAN is a challenging work because based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to have an advanced experience to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using the new big data framework Spark. In order to reduce search time, we apply kd-tree in our algorithm. More specifically, we propose a novel approach to avoid communication between executors so that we can locally obtain partial clusters more efficiently. Based on Java API, we select appropriate data structures carefully: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. In addition, we use other advanced features from Spark to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10k), 10 using 32 cores (100k), and 137 using 512 cores (1m). Another experiment using 10k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.","booktitle":"2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","author":[{"propositions":[],"lastnames":["Han"],"firstnames":["Dianwei"],"suffixes":[]},{"propositions":[],"lastnames":["Agrawal"],"firstnames":["Ankit"],"suffixes":[]},{"propositions":[],"lastnames":["Liao"],"firstnames":["Wei-Keng"],"suffixes":[]},{"propositions":[],"lastnames":["Choudhary"],"firstnames":["Alok"],"suffixes":[]}],"month":"May","year":"2016","keywords":"Algorithm design and analysis, Big data, Clustering algorithms, DBSCAN, Distributed databases, Programming, Spark framework, Sparks, bigdata, clustering","pages":"1393–1402","bibtex":"@inproceedings{han_novel_2016,\n\ttitle = {A {Novel} {Scalable} {DBSCAN} {Algorithm} with {Spark}},\n\tdoi = {10.1109/IPDPSW.2016.57},\n\tabstract = {DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, parallelization of DBSCAN is a challenging work because based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to have an advanced experience to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using the new big data framework Spark. In order to reduce search time, we apply kd-tree in our algorithm. More specifically, we propose a novel approach to avoid communication between executors so that we can locally obtain partial clusters more efficiently. Based on Java API, we select appropriate data structures carefully: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. In addition, we use other advanced features from Spark to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10k), 10 using 32 cores (100k), and 137 using 512 cores (1m). Another experiment using 10k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.},\n\tbooktitle = {2016 {IEEE} {International} {Parallel} and {Distributed} {Processing} {Symposium} {Workshops} ({IPDPSW})},\n\tauthor = {Han, Dianwei and Agrawal, Ankit and Liao, Wei-Keng and Choudhary, Alok},\n\tmonth = may,\n\tyear = {2016},\n\tkeywords = {Algorithm design and analysis, Big data, Clustering algorithms, DBSCAN, Distributed databases, Programming, Spark framework, Sparks, bigdata, clustering},\n\tpages = {1393--1402},\n}\n\n\n\n","author_short":["Han, D.","Agrawal, A.","Liao, W.","Choudhary, A."],"key":"han_novel_2016","id":"han_novel_2016","bibbaseid":"han-agrawal-liao-choudhary-anovelscalabledbscanalgorithmwithspark-2016","role":"author","urls":{},"keyword":["Algorithm design and analysis","Big data","Clustering algorithms","DBSCAN","Distributed databases","Programming","Spark framework","Sparks","bigdata","clustering"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero/mh_lenguyen","dataSources":["iwKepCrWBps7ojhDx"],"keywords":["algorithm design and analysis","big data","clustering algorithms","dbscan","distributed databases","programming","spark framework","sparks","bigdata","clustering"],"search_terms":["novel","scalable","dbscan","algorithm","spark","han","agrawal","liao","choudhary"],"title":"A Novel Scalable DBSCAN Algorithm with Spark","year":2016}