Supporting queries and analyses of large-scale social media data with customizable and scalable indexing techniques over NoSQL databases. Gao, X. & Qiu, J. In Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on, pages 587-590, 2014. IEEE.
abstract   bibtex   
Social media data analysis demonstrates two special characteristics in Big Data processing. First, most analyses focus on data subsets related to specific social events or activities instead of the whole dataset. Second, analysis workflows consist of multiple stages, and algorithms applied in each stage may use different computation and communication patterns depending on processing frameworks. This paper presents our efforts in supporting the data storage and processing requirements for such characteristics. To achieve efficient queries about target data subsets, we propose a general customizable and scalable indexing framework that can be built over distributed NoSQL databases. This framework allows users to define suitable customized index structures for their query patterns against social media data, and supports scalable indexing of both historical and streaming data. We implement this framework on HBase, and name it Indexed HBase. Starting from Indexed HBase, we build a distributed analysis stack based on YARN to support analysis algorithms using different processing frameworks, such as Hadoop MapReduce, Harp, and Giraph. This analysis stack is used to host the Truthy social media data observatory, and we have applied the customized index structures in supporting both query evaluation and sophisticated analysis algorithms. Performance tests show that our solutions outperform implementations using both direct raw data scans and current indexing mechanisms in existing NoSQL databases. © 2014 IEEE.
@inproceedings{
 title = {Supporting queries and analyses of large-scale social media data with customizable and scalable indexing techniques over NoSQL databases},
 type = {inproceedings},
 year = {2014},
 identifiers = {[object Object]},
 pages = {587-590},
 publisher = {IEEE},
 id = {ffbe3f99-df76-3e38-92d6-b9ea3d6f7ac7},
 created = {2018-01-09T20:30:41.927Z},
 file_attached = {false},
 profile_id = {42d295c0-0737-38d6-8b43-508cab6ea85d},
 last_modified = {2020-05-11T14:43:44.469Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {true},
 hidden = {false},
 citation_key = {Gao2014},
 source_type = {CONF},
 folder_uuids = {36d8ccf4-7085-47fa-8ab9-897283d082c5},
 private_publication = {false},
 abstract = {Social media data analysis demonstrates two special characteristics in Big Data processing. First, most analyses focus on data subsets related to specific social events or activities instead of the whole dataset. Second, analysis workflows consist of multiple stages, and algorithms applied in each stage may use different computation and communication patterns depending on processing frameworks. This paper presents our efforts in supporting the data storage and processing requirements for such characteristics. To achieve efficient queries about target data subsets, we propose a general customizable and scalable indexing framework that can be built over distributed NoSQL databases. This framework allows users to define suitable customized index structures for their query patterns against social media data, and supports scalable indexing of both historical and streaming data. We implement this framework on HBase, and name it Indexed HBase. Starting from Indexed HBase, we build a distributed analysis stack based on YARN to support analysis algorithms using different processing frameworks, such as Hadoop MapReduce, Harp, and Giraph. This analysis stack is used to host the Truthy social media data observatory, and we have applied the customized index structures in supporting both query evaluation and sophisticated analysis algorithms. Performance tests show that our solutions outperform implementations using both direct raw data scans and current indexing mechanisms in existing NoSQL databases. © 2014 IEEE.},
 bibtype = {inproceedings},
 author = {Gao, Xiaoming and Qiu, Judy},
 booktitle = {Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on}
}

Downloads: 0