Exploiting mapreduce and data compression for data-intensive applications. Ruan, G., Zhang, H., & Plale, B. In ACM International Conference Proceeding Series, 2013.
doi  abstract   bibtex   
HPC platform shows good success for predominantly computeintensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the "moving computation to data" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time. © 2013 by the Association for Computing Machinery, Inc.
@inproceedings{
 title = {Exploiting mapreduce and data compression for data-intensive applications},
 type = {inproceedings},
 year = {2013},
 id = {9d638dba-6401-309b-8e2e-435526869cef},
 created = {2019-10-01T17:20:47.114Z},
 file_attached = {false},
 profile_id = {42d295c0-0737-38d6-8b43-508cab6ea85d},
 last_modified = {2019-10-01T17:23:33.503Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {true},
 hidden = {false},
 citation_key = {Ruan2013},
 folder_uuids = {73f994b4-a3be-4035-a6dd-3802077ce863},
 private_publication = {false},
 abstract = {HPC platform shows good success for predominantly computeintensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the "moving computation to data" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time. © 2013 by the Association for Computing Machinery, Inc.},
 bibtype = {inproceedings},
 author = {Ruan, G. and Zhang, H. and Plale, B.},
 doi = {10.1145/2484762.2484785},
 booktitle = {ACM International Conference Proceeding Series}
}

Downloads: 0