Introduction to Harp: when Big Data Meets HPC

Introduction to Harp: when Big Data Meets HPC. Zhang, B., Peng, B., Chen, L., Li, E., Zhou, Y., & Qiu, J. Technical Report 2017.

Paper

Introduction to Harp: when Big Data Meets HPC [link]

Website abstract bibtex

Harp Data analytics is undergoing a revolution in many scientific domains, demanding cost-effective parallel data analysis techniques. We consider the challenges of creating a high performance data analysis software framework in the context of the current HPC-ABDS software stack (High Performance Computing enhanced Apache Big Data Stack) [1]. We have summarized a list of current data processing software from either HPC or commercial sources [2]. Many critical components of the commodity stack (such as Hadoop) come from Apache open source projects for community usage, while HPC (such as collective communication) is needed to bring performance and other parallel computing capabilities. Many machine learning algorithms are built on iterative computation, which can be formulated as í µí°´"µí°´" = í µí°¹(í µí°·, í µí°´"µí°´"()) (1) where D is the observed dataset, A is model parameters to learn, and F is the model update function. The algorithm keeps updating model A until convergence, either by reaching a threshold criterion or fixed number of iterations. There are several advantages of this iterative procedure as apparently simple functions can iterate and produce complex behavior for interesting problems. The power of iteration and its extensions lies in the approximation or accuracy that can be obtained at each step even if the computation stops abruptly before converges to the final answer. To effectively support large-scale data processing, Twister [3] introduced iterative MapReduce using long-running processes or threads with in-memory caching of invariant data. Harp [4] introduces full collective communication in Table 1 (broadcast, reduce, allgather, allreduce, rotation, regroup or push & pull), adding a separate communication abstraction where the Harp prototype implements the MapCollective concept as a plug-in to Hadoop Ecosystem (see Figure 1 and Figure 2). Instead of using the shuffling phase, Harp uses optimized collective communication operations for data movement since fine-grained data alignment for multiple models is critical for improving performance. It further provides high-level interfaces with various synchronization patterns for parallelizing iterative computation. These enhancements make it possible to exploit HPC capabilities for big data software systems. Figure 1 Map-Collective Model Figure 2 Harp Architecture Shuffle M M M M Collective Communication M M M M R R MapCollective Model MapReduce Model YARN MapReduce V2 Harp MapReduce Applications MapCollective Applications Application Framework Resource Manager

@techreport{
 title = {Introduction to Harp: when Big Data Meets HPC},
 type = {techreport},
 year = {2017},
 pages = {10},
 websites = {https://pdfs.semanticscholar.org/f69e/9f2852c881da4df0142360c745441075a28f.pdf?_ga=2.28162492.1830027813.1567539825-1063534713.1566236187},
 id = {d1ae1abc-8111-3180-bb02-5a653b5af0c6},
 created = {2019-10-01T17:21:02.693Z},
 file_attached = {true},
 profile_id = {42d295c0-0737-38d6-8b43-508cab6ea85d},
 last_modified = {2020-05-11T14:43:32.566Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {false},
 hidden = {false},
 citation_key = {Zhang2017},
 private_publication = {false},
 abstract = {Harp Data analytics is undergoing a revolution in many scientific domains, demanding cost-effective parallel data analysis techniques. We consider the challenges of creating a high performance data analysis software framework in the context of the current HPC-ABDS software stack (High Performance Computing enhanced Apache Big Data Stack) [1]. We have summarized a list of current data processing software from either HPC or commercial sources [2]. Many critical components of the commodity stack (such as Hadoop) come from Apache open source projects for community usage, while HPC (such as collective communication) is needed to bring performance and other parallel computing capabilities. Many machine learning algorithms are built on iterative computation, which can be formulated as í µí°´"µí°´" = í µí°¹(í µí°·, í µí°´"µí°´"()) (1) where D is the observed dataset, A is model parameters to learn, and F is the model update function. The algorithm keeps updating model A until convergence, either by reaching a threshold criterion or fixed number of iterations. There are several advantages of this iterative procedure as apparently simple functions can iterate and produce complex behavior for interesting problems. The power of iteration and its extensions lies in the approximation or accuracy that can be obtained at each step even if the computation stops abruptly before converges to the final answer. To effectively support large-scale data processing, Twister [3] introduced iterative MapReduce using long-running processes or threads with in-memory caching of invariant data. Harp [4] introduces full collective communication in Table 1 (broadcast, reduce, allgather, allreduce, rotation, regroup or push & pull), adding a separate communication abstraction where the Harp prototype implements the MapCollective concept as a plug-in to Hadoop Ecosystem (see Figure 1 and Figure 2). Instead of using the shuffling phase, Harp uses optimized collective communication operations for data movement since fine-grained data alignment for multiple models is critical for improving performance. It further provides high-level interfaces with various synchronization patterns for parallelizing iterative computation. These enhancements make it possible to exploit HPC capabilities for big data software systems. Figure 1 Map-Collective Model Figure 2 Harp Architecture Shuffle M M M M Collective Communication M M M M R R MapCollective Model MapReduce Model YARN MapReduce V2 Harp MapReduce Applications MapCollective Applications Application Framework Resource Manager},
 bibtype = {techreport},
 author = {Zhang, Bingjing and Peng, Bo and Chen, Langshi and Li, Ethan and Zhou, Yiming and Qiu, Judy}
}

Downloads: 0

{"_id":"GmSnxJoynifFyEPo4","bibbaseid":"zhang-peng-chen-li-zhou-qiu-introductiontoharpwhenbigdatameetshpc-2017","authorIDs":[],"author_short":["Zhang, B.","Peng, B.","Chen, L.","Li, E.","Zhou, Y.","Qiu, J."],"bibdata":{"title":"Introduction to Harp: when Big Data Meets HPC","type":"techreport","year":"2017","pages":"10","websites":"https://pdfs.semanticscholar.org/f69e/9f2852c881da4df0142360c745441075a28f.pdf?_ga=2.28162492.1830027813.1567539825-1063534713.1566236187","id":"d1ae1abc-8111-3180-bb02-5a653b5af0c6","created":"2019-10-01T17:21:02.693Z","file_attached":"true","profile_id":"42d295c0-0737-38d6-8b43-508cab6ea85d","last_modified":"2020-05-11T14:43:32.566Z","read":false,"starred":false,"authored":"true","confirmed":false,"hidden":false,"citation_key":"Zhang2017","private_publication":false,"abstract":"Harp Data analytics is undergoing a revolution in many scientific domains, demanding cost-effective parallel data analysis techniques. We consider the challenges of creating a high performance data analysis software framework in the context of the current HPC-ABDS software stack (High Performance Computing enhanced Apache Big Data Stack) [1]. We have summarized a list of current data processing software from either HPC or commercial sources [2]. Many critical components of the commodity stack (such as Hadoop) come from Apache open source projects for community usage, while HPC (such as collective communication) is needed to bring performance and other parallel computing capabilities. Many machine learning algorithms are built on iterative computation, which can be formulated as í µí°´\"µí°´\" = í µí°¹(í µí°·, í µí°´\"µí°´\"()) (1) where D is the observed dataset, A is model parameters to learn, and F is the model update function. The algorithm keeps updating model A until convergence, either by reaching a threshold criterion or fixed number of iterations. There are several advantages of this iterative procedure as apparently simple functions can iterate and produce complex behavior for interesting problems. The power of iteration and its extensions lies in the approximation or accuracy that can be obtained at each step even if the computation stops abruptly before converges to the final answer. To effectively support large-scale data processing, Twister [3] introduced iterative MapReduce using long-running processes or threads with in-memory caching of invariant data. Harp [4] introduces full collective communication in Table 1 (broadcast, reduce, allgather, allreduce, rotation, regroup or push & pull), adding a separate communication abstraction where the Harp prototype implements the MapCollective concept as a plug-in to Hadoop Ecosystem (see Figure 1 and Figure 2). Instead of using the shuffling phase, Harp uses optimized collective communication operations for data movement since fine-grained data alignment for multiple models is critical for improving performance. It further provides high-level interfaces with various synchronization patterns for parallelizing iterative computation. These enhancements make it possible to exploit HPC capabilities for big data software systems. Figure 1 Map-Collective Model Figure 2 Harp Architecture Shuffle M M M M Collective Communication M M M M R R MapCollective Model MapReduce Model YARN MapReduce V2 Harp MapReduce Applications MapCollective Applications Application Framework Resource Manager","bibtype":"techreport","author":"Zhang, Bingjing and Peng, Bo and Chen, Langshi and Li, Ethan and Zhou, Yiming and Qiu, Judy","bibtex":"@techreport{\n title = {Introduction to Harp: when Big Data Meets HPC},\n type = {techreport},\n year = {2017},\n pages = {10},\n websites = {https://pdfs.semanticscholar.org/f69e/9f2852c881da4df0142360c745441075a28f.pdf?_ga=2.28162492.1830027813.1567539825-1063534713.1566236187},\n id = {d1ae1abc-8111-3180-bb02-5a653b5af0c6},\n created = {2019-10-01T17:21:02.693Z},\n file_attached = {true},\n profile_id = {42d295c0-0737-38d6-8b43-508cab6ea85d},\n last_modified = {2020-05-11T14:43:32.566Z},\n read = {false},\n starred = {false},\n authored = {true},\n confirmed = {false},\n hidden = {false},\n citation_key = {Zhang2017},\n private_publication = {false},\n abstract = {Harp Data analytics is undergoing a revolution in many scientific domains, demanding cost-effective parallel data analysis techniques. We consider the challenges of creating a high performance data analysis software framework in the context of the current HPC-ABDS software stack (High Performance Computing enhanced Apache Big Data Stack) [1]. We have summarized a list of current data processing software from either HPC or commercial sources [2]. Many critical components of the commodity stack (such as Hadoop) come from Apache open source projects for community usage, while HPC (such as collective communication) is needed to bring performance and other parallel computing capabilities. Many machine learning algorithms are built on iterative computation, which can be formulated as í µí°´\"µí°´\" = í µí°¹(í µí°·, í µí°´\"µí°´\"()) (1) where D is the observed dataset, A is model parameters to learn, and F is the model update function. The algorithm keeps updating model A until convergence, either by reaching a threshold criterion or fixed number of iterations. There are several advantages of this iterative procedure as apparently simple functions can iterate and produce complex behavior for interesting problems. The power of iteration and its extensions lies in the approximation or accuracy that can be obtained at each step even if the computation stops abruptly before converges to the final answer. To effectively support large-scale data processing, Twister [3] introduced iterative MapReduce using long-running processes or threads with in-memory caching of invariant data. Harp [4] introduces full collective communication in Table 1 (broadcast, reduce, allgather, allreduce, rotation, regroup or push & pull), adding a separate communication abstraction where the Harp prototype implements the MapCollective concept as a plug-in to Hadoop Ecosystem (see Figure 1 and Figure 2). Instead of using the shuffling phase, Harp uses optimized collective communication operations for data movement since fine-grained data alignment for multiple models is critical for improving performance. It further provides high-level interfaces with various synchronization patterns for parallelizing iterative computation. These enhancements make it possible to exploit HPC capabilities for big data software systems. Figure 1 Map-Collective Model Figure 2 Harp Architecture Shuffle M M M M Collective Communication M M M M R R MapCollective Model MapReduce Model YARN MapReduce V2 Harp MapReduce Applications MapCollective Applications Application Framework Resource Manager},\n bibtype = {techreport},\n author = {Zhang, Bingjing and Peng, Bo and Chen, Langshi and Li, Ethan and Zhou, Yiming and Qiu, Judy}\n}","author_short":["Zhang, B.","Peng, B.","Chen, L.","Li, E.","Zhou, Y.","Qiu, J."],"urls":{"Paper":"https://bibbase.org/service/mendeley/42d295c0-0737-38d6-8b43-508cab6ea85d/file/4af74e3d-1876-e395-25d6-1c4a3de75656/Zhang_et_al___2017___Introduction_to_Harp_when_Big_Data_Meets_HPC.pdf.pdf","Website":"https://pdfs.semanticscholar.org/f69e/9f2852c881da4df0142360c745441075a28f.pdf?_ga=2.28162492.1830027813.1567539825-1063534713.1566236187"},"biburl":"https://bibbase.org/service/mendeley/42d295c0-0737-38d6-8b43-508cab6ea85d","bibbaseid":"zhang-peng-chen-li-zhou-qiu-introductiontoharpwhenbigdatameetshpc-2017","role":"author","metadata":{"authorlinks":{}},"downloads":0},"bibtype":"techreport","creationDate":"2019-09-12T13:19:08.411Z","downloads":0,"keywords":[],"search_terms":["introduction","harp","big","data","meets","hpc","zhang","peng","chen","li","zhou","qiu"],"title":"Introduction to Harp: when Big Data Meets HPC","year":2017,"biburl":"https://bibbase.org/service/mendeley/42d295c0-0737-38d6-8b43-508cab6ea85d","dataSources":["zgahneP4uAjKbudrQ","ya2CyA73rpZseyrZ8","2252seNhipfTmjEBQ"]}