Harp-DAAL for High Performance Big Data Computing. Qiu, J. Parallel Universe, 3, 2018.
Harp-DAAL for High Performance Big Data Computing [pdf]Website  abstract   bibtex   
Large-scale data analytics is revolutionizing many business and scientific domains. Easy-to-use scalable parallel techniques are necessary to process big data and gain meaningful insights. We introduce a novel HPC-Cloud convergence framework named Harp-DAAL and demonstrate that the combination of Big Data (Hadoop) and HPC techniques can simultaneously achieve productivity and performance. Harp is a distributed Hadoop-based framework that orchestrates efficient node synchronization [1]. Harp uses Intel ® Data Analytics Accelerator Library (DAAL) [2], for its highly optimized kernels on Intel ® Xeon and Xeon Phi architectures. This way the high-level API of Big Data tools can be combined with intra-node fine-grained parallelism that is optimized for HPC platforms. We illustrate this framework in detail with K-means clustering, a computation-bounded algorithm used in image clustering. We also show the broad applicability of Harp-DAAL by discussing the performance of three other big data algorithms: Subgraph Counting by color coding, Matrix Factorization and Latent Dirichlet Allocation. They share issues such as load imbalance, irregular structure, and communication issues that create difficult challenges. Figure 1 Cloud-HPC interoperable software for High Performance Big Data Analytics at Scale The categories in Figure 1 illustrate a classification of data intensive computation into five computation models that map into five distinct system architectures. It starts with Sequential, followed by centralized batch architectures corresponding exactly to the three forms of MapReduce: Map-Only, MapReduce and Iterative MapReduce. Category five is the classic MPI model. Harp brings Hadoop users the benefits of supporting all 5 classes of data-intensive computation, from pleasingly parallel to machine learning and simulations. We have expanded the applicability of Hadoop (with Harp plugin) for more classes of Big Data applications, especially complex data analytics such as machine learning and graph. We redesign a modular software stack with native kernels (with DAAL) to effectively utilize scale-up servers for machine learning and data analytics applications. Harp-DAAL shows how simulations and Big Data can use common programming environments with a runtime based on a rich set of collectives and libraries.

Downloads: 0