Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast. Ruhela, A., Ramesh, B., Chakraborty, S., Hashmi, S., Maqbool, J., & Panda, D. K. In Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM) held in conjunction with SC'19, 2019. IEEE/ACM.
abstract   bibtex   
The Message Passing Interface has been the dominating programming model for developing scalable and high-performance parallel applications. Collective operations empower group communication operations in a portable, and efficient manner and are used by a large number of applications across different domains. Optimization of collective operations is the key to achieve good performance speed-ups and portability. Broadcast or One-to-all communication is one of the most commonly used collectives in MPI applications. However, the existing algorithms for broadcast do not effectively utilize the high degree of parallelism and increased message rate capabilities offered by modern architectures. In this paper, we address these challenges and propose a Scalable Multi-Endpoint broadcast algorithm that combines hierarchical communication with multiple endpoints per node for high performance and scalability. We evaluate the proposed algorithm against state-of-the-art designs in other MPI libraries, including MVAPICH2, Intel MPI, and Spectrum MPI. We demonstrate the benefits of the proposed algorithm at benchmark and application level at scale on four different hardware architectures, including Intel Cascade Lake, Intel Skylake, AMD EPYC, and IBM POWER9, and with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed design shows up to 2.5 times performance improvements at a microbenchmark level with 128 Nodes. We also observe up to 37% improvement in broadcast communication latency for the SPECMPI scientific applications
@inproceedings{ruhela-ipdrm19,
  title={Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast},
  author={Ruhela, Amit and Ramesh, Bharath and Chakraborty, Sourav and Hashmi, Subramoni, Hari and Jahanzeb Maqbool and Panda, Dhabaleswar K.},
  booktitle={Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM) held in conjunction with SC'19},
  year={2019},
  abstract={The Message Passing Interface has been the dominating programming model for developing scalable and high-performance parallel applications. Collective operations empower group communication operations in a portable, and efficient manner and are used by a large number of applications across different domains. Optimization of collective operations is the key to achieve good performance speed-ups and portability. Broadcast or One-to-all communication is one of the most commonly used collectives in MPI applications. However, the existing algorithms for broadcast do not effectively utilize the high degree of parallelism and increased message rate capabilities offered by modern architectures. In this paper, we address these challenges and propose a Scalable Multi-Endpoint broadcast algorithm that combines hierarchical communication with multiple endpoints per node for high performance and scalability. We evaluate the proposed algorithm against state-of-the-art designs in other MPI libraries, including MVAPICH2, Intel MPI, and Spectrum MPI. We demonstrate the benefits of the proposed algorithm at benchmark and application level at scale on four different hardware architectures, including Intel Cascade Lake, Intel Skylake, AMD EPYC, and IBM POWER9, and with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed design shows up to 2.5 times performance improvements at a microbenchmark level with 128 Nodes. We also observe up to 37\% improvement in broadcast communication latency for the SPECMPI scientific applications},
  organization={IEEE/ACM}
}

Downloads: 0