Evaluating ARM HPC clusters for scientific workloads. Maqbool, J.; Oh, S.; and Fox, G. C Concurrency and Computation: Practice and Experience, 27(17):5390–5410, Wiley Online Library, 2015.
abstract   bibtex   
The power consumption of modern high‐performance computing (HPC) systems that are built using power hungry commodity servers is one of the major hurdles for achieving Exascale computation. Several efforts have been made by the HPC community to encourage the use of low‐powered system‐on‐chip (SoC) embedded processors in large‐scale HPC systems. These initiatives have successfully demonstrated the use of ARM SoCs in HPC systems, but there is still a need to analyze the viability of these systems for HPC platforms before a case can be made for Exascale computation. The major shortcomings of current ARM‐HPC evaluations include a lack of detailed insights about performance levels on distributed multicore systems and performance levels for benchmarking in large‐scale applications running on HPC. In this paper, we present a comprehensive evaluation of results that covers major aspects of server and HPC benchmarking for ARM‐based SoCs. For the experiments, we built an unconventional cluster of ARM Cortex‐A9s that is referred to as Weiser and ran single‐node benchmarks (STREAM, Sysbench, and PARSEC) and multi‐node scientific benchmarks (High‐performance Linpack (HPL), NASA Advanced Supercomputing (NAS) Parallel Benchmark, and Gadget‐2) in order to provide a baseline for performance limitations of the system. Based on the experimental results, we claim that the performance of ARM SoCs depends heavily on the memory bandwidth, network latency, application class, workload type, and support for compiler optimizations. During server‐based benchmarking, we observed that when performing memory intensive benchmarks for database transactions, x86 performed 12% better for multithreaded query processing. However, ARM performed four times better for performance to power ratios for a single core and 2.6 times better on four cores. We noticed that emulated double precision floating point in Java resulted in three to four times slower performance as compared with the performance in C for CPU‐bound benchmarks. Even though Intel x86 performed slightly better in computation‐oriented applications, ARM showed better scalability in I/O bound applications for shared memory benchmarks. We incorporated the support for ARM in the MPJ‐Express runtime and performed comparative analysis of two widely used message passing libraries. We obtained similar results for network bandwidth, large‐scale application scaling, floating‐point performance, and energy‐efficiency for clusters in message passing evaluations (NBP and Gadget 2 with MPJ‐Express and MPICH). Our findings can be used to evaluate the energy efficiency of ARM‐based clusters for server workloads and scientific workloads and to provide a guideline for building energy‐efficient HPC clusters.
@article{maqbool2015evaluating,
  title={{Evaluating ARM HPC clusters for scientific workloads}},
  author={Maqbool, Jahanzeb and Oh, Sangyoon and Fox, Geoffrey C},
  journal={Concurrency and Computation: Practice and Experience},
  volume={27},
  number={17},
  pages={5390--5410},
  year={2015},
  abstract={The power consumption of modern high‐performance computing (HPC) systems that are built using power hungry commodity servers is one of the major hurdles for achieving Exascale computation. Several efforts have been made by the HPC community to encourage the use of low‐powered system‐on‐chip (SoC) embedded processors in large‐scale HPC systems. These initiatives have successfully demonstrated the use of ARM SoCs in HPC systems, but there is still a need to analyze the viability of these systems for HPC platforms before a case can be made for Exascale computation. The major shortcomings of current ARM‐HPC evaluations include a lack of detailed insights about performance levels on distributed multicore systems and performance levels for benchmarking in large‐scale applications running on HPC. In this paper, we present a comprehensive evaluation of results that covers major aspects of server and HPC benchmarking for ARM‐based SoCs. For the experiments, we built an unconventional cluster of ARM Cortex‐A9s that is referred to as Weiser and ran single‐node benchmarks (STREAM, Sysbench, and PARSEC) and multi‐node scientific benchmarks (High‐performance Linpack (HPL), NASA Advanced Supercomputing (NAS) Parallel Benchmark, and Gadget‐2) in order to provide a baseline for performance limitations of the system. Based on the experimental results, we claim that the performance of ARM SoCs depends heavily on the memory bandwidth, network latency, application class, workload type, and support for compiler optimizations. During server‐based benchmarking, we observed that when performing memory intensive benchmarks for database transactions, x86 performed 12\% better for multithreaded query processing. However, ARM performed four times better for performance to power ratios for a single core and 2.6 times better on four cores. We noticed that emulated double precision floating point in Java resulted in three to four times slower performance as compared with the performance in C for CPU‐bound benchmarks. Even though Intel x86 performed slightly better in computation‐oriented applications, ARM showed better scalability in I/O bound applications for shared memory benchmarks. We incorporated the support for ARM in the MPJ‐Express runtime and performed comparative analysis of two widely used message passing libraries. We obtained similar results for network bandwidth, large‐scale application scaling, floating‐point performance, and energy‐efficiency for clusters in message passing evaluations (NBP and Gadget 2 with MPJ‐Express and MPICH). Our findings can be used to evaluate the energy efficiency of ARM‐based clusters for server workloads and scientific workloads and to provide a guideline for building energy‐efficient HPC clusters.},
  publisher={Wiley Online Library}
}
Downloads: 0