Revisiting Linpack Algorithm on Large-scale CPU-GPU Heterogeneous Systems

Chaoyang Shui,Xianzhi Yu,Yujin Yan,Yinshan Wang,Ke Meng,Guangming Tan
DOI: https://doi.org/10.1145/3332466.3374530
2020-01-01
Abstract:As the widening gap between GPU computing capability and other components (CPU, PCIe bus and communication network), it's increasingly challenging to design high performance parallel algorithms for large CPU-GPU heterogeneous systems. There are mainly two reasons. Firstly, simply offloading the kernel library to GPU incurs large volume data transfer through low-speed PCIe bus. Secondly, communication overheads through network severely affects scalability. To solve the above issues, we advocate a paradigm shift to CPU-centric and fine-grained pipelining algorithm design. By taking Linpack benchmark as a case study, the new algorithm design paradigm shows its effectiveness. Our optimized Linpack program achieves 63.79PFlops on 16384 GPUs. Its floating-point efficiency outperforms the NVIDIA proprietary counterparts by 5% on average.
What problem does this paper attempt to address?