HiCCL: A Hierarchical Collective Communication Library

Mert Hidayetoglu,Simon Garcia de Gonzalo,Elliott Slaughter,Pinku Surana,Wen-mei Hwu,William Gropp,Alex Aiken
2024-08-12
Abstract:HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have envolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially across systems from different vendors. HiCCL's library design decouples the collective communication logic from network-specific optimizations through a compositional API. The communication logic is composed using multicast, reduction, and fence primitives, which are then factorized for a specified network hieararchy using only point-to-point operations within a level. Finally, striping and pipelining optimizations applied as specified for streamlining the execution. Performance evaluation of HiCCL across four different machines$\unicode{x2014}$two with Nvidia GPUs, one with AMD GPUs, and one with Intel GPUs$\unicode{x2014}$demonstrates an average 17$\times$ higher throughput than the collectives of highly specialized GPU-aware MPI implementations, and competitive throughput with those of vendor-specific libraries (NCCL, RCCL, and OneCCL), while providing portability across all four machines.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to optimize collective communication functions in the context of increasingly complex and diverse high - performance computing network architectures, in order to achieve high efficiency and portability across different hardware and software systems. Specifically, as GPU systems evolve into GPU networks with multi - level communication hierarchies, optimizing each collective function for a specific system has become very challenging. Therefore, many collective libraries have difficulty adapting to different hardware and software, especially between systems from different vendors. HiCCL addresses these issues through its library design, which decouples the collective communication logic from network - specific optimizations through a combination API. The communication logic is composed of multicast, reduction, and fence primitives, and is then decomposed using point - to - point operations limited to the same level according to the specified network hierarchy. Finally, striping and pipelining optimizations are specified to simplify execution. The main contributions of the paper include: - Introducing a machine - independent specification for constructing collective functions using multicast, reduction, and fence primitives, which are sufficient to express all collective functions in the MPI standard and their alternative implementations. - Identifying a set of unified hierarchical optimizations applicable to any collective function composed of the proposed primitives, and demonstrating how these optimizations can adapt to different modern GPU systems and are sufficient to saturate the throughput of various networks. - Proposing HiCCL, a hierarchical communication library that integrates multiple communication functions without relying on existing collective functions. The performance portability of HiCCL is demonstrated by matching or exceeding the performance of available MPI and vendor - provided libraries on different systems. In summary, HiCCL aims to provide collective communication operations with high throughput and performance portability, while automatically handling most of the process of constructing optimized collectives, and supporting GPUs from different vendors across diverse network architectures.