Abstract:HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have envolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially across systems from different vendors. HiCCL's library design decouples the collective communication logic from network-specific optimizations through a compositional API. The communication logic is composed using multicast, reduction, and fence primitives, which are then factorized for a specified network hieararchy using only point-to-point operations within a level. Finally, striping and pipelining optimizations applied as specified for streamlining the execution. Performance evaluation of HiCCL across four different machines$\unicode{x2014}$two with Nvidia GPUs, one with AMD GPUs, and one with Intel GPUs$\unicode{x2014}$demonstrates an average 17$\times$ higher throughput than the collectives of highly specialized GPU-aware MPI implementations, and competitive throughput with those of vendor-specific libraries (NCCL, RCCL, and OneCCL), while providing portability across all four machines.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to optimize collective communication functions in the context of increasingly complex and diverse high - performance computing network architectures, in order to achieve high efficiency and portability across different hardware and software systems. Specifically, as GPU systems evolve into GPU networks with multi - level communication hierarchies, optimizing each collective function for a specific system has become very challenging. Therefore, many collective libraries have difficulty adapting to different hardware and software, especially between systems from different vendors. HiCCL addresses these issues through its library design, which decouples the collective communication logic from network - specific optimizations through a combination API. The communication logic is composed of multicast, reduction, and fence primitives, and is then decomposed using point - to - point operations limited to the same level according to the specified network hierarchy. Finally, striping and pipelining optimizations are specified to simplify execution. The main contributions of the paper include: - Introducing a machine - independent specification for constructing collective functions using multicast, reduction, and fence primitives, which are sufficient to express all collective functions in the MPI standard and their alternative implementations. - Identifying a set of unified hierarchical optimizations applicable to any collective function composed of the proposed primitives, and demonstrating how these optimizations can adapt to different modern GPU systems and are sufficient to saturate the throughput of various networks. - Proposing HiCCL, a hierarchical communication library that integrates multiple communication functions without relying on existing collective functions. The performance portability of HiCCL is demonstrated by matching or exceeding the performance of available MPI and vendor - provided libraries on different systems. In summary, HiCCL aims to provide collective communication operations with high throughput and performance portability, while automatically handling most of the process of constructing optimized collectives, and supporting GPUs from different vendors across diverse network architectures.

HiCCL: A Hierarchical Collective Communication Library

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

OCCL: a Deadlock-free Library for GPU Collective Communication

ACCL+: an FPGA-Based Collective Engine for Distributed Applications

Design and Implementation of High Performance Communication Library on Cluster

HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations

Intra-Cluster Coalescing to Reduce GPU NoC Pressure

Intra-Cluster Coalescing and Distributed-Block Scheduling to Reduce GPU NoC Pressure.

SNCL: a Supernode OpenCL Implementation for Hybrid Computing Arrays

POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU Clusters

Hoplite: efficient and fault-tolerant collective communication for task-based distributed systems

ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics

Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL

Monitoring Collective Communication Among GPUs

Delocalization and spin-wave dynamics in ferromagnetic chains with long-range correlated random exchange

Decomposing Collectives for Exploiting Multi-lane Communication

High-Performance Genomic Analysis Heterogeneous System Using OpenCL

Optimizing ML Concurrent Computation and Communication with GPU DMA Engines

A Systemic Strategy for Tuning Intra-node Collective Communication on Multicore Systems

Accelerating Communication for Parallel Programming Models on GPU Systems

GMH: A Message Passing Toolkit for GPU Clusters