Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Prithwish Basu,Liangyu Zhao,Jason Fantl,Siddharth Pal,Arvind Krishnamurthy,Joud Khoury
2024-04-26
Abstract:The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology and lowering the schedules to various runtimes and interconnect technologies. We also propose a novel topology that delivers near-optimal all-to-all performance.
Distributed, Parallel, and Cluster Computing,Networking and Internet Architecture
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to optimize the performance of all - to - all collective communication in the direct - connection topology on the scale of supercomputers. Specifically, all - to - all collective communication is widely used in machine learning (ML) and high - performance computing (HPC) workloads, and its performance optimization is very important for both of these communities. All - to - all is a particularly challenging workload, and as the scale expands, it may severely consume the underlying interconnect bandwidth. Therefore, the paper adopts a comprehensive approach to optimize the performance of all - to - all collective communication on large - scale direct - connection interconnects. The main objectives include: 1. **Develop an efficient all - to - all scheduling algorithm**: Develop an efficient and bandwidth - optimal all - to - all scheduling algorithm for any topology, and reduce it to various run - times and interconnect technologies. 2. **Propose a new topology**: Propose a new topology to achieve near - optimal all - to - all performance. 3. **Solve practical challenges**: Solve various algorithmic and practical challenges encountered in the process of developing these efficient scheduling algorithms. Through these methods, the paper aims to improve the efficiency of all - to - all collective communication in large - scale direct - connection networks, thereby enhancing the overall system performance.