Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Prithwish Basu,Liangyu Zhao,Jason Fantl,Siddharth Pal,Arvind Krishnamurthy,Joud Khoury

2024-04-26

Abstract:The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology and lowering the schedules to various runtimes and interconnect technologies. We also propose a novel topology that delivers near-optimal all-to-all performance.

Distributed, Parallel, and Cluster Computing,Networking and Internet Architecture

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to optimize the performance of all - to - all collective communication in the direct - connection topology on the scale of supercomputers. Specifically, all - to - all collective communication is widely used in machine learning (ML) and high - performance computing (HPC) workloads, and its performance optimization is very important for both of these communities. All - to - all is a particularly challenging workload, and as the scale expands, it may severely consume the underlying interconnect bandwidth. Therefore, the paper adopts a comprehensive approach to optimize the performance of all - to - all collective communication on large - scale direct - connection interconnects. The main objectives include: 1. **Develop an efficient all - to - all scheduling algorithm**: Develop an efficient and bandwidth - optimal all - to - all scheduling algorithm for any topology, and reduce it to various run - times and interconnect technologies. 2. **Propose a new topology**: Propose a new topology to achieve near - optimal all - to - all performance. 3. **Solve practical challenges**: Solve various algorithmic and practical challenges encountered in the process of developing these efficient scheduling algorithms. Through these methods, the paper aims to improve the efficiency of all - to - all collective communication in large - scale direct - connection networks, thereby enhancing the overall system performance.

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Efficient Direct-Connect Topologies for Collective Communications

A Survey of Methods for Collective Communication Optimization and Tuning

Bandwidth Optimal Pipeline Schedule for Collective Communication

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

OpTree: An Efficient Algorithm for All-gather Operation in Optical Interconnect Systems

Message-Combining Algorithms for Isomorphic, Sparse Collective Communication

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

Optimal low-latency network topologies for cluster performance enhancement

Hierarchical and Reconfigurable Optical/electrical Interconnection Network for High-Performance Computing

Decomposing Collectives for Exploiting Multi-lane Communication

Synthesizing Optimal Collective Algorithms

Delocalization and spin-wave dynamics in ferromagnetic chains with long-range correlated random exchange

Optimal circulant graphs as low-latency network topologies

Efficient and Eventually Consistent Collective Operations

On Optimizing the Communication of Model Parallelism

Communication Optimization Technology Based on Network Dynamic Performance Model

Traffic Pattern Adaptive Hybrid Electrical and Optical Switching Network for HPC System

High Throughput Data Center Topology Design

Configurable Non-uniform All-to-all Algorithms

Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem