Efficient Direct-Connect Topologies for Collective Communications

Liangyu Zhao,Siddharth Pal,Tapan Chugh,Weiyang Wang,Jason Fantl,Prithwish Basu,Joud Khoury,Arvind Krishnamurthy
2024-05-13
Abstract:We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and schedules for a given cluster size and degree and then identifies the appropriate topology and schedule for a given workload. Our algorithms start from small, optimal base topologies and associated communication schedules and use techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient collective schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.
Networking and Internet Architecture,Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently construct a direct - connected network topology for collective communication and its communication scheduling in large - scale distributed machine learning (ML) and high - performance computing (HPC). Specifically, the paper focuses on how to efficiently construct a high - performance quantum direct - connected network topology and communication scheduling to optimize the trade - off between latency and bandwidth under given network performance characteristics and degree constraints. ### Paper Background Collective communication operations involve concurrent aggregation and distribution of data on a cluster of nodes and are widely used in the fields of machine learning and high - performance computing. With the improvement of accelerator computing power, collective operations have become a significant overhead in large - scale distributed machine - learning training. To address these challenges, some studies have proposed using optical circuit switching to achieve higher bandwidth while maintaining reasonable capital expenditures and energy consumption. Hosts communicate through a limited number of reconfigurable optical circuits, which makes the network topology a configurable component. ### Existing Problems Although existing ML systems based on optical circuits conform to the direct - connection model, they fail to fully utilize the flexibility provided by topological reconfiguration. For example, although ring allreduce has high bandwidth efficiency, it has a large graph diameter, resulting in high total - hop - count latency; while double binary tree has a logarithmic diameter, it has problems in load balancing and bandwidth efficiency. Other efficient collective algorithms (such as recursive doubling, Bruck algorithm) perform well in switched networks, but due to their dynamic communication patterns, they are not suitable for degree - constrained direct - connected networks. ### Solutions The paper proposes an algorithmic toolchain for rapidly synthesizing efficient network topologies and schedules suitable for collective communication. The main contributions include: 1. **Extension Techniques**: Starting from small - scale optimal topologies and communication schedules, generate approximately optimal large - scale topologies and schedules through a series of extension techniques. 2. **Polynomial - Time Scheduling Generation Algorithm**: Generate optimal collective communication schedules for large - scale topologies with certain symmetry properties. 3. **Topology Enumeration and Search Algorithm**: Identify the best topologies and schedules by exploring Pareto - efficient options with different bandwidth efficiencies, total - hop - count latencies, and all - to - all throughputs. 4. **Compiler**: Develop a compiler to implement optimized schedules and integrate them into ML frameworks (such as PyTorch). ### Experimental Verification The paper verifies the effectiveness of the proposed method through two test platforms (a 12 - node GPU cluster and a 54 - node CPU cluster on the Frontera supercomputer) and large - scale simulation experiments. The results show that this method reduces the collective communication time by more than 30% in DNN training and reduces the communication time by up to 3.1 times in HPC workloads. In addition, the scheduling generation algorithm is several orders of magnitude faster than existing methods and can generate schedules for thousands of nodes within a few minutes. ### Summary This paper solves the performance bottleneck problem of collective communication in large - scale distributed computing by jointly optimizing network topologies and communication schedules, providing new solutions for future high - performance computing and machine - learning applications.