Abstract:We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and schedules for a given cluster size and degree and then identifies the appropriate topology and schedule for a given workload. Our algorithms start from small, optimal base topologies and associated communication schedules and use techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient collective schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently construct a direct - connected network topology for collective communication and its communication scheduling in large - scale distributed machine learning (ML) and high - performance computing (HPC). Specifically, the paper focuses on how to efficiently construct a high - performance quantum direct - connected network topology and communication scheduling to optimize the trade - off between latency and bandwidth under given network performance characteristics and degree constraints. ### Paper Background Collective communication operations involve concurrent aggregation and distribution of data on a cluster of nodes and are widely used in the fields of machine learning and high - performance computing. With the improvement of accelerator computing power, collective operations have become a significant overhead in large - scale distributed machine - learning training. To address these challenges, some studies have proposed using optical circuit switching to achieve higher bandwidth while maintaining reasonable capital expenditures and energy consumption. Hosts communicate through a limited number of reconfigurable optical circuits, which makes the network topology a configurable component. ### Existing Problems Although existing ML systems based on optical circuits conform to the direct - connection model, they fail to fully utilize the flexibility provided by topological reconfiguration. For example, although ring allreduce has high bandwidth efficiency, it has a large graph diameter, resulting in high total - hop - count latency; while double binary tree has a logarithmic diameter, it has problems in load balancing and bandwidth efficiency. Other efficient collective algorithms (such as recursive doubling, Bruck algorithm) perform well in switched networks, but due to their dynamic communication patterns, they are not suitable for degree - constrained direct - connected networks. ### Solutions The paper proposes an algorithmic toolchain for rapidly synthesizing efficient network topologies and schedules suitable for collective communication. The main contributions include: 1. **Extension Techniques**: Starting from small - scale optimal topologies and communication schedules, generate approximately optimal large - scale topologies and schedules through a series of extension techniques. 2. **Polynomial - Time Scheduling Generation Algorithm**: Generate optimal collective communication schedules for large - scale topologies with certain symmetry properties. 3. **Topology Enumeration and Search Algorithm**: Identify the best topologies and schedules by exploring Pareto - efficient options with different bandwidth efficiencies, total - hop - count latencies, and all - to - all throughputs. 4. **Compiler**: Develop a compiler to implement optimized schedules and integrate them into ML frameworks (such as PyTorch). ### Experimental Verification The paper verifies the effectiveness of the proposed method through two test platforms (a 12 - node GPU cluster and a 54 - node CPU cluster on the Frontera supercomputer) and large - scale simulation experiments. The results show that this method reduces the collective communication time by more than 30% in DNN training and reduces the communication time by up to 3.1 times in HPC workloads. In addition, the scheduling generation algorithm is several orders of magnitude faster than existing methods and can generate schedules for thousands of nodes within a few minutes. ### Summary This paper solves the performance bottleneck problem of collective communication in large - scale distributed computing by jointly optimizing network topologies and communication schedules, providing new solutions for future high - performance computing and machine - learning applications.

Efficient Direct-Connect Topologies for Collective Communications

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

Synthesizing Optimal Collective Algorithms

Optimal low-latency network topologies for cluster performance enhancement

Optimal circulant graphs as low-latency network topologies

An Expanded Distributed Algorithm for Dynamic Resource Allocation over Strongly Connected Topologies

Efficient Topology Reconstruction Via Machine Learning Based Traffic Patterns Recognition in Optically Interconnected Computing System.

A Survey of Methods for Collective Communication Optimization and Tuning

Delocalization and spin-wave dynamics in ferromagnetic chains with long-range correlated random exchange

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

Efficient Distributed Algorithms for Topology Control Problem with Shortest Path Constraints.

ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics

HierTopo: Towards High-Performance and Efficient Topology Optimization for Dynamic Networks

Towards Communication-Aware Robust Topologies

Achieving Efficient Routing in Reconfigurable DCNs.

FT-topo: Architecture-Driven Folded-Triangle Partitioning for Communication-efficient Graph Processing

A Systemic Strategy for Tuning Intra-node Collective Communication on Multicore Systems

A Cds-Based Topology Control Algorithm in Energy Efficient Clustering

Efficient Topology Optimization for a Wired Networked System by Adding Wireless Communication

Efficient Topology Control for Ad-Hoc Wireless Networks with Non-Uniform Transmission Ranges