Abstract:Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying is necessary to improve the performance of large-scale distributed training workloads. In this paper, we challenge this prevailing belief and pose the question: How close can a singlepath transport approach an optimal multipath transport? We demonstrate that singlepath transport (from a NIC's perspective) is sufficient and can perform nearly as well as an ideal multipath transport with packet spraying, particularly in the context of distributed training in leaf-spine topologies. Our assertion is based on four key observations about workloads driven by collective communication patterns: (i) flows within a collective start almost simultaneously, (ii) flow sizes are nearly equal, (iii) the completion time of a collective is more crucial than individual flow completion times, and (iv) flows can be split upon arrival. We analytically prove that singlepath transport, using minimal flow splitting (at the application layer), is equivalent to an ideal multipath transport with packet spraying in terms of maximum congestion. Our preliminary evaluations support our claims. This paper suggests an alternative agenda for developing next-generation transport protocols tailored for large-scale distributed training.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: in large - scale distributed training, is it really necessary to use multi - path transmission (such as packet spraying) to improve performance? Specifically, the author challenges the currently widespread view that packet spraying must be adopted to improve the performance of large - scale distributed training workloads, and raises the question of whether single - path transmission can approach the optimal multi - path transmission performance. ### Background of the Paper and Problem Statement 1. **Bottlenecks in Large - Scale Distributed Training**: - Large - scale distributed training in production data centers is a challenging workload, and the main bottleneck lies in network communication. - It is widely believed in the industry and academia that packet spraying is necessary to improve the performance of large - scale distributed training. 2. **Research Motivation**: - The author questions: can single - path transmission approach the performance of optimal multi - path transmission? - By analyzing the workload characteristics in collective communication patterns, the author finds that single - path transmission can achieve performance comparable to multi - path transmission in some cases. ### Main Observations and Hypotheses The author proposes their hypothesis based on the following four key observations: 1. **Streams in Collective Operations Start Almost Simultaneously**: - Streams in collective operations (such as allReduce) almost simultaneously reach the NIC. 2. **Stream Sizes are Almost Equal**: - All stream sizes in each step are the same. 3. **Completion Time of Collective Operations is More Important than that of a Single Stream**: - The completion time of collective operations directly affects the training time, while the completion time of a single stream is relatively less important. 4. **Streams can be Split upon Arrival**: - Streams can be split to a minimum extent at the application layer, thus achieving uniform path allocation. ### Research Conclusions - Through theoretical proof and preliminary evaluation, the author shows that single - path transmission (from the perspective of the NIC) can approach the optimal multi - path transmission performance, especially in a leaf - spine topology. - A new single - path transmission protocol, Ethereal, is proposed, which can achieve a collective completion time similar to that of packet spraying in data - parallel distributed training workloads. ### Formula Representation To express relevant concepts more clearly, the following are the key formulas involved in the paper: - Set a leaf - spine topology, which contains \( \ell \) leaf nodes, \( s \) spine nodes and \( k \) server nodes. - For each server node \( i \), assume that it sends \( n_{i,j} \) streams to any set of target nodes in leaf node \( j \), and the size of each stream is \( f_i \). \[ M=\{ f_i\times n_{i,j}\mid f_i, n_{i,j}\in\mathbb{N}, i\in[1,k], j\in[1,\ell]\} \] - Theorem 1 (Equivalence): Given the above leaf - spine topology, a greedy distribution algorithm (on each node), which splits the minimum number of streams and assigns each stream to the least congested uplink (from a local perspective), is equivalent to packet spraying in the goal of minimizing the maximum congestion. \[ \text{ALG assigns } f_1\cdot\left(\left\lfloor\frac{n_{1,j}}{s}\right\rfloor+\frac{r}{s}\right)\text{ demand to each uplink} \] - The optimal multi - path load - balancing algorithm (OPT) evenly distributes the total demand to all uplink links, and each uplink link is assigned a demand of \( f_1\cdot\frac{n_{i,j}}{s} \). \[ \text{OPT assigns } f_1\cdot\frac{n_{i,j}}{s}\text{ demand to each uplink} \] Through these formulas, the author proves that single - path transmission can achieve the same performance as multi - path transmission under specific conditions.

Challenging the Need for Packet Spraying in Large-Scale Distributed Training

Impact of RoCE Congestion Control Policies on Distributed Training of DNNs

Fair and Efficient Distributed Edge Learning with Hybrid Multipath TCP

Packet Reordering Analysis for Concurrent Multipath Transfer

HTPC: heterogeneous traffic-aware partition coding for random packet spraying in data center networks

Low-Cost Datacenter Load Balancing With Multipath Transport and Top-of-Rack Switches

Is Network the Bottleneck of Distributed Training?

Flexible Distributed Control Plane Deployment

MPTD: optimizing multi-path transport with dynamic target delay in datacenters

DCMPTCP: Host-Based Load Balancing for Datacenters

STrack: A Reliable Multipath Transport for AI/ML Clusters

Responsive multipath TCP in SDN-based datacenters

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

MPTCP Meets Big Data: Customizing Transmission Strategy for Various Data Flows

On the Burstiness of Distributed Machine Learning Traffic

Poster Abstract: Shipping Data from Heterogeneous Protocols on Packet Train

Sparse Mean Field Load Balancing in Large Localized Queueing Systems

Balancing Throughput and Fairness for TCP Flows in Multihop Ad-Hoc Networks

Research on the control strategies of data flow transmission paths for MPTCP-based communication networks

Multi-Channel Scatter (MCS): Traffic Balancing Based on Edge-Switching in Datacenter Networks

ProactMP: A Proactive Multipath Transport Protocol for Low-Latency Datacenters