Abstract:Unlike traditional data center traffic, AI training traffic primarily consists of large-size flows that are fewer in number. This characteristic poses a challenge in balancing routing granularity with reorder overhead in existing routing strategies. Existing serial flowlet schemes aim to achieve a better trade-off in TCP scenarios than flow-level or packet spraying load balancing. However, they are not well-suited for AI training clusters with high-performance RDMA networks. To tackle this issue, we propose a parallel-flowlet strategy, ParaLet, which effectively resolves the serial flowlet's problems of insufficient routing entropy in AI training traffic and the difficulty of identifying time gaps in RDMA networks. ParaLet requires only a small number of Queue Pairs, which are decoupled from the connections, thus circumventing scalability limits. The theoretical analysis and simulations indicate that ParaLet not only achieves near-optimal throughput but also diminishes flow completion time by 1.5-3.4 times compared to existing methods.

Network Load Balancing with Parallel Flowlets for AI Training Clusters