Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning

Xiuhong Li,Xingcheng Zhang,Jiangfei Duan,Chao Yang,Qianchao Zhu,Chang Chen,Peng Sun
DOI: https://doi.org/10.1145/3620666.3651379
2024-04-27
Abstract:Efficiently training large language models (LLMs) necessitates the adoption of hybrid parallel methods, integrating multiple communications collectives within distributed partitioned graphs. Overcoming communication bottlenecks is crucial and is often achieved through communication and computation overlaps. However, existing overlap methodologies tend to lean towards either fine-grained kernel fusion or limited operation scheduling, constraining performance optimization in heterogeneous training environments. This paper introduces Centauri, an innovative framework that encompasses comprehensive communication partitioning and hierarchical scheduling schemes for optimized overlap. We propose a partition space comprising three inherent abstraction dimensions: primitive substitution, topology-aware group partitioning, and workload partitioning. These dimensions collectively create a comprehensive optimization space for efficient overlap. To determine the efficient overlap of communication and computation operators, we decompose the scheduling tasks in hybrid parallel training into three hierarchical tiers: operation, layer, and model. Through these techniques, Centauri effectively overlaps communication latency and enhances hardware utilization. Evaluation results demonstrate that Centauri achieves up to 1.49× speedup over prevalent methods across various parallel training configurations.
Computer Science
What problem does this paper attempt to address?