SDCC: Software-Defined Collective Communication for Distributed Training

Xin Jin,Zhen Zhang,Yunshan Jia,Yun Ma,Xuanzhe Liu
DOI: https://doi.org/10.1007/s11432-023-3894-4
2024-01-01
Abstract:Communication is crucial to the performance of distributed training. Today’s solutions tightly couple the control and data planes and lack flexibility, generality, and performance. In this study, we present SDCC, a software-defined collective communication framework for distributed training. SDCC is based on the principle of modern systems design to effectively decouple the control plane from the data plane. SDCC abstracts the operations for collective communication in distributed training with dataflow operations and unifies computing and communication with a single dataflow graph. The abstraction, together with the unification, is powerful: it enables users to easily express new and existing collective communication algorithms and optimizations, simplifies the integration with different computing engines (e.g., PyTorch and TensorFlow) and network transports (e.g., Linux TCP and kernel bypass), and allows the system to improve performance by exploiting parallelism exposed by the dataflow graph. We further demonstrate the benefits of SDCC in four use cases.
What problem does this paper attempt to address?