Abstract:With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the large-scale scenario poses new challenges that include fault tolerance, scalability of algorithms and infrastructures, and heterogeneity in data sets, models, and resources. Due to intensive synchronization of models and sharing of data across GPUs and computing nodes during distributed training and inference processes, communication efficiency becomes the bottleneck for achieving high performance at a large scale. This article surveys the literature over the period of 2018-2023 on algorithms and technologies aimed at achieving efficient communication in large-scale distributed deep learning at various levels, including algorithms, frameworks, and infrastructures. Specifically, we first introduce efficient algorithms for model synchronization and communication data compression in the context of large-scale distributed training. Next, we introduce efficient strategies related to resource allocation and task scheduling for use in distributed training and inference. After that, we present the latest technologies pertaining to modern communication infrastructures used in distributed deep learning with a focus on examining the impact of the communication overhead in a large-scale and heterogeneous setting. Finally, we conduct a case study on the distributed training of large language models at a large scale to illustrate how to apply these technologies in real cases. This article aims to offer researchers a comprehensive understanding of the current landscape of large-scale distributed deep learning and to reveal promising future research directions toward communication-efficient solutions in this scope.

Communication Efficient Distributed Learning with Feature Partitioned Data

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient

Distributed Learning Systems with First-order Methods

Toward Communication Efficient Adaptive Gradient Method

Lazily Aggregated Quantized Gradient Innovation for Communication-Efficient Federated Learning.

Communication Lower Bounds for Distributed Convex Optimization: Partition Data on Features.

Efficient Privacy-Preserving Machine Learning in Hierarchical Distributed System

Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients

Decentralized Edge Learning via Unreliable Device-to-Device Communications

FedBCD: A Communication-Efficient Collaborative Learning Framework for Distributed Features

Debiased distributed learning for sparse partial linear models in high dimensions

High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods

Resource-constrained Federated Edge Learning with Heterogeneous Data: Formulation and Analysis

Distributed Event-Based Learning via ADMM

Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging

A Hybrid Data and Model Transfer Framework for Distributed Machine Learning