Abstract:The overwhelming parameters and computation consumption of deep neural networks limit their applicability to a single computing node with poor computing power, such as edge and mobile devices. Most previous works leverage model pruning and compression strategies to reduce DNN parameters for resource-constrained devices. However, most model compression methods may suffer from accuracy loss. Recently, we find that combine many weak computing nodes as a distributed system to run large and sophisticated DNN models is a promising solution for the issue. However, it is essential for the distributed system to design distributed DNN models and inference schemes, one of the great challenges of distributed system is how to design an efficient distributed DNN model for data parallelism and model parallelism, and communication overhead is also another critical performance bottleneck for distributed DNN model. Therefore, in this article, we propose DFSNet framework (Dividing-Fuse neural Network with Searching Strategy) for distributed DNN architecture. Firstly, the DFSNet framework includes a joint ”dividing-fusing” method to convert regular DNN models into distributed models that are friendly for distributed systems. This method divides the conventional DNN model in the channel dimension, and sets a few special layers to fuse feature-map information from different channel groups for accuracy improvement. Since the fusion layers are sparse in the network, they do not increase too much extra inference time and communication overhead on the distributed nodes, but they can maintain the accuracy of distributed neural networks significantly. Secondly, considering the architecture of distributed computing nodes, we propose a parallel fusion topology to improve the utilization of different computing nodes. Lastly, the popular weight-sharing neural architecture search (NAS) technique is leveraged to search the position of fusion layers in the distributed DNN model for high accuracy and finally generate an efficient distributed DNN model. Compared with the original network, our converted distributed DNN achieves better performance (e.g. 1.88% precision boosting in ResNet56 on CIFAR-100 dataset, and 1.25% precision improving in MobileNetV2 on ImageNet dataset). In addition, most layers of DNN have been divided into different distributed nodes on channel dimension, which is particularly suitable for distributed DNN architecture with very low communication overhead.

DFS: Joint Data Formatting and Sparsification for Efficient Communication in Distributed Machine Learning

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Slim-DP: A Multi-Agent System for Communication-Efficient Distributed Deep Learning

DFSNet: Dividing-fuse Deep Neural Networks with Searching Strategy for Distributed DNN Architecture

Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs

A Layer-Based Sparsification Method for Distributed DNN Training.

A Hierarchical Communication Algorithm for Distributed Deep Learning Training.

LSDDL: Layer-Wise Sparsification for Distributed Deep Learning

Exploiting Simultaneous Communications to Accelerate Data Parallel Distributed Deep Learning

GSASG: Global Sparsification with Adaptive Aggregated Stochastic Gradients for Communication-Efficient Federated Learning

Efficient Communication Scheduling for Parameter Synchronization of DML in Data Center Networks

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

SSD-SSD: Communication sparsification for distributed deep learning training

SSD-SGD: Communication Sparsification for Distributed Deep Learning Training.

Sparse Communication for Training Deep Networks

Communication-Efficient Distributed Stochastic Gradient Descent with Pooling Operator

Joint Gradient Sparsification and Device Scheduling for Federated Learning

MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms.

MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning

Compressed Collective Sparse-Sketch for Distributed Data-Parallel Training of Deep Learning Models