Decentralized Deep Learning using Momentum-Accelerated Consensus

Aditya Balu,Zhanhong Jiang,Sin Yong Tan,Chinmay Hedge,Young M Lee,Soumik Sarkar
DOI: https://doi.org/10.48550/arXiv.2010.11166
2020-11-29
Abstract:We consider the problem of decentralized deep learning where multiple agents collaborate to learn from a distributed dataset. While there exist several decentralized deep learning approaches, the majority consider a central parameter-server topology for aggregating the model parameters from the agents. However, such a topology may be inapplicable in networked systems such as ad-hoc mobile networks, field robotics, and power network systems where direct communication with the central parameter server may be inefficient. In this context, we propose and analyze a novel decentralized deep learning algorithm where the agents interact over a fixed communication topology (without a central server). Our algorithm is based on the heavy-ball acceleration method used in gradient-based optimization. We propose a novel consensus protocol where each agent shares with its neighbors its model parameters as well as gradient-momentum values during the optimization process. We consider both strongly convex and non-convex objective functions and theoretically analyze our algorithm's performance. We present several empirical comparisons with competing decentralized learning methods to demonstrate the efficacy of our approach under different communication topologies.
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the efficiency and accuracy of model training through the momentum - accelerated consensus algorithm when conducting deep learning in a decentralized network environment. Specifically, the paper focuses on how multiple nodes work together to learn from distributed datasets in the absence of a central parameter server. Most of the existing decentralized deep - learning methods rely on a central parameter server to aggregate the model parameters of each node, but in some network systems (such as ad - hoc mobile networks, field robotic systems, and power network systems), direct communication with the central parameter server may be inefficient. Therefore, the paper proposes a new decentralized deep - learning algorithm, which is based on the heavy - ball acceleration method in gradient optimization and proposes a novel consensus protocol. Each node shares its model parameters and gradient momentum values with its neighbors during the optimization process. ### Main Contributions 1. **Algorithm Proposal**: Proposed the Decentralized Momentum Stochastic Gradient Descent (DMSGD) algorithm, which combines the classical momentum method (also known as the heavy - ball method). 2. **Theoretical Analysis**: For smooth and non - convex objective functions, it is proved that the algorithm can converge to a first - order stable point, that is, the algorithm generates an estimate \(x\) with a sufficiently small gradient after \(O\left(\frac{1}{\epsilon}+\frac{1}{N\epsilon^{2}}\right)\) iterations, where \(N\) is the number of nodes. 3. **Experimental Verification**: Through comparative experiments with baseline decentralized methods (such as D - PSGD and CDSGD), it is shown that when the momentum term is appropriately weighted, the DMSGD algorithm is faster and more accurate, thus proving its practical application value. ### Problem Background The need to accelerate the training of deep neural networks on large - scale distributed datasets has promoted the development of distributed parallel learning methods. Existing methods are mainly divided into two categories: - **Distributed GPU Environment**: Extend the deep - learning algorithms in the traditional single CPU - GPU environment to multi - GPU networks. - **Federated Learning**: Deal with inherently decentralized datasets, where each computing node has its own data samples and does not share them. However, most methods still rely on a central parameter server to aggregate model parameters. The focus of this paper is on a fully decentralized learning environment, that is, each node in the network maintains its own model parameters and communicates with neighboring nodes through a predefined communication topology, with the goal of reaching a consensus model for the entire network. ### Momentum Acceleration Momentum technology is a commonly used technique in the gradient descent method and can accelerate convergence. However, in the decentralized learning literature, there are few studies on momentum acceleration technology, especially the lack of strict theoretical guarantees in the context of non - convex and stochastic optimization. This paper aims to fill this gap from both theoretical and empirical aspects. ### Experimental Results The paper verifies the effectiveness of the DMSGD algorithm through simulation experiments on a GPU cluster. The experiments include different data distribution strategies (independent and identically distributed and non - independent and identically distributed) and different communication topologies (fully connected, ring - shaped, and bipartite graph). The results show that DMSGD outperforms the CDSGD algorithm under non - independent and identically distributed data, indicating that the momentum term has an advantage in handling unbalanced data. ### Conclusions and Future Work This paper proposes a decentralized deep - learning algorithm DMSGD based on momentum - accelerated consensus, which performs well both theoretically and experimentally. Future work directions include expanding the analysis of Nesterov momentum for non - convex objective functions, the analysis of non - independent and identically distributed data settings, and further research on convergence acceleration techniques.