Abstract:Many modern large-scale machine learning problems benefit from decentralized and stochastic optimization. Recent works have shown that utilizing both decentralized computing and local stochastic gradient estimates can outperform state-of-the-art centralized algorithms, in applications involving highly non-convex problems, such as training deep neural networks.
In this work, we propose a decentralized stochastic algorithm to deal with certain smooth non-convex problems where there are $m$ nodes in the system, and each node has a large number of samples (denoted as $n$). Differently from the majority of the existing decentralized learning algorithms for either stochastic or finite-sum problems, our focus is given to both reducing the total communication rounds among the nodes, while accessing the minimum number of local data samples. In particular, we propose an algorithm named D-GET (decentralized gradient estimation and tracking), which jointly performs decentralized gradient estimation (which estimates the local gradient using a subset of local samples) and gradient tracking (which tracks the global full gradient using local estimates). We show that, to achieve certain $\epsilon$ stationary solution of the deterministic finite sum problem, the proposed algorithm achieves an $\mathcal{O}(mn^{1/2}\epsilon^{-1})$ sample complexity and an $\mathcal{O}(\epsilon^{-1})$ communication complexity. These bounds significantly improve upon the best existing bounds of $\mathcal{O}(mn\epsilon^{-1})$ and $\mathcal{O}(\epsilon^{-1})$, respectively. Similarly, for online problems, the proposed method achieves an $\mathcal{O}(m \epsilon^{-3/2})$ sample complexity and an $\mathcal{O}(\epsilon^{-1})$ communication complexity, while the best existing bounds are $\mathcal{O}(m\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-2})$, respectively.
Optimization and Control,Distributed, Parallel, and Cluster Computing,Machine Learning,Signal Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to simultaneously reduce the total number of communication rounds and the number of accesses to local data samples in distributed non - convex optimization. Specifically, the paper focuses on how to design an effective decentralized random algorithm to handle certain smooth non - convex problems in a system with \(m\) nodes, where each node has a large number of samples (denoted as \(n\)). Compared with most of the existing decentralized learning algorithms, the algorithm proposed in this paper not only reduces the total number of communication rounds between nodes, but also minimizes the number of accesses to local data samples.
### Main Contributions
1. **Proposed a new Decentralized Gradient Estimation and Tracking algorithm (D - GET)**:
- This algorithm estimates the local gradient using some local samples and tracks the global gradient by using the past local gradient differences, thus achieving efficient decentralized optimization.
- For deterministic finite - sum problems, the D - GET algorithm achieves a sample complexity of \(O(mn^{1/2}\epsilon^{-1})\) and a communication complexity of \(O(\epsilon^{-1})\).
- For online problems, the D - GET algorithm achieves a sample complexity of \(O(m\epsilon^{-3/2})\) and a communication complexity of \(O(\epsilon^{-1})\).
2. **Improved the complexity of existing methods**:
- Compared with the existing best complexity, D - GET has significant improvements in both sample complexity and communication complexity. For example, for finite - sum problems, the sample complexity of existing methods is \(O(mn\epsilon^{-1})\) and the communication complexity is \(O(\epsilon^{-1})\), while for D - GET they are \(O(mn^{1/2}\epsilon^{-1})\) and \(O(\epsilon^{-1})\) respectively.
- For online problems, the sample complexity of existing methods is \(O(m\epsilon^{-2})\) and the communication complexity is \(O(\epsilon^{-2})\), while for D - GET they are \(O(m\epsilon^{-3/2})\) and \(O(\epsilon^{-1})\) respectively.
3. **Theoretical analysis**:
- The paper analyzes in detail the convergence properties of the D - GET algorithm and proves its superiority in sample complexity and communication complexity.
- By introducing two auxiliary variables \(v\) and \(y\), which are used to estimate the local and global gradients respectively, the D - GET algorithm effectively combines modern variance reduction techniques and decentralized gradient tracking methods.
### Application Scenarios
- **Large - scale machine learning tasks**: In highly non - convex problems such as training deep neural networks, the D - GET algorithm can significantly improve the optimization efficiency and reduce communication overhead and sample access times.
- **Distributed computing**: In applications that require data privacy protection, enhanced network robustness and improved computational efficiency, the D - GET algorithm provides effective solutions.
### Conclusion
By proposing the D - GET algorithm, this paper has made important progress in the field of decentralized non - convex optimization, especially in reducing communication complexity and sample complexity. This provides new ideas and tools for large - scale machine learning and distributed computing.