Decentralized Stochastic Gradient Descent Ascent for Finite-Sum Minimax Problems

Hongchang Gao
2024-06-11
Abstract:Minimax optimization problems have attracted significant attention in recent years due to their widespread application in numerous machine learning models. To solve the minimax problem, a wide variety of stochastic optimization methods have been proposed. However, most of them ignore the distributed setting where the training data is distributed on multiple workers. In this paper, we developed a novel decentralized stochastic gradient descent ascent method for the finite-sum minimax problem. In particular, by employing the variance-reduced gradient, our method can achieve $O(\frac{\sqrt{n}\kappa^3}{(1-\lambda)^2\epsilon^2})$ sample complexity and $O(\frac{\kappa^3}{(1-\lambda)^2\epsilon^2})$ communication complexity for the nonconvex-strongly-concave minimax problem. As far as we know, our work is the first one to achieve such theoretical complexities for this kind of minimax problem. At last, we apply our method to AUC maximization, and the experimental results confirm the effectiveness of our method.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
This paper mainly focuses on optimizing algorithms for solving finite and minimax problems in distributed environments. Most current methods ignore the situation where data is distributed across multiple worker nodes. The paper proposes a novel Decentralized Stochastic Gradient Descent Ascent (DSGDA) method to solve non-convex strongly concave minimax problems. DSGDA utilizes the gradients reduced in variation, calculates based on local datasets at each worker node, and updates model parameters through a gradient tracking communication scheme. The paper proves that DSGDA outperforms existing methods in terms of sample complexity and communication complexity. For non-convex strongly concave problems, it achieves a communication complexity of O(κ^3(1-λ)^2/ϵ^2), which is better than other methods. In addition, the sample complexity of DSGDA at each worker node is O(√nκ^3(1-λ)^2/ϵ^2), which is better than methods that require periodic computation of the full gradient. Experimental results demonstrate that DSGDA performs better than other methods in maximizing AUC, confirming its effectiveness in practical applications.