Yihan Zhang,My T. Thai,Jie Wu,Hongchang Gao
Abstract:Stochastic bilevel optimization finds widespread applications in machine learning, including meta-learning, hyperparameter optimization, and neural architecture search. To extend stochastic bilevel optimization to distributed data, several decentralized stochastic bilevel optimization algorithms have been developed. However, existing methods often suffer from slow convergence rates and high communication costs in heterogeneous settings, limiting their applicability to real-world tasks. To address these issues, we propose two novel decentralized stochastic bilevel gradient descent algorithms based on simultaneous and alternating update strategies. Our algorithms can achieve faster convergence rates and lower communication costs than existing methods. Importantly, our convergence analyses do not rely on strong assumptions regarding heterogeneity. More importantly, our theoretical analysis clearly discloses how the additional communication required for estimating hypergradient under the heterogeneous setting affects the convergence rate. To the best of our knowledge, this is the first time such favorable theoretical results have been achieved with mild assumptions in the heterogeneous setting. Furthermore, we demonstrate how to establish the convergence rate for the alternating update strategy when combined with the variance-reduced gradient. Finally, experimental results confirm the efficacy of our algorithms.
Machine Learning,Distributed, Parallel, and Cluster Computing,Optimization and Control
What problem does this paper attempt to address?
This paper attempts to solve several key problems in decentralized bi - level stochastic optimization algorithms in heterogeneous settings:
1. **High communication complexity**: Existing decentralized bi - level optimization algorithms in heterogeneous environments usually require a large number of communication rounds and high communication costs per round. For example, existing methods such as DSBO and Gossip - DSBO need to communicate in the inner loop, resulting in the number of communication rounds being the same as the iteration complexity, that is, \(O\left(\frac{1}{\epsilon^2}\log\frac{1}{\epsilon}\right)\). Moreover, these methods need to transmit the Hessian matrix or Jacobian matrix in each round of communication, resulting in a communication cost of \(O(d_y^2)\) or \(O(d_xd_y)\) per round.
2. **Strong assumptions in convergence analysis**: Existing methods usually need to rely on some strong assumptions to limit heterogeneity when establishing theoretical convergence rates. For example, DSBO requires that the upper - level loss function is Lipschitz continuous with respect to \(x\), that is, \(\|\nabla_1 f^{(k)}(x,y)\|\leq c_{fx}\), and Gossip - DSBO also requires that the lower - level loss function is Lipschitz continuous with respect to \(y\), that is, \(\|\nabla_2 g^{(k)}(x,y)\|\leq c_{gy}\). These assumptions may not hold in practical applications.
3. **Challenges of the alternating update strategy**: Since there are two variables \(x\) and \(y\) in bi - level optimization, there are two strategies: simultaneous update and alternating update. In a heterogeneous environment, the alternating update strategy will introduce more challenges. Specifically, when using the alternating strategy to update \(x_t\) and \(y_t\), first update \(y_t\) to \(y_{t + 1}\), and then use \(y_{t+1}\) to calculate the gradient of \(x_t\), which may lead to more heterogeneous gradients and thus affect convergence.
To solve these problems, this paper makes the following contributions:
- **Develop a new decentralized bi - level stochastic gradient descent algorithm DSVRBGD - S**, based on the simultaneous update strategy and variance - reduced gradient estimators, which can reduce the number of communication rounds and the communication cost per round. Specifically, the communication cost per round is only \(O(d_x + d_y)\), and the number of communication rounds can be as small as \(O\left(\frac{1}{K\epsilon^{3/2}}\right)\). To the best of the authors' knowledge, this is the first method to achieve such a low communication complexity in a heterogeneous environment.
- **Establish the theoretical convergence rate of the algorithm** without relying on any strong heterogeneity assumptions, and analyze in detail the impact of additional communication on the convergence rate, especially the dependence of the convergence rate on the spectral gap of the communication topology. This is the first time to achieve such favorable theoretical results in a heterogeneous environment.
- **Develop another new decentralized bi - level stochastic gradient descent algorithm DSVRBGD - A**, based on the alternating update strategy and variance - reduced gradient estimators, which also has a low communication complexity. In addition, its theoretical convergence rate is also established. This is the first time to establish the convergence rate of the alternating variance - reduced gradient descent method in a heterogeneous environment.
Through these contributions, this paper not only achieves more efficient communication in algorithm design, but also solves the problems existing in existing methods in theoretical analysis, filling the gaps in the literature.