A Penalty-Based Method for Communication-Efficient Decentralized Bilevel Programming

Parvin Nazari,Ahmad Mousavi,Davoud Ataee Tarzanagh,George Michailidis
2024-10-10
Abstract:Bilevel programming has recently received attention in the literature due to its wide range of applications, including reinforcement learning and hyper-parameter optimization. However, it is widely assumed that the underlying bilevel optimization problem is solved either by a single machine or, in the case of multiple machines connected in a star-shaped network, i.e., in a federated learning setting. The latter approach suffers from a high communication cost on the central node (e.g., parameter server). Hence, there is an interest in developing methods that solve bilevel optimization problems in a communication-efficient, decentralized manner. To that end, this paper introduces a penalty function-based decentralized algorithm with theoretical guarantees for this class of optimization problems. Specifically, a distributed alternating gradient-type algorithm for solving consensus bilevel programming over a decentralized network is developed. A key feature of the proposed algorithm is the estimation of the hyper-gradient of the penalty function through decentralized computation of matrix-vector products and a few vector communications. The estimation is integrated into an alternating algorithm for solving the penalized reformulation of the bilevel optimization problem. Under appropriate step sizes and penalty parameters, our theoretical framework ensures non-asymptotic convergence to the optimal solution of the original problem under various convexity conditions. Our theoretical result highlights improvements in the iteration complexity of decentralized bilevel optimization, all while making efficient use of vector communication. Empirical results demonstrate that the proposed method performs well in real-world settings.
Machine Learning,Distributed, Parallel, and Cluster Computing,Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop an efficient and low - communication - cost decentralized bilevel programming algorithm. Specifically, the author focuses on the high communication cost and complex computational challenges faced when solving bilevel optimization problems in decentralized networks. Existing methods either rely on a single machine or star - shaped networks (such as federated learning), which usually lead to high communication overheads for central nodes (e.g., parameter servers), or involve expensive Hessian matrix calculations and matrix communications between nodes. To solve these problems, this paper proposes a decentralized alternating gradient method based on penalty functions (Decentralized Alternating Gradient Method, DAGM). This method reformulates the bilevel optimization problem by introducing penalty functions and uses Neumann series to approximate the inverse Hessian matrix, thereby avoiding the explicit calculation of the entire Hessian matrix. This enables the algorithm to operate efficiently in decentralized networks while maintaining a low communication cost. ### Key Contributions 1. **Light - weight Computation and Communication**: Through the penalty function optimization method, DAGM achieves light - weight decentralized communication and computation. It uses local matrix - vector multiplications and decentralized vector communications to estimate the hyper - gradient (i.e., the gradient of the outer function). 2. **Iterative Complexity and Acceleration**: Theoretically, DAGM has guarantees on convergence rate and communication complexity for smooth strongly convex, convex, and non - convex bilevel problems. In particular, even when only vector communications are involved, DAGM can achieve linear acceleration (an \( n^{-1} \) improvement in the complexity bound). 3. **Experimental Evaluation**: From a practical perspective, the author evaluates the performance of DAGM when dealing with large - scale problems and demonstrates the robustness and scalability of the DIHGP method. This is the first theoretical and empirical exploration of the DIHGP method based on Neumann series. ### Summary The core objective of this paper is to develop an efficient decentralized bilevel optimization algorithm by introducing penalty functions and Neumann series approximation, in order to reduce communication costs and improve computational efficiency. This method is not only applicable to large - scale distributed systems, but can also exhibit good performance and robustness in practical applications.