Tianchen Zhou,FNU Hairi,Haibo Yang,Jia Liu,Tian Tong,Fan Yang,Michinari Momma,Yan Gao
Abstract:Reinforcement learning with multiple, potentially conflicting objectives is pervasive in real-world applications, while this problem remains theoretically under-explored. This paper tackles the multi-objective reinforcement learning (MORL) problem and introduces an innovative actor-critic algorithm named MOAC which finds a policy by iteratively making trade-offs among conflicting reward signals. Notably, we provide the first analysis of finite-time Pareto-stationary convergence and corresponding sample complexity in both discounted and average reward settings. Our approach has two salient features: (a) MOAC mitigates the cumulative estimation bias resulting from finding an optimal common gradient descent direction out of stochastic samples. This enables provable convergence rate and sample complexity guarantees independent of the number of objectives; (b) With proper momentum coefficient, MOAC initializes the weights of individual policy gradients using samples from the environment, instead of manual initialization. This enhances the practicality and robustness of our algorithm. Finally, experiments conducted on a real-world dataset validate the effectiveness of our proposed method.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve Pareto - stable convergence within a finite time in multi - objective reinforcement learning (MORL) and the corresponding sample complexity analysis. Specifically, the paper proposes an innovative actor - critic algorithm - MOAC (Multi - Objective Actor - Critic), aiming to find a policy by iteratively making trade - offs between conflicting objective reward signals. The main contributions of the paper are as follows:
1. **Proposing a unified multi - objective actor - critic algorithm framework**: Based on the MGDA (Multi - Gradient Descent Algorithm) - style policy gradient update method, it is suitable for heterogeneous discounted reward and average reward settings in MORL. This framework provides a guarantee of finite - time convergence to an ϵ - Pareto - stable solution and a sample complexity of O(1/ϵ²).
2. **Alleviating the cumulative estimation bias**: In order to reduce the cumulative estimation bias caused by the stochastic MGDA - type policy parameter update, the paper proposes a momentum mechanism. A notable feature of this mechanism is that the convergence rate and sample complexity of MOAC are independent of the number of objectives, which is in sharp contrast to the results in general multi - objective optimization (MOO), where it usually depends on the number of objectives M.
3. **Environment - initialized weights**: Based on the proposed momentum mechanism, the paper shows that through an appropriate momentum coefficient schedule, MOAC can initialize the weights of each policy gradient from the data sampled in the environment instead of manual initialization. This enhances the practicality and robustness of the method.
### Background and Motivation of the Paper
Traditional reinforcement learning (RL) mainly focuses on the optimization problem of a single reward, but applications in the real world often involve multiple potentially conflicting objectives. For example, in an RL - based short - video recommendation system, the agent needs to optimize a multi - dimensional reward rate, including the user's viewing time, the number of likes, the number of dislikes, and the number of comments, etc. Similarly, in an e - commerce recommendation system, the agent needs to balance the preferences of different user groups. Some users prefer fast delivery, while others are more willing to accept slower delivery in exchange for a lower price.
These complex multi - objective RL applications require solving the multi - objective reinforcement learning (MORL) problem. However, current research on MORL is still in its infancy, especially lacking a strict theoretical understanding in terms of finite - time convergence and sample complexity analysis. Therefore, this paper aims to establish the theoretical foundation of MORL.
### Technical Challenges
1. **Actor - critic dependence**: In actor - critic RL, the actor and critic components are approximated by the value function guided by the Bellman optimality principle, which leads to a complex dependence relationship between the two. In addition, the complex coupling between multiple objectives further exacerbates this dependence, making the traditional multi - objective optimization (MOO) convergence analysis inapplicable to actor - critic policy - gradient MORL methods.
2. **Cumulative estimation bias**: In actor - critic MORL, due to the limitation of the trajectory length in practice, both the actor and the critic must update their parameters through stochastic approximation. Therefore, using the stochastic MGDA - type update method will inevitably introduce cumulative estimation bias. If not carefully handled, these biases may significantly reduce the performance of MORL or even lead to the divergence of policy parameter updates.
### Main Contributions
1. **Unified multi - objective actor - critic algorithm framework**: A multi - objective actor - critic algorithm framework (MOAC) based on MGDA - style policy - gradient update is proposed, which is suitable for discounted reward and average reward settings in MORL. This framework provides Pareto - stable convergence within a finite time and a guarantee of sample complexity of O(1/ϵ²).
2. **Momentum mechanism**: A momentum mechanism is proposed to alleviate the cumulative estimation bias, making the convergence rate and sample complexity of MOAC independent of the number of objectives.
3. **Environment - initialized weights**: It is shown that through an appropriate momentum coefficient schedule, MOAC can initialize the weights of each policy gradient from the data sampled in the environment, enhancing the practicality and robustness of the method.
### Related Work
1. **MGDA - type multi - objective optimization (MOO) methods**: Multi - objective optimization focuses on simultaneously optimizing a set of objective functions. The MGDA algorithm has received increasing attention in recent years, but its convergence