Abstract:This paper investigates multi-objective reinforcement learning (MORL), which focuses on learning Pareto optimal policies in the presence of multiple reward functions. Despite MORL's significant empirical success, there is still a lack of satisfactory understanding of various MORL optimization targets and efficient learning algorithms. Our work offers a systematic analysis of several optimization targets to assess their abilities to find all Pareto optimal policies and controllability over learned policies by the preferences for different objectives. We then identify Tchebycheff scalarization as a favorable scalarization method for MORL. Considering the non-smoothness of Tchebycheff scalarization, we reformulate its minimization problem into a new min-max-max optimization problem. Then, for the stochastic policy class, we propose efficient algorithms using this reformulation to learn Pareto optimal policies. We first propose an online UCB-based algorithm to achieve an $\varepsilon$ learning error with an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ sample complexity for a single given preference. To further reduce the cost of environment exploration under different preferences, we propose a preference-free framework that first explores the environment without pre-defined preferences and then generates solutions for any number of preferences. We prove that it only requires an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ exploration complexity in the exploration phase and demands no additional exploration afterward. Lastly, we analyze the smooth Tchebycheff scalarization, an extension of Tchebycheff scalarization, which is proved to be more advantageous in distinguishing the Pareto optimal policies from other weakly Pareto optimal policies based on entry values of preference vectors. Furthermore, we extend our algorithms and theoretical analysis to accommodate this optimization target.

Actor-critic multi-objective reinforcement learning for non-linear utility functions

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

Multi-Objective Reinforcement Learning: Convexity, Stationarity and Pareto Optimality

Utility-Based Reinforcement Learning: Unifying Single-objective and Multi-objective Reinforcement Learning

UCB-driven Utility Function Search for Multi-objective Reinforcement Learning

A Robust Policy Bootstrapping Algorithm for Multi-objective Reinforcement Learning in Non-stationary Environments

Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning

Opponent Learning Awareness and Modelling in Multi-Objective Normal Form Games

A Scale-Independent Multi-Objective Reinforcement Learning with Convergence Analysis

In Search for Architectures and Loss Functions in Multi-Objective Reinforcement Learning

C-MORL: Multi-Objective Reinforcement Learning through Efficient Discovery of Pareto Front

MORAL: Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning

Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with General Utilities

A Two-Stage Multi-Objective Deep Reinforcement Learning Framework.

Finite-Time Convergence and Sample Complexity of Actor-Critic Multi-Objective Reinforcement Learning

Welfare and Fairness in Multi-objective Reinforcement Learning

Learning Skills to Navigate without a Master: A Sequential Multi-Policy Reinforcement Learning Algorithm

An Empirical Investigation of Value-Based Multi-objective Reinforcement Learning for Stochastic Environments

A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Multi-objective optimisation via the R2 utilities

Generalizing Across Multi-Objective Reward Functions in Deep Reinforcement Learning