Abstract:This paper investigates multi-objective reinforcement learning (MORL), which focuses on learning Pareto optimal policies in the presence of multiple reward functions. Despite MORL's significant empirical success, there is still a lack of satisfactory understanding of various MORL optimization targets and efficient learning algorithms. Our work offers a systematic analysis of several optimization targets to assess their abilities to find all Pareto optimal policies and controllability over learned policies by the preferences for different objectives. We then identify Tchebycheff scalarization as a favorable scalarization method for MORL. Considering the non-smoothness of Tchebycheff scalarization, we reformulate its minimization problem into a new min-max-max optimization problem. Then, for the stochastic policy class, we propose efficient algorithms using this reformulation to learn Pareto optimal policies. We first propose an online UCB-based algorithm to achieve an $\varepsilon$ learning error with an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ sample complexity for a single given preference. To further reduce the cost of environment exploration under different preferences, we propose a preference-free framework that first explores the environment without pre-defined preferences and then generates solutions for any number of preferences. We prove that it only requires an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ exploration complexity in the exploration phase and demands no additional exploration afterward. Lastly, we analyze the smooth Tchebycheff scalarization, an extension of Tchebycheff scalarization, which is proved to be more advantageous in distinguishing the Pareto optimal policies from other weakly Pareto optimal policies based on entry values of preference vectors. Furthermore, we extend our algorithms and theoretical analysis to accommodate this optimization target.

Approximating Pareto Frontier Through Bayesian-optimization-directed Robust Multi-objective Reinforcement Learning

Robust Multiobjective Reinforcement Learning Considering Environmental Uncertainties

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning

Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning

A Two-Stage Multi-Objective Deep Reinforcement Learning Framework.

C-MORL: Multi-Objective Reinforcement Learning through Efficient Discovery of Pareto Front

Learning Pareto Set for Multi-Objective Continuous Robot Control

Approximate Policy Iteration for Robust Stochastic Control of Multi-agent Markov Decision Processes

Online Policy Optimization for Robust MDP

Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization

Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning

UCB-driven Utility Function Search for Multi-objective Reinforcement Learning

RLBOF: Reinforcement Learning from Bayesian Optimization Feedback

A Robust Policy Bootstrapping Algorithm for Multi-objective Reinforcement Learning in Non-stationary Environments

User-Oriented Robust Reinforcement Learning

Bidirectional Model-Based Policy Optimization Based on Adaptive Gaussian Noise and Improved Confidence Weights.

Approximating Robust Pareto Fronts by the MEOF-based Multiobjective Evolutionary Algorithm with Two-level Surrogate Models

Domains as Objectives: Domain-Uncertainty-Aware Policy Optimization through Explicit Multi-Domain Convex Coverage Set Learning

Multi-Objective Bayesian Optimization with Active Preference Learning

Cautious Bayesian Optimization for Efficient and Scalable Policy Search