Abstract:This paper investigates multi-objective reinforcement learning (MORL), which focuses on learning Pareto optimal policies in the presence of multiple reward functions. Despite MORL's significant empirical success, there is still a lack of satisfactory understanding of various MORL optimization targets and efficient learning algorithms. Our work offers a systematic analysis of several optimization targets to assess their abilities to find all Pareto optimal policies and controllability over learned policies by the preferences for different objectives. We then identify Tchebycheff scalarization as a favorable scalarization method for MORL. Considering the non-smoothness of Tchebycheff scalarization, we reformulate its minimization problem into a new min-max-max optimization problem. Then, for the stochastic policy class, we propose efficient algorithms using this reformulation to learn Pareto optimal policies. We first propose an online UCB-based algorithm to achieve an $\varepsilon$ learning error with an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ sample complexity for a single given preference. To further reduce the cost of environment exploration under different preferences, we propose a preference-free framework that first explores the environment without pre-defined preferences and then generates solutions for any number of preferences. We prove that it only requires an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ exploration complexity in the exploration phase and demands no additional exploration afterward. Lastly, we analyze the smooth Tchebycheff scalarization, an extension of Tchebycheff scalarization, which is proved to be more advantageous in distinguishing the Pareto optimal policies from other weakly Pareto optimal policies based on entry values of preference vectors. Furthermore, we extend our algorithms and theoretical analysis to accommodate this optimization target.

Policy Learning for Many Outcomes of Interest: Combining Optimal Policy Trees with Multi-objective Bayesian Optimisation

Reduced-Rank Multi-objective Policy Learning and Optimization

Policy Learning with Rare Outcomes

Pareto-Optimal Estimation and Policy Learning on Short-term and Long-term Treatment Effects

Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning

POETREE: Interpretable Policy Learning with Adaptive Decision Trees

Learning Optimal Prescriptive Trees from Observational Data

Navigating Trade-offs: Policy Summarization for Multi-Objective Reinforcement Learning

Exploring trade-offs in agro-ecological landscapes: using a multi-objective land-use allocation model to support agroforestry research

Optimal Decision Tree Policies for Markov Decision Processes

Policy Trees for Prediction: Interpretable and Adaptive Model Selection for Machine Learning

Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning

More Efficient Policy Learning via Optimal Retargeting

Mixed-Integer Optimization with Constraint Learning

Offline Multi-Action Policy Learning: Generalization and Optimization

Integrating decision modeling and machine learning to inform treatment stratification

Multi-Objective Recommendation via Multivariate Policy Learning

Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems

Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences

Eliciting User Preferences for Personalized Multi-Objective Decision Making through Comparative Feedback

Learning to Expand/Contract Pareto Sets in Dynamic Multi-Objective Optimization With a Changing Number of Objectives