Abstract:Multi-objective reinforcement learning (MORL) aims to find a set of high-performing and diverse policies that address trade-offs between multiple conflicting objectives. However, in practice, decision makers (DMs) often deploy only one or a limited number of trade-off policies. Providing too many diversified trade-off policies to the DM not only significantly increases their workload but also introduces noise in multi-criterion decision-making. With this in mind, we propose a human-in-the-loop policy optimization framework for preference-based MORL that interactively identifies policies of interest. Our method proactively learns the DM's implicit preference information without requiring any a priori knowledge, which is often unavailable in real-world black-box decision scenarios. The learned preference information is used to progressively guide policy optimization towards policies of interest. We evaluate our approach against three conventional MORL algorithms that do not consider preference information and four state-of-the-art preference-based MORL algorithms on two MORL environments for robot control and smart grid management. Experimental results fully demonstrate the effectiveness of our proposed method in comparison to the other peer algorithms.

What problem does this paper attempt to address?

The paper primarily addresses a problem in the field of Multi-Objective Reinforcement Learning (MORL), specifically how to find a set of efficient and diverse policies considering the preferences of the Decision Maker (DM) to handle multiple potentially conflicting objectives. Specifically, the paper attempts to solve the following key issues: 1. **Effective Utilization of Preference Information**: Traditional MORL methods often generate a large number of trade-off policies, which not only increase the burden on the decision maker but may also introduce noise, affecting the multi-criteria decision-making process. Therefore, the paper proposes a human-computer interactive policy optimization framework that can actively learn the implicit preference information of the decision maker and use it to guide the policy optimization process. 2. **Combination of Preference Learning and Policy Optimization**: The proposed framework (referred to as CBOB) includes three core modules: seed policy generation, preference acquisition, and policy optimization. These three modules work in coordination, gradually learning the decision maker's preferences through human-computer interaction and adjusting the direction of policy optimization accordingly to find the policies that the decision maker is truly interested in. 3. **Practicality and Efficiency**: The method particularly focuses on applications in real-world black-box decision scenarios, where prior preference information is usually unavailable. Through human-computer interaction, the proposed method can effectively learn the decision maker's preferences without requiring any prior knowledge. 4. **Experimental Validation**: The paper conducts extensive experiments in two MORL environments, comparing the proposed method with three traditional MORL algorithms and four state-of-the-art preference-oriented MORL algorithms. The results fully demonstrate the effectiveness of the proposed method. In summary, the main contribution of the paper is the proposal of a novel human-computer interactive policy optimization framework, aimed at improving the performance and practicality of MORL in real-world applications through an effective preference learning mechanism.

Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning

Beyond Reward: Offline Preference-guided Policy Optimization

C-MORL: Multi-Objective Reinforcement Learning through Efficient Discovery of Pareto Front

Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning

A Two-Stage Multi-Objective Deep Reinforcement Learning Framework.

Scaling Pareto-Efficient Decision Making Via Offline Multi-Objective RL

Eliciting User Preferences for Personalized Multi-Objective Decision Making through Comparative Feedback

Demonstration Guided Multi-Objective Reinforcement Learning

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

Approximating Pareto Frontier Through Bayesian-optimization-directed Robust Multi-objective Reinforcement Learning

Policy Optimization in RLHF: The Impact of Out-of-preference Data

In Search for Architectures and Loss Functions in Multi-Objective Reinforcement Learning

Hyperparameter Optimization for Multi-Objective Reinforcement Learning

Prediction Guided Meta-Learning for Multi-Objective Reinforcement Learning

Navigating Trade-offs: Policy Summarization for Multi-Objective Reinforcement Learning

Robust Multiobjective Reinforcement Learning Considering Environmental Uncertainties

Deep Reinforcement Learning for Multiobjective Optimization

Learning Pareto Set for Multi-Objective Continuous Robot Control

Scalable Multi-Objective Reinforcement Learning with Fairness Guarantees using Lorenz Dominance

Direct Preference-based Policy Optimization without Reward Modeling

Dynamic preference inference network: Improving sample efficiency for multi-objective reinforcement learning by preference estimation