Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning

Ke Li,Han Guo
2024-01-04
Abstract:Multi-objective reinforcement learning (MORL) aims to find a set of high-performing and diverse policies that address trade-offs between multiple conflicting objectives. However, in practice, decision makers (DMs) often deploy only one or a limited number of trade-off policies. Providing too many diversified trade-off policies to the DM not only significantly increases their workload but also introduces noise in multi-criterion decision-making. With this in mind, we propose a human-in-the-loop policy optimization framework for preference-based MORL that interactively identifies policies of interest. Our method proactively learns the DM's implicit preference information without requiring any a priori knowledge, which is often unavailable in real-world black-box decision scenarios. The learned preference information is used to progressively guide policy optimization towards policies of interest. We evaluate our approach against three conventional MORL algorithms that do not consider preference information and four state-of-the-art preference-based MORL algorithms on two MORL environments for robot control and smart grid management. Experimental results fully demonstrate the effectiveness of our proposed method in comparison to the other peer algorithms.
Neural and Evolutionary Computing
What problem does this paper attempt to address?
The paper primarily addresses a problem in the field of Multi-Objective Reinforcement Learning (MORL), specifically how to find a set of efficient and diverse policies considering the preferences of the Decision Maker (DM) to handle multiple potentially conflicting objectives. Specifically, the paper attempts to solve the following key issues: 1. **Effective Utilization of Preference Information**: Traditional MORL methods often generate a large number of trade-off policies, which not only increase the burden on the decision maker but may also introduce noise, affecting the multi-criteria decision-making process. Therefore, the paper proposes a human-computer interactive policy optimization framework that can actively learn the implicit preference information of the decision maker and use it to guide the policy optimization process. 2. **Combination of Preference Learning and Policy Optimization**: The proposed framework (referred to as CBOB) includes three core modules: seed policy generation, preference acquisition, and policy optimization. These three modules work in coordination, gradually learning the decision maker's preferences through human-computer interaction and adjusting the direction of policy optimization accordingly to find the policies that the decision maker is truly interested in. 3. **Practicality and Efficiency**: The method particularly focuses on applications in real-world black-box decision scenarios, where prior preference information is usually unavailable. Through human-computer interaction, the proposed method can effectively learn the decision maker's preferences without requiring any prior knowledge. 4. **Experimental Validation**: The paper conducts extensive experiments in two MORL environments, comparing the proposed method with three traditional MORL algorithms and four state-of-the-art preference-oriented MORL algorithms. The results fully demonstrate the effectiveness of the proposed method. In summary, the main contribution of the paper is the proposal of a novel human-computer interactive policy optimization framework, aimed at improving the performance and practicality of MORL in real-world applications through an effective preference learning mechanism.