Abstract:Decision-focused (DF) model-based reinforcement learning has recently been introduced as a powerful algorithm that can focus on learning the MDP dynamics that are most relevant for obtaining high returns. While this approach increases the agent's performance by directly optimizing the reward, it does so by learning less accurate dynamics from a maximum likelihood perspective. We demonstrate that when the reward function is defined by preferences over multiple objectives, the DF model may be sensitive to changes in the objective <a class="link-external link-http" href="http://preferences.In" rel="external noopener nofollow">this http URL</a> this work, we develop the robust decision-focused (RDF) algorithm, which leverages the non-identifiability of DF solutions to learn models that maximize expected returns while simultaneously learning models that transfer to changes in the preference over multiple objectives. We demonstrate the effectiveness of RDF on two synthetic domains and two healthcare simulators, showing that it significantly improves the robustness of DF model learning to changes in the reward function without compromising training-time return.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the performance degradation of the model in decision - focused (DF) model reinforcement learning when the reward function changes between the training phase and the deployment phase. Specifically: 1. **Limitations of the DF model**: The DF model learns environmental dynamics by optimizing the return under a specific reward function. Although this method improves the performance of the agent during training, it is at the cost of sacrificing the accuracy of the model's maximum likelihood estimation (MLE). In particular, when the reward function is defined by the preferences of multiple objectives, the DF model is very sensitive to these preference changes, resulting in poor performance in the deployment phase. 2. **Changes in multi - objective rewards**: In many practical applications, the reward function may change. For example, in the healthcare field, doctors may adjust the preferences of treatment plans according to the specific conditions of patients, such as weighing long - term and short - term health effects, the aggressiveness of treatment, or other well - being indicators of patients. Such changes require the model to maintain good performance under different reward preferences. 3. **Robustness requirements**: Ideally, we hope to learn a dynamic model that not only performs well in the training phase but also maintains high performance when facing different reward preferences in the deployment phase. To solve the above problems, the authors propose the robust decision - focused (RDF) algorithm. The RDF algorithm learns a model that can perform well under different reward preferences by taking advantage of the non - uniqueness of the DF solution. Specifically, the RDF algorithm achieves this goal in the following ways: - **Optimization objective**: The RDF algorithm optimizes the model parameters so that it can maximize the expected return under a series of possible reward functions while ensuring good performance under the reward preferences in the training phase. - **Theoretical analysis**: The authors provide a theoretical analysis, proving that the RDF model can obtain a tighter upper bound than the DF model under the reward function in the test phase. - **Experimental verification**: Through experiments on synthetic environments and healthcare simulators, the authors demonstrate the robustness and performance of the RDF model under different reward preferences. In conclusion, this paper aims to improve the robustness and generalization ability of the model under different reward preferences through the RDF algorithm, so as to better adapt to the changing environment in practical applications.

Decision-Focused Model-based Reinforcement Learning for Reward Transfer

A reinforcement learning diffusion decision model for value-based decisions

Self-Supervised Reinforcement Learning that Transfers using Random Features

A Multiple-Attribute Decision-Making Approach to Reinforcement Learning.

Bridging the gap between Markowitz planning and deep reinforcement learning

Transferable Dynamics Models for Efficient Object-Oriented Reinforcement Learning

Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Decision Making in Monopoly Using a Hybrid Deep Reinforcement Learning Approach

A Two-Stage Multi-Objective Deep Reinforcement Learning Framework.

Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback?

Decision Theory-Guided Deep Reinforcement Learning for Fast Learning

Mobile Robot Sequential Decision Making Using a Deep Reinforcement Learning Hyper-Heuristic Approach

Deep Reinforcement Trading with Predictable Returns

Intelligent Decision Making Based on the Combination of Deep Reinforcement Learning and an Influence Map

Extracting Reward Functions from Diffusion Models

Decoding Global Preferences: Temporal and Cooperative Dependency Modeling in Multi-Agent Preference-Based Reinforcement Learning

Optimal Treatment Strategies for Critical Patients with Deep Reinforcement Learning

Learning when to Transfer among Agents: an Efficient Multiagent Transfer Learning Framework.

Auxiliary Reward Generation with Transition Distance Representation Learning

Learning Long-Term Reward Redistribution via Randomized Return Decomposition