Decision-Focused Model-based Reinforcement Learning for Reward Transfer

Abhishek Sharma,Sonali Parbhoo,Omer Gottesman,Finale Doshi-Velez
2024-01-02
Abstract:Decision-focused (DF) model-based reinforcement learning has recently been introduced as a powerful algorithm that can focus on learning the MDP dynamics that are most relevant for obtaining high returns. While this approach increases the agent's performance by directly optimizing the reward, it does so by learning less accurate dynamics from a maximum likelihood perspective. We demonstrate that when the reward function is defined by preferences over multiple objectives, the DF model may be sensitive to changes in the objective <a class="link-external link-http" href="http://preferences.In" rel="external noopener nofollow">this http URL</a> this work, we develop the robust decision-focused (RDF) algorithm, which leverages the non-identifiability of DF solutions to learn models that maximize expected returns while simultaneously learning models that transfer to changes in the preference over multiple objectives. We demonstrate the effectiveness of RDF on two synthetic domains and two healthcare simulators, showing that it significantly improves the robustness of DF model learning to changes in the reward function without compromising training-time return.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance degradation of the model in decision - focused (DF) model reinforcement learning when the reward function changes between the training phase and the deployment phase. Specifically: 1. **Limitations of the DF model**: The DF model learns environmental dynamics by optimizing the return under a specific reward function. Although this method improves the performance of the agent during training, it is at the cost of sacrificing the accuracy of the model's maximum likelihood estimation (MLE). In particular, when the reward function is defined by the preferences of multiple objectives, the DF model is very sensitive to these preference changes, resulting in poor performance in the deployment phase. 2. **Changes in multi - objective rewards**: In many practical applications, the reward function may change. For example, in the healthcare field, doctors may adjust the preferences of treatment plans according to the specific conditions of patients, such as weighing long - term and short - term health effects, the aggressiveness of treatment, or other well - being indicators of patients. Such changes require the model to maintain good performance under different reward preferences. 3. **Robustness requirements**: Ideally, we hope to learn a dynamic model that not only performs well in the training phase but also maintains high performance when facing different reward preferences in the deployment phase. To solve the above problems, the authors propose the robust decision - focused (RDF) algorithm. The RDF algorithm learns a model that can perform well under different reward preferences by taking advantage of the non - uniqueness of the DF solution. Specifically, the RDF algorithm achieves this goal in the following ways: - **Optimization objective**: The RDF algorithm optimizes the model parameters so that it can maximize the expected return under a series of possible reward functions while ensuring good performance under the reward preferences in the training phase. - **Theoretical analysis**: The authors provide a theoretical analysis, proving that the RDF model can obtain a tighter upper bound than the DF model under the reward function in the test phase. - **Experimental verification**: Through experiments on synthetic environments and healthcare simulators, the authors demonstrate the robustness and performance of the RDF model under different reward preferences. In conclusion, this paper aims to improve the robustness and generalization ability of the model under different reward preferences through the RDF algorithm, so as to better adapt to the changing environment in practical applications.