Pruning the Way to Reliable Policies: A Multi-Objective Deep Q-Learning Approach to Critical Care

Ali Shirali,Alexander Schubert,Ahmed Alaa
DOI: https://doi.org/10.1109/JBHI.2024.3415115
2024-10-14
Abstract:Medical treatments often involve a sequence of decisions, each informed by previous outcomes. This process closely aligns with reinforcement learning (RL), a framework for optimizing sequential decisions to maximize cumulative rewards under unknown dynamics. While RL shows promise for creating data-driven treatment plans, its application in medical contexts is challenging due to the frequent need to use sparse rewards, primarily defined based on mortality outcomes. This sparsity can reduce the stability of offline estimates, posing a significant hurdle in fully utilizing RL for medical decision-making. We introduce a deep Q-learning approach to obtain more reliable critical care policies by integrating relevant but noisy frequently measured biomarker signals into the reward specification without compromising the optimization of the main outcome. Our method prunes the action space based on all available rewards before training a final model on the sparse main reward. This approach minimizes potential distortions of the main objective while extracting valuable information from intermediate signals to guide learning. We evaluate our method in off-policy and offline settings using simulated environments and real health records from intensive care units. Our empirical results demonstrate that our method outperforms common offline RL methods such as conservative Q-learning and batch-constrained deep Q-learning. By disentangling sparse rewards and frequently measured reward proxies through action pruning, our work represents a step towards developing reliable policies that effectively harness the wealth of available information in data-intensive critical care environments.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The key problem that this paper attempts to solve is how to use the Deep Q - Learning method to develop more reliable medical decision - making strategies in the intensive care unit (ICU) environment. Specifically, the paper aims to overcome the following challenges: 1. **Sparse Reward Signals**: In the medical environment, the main reward signals are usually based on patient survival rates. Such reward signals are very sparse and delayed, leading to a decrease in stability during the offline learning process. 2. **Frequent but Noisy Biomarker Signals**: Although there are a large number of frequently measured biomarkers that can provide immediate feedback, these signals are usually noisy and may distort the learning process, resulting in sub - optimal strategies. To solve these problems, the authors propose a two - stage multi - objective Deep Q - Learning method: ### First Stage: Multi - objective Deep Q - Learning - **Integrating Multiple Reward Signals**: By learning a vector - valued Q - function, this function can handle multiple reward signals, including sparse main rewards and frequent but noisy auxiliary rewards. - **Action Space Pruning**: Prune the action space according to all available reward signals, removing actions that may perform poorly under any weight combination. This step reduces the dependence on noisy intermediate rewards and avoids their negative impact on the final strategy. ### Second Stage: Q - Learning Based on the Pruned Action Space - **Using Only Sparse Main Rewards**: In the pruned action space, use sparse main rewards for Q - learning to ensure that the learned strategy focuses on the long - term main goal without being disturbed by short - term noisy signals. ### Specific Implementation of the Method - **Vector - valued Q - function**: Learn a Q - function that outputs a vector, with each dimension corresponding to a reward signal. - **Conservative Estimation**: To ensure that the update equation is applicable to any weight combination, use a linear conservative estimation to approximate the softmax operation. - **Posterior Sampling**: Sample weights w from the posterior distribution P(w|s, a) to improve the accuracy of the approximation. - **Double Q - Learning**: Introduce a target network Q' to reduce the over - estimation problem and improve the stability of learning. - **Action Space Pruning**: Define a random policy πβ_P(a|s) according to the learned Q - function and weight prior, and obtain the pruned action set Πβ(s) by sampling. ### Experimental Verification The authors evaluated in offline and off - policy settings using simulated environments (such as Lunar Lander and Sepsis Simulator) and real - world health records. The experimental results show that this method outperforms common offline reinforcement learning methods, such as Conservative Q - Learning (CQL) and Batch - Constrained Deep Q - Learning (BCQ), in performance, and can effectively use intermediate signals to simplify the learning problem while minimizing the impact on the main goal. In conclusion, this paper successfully solves the challenges brought by sparse rewards and noisy intermediate signals in the medical environment by introducing a novel two - stage algorithm, providing new ideas for developing reliable data - driven medical decision - support tools.