Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting

Zhang-Wei Hong,Pulkit Agrawal,Rémi Tachet des Combes,Romain Laroche
2023-06-23
Abstract:Most offline reinforcement learning (RL) algorithms return a target policy maximizing a trade-off between (1) the expected performance gain over the behavior policy that collected the dataset, and (2) the risk stemming from the out-of-distribution-ness of the induced state-action occupancy. It follows that the performance of the target policy is strongly related to the performance of the behavior policy and, thus, the trajectory return distribution of the dataset. We show that in mixed datasets consisting of mostly low-return trajectories and minor high-return trajectories, state-of-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit high-performing trajectories to the fullest. To overcome this issue, we show that, in deterministic MDPs with stochastic initial states, the dataset sampling can be re-weighted to induce an artificial dataset whose behavior policy has a higher return. This re-weighted sampling strategy may be combined with any offline RL algorithm. We further analyze that the opportunity for performance improvement over the behavior policy correlates with the positive-sided variance of the returns of the trajectories in the dataset. We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms combined with our reweighted sampling strategy fully exploit the dataset. Furthermore, we empirically demonstrate that, despite its theoretical limitation, the approach may still be efficient in stochastic environments. The code is available at <a class="link-external link-https" href="https://github.com/Improbable-AI/harness-offline-rl" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the mixed offline reinforcement learning (offline RL) data sets, existing algorithms fail to fully utilize a small number of high - performing trajectories and are instead restricted by a large number of low - performing trajectories. Specifically: 1. **Problem Background**: - Offline reinforcement learning allows optimizing policies from historical data without directly interacting with the environment, which makes the training process safer and more economical. - However, current offline RL algorithms usually rely on behavior policies, that is, the policies used when collecting data. This means that the performance of the algorithms highly depends on the quality of the behavior policies. 2. **Specific Problems**: - In the mixed data sets, there are a large number of low - reward trajectories and a small number of high - reward trajectories. Existing offline RL algorithms perform poorly in this case because they are too restricted by the low - reward trajectories and cannot fully explore the value of the high - reward trajectories. - This restriction results in the failure to fully realize the potential of the algorithms in improving the behavior policies. 3. **Objectives**: - By re - weighting trajectories, enable offline RL algorithms to better utilize the high - reward trajectories in the mixed data sets, thereby improving the overall performance. 4. **Solutions**: - A sampling strategy based on trajectory weighting is proposed, including Return - weighting (RW) and Advantage - weighting (AW). These strategies adjust the weights of different trajectories in the data set, making the algorithms pay more attention to high - reward trajectories. - Theoretical analysis shows that this method can improve the dependence of the algorithms on behavior policies, and its effectiveness has been proven in experiments, especially in data sets with sparse high - reward trajectories. 5. **Experimental Verification**: - Through experiments on multiple benchmark data sets, it is shown that the proposed weighted sampling strategy is significantly superior to traditional uniform sampling and other baseline methods. - The experimental results indicate that even in a random environment, the weighted sampling strategy can effectively improve the performance of the algorithms. In summary, this paper aims to overcome the limitations of existing offline RL algorithms in dealing with sparse high - reward trajectories by re - weighting the trajectories in the mixed data sets, thereby achieving better policy improvement.