Model-Based Reinforcement Learning with Multi-Task Offline Pretraining

Minting Pan,Yitao Zheng,Yunbo Wang,Xiaokang Yang
2024-06-05
Abstract:Pretraining reinforcement learning (RL) models on offline datasets is a promising way to improve their training efficiency in online tasks, but challenging due to the inherent mismatch in dynamics and behaviors across various tasks. We present a model-based RL method that learns to transfer potentially useful dynamics and action demonstrations from offline data to a novel task. The main idea is to use the world models not only as simulators for behavior learning but also as tools to measure the task relevance for both dynamics representation transfer and policy transfer. We build a time-varying, domain-selective distillation loss to generate a set of offline-to-online similarity weights. These weights serve two purposes: (i) adaptively transferring the task-agnostic knowledge of physical dynamics to facilitate world model training, and (ii) learning to replay relevant source actions to guide the target policy. We demonstrate the advantages of our approach compared with the state-of-the-art methods in Meta-World and DeepMind Control Suite.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use multi - task offline data to improve the training efficiency and performance of the reinforcement learning (RL) model in online tasks in visual control tasks. Specifically, the paper focuses on how to effectively transfer useful dynamic and behavioral knowledge from offline data to new online tasks to reduce training time and improve the generalization ability of the model when there are significant differences between different tasks. ### Main problems solved in the paper 1. **Sample efficiency in visual control tasks**: - Visual reinforcement learning (Visual RL) needs to learn strategies from high - dimensional and complex observations, which usually requires a large number of interactions with the environment, limiting its application in the real world. - Model - Based RL greatly improves sample efficiency by learning a differentiable simulator of the environment (i.e., the world model) and optimizing strategies on imagined trajectories. 2. **Knowledge transfer across tasks**: - Although existing pre - training and fine - tuning methods can improve the performance of the model to a certain extent, the direct fine - tuning method may be affected by differences in visual observations, physical dynamics, or action spaces between different tasks. - This paper proposes a new domain - selective transfer learning method, which realizes more effective knowledge transfer by adaptively identifying the correlation between offline and online tasks and using relevant actions to guide the learning of the target strategy. ### Main contributions 1. **Novel pre - training and fine - tuning pipeline**: - A pre - training method based on multi - task offline data is proposed, which transfers the dynamics of multiple source tasks by learning a set of importance weights. - These importance weights are used not only for representation learning but also for behavior guidance in the policy optimization process. 2. **Domain - selective behavior learning scheme**: - Through the action replay generation module, the actions of the source task are reproduced from the target state, providing effective guidance to help improve the target strategy. - Dynamically select the most relevant source tasks to meet the needs of different time steps. ### Experimental verification The paper conducted experiments on two benchmarks, Meta - World and DeepMind Control Suite. The results show that the proposed method significantly outperforms existing model - baseline methods on multiple tasks, especially when dealing with visual inputs. ### Summary This paper solves the problem of using multi - task offline data to improve the performance of online tasks in visual control tasks by introducing a new domain - selective transfer learning method. By adaptively identifying task correlations and using relevant actions, this method can effectively transfer useful knowledge and improve the training efficiency and generalization ability of the model.