Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation

Muhammad Burhan Hafez,Kerim Erekmen
2024-11-26
Abstract:Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space. This problem space includes: (1) addressing catastrophic forgetting to retain previously learned tasks, (2) demonstrating positive forward transfer for faster learning, (3) ensuring scalability across numerous tasks, and (4) facilitating learning without requiring task labels, even in the absence of clear task boundaries. In this paper, the Task-Agnostic Policy Distillation (TAPD) framework is introduced. This framework alleviates problems (1)-(4) by incorporating a task-agnostic phase, where an agent explores its environment without any external goal and maximizes only its intrinsic motivation. The knowledge gained during this phase is later distilled for further exploration. Therefore, the agent acts in a self-supervised manner by systematically seeking novel states. By utilizing task-agnostic distilled knowledge, the agent can solve downstream tasks more efficiently, leading to improved sample efficiency. Our code is available at the repository: <a class="link-external link-https" href="https://github.com/wabbajack1/TAPD" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in continual reinforcement learning (continual RL), specifically including: 1. **Catastrophic Forgetting**: When learning a new task, how to retain the knowledge that has been previously learned without losing it. 2. **Positive Forward Transfer**: How to accelerate the learning of new tasks so that new tasks can be mastered more quickly. 3. **Scalability**: Ensure that the algorithm can work effectively in multiple tasks and still maintain performance as the number of tasks increases. 4. **Learning Without Task Labels**: How to conduct effective learning in the absence of clear task boundaries. To solve these problems, the paper introduces a new framework - **Task - Agnostic Policy Distillation (TAPD)**. This framework enables the agent to explore the environment without an external goal and maximize its intrinsic motivation by introducing a task - agnostic phase. The knowledge obtained in this phase will then be distilled and used for further exploration. ### Main Features of the TAPD Framework 1. **Self - Supervised Learning**: - In the task - agnostic phase, the agent explores the environment in a self - supervised manner, without relying on the reward function of a specific task. - By systematically searching for novel states, the agent can accumulate general knowledge that can be reused in subsequent tasks. 2. **Policy Distillation**: - Distill the knowledge obtained in the task - agnostic phase into a knowledge - base network so that this knowledge can be quickly adapted and applied when encountering specific tasks. - This method improves sample efficiency, enabling the agent to solve problems more efficiently. 3. **Alternating Training Mechanism**: - Combines the "Progress & Compress" framework proposed by Schwarz et al., adding a task - agnostic phase. - Learn specific tasks in the progress phase, distill newly learned knowledge in the compress phase, and focus on exploration and acquisition of general knowledge in the task - agnostic phase. In this way, the TAPD framework aims to overcome the main challenges in continual reinforcement learning, especially the problems of catastrophic forgetting, positive forward transfer, scalability, and learning without task labels, thereby achieving more efficient multi - task learning and adaptability.