Abstract:Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space. This problem space includes: (1) addressing catastrophic forgetting to retain previously learned tasks, (2) demonstrating positive forward transfer for faster learning, (3) ensuring scalability across numerous tasks, and (4) facilitating learning without requiring task labels, even in the absence of clear task boundaries. In this paper, the Task-Agnostic Policy Distillation (TAPD) framework is introduced. This framework alleviates problems (1)-(4) by incorporating a task-agnostic phase, where an agent explores its environment without any external goal and maximizes only its intrinsic motivation. The knowledge gained during this phase is later distilled for further exploration. Therefore, the agent acts in a self-supervised manner by systematically seeking novel states. By utilizing task-agnostic distilled knowledge, the agent can solve downstream tasks more efficiently, leading to improved sample efficiency. Our code is available at the repository: <a class="link-external link-https" href="https://github.com/wabbajack1/TAPD" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in continual reinforcement learning (continual RL), specifically including: 1. **Catastrophic Forgetting**: When learning a new task, how to retain the knowledge that has been previously learned without losing it. 2. **Positive Forward Transfer**: How to accelerate the learning of new tasks so that new tasks can be mastered more quickly. 3. **Scalability**: Ensure that the algorithm can work effectively in multiple tasks and still maintain performance as the number of tasks increases. 4. **Learning Without Task Labels**: How to conduct effective learning in the absence of clear task boundaries. To solve these problems, the paper introduces a new framework - **Task - Agnostic Policy Distillation (TAPD)**. This framework enables the agent to explore the environment without an external goal and maximize its intrinsic motivation by introducing a task - agnostic phase. The knowledge obtained in this phase will then be distilled and used for further exploration. ### Main Features of the TAPD Framework 1. **Self - Supervised Learning**: - In the task - agnostic phase, the agent explores the environment in a self - supervised manner, without relying on the reward function of a specific task. - By systematically searching for novel states, the agent can accumulate general knowledge that can be reused in subsequent tasks. 2. **Policy Distillation**: - Distill the knowledge obtained in the task - agnostic phase into a knowledge - base network so that this knowledge can be quickly adapted and applied when encountering specific tasks. - This method improves sample efficiency, enabling the agent to solve problems more efficiently. 3. **Alternating Training Mechanism**: - Combines the "Progress & Compress" framework proposed by Schwarz et al., adding a task - agnostic phase. - Learn specific tasks in the progress phase, distill newly learned knowledge in the compress phase, and focus on exploration and acquisition of general knowledge in the task - agnostic phase. In this way, the TAPD framework aims to overcome the main challenges in continual reinforcement learning, especially the problems of catastrophic forgetting, positive forward transfer, scalability, and learning without task labels, thereby achieving more efficient multi - task learning and adaptability.

Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation

PolyTask: Learning Unified Policies through Behavior Distillation

Reinforcement Learning via Auxiliary Task Distillation

Experience Consistency Distillation Continual Reinforcement Learning for Robotic Manipulation Tasks

Dual Policy Distillation

Continual Task Learning through Adaptive Policy Self-Composition

TAME: Task Agnostic Continual Learning using Multiple Experts

Prototype-Sample Relation Distillation: Towards Replay-Free Continual Learning

Behavior Self-Organization Supports Task Inference for Continual Robot Learning

Attention-Based Policy Distillation for UAV Simultaneous Target Tracking and Obstacle Avoidance

Efficient Open-world Reinforcement Learning via Knowledge Distillation and Autonomous Rule Discovery

Real-time Policy Distillation in Deep Reinforcement Learning

Continual Task Allocation in Meta-Policy Network via Sparse Prompting

Online Policy Distillation with Decision-Attention

Continual Offline Reinforcement Learning via Diffusion-based Dual Generative Replay

Lifetime policy reuse and the importance of task capacity

Task Aware Dreamer for Task Generalization in Reinforcement Learning

Transferring Domain Knowledge with an Adviser in Continuous Tasks

Text-Aware Diffusion for Policy Learning

Densely Distilling Cumulative Knowledge for Continual Learning

Task Agnostic Continual Learning via Meta Learning