Abstract:Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this paper, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as the observation signal that contains limited information and cannot be obtained until the end of an episode. Instead, we employ the state transition sample, which is informative and instantaneous, as the observation signal for faster and more accurate task inference. Second, BPR algorithms usually require numerous samples to estimate the probability distribution of the tabular-based observation model, which may be expensive and even infeasible to learn and maintain, especially when using the state transition sample as the signal. Hence, we propose a scalable observation model based on fitting state transition functions of source tasks from only a small number of samples, which can generalize to any signals observed in the target task. Moreover, we extend the offline-mode BPR to the continual learning setting by expanding the scalable observation model in a plug-and-play fashion, which can avoid negative transfer when faced with new unknown tasks. Experimental results show that our method can consistently facilitate faster and more efficient policy transfer.

Lifetime policy reuse and the importance of task capacity

Failure-aware Policy Learning for Self-assessable Robotics Tasks

Heuristically Adaptive Policy Reuse in Reinforcement Learning

Context-Aware Policy Reuse

Multi-Task Policy Search

Statistical Guarantees for Lifelong Reinforcement Learning using PAC-Bayesian Theory

Hierarchical Orchestra of Policies

Knowledge Reuse of Multi-Agent Reinforcement Learning in Cooperative Tasks

On the Value of Myopic Behavior in Policy Reuse

Latent Plans for Task-Agnostic Offline Reinforcement Learning

I Know How: Combining Prior Policies to Solve New Tasks

Leveraging the Efficiency of Multi-Task Robot Manipulation Via Task-Evoked Planner and Reinforcement Learning

Continual Task Allocation in Meta-Policy Network via Sparse Prompting

HLifeRL: A Hierarchical Lifelong Reinforcement Learning Framework

Developing cooperative policies for multi-stage reinforcement learning tasks

Reinforcement Learning Experience Reuse with Policy Residual Representation

Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation

Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning

Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning

Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret