Abstract:Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment. Referred to as "zero-shot learning," this ability remains elusive for general-purpose reinforcement learning algorithms. While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP. We present \emph{Proto Successor Measure}: the basis set for all possible solutions of Reinforcement Learning in a dynamical system. We provably show that any possible policy can be represented using an affine combination of these policy independent basis functions. Given a reward function at test time, we simply need to find the right set of linear weights to combine these basis corresponding to the optimal policy. We derive a practical algorithm to learn these basis functions using only interaction data from the environment and show that our approach can produce the optimal policy at test time for any given reward function without additional environmental interactions. Project page: <a class="link-external link-https" href="https://agarwalsiddhant10.github.io/projects/psm.html" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: achieving zero - shot learning in Reinforcement Learning (RL), that is, after exploring the environment, the agent can transfer knowledge to most downstream tasks in this environment without additional environmental interactions. Specifically, existing general - purpose reinforcement learning algorithms are difficult to achieve true zero - shot learning capabilities. Although some works have attempted to generate zero - shot RL agents, they usually make assumptions about the nature of the task or the structure of the Markov Decision Process (MDP). This limits the generalization ability of these methods, especially when facing unseen tasks. To solve this problem, the authors propose Proto Successor Measure (PSM), which is a basis set representing all possible reinforcement learning solutions in a dynamic system. They prove that any possible policy can be represented by an affine combination of these policy - independent basis functions. Given the reward function at test time, it is only necessary to find the appropriate linear weights to combine these basis functions, thus corresponding to the optimal policy. In addition, the authors also derive a practical algorithm that can learn these basis functions using only interaction data from the environment, and show that their method can generate the optimal policy for any given reward function at test time without additional environmental interactions. ### Key Contributions: 1. **A mathematically complete representation - learning perspective**: A novel and mathematically complete representation - learning method for Markov Decision Process (MDP) is proposed. 2. **Efficient practical implementation**: The learning of basis functions is simplified into a single optimization problem. 3. **Extensive experimental verification**: Through the evaluation of multiple tasks, the ability of the learned representation to quickly infer the optimal policy is demonstrated. ### Formula Presentation: - **Bellman Flow Constraint**: \[ d^\pi(s, a)=(1 - \gamma)\mu(s)\pi(a|s)+\gamma\sum_{s' \in S, a' \in A}P(s|s', a')d^\pi(s', a')\pi(a|s) \] - **Successor Measure**: \[ M^\pi(s, a, s^+, a^+)=(1 - \gamma)1[s = s^+, a = a^+]+\gamma\sum_{s' \in S, a' \in A}P(s^+|s', a')M^\pi(s, a, s', a')\pi(a^+|s^+) \] - **Objective function in linear programming form**: \[ \max_w E_\mu[(\Phi w + b)r] \] Subject to: \[ \Phi w + b\geq0\quad\forall s, a \] Through these formulas and theoretical derivations, the authors show how to use basis functions and bias terms to represent any possible state - visit distribution and successor measure, thereby achieving the goal of zero - shot learning.

Proto Successor Measure: Representing the Space of All Possible Solutions of Reinforcement Learning

Zero-shot Policy Learning with Spatial Temporal RewardDecomposition on Contingency-aware Observation

Zero-shot Policy Learning with Spatial Temporal Reward Decomposition on Contingency-aware Observation.

Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

Never Give Up: Learning Directed Exploration Strategies

Curiosity & Entropy Driven Unsupervised RL in Multiple Environments

Zero-shot Imitation Policy via Search in Demonstration Dataset

Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer

Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions

Improving Zero-Shot Coordination Performance Based on Policy Similarity

Bridging RL Theory and Practice with the Effective Horizon

Pixel to policy: DQN Encoders for within & cross-game reinforcement learning

A Comparison of Self-Play Algorithms Under a Generalized Framework

Model Predictive Control and Reinforcement Learning: A Unified Framework Based on Dynamic Programming

Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination

Hierarchical Deep Reinforcement Learning Agent with Counter Self-play on Competitive Games

A Robust Policy Bootstrapping Algorithm for Multi-objective Reinforcement Learning in Non-stationary Environments

Emergent Solutions to High-Dimensional Multitask Reinforcement Learning

KnowPC: Knowledge-Driven Programmatic Reinforcement Learning for Zero-shot Coordination

Iteratively Learning Novel Strategies with Diversity Measured in State Distances