Abstract:Offline reinforcement learning (RL) leverages previously collected data to extract policies that return satisfying performance in online environments. However, offline RL suffers from the distribution shift between the offline dataset and the online environment. In the multi-agent RL (MARL) setting, this distribution shift may arise from the nonstationary opponents (exogenous agents beyond control) in the online testing who display distinct behaviors from those recorded in the offline dataset. Hence, the key to the broader deployment of offline MARL is the online adaptation to nonstationary opponents. Recent advances in large language models have demonstrated the surprising generalization ability of the transformer architecture in sequence modeling, which prompts one to wonder \textit{whether the offline-trained transformer policy adapts to nonstationary opponents during online testing}. This work proposes the self-confirming loss (SCL) in offline transformer training to address the online nonstationarity, which is motivated by the self-confirming equilibrium (SCE) in game theory. The gist is that the transformer learns to predict the opponents' future moves based on which it acts accordingly. As a weaker variant of Nash equilibrium (NE), SCE (equivalently, SCL) only requires local consistency: the agent's local observations do not deviate from its conjectures, leading to a more adaptable policy than the one dictated by NE focusing on global optimality. We evaluate the online adaptability of the self-confirming transformer (SCT) by playing against nonstationary opponents employing a variety of policies, from the random one to the benchmark MARL policies. Experimental results demonstrate that SCT can adapt to nonstationary opponents online, achieving higher returns than vanilla transformers and offline MARL baselines.

Continual Task Learning through Adaptive Policy Self-Composition

Solving Continual Offline Reinforcement Learning with Decision Transformer

CLFR-M: Continual Learning Framework for Robots Via Human Feedback and Dynamic Memory

Multi-Task Reinforcement Learning in Continuous Control with Successor Feature-Based Concurrent Composition

Dynamics-Adaptive Continual Reinforcement Learning Via Progressive Contextualization.

Effective Offline Robot Learning with Structured Task Graph

Solving Continual Offline RL through Selective Weights Activation on Aligned Spaces

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Continual Offline Reinforcement Learning via Diffusion-based Dual Generative Replay

Online Reinforcement Learning in Non-Stationary Context-Driven Environments

Self-Confirming Transformer for Locally Consistent Online Adaptation in Multi-Agent Reinforcement Learning

Latent Plans for Task-Agnostic Offline Reinforcement Learning

Continual Task Allocation in Meta-Policy Network via Sparse Prompting

Hierarchical Orchestra of Policies

Efficient and Stable Offline-to-online Reinforcement Learning Via Continual Policy Revitalization

Improving Plasticity in Online Continual Learning via Collaborative Learning

Multiagent Continual Coordination via Progressive Task Contextualization

Multi-agent Continual Coordination Via Progressive Task Contextualization

Online Continual Learning For Interactive Instruction Following Agents

Continual Sequence Generation with Adaptive Compositional Modules

Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning