Counterfactual Conservative Q Learning for Offline Multi-agent Reinforcement Learning

Jianzhun Shao,Yun Qu,Chen Chen,Hongchang Zhang,Xiangyang Ji

2023-09-22

Abstract:Offline multi-agent reinforcement learning is challenging due to the coupling effect of both distribution shift issue common in offline setting and the high dimension issue common in multi-agent setting, making the action out-of-distribution (OOD) and value overestimation phenomenon excessively severe. Tomitigate this problem, we propose a novel multi-agent offline RL algorithm, named CounterFactual Conservative Q-Learning (CFCQL) to conduct conservative value estimation. Rather than regarding all the agents as a high dimensional single one and directly applying single agent methods to it, CFCQL calculates conservative regularization for each agent separately in a counterfactual way and then linearly combines them to realize an overall conservative value estimation. We prove that it still enjoys the underestimation property and the performance guarantee as those single agent conservative methods do, but the induced regularization and safe policy improvement bound are independent of the agent number, which is therefore theoretically superior to the direct treatment referred to above, especially when the agent number is large. We further conduct experiments on four environments including both discrete and continuous action settings on both existing and our man-made datasets, demonstrating that CFCQL outperforms existing methods on most datasets and even with a remarkable margin on some of them.

Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in Offline Multi - agent Reinforcement Learning (Offline MARL), due to distribution shift and high - dimensional action space, the phenomena of Out - of - Distribution (OOD) actions and value overestimation are serious. These problems are particularly prominent in multi - agent environments. Because as the number of agents increases, the joint action space grows exponentially, making any joint action more likely to be an OOD action, which intensifies the extrapolation error and overestimation problems and may ultimately lead to unexpected or even catastrophic policies. To solve these problems, the authors propose a new multi - agent offline RL algorithm - Counterfactual Conservative Q - Learning (CFCQL), which aims to perform conservative value estimation. CFCQL calculates conservative regularization in a counterfactual way for each agent separately and linearly combines them to achieve overall conservative value estimation, instead of treating all agents as a high - dimensional single entity and directly applying single - agent methods. This method not only maintains the underestimation property and performance guarantees of single - agent conservative methods, but also the induced regularization and safe policy improvement bounds are independent of the number of agents, so it is theoretically superior to direct processing methods when the number of agents is large.

Counterfactual Conservative Q Learning for Offline Multi-agent Reinforcement Learning

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

Mildly Conservative Q-Learning for Offline Reinforcement Learning

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

Adaptable Conservative Q-Learning for Offline Reinforcement Learning.

Conservative In-Distribution Q-Learning for Offline Reinforcement Learning

Strategically Conservative Q-Learning

Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification

Offline Quantum Reinforcement Learning in a Conservative Manner

DCE: Offline Reinforcement Learning with Double Conservative Estimates

Offline Decentralized Multi-Agent Reinforcement Learning

UAC: Offline Reinforcement Learning with Uncertain Action Constraint

Confidence-Conditioned Value Functions for Offline Reinforcement Learning

Offline Multi-Agent Reinforcement Learning with Coupled Value Factorization

CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning

Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices

Budgeting Counterfactual for Offline RL

Offline RL With Realistic Datasets: Heteroskedasticity and Support Constraints

Efficient Offline Reinforcement Learning With Relaxed Conservatism

Constraints Penalized Q-learning for Safe Offline Reinforcement Learning.

State-Constrained Offline Reinforcement Learning