Abstract:Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local optima, hindering the effective exploration of better policies. Furthermore, in sparse-reward settings, each agent tends to receive a scarce reward, which poses significant challenges to inter-agent cooperation. This not only increases the difficulty of policy learning but also degrades the overall performance of multi-agent tasks. To address these issues, we propose a Consistency Policy with Intention Guidance (CPIG), with two primary components: (a) introducing a multimodal policy to enhance the agent's exploration capability, and (b) sharing the intention among agents to foster agent cooperation. For component (a), CPIG incorporates a Consistency model as the policy, leveraging its multimodal nature and stochastic characteristics to facilitate exploration. Regarding component (b), we introduce an Intention Learner to deduce the intention on the global state from each agent's local observation. This intention then serves as a guidance for the Consistency Policy, promoting cooperation among agents. The proposed method is evaluated in multi-agent particle environments (MPE) and multi-agent MuJoCo (MAMuJoCo). Empirical results demonstrate that our method not only achieves comparable performance to various baselines in dense-reward environments but also significantly enhances performance in sparse-reward settings, outperforming state-of-the-art (SOTA) algorithms by 20%.

CASA: Bridging the Gap between Policy Improvement and Policy Evaluation with Conflict Averse Policy Iteration

Generalised Policy Improvement with Geometric Policy Composition

A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence

On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation

GAILPG: Multi-Agent Policy Gradient with Generative Adversarial Imitation Learning

Bridging the Gap between Newton-Raphson Method and Regularized Policy Iteration

Blending Imitation and Reinforcement Learning for Robust Policy Improvement

Dual Parallel Policy Iteration with Coupled Policy Improvement

CPIG: Leveraging Consistency Policy with Intention Guidance for Multi-agent Exploration

Explicitly Coordinated Policy Iteration.

Don't Start from Scratch: Behavioral Refinement via Interpolant-based Policy Diffusion

Iterative Regularized Policy Optimization with Imperfect Demonstrations

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Policy Optimization over General State and Action Spaces

Easy Monotonic Policy Iteration

Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets

An Active Exploration Method for Data Efficient Reinforcement Learning

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

Bidirectional Model-based Policy Optimization

Combing Policy Evaluation and Policy Improvement in a Unified F-Divergence Framework.

Towards Imitation Learning to Branch for MIP: A Hybrid Reinforcement Learning Based Sample Augmentation Approach