Abstract:Policy diversity, encompassing the variety of policies an agent can adopt, enhances reinforcement learning (RL) success by fostering more robust, adaptable, and innovative problem-solving in the environment. The environment in which standard RL operates is usually modeled with a Markov Decision Process (MDP) as the theoretical foundation. However, in many real-world scenarios, the rewards depend on an agent's history of states and actions leading to a non-MDP. Under the premise of policy diffusion initialization, non-MDPs may have unstructured expanding solution space due to varying historical information and temporal dependencies. This results in solutions having non-equivalent closed forms in non-MDPs. In this paper, deriving diverse solutions for non-MDPs requires policies to break through the boundaries of the current solution space through gradual dispersion. The goal is to expand the solution space, thereby obtaining more diverse policies. Specifically, we first model the sequences of states and actions by a transformer-based method to learn policy embeddings for dispersion in the solution space, since the transformer has advantages in handling sequential data and capturing long-range dependencies for non-MDP. Then, we stack the policy embeddings to construct a dispersion matrix as the policy diversity measure to induce the policy dispersion in the solution space and obtain a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results of both non-MDP and MDP environments show that this dispersion scheme can obtain more expressive diverse policies via expanding the solution space, showing more robust performance than the recent learning baselines.

VMAPD: Generate Diverse Solutions for Multi-Agent Games with Recurrent Trajectory Discriminators.

Learning to Cooperate: Application of Deep Reinforcement Learning for Online AGV Path Finding.

Variational Policy Propagation for Multi-agent Reinforcement Learning

Policy Diversity for Cooperative Agents

Diversifying Policies With Non-Markov Dispersion to Expand the Solution Space

Multi-Agent Path Finding Method Based on Evolutionary Reinforcement Learning

Cooperative multi-agent target searching: a deep reinforcement learning approach based on parallel hindsight experience replay

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control

Measuring Policy Distance for Multi-Agent Reinforcement Learning

The Design and Realization of Multi-agent Obstacle Avoidance based on Reinforcement Learning

Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization

Boosting Weak-to-Strong Agents in Multiagent Reinforcement Learning via Balanced PPO

Off-Policy Multi-Agent Decomposed Policy Gradients

DDMA: Discrepancy-Driven Multi-agent Reinforcement Learning

Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient

Discovering diverse solutions in deep reinforcement learning by maximizing state-action-based mutual information

A Unified Diversity Measure for Multiagent Reinforcement Learning

MAPDP: Cooperative Multi-Agent Reinforcement Learning to Solve Pickup and Delivery Problems

A Policy Gradient Algorithm to Alleviate the Multi-Agent Value Overestimation Problem in Complex Environments

TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient