Abstract:Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without interacting with the environment. A common challenge is handling multi-modal action distributions, where multiple behaviours are represented in the data. Existing methods often assume unimodal behaviour policies, leading to suboptimal performance when this assumption is violated. We propose Weighted Imitation Learning on One Mode (LOM), a novel approach that focuses on learning from a single, promising mode of the behaviour policy. By using a Gaussian mixture model to identify modes and selecting the best mode based on expected returns, LOM avoids the pitfalls of averaging over conflicting actions. Theoretically, we show that LOM improves performance while maintaining simplicity in policy learning. Empirically, LOM outperforms existing methods on standard D4RL benchmarks and demonstrates its effectiveness in complex, multi-modal scenarios.

What problem does this paper attempt to address?

This paper attempts to solve the problem of dealing with multi - modal behavior policies in offline reinforcement learning (Offline Reinforcement Learning, RL). Specifically, the main challenges faced by researchers are: 1. **Multi - modal action distribution**: In offline RL, the behavior policies in the dataset may contain multiple different behavior patterns (i.e., multi - modal), which are manifested as multiple valid but potentially conflicting actions in the data. For example, in autonomous driving, conservative and aggressive driving styles may lead to different but equally effective navigation methods; in robotic manipulation, the way of grasping an object depends on the approach of the robot, the position of the object and environmental constraints. 2. **Limitations of existing methods**: Most existing offline RL methods assume that the behavior policy is unimodal, which leads to a performance degradation on multi - modal datasets. When this assumption is violated, the learned policy may converge to an average action that does not exist in the dataset, resulting in sub - optimal or ineffective results. To solve these problems, the authors propose a new method, **Weighted Imitation Learning on One Mode (LOM)**. LOM addresses the multi - modal problem through the following steps: 1. **Modeling behavior policies**: Use the Gaussian Mixture Model (GMM) to model behavior policies to capture the multi - modal characteristics in the action space. Each mode represents a different set of actions related to a certain state. 2. **Evaluating and selecting the optimal mode**: Introduce the hyper - Q - function to evaluate the expected return of each mode and dynamically select the most advantageous mode. This step ensures that only the most promising mode is focused on, rather than averaging all conflicting actions. 3. **Weighted imitation learning**: Perform weighted imitation learning on the selected optimal mode to ensure that the learned policy focuses on the most beneficial actions while maintaining the simplicity of the unimodal policy. The main contributions of LOM include: - Proposing the LOM method specifically for dealing with multi - modal problems in offline RL. - Introducing the hyper - Q - function and hyper - policy to evaluate and select action modes. - Providing theoretical guarantees to prove that LOM is superior to the behavior policy and the optimal action mode in performance. - Experimental results show that LOM outperforms the existing state - of - the - art (SOTA) offline RL methods in the standard D4RL benchmark test, especially performing excellently on multi - modal datasets. Through these improvements, LOM simplifies the policy learning process in multi - modal scenarios while improving performance and reducing complexity.

Learning on One Mode: Addressing Multi-Modality in Offline Reinforcement Learning

Beyond Reward: Offline Preference-guided Policy Optimization

Goal-conditioned Offline Reinforcement Learning through State Space Partitioning

Scrutinize What We Ignore: Reining In Task Representation Shift Of Context-Based Offline Meta Reinforcement Learning

Policy-regularized Offline Multi-objective Reinforcement Learning

Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization

Effective Multimodal Reinforcement Learning with Modality Alignment and Importance Enhancement

MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations

Mildly Conservative Q-Learning for Offline Reinforcement Learning

LAPO: Latent-Variable Advantage-Weighted Policy Optimization for Offline Reinforcement Learning.

Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling

MOReL : Model-Based Offline Reinforcement Learning

Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing

MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces

Offline Reinforcement Learning with Reverse Model-based Imagination

Offline Multitask Representation Learning for Reinforcement Learning

Align Your Intents: Offline Imitation Learning via Optimal Transport

TWOSOME: an Efficient Online Framework to Align LLMs with Embodied Environments Via Reinforcement Learning

Imitation-Regularized Offline Learning

A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective