Abstract:We focus on offline imitation learning (IL), which aims to mimic an expert's behavior using demonstrations without any interaction with the environment. One of the main challenges in offline IL is the limited support of expert demonstrations, which typically cover only a small fraction of the state-action space. While it may not be feasible to obtain numerous expert demonstrations, it is often possible to gather a larger set of sub-optimal demonstrations. For example, in treatment optimization problems, there are varying levels of doctor treatments available for different chronic conditions. These range from treatment specialists and experienced general practitioners to less experienced general practitioners. Similarly, when robots are trained to imitate humans in routine tasks, they might learn from individuals with different levels of expertise and efficiency. In this paper, we propose an offline IL approach that leverages the larger set of sub-optimal demonstrations while effectively mimicking expert trajectories. Existing offline IL methods based on behavior cloning or distribution matching often face issues such as overfitting to the limited set of expert demonstrations or inadvertently imitating sub-optimal trajectories from the larger dataset. Our approach, which is based on inverse soft-Q learning, learns from both expert and sub-optimal demonstrations. It assigns higher importance (through learned weights) to aligning with expert demonstrations and lower importance to aligning with sub-optimal ones. A key contribution of our approach, called SPRINQL, is transforming the offline IL problem into a convex optimization over the space of Q functions. Through comprehensive experimental evaluations, we demonstrate that the SPRINQL algorithm achieves state-of-the-art (SOTA) performance on offline IL benchmarks. Code is available at <a class="link-external link-https" href="https://github.com/hmhuy0/SPRINQL" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limited expert demonstration data in Offline Imitation Learning (IL). Specifically, traditional offline IL methods rely on a small amount of high - quality expert demonstration data, which are often difficult to obtain and costly. At the same time, more sub - optimal demonstration data can be obtained, but directly using these data may lead to learning sub - optimal strategies. To solve this problem, the author proposes a new algorithm SPRINQL (Sub - optimal Demonstrations driven Reward regularized INverse soft Q Learning), aiming to improve the learning effect by combining expert and sub - optimal demonstration data. The following are the core contributions of the paper: 1. **Proposing the SPRINQL algorithm**: Based on the inverse soft Q - learning framework, this algorithm can use both expert and sub - optimal demonstration data for learning simultaneously. 2. **Theoretical properties**: It provides the key theoretical properties of the SPRINQL objective function, ensuring the scalability and efficiency of the algorithm. In particular, through distribution matching and reward regularization, an objective function is developed, which not only solves the problem of insufficient expert samples but also uses non - expert data to enhance the learning effect. 3. **Empirical evaluation**: Through extensive experimental comparisons with the existing best offline IL algorithms, it is proved that the performance of SPRINQL on benchmark problems has reached the current optimal level. Moreover, SPRINQL can recover a reward function that is highly positively correlated with the true reward, which is an advantage that other IL algorithms do not have. ### Detailed Explanation #### Research Background The goal of offline imitation learning is to learn the behavior of experts from a given static dataset without interacting with the environment. However, existing methods usually rely on a large amount of expert demonstration data, which is often unrealistic in practical applications. Therefore, how to effectively use limited expert data and abundant sub - optimal data has become the focus of research. #### Main Challenges - **Limited expert data**: The number of expert demonstration data is usually small, causing the model to be prone to over - fitting. - **The influence of sub - optimal data**: Directly using sub - optimal data may lead to learning sub - optimal strategies because these data contain wrong or inefficient behaviors. #### The Core Idea of SPRINQL SPRINQL solves the above problems through the following three key components: 1. **Distribution matching**: It not only matches the occupancy distribution of expert demonstration data but also matches the distribution of sub - optimal demonstration data, thus making full use of all available data. 2. **Reward regularization**: An reward regularization term is introduced to ensure that demonstrations with a high skill level obtain higher reward values, thus guiding the model to be more inclined to learn expert behaviors. 3. **Optimization objective transformation**: The original objective function is transformed into the Q - space, making the optimization problem easier to solve and ensuring that the found Q - function can lower - bound its true value. #### Experimental Results Through experiments on multiple benchmark tasks, SPRINQL shows performance significantly superior to existing methods, especially when dealing with demonstration data of different qualities and quantities. In addition, SPRINQL can effectively recover rewards close to the true reward function, further verifying its potential in inverse reinforcement learning. In conclusion, this paper successfully solves the problem of limited expert data in offline imitation learning by introducing the SPRINQL algorithm, providing new ideas and methods for future research.

SPRINQL: Sub-optimal Demonstrations driven Offline Imitation Learning

Beyond Reward: Offline Preference-guided Policy Optimization

How to Leverage Diverse Demonstrations in Offline Imitation Learning

Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations.

Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations

Offline Imitation Learning with Suboptimal Demonstrations via Relaxed Distribution Matching

Offline Imitation Learning with Model-based Reverse Augmentation

Imitation Learning from Imperfection: Theoretical Justifications and Algorithms

UNIQ: Offline Inverse Q-learning for Avoiding Undesirable Demonstrations

Mitigating Covariate Shift in Imitation Learning via Offline Data Without Great Coverage

Sparse Q-Learning: Offline Reinforcement Learning with Implicit Value Regularization

When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning

Expert Proximity as Surrogate Rewards for Single Demonstration Imitation Learning

Offline RL with No OOD Actions: In-Sample Learning Via Implicit Value Regularization

Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

Offline Imitation Learning by Controlling the Effective Planning Horizon

SEABO: A Simple Search-Based Method for Offline Imitation Learning

Offline Reinforcement Learning with Implicit Q-Learning

Robust Offline Imitation Learning from Diverse Auxiliary Data

Offline Imitation Learning with a Misspecified Simulator.