Abstract:We focus on offline imitation learning (IL), which aims to mimic an expert's behavior using demonstrations without any interaction with the environment. One of the main challenges in offline IL is the limited support of expert demonstrations, which typically cover only a small fraction of the state-action space. While it may not be feasible to obtain numerous expert demonstrations, it is often possible to gather a larger set of sub-optimal demonstrations. For example, in treatment optimization problems, there are varying levels of doctor treatments available for different chronic conditions. These range from treatment specialists and experienced general practitioners to less experienced general practitioners. Similarly, when robots are trained to imitate humans in routine tasks, they might learn from individuals with different levels of expertise and efficiency. In this paper, we propose an offline IL approach that leverages the larger set of sub-optimal demonstrations while effectively mimicking expert trajectories. Existing offline IL methods based on behavior cloning or distribution matching often face issues such as overfitting to the limited set of expert demonstrations or inadvertently imitating sub-optimal trajectories from the larger dataset. Our approach, which is based on inverse soft-Q learning, learns from both expert and sub-optimal demonstrations. It assigns higher importance (through learned weights) to aligning with expert demonstrations and lower importance to aligning with sub-optimal ones. A key contribution of our approach, called SPRINQL, is transforming the offline IL problem into a convex optimization over the space of Q functions. Through comprehensive experimental evaluations, we demonstrate that the SPRINQL algorithm achieves state-of-the-art (SOTA) performance on offline IL benchmarks. Code is available at <a class="link-external link-https" href="https://github.com/hmhuy0/SPRINQL" rel="external noopener nofollow">this https URL</a>.

Self-adaptive Inverse Soft-Q Learning for Imitation.

Off-Dynamics Inverse Reinforcement Learning

Extrinsicaly Rewarded Soft Q Imitation Learning with Discriminator

Imitation Learning from Imperfection: Theoretical Justifications and Algorithms

Imitator Learning: Achieve Out-of-the-Box Imitation Ability in Variable Environments

Aligning Human Intent from Imperfect Demonstrations with Confidence-based Inverse soft-Q Learning

Limited Preference Aided Imitation Learning from Imperfect Demonstrations

Quality Diversity Imitation Learning

Augmented Q Imitation Learning (AQIL)

Accelerating Self-Imitation Learning from Demonstrations via Policy Constraints and Q-Ensemble

Sample Efficient Imitation Learning via Reward Function Trained in Advance

Imitation Learning via Simultaneous Optimization of Policies and Auxiliary Trajectories

Self-Practice Imitation Learning From Weak Policy

Planning for Sample Efficient Imitation Learning

Reward-free World Models for Online Imitation Learning

Confidence-Aware Imitation Learning from Demonstrations with Varying Optimality

Support-weighted Adversarial Imitation Learning

Robust Visual Imitation Learning with Inverse Dynamics Representations

Inverse Reinforcement Q-Learning Through Expert Imitation for Discrete-Time Systems

Co-Imitation Learning without Expert Demonstration

SPRINQL: Sub-optimal Demonstrations driven Offline Imitation Learning