Abstract:We identify two issues with the family of algorithms based on the Adversarial Imitation Learning framework. The first problem is implicit bias present in the reward functions used in these algorithms. While these biases might work well for some environments, they can also lead to sub-optimal behavior in others. Secondly, even though these algorithms can learn from few expert demonstrations, they require a prohibitively large number of interactions with the environment in order to imitate the expert for many real-world applications. In order to address these issues, we propose a new algorithm called Discriminator-Actor-Critic that uses off-policy Reinforcement Learning to reduce policy-environment interaction sample complexity by an average factor of 10. Furthermore, since our reward function is designed to be unbiased, we can apply our algorithm to many problems without making any task-specific adjustments.

What problem does this paper attempt to address?

This paper attempts to solve two key problems in the Adversarial Imitation Learning (AIL) framework: 1. **Implicit Bias in the Reward Function**: The reward functions used in existing AIL algorithms may have implicit biases. These biases may perform well in some environments, but may lead to sub - optimal behaviors in others. For example, some forms of reward functions may introduce a survival bonus, causing agents to tend to survive longer rather than complete tasks as quickly as possible. In addition, improper handling of absorbing states may also lead to performance degradation. 2. **Low Sample Efficiency**: Although these algorithms can learn from a small number of expert demonstrations, they need to interact extensively with the environment to successfully imitate the behavior of experts, which is impractical for many practical applications (such as robotics). To solve these problems, the authors propose a new algorithm - Discriminator - Actor - Critic (DAC). The main improvements of DAC include: - **Unbiased Reward Function**: By explicitly learning the rewards of terminating states, DAC can eliminate the implicit biases in the reward function. This enables the algorithm to adapt to different tasks without the need for specific adjustments to the reward function. - **Off - Policy Reinforcement Learning**: DAC uses off - policy reinforcement learning methods to significantly reduce the number of interactions between the agent and the environment. Specifically, DAC uses TD3 (an off - policy RL algorithm) and an off - policy discriminator, thereby reducing the sample complexity by approximately an order of magnitude. Through these improvements, DAC not only improves sample efficiency but also can achieve state - of - the - art performance in a variety of complex imitation learning tasks.

Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

Adversarial Imitation Learning via Boosting

Efficient Imitation Learning with Conservative World Models

Planning for Sample Efficient Imitation Learning

Addressing Implicit Bias in Adversarial Imitation Learning with Mutual Information.

Sample-efficient Adversarial Imitation Learning from Observation

Rethinking Adversarial Inverse Reinforcement Learning: Policy Imitation, Transferable Reward Recovery and Algebraic Equilibrium Proof

RILe: Reinforced Imitation Learning

Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations

Imitation Learning from Imperfection: Theoretical Justifications and Algorithms

Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations.

Addressing reward bias in Adversarial Imitation Learning with neutral reward functions

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Extrinsicaly Rewarded Soft Q Imitation Learning with Discriminator

Recursive Least Squares Advantage Actor-Critic Algorithms

Co-Adaptation of Algorithmic and Implementational Innovations in Inference-based Deep Reinforcement Learning

Unlabeled Imperfect Demonstrations in Adversarial Imitation Learning

A Pragmatic Look at Deep Imitation Learning

Non-Adversarial Imitation Learning and its Connections to Adversarial Methods

Imitator Learning: Achieve Out-of-the-Box Imitation Ability in Variable Environments