Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

Ilya Kostrikov,Kumar Krishna Agrawal,Debidatta Dwibedi,Sergey Levine,Jonathan Tompson
DOI: https://doi.org/10.48550/arXiv.1809.02925
2018-10-16
Abstract:We identify two issues with the family of algorithms based on the Adversarial Imitation Learning framework. The first problem is implicit bias present in the reward functions used in these algorithms. While these biases might work well for some environments, they can also lead to sub-optimal behavior in others. Secondly, even though these algorithms can learn from few expert demonstrations, they require a prohibitively large number of interactions with the environment in order to imitate the expert for many real-world applications. In order to address these issues, we propose a new algorithm called Discriminator-Actor-Critic that uses off-policy Reinforcement Learning to reduce policy-environment interaction sample complexity by an average factor of 10. Furthermore, since our reward function is designed to be unbiased, we can apply our algorithm to many problems without making any task-specific adjustments.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve two key problems in the Adversarial Imitation Learning (AIL) framework: 1. **Implicit Bias in the Reward Function**: The reward functions used in existing AIL algorithms may have implicit biases. These biases may perform well in some environments, but may lead to sub - optimal behaviors in others. For example, some forms of reward functions may introduce a survival bonus, causing agents to tend to survive longer rather than complete tasks as quickly as possible. In addition, improper handling of absorbing states may also lead to performance degradation. 2. **Low Sample Efficiency**: Although these algorithms can learn from a small number of expert demonstrations, they need to interact extensively with the environment to successfully imitate the behavior of experts, which is impractical for many practical applications (such as robotics). To solve these problems, the authors propose a new algorithm - Discriminator - Actor - Critic (DAC). The main improvements of DAC include: - **Unbiased Reward Function**: By explicitly learning the rewards of terminating states, DAC can eliminate the implicit biases in the reward function. This enables the algorithm to adapt to different tasks without the need for specific adjustments to the reward function. - **Off - Policy Reinforcement Learning**: DAC uses off - policy reinforcement learning methods to significantly reduce the number of interactions between the agent and the environment. Specifically, DAC uses TD3 (an off - policy RL algorithm) and an off - policy discriminator, thereby reducing the sample complexity by approximately an order of magnitude. Through these improvements, DAC not only improves sample efficiency but also can achieve state - of - the - art performance in a variety of complex imitation learning tasks.