Imitation Learning to Outperform Demonstrators by Directly Extrapolating Demonstrations

Yuanying Cai,Chuheng Zhang,Wei Shen,Xiaonan He,Xuyun Zhang,Longbo Huang
DOI: https://doi.org/10.1145/3511808.3557357
2022-01-01
Abstract:We consider the problem of imitation learning from suboptimal demonstrations that aims to learn a better policy than demonstrators. Previous methods usually learn a reward function to encode the underlying intention of the demonstrators and use standard reinforcement learning to learn a policy based on this reward function. Such methods can fail to control the distribution shift between demonstrations and the learned policy since the learned reward function may not generalize well on out-of-distribution samples and can mislead the agent to highly uncertain states, resulting in degenerated performance. To address this limitation, we propose a novel algorithm called Outperforming demonstrators by Directly Extrapolating Demonstrations(ODED). Instead of learning a reward function, ODED trains an ensemble of extrapolation networks that generate extrapolated demonstrations, i.e., demonstrations that may be induced by a good agent, based on provided demonstrations. With these extrapolated demonstrations, we can use an off-the-shelf imitation learning algorithm to learn a good policy. Guided by extrapolated demonstrations, the learned policy avoids visiting highly uncertain states and therefore controls the distribution shift. Empirically, we show that ODED outperforms suboptimal demonstrators and achieves better performance than state-of-the-art imitation learning algorithms on the MuJoCo and DeepMind Control Suite tasks.
What problem does this paper attempt to address?