EvIL: Evolution Strategies for Generalisable Imitation Learning

Silvia Sapora,Gokul Swamy,Chris Lu,Yee Whye Teh,Jakob Nicolaus Foerster
2024-06-16
Abstract:Often times in imitation learning (IL), the environment we collect expert demonstrations in and the environment we want to deploy our learned policy in aren't exactly the same (e.g. demonstrations collected in simulation but deployment in the real world). Compared to policy-centric approaches to IL like behavioural cloning, reward-centric approaches like inverse reinforcement learning (IRL) often better replicate expert behaviour in new environments. This transfer is usually performed by optimising the recovered reward under the dynamics of the target environment. However, (a) we find that modern deep IL algorithms frequently recover rewards which induce policies far weaker than the expert, even in the same environment the demonstrations were collected in. Furthermore, (b) these rewards are often quite poorly shaped, necessitating extensive environment interaction to optimise effectively. We provide simple and scalable fixes to both of these concerns. For (a), we find that reward model ensembles combined with a slightly different training objective significantly improves re-training and transfer performance. For (b), we propose a novel evolution-strategies based method EvIL to optimise for a reward-shaping term that speeds up re-training in the target environment, closing a gap left open by the classical theory of IRL. On a suite of continuous control tasks, we are able to re-train policies in target (and source) environments more interaction-efficiently than prior work.
Neural and Evolutionary Computing,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem that in imitation learning (IL), the policies learned from the expert demonstration environment are difficult to be effectively retrained in the target environment. Specifically, the paper mainly focuses on the following two issues: 1. **The reward functions recovered by modern deep IL algorithms cannot produce policies comparable to those of experts**: Even in the same environment, the reward functions recovered by modern deep IL algorithms (such as behavior cloning and inverse reinforcement learning) often induce policies far weaker than those of experts. This is not only a theoretical problem but also occurs frequently in practice. 2. **The recovered reward functions have a poor shape, resulting in inefficient retraining**: These reward functions usually require a large number of environmental interactions to be effectively optimized because they lack good shape characteristics, making retraining difficult. To solve these problems, the paper proposes two main methods: - **Improving Retrainability in IRL**: - **Policy Buffer**: By maintaining a buffer containing all past policy trajectories, ensure that the IRL discriminator can be continuously retrained on the complete history of policies. - **Discriminator and Policy Ensembles**: Adopt an ensemble method to expand the part of the state space where the discriminator provides useful feedback and ensure that each discriminator is trained on different sets of states. - **Random Policy Resets**: Occasionally re - initialize the learner's policy during the training process to avoid premature convergence and enhance the exploration ability. - **Decoupling Shaping from Discrimination**: - Propose a two - stage process: first learn the reward function, and then learn a shaping term and add this shaping term during the retraining process. This method avoids the problem of the invariance of the reward function shape to a series of loss functions seen by the discriminator during the game - solving process. In addition, the paper introduces a method based on evolution strategies (ES) (called EvIL) to optimize the shaping term, directly maximizing the efficiency of retraining without the round - about path of learning the value function network. Specifically, use evolution strategies to evolve a potential function Φ to accelerate the training process, and evaluate and optimize the shaping term by calculating the area under the performance curve (AUC) as the fitness function. In summary, the main contribution of the paper is to provide a method that combines the advantages of modern deep - learning architectures and classical IRL methods to achieve an efficient and effective retraining method.