Supervised Meta-Reinforcement Learning with Trajectory Optimization for Manipulation Tasks
Lei Wang,Yunzhou Zhang,Delong Zhu,Sonya Coleman,Dermot Kerr
DOI: https://doi.org/10.1109/tcds.2023.3286465
IF: 4.546
2024-01-01
IEEE Transactions on Cognitive and Developmental Systems
Abstract:Learning from small amounts of samples with reinforcement learning (RL) is challenging in many tasks, especially, in real-world applications, such as robotics. Meta-RL (meta-RL) has been proposed as an approach to address this problem by generalizing to new tasks through experience from previous similar tasks. However, these approaches generally perform meta-optimization by focusing direct policy search methods on validation samples from adapted policies, thus, requiring large amounts of on-policy samples during meta-training. To this end, we propose a novel algorithm called supervised meta-RL with trajectory optimization (SMRL-TO) by integrating model-agnostic meta-learning (MAML) and iterative LQR (iLQR)-based trajectory optimization. Our approach is designed to provide online supervision for validation samples through iLQR-based trajectory optimization and embed simple imitation learning into the meta-optimization rather than policy gradient steps. This is actually a bi-level optimization that needs to calculate several gradient updates in each meta-iteration, consisting of off-policy RL in the inner loop and online imitation learning in the outer loop. SMRL-TO can achieve significant improvements in sample efficiency without human-provided demonstrations, due to the effective supervision from iLQR-based trajectory optimization. In this article, we describe how to use iLQR-based trajectory optimization to obtain labeled data and then how leverage them to assist the training of meta-learner. Through a series of robotic manipulation tasks, we further show that compared with the previous methods, the proposed approach can substantially improve sample efficiency and achieve better asymptotic performance.