Deep Generative Models in Robotics: A Survey on Learning from Multimodal Demonstrations

Julen Urain,Ajay Mandlekar,Yilun Du,Mahi Shafiullah,Danfei Xu,Katerina Fragkiadaki,Georgia Chalvatzaki,Jan Peters
2024-08-21
Abstract:Learning from Demonstrations, the field that proposes to learn robot behavior models from data, is gaining popularity with the emergence of deep generative models. Although the problem has been studied for years under names such as Imitation Learning, Behavioral Cloning, or Inverse Reinforcement Learning, classical methods have relied on models that don't capture complex data distributions well or don't scale well to large numbers of demonstrations. In recent years, the robot learning community has shown increasing interest in using deep generative models to capture the complexity of large datasets. In this survey, we aim to provide a unified and comprehensive review of the last year's progress in the use of deep generative models in robotics. We present the different types of models that the community has explored, such as energy-based models, diffusion models, action value maps, or generative adversarial networks. We also present the different types of applications in which deep generative models have been used, from grasp generation to trajectory generation or cost learning. One of the most important elements of generative models is the generalization out of distributions. In our survey, we review the different decisions the community has made to improve the generalization of the learned models. Finally, we highlight the research challenges and propose a number of future directions for learning deep generative models in robotics.
Robotics,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the challenges encountered when training robot behavior models through multimodal demonstrations (such as images, language, touch, etc.). Specifically, the paper focuses on the following aspects: 1. **Data Diversity**: Different demonstrators may have varying skill levels, preferences, and strategies, leading to diverse methods in the dataset. Traditional unimodal distribution models cannot capture this diversity, thus affecting performance. The paper explores how to use Deep Generative Models (DGM) to capture the complex multimodal data distribution. 2. **Heterogeneous Action and State Spaces**: Unlike computer vision, robot actions can include various forms such as torque commands, target positions, or trajectories. Additionally, robot behavior can be modeled in either configuration space or task space, leading to heterogeneity in datasets and solutions. The paper discusses how to apply deep generative models in these heterogeneous environments. 3. **Partially Observable Demonstrations**: Human demonstrators' actions are based not only on observable elements but also on internal states that may not be captured by robot sensors. This mismatch leads to partial representation of the task context, increasing the ambiguity of learning strategies. The paper proposes how to reduce this ambiguity by encoding historical observations. 4. **Temporal Dependency and Long-term Planning**: Robot tasks often involve sequential decision-making, where actions are temporally correlated. This sequential nature can lead to cumulative errors, causing the robot to encounter situations not seen in training demonstrations. The paper discusses how to reduce these cumulative errors by learning short-sighted skills or generating action trajectories. 5. **Mismatch Between Training and Evaluation Objectives**: Learning from offline demonstrations is typically framed as a density estimation problem, but when the learned model is used to solve specific tasks, the evaluation metric is task success rate. This mismatch between training and evaluation objectives can lead to poor performance. The paper proposes a method combining behavior cloning and reinforcement learning fine-tuning to address this issue. 6. **Distribution Shift and Generalization**: A fundamental challenge of learning from offline demonstrations is the distribution shift between demonstration data and real-world application scenarios. The paper explores how to extrapolate from given demonstrations and adapt to new unseen environments through technical means. Overall, the paper aims to provide a unified and comprehensive review, introducing recent advances in the use of deep generative models in robotics, particularly how to address the aforementioned challenges.