Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Haoran He,Chenjia Bai,Ling Pan,Weinan Zhang,Bin Zhao,Xuelong Li
2024-10-09
Abstract:Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. However, it remains a challenge due to the domain gap between humans and robots. Moreover, it is difficult to extract useful information representing the dynamic world from human videos, because of its noisy and multimodal data structure. In this paper, we introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning with a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior performance. Our project website is available at <a class="link-external link-https" href="https://video-diff.github.io/" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The paper attempts to address the problem of how to utilize large-scale human video data without action labels to pre-train models and fine-tune them with a small amount of robot video data with action labels, thereby generating a general robot agent capable of performing various tasks. Specifically, the paper aims to overcome the following challenges: 1. **Scarcity of Robot Data**: Unlike the vast amount of data available on the internet, high-quality robot interaction data is difficult to obtain because it usually needs to be collected through teleoperation or kinematic solvers, which is both expensive and time-consuming. 2. **Domain Gap Between Humans and Robots**: The behavior patterns in human videos are complex and diverse, making it challenging to directly transfer this knowledge to robots due to significant differences in morphology and dynamics. 3. **Extracting Useful Information from Human Videos**: Human video data is noisy and multimodal, making it challenging to extract useful information that represents the dynamic world. To address these challenges, the paper proposes a video policy learning framework based on a discrete diffusion model (VPDD), which achieves its goals through the following steps: - **Pre-training Phase**: Utilize large-scale human video data without action labels for pre-training. By predicting future video frames through a discrete diffusion model, the framework learns common-sense knowledge, dynamic rules, and behavior patterns in human-robot interactions. - **Fine-tuning Phase**: Utilize a small amount of robot video data with action labels for fine-tuning. By using the predicted future videos to guide low-level action learning, the model can generate specific action instructions based on the predicted future videos. In this way, VPDD can effectively transfer the knowledge learned from human videos to robot tasks, improving the robot's performance in multi-task environments.