Abstract:Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. However, it remains a challenge due to the domain gap between humans and robots. Moreover, it is difficult to extract useful information representing the dynamic world from human videos, because of its noisy and multimodal data structure. In this paper, we introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning with a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior performance. Our project website is available at <a class="link-external link-https" href="https://video-diff.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to utilize large-scale human video data without action labels to pre-train models and fine-tune them with a small amount of robot video data with action labels, thereby generating a general robot agent capable of performing various tasks. Specifically, the paper aims to overcome the following challenges: 1. **Scarcity of Robot Data**: Unlike the vast amount of data available on the internet, high-quality robot interaction data is difficult to obtain because it usually needs to be collected through teleoperation or kinematic solvers, which is both expensive and time-consuming. 2. **Domain Gap Between Humans and Robots**: The behavior patterns in human videos are complex and diverse, making it challenging to directly transfer this knowledge to robots due to significant differences in morphology and dynamics. 3. **Extracting Useful Information from Human Videos**: Human video data is noisy and multimodal, making it challenging to extract useful information that represents the dynamic world. To address these challenges, the paper proposes a video policy learning framework based on a discrete diffusion model (VPDD), which achieves its goals through the following steps: - **Pre-training Phase**: Utilize large-scale human video data without action labels for pre-training. By predicting future video frames through a discrete diffusion model, the framework learns common-sense knowledge, dynamic rules, and behavior patterns in human-robot interactions. - **Fine-tuning Phase**: Utilize a small amount of robot video data with action labels for fine-tuning. By using the predicted future videos to guide low-level action learning, the model can generate specific action instructions based on the predicted future videos. In this way, VPDD can effectively transfer the knowledge learned from human videos to robot tasks, improving the robot's performance in multi-task environments.

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Prediction with Action: Visual Policy Learning via Joint Denoising Process

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Diffusion Transformer Policy

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Diffusion Co-Policy for Synergistic Human-Robot Collaborative Tasks

Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos

One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Text-Aware Diffusion for Policy Learning

Enabling Stateful Behaviors for Diffusion-based Policy Learning

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

DNAct: Diffusion Guided Multi-Task 3D Policy Learning

Diff-DAgger: Uncertainty Estimation with Diffusion Policy for Robotic Manipulation

Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

Hierarchical Diffusion Policy: manipulation trajectory generation via contact guidance