Abstract:Imitation Learning presents a promising approach for learning generalizable and complex robotic skills. The recently proposed Diffusion Policy generates robot action sequences through a conditional denoising diffusion process, achieving state-of-the-art performance compared to other imitation learning methods. This paper summarizes five key components of Diffusion Policy: 1) observation sequence input; 2) action sequence execution; 3) receding horizon; 4) U-Net or Transformer network architecture; and 5) FiLM conditioning. By conducting experiments across ManiSkill and Adroit benchmarks, this study aims to elucidate the contribution of each component to the success of Diffusion Policy in various scenarios. We hope our findings will provide valuable insights for the application of Diffusion Policy in future research and industry.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to understand and quantify the contributions of each key component in Diffusion Policy to the overall performance?** Although Diffusion Policy has excellent performance in the field of robot imitation learning, the specific roles of its internal components have not been systematically analyzed yet. This causes many researchers to lack clear guidance when using or modifying Diffusion Policy, and may inadvertently weaken its performance. Specifically, the paper experimentally analyzes the influence of the following five key components on the performance of Diffusion Policy: 1. **Observation Sequence Input**: - Diffusion Policy uses a series of past observation data as input instead of relying only on the current single observation. 2. **Action Sequence Execution**: - Diffusion Policy executes a series of actions in one inference instead of only one action. 3. **Receding Horizon Control**: - Diffusion Policy predicts multiple subsequent actions, but only executes the first few actions in the environment to maintain the balance between long - term planning and real - time response. 4. **Denoising Network Architecture**: - Diffusion Policy adopts U - Net or Transformer architecture as the denoising network instead of a simple multi - layer perceptron (MLP). 5. **FiLM Conditioning**: - Diffusion Policy applies the observation sequence as FiLM Conditioning to the denoising network instead of directly as network input. To evaluate the importance of these components, the paper conducts ablation experiments on the ManiSkill and Adroit benchmarks and draws the following conclusions: - **Observation Sequence Input**: It is crucial for tasks requiring absolute control, but has less impact on incremental control tasks. - **Action Sequence Execution**: Generally, it can improve the performance by 10 - 20%, but for tasks requiring real - time feedback, shorter action sequences or single - action execution are more effective. - **Receding Horizon Control**: It significantly improves the performance of long - horizon tasks, but has little impact on short - horizon tasks. - **Denoising Network Architecture**: U - Net is very important for complex tasks, while MLP is sufficient for simple tasks. - **FiLM Conditioning**: It significantly improves the performance of complex tasks, but is not necessary for simple tasks. Through these experimental results, the paper provides specific suggestions for future research and applications, helping researchers better understand and optimize each component of Diffusion Policy. ### Summary This paper aims to reveal the specific contributions of each key component in Diffusion Policy to its performance through systematic experiments and analysis, thereby providing valuable guidance for future scientific research and practical applications.

Unpacking the Individual Components of Diffusion Policy

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Don't Start from Scratch: Behavioral Refinement via Interpolant-based Policy Diffusion

Enabling Stateful Behaviors for Diffusion-based Policy Learning

PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play

Hierarchical Diffusion Policy: manipulation trajectory generation via contact guidance

Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

AffordDP: Generalizable Diffusion Policy with Transferable Affordance

EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

Diffusion Imitation from Observation

Prediction with Action: Visual Policy Learning via Joint Denoising Process

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Diffusion Policy Policy Optimization

Diff-DAgger: Uncertainty Estimation with Diffusion Policy for Robotic Manipulation

Diffusion Co-Policy for Synergistic Human-Robot Collaborative Tasks

Diffusion-Reward Adversarial Imitation Learning