Unifying Diffusion Models with Action Detection Transformers for Multi-task Robotic Manipulation

Abstract:: We present ChainedDiffuser , a policy architecture that unifies 1 transformer-based end-effector action prediction and diffusion-based trajectory 2 generation for learning multimodal multi-task robotic manipulation from demon-3 strations. Our model sets a new record on established manipulation benchmarks 4 across a variety of settings, significantly outperforming all prior state-of-the-art 5 approaches. Our main innovation is to use a global transformer-based action pre-6 dictor to predict actions at keyframes, a task that requires multimodal semantic 7 scene understanding, and to use a local trajectory diffuser to predict trajectory seg-8 ments that connect predicted macro-actions. ChainedDiffuser outperforms both 9 state-of-the-art macro-action prediction models that use motion planners for tra-10 jectory prediction, and trajectory diffusion policies that do not predict keyframe 11 macro-actions. We conduct experiments in both simulated and real-world envi-12 ronments and demonstrate ChainedDiffuser’s ability in solving a wide range of 13 manipulation tasks involving interactions with diverse objects. 14
Engineering,Computer Science
What problem does this paper attempt to address?