Two-stream 2D/3D Residual Networks for Learning Robot Manipulations from Human Demonstration Videos

Xin Xu,Kun Qian,Bo Zhou,Shenghao Chen,Yitong Li
DOI: https://doi.org/10.1109/icra48506.2021.9561308
2021-01-01
Abstract:Learning manipulation skills from observing human demonstration videos is a promising aspect for intelligent robotic systems. Recent advances in video to command provide an end-to-end approach to translate a video into robot plans. However, the general video captioning methods focus more on the understanding of the full frame, while they lack the consideration of the spatio-temporal features in videos. In this paper, we proposed the two-stream 2D/3D residual networks for robots to learn manipulation tasks from human demonstration videos. We integrate spatial features with 2D residual network and temporal features with 3D residual network as inputs for RNN layers. An encoder-decoder architecture is then used to encode the spatio-temporal features and sequentially generate the command words. Experimental results on an extended manipulation dataset show that our approach outperforms the state-of-the-art methods. Real-world experiments results on a Baxter robotic arm indicate that our method could produce more accurate commands from video demonstrations.
What problem does this paper attempt to address?