Global and Local C3D Ensemble System for First Person Interactive Action Recognition.

Lingling Fa,Yan Song,Xiangbo Shu
DOI: https://doi.org/10.1007/978-3-319-73600-6_14
2018-01-01
Abstract:Action recognition in first person videos is different from that in third person videos. In this paper, we aim to recognize interactive actions in first person videos. First person interactive actions contain two kinds of motion which are the ego-motion from the observer and the motion from the actor. To enable an observer to understand "what activity others are doing to me", we propose a twin stream network architecture based on 3D convolution networks. The global action C3D learns interactions with ego-motion and the local salient motion C3D analyzes the motion from the actor in a salient region, especially when the action happens at a distance from the observer. We also propose a sampling method to extract clips as input to the C3D models and investigate different C3D architectures to improve the performance of C3D. We carry out experiments on the benchmark of JPL first-person interaction dataset. Experiment results prove that the ensemble of global and local networks can increase the accuracy over the state-of-the-art methods by 3.26%.
What problem does this paper attempt to address?