Deep Key Clips-Video Feature Fusion Framework for Action Recognition

Chao Li,Yue Ming,Yuan Shen,Hui Yu
DOI: https://doi.org/10.1109/ICMEW.2019.00034
2019-01-01
Abstract:Action recognition is crucial for many computer vision applications. Recently, deep learning has made breakthrough in recognition performance of action. However, there are a large number of redundant video frames which contain similar information making it difficult to capture discriminative spatio-temporal features for long-term actions. In this paper, we propose a novel framework for action recognition: Deep Key Clips-Video feature fusion framework. First, we propose a key clip selection algorithm based on background subtraction, which utilizes image average gradient and select key clips for training. Then, we further superimpose the key frames to generate historical contour images, effectively aggregating long-term information of the actions. Key video clips and historical contour images are inputted to the 3D convolutional network and the 2D convolutional network respectively, which extract the clip level and long term video level feature. Finally, we fuse these two sub-networks to improve the accuracy of recognition. We conduct experiments on two current mainstream action recognition datasets UCF-101 and HMDB-51. Compared with the state-of-the-art methods, the experimental results demonstrate the effectiveness of our proposed network for action recognition.
What problem does this paper attempt to address?