Action Recognition in Videos with Temporal Segments Fusions
Yuanye Fang,Rui Zhang,Qiu-Feng Wang,Kaizhu Huang
DOI: https://doi.org/10.1007/978-3-030-39431-8_23
2020-01-01
Abstract:Deep Convolutional Neural Networks (CNNs) have achieved great success in object recognition. However, they are difficult to capture the long-range temporal information, which plays an important role for action recognition in videos. To overcome this issue, a two-stream architecture including spatial and temporal segments based CNNs is widely used recently. However, the relationship among the segments is not sufficiently investigated. In this paper, we proposed to combine multiple segments by a fully connected layer in a deep CNN model for the whole action video. Moreover, the four streams (i.e., RGB, RGB differences, optical flow, and warped optical flow) are carefully integrated with a linear combination, and the weights are optimized on the validation datasets. We evaluate the recognition accuracy of the proposed method on two benchmark datasets of UCF101 and HMDB51. The extensive experimental results demonstrate encouraging results of our proposed method. Specifically, the proposed method improves the accuracy of action recognition in videos obviously (e.g., compared with the baseline, the accuracy is improved from 94.20% to 97.30% and from 69.40% to 77.99% on the dataset UCF101 and HMDB51, respectively). Furthermore, the proposed method can obtain the competitive accuracy to the state-of-the-art method of the 3D convolutional operation, but with much fewer parameters.