Deep Spatiotemporal Relation Learning with 3D Multi-Level Dense Fusion for Video Action Recognition

Junxuan Zhang,Haifeng Hu
DOI: https://doi.org/10.1109/access.2019.2895472
IF: 3.9
2019-01-01
IEEE Access
Abstract:Two-stream Convolutional Neural Network has shown a remarkable performance for video action recognition. Many recent works mainly focus on the fusion of appearance and motion information to obtain a robust spatiotemporal representation for action video. However, most of these networks are based on 2D convolution architecture and apply spatiotemporal fusion at the top layers of the network, which lacks the capability to take full advantage of the potential spatiotemporal relation in multiple levels of the network as well as capture the temporal dynamic in low-level details. In this paper, we propose a novel convolutional fusion network based on a two-stream network, called 3D Multi-Level Dense Fusion (MLDF-3D) for the deep spatiotemporal relation learning. There are mainly three merits of the proposed network: (i) rather than performing fusion only at the last convolution layer or softmax layer, the MLDF-3D performs spatiotemporal fusion at multiple levels of network, which can fully explore the potential relation of features extracted from two-stream network; (ii) we introduce dense connection in MLDF-3D to make fusion features at multiple levels directly connected; and (iii) we develop a sequence segment framework for the long-range temporal structure modeling. Our experimental results show that the MLDF-3D can effectively learn the correlation of two-stream features to obtain robust action representation, and achieve the state-of-the-art performance on HMDB51 and UCF101.
What problem does this paper attempt to address?