Abstract:Two-stream Convolutional Neural Network has shown a remarkable performance for video action recognition. Many recent works mainly focus on the fusion of appearance and motion information to obtain a robust spatiotemporal representation for action video. However, most of these networks are based on 2D convolution architecture and apply spatiotemporal fusion at the top layers of the network, which lacks the capability to take full advantage of the potential spatiotemporal relation in multiple levels of the network as well as capture the temporal dynamic in low-level details. In this paper, we propose a novel convolutional fusion network based on a two-stream network, called 3D Multi-Level Dense Fusion (MLDF-3D) for the deep spatiotemporal relation learning. There are mainly three merits of the proposed network: (i) rather than performing fusion only at the last convolution layer or softmax layer, the MLDF-3D performs spatiotemporal fusion at multiple levels of network, which can fully explore the potential relation of features extracted from two-stream network; (ii) we introduce dense connection in MLDF-3D to make fusion features at multiple levels directly connected; and (iii) we develop a sequence segment framework for the long-range temporal structure modeling. Our experimental results show that the MLDF-3D can effectively learn the correlation of two-stream features to obtain robust action representation, and achieve the state-of-the-art performance on HMDB51 and UCF101.

Convolutional Gated Recurrent Units Fusion For Video Action Recognition

Two-Stream 3-D Convnet Fusion for Action Recognition in Videos with Arbitrary Size and Length

Two-Stream Gated Fusion ConvNets for Action Recognition

Residual Gating Fusion Network for Human Action Recognition.

Local Fusion Networks with Chained Residual Pooling for Video Action Recognition

Multi-scale Spatiotemporal Information Fusion Network for Video Action Recognition

Long-term 3D Convolutional Fusion Network for Action Recognition

3D Convolutional Two-Stream Network for Action Recognition in Videos

Action Recognition in Videos with Temporal Segments Fusions

Fully Convolutional Networks for Action Recognition

Deep Spatiotemporal Relation Learning with 3D Multi-Level Dense Fusion for Video Action Recognition

Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition

Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition

Deep Fusion Module for Video Action Recognition

Two-Stream Convolutional Neural Network for Video Action Recognition

Spatiotemporal Fusion Networks for Video Action Recognition

Multiple Feature Fusion in Convolutional Neural Networks for Action Recognition

Learning Gating ConvNet for Two-Stream based Methods in Action Recognition.

Multi-dimension Feature Fusion for Action Recognition

3-Stream Convolutional Networks for Video Action Recognition with Hybrid Motion Field

Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition