Abstract:For a long time, learning spatiotemporal features with deep neural networks has been a difficult task in the field of computer vision. In this paper, we present a novel deep architecture, termed as Bifurcated Convolutional Neural Network (BifurcatedNet) to learn the discriminative video representation in an end-to-end manner. In our work, the BifurcatedNet is built upon the stacking bifurcated blocks that aim at simultaneously capturing the static appearance information and the temporal dynamic from input data. Specifically, the bifurcated block is composed of two separated branch, i.e., an appearance branch and a dynamic branch. The appearance branch employs 2D convolutional operation to obtain the spatial responses of image pixels or filters of each input frame, while the design of the dynamic branch is based on the spatio-temporal convolutional operation to exploit the temporal dynamic between pixels and filter response across multiple frames. Multiple experiments are conducted on two popular action recognition benchmarks: UCF101 and HMDB51. With only RGB input, the BifurcatedNet obtains the superior performance over the existing state-of-the-art models under the same experimental setting. The proposed BifurcatedNet is also implemented in a two-stream fashion by using both RGB and optical flow input, and still achieves the state-of-the-art performance, demonstrating the effectiveness of the network design. Furthermore, in order to evaluate the generalization ability, we conduct experiments on the Chalearn LAP IsoGD dataset and find that our model works well in gesture recognition tasks.

BDNet: a Method Based on Forward and Backward Convolutional Networks for Action Recognition in Videos

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Appearance-and-Dynamic Learning With Bifurcated Convolution Neural Network for Action Recognition

Bi-direction Hierarchical LSTM with Spatial-Temporal Attention for Action Recognition

A Novel 3D Convolutional Neural Network for Action Recognition in Infrared Videos

3D Convolutional Two-Stream Network for Action Recognition in Videos

ACTION-Net: Multipath Excitation for Action Recognition

A Jeap-BiLSTM Neural Network for Action Recognition

DC3D: A Video Action Recognition Network Based on Dense Connection

Fully Convolutional Networks for Action Recognition

Action recognition method based on a novel keyframe extraction method and enhanced 3D convolutional neural network

An Attention Mechanism Based Convolutional LSTM Network for Video Action Recognition.

Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network.

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition

Integrating Temporal and Spatial Attention for Video Action Recognition

Action Recognition Using Action Sequences Optimization and Two-Stream 3D Dilated Neural Network.

Action Recognition in Videos with Temporal Segments Fusions

Exploring Hybrid Spatio-Temporal Convolutional Networks for Human Action Recognition.

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

Joint Network based Attention for Action Recognition

Spatial-Temporal Neural Networks For Action Recognition