Abstract:Deep convolutional neural network (DCNN) and recurrent neural network (RNN) have been proved as an imperious research area in multimedia understanding and obtained remarkable action recognition performance. However, videos contain rich motion information with varying dimensions. Existing recurrent based pipelines fail to capture long-term motion dynamics in videos with various motion scales and complex actions performed by multiple actors. Consideration of contextual and salient features is more important than mapping a video frame into a static video representation. This research work provides a novel pipeline by analyzing and processing the video information using a 3D convolution (C3D) network and newly introduced deep bidirectional LSTM. Like popular two-stream convent, we also introduce a two-stream framework with one modification; that is, we replace the optical flow stream by saliency-aware stream to avoid the computational complexity. First, we generate a saliency-aware video stream by applying the saliency-aware method. Secondly, a two-stream 3D-convolutional network (C3D) is utilized with two different types of streams, i.e., RGB stream and saliency-aware video stream, to collect both spatial and semantic temporal features. Next, a deep bidirectional LSTM network is used to learn sequential deep temporal dynamics Finally, time-series-pooling-layer and softmax-layers classify human activity and behavior. The introduced system can learn long-term temporal dependencies and can predict complex human actions. Experimental results demonstrate the significant improvement in action recognition accuracy on different benchmark datasets.

3D-Cnn-based Fused Feature Maps with LSTM Applied to Action Recognition.

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Bidirectional LSTM with Saliency-Aware 3D-CNN Features for Human Action Recognition

Long-term 3D Convolutional Fusion Network for Action Recognition

Human Action Recognition From Digital Videos Based on Deep Learning.

Beyond Frame-level CNN: Saliency-Aware 3-D CNN with LSTM for Video Action Recognition.

Bi-direction Hierarchical LSTM with Spatial-Temporal Attention for Action Recognition

A Spatio-temporal Hybrid Network for Action Recognition

Joint Multi-Scale Residual and Motion Feature Learning for Action Recognition.

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Action recognition using three dimension convolution and long short term memory

Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition

Human Action Recognition Based on Improved Fusion Attention CNN and RNN

An Enhanced 3dcnn-Convlstm For Spatiotemporal Multimedia Data Analysis

Human action recognition using attention based LSTM network with dilated CNN features

Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

Residual Attention Fusion Network for Video Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Skeleton-based Action Recognition Using LSTM and CNN

Human Action Recognition Based on Selected Spatio-Temporal Features Via Bidirectional LSTM

A deep multimodal network based on bottleneck layer features fusion for action recognition