Abstract:Human activity recognition in videos with convolutional neural network (CNN) features has received increasing attention in multimedia understanding. Taking videos as a sequence of frames, a new record was recently set on several benchmark datasets by feeding frame-level CNN sequence features to long short-term memory (LSTM) model for video activity recognition. This recurrentmodel-based visual recognition pipeline is a natural choice for perceptual problems with time-varying visual input or sequential outputs. However, the above-mentioned pipeline takes frame-level CNN sequence features as input for LSTM, which may fail to capture the rich motion information from adjacent frames or maybe multiple clips. Furthermore, an activity is conducted by a subject or multiple subjects. It is important to consider attention that allows for salient features, instead of mapping an entire frame into a static representation. To tackle these issues, we propose a novel pipeline, saliency-aware three-dimensional (3-D) CNN with LSTM, for video action recognition by integrating LSTM with salient-aware deep 3-D CNN features on videos shots. Specifically, we first apply saliency-aware methods to generate saliency-aware videos. Then, we design an end-to-end pipeline by integrating 3-D CNN with LSTM, followed by a time series pooling layer and a soft max layer to predict the activities. Noticeably, we set a new record on two benchmark datasets, i.e., UCF101 with 13 320 videos and HMDB-51 with 6766 videos. Our method outperforms the state-of-the-art end-to-end methods of action recognition by 3.8% and 3.2%, respectively on above two datasets.

Continuous Action Recognition Based on Hybrid CNN-LDCRF Model

Continuous Action Segmentation and Recognition Using Hybrid Convolutional Neural Network-Hidden Markov Model Model

Continuous action recognition with weakly labelling videos

Continuous Action Recognition and Segmentation in Untrimmed Videos

Continuous Gesture Segmentation and Recognition Using 3DCNN and Convolutional LSTM

Human Action Recognition From Digital Videos Based on Deep Learning.

Continuous Human Action Recognition in Real Time

Joint Multi-Scale Residual and Motion Feature Learning for Action Recognition.

Action Recognition Based on a Hybrid Deep Network

Bi-direction Hierarchical LSTM with Spatial-Temporal Attention for Action Recognition

Latent Pose Estimator for Continuous Action Recognition

Action Recognition Using Co-trained Deep Convolutional Neural Networks.

ARCH: Adaptive Recurrent-Convolutional Hybrid Networks for Long-Term Action Recognition

Realistic Human Action Recognition: when CNNS Meet LDS

Beyond Frame-level CNN: Saliency-Aware 3-D CNN with LSTM for Video Action Recognition.

A Spatio-temporal Hybrid Network for Action Recognition

Action recognition with temporal scale-invariant deep learning framework

Integrating Temporal and Spatial Attention for Video Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Human behavior segmentation and recognition using Continuous Linear Dynamic System

Action Recognition in Videos with Temporal Segments Fusions