Abstract:Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal vector of locally aggregated descriptors (ActionS-ST-VLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionS-ST-VLAD encoding approach, by using AVFS-ASFS, the keyframe features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted keyframe feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks—HMDB51, UCF101, Kinetics, and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-the-art performance for video-based action recognition.

Realistic Human Action Recognition: when Deep Learning Meets VLAD

Human Action Recognition Using Deep Learning Methods.

Human Action Recognition From Digital Videos Based on Deep Learning.

Realistic Human Action Recognition by Fast Hog3d and Self-Organization Feature Map

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Visualization As Intermediate Representations (VLAIR) for Human Activity Recognition.

DB-LSTM: Densely-connected Bi-directional LSTM for Human Action Recognition

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

Online Robust Action Recognition Based on a Hierarchical Model

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Good Practices for Learning to Recognize Actions Using FV and VLAD

Deep Learning-Based Human Action Recognition in Videos

Combining Sparse And Dense Descriptors With Temporal Semantic Structures For Robust Human Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Deep spatiotemporal LSTM network with temporal pattern feature for 3D human action recognition

Human Action Recognition Based on DMMs, HOGs and Contourlet Transform

A Comprehensive Study of Deep Video Action Recognition

A Novel Trajectory-VLAD Based Action Recognition Algorithm for Video Analysis.

Real Time Human Action Recognition in a Long Video Sequence

DA-VLAD: Discriminative Action Vector of Locally Aggregated Descriptors for Action Recognition