Abstract:Video analysis is an important branch of computer vision due to its wide applications, ranging from video surveillance, video indexing, and retrieval to human computer interaction. All of the applications are based on a good video representation, which encodes video content into a feature vector with fixed length. Most existing methods treat video as a flat image sequence, but from our observations we argue that video is an information-intensive media with intrinsic hierarchical structure, which is largely ignored by previous approaches. Therefore, in this work, we represent the hierarchical structure of video with multiple granularities including, from short to long, single frame, consecutive frames (motion), short clip, and the entire video. Furthermore, we propose a novel deep learning framework to model each granularity individually. Specifically, we model the frame and motion granularities with 2D convolutional neural networks and model the clip and video granularities with 3D convolutional neural networks. Long Short-Term Memory networks are applied on the frame, motion, and clip to further exploit the long-term temporal clues. Consequently, the whole framework utilizes multi-stream CNNs to learn a hierarchical representation that captures spatial and temporal information of video. To validate its effectiveness in video analysis, we apply this video representation to action recognition task. We adopt a distribution-based fusion strategy to combine the decision scores from all the granularities, which are obtained by using a softmax layer on the top of each stream. We conduct extensive experiments on three action benchmarks (UCF101, HMDB51, and CCV) and achieve competitive performance against several state-of-the-art methods.

An attention-based spatial-temporal hierarchical ConvLSTM network for action recognition in videos

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Bi-direction Hierarchical LSTM with Spatial-Temporal Attention for Action Recognition

An Attention Mechanism Based Convolutional LSTM Network for Video Action Recognition.

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Spatial Mask ConvLSTM Network and Intra-Class Joint Training Method for Human Action Recognition in Video.

Convolutional Networks with Channel and STIPs Attention Model for Action Recognition in Videos

Action recognition with hierarchical convolutional neural networks features and bi-directional long short-term memory model

An End to End Framework with Adaptive Spatio-Temporal Attention Module for Human Action Recognition.

Hierarchical Attention Network for Action Recognition in Videos

Spatio-Temporal Attention Networks for Action Recognition and Detection

STCA: an action recognition network with spatio-temporal convolution and attention

Spatial-Temporal Neural Networks For Action Recognition

Graph-Temporal LSTM Networks for Skeleton-Based Action Recognition

Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network.

Recurrent Attention Network Using Spatial-Temporal Relations for Action Recognition

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

Learning Hierarchical Video Representation for Action Recognition

Human Action Recognition Based on Hierarchical Multi-Scale Adaptive Conv-Long Short-Term Memory Network

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition.

Action Recognition Based on Two-Stream Convolutional Networks with Long-Short-Term Spatiotemporal Features