Abstract:Video analysis is an important branch of computer vision due to its wide applications, ranging from video surveillance, video indexing, and retrieval to human computer interaction. All of the applications are based on a good video representation, which encodes video content into a feature vector with fixed length. Most existing methods treat video as a flat image sequence, but from our observations we argue that video is an information-intensive media with intrinsic hierarchical structure, which is largely ignored by previous approaches. Therefore, in this work, we represent the hierarchical structure of video with multiple granularities including, from short to long, single frame, consecutive frames (motion), short clip, and the entire video. Furthermore, we propose a novel deep learning framework to model each granularity individually. Specifically, we model the frame and motion granularities with 2D convolutional neural networks and model the clip and video granularities with 3D convolutional neural networks. Long Short-Term Memory networks are applied on the frame, motion, and clip to further exploit the long-term temporal clues. Consequently, the whole framework utilizes multi-stream CNNs to learn a hierarchical representation that captures spatial and temporal information of video. To validate its effectiveness in video analysis, we apply this video representation to action recognition task. We adopt a distribution-based fusion strategy to combine the decision scores from all the granularities, which are obtained by using a softmax layer on the top of each stream. We conduct extensive experiments on three action benchmarks (UCF101, HMDB51, and CCV) and achieve competitive performance against several state-of-the-art methods.

Learning Motion and Content-Dependent Features with Convolutions for Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Learning Comprehensive Motion Representation for Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Temporal Distinct Representation Learning for Action Recognition

Memory-Augmented Temporal Dynamic Learning for Action Recognition

Human Action Recognition From Digital Videos Based on Deep Learning.

Learning and Distillating the Internal Relationship of Motion Features in Action Recognition.

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Action recognition using three dimension convolution and long short term memory

3-Stream Convolutional Networks for Video Action Recognition with Hybrid Motion Field

Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition

Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network

3D-TDC: A 3D temporal dilation convolution framework for video action recognition

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

Lattice Long Short-Term Memory for Human Action Recognition

Fine-Tuned Temporal Dense Sampling with 1D Convolutional Neural Network for Human Action Recognition

Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos

Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

Learning Hierarchical Video Representation for Action Recognition

Learning Deep Trajectory Descriptor for Action Recognition in Videos Using Deep Neural Networks.