Abstract:Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification, recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has been reported. This has triggered us to investigate in this paper a new deep architecture which can handle 3D signals more effectively. Specifically, we propose factorized spatio-temporal convolutional networks (FstCN) that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers (called spatial convolutional layers), followed by learning 1D temporal kernels in the upper layers (called temporal convolutional layers). We introduce a novel transformation and permutation operator to make factorization in FstCN possible. Moreover, to address the issue of sequence alignment, we propose an effective training and inference strategy based on sampling multiple video clips from a given action video sequence. We have tested FstCN on two commonly used benchmark datasets (UCF-101 and HMDB-51). Without using auxiliary training videos to boost the performance, FstCN outperforms existing CNN based methods and achieves comparable performance with a recent method that benefits from using auxiliary training videos.

3D Convolutional Network Based Foreground Feature Fusion.

Two-Stream 3-D Convnet Fusion for Action Recognition in Videos with Arbitrary Size and Length

DC3D: A Video Action Recognition Network Based on Dense Connection

Convolutional Gated Recurrent Units Fusion For Video Action Recognition

Dynamic Spatio-Temporal Feature Learning via Graph Convolution in 3D Convolutional Networks

Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

Research on Diverse Feature Fusion Network Based on Video Action Recognition

3D Convolutional Neural Network for Action Recognition.

Action Recognition in Videos with Temporal Segments Fusions

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Multi-modality Fusion Network for Action Recognition.

D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition

3-Stream Convolutional Networks for Video Action Recognition with Hybrid Motion Field

Enhanced Action Recognition With Visual Attribute-Augmented 3D Convolutional Neural Network

Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition

Joint Feature Optimization and Fusion for Compressed Action Recognition

Visual Attribute-augmented Three-dimensional Convolutional Neural Network for Enhanced Human Action Recognition.

3D-TDC: A 3D temporal dilation convolution framework for video action recognition

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition

MULTI-DIRECTIONAL CONVOLUTION NETWORKS WITH SPATIAL-TEMPORAL FEATURE PYRAMID MODULE FOR ACTION RECOGNITION

A Real-Time Action Representation With Temporal Encoding and Deep Compression