Abstract:Learning by deep convolutional networks have shown an outstanding effectiveness in a variety of vision based classification tasks, and for which, large datasets are the prerequisites to guarantee its high performance. But in many realistic circumstances, using a massive quantity of training samples to achieve more sophisticated analysis is hard to be fulfilled always, such as human action recognition in videos, and the resulting problem of data deficiency, especially for the labeled data, would critically limit the deeper model structure as a promising solution due to its high risk of overfitting. Additionally, in lacking of high modeling capacity constrained by of model depth, the high-level visual cues like object interaction, scene context and pose variations concurrent with human action also could become the extrinsic and intrinsic challenges for the traditional deep convolutional networks. For the limitations above, in this paper, we proposed a strategy of dataset remodeling by transferring parameters of ResNet-101 layers trained on the ImageNet dataset to initialize learning model and adopt an augmented data variation approach to overcome the overfitting challenge of sample deficiency. For model structure improvement, a novel deeper two-stream ConvNets has been designed for the learning of action complexity. With a dis-order strategy of training/testing video sets, the proposed model and learning strategy are able to collaboratively achieve a significant improvement of action recognition. Experiments on two challenging datasets UCF101 and KTH have verified a superior performance in comparison with other state-of-the-art methods. (C) 2017 Published by Elsevier B.V.

Video Action Recognition Based on Deeper Convolution Networks with Pair-Wise Frame Motion Concatenation

Going Deeper with Two-Stream ConvNets for Action Recognition in Video Surveillance

Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition

Temporal Interaction and Excitation for Action Recognition

Long-term 3D Convolutional Fusion Network for Action Recognition

Temporal Distinct Representation Learning for Action Recognition

Learning Motion and Content-Dependent Features with Convolutions for Action Recognition

Frame-skip Convolutional Neural Networks for Action Recognition.

3D Convolutional Two-Stream Network for Action Recognition in Videos

Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network

Action Recognition Using Co-trained Deep Convolutional Neural Networks.

ResLNet: Deep Residual LSTM Network with Longer Input for Action Recognition

A New Depth Residual Network Combined Recurrent with Residual Structure for Human Action Recognition from Videos

Real-Time Action Recognition with Enhanced Motion Vector CNNs

Action recognition with temporal scale-invariant deep learning framework

Motion Enhanced Model Based on High-Level Spatial Features

Exploring Hybrid Spatio-Temporal Convolutional Networks for Human Action Recognition.

Action Recognition with Motion Map 3D Network

Fully Convolutional Networks for Action Recognition

3-Stream Convolutional Networks for Video Action Recognition with Hybrid Motion Field

The Very Deep Multi-stage Two-stream Convolutional Neural Network for Action Recognition