Abstract:Most existing action recognition approaches directly leverage the video-level features to recognize human actions from videos. Although these methods have made remarkable progress, the accuracy is still unsatisfied. When the test video involves complex backgrounds and activities, existing methods usually suffer from a significant drop in accuracy. Human action is inherently a high-level concept. Merely applying a video classification model without a detailed semantic understanding of the video content, e.g., objects, scene context, object motions, object interactions, is inadequate to tackle the challenges for action recognition. Fine-level semantic understanding of videos generates elementary semantic concepts from the raw video data, such as the semantics of objects and background regions. It can be employed to bridge the gap between the raw video data and the high-level concept of human actions. In this work, we leverage dense semantic segmentation masks, which encode rich semantic details, provide extra information for the network training, and improve the performance of action recognition. We propose a novel deep architecture which is named as Dense Semantics-Assisted Convolutional Neural Networks (DSA-CNNs) to effectively utilize dense semantic information of video by a bottom-up attention way in the spatial stream, while by the way of branch fusion in the temporal stream. To verify the effectiveness of our approach, we conduct extensive experiments on publicly available datasets – UCF101, HMDB51, and Kinetics. The experimental results demonstrate that our approach substantially improves existing methods and achieves very competitive performance. It also shows that our approach is superior to other related methods that utilize extra information for action recognition.

Action recognition method based on a novel keyframe extraction method and enhanced 3D convolutional neural network

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

DC3D: A Video Action Recognition Network Based on Dense Connection

Human Action Recognition Using Deep Learning Methods.

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Short-Term Action Recognition by 3D Convolutional Neural Network with Pixel-Wise Evidences

An efficient attention module for 3d convolutional neural networks in action recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

ACTION-Net: Multipath Excitation for Action Recognition

3D-TDC: A 3D temporal dilation convolution framework for video action recognition

3D Convolutional Neural Network for Action Recognition.

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Enhanced Action Recognition With Visual Attribute-Augmented 3D Convolutional Neural Network

D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition

Action recognition using three dimension convolution and long short term memory

Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

CANet: Comprehensive Attention Network for video-based action recognition

Multi-cue based four-stream 3D ResNets for video-based action recognition

An Improved Human Action Recognition Method Based on 3D Convolutional Neural Network

Dense Semantics-Assisted Networks For Video Action Recognition

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition