Abstract:Most existing action recognition approaches directly leverage the video-level features to recognize human actions from videos. Although these methods have made remarkable progress, the accuracy is still unsatisfied. When the test video involves complex backgrounds and activities, existing methods usually suffer from a significant drop in accuracy. Human action is inherently a high-level concept. Merely applying a video classification model without a detailed semantic understanding of the video content, e.g., objects, scene context, object motions, object interactions, is inadequate to tackle the challenges for action recognition. Fine-level semantic understanding of videos generates elementary semantic concepts from the raw video data, such as the semantics of objects and background regions. It can be employed to bridge the gap between the raw video data and the high-level concept of human actions. In this work, we leverage dense semantic segmentation masks, which encode rich semantic details, provide extra information for the network training, and improve the performance of action recognition. We propose a novel deep architecture which is named as Dense Semantics-Assisted Convolutional Neural Networks (DSA-CNNs) to effectively utilize dense semantic information of video by a bottom-up attention way in the spatial stream, while by the way of branch fusion in the temporal stream. To verify the effectiveness of our approach, we conduct extensive experiments on publicly available datasets – UCF101, HMDB51, and Kinetics. The experimental results demonstrate that our approach substantially improves existing methods and achieves very competitive performance. It also shows that our approach is superior to other related methods that utilize extra information for action recognition.

Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes

Towards Accurate Human Pose Estimation in Videos of Crowded Scenes

Human Action Recognition Using Deep Learning Methods.

Person-level Action Recognition in Complex Events Via TSD-TSM Networks.

Online Robust Action Recognition Based on a Hierarchical Model

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Exploring 3d Human Action Recognition: From Offline To Online

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Human Action Recognition From Digital Videos Based on Deep Learning.

Human Activity Recognition based on Dynamic Spatio-Temporal Relations

Human Action Recognition Based on Hierarchical Multi-Scale Adaptive Conv-Long Short-Term Memory Network

Action Recognition by Exploring Data Distribution and Feature Correlation

Action Machine: Toward Person-Centric Action Recognition in Videos

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Dense Semantics-Assisted Networks For Video Action Recognition

A Simple Baseline for Pose Tracking in Videos of Crowed Scenes

Action Machine: Rethinking Action Recognition in Trimmed Videos

Combining Sparse And Dense Descriptors With Temporal Semantic Structures For Robust Human Action Recognition

Action Recognition Framework in Traffic Scene for Autonomous Driving System.

Typing Video frames after person detection Pose Tube 2 D Deconv Score fusion RGB action recognition Pose action recognition Pose estimation

Spatiotemporal Multi-Task Network for Human Activity Understanding.