Abstract:Most existing action recognition approaches directly leverage the video-level features to recognize human actions from videos. Although these methods have made remarkable progress, the accuracy is still unsatisfied. When the test video involves complex backgrounds and activities, existing methods usually suffer from a significant drop in accuracy. Human action is inherently a high-level concept. Merely applying a video classification model without a detailed semantic understanding of the video content, e.g., objects, scene context, object motions, object interactions, is inadequate to tackle the challenges for action recognition. Fine-level semantic understanding of videos generates elementary semantic concepts from the raw video data, such as the semantics of objects and background regions. It can be employed to bridge the gap between the raw video data and the high-level concept of human actions. In this work, we leverage dense semantic segmentation masks, which encode rich semantic details, provide extra information for the network training, and improve the performance of action recognition. We propose a novel deep architecture which is named as Dense Semantics-Assisted Convolutional Neural Networks (DSA-CNNs) to effectively utilize dense semantic information of video by a bottom-up attention way in the spatial stream, while by the way of branch fusion in the temporal stream. To verify the effectiveness of our approach, we conduct extensive experiments on publicly available datasets – UCF101, HMDB51, and Kinetics. The experimental results demonstrate that our approach substantially improves existing methods and achieves very competitive performance. It also shows that our approach is superior to other related methods that utilize extra information for action recognition.

A Closer Look at Video Sampling for Sequential Action Recognition

Sequential Segment Networks for Action Recognition

Dynamic Sampling Networks for Efficient Action Recognition in Videos.

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Temporal Distinct Representation Learning for Action Recognition

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

Temporal Segment Networks for Action Recognition in Videos

A Discussion of Data Sampling Strategies for Early Action Prediction

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Feature Sampling Strategies for Action Recognition

Discriminative Segment Focus Network for Fine-grained Video Action Recognition

MGSampler: an Explainable Sampling Strategy for Video Action Recognition

Temporal Action Detection with Structured Segment Networks

TSI: Temporal Saliency Integration for Video Action Recognition

Dense Semantics-Assisted Networks For Video Action Recognition

Temporal-Spatial Mapping for Action Recognition

Short-Term Action Learning for Video Action Recognition

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

TDN: Temporal Difference Networks for Efficient Action Recognition

Fast and Reliable Human Action Recognition in Video Sequences by Sequential Analysis