Abstract:Most existing action recognition approaches directly leverage the video-level features to recognize human actions from videos. Although these methods have made remarkable progress, the accuracy is still unsatisfied. When the test video involves complex backgrounds and activities, existing methods usually suffer from a significant drop in accuracy. Human action is inherently a high-level concept. Merely applying a video classification model without a detailed semantic understanding of the video content, e.g., objects, scene context, object motions, object interactions, is inadequate to tackle the challenges for action recognition. Fine-level semantic understanding of videos generates elementary semantic concepts from the raw video data, such as the semantics of objects and background regions. It can be employed to bridge the gap between the raw video data and the high-level concept of human actions. In this work, we leverage dense semantic segmentation masks, which encode rich semantic details, provide extra information for the network training, and improve the performance of action recognition. We propose a novel deep architecture which is named as Dense Semantics-Assisted Convolutional Neural Networks (DSA-CNNs) to effectively utilize dense semantic information of video by a bottom-up attention way in the spatial stream, while by the way of branch fusion in the temporal stream. To verify the effectiveness of our approach, we conduct extensive experiments on publicly available datasets – UCF101, HMDB51, and Kinetics. The experimental results demonstrate that our approach substantially improves existing methods and achieves very competitive performance. It also shows that our approach is superior to other related methods that utilize extra information for action recognition.

A New Architecture of Neural Network for Fine-Grained Video Analysis Based on Visual Attention

Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network.

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Spatio-Temporal Attention Networks for Action Recognition and Detection

TEINet: Towards an Efficient Architecture for Video Recognition.

CANet: Comprehensive Attention Network for video-based action recognition

Video Action Recognition Via Neural Architecture Searching

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Dense Semantics-Assisted Networks For Video Action Recognition

Recurrent Attention Network Using Spatial-Temporal Relations for Action Recognition

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

Select and Focus: Action Recognition with Spatial-Temporal Attention

Global Context-Aware Attention LSTM Networks for 3D Action Recognition.

An efficient attention module for 3d convolutional neural networks in action recognition

Content-Aware Attention Network For Action Recognition

A hybrid attention-guided ConvNeXt-GRU network for action recognition

STA-TSN: Spatial-Temporal Attention Temporal Segment Network for Action Recognition in Video.

Spatio-Temporal Self-Attention Weighted VLAD Neural Network for Action Recognition.

Spatial-Temporal Hypergraph Neural Network based on Attention Mechanism for Multi-view Data Action Recognition