Abstract:Recognizing actions in videos is not a trivial task because video is an information-intensive media and includes multiple modalities. Moreover, on each modality, an action may only appear at some spatial regions, or only part of the temporal video segments may contain the action. A valid question is how to locate the attended spatial areas and selective video segments for action recognition. In this paper, we devise a general attention neural cell, called AttCell, that estimates the attention probability not only at each spatial location but also for each video segment in a temporal sequence. With AttCell, a unified Spatio-Temporal Attention Networks (STAN) is proposed in the context of multiple modalities. Specifically, STAN extracts the feature map of one convolutional layer as the local descriptors on each modality and pools the extracted descriptors with the spatial attention measured by AttCell as a representation of each segment. Then, we concatenate the representation on each modality to seek a consensus on the temporal attention, a priori, to holistically fuse the combined representation of video segments to the video representation for recognition. Our model differs from conventional deep networks, which focus on the attention mechanism, because our temporal attention provides a principled and global guidance across different modalities and video segments. Extensive experiments are conducted on four public datasets; UCF101, CCV, THUMOS14, and Sports-1M; our STAN consistently achieves superior results over several state-of-the-art techniques. More remarkably, we validate and demonstrate the effectiveness of our proposal when capitalizing on the different number of modalities.

Spatio-temporal Deformable 3D ConvNets with Attention for Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Spatio-Temporal Attention Networks for Action Recognition and Detection

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition

STCA: an action recognition network with spatio-temporal convolution and attention

An efficient attention module for 3d convolutional neural networks in action recognition

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition.

Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection

3D-TDC: A 3D temporal dilation convolution framework for video action recognition

Global Context-Aware Attention LSTM Networks for 3D Action Recognition.

Cross-Modal Learning with 3D Deformable Attention for Action Recognition

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Spatial Temporal Graph Attention Network for Skeleton-Based Action Recognition

Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition.

Unified Spatio-Temporal Attention Models for Advanced Human Action Recognition & Detection

ACTION-Net: Multipath Excitation for Action Recognition

D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition