Abstract:Recognizing actions in videos is not a trivial task because video is an information-intensive media and includes multiple modalities. Moreover, on each modality, an action may only appear at some spatial regions, or only part of the temporal video segments may contain the action. A valid question is how to locate the attended spatial areas and selective video segments for action recognition. In this paper, we devise a general attention neural cell, called AttCell, that estimates the attention probability not only at each spatial location but also for each video segment in a temporal sequence. With AttCell, a unified Spatio-Temporal Attention Networks (STAN) is proposed in the context of multiple modalities. Specifically, STAN extracts the feature map of one convolutional layer as the local descriptors on each modality and pools the extracted descriptors with the spatial attention measured by AttCell as a representation of each segment. Then, we concatenate the representation on each modality to seek a consensus on the temporal attention, a priori, to holistically fuse the combined representation of video segments to the video representation for recognition. Our model differs from conventional deep networks, which focus on the attention mechanism, because our temporal attention provides a principled and global guidance across different modalities and video segments. Extensive experiments are conducted on four public datasets; UCF101, CCV, THUMOS14, and Sports-1M; our STAN consistently achieves superior results over several state-of-the-art techniques. More remarkably, we validate and demonstrate the effectiveness of our proposal when capitalizing on the different number of modalities.

Attention Transfer (ANT) Network for View-invariant Action Recognition

View-invariant action recognition via Unsupervised AttentioN Transfer (UANT)

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

View-invariant action recognition:a survey

View-Invariant Human Action Recognition Via View Transformation Network (VTN).

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Spatio-Temporal Attention Deep Network for Skeleton Based View-Invariant Human Action Recognition

Spatial-Temporal Alignment Network for Action Recognition

A Novel View Attention Network for Skeleton Based Human Action Recognition*

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Joint Network based Attention for Action Recognition

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition

Temporal Attentive Network for Action Recognition

Spatial-temporal hypergraph based on dual-stage attention network for multi-view data lightweight action recognition

Spatial-Temporal Alignment Network for Action Recognition and Detection

View-Robust Neural Networks for Unseen Human Action Recognition in Videos

Spatial-Temporal Hypergraph Neural Network based on Attention Mechanism for Multi-view Data Action Recognition

View Adaptive Neural Networks for High Performance Skeleton-based Human Action Recognition

Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action Recognition

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

Relation-mining Self-Attention Network for Skeleton-Based Human Action Recognition.