Abstract:Recognizing actions in videos is not a trivial task because video is an information-intensive media and includes multiple modalities. Moreover, on each modality, an action may only appear at some spatial regions, or only part of the temporal video segments may contain the action. A valid question is how to locate the attended spatial areas and selective video segments for action recognition. In this paper, we devise a general attention neural cell, called AttCell, that estimates the attention probability not only at each spatial location but also for each video segment in a temporal sequence. With AttCell, a unified Spatio-Temporal Attention Networks (STAN) is proposed in the context of multiple modalities. Specifically, STAN extracts the feature map of one convolutional layer as the local descriptors on each modality and pools the extracted descriptors with the spatial attention measured by AttCell as a representation of each segment. Then, we concatenate the representation on each modality to seek a consensus on the temporal attention, a priori, to holistically fuse the combined representation of video segments to the video representation for recognition. Our model differs from conventional deep networks, which focus on the attention mechanism, because our temporal attention provides a principled and global guidance across different modalities and video segments. Extensive experiments are conducted on four public datasets; UCF101, CCV, THUMOS14, and Sports-1M; our STAN consistently achieves superior results over several state-of-the-art techniques. More remarkably, we validate and demonstrate the effectiveness of our proposal when capitalizing on the different number of modalities.

MuAt-Va: Multi-Attention and Video-Auxiliary Network for Device-Free Action Recognition

Cross-modality Online Distillation for Multi-View Action Recognition

MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity Recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

AMIR

Improving human action recognition by jointly exploiting video and WiFi clues

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition.

Wi-ATCN: Attentional Temporal Convolutional Network for Human Action Prediction Using WiFi Channel State Information

A Multimode Two-Stream Network for Egocentric Action Recognition

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

DeepMV

MaskFi: Unsupervised Learning of WiFi and Vision Representations for Multimodal Human Activity Recognition

Enhanced Attention Tracking with Multi-Branch Network for Egocentric Activity Recognition

Multi-Person Action Recognition in Microwave Sensors

ASM2TV: An Adaptive Semi-Supervised Multi-Task Multi-View Learning Framework for Human Activity Recognition

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

Multi-Channel Deep Networks on Sequence Data for Multi-Action Recognition

MA-VLAD: a fine-grained local feature aggregation scheme for action recognition

Human-Object Contour for Action Recognition with Attentional Multi-modal Fusion Network

An AIoT Framework With Multi-modal Frequency Fusion for WiFi-Based Coarse and Fine Activity Recognition

AttnSense: Multi-level Attention Mechanism for Multimodal Human Activity Recognition