Abstract:Numerous studies have highlighted the crucial role of motion information in accurate action recognition in videos. However, current methods heavily rely on temporal differences of features extracted by convolutional neural networks (CNNs) to represent motion, which may have two potential limitations: (1) incomplete representation of the moving target contour due to the difference operation, and (2) equal treatment of all extracted motion features, regardless of their relevance to the classification task, which may negatively impact performance. To address these limitations, we propose a novel approach called the Motion Accumulation and Selection Network (MAS-Net). Although our new approach also considers spatial attributes, it draws inspiration from the cumulative and selective nature of human visual attention, with a primary focus on capturing the temporal attributes of actions for recognition. Further, an motion selection module is exploited to prioritize relevant temporal features while filtering out irrelevant ones. Currently, there is a growing demand for action recognition with strong temporal information, as opposed to conventional scene-related datasets such as UCF-101 and HMDB-51. Therefore, we evaluated MAS-Net on benchmark video datasets that primarily emphasize temporal information, including Something-Something V1 & V2, Diving48, and Kinetics-400. Our experimental results demonstrate that MAS-Net achieves state-of-the-art performance on Something-Something V1 & V2 and Diving48 datasets. Furthermore, when compared to other 2D CNN-based models, MAS-Net exhibits competitive results on the Kinetics-400 dataset while maintaining computational efficiency. These findings highlight the effectiveness and efficiency of MAS-Net for temporal modeling in video analysis tasks.

Human Action Recognition Method Based on Motion Excitation and Temporal Aggregation Module.

META: Motion Excitation with Temporal Attention for Compressed Video Action Recognition

TEA: Temporal Excitation and Aggregation for Action Recognition.

Temporal Interaction and Excitation for Action Recognition

A Temporal Order Modeling Approach to Human Action Recognition from Multimodal Sensor Data.

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Human Action Recognition Based on Hierarchical Multi-Scale Adaptive Conv-Long Short-Term Memory Network

Human Action Recognition Using Deep Learning Methods.

Extracting Hierarchical Spatial and Temporal Features for Human Action Recognition

Multimodal human action recognition based on spatio-temporal action representation recognition model

MAFormer: A cross-channel spatio-temporal feature aggregation method for human action recognition

Human action recognition using motion energy template

Human Action Recognition Using Multi-Velocity STIPs and Motion Energy Orientation Histogram.

An Efficient Motion Visual Learning Method for Video Action Recognition

A Human Action Recognition Model Inspired By Multiple Scale Temporal Segments Model Fusion

Human Activity Recognition based on Dynamic Spatio-Temporal Relations

Temporal Information Oriented Motion Accumulation and Selection Network for RGB-based Action Recognition.

Multiple temporal scale aggregation graph convolutional network for skeleton-based action recognition

Mixed Resolution Network with Hierarchical Motion Modeling for Efficient Action Recognition

Spatio-Temporal Human Action Recognition Modelwith Flexible-interval Sampling and Normalization

Multi-level Channel Attention Excitation Network for Human Action Recognition in Videos