Abstract:Numerous studies have highlighted the crucial role of motion information in accurate action recognition in videos. However, current methods heavily rely on temporal differences of features extracted by convolutional neural networks (CNNs) to represent motion, which may have two potential limitations: (1) incomplete representation of the moving target contour due to the difference operation, and (2) equal treatment of all extracted motion features, regardless of their relevance to the classification task, which may negatively impact performance. To address these limitations, we propose a novel approach called the Motion Accumulation and Selection Network (MAS-Net). Although our new approach also considers spatial attributes, it draws inspiration from the cumulative and selective nature of human visual attention, with a primary focus on capturing the temporal attributes of actions for recognition. Further, an motion selection module is exploited to prioritize relevant temporal features while filtering out irrelevant ones. Currently, there is a growing demand for action recognition with strong temporal information, as opposed to conventional scene-related datasets such as UCF-101 and HMDB-51. Therefore, we evaluated MAS-Net on benchmark video datasets that primarily emphasize temporal information, including Something-Something V1 & V2, Diving48, and Kinetics-400. Our experimental results demonstrate that MAS-Net achieves state-of-the-art performance on Something-Something V1 & V2 and Diving48 datasets. Furthermore, when compared to other 2D CNN-based models, MAS-Net exhibits competitive results on the Kinetics-400 dataset while maintaining computational efficiency. These findings highlight the effectiveness and efficiency of MAS-Net for temporal modeling in video analysis tasks.

Collaborative Positional-Motion Excitation Module for Efficient Action Recognition.

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

ACTION-Net: Multipath Excitation for Action Recognition

Two-Path Motion Excitation for Action Recognition

Temporal Interaction and Excitation for Action Recognition

Learning Comprehensive Motion Representation for Action Recognition

TEINet: Towards an Efficient Architecture for Video Recognition.

Joint Multi-Scale Residual and Motion Feature Learning for Action Recognition.

An efficient attention module for 3d convolutional neural networks in action recognition

Channel Attention Module for Efficient Action Recognition

Efficient spatio-temporal network for action recognition

A Two-Pathway Convolutional Neural Network with Temporal Pyramid Network for Action Recognition

Multi-Kernel Excitation Network for Video Action Recognition

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

A Multi-scale Interaction Motion Network for Action Recognition Based on Capsule Network.

Mixed Resolution Network with Hierarchical Motion Modeling for Efficient Action Recognition

Multi-level Channel Attention Excitation Network for Human Action Recognition in Videos

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

An Efficient Motion Visual Learning Method for Video Action Recognition

Temporal Information Oriented Motion Accumulation and Selection Network for RGB-based Action Recognition.