Abstract:Spatial-temporal modeling is crucial for action recognition in videos within the field of artificial intelligence. However, robustly extracting motion information remains a primary challenge due to temporal deformations of appearances and variations in motion frequencies between different actions. In order to address these issues, we propose an innovative and effective method called the Motion Sensitive Network (MSN), incorporating the theories of artificial neural networks and key concepts of autonomous system control and decision-making. Specifically, we employ an approach known as Spatial-Temporal Pyramid Motion Extraction (STP-ME) module, adjusting convolution kernel sizes and time intervals synchronously to gather motion information at different temporal scales, aligning with the learning and prediction characteristics of artificial neural networks. Additionally, we introduce a new module called Variable Scale Motion Excitation (DS-ME), utilizing a differential model to capture motion information in resonance with the flexibility of autonomous system control. Particularly, we employ a multi-scale deformable convolutional network to alter the motion scale of the target object before computing temporal differences across consecutive frames, providing theoretical support for the flexibility of autonomous systems. Temporal modeling is a crucial step in understanding environmental changes and actions within autonomous systems, and MSN, by integrating the advantages of Artificial Neural Networks (ANN) in this task, provides an effective framework for the future utilization of artificial neural networks in autonomous systems. We evaluate our proposed method on three challenging action recognition datasets (Kinetics-400, Something-Something V1, and Something-Something V2). The results indicate an improvement in accuracy ranging from 1.1% to 2.2% on the test set. When compared with state-of-the-art (SOTA) methods, the proposed approach achieves a maximum performance of 89.90%. In ablation experiments, the performance gain of this module also shows an increase ranging from 2% to 5.3%. The introduced Motion Sensitive Network (MSN) demonstrates significant potential in various challenging scenarios, providing an initial exploration into integrating artificial neural networks into the domain of autonomous systems.

Motion Stimulation for Compositional Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Learning Comprehensive Motion Representation for Action Recognition

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition

Online Robust Action Recognition Based on a Hierarchical Model

Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks

Motion sensitive network for action recognition in control and decision-making of autonomous systems

I Know How You Move: Explicit Motion Estimation for Human Action Recognition

TSI: Temporal Saliency Integration for Video Action Recognition

Modelling Spatio-Temporal Interactions For Compositional Action Recognition

An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video

Progressive Instance-Aware Feature Learning for Compositional Action Recognition.

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Compositional Structure Learning for Action Understanding

Human Action Recognition in Unconstrained Videos by Explicit Motion Modeling

Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization

Human Action Recognition Using Multi-Velocity STIPs and Motion Energy Orientation Histogram.

Multi-view key information representation and multi-modal fusion for single-subject routine action recognition

Attention-Based Multilevel Co-Occurrence Graph Convolutional LSTM for 3-D Action Recognition

Embedding Motion and Structure Features for Action Recognition

STM: SpatioTemporal and Motion Encoding for Action Recognition.