Abstract:Videos contain rich spatio-temporal information. Traditional methods for extracting motion, used in tasks such as action recognition, often rely on visual contents rather than precise motion features. This phenomenon is referred to as 'blind motion extraction' behavior, which proves inefficient in capturing motions of interest due to a lack of motion-guided cues. Recently, attention mechanisms have enhanced many computer vision tasks by effectively highlighting salient visual areas. Inspired by this, we propose a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to modulate motion signals from frame differencing maps. This approach generates a sequence of attention maps that enhance the processing of motion-related video content. To ensure temporal continuity and smoothness of the attention maps, we apply pair-wise temporal attention variation regularization to remove unwanted motions (e.g., noise) while preserving important ones. We then perform Hadamard product between each pair of attention maps and the original video frames to highlight the evolving motions of interest over time. These highlighted motions, termed video motion prompts, are subsequently used as inputs to the model instead of the original video frames. We formalize this process as a motion prompt layer and incorporate the regularization term into the loss function to learn better motion prompts. This layer serves as an adapter between the model and the video data, bridging the gap between traditional 'blind motion extraction' and the extraction of relevant motions of interest. We show that our lightweight, plug-and-play motion prompt layer seamlessly integrates into models like SlowFast, X3D, and TimeSformer, enhancing performance on benchmarks such as FineGym and MPII Cooking 2.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the "blind motion extraction" phenomenon in video processing. That is, traditional methods rely too much on visual content rather than precise motion features when extracting motion information from videos. This phenomenon leads to the inability to efficiently capture the motion of interest because of the lack of motion - guided cues. Specifically: 1. **Limitations of traditional methods**: Traditional motion extraction methods usually rely on visual content (such as objects, human subjects, and scene layouts) rather than specific motion features. This is called "blind motion extraction" because it does not focus on the motion itself but more on static visual information. 2. **The need to introduce an attention mechanism**: In recent years, the attention mechanism has performed well in many computer vision tasks and can effectively highlight significant visual areas. However, how to apply the attention mechanism to the extraction of video motion information to increase the focus on motion features remains a challenge. 3. **Deficiencies of existing methods**: Existing attention mechanisms face some challenges in video processing, such as high computational complexity, scalability issues in processing videos of different lengths, and the difficulty in capturing temporal dependencies. To solve these problems, the author proposes a video motion prompt layer combined with an attention mechanism, aiming to generate a series of attention maps by learning to adjust the motion signals in frame differencing maps, thereby enhancing the capture of the motion of interest. The specific contributions are as follows: - **Introducing video motion prompts**: A video frame sequence highlighted in space - time is defined as a video motion prompt, which is inserted between the video data input and the model architecture through a plug - in motion prompt layer, acting as an adapter, bridging the gap between traditional "blind motion extraction" and the extraction of relevant motion. - **Improved attention mechanism**: A modified Sigmoid function with learnable slope and displacement parameters is proposed as a Power Normalization (PN) function for activating and regulating motion. By introducing a pair - wise temporal attention variation regularization term, it is ensured that the generated attention maps are smooth and continuous in space - time, removing unnecessary motion (such as noise) while retaining important motion. - **Experimental verification**: Through experiments on multiple popular video models (such as SlowFast, X3D, and TimeSformer), the simplicity and effectiveness of this motion prompt layer are proved, and it can achieve state - of - the - art performance in general - action recognition and fine - grained action recognition tasks. In summary, this paper aims to solve the "blind motion extraction" problem existing in traditional methods when extracting video motion information by introducing a video motion prompt layer, and improve the efficiency and accuracy of capturing the motion of interest.

Motion meets Attention: Video Motion Prompts

Motion Prompting: Controlling Video Generation with Motion Trajectories

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

Motion Control for Enhanced Complex Action Video Generation

Motion Inversion for Video Customization

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Explicit Motion Handling and Interactive Prompting for Video Camouflaged Object Detection

Motion-Aware Rapid Video Saliency Detection

Nonparametric motion model.

Fast and Accurate Action Detection in Videos With Motion-Centric Attention Model

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

Optical-Flow Guided Prompt Optimization for Coherent Video Generation

MotionCrafter: One-Shot Motion Customization of Diffusion Models

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Motion-Focused Contrastive Learning of Video Representations*

MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Interpretable Spatio-temporal Attention for Video Action Recognition

Video Saliency Detection via Dynamic Consistent Spatio-Temporal Attention Modelling.

Motion and Context-Aware Audio-Visual Conditioned Video Prediction