Abstract:High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at <a class="link-external link-https" href="https://github.com/wenbohuang1002/SOAP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main challenges in **Few - Shot Action Recognition (FSAR)**: 1. **Optimizing Spatio - Temporal Relation Construction**: - Although high - frame - rate (HFR) videos improve the representation of fine - grained actions, they reduce the density of spatio - temporal relations and motion information. This causes traditional data - driven models to require a large number of video samples for training. - In real - world scenarios, samples of target actions (such as "falling") are usually insufficient and difficult to collect, so few - shot learning becomes especially important. - Most existing FSAR methods perform time alignment after extracting spatial features, separating spatial and temporal features and ignoring the close connection between them. 2. **Comprehensive Motion Information Capturing**: - Motion information is a unique feature of videos and is crucial for helping models dynamically recognize target actions. - However, the density of motion information in HFR videos is low, and mainstream methods can only process a limited number of frames at a time, resulting in insufficient motion information capture. - Most existing methods only focus on the motion information between adjacent frames. This narrow perspective inevitably ignores the density of motion information, leading to insufficient capture. ### Solutions To solve the above problems, the authors propose a new plug - gable architecture named **Spatio - temp Oral frAme tuPle enhancer (SOAP)**. Specifically: - **3D Dimension Enhancement Module (3DEM)**: Establish spatio - temporal relations through 3D convolution to avoid simple time - alignment operations. - **Channel - Wise Enhancement Module (CWEM)**: Adaptively calibrate the temporal connections between different channels. - **Hybrid Motion Enhancement Module (HMEM)**: Adopt multi - scale frame tuples to capture comprehensive motion information from a broader perspective. These modules work together to enable SOAP to effectively improve the performance of action recognition in few - shot settings. Experimental results show that SOAP - Net reaches a new state - of - the - art level on several well - known benchmark datasets such as SthSthV2, Kinetics, UCF101, and HMDB51. ### Summary The core contributions of the paper are: - **Optimizing Spatio - Temporal Relation Construction**: Avoid simple time - alignment operations and enhance the modeling of spatio - temporal relations. - **Comprehensive Motion Information Capturing**: Capture richer motion information through multi - scale frame tuples. - **Effectiveness Verification**: Demonstrate the competitiveness, plug - gability, generalization ability, and robustness of SOAP - Net on multiple benchmark datasets. Through these improvements, SOAP - Net significantly improves the effect of few - shot action recognition and solves the key problems existing in existing methods.

SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

Semantic-aware Video Representation for Few-shot Action Recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

Perceiving Actions via Temporal Video Frame Pairs

Motion-modulated Temporal Fragment Alignment Network for Few-Shot Action Recognition

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

Dynamic Sampling Networks for Efficient Action Recognition in Videos.

On the Importance of Spatial Relations for Few-shot Action Recognition

Dense Semantics-Assisted Networks For Video Action Recognition

Space-Time Robust Representation for Action Recognition

Efficient Human Vision Inspired Action Recognition using Adaptive Spatiotemporal Sampling

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

CAST: Cross-Attention in Space and Time for Video Action Recognition

Multi-stream network with key frame sampling for human action recognition