SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Wenbo Huang,Jinghui Zhang,Xuwei Qian,Zhen Wu,Meng Wang,Lei Zhang
DOI: https://doi.org/10.1145/3664647.3681062
2024-08-22
Abstract:High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at <a class="link-external link-https" href="https://github.com/wenbohuang1002/SOAP" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main challenges in **Few - Shot Action Recognition (FSAR)**: 1. **Optimizing Spatio - Temporal Relation Construction**: - Although high - frame - rate (HFR) videos improve the representation of fine - grained actions, they reduce the density of spatio - temporal relations and motion information. This causes traditional data - driven models to require a large number of video samples for training. - In real - world scenarios, samples of target actions (such as "falling") are usually insufficient and difficult to collect, so few - shot learning becomes especially important. - Most existing FSAR methods perform time alignment after extracting spatial features, separating spatial and temporal features and ignoring the close connection between them. 2. **Comprehensive Motion Information Capturing**: - Motion information is a unique feature of videos and is crucial for helping models dynamically recognize target actions. - However, the density of motion information in HFR videos is low, and mainstream methods can only process a limited number of frames at a time, resulting in insufficient motion information capture. - Most existing methods only focus on the motion information between adjacent frames. This narrow perspective inevitably ignores the density of motion information, leading to insufficient capture. ### Solutions To solve the above problems, the authors propose a new plug - gable architecture named **Spatio - temp Oral frAme tuPle enhancer (SOAP)**. Specifically: - **3D Dimension Enhancement Module (3DEM)**: Establish spatio - temporal relations through 3D convolution to avoid simple time - alignment operations. - **Channel - Wise Enhancement Module (CWEM)**: Adaptively calibrate the temporal connections between different channels. - **Hybrid Motion Enhancement Module (HMEM)**: Adopt multi - scale frame tuples to capture comprehensive motion information from a broader perspective. These modules work together to enable SOAP to effectively improve the performance of action recognition in few - shot settings. Experimental results show that SOAP - Net reaches a new state - of - the - art level on several well - known benchmark datasets such as SthSthV2, Kinetics, UCF101, and HMDB51. ### Summary The core contributions of the paper are: - **Optimizing Spatio - Temporal Relation Construction**: Avoid simple time - alignment operations and enhance the modeling of spatio - temporal relations. - **Comprehensive Motion Information Capturing**: Capture richer motion information through multi - scale frame tuples. - **Effectiveness Verification**: Demonstrate the competitiveness, plug - gability, generalization ability, and robustness of SOAP - Net on multiple benchmark datasets. Through these improvements, SOAP - Net significantly improves the effect of few - shot action recognition and solves the key problems existing in existing methods.