MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition

Xiang Wang,Shiwei Zhang,Zhiwu Qing,Changxin Gao,Yingya Zhang,Deli Zhao,Nong Sang

2023-04-03

Abstract:Current state-of-the-art approaches for few-shot action recognition achieve promising performance by conducting frame-level matching on learned visual features. However, they generally suffer from two limitations: i) the matching procedure between local frames tends to be inaccurate due to the lack of guidance to force long-range temporal perception; ii) explicit motion learning is usually ignored, leading to partial information loss. To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder. Specifically, the long-short contrastive objective is to endow local frame features with long-form temporal awareness by maximizing their agreement with the global token of videos belonging to the same class. The motion autodecoder is a lightweight architecture to reconstruct pixel motions from the differential features, which explicitly embeds the network with motion dynamics. By this means, MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching. To demonstrate the effectiveness, we evaluate MoLo on five standard benchmarks, and the results show that MoLo favorably outperforms recent advanced methods. The source code is available at <a class="link-external link-https" href="https://github.com/alibaba-mmai-research/MoLo" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper "MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition" aims to address two main issues in few-shot action recognition: 1. **Inaccurate Local Frame Matching**: Existing few-shot action recognition methods typically rely on local frame-level matching. However, due to the lack of guidance from global temporal awareness, this matching process is easily affected by similar co-occurring video frames, leading to inaccurate matching. 2. **Neglect of Explicit Motion Learning**: Most existing methods overlook explicit motion learning, resulting in partial information loss. Motion dynamics are considered crucial in video understanding, but current few-shot methods fail to fully utilize the rich motion cues between frames, thereby affecting matching performance. To overcome these issues, the authors propose a novel method—**Motion-augmented Long-short Contrastive Learning (MoLo)**. This method includes two key components: - **Long-short Contrastive Objective**: By maximizing the consistency between local frame features and the global representation of videos belonging to the same category, it endows local frame features with long-term temporal awareness. - **Motion Self-decoder**: By reconstructing pixel motion from differential features, it explicitly embeds motion dynamics into the network, thereby extracting motion information. Through these designs, MoLo can simultaneously learn long-term temporal context and motion cues, achieving comprehensive few-shot matching. Experimental results show that MoLo significantly outperforms recent state-of-the-art methods on multiple standard benchmarks.

MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Two-stream joint matching method based on contrastive learning for few-shot action recognition

Consistency Prototype Module and Motion Compensation for Few-Shot Action Recognition (CLIP-CP$\mathbf{M^2}$C)

Few-shot action recognition with implicit temporal alignment and pair similarity optimization

HyRSM++: Hybrid relation guided temporal set matching for few-shot action recognition

On the Importance of Spatial Relations for Few-shot Action Recognition

Cross-Modal Contrastive Learning Network for Few-Shot Action Recognition

Few-Shot Action Recognition with Compromised Metric via Optimal Transport

Learning Comprehensive Motion Representation for Action Recognition

SMAM: Self and Mutual Adaptive Matching for Skeleton-Based Few-Shot Action Recognition

Few-shot Action Recognition via Intra- and Inter-Video Information Maximization

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

Few-shot Action Recognition with Prototype-centered Attentive Learning

Neighbor-Guided Consistent and Contrastive Learning for Semi-Supervised Action Recognition

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

Multi-scale motion contrastive learning for self-supervised skeleton-based action recognition

MIE-Net: Motion Information Enhancement Network for Fine-Grained Action Recognition Using RGB Sensors

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition