MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition

Xiang Wang,Shiwei Zhang,Zhiwu Qing,Changxin Gao,Yingya Zhang,Deli Zhao,Nong Sang
2023-04-03
Abstract:Current state-of-the-art approaches for few-shot action recognition achieve promising performance by conducting frame-level matching on learned visual features. However, they generally suffer from two limitations: i) the matching procedure between local frames tends to be inaccurate due to the lack of guidance to force long-range temporal perception; ii) explicit motion learning is usually ignored, leading to partial information loss. To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder. Specifically, the long-short contrastive objective is to endow local frame features with long-form temporal awareness by maximizing their agreement with the global token of videos belonging to the same class. The motion autodecoder is a lightweight architecture to reconstruct pixel motions from the differential features, which explicitly embeds the network with motion dynamics. By this means, MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching. To demonstrate the effectiveness, we evaluate MoLo on five standard benchmarks, and the results show that MoLo favorably outperforms recent advanced methods. The source code is available at <a class="link-external link-https" href="https://github.com/alibaba-mmai-research/MoLo" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper "MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition" aims to address two main issues in few-shot action recognition: 1. **Inaccurate Local Frame Matching**: Existing few-shot action recognition methods typically rely on local frame-level matching. However, due to the lack of guidance from global temporal awareness, this matching process is easily affected by similar co-occurring video frames, leading to inaccurate matching. 2. **Neglect of Explicit Motion Learning**: Most existing methods overlook explicit motion learning, resulting in partial information loss. Motion dynamics are considered crucial in video understanding, but current few-shot methods fail to fully utilize the rich motion cues between frames, thereby affecting matching performance. To overcome these issues, the authors propose a novel method—**Motion-augmented Long-short Contrastive Learning (MoLo)**. This method includes two key components: - **Long-short Contrastive Objective**: By maximizing the consistency between local frame features and the global representation of videos belonging to the same category, it endows local frame features with long-term temporal awareness. - **Motion Self-decoder**: By reconstructing pixel motion from differential features, it explicitly embeds motion dynamics into the network, thereby extracting motion information. Through these designs, MoLo can simultaneously learn long-term temporal context and motion cues, achieving comprehensive few-shot matching. Experimental results show that MoLo significantly outperforms recent state-of-the-art methods on multiple standard benchmarks.