Abstract:Current few-shot action recognition involves two primary sources of information for classification:(1) intra-video information, determined by frame content within a single video clip, and (2) inter-video information, measured by relationships (e.g., feature similarity) among videos. However, existing methods inadequately exploit these two information sources. In terms of intra-video information, current sampling operations for input videos may omit critical action information, reducing the utilization efficiency of video data. For the inter-video information, the action misalignment among videos makes it challenging to calculate precise relationships. Moreover, how to jointly consider both inter- and intra-video information remains under-explored for few-shot action recognition. To this end, we propose a novel framework, Video Information Maximization (VIM), for few-shot video action recognition. VIM is equipped with an adaptive spatial-temporal video sampler and a spatiotemporal action alignment model to maximize intra- and inter-video information, respectively. The video sampler adaptively selects important frames and amplifies critical spatial regions for each input video based on the task at hand. This preserves and emphasizes informative parts of video clips while eliminating interference at the data level. The alignment model performs temporal and spatial action alignment sequentially at the feature level, leading to more precise measurements of inter-video similarity. Finally, These goals are facilitated by incorporating additional loss terms based on mutual information measurement. Consequently, VIM acts to maximize the distinctiveness of video information from limited video data. Extensive experimental results on public datasets for few-shot action recognition demonstrate the effectiveness and benefits of our framework.

Learning Implicit Temporal Alignment for Few-shot Video Classification

Few-Shot Video Classification via Temporal Alignment

Temporal Alignment Prediction for Few-Shot Video Classification

Few-shot action recognition with implicit temporal alignment and pair similarity optimization

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

Learning Dynamic Alignment Via Meta-filter for Few-shot Learning

Elastic Temporal Alignment for Few‐shot Action Recognition

Learning to Compare Relation: Semantic Alignment for Few-Shot Learning

BDLA: Bi-directional local alignment for few-shot learning

Temporal Aggregation with Context Focusing for Few-Shot Video Object Detection

On the Importance of Spatial Relations for Few-shot Action Recognition

A Closer Look at Few-Shot Video Classification: A New Baseline and Benchmark

Few-Shot Ensemble Learning for Video Classification with SlowFast Memory Networks

Adaptive Compact Attention For Few-shot Video-to-video Translation

LGSim: local task-invariant and global task-specific similarity for few-shot classification

Few-shot Action Recognition via Intra- and Inter-Video Information Maximization

Object-aware Long-short-range Spatial Alignment for Few-Shot Fine-Grained Image Classification

Boosting Few-Shot Classification with View-Learnable Contrastive Learning

Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

Collect and Select: Semantic Alignment Metric Learning for Few-Shot Learning