Abstract:Few-shot video object segmentation (FSVOS) aims to achieve accurate segmentation of novel objects in given video sequences, where the target objects are specified by limited annotated images as support. Most previous top-performing methods adopt the support-query semantic correlation learning paradigm or the intra-query temporal correlation learning paradigm. Nevertheless, they either fail to model temporal consistency across frames, resulting in inconsecutive segmentation, or lose diverse support object information, leading to incomplete segmentation. Therefore, we argue that it is more desirable to achieve both correlations in a collaborative manner. In this work, we delve into the issues present in the combination of few-shot image segmentation methods and video object segmentation methods and propose a dedicated Collaborative Correlation Network (CoCoNet) to address these problems, including a pixel correlation calibration module and a temporal correlation mining module. The proposed CoCoNet enjoys several merits. First, the pixel correlation calibration module aims to mitigate the noise issue in support-query correlation by integrating the affinity learning strategy and the prototype learning strategy. Specifically, we employ Optimal Transport to enrich pixel correlation with contextual information, thereby reducing intra-class differences between support and query. Second, the temporal correlation mining module is responsible for alleviating the issue of uncertainty in the initial frame and establishing reliable guidance for subsequent frames of the query video. With the collaboration of these two modules, our CoCoNet can effectively establish support-query and temporal correlation simultaneously and achieve accurate FSVOS. Extensive experimental results on two challenging benchmarks demonstrate that our method performs favorably against state-of-the-art FSVOS methods.

Temporal Aggregation with Context Focusing for Few-Shot Video Object Detection

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Temporal-adaptive sparse feature aggregation for video object detection

Object detection based on few-shot learning via instance-level feature correlation and aggregation

Few-Shot Video Classification via Temporal Alignment

Context-Transformer: Tackling Object Confusion for Few-Shot Detection

Temporal Alignment Prediction for Few-Shot Video Classification

Two-Stream Temporal Feature Aggregation Based on Clustering for Few-Shot Action Recognition

Learning Implicit Temporal Alignment for Few-shot Video Classification

Attention-guided Temporally Coherent Video Object Matting

When Few-Shot Learning Meets Video Object Detection

Dense Relation Distillation with Context-aware Aggregation for Few-Shot Object Detection

Few-Shot Object Detection with Sparse Context Transformers

Exploring the Better Correlation for Few-Shot Video Object Segmentation

Temporally Identity-Aware SSD With Attentional LSTM

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

Few-shot action recognition with implicit temporal alignment and pair similarity optimization

LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation

AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition