Abstract:Few-shot video object segmentation (FSVOS) aims to achieve accurate segmentation of novel objects in given video sequences, where the target objects are specified by limited annotated images as support. Most previous top-performing methods adopt the support-query semantic correlation learning paradigm or the intra-query temporal correlation learning paradigm. Nevertheless, they either fail to model temporal consistency across frames, resulting in inconsecutive segmentation, or lose diverse support object information, leading to incomplete segmentation. Therefore, we argue that it is more desirable to achieve both correlations in a collaborative manner. In this work, we delve into the issues present in the combination of few-shot image segmentation methods and video object segmentation methods and propose a dedicated Collaborative Correlation Network (CoCoNet) to address these problems, including a pixel correlation calibration module and a temporal correlation mining module. The proposed CoCoNet enjoys several merits. First, the pixel correlation calibration module aims to mitigate the noise issue in support-query correlation by integrating the affinity learning strategy and the prototype learning strategy. Specifically, we employ Optimal Transport to enrich pixel correlation with contextual information, thereby reducing intra-class differences between support and query. Second, the temporal correlation mining module is responsible for alleviating the issue of uncertainty in the initial frame and establishing reliable guidance for subsequent frames of the query video. With the collaboration of these two modules, our CoCoNet can effectively establish support-query and temporal correlation simultaneously and achieve accurate FSVOS. Extensive experimental results on two challenging benchmarks demonstrate that our method performs favorably against state-of-the-art FSVOS methods.

Exploring the Better Correlation for Few-Shot Video Object Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

FCC: Fully Connected Correlation for Few-Shot Segmentation

Learning Robust Correlation with Foundation Model for Weakly-Supervised Few-Shot Segmentation

Spatial Correlation Fusion Network for Few-Shot Segmentation

Rethinking the Correlation in Few-Shot Segmentation: A Buoys View

Multi-Similarity Enhancement Network for Few-Shot Segmentation.

Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation

Self-Correlation and Cross-Correlation Learning for Few-Shot Remote Sensing Image Semantic Segmentation

Dual Correlation Network for Efficient Video Semantic Segmentation

Boosting Video Object Segmentation Via Space-time Correspondence Learning

Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

Complementary Coarse-to-Fine Matching for Video Object Segmentation

Few-Shot Aerial Image Semantic Segmentation Leveraging Pyramid Correlation Fusion

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

Robust Video Object Cosegmentation.

CFNet: Learning Correlation Functions for One-Stage Panoptic Segmentation

Interactive Fusion and Correlation Network for Three-Modal Images Few-Shot Semantic Segmentation

Dual Temporal Memory Network for Efficient Video Object Segmentation

Self Supervised Progressive Network for High Performance Video Object Segmentation