Abstract:To address the semantic inconsistency issue with SAM or other single-image segmentation models handling image sequences, we introduce BYOCL. This novel model outperforms SAM in extensive experiments, showcasing its Hierarchical prototype capabilities across CLIP and other representations. BYOCL significantly reduces time and space consumption by dividing inputs into smaller batches, achieving exponential time reduction compared to previous methods. Our approach leverages the SAM image encoder for feature extraction, followed by Intra-Batch and Inter-Batch clustering algorithms. Extensive experiments demonstrate that BYOCL far exceeds the previous state-of-the-art single image segmentation model. Our work is the first to apply consistent segmentation using foundation models without requiring training, utilizing plug-and-play modules for any latent space, making our method highly efficientModels are available at \href{<a class="link-external link-https" href="https://github.com/cyt1202/BYOCL.git" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the **semantic inconsistency problem** that occurs when single - image segmentation models (such as SAM) in image sequences process continuous images. Specifically, when existing single - image segmentation models process image sequences, due to the lack of understanding of the relationships between different images, the segmentation results are unstable and unreliable, thus affecting the performance of downstream tasks. ### Main problems: 1. **Semantic inconsistency**: When using single - image segmentation models (such as SAM) to process image sequences, the model cannot maintain semantic consistency between different frames, resulting in unstable segmentation results. 2. **High time and space consumption**: Traditional segmentation methods have high time and space consumption and low efficiency when processing large - scale image sequences. ### Solutions: To solve the above problems, the authors propose the **BYOCL (Build Your Own Consistent Latent)** model. This model ensures the semantic consistency of segmentation results in image sequences by introducing a hierarchical clustering method, using the SAM image encoder to extract features, and combining intra - batch clustering and inter - batch clustering. In addition, BYOCL significantly reduces time and space consumption by dividing the input images into small batches for processing. ### Specific improvements: - **Hierarchical clustering**: BYOCL ensures the feature consistency between different images by first performing clustering within the batch and then between the batches through the hierarchical clustering method. - **Zero - shot segmentation**: BYOCL can be directly applied to various datasets without additional training, achieving zero - shot segmentation. - **Efficient processing**: By dividing the input images into small batches for processing, BYOCL significantly reduces the consumption of computing resources and improves the processing speed. ### Experimental verification: The authors conducted extensive experiments on multiple datasets (such as DAVIS, MOSE, etc.), and the results show that BYOCL is superior to existing single - image segmentation models (such as SAM) in terms of segmentation accuracy and consistency. Specific metrics include the mean intersection - over - union (IoU), F1 - score, and recall rate, etc. ### Summary: BYOCL solves the semantic inconsistency problem of single - image segmentation models when processing image sequences by introducing a hierarchical clustering method, and shows better segmentation performance and higher time efficiency on multiple datasets.

BYOCL: Build Your Own Consistent Latent with Hierarchical Representative Latent Clustering

Mejigclu: more effective jigsaw clustering for unsupervised visual representation learning

Consensus Clustering With Unsupervised Representation Learning

Tuning-free Universally-Supervised Semantic Segmentation

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Spatial and Semantic Consistency Contrastive Learning for Self-Supervised Semantic Segmentation of Remote Sensing Images

Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation

Semantic-Enhanced Image Clustering

Deep Clustering by Semantic Contrastive Learning

LCCo: Lending CLIP to Co-Segmentation

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

Collaborating Foundation Models for Domain Generalized Semantic Segmentation

ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation

Towards Label-free Scene Understanding by Vision Foundation Models

A Lightweight Clustering Framework for Unsupervised Semantic Segmentation

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation