Abstract:The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: In the context of the rapid development of Vision Foundation Models (VFMs) such as CLIP and Segment Anything Model (SAM), these models each have unique functions. For example, CLIP is good at semantic understanding while SAM is good at spatial understanding for segmentation. However, maintaining and deploying multiple models to perform different downstream tasks is not only inefficient (high memory usage and running time, especially on edge devices), but also lacks the opportunity for cross - model learning. Multi - task learning is a method that can solve this problem, but it usually requires high training costs and simultaneous access to data for all tasks. In addition, training foundation models usually depends on unsupervised or semi - supervised methods, which require a large amount of computing resources. To overcome these challenges, the paper proposes an effective method to merge Vision Foundation Models with different pre - training targets into a unified model, namely SAM - CLIP. This method integrates the techniques of multi - task learning, continuous learning and knowledge distillation, aiming to reduce storage and computing costs and make it more suitable for edge - device applications. Through this method, SAM - CLIP not only retains the basic advantages of SAM and CLIP, but also shows synergistic functions in new tasks such as zero - shot semantic segmentation, and establishes a new state - of - the - art level in multiple benchmark tests. Specifically, the mean Intersection over Union (mean IoU) of SAM - CLIP on the Pascal - VOC and COCO - Stuff datasets is increased by 6.8% and 5.9% respectively, significantly outperforming previous models specifically designed for this task.

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation

SAM Fails to Segment Anything? – SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More

SAM-Adapter: Adapting Segment Anything in Underperformed Scenes

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

RAP-SAM: Towards Real-Time All-Purpose Segment Anything

FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Effective SAM Combination for Open-Vocabulary Semantic Segmentation

Towards Label-free Scene Understanding by Vision Foundation Models

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation