SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Haoxiang Wang,Pavan Kumar Anasosalu Vasu,Fartash Faghri,Raviteja Vemulapalli,Mehrdad Farajtabar,Sachin Mehta,Mohammad Rastegari,Oncel Tuzel,Hadi Pouransari
2024-06-11
Abstract:The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: In the context of the rapid development of Vision Foundation Models (VFMs) such as CLIP and Segment Anything Model (SAM), these models each have unique functions. For example, CLIP is good at semantic understanding while SAM is good at spatial understanding for segmentation. However, maintaining and deploying multiple models to perform different downstream tasks is not only inefficient (high memory usage and running time, especially on edge devices), but also lacks the opportunity for cross - model learning. Multi - task learning is a method that can solve this problem, but it usually requires high training costs and simultaneous access to data for all tasks. In addition, training foundation models usually depends on unsupervised or semi - supervised methods, which require a large amount of computing resources. To overcome these challenges, the paper proposes an effective method to merge Vision Foundation Models with different pre - training targets into a unified model, namely SAM - CLIP. This method integrates the techniques of multi - task learning, continuous learning and knowledge distillation, aiming to reduce storage and computing costs and make it more suitable for edge - device applications. Through this method, SAM - CLIP not only retains the basic advantages of SAM and CLIP, but also shows synergistic functions in new tasks such as zero - shot semantic segmentation, and establishes a new state - of - the - art level in multiple benchmark tests. Specifically, the mean Intersection over Union (mean IoU) of SAM - CLIP on the Pascal - VOC and COCO - Stuff datasets is increased by 6.8% and 5.9% respectively, significantly outperforming previous models specifically designed for this task.