Abstract:A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are trained with distinct objectives, exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences, these models can be effectively merged into a unified model through multi-teacher distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One). This integrative approach not only surpasses the performance of individual teacher models but also amalgamates their distinctive features, such as zero-shot vision-language comprehension, detailed pixel-level understanding, and open vocabulary segmentation capabilities. In pursuit of the most hardware-efficient backbone, we evaluated numerous architectures in our multi-teacher distillation pipeline using the same training recipe. This led to the development of a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to merge multiple pre-trained Vision Foundation Models (VFMs) into a single model through a multi-teacher distillation method. Specifically, the authors propose a framework named AM-RADIO, which can integrate the unique attributes of models like CLIP, DINOv2, and SAM into one model. This allows for state-of-the-art feature representation in a single forward pass and enables tasks such as zero-shot classification and open-vocabulary instance segmentation with almost no additional cost. ### Main Contributions 1. **Multi-Teacher Distillation Method**: A general method is proposed to distill multiple foundation models with different input resolutions into one model. 2. **Performance Surpassing Teacher Models**: It is demonstrated that these student models can surpass their teacher models' performance in representative benchmarks. 3. **Downstream Task Compatibility**: The student models can directly replace the teacher models, or their features can be directly used in downstream applications, such as providing visual encoding for LLaVA. 4. **Efficient Architecture**: Various efficient architectures are evaluated, and a new hybrid architecture (E-RADIO) is proposed, which significantly improves inference speed while maintaining model quality. ### Method Overview - **Teacher Model Selection**: CLIP, DINOv2, and SAM are chosen as teacher models due to their excellent performance in different tasks. - **Distillation Dataset**: The DataComp-1B dataset is used for training to better measure "zero-shot" performance. - **Loss Function**: A composite loss function is adopted, including cosine similarity loss for feature matching and smooth L1 loss to ensure the student model learns the feature representations of the teacher models. - **Student Model Architecture**: Two student model architectures are studied: the standard ViT architecture and an efficient architecture variant, the latter prioritizing high throughput on GPUs. ### Experimental Results - **Image Classification**: The AM-RADIO model performs excellently in k-NN and zero-shot classification tasks on ImageNet-1K. - **Pixel-Level Tasks**: The AM-RADIO model also achieves good results in semantic segmentation tasks on ADE20K and Pascal VOC. - **Vision-Language Models**: The AM-RADIO model performs well in multiple tasks under the LLaVA-1.5 framework. - **Instance Segmentation**: The AM-RADIO model can effectively replicate SAM's visual features in the COCO instance segmentation task. ### Conclusion Through the AM-RADIO framework, the authors successfully integrate the characteristics of multiple Vision Foundation Models into a single model, achieving outstanding performance in various benchmarks and providing an efficient solution in resource-constrained environments.

AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

ViM: Vision Middleware for Unified Downstream Transferring

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

FMViT: A multiple-frequency mixing Vision Transformer

OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Progressive Volume Distillation with Active Learning for Efficient NeRF Architecture Conversion

One for All: Toward Unified Foundation Models for Earth Vision

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

Divide and Conquer: Rethinking the Training Paradigm of Neural Radiance Fields

One to Transfer All: A Universal Transfer Framework for Vision Foundation Model with Few Data

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

AMD: Automatic Multi-step Distillation of Large-scale Vision Models