SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Chenyang Lei,Liyi Chen,Jun Cen,Xiao Chen,Zhen Lei,Felix Heide,Qifeng Chen,Zhaoxiang Zhang

2024-11-28

Abstract:Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework, SimCMF, to study an important problem: cross-modal fine-tuning from vision foundation models trained on natural RGB images to other imaging modalities of different physical properties (e.g., polarization). In SimCMF, we conduct a thorough analysis of different basic components from the most naive design and ultimately propose a novel cross-modal alignment module to address the modality misalignment problem. We apply SimCMF to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new imaging modality. Given the absence of relevant benchmarks, we construct a benchmark for performance evaluation. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. SimCMF can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. The code is available at <a class="link-external link-https" href="https://github.com/mt-cly/SimCMF" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of transferring visual foundation models trained from natural images to other imaging modalities (with different physical properties, such as polarization). Specifically, the paper explores how to apply powerful visual foundation models to new imaging modalities through cross - modal fine - tuning in the case of limited data volume, thereby improving performance under these modalities. The method proposed in the paper is called SimCMF (Simple Cross - modal Fine - tuning), aiming to address two main challenges: 1. **Modality Misalignment**: Different imaging modalities may capture completely different physical signals, resulting in significant differences in data representation, such as differences in dimension, dynamic range, and semantic information. This misalignment makes it difficult to directly apply pre - trained foundation models. 2. **Fine - tuning Cost**: With the rapid growth of the scale of foundation models, the cost of fine - tuning is also increasing rapidly. Therefore, exploring efficient and parameter - economical fine - tuning strategies is crucial for practical applications. To address these challenges, SimCMF introduces a cross - modal alignment module to handle the misalignment between the target modality and the pre - trained visual modality. In addition, the paper also systematically analyzes different fine - tuning strategies, including full fine - tuning (FFT) and parameter - efficient fine - tuning (PEFT), to find the optimal solution. Through experiments on multiple imaging modalities, the paper verifies the effectiveness of SimCMF, especially the significant improvement in performance on segmentation tasks. For example, in the constructed AIMS benchmark, SimCMF increases the average mIoU from 22.15% to 53.88%, demonstrating its wide applicability and potential on different imaging modalities.

SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality

Adapting the Segment Anything Model for Multi-modal Retinal Anomaly Detection and Localization

CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

Segment Anything with Multiple Modalities

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

CMF: Cascaded Multi-Model Fusion for Referring Image Segmentation

Customize Segment Anything Model for Multi-Modal Semantic Segmentation with Mixture of LoRA Experts

Foundation Model-Based Multimodal Remote Sensing Data Classification

Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation

Cross-Modal Progressive Comprehension for Referring Segmentation

DCMFNet: Deep Cross-Modal Fusion Network for Different Modalities with Iterative Gated Fusion

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Simple Scalable Multimodal Semantic Segmentation Model

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective

Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-modal Inference and Fusion

CMEFusion: Cross-Modal Enhancement and Fusion of FIR and Visible Images

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion