Abstract:Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges to the matching task. Existing works try to extract invariant features for specific modalities and train on limited datasets, showing poor generalization. In this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance universal performance from the perspective of data scaling up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate matching labels. Specifically, we scale up the modalities from cheap but rich RGB-only matching data, by means of generative models. Under this setting, the matching labels and rich diversity of the RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multimodal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability. Extensive experiments on in-domain and zero-shot matching tasks, including $19$ cross-modal cases, demonstrate that our MINIMA can significantly outperform the baselines and even surpass modality-specific methods. The dataset and code are available at <a class="link-external link-https" href="https://github.com/LSXI7/MINIMA" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in cross - view and cross - modal image matching, especially the difficulties in matching tasks caused by the modality gap resulting from different imaging systems or styles. Existing methods usually focus on extracting modality - specific invariant features and training on limited datasets, which restricts their generalization ability. Therefore, this research aims to develop a unified image - matching framework to deal with multiple cross - modal situations and enhance the general performance of the model through large - scale data augmentation. Specifically, the paper proposes MINIMA (Modality Invariant Image Matching), a unified framework for multi - modal image matching. To address the problems of small scale and insufficient scene coverage in existing datasets, the author introduces a simple yet effective large - scale data generation engine. This engine can generate large - scale datasets containing multiple modalities, rich scenes, and accurate matching labels from inexpensive but abundant RGB image pairs. In this way, the author constructs a new comprehensive dataset MD - syn, filling the data gap in multi - modal image matching. ### Main Problem Summary: 1. **Modality Gap**: Differences between different imaging systems or styles make cross - modal matching tasks complicated. 2. **Dataset Limitations**: Existing cross - modal datasets are small in scale and have insufficient scene coverage, which restricts the generalization ability of the model. 3. **Poor Generalization Ability**: Existing methods can only extract matching features for specific modalities and perform poorly on other modalities. ### Solutions: - **Unified Framework**: Propose MINIMA, a unified matching framework suitable for multiple cross - modal situations. - **Data Generation Engine**: Utilize the generative model to generate large - scale multi - modal datasets from RGB image pairs, ensuring data diversity and label accuracy. - **New Dataset**: Construct the MD - syn dataset to fill the data gap in multi - modal image matching and support a wider range of matching tasks. Through these methods, MINIMA can significantly outperform baseline methods in in - domain and zero - shot matching tasks, and even outperform modality - specific methods.

MINIMA: Modality Invariant Image Matching

Multimodal image matching: A scale-invariant algorithm and an open dataset

REMM:Rotation-Equivariant Framework for End-to-End Multimodal Image Matching

Cross-modal Image Retrieval with Deep Mutual Information Maximization

Modality-Adaptive Mixup and Invariant Decomposition for RGB-Infrared Person Re-Identification

Multimodal Remote Sensing Image Matching via Learning Features and Attention Mechanism

Self-reinforcing Unsupervised Matching

Cross-Modal Object Tracking: Modality-Aware Representations and a Unified Benchmark

MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Cross-Modal Object Tracking via Modality-Aware Fusion Network and a Large-Scale Dataset

MiC: Image-text Matching in Circles with cross-modal generative knowledge enhancement

Alleviating the Inconsistency of Multimodal Data in Cross-Modal Retrieval

Achieving Cross Modal Generalization with Multimodal Unified Representation.

Cross-Modal Information Maximization for Medical Imaging: CMIM

Inter-Modality Similarity Learning for Unsupervised Multi-Modality Person Re-Identification

Mind the Gap: Learning Modality-Agnostic Representations With a Cross-Modality UNet

Multi-Modality Cross Attention Network for Image and Sentence Matching

SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

A New Invariant Feature for Multi-Modal Images Matching

Cross-Modality Image Matching Network with Modality-Invariant Feature Representation for Airborne-Ground Thermal Infrared and Visible Datasets

Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID