Abstract:Multimodal image matching is a core basis for information fusion, change detection, and image-based navigation. However, multimodal images may simultaneously suffer from severe nonlinear radiation distortion (NRD) and complex geometric differences, which pose great challenges to existing methods. Although deep learning-based methods had shown potential in image matching, they mainly focus on same-source images or single types of multimodal images such as optical-synthetic aperture radar (SAR). One of the main obstacles is the lack of public data for different types of multimodal images. In this paper, we make two major contributions to the community of multimodal image matching: First, we collect six typical types of images, including optical-optical, optical-infrared, optical-SAR, optical-depth, optical-map, and nighttime, to construct a multimodal image dataset with a total of 1200 pairs. This dataset has good diversity in image categories, feature classes, resolutions, geometric variations, etc. Second, we propose a scale and rotation invariant feature transform (SRIF) method, which achieves good matching performance without relying on data characteristics. This is one of the advantages of our SRIF over deep learning methods. SRIF obtains the scales of FAST keypoints by projecting them into a simple pyramid scale space, which is based on the study that methods with/without scale space have similar performance under small scale change factors. This strategy largely reduces the complexity compared to traditional Gaussian scale space. SRIF also proposes a local intensity binary transform (LIBT) for SIFT-like feature description, which can largely enhance the structure information inside multimodal images. Extensive experiments on these 1200 image pairs show that our SRIF outperforms current state-of-the-arts by a large margin, including RIFT, CoFSM, LNIFT, and MS-HLMO. Both the created dataset and the code of SRIF will be publicly available in https://github.com/LJY-RS/SRIF.

Attention-based multimodal image matching

Attention-Based Multimodal Image Matching

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Embrace Smaller Attention: Efficient Cross-Modal Matching with Dual Gated Attention Fusion

Multimodal Remote Sensing Image Matching via Learning Features and Attention Mechanism

Multi-Modality Cross Attention Network for Image and Sentence Matching

TS-Net: Combining modality specific and common features for multimodal patch matching

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

Dual Semantic Relationship Attention Network for Image-Text Matching

Multi-level network based on transformer encoder for fine-grained image–text matching

Two-Stream Convolutional Neural Network For Multimodal Matching

TransMatch: A Transformer-Based Multilevel Dual-Stream Feature Matching Network for Unsupervised Deformable Image Registration

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints

A Concurrent Multiscale Detector for End-to-End Image Matching

A Semi-Supervised Image Registration Framework Based on Multimodal Cross-Attention

Shape-Former: Bridging CNN and Transformer via ShapeConv for multimodal image matching

Multimodal image matching: A scale-invariant algorithm and an open dataset

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM

Comateformer: Combined Attention Transformer for Semantic Sentence Matching

Cross-Modal Attention With Semantic Consistence for Image–Text Matching