Abstract:Matching multimodal remote sensing images (RSIs) remains an ongoing challenge due to the significant nonlinear radiometric differences and geometric distortions, resulting in matches exhibiting one-to-many matches or mismatches. To tackle this challenge, we propose a novel approach for multimodal RSI matching called modality-independent consistency matching (MICM), which leverages the capabilities of deep convolutional neural networks and the transformer attention mechanism to improve the matching performance. The proposed MICM method consists of three key steps. First, a Unet-like feature extraction backbone network is employed to learn multiscale invariant features from multimodal RSIs, enabling the extraction of rich and evenly distributed feature keypoints. Second, a hybrid approach combining local learning features with the transformer attention mechanism is introduced to aggregate learning features, facilitating both detailed capture and long-range modeling to enhance the representation ability of the features. Third, a feature consistency correlation strategy is adopted to maximize the number of correct corresponding feature points, ensuring reliable matching performance. The performance of the proposed method has been extensively evaluated on both the same scene and different scene multimodal RSIs, which are captured from various imaging modes, wavebands, and platforms. The results show the superior matching performance of the proposed MICM method compared to commonly used and state-of-the-art handcrafted- and learning-based methods when evaluated on both the same scene and different scene datasets. The proposed method serves as a valuable reference for addressing common challenges in multimodal RSI matching.

Two-Stream Convolutional Neural Network For Multimodal Matching

Disparity Estimation Using Multilevel and Global Information

Multi-scale Matching Networks for Semantic Correspondence

TS-Net: Combining modality specific and common features for multimodal patch matching

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Deep Coupled Metric Learning for Cross-Modal Matching.

Attention-based multimodal image matching

Multi-Modality Cross Attention Network for Image and Sentence Matching

A Concurrent Multiscale Detector for End-to-End Image Matching

Learning Two-Branch Neural Networks for Image-Text Matching Tasks.

End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching

Image-Text Matching with Multi-View Attention

Matching Image and Sentence with Multi-Faceted Representations

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Cross-modal Graph Matching Network for Image-text Retrieval

Dual Semantic Relationship Attention Network for Image-Text Matching

Shape-Former: Bridging CNN and Transformer via ShapeConv for multimodal image matching

Multimodal Remote Sensing Image Matching Combining Learning Features and Delaunay Triangulation

Multimodal Remote Sensing Image Matching via Learning Features and Attention Mechanism

Self-Supervised Keypoint Detection and Cross-Fusion Matching Networks for Multimodal Remote Sensing Image Registration