Abstract:We propose a novel unsupervised cross-modal homography estimation framework based on intra-modal Self-supervised learning, Correlation, and consistent feature map Projection, namely SCPNet. The concept of intra-modal self-supervised learning is first presented to facilitate the unsupervised cross-modal homography estimation. The correlation-based homography estimation network and the consistent feature map projection are combined to form the learnable architecture of SCPNet, boosting the unsupervised learning framework. SCPNet is the first to achieve effective unsupervised homography estimation on the satellite-map image pair cross-modal dataset, GoogleMap, under [-32,+32] offset on a 128x128 image, leading the supervised approach MHN by 14.0% of mean average corner error (MACE). We further conduct extensive experiments on several cross-modal/spectral and manually-made inconsistent datasets, on which SCPNet achieves the state-of-the-art (SOTA) performance among unsupervised approaches, and owns 49.0%, 25.2%, 36.4%, and 10.7% lower MACEs than the supervised approach MHN. Source code is available at <a class="link-external link-https" href="https://github.com/RM-Zhang/SCPNet" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the problem of cross-modal isomorphism estimation under unsupervised conditions, especially in cases with significant offsets and modality differences. Specifically, the paper proposes a new framework called SCPNet, which combines three key components: intra-modal self-supervised learning, correlation, and consistent feature map projection. ### Main Contributions and Problems Solved 1. **Proposed SCPNet Framework**: This framework effectively achieves unsupervised cross-modal isomorphism estimation, particularly excelling in scenarios with large offset ranges ([-32,+32]) and significant modality differences (e.g., satellite images and map image pairs). It surpasses the supervised method MHN, improving the Mean Angular Corner Error (MACE) by 14.0%. 2. **Introduced the Concept of Intra-Modal Self-Supervised Learning**: By applying simulated isomorphic transformations within the two modalities, it extracts self-supervised information from both branches. This allows the network to train with the support of intra-modal self-supervised learning and extend the learned knowledge to cross-modal scenarios. 3. **Combined Correlation and Consistent Feature Map Projection**: These two components are combined to form a powerful unsupervised learning network architecture. The correlation-constrained network learns clearer knowledge, while the consistent feature map projection monitors cross-modal isomorphism estimation and the projection of the cross-modal consistent latent space, further improving estimation accuracy. ### Experimental Results - Experimental evaluations were conducted on multiple cross-modal/spectral datasets, including GoogleMap, Flash/no-flash, Harvard, and RGB/NIR datasets. SCPNet demonstrated the best performance across these datasets. - On the GoogleMap dataset, even when faced with large offsets and significant modality differences, SCPNet was able to stably and accurately estimate isomorphism, reducing MACE by 37.2% and 14.0% compared to the supervised learning methods DHN and MHN, respectively. - On the Flash/no-flash dataset, SCPNet also provided the best performance. In summary, the paper addresses the problem of unsupervised cross-modal isomorphism estimation, and SCPNet demonstrates superior performance, especially when dealing with data with large offsets and significant modality differences.

SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation

InterNet: Unsupervised Cross-modal Homography Estimation Based on Interleaved Modality Transfer and Self-supervised Homography Prediction

In Defense Of Multi-Source Omni-Supervised Efficient Convnet For Robust Semantic Segmentation In Heterogeneous Unseen Domains

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

CrossHomo: Cross-Modality and Cross-Resolution Homography Estimation

Content-Aware Unsupervised Deep Homography Estimation and its Extensions

Self-Supervised Deep Homography Estimation with Invertibility Constraints

SCSA-Net: Presentation of two-view reliable correspondence learning via spatial-channel self-attention

Unsupervised deep homography with multi-scale global attention.

Unsupervised Deep Homography: A Fast and Robust Homography Estimation Model

Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization

Multi-Scale Correlation for Deep Homography Estimation

Learning Inter- and Intra-frame Representations for Non-Lambertian Photometric Stereo

Coarse-to-Fine Homography Estimation for Infrared and Visible Images

DMHomo: Learning Homography with Diffusion Models

SAda-Net: A Self-Supervised Adaptive Stereo Estimation CNN For Remote Sensing Image Data

Semantic-aware Representation Learning for Homography Estimation

Self-Supervised Intra-Modal and Cross-Modal Contrastive Learning for Point Cloud Understanding

Descriptor Ensemble: An Unsupervised Approach to Descriptor Fusion in the Homography Space

A Multitask Network for Multiview Stereo Reconstruction: When Semantic Consistency-Based Clustering Meets Depth Estimation Optimization