SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

Runmin Zhang,Jun Ma,Si-Yuan Cao,Lun Luo,Beinan Yu,Shu-Jie Chen,Junwei Li,Hui-Liang Shen
2024-07-11
Abstract:We propose a novel unsupervised cross-modal homography estimation framework based on intra-modal Self-supervised learning, Correlation, and consistent feature map Projection, namely SCPNet. The concept of intra-modal self-supervised learning is first presented to facilitate the unsupervised cross-modal homography estimation. The correlation-based homography estimation network and the consistent feature map projection are combined to form the learnable architecture of SCPNet, boosting the unsupervised learning framework. SCPNet is the first to achieve effective unsupervised homography estimation on the satellite-map image pair cross-modal dataset, GoogleMap, under [-32,+32] offset on a 128x128 image, leading the supervised approach MHN by 14.0% of mean average corner error (MACE). We further conduct extensive experiments on several cross-modal/spectral and manually-made inconsistent datasets, on which SCPNet achieves the state-of-the-art (SOTA) performance among unsupervised approaches, and owns 49.0%, 25.2%, 36.4%, and 10.7% lower MACEs than the supervised approach MHN. Source code is available at <a class="link-external link-https" href="https://github.com/RM-Zhang/SCPNet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the problem of cross-modal isomorphism estimation under unsupervised conditions, especially in cases with significant offsets and modality differences. Specifically, the paper proposes a new framework called SCPNet, which combines three key components: intra-modal self-supervised learning, correlation, and consistent feature map projection. ### Main Contributions and Problems Solved 1. **Proposed SCPNet Framework**: This framework effectively achieves unsupervised cross-modal isomorphism estimation, particularly excelling in scenarios with large offset ranges ([-32,+32]) and significant modality differences (e.g., satellite images and map image pairs). It surpasses the supervised method MHN, improving the Mean Angular Corner Error (MACE) by 14.0%. 2. **Introduced the Concept of Intra-Modal Self-Supervised Learning**: By applying simulated isomorphic transformations within the two modalities, it extracts self-supervised information from both branches. This allows the network to train with the support of intra-modal self-supervised learning and extend the learned knowledge to cross-modal scenarios. 3. **Combined Correlation and Consistent Feature Map Projection**: These two components are combined to form a powerful unsupervised learning network architecture. The correlation-constrained network learns clearer knowledge, while the consistent feature map projection monitors cross-modal isomorphism estimation and the projection of the cross-modal consistent latent space, further improving estimation accuracy. ### Experimental Results - Experimental evaluations were conducted on multiple cross-modal/spectral datasets, including GoogleMap, Flash/no-flash, Harvard, and RGB/NIR datasets. SCPNet demonstrated the best performance across these datasets. - On the GoogleMap dataset, even when faced with large offsets and significant modality differences, SCPNet was able to stably and accurately estimate isomorphism, reducing MACE by 37.2% and 14.0% compared to the supervised learning methods DHN and MHN, respectively. - On the Flash/no-flash dataset, SCPNet also provided the best performance. In summary, the paper addresses the problem of unsupervised cross-modal isomorphism estimation, and SCPNet demonstrates superior performance, especially when dealing with data with large offsets and significant modality differences.