InterNet: Unsupervised Cross-modal Homography Estimation Based on Interleaved Modality Transfer and Self-supervised Homography Prediction

Junchen Yu,Si-Yuan Cao,Runmin Zhang,Chenghao Zhang,Jianxin Hu,Zhu Yu,Beinan Yu,Hui-liang Shen
2024-09-27
Abstract:We propose a novel unsupervised cross-modal homography estimation framework, based on interleaved modality transfer and self-supervised homography prediction, named InterNet. InterNet integrates modality transfer and self-supervised homography estimation, introducing an innovative interleaved optimization framework to alternately promote both components. The modality transfer gradually narrows the modality gaps, facilitating the self-supervised homography estimation to fully leverage the synthetic intra-modal data. The self-supervised homography estimation progressively achieves reliable predictions, thereby providing robust cross-modal supervision for the modality transfer. To further boost the estimation accuracy, we also formulate a fine-grained homography feature loss to improve the connection between two components. Furthermore, we employ a simple yet effective distillation training technique to reduce model parameters and improve cross-domain generalization ability while maintaining comparable performance. Experiments reveal that InterNet achieves the state-of-the-art (SOTA) performance among unsupervised methods, and even outperforms many supervised methods such as MHN and LocalTrans.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is homography estimation between cross - modal images. Specifically, the author proposes a new unsupervised framework, InterNet, for performing homography estimation between images of different modalities. This problem is very important in practical applications, such as in computer vision tasks like robot localization without GPS signals, multi - modal image inpainting, and multi - spectral image fusion. ### Problem Background Traditional homography estimation methods usually rely on labeled data for supervised learning. However, in practical applications, since multi - modal images are obtained through different imaging sensors, the real homography deformation is usually unknown, so it is difficult to obtain sufficient labeled data. To solve this problem, existing unsupervised methods mainly achieve homography estimation by optimizing the similarity between the warped source image and the target image, but these methods have poor performance when dealing with large deformations and modal gaps. ### Core Contributions of the Paper 1. **Proposing a new unsupervised cross - modal homography estimation framework, InterNet**: - InterNet combines modality transfer and self - supervised homography prediction, and gradually narrows the modal gap and improves the accuracy of cross - modal homography estimation by alternately optimizing these two modules. 2. **Introducing an interleaved optimization framework**: - Inspired by the alternating direction multiplier method (ADMM) and the split Bregman method, InterNet adopts an interleaved optimization strategy, which decomposes complex optimization problems into more tractable sub - problems to ensure better convergence performance. 3. **Fine - grained Homography Feature Loss (FGHomo Loss)**: - To further enhance the mutual promotion between the two modules, the author proposes a fine - grained homography feature loss to constrain the feature consistency in the homography estimation module. 4. **Distillation Training Technique**: - By introducing a simple distillation training technique, the number of model parameters is significantly reduced, the cross - domain generalization ability is improved, and comparable performance is maintained. ### Experimental Results Experiments show that InterNet has achieved the state - of - the - art performance of unsupervised methods on multiple datasets, and in some cases even outperforms supervised methods. For example, on the GoogleMap and WHU - OPT - SAR datasets, the mean angular error (MACE) of InterNet is 54.3% and 47.4% lower than that of MHN respectively, and 61.8% and 85.8% lower than that of LocalTrans respectively. ### Summary The main contribution of this paper lies in proposing an innovative unsupervised cross - modal homography estimation framework, InterNet. By interleavedly optimizing modality transfer and self - supervised homography prediction, it solves the homography estimation problem under large modal gaps and large deformations, and shows excellent performance on multiple benchmark datasets.