XPoint: A Self-Supervised Visual-State-Space based Architecture for Multispectral Image Registration

Ismail Can Yagmur,Hasan F. Ates,Bahadir K. Gunturk
2024-11-12
Abstract:Accurate multispectral image matching presents significant challenges due to non-linear intensity variations across spectral modalities, extreme viewpoint changes, and the scarcity of labeled datasets. Current state-of-the-art methods are typically specialized for a single spectral difference, such as visibleinfrared, and struggle to adapt to other modalities due to their reliance on expensive supervision, such as depth maps or camera poses. To address the need for rapid adaptation across modalities, we introduce XPoint, a self-supervised, modular image-matching framework designed for adaptive training and fine-tuning on aligned multispectral datasets, allowing users to customize key components based on their specific tasks. XPoint employs modularity and self-supervision to allow for the adjustment of elements such as the base detector, which generates pseudoground truth keypoints invariant to viewpoint and spectrum variations. The framework integrates a VMamba encoder, pretrained on segmentation tasks, for robust feature extraction, and includes three joint decoder heads: two are dedicated to interest point and descriptor extraction; and a task-specific homography regression head imposes geometric constraints for superior performance in tasks like image registration. This flexible architecture enables quick adaptation to a wide range of modalities, demonstrated by training on Optical-Thermal data and fine-tuning on settings such as visual-near infrared, visual-infrared, visual-longwave infrared, and visual-synthetic aperture radar. Experimental results show that XPoint consistently outperforms or matches state-ofthe-art methods in feature matching and image registration tasks across five distinct multispectral datasets. Our source code is available at <a class="link-external link-https" href="https://github.com/canyagmur/XPoint" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in multispectral image matching, especially due to the nonlinear intensity changes between different spectral patterns, extreme view - angle changes, and the scarcity of labeled datasets. The existing state - of - the - art methods usually focus on a single spectral difference (such as visible - infrared), and have difficulties in adapting to other modalities because they rely on expensive supervised information, such as depth maps or camera poses. Specifically, the paper aims to solve the following problems: 1. **Cross - modality Adaptability**: Existing methods perform poorly when dealing with different spectral patterns and are difficult to generalize to unseen modalities. 2. **Scarcity of Labeled Data**: Multispectral image matching requires a large amount of labeled data, but these data are often difficult to obtain. 3. **View - angle and Spectral Changes**: Multispectral images change significantly under different view - angles and spectra, which increases the difficulty of matching. To solve these problems, the paper proposes XPoint, which is a self - supervised, modular image - matching framework that can perform adaptive training and fine - tuning on aligned multispectral datasets, allowing users to customize key components according to specific tasks. The main contributions of XPoint include: - **Multispectral Homeomorphic Transformation**: An improved multispectral homeomorphic transformation method is introduced to generate a set of pseudo - real key points that are invariant to view - angle and spectral changes. - **Pre - trained VMamba Encoder**: The VMamba encoder pre - trained on the segmentation task is used to enhance the feature extraction ability. - **Geometrically Constrained Regression Head**: A special task - specific head for homeomorphic regression is introduced to impose geometric constraints to improve the matching performance. - **Improved Detector Loss**: For datasets with significant spectral differences (such as VIS - SAR and VIS - NIR), the weighted cross - entropy loss is adopted to improve the performance of the model under complex conditions. Through these improvements, XPoint can achieve high - precision image matching and registration on multiple multispectral datasets, demonstrating its superior performance in multi - modal image - matching tasks.