Abstract:Image retrieval is one of the key techniques of computer vision, and has been studied for a long time. Nevertheless, little attention is paid to infrared and visible cross-modal retrieval which can be widely used in various applications, e.g., infrared and visible surveillance systems. In this paper, we propose a shared features based infrared-visible cross-modal image retrieval method. The similar visual features are extracted from infrared and visible images as the shared features, and the Euclidean distance is used to measure the similarity between these features. The core of the proposed method comes from three aspects: 1) Feature separation network can separate image features into shared features and exclusive features; 2) Maximum Mean Discrepancy (MMD) loss is employed to constrain the distribution of shared features, which can reduce the retrieval error caused by different imaging angles and similarity of infrared images. 3) The cross-layer fusion encoder compensates for the context loss in the convolution of infrared images. Experimental results on the Infrared-Visible dataset demonstrate the proposed method is effective and outperforms the state-of-the-art approaches.

What problem does this paper attempt to address?

This paper attempts to solve the problem of infrared and visible - light cross - modal image retrieval. Specifically, it aims to develop a method that can retrieve the most similar infrared image from a visible - light image, or vice versa, retrieve the most similar visible - light image from an infrared image. This problem is very important in practical applications. For example, in a surveillance system, the suspect image captured by an infrared camera at night can be matched and searched in the visible - light video during the day, and vice versa. ### Main Challenges 1. **Different Imaging Effects**: Infrared images usually have better imaging quality than visible - light images in low - light conditions, but visible - light images have more texture information and important color information. Therefore, during the convolution process, the context information of infrared images will be quickly lost. 2. **Different Imaging Angles**: Even if the infrared and visible - light cameras are shooting the same object, due to different imaging angles, misalignment may occur between image pixels. 3. **Similarity between Infrared Images**: Currently, infrared cameras have a weak ability to distinguish differences in thermal radiation, resulting in high similarity between infrared images, which is likely to cause misjudgment. ### Solutions To solve the above problems, the author proposes an infrared - visible - light cross - modal image retrieval method based on shared features. The core of this method includes the following three aspects: 1. **Feature Separation Network**: Divide image features into shared features and unique features. Shared features are used for cross - modal matching, while unique features retain the unique information of their respective modalities. 2. **Maximum Mean Discrepancy (MMD) Loss**: Used to constrain the distribution of shared features and reduce retrieval errors caused by different imaging angles and infrared image similarity. 3. **Cross - layer Fusion Encoder**: Compensate for the context information lost in the convolution process of infrared images. Through these technical means, this method can extract potential similar features between infrared and visible - light images, thereby achieving effective cross - modal image retrieval. ### Summary The main contributions of this paper are: - Proposing a novel infrared - visible - light cross - modal image retrieval method based on shared feature extraction. - Introducing a cross - layer fusion encoder and MMD loss to reduce context information loss during the convolution process and make the shared features of the two modalities have the same distribution. - The experimental results on the Infrared - Visible dataset show that this method is effective and superior to the existing baseline methods. Hope this summary can help you understand the core problem of this paper and its solutions. If you have any further questions or need more detailed explanations, please feel free to let me know!

Infrared and Visible Cross-Modal Image Retrieval Through Shared Features

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

A Similarity Inference Metric for RGB-Infrared Cross-Modality Person Re-identification

Fusion of Infrared and Visible Images Via Multi-Layer Convolutional Sparse Representation

RGB-IR Person Re-identification by Cross-Modality Similarity Preservation

BCMFIFuse: A Bilateral Cross-Modal Feature Interaction-Based Network for Infrared and Visible Image Fusion

Correlation-Guided Discriminative Cross-Modality Features Network for Infrared and Visible Image Fusion

Retrieval Across Optical and SAR Images with Deep Neural Network.

Infrared and visible image fusion based on infrared background suppression

Infrared–Visible Image Fusion through Feature-Based Decomposition and Domain Normalization

Infrared and Visible Image Fusion Using Threshold Segmentation and Weight Optimization

Cross-Spectrum Dual-Subspace Pairing for RGB-infrared Cross-Modality Person Re-Identification

Infrared and Visible Image Fusion Using Dual-Tree Complex Wavelet Transform and Convolutional Sparse Representation

Cross-modal Image Retrieval with Deep Mutual Information Maximization

Towards RGB-NIR Cross-modality Image Registration and Beyond

Infrared and Visual Image Fusion Through Infrared Feature Extraction and Visual Information Preservation

SFCFusion: Spatial–Frequency Collaborative Infrared and Visible Image Fusion

Infrared and Visible Image Fusion Method Based on Hierarchical Attention Mechanism

DCFusion: A Dual-Frequency Cross-Enhanced Fusion Network for Infrared and Visible Image Fusion.

Unsupervised Misaligned Infrared and Visible Image Fusion via Cross-Modality Image Generation and Registration

Cross-UNet: dual-branch infrared and visible image fusion framework based on cross-convolution and attention mechanism