Abstract:Multispectral object detection (MOD), which incorporates additional information from thermal images into object detection (OD) to robustly cope with complex illumination conditions, has garnered significant attention. However, existing MOD methods always demand a considerable amount of annotated data for training. Inspired by the concept of few-shot learning, we propose a novel task called few-shot multispectral object detection (FSMOD) that aims to accomplish MOD using only a few annotated data from each category. Specifically, we first design a cross-modality interaction (CMI) module, which leverages different attention mechanisms to interact with the information from visible and thermal modalities during backbone feature extraction. With the guidance of interaction process, the detector is able to extract modality-specific backbone features with better discrimination. To improve the few-shot learning ability of the detector, we also design a semantic prototype metric (SPM) loss that integrates semantic knowledge, i.e., word embeddings, into the optimization process of embedding space. Semantic knowledge provides stable category representation when visual information is insufficient. Extensive experiments on the customized FSMOD dataset demonstrate that the proposed method achieves state-of-the-art performance. Source code is publicly available at https://github.com/HuangLian126/FSMOD .

What problem does this paper attempt to address?

The paper primarily focuses on addressing the problem of multispectral object detection under limited labeled data conditions. Specifically: 1. **Research Background and Challenges**: Traditional single-modality object detection methods experience performance degradation under varying lighting conditions, while multispectral object detection (MOD) improves detection accuracy in complex lighting conditions by fusing visible light images and thermal imaging. However, existing multispectral object detection methods require a large amount of labeled data for training, which is often difficult to meet in practical applications. Additionally, convolutional neural network (CNN)-based methods are prone to overfitting when labeled data is limited. 2. **Objectives and Contributions**: The paper proposes a new task—Few-Shot Multispectral Object Detection (FSMOD), aiming to achieve multispectral object detection using a small amount of labeled data per category. To achieve this goal, the paper introduces two key components: - **Cross-modality Interaction (CMI) Module**: This module promotes interaction between visible light image features and thermal image features through spatial attention and channel attention mechanisms, thereby extracting more discriminative features. - **Semantic Prototype Metric (SPM) Loss**: This loss function integrates semantic knowledge (such as word embeddings) into the optimization process of the embedding space to enhance the model's learning ability under few-shot conditions, ensuring stable category representation even when visual information is insufficient. 3. **Method Overview**: - The paper adopts Faster R-CNN as the base detector and designs the CMI module during the feature extraction stage to enhance feature quality. - For shallow features (stages 2 and 3), the CMI module uses spatial attention mechanisms to enhance the spatial information representation of features; for deep features (stages 4 and 5), it uses channel attention mechanisms to enhance the channel information representation of features. - The Cross-Modality Fusion (CMF) module fuses the enhanced visible light features and thermal features to generate comprehensive features, which are then fed into the subsequent detection head to complete the detection task. - The SPM loss uses semantic prototypes to guide the Region of Interest (RoI) features to be closer to the semantic prototypes of the same category, thereby forming good classification boundaries. In summary, this paper proposes a few-shot multispectral object detection method that combines cross-modality interaction and semantic knowledge, aiming to address the need for a large amount of labeled data in existing multispectral object detection methods and the issue of overfitting under few-shot conditions. By introducing cross-modality interaction mechanisms and semantic prototype metric loss, the model's detection performance under few-shot conditions is improved.

Cross-modality interaction for few-shot multispectral object detection with semantic knowledge

Cross-domain Multi-modal Few-shot Object Detection via Rich Text

Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting

Few-Shot Object Detection with Memory Contrastive Proposal Based on Semantic Priors

Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation

MM-FSOD: Meta and metric integrated few-shot object detection

Semantic Enhanced Few-shot Object Detection

Cross-Domain Few-Shot Hyperspectral Image Classification With Cross-Modal Alignment and Supervised Contrastive Learning

Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery

Few-Shot Object Detection With Multilevel Information Interaction for Optical Remote Sensing Images

Object Segmentation by Mining Cross-Modal Semantics

Attention-based Cross-modality Interaction for Multispectral Pedestrian Detection

Object Detection in Multispectral Remote Sensing Images Based on Cross-Modal Cross-Attention

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Few-Shot Object Detection with Self-Adaptive Global Similarity and Two-Way Foreground Stimulator in Remote Sensing Images

Learning transferable cross-modality representations for few-shot hyperspectral and LiDAR collaborative classification

The Cross-Modality Disparity Problem in Multispectral Pedestrian Detection.

RGB-D Salient Object Detection with Cross-Modality Modulation and Selection

Cross-Modality Fusion Transformer for Multispectral Object Detection

Adaptive Cross-Modal Few-Shot Learning

Few-shot Object Detection via Improved Classification Features