Cross-modality interaction for few-shot multispectral object detection with semantic knowledge

Lian Huang,Zongju Peng,Fen Chen,Shaosheng Dai,Ziqiang He,Kesheng Liu
DOI: https://doi.org/10.1016/j.neunet.2024.106156
IF: 7.8
2024-02-09
Neural Networks
Abstract:Multispectral object detection (MOD), which incorporates additional information from thermal images into object detection (OD) to robustly cope with complex illumination conditions, has garnered significant attention. However, existing MOD methods always demand a considerable amount of annotated data for training. Inspired by the concept of few-shot learning, we propose a novel task called few-shot multispectral object detection (FSMOD) that aims to accomplish MOD using only a few annotated data from each category. Specifically, we first design a cross-modality interaction (CMI) module, which leverages different attention mechanisms to interact with the information from visible and thermal modalities during backbone feature extraction. With the guidance of interaction process, the detector is able to extract modality-specific backbone features with better discrimination. To improve the few-shot learning ability of the detector, we also design a semantic prototype metric (SPM) loss that integrates semantic knowledge, i.e., word embeddings, into the optimization process of embedding space. Semantic knowledge provides stable category representation when visual information is insufficient. Extensive experiments on the customized FSMOD dataset demonstrate that the proposed method achieves state-of-the-art performance. Source code is publicly available at https://github.com/HuangLian126/FSMOD .
computer science, artificial intelligence,neurosciences
What problem does this paper attempt to address?
The paper primarily focuses on addressing the problem of multispectral object detection under limited labeled data conditions. Specifically: 1. **Research Background and Challenges**: Traditional single-modality object detection methods experience performance degradation under varying lighting conditions, while multispectral object detection (MOD) improves detection accuracy in complex lighting conditions by fusing visible light images and thermal imaging. However, existing multispectral object detection methods require a large amount of labeled data for training, which is often difficult to meet in practical applications. Additionally, convolutional neural network (CNN)-based methods are prone to overfitting when labeled data is limited. 2. **Objectives and Contributions**: The paper proposes a new task—Few-Shot Multispectral Object Detection (FSMOD), aiming to achieve multispectral object detection using a small amount of labeled data per category. To achieve this goal, the paper introduces two key components: - **Cross-modality Interaction (CMI) Module**: This module promotes interaction between visible light image features and thermal image features through spatial attention and channel attention mechanisms, thereby extracting more discriminative features. - **Semantic Prototype Metric (SPM) Loss**: This loss function integrates semantic knowledge (such as word embeddings) into the optimization process of the embedding space to enhance the model's learning ability under few-shot conditions, ensuring stable category representation even when visual information is insufficient. 3. **Method Overview**: - The paper adopts Faster R-CNN as the base detector and designs the CMI module during the feature extraction stage to enhance feature quality. - For shallow features (stages 2 and 3), the CMI module uses spatial attention mechanisms to enhance the spatial information representation of features; for deep features (stages 4 and 5), it uses channel attention mechanisms to enhance the channel information representation of features. - The Cross-Modality Fusion (CMF) module fuses the enhanced visible light features and thermal features to generate comprehensive features, which are then fed into the subsequent detection head to complete the detection task. - The SPM loss uses semantic prototypes to guide the Region of Interest (RoI) features to be closer to the semantic prototypes of the same category, thereby forming good classification boundaries. In summary, this paper proposes a few-shot multispectral object detection method that combines cross-modality interaction and semantic knowledge, aiming to address the need for a large amount of labeled data in existing multispectral object detection methods and the issue of overfitting under few-shot conditions. By introducing cross-modality interaction mechanisms and semantic prototype metric loss, the model's detection performance under few-shot conditions is improved.