Abstract:Object detection in poor-illumination environments is a challenging task as objects are usually not clearly visible in RGB images. As infrared images provide additional clear edge information that complements RGB images, fusing RGB and infrared images has potential to enhance the detection ability in poor-illumination environments. However, existing works involving both visible and infrared images only focus on image fusion, instead of object detection. Moreover, they directly fuse the two kinds of image modalities, which ignores the mutual interference between them. To fuse the two modalities to maximize the advantages of cross-modality, we design a dual-enhancement-based cross-modality object detection network DEYOLO, in which semantic-spatial cross modality and novel bi-directional decoupled focus modules are designed to achieve the detection-centered mutual enhancement of RGB-infrared (RGB-IR). Specifically, a dual semantic enhancing channel weight assignment module (DECA) and a dual spatial enhancing pixel weight assignment module (DEPA) are firstly proposed to aggregate cross-modality information in the feature space to improve the feature representation ability, such that feature fusion can aim at the object detection task. Meanwhile, a dual-enhancement mechanism, including enhancements for two-modality fusion and single modality, is designed in both DECAand DEPAto reduce interference between the two kinds of image modalities. Then, a novel bi-directional decoupled focus is developed to enlarge the receptive field of the backbone network in different directions, which improves the representation quality of DEYOLO. Extensive experiments on M3FD and LLVIP show that our approach outperforms SOTA object detection algorithms by a clear margin. Our code is available at <a class="link-external link-https" href="https://github.com/chips96/DEYOLO" rel="external noopener nofollow">this https URL</a>.

IMC-Det: Intra–Inter Modality Contrastive Learning for Video Object Detection

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Video object matching across multiple non-overlapping camera views based on multi-feature fusion and incremental learning.

Video Salient Object Detection via Contrastive Features and Attention Modules

IMD-Net: Interpretable multi-scale detection network for infrared dim and small objects

Multimodal Contrastive Training for Visual Representation Learning

A Simple yet Effective Network based on Vision Transformer for Camouflaged Object and Salient Object Detection

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection

More Pictures Say More: Visual Intersection Network for Open Set Object Detection

Abnormal Event Detection Using Deep Contrastive Learning for Intelligent Video Surveillance System

Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection

DeepInteraction: 3D Object Detection via Modality Interaction

DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection

Task-decoupled interactive embedding network for object detection

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

SimDet: Cross Similarity Attention for One-shot Object Detection

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

Multi-Semantic Interactive Learning for Object Detection

Lightweight Spatial Sliced-Concatenate-Multireceptive-Field Enhance and Joint Channel Attention Mechanism for Infrared Object Detection