A Vision Enhancement and Feature Fusion Multiscale Detection Network

Chengwu Qian,Jiangbo Qian,Chong Wang,Xulun Ye,Caiming Zhong

DOI: https://doi.org/10.1007/s11063-024-11471-w

IF: 2.565

2024-02-09

Neural Processing Letters

Abstract:In the field of object detection, there is often a high level of occlusion in real scenes, which can very easily interfere with the accuracy of the detector. Currently, most detectors use a convolutional neural network (CNN) as a backbone network, but the robustness of CNNs for detection under cover is poor, and the absence of object pixels makes conventional convolution ineffective in extracting features, leading to a decrease in detection accuracy. To address these two problems, we propose VFN (A Vision Enhancement and Feature Fusion Multiscale Detection Network), which first builds a multiscale backbone network using different stages of the Swin Transformer, and then utilizes a vision enhancement module using dilated convolution to enhance the vision of feature points at different scales and address the problem of missing pixels. Finally, the feature guidance module enables features at each scale to be enhanced by fusing with each other. The total accuracy demonstrated by VFN on both the PASCAL VOC dataset and the CrowdHuman dataset is better than that of other methods, and its ability to find occluded objects is also better, demonstrating the effectiveness of our method.The code is available at https://github.com/qcw666/vfn.

computer science, artificial intelligence

What problem does this paper attempt to address?

This paper attempts to address the issue of decreased detection accuracy in object detection in real-world scenarios due to occlusion. Currently, most detectors use Convolutional Neural Networks (CNN) as the backbone network, but CNNs have poor robustness to occlusion, and when object pixels are missing, traditional convolution operations cannot effectively extract features, leading to reduced detection accuracy. To this end, the authors propose a Visual Enhancement and Feature Fusion Multi-Scale Detection Network (VFN), which improves detection accuracy by constructing a multi-scale backbone network, using dilated convolutions to enhance the receptive field of different scale feature points, and achieving mutual enhancement between different scale features through a feature guidance module, especially in handling occluded objects. Specifically, the main contributions of the paper include: 1. Proposing an effective method (VFN) to enhance object detection under occlusion. The Swin Transformer is used instead of CNN to construct a multi-scale detection framework, and image information is retained through patch merging, enhancing the robustness of the algorithm. 2. Considering the connections between distant feature points, for the first time, a visual enhancement module and a feature guidance module are constructed by combining dilated convolutions and attention mechanisms, which can effectively enhance the connections between features. 3. Experiments on the PASCAL VOC dataset and the CrowdHuman dataset show that VFN provides better results compared to existing methods, particularly excelling in detecting occluded objects. These improvements make VFN more accurate and robust in handling occluded object detection tasks, which is of great significance for improving object detection performance in real-world applications.

A Vision Enhancement and Feature Fusion Multiscale Detection Network

Millimeter-Wave Radar and Vision Fusion Target Detection Algorithm Based on an Extended Network

SSF: Sparse Point Cloud Object Detection Based on Self-Adaptive Voxel Encoding and Focal-Sparse Convolution

Attention-based Fusion Factor in FPN for Object Detection

An Adaptive Attention Fusion Mechanism Convolutional Network for Object Detection in Remote Sensing Images

PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection

A Multi-Feature Fusion and Attention Network for Multi-Scale Object Detection in Remote Sensing Images

When CNN meet with ViT: decision-level feature fusion for camouflaged object detection

ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection

FAFFENet: frequency attention and feature fusion enhancement network for multiscale remote sensing target detection

A Saliency Enhanced Feature Fusion based multiscale RGB-D Salient Object Detection Network

Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition

NLFFTNet: A non-local feature fusion transformer network for multi-scale object detection

3D Object Detection Based on Attention and Multi-Scale Feature Fusion

MM-FPN: Multi-path and Multi-scale Feature Pyramid Network for Object Detection

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

Adaptively Attentional Feature Fusion Oriented to Multiscale Object Detection in Remote Sensing Images

Progressive structure network-based multiscale feature fusion for object detection in real-time application

MPF-Net: multi-projection filtering network for few-shot object detection

Pyramid attention object detection network with multi-scale feature fusion

Multi-Scale Feature Fusion Point Cloud Object Detection Based on Original Point Cloud and Projection