A Vision Enhancement and Feature Fusion Multiscale Detection Network

Chengwu Qian,Jiangbo Qian,Chong Wang,Xulun Ye,Caiming Zhong
DOI: https://doi.org/10.1007/s11063-024-11471-w
IF: 2.565
2024-02-09
Neural Processing Letters
Abstract:In the field of object detection, there is often a high level of occlusion in real scenes, which can very easily interfere with the accuracy of the detector. Currently, most detectors use a convolutional neural network (CNN) as a backbone network, but the robustness of CNNs for detection under cover is poor, and the absence of object pixels makes conventional convolution ineffective in extracting features, leading to a decrease in detection accuracy. To address these two problems, we propose VFN (A Vision Enhancement and Feature Fusion Multiscale Detection Network), which first builds a multiscale backbone network using different stages of the Swin Transformer, and then utilizes a vision enhancement module using dilated convolution to enhance the vision of feature points at different scales and address the problem of missing pixels. Finally, the feature guidance module enables features at each scale to be enhanced by fusing with each other. The total accuracy demonstrated by VFN on both the PASCAL VOC dataset and the CrowdHuman dataset is better than that of other methods, and its ability to find occluded objects is also better, demonstrating the effectiveness of our method.The code is available at https://github.com/qcw666/vfn.
computer science, artificial intelligence
What problem does this paper attempt to address?
This paper attempts to address the issue of decreased detection accuracy in object detection in real-world scenarios due to occlusion. Currently, most detectors use Convolutional Neural Networks (CNN) as the backbone network, but CNNs have poor robustness to occlusion, and when object pixels are missing, traditional convolution operations cannot effectively extract features, leading to reduced detection accuracy. To this end, the authors propose a Visual Enhancement and Feature Fusion Multi-Scale Detection Network (VFN), which improves detection accuracy by constructing a multi-scale backbone network, using dilated convolutions to enhance the receptive field of different scale feature points, and achieving mutual enhancement between different scale features through a feature guidance module, especially in handling occluded objects. Specifically, the main contributions of the paper include: 1. Proposing an effective method (VFN) to enhance object detection under occlusion. The Swin Transformer is used instead of CNN to construct a multi-scale detection framework, and image information is retained through patch merging, enhancing the robustness of the algorithm. 2. Considering the connections between distant feature points, for the first time, a visual enhancement module and a feature guidance module are constructed by combining dilated convolutions and attention mechanisms, which can effectively enhance the connections between features. 3. Experiments on the PASCAL VOC dataset and the CrowdHuman dataset show that VFN provides better results compared to existing methods, particularly excelling in detecting occluded objects. These improvements make VFN more accurate and robust in handling occluded object detection tasks, which is of great significance for improving object detection performance in real-world applications.