Learning to zoom: Exploiting mixed-scale contextual information for object detection
Boying Wang,Ruyi Ji,Libo Zhang,Yanjun Wu,Jing Liu
DOI: https://doi.org/10.1016/j.eswa.2024.125871
IF: 8.5
2024-12-04
Expert Systems with Applications
Abstract:With the development of deep neural networks, object detection has made a significant leap. However, apart from intrinsic intra and inter class differences, the objects in real-world scenarios may encounter diverse scales, blurred appearance or even severe occlusion. Inspired by the human behavior of looking at blurred images, i.e. , zooming in/out, we propose a Mixed-Scale Network (dubbed for MSNet) to tackle these issues. Specifically, MSNet employs the zoom strategy to exploit mixed-scale contextual information, which fully unleashes the representation ability of deep neural networks. Firstly, the global feature aggregation module and global feature enhancement module aims at aggregating mixed-scale features from the global perspective, which learns the transform offset of pixels to align the higher-res features contextually. Moreover, the local feature aggregation module enriches the instance-level feature by adaptively aligning the features of different receptive fields. Extensive experiments on MS COCO demonstrated the effectiveness of MSNet, yielding significant improvements of 0.7–2.8 points in AP box over baselines when paired with different detectors, backbones, and schedules. In addition, for pixel-level prediction tasks including semantic segmentation and instance segmentation, the proposed method gains consistent improvements of 0.6–2.7 points in mIoU/AP mask over baselines, which substantiates the effectiveness of MSNet further.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science