YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection.

Yuming Chen,Xinbin Yuan,Ruiqi Wu,Jiabao Wang,Qibin Hou,Ming-Ming Cheng
DOI: https://doi.org/10.48550/arxiv.2308.05480
2023-01-01
Abstract:We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can strongly enhance multi-scale feature representations of real-time object detectors. To verify the effectiveness of our strategy, we build a network architecture, termed YOLO-MS. We train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets, like ImageNet, or pre-trained weights. Without bells and whistles, our YOLO-MS outperforms the recent state-of-the-art real-time object detectors, including YOLO-v7 and RTMDet, when using a comparable number of parameters and FLOPs. Taking the XS version of YOLO-MS as an example, with only 4.5M learnable parameters and 8.7G FLOPs, it can achieve an AP score of 43 is about 2 can also be used as a plug-and-play module for other YOLO models. Typically, our method significantly improves the AP of YOLOv8 from 37 fewer parameters and FLOPs. Code is available at https://github.com/FishAndWasabi/YOLO-MS.
What problem does this paper attempt to address?