Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes

Zhi Cai,Yingjie Gao,Yaoyan Zheng,Nan Zhou,Di Huang
2024-07-19
Abstract:In computer vision, object detection is an important task that finds its application in many scenarios. However, obtaining extensive labels can be challenging, especially in crowded scenes. Recently, the Segment Anything Model (SAM) has been proposed as a powerful zero-shot segmenter, offering a novel approach to instance segmentation tasks. However, the accuracy and efficiency of SAM and its variants are often compromised when handling objects in crowded and occluded scenes. In this paper, we introduce Crowd-SAM, a SAM-based framework designed to enhance SAM's performance in crowded and occluded scenes with the cost of few learnable parameters and minimal labeled images. We introduce an efficient prompt sampler (EPS) and a part-whole discrimination network (PWD-Net), enhancing mask selection and accuracy in crowded scenes. Despite its simplicity, Crowd-SAM rivals state-of-the-art (SOTA) fully-supervised object detection methods on several benchmarks including CrowdHuman and CityPersons. Our code is available at <a class="link-external link-https" href="https://github.com/FelixCaae/CrowdSAM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced in object detection in crowded scenes, especially when objects are dense and occluded. Traditional object detection methods often require a large amount of labeled data for training, which is not only time - consuming but also costly. Specifically, the paper focuses on how to improve the accuracy and efficiency of object detection in crowded scenes without a large amount of labeled data. To address this challenge, the authors propose the Crowd - SAM framework, a method based on the Segment Anything Model (SAM). It aims to enhance the performance of SAM in crowded scenes with a small number of learnable parameters and the least number of labeled images by introducing the Efficient Prompt Sampler (EPS) and the Part - Whole Discrimination Network (PWD - Net). These components are helpful for mask selection and improving accuracy in crowded scenes. The main contributions of Crowd - SAM include: 1. Proposing Crowd - SAM, a self - prompted segmentation method for marking images containing clustered objects, which can produce accurate results with only a few examples. 2. Designing two new components of Crowd - SAM, namely EPS and PWD - Net, which effectively unleash the capabilities of SAM in crowded scenes. 3. Conducting comprehensive experiments on two benchmarks, demonstrating the effectiveness and generalization ability of Crowd - SAM. Through these innovations, Crowd - SAM can show performance comparable to fully - supervised object detection methods in multiple public benchmark tests while maintaining simplicity and fast training, especially on benchmarks such as CrowdHuman and CityPersons.