Segment Anything without Supervision

XuDong Wang,Jingfeng Yang,Trevor Darrell
2024-06-29
Abstract:The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to "discover" the hierarchical structure of visual scenes. We first leverage top-down clustering methods to partition an unlabeled image into instance/semantic level segments. For all pixels within a segment, a bottom-up clustering method is employed to iteratively merge them into larger groups, thereby forming a hierarchical structure. These unsupervised multi-granular masks are then utilized to supervise model training. Evaluated across seven popular datasets, UnSAM achieves competitive results with the supervised counterpart SAM, and surpasses the previous state-of-the-art in unsupervised segmentation by 11% in terms of AR. Moreover, we show that supervised SAM can also benefit from our self-supervised labels. By integrating our unsupervised pseudo masks into SA-1B's ground-truth masks and training UnSAM with only 1% of SA-1B, a lightly semi-supervised UnSAM can often segment entities overlooked by supervised SAM, exceeding SAM's AR by over 6.7% and AP by 3.9% on SA-1B.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses how to perform image segmentation in an unsupervised manner, called "Unsupervised Segmentation of Any Object" (UnSAM). Traditional segmentation models require a large amount of manually labelled data, while UnSAM adopts a "divide and conquer" strategy without the need for manual annotation. Firstly, the unlabeled images are segmented into instance/semantic-level segments using a top-down clustering method. Then, within each segment, a bottom-up clustering method is iteratively applied to merge pixels and form a hierarchical structure. These unsupervised multi-scale masks are used to supervise model training. Evaluation on multiple popular datasets shows that UnSAM performs close to the supervised model SAM and improves the average recall (AR) by 11% in unsupervised segmentation. In addition, by combining the unsupervised masks of UnSAM with the annotated data of SA-1B, a mildly semi-supervised UnSAM is trained with only 1% of SA-1B data, which outperforms SAM in some cases with an increase of 6.7% in AR and 3.9% in average precision (AP). The paper also points out that human annotated data may have biases and may overlook small objects in images, while UnSAM is able to identify these overlooked objects. Through comparative experiments, UnSAM demonstrates superior performance in both unsupervised and semi-supervised scenarios, particularly in dealing with complex scenes and fine details.