Segment Anything

Alexander Kirillov,Eric Mintun,Nikhila Ravi,Hanzi Mao,Chloe Rolland,Laura Gustafson,Tete Xiao,Spencer Whitehead,Alexander C. Berg,Wan-Yen Lo,Piotr Dollár,Ross Girshick
DOI: https://doi.org/10.48550/arXiv.2304.02643
2023-04-06
Abstract:We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at <a class="link-external link-https" href="https://segment-anything.com" rel="external noopener nofollow">this https URL</a> to foster research into foundation models for computer vision.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to build a foundation model in the field of image segmentation, enabling it to perform zero - shot transfer through prompts to adapt to new data distributions and tasks. Specifically, the paper proposes three inter - related components to achieve this goal: 1. **Promptable Segmentation Task**: This is a general task aimed at generating valid segmentation masks through any type of prompt (such as points, boxes, text, etc.). Even if the prompt is ambiguous, the model should be able to generate a reasonable mask. 2. **Segment Anything Model (SAM)**: This is a model designed to support flexible prompts and is able to generate segmentation masks in real - time or near - real - time. The model consists of an image encoder, a prompt encoder, and a lightweight mask decoder, can handle multiple types of prompts, and can generate multiple possible masks when faced with ambiguous prompts. 3. **Data Engine**: Since the scale of existing segmentation datasets is limited, the authors developed a data engine. Through three stages of model - assisted manual annotation, semi - automatic annotation, and fully - automatic annotation, more than 1 billion segmentation masks were collected, and the largest segmentation dataset SA - 1B so far was constructed. Through these three components, the paper aims to build a powerful segmentation foundation model. This model can not only perform well on training data, but also perform zero - shot transfer on unseen data distributions and tasks through prompt engineering, thereby solving a wide range of downstream segmentation problems.