ZIM: Zero-Shot Image Matting for Anything

Beomyoung Kim,Chanyong Shin,Joonhyun Jeong,Hyungsik Jung,Se-Yun Lee,Sewhan Chun,Dong-Hyun Hwang,Joonsang Yu
2024-11-01
Abstract:The recent segmentation foundation model, Segment Anything Model (SAM), exhibits strong zero-shot segmentation capabilities, but it falls short in generating fine-grained precise masks. To address this limitation, we propose a novel zero-shot image matting model, called ZIM, with two key contributions: First, we develop a label converter that transforms segmentation labels into detailed matte labels, constructing the new SA1B-Matte dataset without costly manual annotations. Training SAM with this dataset enables it to generate precise matte masks while maintaining its zero-shot capability. Second, we design the zero-shot matting model equipped with a hierarchical pixel decoder to enhance mask representation, along with a prompt-aware masked attention mechanism to improve performance by enabling the model to focus on regions specified by visual prompts. We evaluate ZIM using the newly introduced MicroMat-3K test set, which contains high-quality micro-level matte labels. Experimental results show that ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization. Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D NeRF. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks. The code is available at \url{<a class="link-external link-https" href="https://github.com/naver-ai/ZIM" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the limitation of existing image segmentation models (such as the Segment Anything Model, SAM) in generating fine-grained masks. Although SAM performs well in zero-shot segmentation tasks, the masks it generates often lack fine-grained precision, especially when dealing with complex boundaries and details (such as hair strands). To overcome this issue, the authors propose a new Zero-Shot Image Matting (ZIM) model, aimed at generating high-quality fine-grained matting masks while maintaining zero-shot capability. Specifically, the main contributions of the paper include: 1. **Label Converter**: A label converter is developed to convert segmentation labels into detailed matting labels, thereby constructing a new large-scale fine-grained matting dataset (SA1B-Matte). By training SAM on this dataset, it can generate more precise matting masks while retaining its zero-shot capability. 2. **Zero-Shot Matting Model**: A zero-shot matting model is designed, introducing a hierarchical pixel decoder and a prompt-aware mask attention mechanism to enhance mask representation and performance. The hierarchical pixel decoder, through a multi-level feature pyramid design, improves the robustness and richness of mask feature representation. The prompt-aware mask attention mechanism enables the model to better focus on areas specified by visual prompts. 3. **New Test Set**: A new test set (MicroMat-3K) is introduced, containing 3000 high-quality fine-grained matting labels, used to evaluate the performance of the zero-shot matting model. Through these contributions, the paper provides a strong foundation for advancing zero-shot matting and its downstream applications, particularly in tasks requiring high-precision masks, such as image restoration and 3D NeRF.