Unifying Automatic and Interactive Matting with Pretrained ViTs

Zixuan Ye,Wenze Liu,He Guo,Yujia Liang,Chaoyi Hong,Hao Lu,Zhiguo Cao
DOI: https://doi.org/10.1109/cvpr52733.2024.02417
2024-01-01
Abstract:Automatic and interactive matting largely improve image matting by respectively alleviating the need for auxil-iary input and enabling object selection. Due to different settings on whether prompts exist, they either suffer from weakness in instance completeness or region details. Also, when dealing with different scenarios, directly switching between the two matting models introduces inconvenience and higher workload. Therefore, we wonder whether we can al-leviate the limitations of both settings while achieving unification to facilitate more convenient use. Our key idea is to offer saliency guidance for automatic mode to enable its attention to detailed regions, and also refine the instance completeness in interactive mode by replacing the binary mask guidance with a more probabilistic form. With different guidance for each mode, we can achieve unification through adaptable guidance, defined as saliency information in automatic mode and user cue for interactive one. It is instantiated as candidate feature in our method, an automatic switch for class token in pretrained ViTs and average feature of user prompts, controlled by the existence of user prompts. Then we use the candidate feature to generate a probabilistic similarity map as the guidance to alleviate the over-reliance on binary mask. Extensive experiments show that our method can adapt well to both automatic and inter-active scenarios with more light-weight framework. Code available at github.com/coconut/SMat.
What problem does this paper attempt to address?