SAN: Side Adapter Network for Open-vocabulary Semantic Segmentation

Mengde Xu,Zheng Zhang,Fangyun Wei,Han Hu,Xiang Bai
DOI: https://doi.org/10.1109/tpami.2023.3311618
IF: 23.6
2023-01-01
IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:This paper concentrates on open-vocabulary semantic segmentation, where a well optimized model is able to segment arbitrary categories that appear in an image. To achieve this goal, we present a novel framework termed Side Adapter Network, or SAN for short. Our design principles are three-fold: 1) Recent large-scale vision-language models (e.g. CLIP) show promising open-vocabulary image classification capability; it is training-economized to adapt a pre-trained CLIP model to open-vocabulary semantic segmentation. 2) Our SAN model should be both lightweight and effective in order to reduce the inference cost-to achieve this, we fuse the CLIP model's intermediate features to enhance the representation capability of the SAN model, and drive the CLIP model to focus on the informative areas of an image with the aid of the attention biases predicted by a side adapter network. 3) Our approach should empower mainstream segmentation architectures to have the capability of open-vocabulary segmentation-we present P-SAN and R-SAN, to support widely adopted pixel-wise segmentation and region-wise segmentation, respectively. Experimentally, our approach achieves state-of-the-art performance on 5 commonly used benchmarks while having much less trainable parameters and GFLOPs. For instance, our R-SAN outperforms previous best method OvSeg by +2.3 averaged mIoU across all benchmarks while using only 6% of trainable parameters and less than 1% of GFLOPs. In addition, we also conduct a comprehensive analysis of the open-vocabulary semantic segmentation datasets and verify the feasibility of transferring a well optimzied R-SAN model to video segmentation task. Code and models are available at https://github.com/MendelXu/SAN.
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?