Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation
Zicheng Zhang,Tong Zhang,Yi Zhu,Jianzhuang Liu,Xiaodan Liang,QiXiang Ye,Wei Ke
DOI: https://doi.org/10.1109/tcsvt.2024.3504816
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:The pre-trained vision-language model, exemplified by CLIP, advanceszero-shot semantic segmentation by aligning visual features with classembeddings through a transformer decoder to generate semantic masks. Despiteits effectiveness, prevailing methods within this paradigm encounterchallenges, including overfitting on seen classes and small fragmentation inmasks. To mitigate these issues, we propose a Language-Driven Visual Consensus(LDVC) approach, fostering improved alignment of semantic and visualinformation.Specifically, we leverage class embeddings as anchors due to theirdiscrete and abstract nature, steering vision features toward class embeddings.Moreover, to circumvent noisy alignments from the vision part due to itsredundant nature, we introduce route attention into self-attention for findingvisual consensus, thereby enhancing semantic consistency within the sameobject. Equipped with a vision-language prompting strategy, our approachsignificantly boosts the generalization capacity of segmentation models forunseen classes. Experimental results underscore the effectiveness of ourapproach, showcasing mIoU gains of 4.5 on the PASCAL VOC 2012 and 3.6 on theCOCO-Stuff 164k for unseen classes compared with the state-of-the-art methods.