CLIP for Lightweight Semantic Segmentation

Ke Jin,Wankou Yang
DOI: https://doi.org/10.48550/arXiv.2310.07394
2023-10-11
Abstract:The large-scale pretrained model CLIP, trained on 400 million image-text pairs, offers a promising paradigm for tackling vision tasks, albeit at the image level. Later works, such as DenseCLIP and LSeg, extend this paradigm to dense prediction, including semantic segmentation, and have achieved excellent results. However, the above methods either rely on CLIP-pretrained visual backbones or use none-pretrained but heavy backbones such as Swin, while falling ineffective when applied to lightweight backbones. The reason for this is that the lightweitht networks, feature extraction ability of which are relatively limited, meet difficulty embedding the image feature aligned with text embeddings perfectly. In this work, we present a new feature fusion module which tackles this problem and enables language-guided paradigm to be applied to lightweight networks. Specifically, the module is a parallel design of CNN and transformer with a two-way bridge in between, where CNN extracts spatial information and visual context of the feature map from the image encoder, and the transformer propagates text embeddings from the text encoder forward. The core of the module is the bidirectional fusion of visual and text feature across the bridge which prompts their proximity and alignment in embedding space. The module is model-agnostic, which can not only make language-guided lightweight semantic segmentation practical, but also fully exploit the pretrained knowledge of language priors and achieve better performance than previous SOTA work, such as DenseCLIP, whatever the vision backbone is. Extensive experiments have been conducted to demonstrate the superiority of our method.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to apply the language - guided paradigm to lightweight visual backbones to achieve efficient semantic segmentation tasks. Specifically, existing methods such as DenseCLIP and LSeg rely on CLIP pre - trained visual backbones or non - pre - trained but heavier backbones (such as Swin) when dealing with dense prediction tasks, and they do not perform well when applied to lightweight backbones. This is because the feature extraction ability of lightweight networks is relatively weak, and it is difficult to perfectly align image features with text embeddings. To solve this problem, the author proposes a new feature fusion module that enables the language - guided paradigm to be applicable to lightweight networks. This module realizes the two - way fusion of visual features and text features through the design of a two - way bridge between CNN and Transformer, thereby improving the performance of lightweight networks in semantic segmentation tasks. ### Main Contributions 1. **Solve the application problem of the language - guided paradigm in lightweight visual backbones**: The proposed feature fusion module enables the language - guided paradigm to be effectively applied to lightweight networks. 2. **Model - independence**: This module is not only applicable to lightweight networks, but can also fully utilize pre - trained language prior knowledge and has achieved better performance than existing SOTA methods on various visual backbones. ### Method Overview - **Feature Fusion Module**: Composed of CNN and Transformer, and the two communicate through a two - way bridge. CNN is responsible for extracting spatial information and visual context, while Transformer is responsible for propagating text embeddings. The two - way bridge realizes feature fusion through a lightweight cross - attention mechanism. - **Conv - Former Structure (Conv - Former)**: It includes four parts: Conv, Former, Conv2Former and Former2Conv, forming a parallel structure. Conv2Former and Former2Conv respectively realize the two - way interaction from image features to text embeddings and from text embeddings to image features. ### Experimental Results - Experiments on the ADE20K and Cityscapes datasets show that this method significantly improves the semantic segmentation performance on lightweight backbones (such as MobileNetV2, Xception and EfficientFormer), while the computational cost increases slightly, but this is an acceptable compromise. - For heavy backbones (such as ResNet - 50, ResNet - 101 and ViT - B), this method also shows superior performance, verifying its model - independence and generalization ability. ### Conclusion By proposing a new feature fusion module, the author has successfully applied the language - guided paradigm to lightweight semantic segmentation tasks and has demonstrated excellent performance on multiple datasets and different types of backbones.