Abstract:The large-scale pretrained model CLIP, trained on 400 million image-text pairs, offers a promising paradigm for tackling vision tasks, albeit at the image level. Later works, such as DenseCLIP and LSeg, extend this paradigm to dense prediction, including semantic segmentation, and have achieved excellent results. However, the above methods either rely on CLIP-pretrained visual backbones or use none-pretrained but heavy backbones such as Swin, while falling ineffective when applied to lightweight backbones. The reason for this is that the lightweitht networks, feature extraction ability of which are relatively limited, meet difficulty embedding the image feature aligned with text embeddings perfectly. In this work, we present a new feature fusion module which tackles this problem and enables language-guided paradigm to be applied to lightweight networks. Specifically, the module is a parallel design of CNN and transformer with a two-way bridge in between, where CNN extracts spatial information and visual context of the feature map from the image encoder, and the transformer propagates text embeddings from the text encoder forward. The core of the module is the bidirectional fusion of visual and text feature across the bridge which prompts their proximity and alignment in embedding space. The module is model-agnostic, which can not only make language-guided lightweight semantic segmentation practical, but also fully exploit the pretrained knowledge of language priors and achieve better performance than previous SOTA work, such as DenseCLIP, whatever the vision backbone is. Extensive experiments have been conducted to demonstrate the superiority of our method.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to apply the language - guided paradigm to lightweight visual backbones to achieve efficient semantic segmentation tasks. Specifically, existing methods such as DenseCLIP and LSeg rely on CLIP pre - trained visual backbones or non - pre - trained but heavier backbones (such as Swin) when dealing with dense prediction tasks, and they do not perform well when applied to lightweight backbones. This is because the feature extraction ability of lightweight networks is relatively weak, and it is difficult to perfectly align image features with text embeddings. To solve this problem, the author proposes a new feature fusion module that enables the language - guided paradigm to be applicable to lightweight networks. This module realizes the two - way fusion of visual features and text features through the design of a two - way bridge between CNN and Transformer, thereby improving the performance of lightweight networks in semantic segmentation tasks. ### Main Contributions 1. **Solve the application problem of the language - guided paradigm in lightweight visual backbones**: The proposed feature fusion module enables the language - guided paradigm to be effectively applied to lightweight networks. 2. **Model - independence**: This module is not only applicable to lightweight networks, but can also fully utilize pre - trained language prior knowledge and has achieved better performance than existing SOTA methods on various visual backbones. ### Method Overview - **Feature Fusion Module**: Composed of CNN and Transformer, and the two communicate through a two - way bridge. CNN is responsible for extracting spatial information and visual context, while Transformer is responsible for propagating text embeddings. The two - way bridge realizes feature fusion through a lightweight cross - attention mechanism. - **Conv - Former Structure (Conv - Former)**: It includes four parts: Conv, Former, Conv2Former and Former2Conv, forming a parallel structure. Conv2Former and Former2Conv respectively realize the two - way interaction from image features to text embeddings and from text embeddings to image features. ### Experimental Results - Experiments on the ADE20K and Cityscapes datasets show that this method significantly improves the semantic segmentation performance on lightweight backbones (such as MobileNetV2, Xception and EfficientFormer), while the computational cost increases slightly, but this is an acceptable compromise. - For heavy backbones (such as ResNet - 50, ResNet - 101 and ViT - B), this method also shows superior performance, verifying its model - independence and generalization ability. ### Conclusion By proposing a new feature fusion module, the author has successfully applied the language - guided paradigm to lightweight semantic segmentation tasks and has demonstrated excellent performance on multiple datasets and different types of backbones.

CLIP for Lightweight Semantic Segmentation

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

SegCLIP: Multimodal Visual-Language and Prompt Learning for High-Resolution Remote Sensing Semantic Segmentation

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation

CLIP-S$^4$: Language-Guided Self-Supervised Semantic Segmentation

Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

LCCo: Lending CLIP to Co-Segmentation

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

MMF-CLIP: An Image-Text Multimodal Semantic Segmentation Method for Remote Sensing Images

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

CFENet: Leveraging CLIP Text Features for Enhanced Few-Shot Semantic Segmentation

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation