Abstract:Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias', where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. By leveraging the pre-trained VLM to retrieve categories for unlabeled regions, DenseVLM effectively decouples the interference between foreground and background region features, ensuring that each region is accurately aligned with its corresponding category. We show that DenseVLM can be seamlessly integrated into open-vocabulary object detection and image segmentation tasks, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in open - vocabulary dense prediction tasks, existing vision - language models (VLMs) have a significant "foreground bias" in region recognition and background recognition. Specifically, these models tend to mis - identify background regions as foreground objects, resulting in poor performance in dense prediction tasks. To this end, the paper proposes a new framework - DenseVLM, which aims to alleviate this problem by learning unbiased region - language alignment through powerful pre - trained VLM representations. ### Main problems 1. **Foreground bias**: Existing VLMs perform poorly in local visual - semantic understanding, especially in locating and identifying small objects and background information. This is because these models mainly focus on the alignment between images and global texts during the training process, while ignoring the correspondence between local image regions and text descriptions. 2. **High cost of data annotation**: Some methods attempt to train models by using region - text pairs or pseudo - region - text pairs, but these methods are limited by high annotation costs and lack of scalability. 3. **Limitations of self - supervised methods**: Although self - supervised methods such as CLIPSelf and MaskEmbed can align region semantics without the need for annotated data, the effectiveness of these methods is limited by the performance of the teacher model and is prone to foreground bias. ### Solutions The paper proposes the DenseVLM framework. Its core idea is to retrieve the categories of unlabeled regions through powerful pre - trained VLMs, thereby achieving accurate region - language alignment. The specific steps are as follows: 1. **Region feature extraction**: Extract a dense feature map from the input image and divide it into multiple regions. 2. **Category retrieval**: Use a pre - trained VLM (P - VLM) to retrieve the most relevant category for each region. The category is determined by calculating the cosine similarity between the region feature and the text embedding. 3. **Decoupled alignment**: Decouple the alignment process of region features and text embeddings into foreground and background parts, reducing the interference between foreground and background features and ensuring that each region can be accurately aligned to its corresponding category. 4. **End - to - end optimization**: Optimize the model by minimizing the negative log - likelihood losses of foreground and background regions, supporting end - to - end training. ### Experimental results The paper conducted experiments on multiple open - vocabulary dense prediction benchmarks, including object detection (box classification) and image segmentation (thing and background mask recognition). The experimental results show that DenseVLM significantly outperforms other methods in these tasks, especially in background recognition, effectively alleviating the foreground bias problem. ### Summary DenseVLM effectively solves the foreground bias problem of existing VLMs in dense prediction tasks by using powerful pre - trained VLM representations, improving the performance of the model in open - vocabulary dense prediction tasks.

DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction

Exploring Vision-Language Models for Imbalanced Learning

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

Unified Lexical Representation for Interpretable Visual-Language Alignment

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Learning Object-Language Alignments for Open-Vocabulary Object Detection

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors

ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models

Aligning Bag of Regions for Open-Vocabulary Object Detection

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

OpenDlign: Open-World Point Cloud Understanding with Depth-Aligned Images

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Improving Zero-Shot Generalization for CLIP with Variational Adapter

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models