Abstract:Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: <a class="link-external link-https" href="https://lorebianchi98.github.io/Talk2DINO/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in Open - Vocabulary Segmentation (OVS). Specifically, the OVS task aims to segment images based on free - form text concepts provided at inference time without predefined categories provided during training. Although existing vision - language models (such as CLIP) can generate segmentation masks by leveraging the coarse spatial information generated by visual Transformers, they face challenges in spatial localization because these models mainly focus on the global alignment of image and text features. In contrast, self - supervised vision models (such as DINO) excel in fine - grained visual encoding but lack in integration with language. To bridge this gap, the paper proposes Talk2DINO, a new hybrid method that combines the spatial accuracy of DINOv2 with the language understanding ability of CLIP. It aligns the text embeddings of CLIP with the block - level features of DINOv2 through a learned mapping function, thereby enhancing the segmentation process without fine - tuning the underlying backbone network, achieving a more natural and less noisy segmentation effect, and effectively distinguishing foreground objects from the background. ### Main contributions of the paper: 1. **Propose Talk2DINO**: This is the first model to directly align the DINOv2 and CLIP feature spaces for OVS. By using a non - linear deformation function to map the text embeddings of CLIP to the DINOv2 space, Talk2DINO effectively provides language attributes for DINOv2. 2. **Novel training scheme**: The proposed model adopts a new training scheme, selects the most relevant visual self - attention heads, and does not require fine - tuning of the backbone network. 3. **Efficient inference process**: Demonstrates the capabilities of Talk2DINO in unsupervised OVS and designs a computationally efficient inference process, which includes a new method based on DINOv2 self - attention to clean up background objects. 4. **Experimental results**: The experimental results show that Talk2DINO achieves state - of - the - art performance in standard OVS benchmarks, proving the effectiveness of the proposed method. ### Problems solved: - **Challenges in spatial localization**: Models such as CLIP have limitations in spatial localization because they mainly focus on the global alignment of images and text. Talk2DINO solves this problem by introducing the fine - grained spatial features of DINOv2. - **Recognition of background objects**: Another challenge in the OVS task is to recognize background areas that do not belong to the given categories. The paper proposes a background cleaning method based on DINOv2 self - attention, which effectively distinguishes the foreground from the background. In conclusion, by combining the language understanding of CLIP and the spatial localization ability of DINOv2, the paper significantly improves the performance of open - vocabulary segmentation, especially making significant progress in spatial localization and background recognition.

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

DIVE: Taming DINO for Subject-Driven Video Editing

Learning Open-vocabulary Semantic Segmentation Models from Natural Language Supervision.

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Segment Any 3D Object with Language

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation

DINOv2: Learning Robust Visual Features without Supervision

Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Assessing the Performance of the DINOv2 Self-supervised Learning Vision Transformer Model for the Segmentation of the Left Atrium from MRI Images

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding