Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Luca Barsellotti,Lorenzo Bianchi,Nicola Messina,Fabio Carrara,Marcella Cornia,Lorenzo Baraldi,Fabrizio Falchi,Rita Cucchiara
2024-11-29
Abstract:Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: <a class="link-external link-https" href="https://lorebianchi98.github.io/Talk2DINO/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in Open - Vocabulary Segmentation (OVS). Specifically, the OVS task aims to segment images based on free - form text concepts provided at inference time without predefined categories provided during training. Although existing vision - language models (such as CLIP) can generate segmentation masks by leveraging the coarse spatial information generated by visual Transformers, they face challenges in spatial localization because these models mainly focus on the global alignment of image and text features. In contrast, self - supervised vision models (such as DINO) excel in fine - grained visual encoding but lack in integration with language. To bridge this gap, the paper proposes Talk2DINO, a new hybrid method that combines the spatial accuracy of DINOv2 with the language understanding ability of CLIP. It aligns the text embeddings of CLIP with the block - level features of DINOv2 through a learned mapping function, thereby enhancing the segmentation process without fine - tuning the underlying backbone network, achieving a more natural and less noisy segmentation effect, and effectively distinguishing foreground objects from the background. ### Main contributions of the paper: 1. **Propose Talk2DINO**: This is the first model to directly align the DINOv2 and CLIP feature spaces for OVS. By using a non - linear deformation function to map the text embeddings of CLIP to the DINOv2 space, Talk2DINO effectively provides language attributes for DINOv2. 2. **Novel training scheme**: The proposed model adopts a new training scheme, selects the most relevant visual self - attention heads, and does not require fine - tuning of the backbone network. 3. **Efficient inference process**: Demonstrates the capabilities of Talk2DINO in unsupervised OVS and designs a computationally efficient inference process, which includes a new method based on DINOv2 self - attention to clean up background objects. 4. **Experimental results**: The experimental results show that Talk2DINO achieves state - of - the - art performance in standard OVS benchmarks, proving the effectiveness of the proposed method. ### Problems solved: - **Challenges in spatial localization**: Models such as CLIP have limitations in spatial localization because they mainly focus on the global alignment of images and text. Talk2DINO solves this problem by introducing the fine - grained spatial features of DINOv2. - **Recognition of background objects**: Another challenge in the OVS task is to recognize background areas that do not belong to the given categories. The paper proposes a background cleaning method based on DINOv2 self - attention, which effectively distinguishes the foreground from the background. In conclusion, by combining the language understanding of CLIP and the spatial localization ability of DINOv2, the paper significantly improves the performance of open - vocabulary segmentation, especially making significant progress in spatial localization and background recognition.