Abstract:Personalized retrieval and segmentation aim to locate specific instances within a dataset based on an input image and a short description of the reference instance. While supervised methods are effective, they require extensive labeled data for training. Recently, self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods. However, a significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented. In this paper, we explore text-to-image diffusion models for these tasks. Specifically, we propose a novel approach called PDM for Personalized Features Diffusion Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training. PDM demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods. We also highlight notable shortcomings in current instance and segmentation datasets and propose new benchmarks for these tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to accurately perform personalized retrieval and segmentation in the presence of multiple instances of the same type. Specifically: 1. **Personalized Retrieval**: Given a reference image and description, retrieve images containing specific instances from the database. 2. **Personalized Segmentation**: Segment a specific reference object in a new scene, even if there are other objects of the same type in the scene. ### Main Problem Current methods perform well when dealing with a single instance or multiple instances of different categories, but their performance drops significantly when facing multiple instances within the same category. For example, when there are multiple dogs or cars in an image, existing methods have difficulty accurately identifying the specific instance specified by the user. ### Solution Proposed in the Paper To solve this problem, the authors propose a new method - PDM (Personalized Diffusion Features Matching), which utilizes the intermediate features of a pre - trained text - to - image diffusion model to achieve personalized retrieval and segmentation tasks. The specific steps are as follows: - **Feature Extraction**: Extract features containing semantic and appearance information from the pre - trained diffusion model. - **Feature Matching**: By combining semantic and appearance features, generate a comprehensive similarity map (SDF) for locating the target instance. - **No Additional Training Required**: PDM does not require any additional training or fine - tuning and can be directly applied to zero - sample learning tasks. ### Experimental Verification The authors verified the effectiveness of PDM through the following experiments: - **Personalized Image Segmentation**: Tests were carried out on the PerSeg and PerMIS - Image datasets, and the results show that PDM outperforms existing self - supervised and supervised methods in both mIoU and bIoU metrics. - **Video Label Propagation**: Tests were carried out on the DAVIS and PerMIS - Video datasets, and PDM performs well in region and contour similarity (J&F). - **Personalized Retrieval**: Tests were carried out on the ROxford, RParis and PerMIR datasets, and PDM significantly outperforms other methods in mean average precision (mAP). ### Conclusion By utilizing the intermediate features of the diffusion model, PDM performs well when dealing with multiple instances within the same category and overcomes the limitations of existing methods in complex scenarios.

Where's Waldo: Diffusion Features for Personalized Segmentation and Retrieval

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Panoptic Diffusion Models: co-generation of images and segmentation maps

MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration

Segmentation-Free Guidance for Text-to-Image Diffusion Models

Unleashing Text-to-Image Diffusion Models for Visual Perception

DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery.

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

Generating Images of Rare Concepts Using Pre-trained Diffusion Models

DP-RDM: Adapting Diffusion Models to Private Domains Without Fine-Tuning

Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

Distribution Aligned Diffusion and Prototype-guided network for Unsupervised Domain Adaptive Segmentation

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?

Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation

Personalized Image Semantic Segmentation

Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions