Where's Waldo: Diffusion Features for Personalized Segmentation and Retrieval

Dvir Samuel,Rami Ben-Ari,Matan Levy,Nir Darshan,Gal Chechik
2024-09-30
Abstract:Personalized retrieval and segmentation aim to locate specific instances within a dataset based on an input image and a short description of the reference instance. While supervised methods are effective, they require extensive labeled data for training. Recently, self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods. However, a significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented. In this paper, we explore text-to-image diffusion models for these tasks. Specifically, we propose a novel approach called PDM for Personalized Features Diffusion Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training. PDM demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods. We also highlight notable shortcomings in current instance and segmentation datasets and propose new benchmarks for these tasks.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to accurately perform personalized retrieval and segmentation in the presence of multiple instances of the same type. Specifically: 1. **Personalized Retrieval**: Given a reference image and description, retrieve images containing specific instances from the database. 2. **Personalized Segmentation**: Segment a specific reference object in a new scene, even if there are other objects of the same type in the scene. ### Main Problem Current methods perform well when dealing with a single instance or multiple instances of different categories, but their performance drops significantly when facing multiple instances within the same category. For example, when there are multiple dogs or cars in an image, existing methods have difficulty accurately identifying the specific instance specified by the user. ### Solution Proposed in the Paper To solve this problem, the authors propose a new method - PDM (Personalized Diffusion Features Matching), which utilizes the intermediate features of a pre - trained text - to - image diffusion model to achieve personalized retrieval and segmentation tasks. The specific steps are as follows: - **Feature Extraction**: Extract features containing semantic and appearance information from the pre - trained diffusion model. - **Feature Matching**: By combining semantic and appearance features, generate a comprehensive similarity map (SDF) for locating the target instance. - **No Additional Training Required**: PDM does not require any additional training or fine - tuning and can be directly applied to zero - sample learning tasks. ### Experimental Verification The authors verified the effectiveness of PDM through the following experiments: - **Personalized Image Segmentation**: Tests were carried out on the PerSeg and PerMIS - Image datasets, and the results show that PDM outperforms existing self - supervised and supervised methods in both mIoU and bIoU metrics. - **Video Label Propagation**: Tests were carried out on the DAVIS and PerMIS - Video datasets, and PDM performs well in region and contour similarity (J&F). - **Personalized Retrieval**: Tests were carried out on the ROxford, RParis and PerMIR datasets, and PDM significantly outperforms other methods in mean average precision (mAP). ### Conclusion By utilizing the intermediate features of the diffusion model, PDM performs well when dealing with multiple instances within the same category and overcomes the limitations of existing methods in complex scenarios.