CleanDIFT: Diffusion Features without Noise

Nick Stracke,Stefan Andreas Baumann,Kolja Bauer,Frank Fundel,Björn Ommer
2024-12-05
Abstract:Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when extracting features from pre - trained diffusion models, existing methods need to add noise to the input image to obtain useful features. This not only reduces the amount of information in the image, but also requires adjusting a specific timestep for each downstream task. The authors propose a new feature extraction method - CleanDIFT, which aims to eliminate the need to add noise during the feature extraction process and generate general diffusion features independent of timesteps, thereby improving the performance of these features in various downstream tasks. ### Main Contributions 1. **Propose CleanDIFT**: A method for fine - tuning diffusion models, enabling them to operate on clean images, making the knowledge inside these models more accessible. 2. **Integrate information of all timesteps**: By combining the feature extraction functions of all timesteps into a single feature prediction, the need to adjust timesteps for each downstream task is eliminated. 3. **Significant performance improvement**: Demonstrates a significant performance improvement of its feature extraction technique in various downstream tasks, especially surpassing the current state - of - the - art in zero - shot unsupervised semantic correspondence detection. 4. **Higher efficiency**: Compared with previous methods that solve this problem through noise integration or supervised training, the proposed method is more efficient. ### Method Overview - **Preliminary concept**: Diffusion models recover the original image \(x_0\) by predicting an image \(x_t\) with random Gaussian noise. The task objectives at different timesteps \(t\) are different, resulting in extracted features having different semantic information. - **Feature extraction**: Existing diffusion feature extraction methods usually first add noise to the image and then extract features through a U - Net denoiser. This method limits the perceptual information that the model can extract. - **CleanDIFT**: The authors propose a lightweight fine - tuning method that enables diffusion models to directly extract high - quality features from clean images. Specifically, they initialize a trainable feature extraction model that receives clean input images, while the frozen diffusion model receives noisy images. By introducing a point - wise timestep - conditional feature projection head, the features of the feature extraction model are aligned with the timestep - dependent features of the diffusion model. ### Experimental Results - **Unsupervised semantic correspondence matching**: Extensive evaluations were carried out on the SPair - 71k dataset, and the results show that CleanDIFT features significantly outperform standard diffusion features on multiple metrics. - **Monocular depth estimation**: Experiments on the NYUv2 dataset show that CleanDIFT features exhibit a significant performance improvement in the depth estimation task. - **Semantic segmentation**: Experiments on the PASCAL VOC dataset show that using CleanDIFT features for linear probe training significantly reduces the noise of the feature map and improves the segmentation performance. ### Conclusion This paper successfully solves the problems existing in existing diffusion feature extraction methods by proposing the CleanDIFT method. It not only improves the quality of features but also shows significant performance improvements in various downstream tasks. In addition, the high efficiency and generality of this method make it highly potential in practical applications.