Abstract:Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that when extracting features from pre - trained diffusion models, existing methods need to add noise to the input image to obtain useful features. This not only reduces the amount of information in the image, but also requires adjusting a specific timestep for each downstream task. The authors propose a new feature extraction method - CleanDIFT, which aims to eliminate the need to add noise during the feature extraction process and generate general diffusion features independent of timesteps, thereby improving the performance of these features in various downstream tasks. ### Main Contributions 1. **Propose CleanDIFT**: A method for fine - tuning diffusion models, enabling them to operate on clean images, making the knowledge inside these models more accessible. 2. **Integrate information of all timesteps**: By combining the feature extraction functions of all timesteps into a single feature prediction, the need to adjust timesteps for each downstream task is eliminated. 3. **Significant performance improvement**: Demonstrates a significant performance improvement of its feature extraction technique in various downstream tasks, especially surpassing the current state - of - the - art in zero - shot unsupervised semantic correspondence detection. 4. **Higher efficiency**: Compared with previous methods that solve this problem through noise integration or supervised training, the proposed method is more efficient. ### Method Overview - **Preliminary concept**: Diffusion models recover the original image \(x_0\) by predicting an image \(x_t\) with random Gaussian noise. The task objectives at different timesteps \(t\) are different, resulting in extracted features having different semantic information. - **Feature extraction**: Existing diffusion feature extraction methods usually first add noise to the image and then extract features through a U - Net denoiser. This method limits the perceptual information that the model can extract. - **CleanDIFT**: The authors propose a lightweight fine - tuning method that enables diffusion models to directly extract high - quality features from clean images. Specifically, they initialize a trainable feature extraction model that receives clean input images, while the frozen diffusion model receives noisy images. By introducing a point - wise timestep - conditional feature projection head, the features of the feature extraction model are aligned with the timestep - dependent features of the diffusion model. ### Experimental Results - **Unsupervised semantic correspondence matching**: Extensive evaluations were carried out on the SPair - 71k dataset, and the results show that CleanDIFT features significantly outperform standard diffusion features on multiple metrics. - **Monocular depth estimation**: Experiments on the NYUv2 dataset show that CleanDIFT features exhibit a significant performance improvement in the depth estimation task. - **Semantic segmentation**: Experiments on the PASCAL VOC dataset show that using CleanDIFT features for linear probe training significantly reduces the noise of the feature map and improves the segmentation performance. ### Conclusion This paper successfully solves the problems existing in existing diffusion feature extraction methods by proposing the CleanDIFT method. It not only improves the quality of features but also shows significant performance improvements in various downstream tasks. In addition, the high efficiency and generality of this method make it highly potential in practical applications.

CleanDIFT: Diffusion Features without Noise

FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models

E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models

Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment

simple diffusion: End-to-end diffusion for high resolution images

Your Diffusion Model is Secretly a Noise Classifier and Benefits from Contrastive Training

Not All Steps Are Created Equal: Selective Diffusion Distillation for Image Manipulation

Efficient Diffusion Training Via Min-SNR Weighting Strategy.

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

Improved Noise Schedule for Diffusion Training

Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features

Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Removing Structured Noise with Diffusion Models

TendiffPure: a convolutional tensor-train denoising diffusion model for purification

Coarse-to-fine Mechanisms Mitigate Diffusion Limitations on Image Restoration

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models

Diffusion Models Without Attention