Representative Feature Extraction During Diffusion Process for Sketch Extraction with One Example

Kwan Yun,Youngseo Kim,Kwanggyoon Seo,Chang Wook Seo,Junyong Noh
2024-01-09
Abstract:We introduce DiffSketch, a method for generating a variety of stylized sketches from images. Our approach focuses on selecting representative features from the rich semantics of deep features within a pretrained diffusion model. This novel sketch generation method can be trained with one manual drawing. Furthermore, efficient sketch extraction is ensured by distilling a trained generator into a streamlined extractor. We select denoising diffusion features through analysis and integrate these selected features with VAE features to produce sketches. Additionally, we propose a sampling scheme for training models using a conditional generative approach. Through a series of comparisons, we verify that distilled DiffSketch not only outperforms existing state-of-the-art sketch extraction methods but also surpasses diffusion-based stylization methods in the task of extracting sketches.
Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Generate diverse stylized sketches from images and be trained only with one manually drawn sketch**. Specifically, the author proposes a new method - DiffSketch, which aims to use the representative features in the pre - trained diffusion model to generate sketches. This method can not only extract sketches efficiently but also ensure high - quality results. ### Specific description of the problem 1. **Data scarcity problem**: - Existing sketch extraction methods usually require a large amount of labeled data for training, which is very difficult in practical applications. To overcome this challenge, DiffSketch designs a method that can train the sketch generator with only one manually drawn sketch. 2. **Feature selection and aggregation problem**: - The features generated by the diffusion model during the denoising process are very rich, but how to select the most representative features from them is a difficult problem. The author selects features of multiple time steps through statistical analysis and clustering methods and fuses these features with VAE features to generate more refined sketches. 3. **Personalized sketch generation problem**: - Existing stylization methods based on the diffusion model are difficult to control the style when generating sketches, resulting in generated sketches that may not meet expectations. DiffSketch ensures that the generated sketches can be faithful to the style of the given manual sketch through specific design. 4. **Efficient inference problem**: - The inference process of the diffusion model is usually time - consuming and memory - intensive. To solve this problem, the author proposes a distillation network (Distilled DiffSketch), which can significantly improve the inference speed and reduce memory usage while ensuring high quality. ### Overview of solutions - **Feature selection and aggregation**: Through statistical analysis of the features generated during the diffusion process, select the most representative features and fuse them with VAE features. - **Personalized training**: Use conditional CLIP - guided methods to ensure that the generated sketches can be faithful to the style of the given manual sketch. - **Efficient inference**: Through distillation technology, transform the complex generation model into an efficient image - to - image translation network, thereby achieving fast and high - quality sketch extraction. ### Experimental verification The author verifies the effectiveness of DiffSketch through a series of experiments, including comparison experiments with existing methods, ablation experiments, and user perception studies. The experimental results show that DiffSketch is not only superior to existing sketch extraction methods in quantitative indicators but also obtains higher scores in user perception. ### Formula summary The formulas involved in the paper are mainly concentrated in the loss function and feature aggregation parts: - **Total loss function**: \[ L = L_{\text{rec}}+\lambda_{\text{across}}L_{\text{across}}+\lambda_{\text{within}}L_{\text{within}} \] where: - \( L_{\text{rec}} \) is the reconstruction loss, which is used to ensure that the generated sketch is similar to the real sketch. - \( L_{\text{across}} \) and \( L_{\text{within}} \) are directional CLIP losses, which are used to maintain the consistency of cross - domain and intra - domain differences. - **Reconstruction loss**: \[ L_{\text{rec}}=\lambda_{L1}L_{L1}+\lambda_{\text{LPIPS}}L_{\text{LPIPS}}+\lambda_{\text{CLIPsim}}L_{\text{CLIPsim}} \] where: - \( L_{L1} \) is the L1 distance loss, which is used to avoid blurry sketch results. - \( L_{\text{LPIPS}} \) is the perceptual loss, which captures perceptual similarity. - \( L_{\text{CLIPsim}} \) is the semantic similarity loss, which calculates the cosine distance. Through these