Memorizing Swin-Transformer Denoising Network for Diffusion Model

Jindou Chen,Yiqing Shen
DOI: https://doi.org/10.3390/electronics13204050
IF: 2.9
2024-10-16
Electronics
Abstract:Diffusion models have garnered significant attention in the field of image generation. However, existing denoising architectures, such as U-Net, face limitations in capturing the global context, while Vision Transformers (ViTs) may struggle with local receptive fields. To address these challenges, we propose a novel Swin-Transformer-based denoising network architecture that leverages the strengths of both U-Net and ViT. Moreover, our approach integrates the k-Nearest Neighbor (kNN) based memorizing attention module into the Swin-Transformer, enabling it to effectively harness crucial contextual information from feature maps and enhance its representational capacity. Finally, we introduce an innovative hierarchical time stream embedding scheme that optimizes the incorporation of temporal cues during the denoising process. This method surpasses basic approaches like simple addition or concatenation of fixed time embeddings, facilitating a more effective fusion of temporal information. Extensive experiments conducted on four benchmark datasets demonstrate the superior performance of our proposed model compared to U-Net and ViT as denoising networks. Our model outperforms baselines on the CRC-VAL-HE-7K and CelebA datasets, achieving improved FID scores of 14.39 and 4.96, respectively, and even surpassing DiT and UViT under our experiment setting. The Memorizing Swin-Transformer architecture, coupled with the hierarchical time stream embedding, sets a new state-of-the-art in denoising diffusion models for image generation.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?