MultiSpectral diffusion: joint generation of wavelet coefficients for image synthesis and upsampling
Goudarzvand, Iman
DOI: https://doi.org/10.1007/s11042-024-20383-9
IF: 2.577
2024-10-19
Multimedia Tools and Applications
Abstract:Diffusion models have become a prevalent framework in deep generative modeling across various modalities. However, despite producing high quality results, these models are computationally expensive and suffer from slow convergence. In this work, we address these challenges in image generation by leveraging the wavelet domain, which decomposes images into low and high-frequency components, each at half the resolution of the original image in both height and width. We observe that prioritizing the learning of low-frequency components over high-frequency details and masking out unnecessary high-frequency content in wavelet space can significantly enhance training convergence and reduce computational demands. This strategy simplifies the complexity associated with high-frequency details during training, allowing the model to capture the most representative features of the data distribution while maintaining a balance in detail preservation. To facilitate controlled learning across different wavelet coefficients, we employ a multitask loss function, with each task corresponding to the learning of a distinct wavelet subband. Additionally, to ensure consistency among wavelet coefficients, which is crucial for accurate reconstruction in pixel space, we introduce a multispectral cross-attention mechanism to aid the joint generation of different wavelet coefficients. The sampling process involves jointly generating wavelet coefficients, followed by an inverse wavelet transform to convert them back to pixel space. Our approach not only improves the training efficiency for unconditional image generation compared with the standard denoising diffusion probabilistic model (vanilla DDPM) but also uniquely supports the generation of high-frequency content conditioned on a low-resolution image, enabling both image generation and upsampling within a single model. To our knowledge, this capability is novel. Our model demonstrates superior performance in image generation compared with baseline models on the STL-10 dataset, as evidenced by improved Frećhet inception distance (FID) and recall scores.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering