Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

NVIDIA,Yuval Atzmon,Maciej Bala,Yogesh Balaji,Tiffany Cai,Yin Cui,Jiaojiao Fan,Yunhao Ge,Siddharth Gururani,Jacob Huffman,Ronald Isaac,Pooya Jannaty,Tero Karras,Grace Lam,J. P. Lewis,Aaron Licata,Yen-Chen Lin,Ming-Yu Liu,Qianli Ma,Arun Mallya,Ashlee Martino-Tarr,Doug Mendez,Seungjun Nah,Chris Pruett,Fitsum Reda,Jiaming Song,Ting-Chun Wang,Fangyin Wei,Xiaohui Zeng,Yu Zeng,Qinsheng Zhang
2024-11-12
Abstract:We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "Edify Image: High - Quality Image Generation with Pixel Space Laplacian Diffusion Models" attempts to solve the following problems: 1. **Generate high - quality images**: - Although existing text - to - image generation models can generate realistic images, there is still room for improvement in terms of details and resolution. The paper proposes a new diffusion model - Laplacian Diffusion Model, aiming to generate high - resolution, pixel - level accurate images. 2. **Multi - scale image generation**: - Existing diffusion models are prone to accumulate artifacts when generating high - resolution images. The paper solves this problem by introducing a multi - scale Laplacian diffusion process. This process accurately captures and refines details at multiple scales by attenuating the image signal at different rates on different frequency bandwidths. 3. **Support for multiple applications**: - The paper not only focuses on text - to - image generation but also expands the application range of the model, including 4K upsampling, ControlNets, 360° HDR panorama generation and fine - tuning, etc. These applications require the model to have a high degree of controllability and flexibility. 4. **Improve generation efficiency and quality**: - The method proposed in the paper significantly improves training efficiency by separating low - frequency and high - frequency components. In addition, by using the Mixture of Experts method, the model can be efficiently trained and inferred within different resolution ranges. ### Specific technical means - **Laplacian Diffusion Model**: - Multi - scale image generation is achieved by decomposing the image into components of different frequency bandwidths and attenuating noise at different rates on these components. - Formula representation: \[ \mu(x_0, t)=\sum_{i = 1}^{3}\alpha_t^{(i)}x_0^{(i)} \] where \(\alpha_t^{(i)}\) is the attenuation factor, which is monotonically non - increasing with time \(t\). - **Multi - stage generation**: - A two - stage cascaded model is used to generate 1024 - resolution images. The first stage generates 256 - resolution images, and the second stage up - samples them to 1024 - resolution. - Formula representation: \[ x_t=\sum_{i = 1}^{3}\alpha_t^{(i)}x_0^{(i)}+\sigma_t\epsilon=\sum_{i = 1}^{3}\alpha_t^{(i)}x_0^{(i)}+\sigma_t\epsilon^{(i)} \] - **Conditional input**: - Multiple conditional inputs, such as T5 embeddings, camera properties, media types, etc., are introduced to improve the controllability and diversity of the generated images. - **4K upsampling**: - Using a pre - trained 1K generator as a basis, 4K - resolution images are generated through appropriate noise level scaling and fine - tuning. - **Additional control**: - By training the ControlNet encoder, support for control inputs such as depth maps and sketches is added to achieve more flexible image generation. ### Conclusion By introducing the Laplacian Diffusion Model, the paper successfully solves the key problems in generating high - quality images and multi - scale image generation. This model not only performs well in text - to - image generation tasks but also shows strong controllability and flexibility in multiple applications.