LSReGen: Large-Scale Regional Generator via Backward Guidance Framework

Bowen Zhang,Cheng Yang,Xuanhui Liu
2024-07-21
Abstract:In recent years, advancements in AIGC (Artificial Intelligence Generated Content) technology have significantly enhanced the capabilities of large text-to-image models. Despite these improvements, controllable image generation remains a challenge. Current methods, such as training, forward guidance, and backward guidance, have notable limitations. The first two approaches either demand substantial computational resources or produce subpar results. The third approach depends on phenomena specific to certain model architectures, complicating its application to large-scale image <a class="link-external link-http" href="http://generation.To" rel="external noopener nofollow">this http URL</a> address these issues, we propose a novel controllable generation framework that offers a generalized interpretation of backward guidance without relying on specific assumptions. Leveraging this framework, we introduce LSReGen, a large-scale layout-to-image method designed to generate high-quality, layout-compliant images. Experimental results show that LSReGen outperforms existing methods in the large-scale layout-to-image task, underscoring the effectiveness of our proposed framework. Our code and models will be open-sourced.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the challenges faced by current controllable image generation methods when generating large - scale, high - quality images that meet layout requirements. Specifically, the paper points out the limitations of existing methods (such as model training, forward guidance, and backward guidance): 1. **Model Training**: Although this method can obtain excellent generation control capabilities, it requires a large amount of computing resources, especially for models with a large number of parameters and large - scale datasets. 2. **Forward Guidance**: It hardly requires additional computing overhead, but the quality of the generated images is not ideal, for example, mottling may occur. 3. **Backward Guidance**: This method updates the intermediate variables in the denoising process through back - propagation and can obtain good results with relatively small overhead during the inference stage. However, most backward - guidance methods rely on the cross - attention map phenomenon in specific model architectures, which limits their application in large - scale image generation. To solve these problems, the paper proposes a new controllable generation framework, providing a general backward - guidance interpretation without relying on specific assumptions or model - architecture features. Based on this framework, the authors introduce LSReGen, a method for generating high - quality, large - scale images that meet layout requirements. The experimental results show that LSReGen outperforms existing methods in large - scale layout - to - image tasks, verifying the effectiveness of the proposed framework. ### Main Contributions 1. **General Backward - Guidance Framework**: Provides a general backward - guidance framework without training, which can provide a general explanation for backward - guidance without relying on cross - attention maps. 2. **Large - Scale Layout - to - Image Method**: Based on the above framework, LSReGen is proposed, which can generate high - quality, large - scale images that meet layout requirements. 3. **Experimental Verification**: The experimental results show that LSReGen outperforms existing methods in large - scale layout - to - image tasks, further verifying the effectiveness of the proposed framework. ### Method Overview - **Backward - Guidance Framework**: By defining feature extraction methods and distance calculation functions, taking control information as input, and gradually updating intermediate variables to make them gradually approach the target features. - **Large - Scale Region Generator**: Utilize a pre - trained low - parameter layout - to - image model (such as GLIGEN) as a feature extractor, capture layout features by up - sampling and adding noise, and use the square of the L2 norm to calculate the distance between features during the generation process. In conclusion, this paper aims to overcome the limitations of existing controllable image generation methods, especially in generating large - scale, high - quality images that meet layout requirements, by proposing a new backward - guidance framework and the corresponding generation method.