HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

Xinyu Liu,Yingqing He,Lanqing Guo,Xiang Li,Bu Jin,Peng Li,Yan Li,Chi-Min Chan,Qifeng Chen,Wei Xue,Wenhan Luo,Qingfeng Liu,Yike Guo
2024-09-07
Abstract:The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts offer both global and local guidance. Specifically, the global guidance comes from the user input that describes the overall content, while the local guidance utilizes patch-wise descriptions from MLLMs to elaborately guide the regional structure and texture generation. Furthermore, during the inverse denoising process, the generated noise is decomposed into low- and high-frequency spatial components. These components are conditioned on multiple prompt levels, including detailed patch-wise descriptions and broader image-level prompts, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively avoid object repetition and structural artifacts when using pre - trained diffusion models to generate high - resolution images. Specifically: 1. **Object Repetition Problem**: When generating images with a resolution of 4K or higher, existing methods are prone to cause objects to appear repeatedly, which affects the quality and authenticity of the images. 2. **Structural Artifact Problem**: During the generation of high - resolution images, due to insufficient description of details in local areas, structural artifacts are likely to occur, making the generated images lack coherence and consistency. To solve these problems, the authors propose HiPrompt, a high - resolution image generation method without fine - tuning. By introducing Hierarchical Prompts to provide global and local guidance, the quality and detail accuracy of the generated images are improved. ### Main Contributions 1. **Hierarchical Prompts (HiPrompt)**: Through hierarchical semantic guidance, the semantic mismatch problem between global prompts and local patches is solved, and the phenomenon of object repetition is reduced. 2. **Image Decomposition and Parallel Denoising**: The image is decomposed into low - frequency and high - frequency components, and conditioned by global and local prompts respectively, ensuring the consistency and coherence of local and global structures. 3. **Extensive Experimental Verification**: Through quantitative and qualitative experimental evaluations, the superior performance of HiPrompt in high - resolution image generation is proven, significantly reducing object repetition and improving structural quality. ### Method Overview The main workflow of HiPrompt includes: - **Hierarchical Prompt Generation**: Use multi - modal large - language models (MLLMs) to generate detailed local descriptions, and combine the global descriptions input by users to form multi - level prompts. - **Noise Decomposition**: Decompose the noisy image into low - frequency and high - frequency components, which are guided by global and local prompts for denoising respectively. - **Parallel Denoising**: By parallel processing the noise of different frequency components, finally synthesize a high - quality high - resolution image. Through these innovations, HiPrompt can generate images with higher resolution, fewer repeated objects and clearer structures without additional training.