Abstract:The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts offer both global and local guidance. Specifically, the global guidance comes from the user input that describes the overall content, while the local guidance utilizes patch-wise descriptions from MLLMs to elaborately guide the regional structure and texture generation. Furthermore, during the inverse denoising process, the generated noise is decomposed into low- and high-frequency spatial components. These components are conditioned on multiple prompt levels, including detailed patch-wise descriptions and broader image-level prompts, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively avoid object repetition and structural artifacts when using pre - trained diffusion models to generate high - resolution images. Specifically: 1. **Object Repetition Problem**: When generating images with a resolution of 4K or higher, existing methods are prone to cause objects to appear repeatedly, which affects the quality and authenticity of the images. 2. **Structural Artifact Problem**: During the generation of high - resolution images, due to insufficient description of details in local areas, structural artifacts are likely to occur, making the generated images lack coherence and consistency. To solve these problems, the authors propose HiPrompt, a high - resolution image generation method without fine - tuning. By introducing Hierarchical Prompts to provide global and local guidance, the quality and detail accuracy of the generated images are improved. ### Main Contributions 1. **Hierarchical Prompts (HiPrompt)**: Through hierarchical semantic guidance, the semantic mismatch problem between global prompts and local patches is solved, and the phenomenon of object repetition is reduced. 2. **Image Decomposition and Parallel Denoising**: The image is decomposed into low - frequency and high - frequency components, and conditioned by global and local prompts respectively, ensuring the consistency and coherence of local and global structures. 3. **Extensive Experimental Verification**: Through quantitative and qualitative experimental evaluations, the superior performance of HiPrompt in high - resolution image generation is proven, significantly reducing object repetition and improving structural quality. ### Method Overview The main workflow of HiPrompt includes: - **Hierarchical Prompt Generation**: Use multi - modal large - language models (MLLMs) to generate detailed local descriptions, and combine the global descriptions input by users to form multi - level prompts. - **Noise Decomposition**: Decompose the noisy image into low - frequency and high - frequency components, which are guided by global and local prompts for denoising respectively. - **Parallel Denoising**: By parallel processing the noise of different frequency components, finally synthesize a high - quality high - resolution image. Through these innovations, HiPrompt can generate images with higher resolution, fewer repeated objects and clearer structures without additional training.

HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

FilterPrompt: A Simple yet Efficient Approach to Guide Image Appearance Transfer in Diffusion Models

Optimizing Prompts for Text-to-Image Generation

Prompt-In-Prompt Learning for Universal Image Restoration

PromptFix: You Prompt and We Fix the Photo

Dynamic Prompting: A Unified Framework for Prompt Tuning

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation

Dynamic Prompt Optimizing for Text-to-Image Generation

Adaptive Multi-Modality Prompt Learning

NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation

PromptCoT: Align Prompt Distribution Via Adapted Chain-of-Thought

Modal-aware Prompt Tuning with Deep Adaptive Feature Enhancement

Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis

PromptRestorer: A Prompting Image Restoration Method with Degradation Perception.

Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

The Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven Image Generation

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation