Generative Photomontage

Sean J. Liu,Nupur Kumari,Ariel Shamir,Jun-Yan Zhu

2024-08-17

Abstract:Text-to-image models are powerful tools for image creation. However, the generation process is akin to a dice roll and makes it difficult to achieve a single image that captures everything a user wants. In this paper, we propose a framework for creating the desired image by compositing it from various parts of generated images, in essence forming a Generative Photomontage. Given a stack of images generated by ControlNet using the same input condition and different seeds, we let users select desired parts from the generated results using a brush stroke interface. We introduce a novel technique that takes in the user's brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. Our method faithfully preserves the user-selected regions while compositing them harmoniously. We demonstrate that our flexible framework can be used for many applications, including generating new appearance combinations, fixing incorrect shapes and artifacts, and improving prompt alignment. We show compelling results for each application and demonstrate that our method outperforms existing image blending methods and various baselines.

Computer Vision and Pattern Recognition,Graphics

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of text-to-image generation models struggling to meet all user requirements in a single generation process. Specifically: 1. **Uncertainty of Generated Results**: Current text-to-image generation models (such as ControlNet) are akin to rolling dice during the generation process, making it difficult to produce an image that fully meets user expectations in one go. For example, for the prompt "future robot," each generated result may vary, with the user possibly liking one part of a result while another part comes from a different result. 2. **Insufficient User Control**: Although some methods increase user control by adding extra conditions (such as edge maps, depth maps, etc.), these methods still exhibit a certain degree of randomness and cannot fully satisfy user needs. To address these issues, the paper proposes a new framework—**Generative Photomontage**. This framework allows users to select desired regions from multiple generated images and combine these regions into the final ideal image. This method not only enhances user control but also corrects erroneous shapes and flaws in the generated images, as well as improves alignment for long and complex prompts. ### Main Contributions - **User Interaction and Control**: By allowing users to select desired parts from multiple generated images, users can fully leverage the model's generative capabilities while retaining fine-grained control over the final result. - **Error Correction**: Users can replace unsatisfactory parts, thereby gradually constructing the ideal result. - **Multiple Applications**: This method is applicable to various scenarios, including creating new appearance combinations, correcting shape errors, reducing flaws, and improving prompt alignment. Through the above methods, the paper demonstrates visual effects in various application scenarios and proves that its method outperforms existing image fusion methods.

Generative Photomontage

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

SIMGAN: Photo-Realistic Semantic Image Manipulation Using Generative Adversarial Networks.

Bring Clipart to Life.

Controllable Image Generation via Collage Representations

Generating Images Part by Part with Composite Generative Adversarial Networks

Barbershop: GAN-based Image Compositing using Segmentation Masks

Thinking Outside the BBox: Unconstrained Generative Object Compositing

Controllable Image Generation With Composed Parallel Token Prediction

Handwritten Digits Image Generation with help of Generative Adversarial Network: Machine Learning Approach

Text-driven Face Image Generation and Manipulation via Multi-level Residual Mapper

Composer: Creative and Controllable Image Synthesis with Composable Conditions

NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging

PixelFace+: Towards Controllable Face Generation and Manipulation with Text Descriptions and Segmentation Masks

PhotoSketch: Internet Image Montage

R-GAN: Exploring Human-likeWay for Reasonable Text-to-Image Synthesis Via Generative Adversarial Networks

3D-aware Image Generation and Editing with Multi-modal Conditions

3DFaceShop: Explicitly Controllable 3D-Aware Portrait Generation

Flexible Portrait Image Editing with Fine-Grained Control

ControlCom: Controllable Image Composition using Diffusion Model

Generative Portrait Shadow Removal