Training-Free Consistent Text-to-Image Generation

Yoad Tewel,Omri Kaduri,Rinon Gal,Yoni Kasten,Lior Wolf,Gal Chechik,Yuval Atzmon

2024-05-30

Abstract:Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics,Machine Learning

What problem does this paper attempt to address?

This paper proposes a new method called ConsiStory for zero-shot and consistent text-to-image generation. Existing text-to-image models can generate images guided by natural language, but face challenges in maintaining consistency across multiple prompts. Existing methods often require fine-tuning or large-scale pretraining for each topic, and perform poorly on image-text alignment and multi-topic representation. ConsiStory achieves consistency by sharing internal activations of pretrained models, without optimization or pretraining. It introduces topic-based shared attention blocks and feature injection based on correspondences to facilitate topic consistency among images. Additionally, the method develops strategies to encourage layout diversity while maintaining topic consistency. ConsiStory is around 20 times faster than current state-of-the-art methods and can easily scale to multi-topic scenarios, even supporting zero-shot personalization for common objects. Unlike existing techniques that rely on personalization, fine-tuning, or image conditioning, ConsiStory directly aligns features during the generation process, avoiding posterior consistency constraints on the target image and addressing issues of model creativit and training distribution deviation. Experiments in the paper compare ConsiStory with various baseline methods, demonstrating its superior performance in topic consistency and text alignment. User studies also show a preference for ConsiStory's results. Furthermore, the paper explores the impact of the method's various components and demonstrates compatibility with existing editing tools as well as zero-shot personalization for common objects.

Training-Free Consistent Text-to-Image Generation

Training-Free Consistent Text-to-Image Generation

Chasing Consistency in Text-to-3D Generation from a Single Image.

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

Subject-driven Text-to-Image Generation via Apprenticeship Learning

Customization Assistant for Text-to-image Generation

Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

Multi-Shot Character Consistency for Text-to-Video Generation

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models

DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuning

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Obtaining Favorable Layouts for Multiple Object Generation