Abstract:Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributes in the input prompts; a challenge termed semantic binding. Previous approaches either involve intensive fine-tuning of the entire T2I model or require users or large language models to specify generation layouts, adding complexity. In this paper, we define semantic binding as the task of associating a given object with its attribute, termed attribute binding, or linking it to other related sub-objects, referred to as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token. This ensures that the object, its attributes and sub-objects all share the same cross-attention map. Additionally, to address potential confusion among main objects with complex textual prompts, we propose end token substitution as a complementary strategy. To further refine our approach in the initial stages of T2I generation, where layouts are determined, we incorporate two auxiliary losses, an entropy loss and a semantic binding loss, to iteratively update the composite token to improve the generation integrity. We conducted extensive experiments to validate the effectiveness of ToMe, comparing it against various existing methods on the T2I-CompBench and our proposed GPT-4o object binding benchmark. Our method is particularly effective in complex scenarios that involve multiple objects and attributes, which previous methods often fail to address. The code will be publicly available at \url{<a class="link-external link-https" href="https://github.com/hutaihang/ToMe" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper attempts to address the problem of accurately binding semantically related objects or attributes together in Text-to-Image (T2I) generation, known as the "semantic binding" problem. Specifically, existing T2I models often fail to correctly associate given objects with their attributes or other related sub-objects when generating images. For example, the generated image might place a hat on the wrong object or ignore certain attributes. To solve this problem, the authors propose a new method called Token Merging (ToMe). ToMe enhances semantic binding by merging related tokens into a composite token, ensuring that objects, their attributes, and sub-objects share the same cross-attention map. Additionally, to address the major object confusion issue in complex text prompts, the authors introduce a supplementary strategy called end token substitution. In the initial stage of T2I generation, the authors introduce two auxiliary losses—entropy loss and semantic binding loss—to iteratively update the composite tokens, improving the integrity of the generated images. Through extensive experimental validation, ToMe demonstrates significant advantages in multiple benchmarks, particularly in complex scenarios involving multiple objects and attributes. These results indicate that ToMe effectively addresses the shortcomings of existing methods in handling complex semantic binding issues.

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

Diversified text-to-image generation via deep mutual information estimation

Token Merging: Your ViT But Faster

Token Merging for Fast Stable Diffusion

Token Fusion: Bridging the Gap between Token Pruning and Token Merging

Enhancing semantic mapping in text-to-image diffusion via Gather-and-Bind

Divide & Bind Your Attention for Improved Generative Semantic Nursing

TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling

Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

Language-Guided Image Tokenization for Generation

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Personality Traits and Sibling Relationships in Emerging Adults

TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Negative Token Merging: Image-based Adversarial Feature Guidance

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

Novel Object Synthesis via Adaptive Text-Image Harmony

Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control

Accelerating Transformers with Spectrum-Preserving Token Merging