Abstract:Scene text editing is a challenging task that involves modifying or inserting specified texts in an image while maintaining its natural and realistic appearance. Most previous approaches to this task rely on style-transfer models that crop out text regions and feed them into image transfer models, such as GANs. However, these methods are limited in their ability to change text style and are unable to insert texts into images. Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing. However, our empirical analysis reveals that state-of-the-art diffusion models struggle with rendering correct text and controlling text style. To address these problems, we propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction encoder for better style control. An instruction tuning framework is introduced to train our model to learn the mapping from the text instruction to the corresponding image with either the specified style or the style of the surrounding texts in the background. Such a training method further brings our method the zero-shot generalization ability to the following three scenarios: generating text with unseen font variation, e.g., italic and bold, mixing different fonts to construct a new font, and using more relaxed forms of natural language as the instructions to guide the generation task. We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability. Our code is publicly available. https://github.com/UCSB-NLP-Chang/DiffSTE

DF-CLIP: Towards Disentangled and Fine-grained Image Editing from Text

DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation

Where You Edit is What You Get: Text-guided Image Editing with Region-Based Attention.

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

Lightweight Text-Driven Image Editing With Disentangled Content and Attributes

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

Revealing Directions for Text-guided 3D Face Editing

The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP

Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model

FEAT: Face Editing with Attention

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

Robust Text-driven Image Editing Method that Adaptively Explores Directions in Latent Spaces of StyleGAN and CLIP

CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing

Improving Diffusion Models for Scene Text Editing with Dual Encoders

CLIP2GAN: Towards Bridging Text with the Latent Space of GANs

Mask-guided GAN for robust text editing in the scene

ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

Zero-shot Text-driven Physically Interpretable Face Editing