Abstract:Scene text editing is a challenging task that involves modifying or inserting specified texts in an image while maintaining its natural and realistic appearance. Most previous approaches to this task rely on style-transfer models that crop out text regions and feed them into image transfer models, such as GANs. However, these methods are limited in their ability to change text style and are unable to insert texts into images. Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing. However, our empirical analysis reveals that state-of-the-art diffusion models struggle with rendering correct text and controlling text style. To address these problems, we propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction encoder for better style control. An instruction tuning framework is introduced to train our model to learn the mapping from the text instruction to the corresponding image with either the specified style or the style of the surrounding texts in the background. Such a training method further brings our method the zero-shot generalization ability to the following three scenarios: generating text with unseen font variation, e.g., italic and bold, mixing different fonts to construct a new font, and using more relaxed forms of natural language as the instructions to guide the generation task. We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability. Our code is publicly available. https://github.com/UCSB-NLP-Chang/DiffSTE

Letter embedding guidance diffusion model for scene text editing

Improving Diffusion Models for Scene Text Editing with Dual Encoders

TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

TeSTNeRF: Text-Driven 3D Style Transfer Via Cross-Modal Learning.

Explicitly-Decoupled Text Transfer With Minimized Background Reconstruction for Scene Text Editing

On Manipulating Scene Text in the Wild with Diffusion Models

DiffUTE: Universal Text Editing Diffusion Model

Exploring Stroke-Level Modifications for Scene Text Editing

TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Forgedit: Text Guided Image Editing via Learning and Forgetting

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

Scene Style Text Editing

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

LayerDiffusion: Layered Controlled Image Editing with Diffusion Models

Perceptual Similarity guidance and text guidance optimization for Editing Real Images using Guided Diffusion Models

Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer