Abstract:Centred on content modification and style preservation, Scene Text Editing (STE) remains a challenging task despite considerable progress in text-to-image synthesis and text-driven image manipulation recently. GAN-based STE methods generally encounter a common issue of model generalization, while Diffusion-based STE methods suffer from undesired style deviations. To address these problems, we propose TextCtrl, a diffusion-based method that edits text with prior guidance control. Our method consists of two key components: (i) By constructing fine-grained text style disentanglement and robust text glyph structure representation, TextCtrl explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy. (ii) To further leverage the style prior, a Glyph-adaptive Mutual Self-attention mechanism is proposed which deconstructs the implicit fine-grained features of the source image to enhance style consistency and vision quality during inference. Furthermore, to fill the vacancy of the real-world STE evaluation benchmark, we create the first real-world image-pair dataset termed ScenePair for fair comparisons. Experiments demonstrate the effectiveness of TextCtrl compared with previous methods concerning both style fidelity and text accuracy.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the style consistency and text - rendering accuracy in Scene Text Editing (STE). Despite significant progress in text - to - image synthesis and text - based image manipulation, existing STE methods still face some challenges: 1. **GAN - based STE methods**: These methods usually encounter the problem of insufficient model generalization ability because the model capacity of GAN is limited and it is difficult to accurately decompose text styles. 2. **Diffusion - based STE methods**: Although these methods perform well in image synthesis and processing, they may lead to style deviation in practical applications, especially it is difficult to maintain style consistency in complex scenes. In addition, due to the weak correlation between text prompts and glyph structures, these methods are prone to spelling mistakes, reducing the accuracy of text rendering. To solve these problems, the authors propose TextCtrl, a diffusion - model - based STE method for editing text through prior - guided control. Specifically, TextCtrl mainly solves the following key problems: - **Fine - grained text - style decoupling**: By constructing a fine - grained text - style decoupling and a robust text - glyph - structure representation, TextCtrl explicitly incorporates style - structure guidance into model design and network training, significantly improving text - style consistency and rendering accuracy. - **Adaptive cross - attention mechanism**: To further utilize style priors, an adaptive cross - attention mechanism is proposed. This mechanism deconstructs the implicit fine - grained features of the source image during the inference process to enhance style consistency and visual quality. - **Real - world evaluation benchmark**: To fill the gap in real - world STE evaluation benchmarks, the authors create the first real - world image - pair dataset ScenePair for a fair comparison of the performance of different methods. Through these improvements, TextCtrl outperforms existing methods in both style fidelity and text accuracy.

TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

Improving Diffusion Models for Scene Text Editing with Dual Encoders

TeSTNeRF: Text-Driven 3D Style Transfer Via Cross-Modal Learning.

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles

Explicitly-Decoupled Text Transfer With Minimized Background Reconstruction for Scene Text Editing

Exploring Stroke-Level Modifications for Scene Text Editing

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

On Manipulating Scene Text in the Wild with Diffusion Models

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

Scene Style Text Editing

ITstyler: Image-optimized Text-based Style Transfer

ControlDreamer: Blending Geometry and Style in Text-to-3D

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

TextStyler: A CLIP-based approach to text-guided style transfer

FonTS: Text Rendering with Typography and Style Controls

DiffUTE: Universal Text Editing Diffusion Model

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

RewriteNet: Reliable Scene Text Editing with Implicit Decomposition of Text Contents and Styles