TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

Weichao Zeng,Yan Shu,Zhenhang Li,Dongbao Yang,Yu Zhou
2024-10-14
Abstract:Centred on content modification and style preservation, Scene Text Editing (STE) remains a challenging task despite considerable progress in text-to-image synthesis and text-driven image manipulation recently. GAN-based STE methods generally encounter a common issue of model generalization, while Diffusion-based STE methods suffer from undesired style deviations. To address these problems, we propose TextCtrl, a diffusion-based method that edits text with prior guidance control. Our method consists of two key components: (i) By constructing fine-grained text style disentanglement and robust text glyph structure representation, TextCtrl explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy. (ii) To further leverage the style prior, a Glyph-adaptive Mutual Self-attention mechanism is proposed which deconstructs the implicit fine-grained features of the source image to enhance style consistency and vision quality during inference. Furthermore, to fill the vacancy of the real-world STE evaluation benchmark, we create the first real-world image-pair dataset termed ScenePair for fair comparisons. Experiments demonstrate the effectiveness of TextCtrl compared with previous methods concerning both style fidelity and text accuracy.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the style consistency and text - rendering accuracy in Scene Text Editing (STE). Despite significant progress in text - to - image synthesis and text - based image manipulation, existing STE methods still face some challenges: 1. **GAN - based STE methods**: These methods usually encounter the problem of insufficient model generalization ability because the model capacity of GAN is limited and it is difficult to accurately decompose text styles. 2. **Diffusion - based STE methods**: Although these methods perform well in image synthesis and processing, they may lead to style deviation in practical applications, especially it is difficult to maintain style consistency in complex scenes. In addition, due to the weak correlation between text prompts and glyph structures, these methods are prone to spelling mistakes, reducing the accuracy of text rendering. To solve these problems, the authors propose TextCtrl, a diffusion - model - based STE method for editing text through prior - guided control. Specifically, TextCtrl mainly solves the following key problems: - **Fine - grained text - style decoupling**: By constructing a fine - grained text - style decoupling and a robust text - glyph - structure representation, TextCtrl explicitly incorporates style - structure guidance into model design and network training, significantly improving text - style consistency and rendering accuracy. - **Adaptive cross - attention mechanism**: To further utilize style priors, an adaptive cross - attention mechanism is proposed. This mechanism deconstructs the implicit fine - grained features of the source image during the inference process to enhance style consistency and visual quality. - **Real - world evaluation benchmark**: To fill the gap in real - world STE evaluation benchmarks, the authors create the first real - world image - pair dataset ScenePair for a fair comparison of the performance of different methods. Through these improvements, TextCtrl outperforms existing methods in both style fidelity and text accuracy.