APRNet: Attention-based Pixel-wise Rendering Network for Photo-Realistic Text Image Generation

Yangming Shi,Haisong Ding,Kai Chen,Qiang Huo
DOI: https://doi.org/10.48550/arXiv.2203.07705
2022-03-15
Abstract:Style-guided text image generation tries to synthesize text image by imitating reference image's appearance while keeping text content unaltered. The text image appearance includes many aspects. In this paper, we focus on transferring style image's background and foreground color patterns to the content image to generate photo-realistic text image. To achieve this goal, we propose 1) a content-style cross attention based pixel sampling approach to roughly mimicking the style text image's background; 2) a pixel-wise style modulation technique to transfer varying color patterns of the style image to the content image spatial-adaptively; 3) a cross attention based multi-scale style fusion approach to solving text foreground misalignment issue between style and content images; 4) an image patch shuffling strategy to create style, content and ground truth image tuples for training. Experimental results on Chinese handwriting text image synthesis with SCUT-HCCDoc and CASIA-OLHWDB datasets demonstrate that the proposed method can improve the quality of synthetic text images and make them more photo-realistic.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when generating text images with a specific style, how to effectively transfer the background and foreground color patterns of the reference image to the target image while keeping the text content unchanged, in order to generate realistic photo - level text images. Specifically, the paper focuses on transferring the background and foreground color patterns in the style image to the content image, thereby generating high - quality, realistic text images. The main challenges in this process include complex backgrounds, different lighting conditions, and foreground (text) alignment problems. To address these challenges, the paper makes the following several technical contributions: 1. **Pixel Sampling Module Based on Content - Style Cross - Attention (AttnPixamp)**: It is used to roughly imitate the background of the style text image. 2. **Pixel - level Style Modulation Technique (PixyMod)**: It is used to adaptively transfer the spatially varying color patterns of the style image to the content image. 3. **Multi - scale Style Fusion Module Based on Attention Mechanism (AttnMuSF)**: It is used to solve the text foreground misalignment problem between the style and content images. 4. **Image Patch Shuffling Strategy (Single Crop)**: It is used to create the style, content, and real image triplets required for training. Through these techniques, the paper aims to improve the quality of text image generation, especially in terms of line - level style transfer, making the synthesized text images more realistic. The experimental results show that the proposed method significantly improves the quality of the synthesized images in the Chinese handwritten text image synthesis task and makes them more realistic.