GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures in Text-to-Image Generation

Jian Ma,Mingjun Zhao,Chen Chen,Ruichen Wang,Di Niu,Haonan Lu,Xiaodong Lin
DOI: https://doi.org/10.48550/arXiv.2303.17870
2023-05-23
Abstract:Recent breakthroughs in the field of language-guided image generation have yielded impressive achievements, enabling the creation of high-quality and diverse images based on user <a class="link-external link-http" href="http://instructions.Although" rel="external noopener nofollow">this http URL</a> the synthesis performance is fascinating, one significant limitation of current image generation models is their insufficient ability to generate text coherently within images, particularly for complex glyph structures like Chinese characters. To address this problem, we introduce GlyphDraw, a general learning framework aiming to endow image generation models with the capacity to generate images coherently embedded with text for any specific <a class="link-external link-http" href="http://language.We" rel="external noopener nofollow">this http URL</a> first sophisticatedly design the image-text dataset's construction strategy, then build our model specifically on a diffusion-based image generator and carefully modify the network structure to allow the model to learn drawing language characters with the help of glyph and position <a class="link-external link-http" href="http://information.Furthermore" rel="external noopener nofollow">this http URL</a>, we maintain the model's open-domain image synthesis capability by preventing catastrophic forgetting by using parameter-efficient fine-tuning <a class="link-external link-http" href="http://techniques.Extensive" rel="external noopener nofollow">this http URL</a> qualitative and quantitative experiments demonstrate that our method not only produces accurate language characters as in prompts, but also seamlessly blends the generated text into the <a class="link-external link-http" href="http://background.Please" rel="external noopener nofollow">this http URL</a> refer to our \href{<a class="link-external link-https" href="https://1073521013.github.io/glyph-draw.github.io/" rel="external noopener nofollow">this https URL</a>}{project page}. \end{abstract}
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in text - to - image generation, current models are insufficient in their ability to generate complex character structures (such as Chinese characters). Although existing image - generation models can generate high - quality and diverse images according to user instructions, they perform poorly in generating coherent text in images, especially for complex glyph structures such as Chinese characters. The paper introduces a general learning framework named GlyphDraw, which aims to endow image - generation models with the ability to generate coherent embedded text images in any specific language. Specifically, the paper points out that although some methods can render English text by using pre - trained language models (such as T5 - XXL), their generation ability for non - Latin characters (such as Chinese) is still limited. This is mainly because Chinese characters have a more complex two - dimensional spatial structure, consisting of eight different types of strokes, and the number of commonly used characters is huge, reaching thousands. Therefore, it is more difficult to generate accurate and diverse Chinese characters, and this remains an unsolved research problem. In addition, the method of freezing pre - trained language models has poor flexibility and is difficult to adapt to user - specified downstream languages, while training specific language models from scratch is costly and requires a large amount of data. Therefore, the author designs a general and flexible algorithm to solve the visual - text - rendering challenge through a lightweight training strategy and data set. To address this problem, the paper proposes the GlyphDraw framework, which uses character glyphs and text positions as auxiliary information to provide greater control over the character - generation process. This method can not only generate diverse visual texts that meet given instructions, but also intelligently match the most appropriate font style and seamlessly integrate it into the background, while maintaining high - generation quality and avoiding over - fitting and catastrophic - forgetting problems. The paper verifies the effectiveness of its method through experiments, especially in Chinese and English character rendering, achieving OCR accuracies of 74% and 75% which are significantly better than previous image - synthesis methods.