Few-shot Font Generation based on SAE and Diffusion Model

Yizuo Shi,Wenxia Yang,Mengxu Yuan,Zeping Yi
DOI: https://doi.org/10.54097/rp4sqj55
2024-08-15
Abstract:Generating Chinese characters via few-shot font generation is an intriguing and important challenge in recent years, primarily due to the intricate and unique nature of Chinese fonts. However, the conventional GAN-based model for font generation has encountered issues such as unpredictable training and inaccurate generation. Simultaneously, in the realm of image generation, diffusion models have demonstrated remarkable success, even garnering application in AI painting commercials. Some studies have endeavored to integrate diffusion models into Few-shot Font Generation (FFG). In this paper, we present a straightforward, few-shot font generation framework utilizing a conditional diffusion model. We generate conditional embedding tokens using three encoders, which extract essential character information such as content and style. By combining these conditions into the diffusion process, we can effectively model these three pieces of information. Our model possesses three key features: i) Our model attains disentanglement of all encoders and the diffusion model. The content encoder focuses solely on extracting the content or the relative position of strokes, the style-coding provides only style features, and the diffusion model is limited to generating the target image without obscuring any content or style information. This enhances the model’s interpretability and makes the addition of new functionalities a simpler process. ii) For different fonts, our model requires fewer training steps due to the use of pre-training. We only train the style-coding on a small scale, bypassing the need for extensive training of the large-scale diffusion model. iii) Our model achieves two types of "Few-shot" training. The first type involves the same style but different characters, requiring only a few characters for training. The second type pertains to different styles, needing only a few style fonts for training. Experimental results reveal that our model outperforms previous few-font generation models in terms of quality, generation speed, and the scale of well-trained training datasets.
What problem does this paper attempt to address?