One-shot font generation via local style self-supervision using Region-Aware Contrastive Loss

Jeong-Sik Lee,Hyun-Chul Choi
DOI: https://doi.org/10.1016/j.jksuci.2024.102028
IF: 9.006
2024-04-18
Journal of King Saud University - Computer and Information Sciences
Abstract:Highlights • Region-Aware Contrastive loss (RAC-loss) maximizes the style information between patches of generated and reference images, which self-supervises the generator in local style. • With the proposed RAC-loss, our one-shot font generator (RAC-Font) outperforms the previous method including both few-shot and one-shot methods in quantitative and qualitative terms. • Through the self-supervision of local style, we propose a model with a much simpler structure than previous methods and allows for real-time inference. • Proposed method considers more fine-grained level style (patch) compared to previous component-level style, resulting in more fine-grained font image. Compositional scripts like Hangeul (Korean characters) and Chinese characters involve numerous characters, making manual font design labor-intensive and cost-ineffective work. Although many few-shot font generation methods have been introduced, they have at least one of the limitations, i.e. , lacking local styles of font, additional component labeling, and high complexity in network structure and training. To solve these limitations, given our observation that font style can be perceived at a patch-level rather than a component-level, we propose Region-Aware Contrastive loss (RAC-loss) so that the generator can capture the local style by self-supervision. The proposed loss maximizes the style information between patches of the generated image and the style reference image. And we introduce an attention mechanism to the patch-level contrastive loss to handle multiple patch correspondences. This attention learns style similarity between two glyph images, which serves as a patch-correspondence map. RAC-loss gives more fine-grained feedback to the generator than component-level loss, allowing it to incorporate local styles, even in a straightforward structure like a visual geometry group network (VGGNet). This results in a fast inference latency (3.02ms), and the proposed method achieved 43.18 mean Fréchet Inception Distance (mFID) on the test dataset, a notable decrease of 5.42 compared to the previous method.
computer science, information systems
What problem does this paper attempt to address?