FigGen: Text to Scientific Figure Generation

Juan A Rodriguez,David Vazquez,Issam Laradji,Marco Pedersoli,Pau Rodriguez
2023-12-17
Abstract:The generative modeling landscape has experienced tremendous growth in recent years, particularly in generating natural images and art. Recent techniques have shown impressive potential in creating complex visual compositions while delivering impressive realism and quality. However, state-of-the-art methods have been focusing on the narrow domain of natural images, while other distributions remain unexplored. In this paper, we introduce the problem of text-to-figure generation, that is creating scientific figures of papers from text descriptions. We present FigGen, a diffusion-based approach for text-to-figure as well as the main challenges of the proposed task. Code and models are available at <a class="link-external link-https" href="https://github.com/joanrod/figure-diffusion" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of automatically generating scientific figures from textual descriptions (text-to-figure generation). Specifically, the authors focus on how to utilize generative models to automatically generate corresponding scientific figures from the textual descriptions in research papers. This task is of significant practical importance because scientific figures are a crucial part of research communication, capable of conveying research findings in a concise and intuitive manner. However, existing generative models mainly focus on natural images and artworks, and there is a lack of exploration for the generation tasks in the specific domain of scientific figures. ### Main Challenges 1. **Complex Relationship Representation**: Scientific figures usually contain complex discrete components, such as boxes, arrows, and text, which require a fine understanding of the relationships between these components. 2. **Technical Text Descriptions**: The generation of scientific figures needs to handle variable-length and highly technical text descriptions. 3. **Diverse Figure Styles**: Different types of scientific figures have different styles and layouts, which increases the difficulty of generation. 4. **Alignment of Images and Text**: The generated figures need to be highly consistent with the textual descriptions, which poses high demands on the model's alignment capabilities. ### Solution The authors propose a diffusion model-based approach, called FigGen, for generating scientific figures from textual descriptions. The specific steps are as follows: 1. **Image Autoencoder**: First, an image autoencoder is trained to compress images into a low-dimensional latent representation space to accelerate the training of the diffusion model. 2. **Text Encoder**: A text encoder is trained from scratch using the BERT model to capture the relationship between textual descriptions and figures. 3. **Diffusion Model**: The diffusion process is conducted in the latent representation space, generating figures through a denoising U-Net network conditioned on time and text. ### Experimental Results The authors conducted experiments on the Paper2Fig100k dataset, which contains a large number of paper-figure pairs. The experimental results show that FigGen can learn the relationship between textual descriptions and figures and generate images that conform to the distribution. Although the quality of the generated images still needs improvement, it has demonstrated the potential of generative models in the task of scientific figure generation. ### Conclusion The paper introduces the task of text-to-figure generation and proposes the FigGen model. Although the current generation results are not yet sufficient for direct application in actual research, the authors point out future work directions, including improving the alignment of text and images, designing better validation metrics, and loss functions, etc.