Abstract:The generative modeling landscape has experienced tremendous growth in recent years, particularly in generating natural images and art. Recent techniques have shown impressive potential in creating complex visual compositions while delivering impressive realism and quality. However, state-of-the-art methods have been focusing on the narrow domain of natural images, while other distributions remain unexplored. In this paper, we introduce the problem of text-to-figure generation, that is creating scientific figures of papers from text descriptions. We present FigGen, a diffusion-based approach for text-to-figure as well as the main challenges of the proposed task. Code and models are available at <a class="link-external link-https" href="https://github.com/joanrod/figure-diffusion" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper attempts to address the problem of automatically generating scientific figures from textual descriptions (text-to-figure generation). Specifically, the authors focus on how to utilize generative models to automatically generate corresponding scientific figures from the textual descriptions in research papers. This task is of significant practical importance because scientific figures are a crucial part of research communication, capable of conveying research findings in a concise and intuitive manner. However, existing generative models mainly focus on natural images and artworks, and there is a lack of exploration for the generation tasks in the specific domain of scientific figures. ### Main Challenges 1. **Complex Relationship Representation**: Scientific figures usually contain complex discrete components, such as boxes, arrows, and text, which require a fine understanding of the relationships between these components. 2. **Technical Text Descriptions**: The generation of scientific figures needs to handle variable-length and highly technical text descriptions. 3. **Diverse Figure Styles**: Different types of scientific figures have different styles and layouts, which increases the difficulty of generation. 4. **Alignment of Images and Text**: The generated figures need to be highly consistent with the textual descriptions, which poses high demands on the model's alignment capabilities. ### Solution The authors propose a diffusion model-based approach, called FigGen, for generating scientific figures from textual descriptions. The specific steps are as follows: 1. **Image Autoencoder**: First, an image autoencoder is trained to compress images into a low-dimensional latent representation space to accelerate the training of the diffusion model. 2. **Text Encoder**: A text encoder is trained from scratch using the BERT model to capture the relationship between textual descriptions and figures. 3. **Diffusion Model**: The diffusion process is conducted in the latent representation space, generating figures through a denoising U-Net network conditioned on time and text. ### Experimental Results The authors conducted experiments on the Paper2Fig100k dataset, which contains a large number of paper-figure pairs. The experimental results show that FigGen can learn the relationship between textual descriptions and figures and generate images that conform to the distribution. Although the quality of the generated images still needs improvement, it has demonstrated the potential of generative models in the task of scientific figure generation. ### Conclusion The paper introduces the task of text-to-figure generation and proposes the FigGen model. Although the current generation results are not yet sufficient for direct application in actual research, the authors point out future work directions, including improving the alignment of text and images, designing better validation metrics, and loss functions, etc.

FigGen: Text to Scientific Figure Generation

Generative Diffusion Models on Graphs: Methods and Applications

SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

Figuring out Figures: Using Textual References to Caption Scientific Figures

GlyphDiffusion: Text Generation as Image Generation

Visual Text Generation in the Wild

Text-guided Small Molecule Generation Via Diffusion Model

Text-guided Diffusion Model for 3D Molecule Generation

DiffusionGPT: LLM-Driven Text-to-Image Generation System

Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback

A Survey of Data-Driven 2D Diffusion Models for Generating Images from Text

Typographic Text Generation with Off-the-Shelf Diffusion Model

A Survey on Generative Diffusion Models

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Text-to-Image Synthesis With Generative Models: Methods, Datasets, Performance Metrics, Challenges, and Future Direction