Data Extrapolation for Text-to-image Generation on Small Datasets

Senmao Ye,Fei Liu

2024-10-02

Abstract:Text-to-image generation requires large amount of training data to synthesizing high-quality images. For augmenting training data, previous methods rely on data interpolations like cropping, flipping, and mixing up, which fail to introduce new information and yield only marginal improvements. In this paper, we propose a new data augmentation method for text-to-image generation using linear extrapolation. Specifically, we apply linear extrapolation only on text feature, and new image data are retrieved from the internet by search engines. For the reliability of new text-image pairs, we design two outlier detectors to purify retrieved images. Based on extrapolation, we construct training samples dozens of times larger than the original dataset, resulting in a significant improvement in text-to-image performance. Moreover, we propose a NULL-guidance to refine score estimation, and apply recurrent affine transformation to fuse text information. Our model achieves FID scores of 7.91, 9.52 and 5.00 on the CUB, Oxford and COCO datasets. The code and data will be available on GitHub (<a class="link-external link-https" href="https://github.com/senmaoy/RAT-Diffusion" rel="external noopener nofollow">this https URL</a>).

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of training data when performing text - to - image generation on small - scale datasets. Existing data augmentation methods such as cropping, flipping, and mixing techniques can create new perspectives but are unable to introduce new information, resulting in limited performance improvement. For this reason, the paper proposes a data augmentation method based on linear extrapolation. By linearly extrapolating the text to increase training samples and retrieving new image data from the Internet, the performance of text - to - image generation can be significantly improved. Specifically, the main contributions of the paper include: 1. **Proposing a new data augmentation method**: By retrieving new image data from the Internet based on linearly extrapolating text features, training samples dozens of times larger than the original dataset are constructed. 2. **Designing two outlier detectors**: Used to purify the images retrieved from the Internet and ensure the reliability of the newly generated text - image pairs. 3. **Proposing NULL - condition guidance**: Used to improve score estimation and further enhance the quality of text - to - image generation. 4. **Applying cyclic affine transformation**: To handle complex text information in the diffusion model and improve the generation ability of the model. Through these methods, the paper has achieved significant performance improvements on the CUB, Oxford, and COCO datasets, especially performing excellently in terms of the FID score.

Data Extrapolation for Text-to-image Generation on Small Datasets

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Boosting Unsupervised Contrastive Learning Using Diffusion-Based Data Augmentation from Scratch

Emage: Non-Autoregressive Text-to-Image Generation

Recurrent Affine Transformation for Text-to-image Synthesis

Multimodal Data Augmentation for Image Captioning using Diffusion Models

A Simple Background Augmentation Method for Object Detection with Diffusion Model

Effective Data Augmentation With Diffusion Models

DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

Improving Text Generation on Images with Synthetic Captions

AnyText: Multilingual Visual Text Generation And Editing

Medical diffusion on a budget: Textual Inversion for medical image generation

Data-Efficient Augmentation for Training Neural Networks

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

TTIDA: Controllable Generative Data Augmentation via Text-to-Text and Text-to-Image Models

Not Just Pretty Pictures: Toward Interventional Data Augmentation Using Text-to-Image Generators

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion