Data Extrapolation for Text-to-image Generation on Small Datasets

Senmao Ye,Fei Liu
2024-10-02
Abstract:Text-to-image generation requires large amount of training data to synthesizing high-quality images. For augmenting training data, previous methods rely on data interpolations like cropping, flipping, and mixing up, which fail to introduce new information and yield only marginal improvements. In this paper, we propose a new data augmentation method for text-to-image generation using linear extrapolation. Specifically, we apply linear extrapolation only on text feature, and new image data are retrieved from the internet by search engines. For the reliability of new text-image pairs, we design two outlier detectors to purify retrieved images. Based on extrapolation, we construct training samples dozens of times larger than the original dataset, resulting in a significant improvement in text-to-image performance. Moreover, we propose a NULL-guidance to refine score estimation, and apply recurrent affine transformation to fuse text information. Our model achieves FID scores of 7.91, 9.52 and 5.00 on the CUB, Oxford and COCO datasets. The code and data will be available on GitHub (<a class="link-external link-https" href="https://github.com/senmaoy/RAT-Diffusion" rel="external noopener nofollow">this https URL</a>).
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of training data when performing text - to - image generation on small - scale datasets. Existing data augmentation methods such as cropping, flipping, and mixing techniques can create new perspectives but are unable to introduce new information, resulting in limited performance improvement. For this reason, the paper proposes a data augmentation method based on linear extrapolation. By linearly extrapolating the text to increase training samples and retrieving new image data from the Internet, the performance of text - to - image generation can be significantly improved. Specifically, the main contributions of the paper include: 1. **Proposing a new data augmentation method**: By retrieving new image data from the Internet based on linearly extrapolating text features, training samples dozens of times larger than the original dataset are constructed. 2. **Designing two outlier detectors**: Used to purify the images retrieved from the Internet and ensure the reliability of the newly generated text - image pairs. 3. **Proposing NULL - condition guidance**: Used to improve score estimation and further enhance the quality of text - to - image generation. 4. **Applying cyclic affine transformation**: To handle complex text information in the diffusion model and improve the generation ability of the model. Through these methods, the paper has achieved significant performance improvements on the CUB, Oxford, and COCO datasets, especially performing excellently in terms of the FID score.