CLIP-VQDiffusion : Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model

Seungdae Han,Joohee Kim
2024-03-22
Abstract:There has been a significant progress in text conditional image generation models. Recent advancements in this field depend not only on improvements in model structures, but also vast quantities of text-image paired datasets. However, creating these kinds of datasets is very costly and requires a substantial amount of labor. Famous face datasets don't have corresponding text captions, making it difficult to develop text conditional image generation models on these datasets. Some research has focused on developing text to image generation models using only images without text captions. Here, we propose CLIP-VQDiffusion, which leverage the pretrained CLIP model to provide multimodal text-image representations and strong image generation capabilities. On the FFHQ dataset, our model outperformed previous state-of-the-art methods by 4.4% in clipscore and generated very realistic images even when the text was both in and out of distribution. The pretrained models and codes will soon be available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the dependency on large text-image paired datasets during the training process of text-to-image generation models. Current advancements not only rely on improvements in model architecture but also require extensive text-image paired datasets, which are costly and time-consuming to create. Particularly, some renowned facial datasets lack corresponding text descriptions, making it challenging to develop text-conditioned image generation models on these datasets. Therefore, the paper proposes the CLIP-VQDiffusion method, which leverages the pre-trained CLIP model to provide multimodal text-image representations and combines it with the powerful image generation capabilities of the Vector Quantized Diffusion Model (VQ-Diffusion). This approach enables the training of text-to-image generation models on datasets without text. Experimental results on the FFHQ dataset demonstrate that the model outperforms previous state-of-the-art methods, improving the clipscore metric by 4.4%, and the generated images are highly realistic both within and outside the text distribution.