Text-to-Image Generation Via Energy-Based CLIP

Roy Ganz,Michael Elad
2024-08-30
Abstract:Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present EB-CLIP, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative objective, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative objective, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. EB-CLIP not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of EB-CLIP by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that EB-CLIP can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges that Joint Energy Models (JEMs) encounter when extended to real - world high - resolution datasets. Specifically, although JEMs have received extensive research attention, they perform poorly when dealing with large - scale, high - resolution image datasets, with problems such as unstable training and high consumption of computational resources. In addition, existing CLIP - based text - to - image generation methods (such as CLIPAG) can generate images with good appearance, but these images are often not realistic enough and rely on multi - view enhancement techniques. To solve these problems, the authors propose **CLIP - JEM**, a new method that extends JEM to the multi - modal vision - language field. By combining generative and discriminative objectives, CLIP - JEM can not only generate realistic images from text, but also achieve competitive results in compositional benchmark tests with fewer parameters. Moreover, CLIP - JEM also shows its advantages as an evaluation metric for text - to - image generation tasks, outperforming the traditional CLIP model. #### Summary of main problems: 1. **Scalability issues of JEMs**: Existing JEMs are difficult to be applied to large - scale, high - resolution real - world datasets. 2. **Lack of realism in CLIPAG - generated images**: Although the images generated by CLIPAG have a good appearance, they are not realistic enough and require multi - view enhancement. 3. **Lack of effective evaluation metrics for text - to - image generation**: Existing evaluation methods cannot fully measure the quality and consistency of generated images. By introducing CLIP - JEM, the authors hope to overcome the above problems, achieve high - quality text - to - image generation, and provide a more reliable evaluation tool. ### Formula representation The energy functions and loss functions involved in the paper are as follows: - **Joint image - text energy function**: \[ E_\theta(I, T)=-\text{CosineSimilarity}(f_I^\theta(I), f_T^\theta(T)) \] where \(I\) and \(T\) are visual and text inputs respectively, \(\text{CosineSimilarity}\) is the cosine similarity, and \(f_I^\theta\) and \(f_T^\theta\) are the visual and text encoders of CLIP respectively. - **Contrastive energy loss**: The model is trained by maximizing the energy values of "positive sample" pairs and minimizing the energy values of "negative sample" pairs. - **Adversarial loss**: The loss function of CLIP is extended through adversarial training to ensure that the model is robust to adversarial samples and generates semantically meaningful gradients. These formulas and methods work together to enable CLIP - JEM to achieve efficient and high - quality text - to - image generation in the multi - modal vision - language field.