Abstract:Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present EB-CLIP, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative objective, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative objective, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. EB-CLIP not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of EB-CLIP by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that EB-CLIP can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges that Joint Energy Models (JEMs) encounter when extended to real - world high - resolution datasets. Specifically, although JEMs have received extensive research attention, they perform poorly when dealing with large - scale, high - resolution image datasets, with problems such as unstable training and high consumption of computational resources. In addition, existing CLIP - based text - to - image generation methods (such as CLIPAG) can generate images with good appearance, but these images are often not realistic enough and rely on multi - view enhancement techniques. To solve these problems, the authors propose **CLIP - JEM**, a new method that extends JEM to the multi - modal vision - language field. By combining generative and discriminative objectives, CLIP - JEM can not only generate realistic images from text, but also achieve competitive results in compositional benchmark tests with fewer parameters. Moreover, CLIP - JEM also shows its advantages as an evaluation metric for text - to - image generation tasks, outperforming the traditional CLIP model. #### Summary of main problems: 1. **Scalability issues of JEMs**: Existing JEMs are difficult to be applied to large - scale, high - resolution real - world datasets. 2. **Lack of realism in CLIPAG - generated images**: Although the images generated by CLIPAG have a good appearance, they are not realistic enough and require multi - view enhancement. 3. **Lack of effective evaluation metrics for text - to - image generation**: Existing evaluation methods cannot fully measure the quality and consistency of generated images. By introducing CLIP - JEM, the authors hope to overcome the above problems, achieve high - quality text - to - image generation, and provide a more reliable evaluation tool. ### Formula representation The energy functions and loss functions involved in the paper are as follows: - **Joint image - text energy function**: \[ E_\theta(I, T)=-\text{CosineSimilarity}(f_I^\theta(I), f_T^\theta(T)) \] where \(I\) and \(T\) are visual and text inputs respectively, \(\text{CosineSimilarity}\) is the cosine similarity, and \(f_I^\theta\) and \(f_T^\theta\) are the visual and text encoders of CLIP respectively. - **Contrastive energy loss**: The model is trained by maximizing the energy values of "positive sample" pairs and minimizing the energy values of "negative sample" pairs. - **Adversarial loss**: The loss function of CLIP is extended through adversarial training to ensure that the model is robust to adversarial samples and generates semantically meaningful gradients. These formulas and methods work together to enable CLIP - JEM to achieve efficient and high - quality text - to - image generation in the multi - modal vision - language field.

Text-to-Image Generation Via Energy-Based CLIP

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Hierarchical Text-Conditional Image Generation with CLIP Latents

Fine-grained Image Captioning with CLIP Reward

CgT-GAN: CLIP-guided Text GAN for Image Captioning

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Energy-Based Contrastive Learning of Visual Representations

CLIP2GAN: Towards Bridging Text with the Latent Space of GANs

Text-driven Face Image Generation and Manipulation via Multi-level Residual Mapper

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

CLIP Model for Images to Textual Prompts Based on Top-k Neighbors

FaceCLIP: Facial Image-to-Video Translation Via A Brief Text Description

Finetuning CLIP to Reason about Pairwise Differences

CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation