Abstract:Text-to-image synthesis, a subfield of multimodal generation, has gained significant attention in recent years. We propose a cost-effective approach for image-to-prompt generation that leverages generative models to generate textual prompts without the need for large amounts of annotated data. We divide our method into two stages: online stage and offline stage. We use a combination of the CLIP model and K-nearest neighbors (KNN) algorithm. The proposed system consists of two main parts: an offline task and an online task. Our method owns the highest metric 0.612 among these models, which is 0.013, 0.055, 0.011 higher than Clip, Clip + KNN(top 10) respectively.

What problem does this paper attempt to address?

This paper proposes a solution to the problem of generating image-to-prompt (image-to-text prompts), specifically addressing the limitations of text-to-image synthesis techniques. Existing methods often struggle to capture all the details and features in images, such as color, texture, shape, and position, through simple text descriptions. To address this, the paper proposes a cost-effective method that uses generative models to generate text prompts without the need for large amounts of annotated data. The method consists of an online phase and an offline phase. In the offline phase, various sources of prompt text data are collected, and the CLIP model's text encoder is used to transform these texts into interpretable embedding representations. Then, the Sentence Transformer model is employed to further enhance the understanding of the text. In the online phase, the input image is transformed into an image embedding using the CLIP's image encoder, and the K-nearest neighbors (KNN) algorithm is applied to find the K most similar text prompts in the stored CLIP text embedding database. Finally, the predicted image text prompt is generated by calculating the average of these K embeddings and using the embedding corresponding to the image obtained by the Sentence Transformer model. In the experimental part, the paper uses multiple existing image-text datasets, such as latin400m, coyo700, COCO, and RedCaps. The paper also performs feature engineering, training parameter selection, and sets evaluation metrics (cosine similarity). The results show that the proposed method achieves the highest cosine similarity score of 0.612 in comparison, outperforming the cases of using only the CLIP model or CLIP with KNN (top 10).

CLIP Model for Images to Textual Prompts Based on Top-k Neighbors

The CLIP Model is Secretly an Image-to-Prompt Converter

Emage: Non-Autoregressive Text-to-Image Generation

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models

Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis

Best Prompts for Text-to-Image Models and How to Find Them

A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis

Optimizing Prompts Using In-Context Few-Shot Learning for Text-to-Image Generative Models

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

Dynamic Prompt Optimizing for Text-to-Image Generation

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Image Captions Are Natural Prompts for Text-to-Image Models

PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation

One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation

A Prompt Log Analysis of Text-to-Image Generation Systems

Optimizing Prompts for Text-to-Image Generation

What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance

Prompt-Based Modality Bridging for Unified Text-to-Face Generation and Manipulation