CLIP Model for Images to Textual Prompts Based on Top-k Neighbors

Xin Zhang,Xin Zhang,YeMing Cai,Tianzhi Jia
2024-01-18
Abstract:Text-to-image synthesis, a subfield of multimodal generation, has gained significant attention in recent years. We propose a cost-effective approach for image-to-prompt generation that leverages generative models to generate textual prompts without the need for large amounts of annotated data. We divide our method into two stages: online stage and offline stage. We use a combination of the CLIP model and K-nearest neighbors (KNN) algorithm. The proposed system consists of two main parts: an offline task and an online task. Our method owns the highest metric 0.612 among these models, which is 0.013, 0.055, 0.011 higher than Clip, Clip + KNN(top 10) respectively.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper proposes a solution to the problem of generating image-to-prompt (image-to-text prompts), specifically addressing the limitations of text-to-image synthesis techniques. Existing methods often struggle to capture all the details and features in images, such as color, texture, shape, and position, through simple text descriptions. To address this, the paper proposes a cost-effective method that uses generative models to generate text prompts without the need for large amounts of annotated data. The method consists of an online phase and an offline phase. In the offline phase, various sources of prompt text data are collected, and the CLIP model's text encoder is used to transform these texts into interpretable embedding representations. Then, the Sentence Transformer model is employed to further enhance the understanding of the text. In the online phase, the input image is transformed into an image embedding using the CLIP's image encoder, and the K-nearest neighbors (KNN) algorithm is applied to find the K most similar text prompts in the stored CLIP text embedding database. Finally, the predicted image text prompt is generated by calculating the average of these K embeddings and using the embedding corresponding to the image obtained by the Sentence Transformer model. In the experimental part, the paper uses multiple existing image-text datasets, such as latin400m, coyo700, COCO, and RedCaps. The paper also performs feature engineering, training parameter selection, and sets evaluation metrics (cosine similarity). The results show that the proposed method achieves the highest cosine similarity score of 0.612 in comparison, outperforming the cases of using only the CLIP model or CLIP with KNN (top 10).