Enhance Image-to-Image Generation with LLaVA-generated Prompts

Zhicheng Ding,Panfeng Li,Qikai Yang,Siyang Li

DOI: https://doi.org/10.1109/ISPDS62779.2024.10667513

2024-09-21

Abstract:This paper presents a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). We propose a framework where LLaVA analyzes input images and generates textual descriptions, hereinafter LLaVA-generated prompts. These prompts, along with the original image, are fed into the image-to-image generation pipeline. This enriched representation guides the generation process towards outputs that exhibit a stronger resemblance to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in promoting image similarity. We observe a significant improvement in the visual coherence between the generated and input images compared to traditional methods. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process. By providing more specific details within the prompts, we aim to achieve a delicate balance between faithfulness to the original image and artistic expression in the generated outputs.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address several key issues in the process of image generation: 1. **Lack of control**: When relying solely on the input image for image generation, the generated image may deviate from the user's intent, resulting in output that lacks control or fidelity. 2. **Inaccuracy and instability of generated images**: Images generated by large language models have issues with accuracy and stability, requiring careful consideration and mitigation strategies. To address these issues, the paper proposes a new method that enhances image-to-image generation by leveraging the capabilities of large language and vision-assisted models (LLaVA). Specifically, LLaVA analyzes the input image and generates text descriptions (referred to as LLaVA-generated prompts), which are input into the image generation pipeline along with the original image. This method can guide the generation process, making the output image more similar to the input image, and experiments show that this approach significantly improves the visual consistency between the generated image and the input image. Future work will explore fine-tuning LLaVA prompts to increase control during the creative process, achieving a balance between input image fidelity and artistic expression by providing more specific details in the prompts.

Enhance Image-to-Image Generation with LLaVA-generated Prompts

LLMGA: Multimodal Large Language Model based Generation Assistant

Attention Prompting on Image for Large Vision-Language Models

LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Prompt Highlighter: Interactive Control for Multi-Modal LLMs

Visual Prompting in Multimodal Large Language Models: A Survey

PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement

Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models

PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation

Mutual Prompt Leaning for Vision Language Models

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

LaViP:Language-Grounded Visual Prompts

Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts