Enhance Image-to-Image Generation with LLaVA-generated Prompts

Zhicheng Ding,Panfeng Li,Qikai Yang,Siyang Li
DOI: https://doi.org/10.1109/ISPDS62779.2024.10667513
2024-09-21
Abstract:This paper presents a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). We propose a framework where LLaVA analyzes input images and generates textual descriptions, hereinafter LLaVA-generated prompts. These prompts, along with the original image, are fed into the image-to-image generation pipeline. This enriched representation guides the generation process towards outputs that exhibit a stronger resemblance to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in promoting image similarity. We observe a significant improvement in the visual coherence between the generated and input images compared to traditional methods. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process. By providing more specific details within the prompts, we aim to achieve a delicate balance between faithfulness to the original image and artistic expression in the generated outputs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address several key issues in the process of image generation: 1. **Lack of control**: When relying solely on the input image for image generation, the generated image may deviate from the user's intent, resulting in output that lacks control or fidelity. 2. **Inaccuracy and instability of generated images**: Images generated by large language models have issues with accuracy and stability, requiring careful consideration and mitigation strategies. To address these issues, the paper proposes a new method that enhances image-to-image generation by leveraging the capabilities of large language and vision-assisted models (LLaVA). Specifically, LLaVA analyzes the input image and generates text descriptions (referred to as LLaVA-generated prompts), which are input into the image generation pipeline along with the original image. This method can guide the generation process, making the output image more similar to the input image, and experiments show that this approach significantly improves the visual consistency between the generated image and the input image. Future work will explore fine-tuning LLaVA prompts to increase control during the creative process, achieving a balance between input image fidelity and artistic expression by providing more specific details in the prompts.