Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

Renjie Pi,Jianshu Zhang,Jipeng Zhang,Rui Pan,Zhekai Chen,Tong Zhang

2024-06-12

Abstract:Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the insufficient quality and detail level of existing image captioning datasets. Specifically: 1. **Image - text pairs obtained through web crawling**: Although these datasets are large - scale, the caption quality is low and there is a lot of noise. 2. **Manually - annotated datasets**: Datasets such as COCO are usually very brief and lack details. Moreover, due to the high annotation cost, it is difficult to generate high - quality detailed captions on a large scale. These problems lead to a significant gap between image captions and actual image information, which limits the performance improvement of multi - modal large language models (MLLMs) in tasks such as image understanding, text - to - image generation, and text - image retrieval. To solve these problems, the author proposes an innovative framework named **Image Textualization (IT)**, which automatically generates high - quality image captions in the following ways: - **Utilizing existing multi - modal large language models (MLLMs)**: Provide the overall caption structure of the image. - **Combining multiple visual expert models**: Extract fine - grained object information, supplement the missing details in the captions generated by MLLMs, and identify and correct hallucinated content. - **Finally, utilize powerful language models (LLMs)**: Regenerate captions based on the information from the previous two stages, ensuring that the captions are both detailed and free of hallucinations. In addition, in order to evaluate the quality of the generated image captions, the author also proposes several benchmarks (DID - Bench, D2I - Bench, and LIN - Bench) and conducts extensive experimental verification. The experimental results show that the image captions generated using the IT framework are not only more detailed and accurate, but also have fewer hallucination phenomena, significantly improving the performance of downstream tasks. In summary, this paper aims to generate high - quality, detailed image captions through automated methods to make up for the deficiencies of existing datasets and promote the application and development of multi - modal large language models in image - related tasks.

Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

A Novel Evaluation Framework for Image2Text Generation

Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models

DOCCI: Descriptions of Connected and Contrasting Images

Descriptive Image Quality Assessment in the Wild

Benchmarking and Improving Detail Image Caption

Enhancing Image Description Generation through Deep Reinforcement Learning: Fusing Multiple Visual Features and Reward Mechanisms

Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation

Visuals to Text: A Comprehensive Review on Automatic Image Captioning

From Captions to Visual Concepts and Back

ImageInWords: Unlocking Hyper-Detailed Image Descriptions

Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

Application of Dual Attention Mechanism in Chinese Image Captioning

What If We Recaption Billions of Web Images with LLaMA-3?

T2I-Scorer: Quantitative Evaluation on Text-to-Image Generation Via Fine-Tuned Large Multi-Modal Models

Text-to-Image Generation for Abstract Concepts

Unified Text-to-Image Generation and Retrieval

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

Synthesizing Spoken Descriptions of Images

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study