Abstract:Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at \url{<a class="link-external link-https" href="https://github.com/syp2ysy/DCE" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of the lack of detail and accuracy in image captions generated by current large - scale multimodal models (LMMs). Specifically, existing methods either extract captions from LMMs or construct captions through Internet images or manual annotation, and these methods have the following problems: 1. **High cost and difficulty in expanding manual annotation**: For example, the COCO dataset, although it provides high - quality captions, has a high annotation cost and is difficult to expand on a large scale. 2. **Although the captions generated by LMMs have good scalability, there is still room for improvement in comprehensiveness and accuracy**: For example, some important objects and attributes may be ignored, and 3D spatial relationships may also be missing, which is especially crucial in tasks that require a comprehensive understanding of the scene structure. To overcome these problems, the author proposes a method named Descriptive Caption Enhancement (DCE) to enhance image captions by using existing visual specialist models. DCE generates more detailed and accurate image captions by combining fine - grained object attributes (such as depth, emotion, fine - grained categories, etc.) and relationships between objects (such as relative positions, human - object interactions, etc.). ### Specific objectives 1. **Improve the quality of image captions**: By introducing more fine - grained visual attributes and relationship information between objects, make the generated captions more detailed and accurate. 2. **Reduce costs and improve efficiency**: Utilize open - source visual specialist models and large language models (LLMs) to reduce the dependence on expensive manual annotation. 3. **Enhance multimodal perception ability**: Through more detailed image captions, improve the performance of large - scale multimodal models (LMMs) in visual understanding and reasoning tasks. ### Method overview The main steps of DCE include: - **Object localization**: Use the state - of - the - art open - world object detection model for object localization. - **Attribute extraction**: Extract object - level attributes (such as fine - grained categories, low - level attributes, OCR, etc.) through multiple visual specialist models. - **Relationship extraction**: Extract relationship information between objects, including human - object interactions, 2D and 3D relative position relationships, and object counting information. - **Caption generation**: Use a large language model to integrate the extracted object attributes and relationship information into detailed regional captions and finally generate a complete image caption. Through this method, DCE can significantly improve the quality and detail of image captions without increasing high costs, thus better supporting multimodal perception tasks.

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

From Captions to Visual Concepts and Back

Multimodality-guided Visual-Caption Semantic Enhancement

CompCap: Improving Multimodal Large Language Models with Composite Captions

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

Enhanced Video Caption Generation Based on Multimodal Features.

Application of Dual Attention Mechanism in Chinese Image Captioning

Improving Multimodal Datasets with Image Captioning

Improving Image Captioning through Visual and Semantic Mutual Promotion

Improving Image Captioning with Better Use of Caption

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

Detailed Object Description with Controllable Dimensions

CaptionNet: Automatic End-to-End Siamese Difference Captioning Model with Attention

Caption Anything: Interactive Image Description with Diverse Multimodal Controls

Explicit Image Caption Reasoning: Generating Accurate and Informative Captions for Complex Scenes with LMM

Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

Improving Image Captioning with Better Use of Captions

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection