Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Yanpeng Sun,Jing Hao,Ke Zhu,Jiang-Jiang Liu,Yuxiang Zhao,Xiaofan Li,Gang Zhang,Zechao Li,Jingdong Wang
2024-12-19
Abstract:Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at \url{<a class="link-external link-https" href="https://github.com/syp2ysy/DCE" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of the lack of detail and accuracy in image captions generated by current large - scale multimodal models (LMMs). Specifically, existing methods either extract captions from LMMs or construct captions through Internet images or manual annotation, and these methods have the following problems: 1. **High cost and difficulty in expanding manual annotation**: For example, the COCO dataset, although it provides high - quality captions, has a high annotation cost and is difficult to expand on a large scale. 2. **Although the captions generated by LMMs have good scalability, there is still room for improvement in comprehensiveness and accuracy**: For example, some important objects and attributes may be ignored, and 3D spatial relationships may also be missing, which is especially crucial in tasks that require a comprehensive understanding of the scene structure. To overcome these problems, the author proposes a method named Descriptive Caption Enhancement (DCE) to enhance image captions by using existing visual specialist models. DCE generates more detailed and accurate image captions by combining fine - grained object attributes (such as depth, emotion, fine - grained categories, etc.) and relationships between objects (such as relative positions, human - object interactions, etc.). ### Specific objectives 1. **Improve the quality of image captions**: By introducing more fine - grained visual attributes and relationship information between objects, make the generated captions more detailed and accurate. 2. **Reduce costs and improve efficiency**: Utilize open - source visual specialist models and large language models (LLMs) to reduce the dependence on expensive manual annotation. 3. **Enhance multimodal perception ability**: Through more detailed image captions, improve the performance of large - scale multimodal models (LMMs) in visual understanding and reasoning tasks. ### Method overview The main steps of DCE include: - **Object localization**: Use the state - of - the - art open - world object detection model for object localization. - **Attribute extraction**: Extract object - level attributes (such as fine - grained categories, low - level attributes, OCR, etc.) through multiple visual specialist models. - **Relationship extraction**: Extract relationship information between objects, including human - object interactions, 2D and 3D relative position relationships, and object counting information. - **Caption generation**: Use a large language model to integrate the extracted object attributes and relationship information into detailed regional captions and finally generate a complete image caption. Through this method, DCE can significantly improve the quality and detail of image captions without increasing high costs, thus better supporting multimodal perception tasks.