Abstract:Object description plays an important role for visually impaired individuals to understand and compare the differences between objects. Recent multimodal large language models (MLLMs) exhibit powerful perceptual abilities and demonstrate impressive potential for generating object-centric captions. However, the descriptions generated by such models may still usually contain a lot of content that is not relevant to the user intent. Under special scenarios, users may only need the details of certain dimensions of an object. In this paper, we propose a training-free captioning refinement pipeline, \textbf{Dimension Tailor}, designed to enhance user-specified details in object descriptions. This pipeline includes three steps: dimension extracting, erasing, and supplementing, which decompose the description into pre-defined dimensions and correspond to user intent. Therefore, it can not only improve the quality of object details but also offer flexibility in including or excluding specific dimensions based on user preferences. We conducted extensive experiments to demonstrate the effectiveness of Dimension Tailor on controllable object descriptions. Notably, the proposed pipeline can consistently improve the performance of the recent MLLMs. The code is currently accessible at the following anonymous link: \url{<a class="link-external link-https" href="https://github.com/xin-ran-w/ControllableObjectDescription" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to make the object descriptions generated by Multimodal Large Language Models (MLLMs) more in line with users' intentions, especially when users only need detailed information in specific dimensions, and avoid generating redundant or irrelevant description content. Specifically, existing MLLMs may contain a large amount of content irrelevant to users' intentions when generating object descriptions, especially in some special scenarios where users may only care about the details of certain specific dimensions of the object. Therefore, this paper proposes a training - free description refinement pipeline - Dimension Tailor, which aims to enhance the object description details specified by users and provide flexibility to include or exclude specific dimensions according to users' preferences. ### Main Problems and Solutions 1. **Problem Description**: - The object descriptions generated by existing Multimodal Large Language Models (MLLMs) may have redundant or irrelevant content. - In specific scenarios, users may only need detailed information in certain specific dimensions, but existing models are difficult to precisely control these dimensions. 2. **Solution**: - A training - free description refinement pipeline named Dimension Tailor is proposed. - This pipeline includes three main steps: dimension extracting, dimension erasing, and dimension supplementing. - Through these three steps, long descriptions can be decomposed into predefined dimensions, and the description content can be adjusted according to users' intentions, thereby improving the quality and controllability of the description. ### Specific Implementation - **Dimension Extraction**: Extract each predefined dimension from the generated description and format it into a structured tuple. - **Dimension Deletion**: Delete incorrect concepts or irrelevant dimensions to ensure that the description content is aligned with users' intentions. - **Dimension Supplement**: Add missing dimensions according to users' intentions to ensure that the description content is complete and accurate. ### Experimental Verification The paper verifies the effectiveness of Dimension Tailor through extensive experiments, especially its performance improvement in the controllable object description task. The experimental results show that this method can significantly improve the description quality of MLLMs and make it more in line with users' needs. ### Summary The main contribution of this paper is to propose a training - free description refinement pipeline - Dimension Tailor, which can effectively align the generated description with users' intentions and improve the performance of MLLMs in the controllable object description task. In addition, three evaluation metrics are also designed to comprehensively evaluate the controllability of object descriptions.

Detailed Object Description with Controllable Dimensions

Caption Anything: Interactive Image Description with Diverse Multimodal Controls

DesCo: Learning Object Recognition with Rich Language Descriptions

Imageability- and Length-Controllable Image Captioning

Application of Dual Attention Mechanism in Chinese Image Captioning

Benchmarking and Improving Detail Image Caption

Complete 3D Relationships Extraction Modality Alignment Network for 3D Dense Captioning.

Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags

A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Object Relational Graph with Teacher-Recommended Learning for Video Captioning

Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

FineFormer: Fine-Grained Adaptive Object Transformer for Image Captioning

Leveraging VLM-Based Pipelines to Annotate 3D Objects

Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions

Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning

Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites

Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation

From Captions to Visual Concepts and Back

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding