Detailed Object Description with Controllable Dimensions

Xinran Wang,Haiwen Zhang,Baoteng Li,Kongming Liang,Hao Sun,Zhongjiang He,Zhanyu Ma,Jun Guo
2024-11-28
Abstract:Object description plays an important role for visually impaired individuals to understand and compare the differences between objects. Recent multimodal large language models (MLLMs) exhibit powerful perceptual abilities and demonstrate impressive potential for generating object-centric captions. However, the descriptions generated by such models may still usually contain a lot of content that is not relevant to the user intent. Under special scenarios, users may only need the details of certain dimensions of an object. In this paper, we propose a training-free captioning refinement pipeline, \textbf{Dimension Tailor}, designed to enhance user-specified details in object descriptions. This pipeline includes three steps: dimension extracting, erasing, and supplementing, which decompose the description into pre-defined dimensions and correspond to user intent. Therefore, it can not only improve the quality of object details but also offer flexibility in including or excluding specific dimensions based on user preferences. We conducted extensive experiments to demonstrate the effectiveness of Dimension Tailor on controllable object descriptions. Notably, the proposed pipeline can consistently improve the performance of the recent MLLMs. The code is currently accessible at the following anonymous link: \url{<a class="link-external link-https" href="https://github.com/xin-ran-w/ControllableObjectDescription" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to make the object descriptions generated by Multimodal Large Language Models (MLLMs) more in line with users' intentions, especially when users only need detailed information in specific dimensions, and avoid generating redundant or irrelevant description content. Specifically, existing MLLMs may contain a large amount of content irrelevant to users' intentions when generating object descriptions, especially in some special scenarios where users may only care about the details of certain specific dimensions of the object. Therefore, this paper proposes a training - free description refinement pipeline - Dimension Tailor, which aims to enhance the object description details specified by users and provide flexibility to include or exclude specific dimensions according to users' preferences. ### Main Problems and Solutions 1. **Problem Description**: - The object descriptions generated by existing Multimodal Large Language Models (MLLMs) may have redundant or irrelevant content. - In specific scenarios, users may only need detailed information in certain specific dimensions, but existing models are difficult to precisely control these dimensions. 2. **Solution**: - A training - free description refinement pipeline named Dimension Tailor is proposed. - This pipeline includes three main steps: dimension extracting, dimension erasing, and dimension supplementing. - Through these three steps, long descriptions can be decomposed into predefined dimensions, and the description content can be adjusted according to users' intentions, thereby improving the quality and controllability of the description. ### Specific Implementation - **Dimension Extraction**: Extract each predefined dimension from the generated description and format it into a structured tuple. - **Dimension Deletion**: Delete incorrect concepts or irrelevant dimensions to ensure that the description content is aligned with users' intentions. - **Dimension Supplement**: Add missing dimensions according to users' intentions to ensure that the description content is complete and accurate. ### Experimental Verification The paper verifies the effectiveness of Dimension Tailor through extensive experiments, especially its performance improvement in the controllable object description task. The experimental results show that this method can significantly improve the description quality of MLLMs and make it more in line with users' needs. ### Summary The main contribution of this paper is to propose a training - free description refinement pipeline - Dimension Tailor, which can effectively align the generated description with users' intentions and improve the performance of MLLMs in the controllable object description task. In addition, three evaluation metrics are also designed to comprehensively evaluate the controllability of object descriptions.