VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Zejun Li,Ruipu Luo,Jiwen Zhang,Minghui Qiu,Zhongyu Wei
2024-05-28
Abstract:While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. Additionally, we construct an instruction dataset to facilitate LMMs in adapting to reasoning with VoCoT. By introducing VoCoT into the prevalent open-source LMM architecture, we introduce VolCano. With only 7B parameters and limited input resolution, VolCano demonstrates excellent performance across various scenarios, surpassing SOTA models, including GPT-4V, in tasks requiring complex reasoning. Our code, data and model will be available at <a class="link-external link-https" href="https://github.com/RupertLuo/VoCoT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the limitations faced by current Large Multimodal Models (LMMs) when handling complex tasks, particularly the inefficiency caused by the single-step reasoning paradigm. Specifically: 1. **Enhancing Multi-step Reasoning Ability**: The paper proposes a framework named VOCOT, which is a Chain-of-Thought (CoT) reasoning framework based on visual objects, designed to enhance the multi-step reasoning ability of LMMs in multimodal environments. 2. **Visual Grounding Representation**: VOCOT achieves effective cross-modal information fusion and alignment by representing objects as tuples containing text descriptions, coordinates, and corresponding visual representations. 3. **Improving Interpretability**: Compared to traditional single-step reasoning, multi-step reasoning can better demonstrate the problem-solving process, thereby improving the interpretability of the model's output. 4. **Reducing Hallucinations**: By using visually grounded object representations, the framework reduces the likelihood of generating erroneous information during the reasoning process, thus enhancing the reliability of the model's inferences. In summary, the goal of this paper is to enhance the ability of LMMs to handle tasks requiring complex reasoning by introducing the VOCOT framework, and to improve the reliability and interpretability of their outputs.