Abstract:A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.

Empowering Vision-Language Models for Reasoning Ability Through Large Language Models

Enhance Reasoning Ability of Visual-Language Models via Large Language Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Vision-Language Models Can Self-Improve Reasoning Via Reflection

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Large Language Models are Visual Reasoning Coordinators

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

How Far Are We from Intelligent Visual Deductive Reasoning?

Empowering MultiModal Models' In-Context Learning Ability through Large Language Models.

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Improve Vision Language Model Chain-of-thought Reasoning

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding