Abstract:Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.

Vision Language Models See What You Want but not What You See

Inferring Human Vision in a Human-Like Way: Key Factors Influencing the Cognitive Processing of Level-1 Visual Perspective-Taking

Analyzing the Roles of Language and Vision in Learning from Limited Data

Think Twice: Perspective-Taking Improves Large Language Models' Theory-of-Mind Capabilities

Visual cognition in multimodal large language models

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Tackling Vision Language Tasks Through Learning Inner Monologues

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding

Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem

Can 3D Vision-Language Models Truly Understand Natural Language?

Can Language Models Understand Physical Concepts?

An Introduction to Vision-Language Modeling

Are VLMs Really Blind

A Survey on Vision-Language-Action Models for Embodied AI