Abstract:Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.

Visual Reasoning with Multi-hop Feature Modulation

FiLM: Visual Reasoning with a General Conditioning Layer

From FiLM to Video: Multi-turn Question Answering with Multi-modal Context

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension

DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

Generating Images with Multimodal Language Models

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Improving Visual Commonsense in Language Models via Multiple Image Generation

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Tackling Vision Language Tasks Through Learning Inner Monologues

HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog

HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models

Interleaved-Modal Chain-of-Thought

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

EVLM: An Efficient Vision-Language Model for Visual Understanding