Abstract:Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.

Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning

Self-driven Grounding: Large Language Model Agents with Automatical Language-aligned Skill Learning

Grounding Language for Robotic Manipulation via Skill Library

Learning Reward for Robot Skills Using Large Language Models via Self-Alignment

Grounding Large Language Models In Embodied Environment With Imperfect World Models

Grounding Language with Visual Affordances over Unstructured Data

Learning to Program with Natural Language

Supervised Knowledge Makes Large Language Models Better In-context Learners

Can Large Language Models Invent Algorithms to Improve Themselves?

Large Language Models are reasoners with Self-Verification

Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models

SELF: Self-Evolution with Language Feedback

Introspective Tips: Large Language Model for In-Context Decision Making

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Hypothesis Search: Inductive Reasoning with Language Models

Tackling Vision Language Tasks Through Learning Inner Monologues

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models