Abstract:Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.

Internship: probing joint vision-and-language representations

Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction

Revealing Vision-Language Integration in the Brain with Multimodal Networks

INTERN: A New Learning Paradigm Towards General Vision

12-in-1: Multi-Task Vision and Language Representation Learning

Tackling Vision Language Tasks Through Learning Inner Monologues

Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

Superpixel Semantics Representation and Pre-training for Vision-Language Task

Mutual influence between language and perception in multi-agent communication games

Neural Implicit Vision-Language Feature Fields

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

An Introduction to Vision-Language Modeling

Grounded Language Acquisition From Object and Action Imagery

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Synthesis of Vision and Language: Multifaceted Image Captioning Application

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Evaluating the Representational Hub of Language and Vision Models