Abstract:Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.

Voila-A: Aligning Vision-Language Models with User's Gaze Attention

An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention

G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios

VisuaLizations As Intermediate Representations (VLAIR): an Approach for Applying Deep Learning-Based Computer Vision to Non-Image-based Data

A-VL: Adaptive Attention for Large Vision-Language Models

Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models

Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation

Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Divert More Attention to Vision-Language Object Tracking

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns

Vision and Language Navigation Using Multi-head Attention Mechanism

Tackling Vision Language Tasks Through Learning Inner Monologues

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling