Abstract:Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

MPT4LM: Multi-Modal Prompt Tuning Makes Pre-Trained Large Language Models Better Vision-Language Learners

Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

InfMLLM: A Unified Framework for Visual-Language Tasks.

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Tackling Vision Language Tasks Through Learning Inner Monologues

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models

Efficient Multi-modal Large Language Models via Visual Token Grouping

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment

Instruction Tuning-free Visual Token Complement for Multimodal LLMs

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning

ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model