Abstract:Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at <a class="link-external link-https" href="https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper" rel="external noopener nofollow">this https URL</a>

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

Simple and Effective Visual Question Answering in a Single Modality

Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey

What Large Language Models Bring to Text-rich VQA?

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Multitask Learning for Visual Question Answering

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

Enhancing Advanced Visual Reasoning Ability of Large Language Models

LOVA3: Learning to Visual Question Answering, Asking and Assessment

Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Right this way: Can VLMs Guide Us to See More to Answer Questions?

Selectively Answering Visual Questions