Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Yunxin Li,Xinyu Chen,Baotian Hu,Haoyuan Shi,Min Zhang

2024-06-26

Abstract:Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at <a class="link-external link-https" href="https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper" rel="external noopener nofollow">this https URL</a>

Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the existing large - scale multimodal models (LMMs), the visual - language alignment methods mainly focus on the alignment between images and text descriptions, while ignoring the alignment in the visual knowledge dimension, that is, connecting vision with relevant knowledge. Visual knowledge plays an important role in analyzing, reasoning and interpreting visual information, and helps to improve the accuracy of answering knowledge - based visual questions. Therefore, the main goal of the paper is to improve LMMs by enhancing the knowledge alignment of visual - language, especially for the challenging knowledge - based visual question answering (VQA) tasks. To achieve this goal, the authors propose a Cognitive Visual - Language Mapper (CVLM), which contains a pre - trained Visual Knowledge Aligner (VKA) and a Fine - grained Knowledge Adapter (FKA) for the multimodal instruction tuning stage. The VKA is designed based on the interaction between a small - language model and a visual encoder, and is trained with the collected image - knowledge pairs to achieve the acquisition and projection of visual knowledge. The FKA is used to extract the fine - grained visual knowledge of images and inject it into large - scale language models (LLMs). Through these methods, the CVLM can significantly improve the performance of LMMs on knowledge - based VQA tasks. The paper verifies the effectiveness of the CVLM through extensive experiments on multiple knowledge - based VQA benchmark datasets, and the results show that the CVLM has an average performance improvement of 5.0% on these tasks. Ablation studies also verify the effectiveness of the VKA and the FKA respectively.

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Visually-Augmented Language Modeling

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

InfMLLM: A Unified Framework for Visual-Language Tasks.

Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Enhancing Advanced Visual Reasoning Ability of Large Language Models

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

Rethinking VLMs and LLMs for Image Classification

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

CogVLM: Visual Expert for Pretrained Language Models

Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

Prompting Vision Language Model with Knowledge from Large Language Model for Knowledge-Based VQA

LIVE: Learnable In-Context Vector for Visual Question Answering

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions