Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Yunxin Li,Xinyu Chen,Baotian Hu,Haoyuan Shi,Min Zhang
2024-06-26
Abstract:Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at <a class="link-external link-https" href="https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper" rel="external noopener nofollow">this https URL</a>
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the existing large - scale multimodal models (LMMs), the visual - language alignment methods mainly focus on the alignment between images and text descriptions, while ignoring the alignment in the visual knowledge dimension, that is, connecting vision with relevant knowledge. Visual knowledge plays an important role in analyzing, reasoning and interpreting visual information, and helps to improve the accuracy of answering knowledge - based visual questions. Therefore, the main goal of the paper is to improve LMMs by enhancing the knowledge alignment of visual - language, especially for the challenging knowledge - based visual question answering (VQA) tasks. To achieve this goal, the authors propose a Cognitive Visual - Language Mapper (CVLM), which contains a pre - trained Visual Knowledge Aligner (VKA) and a Fine - grained Knowledge Adapter (FKA) for the multimodal instruction tuning stage. The VKA is designed based on the interaction between a small - language model and a visual encoder, and is trained with the collected image - knowledge pairs to achieve the acquisition and projection of visual knowledge. The FKA is used to extract the fine - grained visual knowledge of images and inject it into large - scale language models (LLMs). Through these methods, the CVLM can significantly improve the performance of LMMs on knowledge - based VQA tasks. The paper verifies the effectiveness of the CVLM through extensive experiments on multiple knowledge - based VQA benchmark datasets, and the results show that the CVLM has an average performance improvement of 5.0% on these tasks. Ablation studies also verify the effectiveness of the VKA and the FKA respectively.