What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing

Shuhan Qi,Zhengying Cao,Jun Rao,Lei Wang,Jing Xiao,Xuan Wang
DOI: https://doi.org/10.1016/j.ipm.2023.103510
IF: 7.466
2023-09-27
Information Processing & Management
Abstract:Large language models (LLMs) are believed to contain vast knowledge. Many works have extended LLMs to multimodal models and applied them to various multimodal downstream tasks with a unified model structure using prompt. Appropriate prompts can stimulate the knowledge capabilities of the model to solve different tasks. However, how the content of the prompts affects the model's understanding of the information is still under-explored in the literature. We fill this gap by offering a systematic study on prompt probing for multimodal LLMs, examining various factors for their understanding of prompts. To achieve this goal, we propose a novel prompt probing framework that starts with the input and designs three types of input change strategies as templates for probing: visual prompt, text prompt and extra knowledge prompt. Our extensive experiments on the VQA dataset show that existing multimodal LLMs do not understand the input content but more simply fit the training data distribution. Current multimodal models are still very far from understanding prompts properly.
computer science, information systems,information science & library science
What problem does this paper attempt to address?