Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA

Yue Fan,Jing Gu,Kaiwen Zhou,Qianqi Yan,Shan Jiang,Ching-Chen Kuo,Xinze Guan,Xin Eric Wang
2024-06-27
Abstract:Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, we introduce Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets of questions, answers, and multipanel images that specifically challenge models in comprehending multipanel images. Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Multimodal Large Language Models (MLLMs) tested, even though humans can attain approximately 99% accuracy on these questions. Distinctively, the MultipanelVQA benchmark features synthetically generated multipanel images specifically crafted to isolate and assess the impact of various factors, such as the layout, on MLLMs' multipanel image comprehension abilities. As a result, in addition to benchmarking the capabilities of MLLMs in understanding multipanel images, we analyze various factors of the multipanel image that affect MLLMs' performance with synthetic data and offer insights for enhancement. Code and data are released at <a class="link-external link-https" href="https://sites.google.com/view/multipanelvqa/home" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the performance of Multimodal Large Language Models (MLLMs) in understanding multi - panel images. Specifically, the paper introduces a new benchmark named MultipanelVQA, which contains 6,600 triples consisting of questions, answers, and multi - panel images, aiming to challenge the MLLMs' ability to understand multi - panel images. Although humans can answer these questions with an accuracy of about 99%, the existing state - of - the - art MLLMs perform poorly on this task. Through this benchmark, the paper comprehensively evaluates the MLLMs' ability to process multi - panel images and analyzes various factors that affect the model performance, such as layout, background elements, and text content, etc. ### Main contributions of the paper: 1. **Proposing the MultipanelVQA benchmark**: This benchmark includes real - world and synthetic data and focuses on evaluating the model's ability to understand the content and layout of multi - panel images. 2. **Benchmarking multiple open - source and proprietary MLLMs**: The results show that all the tested models face significant challenges in interpreting multi - panel images, although they perform well on single - panel image tasks. 3. **Conducting a thorough error analysis using synthetic data**: Through synthetic data, the paper can accurately isolate and analyze various factors that affect the model performance, including sub - graph content, layout, background, and visual text prompts, etc. 4. **Exploring the method of adding sub - graph captions as visual prompts**: The research finds that adding sequential numbers to sub - graphs as visual prompts can improve the understanding ability of some MLLMs. ### Specific content of the paper: - **Introduction**: Introduces the progress of Multimodal Large Language Models (MLLMs) in fusing visual and text data processing, as well as the importance of multi - panel images in daily life. It is pointed out that although humans can usually easily understand multi - panel images, MLLMs have difficulties in processing these images. - **Related work**: Reviews the development history of Multimodal Large Language Models and the existing evaluation benchmarks, and emphasizes the role of synthetic data in model training and evaluation. - **MultipanelVQA benchmark**: Describes in detail the composition of the benchmark dataset, including real - world data and synthetic data. Each multi - panel image is accompanied by a set of three questions in different styles, which respectively evaluate the model's content identification, position description, and visual positioning abilities. - **Experimental setup**: Introduces eight popular MLLMs used for evaluation, including open - source models and proprietary models. The evaluation process includes automatic script comparison of predicted answers and real answers, as well as using GPT - 4 as a secondary judgment. - **Main results**: Shows the average accuracy of each model on synthetic data and real - world data. The results show that proprietary models (such as GPT - 4V, GPT - 4o, and Gemini Pro Vision) perform the best, but there is still a significant gap from the human level. - **Error analysis**: Through case studies and comparative experiments with synthetic data, analyzes the main sources of errors of the model in processing multi - panel images, including the interference of adjacent sub - graphs, the similarity of sub - graph content, and the influence of layout, etc. ### Conclusion: The paper reveals the deficiencies of current MLLMs in understanding multi - panel images through the MultipanelVQA benchmark, and provides detailed error analysis and improvement suggestions, providing a direction for further improving the model performance.