Abstract:Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Yet, recent studies seem to suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across several domains to evaluate the recognition and reasoning abilities of models. Our evaluation of three LVLMs (GPT-4V, GPT-4o, and Gemini) shows that while these models can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.

What problem does this paper attempt to address?

The paper attempts to address the question: Do Large Vision-Language Models (LVLMs) truly understand the visual language in charts? Specifically, the authors designed a comprehensive test suite to evaluate the ability of LVLMs to understand and reason about entities and their relationships in charts. ### Background and Motivation - **Visual Language**: Visual language is a communication system that conveys information through symbols, shapes, and spatial arrangements. Charts are a typical example of visual language, presenting complex concepts and their relationships in image form. - **Challenges**: Building models that can understand charts poses significant challenges due to the symbolic nature of charts. - **Existing Research**: Recent studies suggest that LVLMs seem capable of handling complex reasoning tasks involving charts, but it remains unclear whether these models truly understand the symbolic information in charts. ### Research Methodology - **Test Suite**: The authors developed a comprehensive test suite containing various types of questions to evaluate the ability of LVLMs to recognize and reason about entities and their relationships in charts. The test suite covers both synthetic and real charts. - **Evaluation Subjects**: The authors evaluated three LVLMs (GPT-4V, GPT-4o, and Gemini). ### Key Findings 1. **Entity Recognition and Reasoning**: - LVLMs can accurately recognize and reason about entities in charts, whether the entities are represented in text or visual form. - The models performed exceptionally well on entity-related questions, especially when using Chain of Thought (CoT) prompts. 2. **Relationship Recognition and Reasoning**: - LVLMs exhibited significant difficulty in recognizing and reasoning about relationships in charts, whether implicit relationships (such as relative positions) or explicit relationships (such as arrows). - Even with CoT prompts, the models performed poorly on relationship-related questions, with an average accuracy of only 40% to 66%. 3. **Performance on Real Charts**: - In real charts, LVLMs performed better in recognizing relationships, with an average accuracy of 81.09%, compared to 68.00% in synthetic charts. - Nevertheless, the models still faced difficulties in reasoning about relationships. ### Key Conclusions - **Knowledge Dependence**: The good performance of LVLMs in relationship recognition mainly relies on their background knowledge rather than a true understanding of the charts themselves. - **Hallucinated Relationships**: When no relationships are provided, LVLMs infer relationships based on their learned knowledge; if the provided relationships contradict the model's learned knowledge, the model tends to ignore these relationships and rely on its background knowledge to answer questions. ### Summary Although LVLMs excel in recognizing and reasoning about entities in charts, they have significant limitations in understanding and reasoning about relationships. The performance of these models depends more on their background knowledge than on a true understanding of the charts themselves. This finding reveals the shortcomings of LVLMs in chart understanding and provides directions for future research.

Do Vision-Language Models Really Understand Visual Language?

Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning? An Extensive Investigation into the Capabilities and Limitations of LVLMs

How Far Are We from Intelligent Visual Deductive Reasoning?

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

An Introduction to Vision-Language Modeling

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Are VLMs Really Blind

Can Large Language Models Understand Symbolic Graphics Programs?

A Vision Check-up for Language Models

Vision language models are blind

Visually Descriptive Language Model for Vector Graphics Reasoning

Analyzing the Roles of Language and Vision in Learning from Limited Data

Effectiveness Assessment of Recent Large Vision-Language Models

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

Visual cognition in multimodal large language models

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

Understanding Figurative Meaning through Explainable Visual Entailment

Visualization Literacy of Multimodal Large Language Models: A Comparative Study

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities