Probing Mechanical Reasoning in Large Vision Language Models

Haoran Sun,Qingying Gao,Haiyun Lyu,Dezhi Luo,Hokin Deng,Yijiang Li

2024-10-01

Abstract:Mechanical reasoning is a fundamental ability that sets human intelligence apart from other animal intelligence. Mechanical reasoning allows us to design tools, build bridges and canals, and construct houses which set the foundation of human civilization. Embedding machines with such ability is an important step towards building human-level artificial intelligence. Recently, Li et al. built CogDevelop2K, a data-intensive cognitive experiment benchmark for assaying the developmental trajectory of machine intelligence (Li et al., 2024). Here, to investigate mechanical reasoning in Vision Language Models, we leverage the MechBench of CogDevelop2K, which contains approximately 150 cognitive experiments, to test understanding of mechanical system stability, gears and pulley systems, seesaw-like systems and leverage principle, inertia and motion, and other fluid-related systems in Large Vision Language Models. We observe diverse yet consistent behaviors over these aspects in VLMs.

Artificial Intelligence,Neurons and Cognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the capabilities of large - scale Vision Language Models (VLMs) in mechanical reasoning. Specifically, the author utilizes MechBench in CogDevelop2K, which is a data set containing approximately 150 cognitive experiments, to test the VLMs' comprehension abilities in the following aspects: 1. **Mechanical system stability**: For example, determining which objects are more likely to tip over or remain stable. 2. **Pulley systems**: For example, determining which pulley system requires the least effort to lift a heavy object. 3. **Gear systems**: For example, determining how the rotation direction of one gear affects the rotation direction of another gear. 4. **Seesaw systems and the principle of the lever**: For example, determining how to adjust the position to balance the seesaw. 5. **Inertia and motion**: For example, determining the motion state of an object under different conditions. 6. **Fluid mechanics**: For example, determining the behavior of a fluid system. Through these experiments, the author hopes to understand the performance of current VLMs in handling these mechanical reasoning tasks and the differences between their performance and that of humans in these tasks. This helps to reveal the advantages and limitations of VLMs in mechanical reasoning, thereby providing a basis for further improving these models.

Probing Mechanical Reasoning in Large Vision Language Models

Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

How Far Are We from Intelligent Visual Deductive Reasoning?

GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense

Vision Language Models See What You Want but not What You See

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Smart Vision-Language Reasoners

Towards Reasoning in Large Language Models: A Survey