Abstract:Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in geometric problem - solving, current multi - modal large language models (MLLMs) have difficulties in understanding geometric figures and their relationships. Although existing large language models (LLMs) have already performed excellently in solving mathematical problems in text form, when it comes to problems that require understanding geometric information, these models perform poorly. Specifically, the state - of - the - art MLLMs have significant errors in understanding and describing basic geometric elements (such as points, lines, angles, etc.) and their mutual relationships, which greatly limits their application ability in geometric problems. To overcome this challenge, the paper proposes to construct an enhanced multi - modal geometric dataset and use this dataset to train a large language model G - LLaVA that can effectively solve geometric problems. By analyzing the limitations of existing MLLMs, the authors found that these models are usually trained on images and descriptions in the general domain, and the semantic understanding in these domains is quite different from the skills required for geometric reasoning. Therefore, they utilize the unique characteristics of geometric problems (such as the uniqueness of geometric logic forms, geometric extensibility, etc.), combined with the capabilities of text LLMs, to construct an enhanced dataset Geo170K containing more than 170,000 pairs of geometric image - descriptions and problem - answers. Based on this dataset, the developed G - LLaVA model performs excellently in solving geometric problems. In particular, in the MathVista benchmark test, G - LLaVA with only 7B parameters significantly outperforms GPT - 4 - V. Overall, this paper aims to improve the performance of MLLMs in geometric problem - solving through enhanced datasets and improved model architectures, thereby filling this gap in the research field.

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning

EAGLE: Elevating Geometric Reasoning through LLM-empowered Visual Instruction Tuning

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

Improving Multimodal LLMs Ability In Geometry Problem Solving, Reasoning, And Multistep Scoring

Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning

LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Navigate Complex Physical Worlds via Geometrically Constrained LLM

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

LLMGA: Multimodal Large Language Model based Generation Assistant

Evaluating the Effectiveness of Large Language Models in Representing Textual Descriptions of Geometry and Spatial Relations

Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks