G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Jiahui Gao,Renjie Pi,Jipeng Zhang,Jiacheng Ye,Wanjun Zhong,Yufei Wang,Lanqing Hong,Jianhua Han,Hang Xu,Zhenguo Li,Lingpeng Kong
2023-12-19
Abstract:Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in geometric problem - solving, current multi - modal large language models (MLLMs) have difficulties in understanding geometric figures and their relationships. Although existing large language models (LLMs) have already performed excellently in solving mathematical problems in text form, when it comes to problems that require understanding geometric information, these models perform poorly. Specifically, the state - of - the - art MLLMs have significant errors in understanding and describing basic geometric elements (such as points, lines, angles, etc.) and their mutual relationships, which greatly limits their application ability in geometric problems. To overcome this challenge, the paper proposes to construct an enhanced multi - modal geometric dataset and use this dataset to train a large language model G - LLaVA that can effectively solve geometric problems. By analyzing the limitations of existing MLLMs, the authors found that these models are usually trained on images and descriptions in the general domain, and the semantic understanding in these domains is quite different from the skills required for geometric reasoning. Therefore, they utilize the unique characteristics of geometric problems (such as the uniqueness of geometric logic forms, geometric extensibility, etc.), combined with the capabilities of text LLMs, to construct an enhanced dataset Geo170K containing more than 170,000 pairs of geometric image - descriptions and problem - answers. Based on this dataset, the developed G - LLaVA model performs excellently in solving geometric problems. In particular, in the MathVista benchmark test, G - LLaVA with only 7B parameters significantly outperforms GPT - 4 - V. Overall, this paper aims to improve the performance of MLLMs in geometric problem - solving through enhanced datasets and improved model architectures, thereby filling this gap in the research field.