Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver

Zeren Zhang,Jo-Ku Cheng,Jingyang Deng,Lu Tian,Jinwen Ma,Ziran Qin,Xiaokai Zhang,Na Zhu,Tuo Leng
2024-09-06
Abstract:Mathematical reasoning remains an ongoing challenge for AI models, especially for geometry problems that require both linguistic and visual signals. As the vision encoders of most MLLMs are trained on natural scenes, they often struggle to understand geometric diagrams, performing no better in geometry problem solving than LLMs that only process text. This limitation is amplified by the lack of effective methods for representing geometric relationships. To address these issues, we introduce the Diagram Formalization Enhanced Geometry Problem Solver (DFE-GPS), a new framework that integrates visual features, geometric formal language, and natural language representations. We propose a novel synthetic data approach and create a large-scale geometric dataset, SynthGeo228K, annotated with both formal and natural language captions, designed to enhance the vision encoder for a better understanding of geometric structures. Our framework improves MLLMs' ability to process geometric diagrams and extends their application to open-ended tasks on the formalgeo7k dataset.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to improve the performance of multimodal large language models (MLLMs) in solving geometric problems, especially when dealing with geometric figures. Specifically, the paper points out that current MLLMs have difficulties in understanding and processing geometric figures, resulting in their performance in solving geometric problems not being better than large language models (LLMs) that only handle textual information. To solve this problem, the authors propose a new framework called "Diagram Formalization Enhanced Geometric Problem Solver" (DFE-GPS), which improves the model's understanding of geometric figures by integrating visual features, geometric formal language, and natural language representations. ### Main Issues: 1. **Insufficient understanding of geometric figures**: Most MLLMs' visual encoders are pre-trained on natural scenes, making it difficult to effectively understand geometric figures. 2. **Lack of effective geometric relationship representation methods**: Existing models lack effective methods to represent geometric relationships, leading to poor performance in solving geometric problems. ### Solutions: - **Introducing the DFE-GPS framework**: This framework includes a Diagram Formalizer module to generate formal descriptions of geometric figures, thereby enhancing the visual component and improving LLMs' recognition of geometric structures. - **Creating a large-scale geometric dataset**: The authors propose a new synthetic data generation method, creating a large-scale dataset SynthGeo228K containing 228,000 geometric figures, each annotated with formal and natural language descriptions. - **Multi-stage training process**: Through a three-stage training process, the model's performance in solving geometric problems is gradually improved. These three stages focus on training the Diagram Formalizer module, aligning visual features with the language model, and instruction fine-tuning. ### Experimental Results: - **Performance improvement**: Experimental results show that DFE-GPS performs excellently on multiple metrics, especially in terms of accuracy and process evaluation scores on multiple-choice and open-ended questions. - **Comparison with other models**: DFE-GPS significantly outperforms other existing MLLMs and LLMs, particularly in handling geometric figures. ### Conclusion: By introducing the DFE-GPS framework and a large-scale synthetic dataset, the authors successfully improved the performance of multimodal large language models in solving geometric problems, particularly addressing the shortcomings of existing models in understanding and processing geometric figures. Future work can further explore reinforcement learning and tree search strategies to further enhance the model's performance.