Abstract:Advances in large language models (LLMs) have spurred research into enhancing their reasoning capabilities, particularly in math-rich STEM documents. While LLMs can generate equations or solve math-related queries, their ability to fully understand and interpret abstract mathematical symbols in long, math-rich documents remains limited. In this paper, we introduce STEM-PoM, a comprehensive benchmark dataset designed to evaluate LLMs' reasoning abilities on math symbols within contextual scientific text. The dataset, sourced from real-world ArXiv documents, contains over 2K math symbols classified as main attributes of variables, constants, operators, and unit descriptors, with additional sub-attributes including scalar/vector/matrix for variables and local/global/discipline-specific labels for both constants and operators. Our extensive experiments show that state-of-the-art LLMs achieve an average of 20-60% accuracy under in-context learning and 50-60% accuracy with fine-tuning, revealing a significant gap in their mathematical reasoning capabilities. STEM-PoM fuels future research of developing advanced Math-AI models that can robustly handle math symbols.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Evaluating the capabilities of large - language models (LLMs) in mathematical symbolic reasoning, especially their performance when dealing with long - form, math - rich scientific documents**. Although existing large - language models can perform well in generating equations or answering math problems, their performance in understanding and interpreting abstract mathematical symbols is still limited. Specifically, the paper aims to: 1. **Construct a comprehensive benchmark dataset (STEM - POM)** for evaluating large - language models' understanding and reasoning abilities of mathematical symbols. This dataset contains 2,109 mathematical symbols extracted from real - world arXiv documents and has been classified according to their properties. 2. **Reveal the deficiencies of existing models in understanding mathematical symbols**. Through experiments, the authors found that even the most advanced large - language models have an accuracy rate between 20% - 60% after in - context learning and fine - tuning, indicating that there are significant gaps in these models' mathematical symbolic reasoning. 3. **Promote future research** to develop more advanced Math - AI models to better handle mathematical symbols, especially in complex scientific documents. ### Specific Problem Description - **Polysemy of Mathematical Symbols**: The same mathematical symbol may have different meanings in different contexts. For example, in the linear equation \(y = mx + p\), \(y\) is a variable; while in the cross - entropy loss function \(L(x,y)=-\sum_{i = 1}^{N}x_{i}\log(y_{i})\), \(y\) represents a fixed target label and is regarded as a constant. Therefore, understanding the context of mathematical symbols is crucial. - **Lack of High - Quality Datasets**: Existing datasets for annotating mathematical symbols are insufficient to support advanced natural - language - processing tasks, especially in mathematical - symbol classification and understanding. - **Limitations of Traditional Methods**: Traditional semantic - parsing methods (such as LateXML or arXMLiv) have difficulty accurately matching abstract mathematical symbols with their corresponding XML tags when dealing with math - rich documents, resulting in insufficient precision. By introducing the STEM - POM dataset, the authors hope to provide researchers with a powerful tool to evaluate and improve large - language models' capabilities in mathematical - symbol reasoning, thereby promoting further development in this field.

STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Reasoning in Large Language Models Through Symbolic Math Word Problems

DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents

Solving Math Word Problems by Combining Language Models With Symbolic Solvers

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Large Language Models for Mathematical Reasoning: Progresses and Challenges

CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From A Psychological Perspective

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

Investigating Symbolic Capabilities of Large Language Models

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning