FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains

Yilun Zhao,Hongjun Liu,Yitao Long,Rui Zhang,Chen Zhao,Arman Cohan
2024-08-08
Abstract:We introduce FinanceMath, a novel benchmark designed to evaluate LLMs' capabilities in solving knowledge-intensive math reasoning problems. Compared to prior works, this study features three core advancements. First, FinanceMath includes 1,200 problems with a hybrid of textual and tabular content. These problems require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. We also construct a finance-domain knowledge bank and investigate various knowledge integration strategies. Finally, we evaluate a wide spectrum of 44 LLMs with both Chain-of-Thought and Program-of-Thought prompting methods. Our experimental results reveal that the current best-performing system (i.e., GPT-4o) achieves only 60.9% accuracy using CoT prompting, leaving substantial room for improvement. Moreover, while augmenting LLMs with external knowledge can improve model performance (e.g., from 47.5% to 54.5% for Gemini-1.5-Pro), their accuracy remains significantly lower than the estimated human expert performance of 92%. We believe that FinanceMath can advance future research in the area of domain-specific knowledge retrieval and integration, particularly within the context of solving reasoning-intensive tasks.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issue of evaluating the capabilities of large language models (LLMs) in knowledge-intensive mathematical reasoning within the financial domain. Specifically, the paper proposes a new benchmark dataset named Finance MATH, whose main features include: 1. **Contains a large number of complex problems**: Includes 1,200 problems that require specialized knowledge in the financial field to solve effectively, with 40.2% of the problems involving the interpretation of tabular data. 2. **Expert-annotated solutions**: Provides detailed Python-formatted solutions, ensuring a high-quality benchmark evaluation standard. 3. **Construction of a financial knowledge base**: Establishes a knowledge base containing 864 financial terms and their definitions, and explores various knowledge integration strategies. 4. **Extensive model evaluation**: Evaluates 51 different LLMs, including general models, math-specific models, and code generation models. Through this work, the authors found that even the most advanced current models, such as GPT-4, are far from reaching the level of human experts in handling such tasks (the best performance of GPT-4 is only 60.9%, while human experts achieve 92%). This indicates that there is still significant room for improvement in existing LLM technology for solving complex problems in specific domains. Additionally, the paper explores how to enhance model performance by incorporating external knowledge and provides specific experimental results and analysis.