FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains

Yilun Zhao,Hongjun Liu,Yitao Long,Rui Zhang,Chen Zhao,Arman Cohan

2024-08-08

Abstract:We introduce FinanceMath, a novel benchmark designed to evaluate LLMs' capabilities in solving knowledge-intensive math reasoning problems. Compared to prior works, this study features three core advancements. First, FinanceMath includes 1,200 problems with a hybrid of textual and tabular content. These problems require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. We also construct a finance-domain knowledge bank and investigate various knowledge integration strategies. Finally, we evaluate a wide spectrum of 44 LLMs with both Chain-of-Thought and Program-of-Thought prompting methods. Our experimental results reveal that the current best-performing system (i.e., GPT-4o) achieves only 60.9% accuracy using CoT prompting, leaving substantial room for improvement. Moreover, while augmenting LLMs with external knowledge can improve model performance (e.g., from 47.5% to 54.5% for Gemini-1.5-Pro), their accuracy remains significantly lower than the estimated human expert performance of 92%. We believe that FinanceMath can advance future research in the area of domain-specific knowledge retrieval and integration, particularly within the context of solving reasoning-intensive tasks.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the issue of evaluating the capabilities of large language models (LLMs) in knowledge-intensive mathematical reasoning within the financial domain. Specifically, the paper proposes a new benchmark dataset named Finance MATH, whose main features include: 1. **Contains a large number of complex problems**: Includes 1,200 problems that require specialized knowledge in the financial field to solve effectively, with 40.2% of the problems involving the interpretation of tabular data. 2. **Expert-annotated solutions**: Provides detailed Python-formatted solutions, ensuring a high-quality benchmark evaluation standard. 3. **Construction of a financial knowledge base**: Establishes a knowledge base containing 864 financial terms and their definitions, and explores various knowledge integration strategies. 4. **Extensive model evaluation**: Evaluates 51 different LLMs, including general models, math-specific models, and code generation models. Through this work, the authors found that even the most advanced current models, such as GPT-4, are far from reaching the level of human experts in handling such tasks (the best performance of GPT-4 is only 60.9%, while human experts achieve 92%). This indicates that there is still significant room for improvement in existing LLM technology for solving complex problems in specific domains. Additionally, the paper explores how to enhance model performance by incorporating external knowledge and provides specific experimental results and analysis.

FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains

DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents

Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering

FinanceBench: A New Benchmark for Financial Question Answering

BizBench: A Quantitative Reasoning Benchmark for Business and Finance

AI-Assisted Generation of Difficult Math Questions

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

FinLLMs: A Framework for Financial Reasoning Dataset Generation with Large Language Models

Beyond Classification: Financial Reasoning in State-of-the-Art Language Models

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

FinQA: A Dataset of Numerical Reasoning over Financial Data

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

Financial Knowledge Large Language Model

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning