Abstract:Large Language Models (LLMs), combined with program-based solving techniques, are increasingly demonstrating proficiency in mathematical reasoning. For example, closed-source models such as OpenAI GPT-4 and Claude show excellent results in solving math word problems. However, progress in math word problem-solving for open-source LLMs is limited, and the challenges these models face are not well-studied. In this paper, we study the performance of strong open-source LLMs, including Llama 2 (7B), Code Llama (7B), and Mistral (7B) on math word problems using program-based solving techniques. Specifically, we analyze the outputs of these models when applied to math word problems and identify a category of problems that pose a significant challenge, particularly those involving quantities spanning multiple units. To address this issue, we propose a systematic approach by defining the units for each quantity and ensuring the consistency of these units during mathematical operations. We developed Unit Consistency Programs (UCPs), an annotated dataset of math word problems, each paired with programs containing unit specifications and unit verification routines. We fine-tuned Llama 2 (7B), Code Llama (7B), and Mistral (7B) models with UCPs to produce theirVerityMath variants. Our findings indicate that our approach, which incorporates unit consistency, currently slightly underperforms compared to an approach that does not. To understand the reasons behind this, we conduct an in-depth error analysis and suggest options for future improvements. Our code and dataset are available at <a class="link-external link-https" href="https://github.com/vernontoh/VerityMath" rel="external noopener nofollow">this https URL</a>.

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?

How well do Large Language Models perform in Arithmetic tasks?

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in Chinese

Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks

MathAttack: Attacking Large Language Models towards Math Solving Ability

ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing

VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency

HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics