Abstract:Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as 9.11 > 9.9). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as special tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work takes a preliminary step towards understanding and improving NUPA of LLMs. Our benchmark and code are released at <a class="link-external link-https" href="https://github.com/GraphPKU/number_cookbook" rel="external noopener nofollow">this https URL</a>.

Pre-training and Evaluation of Numeracy-Oriented Language Model.

Enhancing Financial Sentiment Analysis Ability of Language Model via Targeted Numerical Change-Related Masking

Arithmetic-Based Pretraining -- Improving Numeracy of Pretrained Language Models

NumLLM: Numeric-Sensitive Large Language Model for Chinese Finance

Pre-Calc: Learning to Use the Calculator Improves Numeracy in Language Models

MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education

Evaluating Large Language Models on Financial Report Summarization: An Empirical Study

Number Cookbook: Number Understanding of Language Models and How to Improve It

FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining

Laying Anchors: Semantically Priming Numerals in Language Modeling

Numeral Understanding in Financial Tweets for Fine-Grained Crowd-Based Forecasting

Reflection of Thought: Inversely Eliciting Numerical Reasoning in Language Models via Solving Linear Systems

JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem Understanding

An Improved Math Word Problem (MWP) Model Using Unified Pretrained Language Model (UniLM) for Pretraining

Injecting Numerical Reasoning Skills into Language Models

CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language Models

SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models

Data-Centric Financial Large Language Models

Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

Numerical Reasoning for Financial Reports