Abstract:State-of-the-art pretrained language models tend to perform below their capabilities when applied out-of-the-box on tasks that require understanding and working with numbers. Recent work suggests two main reasons for this: (1) popular tokenisation algorithms have limited expressiveness for numbers, and (2) common pretraining objectives do not target numeracy. Approaches that address these shortcomings usually require architectural changes or pretraining from scratch. In this paper, we propose a new extended pretraining approach called Arithmetic-Based Pretraining that jointly addresses both in one extended pretraining step without requiring architectural changes or pretraining from scratch. Arithmetic-Based Pretraining combines contrastive learning to improve the number representation, and a novel extended pretraining objective called Inferable Number Prediction Task to improve numeracy. Our experiments show the effectiveness of Arithmetic-Based Pretraining in three different tasks that require improved numeracy, i.e., reading comprehension in the DROP dataset, inference-on-tables in the InfoTabs dataset, and table-to-text generation in the WikiBio and SciGen datasets.

What problem does this paper attempt to address?

The paper attempts to address the issue of pre-trained language models performing poorly on tasks that require understanding and manipulating numbers. Specifically, existing pre-trained language models often fail to achieve their potential performance levels on tasks involving numbers, such as reading comprehension, table reasoning, and table-to-text generation. There are two main reasons for this problem: 1. **Limited numerical representation capability of popular tokenization algorithms**: Common tokenization algorithms (such as byte pair encoding and wordpiece encoding) perform poorly when handling numbers because these algorithms are primarily designed to capture frequently occurring patterns in text, while numbers usually have different frequency distributions. This leads to similar numbers being tokenized into different forms, affecting the model's understanding of numbers. 2. **Common pre-training objectives do not target numerical computation ability**: Existing pre-training objectives (such as denoising autoencoders and masked language modeling) mainly focus on language structure and semantic understanding rather than numerical computation ability. To address these issues, the paper proposes a new extended pre-training method—Arithmetic-Based Pretraining. This method addresses the above two problems in an extended pre-training step by jointly improving numerical representation and enhancing numerical computation ability, without modifying the model architecture or pre-training from scratch. Specifically, the method includes two main components: - **Contrastive learning**: Combines subword-level and character-level tokenization to improve numerical representation. - **Derivable numerical prediction tasks**: Enhances the model's ability to handle numbers through a new pre-training objective. Experimental results show that the arithmetic-based pre-training method significantly improves the model's performance on multiple tasks requiring numerical computation ability, including reading comprehension on the DROP dataset, table reasoning on the InfoTabs dataset, and table-to-text generation tasks on the WikiBio and SciGen datasets.

Arithmetic-Based Pretraining -- Improving Numeracy of Pretrained Language Models

Pre-training and Evaluation of Numeracy-Oriented Language Model.

Pre-Calc: Learning to Use the Calculator Improves Numeracy in Language Models

Reverse That Number! Decoding Order Matters in Arithmetic Learning

Teaching Arithmetic to Small Transformers

Overcoming Barriers to Skill Injection in Language Modeling: Case Study in Arithmetic

Number Cookbook: Number Understanding of Language Models and How to Improve It

Investigating the Limitations of Transformers with Simple Arithmetic Tasks

Laying Anchors: Semantically Priming Numerals in Language Modeling

MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education

Measuring and Improving BERT's Mathematical Abilities by Predicting the Order of Reasoning

Reflection of Thought: Inversely Eliciting Numerical Reasoning in Language Models via Solving Linear Systems

Arithmetic with language models: From memorization to computation

Injecting Numerical Reasoning Skills into Language Models

Pre-trained Large Language Models Use Fourier Features to Compute Addition

An Improved Math Word Problem (MWP) Model Using Unified Pretrained Language Model (UniLM) for Pretraining

MWP-BERT: Numeracy-Augmented Pre-training for Math Word Problem Solving

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis

Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia