Arithmetic-Based Pretraining -- Improving Numeracy of Pretrained Language Models

Dominic Petrak,Nafise Sadat Moosavi,Iryna Gurevych
DOI: https://doi.org/10.48550/arXiv.2205.06733
2023-06-09
Abstract:State-of-the-art pretrained language models tend to perform below their capabilities when applied out-of-the-box on tasks that require understanding and working with numbers. Recent work suggests two main reasons for this: (1) popular tokenisation algorithms have limited expressiveness for numbers, and (2) common pretraining objectives do not target numeracy. Approaches that address these shortcomings usually require architectural changes or pretraining from scratch. In this paper, we propose a new extended pretraining approach called Arithmetic-Based Pretraining that jointly addresses both in one extended pretraining step without requiring architectural changes or pretraining from scratch. Arithmetic-Based Pretraining combines contrastive learning to improve the number representation, and a novel extended pretraining objective called Inferable Number Prediction Task to improve numeracy. Our experiments show the effectiveness of Arithmetic-Based Pretraining in three different tasks that require improved numeracy, i.e., reading comprehension in the DROP dataset, inference-on-tables in the InfoTabs dataset, and table-to-text generation in the WikiBio and SciGen datasets.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of pre-trained language models performing poorly on tasks that require understanding and manipulating numbers. Specifically, existing pre-trained language models often fail to achieve their potential performance levels on tasks involving numbers, such as reading comprehension, table reasoning, and table-to-text generation. There are two main reasons for this problem: 1. **Limited numerical representation capability of popular tokenization algorithms**: Common tokenization algorithms (such as byte pair encoding and wordpiece encoding) perform poorly when handling numbers because these algorithms are primarily designed to capture frequently occurring patterns in text, while numbers usually have different frequency distributions. This leads to similar numbers being tokenized into different forms, affecting the model's understanding of numbers. 2. **Common pre-training objectives do not target numerical computation ability**: Existing pre-training objectives (such as denoising autoencoders and masked language modeling) mainly focus on language structure and semantic understanding rather than numerical computation ability. To address these issues, the paper proposes a new extended pre-training method—Arithmetic-Based Pretraining. This method addresses the above two problems in an extended pre-training step by jointly improving numerical representation and enhancing numerical computation ability, without modifying the model architecture or pre-training from scratch. Specifically, the method includes two main components: - **Contrastive learning**: Combines subword-level and character-level tokenization to improve numerical representation. - **Derivable numerical prediction tasks**: Enhances the model's ability to handle numbers through a new pre-training objective. Experimental results show that the arithmetic-based pre-training method significantly improves the model's performance on multiple tasks requiring numerical computation ability, including reading comprehension on the DROP dataset, table reasoning on the InfoTabs dataset, and table-to-text generation tasks on the WikiBio and SciGen datasets.