Pre-training and Evaluation of Numeracy-Oriented Language Model.

Fuli Feng,Xilin Rui,Wenjie Wang,Yixin Cao,Tat-Seng Chua
DOI: https://doi.org/10.1145/3490354.3494412
2021-01-01
Abstract:Pre-trained language model (LM) has led to significant performance gains in various natural language processing (NLP) applications due to its strong literacy, e.g., the ability to capture word dependencies. However, the existing pre-trained LMs largely ignore numeracy, i.e., treating numbers within text as plain words and without understanding the basic numerical concepts. The weak numeracy has become a barrier to the use of pre-trained LMs in NLP applications over financial documents such as annual filings and analyst reports that are number intensive. However, the understanding and analysis of financial documents are becoming gradationally important. To bridge this gap, this work explores the central theme of numerical pre-training to empower LM with numeracy. In particular, we propose two numerical pre-training methods with objectives that encourage the LM to understand the magnitude and value of numbers and encode the dependency between a number and its context. By applying the proposed methods on BERT, we pre-train two LMs, named BERT-M and BERT-V. Moreover, we construct four datasets of financial documents for evaluating the numeracy of pre-trained LM, which focus on three fundamental perspectives of numeracy: a) number embedding; b) number-text composition; and c) number-number composition. Extensive experiments on the datasets validate the effectiveness of the pre-trained BERT-M and BERT-V, which outperform the state-of-the-art LM for financial documents (FinBERT) by 4.83% and 4.34% on average. Furthermore, their aggregation named BERT-MV increases the gain to 10.88%.
What problem does this paper attempt to address?