MathBERT: A Pre-Trained Model for Mathematical Formula Understanding

Shuai Peng,Ke Yuan,Liangcai Gao,Zhi Tang
DOI: https://doi.org/10.48550/arXiv.2105.00377
2021-05-02
Abstract:Large-scale pre-trained models like BERT, have obtained a great success in various Natural Language Processing (NLP) tasks, while it is still a challenge to adapt them to the math-related tasks. Current pre-trained models neglect the structural features and the semantic correspondence between formula and its context. To address these issues, we propose a novel pre-trained model, namely \textbf{MathBERT}, which is jointly trained with mathematical formulas and their corresponding contexts. In addition, in order to further capture the semantic-level structural features of formulas, a new pre-training task is designed to predict the masked formula substructures extracted from the Operator Tree (OPT), which is the semantic structural representation of formulas. We conduct various experiments on three downstream tasks to evaluate the performance of MathBERT, including mathematical information retrieval, formula topic classification and formula headline generation. Experimental results demonstrate that MathBERT significantly outperforms existing methods on all those three tasks. Moreover, we qualitatively show that this pre-trained model effectively captures the semantic-level structural information of formulas. To the best of our knowledge, MathBERT is the first pre-trained model for mathematical formula understanding.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that existing large - scale pre - trained models perform poorly when dealing with math - related tasks. The main reason is that these models ignore the structural features in mathematical formulas and the semantic correspondence between formulas and their contexts. To meet this challenge, the author proposes a new pre - trained model - MathBERT, which can use both mathematical formulas and their corresponding contexts for joint training. In addition, in order to further capture the structural features of formulas at the semantic level, a new pre - training task is designed, that is, predicting the masked formula sub - structures extracted from the Operator Tree (OPT). In this way, MathBERT aims to better understand and process mathematical formulas, so as to achieve better performance in downstream tasks such as mathematical information retrieval, formula topic classification and formula title generation. Specifically, the main contributions of the paper include: - Proposing the first pre - trained model MathBERT for mathematical formula understanding, which can jointly train mathematical formulas, contexts and operator trees. - Designing a new pre - training task to capture the structural information of mathematical formulas at the semantic level. - On three downstream tasks, MathBERT performs significantly better than existing methods. - Constructing a new dataset containing mathematical formulas and their corresponding contexts for formula topic classification and planning to make this dataset public.