MatSci-NLP: Evaluating Scientific Language Models on Materials Science Language Tasks Using Text-to-Schema Modeling

Yu Song,Santiago Miret,Bang Liu
2023-05-15
Abstract:We present MatSci-NLP, a natural language benchmark for evaluating the performance of natural language processing (NLP) models on materials science text. We construct the benchmark from publicly available materials science text data to encompass seven different NLP tasks, including conventional NLP tasks like named entity recognition and relation classification, as well as NLP tasks specific to materials science, such as synthesis action retrieval which relates to creating synthesis procedures for materials. We study various BERT-based models pretrained on different scientific text corpora on MatSci-NLP to understand the impact of pretraining strategies on understanding materials science text. Given the scarcity of high-quality annotated data in the materials science domain, we perform our fine-tuning experiments with limited training data to encourage the generalize across MatSci-NLP tasks. Our experiments in this low-resource training setting show that language models pretrained on scientific text outperform BERT trained on general text. MatBERT, a model pretrained specifically on materials science journals, generally performs best for most tasks. Moreover, we propose a unified text-to-schema for multitask learning on \benchmark and compare its performance with traditional fine-tuning methods. In our analysis of different training methods, we find that our proposed text-to-schema methods inspired by question-answering consistently outperform single and multitask NLP fine-tuning methods. The code and datasets are publicly available at \url{<a class="link-external link-https" href="https://github.com/BangLab-UdeM-Mila/NLP4MatSci-ACL23" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Materials Science,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on natural language processing (NLP) tasks in the field of materials science. Specifically: 1. **Develop and evaluate NLP models applicable to materials science texts**: Materials science research involves a large amount of text data, such as journal articles, patents, and technical reports. These text data contain rich knowledge, but currently there is a lack of effective tools to process and understand these texts. Therefore, this research aims to develop an NLP benchmarking platform (MatSci - NLP) specifically for materials science texts to evaluate the performance of different NLP models in materials science tasks. 2. **Explore the impact of pre - training strategies on the performance of downstream tasks**: Due to the scarcity of high - quality labeled data in the field of materials science, researchers hope to understand how different pre - training strategies (for example, pre - training on general texts or domain - specific texts) affect the performance of models on materials science tasks. In particular, the research focuses on whether language models dedicated to the field of materials science (such as MatBERT) are more effective than general - purpose language models (such as BERT). 3. **Propose and validate new multi - task learning methods**: In order to improve the learning efficiency of models in low - resource environments, the research proposes a text - to - schema - based multi - task learning method and compares it with traditional single - task and multi - task fine - tuning methods. The research shows that this new method can significantly improve model performance on multiple tasks. ### Specific problem decomposition - **Q1: What is the impact of in - domain pre - training on the downstream performance of language models on MatSci - NLP tasks?** - The research finds that pre - training models dedicated to the field of materials science (such as MatBERT) usually perform best on most tasks, followed by SciBERT. This indicates that in - domain pre - training helps models acquire knowledge in relevant fields. - **Q2: How do contextual data patterns and multi - task learning affect the learning efficiency in low - resource training environments?** - The experimental results show that the question - answering - inspired text - to - schema method (Task - Schema) performs best on most of all models and is superior to single - task and multi - task fine - tuning settings. Through the research of these problems, the author hopes to promote the development of NLP tools in the field of materials science, thereby accelerating the discovery, synthesis, and application of new materials.