Literature Classification and its Applications in Condensed Matter Physics and Materials Science by Natural Language Process

Siyuan Wu,Tiannian Zhu,Sijia Tu,Ruijuan Xiao,Jie Yuan,Quansheng Wu,Hong Li,Hongming Weng
DOI: https://doi.org/10.1088/1674-1056/ad3c30
2024-04-10
Chinese Physics B
Abstract:The exponential growth of literature is constraining researchers' access to comprehensive information in related fields. While natural language processing (NLP) may offer an effective solution to literature classification, it remains hindered by the lack of labelled dataset. In this article, we introduce a novel method for generating literature classification models through semi-supervised learning, which can generate labelled dataset iteratively with limited human input. We apply this method to train NLP models for classifying literatures related with several research directions, namely battery, superconductor, topological material, and artificial intelligence (AI) in materials science. The trained NLP 'battery' model applied on a larger dataset different from the training and testing dataset can achieve F1 score of 0.738, which indicates the accuracy and reliability of this scheme. Furthermore, our approach demonstrates that even with insufficient data, the not-well-trained model at first few cycles can identify the relationships among different research fields and facilitate the discovery and understanding of interdisciplinary directions.
physics, multidisciplinary
What problem does this paper attempt to address?