Abstract:The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases<a href="/articles/s41586-019-1335-8#ref-CR1">1</a>,<a href="/articles/s41586-019-1335-8#ref-CR2">2</a>, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing<a href="#ref-CR3">3</a>,<a href="#ref-CR4">4</a>,<a href="#ref-CR5">5</a>,<a href="#ref-CR6">6</a>,<a href="#ref-CR7">7</a>,<a href="#ref-CR8">8</a>,<a href="#ref-CR9">9</a>,<a href="/articles/s41586-019-1335-8#ref-CR10">10</a>, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings<a href="#ref-CR11">11</a>,<a href="#ref-CR12">12</a>,<a href="/articles/s41586-019-1335-8#ref-CR13">13</a> (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

A GPT-assisted iterative method for extracting domain knowledge from a large volume of literature of electromagnetic wave absorbing materials with limited manually annotated data

Text to Insight: Accelerating Organic Materials Knowledge Extraction via Deep Learning

Large Language Models as Master Key: Unlocking the Secrets of Materials Science with GPT

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature

Polymetis:Large Language Modeling for Multiple Material Domains

Towards Development of Automated Knowledge Maps and Databases for Materials Engineering using Large Language Models

NLP for Knowledge Discovery and Information Extraction from Energetics Corpora

An automatic descriptors recognizer customized for materials science literature

Application of machine reading comprehension techniques for named entity recognition in materials science

A Large Language Model-Powered Literature Review for HighAngle Annular Dark Field Imaging

Ensemble pretrained language models to extract biomedical knowledge from literature

Construction and Application of Materials Knowledge Graph Based on Author Disambiguation: Revisiting the Evolution of LiFePO4

Extracting accurate materials data from research papers with conversational language models and prompt engineering

Construction and Application of Materials Knowledge Graph Based on Author Disambiguation: Revisiting the Evolution of LiFePO 4

Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction

“FabNER”: information extraction from manufacturing process science domain literature using named entity recognition

Unsupervised word embeddings capture latent knowledge from materials science literature

Construction and Application of Materials Knowledge Graph in Multidisciplinary Materials Science via Large Language Model