Leveraging language representation for materials exploration and discovery

Jiaxing Qu,Yuxuan Richard Xie,Kamil M. Ciesielski,Claire E. Porter,Eric S. Toberer,Elif Ertekin
DOI: https://doi.org/10.1038/s41524-024-01231-8
IF: 12.256
2024-03-22
npj Computational Materials
Abstract:Data-driven approaches to materials exploration and discovery are building momentum due to emerging advances in machine learning. However, parsimonious representations of crystals for navigating the vast materials search space remain limited. To address this limitation, we introduce a materials discovery framework that utilizes natural language embeddings from language models as representations of compositional and structural features. The contextual knowledge encoded in these language representations conveys information about material properties and structures, enabling both similarity analysis to recall relevant candidates based on a query material and multi-task learning to share information across related properties. Applying this framework to thermoelectrics, we demonstrate diversified recommendations of prototype crystal structures and identify under-studied material spaces. Validation through first-principles calculations and experiments confirms the potential of the recommended materials as high-performance thermoelectrics. Language-based frameworks offer versatile and adaptable embedding structures for effective materials exploration and discovery, applicable across diverse material systems.
materials science, multidisciplinary,chemistry, physical
What problem does this paper attempt to address?
This paper aims to address the challenges in material exploration and discovery, especially in finding new materials with desired properties in the vast search space. Currently, although machine learning methods have gradually emerged in materials science, how to represent crystal structures concisely for efficient search remains a limiting factor. The paper proposes a materials discovery framework using natural language embeddings from pre-trained language models, which can capture contextual knowledge of material properties and structures. This framework converts the composition and structural features of materials into natural language representations, enabling similarity analysis to recall candidate materials relevant to the query and leveraging multi-task learning to share information among related properties. In the application of thermoelectric materials, this approach can recommend prototype crystals with different structures and identify underexplored material spaces. The potential of the recommended materials as high-performance thermoelectric materials has been preliminarily validated through first-principle calculations and experimental confirmation. The paper points out that early material representation methods relied on manually designed descriptors, while recent methods treat atomic structures as graphs. However, these methods have limitations in providing general, task-agnostic representations. In contrast, pre-trained Transformer models can capture contextual embeddings in the field of materials science, providing flexible and adaptable embedding structures for material exploration. The study demonstrates the effectiveness of language representations in recalling relevant material candidates and predicting material performance by evaluating different embedding methods. It also introduces the Multi-gate Mixture-of-Experts (MMoE) model to enhance multi-task learning, leveraging the correlations between different material property prediction tasks to improve learning efficiency and accuracy. This framework has been applied to search for high-performance thermoelectric materials, successfully identifying thermoelectric candidate materials with diverse structures and some underexplored material areas. Experimental results confirm the effectiveness of this framework, providing new avenues for efficient exploration and discovery in materials science.