Atoms as words: A novel approach to deciphering material properties using NLP-inspired machine learning on crystallographic information files (CIFs)

Lalit Yadav
DOI: https://doi.org/10.1063/5.0187741
IF: 1.697
2024-04-01
AIP Advances
Abstract:In condensed matter physics and materials science, predicting material properties necessitates understanding intricate many-body interactions. Conventional methods such as density functional theory and molecular dynamics often resort to simplifying approximations and are computationally expensive. Meanwhile, recent machine learning methods use handcrafted descriptors for material representation, which sometimes neglect vital crystallographic information and are often limited to single property prediction or a sub-class of crystal structures. In this study, we pioneer an unsupervised strategy, drawing inspiration from natural language processing to harness the underutilized potential of Crystallographic Information Files (CIFs). We conceptualize atoms and atomic positions within a crystallographic information file similarly to words in textual content. Using a Word2Vec-inspired technique, we produce atomic embeddings that capture intricate atomic relationships. Our model, CIFSemantics, trained on the extensive material project dataset, adeptly predicts 15 distinct material properties from the CIFs. Its performance rivals that of specialized models, marking a significant step forward in material property predictions.
materials science, multidisciplinary,nanoscience & nanotechnology,physics, applied
What problem does this paper attempt to address?
The paper aims to address the problem of predicting material properties in materials science, specifically how to effectively extract material properties from Crystallographic Information Files (CIFs). The main contributions of the paper include: 1. **Proposing a novel method**: Inspired by Natural Language Processing (NLP), the authors treat the atoms and their positions in CIFs as words in a text, thereby using machine learning methods (similar to Word2Vec) to generate atomic embeddings that capture the complex interactions between atoms. 2. **Addressing the limitations of traditional methods**: Traditional methods like Density Functional Theory (DFT) and molecular dynamics involve simplifying assumptions and high computational costs; existing machine learning methods often require manually designed feature descriptors, which may overlook important crystallographic information and are often limited to predicting a single property or specific types of crystal structures. 3. **Developing a model named CIFSemantics**: By training on a large-scale dataset from the Materials Project database, this model can predict 15 different material properties from CIFs, with performance comparable to models specifically designed for predicting particular properties. 4. **Validating the effectiveness of atomic embeddings**: Experimental results show that the obtained atomic embeddings can reflect similarities and differences in the periodic table, demonstrating that this method can effectively capture the fundamental chemical properties of materials. 5. **Evaluating prediction performance**: The model performs well on multiple property prediction tasks, including energy, density, band gap, etc., and is competitive with results reported in other literature. 6. **Demonstrating the model's generality and adaptability**: In addition to predicting specific material properties, the model can also make effective predictions when only the chemical formula is known, such as predicting the Curie temperature of ferromagnetic materials, solute diffusion barriers in metals, and screening of Fractionally Doped Perovskite Oxides (FDPO). In summary, this paper introduces an innovative method that can effectively extract material properties from crystallographic information, providing new tools and perspectives for research in the field of materials science.