Abstract:Recently, the remarkable capabilities of large language models (LLMs) have been illustrated across a variety of research domains such as natural language processing, computer vision, and molecular modeling. We extend this paradigm by utilizing LLMs for material property prediction by introducing our model Materials Informatics Transformer (MatInFormer). Specifically, we introduce a novel approach that involves learning the grammar of crystallography through the tokenization of pertinent space group information. We further illustrate the adaptability of MatInFormer by incorporating task-specific data pertaining to Metal-Organic Frameworks (MOFs). Through attention visualization, we uncover the key features that the model prioritizes during property prediction. The effectiveness of our proposed model is empirically validated across 14 distinct datasets, hereby underscoring its potential for high throughput screening through accurate material property prediction.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Challenges in material property prediction**: Although graph neural networks (GNNs) have made significant progress in material property prediction, they have some limitations. For example, they require input of relaxed structures, which leads to the need for density - functional theory (DFT) calculations and increases the computational cost, especially when dealing with large - scale systems such as metal - organic frameworks (MOFs) or supercell crystal defect properties. In addition, GNNs also have difficulties in capturing global features, especially in the understanding of crystal systems, lattice parameters and periodicity. 2. **Interpretability and flexibility**: Existing material property prediction models often lack interpretability and it is difficult to reveal the key features that the model gives priority to during the prediction process. Meanwhile, the flexibility of these models in dealing with different types of material data also needs to be improved. 3. **Large - scale pre - training methods**: When developing large - language models (LLMs) for material property prediction, how to design appropriate pre - training methods is a key issue. Traditional GNN pre - training methods such as node and edge prediction, self - supervised learning, etc. are not suitable for LLMs, so new pre - training strategies need to be explored. To solve the above problems, the paper introduced the **Materials Informatics Transformer (MatInFormer)**. The main contributions of MatInFormer include: - **Text representation of materials**: By converting the information of crystalline materials (such as space groups, informatics features and chemical formulas) into text sequences, LLMs can process this information. - **Design of the Transformer architecture**: Based on the Roberta architecture, MatInFormer can learn the geometric information of crystal systems and provide a certain degree of interpretability through the attention mechanism. - **Pre - training strategies**: Three pre - training strategies are proposed, including masked language modeling (MLM), lattice parameter prediction (LPP) and the combination of the two (MLM + LPP), to enhance the performance of the model. Through these innovations, MatInFormer not only shows excellent performance on multiple datasets, but also performs well in terms of interpretability and flexibility, providing a new and effective method for material property prediction.

Materials Informatics Transformer: A Language Model for Interpretable Materials Properties Prediction

Toward Accurate Interpretable Predictions of Materials Properties within Transformer Language Models

Matminer: an Open Source Toolkit for Materials Data Mining

LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

Crystal Composition Transformer: Self-Learning Neural Language Model for Generative and Tinkering Design of Materials

Crystal Transformer: Self-learning neural language model for Generative and Tinkering Design of Materials

From Tokens to Materials: Leveraging Language Models for Scientific Discovery

MatText: Do Language Models Need More than Text & Scale for Materials Modeling?

Polymetis:Large Language Modeling for Multiple Material Domains

Materials Transformers Language Models for Generative Materials Design: a benchmark study

Large Language Models for Material Property Predictions: elastic constant tensor prediction and materials design

MatExpert: Decomposing Materials Discovery by Mimicking Human Experts

Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research

High Entropy Alloy property predictions using Transformer-based language model

LLMatDesign: Autonomous Materials Discovery with Large Language Models

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Interpretable Machine Learning for Materials Design

MatChat: A Large Language Model and Application Service Platform for Materials Science

Materials science in the era of large language models: a perspective

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

A Multi-agent Framework for Materials Laws Discovery