Abstract:Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the current lack of a standardized evaluation benchmark in the field of materials science when large - language models (LLMs) are used to predict material properties. Although LLMs have achieved remarkable success in natural - language processing and other scientific tasks, their application in materials science still faces challenges, especially regarding the ability to predict the properties of crystalline materials. To promote research progress in this area, the authors propose **LLM4Mat - Bench**, which is the largest - scale benchmark dataset for evaluating the performance of LLMs in predicting the properties of crystalline materials to date. ### Main problems 1. **Lack of a standardized evaluation benchmark**: Currently, for research on using LLMs to predict material properties, there is a lack of a unified, standardized evaluation benchmark. This makes it difficult to compare the results between different studies and hinders the progress in this field. 2. **Insufficient diversity and scale of datasets**: Existing datasets are usually small in scale and have a single source, unable to comprehensively evaluate the performance of LLMs in different tasks and input modalities. 3. **Diversity of input modalities**: Existing research mainly focuses on specific input modalities (such as chemical composition or structure), while ignoring other modalities (such as text descriptions) that may be more helpful for prediction. ### Solutions - **Construct a large - scale benchmark dataset**: LLM4Mat - Bench contains approximately 1.9 million crystal structures from 10 publicly available material data sources, covering 45 different properties. These data sources include: - hMOF - Materials Project - OQMD - OMDB - JARVIS - DFT - QMOF - JARVIS - QETB - GNoME - Cantor HEA - SNUMAT - **Diverse input modalities**: LLM4Mat - Bench is specifically designed with three input modalities: - **Composition** (Composition) - **Crystallographic Information File** (CIF) - **Crystal Text Description** (Description) - **Model evaluation**: Use LLM4Mat - Bench to evaluate models of different sizes, including LLM - Prop, MatBERT, and Llama2, etc. Evaluate the performance of models on different tasks through zero - shot and few - shot prompts. ### Goals - **Promote research progress**: By providing a large - scale, diverse benchmark dataset, promote research on LLMs in material property prediction. - **Evaluate model performance**: Systematically evaluate the performance of different LLMs in different input modalities to provide references for future research. - **Identify challenges**: Reveal the limitations of current LLMs in materials science, especially their performance when dealing with complex input modalities. ### Conclusions Through the construction and evaluation of LLM4Mat - Bench, the authors hope to promote the application of LLMs in the field of materials science, especially in the development of material property prediction and new material discovery. At the same time, the research results also emphasize the importance of task - specific LLMs and fine - tuning, as well as the challenges in dealing with different input modalities.

LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

LLM-Prop: Predicting Physical And Electronic Properties Of Crystalline Solids From Their Text Descriptions

MatText: Do Language Models Need More than Text & Scale for Materials Modeling?

Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions

Materials Informatics Transformer: A Language Model for Interpretable Materials Properties Prediction

Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm

MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models

Large Language Models for Material Property Predictions: elastic constant tensor prediction and materials design

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials Modeling

Regression with Large Language Models for Materials and Molecular Property Prediction

LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation

Leveraging Large-scale Computational Database and Deep Learning for Accurate Prediction of Material Properties

Are LLMs Ready for Real-World Materials Discovery?

Benchmarking Large Language Models for Molecule Prediction Tasks

LawBench: Benchmarking Legal Knowledge of Large Language Models

A Prompt-Engineered Large Language Model, Deep Learning Workflow for Materials Classification

Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models