LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

Andre Niyongabo Rubungo,Kangming Li,Jason Hattrick-Simpers,Adji Bousso Dieng
2024-11-01
Abstract:Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.
Materials Science,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the current lack of a standardized evaluation benchmark in the field of materials science when large - language models (LLMs) are used to predict material properties. Although LLMs have achieved remarkable success in natural - language processing and other scientific tasks, their application in materials science still faces challenges, especially regarding the ability to predict the properties of crystalline materials. To promote research progress in this area, the authors propose **LLM4Mat - Bench**, which is the largest - scale benchmark dataset for evaluating the performance of LLMs in predicting the properties of crystalline materials to date. ### Main problems 1. **Lack of a standardized evaluation benchmark**: Currently, for research on using LLMs to predict material properties, there is a lack of a unified, standardized evaluation benchmark. This makes it difficult to compare the results between different studies and hinders the progress in this field. 2. **Insufficient diversity and scale of datasets**: Existing datasets are usually small in scale and have a single source, unable to comprehensively evaluate the performance of LLMs in different tasks and input modalities. 3. **Diversity of input modalities**: Existing research mainly focuses on specific input modalities (such as chemical composition or structure), while ignoring other modalities (such as text descriptions) that may be more helpful for prediction. ### Solutions - **Construct a large - scale benchmark dataset**: LLM4Mat - Bench contains approximately 1.9 million crystal structures from 10 publicly available material data sources, covering 45 different properties. These data sources include: - hMOF - Materials Project - OQMD - OMDB - JARVIS - DFT - QMOF - JARVIS - QETB - GNoME - Cantor HEA - SNUMAT - **Diverse input modalities**: LLM4Mat - Bench is specifically designed with three input modalities: - **Composition** (Composition) - **Crystallographic Information File** (CIF) - **Crystal Text Description** (Description) - **Model evaluation**: Use LLM4Mat - Bench to evaluate models of different sizes, including LLM - Prop, MatBERT, and Llama2, etc. Evaluate the performance of models on different tasks through zero - shot and few - shot prompts. ### Goals - **Promote research progress**: By providing a large - scale, diverse benchmark dataset, promote research on LLMs in material property prediction. - **Evaluate model performance**: Systematically evaluate the performance of different LLMs in different input modalities to provide references for future research. - **Identify challenges**: Reveal the limitations of current LLMs in materials science, especially their performance when dealing with complex input modalities. ### Conclusions Through the construction and evaluation of LLM4Mat - Bench, the authors hope to promote the application of LLMs in the field of materials science, especially in the development of material property prediction and new material discovery. At the same time, the research results also emphasize the importance of task - specific LLMs and fine - tuning, as well as the challenges in dealing with different input modalities.