Abstract:Information extraction and textual comprehension from materials literature are vital for developing an exhaustive knowledge base that enables accelerated materials discovery. Language models have demonstrated their capability to answer domain-specific questions and retrieve information from knowledge bases. However, there are no benchmark datasets in the materials domain that can be used to evaluate the understanding of the key concepts by these language models. In this work, we curate a dataset of 650 challenging questions from the materials domain that require the knowledge and skills of a materials science student who has cleared their undergraduate degree. We classify these questions based on their structure and the materials science domain-based subcategories. Further, we evaluate the performance of LLaMA-2-70B, GPT-3.5 and GPT-4 models on solving these questions via zero-shot and chain of thought prompting. It is observed that GPT-4 gives the best performance (~62% accuracy) as compared to GPT-3.5. Interestingly, in contrast to the general observation, no significant improvement in accuracy is observed with the chain of thought prompting. To evaluate the limitations, we performed an error analysis, which revealed conceptual errors (~72%) as the major contributor compared to computational errors (~28%) towards the reduced performance of LLMs. We also compared GPT-4 with human performance and observed that GPT-4 is better than an average student and comes close to qualifying the exam. We also show applications of the best performing model (GPT-4) on composition-extraction from tables of materials science research papers and code writing tasks. While LLMs perform poorly on composition extraction, GPT-4 outperform all other models on the code writing task. We hope that the dataset, analysis, and applications discussed in this work will promote further research in developing better materials science domain-specific LLMs and strategies for information extraction.

Benchmarking large language models for materials synthesis: the case of atomic layer deposition

MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models

Comparison of LLMs in Extracting Synthesis Conditions and Generating Q&A Datasets for Metal-Organic Frameworks

MaScQA: Investigating Materials Science Knowledge of Large Language Models

LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks

Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions

Are large language models superhuman chemists?

Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures?

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Benchmarking Large Language Models for Molecule Prediction Tasks

MaScQA: A Question Answering Dataset for Investigating Materials Science Knowledge of Large Language Models

CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research

JARVIS-Leaderboard: A Large Scale Benchmark of Materials Design Methods

GraphArena: Benchmarking Large Language Models on Graph Computational Problems

AlchemBERT: Exploring Lightweight Language Models for Materials Informatics

LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation