Abstract:Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. This study conducts a comprehensive evaluation and robustness analysis of LLMs within the field of materials science, focusing on domain-specific question answering and materials property prediction. Three distinct datasets are used in this study: 1) a set of multiple-choice questions from undergraduate-level materials science courses, 2) a dataset including various steel compositions and yield strengths, and 3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of 'noise', ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study uncovers unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance enhancement from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to evaluate the performance and robustness of large language models (LLMs) in the tasks of question answering and property prediction in the field of materials science. Specifically, the study focuses on the following aspects: 1. **Performance Evaluation**: - **Question Answering Task**: Evaluating the performance of LLMs in answering domain-specific questions in materials science using a multiple-choice question dataset from undergraduate materials science courses (MSE-MCQs). - **Property Prediction Task**: Assessing the performance of LLMs in predicting material properties using datasets that include different steel compositions and their yield strengths (matbench_steels) and datasets that describe material crystal structures and band gap values (band gap dataset). 2. **Robustness Analysis**: - **Text Perturbation**: Testing the stability and reliability of LLMs under various forms of "noise," including sentence reordering, synonym replacement, distracting information, unit mixing, and redundant information. - **Training Data Selection**: Investigating the impact of different training data selection methods (such as farthest neighbor, random neighbor, and nearest neighbor) on the performance of LLMs, particularly how selecting highly relevant training data can enhance model performance. 3. **Mode Collapse Phenomenon**: - Studying the mode collapse behavior of LLMs in prediction tasks, where the model generates the same output when the similarity of input examples changes. This phenomenon reveals the limitations of LLMs in handling different inputs. 4. **Training and Testing Mismatch**: - Exploring how the performance of LLMs changes when there is a mismatch between training and testing data. For example, discovering that certain adversarial perturbations (such as shuffling or randomizing) can enhance the predictive ability of fine-tuned models. ### Main Objectives - **Evaluate the Applicability of LLMs in Materials Science**: Understand the reliability and limitations of LLMs in practical applications through systematic performance evaluation and robustness analysis. - **Provide Improvement Suggestions**: Based on the research findings, propose methods to enhance the performance and robustness of LLMs in the field of materials science to promote their widespread application in scientific research. ### Conclusions - **Performance Improvement**: Techniques such as expert prompting and zero-shot chain-of-thought prompting can significantly improve the performance of LLMs in question answering tasks. - **Robustness Challenges**: Although LLMs exhibit some robustness under certain types of noise, they still face challenges under complex and deceptive conditions (such as the introduction of redundant information). - **Mode Collapse**: In property prediction tasks, ineffective few-shot examples can lead to mode collapse, causing the model to default to generating responses from memory. - **Training Data Selection**: Selecting highly relevant training data can significantly improve the predictive performance of LLMs, especially when the data volume is limited. - **Training and Testing Mismatch**: Certain adversarial perturbations can unexpectedly enhance the performance of fine-tuned models, providing new insights for optimizing training costs. In summary, this paper reveals the potential and limitations of LLMs in practical applications through a comprehensive evaluation in the field of materials science and provides valuable references for future improvements.

Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions

MaScQA: Investigating Materials Science Knowledge of Large Language Models

Are LLMs Ready for Real-World Materials Discovery?

LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation

Materials science in the era of large language models: a perspective

Evaluating Large Language Models for Material Selection

MaScQA: A Question Answering Dataset for Investigating Materials Science Knowledge of Large Language Models

Regression with Large Language Models for Materials and Molecular Property Prediction

A Prompt-Engineered Large Language Model, Deep Learning Workflow for Materials Classification

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Beyond designer's knowledge: Generating materials design hypotheses via large language models

Large Language Models for Material Property Predictions: elastic constant tensor prediction and materials design

MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models

From Text to Insight: Large Language Models for Materials Science Data Extraction

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

Exploring large language models for microstructure evolution in materials