Abstract:Large Language Models (LLMs) create exciting possibilities for powerful language processing tools to accelerate research in materials science. While LLMs have great potential to accelerate materials understanding and discovery, they currently fall short in being practical materials science tools. In this position paper, we show relevant failure cases of LLMs in materials science that reveal current limitations of LLMs related to comprehending and reasoning over complex, interconnected materials science knowledge. Given those shortcomings, we outline a framework for developing Materials Science LLMs (MatSci-LLMs) that are grounded in materials science knowledge and hypothesis generation followed by hypothesis testing. The path to attaining performant MatSci-LLMs rests in large part on building high-quality, multi-modal datasets sourced from scientific literature where various information extraction challenges persist. As such, we describe key materials science information extraction challenges which need to be overcome in order to build large-scale, multi-modal datasets that capture valuable materials science knowledge. Finally, we outline a roadmap for applying future MatSci-LLMs for real-world materials discovery via: 1. Automated Knowledge Base Generation; 2. Automated In-Silico Material Design; and 3. MatSci-LLM Integrated Self-Driving Materials Laboratories.

What problem does this paper attempt to address?

This paper discusses the application issues of large-scale language models (LLMs) in materials science. Although LLMs have shown great potential in the field of natural language processing, they have limitations in understanding and reasoning complex and interconnected materials science knowledge, and cannot yet serve as practical materials science research tools. The paper proposes a framework for establishing Materials Science LLMs (MatSci-LLMs) rooted in materials science knowledge, and emphasizes the challenges that need to be addressed, including understanding materials structure, properties, and behavior, as well as experimental description-based materials synthesis and analysis procedures. To enable MatSci-LLMs to play a role in real-world materials discovery, the paper proposes the following requirements: 1. Domain knowledge and rooted reasoning: MatSci-LLMs need to understand the field of materials science and be able to reason based on core principles. 2. Enhancing the capabilities of materials scientists: MatSci-LLMs should be able to perform tasks that accelerate materials science research and enhance scientists' work in a reliable and interpretable manner. The paper points out that the current application of LLMs in materials science has failed cases, mainly due to a lack of comprehensive understanding of materials science knowledge, especially in numerical problem solving and reasoning about materials science principles. To address these issues, the paper suggests the development of multimodal large-scale datasets based on materials science knowledge and the integration of MatSci-LLMs with real-world simulation and experimental tools to achieve automated processes and accelerate materials design, synthesis, and analysis. The paper concludes by discussing the challenges in materials science, such as unique language representations, incomplete descriptions, text-to-structure conversion, and multimodal information extraction. It emphasizes the importance of cross-document and source understanding of context and handling diverse experimental and simulation programs. Building a high-quality multimodal materials science corpus is seen as a key step in achieving this goal.

Are LLMs Ready for Real-World Materials Discovery?

Are LLMs Ready for Real-World Materials Discovery?

From Text to Insight: Large Language Models for Materials Science Data Extraction

Materials science in the era of large language models: a perspective

LLMatDesign: Autonomous Materials Discovery with Large Language Models

Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions

Beyond designer's knowledge: Generating materials design hypotheses via large language models

LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation

Exploring large language models for microstructure evolution in materials

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

MatText: Do Language Models Need More than Text & Scale for Materials Modeling?

NLP meets Materials Science: Quantifying the presentation of materials data in scientific literature

A Prompt-Engineered Large Language Model, Deep Learning Workflow for Materials Classification

LLMs for science: Usage for code generation and data analysis

LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

Extracting accurate materials data from research papers with conversational language models and prompt engineering

MatExpert: Decomposing Materials Discovery by Mimicking Human Experts

Large Language Models as Master Key: Unlocking the Secrets of Materials Science with GPT

An Interdisciplinary Outlook on Large Language Models for Scientific Research

Large Language Models as a Tool for Mining Object Knowledge

Regression with Large Language Models for Materials and Molecular Property Prediction