Are LLMs Ready for Real-World Materials Discovery?

Santiago Miret, N M Anoop Krishnan
2024-02-08
Abstract:Large Language Models (LLMs) create exciting possibilities for powerful language processing tools to accelerate research in materials science. While LLMs have great potential to accelerate materials understanding and discovery, they currently fall short in being practical materials science tools. In this position paper, we show relevant failure cases of LLMs in materials science that reveal current limitations of LLMs related to comprehending and reasoning over complex, interconnected materials science knowledge. Given those shortcomings, we outline a framework for developing Materials Science LLMs (MatSci-LLMs) that are grounded in materials science knowledge and hypothesis generation followed by hypothesis testing. The path to attaining performant MatSci-LLMs rests in large part on building high-quality, multi-modal datasets sourced from scientific literature where various information extraction challenges persist. As such, we describe key materials science information extraction challenges which need to be overcome in order to build large-scale, multi-modal datasets that capture valuable materials science knowledge. Finally, we outline a roadmap for applying future MatSci-LLMs for real-world materials discovery via: 1. Automated Knowledge Base Generation; 2. Automated In-Silico Material Design; and 3. MatSci-LLM Integrated Self-Driving Materials Laboratories.
Artificial Intelligence,Machine Learning,Materials Science,Computation and Language
What problem does this paper attempt to address?
This paper discusses the application issues of large-scale language models (LLMs) in materials science. Although LLMs have shown great potential in the field of natural language processing, they have limitations in understanding and reasoning complex and interconnected materials science knowledge, and cannot yet serve as practical materials science research tools. The paper proposes a framework for establishing Materials Science LLMs (MatSci-LLMs) rooted in materials science knowledge, and emphasizes the challenges that need to be addressed, including understanding materials structure, properties, and behavior, as well as experimental description-based materials synthesis and analysis procedures. To enable MatSci-LLMs to play a role in real-world materials discovery, the paper proposes the following requirements: 1. Domain knowledge and rooted reasoning: MatSci-LLMs need to understand the field of materials science and be able to reason based on core principles. 2. Enhancing the capabilities of materials scientists: MatSci-LLMs should be able to perform tasks that accelerate materials science research and enhance scientists' work in a reliable and interpretable manner. The paper points out that the current application of LLMs in materials science has failed cases, mainly due to a lack of comprehensive understanding of materials science knowledge, especially in numerical problem solving and reasoning about materials science principles. To address these issues, the paper suggests the development of multimodal large-scale datasets based on materials science knowledge and the integration of MatSci-LLMs with real-world simulation and experimental tools to achieve automated processes and accelerate materials design, synthesis, and analysis. The paper concludes by discussing the challenges in materials science, such as unique language representations, incomplete descriptions, text-to-structure conversion, and multimodal information extraction. It emphasizes the importance of cross-document and source understanding of context and handling diverse experimental and simulation programs. Building a high-quality multimodal materials science corpus is seen as a key step in achieving this goal.