Developing Large Language Models for Quantum Chemistry Simulation Input Generation

Robert Pollice,Pieter Floris Jacobs

DOI: https://doi.org/10.26434/chemrxiv-2024-9g2w2

2024-09-02

Abstract:Scientists across domains are often challenged to master domain-specific languages (DSLs) for their research, which are merely a means to an end but are pervasive in fields like computational chemistry. Automated code generation promises to overcome this barrier, allowing researchers to focus on their core expertise. While large language models (LLMs) have shown impressive capabilities in synthesizing code from natural language prompts, they often struggle with DSLs, likely due to their limited exposure during training. In this work, we investigate the potential of foundational LLMs for generating input files for the quantum chemistry package ORCA by establishing a general framework that can be adapted to other DLSs. To improve upon GPT-3.5 Turbo as our base model, we explore the impact of prompt engineering, retrieval-augmented generation, and finetuning via synthetically generated datasets. We find that finetuning, even with synthetic datasets as small as 500 samples, significantly improves performance. Additionally, we observe that finetuning shows synergism with advanced prompt engineering such as chain-of-thought prompting. Consequently, our best finetuned models outperform the formally much more powerful GPT-4o model. All tools and datasets are made openly available for future research. We believe that this research lays the groundwork for a wider adoption of LLMs for DSLs in chemistry and beyond.

Chemistry

What problem does this paper attempt to address?

The paper aims to address the challenges faced by scientists when conducting research using Domain Specific Languages (DSLs). Specifically, the goals of the paper are: 1. **Develop a Framework**: Establish a general framework for utilizing Large Language Models (LLMs) to generate input files for the quantum chemistry software package ORCA. 2. **Improve Performance**: Enhance the performance of base models (such as GPT-3.5 Turbo) in generating ORCA input files through methods like fine-tuning, prompt engineering, and Retrieval-Augmented Generation (RAG). 3. **Validate Effectiveness**: Demonstrate that significant performance improvements can be achieved even with fine-tuning on synthetic datasets, and that combining advanced prompt engineering techniques (such as Chain-of-Thought Prompting) can further boost performance. 4. **Surpass Existing Models**: Show that the best fine-tuned model outperforms the more powerful GPT-4 model. 5. **Open Resources**: Make all tools and datasets publicly available so that future researchers can further improve and apply them. Through these efforts, the paper hopes to lay a solid foundation for the synthesis of DSLs in chemistry and other fields, and to enhance the efficiency of researchers' work.

Developing Large Language Models for Quantum Chemistry Simulation Input Generation

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

Large Language Models as Molecular Design Engines

Quantum Many-Body Physics Calculations with Large Language Models

Large Language Model-Guided Prediction Toward Quantum Materials Synthesis

Unleashing the Potential of LLMs for Quantum Computing: A Study in Quantum Architecture Design

Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

Assessment of Fine-Tuned Large Language Models for Real-World Chemistry and Material Science Applications

Leveraging large language models for predictive chemistry

Large Language Models are Catalyzing Chemistry Education

Structured Chemistry Reasoning with Large Language Models

Fine-tuning Large Language Models for Chemical Text Mining

Quantum space-efficient large language models for Prolog query translation

Augmenting large language models with chemistry tools

Small Molecule Optimization with Large Language Models

14 Examples of How LLMs Can Transform Materials Science and Chemistry: A Reflection on a Large Language Model Hackathon

ChemDFM: A Large Language Foundation Model for Chemistry

Are large language models superhuman chemists?