Abstract:The extraction of Metal-Organic Frameworks (MOFs) synthesis conditions from literature text has been challenging but crucial for the logical design of new MOFs with desirable functionality. The recent advent of large language models (LLMs) provides disruptively new solution to this long-standing problem and latest researches have reported over 90% F1 in extracting correct conditions from MOFs literature. We argue in this paper that most existing synthesis extraction practices with LLMs stay with the primitive zero-shot learning, which could lead to downgraded extraction and application performance due to the lack of specialized knowledge. This work pioneers and optimizes the few-shot in-context learning paradigm for LLM extraction of material synthesis conditions. First, we propose a human-AI joint data curation process to secure high-quality ground-truth demonstrations for few-shot learning. Second, we apply a BM25 algorithm based on the retrieval-augmented generation (RAG) technique to adaptively select few-shot demonstrations for each MOF's extraction. Over a dataset randomly sampled from 84,898 well-defined MOFs, the proposed few-shot method achieves much higher average F1 performance (0.93 vs. 0.81, +14.8%) than the native zero-shot LLM using the same GPT-4 model, under fully automatic evaluation that are more objective than the previous human evaluation. The proposed method is further validated through real-world material experiments: compared with the baseline zero-shot LLM, the proposed few-shot approach increases the MOFs structural inference performance (R^2) by 29.4% in average.

What problem does this paper attempt to address?

The paper is primarily dedicated to addressing the challenging problem of extracting metal-organic framework (MOFs) synthesis conditions from literature texts. The authors believe that although large language models (LLMs) have shown great potential in this field, and studies have reported over 90% F1 scores for accurately extracting synthesis conditions from MOFs literature, most existing LLM-based synthesis condition extraction practices still remain at the primitive zero-shot learning stage. This approach may lead to decreased extraction performance due to the lack of specialized knowledge. To address the above issues, the paper proposes the following key contributions: 1. **Introduction of few-shot contextual learning paradigms**: This method optimizes the LLM-based material synthesis condition extraction process. Firstly, a human-AI joint data curation process is proposed to ensure high-quality real example data for few-shot learning; secondly, the BM25 algorithm (based on retrieval-augmented generation technology) is used to adaptively select few-shot examples for each MOF extraction. 2. **Improved extraction performance**: On a randomly selected dataset of 84,898 well-defined MOFs, the proposed few-shot method achieved higher average F1 performance (0.93 vs. 0.81, an improvement of +14.8%) compared to native zero-shot LLMs (both using the GPT-4 model). Additionally, this method was validated through actual material experiments, showing an average improvement of 29.4% in MOFs structure inference performance (R²) compared to the baseline zero-shot LLMs. 3. **Addressed several non-trivial challenges**: - Tackled the issue of obtaining high-quality real example data, which is a daunting task in scientific literature. - Improved data quality through a human-AI joint data curation process. - Utilized the BM25 algorithm to adaptively select the best combination of few-shot examples, significantly outperforming random selection methods. 4. **Considered scalability issues**: To meet the needs of large-scale data processing, the paper proposed several techniques, including an offline model to detect the most relevant synthesis paragraphs in each document, and developed an LLM-based coreference resolution method to address the issue of proxy terms. In summary, by introducing few-shot learning paradigms, this paper not only significantly improves the accuracy of MOFs synthesis condition extraction but also validates its effectiveness through practical applications, thereby providing valuable tools and techniques for the field of materials science.

LLM-based MOFs Synthesis Condition Extraction using Few-Shot Demonstrations

Mining Insights on Metal-Organic Framework Synthesis from Scientific Literature Texts

Post-Pretraining Large Language Model Enabled Reverse Design of MOFs for Hydrogen Storage

Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research

Machine learning accelerates the investigation of targeted MOFs: Performance prediction, rational design and intelligent synthesis

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Machine learning: An accelerator for the exploration and application of advanced metal-organic frameworks

MOFSimplify, machine learning models with extracted stability data of three thousand metal–organic frameworks

Connecting metal-organic framework synthesis to applications with a self-supervised multimodal model

Inverse Design of Metal-Organic Frameworks Using Quantum Natural Language Processing

High-throughput and machine learning approaches for the discovery of metal organic frameworks

Harnessing Large Language Model to collect and analyze Metal-organic framework property dataset

Building Open Knowledge Graph for Metal-Organic Frameworks (MOF-KG): Challenges and Case Studies

DrugLLM: Open Large Language Model for Few-shot Molecule Generation

Accelerating Discovery of Water Stable Metal−Organic Frameworks by Machine Learning

Accelerate Synthesis of Metal–Organic Frameworks by a Robotic Platform and Bayesian Optimization

Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning

Understanding the diversity of the metal-organic framework ecosystem

Identification of optimal metal-organic frameworks by machine learning: Structure decomposition, feature integration, and predictive modeling

Fine-tuning Large Language Models for Chemical Text Mining

Multi-modal conditioning for metal-organic frameworks generation using 3D modeling techniques