LLM-based MOFs Synthesis Condition Extraction using Few-Shot Demonstrations

Lei Shi,Zhimeng Liu,Yi Yang,Weize Wu,Yuyang Zhang,Hongbo Zhang,Jing Lin,Siyu Wu,Zihan Chen,Ruiming Li,Nan Wang,Zipeng Liu,Huobin Tan,Hongyi Gao,Yue Zhang,Ge Wang
2024-08-06
Abstract:The extraction of Metal-Organic Frameworks (MOFs) synthesis conditions from literature text has been challenging but crucial for the logical design of new MOFs with desirable functionality. The recent advent of large language models (LLMs) provides disruptively new solution to this long-standing problem and latest researches have reported over 90% F1 in extracting correct conditions from MOFs literature. We argue in this paper that most existing synthesis extraction practices with LLMs stay with the primitive zero-shot learning, which could lead to downgraded extraction and application performance due to the lack of specialized knowledge. This work pioneers and optimizes the few-shot in-context learning paradigm for LLM extraction of material synthesis conditions. First, we propose a human-AI joint data curation process to secure high-quality ground-truth demonstrations for few-shot learning. Second, we apply a BM25 algorithm based on the retrieval-augmented generation (RAG) technique to adaptively select few-shot demonstrations for each MOF's extraction. Over a dataset randomly sampled from 84,898 well-defined MOFs, the proposed few-shot method achieves much higher average F1 performance (0.93 vs. 0.81, +14.8%) than the native zero-shot LLM using the same GPT-4 model, under fully automatic evaluation that are more objective than the previous human evaluation. The proposed method is further validated through real-world material experiments: compared with the baseline zero-shot LLM, the proposed few-shot approach increases the MOFs structural inference performance (R^2) by 29.4% in average.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing the challenging problem of extracting metal-organic framework (MOFs) synthesis conditions from literature texts. The authors believe that although large language models (LLMs) have shown great potential in this field, and studies have reported over 90% F1 scores for accurately extracting synthesis conditions from MOFs literature, most existing LLM-based synthesis condition extraction practices still remain at the primitive zero-shot learning stage. This approach may lead to decreased extraction performance due to the lack of specialized knowledge. To address the above issues, the paper proposes the following key contributions: 1. **Introduction of few-shot contextual learning paradigms**: This method optimizes the LLM-based material synthesis condition extraction process. Firstly, a human-AI joint data curation process is proposed to ensure high-quality real example data for few-shot learning; secondly, the BM25 algorithm (based on retrieval-augmented generation technology) is used to adaptively select few-shot examples for each MOF extraction. 2. **Improved extraction performance**: On a randomly selected dataset of 84,898 well-defined MOFs, the proposed few-shot method achieved higher average F1 performance (0.93 vs. 0.81, an improvement of +14.8%) compared to native zero-shot LLMs (both using the GPT-4 model). Additionally, this method was validated through actual material experiments, showing an average improvement of 29.4% in MOFs structure inference performance (R²) compared to the baseline zero-shot LLMs. 3. **Addressed several non-trivial challenges**: - Tackled the issue of obtaining high-quality real example data, which is a daunting task in scientific literature. - Improved data quality through a human-AI joint data curation process. - Utilized the BM25 algorithm to adaptively select the best combination of few-shot examples, significantly outperforming random selection methods. 4. **Considered scalability issues**: To meet the needs of large-scale data processing, the paper proposed several techniques, including an offline model to detect the most relevant synthesis paragraphs in each document, and developed an LLM-based coreference resolution method to address the issue of proxy terms. In summary, by introducing few-shot learning paradigms, this paper not only significantly improves the accuracy of MOFs synthesis condition extraction but also validates its effectiveness through practical applications, thereby providing valuable tools and techniques for the field of materials science.