Knowledge Discovery from Porous Organic Cages Literature Using a Large Language Model

Yaoyi Su,Siyuan Yang,Yuanhan Liu,Aiting Kai,Linjiang Chen,Ming Liu
DOI: https://doi.org/10.26434/chemrxiv-2024-jm0ph
2024-10-23
Abstract:Porous organic cages (POCs) are an emerging subclass of porous materials, drawing increasing attention due to their structural tunability, modularity and processibility, with the research in this area rapidly expanding. Nevertheless, it is a time-consuming and labour-intensive process to obtain sufficient information from the extensive literature on organic molecular cages. This article presents a GPT-4-based literature reading method that incorporates multi-label text classification and a follow-up information extraction, in which the potential of GPT-4 can be fully exploited to rapidly extract valid information from the literature. In the process of multi-label text classification, the prompt-engineered GPT-4 demonstrated the ability to label text with proper recall rates according to the type of information contained in text, including authors, affiliations, synthetic procedures, surface area, and the CCDC number of corresponding cages. Additionally, GPT-4 demonstrated proficiency in information extraction, effectively transforming labeled text into concise tabulated data. Furthermore, we built a chatbot based on this database, allowing for quick and comprehensive searching across the entire database and responding for cage-related questions.
Chemistry
What problem does this paper attempt to address?
The problem this paper attempts to address is: **Efficient extraction and organization of information from Porous Organic Cages (POCs) literature**. Specifically, the main challenges faced by the authors include: 1. **Time-consuming and labor-intensive information extraction**: The vast amount of research literature on POCs makes it time-consuming and labor-intensive to manually extract key information (such as authors, affiliations, synthesis steps, specific surface area, CCDC numbers, etc.). 2. **Standardization and systematization of information**: Existing literature information is scattered and formatted inconsistently, making systematic analysis and utilization difficult. To solve these problems, the authors developed a large-scale language model method based on GPT-4. Through multi-label text classification and subsequent information extraction, they quickly and accurately extract key information from the literature and organize it into a structured database. Additionally, they built a chatbot to enable researchers to quickly query and obtain relevant information about POCs. ### Main Contributions: 1. **Multi-label text classification**: Using GPT-4 to classify literature paragraphs, marking paragraphs that contain specific information such as authors, affiliations, synthesis steps, etc. 2. **Information extraction and tabulation**: Further processing the marked text to extract and organize it into structured tabular data. 3. **Chatbot**: Based on the extracted data, a chatbot was built to quickly answer various questions about POCs. Through these methods, the authors aim to improve the efficiency of researchers in literature reading and information extraction, thereby accelerating the design, synthesis, and application research of POCs.