Knowledge Discovery from Porous Organic Cages Literature Using a Large Language Model

Yaoyi Su,Siyuan Yang,Yuanhan Liu,Aiting Kai,Linjiang Chen,Ming Liu

DOI: https://doi.org/10.26434/chemrxiv-2024-jm0ph

2024-10-23

Abstract:Porous organic cages (POCs) are an emerging subclass of porous materials, drawing increasing attention due to their structural tunability, modularity and processibility, with the research in this area rapidly expanding. Nevertheless, it is a time-consuming and labour-intensive process to obtain sufficient information from the extensive literature on organic molecular cages. This article presents a GPT-4-based literature reading method that incorporates multi-label text classification and a follow-up information extraction, in which the potential of GPT-4 can be fully exploited to rapidly extract valid information from the literature. In the process of multi-label text classification, the prompt-engineered GPT-4 demonstrated the ability to label text with proper recall rates according to the type of information contained in text, including authors, affiliations, synthetic procedures, surface area, and the CCDC number of corresponding cages. Additionally, GPT-4 demonstrated proficiency in information extraction, effectively transforming labeled text into concise tabulated data. Furthermore, we built a chatbot based on this database, allowing for quick and comprehensive searching across the entire database and responding for cage-related questions.

Chemistry

What problem does this paper attempt to address?

The problem this paper attempts to address is: **Efficient extraction and organization of information from Porous Organic Cages (POCs) literature**. Specifically, the main challenges faced by the authors include: 1. **Time-consuming and labor-intensive information extraction**: The vast amount of research literature on POCs makes it time-consuming and labor-intensive to manually extract key information (such as authors, affiliations, synthesis steps, specific surface area, CCDC numbers, etc.). 2. **Standardization and systematization of information**: Existing literature information is scattered and formatted inconsistently, making systematic analysis and utilization difficult. To solve these problems, the authors developed a large-scale language model method based on GPT-4. Through multi-label text classification and subsequent information extraction, they quickly and accurately extract key information from the literature and organize it into a structured database. Additionally, they built a chatbot to enable researchers to quickly query and obtain relevant information about POCs. ### Main Contributions: 1. **Multi-label text classification**: Using GPT-4 to classify literature paragraphs, marking paragraphs that contain specific information such as authors, affiliations, synthesis steps, etc. 2. **Information extraction and tabulation**: Further processing the marked text to extract and organize it into structured tabular data. 3. **Chatbot**: Based on the extracted data, a chatbot was built to quickly answer various questions about POCs. Through these methods, the authors aim to improve the efficiency of researchers in literature reading and information extraction, thereby accelerating the design, synthesis, and application research of POCs.

Knowledge Discovery from Porous Organic Cages Literature Using a Large Language Model

Deep Generative Design of Porous Organic Cages via a Variational Autoencoder

Recent Advances in the Applications of Porous Organic Cages

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

Streamlining the Automated Discovery of Porous Organic Cages

Explainable graph neural networks for organic cages

Image and Data Mining in Reticular Chemistry Using GPT-4V

Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research

OceanGPT: A Large Language Model for Ocean Science Tasks

ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

ProtChatGPT: Towards Understanding Proteins with Large Language Models

Clue-Guided Path Exploration: Optimizing Knowledge Graph Retrieval with Large Language Models to Address the Information Black Box Challenge

A GPT-assisted iterative method for extracting domain knowledge from a large volume of literature of electromagnetic wave absorbing materials with limited manually annotated data

ChatGPT, an Opportunity to Understand More About Language Models

Large language models help facilitate the automated synthesis of information on potential pest controllers

Chain of Knowledge: A Framework for Grounding Large Language Models with Structured Knowledge Bases

ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis

Large-Scale Text Analysis Using Generative Language Models: A Case Study in Discovering Public Value Expressions in AI Patents

GPTZoo: A Large-scale Dataset of GPTs for the Research Community