Automating Knowledge Discovery from Scientific Literature via LLMs: A Dual-Agent Approach with Progressive Ontology Prompting

Yuting Hu,Dancheng Liu,Qingyun Wang,Charles Yu,Heng Ji,Jinjun Xiong
2024-08-21
Abstract:To address the challenge of automating knowledge discovery from a vast volume of literature, in this paper, we introduce a novel framework based on large language models (LLMs) that combines a progressive ontology prompting (POP) algorithm with a dual-agent system, named LLM-Duo, designed to enhance the automation of knowledge extraction from scientific articles. The POP algorithm utilizes a prioritized breadth-first search (BFS) across a predefined ontology to generate structured prompt templates and action orders, thereby guiding LLMs to discover knowledge in an automatic manner. Additionally, our LLM-Duo employs two specialized LLM agents: an explorer and an evaluator. These two agents work collaboratively and adversarially to enhance the reliability of the discovery and annotation processes. Experiments demonstrate that our method outperforms advanced baselines, enabling more accurate and complete annotations. To validate the effectiveness of our method in real-world scenarios, we employ our method in a case study of speech-language intervention discovery. Our method identifies 2,421 interventions from 64,177 research articles in the speech-language therapy domain. We curate these findings into a publicly accessible intervention knowledge base that holds significant potential to benefit the speech-language therapy community.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the challenges of automated knowledge discovery from a large number of literatures. Specifically, the researchers proposed a new framework based on large language models (LLMs), combined with the Progressive Ontology Prompting (POP) algorithm and the dual - agent system (LLM - Duo), to improve the ability to automatically extract knowledge from scientific articles. ### Detailed interpretation #### Background and motivation With the publication of millions of research articles every year, the existing amount of scientific knowledge is huge, presenting extremely high challenges and opportunities for researchers to acquire knowledge through advanced analysis tools and interdisciplinary methods. Discovering knowledge from scientific literature enables researchers to keep up with the latest developments in their fields and gain valuable insights, thereby significantly improving the quality of their work. However, in such a vast ocean of data, only a very limited amount of knowledge is collected and organized due to the inefficiency of the manual review process. For example, in the healthcare field, evidence - based interventions refer to practices and treatments that are based on systematic research and proven effective through controlled studies. This emphasizes the importance of using evidence from well - designed and well - implemented research as the basis for medical decision - making. #### Limitations of existing methods Although large language models (LLMs) have shown great potential in automated knowledge discovery, they still face challenges when dealing with a large amount of domain knowledge. In particular, the context window length of LLMs is limited, which restricts the amount of input text that the model can process at one time, potentially leading to incomplete analysis and loss of connections between data points across documents. To address this issue, Retrieval - Augmented Generation (RAG) technology can, by combining a powerful retrieval component and a generation model, allow the system to access a broader range of information beyond the immediate context window of a single model. #### Proposed methods 1. **Progressive Ontology Prompting (POP) algorithm**: - This algorithm utilizes priority - breadth - first - search (BFS) to traverse a predefined ontology graph, generating structured prompt templates and action sequences, thereby guiding LLMs to automatically discover knowledge. - Specifically, the algorithm selects neighbor nodes with a higher out - to - in ratio of out - degree to in - degree by sorting the out - to - in ratio of neighbor nodes for visiting, in order to quickly visit most of the graph. 2. **Dual - agent system (LLM - Duo)**: - This system contains two specialized LLM agents: the explorer and the evaluator. - The explorer is a chatbot based on RAG technology, generating annotation results in a zero - sample setting and arguing with the evaluator to justify its answers. - The evaluator is responsible for evaluating the annotations and providing feedback to assist the explorer in optimizing its annotations. #### Experiments and applications - **Experimental verification**: - The researchers applied this method in the practical scenario of speech - language intervention discovery, identifying 2,421 interventions from 64,177 research articles. - The experimental results show that this method outperforms advanced baseline methods on multiple metrics, including Consistency Rounds, Verbosity Index, Enumeration Quantity, Faithfulness, Accuracy, and Cover. - **Case study**: - Through the case study of speech - language intervention discovery, this method successfully organized the discovered interventions into a publicly available intervention knowledge base, which is of great significance to the speech - language therapy community. ### Main contributions 1. **Problem modeling**: Model the problem of automated knowledge discovery based on LLMs as a prompt design and scheduling problem based on a predefined ontology graph structure. 2. **New algorithm**: Design a new Progressive Ontology Prompting (POP) algorithm that converts knowledge graph ontologies into structured prompts and action sequences to achieve automatic knowledge discovery from literature. 3. **Dual - agent framework**: Propose a new annotation framework that improves the quality of knowledge discovery through the cooperation and competition of two LLM agents, with performance superior to advanced baselines.