Benchmarking Biomedical Relation Knowledge in Large Language Models

Fenghui Zhang,Kuo Yang,Chenqian Zhao,Haixu Li,Xin Dong,Haoyu Tian,Xuezhong Zhou
DOI: https://doi.org/10.1007/978-981-97-5131-0_41
2024-01-01
Abstract:As a special knowledge base (KB), a large language model (LLM) stores a great deal of knowledge in the form of the parametric deep neural network, and evaluating the accuracy of the knowledge within this KB has emerged as a key area of interest in LLM research. Although lots of evaluation studies of LLM knowledge have been carried out, due to the complexity and scarcity of biomedical knowledge, there are still few evaluation studies on this kind of knowledge. To address this, we designed five specific identification and evaluation tasks for the biomedical knowledge in LLMs, including the identification of genes for diseases, targets for drugs/compounds, drugs for diseases, and effectiveness for herbs. We selected four well-known LLMs, including GPT-3.5turbo, GPT-4, ChatGLM-std, and LLaMA2-13B, to quantify the quality of biomedical knowledge in LLMs. Comprehensive experiments that include overall evaluation of accuracy and completeness, ablation analysis, few-shot prompt optimization and case study fully benchmarked the performance of LLMs in the identification of biomedical knowledge and assessed the quality of biomedical knowledge implicit in LLMs. Experimental results showed some interesting observations, e.g., the incompleteness and bias of knowledge of different LLMs, which will give us some insight into LLMs for biomedical discovery and application.
What problem does this paper attempt to address?