Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

Mingchen Li,Zaifu Zhan,Han Yang,Yongkang Xiao,Jiatan Huang,Rui Zhang
2024-05-16
Abstract:Large language models (LLM) have demonstrated remarkable capabilities in various biomedical natural language processing (NLP) tasks, leveraging the demonstration within the input context to adapt to new tasks. However, LLM is sensitive to the selection of demonstrations. To address the hallucination issue inherent in LLM, retrieval-augmented LLM (RAL) offers a solution by retrieving pertinent information from an established database. Nonetheless, existing research work lacks rigorous evaluation of the impact of retrieval-augmented large language models on different biomedical NLP tasks. This deficiency makes it challenging to ascertain the capabilities of RAL within the biomedical domain. Moreover, the outputs from RAL are affected by retrieving the unlabeled, counterfactual, or diverse knowledge that is not well studied in the biomedical domain. However, such knowledge is common in the real world. Finally, exploring the self-awareness ability is also crucial for the RAL system. So, in this paper, we systematically investigate the impact of RALs on 5 different biomedical tasks (triple extraction, link prediction, classification, question answering, and natural language inference). We analyze the performance of RALs in four fundamental abilities, including unlabeled robustness, counterfactual robustness, diverse robustness, and negative awareness. To this end, we proposed an evaluation framework to assess the RALs' performance on different biomedical NLP tasks and establish four different testbeds based on the aforementioned fundamental abilities. Then, we evaluate 3 representative LLMs with 3 different retrievers on 5 tasks over 9 datasets.
Computation and Language
What problem does this paper attempt to address?
This paper primarily discusses the problem of retrieval-enhanced large-scale language models (RAL) in biomedical natural language processing (NLP). Although large language models (LLM) perform well on various biomedical NLP tasks, they are sensitive to the selection of input examples and prone to generating incorrect information (the illusion problem). To address this issue, RAL provides a solution by retrieving relevant information from established databases. However, the current research lacks a rigorous evaluation of the impact of RAL on different biomedical NLP tasks, making it challenging to determine the specific capabilities of RAL in the biomedical field. In addition, the output of RAL may be influenced by unannotated, counterfactual, or diverse knowledge, which has not been fully explored in the biomedical field. The paper also emphasizes the importance of exploring RAL's self-awareness capabilities. To systematically study the impact of RAL on five different biomedical tasks (triple extraction, link prediction, classification, question answering, and natural language inference), the paper proposes an evaluation framework and establishes four testbeds based on fundamental capabilities: unlabeled robustness, counterfactual robustness, diversity robustness, and negative awareness. The paper evaluates the performance of three representative LLMs and three different retrievers on nine datasets. The results show that although RAL improves performance on most biomedical datasets and demonstrates some counterfactual robustness, it still faces significant challenges in handling unannotated and counterfactual retrieval information, as well as negative awareness. The main contributions of the paper include proposing four key capabilities for evaluating RAL in the biomedical field, creating a new benchmark called BioRAB, and conducting a comprehensive evaluation of RAL on different tasks and datasets. Through these experiments, the authors find that there is still room for improvement for RAL in certain tasks and datasets, especially in question answering tasks, and it does not achieve optimal performance in handling diverse knowledge. Moreover, there are also shortcomings in RAL's ability to identify the positive or negative impact of retrieval information on the final output (i.e., negative awareness).