Abstract:The rapid growth of biomedical knowledge has outpaced our ability to efficiently extract insights and generate novel hypotheses. Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction and potentially accelerate biomedical discovery. In this paper, we present a comprehensive evaluation of LLMs as biomedical hypothesis generators. We construct a dataset of background-hypothesis pairs from biomedical literature, carefully partitioned into training, seen, and unseen test sets based on publication date to mitigate data contamination. Using this dataset, we assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings. To enhance the exploration of uncertainty, a crucial aspect of scientific discovery, we incorporate tool use and multi-agent interactions in our evaluation framework. Furthermore, we propose four novel metrics grounded in extensive literature review to evaluate the quality of generated hypotheses, considering both LLM-based and human assessments. Our experiments yield two key findings: 1) LLMs can generate novel and validated hypotheses, even when tested on literature unseen during training, and 2) Increasing uncertainty through multi-agent interactions and tool use can facilitate diverse candidate generation and improve zero-shot hypothesis generation performance. However, we also observe that the integration of additional knowledge through few-shot learning and tool use may not always lead to performance gains, highlighting the need for careful consideration of the type and scope of external knowledge incorporated. These findings underscore the potential of LLMs as powerful aids in biomedical hypothesis generation and provide valuable insights to guide further research in this area.

What problem does this paper attempt to address?

The paper aims to address the challenges brought by the explosive growth of knowledge in biomedical research, namely how to effectively extract insights and generate new hypotheses. Large language models (LLMs) are considered as potential tools to solve this problem, as they have the ability to revolutionize the way we interact with biomedical knowledge and potentially accelerate biomedical discoveries. The paper comprehensively evaluates the capability of LLMs as biomedical hypothesis generators by constructing a biomedical literature dataset composed of pairs of background knowledge and hypotheses. The dataset is carefully divided into training, seen test, and unseen test sets to prevent data contamination based on publication dates. The paper evaluates the hypothesis generation ability of top-guided models on this dataset in zero-shot, few-shot, and fine-tuning settings. To enhance uncertainty exploration, the paper incorporates tool use and multi-agent interaction into the evaluation framework, which is a crucial aspect in scientific discovery. Additionally, the paper proposes four new metrics based on extensive literature review to evaluate the quality of generated hypotheses, taking into account the perspectives of both LLMs and human evaluators. The experiments of the paper reveal two main findings: first, LLMs can generate novel and validated hypotheses even when tested on unseen literature; second, increasing uncertainty through multi-agent interaction and tool use can facilitate diverse candidate hypothesis generation, thereby improving the performance of zero-shot hypothesis generation. However, the paper also observes that integrating additional knowledge through few-shot learning and tool use does not always lead to performance improvement, highlighting the importance of careful consideration of the types and scope of external knowledge. In conclusion, the paper highlights the potential of LLMs as powerful biomedical hypothesis generation assistant tools and provides valuable insights for further research in this field. Specific contributions include pioneering zero-shot and few-shot hypothesis generation validation, revealing LLMs' advanced reasoning abilities and their ability to generate novel hypotheses, developing multidimensional hypothesis evaluation metrics, and proposing a multi-agent hypothesis generation framework based on LLMs. The paper also provides detailed descriptions of experimental setup, model selection, evaluation metrics, and result analysis, demonstrating the performance differences of different models in zero-shot and few-shot settings, as well as the impact of external knowledge integration on hypothesis generation. Through quantitative analysis of uncertainty and human evaluation, the paper extensively explores the potential and limitations of LLMs in biomedical hypothesis generation.

Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation

Large Language Models are Zero Shot Hypothesis Proposers

Hypothesis Generation with Large Language Models

A Survey for Large Language Models in Biomedicine

Large language models for biomedicine: foundations, opportunities, challenges, and best practices

Benchmarking Large Language Models in Evidence-Based Medicine

Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation

Benchmarking Biomedical Relation Knowledge in Large Language Models

Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis

Scientific Hypothesis Generation by a Large Language Model: Laboratory Validation in Breast Cancer Treatment

An Evaluation of Large Language Models in Bioinformatics Research

A Comprehensive Evaluation of Large Language Models in Mining Gene Interactions and Pathway Knowledge

A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations

Large language models reshaping molecular biology and drug development

Evaluating large language models in medical applications: a survey

Large Language Models in Medicine: The Potentials and Pitfalls

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Large language models encode clinical knowledge

Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

Large Language Models for Medicine: A Survey

Large Language Model Benchmarks in Medical Tasks