Abstract:Medical information retrieval (MIR) is essential for retrieving relevant medical knowledge from diverse sources, including electronic health records, scientific literature, and medical databases. However, achieving effective zero-shot dense retrieval in the medical domain poses substantial challenges due to the lack of relevance-labeled data. In this paper, we introduce a novel approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to tackle this issue. SL-HyDE leverages large language models (LLMs) as generators to generate hypothetical documents based on a given query. These generated documents encapsulate key medical context, guiding a dense retriever in identifying the most relevant documents. The self-learning framework progressively refines both pseudo-document generation and retrieval, utilizing unlabeled medical corpora without requiring any relevance-labeled data. Additionally, we present the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation framework grounded in real-world medical scenarios, encompassing five tasks and ten datasets. By benchmarking ten models on CMIRB, we establish a rigorous standard for evaluating medical information retrieval systems. Experimental results demonstrate that SL-HyDE significantly surpasses existing methods in retrieval accuracy while showcasing strong generalization and scalability across various LLM and retriever configurations. CMIRB data and evaluation code are publicly available at: <a class="link-external link-https" href="https://github.com/CMIRB-benchmark/CMIRB" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to achieve zero - shot dense retrieval in Medical Information Retrieval (MIR) without relevant annotated data. Specifically, the paper proposes a new method named Self - Learning Hypothetical Document Embeddings (SL - HyDE) to address the following three key challenges: 1. **Lack of professional medical knowledge in large - language models (LLMs)**: Although LLMs are trained on a wide range of datasets, they usually do not possess sufficient domain - specific knowledge, especially in the medical field. This may lead to the generation of irrelevant or misleading hypothetical documents. 2. **General text - embedding models cannot effectively represent medical queries and documents**: These models are usually designed for multi - domain and multi - task, and it is difficult to capture the nuances and knowledge - intensive characteristics in the medical field. 3. **Lack of high - quality relevant annotated datasets in the medical field**: Especially in non - English languages, creating and obtaining such datasets is both time - consuming and resource - intensive, making it difficult to train and fine - tune models. To solve these problems, the paper proposes the SL - HyDE framework, which gradually optimizes the hypothetical document generation and retrieval process through a self - learning mechanism without any relevant annotated data. In addition, the paper also introduces the Chinese Medical Information Retrieval Benchmark (CMIRB), which is a comprehensive evaluation framework based on real - world medical scenarios, containing five tasks and ten datasets, aiming to provide strict evaluation criteria for Chinese medical information retrieval systems. ### Main contributions 1. **Propose the SL - HyDE framework**: For zero - shot medical information retrieval, without the need for relevant annotated data at all. 2. **Develop the CMIRB benchmark**: Evaluate the performance of various retrieval models, covering multiple real - world medical tasks and datasets. 3. **Significantly improve retrieval accuracy**: SL - HyDE has demonstrated superior performance in multiple configurations and has good generalization and scalability. ### Method overview - **Hypothetical Document Embedding (HyDE)**: Utilize large - language models to generate hypothetical documents and narrow the semantic gap between queries and target documents. - **Self - learning generator**: Generate hypothetical documents and optimize them according to the retrieval results, gradually improving the quality of the generator. - **Self - learning retriever**: Use the generated hypothetical documents as supervision signals to improve the encoding ability of the retriever. Through these innovations, SL - HyDE can effectively improve the accuracy and efficiency of medical information retrieval without annotated data.

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

Efficient Self-Supervised Metric Information Retrieval: A Bibliography Based Method Applied to COVID Literature

Multistage and Multi-features Medical Image Retrieval System

A Survey on Relevance Feedback Techniques in Content-based Medical Image Retrieval

Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

3D-MIR: A Benchmark and Empirical Study on 3D Medical Image Retrieval in Radiology

VPL: Visual Proxy Learning Framework for Zero-Shot Medical Image Diagnosis

Zero-Shot Medical Information Retrieval via Knowledge Graph Embedding

AVPMIR: Adaptive Verifiable Privacy-Preserving Medical Image Retrieval

Rethinking masked image modelling for medical image representation

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Path to Medical AGI: Unify Domain-specific Medical LLMs with the Lowest Cost

Prospective Study for Semantic Inter-Media Fusion in Content-Based Medical Image Retrieval

CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning

JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability

A comprehensive review of content-based image retrieval systems using deep learning and hand-crafted features in medical imaging: Research challenges and future directions

MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

Universal Model for Multi-Domain Medical Image Retrieval

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature