AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

Lei Li,Xiangxu Zhang,Xiao Zhou,Zheng Liu
2024-10-26
Abstract:Medical information retrieval (MIR) is essential for retrieving relevant medical knowledge from diverse sources, including electronic health records, scientific literature, and medical databases. However, achieving effective zero-shot dense retrieval in the medical domain poses substantial challenges due to the lack of relevance-labeled data. In this paper, we introduce a novel approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to tackle this issue. SL-HyDE leverages large language models (LLMs) as generators to generate hypothetical documents based on a given query. These generated documents encapsulate key medical context, guiding a dense retriever in identifying the most relevant documents. The self-learning framework progressively refines both pseudo-document generation and retrieval, utilizing unlabeled medical corpora without requiring any relevance-labeled data. Additionally, we present the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation framework grounded in real-world medical scenarios, encompassing five tasks and ten datasets. By benchmarking ten models on CMIRB, we establish a rigorous standard for evaluating medical information retrieval systems. Experimental results demonstrate that SL-HyDE significantly surpasses existing methods in retrieval accuracy while showcasing strong generalization and scalability across various LLM and retriever configurations. CMIRB data and evaluation code are publicly available at: <a class="link-external link-https" href="https://github.com/CMIRB-benchmark/CMIRB" rel="external noopener nofollow">this https URL</a>.
Information Retrieval,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to achieve zero - shot dense retrieval in Medical Information Retrieval (MIR) without relevant annotated data. Specifically, the paper proposes a new method named Self - Learning Hypothetical Document Embeddings (SL - HyDE) to address the following three key challenges: 1. **Lack of professional medical knowledge in large - language models (LLMs)**: Although LLMs are trained on a wide range of datasets, they usually do not possess sufficient domain - specific knowledge, especially in the medical field. This may lead to the generation of irrelevant or misleading hypothetical documents. 2. **General text - embedding models cannot effectively represent medical queries and documents**: These models are usually designed for multi - domain and multi - task, and it is difficult to capture the nuances and knowledge - intensive characteristics in the medical field. 3. **Lack of high - quality relevant annotated datasets in the medical field**: Especially in non - English languages, creating and obtaining such datasets is both time - consuming and resource - intensive, making it difficult to train and fine - tune models. To solve these problems, the paper proposes the SL - HyDE framework, which gradually optimizes the hypothetical document generation and retrieval process through a self - learning mechanism without any relevant annotated data. In addition, the paper also introduces the Chinese Medical Information Retrieval Benchmark (CMIRB), which is a comprehensive evaluation framework based on real - world medical scenarios, containing five tasks and ten datasets, aiming to provide strict evaluation criteria for Chinese medical information retrieval systems. ### Main contributions 1. **Propose the SL - HyDE framework**: For zero - shot medical information retrieval, without the need for relevant annotated data at all. 2. **Develop the CMIRB benchmark**: Evaluate the performance of various retrieval models, covering multiple real - world medical tasks and datasets. 3. **Significantly improve retrieval accuracy**: SL - HyDE has demonstrated superior performance in multiple configurations and has good generalization and scalability. ### Method overview - **Hypothetical Document Embedding (HyDE)**: Utilize large - language models to generate hypothetical documents and narrow the semantic gap between queries and target documents. - **Self - learning generator**: Generate hypothetical documents and optimize them according to the retrieval results, gradually improving the quality of the generator. - **Self - learning retriever**: Use the generated hypothetical documents as supervision signals to improve the encoding ability of the retriever. Through these innovations, SL - HyDE can effectively improve the accuracy and efficiency of medical information retrieval without annotated data.