MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

Yongqi Fan,Hongli Sun,Kui Xue,Xiaofan Zhang,Shaoting Zhang,Tong Ruan
2024-06-21
Abstract:Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Some benchmarks in the generic domain have also followed up on evaluating long-context capabilities. In the medical domain, tasks are distinctive due to the unique contexts and need for domain expertise, necessitating further evaluation. However, despite the frequent presence of long texts in medical scenarios, evaluation benchmarks of long-context capabilities for LLMs in this field are still rare. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context "needles in a haystack" task and a series of tasks specific to medical applications, together comprising 10 datasets. The first component includes challenges such as counter-intuitive reasoning and novel (unknown) facts injection to mitigate knowledge leakage and data contamination of LLMs. The second component confronts the challenge of requiring professional medical expertise. Especially, we design the ``Maximum Identical Context'' principle to improve fairness by guaranteeing that different LLMs observe as many identical contexts as possible. Our experiment evaluates advanced proprietary and open-source LLMs tailored for processing long contexts and presents detailed performance analyses. This highlights that LLMs still face challenges and need for further research in this area. Our code and data are released in the repository: \url{<a class="link-external link-https" href="https://github.com/JOHNNY-fans/MedOdyssey" rel="external noopener nofollow">this https URL</a>.}
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the evaluation issues of large language models (LLMs) in the medical field under long text contexts. Specifically, although many current advanced LLMs can support context lengths of up to 128K or even 200K tokens, and there are already long text evaluation benchmarks in general domains, such evaluations are still relatively scarce in the medical field. The paper introduces MedOdyssey, the first long-context evaluation benchmark specifically for the medical field, which includes seven different length levels (from 4K to 200K tokens) and designs two main tasks: 1. **Medical Context "Needle in a Haystack" Task**: By inserting irrelevant knowledge fragments (needles) into long texts and then requiring LLMs to answer questions about these fragments, this task evaluates the model's ability to handle long texts. 2. **A series of medical-related tasks**: Including term normalization, question answering based on medical knowledge graphs, table-based question answering, and case-based question answering. Additionally, to ensure fairness, the researchers introduced the "Maximum Identical Context" principle to ensure that different models can be evaluated in the same contextual environment. The paper also employs methods such as Counter-intuitive Reasoning and Novel Facts Injection to prevent data leakage and contamination. Through experimental evaluations of various advanced closed-source and open-source LLMs, the results show that existing models still face many challenges in handling long medical texts, especially in terms of formatted output and complex reasoning. Therefore, the paper emphasizes the importance of further research to improve the performance of LLMs in long-text scenarios in the medical field.